Zero-Downtime Host Migration: Split a Monolith into Microservices

Split a Monolith During Host Migration—Zero Downtime

Splitting monolith web apps into microservices during a host migration is the process of incrementally extracting services while changing hosting provider. It uses strangler patterns, dual-write or CDC for data sync, staged DNS and load balancer rules to avoid downtime. It serves engineering teams and ops managers who need zero-downtime cutovers.

Table of Contents

Split monolith web apps into microservices summary

Plan and map domains and business transactions with owner and SLA.
Prepare target host, load balancers, NAT and DNS staged rules.
Migrate data with dual-write or CDC and validate consistency.
Extract first service, add API contract and CI/CD pipeline.
Stage traffic using TTL reduction, LB rules, and canary percentages.
Validate KPIs, run rollback triggers, then decommission monolith.

Split a Monolith During Host Migration—Zero Downtime

Step 1 Plan and map domains

In the context of planning, map bounded contexts, transactions and data ownership. Use a spreadsheet with columns: domain, transaction, table owners, session state, SLA, read/write needs. Assign a single owner for each extracted service. Typical time for this step is between 2 and 5 days depending on complexity.

Enforce these deliverables before any code change. A common error is extracting without mapping sessions. Session state causes split brain during cutover. Save a backup of the monolith schema snapshot before any migration tasks.

Step 2 Prepare hosting and networking

Prepare the target host and networking with provider-specific rules. For AWS use an internal ALB target group and a public ALB listener, then add rules to route by path or host header. For Cloudflare use staged DNS and proxy rules. For GCP use BackendServices and URL maps.

Commands to create an ALB rule example for AWS CLI:

aws elbv2 create-rule --listener-arn <listener-arn> /

  --conditions Field=path-pattern,Values='/api/service/*' /

  --priority 10 --actions Type=forward,TargetGroupArn=<tg-arn>

Provider trap: changing the LB without staging DNS causes inconsistent traffic. Always prepare NAT and floating IPs if the provider supports them. This step typically takes 4 to 8 hours for a medium app.

Host/network cutover specifics: define provider-level cutover windows and networking steps in advance. For high-TTL domains, reduce TTL well before the migration (72+ hours if original TTL was multi-day) and confirm the new TTL has propagated globally using multiple public resolvers. For IP-level cutover, plan whether you will use floating IPs (DigitalOcean / OpenStack), Elastic IPs (AWS) or BGP announcements (on-prem or colocated): document the exact reassignment commands and expected propagation times (Elastic IP reassignment is immediate in AWS; BGP updates can take seconds to minutes depending on peers). Schedule the critical switchover during a low-traffic window (e.g., 02:00–04:00 local time) and document session-affinity handling: if the monolith used sticky cookies, either migrate sessions to a central store (Redis) before cutover or implement a facade that proxies session cookies and falls back to monolith. Note provider traps such as proxied DNS (Cloudflare) masking origin IPs — verify Cloudflare’s proxy mode does not block health-checking or cause unexpected caching. Include exact verification steps: traceroute to new host, validate LB health endpoints from multiple AZs, confirm floating-IP reassignment completes and service responds within X seconds, and only then increase traffic percentage.

Step 3 Data migration and CDC

Data migration refers to moving ownership of data while keeping reads and writes consistent. Use dual-write for short-lived migrations and change data capture for continuous sync. Debezium and AWS DMS are common CDC options.

Runbook example for Debezium connector (Kafka Connect JSON snippet):

{

  "name": "connector-db1",

  "config": {

    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",

    "database.hostname": "db-primary.example",

    "database.dbname": "appdb",

    "database.user": "replicator",

    "database.password": "***",

    "database.server.name": "dbserver1"

  }

}

Infographic showing the staged flow is below.

Monolith
Read/Write

→

CDC
Stream changes

→

New Service
Read from materialized view

Use these SQL checks to validate CDC liveness and lag:

-- Check replication lag for logical decoding slot

SELECT pg_size_pretty(pg_xlog_location_diff(pg_current_wal_lsn(), replay_lsn)) AS lag_bytes

FROM pg_stat_replication;

A trick only experienced engineers know is that logical replication lag can be invisible while individual table syncs lag. Check per-table last applied timestamp in the consumer system.

According to the CNCF 2023 survey, many teams run microservices in production. Tools like Debezium and AWS DMS help reduce cutover risk in most cases.

Concrete live DB migration pattern: a zero-downtime migration is safest when it combines a chunked backfill, CDC for changes during backfill, and an idempotent upsert pattern on the target. Example sequence: (1) add new target table/schema and create any necessary indexes; (2) run a chunked backfill: INSERT INTO new_schema.table (cols...) SELECT cols... FROM old_schema.table WHERE id BETWEEN X AND Y ORDER BY id LIMIT 10000; repeat with small sleeps to keep load bounded — e.g., 10k rows/second yields 1M rows in ~100s; (3) enable CDC (Debezium/wal2json) into a change topic and apply a consumer that performs idempotent upserts on the new table using: INSERT INTO new_table (id, col1, ...) VALUES (...) ON CONFLICT (id) DO UPDATE SET col1 = EXCLUDED.col1, ...; (4) once backfill is within an acceptable lag (monitor rows_applied / last_wal_lsn diff), switch writes to dual-write or route writes to the new service while continuing to stream changes until lag is zero; (5) run parity checks (row counts, checksum samples, transaction counts) and only then flip read traffic. Practical numbers: a 10M-row table backfilled at 1k rows/sec takes ~2.8 hours; if throughput needs to be faster, split by range and parallelize while monitoring replication lag. Include concrete ETL idempotency and chunking patterns in the runbook and precompute expected transaction counts to detect silent loss.

Step 4 Extract service and deploy with CI CD

Extract the service by creating a thin API facade that delegates unresolved calls to the monolith. Implement a contract and version it. Build container image and push to registry.

Sample GitHub Actions CI pipeline snippet:

name: build-and-deploy

on: [push]

jobs:

  build:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v4

      - name: Build image

        run: docker build -t registry.example/service:{{ github.sha }} .

      - name: Push image

        run: docker push registry.example/service:{{ github.sha }}

Automation trap: manual deploys delay rollback and increase human error. The quick route is manual deployment. The correct route is automated CI/CD with a tested rollback step.

Operational runbook: provide a single-page, executable runbook that an on-call engineer can follow step-by-step during cutover. The runbook should list the exact order of actions, terminal commands and expected outputs (for both forward cut and rollback), for example:

Lower DNS TTL (Cloudflare API PATCH) and verify TTL reflected
Deploy new service image (kubectl set image / docker compose up) and wait for health-check green
Modify LB rule (aws elbv2 modify-rule) to shift 1% traffic
Run smoke tests against canary endpoints and validate metrics
Escalate to 10%/100% or execute rollback script (aws elbv2 modify-rule back to monolith). Include reusable script templates (bash/PowerShell) to automate the LB switch and a tested CI job that runs the rollback command if health checks fail. A short example snippet for an automated rollback job description (not code block): “CI job: on failed canary health, run aws elbv2 modify-rule --rule-arn --actions Type=forward,TargetGroupArn= and notify Slack/PagerDuty; confirm LB target group health becomes healthy within configured timeout.” Packaging these exact commands and expected verification checks into a single runbook reduces human error and shortens mean time to recovery.

Step 5 Cut traffic and validate health

Cut traffic in stages. First, lower DNS TTL to 30 seconds at least 48 hours before cut. Then perform a 1% canary for 15 minutes. Increase to 10% for 1 hour. Finally move to 100% and monitor KPIs.

Commands to reduce DNS TTL via Cloudflare API example:

curl -X PATCH "https://api.cloudflare.com/client/v4/zones/$ZONE/dns_records/$RECORD" /

  -H "Authorization: Bearer $TOKEN" /

  -H "Content-Type: application/json" /

  --data '{"ttl":30}'

Validate these KPIs before final cut:

Latency p50 and p95 stable within 10% of baseline.
Error rate under 0.5% for 30 minutes.
Replication lag under 2 seconds and data mismatch under 0.01%.

Immediate rollback triggers example:

Error rate above 1% sustained 5 minutes.
P95 latency increase >50% for 10 minutes.
Any detected data corruption.

Run these rollback commands to switch LB rules back to monolith quickly:

aws elbv2 modify-rule --rule-arn <rule-arn> --actions Type=forward,TargetGroupArn=<monolith-tg>

Pause briefly and observe metrics before proceeding.

Split monolith web apps into microservices checklist

Freeze non-critical schema changes 48 hours before cutover.
Reduce DNS TTL to 30 seconds 48 hours before a staged cut.
Ensure CI/CD rollback step completes in under 5 minutes.
Implement health endpoints and map them to LB health checks.
Predefine SLA metrics and implement alert thresholds.

KPIs to measure success post-cut:

Latency delta p95 under 25% within 60 minutes.
Error rate below 0.5% for 24 hours.
Deployment frequency 1+ per day for small teams after cutover.
Cost delta per service tracked hourly for first 7 days (expected range $5–$200 per service per month in the USA depending on resources).

Errors that ruin the migration

Ignoring DNS TTL is the most common fatal error. Teams change DNS without lowering TTL and see split traffic for hours. Sessions break when session state remains in the monolith. Attempting a database per service without a sync plan causes data corruption.

Other failures include missing rollback automation and skipping chaos tests. A simple chaos test is terminating one new service instance during canary.

💡 Consejo
Precompute and store expected transaction counts per minute. This helps detect silent data loss during CDC syncs.

When this method does not apply and alternatives

This method does not apply when the app is tiny and microservice overhead exceeds benefits. It also fails if the target provider lacks LB or DNS features for staged cutovers. Consider alternatives: a classic lift and shift or a strangler facade only.

Criterion	Split during host migration	Lift and shift	When to choose
Downtime risk	Low with CDC and staged DNS	Medium to high during cutover	Choose split when SLA critical
Effort	High initial engineering effort	Low short-term effort	Choose lift when timeline is urgent
Cost profile	Higher OPEX initial, scalable later	Lower initial migration cost	Choose split if long-term scale matters

Recommendation: choose split during host migration when the business needs zero downtime and expects ongoing scaling. Choose lift and shift when speed beats long-term architecture changes.

Frequently asked questions

How do you split a monolith into microservices?

Split by business capability and traffic slice. Extract a small, high-value service first. Implement a facade that routes unknown calls to the monolith. Use dual-write or CDC and automated tests to prove parity.

What are the challenges of migrating to microservices?

Challenges include data consistency, increased operational overhead, debugging distributed traces, and designing APIs. Teams incorrectly assume the database can be trivially split. Allow 2 to 8 weeks per service depending on data complexity.

Can you migrate to microservices without touching the database?

Short answer: sometimes. Teams can use read-side materialization or API facades to avoid schema changes. Long term, a per-service schema may be needed. Expect additional engineering work for eventual consistency patterns.

What is the strangler fig pattern and how do you apply it?

The strangler fig pattern means extracting features behind a new interface while leaving the monolith active. Apply it by routing specific calls to new services and gradually widening the scope. Validate each extraction with parity tests.

How do you minimize downtime when migrating a monolith during a host migration?

Use staged DNS, low TTL, CDC or dual-write, and LB rules to steer traffic gradually. Automate rollback conditions. Test the full cutover on a staging environment that mirrors production.

How long does it take to migrate a monolith to microservices?

Time varies widely. Expect 3 to 12 months for a medium product. Single service extraction and cutover often takes between 3 and 14 days with a focused team.

Split monolith web apps into microservices cost per service in USA?

Typical costs range widely from about $5,000 to $50,000 per service for initial migration in the USA. Costs depend on data complexity, SLA, and vendor pricing. Track cost delta hourly for the first 7 days and report to finance.

Debezium CDC documentation

Cloudflare DNS best practices

Alan Curtis

With over 12 years of experience testing and reviewing web hosting solutions, this author is passionate about helping businesses and individuals find the best hosting, VPS, and cloud services for their needs. Covering performance, speed, uptime, migrations, and provider comparisons, every article on Host Compare is based on hands-on experience and real-world testing. Readers gain trusted insights, actionable advice, and clear guidance to choose hosting solutions confidently and optimize their websites effectively.