Contact

Host Compare
Host Compare
  • Home
  • Blog
  • Hosting by Use
  • Hosting Security
  • Hosting Type
  • Performance & Speed
  • Provider Reviews
  • Website Migration
  • About
  • Contact
Search
  • Home
  • Blog
  • Hosting by Use
  • Hosting Security
  • Hosting Type
  • Performance & Speed
  • Provider Reviews
  • Website Migration
  • About
  • Contact

Safe Cutover Without Replication with Minimal Downtime

Safe Cutover Without Replication with Minimal Downtime

What’s the real downtime when moving a 1 TB transactional database across datacenters over a 100 Mbps link? (Raw transfer time for 1 TB over 100 Mbps = (1 TB * 8) / 100 Mbps ≈ 88,000 seconds ≈ 24.4 hours.)

Use the article’s restore multipliers (3–10x) to plan an RTO spanning multiple days. The later 1 Gbps examples apply only to higher bandwidth and do not apply to a 100 Mbps link.

A sysadmin or DevOps lead faces an imminent host or cloud migration. Replication looks simple on paper but adds cost and operational steps.

This runbook delivers measurable estimates and exact commands. It lists mysqldump, pg_basebackup, rsync, and parallel restore commands. It shows a risk/time matrix and a hash-based reconciliation method. It ends with a rollback playbook and a compliance checklist.

The rollback playbook must list explicit promotion and demotion commands, include traffic-change commands, and provide stepwise actions to return the original source online. It must require a rehearsal that simulates a target failure. The team will measure and verify the real rollback RTO before approving the window.

Migrating a production stateful database without replication risks data loss, long downtime, and inconsistent state. If replication is impossible, accept a planned outage. Take verified full backups, quiesce writes, transfer and restore with timing tuned to bandwidth, run checksum reconciliation, and test rollback.

Table of Contents

    Summary of the process

    1. Inventory and measure: size, write rate, IOPS, and bandwidth.

    2. Rehearse: full-size dry-run under throttled network and target storage.

    3. Quiesce writes and take the final backup or snapshot.

    4. Transfer data (bandwidth-aware) and restore using parallel restores.

    5. Run multi-layer verification: schema diff, row counts, per-table checksums, and app smoke tests.

    6. Cut traffic, monitor, and hold source intact until verification passes.

    7. If verification fails, execute the rollback playbook documented below.

    Expand the rollback playbook into testable steps. Include failure-simulation guidance.

    A robust rollback recipe must include these items. First, keep the source database running and immutable, or preserve a filesystem snapshot.

    Second, keep DNS and load-balancer TTLs low. Document exact commands to re-point traffic. Examples: change HAProxy backend or update DNS A and ALIAS records.

    Third, for Postgres, if the target was promoted, prepare commands to point apps back at the source. Update connection strings and restart connection pools. Provide an ordered sequence to restart or re-promote the original instance if it was demoted (for example, systemctl start postgresql). Ensure pg_hba.conf and postgresql.conf match.

    Fourth, practice rollback in rehearsal. Intentionally fail the target after partial verification, then measure time to rollback. Capture DNS change time, app pool recycle, and verification time. Log those rollback RTOs and require the measured rollback time to fit the SLA before acceptance.

    Decision gate: proceed without replication only when business accepts the measured RTO and RPO from a full rehearsal, source is preserved, and rollback path is tested.

    Safe Cutover Without Replication with Minimal Downtime

    Key takeaways and executive decision summary

    • Small read-heavy databases (<10 GB) with low writes often fit a no-replication cutover.

    • Databases >100 GB or with >100 writes/sec usually need replication. Otherwise expect RTO in hours.

    • Rehearse once on production-sized data. Measure RTO during rehearsal. Require that measured RTO be within the allowed window.

    • Roles: DBA (backup and restore lead), Platform Engineer (network and DNS), SRE (monitoring), Application Owner (quiesce), Network Engineer (throughput).

    Hard stop criteria: inability to quiesce writes, target storage IOPS below need, missing validated backup, or rehearsal that fails to meet SLA.

    Pause.

    Who should never migrate without replication

    • Production OLTP systems with continuous writes like payment or order systems. ACID guarantees can break during a final-dump window.

    • Databases under regulatory regimes requiring continuous replication or transaction logging retention such as PCI DSS, HIPAA, SOX, or FedRAMP. Migration without replication may fail audits.

    • Systems with sub-minute RTO or RPO SLAs. A dump-and-restore strategy rarely meets sub-minute recovery.

    • Multi-region, geo-dependent apps where split-brain or data-residency rules like GDPR matter.

    The Database Administrator, SRE, and Chief Technology Officer must sign off when any item above applies.

    Pause.

    Data loss, consistency and transactional risks explained

    The core risk is the final-dump consistency window. A final logical dump captures state at one point in time. Any transaction after that point is lost unless the app quiesces writes or WAL or binlog is shipped.

    A second risk is silent corruption. File-level copies while a database is partially flushed can yield inconsistent state. Those errors may surface only later during index rebuilds or application queries.

    A third risk is underestimated restore time. Index rebuilds, ANALYZE, VACUUM, and constraint checks often dominate RTO. Transfer time is often the smallest part.

    Final-dump consistency window and write cutover

    A single logical dump is consistent only up to its snapshot. The operator must choose between three approaches. First, application-level quiesce (preferred). Second, DB-level freeze. Third, accept RPO equal to the cutover window.

    Options to quiesce:

    • Stop writers at the app or API gateway. Confirm zero incoming write requests during the chosen window. Note: background jobs and delayed queues often block quiesce.

    • DB-level measures: for Postgres, force read-only at connection pool or app layer. For MySQL, use FLUSH TABLES WITH READ LOCK for logical dumps. Long locks block reads and writes.

    Abort rule: if writers cannot be halted reliably for the estimated restore and verification time, do not proceed without replication.

    How index rebuilds, VACUUM and maintenance dominate RTO

    Restores often need reindexing or index rebuilds. Those tasks are I/O-bound. Target storage IOPS and throughput set the clock. Underprovisioned IOPS will extend restore from hours to a day.

    Common error: assuming transfer time equals downtime. That mistake causes cutovers that overrun maintenance windows.

    Pause.

    Real outage examples when skipping replication

    • Retail SaaS (anonymous): a rehearsal used a 200 GB DB. Transfer took 25 minutes on 1 Gbps. Restore and index rebuild took nine hours due to gp2 EBS with low IOPS. Root cause: not testing on production-like storage.

    • Analytics cluster (anonymous): a file-level copy occurred while the DB was running. Subtle corruption in a materialized view caused hours of diagnosis and a point-in-time restore. Root cause: not quiescing writes and skipping checksums.

    • Payment processor (anonymous): a final logical dump on a 50 GB DB with 200 writes/sec failed. The team accepted a 60-second quiesce window. Asynchronous workers resumed writes. The result: 14,000 lost transactions. Root cause: incomplete app quiesce checks.

    These examples show why rehearsals and multi-layer verification matter.

    Quantified RTO/RPO matrix and time-to-cutover estimates

    Key formula:

    transfer_time_seconds = (dataset_size_bytes * 8) / network_bandwidth_bps

    Examples (raw transfer only):

    • 10 GB over 100 Mbps ≈ 800 seconds ≈ 13 minutes.

    • 100 GB over 1 Gbps ≈ 800 seconds ≈ 13.3 minutes.

    • 1 TB over 1 Gbps ≈ 7,990 seconds ≈ 2.22 hours.

    Restore multipliers (empirical ranges from internal tests):

    • Small OLTP with few indexes: multiplier X = 1–3.

    • Large indexed OLTP: X = 3–10.

    • Heavy indexes, constraints and ANALYZE: X = 5–12.

    Sample worked RTOs (transfer + restore + verify):

    • 10 GB on 1 Gbps: transfer 1.3 minutes. Restore and verify 10–30 minutes. RTO ≈ 15–40 minutes. (Measured during rehearsal.)

    • 100 GB on 1 Gbps: transfer 13 minutes. Restore and index 1.25–3.25 hours. RTO ≈ 1.25–3.25 hours.

    • 1 TB on 1 Gbps: transfer 2.2 hours. Restore and index 6–24 hours. RTO ≈ 8–26 hours. (Observed in a 2024 client migration.)

    Risk scoring (textual):

    • Low risk: size <10 GB, write rate <10 writes/sec, expected RTO <1 hour.

    • Medium risk: size 10–100 GB, write rate 10–100 writes/sec, expected RTO 1–6 hours.

    • High risk: size >100 GB, writes >100/sec, expected RTO >6 hours.

    Storage and network notes: NVMe or gp3 with high IOPS cuts restore time. Network asymmetry such as egress throttling and provider egress costs affects timing. The operator should measure effective throughput with iperf3 before the window.

    1
    Inventory — size, writes/sec, IOPS.
    2
    Rehearse — full-size dry-run under throttled network.
    3
    Quiesce & Backup — final dump or snapshot with writes stopped.
    4
    Transfer & Restore — parallel restore, monitor I/O.
    5
    Verify — schema diff, checksums, smoke tests.
    6
    Switch — LB/DNS cutover and monitor; keep source intact.

    Add a compact quantitative comparison matrix between replication and non-replication approaches. Use the 1 TB over 100 Mbps example to show tradeoffs.

    Approach RPO RTO Infra uplift Notes
    Streaming replication 0–5s 1–15 min +20–50% monthly Continuous cost for replica and network egress.
    Cold cut (no replication) Cutover window (hours) Hours to days Minimal infra uplift High downtime cost; affects revenue during outage.

    For a 1 TB dataset over 100 Mbps, raw transfer = (1,000 GiB * 8) / 100 Mbps ≈ 88,000 seconds ≈ 24.4 hours. With restore multipliers observed in rehearsals (3–10x), expected RTO ranges from ~3 days to multi-day outages.

    Contrast this with streaming replication. It yields RPO in seconds and RTO in minutes. The tradeoff is continuous cost and operational overhead.

    Use this template with revenue-per-minute, dataset size, and bandwidth to compute a clear business decision.

    Pause.

    Detailed migration runbook: pre-cutover, cutover, and post-cutover

    This runbook assumes no replication. Roles: Primary DBA (lead), Platform Engineer, SRE, Application Owner.

    Pre-cutover checklist (copy/paste ready commands):

    • Inventory dataset size (Postgres):

    • psql -Atc "SELECT pg_database_size('dbname');"

    • Inventory dataset size (MySQL):

    • mysql -e "SELECT table_schema AS db, SUM(data_length+index_length) FROM information_schema.tables WHERE table_schema='dbname' GROUP BY table_schema;"

    • Measure active connections (Postgres):

    • psql -c "SELECT count(*) FROM pg_stat_activity WHERE state='active';"

    • Measure write throughput for a 5-minute sample (Postgres):

    • psql -c "SELECT sum(xact_commit+xact_rollback) AS tx FROM pg_stat_database;" before and after five minutes and compute the delta.

    • Create logical backup (Postgres) — fast compressed custom format with parallel jobs:

    • sudo -u postgres pg_dump -Fc -d mydb -f /backups/mydb_$(date +%F_%H%M).dump --no-owner --no-acl

    • Note: pg_dump starts with a transactional snapshot for Postgres. For high-write systems, require quiesce.

    • Create logical backup (MySQL):

    • mysqldump --single-transaction --quick --databases mydb > /backups/mydb_$(date +%F_%H%M).sql

    • For non-transactional engines or heavy writes, use Percona XtraBackup: xtrabackup --backup --target-dir=/backups/xtrabackup_dir.

    • Physical base backup (Postgres) with WAL streaming:

    • pg_basebackup -h source_host -D /backups/pgbase -Ft -z -P -X stream -U replicator

    • Transfer using rsync for cold file copy (ensure DB is stopped):

    • systemctl stop postgresql

    • rsync -avh --inplace --no-whole-file /var/lib/postgresql/ user@target:/var/lib/postgresql/

    • chown -R postgres:postgres /var/lib/postgresql

    • systemctl start postgresql

    • Transfer dump using scp with progress:

    • scp /backups/mydb_*.dump target:/backups/

    • Restore Postgres with parallel jobs:

    • pg_restore -d mydb_new -j 8 /backups/mydb_*.dump

    • Rule of thumb: set -j to the number of vCPUs or to I/O capability.

    • Restore MySQL with pv to show progress:

    • pv /backups/mydb.sql | mysql -u root -p mydb

    Parallel restore tips and common traps:

    • If using pg_restore -j, ensure the dump is in custom format.

    • Logical dumps not created for parallel restore will serialize.

    • For MySQL logical dumps, split by table and import in parallel. Use mydumper and myloader.

    • A common mistake is using a single-threaded restore on a large schema.

    Kubernetes PV migration example (CSI snapshots):

    • Create a VolumeSnapshot via CSI on the source cluster.

    • Copy the snapshot to object storage when moving clusters.

    • Create a PVC from the snapshot in the target cluster.

    • For cross-cloud moves, use Velero with restic to copy PV contents.

    Cutover run steps (operator actions, numbered):

    1. Announce the maintenance window and set a short DNS TTL one day before.

    2. Start pre-cutover health checks and confirm backups and checksums exist.

    3. Application quiesce: stop workers, set API to read-only, and drain queues.

    4. Confirm zero active writers. Check logs, pg_stat_activity, and app counters.

    5. Take the final backup or snapshot.

    6. Transfer data to target and start parallel restore.

    7. Run verification pipeline: schema diff, checksums, and sample queries.

    8. Switch traffic via LB or DNS to target. Keep source intact until verification ends.

    9. Monitor heavily for two hours, then again at 24 and 72 hours.

    Abort if any critical verification fails. Do not delete the source until all checks pass.

    Pause.

    Runbook: example scripts and snippets

    Postgres logical dump and parallel restore (copy/paste):

    • Backup:

    • sudo -u postgres pg_dump -Fc -d mydb -f /backups/mydb_$(date +%F_%H%M).dump

    • Transfer:

    • scp /backups/mydb_*.dump target:/backups/

    • Restore:

    • pg_restore -d mydb_new -j 8 /backups/mydb_*.dump

    Postgres physical base backup and WAL shipping snippet:

    • Source:

    • pg_basebackup -h src -D -Ft -z -P -U repl -X stream -l label

    • Target extract and config: untar the base backup. Create standby.signal or recovery.conf. Set a restore_command.

    MySQL logical parallel using mydumper and myloader:

    • Backup: mydumper -u root -p'pass' -B mydb -o /backups/mydumper --threads 8

    • Restore: myloader -u root -p'pass' -d /backups/mydumper -t 8 -o

    Rsync cold-copy example (must stop DB):

    • systemctl stop postgresql

    • rsync -aHAX --info=progress2 --delete /var/lib/postgresql/ user@target:/var/lib/postgresql/

    • chown -R postgres:postgres /var/lib/postgresql

    Tip: test chown and permissions on a staging run. A frequent blocker is wrong file ownership after rsync.

    Provide concrete checksum and reconciliation recipes so teams can verify bit-for-bit equivalence without guessing. For Postgres, use per-table ordered CSV hashing in chunks. The pattern:

    • On source and target run: psql -d mydb -c "COPY (SELECT * FROM mytable WHERE id BETWEEN X AND Y ORDER BY id) TO STDOUT WITH CSV" | md5sum.

    • Iterate ranges of PKs in chunks of 100k rows. Compare md5 values between source and target. Mismatches show the exact range to re-examine.

    For MySQL, use mysqldump with --single-transaction and pipe to md5sum for table-level checks. Or use pt-table-checksum for online checksum and sync at scale.

    For very large tables, compute lightweight aggregates first: COUNT, SUM of integer PKs, and MIN/MAX timestamp. Run chunked hashes only where aggregates differ.

    Document expected runtime per table given row count and IOPS so verification fits in the cutover window.

    Verification, reconciliation, and integrity checks

    Verification must be multilayered and automated where possible. Pipeline order: schema diff, row counts, per-table checksums, then app smoke tests.

    Recommended tools:

    • PostgreSQL: use pg_dump -s for schema, custom checksum scripts, and pg_comparator for deeper diffs.

    • MySQL: use pt-table-checksum and pt-table-sync from Percona Toolkit for checks and sync.

    • Generic: gzip plus sha256sum or per-table md5 or sha256 to detect bit-level differences.

    Per-table checksum example (Postgres) — copy/paste script:

    for t in $(psql -At -c "SELECT tablename FROM pg_tables WHERE schemaname='public'"); do
    
      pg_dump -a -t "$t" --column-inserts mydb | gzip | sha256sum | awk '{print $1"  "$t}'
    
    done > /tmp/postgres_checksums_source.txt
    
    

    MySQL CRC checksum per-table example (template):

    for t in $(mysql -N -e "SELECT table_name FROM information_schema.tables WHERE table_schema='mydb'"); do
    
      mysql -N -e "SELECT COUNT(*) as rows, SUM(CRC32(CONCAT_WS('#', col1, col2))) as crc FROM $t" mydb
    
    done
    
    

    Use lightweight aggregates as a fast pre-check. Then run chunked hashes where aggregates differ. Schedule verification into the cutover window.

    Pause.

    FAQ

    Q: How long will a 1 TB transfer take over 100 Mbps?

    A: Rough answer: raw transfer ≈ 24.4 hours. Restore multiplies that by 3–10x. Expect days.

    Q: Can a final logical dump capture all transactions?

    A: No. It captures a point-in-time snapshot. Any transaction after the dump is lost unless WAL or binlog is shipped.

    Q: What quick checks prove the target is good before cutover?

    A: Run schema diff, table row counts, and per-table checksums. Also run app smoke tests and end-to-end flows.

    Q: What if the app cannot quiesce writers reliably?

    A: Do not proceed without replication. A cold cut will likely lose transactions and break SLAs.

    Q: How to estimate restore time for a 500 GB DB?

    A: Measure a full rehearsal on similar storage. Use multipliers: 3–10x transfer time. Expect 1–3 days depending on IOPS.

    Q: Any recommended tools for online checksum and sync?

    A: Use Percona Toolkit's pt-table-checksum and pt-table-sync. See the docs at Percona pt-table-checksum.

    SUMMARIZE WITH AI: Extract the important

    Share this article:

    𝕏 X (Twitter) f Facebook in LinkedIn 🔥 Reddit 🐘 Mastodon 🦋 Bluesky 💬 WhatsApp 📱 Telegram 📧 Email
    • Cloud Burst vs Fixed Cloud: Seasonal E-commerce Decision Guide
    • Hosting migration emergency services: 24/7 SLA playbook
    • VPS vs Cloud for HIPAA Apps: Security & Performance
    • Avoid downtime when migrating large WordPress multisite
    Alan Curtis

    Alan Curtis

    With over 12 years of experience testing and reviewing web hosting solutions, this author is passionate about helping businesses and individuals find the best hosting, VPS, and cloud services for their needs. Covering performance, speed, uptime, migrations, and provider comparisons, every article on Host Compare is based on hands-on experience and real-world testing. Readers gain trusted insights, actionable advice, and clear guidance to choose hosting solutions confidently and optimize their websites effectively.

    Published: Thu, 26 Mar 2026
    Updated: Fri, 03 Apr 2026
    By Alan Curtis

    In Website Migration.

    tags: stateful-database-migration dump-and-restore runbook RTO-RPO checksum-verification

    Share this article

    Help us by sharing on your social networks

    𝕏 Twitter f Facebook in LinkedIn
    Legal Notice | Privacy Policy | Cookie Policy
    Article Archives

    Contactar

    © Host Compare. All rights reserved.