Contact

Host Compare
Host Compare
  • Home
  • Blog
  • Hosting by Use
  • Hosting Security
  • Hosting Type
  • Performance & Speed
  • Provider Reviews
  • Website Migration
  • About
  • Contact
Search
  • Home
  • Blog
  • Hosting by Use
  • Hosting Security
  • Hosting Type
  • Performance & Speed
  • Provider Reviews
  • Website Migration
  • About
  • Contact

Migration Checklist for Low‑Downtime Database Switchover

Practical migration checklist for low‑downtime switchover

Are migrations causing unpredictable downtime, inconsistent reads or frantic rollbacks? High‑risk database switchover windows often stem from missing runbooks, unclear cutover steps and inadequate validation queries. A repeatable, minute‑by‑minute checklist tailored to each database engine reduces uncertainty and keeps customers online.

Follow an operationally precise migration checklist for low‑downtime database switchover that includes dry‑runs, monitoring thresholds, validated replication, scripted cutover, and concrete rollback triggers. The content below provides playbooks for Postgres, MySQL, SQL Server and Oracle, commands, validation queries, typical timelines, and automation snippets ready for production.

Table of Contents

    Key takeaways: migration checklist for low‑downtime database switchover explained in a minute

    • Always verify replication health and data consistency before cutover. Replication lag, missing transactions or DDL drift are the most common causes of failed cutovers.
    • Predefine an exact cutover window, roles, and minute‑by‑minute runbook steps that include monitoring, DNS, connections and rollback triggers.
    • Choose replication strategy by workload: logical CDC for schema changes and zero‑write‑loss, physical streaming for minimal latency and large datasets.
    • Automate repetitive checks and include validation queries and dashboards to detect divergence within seconds.
    • Have a tested rollback plan that can be executed within the rollback time objective; rehearse rollbacks during dry‑runs.

    Practical migration checklist for low‑downtime switchover

    How to read the checklist and when to use each pattern

    The checklist is organized by phase: assessment, replication setup, dry runs, cutover, validation and rollback. Each phase contains engine‑specific commands and measurable thresholds. Use the runbook when switching primary roles, migrating between hosting providers (VPS → cloud or cloud → cloud), or upgrading major versions.

    Risk vs downtime comparison of common switchover strategies

    Strategy Typical downtime Operational risk Best when Common tools
    Read‑replica cutover Seconds–minutes Moderate (lag, delayed DDL) High reads, acceptable short lock window Native replicas, pglogical, MySQL GTID
    Logical CDC (Debezium, AWS DMS) Seconds–minutes Low (complex mapping, schema handling) Heterogeneous targets or zero‑write loss Debezium, AWS DMS, Striim
    Cutover with short write freeze Minutes Low–Moderate (user impact) Transactional integrity required, simple topology Pg logical, application feature flagging
    Blue/Green with proxy swap Sub‑second to minutes Low (complex infra) Cloud environments, containerized apps HAProxy, Envoy, Kubernetes Service

    Preparation phase: assessment and prechecks

    • Inventory topology, connection strings, client IPs and connection pools. Map every application component that writes to the database.
    • Define SLA, downtime objective and rollback objective. Example: target cutover downtime < 2 minutes, rollback possible within 5 minutes.
    • Benchmark replication capacity: run sample workload and measure replication lag under peak load.
    • Check schema compatibility: compare DDL between source and target using automated diff tools.
    • Export baseline checksums for key tables for later validation (mysqldump --skip‑lock‑tables not suitable for consistency; use consistent snapshot methods).

    Why it matters: incomplete inventories and unclear objectives cause unplanned windows and failed rollbacks.

    Common mistakes and how to avoid them: - Assuming read replicas are fully up‑to‑date—validate with WAL/GTID positions. - Skipping application connection draining—use connection pool draining and feature flags.

    Replication strategies and engine‑specific setup

    Postgres: streaming + logical replication playbook

    • Recommended approach: streaming replication for minimal latency; logical replication or pglogical for table‑level selective replication and complex DDL.
    • Key checks:
    • Verify WAL sender status: SELECT application_name, state, sent_lsn, write_lsn, flush_lsn, replay_lsn FROM pg_stat_replication;
    • Confirm replication lag threshold: (pg_current_wal_lsn() - replay_lsn) converted to bytes or time. Target: < 1s or < 10MB depending on bandwidth.
    • Example logical replication setup (pg 13+):
    -- On primary
    
    CREATE PUBLICATION mypub FOR TABLE orders, customers;
    
    
    
    -- On replica
    
    CREATE SUBSCRIPTION mysub CONNECTION 'host=primary host=db user=replicator password=secret dbname=appdb' PUBLICATION mypub;
    
    
    • Validation query examples:
    • Row counts: SELECT COUNT(*) FROM schema.table;
    • Per‑table checksum (pgcrypto): SELECT md5(string_agg(md5(t::text), '')) FROM (SELECT * FROM table ORDER BY primary_key) t;
    • Rollback triggers:
    • If replication lag > X seconds after X minutes, halt cutover and revert traffic.

    MySQL / MariaDB: GTID and replica promotion

    • Use GTID based replication to make failover deterministic.
    • Prechecks:
    • SHOW SLAVE STATUS/G and verify Seconds_Behind_Master.
    • Confirm gtid_executed positions on both sides: SHOW MASTER STATUS;
    • Promotion steps (simplified):
    • Stop writes to primary (application feature flag or maintenance mode).
    • Wait for replicas to catch up: ensure Seconds_Behind_Master = 0 and relay log applied.
    • Run STOP SLAVE; RESET SLAVE ALL; on promoted server and point apps to new host.
    • Validation queries:
    • Row counts and checksums: pt-table-checksum (Percona Toolkit) is recommended for large tables.

    SQL Server: transactional replication and failover group

    • Use Always On Availability Groups for minimal downtime when available.
    • Prechecks:
    • Verify data synchronization state: sys.dm_hadr_database_replica_states.
    • Check log synchronization and redo queue.
    • Cutover pattern:
    • Failover AG to secondary after blocking writes.
    • Verify connections and endpoint health.

    Oracle: Data Guard and GoldenGate patterns

    • Data Guard for physical or logical standby; GoldenGate for near‑zero downtime heterogeneous migrations.
    • Prechecks: check archived log apply lag and SCN positions.
    • Cutover: switchover command via DGMGRL or orchestrated GoldenGate stop/start operations.

    Dry runs and rehearsal checklist (must be repeated)

    • Schedule dry runs outside peak hours, but using production‑like load.
    • Execute the full runbook up to but not including final DNS or proxy swap.
    • Log timings for each step (how long replicas take to sync, time to drain connections, DNS TTL effects).
    • Validate rollback steps during a dry run: simulate failure at several cutover points and execute rollback procedures.

    Why it matters: dry runs reveal hidden timing and behavioral issues like slow DDL, locks and slow replicas.

    Minute‑by‑minute cutover runbook (sample 30‑minute window)

    1. T‑30: Final prechecks
    2. Verify replication health and dashboards.
    3. Ensure monitoring alerts are enabled and team channels ready.

    4. T‑20: Lock non‑essential writes

    5. Engage application feature flags to redirect background jobs.
    6. Pause scheduled jobs and batch processors.

    7. T‑10: Enter maintenance / freeze writes

    8. Put the application in read‑only mode or redirect writes to a queue.
    9. Confirm clients report write failures or queued state.

    10. T‑5: Final replication sync

    11. Wait until replica replay_lsn matches primary (Postgres) or Seconds_Behind_Master = 0 (MySQL).
    12. Run final checksum queries for critical tables.

    13. T‑2: Switch traffic

    14. Update proxy or load balancer to point to promoted node (swap VIP or update upstreams).
    15. Reduce TTL for DNS earlier so record change propagates quicker.

    16. T+1: Validation window

    17. Run smoke tests, user login flows and write tests with real‑world payloads.
    18. Check monitoring: error rate, latency, replication metrics.

    19. T+10: Confirm and finalize

    20. Remove maintenance mode and reenable background jobs.
    21. Keep elevated monitoring for at least one full business hour.

    Validation queries, metrics and alert thresholds

    • Essential validation queries:
    • Count parity: SELECT COUNT(*) FROM table;
    • Checksum spot checks: SELECT md5(string_agg(col::text, '')) FROM (SELECT col FROM table ORDER BY pk) s;
    • Last transaction timestamps: SELECT max(updated_at) FROM table;
    • Metrics and thresholds to monitor:
    • Replication lag: alert if > 5s for small datasets, > 30s for large bulk replication.
    • Error rate increase: alert on > 2x baseline 5‑minute error rate.
    • Latency: P95 read/write latency exceeding baseline by 30%.

    When is read‑replica cutover safer than rollback?

    Read‑replica cutover is safer when replicas show consistent state, replication lag is minimal, and the application can tolerate a short freeze for the final promotion. A rollback becomes necessary when replication cannot catch up, schema drift exists, or data divergence is detected after promotion.

    Practical indicators favoring replica promotion: - Replicas show zero or sub‑second lag during dry runs. - All DDLs are applied and validated on replicas. - Application supports quick DNS or proxy swap and connection draining.

    If any of these conditions fail, abort and execute rollback before redirecting production traffic.

    VPS vs cloud for low‑downtime database switchover

    • VPS (self‑managed): lower hosting cost and full control but increased operational burden for high‑availability, replication automation and fast networking between nodes.
    • Cloud managed services: typically offer built‑in replication, automated failover, lower networking latency between zones and easier blue/green deployments at higher cost.

    Comparative implications: - Network latency: cloud providers often provide lower intra‑region latency and predictable throughput, reducing replication lag during cutover. - Orchestration: cloud offers managed failover tools (RDS Multi‑AZ, Cloud SQL replicas) that shrink human steps and potential errors. - Hidden costs: cloud may charge for I/O, snapshot storage and inter‑AZ data transfer, factor these into cutover cost calculations.

    Which hosting provider reduces switchover downtime and latency?

    Providers with integrated managed DB services and private networking typically reduce switchover complexity: AWS RDS/Aurora, Google Cloud SQL/Spanner, Azure Database with Availability Zones. For self‑managed clusters on cloud VMs, choose providers with fast private network and placement groups.

    Provider selection checklist: - Verify replication topology support (cross‑region replicas, GTID support). - Check maintenance windows and automated update behavior. - Confirm available tooling for DNS, load balancers and VIP swaps.

    Example links: - AWS Database Migration Service - Debezium documentation - Percona Toolkit pt‑table‑checksum

    Hidden costs of low‑downtime migration on VPS and cloud

    • Data transfer fees for continuous CDC replication across regions.
    • Increased IOPS and backup storage during full syncs and validation workloads.
    • Engineering hours for troubleshooting replication, DDL handling and rollback rehearsals.
    • Longer maintenance windows for high cardinality tables needing reindexing or hot updates.

    Mitigation: estimate data transfer needs in dry runs, use compressed replication if supported, and budget for extra IOPS during cutover.

    What if DNS propagation or failover delays cutover?

    • Reduce DNS TTL to a low value (e.g., 60s) at least 48–72 hours before the cutover to allow caches to expire.
    • Use a layered approach: prefer proxy/Load Balancer swaps or VIP moves rather than depending solely on DNS when sub‑second or minute‑level cutover is required.
    • If propagation stalls, fallback to connection proxy rules or application config toggles to redirect traffic.

    Rollback triggers if DNS or failover delays exceed acceptable windows: - If application error rate > defined threshold after X minutes, immediately revert traffic to previous primary. - If monitoring shows data divergence after cutover, revert via documented rollback commands.

    Engine runbooks: concrete commands and rollback procedures

    Postgres runbook (example)

    Preconditions: replication established and logical replication up to date.

    Commands: - Check replication: SELECT application_name, state, pg_current_wal_lsn(), replay_lsn FROM pg_stat_replication; - Final sync: SELECT pg_sleep(1) until replay_lsn ~= current_lsn. - Promote replica: pg_ctl promote -D /var/lib/postgresql/data or SELECT pg_promote();

    Validation: - Run checksum queries on top 10 critical tables. - Run smoke transactions that exercise primary write paths.

    Rollback: - If replica promotion failed, stop writes on application and reconfigure application to original primary. - If promotion succeeded but divergence occurs, perform selective repairs from backups or apply missing WAL segments.

    MySQL runbook (example)

    Commands: - SHOW MASTER STATUS; SHOW SLAVE STATUS/G - Ensure Seconds_Behind_Master = 0 on replica. - On replica: STOP SLAVE; RESET SLAVE ALL; CHANGE MASTER TO MASTER_HOST='';

    Validation and rollback similar to Postgres: keep ability to reattach original master and resume writes.

    SQL Server and Oracle runbooks

    Provide equivalent operational steps: AG failover or GoldenGate stop/start. Include checking SCN, LSN or log sequence positions.

    Playbook automation snippets and templates

    • Example Ansible task to check Postgres replication:
    - name: Check postgres replication
    
      shell: psql -U repl -c "SELECT application_name, state FROM pg_stat_replication;"
    
      register: repl_check
    
    
    • Simple feature flag pattern to freeze writes:
    UPDATE feature_flags SET enabled=false WHERE name='allow_writes';
    
    
    • Cronable health check script idea: run validation queries and post metrics to monitoring endpoint.

    Monitoring dashboards and alert rules to enable

    • Replication lag metric (alert if > threshold for > 2 minutes).
    • P95 write latency (alert if > 2x baseline).
    • Write error rate (alert if > 1% of total writes in 1 minute).
    • Transaction commit failures (alert on any unusual SQL errors during cutover).

    Cutover timeline and responsibilities

    Cutover timeline • 30‑minute window
    ✓ Prechecks • ✓ Dry run • ✓ Rollback ready
    T‑30
    Final prechecks, monitor ready ✅
    T‑10
    Freeze writes, drain connections ⚠️
    T‑2
    Final sync and checks ✅
    T‑0
    Swap traffic, validate quickly ⚡
    Team: DBOps ✅ App Owners ✅ SRE ✅
    Rollback ready: Yes

    Semantic checklist printable (quick reference)

    • Inventory mapped and contacts listed
    • Replication health green for 1 dry run
    • DNS TTL lowered 48–72 hours prior
    • Monitoring and alerts configured
    • Feature flags ready to freeze writes
    • Final checksum queries prepared
    • Rollback steps documented and tested

    Balance strategic: what is gained and what is risked with low‑downtime switchover

    ✅ When successful: - Minimal customer impact, preserved transactional integrity, faster time to production improvements.

    ⚠️ Red flags: - Unreliable replicas, untested rollback, or heavy DDL during cutover increase failure probability.

    Common questions about migration checklist for low‑downtime database switchover

    How to reduce switchover downtime to under two minutes?

    Use prebuilt replicas, lower DNS TTL, perform final freeze of writes, and swap traffic via proxy/VIP. Validate via final WAL/GTID positions and execute a scripted proxy swap to minimize DNS dependency.

    Why is replication lag the single biggest risk during cutover?

    Replication lag causes missing transactions on the promoted node, resulting in data loss or divergence. Monitoring and thresholds for lag are essential to avoid inconsistent promotion.

    What if application connection pools keep writing during cutover?

    Drain and/or restart connection pools, use feature flags to reject writes, or implement a write‑proxy that can be toggled during the cutover window.

    Which validation queries are fast and reliable for large tables?

    Use sampling checksums (pt‑table‑checksum for MySQL), per‑partition row counts and targeted checksum queries on critical rows instead of full table scans.

    When should automated CDC like Debezium be preferred over native replica promotion?

    Choose CDC when migrating heterogeneous engines, when selective table replication is required, or when keeping both systems writable temporarily is needed.

    Action plan: start the low‑downtime switchover in under 10 minutes

    Ready in 10 minutes

    1. Verify replication health and set monitoring alerts.
    2. Switch application to read‑only mode via feature flag.
    3. Run a single checksum on top 3 critical tables and confirm parity.

    Ready in 5 minutes

    1. Confirm replica has applied all WAL/GTID and report zero lag.
    2. Execute proxy swap or promote replica and run smoke tests.

    Ready in 1 minute

    1. If smoke tests fail, revert proxy settings and execute rollback plan.

    Appendix: sample validation SQL snippets

    • Postgres latest WAL LSN compare:
    SELECT pg_current_wal_lsn(), replay_lsn FROM pg_stat_replication;
    
    
    • MySQL replication lag check:
    SHOW SLAVE STATUS/G
    
    -- check Seconds_Behind_Master
    
    
    • Simple checksum sample (Postgres):
    SELECT md5(string_agg(md5((id||col1||col2)::text), '')) FROM (SELECT id, col1, col2 FROM table ORDER BY id) s;
    
    

    Final notes on compliance and testing

    • Keep all migration actions logged and timestamped for auditability.
    • Ensure backups and point‑in‑time recovery are available before attempting schema changes.
    • Maintain communication channels with stakeholders and automated rollbacks that can be executed without manual configuration edits.
    SUMMARIZE WITH AI: Extract the important

    Share this article:

    𝕏 X (Twitter) f Facebook in LinkedIn 🔥 Reddit 🐘 Mastodon 🦋 Bluesky 💬 WhatsApp 📱 Telegram 📧 Email
    • Zero-Downtime Host Migration: Split a Monolith into Microservices
    • Migrate Next.js SSR with cache warmup and zero downtime
    • Reseller Hosting vs Cloud Reselling: Profit, Risks & Migration
    • Is moving image/video processing to serverless cost-effective?
    Alan Curtis

    Alan Curtis

    With over 12 years of experience testing and reviewing web hosting solutions, this author is passionate about helping businesses and individuals find the best hosting, VPS, and cloud services for their needs. Covering performance, speed, uptime, migrations, and provider comparisons, every article on Host Compare is based on hands-on experience and real-world testing. Readers gain trusted insights, actionable advice, and clear guidance to choose hosting solutions confidently and optimize their websites effectively.

    Published: Sat, 21 Feb 2026
    Updated: Tue, 09 Jun 2026
    By Alan Curtis

    In Website Migration.

    tags: Migration checklist for low‑downtime database switchover zero downtime migration database cutover runbook read replica cutover db migration playbook

    Share this article

    Help us by sharing on your social networks

    𝕏 Twitter f Facebook in LinkedIn
    Legal Notice | Privacy Policy | Cookie Policy
    Article Archives

    Contactar

    © Host Compare. All rights reserved.