How to Migrate from On-Premise to AWS Without Downtime

The myth of the big-bang migration

Most migration horror stories start the same way: a weekend cutover, a team of engineers on a war-room call, and a database that won't sync correctly at 2 AM. The source environment is still live, the destination is partly broken, and there's no clean way to roll back.

The alternative is a phased migration — where traffic shifts gradually, the source stays live until the destination is proven, and every step has a tested rollback procedure. Here's how we approach it.

Start with a discovery and dependency map

Before moving anything, you need a complete picture of what's running and how it talks to each other. Migrate in the wrong order and you'll break dependencies mid-move.

For each service, document:

What it is: language, framework, database, ports
What it depends on: databases, queues, APIs, shared filesystems
What depends on it: upstream callers, scheduled jobs, monitoring
Migration strategy: lift-and-shift, re-platform, or re-architect

The output is a dependency graph. You'll migrate leaf nodes (services with no dependents) first and core shared services last. This ensures you're never in a state where something migrated is calling something that hasn't moved yet.

Choose the right migration strategy per workload

Not everything gets migrated the same way. The classic McKinsey "7 Rs" simplifies to three practical options:

Lift-and-shift (Rehost) — move the workload as-is to an EC2 instance. Fastest, lowest risk, no code changes. Good for legacy apps or anything where re-architecting isn't worth the effort. Use AWS MGN (Application Migration Service) to replicate the server continuously and cut over with minimal downtime.

Re-platform — make targeted changes to take advantage of managed services. Move from a self-managed MySQL on a VM to RDS. Move from a cron job to Lambda. Same application code, better infrastructure. Usually 20–30% more effort than lift-and-shift but meaningfully reduces ongoing ops overhead.

Re-architect — redesign the service for the cloud. Containerize it, put it on EKS, use SQS instead of direct calls. Most valuable long-term, highest effort. Reserve this for services that are high-traffic, actively developed, or currently painful to operate.

Build the target environment first

Never start migrating until the AWS environment is fully built and tested. Provision everything with Terraform before any workload moves:

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "production"
  cidr = "10.0.0.0/16"

  azs             = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway   = true
  single_nat_gateway   = false  # HA: one per AZ
  enable_dns_hostnames = true
}

Key elements of the target environment:

VPC with public/private subnet separation
Security groups mirroring your on-premise firewall rules
RDS or Aurora for databases (not EC2-hosted)
IAM roles instead of access keys
CloudWatch for logs and metrics from day one

Establish connectivity before cutover

During the transition period, traffic will flow between on-premise and AWS. You need a reliable, low-latency connection:

VPN (Site-to-Site): Good for most migrations. AWS Managed VPN gives you up to 1.25 Gbps, costs ~$36/month per tunnel, and can be set up in an hour. Use BGP for dynamic routing.

AWS Direct Connect: For high-bandwidth needs or latency-sensitive workloads. Takes weeks to provision, costs more, but gives you dedicated bandwidth. Overkill for most SMB migrations.

resource "aws_vpn_gateway" "main" {
  vpc_id = module.vpc.vpc_id
  tags   = { Name = "migration-vpg" }
}

resource "aws_customer_gateway" "onprem" {
  bgp_asn    = 65000
  ip_address = "203.0.113.1"  # your on-prem public IP
  type       = "ipsec.1"
}

resource "aws_vpn_connection" "main" {
  vpn_gateway_id      = aws_vpn_gateway.main.id
  customer_gateway_id = aws_customer_gateway.onprem.id
  type                = "ipsec.1"
  static_routes_only  = false
}

Database migration: the hard part

The database is almost always the bottleneck. You can't just copy-and-paste a running database — you need live replication so the AWS copy stays current while the source is still receiving writes.

Using AWS DMS (Database Migration Service)

DMS handles full load (initial copy) followed by CDC (change data capture) for ongoing replication:

Create a replication instance — a managed EC2 instance that runs the migration
Define source and target endpoints — your on-prem MySQL and the RDS target
Create a migration task — full load + CDC
Monitor replication lag — DMS shows how far behind the target is

Once CDC replication lag is consistently under a few seconds, you're ready for cutover.

The cutover sequence

# 1. Take the application down for a brief maintenance window (5–10 minutes)
# 2. Wait for DMS replication lag to hit 0
# 3. Promote the RDS replica to standalone (stop replication)
# 4. Update application config to point to RDS endpoint
# 5. Bring application back up on AWS
# 6. Verify data integrity with checksums

For zero-downtime migrations, use a dual-write pattern: your application writes to both the old and new database simultaneously during a transition period. This is more complex to implement but eliminates the maintenance window entirely.

DNS-based traffic shifting

For stateless application tiers, you can shift traffic without any downtime using weighted DNS records.

With Route 53:

resource "aws_route53_record" "app" {
  zone_id = var.zone_id
  name    = "api.yourdomain.com"
  type    = "A"

  weighted_routing_policy {
    weight = 10  # start with 10% to AWS
  }

  set_identifier = "aws"
  alias {
    name                   = aws_lb.main.dns_name
    zone_id                = aws_lb.main.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "app_onprem" {
  zone_id = var.zone_id
  name    = "api.yourdomain.com"
  type    = "A"

  weighted_routing_policy {
    weight = 90  # keep 90% on-prem initially
  }

  set_identifier = "onprem"
  records        = ["203.0.113.10"]  # on-prem load balancer IP
  ttl            = 60
}

Shift weights gradually: 10%, then 25%, then 50%, then 100%. Monitor error rates at each step. If errors appear, shift back to 0% on AWS and investigate — the on-prem environment is still running.

Keep rollback simple

Every migration phase should have a written rollback procedure tested before you start. For DNS-based cutover, rollback is trivially easy — update the weight back to 0. For database cutover, rollback means reversing the connection string change. Document exactly what to run, who runs it, and what the recovery time should be.

Keep the source environment running in parallel for at least two weeks after full cutover. Don't decommission on-premise systems until you're confident the AWS environment is stable under real load.

Post-migration: don't skip optimization

Most teams cut over and declare victory. The real work starts after:

Right-size instances: your on-prem servers were probably over-provisioned
Add Reserved Instances for steady-state compute
Set up cost allocation tags so you can track spend by service
Remove the VPN once on-prem is fully decommissioned (you're paying per-hour)
Review security groups: migration often inherits over-permissive rules

A cloud migration that doesn't include post-migration optimization typically costs 30–40% more than it needs to.