To migrate a large media library to S3 plus a CDN without breaking links, break the work into staged, executable steps.
Plan S3 buckets, CloudFront, and a URL preservation strategy.
Run parallel incremental syncs using DataSync or the aws CLI.
Normalize metadata with S3 Batch.
Validate checksums and headers.
Stage a cutover with low TTLs and an ALIAS/DNS swap.
Keep a rollback snapshot for immediate failback.
Dicho de otro modo, planning time varies with scope.
Inventory export typically takes 1–6 hours per 1M objects on a well indexed origin.
Sampling and cost modeling take 4–8 hours.
Initial seed transfer tests take a few hours to a day depending on network and file sizes.
Run at least one full dry run.
A dry run often takes 6–24 hours for many small files.
En la práctica a small team can complete planning and a verified dry run in 1–7 days for multi‑million libraries.
Summary of the process
- Inventory and cost estimate for the library with sample throughput numbers and per‑million object cost.
- Create S3 bucket topology, CloudFront distributions, and IAM and CORS policies.
- Run parallel incremental syncs using DataSync or aws s3 plus S3 Batch for metadata fixes.
- Preserve exact URLs using CloudFront origin rewriting, Lambda@Edge, or a reverse proxy.
- Validate parity with checksums and header checks and generate a redirect map for exceptions.
- Stage cutover with lowered TTLs and ALIAS/DNS swap; monitor error rates and pageviews.
- Rollback with versioned objects, a snapshot origin, and automated DNS reversion if parity fails.
The planning phase sets the success envelope.
The engineer must produce a full inventory that lists path, object count, average size, and current headers.
The engineer must estimate time and cost per million objects.
A practical inventory export is a CSV with columns: path, size_bytes, last_modified, content_type, cache_control, orig_url_status.
For multi‑million libraries the inventory export step alone usually takes between 1 and 6 hours.
This depends on source throughput and database performance.
The error typical in planning is assuming all files are homogeneous.
Most sites have 5–7 distinct header patterns.
Expect a 2–3% set of legacy files with broken names or missing extensions.
Use this command to extract a listing from a Linux origin that stores paths on disk or a mounted filesystem.
This command runs in about 1–2 hours per 1M files depending on I/O.
find /var/www/uploads -type f -printf "%P,%s,%TY-%Tm-%TdT%TH:%TM:%TS/n" > inventory_raw.csv
If the site stores media references in a database, export the media table with SQL.
WordPress, Magento, and Discourse all store media references differently.
Expert opinion: accurate inventory reduces unexpected remediation by 20–35% during sync.
Cost and throughput planning numbers to use as anchors as of 2026:
- S3 Standard in us-east-1 ≈ $0.023 per GB‑month.
- S3 PUT/COPY/POST ≈ $0.005 per 1,000 requests.
- S3 GET ≈ $0.0004 per 1,000 requests.
- CloudFront egress ≈ $0.085 per GB for the first 10 TB.
Use those values to generate a per‑million object estimate.
For example, at 0.5 MB average size a million objects ≈ 488 GB stored.
That costs ≈ $11.22 per month for S3 storage.
PUTs to upload 1M objects cost ≈ $5 one time.
GETs for validation (head-object) for 1M objects cost ≈ $0.4.
CloudFront first fetch egress for 488 GB at $0.085/GB ≈ $41.5.
Those numbers change by region and usage patterns.
Always fetch live pricing before a large migration.
See AWS S3 pricing and DataSync docs for current numbers: https://aws.amazon.com/s3/pricing/ and https://docs.aws.amazon.com/datasync/latest/userguide/what-is-datasync.html.
Bucket topology and naming
Design buckets for ownership and lifecycle.
For most migrations use a single region and one bucket per environment.
Example names: example-com-media-prod and example-com-media-staging.
Use prefixes to mirror the original path structure.
Make the object key equal the original URL path minus the leading slash.
The common trap is adding a hashed or date prefix.
That breaks 1:1 URL preservation and forces application or CDN rewrites.
Permissions and access controls must be layered.
Public objects should be served via CloudFront using an Origin Access Identity or Origin Access Control to keep the bucket private.
Private objects require signed URLs.
The engineer must audit which URL patterns use presigned URLs or an auth service before cutover.
Step 1 Inventory and cost estimate per million objects
Start with a realistic sample.
Pick a 100k subset that reflects the actual mix.
Include images, video, thumbnails, and downloads.
Run sampling to estimate average size and header patterns.
The following rough model works for budgeting:
- Storage per million objects at 0.5 MB avg = ~488 GB => S3 storage ≈ $11–13/month (2025 price).
- PUTs to upload 1M objects at $0.005 per 1,000 = $5 one time.
- GETs for validation (head-object) 1M at $0.0004 per 1,000 = $0.4.
- CloudFront first fetch egress for 488 GB at $0.085/GB = $41.5.
- S3 Batch and DataSync have small per‑job overhead but avoid extended EC2 transfer costs.
Estimating time: with DataSync a single task using 1 Gbps transfers ~430 GB/hr.
Practically expect 200–400 GB/hr accounting for overhead.
For 488 GB expect 1–3 hours per million objects at high throughput.
If many small files exist, throughput drops.
For many small files plan 6–24 hours per million objects.
Snowball Edge is an option when network egress is too slow.
Physical shipment adds 3–7 days.
💡 Tip
Measure files per second on the origin rather than raw bandwidth.
Small files kill throughput due to API and disk overhead.
Expect 5–50 files per second over a good link depending on file sizes.
Step 2 Run parallel incremental syncs with DataSync s3 sync and S3 Batch
For multi‑million libraries use this pattern.
Do a bulk copy pass first.
Then run incremental passes.
Finish with a parity pass.
There are two practical approaches.
The fast method uses DataSync or rsync plus the aws CLI.
The correct method adds S3 Batch for metadata and header normalization.
Use both methods together.
Fast method to seed S3 uses DataSync or the aws CLI.
DataSync handles parallelism and is resilient.
Create a DataSync task from on‑prem to S3 and start incremental executions.
Example start command after locations exist:
aws datasync start-task-execution --task-arn arn:aws:datasync:us-east-1:111122223333:task/task-0123456789abcdef
If direct network transfer is acceptable and the origin is accessible, a seeded aws s3 cp is simpler:
aws s3 cp /var/www/uploads s3://example-com-media-prod/ --recursive --only-show-errors --storage-class STANDARD
But the trap: aws s3 cp/sync won’t fix content-type, cache-control, or custom metadata consistently.
The correct method uses S3 Batch Operations with a manifest.
The job copies objects to themselves while replacing metadata or ACLs.
Sample manifest CSV header and two lines.
Object keys must be URL‑encoded if necessary.
Bucket,Key
example-com-media-prod,path/to/image1.jpg
example-com-media-prod,path/to/legacy%20name(1).png
S3 Batch job JSON template for a copy operation that enforces content-type and cache-control (partial example):
{
"AccountId": "111122223333",
"Operation": {
"S3PutObjectCopy": {
"TargetResource": "arn:aws:s3:::example-com-media-prod",
"CannedAccessControlList": "private",
"MetadataDirective": "REPLACE",
"Metadata": {"x-amz-meta-migrated":"true"},
"NewObjectMetadata": {"ContentType": "image/jpeg", "CacheControl": "max-age=31536000,public"}
}
}
}
Note that the S3 Batch UI accepts a CSV manifest.
The batch job will copy objects to themselves and apply new metadata.
This is the correct way to normalize Content-Type and Cache-Control at scale.
A typical S3 Batch job for 1M objects takes several hours.
Start with a smaller job for testing.
The error many encounter is skipping metadata normalization and serving thumbnails as application/octet-stream.
Example s3cmd and aws s3 sync commands
s3cmd helps when MIME detection needs fine control:
s3cmd sync --mime-type='auto' --acl-public /var/www/uploads/ s3://example-com-media-prod/
aws cli sync example with md5 verification via checksum-mode (newer aws cli supports checksum-mode):
aws s3 sync /var/www/uploads s3://example-com-media-prod/ --exact-timestamps --checksum-mode ENABLED
If the environment lacks modern CLI versions, rsync per-directory to a staging host.
Then use aws s3 cp from the staging host.
Infographic flow of the incremental sync process
Inventory
Seed Copy (DataSync)
S3 Batch Metadata Fix
Parity Check & Cutover
Flow: Inventory → Seed → Normalize → Validate → Cutover
Step 3 Preserve original image URLs without root access
There are three proven options to preserve 1:1 URLs.
Choose a reverse proxy, a CloudFront origin rewrite to the S3 website endpoint, or a Lambda@Edge origin request rewrite.
Each option carries tradeoffs.
-
Reverse proxy acting as origin.
Use Nginx or HAProxy on the origin host.
This is the fastest for latency.
The origin continues to respond for unchanged objects and proxies migrated objects to S3.
The drawback is needing host access and additional CPU and network on the origin.
Use this when the origin is under control and can run proxy software.
-
CloudFront custom origin pointing to the S3 website endpoint.
The S3 website endpoint preserves path semantics such as /images/photo.jpg.
It also supports S3 website-style redirects.
But it does not support Origin Access Identity or Origin Access Control.
If the bucket must remain private, use the S3 REST virtual-hosted endpoint as the CloudFront origin.
Then restrict bucket access to CloudFront via OAI, OAC principal, or aws:SourceArn conditions for OAC.
Use the website endpoint only when public access is acceptable or when relying on S3 website redirects.
Tradeoff: 4xx behavior differs between S3 REST and the website endpoint.
-
Lambda@Edge origin request rewrite.
Attach a small function at the CloudFront origin request event and rewrite the origin to the S3 bucket for migrated objects.
Pros: no change to the origin host and exact host and path are preserved.
Cons: slight latency increase (1–15 ms typical), cost per million invocations, and added deployment complexity.
Lambda@Edge sample (Node.js) that rewrites origin to an S3 website endpoint.
Test carefully in a staging distribution before production.