Lessons learned from two years on AWS

After two years of building production systems on AWS, here are the patterns and pitfalls I’ve encountered.

What works well

Managed infrastructure reduces operational burden. Services like RDS, DynamoDB, and S3 handle replication, backups, and scaling automatically. This lets small teams run systems that would otherwise require dedicated operations staff.

Core services are battle-tested:

S3: Effectively unlimited storage with 99.999999999% durability. The pricing model (pay per GB stored and transferred) scales predictably.
DynamoDB: Single-digit millisecond latency at any scale. The capacity model requires understanding, but once configured correctly, it’s remarkably reliable.
RDS: Managed PostgreSQL/MySQL with automated backups, failover, and encryption. Removes most of the operational complexity of running databases.
CloudWatch: Centralized logging and metrics. The query language takes time to learn, but having logs and metrics in one place simplifies debugging.

What requires caution

Cost management demands constant attention. AWS pricing is complex, and costs can escalate quickly without monitoring:

Data transfer costs between regions and to the internet add up
Forgotten resources (unused EBS volumes, idle load balancers) accumulate charges
Some services have non-obvious pricing dimensions (CloudWatch log ingestion, API Gateway requests)

Setting up billing alerts and regular cost reviews is essential.

Service complexity varies significantly. Some services are straightforward (S3, SQS), while others have steep learning curves:

IAM policies require careful design to balance security and usability
VPC networking has many moving parts (subnets, route tables, security groups, NACLs)
The SNS/SQS/Lambda integration patterns can be confusing initially

Managed services have constraints. Running software as a managed service means accepting AWS’s operational decisions:

Version upgrades happen on AWS’s schedule
Configuration options are limited compared to self-managed deployments
Debugging is harder when you can’t access the underlying infrastructure

Load balancer behavior requires understanding. ELB (now split into ALB, NLB, and CLB) works well for most use cases, but edge cases exist:

Connection draining and health check timing affect deployments
Sticky sessions add complexity and can cause uneven load distribution
WebSocket and long-polling connections need specific configuration

Recommendations

Start with managed services. The operational savings usually outweigh the flexibility loss.
Implement cost monitoring early. Use AWS Budgets and Cost Explorer before you need them.
Invest in infrastructure as code. CloudFormation or Terraform makes environments reproducible and changes auditable.
Understand the shared responsibility model. AWS secures the infrastructure; you secure your configuration and data.
Plan for multi-AZ from the start. Retrofitting high availability is harder than building it in.

What works well#

What requires caution#

Recommendations#

What works well

What requires caution

Recommendations