Our AWS bill was getting out of control.

Not in a “we’re going bankrupt” way. More like a slow leak - the kind where you don’t notice until someone asks why you’re spending tens of thousands a month on infrastructure for a product that doesn’t need it.

So I spent six months systematically cutting costs across an enterprise SaaS platform - dedicated AWS accounts per client, dozens of environments. The result? 30% reduction in AWS spend. No outages. No degraded performance. Just less waste.

Here’s what actually worked.

The RDS logging trap

This one’s sneaky.

RDS lets you export logs to CloudWatch: general, slowquery, and error logs. Sounds useful, right?

General query logging in production is insane. Every SELECT, every INSERT, every query - shipped to CloudWatch Logs, which charges by data ingested. On a busy database, that’s gigabytes per day.

We kept slowquery and error logs (actually useful for debugging) and turned off general logging. Immediate savings, zero impact on our ability to troubleshoot.

# Before: logging everything
enabled_cloudwatch_logs_exports = ["general", "slowquery", "error"]

# After: logging what matters
enabled_cloudwatch_logs_exports = ["slowquery", "error"]

Fargate Spot changed everything

This was the biggest single win.

Fargate Spot is up to 70% cheaper than on-demand Fargate. The tradeoff? AWS can terminate your tasks with 2 minutes notice when they need the capacity back.

For most workloads, that’s fine.

We moved all non-production environments to 100% Spot. For production, we kept one on-demand task as a baseline and let Spot handle the rest. The key is adding a stopTimeout to your container definitions so tasks can finish in-flight requests before termination.

We’ve been running this for months. Spot interruptions are rare, and when they happen, ECS just spins up a replacement. Users never notice.

Your VPC is costing more than you think

AWS charges $3.65/month per public IPv4 address. Doesn’t sound like much until you count them.

We were running 6 availability zones. NAT Gateways in each. Load balancers spanning all of them. Elastic IPs everywhere. And here’s the one that sneaks up on you: VPC endpoints charge per AZ too - about $7.30/month per endpoint per AZ. If you’ve got endpoints for ECR, CloudWatch Logs, S3, SSM, and Secrets Manager across 6 AZs, that’s real money.

We dropped to 3 AZs. Still highly available - AWS recommends 3 AZs for production workloads. But now we’re paying for half the NAT Gateways, half the ALB AZ-hours, half the IPv4 addresses, and half the VPC endpoint hours.

Three AZs is enough. Six is vanity.

Replace your bastion hosts with SSM

We had bastion hosts in every account. SSH jump boxes sitting there 24/7, waiting for someone to connect.

We replaced them entirely with AWS Systems Manager Session Manager. SSM lets you shell into instances without opening SSH ports, without managing keys, without paying for a dedicated EC2 instance.

Bastion host: ~$27/month per account (t3a.medium)
SSM: $0/month

Across dozens of accounts, that’s real money. Plus better security - no inbound ports, full audit logging, IAM-based access control.

If you’re still running bastion hosts, stop.

RDS backup retention: the hidden cost

Our backup strategy was “keep everything forever, just in case.”

  • Daily RDS snapshots: 30 days
  • Weekly backups: 365 days
  • Cross-region copies: same retention

That’s a lot of storage. We cut it:

  • Daily snapshots: 14 days
  • Weekly backups: 30 days
  • Native RDS backups: 7 days flat

Still plenty of protection. Half the storage costs. If you actually need to restore from a 6-month-old backup, you probably have bigger problems.

What I’d do differently

Start with visibility. We built a script to track Fargate vs Spot task distribution across all environments. Should’ve done that first. You can’t optimize what you can’t measure.

Question every default. Most AWS defaults are designed for enterprise compliance requirements you probably don’t have. Multi-AZ RDS? Maybe you need it, maybe you don’t. GuardDuty? Depends on your threat model. Don’t pay for insurance you’ll never claim.

Batch more aggressively. We spread changes out over six months, one initiative at a time. Safe, but slow. I’d batch related changes together more - VPC and endpoint reductions in one push, all the Fargate Spot migrations in another. Still avoid big bang, but move faster on things that align.

The takeaway

AWS cost optimization isn’t about finding one magic setting. It’s about systematically questioning assumptions:

  • Do we need this service, or is it checkbox compliance?
  • Is this instance size based on actual load, or someone’s guess from 2019?
  • Are we paying for redundancy we don’t need?

For us, the answers saved 30%. Your mileage will vary, but I guarantee you’re overpaying somewhere.

Got a cost optimization win (or horror story)? I’d love to hear it - reach out.