14 KiB
Cloud Cost Optimization Reference
Comprehensive guide for cloud cost optimization including reserved instances, spot/preemptible, right-sizing, and FinOps practices.
FinOps Framework
FinOps Principles
- Teams need to collaborate - Finance, engineering, and business work together
- Everyone takes ownership - Decentralized cost responsibility
- A centralized team drives FinOps - Center of excellence for best practices
- Reports should be accessible and timely - Real-time visibility
- Decisions are driven by business value - Cost per business outcome
- Take advantage of variable cost model - Scale up and down as needed
FinOps Lifecycle
Inform
|
+---------+
| |
v v
Optimize --> Operate
^ |
| |
+---------+
Inform Phase
- Visibility into cloud spend
- Allocation and showback
- Benchmarking and forecasting
Optimize Phase
- Rate optimization (RIs, savings plans)
- Usage optimization (right-sizing)
- Architectural optimization
Operate Phase
- Continuous improvement
- Automation and governance
- Anomaly detection
Compute Cost Optimization
Reserved Instances / Savings Plans
AWS Savings Plans
| Type | Flexibility | Savings |
|---|---|---|
| Compute Savings Plans | Any EC2, Fargate, Lambda | Up to 66% |
| EC2 Instance Savings Plans | Specific instance family, region | Up to 72% |
| Reserved Instances | Specific instance type, AZ | Up to 72% |
Commitment Strategy
Baseline (always-on): 1-year or 3-year Savings Plans
Variable (predictable): Scheduled Reserved Instances
Spiky (unpredictable): On-Demand + Spot
Azure Reservations
# Azure CLI - Purchase reservation
az reservations reservation-order purchase \
--sku Standard_D2s_v3 \
--term P1Y \
--billing-scope /subscriptions/{subscription-id} \
--quantity 10 \
--applied-scope-type Shared
GCP Committed Use Discounts
- Resource-based: Specific vCPUs and memory
- Spend-based: Dollar commitment for flexibility
- 1-year (37% discount) or 3-year (55% discount)
Spot/Preemptible Instances
When to Use Spot
- Batch processing and analytics
- CI/CD build agents
- Stateless web servers (with auto-scaling)
- Machine learning training
- Development and testing environments
AWS Spot Best Practices
# EC2 Auto Scaling with Spot
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 2
OnDemandPercentageAboveBaseCapacity: 20
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.large
- InstanceType: m5a.large
- InstanceType: m4.large
- InstanceType: r5.large
Spot Interruption Handling
# Check for spot termination notice (AWS)
import requests
def check_spot_termination():
try:
response = requests.get(
"http://169.254.169.254/latest/meta-data/spot/termination-time",
timeout=2
)
if response.status_code == 200:
# 2-minute warning - gracefully shutdown
graceful_shutdown()
except requests.exceptions.RequestException:
pass # Not being terminated
GCP Preemptible/Spot VMs
# Terraform - GCP Spot VM
resource "google_compute_instance" "spot" {
name = "spot-instance"
machine_type = "n2-standard-4"
scheduling {
preemptible = true
automatic_restart = false
provisioning_model = "SPOT"
instance_termination_action = "STOP"
}
}
Right-Sizing
Analysis Process
- Collect metrics (CPU, memory, network, disk I/O)
- Identify idle or underutilized resources
- Recommend appropriate instance size
- Implement changes during maintenance windows
- Monitor and iterate
AWS Compute Optimizer
# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status \
--status Active \
--include-member-accounts
# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=OVER_PROVISIONED
Right-Sizing Thresholds
| Metric | Underutilized | Optimal | Overutilized |
|---|---|---|---|
| CPU | <20% avg | 40-60% avg | >80% avg |
| Memory | <30% avg | 50-70% avg | >85% avg |
| Network | <10% capacity | Variable | >80% capacity |
Azure Advisor Recommendations
# Get cost recommendations
az advisor recommendation list \
--category Cost \
--query "[?impact=='High']"
Storage Cost Optimization
Object Storage Tiering
AWS S3 Storage Classes
S3 Standard
|
| (30 days)
v
S3 Standard-IA
|
| (90 days)
v
S3 Glacier Instant Retrieval
|
| (180 days)
v
S3 Glacier Deep Archive
Lifecycle Policy Example
{
"Rules": [
{
"ID": "OptimizeCosts",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 730 }
}
]
}
S3 Intelligent-Tiering
- Automatic tiering based on access patterns
- No retrieval fees
- Small monitoring fee per object
- Best for unpredictable access patterns
Block Storage Optimization
EBS Volume Selection
| Type | Use Case | $/GB/month |
|---|---|---|
| gp3 | General purpose | $0.08 |
| gp2 | Legacy (migrate to gp3) | $0.10 |
| io2 | High IOPS databases | $0.125+ |
| st1 | Throughput (big data) | $0.045 |
| sc1 | Cold archives | $0.015 |
gp3 Migration (20% savings)
# Modify EBS volume from gp2 to gp3
aws ec2 modify-volume \
--volume-id vol-12345678 \
--volume-type gp3 \
--iops 3000 \
--throughput 125
Database Storage
Aurora Storage Optimization
- Pay only for storage used (auto-scaling)
- No pre-provisioning required
- 10GB increments up to 128TB
DynamoDB Capacity Modes
| Mode | Best For | Pricing |
|---|---|---|
| On-Demand | Unpredictable traffic | Pay per request |
| Provisioned | Steady traffic | Pay per capacity unit |
| Provisioned + Auto Scaling | Variable but predictable | Lower cost than on-demand |
Network Cost Optimization
Data Transfer Costs
AWS Data Transfer Pricing
Inbound: Free
Same AZ: Free
Cross-AZ: $0.01/GB each direction
Same Region (via public IP): $0.01/GB
Cross-Region: $0.02/GB
Internet Egress: $0.09/GB (first 10TB)
Optimization Strategies
- Keep traffic within same AZ when possible
- Use VPC endpoints for AWS services
- Use CloudFront for cacheable content
- Compress data before transfer
- Use regional rather than global services
VPC Endpoints (Avoid NAT Gateway)
# Gateway endpoint (free for S3, DynamoDB)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
}
# Interface endpoint (cheaper than NAT for specific services)
resource "aws_vpc_endpoint" "ecr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
}
CDN Optimization
CloudFront Cost Savings
- Lower data transfer rates than direct from origin
- Cache hit ratio optimization (target >90%)
- Use Origin Shield to reduce origin load
- Compress objects (Gzip/Brotli)
# CloudFront cache optimization
CacheBehaviors:
- PathPattern: "/static/*"
CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6 # CachingOptimized
Compress: true
TTL:
DefaultTTL: 86400
MaxTTL: 31536000
Serverless Cost Optimization
Lambda Optimization
Memory/CPU Tuning
# Use AWS Lambda Power Tuning
# Finds optimal memory for cost vs performance
# Results example:
# 128MB: $0.000021 per invocation, 3200ms duration
# 256MB: $0.000025 per invocation, 1600ms duration
# 512MB: $0.000031 per invocation, 800ms duration
# 1024MB: $0.000042 per invocation, 450ms duration
# Optimal: 512MB (best cost-performance balance)
Cost Reduction Strategies
- Right-size memory allocation
- Minimize cold starts (provisioned concurrency for critical paths)
- Use ARM64 (Graviton2) - 20% cheaper
- Optimize package size for faster cold starts
- Use Lambda Layers for shared dependencies
Graviton2 Migration
# SAM template with ARM64
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.11
Architectures:
- arm64 # 20% cost savings
Container Optimization
Fargate Pricing Optimization
# Fargate Spot: Up to 70% discount
# Use for fault-tolerant workloads
ECS Service:
CapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 4
- CapacityProvider: FARGATE
Weight: 1
Base: 2 # Minimum on-demand tasks
Right-Size Container Resources
# Analyze actual usage with Container Insights
resources:
requests:
memory: "256Mi" # Based on p95 usage + 20% buffer
cpu: "100m" # Based on p95 usage + 20% buffer
limits:
memory: "512Mi" # 2x requests for burst
cpu: "500m"
Cost Allocation and Tagging
Tagging Strategy
Required Tags
# Terraform - enforce tags
variable "required_tags" {
default = {
environment = "prod"
cost-center = "engineering"
owner = "platform-team"
project = "api-gateway"
managed-by = "terraform"
}
}
resource "aws_instance" "example" {
ami = data.aws_ami.latest.id
instance_type = "t3.medium"
tags = var.required_tags
}
Tag Enforcement
// AWS SCP - Deny untagged resources
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUntaggedEC2",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"Null": {
"aws:RequestTag/cost-center": "true"
}
}
}
]
}
Cost Allocation Reports
AWS Cost and Usage Report
# Enable detailed billing reports
aws cur put-report-definition \
--report-definition '{
"ReportName": "detailed-cost-report",
"TimeUnit": "HOURLY",
"Format": "Parquet",
"Compression": "Parquet",
"S3Bucket": "my-billing-bucket",
"S3Region": "us-east-1",
"AdditionalArtifacts": ["ATHENA"]
}'
Athena Queries for Analysis
-- Cost by service and tag
SELECT
line_item_product_code as service,
resource_tags_user_cost_center as cost_center,
SUM(line_item_unblended_cost) as cost
FROM cost_report
WHERE month = '2024-01'
GROUP BY 1, 2
ORDER BY 3 DESC;
-- Unused Reserved Instances
SELECT
reservation_reservation_a_r_n,
reservation_unused_quantity,
reservation_unused_normalized_unit_quantity
FROM cost_report
WHERE reservation_unused_quantity > 0;
Automation and Governance
Automated Cost Controls
AWS Budgets with Actions
# CloudFormation - Budget with auto-stop
Resources:
MonthlyCostBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-cost-limit
BudgetLimit:
Amount: 10000
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: EMAIL
Address: finance@company.com
Scheduled Scaling (Dev/Test)
# Stop non-prod resources nights/weekends
Resources:
ScaleDownSchedule:
Type: AWS::AutoScaling::ScheduledAction
Properties:
AutoScalingGroupName: !Ref DevASG
DesiredCapacity: 0
Recurrence: "0 20 * * MON-FRI" # 8 PM weekdays
ScaleUpSchedule:
Type: AWS::AutoScaling::ScheduledAction
Properties:
AutoScalingGroupName: !Ref DevASG
DesiredCapacity: 3
Recurrence: "0 8 * * MON-FRI" # 8 AM weekdays
Cost Anomaly Detection
AWS Cost Anomaly Detection
# Create anomaly monitor
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "ServiceMonitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Create anomaly subscription
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "CostAlerts",
"MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/abc123"],
"Subscribers": [{"Type": "EMAIL", "Address": "alerts@company.com"}],
"Threshold": 100
}'
Cost Metrics and KPIs
Key Metrics
| Metric | Formula | Target |
|---|---|---|
| Unit Cost | Total Cost / Business Metric | Decreasing |
| Coverage | Reserved Hours / Total Hours | >70% |
| Utilization | Used Reserved Hours / Purchased | >80% |
| Waste | Idle Resource Cost / Total Cost | <10% |
| Forecast Accuracy | Actual / Forecasted | 90-110% |
Dashboard Example
-- Cost efficiency dashboard metrics
WITH metrics AS (
SELECT
date_trunc('month', usage_date) as month,
SUM(cost) as total_cost,
SUM(CASE WHEN reservation_arn IS NOT NULL THEN cost END) as reserved_cost,
COUNT(DISTINCT user_id) as active_users
FROM cloud_costs
GROUP BY 1
)
SELECT
month,
total_cost,
reserved_cost / total_cost as reservation_coverage,
total_cost / active_users as cost_per_user
FROM metrics;
Quick Wins Checklist
Immediate Savings (This Week)
- Delete unused EBS volumes and snapshots
- Terminate stopped EC2 instances not needed
- Remove unused Elastic IPs
- Delete unused load balancers
- Review and delete old AMIs
Short-Term (This Month)
- Right-size underutilized instances
- Migrate gp2 volumes to gp3
- Implement S3 lifecycle policies
- Enable S3 Intelligent-Tiering
- Schedule dev/test environments
Medium-Term (This Quarter)
- Purchase Savings Plans for baseline
- Implement Spot for fault-tolerant workloads
- Set up cost allocation tags
- Enable Cost Anomaly Detection
- Establish FinOps practices