bookworm-smart-assistant/skills/cloud-architect/references/cost.md

583 lines
14 KiB
Markdown
Raw Permalink Normal View History

# Cloud Cost Optimization Reference
Comprehensive guide for cloud cost optimization including reserved instances, spot/preemptible, right-sizing, and FinOps practices.
## FinOps Framework
### FinOps Principles
1. **Teams need to collaborate** - Finance, engineering, and business work together
2. **Everyone takes ownership** - Decentralized cost responsibility
3. **A centralized team drives FinOps** - Center of excellence for best practices
4. **Reports should be accessible and timely** - Real-time visibility
5. **Decisions are driven by business value** - Cost per business outcome
6. **Take advantage of variable cost model** - Scale up and down as needed
### FinOps Lifecycle
```
Inform
|
+---------+
| |
v v
Optimize --> Operate
^ |
| |
+---------+
```
**Inform Phase**
- Visibility into cloud spend
- Allocation and showback
- Benchmarking and forecasting
**Optimize Phase**
- Rate optimization (RIs, savings plans)
- Usage optimization (right-sizing)
- Architectural optimization
**Operate Phase**
- Continuous improvement
- Automation and governance
- Anomaly detection
## Compute Cost Optimization
### Reserved Instances / Savings Plans
**AWS Savings Plans**
| Type | Flexibility | Savings |
|------|-------------|---------|
| Compute Savings Plans | Any EC2, Fargate, Lambda | Up to 66% |
| EC2 Instance Savings Plans | Specific instance family, region | Up to 72% |
| Reserved Instances | Specific instance type, AZ | Up to 72% |
**Commitment Strategy**
```
Baseline (always-on): 1-year or 3-year Savings Plans
Variable (predictable): Scheduled Reserved Instances
Spiky (unpredictable): On-Demand + Spot
```
**Azure Reservations**
```
# Azure CLI - Purchase reservation
az reservations reservation-order purchase \
--sku Standard_D2s_v3 \
--term P1Y \
--billing-scope /subscriptions/{subscription-id} \
--quantity 10 \
--applied-scope-type Shared
```
**GCP Committed Use Discounts**
- Resource-based: Specific vCPUs and memory
- Spend-based: Dollar commitment for flexibility
- 1-year (37% discount) or 3-year (55% discount)
### Spot/Preemptible Instances
**When to Use Spot**
- Batch processing and analytics
- CI/CD build agents
- Stateless web servers (with auto-scaling)
- Machine learning training
- Development and testing environments
**AWS Spot Best Practices**
```yaml
# EC2 Auto Scaling with Spot
MixedInstancesPolicy:
InstancesDistribution:
OnDemandBaseCapacity: 2
OnDemandPercentageAboveBaseCapacity: 20
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
Overrides:
- InstanceType: m5.large
- InstanceType: m5a.large
- InstanceType: m4.large
- InstanceType: r5.large
```
**Spot Interruption Handling**
```python
# Check for spot termination notice (AWS)
import requests
def check_spot_termination():
try:
response = requests.get(
"http://169.254.169.254/latest/meta-data/spot/termination-time",
timeout=2
)
if response.status_code == 200:
# 2-minute warning - gracefully shutdown
graceful_shutdown()
except requests.exceptions.RequestException:
pass # Not being terminated
```
**GCP Preemptible/Spot VMs**
```hcl
# Terraform - GCP Spot VM
resource "google_compute_instance" "spot" {
name = "spot-instance"
machine_type = "n2-standard-4"
scheduling {
preemptible = true
automatic_restart = false
provisioning_model = "SPOT"
instance_termination_action = "STOP"
}
}
```
### Right-Sizing
**Analysis Process**
1. Collect metrics (CPU, memory, network, disk I/O)
2. Identify idle or underutilized resources
3. Recommend appropriate instance size
4. Implement changes during maintenance windows
5. Monitor and iterate
**AWS Compute Optimizer**
```bash
# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status \
--status Active \
--include-member-accounts
# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
--filters name=Finding,values=OVER_PROVISIONED
```
**Right-Sizing Thresholds**
| Metric | Underutilized | Optimal | Overutilized |
|--------|---------------|---------|--------------|
| CPU | <20% avg | 40-60% avg | >80% avg |
| Memory | <30% avg | 50-70% avg | >85% avg |
| Network | <10% capacity | Variable | >80% capacity |
**Azure Advisor Recommendations**
```bash
# Get cost recommendations
az advisor recommendation list \
--category Cost \
--query "[?impact=='High']"
```
## Storage Cost Optimization
### Object Storage Tiering
**AWS S3 Storage Classes**
```
S3 Standard
|
| (30 days)
v
S3 Standard-IA
|
| (90 days)
v
S3 Glacier Instant Retrieval
|
| (180 days)
v
S3 Glacier Deep Archive
```
**Lifecycle Policy Example**
```json
{
"Rules": [
{
"ID": "OptimizeCosts",
"Status": "Enabled",
"Filter": { "Prefix": "logs/" },
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER" },
{ "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
],
"Expiration": { "Days": 730 }
}
]
}
```
**S3 Intelligent-Tiering**
- Automatic tiering based on access patterns
- No retrieval fees
- Small monitoring fee per object
- Best for unpredictable access patterns
### Block Storage Optimization
**EBS Volume Selection**
| Type | Use Case | $/GB/month |
|------|----------|------------|
| gp3 | General purpose | $0.08 |
| gp2 | Legacy (migrate to gp3) | $0.10 |
| io2 | High IOPS databases | $0.125+ |
| st1 | Throughput (big data) | $0.045 |
| sc1 | Cold archives | $0.015 |
**gp3 Migration (20% savings)**
```bash
# Modify EBS volume from gp2 to gp3
aws ec2 modify-volume \
--volume-id vol-12345678 \
--volume-type gp3 \
--iops 3000 \
--throughput 125
```
### Database Storage
**Aurora Storage Optimization**
- Pay only for storage used (auto-scaling)
- No pre-provisioning required
- 10GB increments up to 128TB
**DynamoDB Capacity Modes**
| Mode | Best For | Pricing |
|------|----------|---------|
| On-Demand | Unpredictable traffic | Pay per request |
| Provisioned | Steady traffic | Pay per capacity unit |
| Provisioned + Auto Scaling | Variable but predictable | Lower cost than on-demand |
## Network Cost Optimization
### Data Transfer Costs
**AWS Data Transfer Pricing**
```
Inbound: Free
Same AZ: Free
Cross-AZ: $0.01/GB each direction
Same Region (via public IP): $0.01/GB
Cross-Region: $0.02/GB
Internet Egress: $0.09/GB (first 10TB)
```
**Optimization Strategies**
1. Keep traffic within same AZ when possible
2. Use VPC endpoints for AWS services
3. Use CloudFront for cacheable content
4. Compress data before transfer
5. Use regional rather than global services
**VPC Endpoints (Avoid NAT Gateway)**
```hcl
# Gateway endpoint (free for S3, DynamoDB)
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.s3"
}
# Interface endpoint (cheaper than NAT for specific services)
resource "aws_vpc_endpoint" "ecr" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.us-east-1.ecr.api"
vpc_endpoint_type = "Interface"
}
```
### CDN Optimization
**CloudFront Cost Savings**
- Lower data transfer rates than direct from origin
- Cache hit ratio optimization (target >90%)
- Use Origin Shield to reduce origin load
- Compress objects (Gzip/Brotli)
```yaml
# CloudFront cache optimization
CacheBehaviors:
- PathPattern: "/static/*"
CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6 # CachingOptimized
Compress: true
TTL:
DefaultTTL: 86400
MaxTTL: 31536000
```
## Serverless Cost Optimization
### Lambda Optimization
**Memory/CPU Tuning**
```python
# Use AWS Lambda Power Tuning
# Finds optimal memory for cost vs performance
# Results example:
# 128MB: $0.000021 per invocation, 3200ms duration
# 256MB: $0.000025 per invocation, 1600ms duration
# 512MB: $0.000031 per invocation, 800ms duration
# 1024MB: $0.000042 per invocation, 450ms duration
# Optimal: 512MB (best cost-performance balance)
```
**Cost Reduction Strategies**
1. Right-size memory allocation
2. Minimize cold starts (provisioned concurrency for critical paths)
3. Use ARM64 (Graviton2) - 20% cheaper
4. Optimize package size for faster cold starts
5. Use Lambda Layers for shared dependencies
**Graviton2 Migration**
```yaml
# SAM template with ARM64
Resources:
MyFunction:
Type: AWS::Serverless::Function
Properties:
Runtime: python3.11
Architectures:
- arm64 # 20% cost savings
```
### Container Optimization
**Fargate Pricing Optimization**
```
# Fargate Spot: Up to 70% discount
# Use for fault-tolerant workloads
ECS Service:
CapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 4
- CapacityProvider: FARGATE
Weight: 1
Base: 2 # Minimum on-demand tasks
```
**Right-Size Container Resources**
```yaml
# Analyze actual usage with Container Insights
resources:
requests:
memory: "256Mi" # Based on p95 usage + 20% buffer
cpu: "100m" # Based on p95 usage + 20% buffer
limits:
memory: "512Mi" # 2x requests for burst
cpu: "500m"
```
## Cost Allocation and Tagging
### Tagging Strategy
**Required Tags**
```yaml
# Terraform - enforce tags
variable "required_tags" {
default = {
environment = "prod"
cost-center = "engineering"
owner = "platform-team"
project = "api-gateway"
managed-by = "terraform"
}
}
resource "aws_instance" "example" {
ami = data.aws_ami.latest.id
instance_type = "t3.medium"
tags = var.required_tags
}
```
**Tag Enforcement**
```json
// AWS SCP - Deny untagged resources
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyUntaggedEC2",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"Null": {
"aws:RequestTag/cost-center": "true"
}
}
}
]
}
```
### Cost Allocation Reports
**AWS Cost and Usage Report**
```bash
# Enable detailed billing reports
aws cur put-report-definition \
--report-definition '{
"ReportName": "detailed-cost-report",
"TimeUnit": "HOURLY",
"Format": "Parquet",
"Compression": "Parquet",
"S3Bucket": "my-billing-bucket",
"S3Region": "us-east-1",
"AdditionalArtifacts": ["ATHENA"]
}'
```
**Athena Queries for Analysis**
```sql
-- Cost by service and tag
SELECT
line_item_product_code as service,
resource_tags_user_cost_center as cost_center,
SUM(line_item_unblended_cost) as cost
FROM cost_report
WHERE month = '2024-01'
GROUP BY 1, 2
ORDER BY 3 DESC;
-- Unused Reserved Instances
SELECT
reservation_reservation_a_r_n,
reservation_unused_quantity,
reservation_unused_normalized_unit_quantity
FROM cost_report
WHERE reservation_unused_quantity > 0;
```
## Automation and Governance
### Automated Cost Controls
**AWS Budgets with Actions**
```yaml
# CloudFormation - Budget with auto-stop
Resources:
MonthlyCostBudget:
Type: AWS::Budgets::Budget
Properties:
Budget:
BudgetName: monthly-cost-limit
BudgetLimit:
Amount: 10000
Unit: USD
TimeUnit: MONTHLY
BudgetType: COST
NotificationsWithSubscribers:
- Notification:
NotificationType: ACTUAL
ComparisonOperator: GREATER_THAN
Threshold: 80
Subscribers:
- SubscriptionType: EMAIL
Address: finance@company.com
```
**Scheduled Scaling (Dev/Test)**
```yaml
# Stop non-prod resources nights/weekends
Resources:
ScaleDownSchedule:
Type: AWS::AutoScaling::ScheduledAction
Properties:
AutoScalingGroupName: !Ref DevASG
DesiredCapacity: 0
Recurrence: "0 20 * * MON-FRI" # 8 PM weekdays
ScaleUpSchedule:
Type: AWS::AutoScaling::ScheduledAction
Properties:
AutoScalingGroupName: !Ref DevASG
DesiredCapacity: 3
Recurrence: "0 8 * * MON-FRI" # 8 AM weekdays
```
### Cost Anomaly Detection
**AWS Cost Anomaly Detection**
```bash
# Create anomaly monitor
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "ServiceMonitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Create anomaly subscription
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "CostAlerts",
"MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/abc123"],
"Subscribers": [{"Type": "EMAIL", "Address": "alerts@company.com"}],
"Threshold": 100
}'
```
## Cost Metrics and KPIs
### Key Metrics
| Metric | Formula | Target |
|--------|---------|--------|
| Unit Cost | Total Cost / Business Metric | Decreasing |
| Coverage | Reserved Hours / Total Hours | >70% |
| Utilization | Used Reserved Hours / Purchased | >80% |
| Waste | Idle Resource Cost / Total Cost | <10% |
| Forecast Accuracy | Actual / Forecasted | 90-110% |
### Dashboard Example
```sql
-- Cost efficiency dashboard metrics
WITH metrics AS (
SELECT
date_trunc('month', usage_date) as month,
SUM(cost) as total_cost,
SUM(CASE WHEN reservation_arn IS NOT NULL THEN cost END) as reserved_cost,
COUNT(DISTINCT user_id) as active_users
FROM cloud_costs
GROUP BY 1
)
SELECT
month,
total_cost,
reserved_cost / total_cost as reservation_coverage,
total_cost / active_users as cost_per_user
FROM metrics;
```
## Quick Wins Checklist
**Immediate Savings (This Week)**
- [ ] Delete unused EBS volumes and snapshots
- [ ] Terminate stopped EC2 instances not needed
- [ ] Remove unused Elastic IPs
- [ ] Delete unused load balancers
- [ ] Review and delete old AMIs
**Short-Term (This Month)**
- [ ] Right-size underutilized instances
- [ ] Migrate gp2 volumes to gp3
- [ ] Implement S3 lifecycle policies
- [ ] Enable S3 Intelligent-Tiering
- [ ] Schedule dev/test environments
**Medium-Term (This Quarter)**
- [ ] Purchase Savings Plans for baseline
- [ ] Implement Spot for fault-tolerant workloads
- [ ] Set up cost allocation tags
- [ ] Enable Cost Anomaly Detection
- [ ] Establish FinOps practices