bookworm-smart-assistant/skills/cloud-architect/references/cost.md

14 KiB

Cloud Cost Optimization Reference

Comprehensive guide for cloud cost optimization including reserved instances, spot/preemptible, right-sizing, and FinOps practices.

FinOps Framework

FinOps Principles

  1. Teams need to collaborate - Finance, engineering, and business work together
  2. Everyone takes ownership - Decentralized cost responsibility
  3. A centralized team drives FinOps - Center of excellence for best practices
  4. Reports should be accessible and timely - Real-time visibility
  5. Decisions are driven by business value - Cost per business outcome
  6. Take advantage of variable cost model - Scale up and down as needed

FinOps Lifecycle

      Inform
         |
    +---------+
    |         |
    v         v
 Optimize --> Operate
    ^         |
    |         |
    +---------+

Inform Phase

  • Visibility into cloud spend
  • Allocation and showback
  • Benchmarking and forecasting

Optimize Phase

  • Rate optimization (RIs, savings plans)
  • Usage optimization (right-sizing)
  • Architectural optimization

Operate Phase

  • Continuous improvement
  • Automation and governance
  • Anomaly detection

Compute Cost Optimization

Reserved Instances / Savings Plans

AWS Savings Plans

Type Flexibility Savings
Compute Savings Plans Any EC2, Fargate, Lambda Up to 66%
EC2 Instance Savings Plans Specific instance family, region Up to 72%
Reserved Instances Specific instance type, AZ Up to 72%

Commitment Strategy

Baseline (always-on): 1-year or 3-year Savings Plans
Variable (predictable): Scheduled Reserved Instances
Spiky (unpredictable): On-Demand + Spot

Azure Reservations

# Azure CLI - Purchase reservation
az reservations reservation-order purchase \
  --sku Standard_D2s_v3 \
  --term P1Y \
  --billing-scope /subscriptions/{subscription-id} \
  --quantity 10 \
  --applied-scope-type Shared

GCP Committed Use Discounts

  • Resource-based: Specific vCPUs and memory
  • Spend-based: Dollar commitment for flexibility
  • 1-year (37% discount) or 3-year (55% discount)

Spot/Preemptible Instances

When to Use Spot

  • Batch processing and analytics
  • CI/CD build agents
  • Stateless web servers (with auto-scaling)
  • Machine learning training
  • Development and testing environments

AWS Spot Best Practices

# EC2 Auto Scaling with Spot
MixedInstancesPolicy:
  InstancesDistribution:
    OnDemandBaseCapacity: 2
    OnDemandPercentageAboveBaseCapacity: 20
    SpotAllocationStrategy: capacity-optimized
  LaunchTemplate:
    Overrides:
      - InstanceType: m5.large
      - InstanceType: m5a.large
      - InstanceType: m4.large
      - InstanceType: r5.large

Spot Interruption Handling

# Check for spot termination notice (AWS)
import requests

def check_spot_termination():
    try:
        response = requests.get(
            "http://169.254.169.254/latest/meta-data/spot/termination-time",
            timeout=2
        )
        if response.status_code == 200:
            # 2-minute warning - gracefully shutdown
            graceful_shutdown()
    except requests.exceptions.RequestException:
        pass  # Not being terminated

GCP Preemptible/Spot VMs

# Terraform - GCP Spot VM
resource "google_compute_instance" "spot" {
  name         = "spot-instance"
  machine_type = "n2-standard-4"

  scheduling {
    preemptible                 = true
    automatic_restart           = false
    provisioning_model          = "SPOT"
    instance_termination_action = "STOP"
  }
}

Right-Sizing

Analysis Process

  1. Collect metrics (CPU, memory, network, disk I/O)
  2. Identify idle or underutilized resources
  3. Recommend appropriate instance size
  4. Implement changes during maintenance windows
  5. Monitor and iterate

AWS Compute Optimizer

# Enable Compute Optimizer
aws compute-optimizer update-enrollment-status \
  --status Active \
  --include-member-accounts

# Get recommendations
aws compute-optimizer get-ec2-instance-recommendations \
  --filters name=Finding,values=OVER_PROVISIONED

Right-Sizing Thresholds

Metric Underutilized Optimal Overutilized
CPU <20% avg 40-60% avg >80% avg
Memory <30% avg 50-70% avg >85% avg
Network <10% capacity Variable >80% capacity

Azure Advisor Recommendations

# Get cost recommendations
az advisor recommendation list \
  --category Cost \
  --query "[?impact=='High']"

Storage Cost Optimization

Object Storage Tiering

AWS S3 Storage Classes

S3 Standard
    |
    | (30 days)
    v
S3 Standard-IA
    |
    | (90 days)
    v
S3 Glacier Instant Retrieval
    |
    | (180 days)
    v
S3 Glacier Deep Archive

Lifecycle Policy Example

{
  "Rules": [
    {
      "ID": "OptimizeCosts",
      "Status": "Enabled",
      "Filter": { "Prefix": "logs/" },
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" },
        { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
      ],
      "Expiration": { "Days": 730 }
    }
  ]
}

S3 Intelligent-Tiering

  • Automatic tiering based on access patterns
  • No retrieval fees
  • Small monitoring fee per object
  • Best for unpredictable access patterns

Block Storage Optimization

EBS Volume Selection

Type Use Case $/GB/month
gp3 General purpose $0.08
gp2 Legacy (migrate to gp3) $0.10
io2 High IOPS databases $0.125+
st1 Throughput (big data) $0.045
sc1 Cold archives $0.015

gp3 Migration (20% savings)

# Modify EBS volume from gp2 to gp3
aws ec2 modify-volume \
  --volume-id vol-12345678 \
  --volume-type gp3 \
  --iops 3000 \
  --throughput 125

Database Storage

Aurora Storage Optimization

  • Pay only for storage used (auto-scaling)
  • No pre-provisioning required
  • 10GB increments up to 128TB

DynamoDB Capacity Modes

Mode Best For Pricing
On-Demand Unpredictable traffic Pay per request
Provisioned Steady traffic Pay per capacity unit
Provisioned + Auto Scaling Variable but predictable Lower cost than on-demand

Network Cost Optimization

Data Transfer Costs

AWS Data Transfer Pricing

Inbound: Free
Same AZ: Free
Cross-AZ: $0.01/GB each direction
Same Region (via public IP): $0.01/GB
Cross-Region: $0.02/GB
Internet Egress: $0.09/GB (first 10TB)

Optimization Strategies

  1. Keep traffic within same AZ when possible
  2. Use VPC endpoints for AWS services
  3. Use CloudFront for cacheable content
  4. Compress data before transfer
  5. Use regional rather than global services

VPC Endpoints (Avoid NAT Gateway)

# Gateway endpoint (free for S3, DynamoDB)
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.us-east-1.s3"
}

# Interface endpoint (cheaper than NAT for specific services)
resource "aws_vpc_endpoint" "ecr" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.us-east-1.ecr.api"
  vpc_endpoint_type = "Interface"
}

CDN Optimization

CloudFront Cost Savings

  • Lower data transfer rates than direct from origin
  • Cache hit ratio optimization (target >90%)
  • Use Origin Shield to reduce origin load
  • Compress objects (Gzip/Brotli)
# CloudFront cache optimization
CacheBehaviors:
  - PathPattern: "/static/*"
    CachePolicyId: 658327ea-f89d-4fab-a63d-7e88639e58f6  # CachingOptimized
    Compress: true
    TTL:
      DefaultTTL: 86400
      MaxTTL: 31536000

Serverless Cost Optimization

Lambda Optimization

Memory/CPU Tuning

# Use AWS Lambda Power Tuning
# Finds optimal memory for cost vs performance

# Results example:
# 128MB:  $0.000021 per invocation, 3200ms duration
# 256MB:  $0.000025 per invocation, 1600ms duration
# 512MB:  $0.000031 per invocation, 800ms duration
# 1024MB: $0.000042 per invocation, 450ms duration
# Optimal: 512MB (best cost-performance balance)

Cost Reduction Strategies

  1. Right-size memory allocation
  2. Minimize cold starts (provisioned concurrency for critical paths)
  3. Use ARM64 (Graviton2) - 20% cheaper
  4. Optimize package size for faster cold starts
  5. Use Lambda Layers for shared dependencies

Graviton2 Migration

# SAM template with ARM64
Resources:
  MyFunction:
    Type: AWS::Serverless::Function
    Properties:
      Runtime: python3.11
      Architectures:
        - arm64  # 20% cost savings

Container Optimization

Fargate Pricing Optimization

# Fargate Spot: Up to 70% discount
# Use for fault-tolerant workloads

ECS Service:
  CapacityProviderStrategy:
    - CapacityProvider: FARGATE_SPOT
      Weight: 4
    - CapacityProvider: FARGATE
      Weight: 1
      Base: 2  # Minimum on-demand tasks

Right-Size Container Resources

# Analyze actual usage with Container Insights
resources:
  requests:
    memory: "256Mi"  # Based on p95 usage + 20% buffer
    cpu: "100m"      # Based on p95 usage + 20% buffer
  limits:
    memory: "512Mi"  # 2x requests for burst
    cpu: "500m"

Cost Allocation and Tagging

Tagging Strategy

Required Tags

# Terraform - enforce tags
variable "required_tags" {
  default = {
    environment  = "prod"
    cost-center  = "engineering"
    owner        = "platform-team"
    project      = "api-gateway"
    managed-by   = "terraform"
  }
}

resource "aws_instance" "example" {
  ami           = data.aws_ami.latest.id
  instance_type = "t3.medium"
  tags          = var.required_tags
}

Tag Enforcement

// AWS SCP - Deny untagged resources
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyUntaggedEC2",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "Null": {
          "aws:RequestTag/cost-center": "true"
        }
      }
    }
  ]
}

Cost Allocation Reports

AWS Cost and Usage Report

# Enable detailed billing reports
aws cur put-report-definition \
  --report-definition '{
    "ReportName": "detailed-cost-report",
    "TimeUnit": "HOURLY",
    "Format": "Parquet",
    "Compression": "Parquet",
    "S3Bucket": "my-billing-bucket",
    "S3Region": "us-east-1",
    "AdditionalArtifacts": ["ATHENA"]
  }'

Athena Queries for Analysis

-- Cost by service and tag
SELECT
  line_item_product_code as service,
  resource_tags_user_cost_center as cost_center,
  SUM(line_item_unblended_cost) as cost
FROM cost_report
WHERE month = '2024-01'
GROUP BY 1, 2
ORDER BY 3 DESC;

-- Unused Reserved Instances
SELECT
  reservation_reservation_a_r_n,
  reservation_unused_quantity,
  reservation_unused_normalized_unit_quantity
FROM cost_report
WHERE reservation_unused_quantity > 0;

Automation and Governance

Automated Cost Controls

AWS Budgets with Actions

# CloudFormation - Budget with auto-stop
Resources:
  MonthlyCostBudget:
    Type: AWS::Budgets::Budget
    Properties:
      Budget:
        BudgetName: monthly-cost-limit
        BudgetLimit:
          Amount: 10000
          Unit: USD
        TimeUnit: MONTHLY
        BudgetType: COST
      NotificationsWithSubscribers:
        - Notification:
            NotificationType: ACTUAL
            ComparisonOperator: GREATER_THAN
            Threshold: 80
          Subscribers:
            - SubscriptionType: EMAIL
              Address: finance@company.com

Scheduled Scaling (Dev/Test)

# Stop non-prod resources nights/weekends
Resources:
  ScaleDownSchedule:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref DevASG
      DesiredCapacity: 0
      Recurrence: "0 20 * * MON-FRI"  # 8 PM weekdays

  ScaleUpSchedule:
    Type: AWS::AutoScaling::ScheduledAction
    Properties:
      AutoScalingGroupName: !Ref DevASG
      DesiredCapacity: 3
      Recurrence: "0 8 * * MON-FRI"   # 8 AM weekdays

Cost Anomaly Detection

AWS Cost Anomaly Detection

# Create anomaly monitor
aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "ServiceMonitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

# Create anomaly subscription
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "CostAlerts",
    "MonitorArnList": ["arn:aws:ce::123456789:anomalymonitor/abc123"],
    "Subscribers": [{"Type": "EMAIL", "Address": "alerts@company.com"}],
    "Threshold": 100
  }'

Cost Metrics and KPIs

Key Metrics

Metric Formula Target
Unit Cost Total Cost / Business Metric Decreasing
Coverage Reserved Hours / Total Hours >70%
Utilization Used Reserved Hours / Purchased >80%
Waste Idle Resource Cost / Total Cost <10%
Forecast Accuracy Actual / Forecasted 90-110%

Dashboard Example

-- Cost efficiency dashboard metrics
WITH metrics AS (
  SELECT
    date_trunc('month', usage_date) as month,
    SUM(cost) as total_cost,
    SUM(CASE WHEN reservation_arn IS NOT NULL THEN cost END) as reserved_cost,
    COUNT(DISTINCT user_id) as active_users
  FROM cloud_costs
  GROUP BY 1
)
SELECT
  month,
  total_cost,
  reserved_cost / total_cost as reservation_coverage,
  total_cost / active_users as cost_per_user
FROM metrics;

Quick Wins Checklist

Immediate Savings (This Week)

  • Delete unused EBS volumes and snapshots
  • Terminate stopped EC2 instances not needed
  • Remove unused Elastic IPs
  • Delete unused load balancers
  • Review and delete old AMIs

Short-Term (This Month)

  • Right-size underutilized instances
  • Migrate gp2 volumes to gp3
  • Implement S3 lifecycle policies
  • Enable S3 Intelligent-Tiering
  • Schedule dev/test environments

Medium-Term (This Quarter)

  • Purchase Savings Plans for baseline
  • Implement Spot for fault-tolerant workloads
  • Set up cost allocation tags
  • Enable Cost Anomaly Detection
  • Establish FinOps practices