bookworm-smart-assistant/skills/cloud-architect/references/aws.md

11 KiB

AWS Architecture Reference

Comprehensive guide for AWS services, patterns, and Well-Architected Framework implementation.

Well-Architected Framework

Six Pillars

  1. Operational Excellence

    • Infrastructure as Code (CloudFormation, CDK, Terraform)
    • Continuous integration/deployment
    • Observability (CloudWatch, X-Ray)
    • Runbooks and playbooks
    • Game days and failure injection
  2. Security

    • Identity and Access Management (IAM)
    • Detective controls (GuardDuty, Security Hub)
    • Infrastructure protection (VPC, security groups, NACLs)
    • Data protection (KMS, encryption at rest/transit)
    • Incident response automation
  3. Reliability

    • Multi-AZ deployments
    • Auto Scaling groups
    • Route 53 health checks and failover
    • Backup and restore (AWS Backup)
    • Chaos engineering (AWS FIS)
  4. Performance Efficiency

    • Right-sizing with Compute Optimizer
    • Caching strategies (CloudFront, ElastiCache)
    • Database optimization (RDS Performance Insights)
    • Serverless architectures
    • Global content delivery
  5. Cost Optimization

    • Reserved Instances and Savings Plans
    • Spot Instances for fault-tolerant workloads
    • S3 Intelligent-Tiering and lifecycle policies
    • Right-sizing recommendations
    • Cost allocation tags and budgets
  6. Sustainability

    • Region selection for renewable energy
    • Serverless to minimize idle resources
    • Efficient data storage patterns
    • Resource utilization optimization

Core Services Architecture

Compute

EC2 (Elastic Compute Cloud)

  • Instance families: General (t3, m5), Compute (c5), Memory (r5), GPU (p3, g4)
  • Auto Scaling: Target tracking, step scaling, scheduled scaling
  • Placement groups: Cluster, partition, spread
  • Best practices: Use latest generation, right-size, enable detailed monitoring

Lambda

  • Invocation models: Synchronous, asynchronous, event source mapping
  • Concurrency: Reserved, provisioned, burst limits
  • Layers for shared dependencies
  • Best practices: Keep functions small, use environment variables, set timeouts

ECS/EKS (Container Services)

  • ECS: Fargate for serverless, EC2 for control
  • EKS: Managed Kubernetes with AWS integration
  • Service mesh: App Mesh for observability
  • Best practices: Use Fargate for simplicity, EKS for portability

Elastic Beanstalk

  • Managed platform for web apps
  • Auto-scaling and load balancing included
  • Support for multiple languages and Docker

Storage

S3 (Simple Storage Service)

  • Storage classes: Standard, IA, One Zone-IA, Glacier, Deep Archive
  • Lifecycle policies for automatic tiering
  • Versioning and MFA delete for protection
  • Cross-region replication for DR
  • Best practices: Enable versioning, use lifecycle policies, block public access

EBS (Elastic Block Store)

  • Volume types: gp3 (general), io2 (IOPS), st1 (throughput), sc1 (cold)
  • Snapshots to S3 for backup
  • Encryption by default
  • Best practices: Use gp3 for most workloads, enable encryption

EFS (Elastic File System)

  • NFSv4 file system for shared access
  • Performance modes: General purpose, Max I/O
  • Throughput modes: Bursting, provisioned
  • Best practices: Use lifecycle management, enable encryption

FSx

  • FSx for Windows File Server (SMB)
  • FSx for Lustre (HPC workloads)
  • FSx for NetApp ONTAP
  • FSx for OpenZFS

Database

RDS (Relational Database Service)

  • Engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Aurora
  • Multi-AZ for high availability
  • Read replicas for scalability
  • Automated backups and point-in-time recovery
  • Best practices: Use Aurora for performance, enable Multi-AZ, use read replicas

Aurora

  • MySQL and PostgreSQL compatible
  • 5x MySQL, 3x PostgreSQL performance
  • Global databases for cross-region DR
  • Serverless v2 for variable workloads
  • Best practices: Use Aurora Serverless for unpredictable workloads

DynamoDB

  • NoSQL key-value and document database
  • On-demand or provisioned capacity
  • Global tables for multi-region replication
  • DynamoDB Streams for change data capture
  • Best practices: Use on-demand for unpredictable traffic, implement GSI carefully

ElastiCache

  • Redis or Memcached in-memory caching
  • Cluster mode for Redis scalability
  • Best practices: Use for session storage, API caching

Networking

VPC (Virtual Private Cloud)

  • CIDR planning: Avoid overlaps, plan for growth
  • Subnets: Public (IGW), private (NAT), isolated (no internet)
  • Route tables and routing decisions
  • Security groups (stateful) and NACLs (stateless)
  • Best practices: Use /16 for VPC, /24 for subnets, plan IP space

Route 53

  • DNS service with health checks
  • Routing policies: Simple, weighted, latency, failover, geolocation
  • Best practices: Use alias records, enable DNSSEC

CloudFront

  • Global CDN with edge locations
  • Origin types: S3, ALB, custom origins
  • Lambda@Edge for request/response manipulation
  • Best practices: Enable compression, use field-level encryption

VPN and Direct Connect

  • Site-to-Site VPN for encrypted tunnels
  • Direct Connect for dedicated bandwidth
  • Transit Gateway for hub-and-spoke topology
  • Best practices: Use Direct Connect for high bandwidth, Transit Gateway for complex routing

API Gateway

  • REST APIs, HTTP APIs, WebSocket APIs
  • Throttling and quotas
  • Integration with Lambda, HTTP endpoints, AWS services
  • Best practices: Use HTTP APIs for lower cost, implement caching

Security

IAM (Identity and Access Management)

  • Principle of least privilege
  • Roles for applications, not access keys
  • MFA for privileged users
  • Service Control Policies (SCPs) for organization-wide controls
  • Best practices: Use roles, enable MFA, rotate credentials

KMS (Key Management Service)

  • Customer managed keys (CMKs)
  • Automatic key rotation
  • Envelope encryption pattern
  • Best practices: Enable automatic rotation, use grants for temporary access

Secrets Manager

  • Automatic rotation for RDS credentials
  • Versioning and rollback
  • Best practices: Rotate secrets regularly, use VPC endpoints

Security Hub

  • Centralized security findings
  • CIS AWS Foundations Benchmark
  • Integration with GuardDuty, Inspector, Macie

GuardDuty

  • Threat detection using ML
  • Monitors CloudTrail, VPC Flow Logs, DNS logs

Architecture Patterns

High Availability

Multi-AZ Pattern

- Application Load Balancer across 3 AZs
- Auto Scaling group with instances in each AZ
- RDS Multi-AZ for database
- S3 for static assets (11 9's durability)

Multi-Region Pattern

- Route 53 with health checks and failover
- CloudFront for global distribution
- Aurora Global Database for <1s RPO
- S3 Cross-Region Replication

Serverless Architecture

API-Driven Pattern

API Gateway -> Lambda -> DynamoDB
              |
              v
          EventBridge -> Lambda (async processing)

Event-Driven Pattern

S3 Event -> Lambda -> Process -> SNS
                                  |
                                  v
                              Multiple subscribers

Microservices on AWS

Container-Based

ALB -> ECS Fargate (multiple services)
    |
    v
Service Discovery (Cloud Map)
    |
    v
RDS/DynamoDB per service

Service Mesh

App Mesh for traffic management
X-Ray for distributed tracing
CloudWatch Container Insights

Data Lake Architecture

Data Sources -> Kinesis Data Streams
                      |
                      v
              Kinesis Firehose
                      |
                      v
                S3 (raw bucket)
                      |
                      v
        Glue ETL or Lambda processing
                      |
                      v
            S3 (processed bucket)
                      |
                      v
          Athena/Redshift Spectrum
                      |
                      v
              QuickSight dashboards

Migration Strategies (6Rs)

  1. Rehost (Lift-and-Shift)

    • AWS Application Migration Service (MGN)
    • Minimal changes, quick migration
    • Use for legacy apps with compliance constraints
  2. Replatform (Lift-Tinker-and-Shift)

    • Migrate to RDS instead of self-managed databases
    • Use Elastic Beanstalk instead of custom app servers
    • Small optimizations during migration
  3. Repurchase (Drop-and-Shop)

    • Move to SaaS (e.g., Salesforce, Workday)
    • Reduce maintenance burden
  4. Refactor/Re-architect

    • Modernize to serverless or containers
    • Highest effort, highest benefit
    • Use for competitive advantage applications
  5. Retire

    • Decommission unused applications
    • Reduce attack surface and costs
  6. Retain

    • Keep on-premises temporarily
    • Migrate later or keep for regulatory reasons

Landing Zone Design

AWS Control Tower

  • Multi-account strategy (AWS Organizations)
  • Account factory for standardization
  • Guardrails for governance (SCPs)
  • Centralized logging (CloudTrail, Config)

Account Structure

Root
├── Security OU
│   ├── Log Archive Account
│   └── Security Tooling Account
├── Infrastructure OU
│   ├── Network Account (Transit Gateway, VPN)
│   └── Shared Services Account
└── Workloads OU
    ├── Production Account
    ├── Staging Account
    └── Development Account

Network Design

Transit Gateway (hub)
    |
    ├── Production VPC
    ├── Staging VPC
    ├── Development VPC
    └── On-premises (Direct Connect/VPN)

Cost Optimization Strategies

Compute Savings

  • Compute Savings Plans (up to 66% savings)
  • EC2 Reserved Instances (1-year or 3-year)
  • Spot Instances for batch/fault-tolerant workloads
  • Lambda: Reduce memory if possible, use reserved concurrency

Storage Savings

  • S3 Intelligent-Tiering for unpredictable access
  • Lifecycle policies to Glacier/Deep Archive
  • EBS gp3 instead of gp2 (20% cheaper, better performance)
  • Delete unused snapshots and volumes

Database Savings

  • Aurora Serverless v2 for variable workloads
  • RDS Reserved Instances
  • DynamoDB on-demand for unpredictable workloads
  • Read replicas in same region to reduce cross-AZ data transfer

Monitoring and Alerting

  • AWS Cost Explorer for analysis
  • AWS Budgets for alerts
  • Cost Anomaly Detection
  • Trusted Advisor for recommendations

Disaster Recovery

RPO and RTO Targets

  • Backup and Restore: Hours RPO/RTO (lowest cost)
  • Pilot Light: Minutes RPO, hours RTO
  • Warm Standby: Seconds RPO, minutes RTO
  • Multi-Site Active/Active: Near-zero RPO/RTO (highest cost)

Implementation

  • AWS Backup for centralized backup management
  • Aurora Global Database for cross-region replication
  • S3 Cross-Region Replication
  • Route 53 health checks and failover routing
  • Regular DR testing with CloudFormation/Terraform

Monitoring and Observability

CloudWatch

  • Metrics: Standard (5 min) and detailed (1 min)
  • Alarms with SNS notifications
  • Logs Insights for log analysis
  • Dashboards for visualization

X-Ray

  • Distributed tracing for microservices
  • Service map visualization
  • Trace annotations and metadata

AWS Config

  • Resource inventory and change tracking
  • Compliance rules evaluation
  • Relationship tracking between resources