bookworm-smart-assistant/skills/cloud-architect/references/aws.md

# AWS Architecture Reference

Comprehensive guide for AWS services, patterns, and Well-Architected Framework implementation.

## Well-Architected Framework

### Six Pillars

1. **Operational Excellence**
   - Infrastructure as Code (CloudFormation, CDK, Terraform)
   - Continuous integration/deployment
   - Observability (CloudWatch, X-Ray)
   - Runbooks and playbooks
   - Game days and failure injection

2. **Security**
   - Identity and Access Management (IAM)
   - Detective controls (GuardDuty, Security Hub)
   - Infrastructure protection (VPC, security groups, NACLs)
   - Data protection (KMS, encryption at rest/transit)
   - Incident response automation

3. **Reliability**
   - Multi-AZ deployments
   - Auto Scaling groups
   - Route 53 health checks and failover
   - Backup and restore (AWS Backup)
   - Chaos engineering (AWS FIS)

4. **Performance Efficiency**
   - Right-sizing with Compute Optimizer
   - Caching strategies (CloudFront, ElastiCache)
   - Database optimization (RDS Performance Insights)
   - Serverless architectures
   - Global content delivery

5. **Cost Optimization**
   - Reserved Instances and Savings Plans
   - Spot Instances for fault-tolerant workloads
   - S3 Intelligent-Tiering and lifecycle policies
   - Right-sizing recommendations
   - Cost allocation tags and budgets

6. **Sustainability**
   - Region selection for renewable energy
   - Serverless to minimize idle resources
   - Efficient data storage patterns
   - Resource utilization optimization

## Core Services Architecture

### Compute

**EC2 (Elastic Compute Cloud)**
- Instance families: General (t3, m5), Compute (c5), Memory (r5), GPU (p3, g4)
- Auto Scaling: Target tracking, step scaling, scheduled scaling
- Placement groups: Cluster, partition, spread
- Best practices: Use latest generation, right-size, enable detailed monitoring

**Lambda**
- Invocation models: Synchronous, asynchronous, event source mapping
- Concurrency: Reserved, provisioned, burst limits
- Layers for shared dependencies
- Best practices: Keep functions small, use environment variables, set timeouts

**ECS/EKS (Container Services)**
- ECS: Fargate for serverless, EC2 for control
- EKS: Managed Kubernetes with AWS integration
- Service mesh: App Mesh for observability
- Best practices: Use Fargate for simplicity, EKS for portability

**Elastic Beanstalk**
- Managed platform for web apps
- Auto-scaling and load balancing included
- Support for multiple languages and Docker

### Storage

**S3 (Simple Storage Service)**
- Storage classes: Standard, IA, One Zone-IA, Glacier, Deep Archive
- Lifecycle policies for automatic tiering
- Versioning and MFA delete for protection
- Cross-region replication for DR
- Best practices: Enable versioning, use lifecycle policies, block public access

**EBS (Elastic Block Store)**
- Volume types: gp3 (general), io2 (IOPS), st1 (throughput), sc1 (cold)
- Snapshots to S3 for backup
- Encryption by default
- Best practices: Use gp3 for most workloads, enable encryption

**EFS (Elastic File System)**
- NFSv4 file system for shared access
- Performance modes: General purpose, Max I/O
- Throughput modes: Bursting, provisioned
- Best practices: Use lifecycle management, enable encryption

**FSx**
- FSx for Windows File Server (SMB)
- FSx for Lustre (HPC workloads)
- FSx for NetApp ONTAP
- FSx for OpenZFS

### Database

**RDS (Relational Database Service)**
- Engines: MySQL, PostgreSQL, MariaDB, Oracle, SQL Server, Aurora
- Multi-AZ for high availability
- Read replicas for scalability
- Automated backups and point-in-time recovery
- Best practices: Use Aurora for performance, enable Multi-AZ, use read replicas

**Aurora**
- MySQL and PostgreSQL compatible
- 5x MySQL, 3x PostgreSQL performance
- Global databases for cross-region DR
- Serverless v2 for variable workloads
- Best practices: Use Aurora Serverless for unpredictable workloads

**DynamoDB**
- NoSQL key-value and document database
- On-demand or provisioned capacity
- Global tables for multi-region replication
- DynamoDB Streams for change data capture
- Best practices: Use on-demand for unpredictable traffic, implement GSI carefully

**ElastiCache**
- Redis or Memcached in-memory caching
- Cluster mode for Redis scalability
- Best practices: Use for session storage, API caching

### Networking

**VPC (Virtual Private Cloud)**
- CIDR planning: Avoid overlaps, plan for growth
- Subnets: Public (IGW), private (NAT), isolated (no internet)
- Route tables and routing decisions
- Security groups (stateful) and NACLs (stateless)
- Best practices: Use /16 for VPC, /24 for subnets, plan IP space

**Route 53**
- DNS service with health checks
- Routing policies: Simple, weighted, latency, failover, geolocation
- Best practices: Use alias records, enable DNSSEC

**CloudFront**
- Global CDN with edge locations
- Origin types: S3, ALB, custom origins
- Lambda@Edge for request/response manipulation
- Best practices: Enable compression, use field-level encryption

**VPN and Direct Connect**
- Site-to-Site VPN for encrypted tunnels
- Direct Connect for dedicated bandwidth
- Transit Gateway for hub-and-spoke topology
- Best practices: Use Direct Connect for high bandwidth, Transit Gateway for complex routing

**API Gateway**
- REST APIs, HTTP APIs, WebSocket APIs
- Throttling and quotas
- Integration with Lambda, HTTP endpoints, AWS services
- Best practices: Use HTTP APIs for lower cost, implement caching

### Security

**IAM (Identity and Access Management)**
- Principle of least privilege
- Roles for applications, not access keys
- MFA for privileged users
- Service Control Policies (SCPs) for organization-wide controls
- Best practices: Use roles, enable MFA, rotate credentials

**KMS (Key Management Service)**
- Customer managed keys (CMKs)
- Automatic key rotation
- Envelope encryption pattern
- Best practices: Enable automatic rotation, use grants for temporary access

**Secrets Manager**
- Automatic rotation for RDS credentials
- Versioning and rollback
- Best practices: Rotate secrets regularly, use VPC endpoints

**Security Hub**
- Centralized security findings
- CIS AWS Foundations Benchmark
- Integration with GuardDuty, Inspector, Macie

**GuardDuty**
- Threat detection using ML
- Monitors CloudTrail, VPC Flow Logs, DNS logs

## Architecture Patterns

### High Availability

**Multi-AZ Pattern**
```
- Application Load Balancer across 3 AZs
- Auto Scaling group with instances in each AZ
- RDS Multi-AZ for database
- S3 for static assets (11 9's durability)
```

**Multi-Region Pattern**
```
- Route 53 with health checks and failover
- CloudFront for global distribution
- Aurora Global Database for <1s RPO
- S3 Cross-Region Replication
```

### Serverless Architecture

**API-Driven Pattern**
```
API Gateway -> Lambda -> DynamoDB
              |
              v
          EventBridge -> Lambda (async processing)
```

**Event-Driven Pattern**
```
S3 Event -> Lambda -> Process -> SNS
                                  |
                                  v
                              Multiple subscribers
```

### Microservices on AWS

**Container-Based**
```
ALB -> ECS Fargate (multiple services)
    |
    v
Service Discovery (Cloud Map)
    |
    v
RDS/DynamoDB per service
```

**Service Mesh**
```
App Mesh for traffic management
X-Ray for distributed tracing
CloudWatch Container Insights
```

### Data Lake Architecture

```
Data Sources -> Kinesis Data Streams
                      |
                      v
              Kinesis Firehose
                      |
                      v
                S3 (raw bucket)
                      |
                      v
        Glue ETL or Lambda processing
                      |
                      v
            S3 (processed bucket)
                      |
                      v
          Athena/Redshift Spectrum
                      |
                      v
              QuickSight dashboards
```

## Migration Strategies (6Rs)

1. **Rehost (Lift-and-Shift)**
   - AWS Application Migration Service (MGN)
   - Minimal changes, quick migration
   - Use for legacy apps with compliance constraints

2. **Replatform (Lift-Tinker-and-Shift)**
   - Migrate to RDS instead of self-managed databases
   - Use Elastic Beanstalk instead of custom app servers
   - Small optimizations during migration

3. **Repurchase (Drop-and-Shop)**
   - Move to SaaS (e.g., Salesforce, Workday)
   - Reduce maintenance burden

4. **Refactor/Re-architect**
   - Modernize to serverless or containers
   - Highest effort, highest benefit
   - Use for competitive advantage applications

5. **Retire**
   - Decommission unused applications
   - Reduce attack surface and costs

6. **Retain**
   - Keep on-premises temporarily
   - Migrate later or keep for regulatory reasons

## Landing Zone Design

**AWS Control Tower**
- Multi-account strategy (AWS Organizations)
- Account factory for standardization
- Guardrails for governance (SCPs)
- Centralized logging (CloudTrail, Config)

**Account Structure**
```
Root
├── Security OU
│   ├── Log Archive Account
│   └── Security Tooling Account
├── Infrastructure OU
│   ├── Network Account (Transit Gateway, VPN)
│   └── Shared Services Account
└── Workloads OU
    ├── Production Account
    ├── Staging Account
    └── Development Account
```

**Network Design**
```
Transit Gateway (hub)
    |
    ├── Production VPC
    ├── Staging VPC
    ├── Development VPC
    └── On-premises (Direct Connect/VPN)
```

## Cost Optimization Strategies

**Compute Savings**
- Compute Savings Plans (up to 66% savings)
- EC2 Reserved Instances (1-year or 3-year)
- Spot Instances for batch/fault-tolerant workloads
- Lambda: Reduce memory if possible, use reserved concurrency

**Storage Savings**
- S3 Intelligent-Tiering for unpredictable access
- Lifecycle policies to Glacier/Deep Archive
- EBS gp3 instead of gp2 (20% cheaper, better performance)
- Delete unused snapshots and volumes

**Database Savings**
- Aurora Serverless v2 for variable workloads
- RDS Reserved Instances
- DynamoDB on-demand for unpredictable workloads
- Read replicas in same region to reduce cross-AZ data transfer

**Monitoring and Alerting**
- AWS Cost Explorer for analysis
- AWS Budgets for alerts
- Cost Anomaly Detection
- Trusted Advisor for recommendations

## Disaster Recovery

**RPO and RTO Targets**
- Backup and Restore: Hours RPO/RTO (lowest cost)
- Pilot Light: Minutes RPO, hours RTO
- Warm Standby: Seconds RPO, minutes RTO
- Multi-Site Active/Active: Near-zero RPO/RTO (highest cost)

**Implementation**
- AWS Backup for centralized backup management
- Aurora Global Database for cross-region replication
- S3 Cross-Region Replication
- Route 53 health checks and failover routing
- Regular DR testing with CloudFormation/Terraform

## Monitoring and Observability

**CloudWatch**
- Metrics: Standard (5 min) and detailed (1 min)
- Alarms with SNS notifications
- Logs Insights for log analysis
- Dashboards for visualization

**X-Ray**
- Distributed tracing for microservices
- Service map visualization
- Trace annotations and metadata

**AWS Config**
- Resource inventory and change tracking
- Compliance rules evaluation
- Relationship tracking between resources