bookworm-smart-assistant/skills/cloud-architect/references/multi-cloud.md

# Multi-Cloud Architecture Reference

Comprehensive guide for multi-cloud strategies, abstraction layers, portability patterns, and vendor lock-in mitigation.

## Multi-Cloud Strategy

### When to Use Multi-Cloud

**Valid Use Cases**
- Regulatory compliance requiring data residency in specific regions
- Best-of-breed service selection (BigQuery for analytics, AWS for ML)
- Acquisition integration (different clouds in merged organizations)
- Disaster recovery with cloud provider as failure domain
- Negotiating leverage with cloud vendors

**Poor Reasons for Multi-Cloud**
- "Avoiding vendor lock-in" without specific exit scenario
- Assuming portability is free (it has significant costs)
- Political decisions without technical justification
- Spreading workloads arbitrarily across providers

### Multi-Cloud Patterns

**Active-Active**
```
Users -> Global Load Balancer
              |
    +---------+---------+
    |                   |
  AWS Region        GCP Region
    |                   |
    +----> Data Sync <--+
```
- Highest complexity and cost
- Best for global latency optimization
- Requires robust data synchronization

**Active-Passive (DR)**
```
Users -> Primary (AWS)
              |
         [Failover]
              |
         Secondary (Azure)
```
- Lower complexity than active-active
- Cloud provider becomes failure domain
- Cold or warm standby in secondary cloud

**Segmented by Workload**
```
Analytics -> GCP (BigQuery)
Core App  -> AWS (ECS, RDS)
Office    -> Azure (M365 integration)
```
- Each workload on best-fit cloud
- No cross-cloud data synchronization
- Simplest multi-cloud pattern

## Abstraction Layers

### Infrastructure Abstraction

**Terraform (Recommended)**
```hcl
# Provider-agnostic module structure
module "compute" {
  source = "./modules/compute"

  provider_type = var.cloud_provider  # aws, azure, gcp
  instance_type = var.instance_size
  region        = var.region
}

# Provider-specific implementations
# modules/compute/aws.tf
resource "aws_instance" "main" {
  count         = var.provider_type == "aws" ? 1 : 0
  ami           = data.aws_ami.latest.id
  instance_type = local.aws_instance_map[var.instance_size]
}

# modules/compute/azure.tf
resource "azurerm_virtual_machine" "main" {
  count    = var.provider_type == "azure" ? 1 : 0
  vm_size  = local.azure_vm_map[var.instance_size]
}
```

**Pulumi (Code-First)**
```typescript
// Abstract cloud resources with TypeScript
interface ComputeConfig {
  size: "small" | "medium" | "large";
  region: string;
}

function createCompute(config: ComputeConfig, provider: "aws" | "gcp") {
  if (provider === "aws") {
    return new aws.ec2.Instance("web", {
      instanceType: sizeMap.aws[config.size],
      // ...
    });
  } else {
    return new gcp.compute.Instance("web", {
      machineType: sizeMap.gcp[config.size],
      // ...
    });
  }
}
```

### Container Orchestration (Kubernetes)

**Portable Kubernetes Deployment**
```yaml
# Same manifests work across EKS, AKS, GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    spec:
      containers:
      - name: web
        image: myregistry/web:v1
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
```

**Cloud-Specific Considerations**
| Feature | EKS | AKS | GKE |
|---------|-----|-----|-----|
| Load Balancer | ALB/NLB annotations | Azure LB | GCP LB |
| Storage Class | gp3, io2 | managed-premium | pd-ssd |
| IAM Integration | IRSA | Workload Identity | Workload Identity |
| Ingress | AWS ALB Controller | AGIC | GKE Ingress |

### Application Abstraction

**Database Abstraction**
```python
# Use standard protocols (SQL, Redis, S3 API)
from sqlalchemy import create_engine

# Same code works with:
# - AWS RDS PostgreSQL
# - Azure Database for PostgreSQL
# - GCP Cloud SQL PostgreSQL
# - Self-managed PostgreSQL

DATABASE_URL = os.environ["DATABASE_URL"]
engine = create_engine(DATABASE_URL)
```

**Object Storage Abstraction**
```python
import boto3
from botocore.config import Config

# S3-compatible API works with:
# - AWS S3
# - GCP Cloud Storage (interoperability mode)
# - MinIO
# - Cloudflare R2

s3_client = boto3.client(
    's3',
    endpoint_url=os.environ.get("S3_ENDPOINT"),  # Override for non-AWS
    aws_access_key_id=os.environ["ACCESS_KEY"],
    aws_secret_access_key=os.environ["SECRET_KEY"],
)
```

## Data Synchronization

### Database Replication

**Cross-Cloud PostgreSQL**
```
AWS RDS Primary
      |
      | Logical Replication
      v
GCP Cloud SQL Replica (read-only)
```

Configuration:
```sql
-- On primary (AWS RDS)
CREATE PUBLICATION my_publication FOR ALL TABLES;

-- On replica (GCP Cloud SQL)
CREATE SUBSCRIPTION my_subscription
  CONNECTION 'host=aws-rds-endpoint dbname=mydb user=repl'
  PUBLICATION my_publication;
```

**Conflict Resolution Strategies**
- Last-write-wins (timestamp-based)
- Application-level conflict resolution
- CRDT data structures for eventually consistent data
- Avoid multi-master for transactional data

### Object Storage Sync

**Rclone for Cross-Cloud Sync**
```bash
# Sync S3 to GCS
rclone sync s3:my-bucket gcs:my-bucket \
  --transfers 32 \
  --checkers 16 \
  --s3-upload-concurrency 8

# Bidirectional sync with conflict handling
rclone bisync s3:bucket gcs:bucket \
  --conflict-resolve newer
```

**Event-Driven Replication**
```
S3 Bucket -> S3 Event -> Lambda -> GCS Upload
                              |
                              v
                       Consistency Check
```

## Vendor Lock-In Mitigation

### Lock-In Risk Assessment

| Service Type | Lock-In Risk | Mitigation Strategy |
|--------------|--------------|---------------------|
| Compute (VMs) | Low | Standard OS images, IaC |
| Kubernetes | Low | Portable manifests, avoid proprietary add-ons |
| Object Storage | Low | S3-compatible API, standard formats |
| Managed Databases | Medium | Standard SQL, logical backups |
| Serverless Functions | High | Abstraction layers, containers |
| Proprietary AI/ML | High | Open-source alternatives, ONNX models |
| Managed Services | High | Evaluate portability before adoption |

### Mitigation Strategies

**1. Use Open Standards**
- SQL databases over proprietary NoSQL
- Kubernetes over ECS/Cloud Run
- S3 API for object storage
- OpenTelemetry for observability
- OIDC for authentication

**2. Abstract Proprietary Services**
```typescript
// Wrap cloud-specific services
interface QueueService {
  send(message: string): Promise<void>;
  receive(): Promise<string>;
}

class SQSQueue implements QueueService {
  async send(message: string) {
    await this.sqsClient.sendMessage({ QueueUrl: this.url, MessageBody: message });
  }
}

class PubSubQueue implements QueueService {
  async send(message: string) {
    await this.pubsubClient.topic(this.topic).publish(Buffer.from(message));
  }
}

// Factory pattern for cloud selection
function createQueue(provider: string): QueueService {
  switch (provider) {
    case "aws": return new SQSQueue();
    case "gcp": return new PubSubQueue();
  }
}
```

**3. Maintain Exit Capability**
- Regular data export testing
- Document cloud-specific dependencies
- Keep IaC portable across providers
- Estimate migration effort annually

**4. Containerize Everything**
```dockerfile
# Portable container runs anywhere
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
```

## Network Connectivity

### Cross-Cloud Networking

**VPN Interconnect**
```
AWS VPC                          GCP VPC
   |                                |
   +---> AWS VPN Gateway            |
              |                     |
              | IPsec tunnel        |
              |                     |
              +---> GCP Cloud VPN <-+
```

**Dedicated Interconnect (Enterprise)**
```
On-Premises Data Center
         |
    +----+----+
    |         |
AWS Direct  GCP Cloud
Connect     Interconnect
    |         |
    v         v
AWS VPC    GCP VPC
    |         |
    +----+----+
         |
    Transit Hub (e.g., Megaport, Equinix)
```

### Service Mesh Across Clouds

**Istio Multi-Cluster**
```yaml
# Primary cluster (AWS EKS)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      meshID: multi-cloud-mesh
      multiCluster:
        clusterName: eks-primary
      network: aws-network

# Remote cluster (GCP GKE)
spec:
  values:
    global:
      meshID: multi-cloud-mesh
      multiCluster:
        clusterName: gke-secondary
      network: gcp-network
```

## Cost Management

### Cross-Cloud Cost Visibility

**FinOps Tools**
- CloudHealth by VMware
- Apptio Cloudability
- Spot.io (now part of NetApp)
- Kubecost for Kubernetes

**Unified Tagging Strategy**
```
Required Tags (all clouds):
- environment: prod/staging/dev
- cost-center: engineering/marketing/sales
- owner: team-name
- project: project-code
- managed-by: terraform/manual
```

### Cost Comparison Framework

```
| Workload Type | AWS | Azure | GCP | Decision |
|---------------|-----|-------|-----|----------|
| General Compute | EC2 m5 | D-series | n2-standard | Compare $/vCPU/hour |
| GPU Training | p4d | NC-series | A2 | GCP often cheaper |
| Object Storage | S3 | Blob | GCS | Similar, check egress |
| Analytics | Redshift | Synapse | BigQuery | BigQuery for ad-hoc |
| Kubernetes | EKS | AKS | GKE | GKE Autopilot simplest |
```

## Observability

### Unified Monitoring Stack

**OpenTelemetry Collector**
```yaml
# Collect from all clouds, export to single backend
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  attributes:
    actions:
      - key: cloud.provider
        action: upsert
        value: ${CLOUD_PROVIDER}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheus]
```

**Grafana for Unified Dashboards**
- AWS CloudWatch data source
- Azure Monitor data source
- GCP Cloud Monitoring data source
- Single pane of glass across all clouds

## Security Considerations

### Identity Federation

**Cross-Cloud Identity**
```
Corporate IdP (Okta/Azure AD)
         |
    SAML/OIDC
         |
    +----+----+----+
    |    |    |    |
  AWS  Azure  GCP  K8s
  IAM   AD   IAM  RBAC
```

### Secrets Management

**HashiCorp Vault (Cloud-Agnostic)**
```hcl
# Single secrets management across clouds
resource "vault_aws_secret_backend_role" "aws_role" {
  backend = vault_aws_secret_backend.aws.path
  name    = "app-role"
  credential_type = "iam_user"
}

resource "vault_gcp_secret_roleset" "gcp_role" {
  backend     = vault_gcp_secret_backend.gcp.path
  roleset     = "app-role"
  project     = var.gcp_project
  token_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
```

### Network Security

**Zero-Trust Across Clouds**
- mTLS between all services (service mesh)
- No implicit trust based on network location
- Identity-based access control
- Encrypted transit between clouds (VPN/interconnect)