bookworm-smart-assistant/skills/cloud-architect/references/multi-cloud.md

12 KiB

Multi-Cloud Architecture Reference

Comprehensive guide for multi-cloud strategies, abstraction layers, portability patterns, and vendor lock-in mitigation.

Multi-Cloud Strategy

When to Use Multi-Cloud

Valid Use Cases

  • Regulatory compliance requiring data residency in specific regions
  • Best-of-breed service selection (BigQuery for analytics, AWS for ML)
  • Acquisition integration (different clouds in merged organizations)
  • Disaster recovery with cloud provider as failure domain
  • Negotiating leverage with cloud vendors

Poor Reasons for Multi-Cloud

  • "Avoiding vendor lock-in" without specific exit scenario
  • Assuming portability is free (it has significant costs)
  • Political decisions without technical justification
  • Spreading workloads arbitrarily across providers

Multi-Cloud Patterns

Active-Active

Users -> Global Load Balancer
              |
    +---------+---------+
    |                   |
  AWS Region        GCP Region
    |                   |
    +----> Data Sync <--+
  • Highest complexity and cost
  • Best for global latency optimization
  • Requires robust data synchronization

Active-Passive (DR)

Users -> Primary (AWS)
              |
         [Failover]
              |
         Secondary (Azure)
  • Lower complexity than active-active
  • Cloud provider becomes failure domain
  • Cold or warm standby in secondary cloud

Segmented by Workload

Analytics -> GCP (BigQuery)
Core App  -> AWS (ECS, RDS)
Office    -> Azure (M365 integration)
  • Each workload on best-fit cloud
  • No cross-cloud data synchronization
  • Simplest multi-cloud pattern

Abstraction Layers

Infrastructure Abstraction

Terraform (Recommended)

# Provider-agnostic module structure
module "compute" {
  source = "./modules/compute"

  provider_type = var.cloud_provider  # aws, azure, gcp
  instance_type = var.instance_size
  region        = var.region
}

# Provider-specific implementations
# modules/compute/aws.tf
resource "aws_instance" "main" {
  count         = var.provider_type == "aws" ? 1 : 0
  ami           = data.aws_ami.latest.id
  instance_type = local.aws_instance_map[var.instance_size]
}

# modules/compute/azure.tf
resource "azurerm_virtual_machine" "main" {
  count    = var.provider_type == "azure" ? 1 : 0
  vm_size  = local.azure_vm_map[var.instance_size]
}

Pulumi (Code-First)

// Abstract cloud resources with TypeScript
interface ComputeConfig {
  size: "small" | "medium" | "large";
  region: string;
}

function createCompute(config: ComputeConfig, provider: "aws" | "gcp") {
  if (provider === "aws") {
    return new aws.ec2.Instance("web", {
      instanceType: sizeMap.aws[config.size],
      // ...
    });
  } else {
    return new gcp.compute.Instance("web", {
      machineType: sizeMap.gcp[config.size],
      // ...
    });
  }
}

Container Orchestration (Kubernetes)

Portable Kubernetes Deployment

# Same manifests work across EKS, AKS, GKE
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    spec:
      containers:
      - name: web
        image: myregistry/web:v1
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"

Cloud-Specific Considerations

Feature EKS AKS GKE
Load Balancer ALB/NLB annotations Azure LB GCP LB
Storage Class gp3, io2 managed-premium pd-ssd
IAM Integration IRSA Workload Identity Workload Identity
Ingress AWS ALB Controller AGIC GKE Ingress

Application Abstraction

Database Abstraction

# Use standard protocols (SQL, Redis, S3 API)
from sqlalchemy import create_engine

# Same code works with:
# - AWS RDS PostgreSQL
# - Azure Database for PostgreSQL
# - GCP Cloud SQL PostgreSQL
# - Self-managed PostgreSQL

DATABASE_URL = os.environ["DATABASE_URL"]
engine = create_engine(DATABASE_URL)

Object Storage Abstraction

import boto3
from botocore.config import Config

# S3-compatible API works with:
# - AWS S3
# - GCP Cloud Storage (interoperability mode)
# - MinIO
# - Cloudflare R2

s3_client = boto3.client(
    's3',
    endpoint_url=os.environ.get("S3_ENDPOINT"),  # Override for non-AWS
    aws_access_key_id=os.environ["ACCESS_KEY"],
    aws_secret_access_key=os.environ["SECRET_KEY"],
)

Data Synchronization

Database Replication

Cross-Cloud PostgreSQL

AWS RDS Primary
      |
      | Logical Replication
      v
GCP Cloud SQL Replica (read-only)

Configuration:

-- On primary (AWS RDS)
CREATE PUBLICATION my_publication FOR ALL TABLES;

-- On replica (GCP Cloud SQL)
CREATE SUBSCRIPTION my_subscription
  CONNECTION 'host=aws-rds-endpoint dbname=mydb user=repl'
  PUBLICATION my_publication;

Conflict Resolution Strategies

  • Last-write-wins (timestamp-based)
  • Application-level conflict resolution
  • CRDT data structures for eventually consistent data
  • Avoid multi-master for transactional data

Object Storage Sync

Rclone for Cross-Cloud Sync

# Sync S3 to GCS
rclone sync s3:my-bucket gcs:my-bucket \
  --transfers 32 \
  --checkers 16 \
  --s3-upload-concurrency 8

# Bidirectional sync with conflict handling
rclone bisync s3:bucket gcs:bucket \
  --conflict-resolve newer

Event-Driven Replication

S3 Bucket -> S3 Event -> Lambda -> GCS Upload
                              |
                              v
                       Consistency Check

Vendor Lock-In Mitigation

Lock-In Risk Assessment

Service Type Lock-In Risk Mitigation Strategy
Compute (VMs) Low Standard OS images, IaC
Kubernetes Low Portable manifests, avoid proprietary add-ons
Object Storage Low S3-compatible API, standard formats
Managed Databases Medium Standard SQL, logical backups
Serverless Functions High Abstraction layers, containers
Proprietary AI/ML High Open-source alternatives, ONNX models
Managed Services High Evaluate portability before adoption

Mitigation Strategies

1. Use Open Standards

  • SQL databases over proprietary NoSQL
  • Kubernetes over ECS/Cloud Run
  • S3 API for object storage
  • OpenTelemetry for observability
  • OIDC for authentication

2. Abstract Proprietary Services

// Wrap cloud-specific services
interface QueueService {
  send(message: string): Promise<void>;
  receive(): Promise<string>;
}

class SQSQueue implements QueueService {
  async send(message: string) {
    await this.sqsClient.sendMessage({ QueueUrl: this.url, MessageBody: message });
  }
}

class PubSubQueue implements QueueService {
  async send(message: string) {
    await this.pubsubClient.topic(this.topic).publish(Buffer.from(message));
  }
}

// Factory pattern for cloud selection
function createQueue(provider: string): QueueService {
  switch (provider) {
    case "aws": return new SQSQueue();
    case "gcp": return new PubSubQueue();
  }
}

3. Maintain Exit Capability

  • Regular data export testing
  • Document cloud-specific dependencies
  • Keep IaC portable across providers
  • Estimate migration effort annually

4. Containerize Everything

# Portable container runs anywhere
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]

Network Connectivity

Cross-Cloud Networking

VPN Interconnect

AWS VPC                          GCP VPC
   |                                |
   +---> AWS VPN Gateway            |
              |                     |
              | IPsec tunnel        |
              |                     |
              +---> GCP Cloud VPN <-+

Dedicated Interconnect (Enterprise)

On-Premises Data Center
         |
    +----+----+
    |         |
AWS Direct  GCP Cloud
Connect     Interconnect
    |         |
    v         v
AWS VPC    GCP VPC
    |         |
    +----+----+
         |
    Transit Hub (e.g., Megaport, Equinix)

Service Mesh Across Clouds

Istio Multi-Cluster

# Primary cluster (AWS EKS)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  values:
    global:
      meshID: multi-cloud-mesh
      multiCluster:
        clusterName: eks-primary
      network: aws-network

# Remote cluster (GCP GKE)
spec:
  values:
    global:
      meshID: multi-cloud-mesh
      multiCluster:
        clusterName: gke-secondary
      network: gcp-network

Cost Management

Cross-Cloud Cost Visibility

FinOps Tools

  • CloudHealth by VMware
  • Apptio Cloudability
  • Spot.io (now part of NetApp)
  • Kubecost for Kubernetes

Unified Tagging Strategy

Required Tags (all clouds):
- environment: prod/staging/dev
- cost-center: engineering/marketing/sales
- owner: team-name
- project: project-code
- managed-by: terraform/manual

Cost Comparison Framework

| Workload Type | AWS | Azure | GCP | Decision |
|---------------|-----|-------|-----|----------|
| General Compute | EC2 m5 | D-series | n2-standard | Compare $/vCPU/hour |
| GPU Training | p4d | NC-series | A2 | GCP often cheaper |
| Object Storage | S3 | Blob | GCS | Similar, check egress |
| Analytics | Redshift | Synapse | BigQuery | BigQuery for ad-hoc |
| Kubernetes | EKS | AKS | GKE | GKE Autopilot simplest |

Observability

Unified Monitoring Stack

OpenTelemetry Collector

# Collect from all clouds, export to single backend
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  attributes:
    actions:
      - key: cloud.provider
        action: upsert
        value: ${CLOUD_PROVIDER}

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger:14250

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [prometheus]

Grafana for Unified Dashboards

  • AWS CloudWatch data source
  • Azure Monitor data source
  • GCP Cloud Monitoring data source
  • Single pane of glass across all clouds

Security Considerations

Identity Federation

Cross-Cloud Identity

Corporate IdP (Okta/Azure AD)
         |
    SAML/OIDC
         |
    +----+----+----+
    |    |    |    |
  AWS  Azure  GCP  K8s
  IAM   AD   IAM  RBAC

Secrets Management

HashiCorp Vault (Cloud-Agnostic)

# Single secrets management across clouds
resource "vault_aws_secret_backend_role" "aws_role" {
  backend = vault_aws_secret_backend.aws.path
  name    = "app-role"
  credential_type = "iam_user"
}

resource "vault_gcp_secret_roleset" "gcp_role" {
  backend     = vault_gcp_secret_backend.gcp.path
  roleset     = "app-role"
  project     = var.gcp_project
  token_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}

Network Security

Zero-Trust Across Clouds

  • mTLS between all services (service mesh)
  • No implicit trust based on network location
  • Identity-based access control
  • Encrypted transit between clouds (VPN/interconnect)