bookworm-smart-assistant/skills/cloud-architect/references/multi-cloud.md

484 lines
12 KiB
Markdown
Raw Permalink Normal View History

# Multi-Cloud Architecture Reference
Comprehensive guide for multi-cloud strategies, abstraction layers, portability patterns, and vendor lock-in mitigation.
## Multi-Cloud Strategy
### When to Use Multi-Cloud
**Valid Use Cases**
- Regulatory compliance requiring data residency in specific regions
- Best-of-breed service selection (BigQuery for analytics, AWS for ML)
- Acquisition integration (different clouds in merged organizations)
- Disaster recovery with cloud provider as failure domain
- Negotiating leverage with cloud vendors
**Poor Reasons for Multi-Cloud**
- "Avoiding vendor lock-in" without specific exit scenario
- Assuming portability is free (it has significant costs)
- Political decisions without technical justification
- Spreading workloads arbitrarily across providers
### Multi-Cloud Patterns
**Active-Active**
```
Users -> Global Load Balancer
|
+---------+---------+
| |
AWS Region GCP Region
| |
+----> Data Sync <--+
```
- Highest complexity and cost
- Best for global latency optimization
- Requires robust data synchronization
**Active-Passive (DR)**
```
Users -> Primary (AWS)
|
[Failover]
|
Secondary (Azure)
```
- Lower complexity than active-active
- Cloud provider becomes failure domain
- Cold or warm standby in secondary cloud
**Segmented by Workload**
```
Analytics -> GCP (BigQuery)
Core App -> AWS (ECS, RDS)
Office -> Azure (M365 integration)
```
- Each workload on best-fit cloud
- No cross-cloud data synchronization
- Simplest multi-cloud pattern
## Abstraction Layers
### Infrastructure Abstraction
**Terraform (Recommended)**
```hcl
# Provider-agnostic module structure
module "compute" {
source = "./modules/compute"
provider_type = var.cloud_provider # aws, azure, gcp
instance_type = var.instance_size
region = var.region
}
# Provider-specific implementations
# modules/compute/aws.tf
resource "aws_instance" "main" {
count = var.provider_type == "aws" ? 1 : 0
ami = data.aws_ami.latest.id
instance_type = local.aws_instance_map[var.instance_size]
}
# modules/compute/azure.tf
resource "azurerm_virtual_machine" "main" {
count = var.provider_type == "azure" ? 1 : 0
vm_size = local.azure_vm_map[var.instance_size]
}
```
**Pulumi (Code-First)**
```typescript
// Abstract cloud resources with TypeScript
interface ComputeConfig {
size: "small" | "medium" | "large";
region: string;
}
function createCompute(config: ComputeConfig, provider: "aws" | "gcp") {
if (provider === "aws") {
return new aws.ec2.Instance("web", {
instanceType: sizeMap.aws[config.size],
// ...
});
} else {
return new gcp.compute.Instance("web", {
machineType: sizeMap.gcp[config.size],
// ...
});
}
}
```
### Container Orchestration (Kubernetes)
**Portable Kubernetes Deployment**
```yaml
# Same manifests work across EKS, AKS, GKE
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 3
selector:
matchLabels:
app: web
template:
spec:
containers:
- name: web
image: myregistry/web:v1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
```
**Cloud-Specific Considerations**
| Feature | EKS | AKS | GKE |
|---------|-----|-----|-----|
| Load Balancer | ALB/NLB annotations | Azure LB | GCP LB |
| Storage Class | gp3, io2 | managed-premium | pd-ssd |
| IAM Integration | IRSA | Workload Identity | Workload Identity |
| Ingress | AWS ALB Controller | AGIC | GKE Ingress |
### Application Abstraction
**Database Abstraction**
```python
# Use standard protocols (SQL, Redis, S3 API)
from sqlalchemy import create_engine
# Same code works with:
# - AWS RDS PostgreSQL
# - Azure Database for PostgreSQL
# - GCP Cloud SQL PostgreSQL
# - Self-managed PostgreSQL
DATABASE_URL = os.environ["DATABASE_URL"]
engine = create_engine(DATABASE_URL)
```
**Object Storage Abstraction**
```python
import boto3
from botocore.config import Config
# S3-compatible API works with:
# - AWS S3
# - GCP Cloud Storage (interoperability mode)
# - MinIO
# - Cloudflare R2
s3_client = boto3.client(
's3',
endpoint_url=os.environ.get("S3_ENDPOINT"), # Override for non-AWS
aws_access_key_id=os.environ["ACCESS_KEY"],
aws_secret_access_key=os.environ["SECRET_KEY"],
)
```
## Data Synchronization
### Database Replication
**Cross-Cloud PostgreSQL**
```
AWS RDS Primary
|
| Logical Replication
v
GCP Cloud SQL Replica (read-only)
```
Configuration:
```sql
-- On primary (AWS RDS)
CREATE PUBLICATION my_publication FOR ALL TABLES;
-- On replica (GCP Cloud SQL)
CREATE SUBSCRIPTION my_subscription
CONNECTION 'host=aws-rds-endpoint dbname=mydb user=repl'
PUBLICATION my_publication;
```
**Conflict Resolution Strategies**
- Last-write-wins (timestamp-based)
- Application-level conflict resolution
- CRDT data structures for eventually consistent data
- Avoid multi-master for transactional data
### Object Storage Sync
**Rclone for Cross-Cloud Sync**
```bash
# Sync S3 to GCS
rclone sync s3:my-bucket gcs:my-bucket \
--transfers 32 \
--checkers 16 \
--s3-upload-concurrency 8
# Bidirectional sync with conflict handling
rclone bisync s3:bucket gcs:bucket \
--conflict-resolve newer
```
**Event-Driven Replication**
```
S3 Bucket -> S3 Event -> Lambda -> GCS Upload
|
v
Consistency Check
```
## Vendor Lock-In Mitigation
### Lock-In Risk Assessment
| Service Type | Lock-In Risk | Mitigation Strategy |
|--------------|--------------|---------------------|
| Compute (VMs) | Low | Standard OS images, IaC |
| Kubernetes | Low | Portable manifests, avoid proprietary add-ons |
| Object Storage | Low | S3-compatible API, standard formats |
| Managed Databases | Medium | Standard SQL, logical backups |
| Serverless Functions | High | Abstraction layers, containers |
| Proprietary AI/ML | High | Open-source alternatives, ONNX models |
| Managed Services | High | Evaluate portability before adoption |
### Mitigation Strategies
**1. Use Open Standards**
- SQL databases over proprietary NoSQL
- Kubernetes over ECS/Cloud Run
- S3 API for object storage
- OpenTelemetry for observability
- OIDC for authentication
**2. Abstract Proprietary Services**
```typescript
// Wrap cloud-specific services
interface QueueService {
send(message: string): Promise<void>;
receive(): Promise<string>;
}
class SQSQueue implements QueueService {
async send(message: string) {
await this.sqsClient.sendMessage({ QueueUrl: this.url, MessageBody: message });
}
}
class PubSubQueue implements QueueService {
async send(message: string) {
await this.pubsubClient.topic(this.topic).publish(Buffer.from(message));
}
}
// Factory pattern for cloud selection
function createQueue(provider: string): QueueService {
switch (provider) {
case "aws": return new SQSQueue();
case "gcp": return new PubSubQueue();
}
}
```
**3. Maintain Exit Capability**
- Regular data export testing
- Document cloud-specific dependencies
- Keep IaC portable across providers
- Estimate migration effort annually
**4. Containerize Everything**
```dockerfile
# Portable container runs anywhere
FROM node:20-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["node", "server.js"]
```
## Network Connectivity
### Cross-Cloud Networking
**VPN Interconnect**
```
AWS VPC GCP VPC
| |
+---> AWS VPN Gateway |
| |
| IPsec tunnel |
| |
+---> GCP Cloud VPN <-+
```
**Dedicated Interconnect (Enterprise)**
```
On-Premises Data Center
|
+----+----+
| |
AWS Direct GCP Cloud
Connect Interconnect
| |
v v
AWS VPC GCP VPC
| |
+----+----+
|
Transit Hub (e.g., Megaport, Equinix)
```
### Service Mesh Across Clouds
**Istio Multi-Cluster**
```yaml
# Primary cluster (AWS EKS)
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
values:
global:
meshID: multi-cloud-mesh
multiCluster:
clusterName: eks-primary
network: aws-network
# Remote cluster (GCP GKE)
spec:
values:
global:
meshID: multi-cloud-mesh
multiCluster:
clusterName: gke-secondary
network: gcp-network
```
## Cost Management
### Cross-Cloud Cost Visibility
**FinOps Tools**
- CloudHealth by VMware
- Apptio Cloudability
- Spot.io (now part of NetApp)
- Kubecost for Kubernetes
**Unified Tagging Strategy**
```
Required Tags (all clouds):
- environment: prod/staging/dev
- cost-center: engineering/marketing/sales
- owner: team-name
- project: project-code
- managed-by: terraform/manual
```
### Cost Comparison Framework
```
| Workload Type | AWS | Azure | GCP | Decision |
|---------------|-----|-------|-----|----------|
| General Compute | EC2 m5 | D-series | n2-standard | Compare $/vCPU/hour |
| GPU Training | p4d | NC-series | A2 | GCP often cheaper |
| Object Storage | S3 | Blob | GCS | Similar, check egress |
| Analytics | Redshift | Synapse | BigQuery | BigQuery for ad-hoc |
| Kubernetes | EKS | AKS | GKE | GKE Autopilot simplest |
```
## Observability
### Unified Monitoring Stack
**OpenTelemetry Collector**
```yaml
# Collect from all clouds, export to single backend
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
attributes:
actions:
- key: cloud.provider
action: upsert
value: ${CLOUD_PROVIDER}
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
jaeger:
endpoint: jaeger:14250
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, attributes]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch, attributes]
exporters: [prometheus]
```
**Grafana for Unified Dashboards**
- AWS CloudWatch data source
- Azure Monitor data source
- GCP Cloud Monitoring data source
- Single pane of glass across all clouds
## Security Considerations
### Identity Federation
**Cross-Cloud Identity**
```
Corporate IdP (Okta/Azure AD)
|
SAML/OIDC
|
+----+----+----+
| | | |
AWS Azure GCP K8s
IAM AD IAM RBAC
```
### Secrets Management
**HashiCorp Vault (Cloud-Agnostic)**
```hcl
# Single secrets management across clouds
resource "vault_aws_secret_backend_role" "aws_role" {
backend = vault_aws_secret_backend.aws.path
name = "app-role"
credential_type = "iam_user"
}
resource "vault_gcp_secret_roleset" "gcp_role" {
backend = vault_gcp_secret_backend.gcp.path
roleset = "app-role"
project = var.gcp_project
token_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
}
```
### Network Security
**Zero-Trust Across Clouds**
- mTLS between all services (service mesh)
- No implicit trust based on network location
- Identity-based access control
- Encrypted transit between clouds (VPN/interconnect)