484 lines
12 KiB
Markdown
484 lines
12 KiB
Markdown
# Multi-Cloud Architecture Reference
|
|
|
|
Comprehensive guide for multi-cloud strategies, abstraction layers, portability patterns, and vendor lock-in mitigation.
|
|
|
|
## Multi-Cloud Strategy
|
|
|
|
### When to Use Multi-Cloud
|
|
|
|
**Valid Use Cases**
|
|
- Regulatory compliance requiring data residency in specific regions
|
|
- Best-of-breed service selection (BigQuery for analytics, AWS for ML)
|
|
- Acquisition integration (different clouds in merged organizations)
|
|
- Disaster recovery with cloud provider as failure domain
|
|
- Negotiating leverage with cloud vendors
|
|
|
|
**Poor Reasons for Multi-Cloud**
|
|
- "Avoiding vendor lock-in" without specific exit scenario
|
|
- Assuming portability is free (it has significant costs)
|
|
- Political decisions without technical justification
|
|
- Spreading workloads arbitrarily across providers
|
|
|
|
### Multi-Cloud Patterns
|
|
|
|
**Active-Active**
|
|
```
|
|
Users -> Global Load Balancer
|
|
|
|
|
+---------+---------+
|
|
| |
|
|
AWS Region GCP Region
|
|
| |
|
|
+----> Data Sync <--+
|
|
```
|
|
- Highest complexity and cost
|
|
- Best for global latency optimization
|
|
- Requires robust data synchronization
|
|
|
|
**Active-Passive (DR)**
|
|
```
|
|
Users -> Primary (AWS)
|
|
|
|
|
[Failover]
|
|
|
|
|
Secondary (Azure)
|
|
```
|
|
- Lower complexity than active-active
|
|
- Cloud provider becomes failure domain
|
|
- Cold or warm standby in secondary cloud
|
|
|
|
**Segmented by Workload**
|
|
```
|
|
Analytics -> GCP (BigQuery)
|
|
Core App -> AWS (ECS, RDS)
|
|
Office -> Azure (M365 integration)
|
|
```
|
|
- Each workload on best-fit cloud
|
|
- No cross-cloud data synchronization
|
|
- Simplest multi-cloud pattern
|
|
|
|
## Abstraction Layers
|
|
|
|
### Infrastructure Abstraction
|
|
|
|
**Terraform (Recommended)**
|
|
```hcl
|
|
# Provider-agnostic module structure
|
|
module "compute" {
|
|
source = "./modules/compute"
|
|
|
|
provider_type = var.cloud_provider # aws, azure, gcp
|
|
instance_type = var.instance_size
|
|
region = var.region
|
|
}
|
|
|
|
# Provider-specific implementations
|
|
# modules/compute/aws.tf
|
|
resource "aws_instance" "main" {
|
|
count = var.provider_type == "aws" ? 1 : 0
|
|
ami = data.aws_ami.latest.id
|
|
instance_type = local.aws_instance_map[var.instance_size]
|
|
}
|
|
|
|
# modules/compute/azure.tf
|
|
resource "azurerm_virtual_machine" "main" {
|
|
count = var.provider_type == "azure" ? 1 : 0
|
|
vm_size = local.azure_vm_map[var.instance_size]
|
|
}
|
|
```
|
|
|
|
**Pulumi (Code-First)**
|
|
```typescript
|
|
// Abstract cloud resources with TypeScript
|
|
interface ComputeConfig {
|
|
size: "small" | "medium" | "large";
|
|
region: string;
|
|
}
|
|
|
|
function createCompute(config: ComputeConfig, provider: "aws" | "gcp") {
|
|
if (provider === "aws") {
|
|
return new aws.ec2.Instance("web", {
|
|
instanceType: sizeMap.aws[config.size],
|
|
// ...
|
|
});
|
|
} else {
|
|
return new gcp.compute.Instance("web", {
|
|
machineType: sizeMap.gcp[config.size],
|
|
// ...
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
### Container Orchestration (Kubernetes)
|
|
|
|
**Portable Kubernetes Deployment**
|
|
```yaml
|
|
# Same manifests work across EKS, AKS, GKE
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: web-app
|
|
spec:
|
|
replicas: 3
|
|
selector:
|
|
matchLabels:
|
|
app: web
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: web
|
|
image: myregistry/web:v1
|
|
resources:
|
|
requests:
|
|
memory: "256Mi"
|
|
cpu: "250m"
|
|
limits:
|
|
memory: "512Mi"
|
|
cpu: "500m"
|
|
```
|
|
|
|
**Cloud-Specific Considerations**
|
|
| Feature | EKS | AKS | GKE |
|
|
|---------|-----|-----|-----|
|
|
| Load Balancer | ALB/NLB annotations | Azure LB | GCP LB |
|
|
| Storage Class | gp3, io2 | managed-premium | pd-ssd |
|
|
| IAM Integration | IRSA | Workload Identity | Workload Identity |
|
|
| Ingress | AWS ALB Controller | AGIC | GKE Ingress |
|
|
|
|
### Application Abstraction
|
|
|
|
**Database Abstraction**
|
|
```python
|
|
# Use standard protocols (SQL, Redis, S3 API)
|
|
from sqlalchemy import create_engine
|
|
|
|
# Same code works with:
|
|
# - AWS RDS PostgreSQL
|
|
# - Azure Database for PostgreSQL
|
|
# - GCP Cloud SQL PostgreSQL
|
|
# - Self-managed PostgreSQL
|
|
|
|
DATABASE_URL = os.environ["DATABASE_URL"]
|
|
engine = create_engine(DATABASE_URL)
|
|
```
|
|
|
|
**Object Storage Abstraction**
|
|
```python
|
|
import boto3
|
|
from botocore.config import Config
|
|
|
|
# S3-compatible API works with:
|
|
# - AWS S3
|
|
# - GCP Cloud Storage (interoperability mode)
|
|
# - MinIO
|
|
# - Cloudflare R2
|
|
|
|
s3_client = boto3.client(
|
|
's3',
|
|
endpoint_url=os.environ.get("S3_ENDPOINT"), # Override for non-AWS
|
|
aws_access_key_id=os.environ["ACCESS_KEY"],
|
|
aws_secret_access_key=os.environ["SECRET_KEY"],
|
|
)
|
|
```
|
|
|
|
## Data Synchronization
|
|
|
|
### Database Replication
|
|
|
|
**Cross-Cloud PostgreSQL**
|
|
```
|
|
AWS RDS Primary
|
|
|
|
|
| Logical Replication
|
|
v
|
|
GCP Cloud SQL Replica (read-only)
|
|
```
|
|
|
|
Configuration:
|
|
```sql
|
|
-- On primary (AWS RDS)
|
|
CREATE PUBLICATION my_publication FOR ALL TABLES;
|
|
|
|
-- On replica (GCP Cloud SQL)
|
|
CREATE SUBSCRIPTION my_subscription
|
|
CONNECTION 'host=aws-rds-endpoint dbname=mydb user=repl'
|
|
PUBLICATION my_publication;
|
|
```
|
|
|
|
**Conflict Resolution Strategies**
|
|
- Last-write-wins (timestamp-based)
|
|
- Application-level conflict resolution
|
|
- CRDT data structures for eventually consistent data
|
|
- Avoid multi-master for transactional data
|
|
|
|
### Object Storage Sync
|
|
|
|
**Rclone for Cross-Cloud Sync**
|
|
```bash
|
|
# Sync S3 to GCS
|
|
rclone sync s3:my-bucket gcs:my-bucket \
|
|
--transfers 32 \
|
|
--checkers 16 \
|
|
--s3-upload-concurrency 8
|
|
|
|
# Bidirectional sync with conflict handling
|
|
rclone bisync s3:bucket gcs:bucket \
|
|
--conflict-resolve newer
|
|
```
|
|
|
|
**Event-Driven Replication**
|
|
```
|
|
S3 Bucket -> S3 Event -> Lambda -> GCS Upload
|
|
|
|
|
v
|
|
Consistency Check
|
|
```
|
|
|
|
## Vendor Lock-In Mitigation
|
|
|
|
### Lock-In Risk Assessment
|
|
|
|
| Service Type | Lock-In Risk | Mitigation Strategy |
|
|
|--------------|--------------|---------------------|
|
|
| Compute (VMs) | Low | Standard OS images, IaC |
|
|
| Kubernetes | Low | Portable manifests, avoid proprietary add-ons |
|
|
| Object Storage | Low | S3-compatible API, standard formats |
|
|
| Managed Databases | Medium | Standard SQL, logical backups |
|
|
| Serverless Functions | High | Abstraction layers, containers |
|
|
| Proprietary AI/ML | High | Open-source alternatives, ONNX models |
|
|
| Managed Services | High | Evaluate portability before adoption |
|
|
|
|
### Mitigation Strategies
|
|
|
|
**1. Use Open Standards**
|
|
- SQL databases over proprietary NoSQL
|
|
- Kubernetes over ECS/Cloud Run
|
|
- S3 API for object storage
|
|
- OpenTelemetry for observability
|
|
- OIDC for authentication
|
|
|
|
**2. Abstract Proprietary Services**
|
|
```typescript
|
|
// Wrap cloud-specific services
|
|
interface QueueService {
|
|
send(message: string): Promise<void>;
|
|
receive(): Promise<string>;
|
|
}
|
|
|
|
class SQSQueue implements QueueService {
|
|
async send(message: string) {
|
|
await this.sqsClient.sendMessage({ QueueUrl: this.url, MessageBody: message });
|
|
}
|
|
}
|
|
|
|
class PubSubQueue implements QueueService {
|
|
async send(message: string) {
|
|
await this.pubsubClient.topic(this.topic).publish(Buffer.from(message));
|
|
}
|
|
}
|
|
|
|
// Factory pattern for cloud selection
|
|
function createQueue(provider: string): QueueService {
|
|
switch (provider) {
|
|
case "aws": return new SQSQueue();
|
|
case "gcp": return new PubSubQueue();
|
|
}
|
|
}
|
|
```
|
|
|
|
**3. Maintain Exit Capability**
|
|
- Regular data export testing
|
|
- Document cloud-specific dependencies
|
|
- Keep IaC portable across providers
|
|
- Estimate migration effort annually
|
|
|
|
**4. Containerize Everything**
|
|
```dockerfile
|
|
# Portable container runs anywhere
|
|
FROM node:20-alpine
|
|
WORKDIR /app
|
|
COPY package*.json ./
|
|
RUN npm ci --only=production
|
|
COPY . .
|
|
EXPOSE 3000
|
|
CMD ["node", "server.js"]
|
|
```
|
|
|
|
## Network Connectivity
|
|
|
|
### Cross-Cloud Networking
|
|
|
|
**VPN Interconnect**
|
|
```
|
|
AWS VPC GCP VPC
|
|
| |
|
|
+---> AWS VPN Gateway |
|
|
| |
|
|
| IPsec tunnel |
|
|
| |
|
|
+---> GCP Cloud VPN <-+
|
|
```
|
|
|
|
**Dedicated Interconnect (Enterprise)**
|
|
```
|
|
On-Premises Data Center
|
|
|
|
|
+----+----+
|
|
| |
|
|
AWS Direct GCP Cloud
|
|
Connect Interconnect
|
|
| |
|
|
v v
|
|
AWS VPC GCP VPC
|
|
| |
|
|
+----+----+
|
|
|
|
|
Transit Hub (e.g., Megaport, Equinix)
|
|
```
|
|
|
|
### Service Mesh Across Clouds
|
|
|
|
**Istio Multi-Cluster**
|
|
```yaml
|
|
# Primary cluster (AWS EKS)
|
|
apiVersion: install.istio.io/v1alpha1
|
|
kind: IstioOperator
|
|
spec:
|
|
values:
|
|
global:
|
|
meshID: multi-cloud-mesh
|
|
multiCluster:
|
|
clusterName: eks-primary
|
|
network: aws-network
|
|
|
|
# Remote cluster (GCP GKE)
|
|
spec:
|
|
values:
|
|
global:
|
|
meshID: multi-cloud-mesh
|
|
multiCluster:
|
|
clusterName: gke-secondary
|
|
network: gcp-network
|
|
```
|
|
|
|
## Cost Management
|
|
|
|
### Cross-Cloud Cost Visibility
|
|
|
|
**FinOps Tools**
|
|
- CloudHealth by VMware
|
|
- Apptio Cloudability
|
|
- Spot.io (now part of NetApp)
|
|
- Kubecost for Kubernetes
|
|
|
|
**Unified Tagging Strategy**
|
|
```
|
|
Required Tags (all clouds):
|
|
- environment: prod/staging/dev
|
|
- cost-center: engineering/marketing/sales
|
|
- owner: team-name
|
|
- project: project-code
|
|
- managed-by: terraform/manual
|
|
```
|
|
|
|
### Cost Comparison Framework
|
|
|
|
```
|
|
| Workload Type | AWS | Azure | GCP | Decision |
|
|
|---------------|-----|-------|-----|----------|
|
|
| General Compute | EC2 m5 | D-series | n2-standard | Compare $/vCPU/hour |
|
|
| GPU Training | p4d | NC-series | A2 | GCP often cheaper |
|
|
| Object Storage | S3 | Blob | GCS | Similar, check egress |
|
|
| Analytics | Redshift | Synapse | BigQuery | BigQuery for ad-hoc |
|
|
| Kubernetes | EKS | AKS | GKE | GKE Autopilot simplest |
|
|
```
|
|
|
|
## Observability
|
|
|
|
### Unified Monitoring Stack
|
|
|
|
**OpenTelemetry Collector**
|
|
```yaml
|
|
# Collect from all clouds, export to single backend
|
|
receivers:
|
|
otlp:
|
|
protocols:
|
|
grpc:
|
|
http:
|
|
|
|
processors:
|
|
batch:
|
|
attributes:
|
|
actions:
|
|
- key: cloud.provider
|
|
action: upsert
|
|
value: ${CLOUD_PROVIDER}
|
|
|
|
exporters:
|
|
prometheus:
|
|
endpoint: "0.0.0.0:8889"
|
|
jaeger:
|
|
endpoint: jaeger:14250
|
|
|
|
service:
|
|
pipelines:
|
|
traces:
|
|
receivers: [otlp]
|
|
processors: [batch, attributes]
|
|
exporters: [jaeger]
|
|
metrics:
|
|
receivers: [otlp]
|
|
processors: [batch, attributes]
|
|
exporters: [prometheus]
|
|
```
|
|
|
|
**Grafana for Unified Dashboards**
|
|
- AWS CloudWatch data source
|
|
- Azure Monitor data source
|
|
- GCP Cloud Monitoring data source
|
|
- Single pane of glass across all clouds
|
|
|
|
## Security Considerations
|
|
|
|
### Identity Federation
|
|
|
|
**Cross-Cloud Identity**
|
|
```
|
|
Corporate IdP (Okta/Azure AD)
|
|
|
|
|
SAML/OIDC
|
|
|
|
|
+----+----+----+
|
|
| | | |
|
|
AWS Azure GCP K8s
|
|
IAM AD IAM RBAC
|
|
```
|
|
|
|
### Secrets Management
|
|
|
|
**HashiCorp Vault (Cloud-Agnostic)**
|
|
```hcl
|
|
# Single secrets management across clouds
|
|
resource "vault_aws_secret_backend_role" "aws_role" {
|
|
backend = vault_aws_secret_backend.aws.path
|
|
name = "app-role"
|
|
credential_type = "iam_user"
|
|
}
|
|
|
|
resource "vault_gcp_secret_roleset" "gcp_role" {
|
|
backend = vault_gcp_secret_backend.gcp.path
|
|
roleset = "app-role"
|
|
project = var.gcp_project
|
|
token_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
|
|
}
|
|
```
|
|
|
|
### Network Security
|
|
|
|
**Zero-Trust Across Clouds**
|
|
- mTLS between all services (service mesh)
|
|
- No implicit trust based on network location
|
|
- Identity-based access control
|
|
- Encrypted transit between clouds (VPN/interconnect)
|