bookworm-smart-assistant/skills/cloud-architect/references/gcp.md

634 lines
17 KiB
Markdown
Raw Normal View History

# GCP Architecture Reference
Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework.
## Google Cloud Architecture Framework
### Five Pillars
1. **Operational Excellence**
- Infrastructure as Code (Deployment Manager, Terraform)
- CI/CD with Cloud Build
- Monitoring with Cloud Monitoring (Stackdriver)
- SRE principles and SLOs
- Incident management
2. **Security, Privacy, and Compliance**
- Identity and Access Management (Cloud IAM)
- VPC Service Controls for data perimeter
- Binary Authorization for containers
- Data encryption (default at rest and in transit)
- Security Command Center
3. **Reliability**
- Multi-zone and multi-region deployments
- Load balancing and autoscaling
- Disaster recovery planning
- Chaos engineering practices
- SLIs, SLOs, and error budgets
4. **Cost Optimization**
- Committed Use Discounts
- Sustained Use Discounts (automatic)
- Preemptible VMs and Spot VMs
- Recommender for right-sizing
- Active Assist for optimization
5. **Performance Optimization**
- Cloud CDN and Media CDN
- Caching strategies (Memorystore)
- Database performance tuning
- Network optimization (Premium vs Standard tier)
- Regional and zonal resource placement
## Core Services Architecture
### Compute
**Compute Engine**
- Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized)
- Custom machine types for specific needs
- Preemptible VMs (up to 80% discount, max 24 hours)
- Spot VMs (similar to preemptible, better availability)
- Instance groups: Managed (with autoscaling), unmanaged
- Best practices: Use latest generation, committed use discounts, Spot for batch jobs
**Cloud Run**
- Fully managed serverless container platform
- Auto-scaling to zero
- Pay per request
- CPU allocated only during request handling
- Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch
**Cloud Functions**
- Event-driven serverless functions
- 1st gen: HTTP and background functions
- 2nd gen: Built on Cloud Run, better performance
- Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP
- Best practices: Use 2nd gen, minimize cold starts, implement retry logic
**Google Kubernetes Engine (GKE)**
- Managed Kubernetes with GCP integration
- Autopilot mode: Fully managed, per-pod pricing
- Standard mode: More control, node management
- Workload Identity for secure service access
- Binary Authorization for deployment policies
- Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies
**App Engine**
- Fully managed platform (PaaS)
- Standard environment (sandboxed, auto-scaling)
- Flexible environment (Docker containers, custom runtimes)
- Traffic splitting for canary deployments
- Best practices: Use Standard for web apps, Flexible for custom dependencies
### Storage
**Cloud Storage**
- Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day)
- Object lifecycle management
- Object versioning and retention policies
- Autoclass for automatic tier transitions
- Requester pays for data transfer
- Best practices: Use Autoclass, enable versioning, implement lifecycle policies
**Persistent Disk**
- Types: Standard (HDD), Balanced SSD, SSD, Extreme
- Zonal and regional persistent disks
- Snapshots for backup (incremental)
- Disk resize without downtime
- Best practices: Use Balanced SSD for most workloads, enable snapshots
**Filestore**
- Managed NFS file storage
- Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance)
- Backup to Cloud Storage
- Best practices: Use Enterprise for production, implement backups
**Cloud Storage for Firebase**
- Object storage for mobile and web apps
- Client SDKs for direct upload/download
- Security rules for access control
### Database
**Cloud SQL**
- Managed MySQL, PostgreSQL, SQL Server
- High availability configuration (regional)
- Read replicas for scaling
- Automated backups and point-in-time recovery
- Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy
**Cloud Spanner**
- Globally distributed relational database
- Horizontal scalability with strong consistency
- Multi-region for 99.999% availability
- TrueTime for global consistency
- Best practices: Design proper schema splits, use commit timestamps, optimize hotspots
**Firestore (Native mode)**
- NoSQL document database
- Real-time synchronization
- Offline support for mobile
- ACID transactions
- Best practices: Design document structure carefully, use collection group queries wisely
**Bigtable**
- NoSQL wide-column database
- Petabyte-scale with single-digit millisecond latency
- HBase API compatible
- Linear scalability by adding nodes
- Best practices: Design row keys to avoid hotspots, use replication for HA
**Memorystore**
- Managed Redis and Memcached
- Standard tier (HA with replica) and Basic tier
- Best practices: Use Standard tier for production, implement connection pooling
**BigQuery**
- Serverless data warehouse
- SQL analytics on petabyte-scale data
- Column-oriented storage
- Automatic caching and optimization
- Best practices: Partition and cluster tables, use approximate functions, control costs with quotas
### Networking
**VPC (Virtual Private Cloud)**
- Global resource (subnets are regional)
- Custom or auto mode networks
- Firewall rules (stateful)
- VPC peering and Shared VPC
- Private Google Access for GCP services
- Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules
**Cloud Load Balancing**
- Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP)
- Regional load balancing (internal HTTP(S), internal TCP/UDP)
- Anycast IP for global distribution
- Backend services with health checks
- Best practices: Use global for multi-region, enable CDN, configure health checks
**Cloud CDN**
- Global content delivery network
- Cache invalidation and signed URLs
- Integration with Cloud Storage and compute
- Best practices: Enable compression, use cache-control headers
**Cloud Interconnect and VPN**
- Dedicated Interconnect (10 Gbps or 100 Gbps)
- Partner Interconnect (50 Mbps to 50 Gbps)
- Cloud VPN (HA VPN for 99.99% SLA)
- Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth
**Cloud Armor**
- DDoS protection and WAF
- Preconfigured and custom rules
- Adaptive protection (ML-based)
- Best practices: Enable for internet-facing services, use preconfigured rules
**Private Service Connect**
- Private connectivity to Google APIs and services
- Service Directory for service discovery
- Best practices: Use for all managed services in production
### Serverless and Event-Driven
**Pub/Sub**
- Global message queue
- At-least-once delivery
- Push and pull subscriptions
- Message ordering and filtering
- Dead-letter topics
- Best practices: Use message attributes for filtering, implement idempotent processing
**Eventarc**
- Event-driven architecture
- Triggers for Cloud Run, Workflows, GKE
- Sources: Audit Logs, Pub/Sub, custom events
- Best practices: Use for decoupled architectures, implement event filtering
**Cloud Scheduler**
- Fully managed cron service
- HTTP, Pub/Sub, and App Engine targets
- Best practices: Use for periodic tasks, implement retry logic
**Workflows**
- Orchestrate and automate GCP and HTTP services
- YAML-based workflow definition
- Built-in error handling and retry
- Best practices: Use for complex multi-step processes, implement compensating transactions
### Security and Identity
**Cloud IAM**
- Resource hierarchy: Organization -> Folders -> Projects -> Resources
- Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom
- Service accounts for applications
- Workload Identity for GKE
- Best practices: Use predefined roles, least privilege, service accounts for apps
**Cloud Key Management (KMS)**
- Encryption key management
- Customer-managed encryption keys (CMEK)
- Hardware Security Module (HSM) backed
- Automatic key rotation
- Best practices: Enable automatic rotation, use separate keys per environment
**Secret Manager**
- Store API keys, passwords, certificates
- Versioning and access control
- Automatic rotation integration
- Best practices: Rotate secrets regularly, use IAM for access control
**Security Command Center**
- Centralized security and risk management
- Asset discovery and vulnerability scanning
- Threat detection and compliance monitoring
- Best practices: Enable all detectors, review findings regularly
**VPC Service Controls**
- Create security perimeters around GCP resources
- Prevent data exfiltration
- Best practices: Use for sensitive data, implement access levels
### AI and Machine Learning
**Vertex AI**
- Unified ML platform
- AutoML for custom models
- Pre-trained models (Vision, Natural Language, etc.)
- MLOps with pipelines
- Best practices: Use AutoML for quick start, implement feature store
**BigQuery ML**
- Create and execute ML models using SQL
- Model types: Linear regression, logistic regression, clustering, etc.
- Integration with Vertex AI
- Best practices: Use for simple models, leverage BigQuery's scale
## Architecture Patterns
### High Availability
**Multi-Zone Pattern**
```
Global HTTP(S) Load Balancer
|
v
Managed Instance Group (multi-zone)
|
v
Cloud SQL (regional, HA configuration)
|
v
Cloud Storage (multi-region)
```
**Multi-Region Pattern**
```
Global HTTP(S) Load Balancer
|
├── Backend Service Region 1 (Cloud Run)
└── Backend Service Region 2 (Cloud Run)
|
v
Cloud Spanner (multi-region)
```
### Serverless Architecture
**Event-Driven Pattern**
```
Cloud Storage upload event
|
v
Pub/Sub topic
|
v
Cloud Functions (image processing)
|
v
Firestore (metadata storage)
```
**API-First Pattern**
```
Cloud Endpoints or API Gateway
|
v
Cloud Run (multiple services)
|
├── Cloud SQL (transactional data)
└── Firestore (user data)
```
### Microservices on GKE
**GKE with Service Mesh**
```
Global Load Balancer
|
v
GKE Ingress
|
v
Anthos Service Mesh (Istio)
|
v
Microservices (Cloud Spanner, Firestore, Memorystore)
```
### Data Analytics Platform
```
Data Sources
|
v
Pub/Sub (streaming)
|
v
Dataflow (Apache Beam)
|
v
BigQuery (data warehouse)
|
v
Looker or Data Studio (visualization)
```
**Batch Processing**
```
Cloud Storage (raw data)
|
v
Dataproc (Apache Spark)
|
v
BigQuery (analytics)
```
## Landing Zone Design
### Resource Hierarchy
```
Organization
├── Folders (by environment or team)
│ ├── Production Folder
│ │ ├── Project A
│ │ └── Project B
│ ├── Staging Folder
│ └── Development Folder
└── Shared Services Folder
├── Networking Project (Shared VPC host)
├── Security Project (KMS, Secret Manager)
└── Logging Project (centralized logs)
```
### Network Design
**Shared VPC Pattern**
```
Host Project (networking team)
├── Shared VPC
│ ├── Subnet Production (region A)
│ ├── Subnet Staging (region A)
│ └── Subnet Development (region B)
Service Projects (application teams)
├── Production Project (uses Production subnet)
├── Staging Project (uses Staging subnet)
└── Development Project (uses Development subnet)
```
**Hub-and-Spoke with VPN**
```
On-premises Network
|
v
Cloud VPN / Interconnect
|
v
Hub VPC (shared services)
|
├── Spoke VPC 1 (production workloads)
├── Spoke VPC 2 (development workloads)
└── Spoke VPC 3 (analytics workloads)
```
### Governance
**Organization Policies**
- Restrict public IP assignment
- Enforce uniform bucket-level access
- Restrict VM external IP
- Define allowed resource locations
**IAM Strategy**
- Use Google Groups for role assignments
- Separate duties (network admin, security admin, etc.)
- Service accounts per application
- Workload Identity for GKE workloads
**Logging and Monitoring**
```
All Projects
|
v
Log Router
|
├── Cloud Logging (default sink)
├── BigQuery (long-term analysis)
├── Cloud Storage (archive)
└── Pub/Sub (real-time processing)
```
## Migration Strategies
### Migrate to Virtual Machines
**Tools**
- Migrate to Virtual Machines (formerly Migrate for Compute Engine)
- Supports VMware, AWS, Azure, physical servers
- Agentless or agent-based migration
- Waves and test clones
**Process**
1. Assess: Fit assessment and TCO analysis
2. Plan: Group VMs, define migration waves
3. Deploy: Set up infrastructure (VPC, firewall rules)
4. Migrate: Test migration, cutover, validation
5. Optimize: Right-sizing, committed use discounts
### Database Migration
**Database Migration Service**
- Minimal downtime migrations
- Supports MySQL, PostgreSQL, SQL Server, Oracle
- Continuous replication for cutover flexibility
**Transfer Appliance**
- Physical device for large data transfers
- Up to 1 PB capacity
- Offline data transfer
## Cost Optimization
### Compute Savings
**Committed Use Discounts**
- 1-year or 3-year commitments
- Up to 57% savings for VMs
- Resource-based or spend-based
**Sustained Use Discounts**
- Automatic discounts for running VMs >25% of month
- Up to 30% savings
- No commitment required
**Preemptible and Spot VMs**
- Up to 80% discount
- Can be terminated by GCP
- Best for batch processing, fault-tolerant workloads
**Recommender**
- VM rightsizing recommendations
- Idle resource identification
- Committed use discount recommendations
### Storage Savings
**Cloud Storage**
- Autoclass for automatic tier transitions
- Lifecycle policies (delete or transition)
- Nearline (30+ days), Coldline (90+ days), Archive (365+ days)
- Requester pays for data transfer
**Persistent Disk**
- Delete orphaned disks
- Use balanced SSD instead of SSD when possible
- Resize disks to match actual usage
### BigQuery Savings
**On-Demand Pricing**
- $5 per TB processed
- Use partitioning and clustering
- Query cache for free repeated queries
**Flat-Rate Pricing**
- Predictable costs for heavy users
- Autoscaling slots available
- Flex slots for short-term commitments
**Best Practices**
- Use approximate aggregation functions (APPROX_COUNT_DISTINCT)
- Avoid SELECT *, specify columns
- Use materialized views for common queries
- Set up cost controls with custom quotas
### Monitoring Costs
**Cloud Billing**
- Budgets and alerts
- Cost breakdown by project, service, SKU
- Export to BigQuery for analysis
- Recommendations from Active Assist
## Disaster Recovery
### Backup Strategies
**VM Backups**
- Persistent disk snapshots (incremental)
- Machine images (include metadata and config)
- Cross-region snapshot copy
- Snapshot schedules for automation
**Database Backups**
- Cloud SQL: Automated backups (7-365 days retention)
- Cloud Spanner: Backups on demand or scheduled
- Firestore: Automated daily exports
- Bigtable: Backups to Cloud Storage
### High Availability
**RTO/RPO Matrix**
| Pattern | RPO | RTO | Cost |
|---------|-----|-----|------|
| Active-Active Multi-Region | Seconds | Seconds | High |
| Active-Passive with Replication | Minutes | Minutes | Medium |
| Warm Standby | Minutes | 10-30 min | Medium |
| Backup and Restore | Hours | Hours | Low |
**Cloud SQL HA**
- Regional configuration with synchronous replication
- Automatic failover
- 99.95% SLA (vs 99.5% for single zone)
**Cloud Spanner**
- Multi-region configuration
- 99.999% availability SLA
- Synchronous replication across regions
### Disaster Recovery Testing
- Regular DR drills (quarterly recommended)
- Document runbooks
- Test restoration procedures
- Measure actual RTO/RPO vs targets
## Monitoring and Observability
### Cloud Monitoring (formerly Stackdriver)
**Metrics**
- System metrics (CPU, memory, disk, network)
- Custom metrics via Cloud Monitoring API
- Metric scopes for multi-project monitoring
- Uptime checks for availability
**Dashboards and Charts**
- Predefined dashboards for GCP services
- Custom dashboards with filters and grouping
- SLO monitoring with error budgets
### Cloud Logging
**Log Types**
- Admin Activity logs (always enabled, no charge)
- Data Access logs (must be enabled)
- System Event logs
- Access Transparency logs (for Google access)
**Log Sinks**
- Route logs to BigQuery, Cloud Storage, Pub/Sub
- Aggregated sinks at organization/folder level
- Exclusion filters to reduce costs
### Cloud Trace
**Distributed Tracing**
- Automatic instrumentation for App Engine, Cloud Run, GKE
- Manual instrumentation with client libraries
- Latency analysis and performance insights
- Integration with Zipkin
### Cloud Profiler
**Continuous Profiling**
- CPU and memory profiling
- Low overhead (< 0.5% CPU)
- Flame graphs for visualization
- Supported languages: Java, Go, Python, Node.js
### Error Reporting
**Aggregated Error Tracking**
- Automatic error grouping
- Stack trace analysis
- Integration with Cloud Logging
- Notifications for new errors