bookworm-smart-assistant/skills/cloud-architect/references/gcp.md

# GCP Architecture Reference

Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework.

## Google Cloud Architecture Framework

### Five Pillars

1. **Operational Excellence**
   - Infrastructure as Code (Deployment Manager, Terraform)
   - CI/CD with Cloud Build
   - Monitoring with Cloud Monitoring (Stackdriver)
   - SRE principles and SLOs
   - Incident management

2. **Security, Privacy, and Compliance**
   - Identity and Access Management (Cloud IAM)
   - VPC Service Controls for data perimeter
   - Binary Authorization for containers
   - Data encryption (default at rest and in transit)
   - Security Command Center

3. **Reliability**
   - Multi-zone and multi-region deployments
   - Load balancing and autoscaling
   - Disaster recovery planning
   - Chaos engineering practices
   - SLIs, SLOs, and error budgets

4. **Cost Optimization**
   - Committed Use Discounts
   - Sustained Use Discounts (automatic)
   - Preemptible VMs and Spot VMs
   - Recommender for right-sizing
   - Active Assist for optimization

5. **Performance Optimization**
   - Cloud CDN and Media CDN
   - Caching strategies (Memorystore)
   - Database performance tuning
   - Network optimization (Premium vs Standard tier)
   - Regional and zonal resource placement

## Core Services Architecture

### Compute

**Compute Engine**
- Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized)
- Custom machine types for specific needs
- Preemptible VMs (up to 80% discount, max 24 hours)
- Spot VMs (similar to preemptible, better availability)
- Instance groups: Managed (with autoscaling), unmanaged
- Best practices: Use latest generation, committed use discounts, Spot for batch jobs

**Cloud Run**
- Fully managed serverless container platform
- Auto-scaling to zero
- Pay per request
- CPU allocated only during request handling
- Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch

**Cloud Functions**
- Event-driven serverless functions
- 1st gen: HTTP and background functions
- 2nd gen: Built on Cloud Run, better performance
- Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP
- Best practices: Use 2nd gen, minimize cold starts, implement retry logic

**Google Kubernetes Engine (GKE)**
- Managed Kubernetes with GCP integration
- Autopilot mode: Fully managed, per-pod pricing
- Standard mode: More control, node management
- Workload Identity for secure service access
- Binary Authorization for deployment policies
- Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies

**App Engine**
- Fully managed platform (PaaS)
- Standard environment (sandboxed, auto-scaling)
- Flexible environment (Docker containers, custom runtimes)
- Traffic splitting for canary deployments
- Best practices: Use Standard for web apps, Flexible for custom dependencies

### Storage

**Cloud Storage**
- Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day)
- Object lifecycle management
- Object versioning and retention policies
- Autoclass for automatic tier transitions
- Requester pays for data transfer
- Best practices: Use Autoclass, enable versioning, implement lifecycle policies

**Persistent Disk**
- Types: Standard (HDD), Balanced SSD, SSD, Extreme
- Zonal and regional persistent disks
- Snapshots for backup (incremental)
- Disk resize without downtime
- Best practices: Use Balanced SSD for most workloads, enable snapshots

**Filestore**
- Managed NFS file storage
- Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance)
- Backup to Cloud Storage
- Best practices: Use Enterprise for production, implement backups

**Cloud Storage for Firebase**
- Object storage for mobile and web apps
- Client SDKs for direct upload/download
- Security rules for access control

### Database

**Cloud SQL**
- Managed MySQL, PostgreSQL, SQL Server
- High availability configuration (regional)
- Read replicas for scaling
- Automated backups and point-in-time recovery
- Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy

**Cloud Spanner**
- Globally distributed relational database
- Horizontal scalability with strong consistency
- Multi-region for 99.999% availability
- TrueTime for global consistency
- Best practices: Design proper schema splits, use commit timestamps, optimize hotspots

**Firestore (Native mode)**
- NoSQL document database
- Real-time synchronization
- Offline support for mobile
- ACID transactions
- Best practices: Design document structure carefully, use collection group queries wisely

**Bigtable**
- NoSQL wide-column database
- Petabyte-scale with single-digit millisecond latency
- HBase API compatible
- Linear scalability by adding nodes
- Best practices: Design row keys to avoid hotspots, use replication for HA

**Memorystore**
- Managed Redis and Memcached
- Standard tier (HA with replica) and Basic tier
- Best practices: Use Standard tier for production, implement connection pooling

**BigQuery**
- Serverless data warehouse
- SQL analytics on petabyte-scale data
- Column-oriented storage
- Automatic caching and optimization
- Best practices: Partition and cluster tables, use approximate functions, control costs with quotas

### Networking

**VPC (Virtual Private Cloud)**
- Global resource (subnets are regional)
- Custom or auto mode networks
- Firewall rules (stateful)
- VPC peering and Shared VPC
- Private Google Access for GCP services
- Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules

**Cloud Load Balancing**
- Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP)
- Regional load balancing (internal HTTP(S), internal TCP/UDP)
- Anycast IP for global distribution
- Backend services with health checks
- Best practices: Use global for multi-region, enable CDN, configure health checks

**Cloud CDN**
- Global content delivery network
- Cache invalidation and signed URLs
- Integration with Cloud Storage and compute
- Best practices: Enable compression, use cache-control headers

**Cloud Interconnect and VPN**
- Dedicated Interconnect (10 Gbps or 100 Gbps)
- Partner Interconnect (50 Mbps to 50 Gbps)
- Cloud VPN (HA VPN for 99.99% SLA)
- Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth

**Cloud Armor**
- DDoS protection and WAF
- Preconfigured and custom rules
- Adaptive protection (ML-based)
- Best practices: Enable for internet-facing services, use preconfigured rules

**Private Service Connect**
- Private connectivity to Google APIs and services
- Service Directory for service discovery
- Best practices: Use for all managed services in production

### Serverless and Event-Driven

**Pub/Sub**
- Global message queue
- At-least-once delivery
- Push and pull subscriptions
- Message ordering and filtering
- Dead-letter topics
- Best practices: Use message attributes for filtering, implement idempotent processing

**Eventarc**
- Event-driven architecture
- Triggers for Cloud Run, Workflows, GKE
- Sources: Audit Logs, Pub/Sub, custom events
- Best practices: Use for decoupled architectures, implement event filtering

**Cloud Scheduler**
- Fully managed cron service
- HTTP, Pub/Sub, and App Engine targets
- Best practices: Use for periodic tasks, implement retry logic

**Workflows**
- Orchestrate and automate GCP and HTTP services
- YAML-based workflow definition
- Built-in error handling and retry
- Best practices: Use for complex multi-step processes, implement compensating transactions

### Security and Identity

**Cloud IAM**
- Resource hierarchy: Organization -> Folders -> Projects -> Resources
- Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom
- Service accounts for applications
- Workload Identity for GKE
- Best practices: Use predefined roles, least privilege, service accounts for apps

**Cloud Key Management (KMS)**
- Encryption key management
- Customer-managed encryption keys (CMEK)
- Hardware Security Module (HSM) backed
- Automatic key rotation
- Best practices: Enable automatic rotation, use separate keys per environment

**Secret Manager**
- Store API keys, passwords, certificates
- Versioning and access control
- Automatic rotation integration
- Best practices: Rotate secrets regularly, use IAM for access control

**Security Command Center**
- Centralized security and risk management
- Asset discovery and vulnerability scanning
- Threat detection and compliance monitoring
- Best practices: Enable all detectors, review findings regularly

**VPC Service Controls**
- Create security perimeters around GCP resources
- Prevent data exfiltration
- Best practices: Use for sensitive data, implement access levels

### AI and Machine Learning

**Vertex AI**
- Unified ML platform
- AutoML for custom models
- Pre-trained models (Vision, Natural Language, etc.)
- MLOps with pipelines
- Best practices: Use AutoML for quick start, implement feature store

**BigQuery ML**
- Create and execute ML models using SQL
- Model types: Linear regression, logistic regression, clustering, etc.
- Integration with Vertex AI
- Best practices: Use for simple models, leverage BigQuery's scale

## Architecture Patterns

### High Availability

**Multi-Zone Pattern**
```
Global HTTP(S) Load Balancer
    |
    v
Managed Instance Group (multi-zone)
    |
    v
Cloud SQL (regional, HA configuration)
    |
    v
Cloud Storage (multi-region)
```

**Multi-Region Pattern**
```
Global HTTP(S) Load Balancer
    |
    ├── Backend Service Region 1 (Cloud Run)
    └── Backend Service Region 2 (Cloud Run)
         |
         v
    Cloud Spanner (multi-region)
```

### Serverless Architecture

**Event-Driven Pattern**
```
Cloud Storage upload event
    |
    v
Pub/Sub topic
    |
    v
Cloud Functions (image processing)
    |
    v
Firestore (metadata storage)
```

**API-First Pattern**
```
Cloud Endpoints or API Gateway
    |
    v
Cloud Run (multiple services)
    |
    ├── Cloud SQL (transactional data)
    └── Firestore (user data)
```

### Microservices on GKE

**GKE with Service Mesh**
```
Global Load Balancer
    |
    v
GKE Ingress
    |
    v
Anthos Service Mesh (Istio)
    |
    v
Microservices (Cloud Spanner, Firestore, Memorystore)
```

### Data Analytics Platform

```
Data Sources
    |
    v
Pub/Sub (streaming)
    |
    v
Dataflow (Apache Beam)
    |
    v
BigQuery (data warehouse)
    |
    v
Looker or Data Studio (visualization)
```

**Batch Processing**
```
Cloud Storage (raw data)
    |
    v
Dataproc (Apache Spark)
    |
    v
BigQuery (analytics)
```

## Landing Zone Design

### Resource Hierarchy

```
Organization
├── Folders (by environment or team)
│   ├── Production Folder
│   │   ├── Project A
│   │   └── Project B
│   ├── Staging Folder
│   └── Development Folder
└── Shared Services Folder
    ├── Networking Project (Shared VPC host)
    ├── Security Project (KMS, Secret Manager)
    └── Logging Project (centralized logs)
```

### Network Design

**Shared VPC Pattern**
```
Host Project (networking team)
├── Shared VPC
│   ├── Subnet Production (region A)
│   ├── Subnet Staging (region A)
│   └── Subnet Development (region B)

Service Projects (application teams)
├── Production Project (uses Production subnet)
├── Staging Project (uses Staging subnet)
└── Development Project (uses Development subnet)
```

**Hub-and-Spoke with VPN**
```
On-premises Network
    |
    v
Cloud VPN / Interconnect
    |
    v
Hub VPC (shared services)
    |
    ├── Spoke VPC 1 (production workloads)
    ├── Spoke VPC 2 (development workloads)
    └── Spoke VPC 3 (analytics workloads)
```

### Governance

**Organization Policies**
- Restrict public IP assignment
- Enforce uniform bucket-level access
- Restrict VM external IP
- Define allowed resource locations

**IAM Strategy**
- Use Google Groups for role assignments
- Separate duties (network admin, security admin, etc.)
- Service accounts per application
- Workload Identity for GKE workloads

**Logging and Monitoring**
```
All Projects
    |
    v
Log Router
    |
    ├── Cloud Logging (default sink)
    ├── BigQuery (long-term analysis)
    ├── Cloud Storage (archive)
    └── Pub/Sub (real-time processing)
```

## Migration Strategies

### Migrate to Virtual Machines

**Tools**
- Migrate to Virtual Machines (formerly Migrate for Compute Engine)
- Supports VMware, AWS, Azure, physical servers
- Agentless or agent-based migration
- Waves and test clones

**Process**
1. Assess: Fit assessment and TCO analysis
2. Plan: Group VMs, define migration waves
3. Deploy: Set up infrastructure (VPC, firewall rules)
4. Migrate: Test migration, cutover, validation
5. Optimize: Right-sizing, committed use discounts

### Database Migration

**Database Migration Service**
- Minimal downtime migrations
- Supports MySQL, PostgreSQL, SQL Server, Oracle
- Continuous replication for cutover flexibility

**Transfer Appliance**
- Physical device for large data transfers
- Up to 1 PB capacity
- Offline data transfer

## Cost Optimization

### Compute Savings

**Committed Use Discounts**
- 1-year or 3-year commitments
- Up to 57% savings for VMs
- Resource-based or spend-based

**Sustained Use Discounts**
- Automatic discounts for running VMs >25% of month
- Up to 30% savings
- No commitment required

**Preemptible and Spot VMs**
- Up to 80% discount
- Can be terminated by GCP
- Best for batch processing, fault-tolerant workloads

**Recommender**
- VM rightsizing recommendations
- Idle resource identification
- Committed use discount recommendations

### Storage Savings

**Cloud Storage**
- Autoclass for automatic tier transitions
- Lifecycle policies (delete or transition)
- Nearline (30+ days), Coldline (90+ days), Archive (365+ days)
- Requester pays for data transfer

**Persistent Disk**
- Delete orphaned disks
- Use balanced SSD instead of SSD when possible
- Resize disks to match actual usage

### BigQuery Savings

**On-Demand Pricing**
- $5 per TB processed
- Use partitioning and clustering
- Query cache for free repeated queries

**Flat-Rate Pricing**
- Predictable costs for heavy users
- Autoscaling slots available
- Flex slots for short-term commitments

**Best Practices**
- Use approximate aggregation functions (APPROX_COUNT_DISTINCT)
- Avoid SELECT *, specify columns
- Use materialized views for common queries
- Set up cost controls with custom quotas

### Monitoring Costs

**Cloud Billing**
- Budgets and alerts
- Cost breakdown by project, service, SKU
- Export to BigQuery for analysis
- Recommendations from Active Assist

## Disaster Recovery

### Backup Strategies

**VM Backups**
- Persistent disk snapshots (incremental)
- Machine images (include metadata and config)
- Cross-region snapshot copy
- Snapshot schedules for automation

**Database Backups**
- Cloud SQL: Automated backups (7-365 days retention)
- Cloud Spanner: Backups on demand or scheduled
- Firestore: Automated daily exports
- Bigtable: Backups to Cloud Storage

### High Availability

**RTO/RPO Matrix**

| Pattern | RPO | RTO | Cost |
|---------|-----|-----|------|
| Active-Active Multi-Region | Seconds | Seconds | High |
| Active-Passive with Replication | Minutes | Minutes | Medium |
| Warm Standby | Minutes | 10-30 min | Medium |
| Backup and Restore | Hours | Hours | Low |

**Cloud SQL HA**
- Regional configuration with synchronous replication
- Automatic failover
- 99.95% SLA (vs 99.5% for single zone)

**Cloud Spanner**
- Multi-region configuration
- 99.999% availability SLA
- Synchronous replication across regions

### Disaster Recovery Testing

- Regular DR drills (quarterly recommended)
- Document runbooks
- Test restoration procedures
- Measure actual RTO/RPO vs targets

## Monitoring and Observability

### Cloud Monitoring (formerly Stackdriver)

**Metrics**
- System metrics (CPU, memory, disk, network)
- Custom metrics via Cloud Monitoring API
- Metric scopes for multi-project monitoring
- Uptime checks for availability

**Dashboards and Charts**
- Predefined dashboards for GCP services
- Custom dashboards with filters and grouping
- SLO monitoring with error budgets

### Cloud Logging

**Log Types**
- Admin Activity logs (always enabled, no charge)
- Data Access logs (must be enabled)
- System Event logs
- Access Transparency logs (for Google access)

**Log Sinks**
- Route logs to BigQuery, Cloud Storage, Pub/Sub
- Aggregated sinks at organization/folder level
- Exclusion filters to reduce costs

### Cloud Trace

**Distributed Tracing**
- Automatic instrumentation for App Engine, Cloud Run, GKE
- Manual instrumentation with client libraries
- Latency analysis and performance insights
- Integration with Zipkin

### Cloud Profiler

**Continuous Profiling**
- CPU and memory profiling
- Low overhead (< 0.5% CPU)
- Flame graphs for visualization
- Supported languages: Java, Go, Python, Node.js

### Error Reporting

**Aggregated Error Tracking**
- Automatic error grouping
- Stack trace analysis
- Integration with Cloud Logging
- Notifications for new errors
Initial: Bookworm Smart Assistant v6.5.1 (byte-preserved, 809 files, fp 26b83e1b38cdf64a) 2026-04-21 17:57:05 +08:00			`# GCP Architecture Reference`

			`Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework.`

			`## Google Cloud Architecture Framework`

			`### Five Pillars`

			`1. Operational Excellence`
			`- Infrastructure as Code (Deployment Manager, Terraform)`
			`- CI/CD with Cloud Build`
			`- Monitoring with Cloud Monitoring (Stackdriver)`
			`- SRE principles and SLOs`
			`- Incident management`

			`2. Security, Privacy, and Compliance`
			`- Identity and Access Management (Cloud IAM)`
			`- VPC Service Controls for data perimeter`
			`- Binary Authorization for containers`
			`- Data encryption (default at rest and in transit)`
			`- Security Command Center`

			`3. Reliability`
			`- Multi-zone and multi-region deployments`
			`- Load balancing and autoscaling`
			`- Disaster recovery planning`
			`- Chaos engineering practices`
			`- SLIs, SLOs, and error budgets`

			`4. Cost Optimization`
			`- Committed Use Discounts`
			`- Sustained Use Discounts (automatic)`
			`- Preemptible VMs and Spot VMs`
			`- Recommender for right-sizing`
			`- Active Assist for optimization`

			`5. Performance Optimization`
			`- Cloud CDN and Media CDN`
			`- Caching strategies (Memorystore)`
			`- Database performance tuning`
			`- Network optimization (Premium vs Standard tier)`
			`- Regional and zonal resource placement`

			`## Core Services Architecture`

			`### Compute`

			`Compute Engine`
			`- Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized)`
			`- Custom machine types for specific needs`
			`- Preemptible VMs (up to 80% discount, max 24 hours)`
			`- Spot VMs (similar to preemptible, better availability)`
			`- Instance groups: Managed (with autoscaling), unmanaged`
			`- Best practices: Use latest generation, committed use discounts, Spot for batch jobs`

			`Cloud Run`
			`- Fully managed serverless container platform`
			`- Auto-scaling to zero`
			`- Pay per request`
			`- CPU allocated only during request handling`
			`- Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch`

			`Cloud Functions`
			`- Event-driven serverless functions`
			`- 1st gen: HTTP and background functions`
			`- 2nd gen: Built on Cloud Run, better performance`
			`- Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP`
			`- Best practices: Use 2nd gen, minimize cold starts, implement retry logic`

			`Google Kubernetes Engine (GKE)`
			`- Managed Kubernetes with GCP integration`
			`- Autopilot mode: Fully managed, per-pod pricing`
			`- Standard mode: More control, node management`
			`- Workload Identity for secure service access`
			`- Binary Authorization for deployment policies`
			`- Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies`

			`App Engine`
			`- Fully managed platform (PaaS)`
			`- Standard environment (sandboxed, auto-scaling)`
			`- Flexible environment (Docker containers, custom runtimes)`
			`- Traffic splitting for canary deployments`
			`- Best practices: Use Standard for web apps, Flexible for custom dependencies`

			`### Storage`

			`Cloud Storage`
			`- Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day)`
			`- Object lifecycle management`
			`- Object versioning and retention policies`
			`- Autoclass for automatic tier transitions`
			`- Requester pays for data transfer`
			`- Best practices: Use Autoclass, enable versioning, implement lifecycle policies`

			`Persistent Disk`
			`- Types: Standard (HDD), Balanced SSD, SSD, Extreme`
			`- Zonal and regional persistent disks`
			`- Snapshots for backup (incremental)`
			`- Disk resize without downtime`
			`- Best practices: Use Balanced SSD for most workloads, enable snapshots`

			`Filestore`
			`- Managed NFS file storage`
			`- Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance)`
			`- Backup to Cloud Storage`
			`- Best practices: Use Enterprise for production, implement backups`

			`Cloud Storage for Firebase`
			`- Object storage for mobile and web apps`
			`- Client SDKs for direct upload/download`
			`- Security rules for access control`

			`### Database`

			`Cloud SQL`
			`- Managed MySQL, PostgreSQL, SQL Server`
			`- High availability configuration (regional)`
			`- Read replicas for scaling`
			`- Automated backups and point-in-time recovery`
			`- Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy`

			`Cloud Spanner`
			`- Globally distributed relational database`
			`- Horizontal scalability with strong consistency`
			`- Multi-region for 99.999% availability`
			`- TrueTime for global consistency`
			`- Best practices: Design proper schema splits, use commit timestamps, optimize hotspots`

			`Firestore (Native mode)`
			`- NoSQL document database`
			`- Real-time synchronization`
			`- Offline support for mobile`
			`- ACID transactions`
			`- Best practices: Design document structure carefully, use collection group queries wisely`

			`Bigtable`
			`- NoSQL wide-column database`
			`- Petabyte-scale with single-digit millisecond latency`
			`- HBase API compatible`
			`- Linear scalability by adding nodes`
			`- Best practices: Design row keys to avoid hotspots, use replication for HA`

			`Memorystore`
			`- Managed Redis and Memcached`
			`- Standard tier (HA with replica) and Basic tier`
			`- Best practices: Use Standard tier for production, implement connection pooling`

			`BigQuery`
			`- Serverless data warehouse`
			`- SQL analytics on petabyte-scale data`
			`- Column-oriented storage`
			`- Automatic caching and optimization`
			`- Best practices: Partition and cluster tables, use approximate functions, control costs with quotas`

			`### Networking`

			`VPC (Virtual Private Cloud)`
			`- Global resource (subnets are regional)`
			`- Custom or auto mode networks`
			`- Firewall rules (stateful)`
			`- VPC peering and Shared VPC`
			`- Private Google Access for GCP services`
			`- Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules`

			`Cloud Load Balancing`
			`- Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP)`
			`- Regional load balancing (internal HTTP(S), internal TCP/UDP)`
			`- Anycast IP for global distribution`
			`- Backend services with health checks`
			`- Best practices: Use global for multi-region, enable CDN, configure health checks`

			`Cloud CDN`
			`- Global content delivery network`
			`- Cache invalidation and signed URLs`
			`- Integration with Cloud Storage and compute`
			`- Best practices: Enable compression, use cache-control headers`

			`Cloud Interconnect and VPN`
			`- Dedicated Interconnect (10 Gbps or 100 Gbps)`
			`- Partner Interconnect (50 Mbps to 50 Gbps)`
			`- Cloud VPN (HA VPN for 99.99% SLA)`
			`- Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth`

			`Cloud Armor`
			`- DDoS protection and WAF`
			`- Preconfigured and custom rules`
			`- Adaptive protection (ML-based)`
			`- Best practices: Enable for internet-facing services, use preconfigured rules`

			`Private Service Connect`
			`- Private connectivity to Google APIs and services`
			`- Service Directory for service discovery`
			`- Best practices: Use for all managed services in production`

			`### Serverless and Event-Driven`

			`Pub/Sub`
			`- Global message queue`
			`- At-least-once delivery`
			`- Push and pull subscriptions`
			`- Message ordering and filtering`
			`- Dead-letter topics`
			`- Best practices: Use message attributes for filtering, implement idempotent processing`

			`Eventarc`
			`- Event-driven architecture`
			`- Triggers for Cloud Run, Workflows, GKE`
			`- Sources: Audit Logs, Pub/Sub, custom events`
			`- Best practices: Use for decoupled architectures, implement event filtering`

			`Cloud Scheduler`
			`- Fully managed cron service`
			`- HTTP, Pub/Sub, and App Engine targets`
			`- Best practices: Use for periodic tasks, implement retry logic`

			`Workflows`
			`- Orchestrate and automate GCP and HTTP services`
			`- YAML-based workflow definition`
			`- Built-in error handling and retry`
			`- Best practices: Use for complex multi-step processes, implement compensating transactions`

			`### Security and Identity`

			`Cloud IAM`
			`- Resource hierarchy: Organization -> Folders -> Projects -> Resources`
			`- Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom`
			`- Service accounts for applications`
			`- Workload Identity for GKE`
			`- Best practices: Use predefined roles, least privilege, service accounts for apps`

			`Cloud Key Management (KMS)`
			`- Encryption key management`
			`- Customer-managed encryption keys (CMEK)`
			`- Hardware Security Module (HSM) backed`
			`- Automatic key rotation`
			`- Best practices: Enable automatic rotation, use separate keys per environment`

			`Secret Manager`
			`- Store API keys, passwords, certificates`
			`- Versioning and access control`
			`- Automatic rotation integration`
			`- Best practices: Rotate secrets regularly, use IAM for access control`

			`Security Command Center`
			`- Centralized security and risk management`
			`- Asset discovery and vulnerability scanning`
			`- Threat detection and compliance monitoring`
			`- Best practices: Enable all detectors, review findings regularly`

			`VPC Service Controls`
			`- Create security perimeters around GCP resources`
			`- Prevent data exfiltration`
			`- Best practices: Use for sensitive data, implement access levels`

			`### AI and Machine Learning`

			`Vertex AI`
			`- Unified ML platform`
			`- AutoML for custom models`
			`- Pre-trained models (Vision, Natural Language, etc.)`
			`- MLOps with pipelines`
			`- Best practices: Use AutoML for quick start, implement feature store`

			`BigQuery ML`
			`- Create and execute ML models using SQL`
			`- Model types: Linear regression, logistic regression, clustering, etc.`
			`- Integration with Vertex AI`
			`- Best practices: Use for simple models, leverage BigQuery's scale`

			`## Architecture Patterns`

			`### High Availability`

			`Multi-Zone Pattern`
			```
			`Global HTTP(S) Load Balancer`
			`\|`
			`v`
			`Managed Instance Group (multi-zone)`
			`\|`
			`v`
			`Cloud SQL (regional, HA configuration)`
			`\|`
			`v`
			`Cloud Storage (multi-region)`
			```

			`Multi-Region Pattern`
			```
			`Global HTTP(S) Load Balancer`
			`\|`
			`├── Backend Service Region 1 (Cloud Run)`
			`└── Backend Service Region 2 (Cloud Run)`
			`\|`
			`v`
			`Cloud Spanner (multi-region)`
			```

			`### Serverless Architecture`

			`Event-Driven Pattern`
			```
			`Cloud Storage upload event`
			`\|`
			`v`
			`Pub/Sub topic`
			`\|`
			`v`
			`Cloud Functions (image processing)`
			`\|`
			`v`
			`Firestore (metadata storage)`
			```

			`API-First Pattern`
			```
			`Cloud Endpoints or API Gateway`
			`\|`
			`v`
			`Cloud Run (multiple services)`
			`\|`
			`├── Cloud SQL (transactional data)`
			`└── Firestore (user data)`
			```

			`### Microservices on GKE`

			`GKE with Service Mesh`
			```
			`Global Load Balancer`
			`\|`
			`v`
			`GKE Ingress`
			`\|`
			`v`
			`Anthos Service Mesh (Istio)`
			`\|`
			`v`
			`Microservices (Cloud Spanner, Firestore, Memorystore)`
			```

			`### Data Analytics Platform`

			```
			`Data Sources`
			`\|`
			`v`
			`Pub/Sub (streaming)`
			`\|`
			`v`
			`Dataflow (Apache Beam)`
			`\|`
			`v`
			`BigQuery (data warehouse)`
			`\|`
			`v`
			`Looker or Data Studio (visualization)`
			```

			`Batch Processing`
			```
			`Cloud Storage (raw data)`
			`\|`
			`v`
			`Dataproc (Apache Spark)`
			`\|`
			`v`
			`BigQuery (analytics)`
			```

			`## Landing Zone Design`

			`### Resource Hierarchy`

			```
			`Organization`
			`├── Folders (by environment or team)`
			`│ ├── Production Folder`
			`│ │ ├── Project A`
			`│ │ └── Project B`
			`│ ├── Staging Folder`
			`│ └── Development Folder`
			`└── Shared Services Folder`
			`├── Networking Project (Shared VPC host)`
			`├── Security Project (KMS, Secret Manager)`
			`└── Logging Project (centralized logs)`
			```

			`### Network Design`

			`Shared VPC Pattern`
			```
			`Host Project (networking team)`
			`├── Shared VPC`
			`│ ├── Subnet Production (region A)`
			`│ ├── Subnet Staging (region A)`
			`│ └── Subnet Development (region B)`

			`Service Projects (application teams)`
			`├── Production Project (uses Production subnet)`
			`├── Staging Project (uses Staging subnet)`
			`└── Development Project (uses Development subnet)`
			```

			`Hub-and-Spoke with VPN`
			```
			`On-premises Network`
			`\|`
			`v`
			`Cloud VPN / Interconnect`
			`\|`
			`v`
			`Hub VPC (shared services)`
			`\|`
			`├── Spoke VPC 1 (production workloads)`
			`├── Spoke VPC 2 (development workloads)`
			`└── Spoke VPC 3 (analytics workloads)`
			```

			`### Governance`

			`Organization Policies`
			`- Restrict public IP assignment`
			`- Enforce uniform bucket-level access`
			`- Restrict VM external IP`
			`- Define allowed resource locations`

			`IAM Strategy`
			`- Use Google Groups for role assignments`
			`- Separate duties (network admin, security admin, etc.)`
			`- Service accounts per application`
			`- Workload Identity for GKE workloads`

			`Logging and Monitoring`
			```
			`All Projects`
			`\|`
			`v`
			`Log Router`
			`\|`
			`├── Cloud Logging (default sink)`
			`├── BigQuery (long-term analysis)`
			`├── Cloud Storage (archive)`
			`└── Pub/Sub (real-time processing)`
			```

			`## Migration Strategies`

			`### Migrate to Virtual Machines`

			`Tools`
			`- Migrate to Virtual Machines (formerly Migrate for Compute Engine)`
			`- Supports VMware, AWS, Azure, physical servers`
			`- Agentless or agent-based migration`
			`- Waves and test clones`

			`Process`
			`1. Assess: Fit assessment and TCO analysis`
			`2. Plan: Group VMs, define migration waves`
			`3. Deploy: Set up infrastructure (VPC, firewall rules)`
			`4. Migrate: Test migration, cutover, validation`
			`5. Optimize: Right-sizing, committed use discounts`

			`### Database Migration`

			`Database Migration Service`
			`- Minimal downtime migrations`
			`- Supports MySQL, PostgreSQL, SQL Server, Oracle`
			`- Continuous replication for cutover flexibility`

			`Transfer Appliance`
			`- Physical device for large data transfers`
			`- Up to 1 PB capacity`
			`- Offline data transfer`

			`## Cost Optimization`

			`### Compute Savings`

			`Committed Use Discounts`
			`- 1-year or 3-year commitments`
			`- Up to 57% savings for VMs`
			`- Resource-based or spend-based`

			`Sustained Use Discounts`
			`- Automatic discounts for running VMs >25% of month`
			`- Up to 30% savings`
			`- No commitment required`

			`Preemptible and Spot VMs`
			`- Up to 80% discount`
			`- Can be terminated by GCP`
			`- Best for batch processing, fault-tolerant workloads`

			`Recommender`
			`- VM rightsizing recommendations`
			`- Idle resource identification`
			`- Committed use discount recommendations`

			`### Storage Savings`

			`Cloud Storage`
			`- Autoclass for automatic tier transitions`
			`- Lifecycle policies (delete or transition)`
			`- Nearline (30+ days), Coldline (90+ days), Archive (365+ days)`
			`- Requester pays for data transfer`

			`Persistent Disk`
			`- Delete orphaned disks`
			`- Use balanced SSD instead of SSD when possible`
			`- Resize disks to match actual usage`

			`### BigQuery Savings`

			`On-Demand Pricing`
			`- $5 per TB processed`
			`- Use partitioning and clustering`
			`- Query cache for free repeated queries`

			`Flat-Rate Pricing`
			`- Predictable costs for heavy users`
			`- Autoscaling slots available`
			`- Flex slots for short-term commitments`

			`Best Practices`
			`- Use approximate aggregation functions (APPROX_COUNT_DISTINCT)`
			`- Avoid SELECT *, specify columns`
			`- Use materialized views for common queries`
			`- Set up cost controls with custom quotas`

			`### Monitoring Costs`

			`Cloud Billing`
			`- Budgets and alerts`
			`- Cost breakdown by project, service, SKU`
			`- Export to BigQuery for analysis`
			`- Recommendations from Active Assist`

			`## Disaster Recovery`

			`### Backup Strategies`

			`VM Backups`
			`- Persistent disk snapshots (incremental)`
			`- Machine images (include metadata and config)`
			`- Cross-region snapshot copy`
			`- Snapshot schedules for automation`

			`Database Backups`
			`- Cloud SQL: Automated backups (7-365 days retention)`
			`- Cloud Spanner: Backups on demand or scheduled`
			`- Firestore: Automated daily exports`
			`- Bigtable: Backups to Cloud Storage`

			`### High Availability`

			`RTO/RPO Matrix`

			`\| Pattern \| RPO \| RTO \| Cost \|`
			`\|---------\|-----\|-----\|------\|`
			`\| Active-Active Multi-Region \| Seconds \| Seconds \| High \|`
			`\| Active-Passive with Replication \| Minutes \| Minutes \| Medium \|`
			`\| Warm Standby \| Minutes \| 10-30 min \| Medium \|`
			`\| Backup and Restore \| Hours \| Hours \| Low \|`

			`Cloud SQL HA`
			`- Regional configuration with synchronous replication`
			`- Automatic failover`
			`- 99.95% SLA (vs 99.5% for single zone)`

			`Cloud Spanner`
			`- Multi-region configuration`
			`- 99.999% availability SLA`
			`- Synchronous replication across regions`

			`### Disaster Recovery Testing`

			`- Regular DR drills (quarterly recommended)`
			`- Document runbooks`
			`- Test restoration procedures`
			`- Measure actual RTO/RPO vs targets`

			`## Monitoring and Observability`

			`### Cloud Monitoring (formerly Stackdriver)`

			`Metrics`
			`- System metrics (CPU, memory, disk, network)`
			`- Custom metrics via Cloud Monitoring API`
			`- Metric scopes for multi-project monitoring`
			`- Uptime checks for availability`

			`Dashboards and Charts`
			`- Predefined dashboards for GCP services`
			`- Custom dashboards with filters and grouping`
			`- SLO monitoring with error budgets`

			`### Cloud Logging`

			`Log Types`
			`- Admin Activity logs (always enabled, no charge)`
			`- Data Access logs (must be enabled)`
			`- System Event logs`
			`- Access Transparency logs (for Google access)`

			`Log Sinks`
			`- Route logs to BigQuery, Cloud Storage, Pub/Sub`
			`- Aggregated sinks at organization/folder level`
			`- Exclusion filters to reduce costs`

			`### Cloud Trace`

			`Distributed Tracing`
			`- Automatic instrumentation for App Engine, Cloud Run, GKE`
			`- Manual instrumentation with client libraries`
			`- Latency analysis and performance insights`
			`- Integration with Zipkin`

			`### Cloud Profiler`

			`Continuous Profiling`
			`- CPU and memory profiling`
			`- Low overhead (< 0.5% CPU)`
			`- Flame graphs for visualization`
			`- Supported languages: Java, Go, Python, Node.js`

			`### Error Reporting`

			`Aggregated Error Tracking`
			`- Automatic error grouping`
			`- Stack trace analysis`
			`- Integration with Cloud Logging`
			`- Notifications for new errors`