# GCP Architecture Reference

Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework.

## Google Cloud Architecture Framework

### Five Pillars

1. **Operational Excellence**
   - Infrastructure as Code (Deployment Manager, Terraform)
   - CI/CD with Cloud Build
   - Monitoring with Cloud Monitoring (Stackdriver)
   - SRE principles and SLOs
   - Incident management

2. **Security, Privacy, and Compliance**
   - Identity and Access Management (Cloud IAM)
   - VPC Service Controls for data perimeter
   - Binary Authorization for containers
   - Data encryption (default at rest and in transit)
   - Security Command Center

3. **Reliability**
   - Multi-zone and multi-region deployments
   - Load balancing and autoscaling
   - Disaster recovery planning
   - Chaos engineering practices
   - SLIs, SLOs, and error budgets

4. **Cost Optimization**
   - Committed Use Discounts
   - Sustained Use Discounts (automatic)
   - Preemptible VMs and Spot VMs
   - Recommender for right-sizing
   - Active Assist for optimization

5. **Performance Optimization**
   - Cloud CDN and Media CDN
   - Caching strategies (Memorystore)
   - Database performance tuning
   - Network optimization (Premium vs Standard tier)
   - Regional and zonal resource placement

## Core Services Architecture

### Compute

**Compute Engine**
- Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized)
- Custom machine types for specific needs
- Preemptible VMs (up to 80% discount, max 24 hours)
- Spot VMs (similar to preemptible, better availability)
- Instance groups: Managed (with autoscaling), unmanaged
- Best practices: Use latest generation, committed use discounts, Spot for batch jobs

**Cloud Run**
- Fully managed serverless container platform
- Auto-scaling to zero
- Pay per request
- CPU allocated only during request handling
- Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch

**Cloud Functions**
- Event-driven serverless functions
- 1st gen: HTTP and background functions
- 2nd gen: Built on Cloud Run, better performance
- Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP
- Best practices: Use 2nd gen, minimize cold starts, implement retry logic

**Google Kubernetes Engine (GKE)**
- Managed Kubernetes with GCP integration
- Autopilot mode: Fully managed, per-pod pricing
- Standard mode: More control, node management
- Workload Identity for secure service access
- Binary Authorization for deployment policies
- Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies

**App Engine**
- Fully managed platform (PaaS)
- Standard environment (sandboxed, auto-scaling)
- Flexible environment (Docker containers, custom runtimes)
- Traffic splitting for canary deployments
- Best practices: Use Standard for web apps, Flexible for custom dependencies

### Storage

**Cloud Storage**
- Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day)
- Object lifecycle management
- Object versioning and retention policies
- Autoclass for automatic tier transitions
- Requester pays for data transfer
- Best practices: Use Autoclass, enable versioning, implement lifecycle policies

**Persistent Disk**
- Types: Standard (HDD), Balanced SSD, SSD, Extreme
- Zonal and regional persistent disks
- Snapshots for backup (incremental)
- Disk resize without downtime
- Best practices: Use Balanced SSD for most workloads, enable snapshots

**Filestore**
- Managed NFS file storage
- Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance)
- Backup to Cloud Storage
- Best practices: Use Enterprise for production, implement backups

**Cloud Storage for Firebase**
- Object storage for mobile and web apps
- Client SDKs for direct upload/download
- Security rules for access control

### Database

**Cloud SQL**
- Managed MySQL, PostgreSQL, SQL Server
- High availability configuration (regional)
- Read replicas for scaling
- Automated backups and point-in-time recovery
- Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy

**Cloud Spanner**
- Globally distributed relational database
- Horizontal scalability with strong consistency
- Multi-region for 99.999% availability
- TrueTime for global consistency
- Best practices: Design proper schema splits, use commit timestamps, optimize hotspots

**Firestore (Native mode)**
- NoSQL document database
- Real-time synchronization
- Offline support for mobile
- ACID transactions
- Best practices: Design document structure carefully, use collection group queries wisely

**Bigtable**
- NoSQL wide-column database
- Petabyte-scale with single-digit millisecond latency
- HBase API compatible
- Linear scalability by adding nodes
- Best practices: Design row keys to avoid hotspots, use replication for HA

**Memorystore**
- Managed Redis and Memcached
- Standard tier (HA with replica) and Basic tier
- Best practices: Use Standard tier for production, implement connection pooling

**BigQuery**
- Serverless data warehouse
- SQL analytics on petabyte-scale data
- Column-oriented storage
- Automatic caching and optimization
- Best practices: Partition and cluster tables, use approximate functions, control costs with quotas

### Networking

**VPC (Virtual Private Cloud)**
- Global resource (subnets are regional)
- Custom or auto mode networks
- Firewall rules (stateful)
- VPC peering and Shared VPC
- Private Google Access for GCP services
- Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules

**Cloud Load Balancing**
- Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP)
- Regional load balancing (internal HTTP(S), internal TCP/UDP)
- Anycast IP for global distribution
- Backend services with health checks
- Best practices: Use global for multi-region, enable CDN, configure health checks

**Cloud CDN**
- Global content delivery network
- Cache invalidation and signed URLs
- Integration with Cloud Storage and compute
- Best practices: Enable compression, use cache-control headers

**Cloud Interconnect and VPN**
- Dedicated Interconnect (10 Gbps or 100 Gbps)
- Partner Interconnect (50 Mbps to 50 Gbps)
- Cloud VPN (HA VPN for 99.99% SLA)
- Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth

**Cloud Armor**
- DDoS protection and WAF
- Preconfigured and custom rules
- Adaptive protection (ML-based)
- Best practices: Enable for internet-facing services, use preconfigured rules

**Private Service Connect**
- Private connectivity to Google APIs and services
- Service Directory for service discovery
- Best practices: Use for all managed services in production

### Serverless and Event-Driven

**Pub/Sub**
- Global message queue
- At-least-once delivery
- Push and pull subscriptions
- Message ordering and filtering
- Dead-letter topics
- Best practices: Use message attributes for filtering, implement idempotent processing

**Eventarc**
- Event-driven architecture
- Triggers for Cloud Run, Workflows, GKE
- Sources: Audit Logs, Pub/Sub, custom events
- Best practices: Use for decoupled architectures, implement event filtering

**Cloud Scheduler**
- Fully managed cron service
- HTTP, Pub/Sub, and App Engine targets
- Best practices: Use for periodic tasks, implement retry logic

**Workflows**
- Orchestrate and automate GCP and HTTP services
- YAML-based workflow definition
- Built-in error handling and retry
- Best practices: Use for complex multi-step processes, implement compensating transactions

### Security and Identity

**Cloud IAM**
- Resource hierarchy: Organization -> Folders -> Projects -> Resources
- Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom
- Service accounts for applications
- Workload Identity for GKE
- Best practices: Use predefined roles, least privilege, service accounts for apps

**Cloud Key Management (KMS)**
- Encryption key management
- Customer-managed encryption keys (CMEK)
- Hardware Security Module (HSM) backed
- Automatic key rotation
- Best practices: Enable automatic rotation, use separate keys per environment

**Secret Manager**
- Store API keys, passwords, certificates
- Versioning and access control
- Automatic rotation integration
- Best practices: Rotate secrets regularly, use IAM for access control

**Security Command Center**
- Centralized security and risk management
- Asset discovery and vulnerability scanning
- Threat detection and compliance monitoring
- Best practices: Enable all detectors, review findings regularly

**VPC Service Controls**
- Create security perimeters around GCP resources
- Prevent data exfiltration
- Best practices: Use for sensitive data, implement access levels

### AI and Machine Learning

**Vertex AI**
- Unified ML platform
- AutoML for custom models
- Pre-trained models (Vision, Natural Language, etc.)
- MLOps with pipelines
- Best practices: Use AutoML for quick start, implement feature store

**BigQuery ML**
- Create and execute ML models using SQL
- Model types: Linear regression, logistic regression, clustering, etc.
- Integration with Vertex AI
- Best practices: Use for simple models, leverage BigQuery's scale

## Architecture Patterns

### High Availability

**Multi-Zone Pattern**
```
Global HTTP(S) Load Balancer
    |
    v
Managed Instance Group (multi-zone)
    |
    v
Cloud SQL (regional, HA configuration)
    |
    v
Cloud Storage (multi-region)
```

**Multi-Region Pattern**
```
Global HTTP(S) Load Balancer
    |
    ├── Backend Service Region 1 (Cloud Run)
    └── Backend Service Region 2 (Cloud Run)
         |
         v
    Cloud Spanner (multi-region)
```

### Serverless Architecture

**Event-Driven Pattern**
```
Cloud Storage upload event
    |
    v
Pub/Sub topic
    |
    v
Cloud Functions (image processing)
    |
    v
Firestore (metadata storage)
```

**API-First Pattern**
```
Cloud Endpoints or API Gateway
    |
    v
Cloud Run (multiple services)
    |
    ├── Cloud SQL (transactional data)
    └── Firestore (user data)
```

### Microservices on GKE

**GKE with Service Mesh**
```
Global Load Balancer
    |
    v
GKE Ingress
    |
    v
Anthos Service Mesh (Istio)
    |
    v
Microservices (Cloud Spanner, Firestore, Memorystore)
```

### Data Analytics Platform

```
Data Sources
    |
    v
Pub/Sub (streaming)
    |
    v
Dataflow (Apache Beam)
    |
    v
BigQuery (data warehouse)
    |
    v
Looker or Data Studio (visualization)
```

**Batch Processing**
```
Cloud Storage (raw data)
    |
    v
Dataproc (Apache Spark)
    |
    v
BigQuery (analytics)
```

## Landing Zone Design

### Resource Hierarchy

```
Organization
├── Folders (by environment or team)
│   ├── Production Folder
│   │   ├── Project A
│   │   └── Project B
│   ├── Staging Folder
│   └── Development Folder
└── Shared Services Folder
    ├── Networking Project (Shared VPC host)
    ├── Security Project (KMS, Secret Manager)
    └── Logging Project (centralized logs)
```

### Network Design

**Shared VPC Pattern**
```
Host Project (networking team)
├── Shared VPC
│   ├── Subnet Production (region A)
│   ├── Subnet Staging (region A)
│   └── Subnet Development (region B)

Service Projects (application teams)
├── Production Project (uses Production subnet)
├── Staging Project (uses Staging subnet)
└── Development Project (uses Development subnet)
```

**Hub-and-Spoke with VPN**
```
On-premises Network
    |
    v
Cloud VPN / Interconnect
    |
    v
Hub VPC (shared services)
    |
    ├── Spoke VPC 1 (production workloads)
    ├── Spoke VPC 2 (development workloads)
    └── Spoke VPC 3 (analytics workloads)
```

### Governance

**Organization Policies**
- Restrict public IP assignment
- Enforce uniform bucket-level access
- Restrict VM external IP
- Define allowed resource locations

**IAM Strategy**
- Use Google Groups for role assignments
- Separate duties (network admin, security admin, etc.)
- Service accounts per application
- Workload Identity for GKE workloads

**Logging and Monitoring**
```
All Projects
    |
    v
Log Router
    |
    ├── Cloud Logging (default sink)
    ├── BigQuery (long-term analysis)
    ├── Cloud Storage (archive)
    └── Pub/Sub (real-time processing)
```

## Migration Strategies

### Migrate to Virtual Machines

**Tools**
- Migrate to Virtual Machines (formerly Migrate for Compute Engine)
- Supports VMware, AWS, Azure, physical servers
- Agentless or agent-based migration
- Waves and test clones

**Process**
1. Assess: Fit assessment and TCO analysis
2. Plan: Group VMs, define migration waves
3. Deploy: Set up infrastructure (VPC, firewall rules)
4. Migrate: Test migration, cutover, validation
5. Optimize: Right-sizing, committed use discounts

### Database Migration

**Database Migration Service**
- Minimal downtime migrations
- Supports MySQL, PostgreSQL, SQL Server, Oracle
- Continuous replication for cutover flexibility

**Transfer Appliance**
- Physical device for large data transfers
- Up to 1 PB capacity
- Offline data transfer

## Cost Optimization

### Compute Savings

**Committed Use Discounts**
- 1-year or 3-year commitments
- Up to 57% savings for VMs
- Resource-based or spend-based

**Sustained Use Discounts**
- Automatic discounts for running VMs >25% of month
- Up to 30% savings
- No commitment required

**Preemptible and Spot VMs**
- Up to 80% discount
- Can be terminated by GCP
- Best for batch processing, fault-tolerant workloads

**Recommender**
- VM rightsizing recommendations
- Idle resource identification
- Committed use discount recommendations

### Storage Savings

**Cloud Storage**
- Autoclass for automatic tier transitions
- Lifecycle policies (delete or transition)
- Nearline (30+ days), Coldline (90+ days), Archive (365+ days)
- Requester pays for data transfer

**Persistent Disk**
- Delete orphaned disks
- Use balanced SSD instead of SSD when possible
- Resize disks to match actual usage

### BigQuery Savings

**On-Demand Pricing**
- $5 per TB processed
- Use partitioning and clustering
- Query cache for free repeated queries

**Flat-Rate Pricing**
- Predictable costs for heavy users
- Autoscaling slots available
- Flex slots for short-term commitments

**Best Practices**
- Use approximate aggregation functions (APPROX_COUNT_DISTINCT)
- Avoid SELECT *, specify columns
- Use materialized views for common queries
- Set up cost controls with custom quotas

### Monitoring Costs

**Cloud Billing**
- Budgets and alerts
- Cost breakdown by project, service, SKU
- Export to BigQuery for analysis
- Recommendations from Active Assist

## Disaster Recovery

### Backup Strategies

**VM Backups**
- Persistent disk snapshots (incremental)
- Machine images (include metadata and config)
- Cross-region snapshot copy
- Snapshot schedules for automation

**Database Backups**
- Cloud SQL: Automated backups (7-365 days retention)
- Cloud Spanner: Backups on demand or scheduled
- Firestore: Automated daily exports
- Bigtable: Backups to Cloud Storage

### High Availability

**RTO/RPO Matrix**

| Pattern | RPO | RTO | Cost |
|---------|-----|-----|------|
| Active-Active Multi-Region | Seconds | Seconds | High |
| Active-Passive with Replication | Minutes | Minutes | Medium |
| Warm Standby | Minutes | 10-30 min | Medium |
| Backup and Restore | Hours | Hours | Low |

**Cloud SQL HA**
- Regional configuration with synchronous replication
- Automatic failover
- 99.95% SLA (vs 99.5% for single zone)

**Cloud Spanner**
- Multi-region configuration
- 99.999% availability SLA
- Synchronous replication across regions

### Disaster Recovery Testing

- Regular DR drills (quarterly recommended)
- Document runbooks
- Test restoration procedures
- Measure actual RTO/RPO vs targets

## Monitoring and Observability

### Cloud Monitoring (formerly Stackdriver)

**Metrics**
- System metrics (CPU, memory, disk, network)
- Custom metrics via Cloud Monitoring API
- Metric scopes for multi-project monitoring
- Uptime checks for availability

**Dashboards and Charts**
- Predefined dashboards for GCP services
- Custom dashboards with filters and grouping
- SLO monitoring with error budgets

### Cloud Logging

**Log Types**
- Admin Activity logs (always enabled, no charge)
- Data Access logs (must be enabled)
- System Event logs
- Access Transparency logs (for Google access)

**Log Sinks**
- Route logs to BigQuery, Cloud Storage, Pub/Sub
- Aggregated sinks at organization/folder level
- Exclusion filters to reduce costs

### Cloud Trace

**Distributed Tracing**
- Automatic instrumentation for App Engine, Cloud Run, GKE
- Manual instrumentation with client libraries
- Latency analysis and performance insights
- Integration with Zipkin

### Cloud Profiler

**Continuous Profiling**
- CPU and memory profiling
- Low overhead (< 0.5% CPU)
- Flame graphs for visualization
- Supported languages: Java, Go, Python, Node.js

### Error Reporting

**Aggregated Error Tracking**
- Automatic error grouping
- Stack trace analysis
- Integration with Cloud Logging
- Notifications for new errors