# GCP Architecture Reference Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework. ## Google Cloud Architecture Framework ### Five Pillars 1. **Operational Excellence** - Infrastructure as Code (Deployment Manager, Terraform) - CI/CD with Cloud Build - Monitoring with Cloud Monitoring (Stackdriver) - SRE principles and SLOs - Incident management 2. **Security, Privacy, and Compliance** - Identity and Access Management (Cloud IAM) - VPC Service Controls for data perimeter - Binary Authorization for containers - Data encryption (default at rest and in transit) - Security Command Center 3. **Reliability** - Multi-zone and multi-region deployments - Load balancing and autoscaling - Disaster recovery planning - Chaos engineering practices - SLIs, SLOs, and error budgets 4. **Cost Optimization** - Committed Use Discounts - Sustained Use Discounts (automatic) - Preemptible VMs and Spot VMs - Recommender for right-sizing - Active Assist for optimization 5. **Performance Optimization** - Cloud CDN and Media CDN - Caching strategies (Memorystore) - Database performance tuning - Network optimization (Premium vs Standard tier) - Regional and zonal resource placement ## Core Services Architecture ### Compute **Compute Engine** - Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized) - Custom machine types for specific needs - Preemptible VMs (up to 80% discount, max 24 hours) - Spot VMs (similar to preemptible, better availability) - Instance groups: Managed (with autoscaling), unmanaged - Best practices: Use latest generation, committed use discounts, Spot for batch jobs **Cloud Run** - Fully managed serverless container platform - Auto-scaling to zero - Pay per request - CPU allocated only during request handling - Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch **Cloud Functions** - Event-driven serverless functions - 1st gen: HTTP and background functions - 2nd gen: Built on Cloud Run, better performance - Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP - Best practices: Use 2nd gen, minimize cold starts, implement retry logic **Google Kubernetes Engine (GKE)** - Managed Kubernetes with GCP integration - Autopilot mode: Fully managed, per-pod pricing - Standard mode: More control, node management - Workload Identity for secure service access - Binary Authorization for deployment policies - Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies **App Engine** - Fully managed platform (PaaS) - Standard environment (sandboxed, auto-scaling) - Flexible environment (Docker containers, custom runtimes) - Traffic splitting for canary deployments - Best practices: Use Standard for web apps, Flexible for custom dependencies ### Storage **Cloud Storage** - Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day) - Object lifecycle management - Object versioning and retention policies - Autoclass for automatic tier transitions - Requester pays for data transfer - Best practices: Use Autoclass, enable versioning, implement lifecycle policies **Persistent Disk** - Types: Standard (HDD), Balanced SSD, SSD, Extreme - Zonal and regional persistent disks - Snapshots for backup (incremental) - Disk resize without downtime - Best practices: Use Balanced SSD for most workloads, enable snapshots **Filestore** - Managed NFS file storage - Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance) - Backup to Cloud Storage - Best practices: Use Enterprise for production, implement backups **Cloud Storage for Firebase** - Object storage for mobile and web apps - Client SDKs for direct upload/download - Security rules for access control ### Database **Cloud SQL** - Managed MySQL, PostgreSQL, SQL Server - High availability configuration (regional) - Read replicas for scaling - Automated backups and point-in-time recovery - Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy **Cloud Spanner** - Globally distributed relational database - Horizontal scalability with strong consistency - Multi-region for 99.999% availability - TrueTime for global consistency - Best practices: Design proper schema splits, use commit timestamps, optimize hotspots **Firestore (Native mode)** - NoSQL document database - Real-time synchronization - Offline support for mobile - ACID transactions - Best practices: Design document structure carefully, use collection group queries wisely **Bigtable** - NoSQL wide-column database - Petabyte-scale with single-digit millisecond latency - HBase API compatible - Linear scalability by adding nodes - Best practices: Design row keys to avoid hotspots, use replication for HA **Memorystore** - Managed Redis and Memcached - Standard tier (HA with replica) and Basic tier - Best practices: Use Standard tier for production, implement connection pooling **BigQuery** - Serverless data warehouse - SQL analytics on petabyte-scale data - Column-oriented storage - Automatic caching and optimization - Best practices: Partition and cluster tables, use approximate functions, control costs with quotas ### Networking **VPC (Virtual Private Cloud)** - Global resource (subnets are regional) - Custom or auto mode networks - Firewall rules (stateful) - VPC peering and Shared VPC - Private Google Access for GCP services - Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules **Cloud Load Balancing** - Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP) - Regional load balancing (internal HTTP(S), internal TCP/UDP) - Anycast IP for global distribution - Backend services with health checks - Best practices: Use global for multi-region, enable CDN, configure health checks **Cloud CDN** - Global content delivery network - Cache invalidation and signed URLs - Integration with Cloud Storage and compute - Best practices: Enable compression, use cache-control headers **Cloud Interconnect and VPN** - Dedicated Interconnect (10 Gbps or 100 Gbps) - Partner Interconnect (50 Mbps to 50 Gbps) - Cloud VPN (HA VPN for 99.99% SLA) - Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth **Cloud Armor** - DDoS protection and WAF - Preconfigured and custom rules - Adaptive protection (ML-based) - Best practices: Enable for internet-facing services, use preconfigured rules **Private Service Connect** - Private connectivity to Google APIs and services - Service Directory for service discovery - Best practices: Use for all managed services in production ### Serverless and Event-Driven **Pub/Sub** - Global message queue - At-least-once delivery - Push and pull subscriptions - Message ordering and filtering - Dead-letter topics - Best practices: Use message attributes for filtering, implement idempotent processing **Eventarc** - Event-driven architecture - Triggers for Cloud Run, Workflows, GKE - Sources: Audit Logs, Pub/Sub, custom events - Best practices: Use for decoupled architectures, implement event filtering **Cloud Scheduler** - Fully managed cron service - HTTP, Pub/Sub, and App Engine targets - Best practices: Use for periodic tasks, implement retry logic **Workflows** - Orchestrate and automate GCP and HTTP services - YAML-based workflow definition - Built-in error handling and retry - Best practices: Use for complex multi-step processes, implement compensating transactions ### Security and Identity **Cloud IAM** - Resource hierarchy: Organization -> Folders -> Projects -> Resources - Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom - Service accounts for applications - Workload Identity for GKE - Best practices: Use predefined roles, least privilege, service accounts for apps **Cloud Key Management (KMS)** - Encryption key management - Customer-managed encryption keys (CMEK) - Hardware Security Module (HSM) backed - Automatic key rotation - Best practices: Enable automatic rotation, use separate keys per environment **Secret Manager** - Store API keys, passwords, certificates - Versioning and access control - Automatic rotation integration - Best practices: Rotate secrets regularly, use IAM for access control **Security Command Center** - Centralized security and risk management - Asset discovery and vulnerability scanning - Threat detection and compliance monitoring - Best practices: Enable all detectors, review findings regularly **VPC Service Controls** - Create security perimeters around GCP resources - Prevent data exfiltration - Best practices: Use for sensitive data, implement access levels ### AI and Machine Learning **Vertex AI** - Unified ML platform - AutoML for custom models - Pre-trained models (Vision, Natural Language, etc.) - MLOps with pipelines - Best practices: Use AutoML for quick start, implement feature store **BigQuery ML** - Create and execute ML models using SQL - Model types: Linear regression, logistic regression, clustering, etc. - Integration with Vertex AI - Best practices: Use for simple models, leverage BigQuery's scale ## Architecture Patterns ### High Availability **Multi-Zone Pattern** ``` Global HTTP(S) Load Balancer | v Managed Instance Group (multi-zone) | v Cloud SQL (regional, HA configuration) | v Cloud Storage (multi-region) ``` **Multi-Region Pattern** ``` Global HTTP(S) Load Balancer | ├── Backend Service Region 1 (Cloud Run) └── Backend Service Region 2 (Cloud Run) | v Cloud Spanner (multi-region) ``` ### Serverless Architecture **Event-Driven Pattern** ``` Cloud Storage upload event | v Pub/Sub topic | v Cloud Functions (image processing) | v Firestore (metadata storage) ``` **API-First Pattern** ``` Cloud Endpoints or API Gateway | v Cloud Run (multiple services) | ├── Cloud SQL (transactional data) └── Firestore (user data) ``` ### Microservices on GKE **GKE with Service Mesh** ``` Global Load Balancer | v GKE Ingress | v Anthos Service Mesh (Istio) | v Microservices (Cloud Spanner, Firestore, Memorystore) ``` ### Data Analytics Platform ``` Data Sources | v Pub/Sub (streaming) | v Dataflow (Apache Beam) | v BigQuery (data warehouse) | v Looker or Data Studio (visualization) ``` **Batch Processing** ``` Cloud Storage (raw data) | v Dataproc (Apache Spark) | v BigQuery (analytics) ``` ## Landing Zone Design ### Resource Hierarchy ``` Organization ├── Folders (by environment or team) │ ├── Production Folder │ │ ├── Project A │ │ └── Project B │ ├── Staging Folder │ └── Development Folder └── Shared Services Folder ├── Networking Project (Shared VPC host) ├── Security Project (KMS, Secret Manager) └── Logging Project (centralized logs) ``` ### Network Design **Shared VPC Pattern** ``` Host Project (networking team) ├── Shared VPC │ ├── Subnet Production (region A) │ ├── Subnet Staging (region A) │ └── Subnet Development (region B) Service Projects (application teams) ├── Production Project (uses Production subnet) ├── Staging Project (uses Staging subnet) └── Development Project (uses Development subnet) ``` **Hub-and-Spoke with VPN** ``` On-premises Network | v Cloud VPN / Interconnect | v Hub VPC (shared services) | ├── Spoke VPC 1 (production workloads) ├── Spoke VPC 2 (development workloads) └── Spoke VPC 3 (analytics workloads) ``` ### Governance **Organization Policies** - Restrict public IP assignment - Enforce uniform bucket-level access - Restrict VM external IP - Define allowed resource locations **IAM Strategy** - Use Google Groups for role assignments - Separate duties (network admin, security admin, etc.) - Service accounts per application - Workload Identity for GKE workloads **Logging and Monitoring** ``` All Projects | v Log Router | ├── Cloud Logging (default sink) ├── BigQuery (long-term analysis) ├── Cloud Storage (archive) └── Pub/Sub (real-time processing) ``` ## Migration Strategies ### Migrate to Virtual Machines **Tools** - Migrate to Virtual Machines (formerly Migrate for Compute Engine) - Supports VMware, AWS, Azure, physical servers - Agentless or agent-based migration - Waves and test clones **Process** 1. Assess: Fit assessment and TCO analysis 2. Plan: Group VMs, define migration waves 3. Deploy: Set up infrastructure (VPC, firewall rules) 4. Migrate: Test migration, cutover, validation 5. Optimize: Right-sizing, committed use discounts ### Database Migration **Database Migration Service** - Minimal downtime migrations - Supports MySQL, PostgreSQL, SQL Server, Oracle - Continuous replication for cutover flexibility **Transfer Appliance** - Physical device for large data transfers - Up to 1 PB capacity - Offline data transfer ## Cost Optimization ### Compute Savings **Committed Use Discounts** - 1-year or 3-year commitments - Up to 57% savings for VMs - Resource-based or spend-based **Sustained Use Discounts** - Automatic discounts for running VMs >25% of month - Up to 30% savings - No commitment required **Preemptible and Spot VMs** - Up to 80% discount - Can be terminated by GCP - Best for batch processing, fault-tolerant workloads **Recommender** - VM rightsizing recommendations - Idle resource identification - Committed use discount recommendations ### Storage Savings **Cloud Storage** - Autoclass for automatic tier transitions - Lifecycle policies (delete or transition) - Nearline (30+ days), Coldline (90+ days), Archive (365+ days) - Requester pays for data transfer **Persistent Disk** - Delete orphaned disks - Use balanced SSD instead of SSD when possible - Resize disks to match actual usage ### BigQuery Savings **On-Demand Pricing** - $5 per TB processed - Use partitioning and clustering - Query cache for free repeated queries **Flat-Rate Pricing** - Predictable costs for heavy users - Autoscaling slots available - Flex slots for short-term commitments **Best Practices** - Use approximate aggregation functions (APPROX_COUNT_DISTINCT) - Avoid SELECT *, specify columns - Use materialized views for common queries - Set up cost controls with custom quotas ### Monitoring Costs **Cloud Billing** - Budgets and alerts - Cost breakdown by project, service, SKU - Export to BigQuery for analysis - Recommendations from Active Assist ## Disaster Recovery ### Backup Strategies **VM Backups** - Persistent disk snapshots (incremental) - Machine images (include metadata and config) - Cross-region snapshot copy - Snapshot schedules for automation **Database Backups** - Cloud SQL: Automated backups (7-365 days retention) - Cloud Spanner: Backups on demand or scheduled - Firestore: Automated daily exports - Bigtable: Backups to Cloud Storage ### High Availability **RTO/RPO Matrix** | Pattern | RPO | RTO | Cost | |---------|-----|-----|------| | Active-Active Multi-Region | Seconds | Seconds | High | | Active-Passive with Replication | Minutes | Minutes | Medium | | Warm Standby | Minutes | 10-30 min | Medium | | Backup and Restore | Hours | Hours | Low | **Cloud SQL HA** - Regional configuration with synchronous replication - Automatic failover - 99.95% SLA (vs 99.5% for single zone) **Cloud Spanner** - Multi-region configuration - 99.999% availability SLA - Synchronous replication across regions ### Disaster Recovery Testing - Regular DR drills (quarterly recommended) - Document runbooks - Test restoration procedures - Measure actual RTO/RPO vs targets ## Monitoring and Observability ### Cloud Monitoring (formerly Stackdriver) **Metrics** - System metrics (CPU, memory, disk, network) - Custom metrics via Cloud Monitoring API - Metric scopes for multi-project monitoring - Uptime checks for availability **Dashboards and Charts** - Predefined dashboards for GCP services - Custom dashboards with filters and grouping - SLO monitoring with error budgets ### Cloud Logging **Log Types** - Admin Activity logs (always enabled, no charge) - Data Access logs (must be enabled) - System Event logs - Access Transparency logs (for Google access) **Log Sinks** - Route logs to BigQuery, Cloud Storage, Pub/Sub - Aggregated sinks at organization/folder level - Exclusion filters to reduce costs ### Cloud Trace **Distributed Tracing** - Automatic instrumentation for App Engine, Cloud Run, GKE - Manual instrumentation with client libraries - Latency analysis and performance insights - Integration with Zipkin ### Cloud Profiler **Continuous Profiling** - CPU and memory profiling - Low overhead (< 0.5% CPU) - Flame graphs for visualization - Supported languages: Java, Go, Python, Node.js ### Error Reporting **Aggregated Error Tracking** - Automatic error grouping - Stack trace analysis - Integration with Cloud Logging - Notifications for new errors