634 lines
17 KiB
Markdown
634 lines
17 KiB
Markdown
|
|
# GCP Architecture Reference
|
||
|
|
|
||
|
|
Comprehensive guide for Google Cloud Platform services, patterns, and architecture framework.
|
||
|
|
|
||
|
|
## Google Cloud Architecture Framework
|
||
|
|
|
||
|
|
### Five Pillars
|
||
|
|
|
||
|
|
1. **Operational Excellence**
|
||
|
|
- Infrastructure as Code (Deployment Manager, Terraform)
|
||
|
|
- CI/CD with Cloud Build
|
||
|
|
- Monitoring with Cloud Monitoring (Stackdriver)
|
||
|
|
- SRE principles and SLOs
|
||
|
|
- Incident management
|
||
|
|
|
||
|
|
2. **Security, Privacy, and Compliance**
|
||
|
|
- Identity and Access Management (Cloud IAM)
|
||
|
|
- VPC Service Controls for data perimeter
|
||
|
|
- Binary Authorization for containers
|
||
|
|
- Data encryption (default at rest and in transit)
|
||
|
|
- Security Command Center
|
||
|
|
|
||
|
|
3. **Reliability**
|
||
|
|
- Multi-zone and multi-region deployments
|
||
|
|
- Load balancing and autoscaling
|
||
|
|
- Disaster recovery planning
|
||
|
|
- Chaos engineering practices
|
||
|
|
- SLIs, SLOs, and error budgets
|
||
|
|
|
||
|
|
4. **Cost Optimization**
|
||
|
|
- Committed Use Discounts
|
||
|
|
- Sustained Use Discounts (automatic)
|
||
|
|
- Preemptible VMs and Spot VMs
|
||
|
|
- Recommender for right-sizing
|
||
|
|
- Active Assist for optimization
|
||
|
|
|
||
|
|
5. **Performance Optimization**
|
||
|
|
- Cloud CDN and Media CDN
|
||
|
|
- Caching strategies (Memorystore)
|
||
|
|
- Database performance tuning
|
||
|
|
- Network optimization (Premium vs Standard tier)
|
||
|
|
- Regional and zonal resource placement
|
||
|
|
|
||
|
|
## Core Services Architecture
|
||
|
|
|
||
|
|
### Compute
|
||
|
|
|
||
|
|
**Compute Engine**
|
||
|
|
- Machine types: E2 (cost-optimized), N2 (balanced), C2 (compute-optimized), M2 (memory-optimized)
|
||
|
|
- Custom machine types for specific needs
|
||
|
|
- Preemptible VMs (up to 80% discount, max 24 hours)
|
||
|
|
- Spot VMs (similar to preemptible, better availability)
|
||
|
|
- Instance groups: Managed (with autoscaling), unmanaged
|
||
|
|
- Best practices: Use latest generation, committed use discounts, Spot for batch jobs
|
||
|
|
|
||
|
|
**Cloud Run**
|
||
|
|
- Fully managed serverless container platform
|
||
|
|
- Auto-scaling to zero
|
||
|
|
- Pay per request
|
||
|
|
- CPU allocated only during request handling
|
||
|
|
- Best practices: Stateless containers, optimize cold starts, use Cloud Run jobs for batch
|
||
|
|
|
||
|
|
**Cloud Functions**
|
||
|
|
- Event-driven serverless functions
|
||
|
|
- 1st gen: HTTP and background functions
|
||
|
|
- 2nd gen: Built on Cloud Run, better performance
|
||
|
|
- Event sources: Pub/Sub, Cloud Storage, Firestore, HTTP
|
||
|
|
- Best practices: Use 2nd gen, minimize cold starts, implement retry logic
|
||
|
|
|
||
|
|
**Google Kubernetes Engine (GKE)**
|
||
|
|
- Managed Kubernetes with GCP integration
|
||
|
|
- Autopilot mode: Fully managed, per-pod pricing
|
||
|
|
- Standard mode: More control, node management
|
||
|
|
- Workload Identity for secure service access
|
||
|
|
- Binary Authorization for deployment policies
|
||
|
|
- Best practices: Use Autopilot for simplicity, enable Workload Identity, implement network policies
|
||
|
|
|
||
|
|
**App Engine**
|
||
|
|
- Fully managed platform (PaaS)
|
||
|
|
- Standard environment (sandboxed, auto-scaling)
|
||
|
|
- Flexible environment (Docker containers, custom runtimes)
|
||
|
|
- Traffic splitting for canary deployments
|
||
|
|
- Best practices: Use Standard for web apps, Flexible for custom dependencies
|
||
|
|
|
||
|
|
### Storage
|
||
|
|
|
||
|
|
**Cloud Storage**
|
||
|
|
- Storage classes: Standard, Nearline (30-day), Coldline (90-day), Archive (365-day)
|
||
|
|
- Object lifecycle management
|
||
|
|
- Object versioning and retention policies
|
||
|
|
- Autoclass for automatic tier transitions
|
||
|
|
- Requester pays for data transfer
|
||
|
|
- Best practices: Use Autoclass, enable versioning, implement lifecycle policies
|
||
|
|
|
||
|
|
**Persistent Disk**
|
||
|
|
- Types: Standard (HDD), Balanced SSD, SSD, Extreme
|
||
|
|
- Zonal and regional persistent disks
|
||
|
|
- Snapshots for backup (incremental)
|
||
|
|
- Disk resize without downtime
|
||
|
|
- Best practices: Use Balanced SSD for most workloads, enable snapshots
|
||
|
|
|
||
|
|
**Filestore**
|
||
|
|
- Managed NFS file storage
|
||
|
|
- Tiers: Basic (1-63.9 TB), Enterprise (1-10 TB, better performance)
|
||
|
|
- Backup to Cloud Storage
|
||
|
|
- Best practices: Use Enterprise for production, implement backups
|
||
|
|
|
||
|
|
**Cloud Storage for Firebase**
|
||
|
|
- Object storage for mobile and web apps
|
||
|
|
- Client SDKs for direct upload/download
|
||
|
|
- Security rules for access control
|
||
|
|
|
||
|
|
### Database
|
||
|
|
|
||
|
|
**Cloud SQL**
|
||
|
|
- Managed MySQL, PostgreSQL, SQL Server
|
||
|
|
- High availability configuration (regional)
|
||
|
|
- Read replicas for scaling
|
||
|
|
- Automated backups and point-in-time recovery
|
||
|
|
- Best practices: Enable HA, use read replicas, implement connection pooling with Cloud SQL Proxy
|
||
|
|
|
||
|
|
**Cloud Spanner**
|
||
|
|
- Globally distributed relational database
|
||
|
|
- Horizontal scalability with strong consistency
|
||
|
|
- Multi-region for 99.999% availability
|
||
|
|
- TrueTime for global consistency
|
||
|
|
- Best practices: Design proper schema splits, use commit timestamps, optimize hotspots
|
||
|
|
|
||
|
|
**Firestore (Native mode)**
|
||
|
|
- NoSQL document database
|
||
|
|
- Real-time synchronization
|
||
|
|
- Offline support for mobile
|
||
|
|
- ACID transactions
|
||
|
|
- Best practices: Design document structure carefully, use collection group queries wisely
|
||
|
|
|
||
|
|
**Bigtable**
|
||
|
|
- NoSQL wide-column database
|
||
|
|
- Petabyte-scale with single-digit millisecond latency
|
||
|
|
- HBase API compatible
|
||
|
|
- Linear scalability by adding nodes
|
||
|
|
- Best practices: Design row keys to avoid hotspots, use replication for HA
|
||
|
|
|
||
|
|
**Memorystore**
|
||
|
|
- Managed Redis and Memcached
|
||
|
|
- Standard tier (HA with replica) and Basic tier
|
||
|
|
- Best practices: Use Standard tier for production, implement connection pooling
|
||
|
|
|
||
|
|
**BigQuery**
|
||
|
|
- Serverless data warehouse
|
||
|
|
- SQL analytics on petabyte-scale data
|
||
|
|
- Column-oriented storage
|
||
|
|
- Automatic caching and optimization
|
||
|
|
- Best practices: Partition and cluster tables, use approximate functions, control costs with quotas
|
||
|
|
|
||
|
|
### Networking
|
||
|
|
|
||
|
|
**VPC (Virtual Private Cloud)**
|
||
|
|
- Global resource (subnets are regional)
|
||
|
|
- Custom or auto mode networks
|
||
|
|
- Firewall rules (stateful)
|
||
|
|
- VPC peering and Shared VPC
|
||
|
|
- Private Google Access for GCP services
|
||
|
|
- Best practices: Use custom mode VPC, plan IP ranges, implement firewall rules
|
||
|
|
|
||
|
|
**Cloud Load Balancing**
|
||
|
|
- Global load balancing (HTTP(S), TCP/SSL Proxy, external TCP/UDP)
|
||
|
|
- Regional load balancing (internal HTTP(S), internal TCP/UDP)
|
||
|
|
- Anycast IP for global distribution
|
||
|
|
- Backend services with health checks
|
||
|
|
- Best practices: Use global for multi-region, enable CDN, configure health checks
|
||
|
|
|
||
|
|
**Cloud CDN**
|
||
|
|
- Global content delivery network
|
||
|
|
- Cache invalidation and signed URLs
|
||
|
|
- Integration with Cloud Storage and compute
|
||
|
|
- Best practices: Enable compression, use cache-control headers
|
||
|
|
|
||
|
|
**Cloud Interconnect and VPN**
|
||
|
|
- Dedicated Interconnect (10 Gbps or 100 Gbps)
|
||
|
|
- Partner Interconnect (50 Mbps to 50 Gbps)
|
||
|
|
- Cloud VPN (HA VPN for 99.99% SLA)
|
||
|
|
- Best practices: Use HA VPN for redundancy, Dedicated Interconnect for high bandwidth
|
||
|
|
|
||
|
|
**Cloud Armor**
|
||
|
|
- DDoS protection and WAF
|
||
|
|
- Preconfigured and custom rules
|
||
|
|
- Adaptive protection (ML-based)
|
||
|
|
- Best practices: Enable for internet-facing services, use preconfigured rules
|
||
|
|
|
||
|
|
**Private Service Connect**
|
||
|
|
- Private connectivity to Google APIs and services
|
||
|
|
- Service Directory for service discovery
|
||
|
|
- Best practices: Use for all managed services in production
|
||
|
|
|
||
|
|
### Serverless and Event-Driven
|
||
|
|
|
||
|
|
**Pub/Sub**
|
||
|
|
- Global message queue
|
||
|
|
- At-least-once delivery
|
||
|
|
- Push and pull subscriptions
|
||
|
|
- Message ordering and filtering
|
||
|
|
- Dead-letter topics
|
||
|
|
- Best practices: Use message attributes for filtering, implement idempotent processing
|
||
|
|
|
||
|
|
**Eventarc**
|
||
|
|
- Event-driven architecture
|
||
|
|
- Triggers for Cloud Run, Workflows, GKE
|
||
|
|
- Sources: Audit Logs, Pub/Sub, custom events
|
||
|
|
- Best practices: Use for decoupled architectures, implement event filtering
|
||
|
|
|
||
|
|
**Cloud Scheduler**
|
||
|
|
- Fully managed cron service
|
||
|
|
- HTTP, Pub/Sub, and App Engine targets
|
||
|
|
- Best practices: Use for periodic tasks, implement retry logic
|
||
|
|
|
||
|
|
**Workflows**
|
||
|
|
- Orchestrate and automate GCP and HTTP services
|
||
|
|
- YAML-based workflow definition
|
||
|
|
- Built-in error handling and retry
|
||
|
|
- Best practices: Use for complex multi-step processes, implement compensating transactions
|
||
|
|
|
||
|
|
### Security and Identity
|
||
|
|
|
||
|
|
**Cloud IAM**
|
||
|
|
- Resource hierarchy: Organization -> Folders -> Projects -> Resources
|
||
|
|
- Roles: Primitive (Owner, Editor, Viewer), Predefined, Custom
|
||
|
|
- Service accounts for applications
|
||
|
|
- Workload Identity for GKE
|
||
|
|
- Best practices: Use predefined roles, least privilege, service accounts for apps
|
||
|
|
|
||
|
|
**Cloud Key Management (KMS)**
|
||
|
|
- Encryption key management
|
||
|
|
- Customer-managed encryption keys (CMEK)
|
||
|
|
- Hardware Security Module (HSM) backed
|
||
|
|
- Automatic key rotation
|
||
|
|
- Best practices: Enable automatic rotation, use separate keys per environment
|
||
|
|
|
||
|
|
**Secret Manager**
|
||
|
|
- Store API keys, passwords, certificates
|
||
|
|
- Versioning and access control
|
||
|
|
- Automatic rotation integration
|
||
|
|
- Best practices: Rotate secrets regularly, use IAM for access control
|
||
|
|
|
||
|
|
**Security Command Center**
|
||
|
|
- Centralized security and risk management
|
||
|
|
- Asset discovery and vulnerability scanning
|
||
|
|
- Threat detection and compliance monitoring
|
||
|
|
- Best practices: Enable all detectors, review findings regularly
|
||
|
|
|
||
|
|
**VPC Service Controls**
|
||
|
|
- Create security perimeters around GCP resources
|
||
|
|
- Prevent data exfiltration
|
||
|
|
- Best practices: Use for sensitive data, implement access levels
|
||
|
|
|
||
|
|
### AI and Machine Learning
|
||
|
|
|
||
|
|
**Vertex AI**
|
||
|
|
- Unified ML platform
|
||
|
|
- AutoML for custom models
|
||
|
|
- Pre-trained models (Vision, Natural Language, etc.)
|
||
|
|
- MLOps with pipelines
|
||
|
|
- Best practices: Use AutoML for quick start, implement feature store
|
||
|
|
|
||
|
|
**BigQuery ML**
|
||
|
|
- Create and execute ML models using SQL
|
||
|
|
- Model types: Linear regression, logistic regression, clustering, etc.
|
||
|
|
- Integration with Vertex AI
|
||
|
|
- Best practices: Use for simple models, leverage BigQuery's scale
|
||
|
|
|
||
|
|
## Architecture Patterns
|
||
|
|
|
||
|
|
### High Availability
|
||
|
|
|
||
|
|
**Multi-Zone Pattern**
|
||
|
|
```
|
||
|
|
Global HTTP(S) Load Balancer
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Managed Instance Group (multi-zone)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud SQL (regional, HA configuration)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud Storage (multi-region)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Multi-Region Pattern**
|
||
|
|
```
|
||
|
|
Global HTTP(S) Load Balancer
|
||
|
|
|
|
||
|
|
├── Backend Service Region 1 (Cloud Run)
|
||
|
|
└── Backend Service Region 2 (Cloud Run)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud Spanner (multi-region)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Serverless Architecture
|
||
|
|
|
||
|
|
**Event-Driven Pattern**
|
||
|
|
```
|
||
|
|
Cloud Storage upload event
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Pub/Sub topic
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud Functions (image processing)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Firestore (metadata storage)
|
||
|
|
```
|
||
|
|
|
||
|
|
**API-First Pattern**
|
||
|
|
```
|
||
|
|
Cloud Endpoints or API Gateway
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud Run (multiple services)
|
||
|
|
|
|
||
|
|
├── Cloud SQL (transactional data)
|
||
|
|
└── Firestore (user data)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Microservices on GKE
|
||
|
|
|
||
|
|
**GKE with Service Mesh**
|
||
|
|
```
|
||
|
|
Global Load Balancer
|
||
|
|
|
|
||
|
|
v
|
||
|
|
GKE Ingress
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Anthos Service Mesh (Istio)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Microservices (Cloud Spanner, Firestore, Memorystore)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Data Analytics Platform
|
||
|
|
|
||
|
|
```
|
||
|
|
Data Sources
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Pub/Sub (streaming)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Dataflow (Apache Beam)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
BigQuery (data warehouse)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Looker or Data Studio (visualization)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Batch Processing**
|
||
|
|
```
|
||
|
|
Cloud Storage (raw data)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Dataproc (Apache Spark)
|
||
|
|
|
|
||
|
|
v
|
||
|
|
BigQuery (analytics)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Landing Zone Design
|
||
|
|
|
||
|
|
### Resource Hierarchy
|
||
|
|
|
||
|
|
```
|
||
|
|
Organization
|
||
|
|
├── Folders (by environment or team)
|
||
|
|
│ ├── Production Folder
|
||
|
|
│ │ ├── Project A
|
||
|
|
│ │ └── Project B
|
||
|
|
│ ├── Staging Folder
|
||
|
|
│ └── Development Folder
|
||
|
|
└── Shared Services Folder
|
||
|
|
├── Networking Project (Shared VPC host)
|
||
|
|
├── Security Project (KMS, Secret Manager)
|
||
|
|
└── Logging Project (centralized logs)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Network Design
|
||
|
|
|
||
|
|
**Shared VPC Pattern**
|
||
|
|
```
|
||
|
|
Host Project (networking team)
|
||
|
|
├── Shared VPC
|
||
|
|
│ ├── Subnet Production (region A)
|
||
|
|
│ ├── Subnet Staging (region A)
|
||
|
|
│ └── Subnet Development (region B)
|
||
|
|
|
||
|
|
Service Projects (application teams)
|
||
|
|
├── Production Project (uses Production subnet)
|
||
|
|
├── Staging Project (uses Staging subnet)
|
||
|
|
└── Development Project (uses Development subnet)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Hub-and-Spoke with VPN**
|
||
|
|
```
|
||
|
|
On-premises Network
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Cloud VPN / Interconnect
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Hub VPC (shared services)
|
||
|
|
|
|
||
|
|
├── Spoke VPC 1 (production workloads)
|
||
|
|
├── Spoke VPC 2 (development workloads)
|
||
|
|
└── Spoke VPC 3 (analytics workloads)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Governance
|
||
|
|
|
||
|
|
**Organization Policies**
|
||
|
|
- Restrict public IP assignment
|
||
|
|
- Enforce uniform bucket-level access
|
||
|
|
- Restrict VM external IP
|
||
|
|
- Define allowed resource locations
|
||
|
|
|
||
|
|
**IAM Strategy**
|
||
|
|
- Use Google Groups for role assignments
|
||
|
|
- Separate duties (network admin, security admin, etc.)
|
||
|
|
- Service accounts per application
|
||
|
|
- Workload Identity for GKE workloads
|
||
|
|
|
||
|
|
**Logging and Monitoring**
|
||
|
|
```
|
||
|
|
All Projects
|
||
|
|
|
|
||
|
|
v
|
||
|
|
Log Router
|
||
|
|
|
|
||
|
|
├── Cloud Logging (default sink)
|
||
|
|
├── BigQuery (long-term analysis)
|
||
|
|
├── Cloud Storage (archive)
|
||
|
|
└── Pub/Sub (real-time processing)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Migration Strategies
|
||
|
|
|
||
|
|
### Migrate to Virtual Machines
|
||
|
|
|
||
|
|
**Tools**
|
||
|
|
- Migrate to Virtual Machines (formerly Migrate for Compute Engine)
|
||
|
|
- Supports VMware, AWS, Azure, physical servers
|
||
|
|
- Agentless or agent-based migration
|
||
|
|
- Waves and test clones
|
||
|
|
|
||
|
|
**Process**
|
||
|
|
1. Assess: Fit assessment and TCO analysis
|
||
|
|
2. Plan: Group VMs, define migration waves
|
||
|
|
3. Deploy: Set up infrastructure (VPC, firewall rules)
|
||
|
|
4. Migrate: Test migration, cutover, validation
|
||
|
|
5. Optimize: Right-sizing, committed use discounts
|
||
|
|
|
||
|
|
### Database Migration
|
||
|
|
|
||
|
|
**Database Migration Service**
|
||
|
|
- Minimal downtime migrations
|
||
|
|
- Supports MySQL, PostgreSQL, SQL Server, Oracle
|
||
|
|
- Continuous replication for cutover flexibility
|
||
|
|
|
||
|
|
**Transfer Appliance**
|
||
|
|
- Physical device for large data transfers
|
||
|
|
- Up to 1 PB capacity
|
||
|
|
- Offline data transfer
|
||
|
|
|
||
|
|
## Cost Optimization
|
||
|
|
|
||
|
|
### Compute Savings
|
||
|
|
|
||
|
|
**Committed Use Discounts**
|
||
|
|
- 1-year or 3-year commitments
|
||
|
|
- Up to 57% savings for VMs
|
||
|
|
- Resource-based or spend-based
|
||
|
|
|
||
|
|
**Sustained Use Discounts**
|
||
|
|
- Automatic discounts for running VMs >25% of month
|
||
|
|
- Up to 30% savings
|
||
|
|
- No commitment required
|
||
|
|
|
||
|
|
**Preemptible and Spot VMs**
|
||
|
|
- Up to 80% discount
|
||
|
|
- Can be terminated by GCP
|
||
|
|
- Best for batch processing, fault-tolerant workloads
|
||
|
|
|
||
|
|
**Recommender**
|
||
|
|
- VM rightsizing recommendations
|
||
|
|
- Idle resource identification
|
||
|
|
- Committed use discount recommendations
|
||
|
|
|
||
|
|
### Storage Savings
|
||
|
|
|
||
|
|
**Cloud Storage**
|
||
|
|
- Autoclass for automatic tier transitions
|
||
|
|
- Lifecycle policies (delete or transition)
|
||
|
|
- Nearline (30+ days), Coldline (90+ days), Archive (365+ days)
|
||
|
|
- Requester pays for data transfer
|
||
|
|
|
||
|
|
**Persistent Disk**
|
||
|
|
- Delete orphaned disks
|
||
|
|
- Use balanced SSD instead of SSD when possible
|
||
|
|
- Resize disks to match actual usage
|
||
|
|
|
||
|
|
### BigQuery Savings
|
||
|
|
|
||
|
|
**On-Demand Pricing**
|
||
|
|
- $5 per TB processed
|
||
|
|
- Use partitioning and clustering
|
||
|
|
- Query cache for free repeated queries
|
||
|
|
|
||
|
|
**Flat-Rate Pricing**
|
||
|
|
- Predictable costs for heavy users
|
||
|
|
- Autoscaling slots available
|
||
|
|
- Flex slots for short-term commitments
|
||
|
|
|
||
|
|
**Best Practices**
|
||
|
|
- Use approximate aggregation functions (APPROX_COUNT_DISTINCT)
|
||
|
|
- Avoid SELECT *, specify columns
|
||
|
|
- Use materialized views for common queries
|
||
|
|
- Set up cost controls with custom quotas
|
||
|
|
|
||
|
|
### Monitoring Costs
|
||
|
|
|
||
|
|
**Cloud Billing**
|
||
|
|
- Budgets and alerts
|
||
|
|
- Cost breakdown by project, service, SKU
|
||
|
|
- Export to BigQuery for analysis
|
||
|
|
- Recommendations from Active Assist
|
||
|
|
|
||
|
|
## Disaster Recovery
|
||
|
|
|
||
|
|
### Backup Strategies
|
||
|
|
|
||
|
|
**VM Backups**
|
||
|
|
- Persistent disk snapshots (incremental)
|
||
|
|
- Machine images (include metadata and config)
|
||
|
|
- Cross-region snapshot copy
|
||
|
|
- Snapshot schedules for automation
|
||
|
|
|
||
|
|
**Database Backups**
|
||
|
|
- Cloud SQL: Automated backups (7-365 days retention)
|
||
|
|
- Cloud Spanner: Backups on demand or scheduled
|
||
|
|
- Firestore: Automated daily exports
|
||
|
|
- Bigtable: Backups to Cloud Storage
|
||
|
|
|
||
|
|
### High Availability
|
||
|
|
|
||
|
|
**RTO/RPO Matrix**
|
||
|
|
|
||
|
|
| Pattern | RPO | RTO | Cost |
|
||
|
|
|---------|-----|-----|------|
|
||
|
|
| Active-Active Multi-Region | Seconds | Seconds | High |
|
||
|
|
| Active-Passive with Replication | Minutes | Minutes | Medium |
|
||
|
|
| Warm Standby | Minutes | 10-30 min | Medium |
|
||
|
|
| Backup and Restore | Hours | Hours | Low |
|
||
|
|
|
||
|
|
**Cloud SQL HA**
|
||
|
|
- Regional configuration with synchronous replication
|
||
|
|
- Automatic failover
|
||
|
|
- 99.95% SLA (vs 99.5% for single zone)
|
||
|
|
|
||
|
|
**Cloud Spanner**
|
||
|
|
- Multi-region configuration
|
||
|
|
- 99.999% availability SLA
|
||
|
|
- Synchronous replication across regions
|
||
|
|
|
||
|
|
### Disaster Recovery Testing
|
||
|
|
|
||
|
|
- Regular DR drills (quarterly recommended)
|
||
|
|
- Document runbooks
|
||
|
|
- Test restoration procedures
|
||
|
|
- Measure actual RTO/RPO vs targets
|
||
|
|
|
||
|
|
## Monitoring and Observability
|
||
|
|
|
||
|
|
### Cloud Monitoring (formerly Stackdriver)
|
||
|
|
|
||
|
|
**Metrics**
|
||
|
|
- System metrics (CPU, memory, disk, network)
|
||
|
|
- Custom metrics via Cloud Monitoring API
|
||
|
|
- Metric scopes for multi-project monitoring
|
||
|
|
- Uptime checks for availability
|
||
|
|
|
||
|
|
**Dashboards and Charts**
|
||
|
|
- Predefined dashboards for GCP services
|
||
|
|
- Custom dashboards with filters and grouping
|
||
|
|
- SLO monitoring with error budgets
|
||
|
|
|
||
|
|
### Cloud Logging
|
||
|
|
|
||
|
|
**Log Types**
|
||
|
|
- Admin Activity logs (always enabled, no charge)
|
||
|
|
- Data Access logs (must be enabled)
|
||
|
|
- System Event logs
|
||
|
|
- Access Transparency logs (for Google access)
|
||
|
|
|
||
|
|
**Log Sinks**
|
||
|
|
- Route logs to BigQuery, Cloud Storage, Pub/Sub
|
||
|
|
- Aggregated sinks at organization/folder level
|
||
|
|
- Exclusion filters to reduce costs
|
||
|
|
|
||
|
|
### Cloud Trace
|
||
|
|
|
||
|
|
**Distributed Tracing**
|
||
|
|
- Automatic instrumentation for App Engine, Cloud Run, GKE
|
||
|
|
- Manual instrumentation with client libraries
|
||
|
|
- Latency analysis and performance insights
|
||
|
|
- Integration with Zipkin
|
||
|
|
|
||
|
|
### Cloud Profiler
|
||
|
|
|
||
|
|
**Continuous Profiling**
|
||
|
|
- CPU and memory profiling
|
||
|
|
- Low overhead (< 0.5% CPU)
|
||
|
|
- Flame graphs for visualization
|
||
|
|
- Supported languages: Java, Go, Python, Node.js
|
||
|
|
|
||
|
|
### Error Reporting
|
||
|
|
|
||
|
|
**Aggregated Error Tracking**
|
||
|
|
- Automatic error grouping
|
||
|
|
- Stack trace analysis
|
||
|
|
- Integration with Cloud Logging
|
||
|
|
- Notifications for new errors
|