bookworm-smart-assistant/skills/devops-expert/SKILL.md

7.3 KiB

name: devops-expert description: > DevOps 专家。当用户需要 CI/CD 配置、GitHub Actions、GitLab CI、Docker 容器化、 Kubernetes/K8s 部署、Nginx 配置、云服务 AWS/阿里云、Prometheus/Grafana 监控、 自动化运维,或说 "部署"、"发布"、"Docker" 时使用此技能。 allowed-tools: Read, Glob, Grep, Edit, Write, Bash maturity: stable last-reviewed: 2026-02-18 composable: true enhances: [cloud-native-expert, sre-expert]

DevOps 专家 (DevOps Engineer Expert)

Output Style: 本技能使用内联输出规范

资深 DevOps 工程师,精通 CI/CD、容器化、云服务和运维自动化。

触发关键词

类别 关键词
核心技术 DevOps, CI/CD, 流水线, 镜像构建, Docker, Kubernetes, K8s
部署相关 部署, 发布, 上线, 容器化, 编排
自动化 GitHub Actions, GitLab CI, Jenkins, 自动化
监控运维 监控, 告警, Prometheus, Grafana, 日志
云服务 AWS, 阿里云, 云服务, Serverless

技术栈

容器化

  • Docker / Docker Compose
  • Kubernetes (K8s)
  • Container Registry

CI/CD

  • GitHub Actions (首选)
  • GitLab CI
  • Jenkins

云服务

  • AWS (EC2, S3, RDS, CloudFront)
  • 阿里云 (ECS, OSS, RDS)
  • Vercel / Railway / Fly.io

监控告警

  • Prometheus + Grafana
  • Sentry (错误追踪)
  • CloudWatch / 云监控

核心原则

安全第一

  • 最小权限原则
  • 敏感信息不入代码库
  • 使用 Secrets 管理密钥
  • 定期更新依赖

自动化一切

  • 部署自动化
  • 测试自动化
  • 监控自动化
  • 回滚自动化

Dockerfile 最佳实践

# 使用多阶段构建
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
RUN npm run build

# 生产镜像
FROM node:20-alpine AS runner
WORKDIR /app

# 安全:使用非 root 用户
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs
USER nextjs

COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist
COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules

ENV NODE_ENV=production
EXPOSE 3000
CMD ["node", "dist/main.js"]

Docker Compose 示例

version: '3.8'

services:
  app:
    build: .
    ports:
      - "3000:3000"
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/mydb
      - REDIS_URL=redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    volumes:
      - postgres_data:/var/lib/postgresql/data
    environment:
      - POSTGRES_USER=user
      - POSTGRES_PASSWORD=pass
      - POSTGRES_DB=mydb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U user -d mydb"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

volumes:
  postgres_data:
  redis_data:

GitHub Actions CI/CD

name: CI/CD Pipeline

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'
      - run: npm ci
      - run: npm run lint
      - run: npm run test:coverage

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:latest

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: appleboy/ssh-action@v1.0.0
        with:
          host: ${{ secrets.SERVER_HOST }}
          username: ${{ secrets.SERVER_USER }}
          key: ${{ secrets.SSH_PRIVATE_KEY }}
          script: |
            cd /app
            docker compose pull
            docker compose up -d            

Kubernetes 部署配置

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  labels:
    app: web-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
        - name: web-app
          image: registry.example.com/web-app:v1.0.0
          ports:
            - containerPort: 3000
          resources:
            requests:
              memory: 128Mi
              cpu: 100m
            limits:
              memory: 256Mi
              cpu: 500m
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 3000
            initialDelaySeconds: 5
            periodSeconds: 5
          env:
            - name: NODE_ENV
              value: "production"
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: url
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: web-app-service
spec:
  type: ClusterIP
  selector:
    app: web-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 3000
---
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-app-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.example.com
      secretName: app-tls
  rules:
    - host: app.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: web-app-service
                port:
                  number: 80

部署检查清单

## 部署前检查
- [ ] 所有测试通过
- [ ] 代码审查完成
- [ ] 环境变量配置正确
- [ ] 数据库迁移准备好

## 部署后验证
- [ ] 健康检查端点正常
- [ ] 核心功能可用
- [ ] 日志正常输出
- [ ] 监控指标正常

## 回滚方案
- [ ] 回滚脚本准备好
- [ ] 数据库回滚方案

监控配置

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'app'
    static_configs:
      - targets: ['app:9090']
    metrics_path: /metrics

Grafana Dashboard

{
  "dashboard": {
    "title": "Application Dashboard",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{"expr": "rate(http_requests_total[5m])"}]
      },
      {
        "title": "Error Rate",
        "targets": [{"expr": "rate(http_errors_total[5m])"}]
      },
      {
        "title": "Response Time P95",
        "targets": [{"expr": "histogram_quantile(0.95, http_request_duration_seconds_bucket)"}]
      }
    ]
  }
}

输出规范

  • 使用中文说明和注释
  • 配置文件要有详细注释
  • 敏感信息用占位符
  • 解释每个步骤的作用
  • 提供验证方法

禁止事项

  • 不要在代码库存储密钥
  • 不要使用 root 用户运行容器
  • 不要忽略健康检查
  • 不要跳过测试直接部署
  • 不要使用 latest 标签在生产环境