bookworm-smart-assistant/skills/data-engineer-expert/SKILL.md

206 lines
4.7 KiB
Markdown
Raw Normal View History

---
name: data-engineer-expert
description: >
数据工程师专家。当用户需要进行数据管道设计、ETL 开发、数据仓库建模、流处理、
Spark/Kafka/Airflow/dbt 使用、维度建模、数据质量,
或说 "数据工程"、"ETL"、"数据管道" 时使用此技能。
allowed-tools: Read, Glob, Grep, Edit, Write, Bash
maturity: stable
last-reviewed: 2026-02-18
composable: true
enhances: [data-analyst-expert, database-tuning-expert]
---
# 数据工程师 (Data Engineer Expert)
> **Output Style**: 本技能使用内联输出规范
资深数据工程师精通数据管道设计、ETL 开发、数据仓库建模和流处理技术。
## 触发关键词
- **数据处理**: `ETL`, `数据管道`, `数据流`, `数据处理`
- **存储**: `数据仓库`, `数据湖`, `数据集市`, `OLAP`
- **工具**: `Spark`, `Kafka`, `Airflow`, `dbt`, `Flink`
- **建模**: `维度建模`, `星型模型`, `雪花模型`
- **质量**: `数据质量`, `数据治理`, `数据血缘`
## 技术栈
### 批处理
- **Apache Spark**: 大规模数据处理
- **dbt**: 数据转换和建模
- **Apache Airflow**: 工作流编排
### 流处理
- **Apache Kafka**: 消息队列
- **Apache Flink**: 实时流处理
- **Kafka Streams**: 轻量流处理
### 数据存储
- **Snowflake/BigQuery**: 云数据仓库
- **Delta Lake**: 数据湖格式
- **Apache Iceberg**: 表格式
## 数据管道设计
### Airflow DAG
```python
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
default_args = {
'owner': 'data-team',
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
with DAG(
dag_id='user_analytics_pipeline',
default_args=default_args,
schedule_interval='0 2 * * *',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
extract = PythonOperator(
task_id='extract_data',
python_callable=extract_from_source,
)
transform = PythonOperator(
task_id='transform_data',
python_callable=transform_data,
)
load = PythonOperator(
task_id='load_to_warehouse',
python_callable=load_to_warehouse,
)
extract >> transform >> load
```
### dbt 模型
```sql
-- models/marts/dim_users.sql
{{ config(materialized='table') }}
WITH source_users AS (
SELECT * FROM {{ ref('stg_users') }}
),
enriched AS (
SELECT
user_id,
email,
created_at,
COALESCE(country, 'Unknown') AS country,
DATE_TRUNC('month', created_at) AS signup_month,
CASE
WHEN last_login_at > CURRENT_DATE - INTERVAL '30 days' THEN 'active'
ELSE 'inactive'
END AS status
FROM source_users
)
SELECT * FROM enriched
```
### Spark ETL
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, sum as spark_sum
spark = SparkSession.builder \
.appName("UserAnalytics") \
.getOrCreate()
# 读取数据
df = spark.read.parquet("s3://data-lake/users/")
# 转换
result = df \
.filter(col("created_at") >= "2024-01-01") \
.groupBy("country") \
.agg(
spark_sum("revenue").alias("total_revenue"),
spark_sum(when(col("is_active"), 1).otherwise(0)).alias("active_users")
)
# 写入
result.write \
.mode("overwrite") \
.parquet("s3://data-warehouse/user_metrics/")
```
## 维度建模
### 星型模型
```sql
-- 事实表
CREATE TABLE fact_orders (
order_id BIGINT PRIMARY KEY,
user_id BIGINT REFERENCES dim_users(user_id),
product_id BIGINT REFERENCES dim_products(product_id),
date_id INT REFERENCES dim_date(date_id),
quantity INT,
revenue DECIMAL(18, 2),
created_at TIMESTAMP
);
-- 维度表
CREATE TABLE dim_users (
user_id BIGINT PRIMARY KEY,
email VARCHAR(255),
name VARCHAR(255),
country VARCHAR(100),
tier VARCHAR(50)
);
CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
date DATE,
year INT,
quarter INT,
month INT,
week INT,
day_of_week INT
);
```
## 数据质量
### Great Expectations
```python
import great_expectations as gx
context = gx.get_context()
# 定义期望
validator = context.sources.pandas_default.read_csv("users.csv")
validator.expect_column_values_to_not_be_null("user_id")
validator.expect_column_values_to_be_unique("email")
validator.expect_column_values_to_be_between("age", 0, 150)
# 验证
results = validator.validate()
```
## 输出规范
- 使用中文注释
- 提供完整的管道代码
- 说明数据流向
- 包含错误处理和重试
- 考虑数据质量检查
## 禁止事项
- ❌ 不要忽略数据质量检查
- ❌ 不要硬编码凭据
- ❌ 不要忽略幂等性设计
- ❌ 不要跳过数据血缘记录