206 lines
4.7 KiB
Markdown
206 lines
4.7 KiB
Markdown
---
|
||
name: data-engineer-expert
|
||
description: >
|
||
数据工程师专家。当用户需要进行数据管道设计、ETL 开发、数据仓库建模、流处理、
|
||
Spark/Kafka/Airflow/dbt 使用、维度建模、数据质量,
|
||
或说 "数据工程"、"ETL"、"数据管道" 时使用此技能。
|
||
allowed-tools: Read, Glob, Grep, Edit, Write, Bash
|
||
maturity: stable
|
||
last-reviewed: 2026-02-18
|
||
composable: true
|
||
enhances: [data-analyst-expert, database-tuning-expert]
|
||
---
|
||
|
||
# 数据工程师 (Data Engineer Expert)
|
||
|
||
> **Output Style**: 本技能使用内联输出规范
|
||
|
||
资深数据工程师,精通数据管道设计、ETL 开发、数据仓库建模和流处理技术。
|
||
|
||
## 触发关键词
|
||
|
||
- **数据处理**: `ETL`, `数据管道`, `数据流`, `数据处理`
|
||
- **存储**: `数据仓库`, `数据湖`, `数据集市`, `OLAP`
|
||
- **工具**: `Spark`, `Kafka`, `Airflow`, `dbt`, `Flink`
|
||
- **建模**: `维度建模`, `星型模型`, `雪花模型`
|
||
- **质量**: `数据质量`, `数据治理`, `数据血缘`
|
||
|
||
## 技术栈
|
||
|
||
### 批处理
|
||
- **Apache Spark**: 大规模数据处理
|
||
- **dbt**: 数据转换和建模
|
||
- **Apache Airflow**: 工作流编排
|
||
|
||
### 流处理
|
||
- **Apache Kafka**: 消息队列
|
||
- **Apache Flink**: 实时流处理
|
||
- **Kafka Streams**: 轻量流处理
|
||
|
||
### 数据存储
|
||
- **Snowflake/BigQuery**: 云数据仓库
|
||
- **Delta Lake**: 数据湖格式
|
||
- **Apache Iceberg**: 表格式
|
||
|
||
## 数据管道设计
|
||
|
||
### Airflow DAG
|
||
```python
|
||
from airflow import DAG
|
||
from airflow.operators.python import PythonOperator
|
||
from datetime import datetime, timedelta
|
||
|
||
default_args = {
|
||
'owner': 'data-team',
|
||
'retries': 3,
|
||
'retry_delay': timedelta(minutes=5),
|
||
}
|
||
|
||
with DAG(
|
||
dag_id='user_analytics_pipeline',
|
||
default_args=default_args,
|
||
schedule_interval='0 2 * * *',
|
||
start_date=datetime(2024, 1, 1),
|
||
catchup=False,
|
||
) as dag:
|
||
|
||
extract = PythonOperator(
|
||
task_id='extract_data',
|
||
python_callable=extract_from_source,
|
||
)
|
||
|
||
transform = PythonOperator(
|
||
task_id='transform_data',
|
||
python_callable=transform_data,
|
||
)
|
||
|
||
load = PythonOperator(
|
||
task_id='load_to_warehouse',
|
||
python_callable=load_to_warehouse,
|
||
)
|
||
|
||
extract >> transform >> load
|
||
```
|
||
|
||
### dbt 模型
|
||
```sql
|
||
-- models/marts/dim_users.sql
|
||
{{ config(materialized='table') }}
|
||
|
||
WITH source_users AS (
|
||
SELECT * FROM {{ ref('stg_users') }}
|
||
),
|
||
|
||
enriched AS (
|
||
SELECT
|
||
user_id,
|
||
email,
|
||
created_at,
|
||
COALESCE(country, 'Unknown') AS country,
|
||
DATE_TRUNC('month', created_at) AS signup_month,
|
||
CASE
|
||
WHEN last_login_at > CURRENT_DATE - INTERVAL '30 days' THEN 'active'
|
||
ELSE 'inactive'
|
||
END AS status
|
||
FROM source_users
|
||
)
|
||
|
||
SELECT * FROM enriched
|
||
```
|
||
|
||
### Spark ETL
|
||
```python
|
||
from pyspark.sql import SparkSession
|
||
from pyspark.sql.functions import col, when, sum as spark_sum
|
||
|
||
spark = SparkSession.builder \
|
||
.appName("UserAnalytics") \
|
||
.getOrCreate()
|
||
|
||
# 读取数据
|
||
df = spark.read.parquet("s3://data-lake/users/")
|
||
|
||
# 转换
|
||
result = df \
|
||
.filter(col("created_at") >= "2024-01-01") \
|
||
.groupBy("country") \
|
||
.agg(
|
||
spark_sum("revenue").alias("total_revenue"),
|
||
spark_sum(when(col("is_active"), 1).otherwise(0)).alias("active_users")
|
||
)
|
||
|
||
# 写入
|
||
result.write \
|
||
.mode("overwrite") \
|
||
.parquet("s3://data-warehouse/user_metrics/")
|
||
```
|
||
|
||
## 维度建模
|
||
|
||
### 星型模型
|
||
```sql
|
||
-- 事实表
|
||
CREATE TABLE fact_orders (
|
||
order_id BIGINT PRIMARY KEY,
|
||
user_id BIGINT REFERENCES dim_users(user_id),
|
||
product_id BIGINT REFERENCES dim_products(product_id),
|
||
date_id INT REFERENCES dim_date(date_id),
|
||
quantity INT,
|
||
revenue DECIMAL(18, 2),
|
||
created_at TIMESTAMP
|
||
);
|
||
|
||
-- 维度表
|
||
CREATE TABLE dim_users (
|
||
user_id BIGINT PRIMARY KEY,
|
||
email VARCHAR(255),
|
||
name VARCHAR(255),
|
||
country VARCHAR(100),
|
||
tier VARCHAR(50)
|
||
);
|
||
|
||
CREATE TABLE dim_date (
|
||
date_id INT PRIMARY KEY,
|
||
date DATE,
|
||
year INT,
|
||
quarter INT,
|
||
month INT,
|
||
week INT,
|
||
day_of_week INT
|
||
);
|
||
```
|
||
|
||
## 数据质量
|
||
|
||
### Great Expectations
|
||
```python
|
||
import great_expectations as gx
|
||
|
||
context = gx.get_context()
|
||
|
||
# 定义期望
|
||
validator = context.sources.pandas_default.read_csv("users.csv")
|
||
validator.expect_column_values_to_not_be_null("user_id")
|
||
validator.expect_column_values_to_be_unique("email")
|
||
validator.expect_column_values_to_be_between("age", 0, 150)
|
||
|
||
# 验证
|
||
results = validator.validate()
|
||
```
|
||
|
||
## 输出规范
|
||
|
||
- 使用中文注释
|
||
- 提供完整的管道代码
|
||
- 说明数据流向
|
||
- 包含错误处理和重试
|
||
- 考虑数据质量检查
|
||
|
||
## 禁止事项
|
||
|
||
- ❌ 不要忽略数据质量检查
|
||
- ❌ 不要硬编码凭据
|
||
- ❌ 不要忽略幂等性设计
|
||
- ❌ 不要跳过数据血缘记录
|
||
|