nf-core 流程部署：工作流程清单

本章提供完整的步骤清单，确保你在每个环节都做出正确的决策。

完整检查清单

□ 步骤 0: 获取数据（GEO/SRA）
□ 步骤 1: 环境检查（必须通过）
□ 步骤 2: 选择流程（与用户确认）
□ 步骤 3: 运行测试（必须通过）
□ 步骤 4: 创建样本信息表
□ 步骤 5: 配置和运行（确认基因组）
□ 步骤 6: 验证输出

步骤 0: 获取数据（可选）

如果你已有 FASTQ 文件：跳到步骤 1

如果需要从 GEO/SRA 下载：

# 1. 获取研究信息
python scripts/sra_geo_fetch.py info GSE110004

# 2. 下载数据
python scripts/sra_geo_fetch.py download GSE110004 -o ./fastq -i

# 3. 生成样本信息表
python scripts/sra_geo_fetch.py samplesheet GSE110004 \
  --fastq-dir ./fastq -o samplesheet.csv

决策点：下载前与用户确认

哪些样本子集
建议的基因组和流程

步骤 1: 环境检查

必须首先运行。没有通过环境检查，流程会失败。

python scripts/check_environment.py

检查项目

组件	要求	检查命令	修复方法
Docker	已安装且运行	`docker ps`	https://docs.docker.com/get-docker/
Nextflow	≥ 23.04	`nextflow -version`	`curl -s https://get.nextflow.io \| bash`
Java	≥ 11	`java -version`	`sudo apt install openjdk-11-jdk`

重要：不要在环境检查失败时继续。

Docker 问题排查

问题	解决
未安装	从官网安装
权限拒绝	`sudo usermod -aG docker $USER` 然后重新登录
守护进程未运行	`sudo systemctl start docker`

步骤 2: 选择流程

决策点：必须与用户确认

数据类型与流程匹配

数据类型	流程	版本	目的
RNA-seq	`rnaseq`	3.22.2	基因表达分析
全基因组/外显子	`sarek`	3.7.1	变异检测
ATAC-seq	`atacseq`	2.1.2	染色质可及性

自动检测

python scripts/detect_data_type.py /path/to/data

流程特定参考

references/pipelines/rnaseq.md
references/pipelines/sarek.md
references/pipelines/atacseq.md

步骤 3: 运行测试

使用小数据验证环境。必须在真实数据之前通过。

nextflow run nf-core/<pipeline> \
  -r <version> \
  -profile test,docker \
  --outdir test_output

各流程的测试命令

# RNA-seq
nextflow run nf-core/rnaseq -r 3.22.2 \
  -profile test,docker --outdir test_rnaseq

# Sarek (WGS/WES)
nextflow run nf-core/sarek -r 3.7.1 \
  -profile test,docker --outdir test_sarek

# ATAC-seq
nextflow run nf-core/atacseq -r 2.1.2 \
  -profile test,docker --outdir test_atacseq

验证测试成功

ls test_output/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log

步骤 4: 创建样本信息表

自动生成

python scripts/generate_samplesheet.py /path/to/data <pipeline> \
  -o samplesheet.csv

脚本会：

发现 FASTQ/BAM/CRAM 文件
配对 R1/R2
推断样本元数据
验证格式

验证现有信息表

python scripts/generate_samplesheet.py --validate samplesheet.csv <pipeline>

信息表格式

RNA-seq：

sample,fastq_1,fastq_2,strandedness
SAMPLE1,/abs/path/R1.fq.gz,/abs/path/R2.fq.gz,auto

Sarek：

patient,sample,lane,fastq_1,fastq_2,status
patient1,tumor,L001,/abs/path/tumor_R1.fq.gz,/abs/path/tumor_R2.fq.gz,1
patient1,normal,L001,/abs/path/normal_R1.fq.gz,/abs/path/normal_R2.fq.gz,0

ATAC-seq：

sample,fastq_1,fastq_2,replicate
CONTROL,/abs/path/ctrl_R1.fq.gz,/abs/path/ctrl_R2.fq.gz,1

步骤 5: 配置和运行

5a. 检查基因组

# 检查是否已安装
python scripts/manage_genomes.py check <genome>

# 如果未安装，下载
python scripts/manage_genomes.py download <genome>

常用基因组：GRCh38（人）、GRCh37（人旧版）、GRCm39（小鼠）

5b. 决策点（与用户确认）

决策	选项
基因组	哪个参考？
RNA-seq aligner	star_salmon（推荐）或 hisat2（低内存）
Sarek tools	haplotypecaller（胚系）或 mutect2（体细胞）
ATAC-seq read_length	50/75/100/150

5c. 运行流程

nextflow run nf-core/<pipeline> \
  -r <version> \
  -profile docker \
  --input samplesheet.csv \
  --outdir results \
  --genome <genome> \
  -resume

关键参数：

-r：固定版本
-profile docker：使用 Docker（集群用 singularity）
--genome：iGenomes 键
-resume：从断点继续

资源限制（如需要）：

--max_cpus 8 --max_memory '32.GB' --max_time '24.h'

步骤 6: 验证输出

检查完成

ls results/multiqc/multiqc_report.html
grep "Pipeline completed successfully" .nextflow.log

各流程的关键输出

RNA-seq：

results/star_salmon/salmon.merged.gene_counts.tsv - 基因计数
results/star_salmon/salmon.merged.gene_tpm.tsv - TPM 值

Sarek：

results/variant_calling/*/ - VCF 文件
results/preprocessing/recalibrated/ - BAM 文件

ATAC-seq：

results/macs2/narrowPeak/ - 峰值文件
results/bwa/mergedLibrary/bigwig/ - 覆盖度轨道

快速参考

恢复失败的运行

nextflow run nf-core/<pipeline> -resume

清理工作目录

nextflow clean

查看资源使用

nextflow log -f process,,cpus,memory,time,duration

参考文档

references/geo-sra-acquisition.md
references/troubleshooting.md
references/installation.md
references/pipelines/rnaseq.md
references/pipelines/sarek.md
references/pipelines/atacseq.md

重要提示

不要跳过测试：测试失败意味着真实数据也会失败
明确版本：使用 -r 固定流程版本
使用 -resume：避免从头开始
验证输出：检查 MultiQC 报告
保留日志：.nextflow.log 包含重要调试信息

继续前往应用场景查看真实案例。