Running the workflow
We'll run the workflow from within the data/
directory that will contain the BagIt Payload files with the outputs of the workflows:
mkdir -p chipseq_20200924/data
cd chipseq_20200924/data
We're going to run the workflow nf-core/chipseq. Nextflow allows running of nf-core workflows directly by their GitHub short-name.
Recording
We'll use the test
profile that runs with preconfigured test data, for using your own inputs see the workflow usage instructions.
Because our BCO should be reproducible we need to be concious about which version of the workflow we are using. At time of writing the latest version is 1.2.2
.
If you are using Docker/Singularity containers instead of Conda, change the below -profile test,conda
to -profile test,docker
or -profile test,singularity
accordingly.
$ nextflow run nf-core/chipseq -revision 1.2.2 -profile test,conda
N E X T F L O W ~ version 20.07.1
Launching `nf-core/chipseq` [backstabbing_hawking] - revision: 6924b66942 [1.2.2]
----------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/chipseq v1.2.2
----------------------------------------------------
Run Name : backstabbing_hawking
Data Type : Paired-End
Design File : https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/design.csv
Genome : Not supplied
Fasta File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genome.fa
GTF File : https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genes.gtf
MACS2 Genome Size : 1.2E+7
Min Consensus Reps : 1
…
The workflow will take a while to run. If you previously skipped ahead, now go back to create the skeleton BCO
Some workflow system require explicit inputs, while others have them declared as part of the workflow or the workflow config. Nextflow have both options, in this case we used the its test
profile to pick the minimal test inputs suitable for testing.
We notice already nf-core reporting inputs of reference datasets, as declared in the profile. Rather than record the test
parameter as the workflow input, we'll record the URLs of these file inputs in the BCO as part of the io_domain
:
{"io_domain": {
"input_subdomain": [
{
"url": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/design.csv"
},
{
"url": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genome.fa"
},
{
"url": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/reference/genes.gtf"
}
]
},
"…": []
}
As the workflow progresses, Nextflow reports status per process:
executor > local (97)
[a9/82ebb7] process > CHECK_DESIGN (design.csv) [100%] 1 of 1 ✔
[51/2105b9] process > BWA_INDEX (genome.fa) [100%] 1 of 1 ✔
[55/6ae158] process > MAKE_GENE_BED (genes.gtf) [100%] 1 of 1 ✔
[0f/984a21] process > MAKE_GENOME_FILTER (genome.fa) [100%] 1 of 1 ✔
[af/85c0f1] process > FASTQC (SPT5_T15_R2_T1) [100%] 6 of 6 ✔
[d1/34ec81] process > TRIMGALORE (SPT5_T15_R1_T1) [100%] 6 of 6 ✔
[17/ae37aa] process > BWA_MEM (SPT5_T15_R1_T1) [100%] 6 of 6 ✔
[25/530a51] process > SORT_BAM (SPT5_T15_R1_T1) [100%] 6 of 6 ✔
[76/3defb1] process > MERGED_BAM (SPT5_T15_R1) [100%] 6 of 6 ✔
[2e/3fcd78] process > MERGED_BAM_FILTER (SPT5_T15_R1) [100%] 6 of 6 ✔
[ca/14e2bf] process > MERGED_BAM_REMOVE_ORPHAN (SPT5_T15_R1) [100%] 6 of 6 ✔
[f1/1da873] process > PRESEQ (SPT5_T15_R1) [100%] 6 of 6, failed: 2 ✔
[99/6ff523] process > PICARD_METRICS (SPT5_T15_R1) [100%] 6 of 6 ✔
[b5/572a3a] process > BIGWIG (SPT5_T15_R1) [100%] 6 of 6 ✔
[0a/7b400e] process > PLOTPROFILE (SPT5_T15_R1) [100%] 6 of 6 ✔
[18/d0b673] process > PHANTOMPEAKQUALTOOLS (SPT5_T15_R1) [100%] 6 of 6 ✔
[3a/d9d296] process > PLOTFINGERPRINT (SPT5_T15_R1 vs SPT5_INPUT_R1) [100%] 4 of 4 ✔
[29/2f9f0b] process > MACS2 (SPT5_T15_R1 vs SPT5_INPUT_R1) [100%] 4 of 4 ✔
[f6/a50dd6] process > MACS2_ANNOTATE (SPT5_T15_R1 vs SPT5_INPUT_R1) [100%] 4 of 4 ✔
[4e/ac1c7b] process > MACS2_QC [100%] 1 of 1 ✔
[19/786250] process > CONSENSUS_PEAKS (SPT5) [100%] 1 of 1 ✔
[70/52a6a1] process > CONSENSUS_PEAKS_ANNOTATE (SPT5) [100%] 1 of 1 ✔
[9d/4b5051] process > CONSENSUS_PEAKS_COUNTS (SPT5) [100%] 1 of 1 ✔
[7f/34cd6c] process > CONSENSUS_PEAKS_DESEQ2 (SPT5) [100%] 1 of 1 ✔
[a1/b86eb7] process > IGV [100%] 1 of 1 ✔
[78/8d9786] process > get_software_versions [100%] 1 of 1 ✔
[6d/7aa2ec] process > MULTIQC (1) [100%] 1 of 1 ✔
[12/a75060] process > output_documentation [100%] 1 of 1 ✔
-Warning, pipeline completed, but with errored process(es) -
-Number of ignored errored process(es) : 2 -
-Number of successfully ran process(es) : 95 -
-[nf-core/chipseq] Pipeline completed successfully-
WARN: To render the execution DAG in the required format it is required to install Graphviz -- See http://www.graphviz.org for more info.
Completed at: 23-Apr-2021 12:11:58
Duration : 2h 4s
CPU hours : 2.7 (0.6% failed)
Succeeded : 95
Ignored : 2
Failed : 2
After execution our data/results
folder will be populated with the result files, as marked as output in the Nextflow workflow definition. There will be additional files in the intermediate working directory under data/work
which we'll not publish as part of our BCO or RO-Crate, as they are mostly duplicates of the results.
Listing steps
The BCO should list the individual steps executed in the workflow, ideally with their individual inputs and outputs identified. As we see above, some of these steps are executed multiple times in iterations.
As a first step towards representing the workflow steps in BCO we list the name
corresponding directly to the report above.
"description_domain": {
"…": [],
"pipeline_steps": [
{"step_number": 1, "name": "CHECK_DESIGN", "description": "", "input_list": [], "output_list": []},
{"step_number": 2, "name": "output_documentation", "description": "", "input_list": [], "output_list": []},
{"step_number": 3, "name": "MAKE_GENE_BED", "description": "", "input_list": [], "output_list": []},
{"step_number": 5, "name": "get_software_versions", "description": "", "input_list": [], "output_list": []},
{"step_number": 6, "name": "BWA_INDEX", "description": "", "input_list": [], "output_list": []},
{"step_number": 7, "name": "MAKE_GENOME_FILTER", "description": "", "input_list": [], "output_list": []},
{"step_number": 8, "name": "FASTQC", "description": "", "input_list": [], "output_list": []},
{"step_number": 9, "name": "TRIMGALORE", "description": "", "input_list": [], "output_list": []},
{"step_number": 10, "name": "BWA_MEM", "description": "", "input_list": [], "output_list": []},
{"step_number": 11, "name": "SORT_BAM", "description": "", "input_list": [], "output_list": []},
{"step_number": 12, "name": "MERGED_BAM", "description": "", "input_list": [], "output_list": []},
{"step_number": 13, "name": "PRESEQ", "description": "", "input_list": [], "output_list": []},
{"step_number": 14, "name": "MERGED_BAM_FILTER", "description": "", "input_list": [], "output_list": []},
{"step_number": 15, "name": "MERGED_BAM_REMOVE_ORPHAN", "description": "", "input_list": [], "output_list": []},
{"step_number": 16, "name": "PHANTOMPEAKQUALTOOLS", "description": "", "input_list": [], "output_list": []},
{"step_number": 17, "name": "BIGWIG", "description": "", "input_list": [], "output_list": []},
{"step_number": 18, "name": "PICARD_METRICS", "description": "", "input_list": [], "output_list": []},
{"step_number": 19, "name": "PLOTFINGERPRINT", "description": "", "input_list": [], "output_list": []},
{"step_number": 20, "name": "MACS2", "description": "", "input_list": [], "output_list": []},
{"step_number": 21, "name": "PLOTPROFILE", "description": "", "input_list": [], "output_list": []},
{"step_number": 22, "name": "MACS2_ANNOTATE", "description": "", "input_list": [], "output_list": []},
{"step_number": 23, "name": "CONSENSUS_PEAKS", "description": "", "input_list": [], "output_list": []},
{"step_number": 24, "name": "CONSENSUS_PEAKS_COUNTS", "description": "", "input_list": [], "output_list": []},
{"step_number": 25, "name": "CONSENSUS_PEAKS_ANNOTATE", "description": "", "input_list": [], "output_list": []},
{"step_number": 26, "name": "MACS2_QC", "description": "", "input_list": [], "output_list": []},
{"step_number": 27, "name": "CONSENSUS_PEAKS_DESEQ2", "description": "", "input_list": [], "output_list": []},
{"step_number": 28, "name": "MULTIQC", "description": "", "input_list": [], "output_list": []},
{"step_number": 29, "name": "IGV", "description": "", "input_list": [], "output_list": []},
{"step_number": 30, "name": "but", "description": "", "input_list": [], "output_list": []},
{"step_number": 31, "name": "errored", "description": "", "input_list": [], "output_list": []},
{"step_number": 32, "name": "ran", "description": "", "input_list": [], "output_list": []}
]
}
Note the partial order implied by step_number
, here deliberately not listing any step 4
. You may have noticed during execution that multiple Nextflow steps executed concurrently, as they did not depend on each other. In BCO such parallelism can be more clearly indicated by setting the same number for step_number
, e.g. if FASTQC
ran at the same time as TRIMGALORE
they can both have "step_number": 8
:
{"step_number": 8, "name": "FASTQC", "description": "", "input_list": [], "output_list": []},{"step_number": 8, "name": "TRIMGALORE", "description": "", "input_list": [], "output_list": []},
A more detailed listing of step execution can perhaps be made from Nextflow’s data/results/pipeline_info/execution_trace.txt, we here see each execution of FASTQC
for different inputs:
8 e1/a217d0 328841 FASTQC (SPT5_INPUT_R1_T1) COMPLETED 0 2020-09-10 12:56:45.296 15.9s 13s 160.4% 237.9 MB 3.3 GB 38.1 MB 4.4 MB7 04/a99077 329934 TRIMGALORE (SPT5_INPUT_R1_T1) COMPLETED 0 2020-09-10 12:56:50.451 23.5s 20.2s 163.8% 248.8 MB 3.1 GB 257.9 MB 213 MB11 bf/715e9e 331956 FASTQC (SPT5_T0_R1_T1) COMPLETED 0 2020-09-10 12:57:01.183 16.9s 12.6s 166.3% 212.5 MB 3.3 GB 40.3 MB 4 MB
Conveniently Nextflow’s incremental task_id
could be used directly as a step_number
- however notice how this log records tasks in order of their completion of a task, while their task_id
is assigned by their initial scheduling for execution. With several tasks executing concurrently they will not necessarily finish in the same order. As the partial ordering of the steps in BCO is only meant as a guide, it is up to you if you would like to list BCO steps by their “ideal” scheduled order (showing if they potentially could be concurrent), or list the steps in their actual started or completed order.
These programmatic step names like TRIMGALORE
may seem a bit cryptic, and in many workflows have less descriptive names than shown in this nf-core example. Therefore it is good to provide a description
explaining each step for humans. Luckily nf-core already documents each step, so we can explain each step, e.g.:
{"step_number": 8, "name": "FASTQC", "description": "FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/ gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%A/C/G/T). You get information about adapter contamination and other overrepresented sequences", "input_list": [], "output_list": []},
The field of description
is free text, so the URL to FastQC is not formally captured as linked data. We can provide further detail about each tool under execution_domain
in the BCO, as well as in the RO-Crate.
For each step execution, input_list
in BCO refers to a list of files consumed by that step, and likewise output_list
can list the files produced.
By looking for Submitted process in the (very) detailed .nextflow report that in reality FASTQC
executed 6 times over different inputs:
Sep-10 12:56:45.296 [Task submitter] INFO nextflow.Session - [e1/a217d0] Submitted process > FASTQC (SPT5_INPUT_R1_T1)
Sep-10 12:57:01.183 [Task submitter] INFO nextflow.Session - [bf/715e9e] Submitted process > FASTQC (SPT5_T0_R1_T1)
Sep-10 12:57:40.582 [Task submitter] INFO nextflow.Session - [f0/0c96b3] Submitted process > FASTQC (SPT5_INPUT_R2_T1)
Sep-10 12:57:54.885 [Task submitter] INFO nextflow.Session - [61/31369d] Submitted process > FASTQC (SPT5_T0_R2_T1)
Sep-10 12:58:09.331 [Task submitter] INFO nextflow.Session - [3e/c3dc21] Submitted process > FASTQC (SPT5_T15_R2_T1)
Sep-10 12:58:39.340 [Task submitter] INFO nextflow.Session - [be/022ad6] Submitted process > FASTQC (SPT5_T15_R1_T1)
Step inputs
In this tutorial we've opted to describe FASTQC
in BCO as a simplified single larger step, which can then be considered to consume all 6 inputs. However we first have a challenge - the workflow started with a single CSV file, yet it has been split into 6*2 actual FastQ files to be consumed by FASTQ
.
There is an implied shim step in Nextflow that has done this expansion and selected the paired end URLs from the CSV files, converting them to file objects connected to FASTQC
and TRIMGALORE
:
ch_design_reads_csv
.splitCsv(header:true, sep:',')
.map { row -> [ row.sample_id, [ file(row.fastq_1, checkIfExists: true), file(row.fastq_2, checkIfExists: true) ] ] }
.into { ch_raw_reads_fastqc;
ch_raw_reads_trimgalore }
While we could have formalized this CSV splitting as an implied ch_design_reads_csv
intermediate step, the purpose of BCO is not to explain every such lower-level shim or data conversion built-in at the workflow execution level, but rather to explain the computational analysis. Therefore in this case we will rather list the data identifiers that data actually used, at the cost of them seemingly appearing out of nowhere. Other workflow systems may have such steps explicit (e.g. calling a Python script to retrieve the files), in which case it may be more natural to also list them in the BCO.
The CHECK_DESIGN
step has conveniently produced for us a data/results/pipeline_info/design_reads.csv with sample_id
that match what we see in the reports. We can thus report out the inputs directly into input_list
which takes a list of URLs:
{"step_number": 8, "name": "FASTQC", "description": "FastQC gives general quality…",
"input_list": [
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822153_1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822154_1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822157_1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822158_1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204810_Spt5-ChIP_Input2_SacCer_ChIP-Seq_ss100k_R2.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822153_2.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822154_2.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822157_2.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/atacseq/testdata/SRR1822158_2.fastq.gz}"
],
"output_list": []},
Note that in this simplified approach of a single FASTQC
step, with a flat list, we can no longer single out the individual pair-end reads. In the more verbose approach of using one FASTQC
step for each row in execution_trace.txt, we can be slightly more precise by only showing the reads for the given sample SPT5_INPUT_R1_T1
:
{"step_number": 8, "name": "FASTQC", "description": "Input (SPT5_INPUT_R1_T1) FastQC gives general quality…",
"input_list": [
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz"},
{"uri": "https://raw.githubusercontent.com/nf-core/test-datasets/chipseq/testdata/SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz"},
],
"output_list": []}
This is still a simplification because in reality this particular workflow step actually executed fastqc
separately for each file.
(bco-ro) root@05ce36630c51:/work/be/fc6003d5258941857baea391194fde# cat .command.sh
#!/bin/bash -euo pipefail
[ ! -f SPT5_INPUT_R1_T1_1.fastq.gz ] && ln -s SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R1.fastq.gz SPT5_INPUT_R1_T1_1.fastq.gz
[ ! -f SPT5_INPUT_R1_T1_2.fastq.gz ] && ln -s SRR5204809_Spt5-ChIP_Input1_SacCer_ChIP-Seq_ss100k_R2.fastq.gz SPT5_INPUT_R1_T1_2.fastq.gz
fastqc -q -t 2 SPT5_INPUT_R1_T1_1.fastq.gz
fastqc -q -t 2 SPT5_INPUT_R1_T1_2.fastq.gz
Similarly note that in BCO we only provide input references as a flat list, we won't here document their role or parameter to the tool, nor the exact command line arguments, like for TRIM_GALORE
:
trim_galore --cores 1 --paired --fastqc --gzip SPT5_INPUT_R1_T1_1.fastq.gz SPT5_INPUT_R1_T1_2.fastq.gz
There are many reasons for not going into that level of detail here, for instance in this NF-Core workflow each step also have pre- and post- steps that handle temporary file, record memory usage, capture the error log, etc. These details are usually not essential for explaining the computational analysis, and therefore do not need to be represented at BCO level.
Step outputs
Now let's go ahead and describe the output_list
of the files this step produced, the *.zip
and *.html
reports, which are also workflow outputs under data/results/fastqc
"output_list": [
"results/fastqc/SPT5_INPUT_R1_T1_1_fastqc.html",
"results/fastqc/SPT5_INPUT_R1_T1_2_fastqc.html"
"results/fastqc/SPT5_INPUT_R1_T1_2_fastqc.html"
"results/fastqc/zips/SPT5_INPUT_R1_T1_1_fastqc.zip"
"results/fastqc/zips/SPT5_INPUT_R1_T1_2_fastqc.zip"
]
Note in this case we are referring to output files contained by this RO-Crate, they do not (yet) have any absolute URL, therefore we for now refer to them as relative URI paths from the RO-Crate Root, the data/
folder.
Some workflow systems produce workflow outputs directly web adressable, in which case you may elect to identify them by URL reference, instead of including them as part of the RO-Crate payload. However this means the BCO is more fragile, it refers to things on the web which may change or disappear.
Identifying input/output files as URLs
As we saw above, the input_list
and output_list
identify the workflow step's input and output files using URIs. For files contained in the RO-Crate we initially used a relative URI path like in:
We include the relative path under filename
. It would not be valid to use that path as uri
; the form below uses relative path, which currently is not valid according to the IEEE 2791 JSON Schema:
{"uri": "data/results/genome/genes.bed"},
One way to make a more lightweight BCO without all files bundled in, is to publish the workflow outputs on a hosting service.
Here we try using GitHub to publish this BCO at https://github.com/biocompute-objects/bco-ro-example-chipseq/
The below form is valid, and use a HTTP raw URI at GitHub. Note that the use of large files on GitHub might require the use of Git LFS which could cause billable charges. The use of S3 bucket is discouraged as they are subject to change. Note that the below uses commit ae950188ef874a9527f2c466354aa19a23ca0043
instead of master
which again could be subject to change.
{"step_number": 3, "name": "MAKE_GENE_BED", "description": "", "input_list": [], "output_list": [
{"uri": "https://raw.githubusercontent.com/biocompute-object/bco-ro-example-chipseq/ae950188ef874a9527f2c466354aa19a23ca0043/data/results/genome/genes.bed",
"filename": "data/results/genome/genes.bed"
}
]},
This form uses a file:///
path which is valid and provides provenance of where the file was made locally, is not portable to other machines:
{"uri": "file:///home/stain/src/bco-ro-example-chipseq/data/results/genome/genes.bed"},
This form uses ARCP URIs inside the RO-Crate based on the uuid in bag-info.txt
, but is not valid because ,
wrongly is not permitted in the uri
JSON Schema format for authority
(it expects a hostname).
{"uri": "arcp://uuid,9b309ebd-6dfb-4c6d-983b-56b91fca6e06home/data/results/genome/genome.fa.include_regions.bed"},