The role of the BagIt is mainly to ensure all files are present and not modified or corrupted in transfer. The roles of the files are as follows, conforming with [RFC 8493]:
- bagit.txt — BagIt declaration
- bag-info.txt — BagIt metadata
- manifest-sha512.txt — Payload manifest, checksums of all
- tagmanifest-sha512.txt — Tag manifest, checksums of other files
The links above go to the https://github.com/biocompute-objects/bco-ro-example-chipseq/ example built by this tutorial.
bagit.txt should have this fixed content:
BagIt-Version: 1.0 Tag-File-Character-Encoding: UTF-8
The main role of this file is to mark the folder as a bag according to RFC8493. The base directory containing
bagit.txt can have any name, and can be transferred in any way, e.g. ZIP, SFTP or even be exposed on the web.
41846747…ee71 data/ro-crate-metadata.json e1105ed0…5e13 data/chipseq_20200910.json 37fd3a02…bb95 data/results/pipeline_info/design_reads.csv …
Creating the payload manifest file without using BagIt tools/libraries can be done as:
$ find data -type f -print0 | xargs -0 sha512sum > manifest-sha512.txt
Similarly checking the manifest:
$ sha512sum --quiet -c manifest-sha512.txt data/chipseq_20200910.json: FAILED data/ro-crate-metadata.json: FAILED sha512sum: WARNING: 2 computed checksums did NOT match
Notice how the payload manifest list checksums of all individual data outputs in data/results, as well as the RO-Crate data/ro-crate-metadata.json and the IEEE 2791 BCO JSON data/chipseq_20200910.json. Although these files are strictly speaking metadata we can consider them part of the
data/ payload of transfering an RO-Crate containing a BioCompute Object, as recommended by RO-Crate 1.1.
Alternative checksums algorithm are allowed according to RFC8493, e.g.
manifest-sha256.txt - as long as each manifest file is complete - here we use SHA-512 by default as recommended by RFC8493.
In BagIt it is optional to also checksum files outside
data/, so-called tag files, as in tagmanifest-sha512.txt. Because we have included the RO-Crate under
data/ the remaining files in the example repository are mainly about creating the BagIt BCO (like a README), as well as the BagIt files
b0556450…8802 bag-info.txt 1abe59bd…969a Makefile b5598554…256d README.md 000b27e3…c52e manifest-sha512.txt …
Unlike the payload manifest above, this file does not need to be complete, however we recommend it minimally includes
manifest-sha512.txt so consumers can ensure these files are complete.
Creating the tag manifest without BagIt tools is best achieved by listing individual files:
$ sha512sum bagit.txt bag-info.txt manifest-sha512.txt > tagmanifest-sha512.txt
To avoid circularity the tag manifest MUST NOT checksum itself or other
Our example is quite minimal:
ROCrate_Specification_Identifier: https://w3id.org/ro/crate/1.0/ External-Description: Workflow run of a ChIP-seq peak-calling, QC and differential analysis pipeline Bagging-Date: 2020-09-10T19:27:45Z Bag-Size: 396MB
Payload-Oxum, if present, is a compound field, listing the total size of
data/ files in bytes, and the number of payload files (excluding directories).
Bag-Size is the human-readable version of the total size. These numbers could be obtained with:
$ du --apparent-size -b -s data 414893243 data $ du --apparent-size --human-readable -s data 396M data $ find data -type f | wc -l 372
Note the use of
--apparent-size as in this case actual disk-usage is 129M due to ZFS compression, while the sum of each's file's size is 396 MB. It is NOT RECOMMENDED to include
data contains hard or soft linked outputs, as is the case of this example's Nextflow workflow.
Bagging-Date should reflect the time the bag was created in ISO8601 format, for instance as output
date --utc --iso-8601=seconds
This example from [RFC 8493] lists other keys that MAY be used:
Source-Organization: FOO University Organization-Address: 1 Main St., Cupertino, California, 11111 Contact-Name: Jane Doe Contact-Phone: +1 111-111-1111 Contact-Email: email@example.com External-Description: Uncompressed greyscale TIFF images from the FOO papers colle... Bagging-Date: 2008-01-15 External-Identifier: university_foo_001 Payload-Oxum: 279164409832.1198 Bag-Group-Identifier: university_foo Bag-Count: 1 of 15 Internal-Sender-Identifier: /storage/images/foo Internal-Sender-Description: Uncompressed greyscale TIFFs created from microfilm and are...