BagIt
The role of the BagIt is mainly to ensure all files are present and not modified or corrupted in transfer. The roles of the files are as follows, conforming with [RFC 8493]:
- bagit.txt — BagIt declaration
- bag-info.txt — BagIt metadata
- manifest-sha512.txt — Payload manifest, checksums of all
data/
files - tagmanifest-sha512.txt — Tag manifest, checksums of other files
The links above go to the https://github.com/biocompute-objects/bco-ro-example-chipseq/ example built by this tutorial.
BagIt declaration
bagit.txt should have this fixed content:
BagIt-Version: 1.0
Tag-File-Character-Encoding: UTF-8
The main role of this file is to mark the folder as a bag according to RFC8493. The base directory containing bagit.txt
can have any name, and can be transferred in any way, e.g. ZIP, SFTP or even be exposed on the web.
Payload manifest
manifest-sha512.txt is the payload manifest file, containing SHA-512 checksums of all files under data/ directory recursively, e.g.:
41846747…ee71 data/ro-crate-metadata.json
e1105ed0…5e13 data/chipseq_20200910.json
37fd3a02…bb95 data/results/pipeline_info/design_reads.csv
…
Creating the payload manifest file without using BagIt tools/libraries can be done as:
$ find data -type f -print0 | xargs -0 sha512sum > manifest-sha512.txt
Similarly checking the manifest:
$ sha512sum --quiet -c manifest-sha512.txt
data/chipseq_20200910.json: FAILED
data/ro-crate-metadata.json: FAILED
sha512sum: WARNING: 2 computed checksums did NOT match
Notice how the payload manifest list checksums of all individual data outputs in data/results, as well as the RO-Crate data/ro-crate-metadata.json and the IEEE 2791 BCO JSON data/chipseq_20200910.json. Although these files are strictly speaking metadata we can consider them part of the data/
payload of transfering an RO-Crate containing a BioCompute Object, as recommended by RO-Crate 1.1.
Alternative checksums algorithm are allowed according to RFC8493, e.g. manifest-sha256.txt
- as long as each manifest file is complete - here we use SHA-512 by default as recommended by RFC8493.
Tag manifest
In BagIt it is optional to also checksum files outside data/
, so-called tag files, as in tagmanifest-sha512.txt. Because we have included the RO-Crate under data/
the remaining files in the example repository are mainly about creating the BagIt BCO (like a README), as well as the BagIt files bagit.txt
, manifest-sha512.txt
etc:
b0556450…8802 bag-info.txt
1abe59bd…969a Makefile
b5598554…256d README.md
000b27e3…c52e manifest-sha512.txt
…
Unlike the payload manifest above, this file does not need to be complete, however we recommend it minimally includes bag-info.txt
and manifest-sha512.txt
so consumers can ensure these files are complete.
Creating the tag manifest without BagIt tools is best achieved by listing individual files:
$ sha512sum bagit.txt bag-info.txt manifest-sha512.txt > tagmanifest-sha512.txt
To avoid circularity the tag manifest MUST NOT checksum itself or other tagmanifest-*.txt
files.
BagIt metadata
bag-info.txt contains Bag-It metadata in a loose key-value based textual format to describe the bag, primarily for the purpose of transfer.
Our example is quite minimal:
ROCrate_Specification_Identifier: https://w3id.org/ro/crate/1.0/
External-Description: Workflow run of a ChIP-seq peak-calling, QC and differential analysis pipeline
Bagging-Date: 2020-09-10T19:27:45Z
Bag-Size: 396MB
ROCrate_Specification_Identifier
is an additional key used by RO-Crate Describo to indicate the presence and version of data/ro-crate-metadata.json. This string SHOULD match the conformsTo
URI.
Payload-Oxum
, if present, is a compound field, listing the total size of data/
files in bytes, and the number of payload files (excluding directories). Bag-Size
is the human-readable version of the total size. These numbers could be obtained with:
$ du --apparent-size -b -s data
414893243 data
$ du --apparent-size --human-readable -s data
396M data
$ find data -type f | wc -l
372
Note the use of --apparent-size
as in this case actual disk-usage is 129M due to ZFS compression, while the sum of each's file's size is 396 MB. It is NOT RECOMMENDED to include Payload-Oxum
if data
contains hard or soft linked outputs, as is the case of this example's Nextflow workflow.
The Bagging-Date
should reflect the time the bag was created in ISO8601 format, for instance as output date --utc --iso-8601=seconds
This example from [RFC 8493] lists other keys that MAY be used:
Source-Organization: FOO University
Organization-Address: 1 Main St., Cupertino, California, 11111
Contact-Name: Jane Doe
Contact-Phone: +1 111-111-1111
Contact-Email: example@example.com
External-Description: Uncompressed greyscale TIFF images from the
FOO papers colle...
Bagging-Date: 2008-01-15
External-Identifier: university_foo_001
Payload-Oxum: 279164409832.1198
Bag-Group-Identifier: university_foo
Bag-Count: 1 of 15
Internal-Sender-Identifier: /storage/images/foo
Internal-Sender-Description: Uncompressed greyscale TIFFs created
from microfilm and are...
However, as metadata would primarily be covered by the bco and RO-Crate we recommend keeping bag-info.txt
minimal reflecting transfer-level metadata.