Starting the BCO RO-Crate

We'll create the BCO RO-Crate inside a new folder.

mkdir chipseq_20200924
cd chipseq_20200924
mkdir data

Here we use a name chipseq_20200924 as we ran the workflow chipseq on 2020-09-24 - you can use any reasonable name.

The data/ folder will contain our RO-Crate according to its recommendation for combining with BagIt. If you will not be using BagIt you can skip the data/ subfolder.

Note that running the workflow can take a while, so this might be a good point to skip ahead and start the Nextflow run in a separate window.

Skeleton BCO

We'll start our editor to create skeleton JSON files for the BCO and RO-Crate.

The BCO JSON can have any name - to make its relative paths easy we'll put it in the root of data/ and we'll name it like the folder chipseq_20200924.json which we'll also use as our object_id:

data/chipseq_20200924.json

{
    "object_id": "chipseq_20200924",
    "spec_version": "https://w3id.org/ieee/ieee-2791-schema/",
    "etag": "abcd",
    "provenance_domain": {
        "name": "",
        "version": "",
        "created": "2020-09-24T15:38:00Z",
        "modified": "2020-09-24T15:38:00Z",
        "contributors": [
            {"contribution": ["createdBy"],
             "name": ""
            }   
        ],
        "license": ""
    },
    "usability_domain": [
    ],
    "description_domain": {
        "keywords": [
        ],
        "pipeline_steps": [
        ]
    },
    "execution_domain": {
		"script": [],
        "script_driver": "",
        "software_prerequisites": [
        ],
        "external_data_endpoints": [
        ],
        "environment_variables": {}
    },
    "io_domain": {
      "input_subdomain": [],
      "output_subdomain": []     
    }
}

While the above is valid according to the IEEE 2791 schemas it has obviously not got much information yet.

While editing the BCO JSON, have it open as Input JSON in <https://www.jsonschemavalidator.net/> with the below schema:

{ "$ref": "https://w3id.org/ieee/ieee-2791-schema/2791object.json"}

ETag

From IEEE 2791-2020:

Everything except for the etag, object_id, and spec_version shall be included in the generation of an ETag - which can be “strong” or “weak”. It is recommended that the ETag be deleted or updated if the object file is changed (except in cases using weak ETags in which the entirety of the change comprises a simple re-writing of the JSON).

To calculate the etag value we need some kind of checksum of the BCO's JSON.

Below is a poor man's “weak” attempt of a JSON checksum by ignoring any lines containing etag, object_id, and spec_version when calculating a new etag, as well as ignoring any newlines, space and tab characters:

$ egrep -v '"(etag|object_id|spec_version)"' chipseq_20200924.json  | 
   sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' -e 's/[ ]*//g' | 
   sed -e $'s/\t/|/g' | sha256sum
cfb77c3e5c9b0e937de215190106dadcc9af9f40fa19a4deacf9a5a3a42f5de6  -

We can from this checksum generate and insert the `etag` field for the BCO:

```json
"etag": "cfb77c3e5c9b0e937de215190106dadcc9af9f40fa19a4deacf9a5a3a42f5de6"

The above etag calculation can be considered “weak”, e.g. it would give a different checksum also for non-structural changes like moving key-value lines around, while ignoring some structural changes like "lots of space.txt" to "lotsofspace.txt". RFC7232 says such Etag weakness should be indicated with a W/ prefix, however that is not currently permitted in the JSON Schema for the BCO etag field.

We'll need to update the etag whenever we substantially change the BCO JSON - so we'll do this again towards the end of this tutorial. Notice how the above egrep command-line ignores the "etag" line to avoid circularly calculating a new checksum.

Performing a “strong” etag calculation would need to be done at tree level by parsing the BCO JSON, e.g. ignoring order of fields within an object. A strong calculation may also want to assign new etag values if any of files referenced from the BCO have changed.

Provenance domain

The provenance_domain describe this BCO overall. Because in this case we are primarily submitting the nf-core/chipseq workflow with supporting data we'll reflect the workflow's original creation date and contributors, even though they were not involved in making the workflow's description as a BCO.

Luckily nf-core has provided citation info where we can find the authors (authoredBy) of the workflow from https://doi.org/10.5281/zenodo.3240506.

We add a name with the title from the nf-core/chipseq website, as well as their indicated latest release - at time of writing "version": "1.2.2. If your workflow does not currently have a version number, consider using Semantic Versioning as recommended by IEEE 2791.

Here created reflects the original release date the workflow (which did not include a BCO), we'll set modified to current date to reflect when we last updated the BCO JSON (and presumably also changed its etag).

In many cases your workflow will be considered “released” at the same time as the BCO JSON is created by its own authors, in which case this situation is simplified.

"provenance_domain": {
    "name": "nf-core/chipseq: ChIP-seq peak-calling, QC and differential analysis pipeline",
    "version": "1.2.2",
    "created": "2019-06-06T15:15:00.676813+00:00Z",
    "modified": "2021-04-23T00:27:42.542985Z",
    "contributors": [
        {"contribution": ["authoredBy"], "name": "Harshil Patel" },
        {"contribution": ["authoredBy"], "name": "Chuan Wang" },
        {"contribution": ["authoredBy"], "name": "Phil Ewels" },
        {"contribution": ["authoredBy"], "name": "Tiago Chedraoui Silva" },
        {"contribution": ["authoredBy"], "name": "Alexander Peltzer" },
        {"contribution": ["authoredBy"], "name": "Drew Behrens" },
        {"contribution": ["authoredBy"], "name": "Maxime Garcia" },
        {"contribution": ["authoredBy"], "name": "mashehu" },
        {"contribution": ["authoredBy"], "name": "Rotholandus" },
        {"contribution": ["authoredBy"], "name": "Sofia Haglund" },
        {"contribution": ["authoredBy"], "name": "Winni Kretzschmar" },
    ],
    "license": "https://github.com/nf-core/chipseq/blob/1.2.2/LICENSE"
},

To find the license of the workflow we have to hunt into its GitHub source code https://github.com/nf-core/chipseq where we find a LICENSE file - indicated as an MIT License. As this particular license includes a copyright statement Copyright (c) Philip Ewels we need to refer to the license in its source repository, rather than the generic SPDX identifier https://spdx.org/licenses/MIT.

For open source licenses like the Apache License version 2.0 which are not modified depending on copyright holder, you should instead use their SPDX identifier like https://spdx.org/licenses/Apache-2.0 (removing .html). If your workflow has its own license that is not on the web (e.g. non-open license), use "license": "LICENSE.txt" and add the file data/LICENSE.txt with your copyright information.

In this case Stian has created the BCO and performed the workflow run, so we'll also add this as a separate contribution to the contributors array, ideally including a https://orcid.org/ identifier to disambiguate from other people with the same name.

{ "contribution": ["createdBy"],
  "name": "Stian Soiland-Reyes",
  "orcid": "https://orcid.org/0000-0001-9842-9718",
} 

PAV contributor relations supported by IEEE 2791 include:

  • authoredBy - who originated or gave existence to the work that is expressed by this BCO (who designed the workflow)
  • createdBy - person who made the representation as saved to disk (JSON of this BCO)
  • createdWith - software used to create the representation (e.g. BCO Editor)
  • curatedBy - who shaped the expression (workflow) into the appropriate format (BCO) or who ensured the quality of the representation (see also review key of provenance_domain)
  • importedBy - software that transcribed the work (workflow) to the current form (BCO)
  • providedBy - original provider of the encoded information (e.g. nf-core)
  • retrievedBy - software that retrieved data from external source
  • sourceAccessedBy - someone who consulted an external source of information
  • contributedBy - someone who provided any sort of help in conceiving the work (if none of the above match)

These additional PAV relations can also be provided, however we would need to be flexible with their name and orcid field to give their URIs:

Skeleton RO-Crate

Similarly let's create the high-level RO-Crate. According to the RO-Crate structure the RO-Crate Metadata File always have the name ro-crate-metadata.json which we'll create inside data/:

data/ro-crate-metadata.json

{ "@context": "https://w3id.org/ro/crate/1.1/context", 
  "@graph": [
    {
        "@type": "CreativeWork",
        "@id": "ro-crate-metadata.json",
        "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
        "about": {"@id": "./"}
    },
    
    {
      "@id": "./",
      "@type": "Dataset"
    }
  ]
}

This preamble declares the version of the RO-Crate specification used, it's equivalent to BCO's spec_version. The remaining RO-Crate elements will be added to the @graph array as data entities (files and directories) or contextual entities (e.g. people, organizations)

The root of the RO-Crate represents the whole dataset, in this case our data/ folder, but also any external references like the nf-core workflow and reference data. We're going to use the RO-Crate metadata to provide further details on resources and their origin in the world, while the BCO provides finer-grained details of the workflow execution.

In both cases we will mainly use relative paths within the data/ directory, where also both metadata files reside. We notice thus that ./ for the root Dataset reflects the data/ directory.

We'll start by describing the RO-Crate itself under the ./ Dataset, including as a minimum the core properties of the Root Data Entity, although any of the http://schema.org/Dataset properties can be utillized:

{
    "@id": "./",
    "@type": "Dataset",
    "name": "Workflow run of nf-core/chipseq",
    "description": "Workflow run of a ChIP-seq peak-calling, QC and differential analysis pipeline",
    "author": {
    "@id": "https://orcid.org/0000-0001-9842-9718"
    },
    "datePublished": "2020-09-24T23:00:00.000Z",
    "distribution": {
    "@id": "https://github.com/stain/bco-ro-example-chipseq/archive/master.zip"
    },
    "hasPart": [
    {
        "@id": "chipseq_20200910.json"
    }
    ],
    "license": {
    "@id": "https://spdx.org/licenses/CC0-1.0"
    }
}

Already you will notice some differences from the BCO. The name could match the provenance_domain/name of the BCO - but as the BCO focus more on the workflow and the Dataset includes all the files we've changed it to include "Workflow run of..". However if your RO-Crate did not include workflow results, then the two could have the same title. description allow us to provide a longer description - comparable to BCO's usability_domain which we'll populate later, but again decribing the whole dataset.

The reason these fields are mainly at dataset level is that we can further describe individual files and resources later as separate data entities. Therefore here the author of the dataset is https://orcid.org/0000-0001-9842-9718, the ORCID identifier for Stian, as he ran the workflow and gathered (most of) the files, and license of the dataset (the whole folder) can be different from the license of the workflow. If need be license, author etc. can be different on the ro-crate-metadata.json entity if someone else made this JSON.

The hasPart lists the content of the RO-Crate, currently just the BCO JSON file. We'll describe it as a data entity and indicate that it is following the IEEE 2791 schemas using conformsTo and adding to the @graph array:

{
    "@id": "chipseq_20200910.json",
    "@type": "File",
    "name": "chipseq_20200910.json",
    "description": "IEEE 2791 description (BioCompute Object) of nf-core/chipseq",
    "encodingFormat": ["application/json", 
        {"@id": "https://www.nationalarchives.gov.uk/PRONOM/fmt/817"}],    
    "conformsTo": {
        "@id": "https://w3id.org/ieee/ieee-2791-schema/"
    }
}

Given the above you can identify programmatically the BCO file chipseq_20200910.json independent of its filename: Parse ro-crate-metadata.json as JSON, then iterate over @graph array until you find a matching conformsTo, then look up its @id to find its (URI escaped) file path within the data/ folder.

Notice how we also declare the encodingFormat as being JSON – the BCO file is following the general syntax of JSON and would be directly parsable using its libraries. However the particular keys and structure we are conforming to is according to the IEEE 2791 JSON Schema.

Declaring the IEEE 2791 schema

In case someone consumes the RO-Crate without knowing about IEEE 2791 or BioCompute Objects, or to be precise about which release version of the schema was used, we'll also declare a contextual entity for this JSON schema:

{
    "@id": "https://w3id.org/ieee/ieee-2791-schema/",
    "@type": "CreativeWork",
    "name": "IEEE 2791 Object Schema",
    "version": "1.4",
    "citation": "https://doi.org/10.1109/IEEESTD.2020.9094416",
    "subjectOf": {"@id": "https://www.biocomputeobject.org/"}
}

Notice how citation links to the standard IEEE 2791-2020, as RO-Crate recommends for citations, and subjectOf to link to the related website https://www.biocomputeobject.org/ that describes the BCO schema in a more textual way.

These two new resources we don't describe much further (consumers can click the URIs), but we give them at least @type and a name. Other properties from http://schema.org/ScholarlyArticle and http://schema.org/WebSite could also be used:

{
    "@id": "https://doi.org/10.1109/IEEESTD.2020.9094416",
    "@type": "ScholarlyWork",
    "name": "IEEE 2791-2020 Standard for Bioinformatics Computations and Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication",
    "alternateName": "IEEE Std 2791-2020"
},
{
    "@id": "https://www.biocomputeobject.org/}",
    "@type": "WebSite",
    "name": "BioCompute Objects",
    "about": { "https://doi.org/10.1109/IEEESTD.2020.9094416" }
}

Assigning unique identifiers to BCOs

If the BCO has a object_id that is globally unique, e.g. a randomly generated UUID then these can be reflected as identifier URIs in the RO-Crate if combined with the registered urn:uuid: prefix.

This can be useful when looking up or matching BCOs later; RO-Crate Metadata files are valid JSON-LD and can be loaded into a knowledge graph using a triple store like Apache Jena Fuseki for detailed querying.

$ echo urn:uuid:`uuidgen`
urn:uuid:dc308d7c-7949-446a-9c39-511b8ab40caf

So we can change our BCO JSON's self-identity from the not-so-unique chipseq_20200924 to:

"object_id": "urn:uuid:dc308d7c-7949-446a-9c39-511b8ab40caf"

and, now that we have a URI, add correspondingly to the chipseq_20200910.json entry in ro-crate-metadata.json:

"identifier": "urn:uuid:dc308d7c-7949-446a-9c39-511b8ab40caf"