Workflow Provenance

The COMPSs runtime includes the capacity of recording details of the application’s execution as metadata, also known as Workflow Provenance. With workflow provenance, you are able to share not only your workflow application (i.e. the source code) but also your workflow run (i.e. the datasets used as inputs, the outputs generated as results, and details on the environment where the application was run). This is supported for both Python and Java COMPSs applications. More technical details on how Provenance is generated in COMPSs using a lightweight approach that does not introduce overhead to the workflow execution can be found in the paper:

Provenance information can be useful for a number of things, including Governance, Reproducibility, Replicability, Traceability, or Knowledge Extraction, among others. In our case, we have initially targeted workflow provenance recording to enable users to publish research results obtained with COMPSs as artifacts that can be cited in scientific publications with their corresponding DOI as a persistent identifier. See Section Publish and cite your results with WorkflowHub to learn precisely how to do that. We see a growing number of scientific conferences requesting these reproducible artifacts, such as:

Tip

A step-by-step guide on how to share your COMPSs execution results in scientific papers can be found here.

When the provenance option is activated, the runtime records every access to a file or directory specified in the application, as well as its direction (IN, OUT, INOUT). In addition to this, other information such as the parameters passed as inputs in the command line that submitted the application, its source files, workflow diagram and task profiling statistics, authors and their institutions, … are also stored. All this information is later used to record the workflow provenance of your application using the RO-Crate specification, and with the assistance of the ro-crate-py library. RO-Crate is based on JSON-LD (JavaScript Object Notation for Linked Data), is much simpler than other standards and tools created to record Provenance, and that is why it has been adopted in a number of communities. Using RO-Crate to register the execution’s information ensures not only to register correctly the Provenance of a COMPSs application run, but also compatibility with some existing portals that already embrace RO-Crate as their core format for representing metadata, such as WorkflowHub. Our RO-Crate format is compliant with the Workflow RO-Crate Profile v1.0 and the Workflow Run Crate Profile v0.5.