π΅ Workflow Provenance
Quickstart
Submit your COMPSs computational experiment with
-por-zflags (i.e. usingruncompss,enqueue_compss,pycompss [run | job submit]).When the application finishes, run
pycompss inspect COMPSs_RO-Crate_*
Introduction
The COMPSs runtime includes the capacity of recording details of the applicationβs execution as metadata, also known as Workflow Provenance. With workflow provenance, you are able to share not only your workflow application (i.e. the source code) but also your workflow run (i.e. the datasets used as inputs, the outputs generated as results, and details on the environment where the application was run). This is supported for both Python and Java COMPSs applications. More technical details on how Provenance is generated in COMPSs using a lightweight approach that does not introduce overhead to the workflow execution can be found in the paper:
Provenance information can be useful for a number of things, including Governance, Reproducibility, Replicability, Traceability, or Knowledge Extraction, among others. In our case, we have initially targeted workflow provenance recording to enable users to publish research results obtained with COMPSs as artifacts that can be cited in scientific publications with their corresponding DOI as a persistent identifier. See Section Publish your experiment to learn precisely how to do that. We see a growing number of scientific conferences requesting these reproducible artifacts, such as:
Reproducibility at the International Conference on Parallel Processing (ICPP)
The ACM Special Interest Group on Management of Data (SIGMOD) Reproducibility Award
Call for Artifacts at USENIX Conference on File and Storage Technologies (FAST)
And many moreβ¦
Tip
A step-by-step guide on how to share your COMPSs execution results in scientific papers can be found here.
When the provenance option is activated, the runtime records every access to a file or directory specified in the application, as well as its direction (IN, OUT, INOUT). In addition to this, other information such as the parameters passed as inputs in the command line that submitted the application, its source files, workflow diagram and task profiling statistics, authors and their institutions, β¦ are also stored. All this information is later used to record the workflow provenance of your application using the RO-Crate specification, and with the assistance of the ro-crate-py library. RO-Crate is based on JSON-LD (JavaScript Object Notation for Linked Data), is much simpler than other standards and tools created to record Provenance, and that is why it has been adopted in a number of communities. Using RO-Crate to register the executionβs information ensures not only to register correctly the Provenance of a COMPSs application run, but also compatibility with some existing portals that already embrace RO-Crate as their core format for representing metadata, such as WorkflowHub. Our RO-Crate format is compliant with the Workflow RO-Crate Profile v1.0 and the Provenance Run Crate Profile v0.5. The Provenance Run Crate Profile is the most detailed one in the Workflow Run RO-Crate Profile Collection that supports the recording of the internal details of the workflow execution steps, such as the resource usage and the input/output parameters of each task. In case the user wants to opt out from this detailed registration of provenance data, the simpler Workflow Run Crate Profile v0.5 can be selected instead (see Section YAML configuration file).