Metadata examples
PyCOMPSs example (laptop execution)
In the RO-Crate specification, the root file containing the metadata referring to the crate created is named
ro-crate-metadata.json. In these lines, we show how to navigate an ro-crate-metadata.json
file resulting from
a PyCOMPSs application execution in a laptop, specifically an out-of-core matrix multiplication example that includes matrices
A and B as inputs in an inputs/ sub-directory, and matrix C as the result of their multiplication
(which in the code is also passed as input, to have a matrix initialized with 0s). We also set the data_persistence
term of the YAML configuration file to True to indicate we want the datasets to be included in the resulting
crate.
For all the specific details on the fields provided in the JSON file, please refer to the
RO-Crate specification Website.
The corresponding ro-crate-metadata.json can be found here:
PyCOMPSs Matrix Multiplication, out-of-core using files. Example using DIRECTORY parameters executed at laptop, data persistence True: https://doi.org/10.48546/workflowhub.workflow.1046.1
Intuitively, if you search through the JSON file you can find several interesting terms:
creatorcontains the list of authors, identified by their ORCID.publisherlists the organizations of the authors.hasPartin./lists all the files and directories this workflow needs and generates, and also the ones included in the crate. They are referenced with relative paths, since they are included in the crate.ComputationalWorkflowis the main file of the application (in the example,application_sources/matmul_directory.py). Includes a reference to the generated workflow diagram in theimagefield.versioncontains the COMPSs specific version and build used to run this application. In the example:3.3. This is a very important field to achieve reproducibility or replicability, since COMPSs features may vary their behavior in different versions of the programming model runtime.CreateActiondetails the specific execution of the workflow, compliant with the Workflow Run Crate Profile.The defined
Agentis recorded as theagent.The
descriptionterm records details on the host that ran the workflow (architecture, Operating System version).The
environmentterm includes references to the COMPSs and / or SLURM related environment variables used during the run.The
startTimeandendTimeterms include respectively the starting and ending time of the application as UTC time.The
objectterm makes reference to the input files or directories used by the workflow.The
resultterm references the output files or directories generated by the workflow.
We encourage the reader to navigate through this ro-crate-metadata.json file example to get familiar with its
contents. Many of the fields are easily and directly understandable.
Java COMPSs example (MareNostrum supercomputer execution)
In this second ro-crate-metadata.json example, we want to illustrate the workflow provenance result of a Java COMPSs
application execution in the MareNostrum V supercomputer. We show the execution of a matrix LU factorization
for out-of-core sparse matrices implemented with COMPSs and using the Java programming language. In this algorithm,
matrix A is both input and output of the workflow, since the factorization overwrites the original value of A.
In addition, we have used a 4x4 blocks hyper-matrix (i.e. the matrix is divided in 16 blocks, that contain 16
elements each) and, if a block is all 0s, the corresponding file will not be
created in the file system (in the example, this happens for blocks A.0.3, A.1.3, A.3.0 and A.3.1). We
do not define the data_persistence option, which means it will be False, and the datasets will not be included in
the resulting crate (i.e. only references to the location of files will be included in the metadata).
The corresponding ro-crate-metadata.json can be found here:
Java COMPSs LU Factorization for Sparse Matrices, MareNostrum V, 3 nodes, no data persistence: https://doi.org/10.48546/workflowhub.workflow.1047.1
Apart from the terms already mentioned in the previous example (creator, publisher, hasPart,
ComputationalWorkflow, version, CreateAction), if we first observe the YAML configuration file:
COMPSs Workflow Information:
name: Java COMPSs LU Factorization for Sparse Matrices, MareNostrum V, 3 nodes, no data persistence
description: |
...
license: Apache-2.0
sources: [src, jar, xml, Readme, pom.xml]
Authors:
- name: Raül Sirvent
e-mail: Raul.Sirvent@bsc.es
orcid: https://orcid.org/0000-0003-0606-2512
organisation_name: Barcelona Supercomputing Center
ror: https://ror.org/05sd8tv96
We can see that we have specified several directories to be added as source files of the application:
the src folder that contains the
.java and .class files, the jar folder with the sparseLU.jar file, and the xml folder with extra
xml configuration files. Besides, we also add the Readme and pom.xml
so they are packed in the resulting crate. This example also shows that the script is able to select the correct
SparseLU.java main file as the ComputationalWorkflow in the RO-Crate, even when in the sources three
files using the same file name exists (i.e. they implement 3 versions of the same algorithm: using files, arrays or
objects). Finally, since no Agent is defined, the first author will be considered as such. The resulting
tree for the source files is:
application_sources/
|-- Readme
|-- jar
| `-- sparseLU.jar
|-- pom.xml
|-- src
| `-- main
| `-- java
| `-- sparseLU
| |-- arrays
| | |-- SparseLU.class
| | |-- SparseLU.java
| | |-- SparseLUImpl.class
| | |-- SparseLUImpl.java
| | |-- SparseLUItf.class
| | `-- SparseLUItf.java
| |-- files
| | |-- Block.class
| | |-- Block.java
| | |-- SparseLU.class
| | |-- SparseLU.java
| | |-- SparseLUImpl.class
| | |-- SparseLUImpl.java
| | |-- SparseLUItf.class
| | `-- SparseLUItf.java
| `-- objects
| |-- Block.class
| |-- Block.java
| |-- SparseLU.class
| |-- SparseLU.java
| |-- SparseLUItf.class
| `-- SparseLUItf.java
`-- xml
|-- project.xml
`-- resources.xml
9 directories, 25 files
Since in this second example we do not add explicitly the input and output files of the workflow (i.e.
data_persistence is set to False) (in some cases, datasets could be extremely large),
our crate does not have a dataset sub-folder and only includes references to the files,
which are ment as pointers to where they can be found, rather than a publicly accessible URI references. Therefore,
in this Java COMPSs
example, files can be found in the gs05r2b06-ib0 hostname, which is an internal hostname of MN5. This means that, for
reproducibility purposes, a new user would have to request access to the MN5 paths specified by the corresponding
URIs (i.e. /gpfs/home/bsc/bsc019057/...).
The CreateAction term has also a richer set of information available from MareNostrum’s SLURM workload manager. We
can see that both the id and the description terms include the SLURM_JOB_ID, which can be used to see more
details and statistics on the job run from SLURM using the User Portal tool.
In addition, many more relevant environment variables are captured (specifically SLURM and COMPSs related),
which provide details on how the execution has been performed (i.e.
SLURM_JOB_NODELIST, SLURM_JOB_NUM_NODES, SLURM_JOB_CPUS_PER_NODE, COMPSS_MASTER_NODE,
COMPSS_WORKER_NODES, among others).