Metadata examples

PyCOMPSs example (laptop execution)

In the RO-Crate specification, the root file containing the metadata referring to the crate created is named ro-crate-metadata.json. In these lines, we show how to navigate an ro-crate-metadata.json file resulting from a PyCOMPSs application execution in a laptop, specifically an out-of-core matrix multiplication example that includes matrices A and B as inputs in an inputs/ sub-directory, and matrix C as the result of their multiplication (which in the code is also passed as input, to have a matrix initialized with 0s). We also set the data_persistence term of the YAML configuration file to True to indicate we want the datasets to be included in the resulting crate. For all the specific details on the fields provided in the JSON file, please refer to the RO-Crate specification Website.

The corresponding ro-crate-metadata.json can be found here:

Intuitively, if you search through the JSON file you can find several interesting terms:

  • creator contains the list of authors, identified by their ORCID.

  • publisher lists the organizations of the authors.

  • hasPart in ./ lists all the files and directories this workflow needs and generates, and also the ones included in the crate. They are referenced with relative paths, since they are included in the crate.

  • ComputationalWorkflow is the main file of the application (in the example, application_sources/matmul_directory.py). Includes a reference to the generated workflow diagram in the image field.

  • version contains the COMPSs specific version and build used to run this application. In the example: 3.3. This is a very important field to achieve reproducibility or replicability, since COMPSs features may vary their behavior in different versions of the programming model runtime.

  • CreateAction details the specific execution of the workflow, compliant with the Workflow Run Crate Profile.

    • The defined Agent is recorded as the agent.

    • The description term records details on the host that ran the workflow (architecture, Operating System version).

    • The environment term includes references to the COMPSs and / or SLURM related environment variables used during the run.

    • The startTime and endTime terms include respectively the starting and ending time of the application as UTC time.

    • The object term makes reference to the input files or directories used by the workflow.

    • The result term references the output files or directories generated by the workflow.

We encourage the reader to navigate through this ro-crate-metadata.json file example to get familiar with its contents. Many of the fields are easily and directly understandable.

Java COMPSs example (MareNostrum supercomputer execution)

In this second ro-crate-metadata.json example, we want to illustrate the workflow provenance result of a Java COMPSs application execution in the MareNostrum V supercomputer. We show the execution of a matrix LU factorization for out-of-core sparse matrices implemented with COMPSs and using the Java programming language. In this algorithm, matrix A is both input and output of the workflow, since the factorization overwrites the original value of A. In addition, we have used a 4x4 blocks hyper-matrix (i.e. the matrix is divided in 16 blocks, that contain 16 elements each) and, if a block is all 0s, the corresponding file will not be created in the file system (in the example, this happens for blocks A.0.3, A.1.3, A.3.0 and A.3.1). We do not define the data_persistence option, which means it will be False, and the datasets will not be included in the resulting crate (i.e. only references to the location of files will be included in the metadata).

The corresponding ro-crate-metadata.json can be found here:

Apart from the terms already mentioned in the previous example (creator, publisher, hasPart, ComputationalWorkflow, version, CreateAction), if we first observe the YAML configuration file:

COMPSs Workflow Information:
  name: Java COMPSs LU Factorization for Sparse Matrices, MareNostrum V, 3 nodes, no data persistence
  description: |
    ...
  license: Apache-2.0
  sources: [src, jar, xml, Readme, pom.xml]

Authors:
  - name: Raül Sirvent
    e-mail: Raul.Sirvent@bsc.es
    orcid: https://orcid.org/0000-0003-0606-2512
    organisation_name: Barcelona Supercomputing Center
    ror: https://ror.org/05sd8tv96

We can see that we have specified several directories to be added as source files of the application: the src folder that contains the .java and .class files, the jar folder with the sparseLU.jar file, and the xml folder with extra xml configuration files. Besides, we also add the Readme and pom.xml so they are packed in the resulting crate. This example also shows that the script is able to select the correct SparseLU.java main file as the ComputationalWorkflow in the RO-Crate, even when in the sources three files using the same file name exists (i.e. they implement 3 versions of the same algorithm: using files, arrays or objects). Finally, since no Agent is defined, the first author will be considered as such. The resulting tree for the source files is:

application_sources/
|-- Readme
|-- jar
|   `-- sparseLU.jar
|-- pom.xml
|-- src
|   `-- main
|       `-- java
|           `-- sparseLU
|               |-- arrays
|               |   |-- SparseLU.class
|               |   |-- SparseLU.java
|               |   |-- SparseLUImpl.class
|               |   |-- SparseLUImpl.java
|               |   |-- SparseLUItf.class
|               |   `-- SparseLUItf.java
|               |-- files
|               |   |-- Block.class
|               |   |-- Block.java
|               |   |-- SparseLU.class
|               |   |-- SparseLU.java
|               |   |-- SparseLUImpl.class
|               |   |-- SparseLUImpl.java
|               |   |-- SparseLUItf.class
|               |   `-- SparseLUItf.java
|               `-- objects
|                   |-- Block.class
|                   |-- Block.java
|                   |-- SparseLU.class
|                   |-- SparseLU.java
|                   |-- SparseLUItf.class
|                   `-- SparseLUItf.java
`-- xml
    |-- project.xml
    `-- resources.xml

9 directories, 25 files

Since in this second example we do not add explicitly the input and output files of the workflow (i.e. data_persistence is set to False) (in some cases, datasets could be extremely large), our crate does not have a dataset sub-folder and only includes references to the files, which are ment as pointers to where they can be found, rather than a publicly accessible URI references. Therefore, in this Java COMPSs example, files can be found in the gs05r2b06-ib0 hostname, which is an internal hostname of MN5. This means that, for reproducibility purposes, a new user would have to request access to the MN5 paths specified by the corresponding URIs (i.e. /gpfs/home/bsc/bsc019057/...).

The CreateAction term has also a richer set of information available from MareNostrum’s SLURM workload manager. We can see that both the id and the description terms include the SLURM_JOB_ID, which can be used to see more details and statistics on the job run from SLURM using the User Portal tool. In addition, many more relevant environment variables are captured (specifically SLURM and COMPSs related), which provide details on how the execution has been performed (i.e. SLURM_JOB_NODELIST, SLURM_JOB_NUM_NODES, SLURM_JOB_CPUS_PER_NODE, COMPSS_MASTER_NODE, COMPSS_WORKER_NODES, among others).