Remote Access to Computing Servers and Clusters (GOS)

This section is intended to walk you through the COMPSs execution environment that allow users to execute a COMPSs application using several remote machines and computing clusters from a local machine. This access to remote resources is done through the SSH (Secure Shell) and SCP (Secure Copy) protocols which are the most used protocols to establishing a secure, encrypted connection between a client computer and a remote server within a cluster.

Although, this feature has been designed to work with resources that have a job submission queue. It can also be used to work with any other type of machine that can be accessed by an SSH connection.

Requirements

In order to use COMPSs with remote clusters some requirements must be fulfilled:

Generate a public-private key pair and authorize it in any Cluster that will be used (more details in section Configure SSH passwordless).
Have this remote resources in the known hosts file situated in ~/.ssh/known_hosts.
COMPSs must be installed in both in the master and all the remote Clusters.

Important

Both, the client and the remote computing resource should have the same or a compatible version of COMPSs, which must be 3.2 or higher.

Execution

The execution of an application using this method consists of 3 steps:

Step 1: Deployment

The very first step is to copy the application and its necessary files to the remote machines. If the application uses JAVA or C languages, the compiled files must be also transferred or compiled to the remote machines.

This can be easily accomplished with the scp command as follows:

$ scp -r /local/path/application/ myUser@remoteMachine:/remote/path/.

This must be done for every new application, and then you can run it as many times as needed. If the application is updated this step will be necessary again in order to keep the same application locally and in the remote machines.

Step 2: Configuration

In order to run the application, COMPSs needs the descriptions of the remote machines (e.g. clusters) used for the execution. This information must be provided in two XML files: resources and project XML files (more details in Resources file and Project file). The resources file, has to include the description of the available clusters and the Submission Modes, and the project file has to provide the access information (user, keys) and the location where COMPSs and the application is installed in every cluster.

The following code shows the basic structure of the resources.xml file using interactive submission mode (a working example of the resources.xml file using batch submission mode for MN5 in the Execution example).

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ResourcesList>
<ComputingCluster Name="COMPSsWorker01">
    <Adaptors>
        <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
            <SubmissionSystem>
                <Interactive/>
            </SubmissionSystem>
            <BrokerAdaptor>sshtrilead</BrokerAdaptor>
        </Adaptor>
    </Adaptors>
    <ClusterNode Name="compute_node_type1">
        <MaxNumNodes>10</MaxNumNodes>
        <Processor Name="P1">
            <ComputingUnits>8</ComputingUnits>
            <Type>CPU</Type>
        </Processor>
        ...
    </ClusterNode>
</ComputingCluster>
</ResourcesList>

The following code shows the structure of the project.xml file using interactive submission mode (a working example of the project.xml file using batch submission mode for MN5 in the Execution example).

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Project>
    <MasterNode/>
    <ComputingCluster Name="COMPSsWorker01">
        <LimitOfTasks>10</LimitOfTasks>
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Interactive/>
                </SubmissionSystem>
                <BrokerAdaptor>sshtrilead</BrokerAdaptor>
            </Adaptor>
        </Adaptors>
        <InstallDir>/opt/COMPSs/</InstallDir>
        <WorkingDir>/tmp/COMPSsWorker01/</WorkingDir>
        <User>myUser</User>
        <ClusterNode Name="compute_node1">
            <NumberOfNodes>2</NumberOfNodes>
        </ClusterNode>
    </ComputingCluster>
</Project>

The Name given to the Computing cluster equals the host name of the remote cluster and the User tag is the user for that host. For example, if we want to access the remote machine with myUser@remoteMachine the xml should be indicated as follows:

<ComputingCluster Name="remoteMachine">
    [... ExtraInformation ...]
    <User>myUser</User>
</ComputingCluster>

Caution

If an user is not provided, the current user in the local node will be used for the remote nodes.

As shown before, the InstallDir tag is necessary and must be the absolute path to the folder where COMPSs is installed in the remote cluster.

Submission Modes

The SubmissionSystem tag of the resources.xml and project.xml is used to define how to submit the tasks to the remote resources.

This adaptor supports two different forms for submitting the tasks generated by COMPSs:

Interactive Mode
Batch Mode

Important

If both submission systems are defined as possible, the application will run in interactive mode.

Interactive Mode

This mode directly launches the execution of tasks to remote machines, and should be used if we have direct access to the computing hardware (NO queuing system in the remote machine).

Example of setting the interactive mode, this code MUST be in resources.xml and OPTIONALLY be in project.xml:

<Adaptors>
    <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
        <SubmissionSystem>
            <Interactive/>
        </SubmissionSystem>
    </Adaptor>
</Adaptors>

Batch Mode

Computing clusters are usually shared by different users and to enable a proper sharing of resources the computations are spawn using a job submission system (e.g. SLURM). The Batch Mode option handles that aspect and manages the execution of the application tasks as jobs in the cluster. Consequently, the user has to provide the following information in the project and resources XML files.

Port

The port used for SSH Communication.

Optional ; Default: 22

MaxExecTime

Expected execution time of the application (in minutes).

Optional ; Default: 10

Queue

Specifies which type of queue system the remote resource has. This queue must have a corresponding cfg file in <installation_dir>/Runtime/scripts/queues/queue_systems folder. For more information, please read this section (Configuration Files).

Optional ; Default: computing cluster’s user default queue

FileCFG

To further customize the supercomputers cfg files contains a set of variables to indicate the queue system used by a supercomputer, paths where the shared disk is mounted, the default values that COMPSs will set in the project and resources files when they are not set by the user and flags to indicate if a functionality is available or not in a supercomputer. This file must have either a corresponding cfg file in <installation_dir>/Runtime/scripts/queues/supercomputers/ folder or an absolute path to a file. For more information, please read this section (Configuration Files).

Optional

Important

Inside this file, you can also specify which queue system is going to be used instead with the previous parameter.

Caution

The .cfg files for queues and supercomputers must be in the remote machine.

Reservation

Some queue systems have the ability to reserve resources for jobs being executed by selected users accounts. A resource reservation identifies the resources in that reservation and a time period during which the reservation is available. Reservation to use when submitting the job.

Optional ; Default: disabled

QOS

One can specify a Quality of Service (QOS) for each job submitted to the corresponding queue. The quality of service associated with a job might affect the job scheduling priority.

Optional ; Default: computing cluster’s user default qos

ProjectName

It is possible to define the project name required by the queue system of the computing cluster.

Optional ; Default: computing cluster’s user default project name

The following code snippet shows an example for the batch submission system of MN5 cluster:

<Adaptors>
    <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
        <SubmissionSystem>
            <Batch>
                <Queue>slurm</Queue>
                <BatchProperties>
                    <Port>200</Port>
                    <MaxExecTime>30</MaxExecTime>
                    <Reservation>myReservation</Reservation>
                    <QOS>debug</QOS>
                    <FileCFG>mn5.cfg</FileCFG>
                    <ProjectName>bsc</ProjectName>
                </BatchProperties>
            </Batch>
        </SubmissionSystem>
        <BrokerAdaptor>sshtrilead</BrokerAdaptor>
    </Adaptor>
</Adaptors>

Important

If batch mode is selected, an environment script is probably necessary. This script will be executed in any computing nodes that the execution will ask to the job submission queue. In this nodes user defined variables can NOT be used. Calling your own .bashrc might help with some of these problems. However, you might have to redefine this variables in the script.

source /path/to/userDirectory/.bashrc
[... Rest of the environment script ]

Step 3: Run the application

For further details of the runcompss command check its dedicated Section (Runcompss command).

$ runcompss  --project=/local/path/application/project.xml \
             --resources=/local/path/application/resources.xml \
             --comm="es.bsc.compss.gos.master.GOSAdaptor" \
             [options] \
             application_name [application_arguments]

Execution results

The execution result follows the same pattern as other execution environments (see further details in its section, Results).

Regarding the logs when debug is enabled, the out and err logs from each task are stored in the corresponding log directory within the local node when each task ends.

Caution

In case of an error that prevents bringing the execution logs, for example, a lose of connection with the remote resources. The logs will be located in <WorkingDir>/BatchOutput/task_ID in the remote machine.

Execution example

Application

In this section, we show how to execute the KMeans Python COMPSs application in batch mode using MareNostrum 5 supercomputer.

In this scenario, we have in our local machine, the KMeans application in /home/user/kmeans and inside the kmeans directory we only have the file kmeans.py. And in the remote machine (glogin1.bsc.es), we have the user bsc12345. So we can access this machine with ssh bsc12345@glogin1.bsc.es.

In the first step, we have to be sure that COMPSs and all the application files are available in MN5 (glogin1.bsc.es). For this example, we assume that the application will be deployed in the user home directory (/home/bsc/bsc12345/kmeans) and COMPSs is installed in /apps/GPP/COMPSs/3.4. The following command are used to deploy the application and check the COMPSs installation:

# In the local machine, copy the application data into MN5
$ scp -r /home/user/kmeans bsc12345@glogin1.bsc.es:/home/bsc/bsc12345/.
$ ssh bsc12345@glogin1.bsc.es
# Inside the remote machine, check where COMPSs is installed
$ module load COMPSs/3.4
$ echo $(builtin cd $(dirname $(which runcompss))/../../..; pwd)
/apps/GPP/COMPSs/3.4
$ exit

In the second step, we create the required xml files and they will be stored in /home/user/kmeans. Next lines show the XML files for this example:

Code 157 project.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Project>
    <MasterNode/>
    <ComputingCluster Name="glogin1.bsc.es">
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Batch>
                        <Queue>slurm</Queue>
                        <BatchProperties>
                            <Port>22</Port>
                            <MaxExecTime>2</MaxExecTime>
                            <Reservation>disabled</Reservation>
                            <QOS>gp_debug</QOS>
                            <FileCFG>mn5.cfg</FileCFG>
                            <ProjectName>bsc19</ProjectName>
                        </BatchProperties>
                    </Batch>
                </SubmissionSystem>
              </Adaptor>
        </Adaptors>
        <InstallDir>/apps/GPP/COMPSs/3.4/</InstallDir>
        <WorkingDir>/home/bsc/bsc12345/kmeans/tmp/</WorkingDir>
        <User>bsc12345</User>
        <LimitOfTasks>1000</LimitOfTasks>
        <Application>
            <Classpath>/home/bsc/bsc12345/kmeans</Classpath>
            <EnvironmentScript>/home/bsc/bsc12345/kmeans/env_mn.sh</EnvironmentScript>
        </Application>
        <ClusterNode Name="compute_node_type">
            <NumberOfNodes>2</NumberOfNodes>
        </ClusterNode>
    </ComputingCluster>
</Project>

Code 158 resources.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ResourcesList>
<ComputingCluster Name="glogin1.bsc.es">
    <Adaptors>
        <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
            <SubmissionSystem>
                <Batch>
                    <Queue>slurm</Queue>
                </Batch>
            </SubmissionSystem>
        </Adaptor>
    </Adaptors>
    <ClusterNode Name="compute_node_type">
        <MaxNumNodes>4</MaxNumNodes>
        <Processor Name="P1">
            <ComputingUnits>8</ComputingUnits>
            <Type>CPU</Type>
        </Processor>
    </ClusterNode>
</ComputingCluster>
</ResourcesList>

And the environment script for MN5 (/home/bsc/bsc12345/kmeans/env_mn.sh):

Code 159 env_mn.sh

export COMPSS_PYTHON_VERSION=3.12.1
module load COMPSs/3.4

Finally, we launch the application in the third step. It must be done using the following command within the local machine:

$ runcompss  --project=/home/user/kmeans/project.xml \
             --resources=/home/user/kmeans/resources.xml \
             --comm=es.bsc.compss.gos.master.GOSAdaptor \
             kmeans.py -n 10240000 -f 8 -d 3 -c 8 -i 10

Tip

The same command can be used to run Java or C applications using the GOS adaptor (but take into account that the --classpath flag is will be needed for Java and --library_path will be needed for C).

Jupyter notebook

In this section, we show how to execute the a Jupyter notebook in batch mode.

The first step requires to make sure that COMPSs is available in the remote machine (e.g. glogin1.bsc.es). For this example, we assume that COMPSs is installed in /apps/GPP/COMPSs/3.4.

Important

When using jupyter notebook it is not necessary to transfer the application to the remote machine, since COMPSs will deal with the code automatically.

In the second step, we create the required project and resources xml files and they will be stored in /home/user/notebook. They are the same as defined in project.xml and resources.xml.

Finally, in the third step we can define in our local machine the notebook /home/user/notebook/simple.ipynb. Note that the ipycompss.start call includes the project and resources parameters, as well as the GOS communication adaptor.

import pycompss.interactive as ipycompss
ipycompss.start(comm="GOS",
                project_xml="/home/user/notebook/project.xml",
                resources_xml="/home/user/notebook/resources.xml")

# Now define your tasks and code within the following cells

Hybrid execution example

Sample Application

In this section, we show how to execute a really simple Python application for COMPSs in batch mode using two clusters. In particular, this example uses the two MareNostrum 5 supercomputer partitions (one with powerful CPUs (GPP) and another with GPUs (ACC)) from the local machine..

In this scenario, we have in our local machine, the application in /home/user/simple and inside the simple directory we only have the file simple.py. And in the remote machines (glogin1.bsc.es for GPP and alogin1.bsc.es for GPP, we have the user bsc12345. So we can access these machines with ssh bsc12345@glogin1.bsc.es and ssh bsc12345@alogin1.bsc.es.

The application that we are going to use is:

from pycompss.api.task import task
from pycompss.api.constraint import constraint
from pycompss.api.api import compss_wait_on

@constraint(processors=[{'ProcessorType':'CPU', 'ComputingUnits':'100'}])
@task(returns=1)
def increment(value):
    # Code that uses 100 CPU cores
    return value + 1

@constraint(processors=[{'ProcessorType':'CPU', 'ComputingUnits':'20'},
                        {'ProcessorType':'GPU', 'ComputingUnits':'1'}])
@task(returns=1)
def multiply(value):
    # Code that uses 20 CPU cores and 1 GPU
    return value * value

def main():
    value = 2
    results = []
    for i in range(2):
        partial = increment(value)
        complete = multiply(partial)
        results.append(complete)
    results = compss_wait_on(results)
    print(results)

if __name__=="__main__":
    main()

This application has two tasks defined (increment and multiply) with different requirements. Since one of the MN5 partitions has GPUs, this example illustrates how COMPSs is able to deal with two different clusters executing the tasks respecting their constraints. The increment task is represents a function with a high internal parallelism, requiring 100 CPU cores, and the multiply function represents a function with less internal parallelism, but requiring one GPU. Consequently, the increment tasks can only be executed in the GPP partition (the ACC partition CPUs have only 80 GPU cores), while the multiply tasks can only be executed in the ACC partition (the GPP partition although it has enough CPU cores, does not have GPUs). The main function loops over two iterations invoking two times the increment and multiply tasks. Notice that there is a data dependency between the tasks.

In the first step, we have to be sure that COMPSs and the application is available in MN5. For this example, we assume that the application will be deployed in the user home directory (/home/bsc/bsc12345/simple) which is shared among partitions and COMPSs is installed in /apps/GPP/COMPSs/3.4 in GPP and in /apps/ACC/COMPSs/3.4 in ACC. The following command are used to deploy the application and check the COMPSs installation:

# In the local machine, copy the application data into MN5
$ scp -r /home/user/simple bsc12345@glogin1.bsc.es:/home/bsc/bsc12345/.
$ ssh bsc12345@glogin1.bsc.es
# Inside the remote machine within GPP, check where COMPSs is installed
$ module load COMPSs/3.4
$ echo $(builtin cd $(dirname $(which runcompss))/../../..; pwd)
/apps/GPP/COMPSs/3.4
$ exit
$ ssh bsc12345@alogin1.bsc.es
# Inside the remote machine within ACC, check where COMPSs is installed
$ module load COMPSs/3.4
$ echo $(builtin cd $(dirname $(which runcompss))/../../..; pwd)
/apps/ACC/COMPSs/3.4
$ exit

In the second step, we create the required xml files and they will be stored in /home/user/simple. Next lines show the XML files for this example:

Code 160 hybrid_project.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Project>
    <MasterNode/>
    <ComputingCluster Name="glogin1.bsc.es">
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Batch>
                        <Queue>slurm</Queue>
                        <BatchProperties>
                            <Port>22</Port>
                            <MaxExecTime>2</MaxExecTime>
                            <Reservation>disabled</Reservation>
                            <QOS>gp_debug</QOS>
                            <FileCFG>mn5.cfg</FileCFG>
                            <ProjectName>bsc00</ProjectName>
                        </BatchProperties>
                    </Batch>
                </SubmissionSystem>
            </Adaptor>
        </Adaptors>
        <InstallDir>/apps/GPP/COMPSs/3.4/</InstallDir>
        <WorkingDir>/home/bsc/bsc12345/simple/gpp/</WorkingDir>
        <User>bsc12345</User>
        <LimitOfTasks>1000</LimitOfTasks>
        <Application>
            <Classpath>/home/bsc/bsc12345/simple/</Classpath>
            <EnvironmentScript>/home/bsc/bsc12345/simple/env_gpp.sh</EnvironmentScript>
        </Application>
        <ClusterNode Name="compute_node_type">
            <NumberOfNodes>2</NumberOfNodes>
        </ClusterNode>
    </ComputingCluster>
    <ComputingCluster Name="alogin1.bsc.es">
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Batch>
                        <Queue>slurm</Queue>
                        <BatchProperties>
                            <Port>22</Port>
                            <MaxExecTime>2</MaxExecTime>
                            <Reservation>disabled</Reservation>
                            <QOS>acc_debug</QOS>
                            <FileCFG>mn5_acc.cfg</FileCFG>
                            <ProjectName>bsc00</ProjectName>
                        </BatchProperties>
                    </Batch>
                </SubmissionSystem>
            </Adaptor>
        </Adaptors>
        <InstallDir>/apps/ACC/COMPSs/3.4/</InstallDir>
        <WorkingDir>/home/bsc/bsc12345/simple/acc/</WorkingDir>
        <User>bsc12345</User>
        <LimitOfTasks>1000</LimitOfTasks>
        <Application>
            <Classpath>/home/bsc/bsc12345/simple/</Classpath>
            <EnvironmentScript>/home/bsc/bsc12345/simple/env_acc.sh</EnvironmentScript>
        </Application>
        <ClusterNode Name="compute_node_type">
            <NumberOfNodes>2</NumberOfNodes>
        </ClusterNode>
    </ComputingCluster>
</Project>

Code 161 hybrid_resources.xml

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ResourcesList>
    <SharedDisk Name="Disk1">
        <Storage>
            <Size>100.0</Size>
        </Storage>
    </SharedDisk>
    <ComputingCluster Name="glogin1.bsc.es">
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Batch>
                        <Queue>slurm</Queue>
                    </Batch>
                </SubmissionSystem>
            </Adaptor>
        </Adaptors>
        <SharedDisks>
            <AttachedDisk Name="Disk1">
                <MountPoint>/</MountPoint>
            </AttachedDisk>
        </SharedDisks>
        <ClusterNode Name="compute_node_type">
            <MaxNumNodes>4</MaxNumNodes>
            <Processor Name="CPU_MN5_GPP">
                <Architecture>worker_gpp</Architecture>
                <ComputingUnits>112</ComputingUnits>
                <Type>CPU</Type>
            </Processor>
        </ClusterNode>
    </ComputingCluster>
    <ComputingCluster Name="alogin1.bsc.es">
        <Adaptors>
            <Adaptor Name="es.bsc.compss.gos.master.GOSAdaptor">
                <SubmissionSystem>
                    <Batch>
                        <Queue>slurm</Queue>
                    </Batch>
                </SubmissionSystem>
            </Adaptor>
        </Adaptors>
        <SharedDisks>
            <AttachedDisk Name="Disk1">
                <MountPoint>/</MountPoint>
            </AttachedDisk>
        </SharedDisks>
        <ClusterNode Name="compute_node_type">
            <MaxNumNodes>4</MaxNumNodes>
            <Processor Name="GPU_MN5_ACC">
                <Architecture>worker_acc</Architecture>
                <ComputingUnits>4</ComputingUnits>
                <Type>GPU</Type>
            </Processor>
            <Processor Name="CPU_MN5_ACC">
                <Architecture>worker_acc</Architecture>
                <ComputingUnits>80</ComputingUnits>
                <Type>CPU</Type>
            </Processor>
        </ClusterNode>
    </ComputingCluster>
</ResourcesList>

And the environment scripts for MN5 are /home/bsc/bsc12345/simple/env_gpp.sh and /home/bsc/bsc12345/simple/env_acc.sh:

Code 162 env_gpp.sh

export COMPSS_PYTHON_VERSION=3.12.1
module load COMPSs/3.4

Code 163 env_acc.sh

export COMPSS_PYTHON_VERSION=3.12.1
module load COMPSs/3.4

Finally, we launch the application in the third step. It must be done using the following command within the local machine:

$ runcompss  --project=/home/user/simple/project.xml \
             --resources=/home/user/simple/resources.xml \
              --comm=es.bsc.compss.gos.master.GOSAdaptor \
             simple.py

Tip

The same command can be used to run Java or C applications using the GOS adaptor (but take into account that the --classpath flag is will be needed for Java and --library_path will be needed for C).

Notebook

In this section, we show how to execute the a Jupyter notebook in batch mode using multiple computing clusters.

The first step requires to make sure that COMPSs is available in the remote machines (e.g. glogin1.bsc.es and alogin1.bsc.es). For this example, we assume that COMPSs is installed in /apps/GPP/COMPSs/3.4 within glogin1.bsc.es, and /apps/ACC/COMPSs/3.4 within alogin1.bsc.es.

Important

When using jupyter notebook it is not necessary to transfer the application to the remote machine, since COMPSs will deal with the code automatically.

In the second step, we create the required project and resources xml files and they will be stored in /home/user/notebook. They are the same as defined in hybrid_project.xml and hybrid_resources.xml.

Finally, in the third step we can define in our local machine the notebook /home/user/notebook/simple.ipynb. Note that the ipycompss.start call includes the project and resources parameters, as well as the GOS communication adaptor.

import pycompss.interactive as ipycompss
ipycompss.start(comm="GOS",
                project_xml="/home/user/notebook/hybrid_project.xml",
                resources_xml="/home/user/notebook/hybrid_resources.xml")

# Now define your tasks and code within the following cells