πŸš‘ Troubleshooting

This section provides answers for the most common issues of the execution of COMPSs applications and its known limitations.

For specific issues not covered in this section, please do not hesitate to contact us at: πŸ“§ support-compss@bsc.es

How to debug

This section will show you how to act when errors during the execution of an application happen.

First steps

When an error/exception happens during the execution of an application, the first thing that users must do is to check the application output:

  • Using runcompss the output is shown in the console.

  • Using enqueue_compss the output is in the compss-<JOB_ID>.out and compss-<JOB_ID>.err

If the error happens within a task, it will not appear in these files. Users must check the log folder in order to find what has failed. The log folder is by default in:

  • Using runcompss: $HOME/.COMPSs/<APP_NAME>_XX (where XX is a number between 00 and 99, and increases on each run).

  • Using enqueue_compss: $HOME/.COMPSs/<JOB_ID>

This log folder contains the jobs folder, where all output/errors of the tasks are stored. In particular, each task produces a JOB<TASK_NUMBER>_NEW.out and JOB<TASK_NUMBER>_NEW.err files when a task fails.

Tip

If the user enables the debug mode by including the -d flag into runcompss or enqueue_compss command, more information will be stored in the log folder of each run easing the error detection. In particular, all output and error output of all tasks will appear within the jobs folder.

In addition, some more log files will appear:

  • runtime.log

  • pycompss.log (only if using the Python binding).

  • pycompss.err (only if using the Python binding and an error in the binding happens.)

  • resources.log

  • workers folder. This folder will contain four files per worker node:

    • worker_<MACHINE_NAME>.out

    • worker_<MACHINE_NAME>.err

    • binding_worker_<MACHINE_NAME>.out

    • binding_worker_<MACHINE_NAME>.err

As a suggestion, users should check the last lines of the runtime.log. If the file-transfers or the tasks are failing an error message will appear in this file. If the file-transfers are successfully and the jobs are submitted, users should check the jobs folder and look at the error messages produced inside each job. Users should notice that if there are RESUBMITTED files something inside the job is failing.

If the workers folder is empty, means that the execution failed and the COMPSs runtime was not able to retrieve the workers logs. In this case, users must connect to the workers and look directly into the worker logs. Alternatively, if the user is running with a shared disk (e.g. in a supercomputer), the user can define a shared folder in the --worker_working_directory=/shared/folder where a tmp_XXXXXX folder will be created on the application execution and all worker logs will be stored.

Tip

When debug is enabled, the workers also produce log files which are transferred to the master when the application finishes. These log files are always removed from the workers (even if there is a failure to avoid abandoning files). Consequently, it is possible to disable the removal of the log files produced by the workers, so that users can still check them in the worker nodes if something fails and these logs are not transferred to the master node. To this end, include the following flag into runcompss or enqueue_compss:

--keep_workingdir

Please, note that the workers will store the log files into the folder defined by the --worker_working_directory, that can be a shared or local folder.

Tip

If segmentation fault occurs, the core dump file can be generated by setting the following flag into runcompss or enqueue_compss:

--gen_coredump

The following subsections show debugging examples depending on the chosen flavour (Java, Python or C/C++).

β˜• Java examples

β˜• Exception in the main code

TODO

Missing subsection

β˜• Exception in a task

TODO

Missing subsection

🐍 Python examples

🐍 Exception in the main code

Consider the following code where an intended error in the main code has been introduced to show how it can be debugged.

from pycompss.api.task import task

@task(returns=1)
def increment(value):
     return value + 1

def main():
    initial_value = 1
    result = increment(initial_value)

    result = result + 1  # Try to use result without synchronizing it: Error

    print("Result: " + str(result))

if __name__=='__main__':
    main()

When executed, it produces the following output:

$ runcompss error_in_main.py

[  INFO] Inferred PYTHON language
[  INFO] Using default location for project file: /opt/COMPSs//Runtime/configuration/xml/projects/default_project.xml
[  INFO] Using default location for resources file: /opt/COMPSs//Runtime/configuration/xml/resources/default_resources.xml
[  INFO] Using default execution type: compss

----------------- Executing error_in_main.py --------------------------

WARNING: COMPSs Properties file is null. Setting default values
[(377)    API]  -  Starting COMPSs Runtime v3.4
[ ERROR ]: An exception occurred: unsupported operand type(s) for +: 'Future' and 'int'
Traceback (most recent call last):
  File "/opt/COMPSs//Bindings/python/3/pycompss/runtime/launch.py", line 204, in compss_main
    execfile(APP_PATH, globals())  # MAIN EXECUTION
  File "error_in_main.py", line 16, in <module>
    main()
  File "error_in_main.py", line 11, in main
    result = result + 1  # Try to use result without synchronizing it: Error
TypeError: unsupported operand type(s) for +: 'Future' and 'int'
[ERRMGR]  -  WARNING: Task 1(Action: 1) with name error_in_main.increment has been canceled.
[ERRMGR]  -  WARNING: Task canceled: [[Task id: 1], [Status: CANCELED], [Core id: 0], [Priority: false], [NumNodes: 1], [MustReplicate: false], [MustDistribute: false], [error_in_main.increment(INT_T)]]
[(3609)    API]  -  Execution Finished

Error running application

It can be identified the complete traceback pointing where the error is, and the reason. In this example, the reason is TypeError: unsupported operand type(s) for +: 'Future' and 'int' since we are trying to use an object that has not been synchronized.

Tip

Any exception raised from the main code will appear in the same way, showing the traceback helping to identify the line which produced the exception and its reason.

🐍 Exception in a task

Consider the following code where an intended error in a task code has been introduced to show how it can be debugged.

from pycompss.api.task import task
from pycompss.api.api import compss_wait_on

@task(returns=1)
def increment(value):
   return value + 1  # value is an string, can not add an int: Error

def main():
  initial_value = "1"  # the initial value is a string instead of an integer
  result = increment(initial_value)
  result = compss_wait_on(result)
  print("Result: " + str(result))

if __name__=='__main__':
  main()

When executed, it produces the following output:

$ runcompss error_in_task.py

[  INFO] Inferred PYTHON language
[  INFO] Using default location for project file: /opt/COMPSs//Runtime/configuration/xml/projects/default_project.xml
[  INFO] Using default location for resources file: /opt/COMPSs//Runtime/configuration/xml/resources/default_resources.xml
[  INFO] Using default execution type: compss

----------------- Executing error_in_task.py --------------------------

WARNING: COMPSs Properties file is null. Setting default values
[(570)    API]  -  Starting COMPSs Runtime v3.4
[ERRMGR]  -  WARNING: Job 1 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR]  -  WARNING: Task 1 execution on worker localhost has failed; rescheduling task execution. (changing worker)
[ERRMGR]  -  WARNING: Job 2 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR]  -  WARNING: Task 1 has already been rescheduled; notifying task failure.
[ERRMGR]  -  WARNING: Task 'error_in_task.increment' TOTALLY FAILED.
                      Possible causes:
                           -Exception thrown by task 'error_in_task.increment'.
                           -Expected output files not generated by task 'error_in_task.increment'.
                           -Could not provide nor retrieve needed data between master and worker.

                      Check files '/home/user/.COMPSs/error_in_task.py_01/jobs/job[1|2'] to find out the error.

[ERRMGR]  -  ERROR:   Task failed: [[Task id: 1], [Status: FAILED], [Core id: 0], [Priority: false], [NumNodes: 1], [MustReplicate: false], [MustDistribute: false], [error_in_task.increment(STRING_T)]]
[ERRMGR]  -  Shutting down COMPSs...
[(4711)    API]  -  Execution Finished
Shutting down the running process

Error running application

The output describes that there has been an issue with the task number 1. Since the default behavior of the runtime is to resubmit the failed task, task 2 also fails.

In this case, the runtime suggests to check the log files of the tasks: /home/user/.COMPSs/error_in_task.py_01/jobs/job[1|2]

Looking into the logs folder, it can be seen that the jobs folder contains the logs of the failed tasks:

$HOME/.COMPSs
  └── error_in_task.py_01
        β”œβ”€β”€ jobs
        β”‚   β”œβ”€β”€ job1_NEW.err
        β”‚   β”œβ”€β”€ job1_NEW.out
        β”‚   β”œβ”€β”€ job1_RESUBMITTED.err
        β”‚   β”œβ”€β”€ job1_RESUBMITTED.out
        β”‚   β”œβ”€β”€ job2_NEW.err
        β”‚   β”œβ”€β”€ job2_NEW.out
        β”‚   β”œβ”€β”€ job2_RESUBMITTED.err
        β”‚   └── job2_RESUBMITTED.out
        β”œβ”€β”€ resources.log
        β”œβ”€β”€ runtime.log
        β”œβ”€β”€ tmpFiles
        └── workers

And the job1_NEW.err contains the complete traceback of the exception that has been raised (TypeError: cannot concatenate 'str' and 'int' objects as consequence of using a string for the task input which tries to add 1):

  [EXECUTOR] executeTask - Error in task execution
  es.bsc.compss.types.execution.exceptions.JobExecutionException: Job 1 exit with value 1
      at es.bsc.compss.invokers.external.piped.PipedInvoker.invokeMethod(PipedInvoker.java:78)
      at es.bsc.compss.invokers.Invoker.invoke(Invoker.java:352)
      at es.bsc.compss.invokers.Invoker.processTask(Invoker.java:287)
      at es.bsc.compss.executor.Executor.executeTask(Executor.java:486)
      at es.bsc.compss.executor.Executor.executeTaskWrapper(Executor.java:322)
      at es.bsc.compss.executor.Executor.execute(Executor.java:229)
      at es.bsc.compss.executor.Executor.processRequests(Executor.java:198)
      at es.bsc.compss.executor.Executor.run(Executor.java:153)
      at es.bsc.compss.executor.utils.ExecutionPlatform$2.run(ExecutionPlatform.java:178)
      at java.lang.Thread.run(Thread.java:748)
  Traceback (most recent call last):
  File "/opt/COMPSs/Bindings/python/2/pycompss/worker/commons/worker.py", line 265, in task_execution
    **compss_kwargs)
  File "/opt/COMPSs/Bindings/python/2/pycompss/api/task.py", line 267, in task_decorator
    return self.worker_call(*args, **kwargs)
  File "/opt/COMPSs/Bindings/python/2/pycompss/api/task.py", line 1523, in worker_call
    **user_kwargs)
  File "/home/user/temp/Bugs/documentation/error_in_task.py", line 6, in increment
    return value + 1
TypeError: cannot concatenate 'str' and 'int' objects

Tip

Any exception raised from the task code will appear in the same way, showing the traceback helping to identify the line which produced the exception and its reason.

πŸ§™ C/C++ examples

πŸ§™ Exception in the main code

TODO

Missing subsection

πŸ§™ Exception in a task

TODO

Missing subsection

Common Issues

This section will show you some common issues and how to deal/fix them.

Tasks are not executed

If the tasks remain in Blocked state probably there are no existing resources matching the specific task constraints. This error can be potentially caused by two facts: the resources are not correctly loaded into the runtime, or the task constraints do not match with any resource.

In the first case, users should take a look at the resouces.log and check that all the resources defined in the project.xml file are available to the runtime. In the second case users should re-define the task constraints taking into account the resources capabilities defined into the resources.xml and project.xml files.

Jobs fail

If all the application’s tasks fail because all the submitted jobs fail, it is probably due to the fact that there is a resource miss-configuration. In most of the cases, the resource that the application is trying to access has no passwordless access through the configured user. This can be checked by:

  • Open the project.xml. (The default file is stored under /opt/COMPSs/ Runtime/configuration/xml/projects/project.xml)

  • For each resource annotate its name and the value inside the User tag. Remember that if there is no User tag COMPSs will try to connect this resource with the same username than the one that launches the main application.

  • For each annotated resourceName - user please try ssh user@resourceName. If the connection asks for a password then there is an error in the configuration of the ssh access in the resource.

The problem can be solved running the following commands:

compss@bsc:~$ scp ~/.ssh/id_rsa.pub user@resourceName:./myRSA.pub
compss@bsc:~$ ssh user@resourceName "cat myRSA.pub >> ~/.ssh/authorized_keys; rm ./myRSA.pub"

These commands are a quick solution, for further details please check the General Section.

Exceptions when starting the Worker processes

When the COMPSs master is not able to communicate with one of the COMPSs workers described in the project.xml and resources.xml files, different exceptions can be raised and logged on the runtime.log of the application. All of them are raised during the worker start up and contain the [WorkerStarter] prefix. Next we provide a list with the common exceptions:

InitNodeException

Exception raised when the remote SSH process to start the worker has failed.

UnstartedNodeException

Exception raised when the worker process has aborted.

Connection refused

Exception raised when the master cannot communicate with the worker process (NIO).

All these exceptions encapsulate an error when starting the worker process. This means that the worker machine is not properly configured and thus, you need to check the environment of the failing worker. Further information about the specific error can be found on the worker log, available at the working directory path in the remote worker machine (the worker working directory specified in the project.xml} file).

Next, we list the most common errors and their solutions:

java command not found

Invalid path to the java binary. Check the JAVA_HOME definition at the remote worker machine.

Cannot create WD

Invalid working directory. Check the rw permissions of the worker’s working directory.

No exception

The worker process has started normally and there is no exception. In this case the issue is normally due to the firewall configuration preventing the communication between the COMPSs master and worker. Please check that the worker firewall has in and out permissions for TCP and UDP in the adaptor ports (the adaptor ports are specified in the resources.xml file. By default the port rank is 43000-44000.

Compilation error: @Method not found

When trying to compile Java applications users can get some of the following compilation errors:

error: package es.bsc.compss.types.annotations does not exist
import es.bsc.compss.types.annotations.Constraints;
                                          ^
error: package es.bsc.compss.types.annotations.task does not exist
import es.bsc.compss.types.annotations.task.Method;
                                          ^
error: package es.bsc.compss.types.annotations does not exist
import es.bsc.compss.types.annotations.Parameter;
                                          ^
error: package es.bsc.compss.types.annotations.Parameter does not exist
import es.bsc.compss.types.annotations.parameter.Direction;
                                                    ^
error: package es.bsc.compss.types.annotations.Parameter does not exist
import es.bsc.compss.types.annotations.parameter.Type;
                                                    ^
error: cannot find symbol
@Parameter(type = Type.FILE, direction = Direction.INOUT)
^
  symbol:   class Parameter
  location: interface APPLICATION_Itf

error: cannot find symbol
@Constraints(computingUnits = "2")
^
  symbol:   class Constraints
  location: interface APPLICATION_Itf

error: cannot find symbol
@Method(declaringClass = "application.ApplicationImpl")
^
  symbol:   class Method
  location: interface APPLICATION_Itf

All these errors are raised because the compss-engine.jar is not listed in the CLASSPATH. The default COMPSs installation automatically inserts this package into the CLASSPATH but it may have been overwritten or deleted. Please check that your environment variable CLASSPATH contains the compss-engine.jar location by running the following command:

$ echo $CLASSPATH | grep compss-engine

If the result of the previous command is empty it means that you are missing the compss-engine.jar package in your CLASSPATH.

The easiest solution is to manually export the CLASSPATH variable into the user session:

$ export CLASSPATH=$CLASSPATH:/opt/COMPSs/Runtime/compss-engine.jar

However, you will need to remember to export this variable every time you log out and back in again. Consequently, we recommend to add this export to the .bashrc file:

$ echo "# COMPSs variables for Java compilation" >> ~/.bashrc
$ echo "export CLASSPATH=$CLASSPATH:/opt/COMPSs/Runtime/compss-engine.jar" >> ~/.bashrc

Warning

The compss-engine.jar is installed inside the COMPSs installation directory. If you have performed a custom installation, the path of the package may be different.

Jobs failed on method reflection

When executing an application the main code gets stuck executing a task. Taking a look at the runtime.log users can check that the job associated to the task has failed (and all its resubmissions too). Then, opening the jobX_NEW.out or the jobX_NEW.err files users find the following error:

[ERROR|es.bsc.compss.Worker|Executor] Can not get method by reflection
es.bsc.compss.nio.worker.executors.Executor$JobExecutionException: Can not get method by reflection
        at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:142)
        at es.bsc.compss.nio.worker.executors.Executor.execute(Executor.java:42)
        at es.bsc.compss.nio.worker.JobLauncher.executeTask(JobLauncher.java:46)
        at es.bsc.compss.nio.worker.JobLauncher.processRequests(JobLauncher.java:34)
        at es.bsc.compss.util.RequestDispatcher.run(RequestDispatcher.java:46)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodException: simple.Simple.increment(java.lang.String)
        at java.lang.Class.getMethod(Class.java:1678)
        at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:140)
        ... 5 more

This error is due to the fact that COMPSs cannot find one of the tasks declared in the Java Interface. Commonly this is triggered by one of the following errors:

  • The declaringClass of the tasks in the Java Interface has not been correctly defined.

  • The parameters of the tasks in the Java Interface do not match the task call.

  • The tasks have not been defined as public.

Jobs failed on reflect target invocation null pointer

When executing an application the main code gets stuck executing a task. Taking a look at the runtime.log users can check that the job associated to the task has failed (and all its resubmissions too). Then, opening the jobX_NEW.out or the jobX_NEW.err files users find the following error:

[ERROR|es.bsc.compss.Worker|Executor]
java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:154)
        at es.bsc.compss.nio.worker.executors.Executor.execute(Executor.java:42)
        at es.bsc.compss.nio.worker.JobLauncher.executeTask(JobLauncher.java:46)
        at es.bsc.compss.nio.worker.JobLauncher.processRequests(JobLauncher.java:34)
        at es.bsc.compss.util.RequestDispatcher.run(RequestDispatcher.java:46)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
        at simple.Ll.printY(Ll.java:25)
        at simple.Simple.task(Simple.java:72)
        ... 10 more

This cause of this error is that the Java object accessed by the task has not been correctly transferred and one or more of its fields is null. The transfer failure is normally caused because the transferred object is not serializable.

Users should check that all the object parameters in the task are either implementing the serializable interface or following the java beans model (by implementing an empty constructor and getters and setters for each attribute).

Tracing merge failed: too many open files

When too many nodes and threads are instrumented, the tracing merge can fail due to an OS limitation, namely: the maximum open files. This problem usually happens when using advanced mode due to the larger number of threads instrumented. To overcome this issue users have two choices. First option, use Extrae parallel MPI merger. This merger is automatically used if COMPSs was installed with MPI support. In Ubuntu you can install the following packets to get MPI support:

$ sudo apt-get install libcr-dev mpich2 mpich2-doc

Please note that Extrae is never compiled with MPI support when building it locally (with buildlocal command).

To check if COMPSs was deployed with MPI support, you can check the installation log and look for the following Extrae configuration output:

Package configuration for Extrae VERSION based on extrae/trunk rev. 3966:
-----------------------
Installation prefix: /gpfs/apps/MN3/COMPSs/Trunk/Dependencies/extrae
Cross compilation: no
CC: gcc
CXX: g++
Binary type: 64 bits

MPI instrumentation: yes
    MPI home: /apps/OPENMPI/1.8.1-mellanox
    MPI launcher: /apps/OPENMPI/1.8.1-mellanox/bin/mpirun

On the other hand, if you already installed COMPSs, you can check Extrae configuration executing the script /opt/COMPSs/Dependencies/extrae/etc/configured.sh. Users should check that flags --with-mpi=/usr and --enable-parallel-merge are present and that MPI path is correct and exists. Sample output:

EXTRAE_HOME is not set. Guessing from the script invoked that Extrae was installed in /opt/COMPSs/Dependencies/extrae
The directory exists .. OK
Loaded specs for Extrae from /opt/COMPSs/Dependencies/extrae/etc/extrae-vars.sh

Extrae SVN branch extrae/trunk at revision 3966

Extrae was configured with:
$ ./configure --enable-gettimeofday-clock --without-mpi --without-unwind --without-dyninst --without-binutils --with-mpi=/usr --enable-parallel-merge --with-papi=/usr --with-java-jdk=/usr/lib/jvm/java-7-openjdk-amd64/ --disable-openmp --disable-nanos --disable-smpss --prefix=/opt/COMPSs/Dependencies/extrae --with-mpi=/usr --enable-parallel-merge --libdir=/opt/COMPSs/Dependencies/extrae/lib

CC was gcc
CFLAGS was -g -O2 -fno-optimize-sibling-calls -Wall -W
CXX was g++
CXXFLAGS was -g -O2 -fno-optimize-sibling-calls -Wall -W

MPI_HOME points to /usr and the directory exists .. OK
LIBXML2_HOME points to /usr and the directory exists .. OK
PAPI_HOME points to /usr and the directory exists .. OK
DYNINST support seems to be disabled
UNWINDing support seems to be disabled (or not needed)
Translating addresses into source code references seems to be disabled (or not needed)

Please, report bugs to tools@bsc.es

Important

Disclaimer: the parallel merge with MPI will not bypass the system’s maximum number of open files, just distribute the files among the resources. If all resources belong to the same machine, the merge will fail anyways.

The second option is to increase the OS maximum number of open files. For instance, in Ubuntu add `` ulimit -n 40000 `` just before the start-stop-daemon line in the do_start section.

Performance issues

Different work directories

Having different work directories (for master and workers) may lead to performance issues. In particular, if the work directories belong to different mount points and with different performance, where the copy of files may be required. For example, using folders that are shared across nodes in a supercomputer but with different performance (e.g. scratch and projects in MareNostrum 4) for the master and worker workspaces.

Memory Profiling

This section will show you how to analyze the main memory consumed during an application execution.

Basic profiling

COMPSs also provides a mechanism to show the memory usage over time when running Python applications. This is particularly useful when memory issues happen (e.g. memory exhausted – causing the application crash), or performance analysis (e.g. problem size scalability).

To this end, the runcompss and enqueue_compss commands provide the --python_memory_profile flag, which provides a set of files (one per node used in the application execution) where the memory used during the execution is recorded at the end of the application. They are generated in the same folder where the execution has been launched.

Important

The memory-profiler and psutil packages are mandatory in order to use the --python_memory_profile flag.

It can be easily installed with pip:

$ python -m pip install psutil memory-profiler --user

Tip

If you want to store from the memory profiler in a different folder, export the COMPSS_WORKER_PROFILE_PATH with the destination path:

$ export COMPSS_WORKER_PROFILE_PATH=/path/to/destination

When --python_memory_profile is included, a file with name mprofile_<DATE_TIME>.dat is generated for the master memory profiling, while for the workers they are named <WORKER_NODE_NAME>.dat. These files can be displayed with the mprof tool:

$ mprof plot <FILE>.dat
Paraver menu

Figure 89 mprof plot example

Advanced profiling

For a more fine grained memory profiling and analyzing the workers memory usage, PyCOMPSs provides the @profile decorator. This decorator is able to display the memory usage per line of the code. It can be imported from the PyCOMPSs functions module:

from pycompss.functions.profile import profile

This decorator can be placed over any function:

Over the @task decorator (or over the decorator stack of a task)

This will display the memory usage in the master (through standard output).

Under the @task decorator:

This will display the memory used by the actual task in the worker. The memory usage will be shown through standard output, so it is mandatory to enable debug (--log_level=debug) and check the job output file from .COMPSs/<app_folder>/jobs/.

Over a non task function:

Will display the memory usage of the function in the master (through standard output).

By default, the @profile decorator reports the memory usage line by line:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     7     53.3 MiB     53.3 MiB           1   @task(returns=1)
     8                                         @profile()
     9                                         def increment(value):
    10     61.0 MiB      7.7 MiB           1       a = [1] * (10 ** 6)
    11     83.7 MiB     22.7 MiB           1       b = [2] * (value * 10 ** 6)
    12    312.6 MiB    228.9 MiB           1       c = [3] * (value * 10 ** 7)
    13    289.9 MiB    -22.7 MiB           1       del b
    14    289.9 MiB      0.0 MiB           1       return value + 1
Job name: job10_NEW
Task start time: 1653572135.1119144
Elapsed time: 0.10722756385803223
Initial memory: 8150122496
Final memory: 7759843328

But this information can be reduce to show only the peak memory usage of each task by setting full_report=False in the @profile decorator (@profile(full_report=False)). More specifically, the profiling information reported will be a one-liner per task showing:

  1. The task start time

  2. The task job name

  3. The file that contains the task

  4. The task name

  5. The task elapsed time

  6. The amount of memory used before executing the task

  7. The amount of memory used after executing the task

  8. The peak memory usage

1653572135.1119144 job10_NEW /path/to/increment.py increment 0.10722756385803223 8150122496 7759843328 312.6 MiB

Tip

It is possible to redirect the profiling output to a single file by exporting the COMPSS_PROFILING_FILE environment variable with the path to the destination file.

Please, remind that this variable needs to be available in the worker if the @profile decorator is used to report the memory usage of the tasks. Consequently, consider the usage of the --env_script flag in the runcompss command defining a script that exports the COMPSS_PROFILING_FILE in order to make it available in the workers in local executions.

Known Limitations

The current COMPSs version has the following limitations.

Global

Exceptions

The current COMPSs version is not able to propagate exceptions raised from a task to the master. However, the runtime catches any exception and sets the task as failed.

Use of file paths

The persistent workers implementation has a unique Working Directory per worker. That means that tasks should not use hardcoded file names to avoid file collisions and tasks misbehavior. We recommend to use files declared as task parameters, or to manually create a sandbox inside each task execution and/or to generate temporary random file names.

With Java Applications

Java tasks

Java tasks must be declared as public. Despite the fact that tasks can be defined in the main class or in other ones, we recommend to define the tasks in a separated class from the main method to force its public declaration.

Java objects

Objects used by tasks must follow the java beans model (implementing an empty constructor and getters and setters for each attribute) or implement the serializable interface. This is due to the fact that objects will be transferred to remote machines to execute the tasks.

Java object aliasing

If a task has an object parameter and returns an object, the returned value must be a new object (or a cloned one) to prevent any aliasing with the task parameters.

// @Method(declaringClass = "...")
// DummyObject incorrectTask (
//    @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject a,
//    @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject b
// );
public DummyObject incorrectTask (DummyObject a, DummyObject b) {
    if (a.getValue() > b.getValue()) {
        return a;
    }
    return b;
}

// @Method(declaringClass = "...")
// DummyObject correctTask (
//    @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject a,
//    @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject b
// );
public DummyObject correctTask (DummyObject a, DummyObject b) {
    if (a.getValue() > b.getValue()) {
        return a.clone();
    }
    return b.clone();
}

public static void main() {
    DummyObject a1 = new DummyObject();
    DummyObject b1 = new DummyObject();
    DummyObject c1 = new DummyObject();
    c1 = incorrectTask(a1, b1);
    System.out.println("Initial value: " + c1.getValue());
    a1.modify();
    b1.modify();
    System.out.println("Aliased value: " + c1.getValue());


    DummyObject a2 = new DummyObject();
    DummyObject b2 = new DummyObject();
    DummyObject c2 = new DummyObject();
    c2 = incorrectTask(a2, b2);
    System.out.println("Initial value: " + c2.getValue());
    a2.modify();
    b2.modify();
    System.out.println("Non-aliased value: " + c2.getValue());
}

With Python Applications

Python constraints in the cloud

When using python applications with constraints in the cloud the minimum number of VMs must be set to 0 because the initial VM creation does not respect the tasks constraints. Notice that if no constraints are defined the initial VMs are still usable.

Intermediate files

Some applications may generate intermediate files that are only used among tasks and are never needed inside the master’s code. However, COMPSs will transfer back these files to the master node at the end of the execution. Currently, the only way to avoid transferring these intermediate files is to manually erase them at the end of the master’s code. Users must take into account that this only applies for files declared as task parameters and not for files created and/or erased inside a task.

User defined classes in Python

User defined classes in Python must not be declared in the same file that contains the main method (if __name__==__main__') to avoid serialization problems of the objects.

Python object hierarchy dependency detection

Dependencies are detected only on the objects that are task parameters or outputs. Consider the following code:

# a.py
class A:
  def __init__(self, b):
    self.b  = b

# main.py
from a import A
from pycompss.api.task import task
from pycompss.api.parameter import *
from pycompss.api.api import compss_wait_on

@task(obj = IN, returns = int)
def get_b(obj):
  return obj.b

@task(obj = INOUT)
def inc(obj):
  obj += [1]

def main():
  my_a = A([5])
  inc(my_a.b)
  obj = get_b(my_a)
  obj = compss_wait_on(obj)
  print obj

if __name__ == '__main__':
  main()

Note that there should exist a dependency between A and A.b. However, PyCOMPSs is not capable to detect dependencies of that kind. These dependencies must be handled (and avoided) manually.

Python modules with global states

Some modules (for example logging) have internal variables apart from functions. These modules are not guaranteed to work in PyCOMPSs due to the fact that master and worker code are executed in different interpreters. For instance, if a logging configuration is set on some worker, it will not be visible from the master interpreter instance.

Python global variables

This issue is very similar to the previous one. PyCOMPSs does not guarantee that applications that create or modify global variables while worker code is executed will work. In particular, this issue (and the previous one) is due to Python’s Global Interpreter Lock (GIL).

Python application directory as a module

If the Python application root folder is a python module (i.e: it contains an __init__.py file) then runcompss must be called from the parent folder. For example, if the Python application is in a folder with an __init__.py file named my_folder then PyCOMPSs will resolve all functions, classes and variables as my_folder.object_name instead of object_name. For example, consider the following file tree:

my_apps/
└── kmeans/
    β”œβ”€β”€ __init__.py
    └── kmeans.py

Then the correct command to call this app is runcompss kmeans/kmeans.py from the my_apps directory.

Python early program exit

All intentional, premature exit operations must be done with sys.exit. PyCOMPSs needs to perform some cleanup tasks before exiting and, if an early exit is performed with sys.exit, the event will be captured, allowing PyCOMPSs to perform these tasks. If the exit operation is done in a different way then there is no guarantee that the application will end properly.

Python with numpy and MKL

Tasks that invoke numpy and MKL may experience issues if tasks use a different number of MKL threads. This is due to the fact that MKL reuses threads along different calls and it does not change the number of threads from one call to another.

With Services

Services types

The current COMPSs version only supports SOAP based services that implement the WS interoperability standard. REST services are not supported.