π Troubleshooting
This section provides answers for the most common issues of the execution of COMPSs applications and its known limitations.
For specific issues not covered in this section, please do not hesitate to contact us at: π§ support-compss@bsc.es
How to debug
This section will show you how to act when errors during the execution of an application happen.
First steps
When an error/exception happens during the execution of an application, the first thing that users must do is to check the application output:
Using
runcompssthe output is shown in the console.Using
enqueue_compssthe output is in thecompss-<JOB_ID>.outandcompss-<JOB_ID>.err
If the error happens within a task, it will not appear in these files. Users must check the log folder in order to find what has failed. The log folder is by default in:
Using
runcompss:$HOME/.COMPSs/<APP_NAME>_XX(where XX is a number between 00 and 99, and increases on each run).Using
enqueue_compss:$HOME/.COMPSs/<JOB_ID>
This log folder contains the jobs folder, where all output/errors of the
tasks are stored. In particular, each task produces a JOB<TASK_NUMBER>_NEW.out
and JOB<TASK_NUMBER>_NEW.err files when a task fails.
Tip
If the user enables the debug mode by including the -d flag into
runcompss or enqueue_compss command, more information will be
stored in the log folder of each run easing the error detection.
In particular, all output and error output of all tasks will appear
within the jobs folder.
In addition, some more log files will appear:
runtime.logpycompss.log(only if using the Python binding).pycompss.err(only if using the Python binding and an error in the binding happens.)resources.logworkersfolder. This folder will contain four files per worker node:worker_<MACHINE_NAME>.outworker_<MACHINE_NAME>.errbinding_worker_<MACHINE_NAME>.outbinding_worker_<MACHINE_NAME>.err
As a suggestion, users should check the last lines of the runtime.log.
If the file-transfers or the tasks are failing an error message will appear
in this file. If the file-transfers are successfully and the jobs are
submitted, users should check the jobs folder and look at the error
messages produced inside each job. Users should notice that if there are
RESUBMITTED files something inside the job is failing.
If the workers folder is empty, means that the execution failed and
the COMPSs runtime was not able to retrieve the workers logs. In this case,
users must connect to the workers and look directly into the worker logs.
Alternatively, if the user is running with a shared disk (e.g. in a
supercomputer), the user can define a shared folder in the
--worker_working_directory=/shared/folder where a tmp_XXXXXX folder
will be created on the application execution and all worker logs will be
stored.
Tip
When debug is enabled, the workers also produce log files which are
transferred to the master when the application finishes. These log files
are always removed from the workers (even if there is a failure to avoid
abandoning files).
Consequently, it is possible to disable the removal of the log files
produced by the workers, so that users can still check them in the
worker nodes if something fails and these logs are not transferred to the
master node. To this end, include the following flag into runcompss or
enqueue_compss:
--keep_workingdir
Please, note that the workers will store the log files into the folder
defined by the --worker_working_directory, that can be a shared or
local folder.
Tip
If segmentation fault occurs, the core dump file can be generated by
setting the following flag into runcompss or enqueue_compss:
--gen_coredump
The following subsections show debugging examples depending on the chosen flavour (Java, Python or C/C++).
β Java examples
β Exception in the main code
TODO
Missing subsection
β Exception in a task
TODO
Missing subsection
π Python examples
π Exception in the main code
Consider the following code where an intended error in the main code has been introduced to show how it can be debugged.
from pycompss.api.task import task
@task(returns=1)
def increment(value):
return value + 1
def main():
initial_value = 1
result = increment(initial_value)
result = result + 1 # Try to use result without synchronizing it: Error
print("Result: " + str(result))
if __name__=='__main__':
main()
When executed, it produces the following output:
$ runcompss error_in_main.py
[ INFO] Inferred PYTHON language
[ INFO] Using default location for project file: /opt/COMPSs//Runtime/configuration/xml/projects/default_project.xml
[ INFO] Using default location for resources file: /opt/COMPSs//Runtime/configuration/xml/resources/default_resources.xml
[ INFO] Using default execution type: compss
----------------- Executing error_in_main.py --------------------------
WARNING: COMPSs Properties file is null. Setting default values
[(377) API] - Starting COMPSs Runtime v3.4
[ ERROR ]: An exception occurred: unsupported operand type(s) for +: 'Future' and 'int'
Traceback (most recent call last):
File "/opt/COMPSs//Bindings/python/3/pycompss/runtime/launch.py", line 204, in compss_main
execfile(APP_PATH, globals()) # MAIN EXECUTION
File "error_in_main.py", line 16, in <module>
main()
File "error_in_main.py", line 11, in main
result = result + 1 # Try to use result without synchronizing it: Error
TypeError: unsupported operand type(s) for +: 'Future' and 'int'
[ERRMGR] - WARNING: Task 1(Action: 1) with name error_in_main.increment has been canceled.
[ERRMGR] - WARNING: Task canceled: [[Task id: 1], [Status: CANCELED], [Core id: 0], [Priority: false], [NumNodes: 1], [MustReplicate: false], [MustDistribute: false], [error_in_main.increment(INT_T)]]
[(3609) API] - Execution Finished
Error running application
It can be identified the complete traceback pointing where the error is, and
the reason. In this example, the reason is
TypeError: unsupported operand type(s) for +: 'Future' and 'int'
since we are trying to use an object that has not been synchronized.
Tip
Any exception raised from the main code will appear in the same way, showing the traceback helping to identify the line which produced the exception and its reason.
π Exception in a task
Consider the following code where an intended error in a task code has been introduced to show how it can be debugged.
from pycompss.api.task import task
from pycompss.api.api import compss_wait_on
@task(returns=1)
def increment(value):
return value + 1 # value is an string, can not add an int: Error
def main():
initial_value = "1" # the initial value is a string instead of an integer
result = increment(initial_value)
result = compss_wait_on(result)
print("Result: " + str(result))
if __name__=='__main__':
main()
When executed, it produces the following output:
$ runcompss error_in_task.py
[ INFO] Inferred PYTHON language
[ INFO] Using default location for project file: /opt/COMPSs//Runtime/configuration/xml/projects/default_project.xml
[ INFO] Using default location for resources file: /opt/COMPSs//Runtime/configuration/xml/resources/default_resources.xml
[ INFO] Using default execution type: compss
----------------- Executing error_in_task.py --------------------------
WARNING: COMPSs Properties file is null. Setting default values
[(570) API] - Starting COMPSs Runtime v3.4
[ERRMGR] - WARNING: Job 1 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR] - WARNING: Task 1 execution on worker localhost has failed; rescheduling task execution. (changing worker)
[ERRMGR] - WARNING: Job 2 for running task 1 on worker localhost has failed; resubmitting task to the same worker.
[ERRMGR] - WARNING: Task 1 has already been rescheduled; notifying task failure.
[ERRMGR] - WARNING: Task 'error_in_task.increment' TOTALLY FAILED.
Possible causes:
-Exception thrown by task 'error_in_task.increment'.
-Expected output files not generated by task 'error_in_task.increment'.
-Could not provide nor retrieve needed data between master and worker.
Check files '/home/user/.COMPSs/error_in_task.py_01/jobs/job[1|2'] to find out the error.
[ERRMGR] - ERROR: Task failed: [[Task id: 1], [Status: FAILED], [Core id: 0], [Priority: false], [NumNodes: 1], [MustReplicate: false], [MustDistribute: false], [error_in_task.increment(STRING_T)]]
[ERRMGR] - Shutting down COMPSs...
[(4711) API] - Execution Finished
Shutting down the running process
Error running application
The output describes that there has been an issue with the task number 1. Since the default behavior of the runtime is to resubmit the failed task, task 2 also fails.
In this case, the runtime suggests to check the log files of the tasks:
/home/user/.COMPSs/error_in_task.py_01/jobs/job[1|2]
Looking into the logs folder, it can be seen that the jobs folder contains
the logs of the failed tasks:
$HOME/.COMPSs
βββ error_in_task.py_01
βββ jobs
β βββ job1_NEW.err
β βββ job1_NEW.out
β βββ job1_RESUBMITTED.err
β βββ job1_RESUBMITTED.out
β βββ job2_NEW.err
β βββ job2_NEW.out
β βββ job2_RESUBMITTED.err
β βββ job2_RESUBMITTED.out
βββ resources.log
βββ runtime.log
βββ tmpFiles
βββ workers
And the job1_NEW.err contains the complete traceback of the exception that
has been raised (TypeError: cannot concatenate 'str' and 'int' objects as
consequence of using a string for the task input which tries to add 1):
[EXECUTOR] executeTask - Error in task execution
es.bsc.compss.types.execution.exceptions.JobExecutionException: Job 1 exit with value 1
at es.bsc.compss.invokers.external.piped.PipedInvoker.invokeMethod(PipedInvoker.java:78)
at es.bsc.compss.invokers.Invoker.invoke(Invoker.java:352)
at es.bsc.compss.invokers.Invoker.processTask(Invoker.java:287)
at es.bsc.compss.executor.Executor.executeTask(Executor.java:486)
at es.bsc.compss.executor.Executor.executeTaskWrapper(Executor.java:322)
at es.bsc.compss.executor.Executor.execute(Executor.java:229)
at es.bsc.compss.executor.Executor.processRequests(Executor.java:198)
at es.bsc.compss.executor.Executor.run(Executor.java:153)
at es.bsc.compss.executor.utils.ExecutionPlatform$2.run(ExecutionPlatform.java:178)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/opt/COMPSs/Bindings/python/2/pycompss/worker/commons/worker.py", line 265, in task_execution
**compss_kwargs)
File "/opt/COMPSs/Bindings/python/2/pycompss/api/task.py", line 267, in task_decorator
return self.worker_call(*args, **kwargs)
File "/opt/COMPSs/Bindings/python/2/pycompss/api/task.py", line 1523, in worker_call
**user_kwargs)
File "/home/user/temp/Bugs/documentation/error_in_task.py", line 6, in increment
return value + 1
TypeError: cannot concatenate 'str' and 'int' objects
Tip
Any exception raised from the task code will appear in the same way, showing the traceback helping to identify the line which produced the exception and its reason.
π§ C/C++ examples
π§ Exception in the main code
TODO
Missing subsection
π§ Exception in a task
TODO
Missing subsection
Common Issues
This section will show you some common issues and how to deal/fix them.
Tasks are not executed
If the tasks remain in Blocked state probably there are no existing resources matching the specific task constraints. This error can be potentially caused by two facts: the resources are not correctly loaded into the runtime, or the task constraints do not match with any resource.
In the first case, users should take a look at the resouces.log and
check that all the resources defined in the project.xml file are
available to the runtime. In the second case users should re-define the
task constraints taking into account the resources capabilities defined
into the resources.xml and project.xml files.
Jobs fail
If all the applicationβs tasks fail because all the submitted jobs fail, it is probably due to the fact that there is a resource miss-configuration. In most of the cases, the resource that the application is trying to access has no passwordless access through the configured user. This can be checked by:
Open the
project.xml. (The default file is stored under/opt/COMPSs/ Runtime/configuration/xml/projects/project.xml)For each resource annotate its name and the value inside the
Usertag. Remember that if there is noUsertag COMPSs will try to connect this resource with the same username than the one that launches the main application.For each annotated resourceName - user please try
ssh user@resourceName. If the connection asks for a password then there is an error in the configuration of the ssh access in the resource.
The problem can be solved running the following commands:
compss@bsc:~$ scp ~/.ssh/id_rsa.pub user@resourceName:./myRSA.pub
compss@bsc:~$ ssh user@resourceName "cat myRSA.pub >> ~/.ssh/authorized_keys; rm ./myRSA.pub"
These commands are a quick solution, for further details please check the General Section.
Exceptions when starting the Worker processes
When the COMPSs master is not able to communicate with one of the COMPSs workers described in the project.xml and resources.xml files, different exceptions can be raised and logged on the runtime.log of the application. All of them are raised during the worker start up and contain the [WorkerStarter] prefix. Next we provide a list with the common exceptions:
- InitNodeException
Exception raised when the remote SSH process to start the worker has failed.
- UnstartedNodeException
Exception raised when the worker process has aborted.
- Connection refused
Exception raised when the master cannot communicate with the worker process (NIO).
All these exceptions encapsulate an error when starting the worker process. This means that the worker machine is not properly configured and thus, you need to check the environment of the failing worker. Further information about the specific error can be found on the worker log, available at the working directory path in the remote worker machine (the worker working directory specified in the project.xml} file).
Next, we list the most common errors and their solutions:
- java command not found
Invalid path to the java binary. Check the JAVA_HOME definition at the remote worker machine.
- Cannot create WD
Invalid working directory. Check the rw permissions of the workerβs working directory.
- No exception
The worker process has started normally and there is no exception. In this case the issue is normally due to the firewall configuration preventing the communication between the COMPSs master and worker. Please check that the worker firewall has in and out permissions for TCP and UDP in the adaptor ports (the adaptor ports are specified in the
resources.xmlfile. By default the port rank is 43000-44000.
Compilation error: @Method not found
When trying to compile Java applications users can get some of the following compilation errors:
error: package es.bsc.compss.types.annotations does not exist
import es.bsc.compss.types.annotations.Constraints;
^
error: package es.bsc.compss.types.annotations.task does not exist
import es.bsc.compss.types.annotations.task.Method;
^
error: package es.bsc.compss.types.annotations does not exist
import es.bsc.compss.types.annotations.Parameter;
^
error: package es.bsc.compss.types.annotations.Parameter does not exist
import es.bsc.compss.types.annotations.parameter.Direction;
^
error: package es.bsc.compss.types.annotations.Parameter does not exist
import es.bsc.compss.types.annotations.parameter.Type;
^
error: cannot find symbol
@Parameter(type = Type.FILE, direction = Direction.INOUT)
^
symbol: class Parameter
location: interface APPLICATION_Itf
error: cannot find symbol
@Constraints(computingUnits = "2")
^
symbol: class Constraints
location: interface APPLICATION_Itf
error: cannot find symbol
@Method(declaringClass = "application.ApplicationImpl")
^
symbol: class Method
location: interface APPLICATION_Itf
All these errors are raised because the compss-engine.jar is not
listed in the CLASSPATH. The default COMPSs installation automatically
inserts this package into the CLASSPATH but it may have been overwritten
or deleted. Please check that your environment variable CLASSPATH
contains the compss-engine.jar location by running the following
command:
$ echo $CLASSPATH | grep compss-engine
If the result of the previous command is empty it means that you are
missing the compss-engine.jar package in your CLASSPATH.
The easiest solution is to manually export the CLASSPATH variable into the user session:
$ export CLASSPATH=$CLASSPATH:/opt/COMPSs/Runtime/compss-engine.jar
However, you will need to remember to export this variable every time
you log out and back in again. Consequently, we recommend to add this
export to the .bashrc file:
$ echo "# COMPSs variables for Java compilation" >> ~/.bashrc
$ echo "export CLASSPATH=$CLASSPATH:/opt/COMPSs/Runtime/compss-engine.jar" >> ~/.bashrc
Warning
The compss-engine.jar is installed inside the COMPSs
installation directory. If you have performed a custom installation,
the path of the package may be different.
Jobs failed on method reflection
When executing an application the main code gets stuck executing a task.
Taking a look at the runtime.log users can check that the job
associated to the task has failed (and all its resubmissions too). Then,
opening the jobX_NEW.out or the jobX_NEW.err files users find
the following error:
[ERROR|es.bsc.compss.Worker|Executor] Can not get method by reflection
es.bsc.compss.nio.worker.executors.Executor$JobExecutionException: Can not get method by reflection
at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:142)
at es.bsc.compss.nio.worker.executors.Executor.execute(Executor.java:42)
at es.bsc.compss.nio.worker.JobLauncher.executeTask(JobLauncher.java:46)
at es.bsc.compss.nio.worker.JobLauncher.processRequests(JobLauncher.java:34)
at es.bsc.compss.util.RequestDispatcher.run(RequestDispatcher.java:46)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NoSuchMethodException: simple.Simple.increment(java.lang.String)
at java.lang.Class.getMethod(Class.java:1678)
at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:140)
... 5 more
This error is due to the fact that COMPSs cannot find one of the tasks declared in the Java Interface. Commonly this is triggered by one of the following errors:
The declaringClass of the tasks in the Java Interface has not been correctly defined.
The parameters of the tasks in the Java Interface do not match the task call.
The tasks have not been defined as public.
Jobs failed on reflect target invocation null pointer
When executing an application the main code gets stuck executing a task.
Taking a look at the runtime.log users can check that the job
associated to the task has failed (and all its resubmissions too). Then,
opening the jobX_NEW.out or the jobX_NEW.err files users find
the following error:
[ERROR|es.bsc.compss.Worker|Executor]
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at es.bsc.compss.nio.worker.executors.JavaExecutor.executeTask(JavaExecutor.java:154)
at es.bsc.compss.nio.worker.executors.Executor.execute(Executor.java:42)
at es.bsc.compss.nio.worker.JobLauncher.executeTask(JobLauncher.java:46)
at es.bsc.compss.nio.worker.JobLauncher.processRequests(JobLauncher.java:34)
at es.bsc.compss.util.RequestDispatcher.run(RequestDispatcher.java:46)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at simple.Ll.printY(Ll.java:25)
at simple.Simple.task(Simple.java:72)
... 10 more
This cause of this error is that the Java object accessed by the task has not been correctly transferred and one or more of its fields is null. The transfer failure is normally caused because the transferred object is not serializable.
Users should check that all the object parameters in the task are either implementing the serializable interface or following the java beans model (by implementing an empty constructor and getters and setters for each attribute).
Tracing merge failed: too many open files
When too many nodes and threads are instrumented, the tracing merge can fail due to an OS limitation, namely: the maximum open files. This problem usually happens when using advanced mode due to the larger number of threads instrumented. To overcome this issue users have two choices. First option, use Extrae parallel MPI merger. This merger is automatically used if COMPSs was installed with MPI support. In Ubuntu you can install the following packets to get MPI support:
$ sudo apt-get install libcr-dev mpich2 mpich2-doc
Please note that Extrae is never compiled with MPI support when building it locally (with buildlocal command).
To check if COMPSs was deployed with MPI support, you can check the installation log and look for the following Extrae configuration output:
Package configuration for Extrae VERSION based on extrae/trunk rev. 3966:
-----------------------
Installation prefix: /gpfs/apps/MN3/COMPSs/Trunk/Dependencies/extrae
Cross compilation: no
CC: gcc
CXX: g++
Binary type: 64 bits
MPI instrumentation: yes
MPI home: /apps/OPENMPI/1.8.1-mellanox
MPI launcher: /apps/OPENMPI/1.8.1-mellanox/bin/mpirun
On the other hand, if you already installed COMPSs, you can check
Extrae configuration executing the script
/opt/COMPSs/Dependencies/extrae/etc/configured.sh. Users should
check that flags --with-mpi=/usr and --enable-parallel-merge are
present and that MPI path is correct and exists. Sample output:
EXTRAE_HOME is not set. Guessing from the script invoked that Extrae was installed in /opt/COMPSs/Dependencies/extrae
The directory exists .. OK
Loaded specs for Extrae from /opt/COMPSs/Dependencies/extrae/etc/extrae-vars.sh
Extrae SVN branch extrae/trunk at revision 3966
Extrae was configured with:
$ ./configure --enable-gettimeofday-clock --without-mpi --without-unwind --without-dyninst --without-binutils --with-mpi=/usr --enable-parallel-merge --with-papi=/usr --with-java-jdk=/usr/lib/jvm/java-7-openjdk-amd64/ --disable-openmp --disable-nanos --disable-smpss --prefix=/opt/COMPSs/Dependencies/extrae --with-mpi=/usr --enable-parallel-merge --libdir=/opt/COMPSs/Dependencies/extrae/lib
CC was gcc
CFLAGS was -g -O2 -fno-optimize-sibling-calls -Wall -W
CXX was g++
CXXFLAGS was -g -O2 -fno-optimize-sibling-calls -Wall -W
MPI_HOME points to /usr and the directory exists .. OK
LIBXML2_HOME points to /usr and the directory exists .. OK
PAPI_HOME points to /usr and the directory exists .. OK
DYNINST support seems to be disabled
UNWINDing support seems to be disabled (or not needed)
Translating addresses into source code references seems to be disabled (or not needed)
Please, report bugs to tools@bsc.es
Important
Disclaimer: the parallel merge with MPI will not bypass the systemβs maximum number of open files, just distribute the files among the resources. If all resources belong to the same machine, the merge will fail anyways.
The second option is to increase the OS maximum number of open files. For instance, in Ubuntu add `` ulimit -n 40000 `` just before the start-stop-daemon line in the do_start section.
Performance issues
Different work directories
Having different work directories (for master and workers) may lead to
performance issues. In particular, if the work directories belong to different
mount points and with different performance, where the copy of files may be
required.
For example, using folders that are shared across nodes in a supercomputer
but with different performance (e.g. scratch and projects in MareNostrum 4)
for the master and worker workspaces.
Memory Profiling
This section will show you how to analyze the main memory consumed during an application execution.
Basic profiling
COMPSs also provides a mechanism to show the memory usage over time when running Python applications. This is particularly useful when memory issues happen (e.g. memory exhausted β causing the application crash), or performance analysis (e.g. problem size scalability).
To this end, the runcompss and enqueue_compss commands provide the
--python_memory_profile flag, which provides a set of files (one per node used
in the application execution) where the memory used during the execution is
recorded at the end of the application.
They are generated in the same folder where the execution has been launched.
Important
The memory-profiler and psutil packages are mandatory in order to
use the --python_memory_profile flag.
It can be easily installed with pip:
$ python -m pip install psutil memory-profiler --user
Tip
If you want to store from the memory profiler in a different folder, export
the COMPSS_WORKER_PROFILE_PATH with the destination path:
$ export COMPSS_WORKER_PROFILE_PATH=/path/to/destination
When --python_memory_profile is included, a file with name
mprofile_<DATE_TIME>.dat is generated for the master memory profiling,
while for the workers they are named <WORKER_NODE_NAME>.dat.
These files can be displayed with the mprof tool:
$ mprof plot <FILE>.dat
Figure 89 mprof plot example
Advanced profiling
For a more fine grained memory profiling and analyzing the workers memory
usage, PyCOMPSs provides the @profile decorator. This decorator is able
to display the memory usage per line of the code.
It can be imported from the PyCOMPSs functions module:
from pycompss.functions.profile import profile
This decorator can be placed over any function:
- Over the
@taskdecorator (or over the decorator stack of a task) This will display the memory usage in the master (through standard output).
- Under the
@taskdecorator: This will display the memory used by the actual task in the worker. The memory usage will be shown through standard output, so it is mandatory to enable debug (
--log_level=debug) and check the job output file from.COMPSs/<app_folder>/jobs/.- Over a non task function:
Will display the memory usage of the function in the master (through standard output).
By default, the @profile decorator reports the memory usage line by line:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
7 53.3 MiB 53.3 MiB 1 @task(returns=1)
8 @profile()
9 def increment(value):
10 61.0 MiB 7.7 MiB 1 a = [1] * (10 ** 6)
11 83.7 MiB 22.7 MiB 1 b = [2] * (value * 10 ** 6)
12 312.6 MiB 228.9 MiB 1 c = [3] * (value * 10 ** 7)
13 289.9 MiB -22.7 MiB 1 del b
14 289.9 MiB 0.0 MiB 1 return value + 1
Job name: job10_NEW
Task start time: 1653572135.1119144
Elapsed time: 0.10722756385803223
Initial memory: 8150122496
Final memory: 7759843328
But this information can be reduce to show only the peak memory usage of
each task by setting full_report=False in the @profile decorator
(@profile(full_report=False)). More specifically, the profiling information
reported will be a one-liner per task showing:
The task start time
The task job name
The file that contains the task
The task name
The task elapsed time
The amount of memory used before executing the task
The amount of memory used after executing the task
The peak memory usage
1653572135.1119144 job10_NEW /path/to/increment.py increment 0.10722756385803223 8150122496 7759843328 312.6 MiB
Tip
It is possible to redirect the profiling output to a single file by
exporting the COMPSS_PROFILING_FILE environment variable with the
path to the destination file.
Please, remind that this variable needs to be available in the worker
if the @profile decorator is used to report the memory usage of the
tasks. Consequently, consider the usage of the --env_script flag
in the runcompss command defining a script that exports the
COMPSS_PROFILING_FILE in order to make it available in the workers
in local executions.
Known Limitations
The current COMPSs version has the following limitations.
Global
- Exceptions
The current COMPSs version is not able to propagate exceptions raised from a task to the master. However, the runtime catches any exception and sets the task as failed.
- Use of file paths
The persistent workers implementation has a unique Working Directory per worker. That means that tasks should not use hardcoded file names to avoid file collisions and tasks misbehavior. We recommend to use files declared as task parameters, or to manually create a sandbox inside each task execution and/or to generate temporary random file names.
With Java Applications
- Java tasks
Java tasks must be declared as public. Despite the fact that tasks can be defined in the main class or in other ones, we recommend to define the tasks in a separated class from the main method to force its public declaration.
- Java objects
Objects used by tasks must follow the java beans model (implementing an empty constructor and getters and setters for each attribute) or implement the serializable interface. This is due to the fact that objects will be transferred to remote machines to execute the tasks.
- Java object aliasing
If a task has an object parameter and returns an object, the returned value must be a new object (or a cloned one) to prevent any aliasing with the task parameters.
// @Method(declaringClass = "...") // DummyObject incorrectTask ( // @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject a, // @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject b // ); public DummyObject incorrectTask (DummyObject a, DummyObject b) { if (a.getValue() > b.getValue()) { return a; } return b; } // @Method(declaringClass = "...") // DummyObject correctTask ( // @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject a, // @Parameter(type = Type.OBJECT, direction = Direction.IN) DummyObject b // ); public DummyObject correctTask (DummyObject a, DummyObject b) { if (a.getValue() > b.getValue()) { return a.clone(); } return b.clone(); } public static void main() { DummyObject a1 = new DummyObject(); DummyObject b1 = new DummyObject(); DummyObject c1 = new DummyObject(); c1 = incorrectTask(a1, b1); System.out.println("Initial value: " + c1.getValue()); a1.modify(); b1.modify(); System.out.println("Aliased value: " + c1.getValue()); DummyObject a2 = new DummyObject(); DummyObject b2 = new DummyObject(); DummyObject c2 = new DummyObject(); c2 = incorrectTask(a2, b2); System.out.println("Initial value: " + c2.getValue()); a2.modify(); b2.modify(); System.out.println("Non-aliased value: " + c2.getValue()); }
With Python Applications
- Python constraints in the cloud
When using python applications with constraints in the cloud the minimum number of VMs must be set to 0 because the initial VM creation does not respect the tasks constraints. Notice that if no constraints are defined the initial VMs are still usable.
- Intermediate files
Some applications may generate intermediate files that are only used among tasks and are never needed inside the masterβs code. However, COMPSs will transfer back these files to the master node at the end of the execution. Currently, the only way to avoid transferring these intermediate files is to manually erase them at the end of the masterβs code. Users must take into account that this only applies for files declared as task parameters and not for files created and/or erased inside a task.
- User defined classes in Python
User defined classes in Python must not be declared in the same file that contains the main method (
if __name__==__main__') to avoid serialization problems of the objects.- Python object hierarchy dependency detection
Dependencies are detected only on the objects that are task parameters or outputs. Consider the following code:
# a.py class A: def __init__(self, b): self.b = b # main.py from a import A from pycompss.api.task import task from pycompss.api.parameter import * from pycompss.api.api import compss_wait_on @task(obj = IN, returns = int) def get_b(obj): return obj.b @task(obj = INOUT) def inc(obj): obj += [1] def main(): my_a = A([5]) inc(my_a.b) obj = get_b(my_a) obj = compss_wait_on(obj) print obj if __name__ == '__main__': main()
Note that there should exist a dependency between
AandA.b. However, PyCOMPSs is not capable to detect dependencies of that kind. These dependencies must be handled (and avoided) manually.- Python modules with global states
Some modules (for example
logging) have internal variables apart from functions. These modules are not guaranteed to work in PyCOMPSs due to the fact that master and worker code are executed in different interpreters. For instance, if aloggingconfiguration is set on some worker, it will not be visible from the master interpreter instance.- Python global variables
This issue is very similar to the previous one. PyCOMPSs does not guarantee that applications that create or modify global variables while worker code is executed will work. In particular, this issue (and the previous one) is due to Pythonβs Global Interpreter Lock (GIL).
- Python application directory as a module
If the Python application root folder is a python module (i.e: it contains an
__init__.pyfile) thenruncompssmust be called from the parent folder. For example, if the Python application is in a folder with an__init__.pyfile namedmy_folderthen PyCOMPSs will resolve all functions, classes and variables asmy_folder.object_nameinstead ofobject_name. For example, consider the following file tree:my_apps/ βββ kmeans/ βββ __init__.py βββ kmeans.pyThen the correct command to call this app is
runcompss kmeans/kmeans.pyfrom themy_appsdirectory.- Python early program exit
All intentional, premature exit operations must be done with
sys.exit. PyCOMPSs needs to perform some cleanup tasks before exiting and, if an early exit is performed withsys.exit, the event will be captured, allowing PyCOMPSs to perform these tasks. If the exit operation is done in a different way then there is no guarantee that the application will end properly.- Python with numpy and MKL
Tasks that invoke numpy and MKL may experience issues if tasks use a different number of MKL threads. This is due to the fact that MKL reuses threads along different calls and it does not change the number of threads from one call to another.
With Services
- Services types
The current COMPSs version only supports SOAP based services that implement the WS interoperability standard. REST services are not supported.