Jonathan Passerat-Palmbach, Romain Reuillon, Mathieu Leclaire, Antonios Makropoulos, Emma C. Robinson, Sarah Parisot and Daniel Rueckert, Reproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System, published in Frontiers in Neuroinformatics Vol 11, 2017.
[online version] [bibteX]
CARETask
offers to run external applications packaged with CARE. The site (proposing an outdated version of CARE for now, but a great documentation) can be found here. CARE makes it possible to package your application from any Linux computer, and then re-execute it on any other Linux computer. The CARE / OpenMOLE pair is a very efficient way to distribute your application at very large scale with very little effort. Please note that this packaging step is only necessary if you plan distribute your workflow to an heterogeneous computing environment such as the EGI grid. If you target local clusters, running the same operating system and sharing a network file system, you can directly jump to the SystemExecTask.
You should first install CARE:
chmod +x care
)export PATH=/path/to/the/care/folder:$PATH
)The CARETask
was designed to embed native binaries such as programs compiled from C, C++, Fortran, Python, R, Scilab... Embedding an application in a CARETask
happens in 2 steps:
First you should package your application using the CARE binary you just installed, so that it executes on any Linux environment. This usually consists in prepending your command line with: care -o /path/to/myarchive.tgz.bin -r ~ -p /path/to/mydata1 -p /path/to/mydata2 mycommand myparam1 myparam2
Before going any further, here are a few notes about the options accepted by CARE:
-o
indicates where to store the archive. At the moment, OpenMOLE prefers to work with archives stored in .tgz.bin so please don't toy with the extension ;-)-r ~
is not compulsory but it has proved mandatory in some cases. So as rule of thumb, if you encounter problems when packaging your application, try adding / removing it.-p /path
asks CARE not to archive /path. This is particularly useful for input data that will change with your parameters. You probably do not want to embed this data in the archive, and we'll see further down how to inject the necessary input data in the archive from OpenMOLE.Second, just provide the resulting package along with some other information to OpenMOLE. Et voila! If you encounter any problem to package your application, please refer to the corresponding entry in the FAQ
One very important aspect of CARE is that you only need to package your application once. As long as the execution you use to package your application makes uses of all the dependencies (libraries, packages, ...), you should not have any problem re-executing this archive with other parameters.
errorOnReturnValue
to false will prevent OpenMOLE from re-scheduling a CARETask that has reported a return code different from 0. You can also get the return code in a variable using the returnValue
setting.
stdOut
and stdErr
actions on the task.
environmentVariable += (variable, "variableName")
field.
If no name is specified, the environment variable is named after the OpenMOLE variable.
Environment variables injected from the dataflow are inserted in the pre-existing set of environment variables from the execution host. This shows particularly useful to preserve
the behaviour of some toolkits when executed on local environments (ssh, clusters, ...) where users control their work environment.
The following snippet creates a task that employs the features described in this section:
// Declare the variable
val output = Val[String]
val error = Val[String]
val value = Val[Int]
// Any task
val pythonTask =
CARETask("hello.tgz.bin", "python hello.py") set (
stdOut := output,
stdErr := error,
returnValue := value,
environmentVariable += (value, "I_AM_AN_ENV_VAR")
)
You will note that options holding a single value are set using the :=
operator. Also, the OpenMOLE variables containing the standard and error outputs are automatically marked as outputs of the task, and must not be added to the outputs
list.
CARETask
using the set
operator on a freshly defined task.
val out = Val[Int]
val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
hostFiles += ("/path/to/my/file"),
customWorkDirectory := "/tmp",
returnValue := out
)
The available options are described hereafter:
hostFiles += ("/etc/hosts")
or with a specific path hostFiles += ("/etc/bash.bashrc", "/home/foo/.bashrc")
environmentVariables += ("VARIABLE1", "42")
. Multiple hostFiles
entries can be used within the same set
block.workDirectory := "/tmp"
Val[Int]
variable. Example: returnValue := out
errorOnReturnValue := false
Val[String]
variable. Example: stdOut := output
Val[String]
variable. Example: stdErr := error
CARETask
: hostFiles
. This option takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. For instance:
val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
hostFiles += ("/path/to/my/file"),
hostFiles += ("/path/to/another/file", "/virtual/path")
)
This CAREtask
will thus have access to /path/to/my/file and /virtual/path.
SystemExecTask
. This task is made to launch native commands on the execution host. There is two modes for using this task:
resources
field. Please note that contrary to the CARETask
, there is no guarantee that an application passed as a resource to a SystemExecTask
will re-execute successfully on a remote environmentSystemExecTask
accepts an arbitrary number of commands. These commands will be executed sequentially on the same execution node where the task is instantiated. In other words, it is not possible to split the execution of multiple commands grouped in the same SystemExecTask
.
The following example first copies and runs a bash script on the remote host, before calling the remote's host /bin/hostname
. Both commands' standard and error outputs are gathered and concatenated to a single OpenMOLE variable: respectively stdOut
and stdErr
:
// Declare the variable
val output = Val[String]
val error = Val[String]
// Any task
val scriptTask =
SystemExecTask("bash script.sh", "hostname") set (
resources += workDirectory / "script.sh",
stdOut := output,
stdErr := error
)
scriptTask hook ToStringHook()
In this case the bash script might depend on applications installed on the remote host. Similarly, we assume the presence of /bin/hostname
on the execution node. Therefore this task cannot be considered as portable.
Note that each execution is isolated in a separate folder on the execution host and that the task execution is considered as failed if the script returns a value different from 0. If you need another behaviour you can use the same advanced options as the CARETask regarding the return code.