Embed external applications (C, C++, Python, R, Scilab...)


In OpenMOLE, a generic task named CARETask offers to run external applications packaged with CARE. The site (proposing an outdated version of CARE for now, but a great documentation) can be found here. CARE makes it possible to package your application from any Linux computer, and then re-execute it on any other Linux computer. The CARE / OpenMOLE pair is a very efficient way to distribute your application at very large scale with very little effort. Please note that this packaging step is only necessary if you plan distribute your workflow to an heterogeneous computing environment such as the EGI grid. If you target local clusters, running the same operating system and sharing a network file system, you can directly jump to the SystemExecTask, section.

You should first install CARE:
  • download the CARE binary from here
  • make it executable (chmod +x care)
  • add the path to the executable to your PATH variable (export PATH=/path/to/the/care/folder:$PATH)

The CARETask has been designed to embed native binaries such as programs compiled from C, C++, Fortran, Python, R, Scilab... Embedding an application in a CARETask happens in 2 steps:

First you should package your application using the CARE binary you just installed, so that it executes on any Linux environment. This usually consists in prepending your command line with care -o /path/to/myarchive.tgz.bin -r ~ -p /path/to/mydata1 -p /path/to/mydata2 mycommand myparam1 myparam2. Before going any further, here are a few notes about the options passed to CARE:
  • -o indicates where to store the archive. At the moment, OpenMOLE prefers to work with archives stored in .tgz.bin, so please don't toy with the extension ;)
  • -r ~ is not compulsory but it has proved mandatory in some cases. So as rule of thumb, if you encounter problems when packaging your application, try adding / removing it.
  • -p /path asks CARE not to archive /path. This is particularly useful for input data that will change with your parameters. You probably do not want to embed this data in the archive, and we'll see further down how to inject the necessary input data in the archive from OpenMOLE.

Second, just provide the resulting package along with some other information to OpenMOLE. Et voila! If you encounter any problem to package your application, please refer to the corresponding entry in the FAQ

One very important aspect of CARE is that you only need to package your application once. As long as the execution you use to package your application makes uses of all the dependencies, you should not have any problem re-executing this archive with other parameters.

Let's study two concrete use cases that take an existing application, package it with CARE, and embed it in OpenMOLE. You should be able to achieve exactly the same process with almost any executable running on Linux. We've chosen an R code and a Python script.

An example with R


Our first example is an R script contained in a file myscript.R. We want to distribute the execution of this R code to the EGI grid.

First your script should run in headless mode with no input required from the user during the execution. Your script should produce files or write its results to the standard output so that OpenMOLE can retrieve them from the remote execution environment.

Here is an example R script matching these criteria:
args<-commandArgs(trailingOnly = TRUE)
data<-read.csv("data.csv",header=T,sep=",")
result<-as.numeric(args[1])*data
write.csv(result,"result.csv", row.names=FALSE)

With an example data.csv:
h1,h2,h3
7,8,9
9,7,3
1,1,1

This reads a file called data.csv, multiply its content by a number provided on the command line and writes the result to an output file called results.csv. To call this script from the command line you should type: R -f script.R --slave --args 4, considering you have R installed on your system.

Once the script is up and running, remember that the first step to run it from OpenMOLE is to package it. This is done using CARE on your system.
care -r /home/reuillon/ -o r.tgz.bin R -f script.R --slave --args 4

Notice how the command line is identical to the original one. The call to the R script remains unchanged, as CARE and its options are inserted at the beginning of the command line.

A care.tgz.bin file is created. It is an archive containing a portable version of your execution. It can be extracted and executed on any other Linux platform.

The method described here packages everything, including R itself! Therefore there is no need to install R on the target execution machine. All that is needed is for the remote execution host to run Linux, which is the case for the vast majority of (decent) high performance computing environments.

Packaging an application is done once and for all by running the original application against CARE. CARE's re-execution mechanisms allows you to change the original command line when re-running your application. This way you can update the parameters passed on the command line and the re-execution will be impacted accordingly. As long as all the configuration files, libraries, ... were used during the original execution, there is no need to package the application multiple times with different input parameters.

You can now upload this archive to your OpenMOLE workspace along with a data.csv file in a subfolder named data. Let's now explore a complete combination of all the data files with OpenMOLE. The input data files are located in data and the result files are written to a folder called results. A second input parameter is a numeric value i ranging from 1 to 10. The corresponding OpenMOLE script looks like this:

// Declare the variable
val i = Val[Double]
val input = Val[File]
val inputName = Val[String]
val output = Val[File]

// R task
// "path/on/your/system" is a path on the original system on which you packaged R
val rTask = CARETask(workDirectory / "r.tgz.bin", "R --slave -f script.R --args ${i}") set (
  (inputs, outputs) += (i, inputName),
  inputFiles += (input, "data.csv"),
  outputFiles += ("result.csv", output)
)

val exploration =
  ExplorationTask(
    (i in (1.0 to 10.0 by 1.0)) x
    (input in (workDirectory / "data").files withName inputName)
  )

val copy = CopyFileHook(output, workDirectory / "result" / "${inputName}-${i}.csv")
exploration -< (rTask hook copy hook ToStringHook())

The CARETask performs two actions: it first unarchives the CARE container by running r.tgz.bin. Then the actual execution takes place as a second command. Note that for each execution of the CARETask, any command starting with / is relative to the root of the CARE archive, and any other command is executed in the current directory. The current directory defaults to the original packaging directory.

Several notions from OpenMOLE are reused in this example. If you're not too familiar with Hooks or Samplings, check the relevant sections of the documentation.

Another example with a Python script


The toy Python script for this test case is:
import sys
f = open(sys.argv[2], 'w')
f.write(sys.argv[1])
exit(0)

This script is saved to hello.py. We first package it using CARE: care -o hello.tgz.bin python hello.py 42 test.txt

We can now run it in OpenMOLE using the following script:
// Declare the variable
val arg = Val[Int]
val output = Val[File]

// python task
val pythonTask =
  CARETask(workDirectory / "hello.tgz.bin", "python hello.py ${arg} output.txt") set (
    inputs += arg,
    outputFiles += ("output.txt", output),
    outputs += arg
  )

val exploration = ExplorationTask(arg in (0 to 10))

val copy = CopyFileHook(output, workDirectory / "hello${arg}.txt")
val env = LocalEnvironment(4)
exploration -< (pythonTask hook copy on env by 2)

Again notions from OpenMOLE are reused in this example. If you're not too familiar with Environments or Groupings, check the relevant sections of the documentation.

Two things should be noted from these examples:
  • The procedure to package an application is always the same regardless of the underlying programming language / framework used.
  • The CARETask is not different from the SystemExecTask, to the extent of the archive given as a first parameter.
These two aspects make it really easy to embed native applications in OpenMOLE.

Advanced options


The CARETask can be customised to fit the needs of a specific application. For instance, some applications disregarding standards might not return the expected 0 value upon completion. The return value of the application is used by OpenMOLE to determine whether the task has been successfully executed, or needs to be re-executed. Setting the boolean flag errorOnReturnValue to false will prevent OpenMOLE from re-scheduling a CARETask that have reported a return code different from 0. You can also get the return code in a variable using the returnValue setting.

Another default behaviour is to print the standard and error outputs of each task in the OpenMOLE console. Such raw prints might not be suitable when a very large number of tasks is involved or that further processing are to be performed on the outputs. A CARETask's standard and error outputs can be assigned to OpenMOLE variable and thus injected in the dataflow by summoning respectively the stdOut and stdErr actions on the task.

As any other process, the applications contained in OpenMOLE's native tasks accept environment variables to influence their behaviour. Variables from the dataflow can be injected as environment variables using the environmentVariable += (variable, "variableName") field. If no name is specified, the environment variable is named after the OpenMOLE variable. Environment variables injected from the dataflow are inserted in the pre-existing set of environment variables from the execution host. This shows particularly useful to preserve the behaviour of some toolkits when executed on local environments (ssh, clusters, ...) where users control their work environment.

The following snippet creates a task that employs the features described in this section:
// Declare the variable
val output = Val[String]
val error  = Val[String]
val value = Val[Int]

// Any task
val pythonTask =
  CARETask("hello.tgz.bin", "python hello.py") set (
    stdOut := output,
    stdErr := error,
    returnValue := value,
    environmentVariable += (value, "I_AM_AN_ENV_VAR")
  )

You will note that options holding a single value are set using the := operator. Also, the OpenMOLE variables containing the standard and error outputs are automatically marked as outputs of the task, and must not be added to the outputs list.

Using local resources


To access data present on the execution node (outside the CARE filesystem) you should use a dedicated option of the CARETask: hostFiles. This option takes the path of a file on the execution host and binds it to the same path in the CARE filesystem. Optionally you can provide a second argument to specify the path explicitly. For instance:
val careTask = CARETask("care.tgz.bin", "executable arg1 arg2 /path/to/my/file /virtual/path arg4") set (
  hostFiles += ("/path/to/my/file"),
  hostFiles += ("/path/to/another/file", "/virtual/path")
)

This CARE task will thus be able to access /path/to/my/file and /virtual/path.

Using a local executable (in non portable tasks)


The CARETask was designed to be portable from one machine to another. However, some use-cases require executing specific commands installed on a given cluster. To achieve that you should use another task called SystemExecTask. This task is made to launch native commands on the execution host. There is two modes for using this task:
  • Calling a command that is assumed to be available on any execution node of the environment. The command will be looked for in the system as it would from a traditional command line: searching in the default PATH or an absolute location.
  • Copying a local script not installed on the remote environment. Applications and scripts can be copied to the task's work directory using the resources field. Please note that contrary to the CARETask, there is no guarantee that an application passed as a resource to a SystemExecTask will re-execute successfully on a remote environment
  • .

The SystemExecTask accepts an arbitrary number of commands. These commands will be executed sequentially on the same execution node where the task is instantiated. In other words, it is not possible to split the execution of multiple commands grouped in the same SystemExecTask.

The following example first copies and runs a bash script on the remote host, before calling the remote's host /bin/hostname. Both commands' standard and error outputs are gathered and concatenated to a single OpenMOLE variable: respectively stdOut and stdErr. To achieve that you should use a SystemExecTask:
// Declare the variable
val output = Val[String]
val error  = Val[String]

// Any task
val scriptTask =
  SystemExecTask("bash script.sh", "hostname") set (
    resources += workDirectory / "script.sh",
    stdOut := output,
    stdErr := error
  )

 scriptTask hook ToStringHook()

In this case the bash script might depend on program installed on the remote host. Similarly, we assume the presence of /bin/hostname on the execution node. Therefore this task cannot be considered as portable.

Note that each execution is isolated in separate folder on the execution host and that the task execution is considered as failed if the script return a value different from 0. If you need another behaviour you can use the same advanced options as the CARETask regarding the return code.

Troubleshooting


You should always try to re-execute your application outside of OpenMOLE first. This allows you to ensure the packaging process with CARE was successful. If something goes wrong at this stage, you should check the official CARE documentation or the archives of the CARE mailing list.

If the packaged application re-executes as you'd expect, but you still struggle to embed it in OpenMOLE, then get in touch with our user community via our the OpenMOLE user mailing-list.