Cluster

Scale up on Cluster

Documentation > Scale Up

Content:

1 - Using clusters with OpenMOLE

Using clusters with OpenMOLE 🔗

Batch systems 🔗

Many distributed computing environments offer batch processing capabilities. OpenMOLE supports most of the batch systems.
Batch systems generally work by exposing an entry point on which the user can log in and submit jobs. OpenMOLE accesses this entry point using SSH. Different environments can be assigned to delegate the workload resulting of different tasks or groups of tasks. However, not all clusters expose the same features, so options may vary from one environment to another.

Before being able to use a batch system, you should first provide your authentication information to OpenMOLE.

Singularity 🔗

To make sure that OpenMOLE tasks runs on your cluster you might want to ensure that Singularity or Apptainer is properly working on the cluster nodes.
To test it you should first use a Linux system on which are root. First, install Singularity on this Linux system and then build the lolcow.sif image:

sudo singularity build lolcow.sif library://sylabs-jms/testing/lolcow

Copy the lolcow.sif image on your cluster and log to a worker node (from the head node you can submit an interactive job, see the doc of your cluster). Then try running these 2 commands:

singularity overlay create -S -s 20000 test.img
singularity exec --overlay test.img lolcow.sif cowsay OpenMOLE

If you see a cow saying OpenMOLE, congrats the OpenMOLE tasks should work fine on the cluster. In case there is an error, you can report it to the cluster admin team and ask for help.

Sometimes to make Singularity or Apptainer available you might have to load the corresponding modules. To check the available modules:

module avail

And to load it:

module load apptainer

In the case you need to load a module, mind passing it to your OpenMOLE environment as well, for instance:

val env =
  SLURMEnvironment(
    "login",
    "machine.domain",
    modules = Seq("apptainer")
  )

Grouping 🔗

You should also note that the use of a batch environment is generally not suited for short tasks, i.e. less than 1 minute for a cluster. In case your tasks are short, you can group several executions with the keyword by in your workflow. For instance, the workflow below groups the execution of model by 100 in each job submitted to the environment:

// Define the variables that are transmitted between the tasks
val i = Val[Double]
val res = Val[Double]

// Define the model, here it is a simple task executing "res = i * 2", but it can be your model
val model =
  ScalaTask("val res = i * 2") set (
    inputs += i,
    outputs += (i, res)
  )

// Define a local environment
val env = SLURMEnvironment("login", "machine.domain")

// Make the model run on the the local environment
DirectSampling(
  evaluation = model hook display,
  sampling = i in (0.0 to 1000.0 by 1.0)
) on env by 100

Slurm 🔗

To delegate the workload to a Slurm based cluster you can use the SLURMEnvironment as follows:

val env =
  SLURMEnvironment(
    "login",
    "machine.domain",
    // optional parameters
    partition = "short-jobs",
    time = 1 hour
  )

You also can set options by providing additional parameters to the environment (..., option = value, ...):

port: the port number used by the ssh server, by default it is set to 22,
sharedDirectory: OpenMOLE uses this directory to communicate from the head node of the cluster to the worker nodes (defaults to "/home/user/.openmole/.tmp/ssh"),
storageSharedLocally: When set to true, OpenMOLE will use symbolic links instead of physically copying files to the remote environment. This assumes that the OpenMOLE instance has access to the same storage space as the remote environment (think same NFS filesystem on desktop machine and cluster). Defaults to false and shouldn't be used unless you're 100% sure of what you're doing!
workDirectory: the directory in which OpenMOLE will execute on the remote server, for instance workDirectory = "/tmp"(defaults to "/tmp"),
partition: the name of the queue on which jobs will be submitted, for instance partition = "longjobs",
time: the maximum time a job is permitted to run before being killed, for instance time = 1 hour,
memory: the memory for the job, for instance memory = 2 gigabytes,
openMOLEMemory: the memory of attributed to the OpenMOLE runtime on the execution node, if you run external tasks you can reduce the memory for the OpenMOLE runtime to 256MB in order to have more memory for you program on the execution node, for instance openMOLEMemory = 256 megabytes,
nodes: number of nodes requested,
runtimeSetting: the settings for the runtime, to set: ,the use of memory overlay instead of file system overlay by Singularity/Apptainer (default is false), the number of threads for concurrent task executions the worker node (default is 1), for instance RuntimeSetting(memoryOverlay = true, threads = 1), in case thread is set it automatically sets the cpuPerTask entry,
cpuPerTask: specify the number of thread requested on the execution nodes, you should use it if your executable is multi-threaded, if not specified cpuPerTask takes the value of the threads when it is specified,
reservation: name of a SLURM reservation,
qos: Quality of Service (QOS) as defined in the Slurm database
gres: a list of Generic Resource (GRES) requested. For instance gres = Seq("gpu:1")
constraints: a list of SLURM defined constraints which selected nodes must match,
exclusive: set job node exclusivity, values can be set to (\"user\", \"mcs\", \"topo\"),
submittedJobs: cap the number of jobs submitted at given time to the environment, for instance submittedJobs = 1000 (by default no job cap is set),
modules: a sequence of String to load modules on the execution environment using "module load name", for instance modules = Seq("mpi"),
reconnect: when set, the interval at which the SSH connection is reconnected, for instance reconnect = 1 minute (by default the ssh connection is connected once and for all),
localSubmission: set to true if you are running OpenMOLE from a node of the cluster (useful for example if you have a cluster that you can only ssh behind a VPN but you can not set up the VPN where your OpenMOLE is running); user and host are not mandatory in this case.

PBS 🔗

PBS is a venerable batch system for clusters. It is also referred to as Torque. You may use a PBS computing environment as follows:

val env =
  PBSEnvironment(
    "login",
    "machine.domain"
  )

You also can set options by providing additional parameters to the environment (..., option = value, ...):

port: the port number used by the ssh server, by default it is set to 22,
sharedDirectory: OpenMOLE uses this directory to communicate from the head node of the cluster to the worker nodes (defaults to "/home/user/.openmole/.tmp/ssh"),
storageSharedLocally: When set to true, OpenMOLE will use symbolic links instead of physically copying files to the remote environment. This assumes that the OpenMOLE instance has access to the same storage space as the remote environment (think same NFS filesystem on desktop machine and cluster). Defaults to false and shouldn't be used unless you're 100% sure of what you're doing!,
workDirectory: the directory in which OpenMOLE will execute on the remote server, for instance workDirectory = "/tmp"(defaults to "/tmp"),
queue: the name of the queue on which jobs will be submitted, for instance queue = "longjobs",
wallTime: the maximum time a job is permitted to run before being killed, for instance wallTime = 1 hour,
memory: the memory for the job, for instance memory = 2 gigabytes,
openMOLEMemory: the memory of attributed to the OpenMOLE runtime on the execution node, if you run external tasks you can reduce the memory for the OpenMOLE runtime to 256MB in order to have more memory for you program on the execution node, for instance openMOLEMemory = 256 megabytes,
nodes: Number of nodes requested,
runtimeSetting: the settings for the runtime, to set: ,the use of memory overlay instead of file system overlay by Singularity/Apptainer (default is false), the number of threads for concurrent task executions the worker node (default is 1), for instance RuntimeSetting(memoryOverlay = true, threads = 1),
coreByNodes: an alternative to specifying the number of threads. coreByNodes takes the value of the threads when not specified, or 1 if none of them is specified,
flavour: specify the declination of PBS installed on your cluster. You can choose between Torque (for the open source PBS/Torque) or PBSPro (defaults to flavour = Torque),
submittedJobs: cap the number of jobs submitted at given time to the environment, for instance submittedJobs = 1000 (by default no job cap is set),
modules: a sequence of String to load modules on the execution environment using "module load name", for instance modules = Seq("mpi"),
reconnect: when set, the interval at which the SSH connection is reconnected, for instance reconnect = 1 minute (by default the ssh connection is connected once and for all),
localSubmission: set to true if you are running OpenMOLE from a node of the cluster (useful for example if you have a cluster that you can only ssh behind a VPN but you can not set up the VPN where your OpenMOLE is running); user and host are not mandatory in this case.

SGE 🔗

To delegate some computation load to a SGE based cluster you can use the SGEEnvironment as follows:

val env =
  SGEEnvironment(
    "login",
    "machine.domain"
  )

You also can set options by providing additional parameters to the environment (..., option = value, ...):

port: the port number used by the ssh server, by default it is set to 22,
sharedDirectory: OpenMOLE uses this directory to communicate from the head node of the cluster to the worker nodes (defaults to "/home/user/.openmole/.tmp/ssh"),
storageSharedLocally: When set to true, OpenMOLE will use symbolic links instead of physically copying files to the remote environment. This assumes that the OpenMOLE instance has access to the same storage space as the remote environment (think same NFS filesystem on desktop machine and cluster). Defaults to false and shouldn't be used unless you're 100% sure of what you're doing!
workDirectory: the directory in which OpenMOLE will execute on the remote server, for instance workDirectory = "/tmp"(defaults to "/tmp"),
queue: the name of the queue on which jobs will be submitted, for instance queue = "longjobs",
wallTime: the maximum time a job is permitted to run before being killed, for instance wallTime = 1 hour,
memory: the memory for the job, for instance memory = 2 gigabytes,
openMOLEMemory: the memory of attributed to the OpenMOLE runtime on the execution node, if you run external tasks you can reduce the memory for the OpenMOLE runtime to 256MB in order to have more memory for you program on the execution node, for instance openMOLEMemory = 256 megabytes,
runtimeSetting: the settings for the runtime, to set: ,the use of memory overlay instead of file system overlay by Singularity/Apptainer (default is false), the number of threads for concurrent task executions the worker node (default is 1), for instance RuntimeSetting(memoryOverlay = true, threads = 1),
submittedJobs: cap the number of jobs submitted at given time to the environment, for instance submittedJobs = 1000 (by default no job cap is set),
modules: a sequence of String to load modules on the execution environment using "module load name", for instance modules = Seq("mpi"),
reconnect: when set, the interval at which the SSH connection is reconnected, for instance reconnect = 1 minute (by default the ssh connection is connected once and for all),
localSubmission: set to true if you are running OpenMOLE from a node of the cluster (useful for example if you have a cluster that you can only ssh behind a VPN but you can not set up the VPN where your OpenMOLE is running); user and host are not mandatory in this case.

Condor 🔗

Condor clusters can be leveraged using the following syntax:

val env =
  CondorEnvironment(
    "login",
    "machine.domain"
  )

You also can set options by providing additional parameters to the environment (..., option = value, ...):

port: the port number used by the ssh server, by default it is set to 22,
sharedDirectory: OpenMOLE uses this directory to communicate from the head node of the cluster to the worker nodes (defaults to "/home/user/.openmole/.tmp/ssh"),
storageSharedLocally: When set to true, OpenMOLE will use symbolic links instead of physically copying files to the remote environment. This assumes that the OpenMOLE instance has access to the same storage space as the remote environment (think same NFS filesystem on desktop machine and cluster). Defaults to false and shouldn't be used unless you're 100% sure of what you're doing!
workDirectory: the directory in which OpenMOLE will execute on the remote server, for instance workDirectory = "/tmp"(defaults to "/tmp"),
memory: the memory for the job, for instance memory = 2 gigabytes,
openMOLEMemory: the memory of attributed to the OpenMOLE runtime on the execution node, if you run external tasks you can reduce the memory for the OpenMOLE runtime to 256MB in order to have more memory for you program on the execution node, for instance openMOLEMemory = 256 megabytes,
runtimeSetting: the settings for the runtime, to set: ,the use of memory overlay instead of file system overlay by Singularity/Apptainer (default is false), the number of threads for concurrent task executions the worker node (default is 1), for instance RuntimeSetting(memoryOverlay = true, threads = 1),
submittedJobs: cap the number of jobs submitted at given time to the environment, for instance submittedJobs = 1000 (by default no job cap is set),
modules: a sequence of String to load modules on the execution environment using "module load name", for instance modules = Seq("mpi"),
reconnect: when set, the interval at which the SSH connection is reconnected, for instance reconnect = 1 minute (by default the ssh connection is connected once and for all),
localSubmission: set to true if you are running OpenMOLE from a node of the cluster (useful for example if you have a cluster that you can only ssh behind a VPN but you can not set up the VPN where your OpenMOLE is running); user and host are not mandatory in this case.

OAR 🔗

Similarly, OAR clusters are reached as follows:

val env =
  OAREnvironment(
    "login",
    "machine.domain"
  )

You also can set options by providing additional parameters to the environment (..., option = value, ...):

port: the port number used by the ssh server, by default it is set to 22,
sharedDirectory: OpenMOLE uses this directory to communicate from the head node of the cluster to the worker nodes (defaults to "/home/user/.openmole/.tmp/ssh"),
storageSharedLocally: When set to true, OpenMOLE will use symbolic links instead of physically copying files to the remote environment. This assumes that the OpenMOLE instance has access to the same storage space as the remote environment (think same NFS filesystem on desktop machine and cluster). Defaults to false and shouldn't be used unless you're 100% sure of what you're doing!
workDirectory: the directory in which OpenMOLE will execute on the remote server, for instance workDirectory = "/tmp"(defaults to "/tmp"),
queue: the name of the queue on which jobs will be submitted, for instance queue = "longjobs",
wallTime: the maximum time a job is permitted to run before being killed, for instance wallTime = 1 hour,
openMOLEMemory: the memory of attributed to the OpenMOLE runtime on the execution node, if you run external tasks you can reduce the memory for the OpenMOLE runtime to 256MB in order to have more memory for you program on the execution node, for instance openMOLEMemory = 256 megabytes,
runtimeSetting: the settings for the runtime, to set: ,the use of memory overlay instead of file system overlay by Singularity/Apptainer (default is false), the number of threads for concurrent task executions the worker node (default is 1), for instance RuntimeSetting(memoryOverlay = true, threads = 1),
core: number of cores allocated for each job,
cpu: number of CPUs allocated for each job,
bestEffort: a boolean for setting the best effort mode (true by default),
submittedJobs: cap the number of jobs submitted at given time to the environment, for instance submittedJobs = 1000 (by default no job cap is set),
modules: a sequence of String to load modules on the execution environment using "module load name", for instance modules = Seq("mpi"),
reconnect: when set, the interval at which the SSH connection is reconnected, for instance reconnect = 1 minute (by default the ssh connection is connected once and for all),
localSubmission: set to true if you are running OpenMOLE from a node of the cluster (useful for example if you have a cluster that you can only ssh behind a VPN but you can not set up the VPN where your OpenMOLE is running); user and host are not mandatory in this case.