Data processing often revolves about massive computation using big bunch of files.
Model inputs come in many shapes; this is why OpenMOLE features some file exploration functions to manipulate your datasets as easily as possible.
OpenMOLE introduces the concept of
Domains as a variable ranging along a set of files. For instance, to run a program
over a set of files in a subdirectory you may use:
val f = Val[File]
val explo = ExplorationTask (f in (workDirectory / "dir"))
To explore files located in several directories:
val i = Val[Int]
val f = Val[File]
val explo =
ExplorationTask (
(i in (0 to 10)) x
(f in (workDirectory / "dir").files("subdir${i}", recursive = true).filter(f => f.isDirectory && f.getName.startsWith("exp")))
)
The
filter
modifier filters the initial file sampling according to a predicate.
You can filter using any function taking a
File
and producing a
Boolean
(see the corresponding
javadoc or create your own). Some predicate functions available out of the box are
startsWith(), contains(), endsWith()
.
val f = Val[File]
val explo =
ExplorationTask ( (f in (workDirectory / "dir") filter(_.getName.endsWith(".nii.gz")) ) )
Searching in deep file trees can be very time consuming and irrelevant in some cases where you know how your data is organised.
By default the file selector only explores the direct level under the directory you've passed as a parameter.
If you want it to explore the whole file tree, you can set the option recursive to true as in
files(recursive = true)
.
If you wish to select one single file for each value of i you may use the
select
operation:
val i = Val[Int]
val f = Val[File]
val explo =
ExplorationTask (
(i in (0 to 10)) x
(f in File("/path/to/a/dir").select("file${i}.txt"))
)
As its name suggests, the
files
selector manipulates
File
instances and directly injects them in the dataflow.
If you plan to delegate your workflow to a
local cluster environment equipped with a shared file system across all nodes, you don't need data to be automatically copied by OpenMOLE.
In this case, you might prefer the paths selector instead.
Paths works exactly like files and accept the very same options.
The only difference between the two selectors is that
paths
will inject
Path
variables in the dataflow.
Path describes a file's location but not its content.
The
explored files won't be automatically copied by OpenMOLE when using Path
, so this
does not fit a grid environment for instance.
import java.nio.file.Path
val dataDir = "/vol/vipdata/data/HCP100"
val subjectPath = Val[Path]
val subjectID = Val[String]
val exploIDsTask = ExplorationTask ( subjectPath in File(dataDir).paths(filter=".*\\.nii.gz") withName subjectID)
( exploIDsTask hook ToStringHook() ) -- EmptyTask()
More details on the difference between manipulating
Files
and
Paths
can be found in the dedicated entry of the
FAQ. You can also learn more about the OpenMOLE dataflow in the
dedicated section.
You can find full examples using OpenMOLE's capabilities to process a dataset in the following entries of the marketplace:
Files can also be injected in the dataflow through
Sources. They provide more powerful file filtering possibilities using regular expressions and can also target directories only.