Explore: Direct Sampling

Suggest edits
Documentation > Explore

Contents

Design of Experiments (DoE) is the art of setting up an experimentation. In a model simulation context, it boils down to declare the inputs under study (most of the time, they're parameters) and the values they will take, for a batch of several simulations, with the idea of revealing a property of the model (e.g. sensitivity). Even if there are several state-of-the-art DoE methods implemented in OpenMOLE, we recommend to focus on OpenMOLE new methods: PSE, and Calibration and Profiles which have been thought to improve the drawbacks of the classical methods.
Your model inputs can be sampled in the traditional way, by using grid (or regular) sampling,or by sampling uniformly inside their domain.
For higher dimension input space, specific statistics techniques ensuring low discrepency like Latin Hypercube Sampling and SobolSequence are available.
If you want to use design of experiments of your own you may also want to provide a csv file with your samples to OpenMOLE.
By defining your own exploration task on several types of input, you will be able to highlight some of your model inner properties like those revealed by sensitivity analysis, as shown in a toy example on a real world example

Grid Sampling 🔗

For a reasonable number of dimension and discretisation quanta (steps) values, complete sampling (or grid sampling) consists of producing every combination of the inputs possibles values, given their bounds and quanta of discretisation.

image/svg+xml Output Exploration Input Exploration Sensitivity Optimisation

Method scores
Regular sampling or Uniform Sampling are quite good for a first Input Space exploration when you don't know anything about its structure yet. Since it samples from the input space, the collected values from the model executions will reveal the output values obtained for "evenly spaced" inputs. Sure it's not perfect, but still , it gives a little bit of insight about model sensitivity (as input values vary within their domain) and if the output are fitness, it may present a little bit of optimization information (as the zone in which the fitness could be minimized).
The sampling does not reveal anything about the output space structure, as there is no reason than evenly spaced inputs lead to evenly spaced outputs. Grid sampling is hampered by input space dimensionality as high dimension spaces need a lot of samples to be covered, as well as a lot of memory to store them.


Grid Sampling is declared via a DirectSampling Task, where the bounds and discretisation quantum of each input to vary are declared for each input


val input_i = Val[Int]
val input_j = Val[Double]

DirectSampling(
  evaluation = my_own_evaluation  ,
  sampling =
    (input_i in (0 to 10 by 2)) x
    (input_j in (0.0 to 5.0 by 0.5)),
  aggregation= my_aggregation
)

with
  • evaluation is the task (or a composition of tasks) that uses your inputs, typically your model task and a hook.
  • sampling is the sampling task
  • aggregation (optional) is an aggregation task to be performed on the outputs of your evaluation task


Let's see it in action within a dummy workflow; Suppose we want to explore a model written in java, taking an integer value as input, and generating a String as output.
The exploration script would look like:
//inputs and outputs declaration
val i = Val[Int]
val o = Val[Double]
val avg = Val[Double]

//Defines the "model" task
val myModel =
  ScalaTask("val o = i * 2") set (
    inputs += i,
    outputs += (i, o)
  )

val average =
  ScalaTask("val avg = o.average") set (
    inputs += o.toArray,
    outputs += avg
  )

val exploration =
  DirectSampling(
    evaluation = myModel hook ToStringHook(),
    sampling = i in (0 to 10 by 1),
    aggregation = average hook ToStringHook()
  )

exploration
Some details:
  • myModel is the task that multiply the input by 2
  • the evaluation attribute of the DirectSampling task is the composition of myModel and a hook
  • the aggregation attribute of the DirectSampling task is the average task, a ScalaTask that compute the average of an array Double values
  • the task declared under the name exploration is a DirectSampling task, which means it will generate parallel executions of myModel, one for each sample generated by the sampling task



DirectSampling generates a workflow that is illustrated below. You may recognize the map reduce design pattern, provided that an aggregation operator is defined (otherwise it would just be a map :-) )



Model replication 🔗

In the case of a stochastic model, you may want to define a replication task to run several replications of the model for the same parameter values. This is similar to using a uniform distribution sampling on the seed of the model, and OpenMOLE provides a specific constructor for that, namely Replication. The use of a Replication sampling is the following:
val mySeed = Val[Int]
val i = Val[Int]
val o = Val[Double]

val myModel =
  ScalaTask("import scala.util.Random; val rng = new Random(mySeed); val o = i * 2 + 0.1 * rng.nextDouble()") set (
    inputs += (i, mySeed),
    outputs += (i, o)
  )

val replication = Replication(
    evaluation = myModel,
    seed = mySeed,
    replications = 100
)

replication
The arguments for Replication are the following:
  • evaluation is the task (or a composition of tasks) that uses your inputs, typically your model task and a hook.
  • seed is the prototype for the seed, which will be sampled with an uniform distribution in its domain (Val[Int] or Val[Long]).
  • replications (Int) is the number of replications.
  • distributionSeed (optional, Long) is an optional seed to be given to he uniform distribution of the seed ("meta-seed").
  • aggregation (optional) is an aggregation task to be performed on the outputs of your evaluation task.



Exploration of several inputs 🔗

Sampling can be performed on several inputs domains as well as on several input types, using the cartesian product operator: x, introduced in the grid sampling dedicated section. Here is an example, still supposing you have already defined a task used for evaluation called myModel:
val i = Val[Int]
val j = Val[Double]
val k = Val[String]
val l = Val[Long]
val m = Val[File]

val exploration =
  DirectSampling (
  evaluation = myModel,
  sampling =
    (i in (0 to 10 by 2)) x
    (j in (0.0 to 5.0 by 0.5)) x
    (k in List("Leonardo", "Donatello", "Raphaël", "Michelangelo")) x
    (l in (UniformDistribution[Long]() take 10)) x
    (m in (workDirectory / "dir").files().filter(f => f.getName.startsWith("exp") && f.getName.endsWith(".csv")))
  )
 

DirectSampling performs every combination between the 5 inputs of various types: Integer (i) , Double (j), Strings (k), Long (l), Files (m).

The UniformDistribution[Long]() take 10 is a uniform sampling of 10 numbers of the Long type, taken in the [Long.MIN_VALUE; Long.MAX_VALUE] domain of the Long native type.

Files are explored as items of a list. The items are gathered by the files() function applied on the dir directory, optionally filtered with any String => Boolean functions such as contains(), startswith(), endswith() (see the Java Class String Documentation for more details)
If your input is one file among many, or a line among a CSV file, use the CSVSampling task and FileSampling task.