The purpose of this vignette is to demonstrate how to utilize SimDesign
in the context of distributing many jobs across independent computing environments, such as high-performance computing (HPC) clusters, in a way that allows for reproducibility of the simulation conditions, resubmission of jobs in case of incomplete collection of results within a specified HPC execution time budget, and to ensure that random number generation across the entire simulation (and subsequent resubmissions, if required) are properly manged throughout given the batch nature of the job. As such, the following text and examples are primarily for managing larger simulations, often with thousands of replications over many simulation conditions (i.e., rows in the defined design
object) which generally require non-trivial amount of computing resources to execute (hence, the need for super-computing resources and job schedulers like Slurm, TORQUE, MAUI, among others).
For information about Slurm’s Job Array support in particular, which this vignette uses as an example, see https://slurm.schedmd.com/job_array.html
To start, the structure of the simulation code used later on to distribute the jobs to the HPC scheduler is effectively the same as the usual generate-analyse-summarise workflow described in runSimulation()
, with a few organizational exceptions. As such, this is always a good place to start when designing, testing, and debugging a simulation experiment before submitting to HPC clusters.
IMPORTANT: Only after the vast majority of the bugs and coding logic have been work out should you consider moving on to the next step involving HPC clusters. If your code is not well vetted in this step then any later jobs evaluated on the HPC cluster will be a waste of time and resources (garbage-in, garbage-out).
Suppose the following simulation was to be evaluated, though for time constraint reasons would not be possible to execute on a single computer (or a smaller network of computers) and therefore should be submitted to an HPC cluster. The following script SimDesign_simulation.R
contains a simulation experiment whose instructions are to be submitted to the Slurm scheduler. To do so, the sbatch
utility is used along with the set of instructions specifying the type of hardware required in the file slurmInstructions.slurm
. In the R side of the simulation, the defined code must grab all available cores (minus 1) that are detectable via parallel::detectCores()
, which occurs automatically when using runSimulation(..., parallel=TRUE)
.
# SimDesign::SimFunctions()
library(SimDesign)
Design <- createDesign(N = c(10, 20, 30))
Generate <- function(condition, fixed_objects = NULL) {
dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
ret
}
Summarise <- function(condition, results, fixed_objects = NULL){
colMeans(results)
}
# standard setup (not ideal for HPC clusters as parallelization
# occurs within the design conditions, not across)
res <- runSimulation(design=Design, replications=10000, generate=Generate,
analyse=Analyse, summarise=Summarise, parallel=TRUE,
filename='mysim')
In the standard runSimulation(..., parallel=TRUE)
setup the 10,000 replications would be distributed to the available computing cores and evaluated independently across the three row conditions in the design
object. However, this process is only executed in sequence: design[1,]
is evaluated first and, only after the 10,000 replications are collected, design[2,]
is evaluated until complete, and so on.
As well, in order for this approach to be at all optimal the HPC cluster must assign a job containing a very large number of resources in the form of RAM and CPUs. To demonstrate, in the following slurmInstructions.slurm
file a larger number of CPUs are requested when building the structure associated with this job, as well as larger amounts of RAM.
#!/bin/bash
#SBATCH --job-name="My simulation (multiple CPUs)"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=somewhere@out.there
#SBATCH --output=/dev/null ## (optional) delete .out files
#SBATCH --time=12:00:00 ## HH:MM:SS
#SBATCH --mem=192G ## Build a computer with 192GB of RAM
#SBATCH --cpus-per-task=96 ## Build a computer with 96 cores
module load R/4.3.1
Rscript --vanilla SimDesign_simulation.R
This job request a computing cluster be built with 192 GB of RAM with 96 CPUs, which the SimDesign_simulation.R
is evaluated in, and is submitted to the scheduler via sbatch slurmInstructions.slurm
.
While generally effective at distributing the computational load, there are a few limitations of the above approach:
design
input (i.e., per simulation condition)To address these and other computational inefficiencies/wasted resources, one can instead switch from the cluster-based approach above to an array distribution approach, discussed in the next section.
For HPC computing it is often more optimal to distribute both replications and conditions simultaneously to unique computing nodes (termed arrays) to effectively break the problem in several mini-batches (e.g., split the simulation into 1000+ independent pieces, and collect the results later). As such, the above design
object and runSimulation()
structure above does not readily lend itself to optimal distribution for the array scheduler to manage. Nevertheless, the core components are still useful for initial code design, testing, and debugging, and therefore serve as a necessary first step when writing simulation experiment code prior to submitting to an HPC cluster.
After defining and testing your simulation to ensure that it works as expected, it now comes the time to setup the components required for organizing the HPC cluster submission using the runArraySimulation()
function.
runSimulation()
workflow to one for runArraySimulation()
The job of runArraySimulation()
is to utilize the relevant information defined in the .sh
or .slurm
script. This is done by extracting information provided by the scheduler (specifically, via an arrayID
), which in turn is used to select specific subsets of the design
rows. However, unlike runSimulation()
the function runArraySimulation()
has been designed to control important information pertaining to .Random.seeds
and other relevant distribution information that allow for the rows in the design
object itself to contain repeated experimental condition information. This allows both the design
rows and replication
information to be optimally distributed to the HPC cluster.
The following example presents the essential modifications required to move from a single runSimulation()
workflow to a batch workflow suitable for runArraySimulation()
and Slurm scheduler.
design
object for each array IDSuppose that 300 computing cores were independently available on the HPC cluster, though the availability of these cores only trickle in as a function of the schedulers decided availability. In this case, the strategy would be to split up the 3 * 10000 = 30000
condition-by-replications experiments across the gradually available resources, where jobs are evaluated in parallel and at different times.
Given the above specifications, you may decide that each of the 300 computing nodes requested to the scheduler should evaluate exactly 100 replications each (nrow(design) * 10000 / 300 = 100
). In this case, expandDesign()
is used to repeat each row condition 100 times, creating a new expanded design
object with 300 rows instead of 3. To accommodate for the new rows, the associated replications
should now be defined according to how many replications should be used within each of these to-be-distributed conditions, which need not be equal (see below).
Design300 <- expandDesign(Design, repeat_conditions = 100)
Design300
## # A tibble: 300 × 1
## N
## <dbl>
## 1 10
## 2 10
## 3 10
## 4 10
## 5 10
## 6 10
## 7 10
## 8 10
## 9 10
## 10 10
## # ℹ 290 more rows
# target replication number for each condition
rep_target <- 10000
# replications per row in Design300
replications <- rep(rep_target / 100, nrow(Design300))
The above approach assumes that each design
condition is equally balanced in terms of computing time and resources, though if this is not the case (e.g., the last condition contains notably higher computing times than the first two conditions) then repeat_conditions
can be specified as a vector instead, such as repeat_conditions = c(100, 100, 1000)
, which for the latter portion would be associated with a 10 replications per distributed node instead of 100.
rc <- c(100, 100, 1000)
DesignUnbalanced <- expandDesign(Design, repeat_conditions = rc)
DesignUnbalanced
## # A tibble: 1,200 × 1
## N
## <dbl>
## 1 10
## 2 10
## 3 10
## 4 10
## 5 10
## 6 10
## 7 10
## 8 10
## 9 10
## 10 10
## # ℹ 1,190 more rows
rep_target <- 10000
replicationsUnbalanced <- rep(rep_target / rc, times = rc)
head(replicationsUnbalanced)
## [1] 100 100 100 100 100 100
table(replicationsUnbalanced)
## replicationsUnbalanced
## 10 100
## 1000 200
Regardless of whether the expanded design is balanced or unbalanced each row in the resulting expanded design object will be assigned to a unique computing array node, identified according to the Slurm assigned array ID.
In principle, the expanded Design300
object above could be passed as runSimulation(Design300, replications=100, ...)
to evaluate each of the repeated conditions, where each row is now replicated only 100 times under the balanced approach above. However, there is now an issue with respect to the random seed management in that use of functions such as set.seed()
are no longer viable. This is because repeated use of set.seed()
does not itself guarantee independent high-quality random numbers between different instances. For example,
set.seed(0)
x <- runif(100)
set.seed(1)
y <- runif(100)
plot(x, y) ## seemingly independent
plot(x[-1], y[-100]) ## subsets perfectly correlated
This issue is generally not problem in the standard runSimulation()
approach as within each design condition high quality random numbers are used by default, and any potentially repeated number sequences across the conditions are highly unlikely to affect the quality of the overall simulation experiment (the conditions themselves typically generate and manage random numbers in different ways due to the varying simulation factors, such as sample sizes, variance conditions, fitted models, number of variables, type of probability distributions use, and so on). However, in the expandDesign()
setup the likelihood of witnessing correlated/redundant random samples increases very quickly, which is particularly problematic within each distributed replication set; hence, special care must be taken to ensure that proper seeds are distributed to each job array.
Fortunately, seeds are easy to manage with the gen_seeds()
function using a two-step approach, which is internally managed by runArraySimulation()
by supplying an initial seed (iseed
) value and the associated array ID (arrayID
). Doing so will utilize L’Ecuyer’s (1999) method, which constructs sequentially computed .Random.seed
states to ensure independence across all replications and conditions. Note that iseed
must constant across all job arrays, so make sure to define iseed
once and only once!
# gen_seeds() # do this once on the main node/home computer and store the number!
iseed <- 1276149341
As discussed in the FAQ section at the bottom, this associated value will also allow for generation of new .Random.seed
elements if (or when) a second or third set of simulation jobs should be submitted to the HPC cluster at a later time but must also generate simulated data that is independent from the initial submission(s).
.slurm
scriptWhen submitting to the HPC cluster you’ll need to include information about how the scheduler should distribute the simulation experiment to the workers. In Slurm systems, you may have a script such as the following, stored into a suitable .slurm
file:
#!/bin/bash
#SBATCH --job-name="My simulation (array jobs, distributing conditions + replications)"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=somewhere@out.there
#SBATCH --output=/dev/null ## (optional) delete .out files
#SBATCH --time=12:00:00 ## HH:MM:SS
#SBATCH --mem=2G
#SBATCH --cpus-per-task=1
#SBATCH --array=1-300 ## Slurm schedulers often allow up to 10,000 arrays
module load R/4.3.1
Rscript --vanilla mySimDesignScript.R
For reference later, label this file simulation.slurm
as this is the file that must be submitted to the scheduler when it’s time.
The top part of this .slurm
file provides the BASH instructions for the Slurm scheduler via the #SBATCH
statements. In this case, how many array jobs to queue (1 through 300), how much memory to use per job (2GB), time limits (12 hours), and more; see here for SBATCH details.
The most important input to focus on in this context is #SBATCH –array=1-300 as this is what is used to within the Slurm scheduler to assign a unique ID to each array job. What the scheduler does is take the defined mySimDesignScript.R
script and send this to 300 independent resources (each with 1 CPU, in this case), where the independent jobs are assigned a single unique array number within the --array=1-300
range (e.g., distribution to the first computing resource would be assigned arrayID=1
, the second resource arrayID=2
, and so on). In the runArraySimulation()
function this is used to subset the Design300
object by row; hence, the array range must correspond to the row identifiers in the design
object for proper subsetting!
Collecting this single number assigned by the Slurm scheduler is also easy. Just include
# get assigned array ID (default uses type = 'slurm')
arrayID <- getArrayID()
to obtain the associated array ID, which is this example will be a single integer
value between 1 and 300. This value is used in the final execution step via runArraySimulation(..., arrayID=arrayID)
, which we finally turn to.
runArraySimulation()
With all the prerequisite steps above we’re finally ready to pass all information to runArraySimulation()
, which is effectively a wrapper to runSimulation()
that suppresses verbose outputs, takes subsets of the Design300
object given the supplied arrayID
(and other objects, such as replications
, seeds
, etc), forces evaluation on a single CPU (hence, #SBATCH --cpus-per-task=1
should be used by default, unless there is further parallelization to occur within the replications, such as via OpenMP
), and saves the SimDesign
results to file names based on the filename
argument with suffixes associated with the arrayID
(e.g., filename='mysim'
will save the files mysim-1.rds
for array 1, mysim-2.rds
for array 2, …, mysim-300.rds
for array 300).
# run the simulation on subset based on arrayID subset information
runArraySimulation(design=Design300, replications=replications,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed,
arrayID=arrayID, filename='mysim')
And that’s it! The above will store all the mysim-#.rds
files in the directory where the job was submitted, which is somewhat on the messy side, so you may also want to specify a directory name to store the simulation files to. Hence, on the main (i.e., landing) location associated with your ssh
account create a directory, using something like mkdir mysimfiles
(or in R, dir.create('mysimfiles')
) in the location where your .R
and .slurm
files are stored. Then the following can be used to store all 300 collected .rds
files, making use of the dirname
argument.
# run the simulation on subset based on arrayID subset information
runArraySimulation(design=Design300, replications=replications,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed, arrayID=arrayID,
dirname='mysimfiles', filename='mysim')
Regardless, the hard part is done here, though other information could be included by way of the control
list input if necessary, such as including explicit time limits in the R executions within array jobs themselves (see the FAQ section for further information).
Below is the complete submission script collecting everything that was presented above. This assumes that
.R
file with the simulation code is stored in the file mySimDesignScript.R
,simulation.slurm
, which points to mySimDesignScript.R
and includes the relevant R
modules, andmysimfiles/
has been created for storing the files on the computer used to submit the array joblibrary(SimDesign)
Design <- createDesign(N = c(10, 20, 30))
Generate <- function(condition, fixed_objects = NULL) {
dat <- with(condition, rnorm(N, 10, 5)) # distributed N(10, 5)
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
ret
}
Summarise <- function(condition, results, fixed_objects = NULL){
colMeans(results)
}
# expand the design to create 300 rows with associated replications
Design300 <- expandDesign(Design, repeat_conditions = 100)
rep_target <- 10000
replications <- rep(rep_target / 100, nrow(Design300))
# gen_seeds() # do this once on the main node/home computer and store the number!
iseed <- 1276149341
# get assigned array ID (default uses type = 'slurm')
arrayID <- getArrayID()
# run the simulation on subset based on arrayID subset information
runArraySimulation(design=Design300, replications=replications,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed, arrayID=arrayID,
dirname='mysimfiles', filename='mysim')
This file is then submitted to the job scheduler via sbatch
, pointing to the .slurm
instructions.
sbatch simulation.slurm
Once complete you can now go get a beer, coffee, or whatever else tickles your fancy to celebrate as the hard part is over.
So, some time has elapsed, the job submission part is now complete, and you finally have access to the complete set of simulation files that have the file names mysim-#.rds
. The final step is then to collect all independent results into a simulation object that resembles what would have been returned from the canonical runSimulation()
function. Fortunately, this is very easy to do with aggregate_simulations()
. All you must do at this point is change to the working directory containing the simulation files and use aggregate_simulations()
with the files names to be aggregated:
setwd('mysimfiles')
library(SimDesign)
Final <- aggregate_simulations(files=dir())
Final
# A tibble: 3 × 8
N mean median REPLICATIONS SIM_TIME COMPLETED
<dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 10 9.9973 9.9934 10000 23.42s Thu Apr 4 11:50:11 2024
2 20 10.007 10.015 10000 24.24s Thu Apr 4 11:50:35 2024
3 30 10.003 10.007 10000 24.39s Thu Apr 4 11:51:00 2024
This function detects which Design300
rows belong to the original Design
object and collapse the meta-statistics and stored results
information accordingly. No fuss, no mess. Of course, you’ll want to store this object for later use as this is the complete collection of the results from the 300 array jobs, organized into one neat little (object) package.
# save the aggregated simulation object for subsequent analyses
saveRDS(Final, "../final_sim.rds")
You should now consider moving this "final_sim.rds"
off the Slurm landing node and onto your home computer via scp
or your other favourite method (e.g., using WinSCP
on Windows).
Of course, nothing really stops you from mixing and matching the above ideas related to multicore computing and array jobs on Slurm and other HPC clusters. For example, if you wanted to take the design
object and submit batches of these instead (e.g., submit one or more rows of the original design
object as an array job), where within each batch multicore processing is requested, then something like the following would work just fine:
#!/bin/bash
#SBATCH --job-name="My simulation (arrays + multiple CPUs)"
#SBATCH --mail-type=ALL
#SBATCH --mail-user=somewhere@out.there
#SBATCH --output=/dev/null ## (optional) delete .out files
#SBATCH --time=04:00:00 ## HH:MM:SS
#SBATCH --mem=32G ## 32 GB RAM per array
#SBATCH --cpus-per-task=16 ## 16 CPUs per array
#SBATCH --array=1-9 ## 9 array jobs
module load R/4.3.1
Rscript --vanilla mySimDesignScript.R
with the associated .R
file containing, in this case, nine simulation conditions across the rows in Design
.
library(SimDesign)
Design <- createDesign(N = c(10, 20, 30),
SD = c(1,2,3))
Generate <- function(condition, fixed_objects = NULL) {
dat <- with(condition, rnorm(N, 10, sd=SD)) # distributed N(10, 5)
dat
}
Analyse <- function(condition, dat, fixed_objects = NULL) {
ret <- c(mean=mean(dat), median=median(dat)) # mean/median of sample data
ret
}
Summarise <- function(condition, results, fixed_objects = NULL){
colMeans(results)
}
Design
## # A tibble: 9 × 2
## N SD
## <dbl> <dbl>
## 1 10 1
## 2 20 1
## 3 30 1
## 4 10 2
## 5 20 2
## 6 30 2
## 7 10 3
## 8 20 3
## 9 30 3
Depending on the intensity of the conditions, you may choose to distribute more than one row of the Design
object to each array (multirow=TRUE
in the following), otherwise the more natural choice is to distribute each row in the Design
object to each assigned array.
# get array ID
arrayID <- getArrayID()
multirow <- FALSE # submit multiple rows of Design object to array?
if(multirow){
# If selecting multiple design rows per array, such as the first 3 rows,
# then next 3 rows, and so on, something like the following would work
s <- c(seq(from=1, to=nrow(Design), by=3), nrow(Design)+1L)
## For arrayID=1, rows2pick is c(1,2,3); for arrayID2, rows2pick is c(4,5,6)
rows2pick <- s[arrayID]:(s[arrayID + 1] - 1)
filename <- paste0('mysim-', paste0(rows2pick, collapse=''))
} else {
# otherwise, submit each row independently across array
rows2pick <- arrayID
filename <- paste0('mysim-', rows2pick)
}
# Make sure parallel=TRUE flag is on! Also, it's important to change the computer
# name to something unique to the array job to avoid overwriting files (even temporary ones)
runSimulation(design=Design[rows2pick, ], replications=10000,
generate=Generate, analyse=Analyse, summarise=Summarise,
parallel=TRUE, filename=filename,
save_details=list(compname=paste0('array-', arrayID)))
When complete, the function aggregate_simulations()
can again be used to put the simulation results together given the nine saved files (or three, if multirow
is TRUE
and #SBATCH --array=1-3
were used instead).
This type of hybrid approach is a middle ground between submitting the complete job (top of this vignette) and the condition
+ replication
distributed load in the previous section, though has similar overhead + inefficiency issues as before (though less so, as the array
jobs are evaluated independently). For very intense simulations, particularly those prone to time outs or other exhausted resources, the runArraySimulation()
approach is recommended as the max_RAM
and max_time
fail-safes are more naturally accommodated within the replications, while for moderate intensity simulations the above approach is likely sufficient as each (batch) of simulation conditions can be evaluated independently across each array
on the HPC cluster.
In addition to using sbatch
to submit jobs, the following contains other useful Slurm commands.
sbatch <jobfile.slurm> # submit job file to Slurm scheduler
squeue -u <username> # what jobs are currently queued/running for a specific user
sshare -U <username> # check the share usage for a specific user
scancel <jobid> # cancel a specific job
scancel -u <username> # cancel all queued and running jobs for a specific user
This issue is important to consider whenever the HPC cluster has mandatory time/RAM limits for the job submissions, where the array job may not complete within the assigned resources — hence, if not properly managed, will discard any valid replication information when abruptly terminated. Unfortunately, this is a very likely occurrence, and is largely a function of being unsure about how long each simulation condition/replication will take to complete when distributed across the arrays (some conditions/replications will take longer than others, and it is difficult to be perfectly knowledgeable about this information beforehand) or how large the final objects will grow as the simulation progresses.
To avoid this time/resource waste it is strongly recommended to add a max_time
and/or max_RAM
argument to the control
list (see help(runArraySimulation)
for supported specifications), which are less than the Slurm specifications. These control flags will halt runArraySimulation()
/runSimulation()
execution early and return only the complete simulation results up to this point. However, this will only work if these arguments are meaningfully less than the allocated Slurm resources; otherwise, you’ll run the risk that the job terminates before the SimDesign
functions have the chance to store the successfully completed replications. Setting these to around 90% of the respective #SBATCH --time=
and #SBATCH --mem=
inputs should, however, be sufficient in most cases.
# Return successful results up to the 11 hour mark, and terminate early
# if more than 3.5 GB of RAM are required to store the internal results
runArraySimulation(design=Design300, replications=replications,
generate=Generate, analyse=Analyse,
summarise=Summarise, iseed=iseed, arrayID=arrayID,
dirname='mysimfiles', filename='mysim',
control=list(max_time="11:00:00", max_RAM="3.5GB"))
Of course, if the session does time out early then this implies that the target replications
will be missed on the first job submission batch. Therefore, and as is covered in the next section, a new job must be submitted to the scheduler that is mindful of the initial simulation history (particularly, the .Random.seed
states).