All model runner options

This article breaks down all the options available when running mbg models. For a summary of these terms, see the documentation for the mbg::MBGModelRunner$new() method


1: MBG model basics

The model always requires two terms: input_data, which includes all point observations of the outcome to be estimated, and id_raster, which lays out the study area.

1.A: Input data

Formatted as a data.frame or data.table::data.table. Should contain at least the following fields:

1.B: ID raster

A terra::SpatRaster object meeting the following requirements:

An ID raster can be created using the mbg::build_id_raster function.

Before running a model, you could use the terra::extract function to ensure that all points in your input data overlap a non-NA pixel in the ID raster.


3: Model effects

The model currently has four effect types which can be toggled and controlled via settings passed to mbg::MbgModelRunner.

3.A: Covariate effects

Relevant settings:

A covariate effect will only be included if use_covariates is TRUE (the default) and covariate_rasters are passed. The covariate_rasters are an optional list of terra::SpatRaster pixel-level predictive covariates. They can be incorporated into the model in two different ways depending on the value of use_stacking:

3.A.i: Standard covariate effect

Only applied if a covariate effect is included and use_stacking is FALSE (the default).

The covariate effect at observation \(i\) is \(\gamma^{covariates}_i = \vec{\beta}X_{s_i}\), where \(\vec{\beta}\) are linear effects on the matrix of covariate values \(X\) evaluated at the location of observation \(i\) (\(s_i\)).

Note that an intercept is not included by default. If you want a model with no covariate effects other than an intercept, pass a covariate_rasters with an intercept raster containing all 1s.

A prior is applied to the variance of effects on all covariates other than the intercept: prior_covariate_effect (default list(threshold = 3, prob_above = 0.05)) is a penalized complexity prior that can be expressed as a level of certainty about the standard deviation on each fixed effect \(\beta\). For example, the default prior corresponds to \(P(\sigma_{\beta} > 3) = 0.05\).

3.A.ii: Stacked ensemble model

Only applied if a covariate effect is included and use_stacking is TRUE.

For a stacked ensemble model, the covariate effect for observation \(i\) is: \[ \gamma^{covariates}_i = \sum_{j=1}^{J}\left[ w_{j} f_j(X_{s_i}) \right] \\ Constraints: w_j > 0 \ \forall \ j, \ \textstyle \sum_{j=1}^{J}(w_j) = 1 \] Where:

  • \(\vec{f}\): Predictions from a set of J regression models fit to the raw covariate data \(X\)
  • \(\vec{w}\): A weighting vector corresponding to each regression model \(f_j\). The weights are constrained to be strictly greater than zero and to sum to one.
  • \(X_{s_i}\): The raw covariate values at the location of observation \(i\), \(s_i\)

Relevant model settings:

  • stacking_model_settings (default list(gbm = NULL, treebag = NULL, rf = NULL)): Defines the list of component models \(f_j(X)\) to be fitted to the covariates. A named list—each name corresponds to a regression model in the caret package, and each value stores optional settings that can be passed to that model.
  • stacking_cv_settings (default list(method = 'repeatedcv', number = 5, repeats = 5)): These are used by caret::traincontrol to cross-validate each regression model
  • stacking_use_admin_bounds (default FALSE), admin_bounds (default NULL), admin_bounds_id (default NULL): If stacking_use_admin_bounds is TRUE and the other two values are set, adds administrative fixed effects to each of the component models.
  • stacking_prediction_range (default NULL): Can be used to restrict the prediction range of each component regression model. For binomial data, a reasonable limit is to not predict outside of c(0, 1).

3.B: Gaussian process

If the setting use_gp is TRUE (the default), adds a spatially correlated effect:

\[ Z \sim GP(0, \Sigma_s) \] Where \(Z\) is a Gaussian process with mean zero and stationary isotropic Matern covariance over space (\(\Sigma_s\)).

The Gaussian process is informed by priors on the range and variance:

To simplify estimation, the R-INLA package represents the continuous Gaussian process on a 2D spatial mesh. Three more settings control the mesh:

For more details about the INLA approach to approximate Gaussian process regression, see the papers at the bottom of this page.


3.C: Administrative-level effect

This effect is a random intercept grouped by administrative unit. The administrative level (polygon boundaries) of interest can be set by the user. If the effect is on, then the following term is added:

\[ \gamma^{admin}_{a_i} \sim N(0, \sigma^2_{admin}) \] In other words, \(\vec\gamma^{admin}\) is an vector of random intercepts with length equal to the total number of administrative units, IID normal with mean 0 and variance \(\sigma^2_{admin}\). All observations \(i\) in the same administrative division \(a\) share the same intercept \(\gamma^{admin}_{a_i}\).

Relevant settings:


3.D: Nugget

The nugget is an independently and identically distributed (IID) normal effect applied to each observation. It corresponds to “irreducible variation” not captured by any other model effect:

\[ \gamma^{Nugget}_i \sim N(0, \sigma^2_{nugget}) \]

Relevant settings:


4: Aggregation to polygon boundaries

As shown in the introductory tutorial, the mbg::MbgModelRunner object can automatically aggregate predictions to administrative boundaries. The following three objects are required to perform aggregation:


5: Logging

Finally, the setting verbose (default TRUE) governs whether the model will perform detailed logging. You can access model logs afterwards by running mbg::logging_get_timer_log.


6: Further reading

Bakka, H., et al. (2018). Spatial modeling with R‐INLA: A review. Wiley Interdisciplinary Reviews: Computational Statistics, 10(6), e1443. https://doi.org/10.1002/wics.1443

Bhatt, S., Cameron, E., Flaxman, S. R., Weiss, D. J., Smith, D. L., & Gething, P. W. (2017). Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. Journal of The Royal Society Interface, 14(134), 20170520. https://doi.org/10.1098/rsif.2017.0520

Freeman, M. (2017). An introduction to hierarchical modeling. http://mfviz.com/hierarchical-models/

Moraga, Paula. (2019). Geospatial Health Data: Modeling and Visualization with R-INLA and Shiny. Chapman & Hall/CRC Biostatistics Series. ISBN 9780367357955. https://www.paulamoraga.com/book-geospatial/index.html

Opitz, T. (2017). Latent Gaussian modeling and INLA: A review with focus on space-time applications. Journal de la société française de statistique, 158(3), 62-85. https://www.numdam.org/article/JSFS_2017__158_3_62_0.pdf