This vignette introduces mbg
, a package for modern Model-Based Geostatistics. This package offers tools for spatial regression analysis:
mbg
combines functions from several packages:
INLA
: for running Bayesian geostatistical modelsdata.table
: for handling tabular datasf
for handling polygonsterra
for handling raster datacaret
for running ML models with point outcomes and raster predictorsIn this tutorial, we will run a simple mbg
model with prepared data to demonstrate core package functions.
Load the mbg
package, as well as the data.table
, sf
, and terra
packages for data processing, and the ggplot2
package for plotting.
# Load packages
library(data.table)
library(ggplot2)
library(sf)
library(terra)
library(mbg)
For this tutorial, we will estimate stunting rates among children under 5 years of age across Benin. Load point observations of this outcome from the package data and print the first few rows:
# Outcome: child stunting
data.table::fread(
outcomes <-system.file('extdata/child_stunting.csv', package = 'mbg')
)
cluster_id | indicator | samplesize | x | y |
---|---|---|---|---|
3 | 5 | 16 | 1.762278 | 6.267199 |
4 | 5 | 14 | 2.044534 | 6.335044 |
5 | 2 | 13 | 2.420548 | 6.346753 |
6 | 1 | 17 | 2.371199 | 6.349460 |
7 | 8 | 20 | 1.889197 | 6.351104 |
8 | 4 | 23 | 2.369804 | 6.354230 |
The outcome data has 505 rows, each representing one surveyed location, and five columns:
Let’s map this survey data over a map of Benin.
# Administrative boundaries
sf::st_read(
communes <-system.file('extdata/Benin_communes.gpkg', package = 'mbg'),
quiet = TRUE
)
# Convert point data to sf
sf::st_as_sf(
outcome_sf <-
outcomes,coords = c('x', 'y'),
crs = sf::st_crs(communes)
)$stunting_rate <- outcome_sf$indicator / outcome_sf$samplesize
outcome_sf
::ggplot() +
ggplot2 ggplot2::geom_sf(data = communes) +
ggplot2::geom_sf(
data = outcome_sf,
::aes(color = stunting_rate, size = samplesize),
ggplot2alpha = 0.75
+
) ggplot2::scale_size_continuous(range = c(0.5, 3)) +
ggplot2::scale_color_gradientn(
colors = terra::map.pal('viridis'),
labels = scales::percent
+
) ggplot2::labs(
title = 'Stunting point data',
color = 'Observed\nstunting\nrate',
size = 'Sample\nsize'
+
) ggplot2::theme_minimal()
We can see that there seem to be spatial patterns in the surveyed stunting data—stunting generally seems to be lower around the urban regions of Porto Novo in the coastal south and Parakou in the center north, and higher in the far north and the inland south. To better understand the underlying trends in stunting risk across the country, we would ideally like to:
A Bayesian geostatistical model can accomplish all of these goals. We will fit a Bayesian geostatistical model to the outcome data, with three potential predictors formatted as raster surfaces:
# Spatial covariates
list(
covariates <-access = terra::rast(system.file('extdata/access.tif', package = 'mbg')),
evi = terra::rast(system.file('extdata/evi.tif', package = 'mbg')),
temperature = terra::rast(system.file('extdata/temperature.tif', package = 'mbg'))
)# Plot the covariates
plot(terra::rast(covariates), nr = 1)
The geostatistical model takes three more arguments:
id_raster
: A terra::SpatRaster
that lays out the prediction grid. The model will make predictions for all non-NA cells in the id_raster
.aggregation_table
(data.table::data.table
, optional): Used to aggregate model estimates from grid cells to commune polygons, preserving uncertaintypopulation_raster
(terra::SpatRaster
, optional): population-based indicators like child stunting should be aggregated using population weighting, which places greater weight on grid cells with higher populations when summarizing those grid cell results by polygon.# Create ID raster
mbg::build_id_raster(
id_raster <-polygons = communes,
template_raster = covariates[[1]]
)# Table to help with aggregation to higher administrative levels
mbg::build_aggregation_table(
aggregation_table <-polygons = communes,
id_raster = id_raster,
polygon_id_field = 'commune_code'
)# Population raster: used for aggregation to administrative boundaries
terra::rast(
population_raster <-system.file('extdata/under_5_population.tif', package = 'mbg')
)
The entire Bayesian geostatistical regression workflow can be run using the mbg::MbgModelRunner
class. The MbgModelRunner$new()
method creates a new instance of the class, and the MbgModelRunner$run_mbg_pipeline()
runs the entire workflow based on the passed inputs.
The following code block runs the mbg
model using mainly default settings. A typical workflow can take a few minutes to finish, although intermediate steps will be printed to screen. Because we want a regression intercept in this model, the first line in the code block below creates the intercept as one of the covariate rasters.
# Add an intercept to the model
$intercept <- covariates[[1]] * 0 + 1
covariates## Run MBG models
MbgModelRunner$new(
model_runner <-input_data = outcomes,
id_raster = id_raster,
covariate_rasters = covariates,
aggregation_table = aggregation_table,
aggregation_levels = list(
commune = c('commune_code', 'commune', 'department_code', 'department'),
region = c('department_code', 'department')
),population_raster = population_raster
)$run_mbg_pipeline()
model_runner#> MBG model fitting
#> MBG model fitting: 13.857 sec elapsed
#> Generating model predictions
#> Parameter posterior samples
#> Parameter posterior samples: 6.415 sec elapsed
#> Cell draws
#> Cell draws: 2.123 sec elapsed
#> Summarize draws
#> Summarize draws: 0.817 sec elapsed
#> Generating model predictions: 9.357 sec elapsed
#> Running population-weighted aggregation from grid cells to polygons
The model makes predictions at each grid cell, with some uncertainty associated with each grid cell prediction. We capture that uncertainty by taking many (by default, 250) samples from the range of plausible model estimates and generating summary statistics across those samples.
Once the geostatistical workflow has finished, we can pull the gridded raster model predictions from the MbgModelRunner$grid_cell_predictions
attribute. This attribute is a list with the following items:
'parameter_draws'
(matrix
): Samples of the underlying model parameters'cell_draws'
(matrix
): Samples of model estimates at each grid cell location from the id_raster
'cell_pred_mean'
(terra::SpatRaster
): The mean estimates across all sampled cell_draws
, represented as a raster.'cell_pred_lower'
(terra::SpatRaster
): Lower bounds of an uncertainty interval, summarized across the sampled cell_draws
and represented as a raster. By default, we use a 95% uncertainty interval, so this raster shows the 2.5th percentile across the samples for each grid cell.'cell_pred_upper'
(terra::SpatRaster
): Upper bounds of the uncertainty interval, represented as a raster. By default, shows the 97.5th percentile of samples for each grid cell.We can plot the mean estimates from the geostatistical model:
# Get predictions by pixel
model_runner$grid_cell_predictions
grid_cell_predictions <-# Plot mean estimates
plot(
$cell_pred_mean * 100,
grid_cell_predictionsmain = 'MBG mean estimates (%)'
)lines(communes)
We can also show the width of the 95% uncertainty interval by subtracting the “lower” from the “upper” summary raster:
# Plot estimate uncertainty
plot(
$cell_pred_upper - grid_cell_predictions$cell_pred_lower) * 100,
(grid_cell_predictionscol = sf::sf.colors(n = 100),
main = 'MBG estimates: 95% uncertainty interval width (%)'
)lines(communes)
Compare these gridded estimates to the underlying point data. The results seem to meet our goals for estimating stunting risk across Benin:
Many of these observations can be quantified and tested, as demonstrated in later vignettes.
Each sample from the geostatistical model yields a candidate map, a single realization of the outcome across each pixel in the study area that is consistent with the fitted model parameters and uncertainty. These samples can be individually aggregated to administrative boundaries, preserving uncertainty, and then summarized.
Aggregated samples and summaries are available from the MbgModelRunner$aggregated_predictions
attribute. This attribute is a named list, where each item contains polygon estimates for each of the passed aggregation_levels
. Each of those levels has two items:
'draws'
(data.table::data.table
): Predictive model samples by aggregated polygon unit'summary'
(data.table::data.table
): Summaries of draws
for each aggregated polygon unit, including the mean and the bounds of the 95% uncertainty interval across samplesWe can merge the commune estimate summaries back onto the communes
sf object to plot the results.
# Get predictions by commune
model_runner$aggregated_predictions
aggregated_predictions <- aggregated_predictions$commune$summary
commune_summary <- merge(
summary_sf <-x = communes,
y = commune_summary,
by = c('commune_code', 'commune', 'department_code', 'department')
)# Plot aggregated estimates by commune
::ggplot() +
ggplot2 ggplot2::geom_sf(data = summary_sf, ggplot2::aes(fill = mean), color = 'black') +
ggplot2::scale_fill_gradientn(
colors = terra::map.pal('viridis'),
breaks = seq(0.15, 0.45, by = .05),
labels = scales::percent
+
) ggplot2::labs(
title = 'MBG mean estimates by commune',
fill = "Estimated\nstunting\nrate"
+
) ggplot2::theme_minimal()
We can also plot the width of the 95% uncertainty interval by commune.
# Plot aggregated uncertainty interval widths by commune
$ui <- summary_sf$upper - summary_sf$lower
summary_sf::ggplot() +
ggplot2 ggplot2::geom_sf(data = summary_sf, ggplot2::aes(fill = ui), color = 'black') +
ggplot2::scale_fill_gradientn(
colors = sf::sf.colors(n = 100),
labels = scales::percent
+
) ggplot2::labs(
title = 'MBG 95% uncertainty interval width by commune',
fill = 'Uncertainty\ninterval\nwidth'
+
) ggplot2::theme_minimal()
Note that in general, uncertainty intervals tend to be wider at the pixel level than for aggregated polygons. This is because some sources of uncertainty are uncorrelated between pixels, and this uncorrelated uncertainty shrinks when estimates are aggregated across pixels.