A geographer’s introduction to space-time regression with GAMs using stgam

Lex Comber, Paul Harris and Chrs Brunsdon

April 2025

Overview

The stgam package (Comber, Harris, and Brunsdon 2024) provides a wrapper around mgcv functionality (S. Wood 2021) to support space-time analysis, with a focus on prediction but perhaps more importantly on inference (process understanding). The latter is commonly the objective of geographical analysis, which are often concerned with how processes and statistical relationships vary over space and / or time.

This vignette illustrates an analysis workflow using GAM smooths that quantifies and models variations over space and time through a house price case study. It highlights the importance of investigations of spatial and / or temporal variability before constructing space-time Generalized Additive Models (GAMs).

You should load the stgam package and the data. The data are sample of terraced house sales data for 2018 to 2024 extracted from Bin et al. (2024), and located via the ONS lookup tables that links UK postcodes to location (actually the geometric centroids of the postcode area).

library(cols4all)
library(dplyr)
library(ggplot2)
library(tidyr)
library(sf)
library(cowplot)
# load the package and data
library(stgam)
data("hp_data")
data("lb")

You should examine the help for the datasets and especially the variables that hp_data contains:

help(hp_data)
help(lb)

1. Data considerations

The first stage is simply to examine the data, particularly the continuous variables over space and time. The code below examines the data before undertaking some initial investigations.

# examine what is loaded 
hp_data
#> # A tibble: 1,888 × 13
#>    price priceper   tfa dot          yot  beds type    cef   pef ageb      lad            X      Y
#>    <dbl>    <dbl> <dbl> <date>     <dbl> <int> <chr> <int> <int> <chr>     <chr>      <int>  <int>
#>  1 1000     7463. 134   2023-02-22  2023     9 T        60    82 1930-1949 E09000014 531674 188707
#>  2  520     7879.  66   2022-05-27  2022     3 T        57    74 1930-1949 E09000031 536888 189269
#>  3  325     5078.  64   2021-01-15  2021     4 T        50    82 1950-1966 E09000016 549763 190717
#>  4  300     2400  125   2020-01-17  2020     5 T        64    85 1900-1929 E09000031 538646 186513
#>  5  585     7222.  81   2020-11-13  2020     4 T        75    88 1983-1990 E09000027 514710 172535
#>  6  618.    6786.  91   2023-05-18  2023     4 T        61    84 1950-1966 E09000008 533233 170230
#>  7 1183.   12117.  97.6 2022-04-22  2022     5 T        63    66 1900-1929 E09000027 516471 174772
#>  8  800     6154. 130   2023-09-28  2023     7 T        60    80 1900-1929 E09000010 532032 194831
#>  9  560     6022.  93   2022-12-21  2022     5 T        47    79 1930-1949 E09000029 526674 164298
#> 10  300     3750   80   2019-09-16  2019     4 T        71    86 1950-1966 E09000004 548369 179349
#> # ℹ 1,878 more rows

The analyses will model price per unit area (the priceper variable) in units of pounds per spare metre. GAM smooth models will be constructed that regress this against other variables in the data including location and time. The print out of the first 10 records above shows that the dataset contains a time variable (dot) in date format as well as location in metres (X and Y from the OSGB projection). It often more useful to represent time as continuous scalar values and to have location in kilometres rather than metres. The code below uses the earliest date in the dataset to create a variable called days to represent time (here 100s of days since the earliest date) and rescales the locational data, but retaining the original coordinates for mapping:

hp_data <- 
  hp_data |> 
  # create continuous time variable
  mutate(days = as.numeric(dot - min(dot))/100) |>
  relocate(days, .after = dot) |>
  # scale location and retain original coordinates
  mutate(Xo = X, Yo = Y) |>
  mutate(X = X/1000, Y = Y/1000)

Again this can be examined:

hp_data

The data and the target variable can be mapped with the London borough spatial layer (lb) as in the code snippet below. Note the use of the Xo and Yo coordinates in the mapping with ggplot. The map above shows the spatial distribution of priceper and also indicates that there were no terraced house sales in the centre of London (in the City of London local authority district).

# map the data layers
lb |> 
  ggplot() + geom_sf() +
  geom_point(data = hp_data, aes(x = Xo, y = Yo, col = priceper)) +
  scale_color_viridis_c(option = "magma") +
  theme_bw()  +xlab("") + ylab("")
The spatial variation of the priceper variable (house price per square metre in £s).
The spatial variation of the priceper variable (house price per square metre in £s).

In the analyses described below, location is measured in kilometres, the lb spatial dataset of London boroughs will be used to provide spatial context to some of the results. To do this the coordinates of the lb spatial data layer need to be rescaled in the same way as the X and Y variables were in hp_data to aid model construction. The code below creates a rescaled version of the lb projected in kilometres (remember the original data can always be reloaded using data(lb)).

# transform to km
lb <- st_transform(lb, pipeline = "+proj=pipeline +step +proj=unitconvert +xy_out=km")
# remove the projection to avoid confusing ggplot
st_crs(lb) <- NA

The correlations and distributions of the numeric variables can be examined. It is important to establish that any regression of the proposed target variable against the predictor variables may yield something interesting and sensible, and just as importantly to identify any variables that might be a problem when trying to make a model. For example, in this case it is good that the proposed target variable (priceper) has a healthy tail.

The code below generates boxplots and then density histograms.

# boxplots
hp_data |> 
  select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
  pivot_longer(-lad) |>
  ggplot(aes(x = value), fil) + 
  geom_boxplot(fill="dodgerblue") +
  facet_wrap(~name, scales = "free") +
  theme_bw()
Boxplots of the numeric variables in the data.
Boxplots of the numeric variables in the data.
# histograms
hp_data |> 
  select(lad, price, priceper, tfa, days, beds, cef, pef, X, Y) |>
  pivot_longer(-lad) |>
  ggplot(aes(x = value), fil) + 
  geom_histogram(aes(y=after_stat(density)),bins = 30, 
                 fill="tomato", col="white") +
  geom_density(alpha=.5, fill="#FF6666") +
  facet_wrap(~name, scales = "free") +
  theme_bw()
Histograms of the numeric variables in the data.
Histograms of the numeric variables in the data.

Examining correlations is for similar reasons: to check for issues that may be a problem later on in the analysis workflow. Here the aim is to establish that there are reasonable correlations with the target variable and no collinearity amongst the predictor variables, when they are examined without considering time or space (i.e. they are examined globally).

# correlations
hp_data |> 
  select(priceper, price, tfa,  beds, cef, pef) |>
  cor() |> round(3)
#>          priceper  price    tfa   beds    cef    pef
#> priceper    1.000  0.723  0.308  0.179 -0.118 -0.220
#> price       0.723  1.000  0.791  0.549 -0.082 -0.216
#> tfa         0.308  0.791  1.000  0.780 -0.034 -0.202
#> beds        0.179  0.549  0.780  1.000 -0.079 -0.170
#> cef        -0.118 -0.082 -0.034 -0.079  1.000  0.388
#> pef        -0.220 -0.216 -0.202 -0.170  0.388  1.000

Finally, thinking about space-time analysis of your dataset, as a general heuristic, any dataset for use in space-time analysis should have a minimum of about 100 locations and a minimum of 50 observation time periods. You should consider this when you are assembling your data.

2. Detecting variability

The investigations in the previous section were all concerned with establishing the viability of undertaking the proposed analysis, using standard exploratory data analysis approaches to examine distributions and correlations. In this section these are extended to test for the presence of variability in the data but, now over time, over space and in space-time. This is done by constructing a series of regression models, of different forms and with different parameters. Tools for constructing Generalized Additive Models (GAMs) using the mgcv package (S. Wood 2021) are introduced but with particular focus on using GAM smooths or splines to explore variations.

2.1 Initial investigations with OLS and dummy variables

As an initial baseline model, the code below creates a standard OLS regression model of priceper using the lm function. The model fit is weak, but the model summary indicates that the some of the variables (pef and beds) are significant predictors of the target variable. The analysis of variance (ANOVA), calculated using the anova function, shows how much variance in priceper is explained by each predictor and whether that contribution is statistically significant. Here higher values in the F value test statistic provides stronger evidence that the predictor is useful.

In this case there is evidence that two of the predictors (pef and beds) are statistically significant and the F-values and associated p-values (Pr(>F)) show strong evidence that each explains some of the variation in priceper.

# an OLS model
m_ols <- lm(priceper ~ cef + pef + beds, data = hp_data)
summary(m_ols)
#> 
#> Call:
#> lm(formula = priceper ~ cef + pef + beds, data = hp_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -7057.6 -1752.2  -554.9  1220.4 23271.9 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 10751.489    690.879  15.562  < 2e-16 ***
#> cef            -9.633      6.325  -1.523    0.128    
#> pef           -59.626      8.047  -7.410 1.90e-13 ***
#> beds          298.329     46.327   6.440 1.52e-10 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2740 on 1884 degrees of freedom
#> Multiple R-squared:  0.06997,    Adjusted R-squared:  0.06849 
#> F-statistic: 47.25 on 3 and 1884 DF,  p-value: < 2.2e-16
anova(m_ols)
#> Analysis of Variance Table
#> 
#> Response: priceper
#>             Df     Sum Sq   Mean Sq F value    Pr(>F)    
#> cef          1 2.1265e+08 212648515  28.323 1.149e-07 ***
#> pef          1 5.4018e+08 540183718  71.948 < 2.2e-16 ***
#> beds         1 3.1134e+08 311344258  41.468 1.516e-10 ***
#> Residuals 1884 1.4145e+10   7508013                      
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A final and quick trick to confirm the presence of space-time effects is to use them as a dummy variable with one of the variables. Here this is done with tfa (total floor area) that unsurprisingly is highly correlated with the target variable (priceper):

# Dummy with LADs
m_dummy1 <- lm(priceper ~ tfa:lad, data = hp_data)
summary(m_dummy1)
anova(m_dummy1)

This shows that there are important and significant differences between the first London borough (which has a lad value equal to “E09000001”) that is not shown in the model summary and many of the the rest that are, suggesting some important spatial differences in the interaction of floor area and price in different boroughs. This is confirmed by the F values in the ANOVA.

Similarly, an even larger effect with time is demonstrated when days is used as the dummy variable:

# Dummy with Time
m_dummy2 <- lm(priceper ~ tfa:days, data = hp_data)
summary(m_dummy2)
anova(m_dummy2)

Thus, despite the models being globally weak with low \(R^2\) values, there is evidence of spatial and temporal interactions with the target variable that warrant further more formal exploration with respect to potential space-time trends. GAMs with smooths offer a route to investigate these.

2.2 Detecting variability using GAM smooths

GAMs with splines (or smooths - the terms are used interchangeably) can be used to explore and quantify spatial and temporal variations. However before considering these aspects, it is important to outline what splines do.

Splines help to fit a (smooth) curve through a cloud of messy data points. They generate flexible fits (curves) able to adapt to the shape of the data. Because of this, they are able to describe the relationships between variables better than a straight line. For example, consider the scatterplot of some simulated data below: it would be difficult to fit a straight line through it.

# create simulated data 
set.seed(12)
x  <- runif(500)
mu <- sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
y  <- rnorm(500, mu, .3)
# plot x and y
ggplot() + 
    geom_point(aes(x,y)) +
  theme_bw()
Simulated x and y data.
Simulated x and y data.

The code below uses a GAM spline to determine the relationship between x and y without having to pre-specify any particular form (e.g. quadratic, exponential, etc). The wiggliness of the fit is determined automatically by the s function. This is done in different ways depending on the type of smooth that is specified (thin plate regression smooths and the defaults in s from the mgcv package). Effectively what the smooth does is to split the data into chunks and fit piecewise relationships, rather than fitting a single straight line as in a standard linear regression.

# a GAM illustration with a spline using s
gam_s_example <- gam(y~s(x))
# extract the smooth fit
y.s <- gam_s_example$fitted.values
# plot
ggplot() +
  geom_point(aes(x,y), col = "grey") +
    geom_line(aes(x, y = y.s), lwd = 1) +
  theme_bw()
Simulated x and y data with GAM a Spline fitted.
Simulated x and y data with GAM a Spline fitted.

The purpose of this diversion into splines and smooths is to illustrate how they model different (slopes) relationships with y at different locations within the variable feature space, (the x axis in the above example). In this way spline smooth curves can be used to capture non-linear relationships in attribute space (here the relationship of x with y). Importantly, in the context of space-time varying coefficient modelling with stgam, the spline can be used to model how relationships between x and y varies with respect to time or location if the smooth is also parameterised with those.

2.3 Temporal variability

It would be useful to examine the how the target variable, priceper, varies over time. One way of doing this is to extend the use GAM smooths illustrated above. The code snippet below models the variation in priceper but this time using time (the days variable).

# the first GAM 
gam.1 <- gam(priceper~s(days), data = hp_data)
summary(gam.1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(days)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      64.11   106.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>           edf Ref.df     F p-value    
#> s(days) 3.798  4.708 15.76  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.0374   Deviance explained = 3.93%
#> GCV = 7.7786e+06  Scale est. = 7.7588e+06  n = 1888

The smooth can be investigated by plotting it with time. Note the use of the actual data variable dot rather than days in the code below to have a friendly x-axis in the plot, and the use of the predict function to extract the standard errors:

# create a data frame with x, predicted y, standard error
x <- hp_data$dot
y <- gam.1$fitted.values
se <- predict(gam.1, se = TRUE, hp_data)$se.fit
u <- y+se
l <- y-se
df <- data.frame(x, y, u, l)
# plot!
ggplot(df, aes(x, y, ymin = l, ymax = u)) + 
  geom_ribbon(fill = "lightblue") + 
  geom_line() + 
  theme_bw() + 
  xlab("Date") + ylab("priceper") 
A plot of the temporal smooth.
A plot of the temporal smooth.

This shows variation in the modelled relationship of priceper with time. The code above extracts the predicted values (\(\hat{y}\)) of priceper from the model and plots them against time, but using the actual date not the days continuous variable. The plot shows how the relationship of the target variable varies with time, potentially reflecting the effects over the pandemic, with increases in the value of homes with their own gardens and outdoor spaces (even small ones!) over apartments or flats. This provides confirmation of the potential suitability of a temporal analysis of priceper.

2.4 Spatial variability

The spatially varying trends can be examined in similar way, again using a standard s spline with location.

# the second GAM
gam.2 <- gam(priceper~s(X,Y), data = hp_data)
summary(gam.2)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      43.62   156.6   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>          edf Ref.df     F p-value    
#> s(X,Y) 27.47  28.85 81.52  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.554   Deviance explained = 56.1%
#> GCV = 3.6469e+06  Scale est. = 3.5919e+06  n = 1888
plot(gam.2, asp = 1)
A mgcv plot of the spatial smooth.
A mgcv plot of the spatial smooth.

Notice that the model fit (\(R^2\)) has increased substantially from the previous models. This suggests some important spatial effects and again the smooth can be plotted, but this time rather than a line it is 2-Dimensional surface. However, it would be useful to to visualise this as a surface rather than just over the hp_data point locations. The code below uses the predict function used to model the relationship over a hexagonal grid (a surface) covering the extent of the London boroughs created from lb, that contains X and Y attributes of the location of the hexagonal cell centroid.

# 1. create a grid object the study area from the LB  data
l_grid <- 
  st_make_grid(lb, square=FALSE,n=50) |> 
  st_sf() |> 
  st_join(lb) |> 
  filter(!is.na(name))
# rename the geometry, sort row names 
st_geometry(l_grid) = "geometry"
rownames(l_grid) <- 1:nrow(l_grid)
# create and add coordinates X and Y
coords <- l_grid |> st_centroid() |> st_coordinates() 
#> Warning: st_centroid assumes attributes are constant over geometries
l_grid <- l_grid |> bind_cols(coords) 

You may wish to inspect this object:

l_grid
plot(st_geometry(l_grid))

Before continuing with the mapping procedure:

# 2. predict over the grid
yhat <- predict(gam.2, newdata = l_grid)
l_grid |> mutate(yhat = yhat) |>
  # 4.and plot
  ggplot() + 
  geom_sf(aes(fill = yhat), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "priceper") +
  # add context
  geom_sf(data = lb, fill = NA) +
  # apply and modify plot theme
  theme_bw() +
  theme(legend.position = "bottom",
        legend.key.width = unit(1, "cm"))
A map of the smoothed (predicted) response variable over hexagon grid.
A map of the smoothed (predicted) response variable over hexagon grid.

This models the spatial trends (variations over space) of the target variable. The map indicates high values in the centre of the study area, as in the initial map of the data earlier, and again provides confirmatory evidence that the some spatial modelling of priceper is appropriate.

2.5 Spatial and temporal variability I

The spline approach can be extended to model the target variable in space-time by including the location and time variables in the smooth. However, a slightly different set of considerations are required because of mixing space and time in the mgcv smooths. Consider the code snippet below for the space-time smooths it specifies a smooth using the t2 function rather than s and contains other parameters in the d argument.

# the third GAM
gam.3 <- gam(priceper~t2(X,Y,days, d = c(2,1)), data = hp_data)
summary(gam.3)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ t2(X, Y, days, d = c(2, 1))
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      42.57   160.5   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                edf Ref.df     F p-value    
#> t2(X,Y,days) 47.87  52.07 40.12  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.576   Deviance explained = 58.6%
#> GCV = 3.5118e+06  Scale est. = 3.4209e+06  n = 1888

It is important to consider the difference between s(X,Y,days) (which would a continuation of the format above but has not been run) and t2(X,Y,days,d=c(2,1)) before examining the resultant GAM (gam.3). The spline functions s() and t2() can model smooth interactions between multiple variables, but they handle dimension scaling, marginal smoothness, and basis construction in different ways.

Thus, in this case a TP smooth is specified because space and time are combined. The code below examines the results and uses the mgcv plot function to show the spatial distributions over 9 discretised time chunks (~310 days).

plot(gam.3, asp = 1)

However it is also possible to slice and dice the time variable in other ways. The code snippet below illustrates how this can be done for 365 day periods using the calculate_vcs function in the stgam package. If predictor variables are included in the terms parameter then the function coefficient estimates (“b_”) and standard errors (“se_”) for each of terms specified. However, in this case we are just interested in the predicted value of the target variable, yhat. The code below uses the grid object created above (l_grid) as a spatial framework to hold the prediction of the priceper variable over these discrete time periods.

# 1. create time intervals (see the creation of days variable above)
pred_days = seq(365, 2555, 365)/100
# 2. create coefficient estimates for each time period (n = 7) 
res_out <- NULL
for (i in pred_days){
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
                         mgcv_model = gam.3,
                         terms = NULL)
  res_out <- cbind(res_out, res.i$yhat)
}
# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y_", 2018:2024)

l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(-name, -lad, -X, -Y) |>
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = paste0("Y_", 2018:2024))) |>
  
  # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_seq("brewer.yl_or_rd", name = "Predicted \n'priceper'") +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())   
A plot of a spatial smooth over 7 approximately annual time periods.
A plot of a spatial smooth over 7 approximately annual time periods.

This time series of maps show that without any exploratory variables, the trends in priceper varies spatially and changes in intensity over time (the increase and decrease in values over space), with limited variation in spatial pattern over time.

2.6 Spatial and temporal variability II

The formula used to construct the gam.3 model in the previous subsection constructed a smooth that implicitly combined space and time, under an assumption that the variations in space and time of priceper interact with each other. However this may not be the case. More generally spatial and temporal dependencies might not be expected to interact in data for relatively small region, whose observations might be expected to be subject to the same socio-economic or environmental pressures. In this case, any space and time effects might be expected to be independent of each other. Whereas in a national study any space and time effects might be expected to interact more strongly.

It is possible to use a different construction involving separate smooths for space and time, with the s() splines, and to avoid the assumption of the interaction of space and time effects:

# the fourth GAM
gam.4 <- gam(priceper~s(X,Y) + s(days), data = hp_data)
summary(gam.4)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y) + s(days)
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  6832.48      42.45     161   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value    
#> s(X,Y)  27.509 28.861 83.76  <2e-16 ***
#> s(days)  3.536  4.392 23.88  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.578   Deviance explained = 58.5%
#> GCV = 3.4606e+06  Scale est. = 3.4019e+06  n = 1888

Again there is evidence of spatial and temporal trends in the data and these can be visualised using the plot.gam function in mgcv (called with just plot).

plot(gam.4, page = 1)
A mgcv plot of the spatial and temporal smooths.
A mgcv plot of the spatial and temporal smooths.

Note that, the space-time interactions could be generated in the same way as for the gam.3 model above, using the location (X and Y) values in l_grid and the time slices in pred_days.

2.7 Including other predictor variables

Up until now, only the response (target) variable, priceper has been examined for variations in time and space, using GAM smooths specified in different ways. This was to explore space-time variability of the variable, but also to introduce GAM smooths. The code below includes the pef predictor variable in a GAM model as a fixed parametric term (i.e. in the same way as in the OLS model created above).

# the fifth GAM
gam.5 <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ s(X, Y) + s(days) + pef
#> 
#> Parametric coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 8826.648    408.776  21.593  < 2e-16 ***
#> pef          -24.802      5.057  -4.905 1.02e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>            edf Ref.df     F p-value    
#> s(X,Y)  27.507 28.861 77.85  <2e-16 ***
#> s(days)  3.473  4.316 24.89  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.583   Deviance explained =   59%
#> GCV = 3.4216e+06  Scale est. = 3.3618e+06  n = 1888

The model summary indicates that pef is a significant predictor of priceper and improves the model fit. However, this raises a key question of how should predictor variables be included in the model: in parametric form or in a smooth? This is addressed in the next section.

As before the smooths can be plotted:

plot(gam.5, page = 1) 

2.8 Summary

This section has introduced GAMs with smooths (splines) as a way of investigating any space-time effects present in the data. The mechanics of smooths was described and illustrated, and a brief introduction to different smooth forms was provided. Variations in the response variable (priceper) in time, space and space-time were explored through elementary model fitting and plotting the smooth graphical trends. In this case, effects in space and time are present in the priceper predictor variable and it looks like there is a universal spatial trend at all times, and universal temporal trend everywhere. The final GAM model included an explanatory variable (pef) which added some explanatory power but these were not examined with respect to space or time (i.e. in smooths). This is done in the next section.

Such investigations are an important initial step. They provide evidence of space-time variations in the response variable - i.e. whether the effects in space and time are present in the data - and can help determine which predictor variables may be of further interest. They guide subsequent analysis and avoid making assumptions about the presence of space-time interactions, for example by plugging every predictor variable into a space-time smooth of some kind.

3. Effects of Time and Space with predictor variables

The previous section examined variability of the response variable in time, space and space-time (the latter in 2 different ways). In this section these investigations are extended to consider the variation of the response variable (priceper) with the pef predictor variable, but this time with the Intercept as an addressable term in the model.

Consider the gam.5 model created above:

gam.5  <- gam(priceper~s(X,Y) + s(days) + pef, data = hp_data)
summary(gam.5)

This includes the terms s(X,Y) + s(days). These model the spatially varying and temporally varying Intercept plus error. Note that if this was a spatial model (without time), then just s(X,Y) would be included for the Intercept and similarly just s(days) for a purely temporal model.

The gam.5 model can be reformulated as follows with an addressable Intercept term:

gam.5.new <- gam(priceper ~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + pef,
                data = hp_data |> mutate(Intercept = 1))
summary(gam.5.new)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8826.648    408.776  21.593  < 2e-16 ***
#> pef        -24.802      5.057  -4.905 1.02e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df     F p-value    
#> s(X,Y):Intercept  27.507 28.861 77.85  <2e-16 ***
#> s(days):Intercept  3.473  4.316 24.89  <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> R-sq.(adj) =  0.583   Deviance explained =   94%
#> GCV = 3.4216e+06  Scale est. = 3.3618e+06  n = 1888

The model summary now includes Intercept as a parametric term (rather than (Intercept)) with s(X,Y):Intercept and s(days):Intercept as smooth terms, rather than s(X,Y)and s(days), but the values in both summaries are all the same.

The formula in subsequent models in this vignette include the Intercept as an addressable term. The code below creates this variable in the input data:

hp_data <- 
  hp_data |> 
  mutate(Intercept = 1)

3.1 Time

The code below specifies a GAM regression model with smooths for space and time as in the previous section but now with the pef predictor variable, included in parametric form and in a temporal smooth with days. The model suggests, in this case, that it is important to model this relationship over time.

gam.t <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + 
              pef + s(days, by = pef), 
            data = hp_data)
summary(gam.t)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(days, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8738.828    410.218  21.303  < 2e-16 ***
#> pef        -11.846      2.538  -4.668 3.27e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F p-value    
#> s(X,Y):Intercept  27.526 28.864 78.064 < 2e-16 ***
#> s(days):Intercept  3.507  4.358  4.389 0.00119 ** 
#> s(days):pef        1.500  1.500  6.255 0.00294 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 49/50
#> R-sq.(adj) =  0.584   Deviance explained =   94%
#> GCV = 3.4159e+06  Scale est. = 3.3544e+06  n = 1888

The summary of this model indicates that the relationship of pef with priceper changes over time and has a strong linear negative trend (the smooth s(days):pef is significant as is the pef parametric term). The nature of the temporally varying coefficients can be examined using the calculate_vcs function in the stgam package. This returns a tibble with coefficient estimates (“b_”) and standard errors (“se_”) for each of terms specified.

vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.t, terms = c("Intercept", "pef"))
head(vcs)
#> # A tibble: 6 × 22
#>   price priceper   tfa dot         days   yot  beds type    cef   pef ageb     lad       X     Y     Xo     Yo Intercept
#>   <dbl>    <dbl> <dbl> <date>     <dbl> <dbl> <int> <chr> <int> <int> <chr>    <chr> <dbl> <dbl>  <int>  <int>     <dbl>
#> 1 1000     7463.   134 2023-02-22 18.8   2023     9 T        60    82 1930-19… E090…  532.  189. 531674 188707         1
#> 2  520     7879.    66 2022-05-27 16.0   2022     3 T        57    74 1930-19… E090…  537.  189. 536888 189269         1
#> 3  325     5078.    64 2021-01-15 11.1   2021     4 T        50    82 1950-19… E090…  550.  191. 549763 190717         1
#> 4  300     2400    125 2020-01-17  7.44  2020     5 T        64    85 1900-19… E090…  539.  187. 538646 186513         1
#> 5  585     7222.    81 2020-11-13 10.4   2020     4 T        75    88 1983-19… E090…  515.  173. 514710 172535         1
#> 6  618.    6786.    91 2023-05-18 19.6   2023     4 T        61    84 1950-19… E090…  533.  170. 533233 170230         1
#> # ℹ 5 more variables: b_Intercept <dbl>, se_Intercept <dbl>, b_pef <dbl>, se_pef <dbl>, yhat <dbl[1d]>

The temporal variation of the relationship of pef with the target variable can be plotted, and in this case the linear trend is confirmed: the negative relationship of pef with the target variable decreases linearly over time.

vcs |> 
  mutate(u = b_pef + se_pef, 
         l = b_pef - se_pef) |>
  ggplot(aes(x = dot, y = b_pef, ymin = l, ymax = u)) + 
  geom_ribbon(fill = "lightblue") + 
  geom_line() + 
  theme_bw() + 
  xlab("Date") + ylab("pef") 
A plot of the temporal smooth for pef.
A plot of the temporal smooth for pef.

3.2 Space

The spatial variation in the predictor variable relationships with the target variable can also be examined. The code below specifies pef within a spatial smooth, but again with a spatially and temporally varying intercept. Notice how pef is included in both the smooth and as parametric (global) term.

gam.s <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) + 
              pef + s(X,Y, by = pef), 
            data = hp_data)
summary(gam.s)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(X, Y, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept   8229.6      442.6  18.593   <2e-16 ***
#> pef         -111.8      168.1  -0.665    0.506    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F p-value    
#> s(X,Y):Intercept  29.000 29.000  5.742 < 2e-16 ***
#> s(days):Intercept  3.536  4.391 24.374 < 2e-16 ***
#> s(X,Y):pef        21.034 25.628  2.067 0.00122 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 69/70
#> R-sq.(adj) =  0.594   Deviance explained = 94.2%
#> GCV = 3.3672e+06  Scale est. = 3.269e+06  n = 1888

Here is is evident at pef is varying significantly over space (see s(X,Y):pef) but is not significant as a global fixed term. That is, the coefficient estimate global slope, is not significantly different from zero, but is when varying over space and allowing for Intercept to vary over space and time. This finding could be related to the age of the houses, which were built in clusters in different locations.

Again the varying coefficients can be extracted from the model for the predictor variable and mapped both at observation locations and over the hexagonal grid if a time period for the Intercept smooth in gam.s is specified:

# 1.over observation locations
vcs <- calculate_vcs(input_data = hp_data, 
                     mgcv_model = gam.s, 
                     terms = c("Intercept", "pef"))
tit <-expression(paste(""*beta[`pef`]*"")) 
p1 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) + 
  scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw() + 
  theme(legend.position = "bottom", 
        legend.key.width = unit(1, "cm"),) + 
  xlab("") + ylab("")
# 2. over grid - recall it needs an intercept term and a days value!
vcs <- calculate_vcs(input_data = l_grid |> mutate(Intercept = 1, days = mean(hp_data$days)), 
                     mgcv_model = gam.s, 
                     terms = c("Intercept", "pef"))
p2 <- 
  ggplot() + 
  geom_sf(data = vcs, aes(fill = b_pef), col = NA) + 
  scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw()+
  theme(legend.position = "bottom", 
        legend.key.width = unit(1, "cm"),) + 
  xlab("") + ylab("")
plot_grid(p1, p2)
Maps of the spatial smooth for pef over the original observations and the hexagonal grid.
Maps of the spatial smooth for pef over the original observations and the hexagonal grid.

Here the relationship of pef, the Potential energy efficiency rating, varies over space with a distinct inflection from negative values in the centre of the study area to positive ones in outer regions.

3.3 Space-Time I

It is also possible to examine the space-time dependencies between the target variable and the predictor variables. The code below specifies a GAM with a space-time smooth for the pef variable and a spatially and temporally varying intercept. As before, a TP smooth is used and specified with the dimensions of each basis (a 2D basis for X and Y; a 1D basis for days), as different degrees of smoothness are expected across space and time. In this case, the pef variable is still not significant globally, but it is over space and time (t2(X,Y,days):pef).

gam.st1 <- gam(priceper~0 + Intercept + s(X,Y,by=Intercept) + s(days, by=Intercept) +  
                pef + t2(X,Y,days, d= c(2,1), by = pef), 
              data = hp_data)
summary(gam.st1)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + t2(X, Y, days, d = c(2, 1), by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8686.866    411.590  21.106  < 2e-16 ***
#> pef        -10.827      2.262  -4.786 1.83e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df      F  p-value    
#> s(X,Y):Intercept  27.514 28.862 65.183  < 2e-16 ***
#> s(days):Intercept  3.519  4.372  4.383  0.00119 ** 
#> t2(X,Y,days):pef   5.555  5.556  5.862 1.63e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 164/165
#> R-sq.(adj) =  0.584   Deviance explained =   94%
#> GCV = 3.4246e+06  Scale est. = 3.3556e+06  n = 1888

When space-time varying coefficients are examined (see below), the increasingly negative relationship of pef with the target variable over time is evident and the spatial distributions generally indicate a lower relationship with the target variable in the east of the study area and increasingly negative one to the west.

# calculate the varying coefficient estimates
vcs <- calculate_vcs(input_data = hp_data, mgcv_model = gam.st1, terms = c("Intercept", "pef"))
# temporal trends
p_time <- 
  vcs |> 
  select(dot, b_Intercept, b_pef) |> 
  pivot_longer(-dot) |>
  mutate(name = recode(name, 
                       "b_Intercept" = '""*beta[Intercept]',
                       "b_pef" = '""*beta[pef]')) |>
  ggplot(aes(x = dot, y = value)) +
  geom_point(alpha = 0.1) + 
  geom_smooth() + 
  facet_wrap(~name,  labeller = label_parsed, scale = "free", ncol = 1) +
  theme_bw() + xlab("Year") + ylab("") 
# spatial trends
tit <-expression(paste(""*beta[`Intercept`]*"")) 
p_sp1 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_Intercept), alpha = 1) + 
  scale_colour_continuous_c4a_seq("brewer.yl_gn_bu", name = tit) +
  theme_bw() + 
  xlab("") + ylab("")
tit <-expression(paste(""*beta[`pef`]*"")) 
p_sp2 <- 
  ggplot() + geom_sf(data = lb, col = "lightgrey") +
  geom_point(data = vcs, aes(x = X, y = Y, colour = b_pef), alpha = 1) + 
  scale_colour_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  theme_bw() + 
  xlab("") + ylab("")

plot_grid(p_time, plot_grid(p_sp1, p_sp2, ncol = 1), nrow = 1, rel_widths = c(3.5,6))
Summaries of coefficient estimates from the the space-time smooths for pef.
Summaries of coefficient estimates from the the space-time smooths for pef.

Of course it is also possible to examine discrete time slices of the spatial distribution of the relationship of pef with priceper. This can be done over grid surfaces as was done before. Notice in the code below how the terms for the Intercept and pef are extracted for each time period in the looped call to calculate_vcs. This east-west trend and its changes over time are confirmed.

# 1. create time intervals (as above)
pred_days = seq(365, 2555, 365)/100
# 2. create coefficient estimates for each time period (n = 7) 
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in pred_days){
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = i),
                         mgcv_model = gam.st1,
                         terms = c("Intercept", "pef"))
  # select just the coefficient estimates of interest
  res.i <- res.i |> st_drop_geometry() |> select(starts_with("b_pef")) 
  res_out <- cbind(res_out, res.i)
}

# 3. name with years and join to the grid
colnames(res_out) <- paste0("Y", "_", 2018:2024)
# define a title
tit <-expression(paste(""*beta[`pef`]*"")) 
l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(starts_with("Y_")) |> 
  # rename
  rename(`2018` = "Y_2018", `2019` = "Y_2019", `2020` = "Y_2020",
         `2021` = "Y_2021", `2022` = "Y_2022", `2023` = "Y_2023", `2024` = "Y_2024") |>
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = colnames(res_out))) |>
  
   # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_div(name = "Potential Energy\nEfficiency") +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The changes over time of the spatial distrubtuion of the pef coefficient estimate.
The changes over time of the spatial distrubtuion of the pef coefficient estimate.

3.4 Space-Time II

The previous section created a GAM model with a single combined space-time smooth for the pef predictor variable. It is also possible to include space and time within separate smooths, as was done to investigate the variability of the target variable. The code snippet below does this, and again includes a space-time varying intercept.

gam.st2 <- gam(priceper~0 + Intercept + 
                 s(X,Y,by=Intercept) + s(days, by=Intercept) + 
                 pef + s(X,Y, by = pef) + s(days, by = pef), 
               data = hp_data)
summary(gam.st2)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ 0 + Intercept + s(X, Y, by = Intercept) + s(days, 
#>     by = Intercept) + pef + s(X, Y, by = pef) + s(days, by = pef)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept 8204.971    442.323  18.550   <2e-16 ***
#> pef         -5.215     13.861  -0.376    0.707    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                      edf Ref.df     F p-value    
#> s(X,Y):Intercept  28.000 28.000 5.870 < 2e-16 ***
#> s(days):Intercept  3.579  4.443 3.896 0.00277 ** 
#> s(X,Y):pef        21.654 26.022 2.039 0.00156 ** 
#> s(days):pef        1.333  1.333 1.432 0.32924    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 77/80
#> R-sq.(adj) =  0.595   Deviance explained = 94.2%
#> GCV = 3.3618e+06  Scale est. = 3.2623e+06  n = 1888

Here the fixed global term is again not significant, and interestingly neither is separate temporal smooth, as confirmed by the plots of the smooths.Taken together, this confirms what was found in the previous subsection with a combined TP smooth: the spatial trend in the relationship of pef with priceper that changes in intensity over time is confirmed, but the interaction over space does not change over time. This suggests a super-imposition of spatial and temporal trends: the temporal smooths with days is not significant.

plot(gam.st2, pages = 1)
mgcv plots of the seperate space and time smooths for the pef predictor variable.
mgcv plots of the seperate space and time smooths for the pef predictor variable.

3.5 Summary

This section has applied smooths to a single predictor variable. These have explored time, space and space-time, by constructing smooths parameterised with time and location. These allow changes in the relationship between the predictor variable and the target variable to be examined over space and time. Importantly, a space-time GAM was constructed in 2 different ways: by including space and time in a single combined smooth, and within separate smooths. These reflect different assumptions about the nature of the space-time dependencies and interactions in the data, and can result in different inferences (understanding) about the process when the significant components of the model are considered. This may not be important if the objective of model construction is prediction, but it is if the objective is process understanding, as is commonly the case in geographical analyses of space-time data.

There 2 important implications that follow from this:

  1. It suggests the need to to consider model form, and to determine how space and time should be included in smooths. This is done in the next section using an automated approach to model selection.
  2. This avoids super-imposing space-time trends by simply constructing a single space-time smooth. In the example above, the spatial effect was clear but was proportionately the same over time, i.e. where the (partial derivative), \(\delta f / (\delta Space, \delta Time) = 0\), as shown in gam.st2 model above.

4. Working with stgam: model selection

The models in construction in Section 3 used a variety of smooths. For example, the Intercept was modelled a parametric term with separate space and time smooths (in the gam.st2 model), and the pef predictor variable was included in parametric form (gam.5), in parametric form with a single space-time smooth (gam.st1) and in parametric form with separate space-time smooths (gam.st2. This poses the question of which form to use and how to specify the model? Which is best? A key focus of the stgam package is to seek to answer this question. It does this by creating and evaluating multiple models.

4.1 Using GCV to evaluate GAMs with smooths

One way to evaluate models is to compare them using some metric. Commonly AIC (Akaike 1998) corrected AIC (AICc) (Hurvich and Tsai 1989) or BIC (Schwarz 1978) are used to compare models as they provide parsimonious measures of model fit (balancing model complexity and prediction accuracy). However, there are some uncertainties over using these to compare mgcv GAM smooth models due to variations in the effective degrees of freedom (EDF) introduced by the smooths. The result is that model EDFs vary depending on how the smoothing parameters are selected, potentially leading to over-penalization or under-penalization in BIC calculations. As a result model Generalized Cross-Validation (GCV) score is recommended for evaluating and comparing mgcv GAMs (S. N. Wood 2017; Marra and Wood 2011). GCV provides an un-biased risk estimator of model fit. It is similar to BIC as it is able to balance model fit with complexity, but does not suffer from some of problems of using BIC when applied to GAMs (e.g. bias in non-parametric model comparison, over-penalises complex smooths etc). The best model is one that minimises the GCV score.

The code below extracts the GCV from each of the models in the last section and prints them out in order. Here it can be seen that the GAM model with pef specified in separate space-time smooths, gam.st2, generated the best model.

df <- data.frame(Model = c("Time", "Space", "Space-Time I", "Space-Time II"),
                 GCV = c(gam.t$gcv.ubre, gam.s$gcv.ubre, gam.st1$gcv.ubre, gam.st2$gcv.ubre))
# rank the models 
df |> arrange(GCV)
#>           Model     GCV
#> 1 Space-Time II 3361850
#> 2         Space 3367181
#> 3          Time 3415938
#> 4  Space-Time I 3424640

4.2 Model selection: determining model form

In a space-time model there are 6 options for each predictor variable:

  1. It is omitted.
  2. It is included as a parametric response with no smooth.
  3. It is included in parametric form and in a spatial smooth with location.
  4. It is included in parametric form and in a temporal smooth with time.
  5. It is included in parametric form and in a single space-time smooth.
  6. It is included in parametric form and in 2 separate space and time smooths.

The intercept can be treated similarly, but without it being absent (i.e. 5 options). Obviously there are 3 options for spatial models (i., ii. and iii.) and for temporal models (i., ii. and iv.), with each having 2 options for the intercept. Thus for a purely spatial or purely a temporal regression with \(k\) predictor variables there are \(2 \times 3^k\) potential models and for a space-time regression there are \(5 \times 6^k\) potential models to evaluate.

Recall that the hp_data contains a number of numeric variables that could be of use in explaining the space-time variations in the priceper target variable:

head(hp_data)
#> # A tibble: 6 × 17
#>   price priceper   tfa dot         days   yot  beds type    cef   pef ageb     lad       X     Y     Xo     Yo Intercept
#>   <dbl>    <dbl> <dbl> <date>     <dbl> <dbl> <int> <chr> <int> <int> <chr>    <chr> <dbl> <dbl>  <int>  <int>     <dbl>
#> 1 1000     7463.   134 2023-02-22 18.8   2023     9 T        60    82 1930-19… E090…  532.  189. 531674 188707         1
#> 2  520     7879.    66 2022-05-27 16.0   2022     3 T        57    74 1930-19… E090…  537.  189. 536888 189269         1
#> 3  325     5078.    64 2021-01-15 11.1   2021     4 T        50    82 1950-19… E090…  550.  191. 549763 190717         1
#> 4  300     2400    125 2020-01-17  7.44  2020     5 T        64    85 1900-19… E090…  539.  187. 538646 186513         1
#> 5  585     7222.    81 2020-11-13 10.4   2020     4 T        75    88 1983-19… E090…  515.  173. 514710 172535         1
#> 6  618.    6786.    91 2023-05-18 19.6   2023     4 T        61    84 1950-19… E090…  533.  170. 533233 170230         1

These include cef (Current energy efficiency rating) pef (Potential energy efficiency rating) and beds (Number of bedrooms), as well as location (X and Y) and time (days). Thus with \(k = 3\) variables could are 1080 potential models to evaluate in space-time case study and 54 potential models in a spatial case study. Here 2 of the variables are used to illustrate an evaluation of 180 space-time models, pef as before and beds.

The evaluate_models() function in the stgam package constructs and compares the full set of potential models. It uses the GCV value of each model to evaluate them. Note that in the code below, ncores is set to 2 pass CRAN diagnostic checks - you may want to specify more using detectCores()-1 from the parallel package. Remember that the input data needs to have a defined Intercept term which can be created in the following way input_data |> mutate(Intercept = 1)) (this was done at the start of Section 3 for hp_data).

library(doParallel)
t1 <- Sys.time()
stvc_mods <- evaluate_models(
  input_data = hp_data,
  target_var = "priceper",
  vars = c("pef", "beds"),
  coords_x = "X",
  coords_y = "Y",
  VC_type = "STVC",
  time_var = "days",
  ncores = 2)
Sys.time() - t1  # about 14 minutes (6 minutes with 15 cores!)

The best \(n\) models can be extracted (i.e. those with the lowest GCV score) using the gam_model_rank() function in the stgam package introduced in Section 3. Interestingly the top 5 ranked models all have space and time specified for each predictor variable, but in different combinations of single and separate space-time smooths. The top 4 models all have the beds predictor variable specified in separate space-time smooths, suggesting that while there are spatial and temporal dependencies with the target variable, there are not interacting space-time ones.

mod_comp <- gam_model_rank(stvc_mods, n= 10)
# have a look
mod_comp |> select(-f) 
#>    Rank Intercept       pef      beds      GCV
#> 1     1     t2_ST s_T + s_S s_T + s_S 16813.82
#> 2     2 s_T + s_S s_T + s_S s_T + s_S 16815.01
#> 3     3     t2_ST     t2_ST s_T + s_S 16815.67
#> 4     4 s_T + s_S     t2_ST s_T + s_S 16822.48
#> 5     5 s_T + s_S s_T + s_S     t2_ST 16823.02
#> 6     6       s_S s_T + s_S s_T + s_S 16824.47
#> 7     7     t2_ST       s_S s_T + s_S 16826.02
#> 8     8     t2_ST s_T + s_S     t2_ST 16826.23
#> 9     9     t2_ST s_T + s_S       s_S 16827.10
#> 10   10 s_T + s_S       s_S s_T + s_S 16827.90

4.3 The best model

The best ranked model can be specified for use in analysis. This included the Intercept in a single space-time TP smooth and separate space and time smooths for pef and beds.

First the formula is extracted and can be inspected:

f <- as.formula(mod_comp$f[1])
f
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) + 
#>     s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) + 
#>     s(days, by = beds)

This formula is used to specify a mgcv GAM model using a REML approach. The choice of REML is described in Section 5 below. The resulting model is checked for over-fitting using the k.check function in themgcv package. This underpins the gam.check function but does not display the diagnostic plots. Here the k-index values are near to 1, the k' and edf parameters are not close, and importantly the edf values are all much less than the k' values, so this model is reasonably well tuned. NB if the k' and edf parameters are close for some smooths, then you may want to increase the k parameter in the relevant smooth. These are automatically determined by the mgcv package but can be specified manually - see the mgcv help for k.check and choose.k.

# specify the model
gam.m <- gam(f, data = hp_data, method = "REML")
# check k
k.check(gam.m)
#>                         k'       edf   k-index p-value
#> t2(X,Y,days):Intercept 124 30.729788 0.8165558  0.0000
#> s(X,Y):pef              30  2.055193 0.6432838  0.0000
#> s(days):pef             10  3.183809 1.0422023  0.9675
#> s(X,Y):beds             30 24.222744 0.6432838  0.0000
#> s(days):beds            10  2.936562 1.0422023  0.9550

A summary of the model can be examined and this shows that nearly all of the terms are significant except the spatial smooth for pef and the temporal smooth for beds.

summary(gam.m)
#> 
#> Family: gaussian 
#> Link function: identity 
#> 
#> Formula:
#> priceper ~ Intercept - 1 + t2(X, Y, days, d = c(2, 1), by = Intercept) + 
#>     s(X, Y, by = pef) + s(days, by = pef) + s(X, Y, by = beds) + 
#>     s(days, by = beds)
#> 
#> Parametric coefficients:
#>           Estimate Std. Error t value Pr(>|t|)    
#> Intercept  10040.1      454.6   22.09   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Approximate significance of smooth terms:
#>                           edf Ref.df     F  p-value    
#> t2(X,Y,days):Intercept 30.730 37.032 2.598 1.36e-06 ***
#> s(X,Y):pef              2.055  2.082 1.247  0.28327    
#> s(days):pef             3.184  3.673 4.121  0.00347 ** 
#> s(X,Y):beds            24.223 26.353 5.647  < 2e-16 ***
#> s(days):beds            2.937  3.349 1.062  0.31011    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Rank: 203/205
#> R-sq.(adj) =  0.601   Deviance explained = 61.5%
#> -REML =  16814  Scale est. = 3.2129e+06  n = 1888

4.4 Varying coefficient estimates

The spatially and temporally varying coefficients estimates generated by the model can be extracted in a number of different ways using the calulate_vcs function in the stgam package, as indicated in earlier sections. There are basically 2 options for generating the varying coefficient estimates: i) over the original data points or ii) over spatial surfaces for specific time slices.

For the first option the original data, the GAM model and a vector of the model terms (i.e. predictor variables names) are simply passed to the calculate_vcs function. The second option requires slices of the space-time data to be passed to the calulate_vcs function and a bit more work to process the results. In both cases the results can be summarised and mapped.

Considering first the data of the original observations:

vcs <- calculate_vcs(input_data = hp_data, 
                    mgcv_model = gam.m, 
                    terms = c("Intercept", "pef", "beds"))

A summary over space and time of the coefficient estimates shows that the intercept the is large and positive and that pef and beds are generally negative but with some positive values at in some places or at some times.

vcs |> select(starts_with("b_")) |>
  apply(2, summary) |> round(1)
#>         b_Intercept b_pef b_beds
#> Min.         4236.0 -60.0 -552.4
#> 1st Qu.      8303.6 -34.5 -369.7
#> Median      10187.8 -26.3 -260.3
#> Mean        10040.1 -26.1 -240.8
#> 3rd Qu.     11703.2 -17.6 -163.4
#> Max.        15778.7   7.7  388.0

These can be examined over time:

vcs |> 
  select(dot, starts_with("b_")) |> 
  rename(`Intercept` = b_Intercept,
         `Potential Energy Efficiency` = b_pef,
         `Bedrooms` = b_beds) |>
  pivot_longer(-dot) |>
  mutate(name = factor(name, 
                       levels=c("Intercept","Potential Energy Efficiency", "Bedrooms"))) |>
  group_by(dot, name) |>
  summarise(
    lower = quantile(value, 0.25),
    median = median(value),
    upper = quantile(value, 0.75)
  ) |>
  ggplot(aes(x = dot, y = median)) +
  geom_point(col = "blue", alpha = 0.2) +
  geom_smooth() +
  facet_wrap(~name, scale = "free_y") +
  theme_bw() + xlab("") + ylab("") +
  theme(strip.background = element_rect(fill="white"))
#> `summarise()` has grouped output by 'dot'. You can override using the `.groups` argument.
#> `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
The changes over time of the coefficient estimates of the final model.
The changes over time of the coefficient estimates of the final model.

Standard dplyr and ggplot approaches can be used to join and map the coefficient estimates as in the figure below, which summarises the coefficient estimates for pef (b_pef) by year.

# make spatial data  
vcs_sf <- 
  vcs |> 
  st_as_sf(coords = c("X", "Y"), remove = F) 
# plot 
ggplot()  + 
  geom_sf(data = lb) +
  geom_sf(data = vcs_sf, aes(col = b_pef)) + 
  scale_colour_continuous_c4a_div(palette="brewer.rd_bu", 
                                  name = "Potential\nEnergy Efficiency") + 
  facet_wrap(~yot) + 
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The varying pef (Potential Energy Efficiency) coefficient estimates over space and time.
The varying pef (Potential Energy Efficiency) coefficient estimates over space and time.

Finally the l_grid can be used to summarise the coefficient estimates over space and time, here for each year in a slightly different way to the use of pred_days above. Notice how the for loop below combines all the coefficient estimates this time.

The maps reflect the relatively small changes in the relationship with priceper over time plotted above.

# create time slices
years <- 2018:2024
# calculate over the grid for each time slice
res_out <- matrix(nrow = nrow(l_grid), ncol = 0)
for (i in 1:length(years)){
  # convert years to days
  day.val = (years[i]-2018) * 365 / 100
  res.i <- calculate_vcs(input_data = l_grid |> mutate(days = day.val),
                         mgcv_model = gam.m,
                         terms = c("Intercept", "pef", "beds"))
  # select all the coefficient estimates 
  res.i <- 
    res.i |> 
    st_drop_geometry() |> 
    select(starts_with("b_"), 
           starts_with("se_")) 
  # rename them
  names(res.i) <- paste0(names(res.i), "_", years[i])
  # bind to the result
  res_out <- cbind(res_out, res.i)
  cat(years[i], "\t")
}
#> 2018     2019    2020    2021    2022    2023    2024    
# title
tit <-expression(paste(""*beta[`beds`]*"")) 
# join to the grid
l_grid |> cbind(res_out) |>
  # select the variables and pivot longer
  select(starts_with("b_beds")) |> 
  # rename
  rename(`2018` = "b_beds_2018", `2019` = "b_beds_2019", 
         `2020` = "b_beds_2020", `2021` = "b_beds_2021", 
         `2022` = "b_beds_2022", `2023` = "b_beds_2023", 
         `2024` = "b_beds_2024") |>  
  pivot_longer(-geometry) |>
  # make the new days object a factor (to enforce plotting order)
  mutate(name = factor(name, levels = 2018:2024)) |>
   # 4. and plot 
  ggplot() + 
  geom_sf(aes(fill = value), col = NA) +
  # adjust default shading
  scale_fill_continuous_c4a_div("brewer.rd_yl_bu", name = tit) +
  # facet
  facet_wrap(~name, ncol = 3) +
  # apply and modify plot theme
  theme_bw() + 
  theme(
    legend.position = "inside",
    legend.direction = "horizontal",
    legend.position.inside = c(0.7, 0.15), 
      legend.text=element_text(size=10),
      legend.title=element_text(size=12),
      strip.background = element_rect(fill="white", colour = "white"), 
      strip.text = element_text(size = 8, margin = margin(b=4)),
      legend.key.width = unit(1.5, "cm"),
        axis.title=element_blank(),
        axis.text=element_blank(),
    axis.ticks=element_blank())  
The varying beds (Bedrooms) coefficient estimates over time and the grid surface.
The varying beds (Bedrooms) coefficient estimates over time and the grid surface.

4.5 Summary

This section has illustrated the the use of functions in the the stgam package. It suggests the following workflow:

  1. Prepare the data: lengthen the data.frame, tibble or sf object containing the data to have single location and time variables for each observation (in the above examples these were X,Yand days), and an Intercept as an addressable term.
  2. Evaluate all possible models. For spatial or temporal problems each predictor is specified in 3 ways, for space-time problems each predictor is specified in 6 ways.
  3. Rank the models by their GCV score, identify any consistent model specification trends in the top ranked models and select the best model with the lowest GCV score.
  4. Extract the formula and create the final model.
  5. Calculate the varying coefficient estimates: to quantify how the relationships between the target and predictor variables vary over space, time or space-time.
  6. Create maps, time series plots etc

This workflow evaluates and ranks multiple models using model GCV value. This was done algorithmically using the evaluate_models function. However, this was not undertaken in isolation. Rather it built on the investigations in Section 3 to determine whether space and time effects were present in the data. This set of model investigations were undertaken to both develop and confirm knowledge of processes related to house prices in London. That is, the analysis was both considered and contextual. For example, in Section 3 it was determined that a varying intercept term was appropriate and with a different dataset there may be a need to explore different avenues. Similarly more of the model space could have been examined, for example to include cef in the models. The stgam package allows these choices to be validated through an automated approach, providing an exploration of the full set of potential choices. The the GCV as an unbiased risk estimator was useful helping to evaluate models

5. Notes, resources and summaries

5.1 Use REML in the GAMs

All of the analysis space-time GAM models in Section 4 of this vignette were specified with method = "REML". Model estimation in GAMs can be conducted in two ways: (a) predictive (i.e., out-of-sample minimization) via Generalized Cross-Validation (GCV) and (b) Bayesian (i.e. attach priors on basis coefficients) via restricted maximum likelihood (REML). REML tends to provide more stable estimates of the smoothing parameters (i.e., reduce over-fitting) and tends to provide more robust estimates of the variance. GCV is more computationally efficient but can over-fit, especially when errors are correlated. GCV is set as the default in mgcv.

5.2 Number of knots in space and time

In Section 4.3, the GAM model was checked for under- and over-fitting using the k.check function in themgcv package the advice was to increase k in the smooths if the k' and edf parameters were close. This can be done by specifying it manually in the smooth as in the hypothetical example below.

gam_m <- gam(y~s(X,Y, by = x, k = 40), data = input_data)

The number of knots in the smooth (\(k\)) determines the complexity of the response-to-predictor variable smooths in the GAM. However, determining k is a balance between setting it too large which can result in over-fitting and under-fitting if it is too low. Computational issues (speed and model stability) can arise if k is set to be too large, especially in the space-time case for small datasets.

5.3 Resources

We think these talks by Noam Ross provide really useful tips and pointers for working with GAMs and mgcv:

5.4 Summary

The rationale for using GAMs with TP smooths for spatially varying coefficient or temporally varying coefficient models is as follows:

The workflow suggested in this vignette and in the stgam package is to determine the most appropriate model form (both steps evaluated by minimising GCV, an un-biased risk estimator the model recommended for evaluating mgcv GAMs (S. N. Wood 2017; Marra and Wood 2011). This avoids making potentially unreasonable assumptions about how space and time interact in space-time varying coefficient models. The best model can then be extracted an examined for the nature of the space-time interactions it suggests, before generating the varying coefficient estimates and mapping or plotting the results.

Future developments will seek to examine the scales of spatial dependencies in the data sing kriging based approaches and will extend the stgam package to work with large data (i.e. to draw from the functionality of the gamm function in mgcv).

References

Akaike, Hirotogu. 1998. “Information Theory and an Extension of the Maximum Likelihood Principle.” In Selected Papers of Hirotugu Akaike, 199–213. Springer.
Bin, Chi, Adam Dennett, Thomas Oléron-Evans, and Robin Morphet. 2024. “House Price Per Square Metre in England and Wales, 1995-2024.” https://reshare.ukdataservice.ac.uk/855033/.
Comber, Lex, Paul Harris, and Chris Brunsdon. 2024. stgam: Spatially and Temporally Varying Coefficient Models Using Generalized Additive Models. https://CRAN.R-project.org/package=stgam.
Hurvich, Clifford M, and Chih-Ling Tsai. 1989. “Regression and Time Series Model Selection in Small Samples.” Biometrika 76 (2): 297–307.
Marra, Giampiero, and Simon N Wood. 2011. “Practical Variable Selection for Generalized Additive Models.” Computational Statistics & Data Analysis 55 (7): 2372–87.
Schwarz, Gideon. 1978. “Estimating the Dimension of a Model.” The Annals of Statistics, 461–64.
Wood, Simon. 2021. mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation. https://CRAN.R-project.org/package=mgcv.
Wood, Simon N. 2017. Generalized Additive Models: An Introduction with r. CRC press.