Loading [MathJax]/jax/output/HTML-CSS/jax.js

Leave-one-out cross-validation and error correction

Collin Erickson

2025-04-07

Cross-validation is often used in machine learning to judge how well a model is fit. Instead of using the entire data set to fit the model, it will use one part of the data set to fit a model and then test the model on the remaining data. This gives an idea of how well the model will generalize to indpendent data.

Leave-one-out predictions using Gaussian processes

Leave-one-out prediction uses an entire model fit to all the data except a single point, and then makes a prediction at that point which can be compared to the actual value. It seems like this may be very expensive to do, but it is actually an inexpensive computation for a Gaussian process model, as long as the same parameters are used from the full model. This will bias the predictions to better results than if parameters were re-estimated.

Normally each prediction point requires solving a matrix equation. To predict the output, y, at point x, given input data in matrix X2 and output y2, we use the equation ˆy=ˆμ+R(x, X2)R(X2)1(y2μ1n)) For leave-one-out predictions, the matrix X2 will have all the design points except for the one we are predicting at, and thus will be different for each one. However, we will have the correlation matrix R for the full data set from estimating the parameters, and there is a shortcut to find the inverse of a matrix leaving out a single row and column.

There is significant speed-up by using a multiplication instead of a matrix solve. The code chunk below shows that solving with a square matrix with 200 rows is over 30 times slower than a matrix multiplication.

n <- 200
m1 <- matrix(runif(n*n),ncol=n)
b1 <- runif(n)
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(m1, b1), m1 %*% b1)
}
## Unit: microseconds
##           expr      min       lq       mean    median       uq      max neval
##  solve(m1, b1) 1412.601 1739.552 1915.88804 1890.0005 1985.101 6455.501   100
##      m1 %*% b1   33.901   41.902   50.84209   45.2515   48.501  591.601   100

Getting the inverse of a submatrix

Suppose we have a matrix K and know its inverse K1. Suppose that K has block structure K=[A BC D] Now we want to find out how to find A1 using K1 instead of doing the full inverse. We can write K1 in block structure K1=[E FG H]

Now we use the fact that KK1=I [I 00 I]=[A BC D][E FG H]

This gives the equations AE+BG=I AF+BH=0 CE+DG=0 CF+DH=I

Solving the first equation gives that A=(IBG)E1 or A1=E(IBG)1

Leave-one-out covariance matrix inverse for Gaussian processes

For Gaussian processes we can consider the block matrix for the covariance (or correlation) matrix where a single row and its corresponding column is being removed. Let the first n1 rows and columns be the covariance of the points in design matrix X, while the last row and column are the covariance for the vector x with X and x. Then we can have

K=[C(X,X) C(X,x)C(x,X) C(x,x)]

Using the notation from the previous subsection we have A=C(X,X) and B=C(X,x), and E and G will be submatrices of the full K1. B is a column vector, so I’ll write it as a vector b, and G is a row vector, so I’ll write it as a vector gT. So we have C(X,X)1=E(IC(X,x)G)1 So if we want to calculate A1=E(IbgT)1 we still have to invert IBG, which is a large matrix. However this can be done efficiently since it is a rank one matrix using the Sherman-Morrison formula. (IbgT)1=I1I1bgTI11+gTI1b=IbgT1+gTb Thus we have the shortcut for A1 that is only multiplication A1=E(IbgT1+gTb)

To speed this up it should be calculated as

A1=E(Eb)gT1+gTb Below demonstrates that we get a speedup of almost twenty by using this shortcut.

set.seed(0)
corr <- function(x,y) {exp(sum(-30*(x-y)^2))}
n <- 200
d <- 2
X <- matrix(runif(n*d),ncol=2)
R <- outer(1:n,1:n, Vectorize(function(i,j) {corr(X[i,], X[j,])}))
Rinv <- solve(R)
A <- R[-n,-n]
Ainv <- solve(A)
E <- Rinv[-n, -n]
b <- R[n,-n]
g <- Rinv[n,-n]
Ainv_shortcut <- E + E %*% b %*% g / (1-sum(g*b))
summary(c(Ainv - Ainv_shortcut))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -4.768e-03 -2.300e-08  0.000e+00  0.000e+00  2.300e-08  3.838e-03
if (requireNamespace("microbenchmark", quietly = TRUE)) {
  microbenchmark::microbenchmark(solve(A), E + E %*% b %*% g / (1-sum(g*b)))
}
## Unit: microseconds
##                                expr      min       lq     mean    median
##                            solve(A) 5190.301 6422.701 6921.910 6772.7010
##  E + E %*% b %*% g/(1 - sum(g * b))  122.801  203.950  234.226  227.4515
##        uq       max neval
##  7168.901 10828.201   100
##   253.851   548.101   100

In terms of the covariance matrices already calculated, this is the following, where Mi is the matrix M with the ith row and column removed, and Mi,i is the ith row of the matrix M with the value from the ith column removed.

R(Xi)1=R(X)i(R(X)iR(X)i,i)(R(X)1)Ti,i1+(R(X)1)Ti,iR(X)i,i

Leave-one-out prediction

Recall that the predicted mean at a new point is ˆy=ˆμ+R(x, X2)R(X2)1(y2μ1n))

ˆy=ˆμ+R(xi, X1)R(Xi)1(yiμ1n))