Implementations

Data Generation and Problem Settings

Import the necessary packages. The library provfis is a stochastic profiler and can be used to profile our code.

library(profvis)
library(tidyverse)
library(MASS)
library(microbenchmark)

First, let’s decide some parameters and settings such as the amount of training and testing data.

# Number of data points for plotting & Bandwidth squared of the RBF kernel
ntest <- 500
# Characteristic Length-scale for kernel
sigmasq <- 1.0
# Number of training points, standard deviation of additive noise
ntrain <- 10
sigma_n <- 0.5
# Define number of samples for prior gp mvn and posterior gp mvn to take
nprior_samples <- 5
npost_samples <- 5

Define some helper functions. In particular, the kernel function is a squared exponential. The bandwidth is computed as the median of all pairwise distances. We also define a matrix that applies the squared exponential to rows of two matrices, i.e. it computes the kernel matrix \(K(X, Y)\). for two matrices \(X\) and \(Y\). Finally, we also have a regression function which represents the true structure of the data. This is just a quintic polynomial for simplicity, but can be replaced by more complicated functions.

squared_exponential <- function(x, c, sigmasq){
  return(exp(-0.5*sum((x - c)^2) / sigmasq))
}
kernel_matrix <- function(X, Xstar, sigmasq){
  # compute the kernel matrix
  K <- apply(
    X=Xstar,
    MARGIN=1, 
    FUN=function(xstar_row) apply(
      X=X, 
      MARGIN=1, 
      FUN=squared_exponential, 
      xstar_row,
      sigmasq
      )
    )
  return(K)
}
regression_function <- function(x){
    val <- (x+5)*(x+2)*(x)*(x-4)*(x-3)/10 + 2
  return(val)
}

Now let’s actually generate some data

set.seed(12345)
# training data
xtrain <- matrix(runif(ntrain, min=-5, max=5))
ytrain <- regression_function(xtrain) + matrix(rnorm(ntrain, sd=sigma_n))
# testing data
xtest <- matrix(seq(-5,5, len=ntest))

Naive Vectorized Implementation

The first implementation that we look at is naive in the sense that it basically blindly copies the operations. This means that we invert \(K + \sigma_n^2 I\) directly.

source("gp_naive.R", keep.source = TRUE)
profvis(result <- gp_naive(xtrain, ytrain, xtest, sigma_n, sigmasq))

Flame Graph

Data

Options ▾

gp_naive.R		Memory				Time
	gp_naive <- function(xtrain, ytrain, xtest, sigma_n, sigmasq){
	# Find kernel matrices in vectorized form
	K <- kernel_matrix(xtrain, xtrain, sigmasq)
	Ks <- kernel_matrix(xtest, xtrain, sigmasq)
	Kss <- kernel_matrix(xtest, xtest, sigmasq)
	# Find the inverse of K + sigma_n^2I directly
	inverse <- solve(K + sigma_n^2 * diag(ntrain))
	# GP mean
	gpmean <- Ks %% (inverse %% ytrain)
	# GP variance-covariance matrix
	gpvcov <- Kss - Ks %% inverse %% t(Ks)
	return(list(gpmean, gpvcov))
	}

☒ Split horizontally

☐ Hide lines of code with zero time

☐ Hide memory results

Online Non-Vectorized Cholesky Implementation

This implementation can also be used online. It is recommended in the “Gaussian Processes for Machine Learning” book.

source("gp_online.R", keep.source = TRUE)
profvis(result_online <- gp_online(xtrain, ytrain, xtest, sigma_n, sigmasq))

Flame Graph

Data

Options ▾

gp_online.R		Memory				Time
	gp_online <- function(xtrain, ytrain, xtest, sigma_n, sigmasq){
	# Find kernel matrix
	K <- kernel_matrix(xtrain, xtrain, sigmasq)
	# Cholesky factor for K + sigma^2 * I
	L <- chol(K + sigma_n^2*diag(ntrain)) ## Upper triangular
	alpha <- backsolve(L, forwardsolve(t(L), ytrain))
	# Allocate memory first
	gpmean <- rep(0, ntest)
	gpvar <- rep(0, ntest)
	# Loop through all test rows of do online regression
	for (row_ix in 1:nrow(xtest)){
	# Find kernel evaluation against all training points
	kstar <- kernel_matrix(matrix(xtest[row_ix, ]), xtrain, sigmasq)
	# GP mean for current test point
	gpmean[row_ix] <- t(kstar) %*% alpha
	# GP variance for current test point
	gpvar[row_ix] <- 1.0 - crossprod(forwardsolve(t(L), kstar)) ## as SE(x, x) = 1.0
	}
	return(list(gpmean, gpvar))
	}

☒ Split horizontally

☐ Hide lines of code with zero time

☐ Hide memory results

Online Vectorized-Kernel Cholesky Implementation

We can see that most of the time is spent computing the kernel matrix. We can therefore find a faster way to compute it as follows

kernel_matrix_vectorized <- function(X, sigmasq, Y=NULL){
  if (is.null(Y)){
    Y <- X
  }
  n <- nrow(X)
  m <- nrow(Y)
  # Find three matrices above
  Xnorm <- matrix(apply(X^2, 1, sum), n, m)
  Ynorm <- matrix(apply(Y^2, 1, sum), n, m, byrow=TRUE)
  XY <- tcrossprod(X, Y)
  return(exp(-(Xnorm - 2*XY + Ynorm) / (2*sigmasq)))
}

using this, we get

source("gp_online_vect.R", keep.source = TRUE)
profvis(gp_online_vect(xtrain, ytrain, xtest, sigma_n, sigmasq))

Flame Graph

Data

Options ▾

gp_online_vect.R		Memory				Time
	# uses vectorized kernel matrix
	gp_online_vect <- function(xtrain, ytrain, xtest, sigma_n, sigmasq){
	# Find kernel matrix
	K <- kernel_matrix_vectorized(xtrain, sigmasq)
	# Cholesky factor for K + sigma^2 * I
	L <- chol(K + sigma_n^2*diag(ntrain)) ## Upper triangular
	alpha <- backsolve(L, forwardsolve(t(L), ytrain))
	# Allocate memory first
	gpmean <- rep(0, ntest)
	gpvar <- rep(0, ntest)
	# Loop through all test rows of do online regression
	for (row_ix in 1:nrow(xtest)){
	# Find kernel evaluation against all training points
	kstar <- kernel_matrix_vectorized(matrix(xtest[row_ix, ]), sigmasq, xtrain)
	# GP mean for current test point
	gpmean[row_ix] <- kstar %*% alpha
	# GP variance for current test point
	gpvar[row_ix] <- 1.0 - crossprod(forwardsolve(t(L), t(kstar))) ## as SE(x, x) = 1.0
	}
	return(list(gpmean, gpvar))
	}

☒ Split horizontally

☐ Hide lines of code with zero time

☐ Hide memory results

We can see that this implementation uses much less memory and it’s much faster.

Completely vectorized implementation

We can combine these ideas to obtain a much faster implementation.

source("gp_completely_vectorized.R", keep.source = TRUE)
profvis(gp_completely_vectorized(xtrain, ytrain, xtest, sigma_n, sigmasq), interval=0.005)

Flame Graph

Data

Options ▾

gp_completely_vectorized.R		Memory				Time
	gp_completely_vectorized <- function(xtrain, ytrain, xtest, sigma_n, sigmasq){
	# Find kernel matrices in vectorized form
	K <- kernel_matrix_vectorized(xtrain, sigmasq)
	Ks <- kernel_matrix_vectorized(xtest, sigmasq, xtrain)
	Kss <- kernel_matrix_vectorized(xtest, sigmasq)
	# Cholesky factorization
	L <- chol(K + sigma_n^2*diag(ntrain)) ## Upper triangular
	alpha <- backsolve(L, forwardsolve(t(L), ytrain))
	# Solve by forward and backward substitution
	gpmean <- Ks %*% alpha
	gpvcov <- Kss - Ks %*% backsolve(L, forwardsolve(t(L), t(Ks)))
	return(list(gpmean, gpvcov))
	}

☒ Split horizontally

☐ Hide lines of code with zero time

☐ Hide memory results

Visualizations

Before seeing training data

Before seeing the training data we only have the test data. The Gaussian Process will therefore predict random (smooth) functions with mean zero.

Kss <- kernel_matrix_vectorized(xtest, sigmasq)
# Sample nprior_samples Multivariate Normals with mean zero and variance-covariance
# being the kernel matrix
data.frame(x=xtest, t(mvrnorm(nprior_samples, rep(0, length=ntest), Kss))) %>% 
  setNames(c("x", sprintf("Sample %s", 1:nprior_samples))) %>% 
  gather("MVN Samples", "Value", -x) %>% 
  ggplot(aes(x=x, y=Value)) + 
    # Because diag(Kss) are all 1s. We use mean +\- 2*standard deviation
    geom_rect(xmin=-Inf, xmax=Inf, ymin=-2, ymax=2, fill="grey80") + 
    geom_line(aes(color=`MVN Samples`)) +
    geom_abline(slope=0.0, intercept=0.0, lty=2) + 
    scale_y_continuous(lim=c(-3, 3)) + 
    labs(title=paste(nprior_samples, "MVN Samples before seeing the data")) + 
    theme(plot.title=element_text(hjust=0.5))

After seeing training data

We only need to find the predicted mean and the predicted variance.

# Get predicitons. To predict noisy data just add sigma_n^2*diag(ncol(xtest)) 
# to the covariance matrix as implemented in the script 
results <- gp_completely_vectorized(xtrain, ytrain, xtest, sigma_n, sigmasq)

gpmean <- results[[1]]
gpvcov <- results[[2]]
# for plotting
dftrain = data.frame(xtrain=xtrain, ytrain=ytrain)
# Plot
data.frame(x=xtest, t(mvrnorm(npost_samples, gpmean, gpvcov))) %>% 
  setNames(c("x", sprintf("Sample %s", 1:npost_samples))) %>% 
  mutate(ymin=gpmean-2*sqrt(diag(gpvcov)), ymax=gpmean+2*sqrt(diag(gpvcov)),
         gpmean=gpmean, ytrue=regression_function(xtest)) %>% 
  gather("MVN Samples", "Value", -x, -ymin, -ymax, -gpmean, -ytrue) %>% 
  ggplot(aes(x=x, y=Value)) + 
    geom_ribbon(aes(ymin=ymin, ymax=ymax), fill="grey80") + 
    geom_line(aes(color=`MVN Samples`)) +
    geom_line(aes(y=gpmean), size=1) +
    geom_line(aes(y=ytrue), color="darkred", lty=2) + 
    geom_point(data=dftrain, aes(x=xtrain, y=ytrain), color="red") +
    ggtitle(paste(npost_samples, "MVN Samples after seeing", ntrain, "data points")) + 
    theme(plot.title=element_text(hjust=0.5))

Note on Vectorized Kernel Matrix

One might wonder why in the function kernel_matrix_vectorized() we fill the matrix Ynorm by row. Afterall R works with column-major storage so this should be inefficient. One can compare filling in a matrix by row versus filling it by column and then taking the transpose in different cases:

Number of rows < Number of columns

nrows <- 100
ncols <- 2000
microbenchmark(
  rowwise=matrix(0, nrows, ncols, byrow=TRUE),
  transpose=t(matrix(0, nrows, ncols))
  )

## Unit: microseconds
##       expr     min       lq     mean   median        uq      max neval
##    rowwise 321.202 343.0085 648.0742 360.0345  697.1185 5734.401   100
##  transpose 534.310 569.7350 950.3921 608.9455 1233.3545 3354.230   100

Number of rows = Number of columns

nrows <- 2000
ncols <- 2000
microbenchmark(
  rowwise=matrix(0, nrows, ncols, byrow=TRUE),
  transpose=t(matrix(0, nrows, ncols))
  )

## Unit: milliseconds
##       expr      min       lq     mean   median       uq      max neval
##    rowwise 16.69890 24.08596 26.52232 24.48049 28.26676 91.81093   100
##  transpose 23.32213 29.73081 38.70225 36.65087 41.26867 96.43356   100

Number of rows > Number of columns

nrows <- 2000
ncols <- 100
microbenchmark(
  rowwise=matrix(0, nrows, ncols, byrow=TRUE),
  transpose=t(matrix(0, nrows, ncols))
  )

## Unit: microseconds
##       expr     min      lq     mean   median       uq      max neval
##    rowwise 378.090 383.898 470.1237 392.2410 398.4715 2512.430   100
##  transpose 564.521 579.105 688.0039 593.8405 621.8380 2687.371   100

Generally, byrow=TRUE is faster.

Profiling Portfolio

Implementations

Data Generation and Problem Settings

Naive Vectorized Implementation

Online Non-Vectorized Cholesky Implementation

Online Vectorized-Kernel Cholesky Implementation

Completely vectorized implementation

Visualizations

Before seeing training data

After seeing training data

Note on Vectorized Kernel Matrix

Mauro Camara Escudero