Metropolis-Hastings library(ggplot2) library(tidyverse) library(gridExtra) library(MASS) Random Walk Metropolis Hastings Let’s implement a Random Walk Metropolis-Hastings algorithm with a symmetric proposal density.
rwmh <- function(start, niter, proposal, target){ # Store accepted, rejected and all samples accepted <- matrix(NA, nrow=niter, dimnames=list(NULL, "accepted")) rejected <- matrix(NA, nrow=niter, dimnames=list(NULL, "rejected")) samples <- matrix(NA, nrow=niter) # Set first sample to start and sample uniform random numbers for efficiency z <- start accepted[1] <- z; rejected[1] <- z; samples[1] <- z u <- runif(niter) # Draw candidate, calculate acceptance prob for niter-1 times for (i in 2:niter){ cand <- proposal$sample(z) prob <- min(1, target$density(cand)/target$density(z)) if (u[i] <= prob) {accepted[i] <- cand; z <- cand} else {rejected[i] <- cand} samples[i] <- z } result <- list("samples"=samples, "accepted"=accepted, "rejected"=rejected, "proposal"=proposal, "target"=target) return(result) } Plotting Function The following function takes the output of the Random-Walk Metropolis-Hastings algorithm and it produces two plots:
library(MASS) library(microbenchmark) library(tidyverse) Utility functions The kernel_matrix function calculates the kernel matrix with squared exponential in a vectorized way. This function is borrowed from the profiling portfolio.
kernel_matrix <- function(X, sigmasq, Y=NULL){ if (is.null(Y)){ Y <- X } n <- nrow(X) m <- nrow(Y) # Find three matrices above Xnorm <- matrix(apply(X^2, 1, sum), n, m) Ynorm <- matrix(apply(Y^2, 1, sum), n, m, byrow=TRUE) XY <- tcrossprod(X, Y) return(exp(-(Xnorm - 2*XY + Ynorm) / (2*sigmasq))) } The function make_flags takes a y vector of length n and containing K class labels 0, 1, , K-1 and returns a matrix with n rows and K columns, where each row is a one-hot encoding for the class corresponding to that observation.
library(Matrix) library(tidyverse) library(igraph) We work with the same dataset used for Tidyverse containing data regarding some of the top-voted kaggle kernels.
# Use `` for column names with spaces kaggle <- "kagglekernels.csv" %>% read_csv(col_types = cols( Votes=col_double(), Owner=col_factor(), Kernel=col_factor(), Dataset=col_factor(), Output=col_character(), `Code Type`=col_factor(), Language=col_factor(), Comments=col_double(), Views=col_double(), Forks=col_double()) ) kaggle # Tibbles automatically print head(tibble) ## # A tibble: 971 x 12 ## Votes Owner Kernel Dataset `Version Histor… Tags Output `Code Type` Language ## <dbl> <fct> <fct> <fct> <chr> <chr> <chr> <fct> <fct> ## 1 2130 Mega… Explo… Titani… Version 8,2017-… tuto… This … Script markdown ## 2 1395 Guid… Full … Data S… Version 19,2017… tuto… This … Notebook Python ## 3 1363 Pedr… Compr… House … Version 47,2018… begi… This … Notebook Python ## 4 1316 Anis… Intro… Titani… Version 93,2018… tuto… This … Notebook Python ## 5 1078 Kaan… Data … Pokemo… Version 389,201… begi… This … Notebook Python ## 6 1003 Phil… Explo… Zillow… Version 44,2017… begi… This … Script markdown ## 7 946 Mana… Titan… Titani… Version 16,2017… tuto… This … Notebook Python ## 8 826 Omar… A Jou… Titani… Version 6,2016-… begi… This … Notebook Python ## 9 814 anok… Data … Quora … <NA> inte… This … Notebook Python ## 10 726 SRK Simpl… Zillow… Version 19,2017… eda,… This … Notebook Python ## # … with 961 more rows, and 3 more variables: Comments <dbl>, Views <dbl>, ## # Forks <dbl> Again, we can use the Tags to create a number of different new variables, each representing one Tag.
Implementations Data Generation and Problem Settings Import the necessary packages. The library provfis is a stochastic profiler and can be used to profile our code.
library(profvis) library(tidyverse) library(MASS) library(microbenchmark) First, let’s decide some parameters and settings such as the amount of training and testing data.
# Number of data points for plotting & Bandwidth squared of the RBF kernel ntest <- 500 # Characteristic Length-scale for kernel sigmasq <- 1.
library(tidyverse) library(plotly) library(forcats) library(reshape2) library(magrittr) Spotify: Genre Popularity by Artist Import the spotify data an drop index column cause it contains redundant information.
spotify <- read_csv("top50.csv", col_names=c( "index", "Song", "Artist", "Genre", "BPM", "Energy", "Danceability", "Loudness", "Liveness", "Valence", "Length", "Acousticness", "Speechiness", "Popularity"), col_types=cols( index=col_double(), Song=col_factor(), Artist=col_factor(), Genre=col_factor()), skip=1) # Index column contains row_numbers spotify <- spotify %>% select(-c("index")) For brevity, will change the name of the genres and change one of the artists cause it has a non utf-8 name.
Mathematical Set Up Suppose we have a training matrix \(X\) with \(n\) observations and \(d\) features \[ X = \begin{pmatrix} x_{11} & x_{12} & \ldots & x_{1d}\\ x_{21} & x_{22} & \ldots & x_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \ldots & x_{nd} \end{pmatrix} =\begin{pmatrix} {\bf{x}}_1^\top \\ {\bf{x}}_2^\top \\ \vdots \\ {\bf{x}}_n^\top \end{pmatrix} \in\mathbb{R}^{n\times d} \] and that we want to do a feature transform of this data using the Radial Basis Function.
Vectorization As a general rule of thumb, we want to avoid writing for-loops as much as possible, and instead vectorize operations to speed up our calculations. For example suppose that we want to calculate \[\sum_{i=1}^n\cos(i) \times \sin(i) \times \tan(i) - \cos(i)^2-\sin(i)^3\]
At first one could write a function similar to this:
myfunc <- function(n){ total = 0 for (i in 1:n){ total = total + cos(i)*sin(i)*tan(i) - cos(i)^2 - sin(i)^3 } return(total) } We can see see how long a function takes to run in R by using the system.