Creating a simple statistical learning model from scratch: emphasizing prediction in regression

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.4.2

library(grid)
d <- head(iris[,1:3])
grid.table(d)

Learning about statistical/machine learning can be intimidating, especially if you’re coming from a field other than statistics/machine learning/mathematics etc. If you’re like me and coming from another area (eg., social sciences, life sciences, business) you undoubtedly were exposed to regression at some point. However, when learning about regression, the emphasis probably wasn’t on prediction but rather on interpreting coefficients and praying that p-values associated with said coefficeints fell below .05. In reality, if you dig a bit deeper into the algebra underlying regression models, you’ll find that you’ve already been exposed to a method that can work well for fitting simple statistical learning models. Below, we’ll unpack that process without the help of functions or libraries and explore how linear regression can be used as a basic predictive tool with the classic iris dataset.

data("iris")

head(iris, 10)

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa

Let’s begin by looking at how we can fit a linear model and use that to predict some new data in base R. We’ll attempt to predict Sepal Length from the other 3 numeric variables in the iris dataset. To bein, we’ll grab some training data and fit our model.

iris_sub <- iris[,c(1:4)]

cut <- .7*nrow(iris)

iris_train <- iris_sub[1:cut, ]

fit <- lm(Sepal.Length ~ ., data = iris_train )

Now, we’ll create some test data and combine that with our model from the training data to create some predictions.

iris_test <- iris_sub[(cut+1):nrow(iris_sub), ]

preds <- predict(fit, iris_test[, -1])

plot(preds, iris_test[, 1], main = paste('Cor =',round(cor(iris_test[, 1], preds ),2)), xlab = "Observed Values", ylab = "Predicted Values")

The predictions look great! If someone accidently deleted a bunch of information regarding iris’ sepal lengths in a flower database, we can safely say we’d do a sound job recovering that information.

But what is “predict” doing to come up with those values? We’ll take a look at that in a second, but we’ll examine prediction in the context of linear regression models in a more general sense.

We’ll start by examining how new values can be predicted on a set of data where both the y values and predictor variables are present (using the iris training data we created as an example). First, we need to separate our x values from our y values. We’ll pad the x matrix with a column of ones to represent the intercept.

x <- iris_train[, -1]

x <- cbind(Intercept = rep(1, nrow(x)), x)

x <- as.matrix(x)

y <- as.matrix(iris_train[, 1])

Next, we need to transpose x and multiply it by itself creating the ( X’X ) matrix. This results in a sum of squares/cross product matrix, ( SSCP ).

XtX <-t(x) %*% x

XtX

##              Intercept Sepal.Width Petal.Length Petal.Width
## Intercept        105.0      324.80       314.50       89.10
## Sepal.Width      324.8     1027.66       930.13      259.79
## Petal.Length     314.5      930.13      1188.37      364.22
## Petal.Width       89.1      259.79       364.22      115.75

crossprod(x) #matches output from the crossprod function

##              Intercept Sepal.Width Petal.Length Petal.Width
## Intercept        105.0      324.80       314.50       89.10
## Sepal.Width      324.8     1027.66       930.13      259.79
## Petal.Length     314.5      930.13      1188.37      364.22
## Petal.Width       89.1      259.79       364.22      115.75

Now, we are going to calculate a projection matrix ( X(X’X)^{-1}X’ ) or hat matrix (on it’s diagonal are observation leverages). This will allow us to map y into predicted y values ( ).

h <- x %*% solve(XtX) %*% t(x)

all.equal(diag(h), hatvalues(fit)) #matches output from hatvalues function

## [1] TRUE

our_fitted_vals <- h %*% y

all.equal(as.numeric(our_fitted_vals), as.numeric(fit$fitted.values)) #our calculated predicted values match the values accessed from the model

## [1] TRUE

That’s all well and good, but how can we predict new values of y (( )) in our test data using our training data? To do that, we need to calculate beta coefficients for our training data that contain information about the relationship between our y and x values. The formula to do so is: ( (X’X)^{-1}X’Y )

XtY <- t(x) %*% y

betas <- solve(XtX) %*% XtY

data.frame(model_betas = summary(fit)$coefficients[, 1], our_betas = betas) # our calculated betas match betas from lm

##              model_betas  our_betas
## (Intercept)    2.0952291  2.0952291
## Sepal.Width    0.5945672  0.5945672
## Petal.Length   0.7010446  0.7010446
## Petal.Width   -0.6115940 -0.6115940

We can add the model intercept through to our test data and multiply that by the beta coefficients (intercept excluded) to find our predicted values. We’ll calculate a few other summary statistics such as MSE, and MAE.

our_preds <- betas[1] + as.matrix(iris_test[,-1]) %*% betas[-1]

all.equal(as.numeric(preds), as.numeric(our_preds)) #our predicted values for our test data match that of the predict function

## [1] TRUE

sqrt(mean((our_preds - iris_test[, 1])^2))

## [1] 0.3603703

mean(abs(our_preds - iris_test[, 1]))

## [1] 0.2854695

What we’ve learned so far can easily be extended to a kfold cross validation scheme. The function below will create folds and return a list containing the original dataset with the folds as a new column. In addition, a list containing the fold indices themselves will also be returned.

kfold <- function(df, k){

  df <- df[sample(nrow(df)),] #shuffle data
  
  folds <- cut(seq(1,nrow(df)),breaks=k,labels=FALSE) #create folds
  
  inds <- list()
  
  for (i in 1:k){
  
    inds[[i]] <- which(folds != i) #get indices
  
  }
  
  df <- data.frame(df, folds = unlist(folds))
  
  out <- list(df, inds)
  
  names(out) <- c("df", "fold_inds")
  
  return(out)
  
}

Let’s create the folds and make sure they make sense:

kfold_data <- kfold(iris, 5)

str(kfold_data)

## List of 2
##  $ df       :'data.frame':   150 obs. of  6 variables:
##   ..$ Sepal.Length: num [1:150] 5 5.7 5.4 6.8 6 4.8 5.3 7.6 7.2 5.5 ...
##   ..$ Sepal.Width : num [1:150] 2 3.8 3.4 3.2 2.2 3.1 3.7 3 3.2 2.5 ...
##   ..$ Petal.Length: num [1:150] 3.5 1.7 1.7 5.9 5 1.6 1.5 6.6 6 4 ...
##   ..$ Petal.Width : num [1:150] 1 0.3 0.2 2.3 1.5 0.2 0.2 2.1 1.8 1.3 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 2 1 1 3 3 1 1 3 3 2 ...
##   ..$ folds       : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##  $ fold_inds:List of 5
##   ..$ : int [1:120] 31 32 33 34 35 36 37 38 39 40 ...
##   ..$ : int [1:120] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ : int [1:120] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ : int [1:120] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ : int [1:120] 1 2 3 4 5 6 7 8 9 10 ...

table(kfold_data$df$folds)

## 
##  1  2  3  4  5 
## 30 30 30 30 30

And fit our model using 5 fold cross-validation:

k <- 5

df <- kfold_data$df

k_out <- list()

for (i in 1:k){
  
  train <- df[df$folds != i, ]
  
  train_x <- as.matrix(train[, -c(1,5,6)])
  
  train_x <- cbind(Intercept = rep(1, nrow(train_x)), train_x)
  
  train_y <- as.matrix(train[, 1])
  
  test <- df[df$folds == i, ]
  
  test_x <- as.matrix(test[, -c(1,5,6)])
  
  test_y <- as.matrix(test[, 1])
  
  betas  <- solve(t(train_x) %*% train_x) %*% t(train_x) %*% train_y
  
  preds <- betas[1] + test_x %*% betas[-1]
  
  r <- cor(preds, test_y)
  
  r2 <- r^2
  
  rmse  <- sqrt(mean((preds - test_y)^2))

  mae <- mean(abs(preds - test_y))
  
  k_out[[i]] <- list(c(r= r, r2 = r2, rmse = rmse, mae = mae), preds)
  
  names(k_out[[i]]) <- c("metrics", "predictions")
  
}

We can extract and examine the predictive summary statistics.

summary_info <- do.call("rbind", lapply(k_out, `[[`, "metrics" ))

row.names(summary_info) <- paste("fold", 1:5, sep = "_")

summary_info

##                r        r2      rmse       mae
## fold_1 0.9329948 0.8704793 0.3279822 0.2666810
## fold_2 0.9184196 0.8434945 0.3198034 0.2644244
## fold_3 0.9339755 0.8723103 0.3150415 0.2405732
## fold_4 0.9133824 0.8342673 0.3054551 0.2484961
## fold_5 0.9404967 0.8845341 0.3379004 0.2855339

Finally, we can pass our fold indices to caret and confirm our calculations.

#library(caret)

control <- caret::trainControl(index = kfold_data[[2]], method = "cv", savePredictions = TRUE)

caret_fit <- caret::train(Sepal.Length ~ ., 
                data = df[, -c(5,6)], 
                method = "lm",
                trControl = control)

## Warning: package 'caret' was built under R version 3.4.2

## Loading required package: lattice

## Loading required package: ggplot2

caret_fit

## Linear Regression 
## 
## 150 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 120, 120, 120, 120, 120 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3212365  0.8610171  0.2611417
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Averaging across folds, our statistics match caret’s output

colMeans(summary_info)

##         r        r2      rmse       mae 
## 0.9278538 0.8610171 0.3212365 0.2611417