Info

Objectives

By the end of this activity, you should be able to

  1. Load data from xlsx files rather than from csv.

  2. Convert wide data to long data.

  3. Do basic operations with time series - coerce character vectors to native R time/date format, create lag variables and plot time series.

  4. Train ridge regression and LASSO in R, library caret and choose the optimal value for hyperparameters using cross-validation.

  5. Perform variable selection by best subset or forward stepwise procedure.

Preliminaries

You should be familiar with linear regression and cross-validation. You should also be familiar with the pipe operator, splitting data into training and test sets, data summarization and 2D plotting. This material was covered in labs 1-3.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Libraries

library(tidyverse) # for manipulation with data
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(lubridate) # for working with time
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(readxl) # for reading data from xlsx
library(Hmisc) # for creating lag variables
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(leaps) # for subset selection

Data

Today we will work with data on Singapore haze in Sep - Oct 2015 that Fedor manually downloaded from the National Environment Agency webpage.

Source: https://www.haze.gov.sg/resources/historical-readings

H <-  read_excel("singapore_haze_sep_2015.xlsx", sheet = 1)
head(H)

Note that

  1. We read the data from xlsx file rather than from CSV

  2. The class of the variable Date is actually date/time - special type in R.

  3. The class of the variable Time is character, i.e., we will have to convert it to a date/time.

Processing time series data

Time

Very often we need to work with times and dates. R has special variable types for times and dates “POSIXlt” and “POSIXct” (additionally to familiar “numeric”, “logical”, “character”, and “factor”).

Unlike character values, times and dates can be used in arithmetic operations (time is measured in seconds):

set.seed(280)
x <- sample(H$Date, 3)
cat("Original dates: ")
## Original dates:
print(x)
## [1] "2015-09-30 UTC" "2015-09-22 UTC" "2015-09-30 UTC"
cat("Original dates shifted by 1 day: ")
## Original dates shifted by 1 day:
print(x+86400)
## [1] "2015-10-01 UTC" "2015-09-23 UTC" "2015-10-01 UTC"

Note that the variable Date has been correctly identified as “POSIXlt” and “POSIXct” while Time has been identifed as “character” and hence we need to convert it. We could use the function hm from the library lubridate for that:

H$Time %>% hm %>% head
## [1] "1H 0M 0S" "2H 0M 0S" "3H 0M 0S" "4H 0M 0S" "5H 0M 0S" "6H 0M 0S"

The class of the result is

H$Time %>% hm %>% class
## [1] "Period"
## attr(,"package")
## [1] "lubridate"

An alternative is the function parse_time that requires some manual fine-tuning:

H$Time %>% parse_time("%I:%M%p") %>% head
## 01:00:00
## 02:00:00
## 03:00:00
## 04:00:00
## 05:00:00
## 06:00:00

The class of the result is

H$Time %>% parse_time("%I:%M%p") %>% class
## [1] "hms"      "difftime"

And, again, it is not a native R class but something special available in the library lubridate.

However, a better idea is to merge Date and Time into a single variable of a native R class

H$Timestamp <- H$Date %>% paste(H$Time) %>% ymd_hm 
head(H)

Wide data to long data

Sometimes we need to convert a so-called “wide” data to a so-called “long” data. Here, instead of five columns North, South, East, West, Central, we want to have just two — one containing the value of PSI from any of these five columns and the second containing one of the labels “North”, “South”, “East”, “West”, “Central”. Thus all five columns are merged into one and column names become the new categorical variable. We will call this variable location. The variable containing values from all these five original columns will be called PSI Level.

For this operation, we use the function pivot_longer:

long_data <- H %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")

long_data %>%
  sample_n(10)

Plotting several time series

Converting wide data to long data is needed to plot several time series together. This is how we do it:

ggplot(data = long_data, aes(x = Timestamp, y = `PSI Level`, group = location,
                             colour = location)) +
  geom_line()

Question 1

Figure out why these plots have these strange regular ups and downs and fix this problem.

Answer This happened because our way of converting character time to proper time worked incorrectly:

H[24:25 , ]

Apparently, “12:00AM” of Day N is midnight between Day N and Day N+1 according to National Environment Agency but the same time is midnight between Day N-1 and Day N according to developers of the library lubridate.

A quick fix is to shift the given dates for measurements taken at 12:00AM by one day.

H$AltDate <- H$Date
H$AltDate[H$Time == "12:00AM"] <- H$AltDate[H$Time == "12:00AM"] + 86400

H <- H %>%
  mutate(Timestamp = ymd_hm(paste(AltDate, Time))) 

long_data <- H %>%
  select(Timestamp, North:Overall) %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")

ggplot(data = long_data, aes(x = Timestamp, 
                             y = `PSI Level`, group = location,
                             colour = location)) +
  geom_line()

Creating lag variables

We will try to predict haze level in some location knowing the history of observations in all the locations. The motivation here is that in order to plan outdoors activities during a haze period, we need to be able to know whether air quality is going to get better or worse.

To do this, we will need the function Lag from the library Hmisc (it is, in fact, a very simple function and you can write it from scratch):

cat("10 PSI measurements in the West: ", head(H$West, n = 10), "\n")
## 10 PSI measurements in the West:  132 129 126 121 117 114 110 106 103 100
cat("10 PSI measurements in the West with lag 3: ", head(Lag(H$West, 3), n = 10), "\n")
## 10 PSI measurements in the West with lag 3:  NA NA NA 132 129 126 121 117 114 110

We will create 20 new variables for measurements in the 5 locations lagged by 2, 3, 4, and 5 hours.

First, here is how we create a lagged version of measurements in all the locations with a specific lag. We will use the fact that measurements are stored in numeric columns of the data, i.e., we can just apply the lag transformation to all the numeric variables. Below we apply the function Lag( . , 1) to all numeric columns and then we rename these columns by adding the word “Lag” in front and number "1’ at the back

H %>%
  transmute(across(where(is.numeric) , function(x) Lag(x, 1))) %>%
  rename_with(function(x) paste("Lag", x, sep = "_")) %>%
  rename_with(function(x) paste(x, 1, sep = "_")) %>%
  head

How we will create a function with two inputs, a data frame df and an integer L. This function will return a new data frame that contains renamed lagged observations of all numeric variables in the original data frame:

lagged_obs <- function(df, L) {
  # Returns renamed lagged by L versions of all numeric variables in df
  # For convenience, if L = 0, we will just return the original df
  if (L == 0) {
    return(df)
  }
  df %>%
    transmute(across(where(is.numeric) , function(x) Lag(x, L))) %>%
    rename_with(function(x) paste("Lag", x, sep = "_")) %>%
    rename_with(function(x) paste(x, L, sep = "_"))
}

lagged_obs(H, 3) %>% head

Now we will create four versions of our original data frame with different lags and merge them into one big data frame together with the original data. We will use the function bind_cols for this purpose. This is what it is going to look like:

PSI_with_lags <- c(0, 2:5) %>%
  lapply(function(L) lagged_obs(H, L)) %>%
  bind_cols 

head(PSI_with_lags)

However, we need more than that. We also need to remove all rows that contain NA values (these are, in fact, just the first five rows, but it is better to use a universal method) and we also only keep the response variable West and all the lagged variables whose name starts with “Lag” as predictors.

Here is the final transformation:

PSI_with_lags <- c(0, 2:5) %>%
  lapply(function(L) lagged_obs(H, L)) %>%
  bind_cols %>%
  select(West, starts_with("Lag")) %>%
  filter(complete.cases(.))

head(PSI_with_lags)

Question 2

If we really wanted to construct an accurate model that predicts the haze level in the next few hours, what kind of data would we need besides lagged PSI levels across different locations in Singapore? Is it realistic to collect such data?

Answer We need weather data. Most importantly, we need wind direction and wind strength. Some weather data can be found here:

However, it only has daily observations and no wind direction. Wind direction is available here:

But it is shown visually and it is hard to extract a numeric wind direction.

The point of this exercise is to see how hard it may be to get access to simple data even if the data is well-structured and collected automatically by government agencies.

Regularization

Dimensions of our data are

dim(PSI_with_lags)
## [1] 331  21

We could split it into training and test sets, but since the number of observations is quite small and since today we are not going to compare different algorithms and hence we do not need a final validation, we will skip splitting the data into training and test sets.

Your are encouraged to manually collect more air quality data so that you can get enough for constructing high quality models.

Elastic net

The library caret in R provides syntax to train elastic net models. Recall from the lecture that elastic net is a combination of ridge regression and LASSO with the following loss function \[ L_E(\beta)=\sum_{i=1}^{N} \left(y^i - \beta_0 - \sum_{j=1}^{p}\beta_jx_j^i\right)^2+ (1-\alpha)\lambda\sum_{j=1}^{p}\beta_j^2+ \alpha\lambda\sum_{j=1}^{p}|\beta_j| \] for linear regression.

Here is how we implement it in practice:

elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 5)
)

elastic_net
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 265, 266, 264, 265, 264 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE       Rsquared   MAE      
##   0.10    0.1108337   4.254744  0.9943914   3.124636
##   0.10    1.1083367   5.652521  0.9900256   4.327172
##   0.10   11.0833674  10.010368  0.9693320   7.813251
##   0.55    0.1108337   4.299985  0.9942264   3.217512
##   0.55    1.1083367   7.123940  0.9841027   5.508054
##   0.55   11.0833674  11.123219  0.9752683   8.733915
##   1.00    0.1108337   4.354425  0.9940840   3.337135
##   1.00    1.1083367   7.030602  0.9847464   5.489573
##   1.00   11.0833674  13.142387  0.9846395  10.457791
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1108337.

Note that the function train automatically picked values of \(\alpha\) and \(\lambda\). Here is how we do it manually:

elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(from = 0, to = 1, length = 5), 
                         lambda = 10^(seq(from = -2, to = 2, length = 5))),
  trControl = trainControl("cv", number = 5)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
elastic_net
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 266, 265, 263, 267, 263 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE       Rsquared   MAE      
##   0.00   1e-02    8.023957  0.9781708   6.238625
##   0.00   1e-01    8.023957  0.9781708   6.238625
##   0.00   1e+00    8.023957  0.9781708   6.238625
##   0.00   1e+01    9.358466  0.9704450   7.335704
##   0.00   1e+02   13.120596  0.9492554  10.425877
##   0.25   1e-02    3.924229  0.9951690   2.911008
##   0.25   1e-01    4.269521  0.9942371   3.169942
##   0.25   1e+00    6.047384  0.9876887   4.727404
##   0.25   1e+01    9.878136  0.9706220   7.742145
##   0.25   1e+02   29.768955  0.9552264  24.199975
##   0.50   1e-02    3.782374  0.9954024   2.812805
##   0.50   1e-01    4.292163  0.9941424   3.197779
##   0.50   1e+00    6.877837  0.9840698   5.373592
##   0.50   1e+01   10.484404  0.9739216   8.253056
##   0.50   1e+02   51.694337  0.9742759  42.659005
##   0.75   1e-02    4.323029  0.9941273   3.314524
##   0.75   1e-01    4.364578  0.9939959   3.339371
##   0.75   1e+00    7.016043  0.9837211   5.491419
##   0.75   1e+01   11.407471  0.9782368   9.054474
##   0.75   1e+02   55.848769        NaN  46.091049
##   1.00   1e-02    4.362148  0.9937972   3.279327
##   1.00   1e-01    4.460897  0.9935867   3.379658
##   1.00   1e+00    7.030310  0.9838008   5.503694
##   1.00   1e+01   12.345624  0.9835875   9.763590
##   1.00   1e+02   55.848769        NaN  46.091049
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.01.

LASSO

Usually we will train LASSO rather than ridge regression because LASSO is capable of variable selection. We will try the following values of \(\lambda\):

lambda <- 10^seq(-2, 0 , length = 5)
lambda
## [1] 0.01000000 0.03162278 0.10000000 0.31622777 1.00000000

Below we train our LASSO:

lasso <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda),
  preProcess = c("scale")
)

lasso
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## Pre-processing: scaled (20) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 296, 298, 298, 297, 299, 298, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.01000000  4.082371  0.9949348  3.128353
##   0.03162278  4.082371  0.9949348  3.128353
##   0.10000000  4.243483  0.9945532  3.274762
##   0.31622777  5.240422  0.9918568  4.090273
##   1.00000000  7.015888  0.9858489  5.496858
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03162278.

Note that we did some data normalization, even though, strictly speaking, this is not necessary for our problem since all our variables are measured on the same scale. But usually data normalization is necessary for LASSO and ridge regressions.

Here is how we can extract the best hyperparameter values:

lasso$bestTune

Variables that were actually selected and coefficients in the resulting model are

coef(lasso$finalModel, lasso$bestTune$lambda)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept)     4.7054280
## Lag_North_2     .        
## Lag_South_2    14.5471520
## Lag_East_2      6.4746970
## Lag_West_2     61.6832397
## Lag_Central_2   .        
## Lag_North_3     .        
## Lag_South_3     .        
## Lag_East_3      .        
## Lag_West_3      .        
## Lag_Central_3   .        
## Lag_North_4     .        
## Lag_South_4     .        
## Lag_East_4      .        
## Lag_West_4     -0.2963903
## Lag_Central_4  -0.1841455
## Lag_North_5    -0.5448523
## Lag_South_5    -2.4176092
## Lag_East_5    -11.8779635
## Lag_West_5    -12.2056699
## Lag_Central_5   .

Coefficients for different values of lambda that we tried are

coef(lasso$finalModel, lambda)
## 21 x 5 sparse Matrix of class "dgCMatrix"
##                         1           2           3         4         5
## (Intercept)     4.7054280   4.7054280   5.4531977  6.276419  7.499788
## Lag_North_2     .           .           .          .         .       
## Lag_South_2    14.5471520  14.5471520  16.9121898 19.704282 14.802099
## Lag_East_2      6.4746970   6.4746970   2.5204375  .         .       
## Lag_West_2     61.6832397  61.6832397  58.5243126 46.814933 40.054310
## Lag_Central_2   .           .           .          .         .       
## Lag_North_3     .           .           .          .         .       
## Lag_South_3     .           .           .          .         .       
## Lag_East_3      .           .           .          .         .       
## Lag_West_3      .           .           .          .         .       
## Lag_Central_3   .           .           .          .         .       
## Lag_North_4     .           .           .          .         .       
## Lag_South_4     .           .           .          .         .       
## Lag_East_4      .           .           .          .         .       
## Lag_West_4     -0.2963903  -0.2963903   .          .         .       
## Lag_Central_4  -0.1841455  -0.1841455   .          .         .       
## Lag_North_5    -0.5448523  -0.5448523  -2.7840846 -4.406256  .       
## Lag_South_5    -2.4176092  -2.4176092  -0.2373293  .         .       
## Lag_East_5    -11.8779635 -11.8779635 -10.3346636 -6.902460  .       
## Lag_West_5    -12.2056699 -12.2056699  -9.4458076  .         .       
## Lag_Central_5   .           .           .          .         .

Now we will plot coefficients of LASSO as it was done in Lecture 4. First, we do some pre-processing for that

lambda_for_plotting <- 10^seq(from = -2, to = 2, length = 100)
lasso_coefs <- coef(lasso$finalModel, lambda_for_plotting) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda_for_plotting)

head(lasso_coefs)

Question 3

Plot coefficients of the LASSO model except for the intercept (i.e., coefficients at different lags) against \(\lambda\) in different colours. Usually \(\lambda\) is chosen on an exponential scale, so add + scale_x_log10() to your plot.

lasso_coefs %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()

Question 4

Train ridge regression in the same fashion as we trained LASSO and plot the coefficients. Use \(\lambda\) from 10 to 10000.

lambda = 10^seq(from = 1, to = 4, length = 100)
ridge <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda),
  preProcess = c("scale")
)

coef(ridge$finalModel, lambda) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda) %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()

Best subset selection

Model construction

Here, we will use the forward stepwise procedure for subset selection. It is not hard to implement it from scratch and it is a good exercise, but here we will just quickly review an existing tool.

Note that if you want to change it to the backward procedure, then change method = "forward" to method = "backward". For the full search, delete this option or change to “exhaustive”. And nvmax is the maximal number of variables that we will consider. Here, we will just use all of them and print how the best subset changes from step to step:

subset_mod <- regsubsets(West ~ . , data = PSI_with_lags,
                         nvmax = ncol(PSI_with_lags) - 1,
                         method = "forward")

sum_subset_mod <- summary(subset_mod)
t(sum_subset_mod$outmat)[ , 1:6]
##               1  ( 1 ) 2  ( 1 ) 3  ( 1 ) 4  ( 1 ) 5  ( 1 ) 6  ( 1 )
## Lag_North_2   " "      " "      " "      " "      " "      " "     
## Lag_South_2   " "      " "      "*"      "*"      "*"      "*"     
## Lag_East_2    " "      " "      " "      " "      " "      " "     
## Lag_West_2    "*"      "*"      "*"      "*"      "*"      "*"     
## Lag_Central_2 " "      " "      " "      " "      " "      " "     
## Lag_North_3   " "      " "      " "      " "      " "      " "     
## Lag_South_3   " "      " "      " "      "*"      "*"      "*"     
## Lag_East_3    " "      " "      " "      " "      " "      " "     
## Lag_West_3    " "      "*"      "*"      "*"      "*"      "*"     
## Lag_Central_3 " "      " "      " "      " "      " "      " "     
## Lag_North_4   " "      " "      " "      " "      " "      " "     
## Lag_South_4   " "      " "      " "      " "      " "      " "     
## Lag_East_4    " "      " "      " "      " "      " "      " "     
## Lag_West_4    " "      " "      " "      " "      "*"      "*"     
## Lag_Central_4 " "      " "      " "      " "      " "      " "     
## Lag_North_5   " "      " "      " "      " "      " "      " "     
## Lag_South_5   " "      " "      " "      " "      " "      " "     
## Lag_East_5    " "      " "      " "      " "      " "      "*"     
## Lag_West_5    " "      " "      " "      " "      " "      " "     
## Lag_Central_5 " "      " "      " "      " "      " "      " "

Now we will print coefficients of the linear models for the first six subsets:

coef(subset_mod, 1:6)
## [[1]]
## (Intercept)  Lag_West_2 
##    1.762717    0.990807 
## 
## [[2]]
## (Intercept)  Lag_West_2  Lag_West_3 
##    2.179403    2.672010   -1.686063 
## 
## [[3]]
## (Intercept) Lag_South_2  Lag_West_2  Lag_West_3 
##  3.23385365  0.08581059  2.48267088 -1.58732878 
## 
## [[4]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3 
##   2.4065148   0.6358198   2.0440764  -0.5739850  -1.1199339 
## 
## [[5]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3  Lag_West_4 
##   2.1418815   0.6797036   2.2553058  -0.6200748  -1.6012397   0.2740064 
## 
## [[6]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3  Lag_West_4 
##  2.65475809  0.60966585  2.23931112 -0.48651891 -1.65719309  0.34134475 
##  Lag_East_5 
## -0.06690284

Variable selection

To select the best model, we will use one of the three criteria that the library leaps conveniently provided for us. For example, below we print values of MAllow’s \(C_p\):

sum_subset_mod$cp
##  [1] 1664.76276  160.73988  123.56771   69.08263   63.63872   57.48548
##  [7]   54.71256   44.40255   44.26639   42.22911   42.71924   43.46708
## [13]   39.66554   27.89344   18.79798   15.61533   17.02980   18.27691
## [19]   19.53327   21.00000

We will create a data frame that contains all the criteria:

criteria <- tibble(
  k = 1:length(sum_subset_mod$cp),
  Cp = sum_subset_mod$cp,
  BIC = sum_subset_mod$bic,
  "1 - Adj.R2" = 1 - sum_subset_mod$adjr2
)

criteria

We will use one of them to select a single model, say, BIC. Below is the plot where we added a vertical line indicating the minimal value of BIC:

ggplot(data = criteria, aes(x = k, y = BIC)) +
  geom_line() + geom_point() +
  geom_vline(aes(xintercept = k[which.min(BIC)]), 
             color = "blue", linetype = "dashed")

Cross-validation

Unfortunately, the library leaps does not calculate cross-validation errors. So we will use familiar tools to do it.

First, note that we have this logical matrix that we can use to extract variables included into each model:

sum_subset_mod$which[1:5, 1:5]
##   (Intercept) Lag_North_2 Lag_South_2 Lag_East_2 Lag_West_2
## 1        TRUE       FALSE       FALSE      FALSE       TRUE
## 2        TRUE       FALSE       FALSE      FALSE       TRUE
## 3        TRUE       FALSE        TRUE      FALSE       TRUE
## 4        TRUE       FALSE        TRUE      FALSE       TRUE
## 5        TRUE       FALSE        TRUE      FALSE       TRUE

We are going to preprocess it a little. Note that (Intercept) is not a variable name, so we don’t really need to explicitly include it to our linear models. At the same time, all entries in the column (Intercept) are TRUE. So if we rename (Intercept) to West (our response variable), we will get the matrix each whose row represents response variable and predictors or the corresponding model:

variable_selection_matrix <- sum_subset_mod$which
colnames(variable_selection_matrix)[1] <- "West"
variable_selection_matrix[1:5, 1:5]
##   West Lag_North_2 Lag_South_2 Lag_East_2 Lag_West_2
## 1 TRUE       FALSE       FALSE      FALSE       TRUE
## 2 TRUE       FALSE       FALSE      FALSE       TRUE
## 3 TRUE       FALSE        TRUE      FALSE       TRUE
## 4 TRUE       FALSE        TRUE      FALSE       TRUE
## 5 TRUE       FALSE        TRUE      FALSE       TRUE

Now how do we find the cross-validation error of, say, the third model? We need to train a linear regression on a subset of our data that only includes variables from the third row of this matrix, i.e.,

PSI_with_lags %>%
  select(which(variable_selection_matrix[3 , ])) %>%
  head

So the resulting model will be as follows:

caret_mod <- train(West ~ . , PSI_with_lags[ , variable_selection_matrix[3 , ]],
      method = "lm", trControl = trainControl(method = "cv", number = 5))
caret_mod
## Linear Regression 
## 
## 331 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 264, 267, 264, 264, 265 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.649813  0.9957569  2.609453
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

So let us put it all together. First, we write a function whose input is a named logical vector indicating which variables are included and whose output is the cross-validation error of the resulting model:

cv_error <- function(vars_to_select) {
  caret_mod <- train(West ~ . , PSI_with_lags[ , vars_to_select],
      method = "lm", trControl = trainControl(method = "cv", number = 5))
  caret_mod$results$RMSE
}

cv_error(variable_selection_matrix[ 3, ])
## [1] 3.615362

Now we will apply this function to every row of the matrix variable_selection_matrix. To do it, we use apply as follows:

CV_errors <- apply(variable_selection_matrix, 1, cv_error)
CV_errors
##        1        2        3        4        5        6        7        8 
## 7.660655 3.792041 3.615046 3.497264 3.426319 3.368566 3.342994 3.361050 
##        9       10       11       12       13       14       15       16 
## 3.392665 3.327710 3.331828 3.429345 3.474263 3.348198 3.270747 3.332735 
##       17       18       19       20 
## 3.260021 3.196113 3.212919 3.274538

The second argument of apply in the code chunk above is 1. If, instead, we wanted to apply a function to every column of a matrix, we would use 2. If we wanted to apply a function to every entry of a matrix, we would use c(1,2).

Finally, we will add our cross-validation errors to our criteria:

criteria$CV <- CV_errors
criteria

And now we will plot them:

ggplot(data = criteria, aes(x = k, y = CV)) +
  geom_line() + geom_point() +
  geom_vline(aes(xintercept = k[which.min(CV)]), 
             color = "blue", linetype = "dashed")

Question 5

Figure out what this plot means:

plot(subset_mod, scale = "Cp")

Answer Every row here represents one model in the forward stepwise process. Variables that included in the model are black, not included white. Models are sorted from bottom to top according to Mallow’s \(C_p\). The top row is therefore the model with the smallest \(C_p\).

Survey

There is a link to a simple survey after lab 3:

Answers

Here are the answers: