Info

Objectives

By the end of this activity, you should be able to

  1. Load data from xlsx files rather than from csv.

  2. Convert wide data to long data.

  3. Do basic operations with time series - coerce character vectors to native R time/date format, create lag variables and plot time series.

  4. Train ridge regression and LASSO in R, library caret and choose the optimal value for hyperparameters using cross-validation.

  5. Perform variable selection by best subset or forward stepwise procedure.

Preliminaries

You should be familiar with linear regression and cross-validation. You should also be familiar with the pipe operator, splitting data into training and test sets, data summarization and 2D plotting. This material was covered in labs 1-3.

Mode

Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.

Please fill the survey in the end - it’ll only take a couple of minutes.

Libraries

library(tidyverse) # for manipulation with data
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret) # for machine learning, including KNN
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(lubridate) # for working with time
library(readxl) # for reading data from xlsx
library(Hmisc) # for creating lag variables
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(leaps) # for subset selection

library(kableExtra) # for formatting tables in the knitted document
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

Data

Today we will work with data on Singapore haze in Sep - Oct 2015 that Fedor manually downloaded from the National Environment Agency webpage.

Source: https://www.haze.gov.sg/resources/historical-readings

H <-  read_excel("singapore_haze_sep_2015.xlsx", sheet = 1)
head(H)

Note that

  1. We read the data from xlsx file rather than from CSV

  2. The class of the variable Date is actually date/time - special type in R.

  3. The class of the variable Time is character, i.e., we will have to convert it to a date/time.

Processing time series data

Time

Very often we need to work with times and dates. Unlike character values, times and dates can be used in arithmetic operations (time is measured in seconds):

set.seed(280)
x <- sample(H$Date, 3)
cat("Original dates: ")
## Original dates:
print(x)
## [1] "2015-09-30 UTC" "2015-09-22 UTC" "2015-09-30 UTC"
cat("Original dates shifted by 1 day: ")
## Original dates shifted by 1 day:
print(x+86400)
## [1] "2015-10-01 UTC" "2015-09-23 UTC" "2015-10-01 UTC"

There are several tidyverse-friendly methods to work with time in R. Below is a little table explaining advantages and disadvanteges of these methods and R codes.

Function Package Input Example Output Example Purpose / Notes
hm("12:34") lubridate "HH:MM" Period object: 12H 34M 0S Parses hours and minutes into a duration (good for arithmetic with dates/times).
parse_time("12:34") readr "HH:MM" or "HH:MM:SS" or "12:34 PM" hms object: 12:34:00 Parses a clock time of day (strict and flexible; works well when reading data with read_csv).
ymd_hm("2025-09-01 12:34") lubridate "YYYY-MM-DD HH:MM" POSIXct datetime: 2025-09-01 12:34:00 UTC Parses a full date + time into a datetime object, useful for timestamps.

Note that the variable Date has been correctly identified as “POSIXlt” and “POSIXct” while Time has been identifed as “character” and hence we need to convert it. We could use the function hm from the library lubridate for that:

H <- H %>%
  mutate(hm_time = hm(Time)) %>%
  mutate(lub_time = parse_time(Time)) %>%
  mutate(Timestamp = ymd_hm(paste(Date, Time)))

head(H)

Wide data to long data

Sometimes we need to convert a so-called “wide” data to a so-called “long” data. Here, instead of five columns North, South, East, West, Central, we want to have just two — one containing the value of PSI from any of these five columns and the second containing one of the labels “North”, “South”, “East”, “West”, “Central”. Thus all five columns are merged into one and column names become the new categorical variable. We will call this variable location. The variable containing values from all these five original columns will be called PSI Level.

For this operation, we use the function pivot_longer:

long_data <- H %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")

long_data %>%
  sample_n(10)

Plotting several time series

Converting wide data to long data is needed to plot several time series together. This is how we do it:

ggplot(data = long_data, 
       aes(x = Timestamp, y = `PSI Level`, group = location,
           colour = location)) +
  geom_line()

Question 1

Figure out why these plots have these strange regular ups and downs and fix this problem.

Answer This happened because our way of converting character time to proper time worked incorrectly:

H %>%
  slice(24:25) %>%
  select(Date, Time, lub_time, Timestamp, Overall)

Apparently, “12:00AM” of Day N is midnight between Day N and Day N+1 according to National Environment Agency but the same time is midnight between Day N-1 and Day N according to developers of the library lubridate.

A quick fix is to shift the given dates for measurements taken at 12:00AM by one day.

H <- H %>%
  mutate(Timestamp = 
           ifelse(Time == "12:00AM",
                  ymd_hm(paste(Date + 86400, Time)),
                  ymd_hm(paste(Date, Time))))
  

long_data <- H %>%
  select(Timestamp, North:Overall) %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")

ggplot(data = long_data, aes(x = Timestamp, 
                             y = `PSI Level`, group = location,
                             colour = location)) +
  geom_line()

Creating lag variables

We will try to predict haze level in some location knowing the history of observations in all the locations. The motivation here is that in order to plan outdoors activities during a haze period, we need to be able to know whether air quality is going to get better or worse.

To do this, we will need the function Lag from the library Hmisc (it is, in fact, a very simple function and you can write it from scratch):

cat("10 PSI measurements in the West: ", head(H$West, n = 10), "\n")
## 10 PSI measurements in the West:  132 129 126 121 117 114 110 106 103 100
cat("10 PSI measurements in the West with lag 3: ", 
    head(Lag(H$West, 3), n = 10), "\n")
## 10 PSI measurements in the West with lag 3:  NA NA NA 132 129 126 121 117 114 110

We will create 20 new variables for measurements in the 5 locations lagged by 2, 3, 4, and 5 hours.

First, here is how we create a lagged version of measurements in all the locations with a specific lag. We will use the fact that measurements are stored in numeric columns of the data, i.e., we can just apply the lag transformation to all the numeric variables. Below we apply the function Lag( . , 1) to all numeric columns and then we rename these columns by adding the word “Lag” in front and number “1’ at the back

H %>%
  transmute(across(c(North, South, East, West, Central),
                   ~Lag(., 1),
                   .names = "Lag_{col}_1"))

How we will create a function with two inputs, a data frame df and an integer L. This function will return a new data frame that contains renamed lagged observations of all numeric variables in the original data frame:

lagged_obs <- function(df, L) {
  # Returns renamed lagged by L versions of all numeric variables in df
  # For convenience, if L = 0, we will just return the original df
  if (L == 0) {
    return(df)
  }
  H %>%
    transmute(across(c(North, South, East, West, Central),
                     ~Lag(., L),
                     .names = "Lag_{col}_{L}"))
}

lagged_obs(H, 3) %>% head

Now we will create four versions of our original data frame with different lags and merge them into one big data frame together with the original data. We will use the function map_dfc for this purpose.

We will also select West (it will be our response variable) and drop rows containing missing values.

PSI_with_lags <- c(0, 2:5) %>%
  map_dfc(function(L) lagged_obs(H, L)) %>%
  select(West, starts_with("Lag")) %>%
  drop_na()

head(PSI_with_lags)

You are encouraged to comment out parts of the pipe above to see the output of individual steps.

Question 2

If we really wanted to construct an accurate model that predicts the haze level in the next few hours, what kind of data would we need besides lagged PSI levels across different locations in Singapore? Is it realistic to collect such data?

Answer We need weather data. Most importantly, we need wind direction and wind strength. Some weather data can be found here:

However, it only has daily observations and no wind direction. Wind direction is available here:

But it is shown visually and it is hard to extract a numeric wind direction.

The point of this exercise is to see how hard it may be to get access to simple data even if the data is well-structured and collected automatically by government agencies.

Regularization

Dimensions of our data are

dim(PSI_with_lags)
## [1] 331  21

We could split it into training and test sets, but since the number of observations is quite small and since today we are not going to compare different algorithms and hence we do not need a final validation, we will skip splitting the data into training and test sets.

Your are encouraged to manually collect more air quality data so that you can get enough for constructing high quality models.

Elastic net

The library caret in R provides syntax to train elastic net models. Recall from the lecture that elastic net is a combination of ridge regression and LASSO with the following loss function \[ L_E(\beta)=\sum_{i=1}^{N} \left(y^i - \beta_0 - \sum_{j=1}^{p}\beta_jx_j^i\right)^2+ (1-\alpha)\lambda\sum_{j=1}^{p}\beta_j^2+ \alpha\lambda\sum_{j=1}^{p}|\beta_j| \] for linear regression.

Here is how we implement it in practice:

elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 5)
)

elastic_net
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 265, 266, 264, 265, 264 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE       Rsquared   MAE      
##   0.10    0.1108337   4.254744  0.9943914   3.124636
##   0.10    1.1083367   5.652521  0.9900256   4.327172
##   0.10   11.0833674  10.010368  0.9693320   7.813251
##   0.55    0.1108337   4.299985  0.9942264   3.217512
##   0.55    1.1083367   7.123940  0.9841027   5.508054
##   0.55   11.0833674  11.123219  0.9752683   8.733915
##   1.00    0.1108337   4.354425  0.9940840   3.337135
##   1.00    1.1083367   7.030602  0.9847464   5.489573
##   1.00   11.0833674  13.142387  0.9846395  10.457791
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1108337.

Note that the function train automatically picked values of \(\alpha\) and \(\lambda\). Here is how we do it manually:

elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(from = 0, to = 1, length = 5), 
                         lambda = 10^(seq(from = -2, to = 2, length = 5))),
  trControl = trainControl("cv", number = 5)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
elastic_net
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 266, 265, 263, 267, 263 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE       Rsquared   MAE      
##   0.00   1e-02    8.023957  0.9781708   6.238625
##   0.00   1e-01    8.023957  0.9781708   6.238625
##   0.00   1e+00    8.023957  0.9781708   6.238625
##   0.00   1e+01    9.358466  0.9704450   7.335704
##   0.00   1e+02   13.120596  0.9492554  10.425877
##   0.25   1e-02    3.924229  0.9951690   2.911008
##   0.25   1e-01    4.269521  0.9942371   3.169942
##   0.25   1e+00    6.047384  0.9876887   4.727404
##   0.25   1e+01    9.878136  0.9706220   7.742145
##   0.25   1e+02   29.768955  0.9552264  24.199975
##   0.50   1e-02    3.782374  0.9954024   2.812805
##   0.50   1e-01    4.292163  0.9941424   3.197779
##   0.50   1e+00    6.877837  0.9840698   5.373592
##   0.50   1e+01   10.484404  0.9739216   8.253056
##   0.50   1e+02   51.694337  0.9742759  42.659005
##   0.75   1e-02    4.323029  0.9941273   3.314524
##   0.75   1e-01    4.364578  0.9939959   3.339371
##   0.75   1e+00    7.016043  0.9837211   5.491419
##   0.75   1e+01   11.407471  0.9782368   9.054474
##   0.75   1e+02   55.848769        NaN  46.091049
##   1.00   1e-02    4.362148  0.9937972   3.279327
##   1.00   1e-01    4.460897  0.9935867   3.379658
##   1.00   1e+00    7.030310  0.9838008   5.503694
##   1.00   1e+01   12.345624  0.9835875   9.763590
##   1.00   1e+02   55.848769        NaN  46.091049
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.01.

LASSO

Usually we will train LASSO rather than ridge regression because LASSO is capable of variable selection. We will try the following values of \(\lambda\):

lambda <- 10^seq(-2, 0 , length = 5)
lambda
## [1] 0.01000000 0.03162278 0.10000000 0.31622777 1.00000000

Below we train our LASSO:

lasso <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda),
  preProcess = c("scale")
)

lasso
## glmnet 
## 
## 331 samples
##  20 predictor
## 
## Pre-processing: scaled (20) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 296, 298, 298, 297, 299, 298, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.01000000  4.082371  0.9949348  3.128353
##   0.03162278  4.082371  0.9949348  3.128353
##   0.10000000  4.243483  0.9945532  3.274762
##   0.31622777  5.240422  0.9918568  4.090273
##   1.00000000  7.015888  0.9858489  5.496858
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03162278.

Note that we did some data normalization, even though, strictly speaking, this is not necessary for our problem since all our variables are measured on the same scale. But usually data normalization is necessary for LASSO and ridge regressions.

Here is how we can extract the best hyperparameter values:

lasso$bestTune

Variables that were actually selected and coefficients in the resulting model are

coef(lasso$finalModel, lasso$bestTune$lambda)
## 21 x 1 sparse Matrix of class "dgCMatrix"
##               s=0.03162278
## (Intercept)      4.7054279
## Lag_North_2      .        
## Lag_South_2     14.5471519
## Lag_East_2       6.4746970
## Lag_West_2      61.6832397
## Lag_Central_2    .        
## Lag_North_3      .        
## Lag_South_3      .        
## Lag_East_3       .        
## Lag_West_3       .        
## Lag_Central_3    .        
## Lag_North_4      .        
## Lag_South_4      .        
## Lag_East_4       .        
## Lag_West_4      -0.2963903
## Lag_Central_4   -0.1841455
## Lag_North_5     -0.5448523
## Lag_South_5     -2.4176092
## Lag_East_5     -11.8779635
## Lag_West_5     -12.2056699
## Lag_Central_5    .

Here is how we can print the estimates as a nicely formatted table:

best_lambda <- lasso$bestTune$lambda

coefs <- coef(lasso$finalModel, s = best_lambda) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  rename_with(~"estimate", .cols = everything()[2])  

coefs %>%
  filter(estimate != 0) %>%
  arrange(desc(abs(estimate))) %>%
  kable(digits = 4,
        caption = sprintf("Lasso coefficients (lambda = %.4g)", best_lambda)) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))
Lasso coefficients (lambda = 0.03162)
term estimate
Lag_West_2 61.6832
Lag_South_2 14.5472
Lag_West_5 -12.2057
Lag_East_5 -11.8780
Lag_East_2 6.4747
(Intercept) 4.7054
Lag_South_5 -2.4176
Lag_North_5 -0.5449
Lag_West_4 -0.2964
Lag_Central_4 -0.1841

Coefficients for different values of lambda that we tried are

coef(lasso$finalModel, lambda) 
## 21 x 5 sparse Matrix of class "dgCMatrix"
##               s=0.01000000 s=0.03162278 s=0.10000000 s=0.31622777 s=1.00000000
## (Intercept)      4.7054279    4.7054279    5.4531977     6.276419     7.499788
## Lag_North_2      .            .            .             .            .       
## Lag_South_2     14.5471519   14.5471519   16.9121898    19.704282    14.802099
## Lag_East_2       6.4746970    6.4746970    2.5204375     .            .       
## Lag_West_2      61.6832397   61.6832397   58.5243126    46.814933    40.054310
## Lag_Central_2    .            .            .             .            .       
## Lag_North_3      .            .            .             .            .       
## Lag_South_3      .            .            .             .            .       
## Lag_East_3       .            .            .             .            .       
## Lag_West_3       .            .            .             .            .       
## Lag_Central_3    .            .            .             .            .       
## Lag_North_4      .            .            .             .            .       
## Lag_South_4      .            .            .             .            .       
## Lag_East_4       .            .            .             .            .       
## Lag_West_4      -0.2963903   -0.2963903    .             .            .       
## Lag_Central_4   -0.1841455   -0.1841455    .             .            .       
## Lag_North_5     -0.5448523   -0.5448523   -2.7840846    -4.406256     .       
## Lag_South_5     -2.4176092   -2.4176092   -0.2373293     .            .       
## Lag_East_5     -11.8779635  -11.8779635  -10.3346636    -6.902460     .       
## Lag_West_5     -12.2056699  -12.2056699   -9.4458076     .            .       
## Lag_Central_5    .            .            .             .            .

Or, printed as a nicely formatted table:

coef(lasso$finalModel, lambda) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  rename_with(~"estimate", .cols = everything()[2]) %>%
  kable(digits = 4, 
        caption = sprintf("Lasso coefficients")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))
Lasso coefficients
term estimate s=0.03162278 s=0.10000000 s=0.31622777 s=1.00000000
(Intercept) 4.7054 4.7054 5.4532 6.2764 7.4998
Lag_North_2 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_South_2 14.5472 14.5472 16.9122 19.7043 14.8021
Lag_East_2 6.4747 6.4747 2.5204 0.0000 0.0000
Lag_West_2 61.6832 61.6832 58.5243 46.8149 40.0543
Lag_Central_2 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_North_3 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_South_3 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_East_3 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_West_3 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_Central_3 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_North_4 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_South_4 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_East_4 0.0000 0.0000 0.0000 0.0000 0.0000
Lag_West_4 -0.2964 -0.2964 0.0000 0.0000 0.0000
Lag_Central_4 -0.1841 -0.1841 0.0000 0.0000 0.0000
Lag_North_5 -0.5449 -0.5449 -2.7841 -4.4063 0.0000
Lag_South_5 -2.4176 -2.4176 -0.2373 0.0000 0.0000
Lag_East_5 -11.8780 -11.8780 -10.3347 -6.9025 0.0000
Lag_West_5 -12.2057 -12.2057 -9.4458 0.0000 0.0000
Lag_Central_5 0.0000 0.0000 0.0000 0.0000 0.0000

Now we will plot coefficients of LASSO as it was done in Lecture 4. First, we do some pre-processing for that

lambda_for_plotting <- 10^seq(from = -2, to = 2, length = 100)

# lasso <- train(
#   West ~., data = PSI_with_lags, method = "glmnet",
#   trControl = trainControl("cv", number = 10),
#   tuneGrid = expand.grid(alpha = 0.5, lambda = lambda_for_plotting),
#   preProcess = c("scale")
# )

lasso_coefs <- coef(lasso$finalModel, lambda_for_plotting) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda_for_plotting)

head(lasso_coefs)

Question 3

Plot coefficients of the LASSO model except for the intercept (i.e., coefficients at different lags) against \(\lambda\) in different colours. Usually \(\lambda\) is chosen on an exponential scale, so add + scale_x_log10() to your plot.

lasso_coefs %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()

Remark (prettier plots)

Here is how we can make the plot prettier by choosing a different colour scheme and a different way to print labels on the \(\lambda\)-axis:

library(viridis)
## Loading required package: viridisLite
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:viridis':
## 
##     viridis_pal
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
lasso_coefs %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, colour = variable)) +
  geom_line() +
  scale_x_log10(
    breaks = 10^(-2:2),
    labels = function(b) scales::math_format(10^.x)(log10(b)),
    minor_breaks = NULL
  ) +
  scale_colour_viridis_d(option = "turbo", end = 0.95, begin = 0.05) +
  theme_minimal()

Question 4

Train ridge regression in the same fashion as we trained LASSO and plot the coefficients. Use \(\lambda\) from 10 to 10000.

lambda = 10^seq(from = 1, to = 4, length = 100)
ridge <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda),
  preProcess = c("scale")
)

coef(ridge$finalModel, lambda) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda) %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()

Best subset selection

Model construction

Here, we will use the forward stepwise procedure for subset selection. It is not hard to implement it from scratch and it is a good exercise, but here we will just quickly review an existing tool.

Note that if you want to change it to the backward procedure, then change method = "forward" to method = "backward". For the full search, delete this option or change to “exhaustive”. And nvmax is the maximal number of variables that we will consider. Here, we will just use all of them and print how the best subset changes from step to step for steps 1 to 5:

p <- ncol(PSI_with_lags) - 1
subset_mod <- regsubsets(West ~ . , data = PSI_with_lags,
                         nvmax = p,
                         method = "forward")

sum_subset_mod <- summary(subset_mod)
t(sum_subset_mod$outmat)[ , 1:5]
##               1  ( 1 ) 2  ( 1 ) 3  ( 1 ) 4  ( 1 ) 5  ( 1 )
## Lag_North_2   " "      " "      " "      " "      " "     
## Lag_South_2   " "      " "      "*"      "*"      "*"     
## Lag_East_2    " "      " "      " "      " "      " "     
## Lag_West_2    "*"      "*"      "*"      "*"      "*"     
## Lag_Central_2 " "      " "      " "      " "      " "     
## Lag_North_3   " "      " "      " "      " "      " "     
## Lag_South_3   " "      " "      " "      "*"      "*"     
## Lag_East_3    " "      " "      " "      " "      " "     
## Lag_West_3    " "      "*"      "*"      "*"      "*"     
## Lag_Central_3 " "      " "      " "      " "      " "     
## Lag_North_4   " "      " "      " "      " "      " "     
## Lag_South_4   " "      " "      " "      " "      " "     
## Lag_East_4    " "      " "      " "      " "      " "     
## Lag_West_4    " "      " "      " "      " "      "*"     
## Lag_Central_4 " "      " "      " "      " "      " "     
## Lag_North_5   " "      " "      " "      " "      " "     
## Lag_South_5   " "      " "      " "      " "      " "     
## Lag_East_5    " "      " "      " "      " "      " "     
## Lag_West_5    " "      " "      " "      " "      " "     
## Lag_Central_5 " "      " "      " "      " "      " "

Now we will print coefficients of the linear models for the first six subsets:

step_models <- coef(subset_mod, 1:p)
step_models[1:5]
## [[1]]
## (Intercept)  Lag_West_2 
##    1.762717    0.990807 
## 
## [[2]]
## (Intercept)  Lag_West_2  Lag_West_3 
##    2.179403    2.672010   -1.686063 
## 
## [[3]]
## (Intercept) Lag_South_2  Lag_West_2  Lag_West_3 
##  3.23385365  0.08581059  2.48267088 -1.58732878 
## 
## [[4]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3 
##   2.4065148   0.6358198   2.0440764  -0.5739850  -1.1199339 
## 
## [[5]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3  Lag_West_4 
##   2.1418815   0.6797036   2.2553058  -0.6200748  -1.6012397   0.2740064

Variable selection

To select the best model, we can use one of the three criteria that the library leaps calculates. We will create a data frame that contains all the criteria:

criteria <- tibble(
  k = 1:length(sum_subset_mod$cp),
  Cp = sum_subset_mod$cp,
  BIC = sum_subset_mod$bic,
  "1 - Adj.R2" = 1 - sum_subset_mod$adjr2
)

criteria

We can use anyone of them to select the best model. For convenience, here is the plot:

criteria %>%
  pivot_longer(cols = -k) %>%
  ggplot(aes(x = k, y = value)) +
  geom_line() + geom_point() + facet_wrap(vars(name), scales = "free") + 
  theme_minimal()

And here are the values of \(k\) minimizing each of the criteria:

criteria %>%
  summarise(across(-k, ~k[which.min(.)]))

Cross-validation

Unfortunately, the library leaps does not calculate cross-validation errors. So we will use familiar tools, i.e., caret::train to do it.

The first thing to do is extracting predictor names from the object constructed with the leaps library. Below we do it and print the first five sets of predictors:

p <- ncol(PSI_with_lags) - 1

step_predictors <- 1:p %>% 
  map(~coef(subset_mod, .)[-1]) %>%
  map(names)

step_predictors[1:5]
## [[1]]
## [1] "Lag_West_2"
## 
## [[2]]
## [1] "Lag_West_2" "Lag_West_3"
## 
## [[3]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_West_3" 
## 
## [[4]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_South_3" "Lag_West_3" 
## 
## [[5]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_South_3" "Lag_West_3"  "Lag_West_4"

Question 5

Why do we need [-1] in the R code above?

Answer In the R code above, [-1] removes (Intercept) from the vector of predictor names.


Now how do we find the cross-validation error of, say, the third model?

step_predictors[[3]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_West_3"

The model is as follows:

caret_mod <- train(West ~ . , 
                   PSI_with_lags[ , c("West", step_predictors[[3]])],
                   method = "lm", 
                   trControl = trainControl(method = "cv", number = 5))
caret_mod
## Linear Regression 
## 
## 331 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 264, 267, 264, 264, 265 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.649813  0.9957569  2.609453
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

So let us put it all together. First, we write a function whose input is a named logical vector indicating which variables are included and whose output is the cross-validation error of the resulting model:

cv_error <- function(vars_to_select) {
  caret_mod <- train(West ~ . , 
                     PSI_with_lags[ , c("West", vars_to_select)],
                     method = "lm", 
                     trControl = trainControl(method = "cv", number = 5))
  caret_mod$results$RMSE
}

cv_error(step_predictors[[3]])
## [1] 3.615362

Now we will apply this function to all variable cobinations extracted from the leaps object:

CV_errors <- map_vec(step_predictors, cv_error)
CV_errors
##  [1] 7.660655 3.792041 3.615046 3.497264 3.426319 3.368566 3.342994 3.361050
##  [9] 3.392665 3.327710 3.331828 3.429345 3.474263 3.348198 3.270747 3.332735
## [17] 3.260021 3.196113 3.212919 3.274538

Finally, we will add our cross-validation errors to our criteria:

criteria <- criteria %>% mutate(CV = CV_errors)
criteria

And now we will plot them:

ggplot(data = criteria, aes(x = k, y = CV)) +
  geom_line() + geom_point() +
  geom_vline(aes(xintercept = k[which.min(CV)]), 
             color = "blue", linetype = "dashed")

Question 6

Figure out what this plot means:

plot(subset_mod, scale = "Cp")

Answer Every row here represents one model in the forward stepwise process. Variables that included in the model are black, not included white. Models are sorted from bottom to top according to Mallow’s \(C_p\). The top row is therefore the model with the smallest \(C_p\).

Survey

There is a link to a simple survey after lab 4:

Answers

Here are the answers: