By the end of this activity, you should be able to
Load data from xlsx files rather than from csv.
Convert wide data to long data.
Do basic operations with time series - coerce character vectors to native R time/date format, create lag variables and plot time series.
Train ridge regression and LASSO in R, library caret
and choose the optimal value for hyperparameters using
cross-validation.
Perform variable selection by best subset or forward stepwise procedure.
You should be familiar with linear regression and cross-validation. You should also be familiar with the pipe operator, splitting data into training and test sets, data summarization and 2D plotting. This material was covered in labs 1-3.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorslibrary(caret) # for machine learning, including KNN## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     liftlibrary(lubridate) # for working with time
library(readxl) # for reading data from xlsx
library(Hmisc) # for creating lag variables## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, unitslibrary(leaps) # for subset selection
library(kableExtra) # for formatting tables in the knitted document## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rowsToday we will work with data on Singapore haze in Sep - Oct 2015 that Fedor manually downloaded from the National Environment Agency webpage.
Source: https://www.haze.gov.sg/resources/historical-readings
H <-  read_excel("singapore_haze_sep_2015.xlsx", sheet = 1)
head(H)Note that
We read the data from xlsx file rather than from CSV
The class of the variable Date is actually date/time
- special type in R.
The class of the variable Time is character, i.e.,
we will have to convert it to a date/time.
Very often we need to work with times and dates. Unlike character values, times and dates can be used in arithmetic operations (time is measured in seconds):
set.seed(280)
x <- sample(H$Date, 3)
cat("Original dates: ")## Original dates:print(x)## [1] "2015-09-30 UTC" "2015-09-22 UTC" "2015-09-30 UTC"cat("Original dates shifted by 1 day: ")## Original dates shifted by 1 day:print(x+86400)## [1] "2015-10-01 UTC" "2015-09-23 UTC" "2015-10-01 UTC"There are several tidyverse-friendly methods to work
with time in R. Below is a little table explaining advantages and
disadvanteges of these methods and R codes.
| Function | Package | Input Example | Output Example | Purpose / Notes | 
|---|---|---|---|---|
| hm("12:34") | lubridate | "HH:MM" | Periodobject: 12H 34M 0S | Parses hours and minutes into a duration (good for arithmetic with dates/times). | 
| parse_time("12:34") | readr | "HH:MM"or"HH:MM:SS"or"12:34 PM" | hmsobject: 12:34:00 | Parses a clock time of day (strict and flexible;
works well when reading data with read_csv). | 
| ymd_hm("2025-09-01 12:34") | lubridate | "YYYY-MM-DD HH:MM" | POSIXct datetime: 2025-09-01 12:34:00 UTC | Parses a full date + time into a datetime object, useful for timestamps. | 
Note that the variable Date has been correctly
identified as “POSIXlt” and “POSIXct” while Time has been
identifed as “character” and hence we need to convert it. We could use
the function hm from the library lubridate for
that:
H <- H %>%
  mutate(hm_time = hm(Time)) %>%
  mutate(lub_time = parse_time(Time)) %>%
  mutate(Timestamp = ymd_hm(paste(Date, Time)))
head(H)Sometimes we need to convert a so-called “wide” data to a so-called
“long” data. Here, instead of five columns North,
South, East, West,
Central, we want to have just two — one containing the
value of PSI from any of these five columns and the second containing
one of the labels “North”, “South”, “East”, “West”, “Central”. Thus all
five columns are merged into one and column names become the new
categorical variable. We will call this variable location.
The variable containing values from all these five original columns will
be called PSI Level.
For this operation, we use the function
pivot_longer:
long_data <- H %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")
long_data %>%
  sample_n(10)Converting wide data to long data is needed to plot several time series together. This is how we do it:
ggplot(data = long_data, 
       aes(x = Timestamp, y = `PSI Level`, group = location,
           colour = location)) +
  geom_line()Figure out why these plots have these strange regular ups and downs and fix this problem.
Answer This happened because our way of converting character time to proper time worked incorrectly:
H %>%
  slice(24:25) %>%
  select(Date, Time, lub_time, Timestamp, Overall)Apparently, “12:00AM” of Day N is midnight between Day N and Day N+1
according to National Environment Agency but the same time is midnight
between Day N-1 and Day N according to developers of the library
lubridate.
A quick fix is to shift the given dates for measurements taken at 12:00AM by one day.
H <- H %>%
  mutate(Timestamp = 
           ifelse(Time == "12:00AM",
                  ymd_hm(paste(Date + 86400, Time)),
                  ymd_hm(paste(Date, Time))))
  
long_data <- H %>%
  select(Timestamp, North:Overall) %>%
  pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")
ggplot(data = long_data, aes(x = Timestamp, 
                             y = `PSI Level`, group = location,
                             colour = location)) +
  geom_line()We will try to predict haze level in some location knowing the history of observations in all the locations. The motivation here is that in order to plan outdoors activities during a haze period, we need to be able to know whether air quality is going to get better or worse.
To do this, we will need the function Lag from the
library Hmisc (it is, in fact, a very simple function and
you can write it from scratch):
cat("10 PSI measurements in the West: ", head(H$West, n = 10), "\n")## 10 PSI measurements in the West:  132 129 126 121 117 114 110 106 103 100cat("10 PSI measurements in the West with lag 3: ", 
    head(Lag(H$West, 3), n = 10), "\n")## 10 PSI measurements in the West with lag 3:  NA NA NA 132 129 126 121 117 114 110We will create 20 new variables for measurements in the 5 locations lagged by 2, 3, 4, and 5 hours.
First, here is how we create a lagged version of measurements in all
the locations with a specific lag. We will use the fact that
measurements are stored in numeric columns of the data, i.e., we can
just apply the lag transformation to all the numeric variables. Below we
apply the function Lag( . , 1) to all numeric columns and
then we rename these columns by adding the word “Lag” in front and
number “1’ at the back
H %>%
  transmute(across(c(North, South, East, West, Central),
                   ~Lag(., 1),
                   .names = "Lag_{col}_1"))How we will create a function with two inputs, a data frame
df and an integer L. This function will return
a new data frame that contains renamed lagged observations of all
numeric variables in the original data frame:
lagged_obs <- function(df, L) {
  # Returns renamed lagged by L versions of all numeric variables in df
  # For convenience, if L = 0, we will just return the original df
  if (L == 0) {
    return(df)
  }
  H %>%
    transmute(across(c(North, South, East, West, Central),
                     ~Lag(., L),
                     .names = "Lag_{col}_{L}"))
}
lagged_obs(H, 3) %>% headNow we will create four versions of our original data frame with
different lags and merge them into one big data frame together with the
original data. We will use the function map_dfc for this
purpose.
We will also select West (it will be our response
variable) and drop rows containing missing values.
PSI_with_lags <- c(0, 2:5) %>%
  map_dfc(function(L) lagged_obs(H, L)) %>%
  select(West, starts_with("Lag")) %>%
  drop_na()
head(PSI_with_lags)You are encouraged to comment out parts of the pipe above to see the output of individual steps.
If we really wanted to construct an accurate model that predicts the haze level in the next few hours, what kind of data would we need besides lagged PSI levels across different locations in Singapore? Is it realistic to collect such data?
Answer We need weather data. Most importantly, we need wind direction and wind strength. Some weather data can be found here:
However, it only has daily observations and no wind direction. Wind direction is available here:
But it is shown visually and it is hard to extract a numeric wind direction.
The point of this exercise is to see how hard it may be to get access to simple data even if the data is well-structured and collected automatically by government agencies.
Dimensions of our data are
dim(PSI_with_lags)## [1] 331  21We could split it into training and test sets, but since the number of observations is quite small and since today we are not going to compare different algorithms and hence we do not need a final validation, we will skip splitting the data into training and test sets.
Your are encouraged to manually collect more air quality data so that you can get enough for constructing high quality models.
The library caret in R provides syntax to train elastic
net models. Recall from the lecture that elastic net is a combination of
ridge regression and LASSO with the following loss function \[
L_E(\beta)=\sum_{i=1}^{N}
\left(y^i - \beta_0 - \sum_{j=1}^{p}\beta_jx_j^i\right)^2+
(1-\alpha)\lambda\sum_{j=1}^{p}\beta_j^2+
\alpha\lambda\sum_{j=1}^{p}|\beta_j|
\] for linear regression.
Here is how we implement it in practice:
elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 5)
)
elastic_net## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 265, 266, 264, 265, 264 
## Resampling results across tuning parameters:
## 
##   alpha  lambda      RMSE       Rsquared   MAE      
##   0.10    0.1108337   4.254744  0.9943914   3.124636
##   0.10    1.1083367   5.652521  0.9900256   4.327172
##   0.10   11.0833674  10.010368  0.9693320   7.813251
##   0.55    0.1108337   4.299985  0.9942264   3.217512
##   0.55    1.1083367   7.123940  0.9841027   5.508054
##   0.55   11.0833674  11.123219  0.9752683   8.733915
##   1.00    0.1108337   4.354425  0.9940840   3.337135
##   1.00    1.1083367   7.030602  0.9847464   5.489573
##   1.00   11.0833674  13.142387  0.9846395  10.457791
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1108337.Note that the function train automatically picked values
of \(\alpha\) and \(\lambda\). Here is how we do it
manually:
elastic_net <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  tuneGrid = expand.grid(alpha = seq(from = 0, to = 1, length = 5), 
                         lambda = 10^(seq(from = -2, to = 2, length = 5))),
  trControl = trainControl("cv", number = 5)
)## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.elastic_net## glmnet 
## 
## 331 samples
##  20 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 266, 265, 263, 267, 263 
## Resampling results across tuning parameters:
## 
##   alpha  lambda  RMSE       Rsquared   MAE      
##   0.00   1e-02    8.023957  0.9781708   6.238625
##   0.00   1e-01    8.023957  0.9781708   6.238625
##   0.00   1e+00    8.023957  0.9781708   6.238625
##   0.00   1e+01    9.358466  0.9704450   7.335704
##   0.00   1e+02   13.120596  0.9492554  10.425877
##   0.25   1e-02    3.924229  0.9951690   2.911008
##   0.25   1e-01    4.269521  0.9942371   3.169942
##   0.25   1e+00    6.047384  0.9876887   4.727404
##   0.25   1e+01    9.878136  0.9706220   7.742145
##   0.25   1e+02   29.768955  0.9552264  24.199975
##   0.50   1e-02    3.782374  0.9954024   2.812805
##   0.50   1e-01    4.292163  0.9941424   3.197779
##   0.50   1e+00    6.877837  0.9840698   5.373592
##   0.50   1e+01   10.484404  0.9739216   8.253056
##   0.50   1e+02   51.694337  0.9742759  42.659005
##   0.75   1e-02    4.323029  0.9941273   3.314524
##   0.75   1e-01    4.364578  0.9939959   3.339371
##   0.75   1e+00    7.016043  0.9837211   5.491419
##   0.75   1e+01   11.407471  0.9782368   9.054474
##   0.75   1e+02   55.848769        NaN  46.091049
##   1.00   1e-02    4.362148  0.9937972   3.279327
##   1.00   1e-01    4.460897  0.9935867   3.379658
##   1.00   1e+00    7.030310  0.9838008   5.503694
##   1.00   1e+01   12.345624  0.9835875   9.763590
##   1.00   1e+02   55.848769        NaN  46.091049
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.01.Usually we will train LASSO rather than ridge regression because LASSO is capable of variable selection. We will try the following values of \(\lambda\):
lambda <- 10^seq(-2, 0 , length = 5)
lambda## [1] 0.01000000 0.03162278 0.10000000 0.31622777 1.00000000Below we train our LASSO:
lasso <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 1, lambda = lambda),
  preProcess = c("scale")
)
lasso## glmnet 
## 
## 331 samples
##  20 predictor
## 
## Pre-processing: scaled (20) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 296, 298, 298, 297, 299, 298, ... 
## Resampling results across tuning parameters:
## 
##   lambda      RMSE      Rsquared   MAE     
##   0.01000000  4.082371  0.9949348  3.128353
##   0.03162278  4.082371  0.9949348  3.128353
##   0.10000000  4.243483  0.9945532  3.274762
##   0.31622777  5.240422  0.9918568  4.090273
##   1.00000000  7.015888  0.9858489  5.496858
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03162278.Note that we did some data normalization, even though, strictly speaking, this is not necessary for our problem since all our variables are measured on the same scale. But usually data normalization is necessary for LASSO and ridge regressions.
Here is how we can extract the best hyperparameter values:
lasso$bestTuneVariables that were actually selected and coefficients in the resulting model are
coef(lasso$finalModel, lasso$bestTune$lambda)## 21 x 1 sparse Matrix of class "dgCMatrix"
##               s=0.03162278
## (Intercept)      4.7054279
## Lag_North_2      .        
## Lag_South_2     14.5471519
## Lag_East_2       6.4746970
## Lag_West_2      61.6832397
## Lag_Central_2    .        
## Lag_North_3      .        
## Lag_South_3      .        
## Lag_East_3       .        
## Lag_West_3       .        
## Lag_Central_3    .        
## Lag_North_4      .        
## Lag_South_4      .        
## Lag_East_4       .        
## Lag_West_4      -0.2963903
## Lag_Central_4   -0.1841455
## Lag_North_5     -0.5448523
## Lag_South_5     -2.4176092
## Lag_East_5     -11.8779635
## Lag_West_5     -12.2056699
## Lag_Central_5    .Here is how we can print the estimates as a nicely formatted table:
best_lambda <- lasso$bestTune$lambda
coefs <- coef(lasso$finalModel, s = best_lambda) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  rename_with(~"estimate", .cols = everything()[2])  
coefs %>%
  filter(estimate != 0) %>%
  arrange(desc(abs(estimate))) %>%
  kable(digits = 4,
        caption = sprintf("Lasso coefficients (lambda = %.4g)", best_lambda)) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))| term | estimate | 
|---|---|
| Lag_West_2 | 61.6832 | 
| Lag_South_2 | 14.5472 | 
| Lag_West_5 | -12.2057 | 
| Lag_East_5 | -11.8780 | 
| Lag_East_2 | 6.4747 | 
| (Intercept) | 4.7054 | 
| Lag_South_5 | -2.4176 | 
| Lag_North_5 | -0.5449 | 
| Lag_West_4 | -0.2964 | 
| Lag_Central_4 | -0.1841 | 
Coefficients for different values of lambda that we tried are
coef(lasso$finalModel, lambda) ## 21 x 5 sparse Matrix of class "dgCMatrix"
##               s=0.01000000 s=0.03162278 s=0.10000000 s=0.31622777 s=1.00000000
## (Intercept)      4.7054279    4.7054279    5.4531977     6.276419     7.499788
## Lag_North_2      .            .            .             .            .       
## Lag_South_2     14.5471519   14.5471519   16.9121898    19.704282    14.802099
## Lag_East_2       6.4746970    6.4746970    2.5204375     .            .       
## Lag_West_2      61.6832397   61.6832397   58.5243126    46.814933    40.054310
## Lag_Central_2    .            .            .             .            .       
## Lag_North_3      .            .            .             .            .       
## Lag_South_3      .            .            .             .            .       
## Lag_East_3       .            .            .             .            .       
## Lag_West_3       .            .            .             .            .       
## Lag_Central_3    .            .            .             .            .       
## Lag_North_4      .            .            .             .            .       
## Lag_South_4      .            .            .             .            .       
## Lag_East_4       .            .            .             .            .       
## Lag_West_4      -0.2963903   -0.2963903    .             .            .       
## Lag_Central_4   -0.1841455   -0.1841455    .             .            .       
## Lag_North_5     -0.5448523   -0.5448523   -2.7840846    -4.406256     .       
## Lag_South_5     -2.4176092   -2.4176092   -0.2373293     .            .       
## Lag_East_5     -11.8779635  -11.8779635  -10.3346636    -6.902460     .       
## Lag_West_5     -12.2056699  -12.2056699   -9.4458076     .            .       
## Lag_Central_5    .            .            .             .            .Or, printed as a nicely formatted table:
coef(lasso$finalModel, lambda) %>%
  as.matrix() %>%
  as.data.frame() %>%
  rownames_to_column("term") %>%
  rename_with(~"estimate", .cols = everything()[2]) %>%
  kable(digits = 4, 
        caption = sprintf("Lasso coefficients")) %>%
  kable_styling(full_width = FALSE,
                bootstrap_options = c("striped", "hover"))| term | estimate | s=0.03162278 | s=0.10000000 | s=0.31622777 | s=1.00000000 | 
|---|---|---|---|---|---|
| (Intercept) | 4.7054 | 4.7054 | 5.4532 | 6.2764 | 7.4998 | 
| Lag_North_2 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_South_2 | 14.5472 | 14.5472 | 16.9122 | 19.7043 | 14.8021 | 
| Lag_East_2 | 6.4747 | 6.4747 | 2.5204 | 0.0000 | 0.0000 | 
| Lag_West_2 | 61.6832 | 61.6832 | 58.5243 | 46.8149 | 40.0543 | 
| Lag_Central_2 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_North_3 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_South_3 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_East_3 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_West_3 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_Central_3 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_North_4 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_South_4 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_East_4 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_West_4 | -0.2964 | -0.2964 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_Central_4 | -0.1841 | -0.1841 | 0.0000 | 0.0000 | 0.0000 | 
| Lag_North_5 | -0.5449 | -0.5449 | -2.7841 | -4.4063 | 0.0000 | 
| Lag_South_5 | -2.4176 | -2.4176 | -0.2373 | 0.0000 | 0.0000 | 
| Lag_East_5 | -11.8780 | -11.8780 | -10.3347 | -6.9025 | 0.0000 | 
| Lag_West_5 | -12.2057 | -12.2057 | -9.4458 | 0.0000 | 0.0000 | 
| Lag_Central_5 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 
Now we will plot coefficients of LASSO as it was done in Lecture 4. First, we do some pre-processing for that
lambda_for_plotting <- 10^seq(from = -2, to = 2, length = 100)
# lasso <- train(
#   West ~., data = PSI_with_lags, method = "glmnet",
#   trControl = trainControl("cv", number = 10),
#   tuneGrid = expand.grid(alpha = 0.5, lambda = lambda_for_plotting),
#   preProcess = c("scale")
# )
lasso_coefs <- coef(lasso$finalModel, lambda_for_plotting) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda_for_plotting)
head(lasso_coefs)Plot coefficients of the LASSO model except for the intercept (i.e.,
coefficients at different lags) against \(\lambda\) in different colours. Usually
\(\lambda\) is chosen on an exponential
scale, so add + scale_x_log10() to your plot.
lasso_coefs %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()Here is how we can make the plot prettier by choosing a different colour scheme and a different way to print labels on the \(\lambda\)-axis:
library(viridis)## Loading required package: viridisLitelibrary(scales)## 
## Attaching package: 'scales'## The following object is masked from 'package:viridis':
## 
##     viridis_pal## The following object is masked from 'package:purrr':
## 
##     discard## The following object is masked from 'package:readr':
## 
##     col_factorlasso_coefs %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, colour = variable)) +
  geom_line() +
  scale_x_log10(
    breaks = 10^(-2:2),
    labels = function(b) scales::math_format(10^.x)(log10(b)),
    minor_breaks = NULL
  ) +
  scale_colour_viridis_d(option = "turbo", end = 0.95, begin = 0.05) +
  theme_minimal()Train ridge regression in the same fashion as we trained LASSO and plot the coefficients. Use \(\lambda\) from 10 to 10000.
lambda = 10^seq(from = 1, to = 4, length = 100)
ridge <- train(
  West ~., data = PSI_with_lags, method = "glmnet",
  trControl = trainControl("cv", number = 10),
  tuneGrid = expand.grid(alpha = 0, lambda = lambda),
  preProcess = c("scale")
)
coef(ridge$finalModel, lambda) %>%
  as.matrix %>% t %>% as_tibble %>%
  mutate(lambda = lambda) %>%
  pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
  ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
  geom_line() + scale_x_log10()Here, we will use the forward stepwise procedure for subset selection. It is not hard to implement it from scratch and it is a good exercise, but here we will just quickly review an existing tool.
Note that if you want to change it to the backward procedure, then
change method = "forward" to
method = "backward". For the full search, delete this
option or change to “exhaustive”. And nvmax is the maximal
number of variables that we will consider. Here, we will just use all of
them and print how the best subset changes from step to step for steps 1
to 5:
p <- ncol(PSI_with_lags) - 1
subset_mod <- regsubsets(West ~ . , data = PSI_with_lags,
                         nvmax = p,
                         method = "forward")
sum_subset_mod <- summary(subset_mod)
t(sum_subset_mod$outmat)[ , 1:5]##               1  ( 1 ) 2  ( 1 ) 3  ( 1 ) 4  ( 1 ) 5  ( 1 )
## Lag_North_2   " "      " "      " "      " "      " "     
## Lag_South_2   " "      " "      "*"      "*"      "*"     
## Lag_East_2    " "      " "      " "      " "      " "     
## Lag_West_2    "*"      "*"      "*"      "*"      "*"     
## Lag_Central_2 " "      " "      " "      " "      " "     
## Lag_North_3   " "      " "      " "      " "      " "     
## Lag_South_3   " "      " "      " "      "*"      "*"     
## Lag_East_3    " "      " "      " "      " "      " "     
## Lag_West_3    " "      "*"      "*"      "*"      "*"     
## Lag_Central_3 " "      " "      " "      " "      " "     
## Lag_North_4   " "      " "      " "      " "      " "     
## Lag_South_4   " "      " "      " "      " "      " "     
## Lag_East_4    " "      " "      " "      " "      " "     
## Lag_West_4    " "      " "      " "      " "      "*"     
## Lag_Central_4 " "      " "      " "      " "      " "     
## Lag_North_5   " "      " "      " "      " "      " "     
## Lag_South_5   " "      " "      " "      " "      " "     
## Lag_East_5    " "      " "      " "      " "      " "     
## Lag_West_5    " "      " "      " "      " "      " "     
## Lag_Central_5 " "      " "      " "      " "      " "Now we will print coefficients of the linear models for the first six subsets:
step_models <- coef(subset_mod, 1:p)
step_models[1:5]## [[1]]
## (Intercept)  Lag_West_2 
##    1.762717    0.990807 
## 
## [[2]]
## (Intercept)  Lag_West_2  Lag_West_3 
##    2.179403    2.672010   -1.686063 
## 
## [[3]]
## (Intercept) Lag_South_2  Lag_West_2  Lag_West_3 
##  3.23385365  0.08581059  2.48267088 -1.58732878 
## 
## [[4]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3 
##   2.4065148   0.6358198   2.0440764  -0.5739850  -1.1199339 
## 
## [[5]]
## (Intercept) Lag_South_2  Lag_West_2 Lag_South_3  Lag_West_3  Lag_West_4 
##   2.1418815   0.6797036   2.2553058  -0.6200748  -1.6012397   0.2740064To select the best model, we can use one of the three criteria that
the library leaps calculates. We will create a data frame
that contains all the criteria:
criteria <- tibble(
  k = 1:length(sum_subset_mod$cp),
  Cp = sum_subset_mod$cp,
  BIC = sum_subset_mod$bic,
  "1 - Adj.R2" = 1 - sum_subset_mod$adjr2
)
criteriaWe can use anyone of them to select the best model. For convenience, here is the plot:
criteria %>%
  pivot_longer(cols = -k) %>%
  ggplot(aes(x = k, y = value)) +
  geom_line() + geom_point() + facet_wrap(vars(name), scales = "free") + 
  theme_minimal()And here are the values of \(k\) minimizing each of the criteria:
criteria %>%
  summarise(across(-k, ~k[which.min(.)]))Unfortunately, the library leaps does not calculate
cross-validation errors. So we will use familiar tools, i.e.,
caret::train to do it.
The first thing to do is extracting predictor names from the object
constructed with the leaps library. Below we do it and
print the first five sets of predictors:
p <- ncol(PSI_with_lags) - 1
step_predictors <- 1:p %>% 
  map(~coef(subset_mod, .)[-1]) %>%
  map(names)
step_predictors[1:5]## [[1]]
## [1] "Lag_West_2"
## 
## [[2]]
## [1] "Lag_West_2" "Lag_West_3"
## 
## [[3]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_West_3" 
## 
## [[4]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_South_3" "Lag_West_3" 
## 
## [[5]]
## [1] "Lag_South_2" "Lag_West_2"  "Lag_South_3" "Lag_West_3"  "Lag_West_4"Why do we need [-1] in the R code above?
Answer In the R code above, [-1] removes
(Intercept) from the vector of predictor names.
Now how do we find the cross-validation error of, say, the third model?
step_predictors[[3]]## [1] "Lag_South_2" "Lag_West_2"  "Lag_West_3"The model is as follows:
caret_mod <- train(West ~ . , 
                   PSI_with_lags[ , c("West", step_predictors[[3]])],
                   method = "lm", 
                   trControl = trainControl(method = "cv", number = 5))
caret_mod## Linear Regression 
## 
## 331 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 264, 267, 264, 264, 265 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   3.649813  0.9957569  2.609453
## 
## Tuning parameter 'intercept' was held constant at a value of TRUESo let us put it all together. First, we write a function whose input is a named logical vector indicating which variables are included and whose output is the cross-validation error of the resulting model:
cv_error <- function(vars_to_select) {
  caret_mod <- train(West ~ . , 
                     PSI_with_lags[ , c("West", vars_to_select)],
                     method = "lm", 
                     trControl = trainControl(method = "cv", number = 5))
  caret_mod$results$RMSE
}
cv_error(step_predictors[[3]])## [1] 3.615362Now we will apply this function to all variable cobinations extracted
from the leaps object:
CV_errors <- map_vec(step_predictors, cv_error)
CV_errors##  [1] 7.660655 3.792041 3.615046 3.497264 3.426319 3.368566 3.342994 3.361050
##  [9] 3.392665 3.327710 3.331828 3.429345 3.474263 3.348198 3.270747 3.332735
## [17] 3.260021 3.196113 3.212919 3.274538Finally, we will add our cross-validation errors to our criteria:
criteria <- criteria %>% mutate(CV = CV_errors)
criteriaAnd now we will plot them:
ggplot(data = criteria, aes(x = k, y = CV)) +
  geom_line() + geom_point() +
  geom_vline(aes(xintercept = k[which.min(CV)]), 
             color = "blue", linetype = "dashed")Figure out what this plot means:
plot(subset_mod, scale = "Cp")Answer Every row here represents one model in the forward stepwise process. Variables that included in the model are black, not included white. Models are sorted from bottom to top according to Mallow’s \(C_p\). The top row is therefore the model with the smallest \(C_p\).