By the end of this activity, you should be able to
Load data from xlsx files rather than from csv.
Convert wide data to long data.
Do basic operations with time series - coerce character vectors to native R time/date format, create lag variables and plot time series.
Train ridge regression and LASSO in R, library caret
and choose the optimal value for hyperparameters using cross-validation.
Perform variable selection by best subset or forward stepwise procedure.
You should be familiar with linear regression and cross-validation. You should also be familiar with the pipe operator, splitting data into training and test sets, data summarization and 2D plotting. This material was covered in labs 1-3.
Please run the R chunks one by one, look at the output and make sure that you understand how it is produced. There will be questions that either require a short answer - then you type your answer right in this document - or modifying R codes - then you modify the R codes here. You can discuss your work with other students and with instructors.
Please fill the survey in the end - it’ll only take a couple of minutes.
library(tidyverse) # for manipulation with data
## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret) # for machine learning, including KNN
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(lubridate) # for working with time
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readxl) # for reading data from xlsx
library(Hmisc) # for creating lag variables
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(leaps) # for subset selection
Today we will work with data on Singapore haze in Sep - Oct 2015 that Fedor manually downloaded from the National Environment Agency webpage.
Source: https://www.haze.gov.sg/resources/historical-readings
H <- read_excel("singapore_haze_sep_2015.xlsx", sheet = 1)
head(H)
Note that
We read the data from xlsx file rather than from CSV
The class of the variable Date
is actually date/time - special type in R.
The class of the variable Time
is character, i.e., we will have to convert it to a date/time.
Very often we need to work with times and dates. R has special variable types for times and dates “POSIXlt” and “POSIXct” (additionally to familiar “numeric”, “logical”, “character”, and “factor”).
Unlike character values, times and dates can be used in arithmetic operations (time is measured in seconds):
set.seed(280)
x <- sample(H$Date, 3)
cat("Original dates: ")
## Original dates:
print(x)
## [1] "2015-09-30 UTC" "2015-09-22 UTC" "2015-09-30 UTC"
cat("Original dates shifted by 1 day: ")
## Original dates shifted by 1 day:
print(x+86400)
## [1] "2015-10-01 UTC" "2015-09-23 UTC" "2015-10-01 UTC"
Note that the variable Date
has been correctly identified as “POSIXlt” and “POSIXct” while Time
has been identifed as “character” and hence we need to convert it. We could use the function hm
from the library lubridate
for that:
H$Time %>% hm %>% head
## [1] "1H 0M 0S" "2H 0M 0S" "3H 0M 0S" "4H 0M 0S" "5H 0M 0S" "6H 0M 0S"
The class of the result is
H$Time %>% hm %>% class
## [1] "Period"
## attr(,"package")
## [1] "lubridate"
An alternative is the function parse_time
that requires some manual fine-tuning:
H$Time %>% parse_time("%I:%M%p") %>% head
## 01:00:00
## 02:00:00
## 03:00:00
## 04:00:00
## 05:00:00
## 06:00:00
The class of the result is
H$Time %>% parse_time("%I:%M%p") %>% class
## [1] "hms" "difftime"
And, again, it is not a native R class but something special available in the library lubridate
.
However, a better idea is to merge Date
and Time
into a single variable of a native R class
H$Timestamp <- H$Date %>% paste(H$Time) %>% ymd_hm
head(H)
Sometimes we need to convert a so-called “wide” data to a so-called “long” data. Here, instead of five columns North
, South
, East
, West
, Central
, we want to have just two — one containing the value of PSI from any of these five columns and the second containing one of the labels “North”, “South”, “East”, “West”, “Central”. Thus all five columns are merged into one and column names become the new categorical variable. We will call this variable location
. The variable containing values from all these five original columns will be called PSI Level
.
For this operation, we use the function pivot_longer
:
long_data <- H %>%
pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")
long_data %>%
sample_n(10)
Converting wide data to long data is needed to plot several time series together. This is how we do it:
ggplot(data = long_data, aes(x = Timestamp, y = `PSI Level`, group = location,
colour = location)) +
geom_line()
Figure out why these plots have these strange regular ups and downs and fix this problem.
Answer This happened because our way of converting character time to proper time worked incorrectly:
H[24:25 , ]
Apparently, “12:00AM” of Day N is midnight between Day N and Day N+1 according to National Environment Agency but the same time is midnight between Day N-1 and Day N according to developers of the library lubridate
.
A quick fix is to shift the given dates for measurements taken at 12:00AM by one day.
H$AltDate <- H$Date
H$AltDate[H$Time == "12:00AM"] <- H$AltDate[H$Time == "12:00AM"] + 86400
H <- H %>%
mutate(Timestamp = ymd_hm(paste(AltDate, Time)))
long_data <- H %>%
select(Timestamp, North:Overall) %>%
pivot_longer(North:Central, names_to = "location", values_to = "PSI Level")
ggplot(data = long_data, aes(x = Timestamp,
y = `PSI Level`, group = location,
colour = location)) +
geom_line()
We will try to predict haze level in some location knowing the history of observations in all the locations. The motivation here is that in order to plan outdoors activities during a haze period, we need to be able to know whether air quality is going to get better or worse.
To do this, we will need the function Lag
from the library Hmisc
(it is, in fact, a very simple function and you can write it from scratch):
cat("10 PSI measurements in the West: ", head(H$West, n = 10), "\n")
## 10 PSI measurements in the West: 132 129 126 121 117 114 110 106 103 100
cat("10 PSI measurements in the West with lag 3: ", head(Lag(H$West, 3), n = 10), "\n")
## 10 PSI measurements in the West with lag 3: NA NA NA 132 129 126 121 117 114 110
We will create 20 new variables for measurements in the 5 locations lagged by 2, 3, 4, and 5 hours.
First, here is how we create a lagged version of measurements in all the locations with a specific lag. We will use the fact that measurements are stored in numeric columns of the data, i.e., we can just apply the lag transformation to all the numeric variables. Below we apply the function Lag( . , 1)
to all numeric columns and then we rename these columns by adding the word “Lag” in front and number "1’ at the back
H %>%
transmute(across(where(is.numeric) , function(x) Lag(x, 1))) %>%
rename_with(function(x) paste("Lag", x, sep = "_")) %>%
rename_with(function(x) paste(x, 1, sep = "_")) %>%
head
How we will create a function with two inputs, a data frame df
and an integer L
. This function will return a new data frame that contains renamed lagged observations of all numeric variables in the original data frame:
lagged_obs <- function(df, L) {
# Returns renamed lagged by L versions of all numeric variables in df
# For convenience, if L = 0, we will just return the original df
if (L == 0) {
return(df)
}
df %>%
transmute(across(where(is.numeric) , function(x) Lag(x, L))) %>%
rename_with(function(x) paste("Lag", x, sep = "_")) %>%
rename_with(function(x) paste(x, L, sep = "_"))
}
lagged_obs(H, 3) %>% head
Now we will create four versions of our original data frame with different lags and merge them into one big data frame together with the original data. We will use the function bind_cols
for this purpose. This is what it is going to look like:
PSI_with_lags <- c(0, 2:5) %>%
lapply(function(L) lagged_obs(H, L)) %>%
bind_cols
head(PSI_with_lags)
However, we need more than that. We also need to remove all rows that contain NA values (these are, in fact, just the first five rows, but it is better to use a universal method) and we also only keep the response variable West
and all the lagged variables whose name starts with “Lag” as predictors.
Here is the final transformation:
PSI_with_lags <- c(0, 2:5) %>%
lapply(function(L) lagged_obs(H, L)) %>%
bind_cols %>%
select(West, starts_with("Lag")) %>%
filter(complete.cases(.))
head(PSI_with_lags)
If we really wanted to construct an accurate model that predicts the haze level in the next few hours, what kind of data would we need besides lagged PSI levels across different locations in Singapore? Is it realistic to collect such data?
Answer We need weather data. Most importantly, we need wind direction and wind strength. Some weather data can be found here:
However, it only has daily observations and no wind direction. Wind direction is available here:
But it is shown visually and it is hard to extract a numeric wind direction.
The point of this exercise is to see how hard it may be to get access to simple data even if the data is well-structured and collected automatically by government agencies.
Dimensions of our data are
dim(PSI_with_lags)
## [1] 331 21
We could split it into training and test sets, but since the number of observations is quite small and since today we are not going to compare different algorithms and hence we do not need a final validation, we will skip splitting the data into training and test sets.
Your are encouraged to manually collect more air quality data so that you can get enough for constructing high quality models.
The library caret
in R provides syntax to train elastic net models. Recall from the lecture that elastic net is a combination of ridge regression and LASSO with the following loss function \[
L_E(\beta)=\sum_{i=1}^{N}
\left(y^i - \beta_0 - \sum_{j=1}^{p}\beta_jx_j^i\right)^2+
(1-\alpha)\lambda\sum_{j=1}^{p}\beta_j^2+
\alpha\lambda\sum_{j=1}^{p}|\beta_j|
\] for linear regression.
Here is how we implement it in practice:
elastic_net <- train(
West ~., data = PSI_with_lags, method = "glmnet",
trControl = trainControl("cv", number = 5)
)
elastic_net
## glmnet
##
## 331 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 265, 266, 264, 265, 264
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.10 0.1108337 4.254744 0.9943914 3.124636
## 0.10 1.1083367 5.652521 0.9900256 4.327172
## 0.10 11.0833674 10.010368 0.9693320 7.813251
## 0.55 0.1108337 4.299985 0.9942264 3.217512
## 0.55 1.1083367 7.123940 0.9841027 5.508054
## 0.55 11.0833674 11.123219 0.9752683 8.733915
## 1.00 0.1108337 4.354425 0.9940840 3.337135
## 1.00 1.1083367 7.030602 0.9847464 5.489573
## 1.00 11.0833674 13.142387 0.9846395 10.457791
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda = 0.1108337.
Note that the function train
automatically picked values of \(\alpha\) and \(\lambda\). Here is how we do it manually:
elastic_net <- train(
West ~., data = PSI_with_lags, method = "glmnet",
tuneGrid = expand.grid(alpha = seq(from = 0, to = 1, length = 5),
lambda = 10^(seq(from = -2, to = 2, length = 5))),
trControl = trainControl("cv", number = 5)
)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
elastic_net
## glmnet
##
## 331 samples
## 20 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 266, 265, 263, 267, 263
## Resampling results across tuning parameters:
##
## alpha lambda RMSE Rsquared MAE
## 0.00 1e-02 8.023957 0.9781708 6.238625
## 0.00 1e-01 8.023957 0.9781708 6.238625
## 0.00 1e+00 8.023957 0.9781708 6.238625
## 0.00 1e+01 9.358466 0.9704450 7.335704
## 0.00 1e+02 13.120596 0.9492554 10.425877
## 0.25 1e-02 3.924229 0.9951690 2.911008
## 0.25 1e-01 4.269521 0.9942371 3.169942
## 0.25 1e+00 6.047384 0.9876887 4.727404
## 0.25 1e+01 9.878136 0.9706220 7.742145
## 0.25 1e+02 29.768955 0.9552264 24.199975
## 0.50 1e-02 3.782374 0.9954024 2.812805
## 0.50 1e-01 4.292163 0.9941424 3.197779
## 0.50 1e+00 6.877837 0.9840698 5.373592
## 0.50 1e+01 10.484404 0.9739216 8.253056
## 0.50 1e+02 51.694337 0.9742759 42.659005
## 0.75 1e-02 4.323029 0.9941273 3.314524
## 0.75 1e-01 4.364578 0.9939959 3.339371
## 0.75 1e+00 7.016043 0.9837211 5.491419
## 0.75 1e+01 11.407471 0.9782368 9.054474
## 0.75 1e+02 55.848769 NaN 46.091049
## 1.00 1e-02 4.362148 0.9937972 3.279327
## 1.00 1e-01 4.460897 0.9935867 3.379658
## 1.00 1e+00 7.030310 0.9838008 5.503694
## 1.00 1e+01 12.345624 0.9835875 9.763590
## 1.00 1e+02 55.848769 NaN 46.091049
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.5 and lambda = 0.01.
Usually we will train LASSO rather than ridge regression because LASSO is capable of variable selection. We will try the following values of \(\lambda\):
lambda <- 10^seq(-2, 0 , length = 5)
lambda
## [1] 0.01000000 0.03162278 0.10000000 0.31622777 1.00000000
Below we train our LASSO:
lasso <- train(
West ~., data = PSI_with_lags, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 1, lambda = lambda),
preProcess = c("scale")
)
lasso
## glmnet
##
## 331 samples
## 20 predictor
##
## Pre-processing: scaled (20)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 296, 298, 298, 297, 299, 298, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.01000000 4.082371 0.9949348 3.128353
## 0.03162278 4.082371 0.9949348 3.128353
## 0.10000000 4.243483 0.9945532 3.274762
## 0.31622777 5.240422 0.9918568 4.090273
## 1.00000000 7.015888 0.9858489 5.496858
##
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03162278.
Note that we did some data normalization, even though, strictly speaking, this is not necessary for our problem since all our variables are measured on the same scale. But usually data normalization is necessary for LASSO and ridge regressions.
Here is how we can extract the best hyperparameter values:
lasso$bestTune
Variables that were actually selected and coefficients in the resulting model are
coef(lasso$finalModel, lasso$bestTune$lambda)
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 4.7054280
## Lag_North_2 .
## Lag_South_2 14.5471520
## Lag_East_2 6.4746970
## Lag_West_2 61.6832397
## Lag_Central_2 .
## Lag_North_3 .
## Lag_South_3 .
## Lag_East_3 .
## Lag_West_3 .
## Lag_Central_3 .
## Lag_North_4 .
## Lag_South_4 .
## Lag_East_4 .
## Lag_West_4 -0.2963903
## Lag_Central_4 -0.1841455
## Lag_North_5 -0.5448523
## Lag_South_5 -2.4176092
## Lag_East_5 -11.8779635
## Lag_West_5 -12.2056699
## Lag_Central_5 .
Coefficients for different values of lambda that we tried are
coef(lasso$finalModel, lambda)
## 21 x 5 sparse Matrix of class "dgCMatrix"
## 1 2 3 4 5
## (Intercept) 4.7054280 4.7054280 5.4531977 6.276419 7.499788
## Lag_North_2 . . . . .
## Lag_South_2 14.5471520 14.5471520 16.9121898 19.704282 14.802099
## Lag_East_2 6.4746970 6.4746970 2.5204375 . .
## Lag_West_2 61.6832397 61.6832397 58.5243126 46.814933 40.054310
## Lag_Central_2 . . . . .
## Lag_North_3 . . . . .
## Lag_South_3 . . . . .
## Lag_East_3 . . . . .
## Lag_West_3 . . . . .
## Lag_Central_3 . . . . .
## Lag_North_4 . . . . .
## Lag_South_4 . . . . .
## Lag_East_4 . . . . .
## Lag_West_4 -0.2963903 -0.2963903 . . .
## Lag_Central_4 -0.1841455 -0.1841455 . . .
## Lag_North_5 -0.5448523 -0.5448523 -2.7840846 -4.406256 .
## Lag_South_5 -2.4176092 -2.4176092 -0.2373293 . .
## Lag_East_5 -11.8779635 -11.8779635 -10.3346636 -6.902460 .
## Lag_West_5 -12.2056699 -12.2056699 -9.4458076 . .
## Lag_Central_5 . . . . .
Now we will plot coefficients of LASSO as it was done in Lecture 4. First, we do some pre-processing for that
lambda_for_plotting <- 10^seq(from = -2, to = 2, length = 100)
lasso_coefs <- coef(lasso$finalModel, lambda_for_plotting) %>%
as.matrix %>% t %>% as_tibble %>%
mutate(lambda = lambda_for_plotting)
head(lasso_coefs)
Plot coefficients of the LASSO model except for the intercept (i.e., coefficients at different lags) against \(\lambda\) in different colours. Usually \(\lambda\) is chosen on an exponential scale, so add + scale_x_log10()
to your plot.
lasso_coefs %>%
pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
geom_line() + scale_x_log10()
Train ridge regression in the same fashion as we trained LASSO and plot the coefficients. Use \(\lambda\) from 10 to 10000.
lambda = 10^seq(from = 1, to = 4, length = 100)
ridge <- train(
West ~., data = PSI_with_lags, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneGrid = expand.grid(alpha = 0, lambda = lambda),
preProcess = c("scale")
)
coef(ridge$finalModel, lambda) %>%
as.matrix %>% t %>% as_tibble %>%
mutate(lambda = lambda) %>%
pivot_longer(starts_with("Lag"), names_to = "variable", values_to = "coef") %>%
ggplot(aes(x = lambda, y = coef, group = variable, colour = variable)) +
geom_line() + scale_x_log10()
Here, we will use the forward stepwise procedure for subset selection. It is not hard to implement it from scratch and it is a good exercise, but here we will just quickly review an existing tool.
Note that if you want to change it to the backward procedure, then change method = "forward"
to method = "backward"
. For the full search, delete this option or change to “exhaustive”. And nvmax
is the maximal number of variables that we will consider. Here, we will just use all of them and print how the best subset changes from step to step:
subset_mod <- regsubsets(West ~ . , data = PSI_with_lags,
nvmax = ncol(PSI_with_lags) - 1,
method = "forward")
sum_subset_mod <- summary(subset_mod)
t(sum_subset_mod$outmat)[ , 1:6]
## 1 ( 1 ) 2 ( 1 ) 3 ( 1 ) 4 ( 1 ) 5 ( 1 ) 6 ( 1 )
## Lag_North_2 " " " " " " " " " " " "
## Lag_South_2 " " " " "*" "*" "*" "*"
## Lag_East_2 " " " " " " " " " " " "
## Lag_West_2 "*" "*" "*" "*" "*" "*"
## Lag_Central_2 " " " " " " " " " " " "
## Lag_North_3 " " " " " " " " " " " "
## Lag_South_3 " " " " " " "*" "*" "*"
## Lag_East_3 " " " " " " " " " " " "
## Lag_West_3 " " "*" "*" "*" "*" "*"
## Lag_Central_3 " " " " " " " " " " " "
## Lag_North_4 " " " " " " " " " " " "
## Lag_South_4 " " " " " " " " " " " "
## Lag_East_4 " " " " " " " " " " " "
## Lag_West_4 " " " " " " " " "*" "*"
## Lag_Central_4 " " " " " " " " " " " "
## Lag_North_5 " " " " " " " " " " " "
## Lag_South_5 " " " " " " " " " " " "
## Lag_East_5 " " " " " " " " " " "*"
## Lag_West_5 " " " " " " " " " " " "
## Lag_Central_5 " " " " " " " " " " " "
Now we will print coefficients of the linear models for the first six subsets:
coef(subset_mod, 1:6)
## [[1]]
## (Intercept) Lag_West_2
## 1.762717 0.990807
##
## [[2]]
## (Intercept) Lag_West_2 Lag_West_3
## 2.179403 2.672010 -1.686063
##
## [[3]]
## (Intercept) Lag_South_2 Lag_West_2 Lag_West_3
## 3.23385365 0.08581059 2.48267088 -1.58732878
##
## [[4]]
## (Intercept) Lag_South_2 Lag_West_2 Lag_South_3 Lag_West_3
## 2.4065148 0.6358198 2.0440764 -0.5739850 -1.1199339
##
## [[5]]
## (Intercept) Lag_South_2 Lag_West_2 Lag_South_3 Lag_West_3 Lag_West_4
## 2.1418815 0.6797036 2.2553058 -0.6200748 -1.6012397 0.2740064
##
## [[6]]
## (Intercept) Lag_South_2 Lag_West_2 Lag_South_3 Lag_West_3 Lag_West_4
## 2.65475809 0.60966585 2.23931112 -0.48651891 -1.65719309 0.34134475
## Lag_East_5
## -0.06690284
To select the best model, we will use one of the three criteria that the library leaps
conveniently provided for us. For example, below we print values of MAllow’s \(C_p\):
sum_subset_mod$cp
## [1] 1664.76276 160.73988 123.56771 69.08263 63.63872 57.48548
## [7] 54.71256 44.40255 44.26639 42.22911 42.71924 43.46708
## [13] 39.66554 27.89344 18.79798 15.61533 17.02980 18.27691
## [19] 19.53327 21.00000
We will create a data frame that contains all the criteria:
criteria <- tibble(
k = 1:length(sum_subset_mod$cp),
Cp = sum_subset_mod$cp,
BIC = sum_subset_mod$bic,
"1 - Adj.R2" = 1 - sum_subset_mod$adjr2
)
criteria
We will use one of them to select a single model, say, BIC. Below is the plot where we added a vertical line indicating the minimal value of BIC:
ggplot(data = criteria, aes(x = k, y = BIC)) +
geom_line() + geom_point() +
geom_vline(aes(xintercept = k[which.min(BIC)]),
color = "blue", linetype = "dashed")
Unfortunately, the library leaps
does not calculate cross-validation errors. So we will use familiar tools to do it.
First, note that we have this logical matrix that we can use to extract variables included into each model:
sum_subset_mod$which[1:5, 1:5]
## (Intercept) Lag_North_2 Lag_South_2 Lag_East_2 Lag_West_2
## 1 TRUE FALSE FALSE FALSE TRUE
## 2 TRUE FALSE FALSE FALSE TRUE
## 3 TRUE FALSE TRUE FALSE TRUE
## 4 TRUE FALSE TRUE FALSE TRUE
## 5 TRUE FALSE TRUE FALSE TRUE
We are going to preprocess it a little. Note that (Intercept)
is not a variable name, so we don’t really need to explicitly include it to our linear models. At the same time, all entries in the column (Intercept)
are TRUE. So if we rename (Intercept)
to West
(our response variable), we will get the matrix each whose row represents response variable and predictors or the corresponding model:
variable_selection_matrix <- sum_subset_mod$which
colnames(variable_selection_matrix)[1] <- "West"
variable_selection_matrix[1:5, 1:5]
## West Lag_North_2 Lag_South_2 Lag_East_2 Lag_West_2
## 1 TRUE FALSE FALSE FALSE TRUE
## 2 TRUE FALSE FALSE FALSE TRUE
## 3 TRUE FALSE TRUE FALSE TRUE
## 4 TRUE FALSE TRUE FALSE TRUE
## 5 TRUE FALSE TRUE FALSE TRUE
Now how do we find the cross-validation error of, say, the third model? We need to train a linear regression on a subset of our data that only includes variables from the third row of this matrix, i.e.,
PSI_with_lags %>%
select(which(variable_selection_matrix[3 , ])) %>%
head
So the resulting model will be as follows:
caret_mod <- train(West ~ . , PSI_with_lags[ , variable_selection_matrix[3 , ]],
method = "lm", trControl = trainControl(method = "cv", number = 5))
caret_mod
## Linear Regression
##
## 331 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 264, 267, 264, 264, 265
## Resampling results:
##
## RMSE Rsquared MAE
## 3.649813 0.9957569 2.609453
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
So let us put it all together. First, we write a function whose input is a named logical vector indicating which variables are included and whose output is the cross-validation error of the resulting model:
cv_error <- function(vars_to_select) {
caret_mod <- train(West ~ . , PSI_with_lags[ , vars_to_select],
method = "lm", trControl = trainControl(method = "cv", number = 5))
caret_mod$results$RMSE
}
cv_error(variable_selection_matrix[ 3, ])
## [1] 3.615362
Now we will apply this function to every row of the matrix variable_selection_matrix
. To do it, we use apply
as follows:
CV_errors <- apply(variable_selection_matrix, 1, cv_error)
CV_errors
## 1 2 3 4 5 6 7 8
## 7.660655 3.792041 3.615046 3.497264 3.426319 3.368566 3.342994 3.361050
## 9 10 11 12 13 14 15 16
## 3.392665 3.327710 3.331828 3.429345 3.474263 3.348198 3.270747 3.332735
## 17 18 19 20
## 3.260021 3.196113 3.212919 3.274538
The second argument of apply
in the code chunk above is 1
. If, instead, we wanted to apply a function to every column of a matrix, we would use 2
. If we wanted to apply a function to every entry of a matrix, we would use c(1,2)
.
Finally, we will add our cross-validation errors to our criteria:
criteria$CV <- CV_errors
criteria
And now we will plot them:
ggplot(data = criteria, aes(x = k, y = CV)) +
geom_line() + geom_point() +
geom_vline(aes(xintercept = k[which.min(CV)]),
color = "blue", linetype = "dashed")
Figure out what this plot means:
plot(subset_mod, scale = "Cp")
Answer Every row here represents one model in the forward stepwise process. Variables that included in the model are black, not included white. Models are sorted from bottom to top according to Mallow’s \(C_p\). The top row is therefore the model with the smallest \(C_p\).
Here are the answers: