library(ISLR)
library(discrim) # for LDA and QDA
library(tidymodels)
library(GGally)
library(corrr)
theme_set(theme_bw())
We will be examining the Smarket data set for this lab. It contains a number of numeric variables plus a variable called Direction which has the two labels “Up” and “Down”. Before we do on to modeling, let us take a look at the correlation between the variables.
To look at the correlation, we will use the corrr package. The correlate() function will calculate the correlation matrix between all the variables that it is being fed. We will therefore remove Direction as it is not numeric. Then we pass that to rplot() to quickly visualize the correlation matrix. I have also changed the colours argument to better see what is going on.
###EDA
smarket <- tibble(Smarket)
glimpse(smarket)
Rows: 1,250
Columns: 9
$ Year <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ~
$ Lag1 <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.~
$ Lag2 <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0~
$ Lag3 <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1~
$ Lag4 <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, ~
$ Lag5 <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, ~
$ Volume <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40~
$ Today <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.~
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, ~
smarket %>% slice(1:4)
# A tibble: 4 x 9
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 2001 0.381 -0.192 -2.62 -1.06 5.01 1.19 0.959 Up
2 2001 0.959 0.381 -0.192 -2.62 -1.06 1.30 1.03 Up
3 2001 1.03 0.959 0.381 -0.192 -2.62 1.41 -0.623 Down
4 2001 -0.623 1.03 0.959 0.381 -0.192 1.28 0.614 Up
smarket %>%
group_by(Direction) %>%
count()
# A tibble: 2 x 2
# Groups: Direction [2]
Direction n
<fct> <int>
1 Down 602
2 Up 648
And we see that these variables are more or less uncorrelated with each other. The other pair is Year and Volume that is a little correlated.
If you want to create heatmap styled correlation chart you can also create it manually.
library(ggcorrplot)
correl <- smarket %>%
select(-Direction) %>%
cor()
ggcorrplot(correl, hc.order = T, type = "lower", lab = T)
ggplot(smarket, aes(Year, Volume)) +
geom_jitter()
## Logistic Regression
Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg() to create a logistic regression model specification.
log_model <-
logistic_reg() %>%
set_engine("glm")
We can now fit the model like normal. We want to model the Direction of the stock market based on the percentage return from the 5 previous days plus the volume of shares traded. When fitting a classification with parsnip requires that the response variable is a factor. This is the case for the Smarket data set so we don’t need to do adjustments.
log_fit <-
fit(log_model,
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = smarket
)
log_fit
parsnip model object
Fit time: 0ms
Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
Lag5 + Volume, family = stats::binomial, data = data)
Coefficients:
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
-0.126000 -0.073074 -0.042301 0.011085 0.009359 0.010313
Volume
0.135441
Degrees of Freedom: 1249 Total (i.e. Null); 1243 Residual
Null Deviance: 1731
Residual Deviance: 1728 AIC: 1742
log_fit %>% pluck("fit") %>% summary()
Call:
stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
Lag5 + Volume, family = stats::binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.446 -1.203 1.065 1.145 1.326
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Lag1 -0.073074 0.050167 -1.457 0.145
Lag2 -0.042301 0.050086 -0.845 0.398
Lag3 0.011085 0.049939 0.222 0.824
Lag4 0.009359 0.049974 0.187 0.851
Lag5 0.010313 0.049511 0.208 0.835
Volume 0.135441 0.158360 0.855 0.392
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6
Number of Fisher Scoring iterations: 3
tidy(log_fit)
# A tibble: 7 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -0.126 0.241 -0.523 0.601
2 Lag1 -0.0731 0.0502 -1.46 0.145
3 Lag2 -0.0423 0.0501 -0.845 0.398
4 Lag3 0.0111 0.0499 0.222 0.824
5 Lag4 0.00936 0.0500 0.187 0.851
6 Lag5 0.0103 0.0495 0.208 0.835
7 Volume 0.135 0.158 0.855 0.392
glance(log_fit)
# A tibble: 1 x 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1731. 1249 -864. 1742. 1778. 1728. 1243 1250
A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. We see here that the model isn’t great, as it tends to predict “Down” as “Up” more often than it should.
predicting <- augment(log_fit, smarket)
conf_mat(predicting,
truth = Direction,
estimate = .pred_class)
Truth
Prediction Down Up
Down 145 141
Up 457 507
predicting %>%
accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.522
predicting %>%
roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.461
We just fit a model and evaluated it on the same data. This doesn’t give us that much information about the model performance. Let us instead split up the data, train it on some of it and then evaluate it on the other part of the data. Since we are working with some data that has a time component, it is natural to fit the model using the first year’s worth of data and evaluate it on the last year. This would more closely match how such a model would be used in real life.
smarket_train <- smarket %>%
filter(Year != 2005)
smarket_test <- smarket %>%
filter(Year == 2005)
log_fit_train <-
fit(log_model,
Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume,
data = smarket_train
)
log_fit_train
parsnip model object
Fit time: 0ms
Call: stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
Lag5 + Volume, family = stats::binomial, data = data)
Coefficients:
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
0.191213 -0.054178 -0.045805 0.007200 0.006441 -0.004223
Volume
-0.116257
Degrees of Freedom: 997 Total (i.e. Null); 991 Residual
Null Deviance: 1383
Residual Deviance: 1381 AIC: 1395
log_fit_train %>% pluck("fit") %>% summary()
Call:
stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 +
Lag5 + Volume, family = stats::binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.302 -1.190 1.079 1.160 1.350
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.191213 0.333690 0.573 0.567
Lag1 -0.054178 0.051785 -1.046 0.295
Lag2 -0.045805 0.051797 -0.884 0.377
Lag3 0.007200 0.051644 0.139 0.889
Lag4 0.006441 0.051706 0.125 0.901
Lag5 -0.004223 0.051138 -0.083 0.934
Volume -0.116257 0.239618 -0.485 0.628
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1383.3 on 997 degrees of freedom
Residual deviance: 1381.1 on 991 degrees of freedom
AIC: 1395.1
Number of Fisher Scoring iterations: 3
pred2 <- augment(log_fit_train, smarket_test)
conf_mat(pred2,
truth = Direction,
estimate = .pred_class)
Truth
Prediction Down Up
Down 77 97
Up 34 44
pred2 %>%
accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.480
pred2 %>%
roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.480
Now we will perform LDA on the Smarket data. We will use the discrim_linear() function to create a LDA specification. We will continue to use 2 predictors for easy comparison.
lda_model <-
discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
lda_fit <-
fit(lda_model,
Direction ~ Lag1 + Lag2,
data = smarket_train
)
lda_fit
parsnip model object
Fit time: 10ms
Call:
lda(Direction ~ Lag1 + Lag2, data = data)
Prior probabilities of groups:
Down Up
0.491984 0.508016
Group means:
Lag1 Lag2
Down 0.04279022 0.03389409
Up -0.03954635 -0.03132544
Coefficients of linear discriminants:
LD1
Lag1 -0.6420190
Lag2 -0.5135293
predict(lda_fit, smarket_test)
# A tibble: 252 x 1
.pred_class
<fct>
1 Up
2 Up
3 Up
4 Up
5 Up
6 Up
7 Up
8 Up
9 Up
10 Up
# ... with 242 more rows
predict(lda_fit, smarket_test, type = "prob")
# A tibble: 252 x 2
.pred_Down .pred_Up
<dbl> <dbl>
1 0.490 0.510
2 0.479 0.521
3 0.467 0.533
4 0.474 0.526
5 0.493 0.507
6 0.494 0.506
7 0.495 0.505
8 0.487 0.513
9 0.491 0.509
10 0.484 0.516
# ... with 242 more rows
lda_pred <- augment(lda_fit, smarket_test)
lda_pred
# A tibble: 252 x 12
Year Lag1 Lag2 Lag3 Lag4 Lag5 Volume Today Direction .pred_class
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct>
1 2005 -0.134 0.008 -0.007 0.715 -0.431 0.787 -0.812 Down Up
2 2005 -0.812 -0.134 0.008 -0.007 0.715 1.51 -1.17 Down Up
3 2005 -1.17 -0.812 -0.134 0.008 -0.007 1.72 -0.363 Down Up
4 2005 -0.363 -1.17 -0.812 -0.134 0.008 1.74 0.351 Up Up
5 2005 0.351 -0.363 -1.17 -0.812 -0.134 1.57 -0.143 Down Up
6 2005 -0.143 0.351 -0.363 -1.17 -0.812 1.48 0.342 Up Up
7 2005 0.342 -0.143 0.351 -0.363 -1.17 1.49 -0.61 Down Up
8 2005 -0.61 0.342 -0.143 0.351 -0.363 1.49 0.398 Up Up
9 2005 0.398 -0.61 0.342 -0.143 0.351 1.56 -0.863 Down Up
10 2005 -0.863 0.398 -0.61 0.342 -0.143 1.51 0.6 Up Up
# ... with 242 more rows, and 2 more variables: .pred_Down <dbl>,
# .pred_Up <dbl>
lda_pred %>% conf_mat(truth = Direction, estimate = .pred_class)
Truth
Prediction Down Up
Down 35 35
Up 76 106
lda_pred %>% accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.560
lda_pred %>% roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.442
knn_model1 <-
nearest_neighbor(neighbors = 1) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_model3 <-
nearest_neighbor(neighbors = 3) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_model5 <-
nearest_neighbor(neighbors = 5) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_fit1 <-
fit(knn_model1,
Direction ~Lag1 + Lag2,
data = smarket_train)
knn_fit3 <-
fit(knn_model3,
Direction ~Lag1 + Lag2,
data = smarket_train)
knn_fit5 <-
fit(knn_model5,
Direction ~Lag1 + Lag2,
data = smarket_train)
knn_pred1 <-
augment(knn_fit1, smarket_test)
knn_pred3 <-
augment(knn_fit3, smarket_test)
knn_pred5 <-
augment(knn_fit5, smarket_test)
knn_pred1 %>%
conf_mat(
truth = Direction,
estimate = .pred_class
)
Truth
Prediction Down Up
Down 43 58
Up 68 83
knn_pred3 %>%
conf_mat(
truth = Direction,
estimate = .pred_class
)
Truth
Prediction Down Up
Down 43 58
Up 68 83
knn_pred5 %>%
conf_mat(
truth = Direction,
estimate = .pred_class
)
Truth
Prediction Down Up
Down 45 57
Up 66 84
knn_pred1 %>% accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.5
knn_pred3 %>% accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.5
knn_pred5 %>% accuracy(truth = Direction, estimate = .pred_class)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy binary 0.512
knn_pred1 %>% roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.512
knn_pred3 %>% roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.486
knn_pred5 %>% roc_auc(truth = Direction, estimate = .pred_Up)
# A tibble: 1 x 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 roc_auc binary 0.482
This section is new and not part of ISLR. We have fitted a lot of different models in this lab. And we were able to calculate the performance metrics one by one, but it is not ideal if we want to compare the different models. Below is an example of how you can more conveniently calculate performance metrics for multiple models at the same time.
Start of by creating a named list of the fitted models you want to evaluate. I have made sure only to include models that were fitted on the same parameters to make it easier to compare them.
models <- list(
"log_reg" = log_fit_train,
"LDA" = lda_fit,
"KNN1" = knn_fit1,
"KNN5" = knn_fit5
)
preds <- imap_dfr(models,
augment,
new_data = smarket_test,
.id = "model")
preds %>%
dplyr::select(Direction, .pred_class, .pred_Down, .pred_Up)
# A tibble: 1,008 x 4
Direction .pred_class .pred_Down .pred_Up
<fct> <fct> <dbl> <dbl>
1 Down Up 0.472 0.528
2 Down Up 0.484 0.516
3 Down Up 0.477 0.523
4 Up Up 0.486 0.514
5 Down Down 0.502 0.498
6 Up Up 0.499 0.501
7 Down Up 0.497 0.503
8 Up Up 0.490 0.510
9 Down Up 0.496 0.504
10 Up Up 0.489 0.511
# ... with 998 more rows
multi_metric <- metric_set(accuracy, sensitivity, specificity)
preds %>%
group_by(model) %>%
multi_metric(truth = Direction, estimate = .pred_class)
# A tibble: 12 x 4
model .metric .estimator .estimate
<chr> <chr> <chr> <dbl>
1 KNN1 accuracy binary 0.5
2 KNN5 accuracy binary 0.512
3 LDA accuracy binary 0.560
4 log_reg accuracy binary 0.480
5 KNN1 sens binary 0.387
6 KNN5 sens binary 0.405
7 LDA sens binary 0.315
8 log_reg sens binary 0.694
9 KNN1 spec binary 0.589
10 KNN5 spec binary 0.596
11 LDA spec binary 0.752
12 log_reg spec binary 0.312
preds %>%
group_by(model) %>%
roc_curve(Direction, .pred_Down) %>%
autoplot()
An Introduction to Statistcial Learning https://www.statlearning.com/
ISLR tidymodels Labs https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/linear-regression.html
– END
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Mexico.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggcorrplot_0.1.3 corrr_0.4.3 GGally_2.1.2 yardstick_0.0.8
[5] workflowsets_0.0.2 workflows_0.2.3 tune_0.1.6 tidyr_1.1.3
[9] tibble_3.1.2 rsample_0.1.0 recipes_0.1.16 purrr_0.3.4
[13] modeldata_0.1.1 infer_0.5.4 ggplot2_3.3.5 dplyr_1.0.7
[17] dials_0.0.9 scales_1.1.1 broom_0.7.8 tidymodels_0.1.3
[21] discrim_0.1.3 parsnip_0.1.7 ISLR_1.2
loaded via a namespace (and not attached):
[1] lubridate_1.7.10 DiceDesign_1.9 RColorBrewer_1.1-2 tools_4.1.0
[5] backports_1.2.1 bslib_0.2.5.1 utf8_1.2.1 R6_2.5.0
[9] rpart_4.1-15 DBI_1.1.1 colorspace_2.0-2 nnet_7.3-16
[13] withr_2.4.2 prettyunits_1.1.1 tidyselect_1.1.1 compiler_4.1.0
[17] cli_3.0.0 labeling_0.4.2 sass_0.4.0 stringr_1.4.0
[21] digest_0.6.27 rmarkdown_2.9 pkgconfig_2.0.3 htmltools_0.5.1.1
[25] parallelly_1.26.1 lhs_1.1.1 highr_0.9 rlang_0.4.11
[29] rstudioapi_0.13 farver_2.1.0 jquerylib_0.1.4 generics_0.1.0
[33] jsonlite_1.7.2 magrittr_2.0.1 Matrix_1.3-3 Rcpp_1.0.7
[37] munsell_0.5.0 fansi_0.5.0 GPfit_1.0-8 lifecycle_1.0.0
[41] furrr_0.2.3 stringi_1.6.2 pROC_1.17.0.1 yaml_2.2.1
[45] MASS_7.3-54 plyr_1.8.6 grid_4.1.0 parallel_4.1.0
[49] listenv_0.8.0 crayon_1.4.1 lattice_0.20-44 splines_4.1.0
[53] knitr_1.33 pillar_1.6.1 igraph_1.2.6 reshape2_1.4.4
[57] codetools_0.2-18 glue_1.4.2 evaluate_0.14 vctrs_0.3.8
[61] foreach_1.5.1 gtable_0.3.0 reshape_0.8.8 future_1.21.0
[65] assertthat_0.2.1 xfun_0.24 gower_0.2.2 prodlim_2019.11.13
[69] class_7.3-19 survival_3.2-11 timeDate_3043.102 kknn_1.3.1
[73] iterators_1.0.13 hardhat_0.1.6 lava_1.6.9 globals_0.14.0
[77] ellipsis_0.3.2 ipred_0.9-11