An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 04 Classification

We will be examining the Smarket data set for this lab. It contains a number of numeric variables plus a variable called Direction which has the two labels “Up” and “Down”. Before we do on to modeling, let us take a look at the correlation between the variables.

To look at the correlation, we will use the corrr package. The correlate() function will calculate the correlation matrix between all the variables that it is being fed. We will therefore remove Direction as it is not numeric. Then we pass that to rplot() to quickly visualize the correlation matrix. I have also changed the colours argument to better see what is going on.

###EDA

smarket <- tibble(Smarket)
glimpse(smarket)

Rows: 1,250
Columns: 9
$ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, ~
$ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.~
$ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0~
$ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1~
$ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, ~
$ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, ~
$ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40~
$ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.~
$ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, ~

smarket %>% slice(1:4)

# A tibble: 4 x 9
   Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction
  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <fct>    
1  2001  0.381 -0.192 -2.62  -1.06   5.01    1.19  0.959 Up       
2  2001  0.959  0.381 -0.192 -2.62  -1.06    1.30  1.03  Up       
3  2001  1.03   0.959  0.381 -0.192 -2.62    1.41 -0.623 Down     
4  2001 -0.623  1.03   0.959  0.381 -0.192   1.28  0.614 Up

smarket %>% 
  group_by(Direction) %>% 
  count()

# A tibble: 2 x 2
# Groups:   Direction [2]
  Direction     n
  <fct>     <int>
1 Down        602
2 Up          648

And we see that these variables are more or less uncorrelated with each other. The other pair is Year and Volume that is a little correlated.

If you want to create heatmap styled correlation chart you can also create it manually.

library(ggcorrplot)

correl <- smarket %>% 
  select(-Direction) %>% 
  cor()

ggcorrplot(correl, hc.order = T, type = "lower", lab = T)

ggplot(smarket, aes(Year, Volume)) + 
  geom_jitter()

## Logistic Regression

Now we will fit a logistic regression model. We will again use the parsnip package, and we will use logistic_reg() to create a logistic regression model specification.

log_model <- 
  logistic_reg() %>% 
  set_engine("glm")

We can now fit the model like normal. We want to model the Direction of the stock market based on the percentage return from the 5 previous days plus the volume of shares traded. When fitting a classification with parsnip requires that the response variable is a factor. This is the case for the Smarket data set so we don’t need to do adjustments.

log_fit <- 
  fit(log_model, 
      Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
      data = smarket
  )

log_fit

parsnip model object

Fit time:  0ms 

Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
    Lag5 + Volume, family = stats::binomial, data = data)

Coefficients:
(Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
  -0.126000    -0.073074    -0.042301     0.011085     0.009359     0.010313  
     Volume  
   0.135441  

Degrees of Freedom: 1249 Total (i.e. Null);  1243 Residual
Null Deviance:      1731 
Residual Deviance: 1728     AIC: 1742

log_fit %>% pluck("fit") %>%  summary()


Call:
stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
    Lag5 + Volume, family = stats::binomial, data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.446  -1.203   1.065   1.145   1.326  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000   0.240736  -0.523    0.601
Lag1        -0.073074   0.050167  -1.457    0.145
Lag2        -0.042301   0.050086  -0.845    0.398
Lag3         0.011085   0.049939   0.222    0.824
Lag4         0.009359   0.049974   0.187    0.851
Lag5         0.010313   0.049511   0.208    0.835
Volume       0.135441   0.158360   0.855    0.392

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1731.2  on 1249  degrees of freedom
Residual deviance: 1727.6  on 1243  degrees of freedom
AIC: 1741.6

Number of Fisher Scoring iterations: 3

tidy(log_fit)

# A tibble: 7 x 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept) -0.126      0.241     -0.523   0.601
2 Lag1        -0.0731     0.0502    -1.46    0.145
3 Lag2        -0.0423     0.0501    -0.845   0.398
4 Lag3         0.0111     0.0499     0.222   0.824
5 Lag4         0.00936    0.0500     0.187   0.851
6 Lag5         0.0103     0.0495     0.208   0.835
7 Volume       0.135      0.158      0.855   0.392

glance(log_fit)

# A tibble: 1 x 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1         1731.    1249  -864. 1742. 1778.    1728.        1243  1250

A good performing model would ideally have high numbers along the diagonal (up-left to down-right) with small numbers on the off-diagonal. We see here that the model isn’t great, as it tends to predict “Down” as “Up” more often than it should.

predicting <- augment(log_fit, smarket)

conf_mat(predicting,
         truth = Direction, 
         estimate = .pred_class)

          Truth
Prediction Down  Up
      Down  145 141
      Up    457 507

predicting %>% 
  accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.522

predicting %>% 
  roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.461

We just fit a model and evaluated it on the same data. This doesn’t give us that much information about the model performance. Let us instead split up the data, train it on some of it and then evaluate it on the other part of the data. Since we are working with some data that has a time component, it is natural to fit the model using the first year’s worth of data and evaluate it on the last year. This would more closely match how such a model would be used in real life.

smarket_train <- smarket %>% 
  filter(Year != 2005)

smarket_test <- smarket %>% 
  filter(Year == 2005)

log_fit_train <- 
  fit(log_model, 
      Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, 
      data = smarket_train
  )

log_fit_train

parsnip model object

Fit time:  0ms 

Call:  stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
    Lag5 + Volume, family = stats::binomial, data = data)

Coefficients:
(Intercept)         Lag1         Lag2         Lag3         Lag4         Lag5  
   0.191213    -0.054178    -0.045805     0.007200     0.006441    -0.004223  
     Volume  
  -0.116257  

Degrees of Freedom: 997 Total (i.e. Null);  991 Residual
Null Deviance:      1383 
Residual Deviance: 1381     AIC: 1395

log_fit_train %>% pluck("fit") %>%  summary()


Call:
stats::glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + 
    Lag5 + Volume, family = stats::binomial, data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.302  -1.190   1.079   1.160   1.350  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.191213   0.333690   0.573    0.567
Lag1        -0.054178   0.051785  -1.046    0.295
Lag2        -0.045805   0.051797  -0.884    0.377
Lag3         0.007200   0.051644   0.139    0.889
Lag4         0.006441   0.051706   0.125    0.901
Lag5        -0.004223   0.051138  -0.083    0.934
Volume      -0.116257   0.239618  -0.485    0.628

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1383.3  on 997  degrees of freedom
Residual deviance: 1381.1  on 991  degrees of freedom
AIC: 1395.1

Number of Fisher Scoring iterations: 3

pred2 <- augment(log_fit_train, smarket_test)

conf_mat(pred2,
         truth = Direction, 
         estimate = .pred_class)

          Truth
Prediction Down Up
      Down   77 97
      Up     34 44

pred2 %>% 
  accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.480

pred2 %>% 
  roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.480

Linear Discriminant Analysis

Now we will perform LDA on the Smarket data. We will use the discrim_linear() function to create a LDA specification. We will continue to use 2 predictors for easy comparison.

lda_model <- 
  discrim_linear() %>% 
  set_engine("MASS") %>% 
  set_mode("classification")

lda_fit <- 
  fit(lda_model,
      Direction ~ Lag1 + Lag2, 
      data = smarket_train
      )

lda_fit

parsnip model object

Fit time:  10ms 
Call:
lda(Direction ~ Lag1 + Lag2, data = data)

Prior probabilities of groups:
    Down       Up 
0.491984 0.508016 

Group means:
            Lag1        Lag2
Down  0.04279022  0.03389409
Up   -0.03954635 -0.03132544

Coefficients of linear discriminants:
            LD1
Lag1 -0.6420190
Lag2 -0.5135293

predict(lda_fit, smarket_test)

# A tibble: 252 x 1
   .pred_class
   <fct>      
 1 Up         
 2 Up         
 3 Up         
 4 Up         
 5 Up         
 6 Up         
 7 Up         
 8 Up         
 9 Up         
10 Up         
# ... with 242 more rows

predict(lda_fit, smarket_test, type = "prob")

# A tibble: 252 x 2
   .pred_Down .pred_Up
        <dbl>    <dbl>
 1      0.490    0.510
 2      0.479    0.521
 3      0.467    0.533
 4      0.474    0.526
 5      0.493    0.507
 6      0.494    0.506
 7      0.495    0.505
 8      0.487    0.513
 9      0.491    0.509
10      0.484    0.516
# ... with 242 more rows

lda_pred <- augment(lda_fit, smarket_test)
lda_pred

# A tibble: 252 x 12
    Year   Lag1   Lag2   Lag3   Lag4   Lag5 Volume  Today Direction .pred_class
   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <fct>     <fct>      
 1  2005 -0.134  0.008 -0.007  0.715 -0.431  0.787 -0.812 Down      Up         
 2  2005 -0.812 -0.134  0.008 -0.007  0.715  1.51  -1.17  Down      Up         
 3  2005 -1.17  -0.812 -0.134  0.008 -0.007  1.72  -0.363 Down      Up         
 4  2005 -0.363 -1.17  -0.812 -0.134  0.008  1.74   0.351 Up        Up         
 5  2005  0.351 -0.363 -1.17  -0.812 -0.134  1.57  -0.143 Down      Up         
 6  2005 -0.143  0.351 -0.363 -1.17  -0.812  1.48   0.342 Up        Up         
 7  2005  0.342 -0.143  0.351 -0.363 -1.17   1.49  -0.61  Down      Up         
 8  2005 -0.61   0.342 -0.143  0.351 -0.363  1.49   0.398 Up        Up         
 9  2005  0.398 -0.61   0.342 -0.143  0.351  1.56  -0.863 Down      Up         
10  2005 -0.863  0.398 -0.61   0.342 -0.143  1.51   0.6   Up        Up         
# ... with 242 more rows, and 2 more variables: .pred_Down <dbl>,
#   .pred_Up <dbl>

lda_pred %>% conf_mat(truth = Direction, estimate = .pred_class)

          Truth
Prediction Down  Up
      Down   35  35
      Up     76 106

lda_pred %>% accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.560

lda_pred %>% roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.442

K-Nearest Neighbors

knn_model1 <- 
  nearest_neighbor(neighbors = 1) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_model3 <- 
  nearest_neighbor(neighbors = 3) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_model5 <- 
  nearest_neighbor(neighbors = 5) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

knn_fit1 <- 
  fit(knn_model1,
      Direction ~Lag1 + Lag2,
      data = smarket_train)

knn_fit3 <- 
  fit(knn_model3,
      Direction ~Lag1 + Lag2,
      data = smarket_train)

knn_fit5 <- 
  fit(knn_model5,
      Direction ~Lag1 + Lag2,
      data = smarket_train)

knn_pred1 <- 
  augment(knn_fit1, smarket_test)

knn_pred3 <- 
  augment(knn_fit3, smarket_test)

knn_pred5 <- 
  augment(knn_fit5, smarket_test)

knn_pred1 %>% 
  conf_mat(
    truth = Direction,
    estimate = .pred_class
    )

          Truth
Prediction Down Up
      Down   43 58
      Up     68 83

knn_pred3 %>% 
  conf_mat(
    truth = Direction,
    estimate = .pred_class
    )

          Truth
Prediction Down Up
      Down   43 58
      Up     68 83

knn_pred5 %>% 
  conf_mat(
    truth = Direction,
    estimate = .pred_class
    )

          Truth
Prediction Down Up
      Down   45 57
      Up     66 84

knn_pred1 %>% accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary           0.5

knn_pred3 %>% accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary           0.5

knn_pred5 %>% accuracy(truth = Direction, estimate = .pred_class)

# A tibble: 1 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.512

knn_pred1 %>% roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.512

knn_pred3 %>% roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.486

knn_pred5 %>% roc_auc(truth = Direction, estimate = .pred_Up)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.482

Comapring models

This section is new and not part of ISLR. We have fitted a lot of different models in this lab. And we were able to calculate the performance metrics one by one, but it is not ideal if we want to compare the different models. Below is an example of how you can more conveniently calculate performance metrics for multiple models at the same time.

Start of by creating a named list of the fitted models you want to evaluate. I have made sure only to include models that were fitted on the same parameters to make it easier to compare them.

models <- list(
  "log_reg" = log_fit_train,
  "LDA" = lda_fit,
  "KNN1" = knn_fit1,
  "KNN5" = knn_fit5
  )

preds <- imap_dfr(models, 
                  augment,
                  new_data = smarket_test,
                  .id = "model")

preds %>% 
  dplyr::select(Direction, .pred_class, .pred_Down, .pred_Up)

# A tibble: 1,008 x 4
   Direction .pred_class .pred_Down .pred_Up
   <fct>     <fct>            <dbl>    <dbl>
 1 Down      Up               0.472    0.528
 2 Down      Up               0.484    0.516
 3 Down      Up               0.477    0.523
 4 Up        Up               0.486    0.514
 5 Down      Down             0.502    0.498
 6 Up        Up               0.499    0.501
 7 Down      Up               0.497    0.503
 8 Up        Up               0.490    0.510
 9 Down      Up               0.496    0.504
10 Up        Up               0.489    0.511
# ... with 998 more rows

multi_metric <- metric_set(accuracy, sensitivity, specificity)

preds %>% 
  group_by(model) %>% 
  multi_metric(truth = Direction, estimate = .pred_class)

# A tibble: 12 x 4
   model   .metric  .estimator .estimate
   <chr>   <chr>    <chr>          <dbl>
 1 KNN1    accuracy binary         0.5  
 2 KNN5    accuracy binary         0.512
 3 LDA     accuracy binary         0.560
 4 log_reg accuracy binary         0.480
 5 KNN1    sens     binary         0.387
 6 KNN5    sens     binary         0.405
 7 LDA     sens     binary         0.315
 8 log_reg sens     binary         0.694
 9 KNN1    spec     binary         0.589
10 KNN5    spec     binary         0.596
11 LDA     spec     binary         0.752
12 log_reg spec     binary         0.312

preds %>% 
  group_by(model) %>% 
  roc_curve(Direction, .pred_Down) %>% 
  autoplot()

An Introduction to Statistcial Learning https://www.statlearning.com/

ISLR tidymodels Labs https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/linear-regression.html

– END

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ggcorrplot_0.1.3   corrr_0.4.3        GGally_2.1.2       yardstick_0.0.8   
 [5] workflowsets_0.0.2 workflows_0.2.3    tune_0.1.6         tidyr_1.1.3       
 [9] tibble_3.1.2       rsample_0.1.0      recipes_0.1.16     purrr_0.3.4       
[13] modeldata_0.1.1    infer_0.5.4        ggplot2_3.3.5      dplyr_1.0.7       
[17] dials_0.0.9        scales_1.1.1       broom_0.7.8        tidymodels_0.1.3  
[21] discrim_0.1.3      parsnip_0.1.7      ISLR_1.2          

loaded via a namespace (and not attached):
 [1] lubridate_1.7.10   DiceDesign_1.9     RColorBrewer_1.1-2 tools_4.1.0       
 [5] backports_1.2.1    bslib_0.2.5.1      utf8_1.2.1         R6_2.5.0          
 [9] rpart_4.1-15       DBI_1.1.1          colorspace_2.0-2   nnet_7.3-16       
[13] withr_2.4.2        prettyunits_1.1.1  tidyselect_1.1.1   compiler_4.1.0    
[17] cli_3.0.0          labeling_0.4.2     sass_0.4.0         stringr_1.4.0     
[21] digest_0.6.27      rmarkdown_2.9      pkgconfig_2.0.3    htmltools_0.5.1.1 
[25] parallelly_1.26.1  lhs_1.1.1          highr_0.9          rlang_0.4.11      
[29] rstudioapi_0.13    farver_2.1.0       jquerylib_0.1.4    generics_0.1.0    
[33] jsonlite_1.7.2     magrittr_2.0.1     Matrix_1.3-3       Rcpp_1.0.7        
[37] munsell_0.5.0      fansi_0.5.0        GPfit_1.0-8        lifecycle_1.0.0   
[41] furrr_0.2.3        stringi_1.6.2      pROC_1.17.0.1      yaml_2.2.1        
[45] MASS_7.3-54        plyr_1.8.6         grid_4.1.0         parallel_4.1.0    
[49] listenv_0.8.0      crayon_1.4.1       lattice_0.20-44    splines_4.1.0     
[53] knitr_1.33         pillar_1.6.1       igraph_1.2.6       reshape2_1.4.4    
[57] codetools_0.2-18   glue_1.4.2         evaluate_0.14      vctrs_0.3.8       
[61] foreach_1.5.1      gtable_0.3.0       reshape_0.8.8      future_1.21.0     
[65] assertthat_0.2.1   xfun_0.24          gower_0.2.2        prodlim_2019.11.13
[69] class_7.3-19       survival_3.2-11    timeDate_3043.102  kknn_1.3.1        
[73] iterators_1.0.13   hardhat_0.1.6      lava_1.6.9         globals_0.14.0    
[77] ellipsis_0.3.2     ipred_0.9-11

04. Classification lab

GS

11/8/2021

Packages

An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 04

Classification

Linear Discriminant Analysis

K-Nearest Neighbors

Comapring models