An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 12 Unsupervised Learning

library(tidymodels)
library(ISLR)
library(patchwork)
library(factoextra)

theme_set(theme_bw())

usa <- as_tibble(USArrests, rownames = "state")

Notice how the mean of each of the variables is quite different. if we were to apply PCA directly to the data set then Murder would have a very small influence.

usa %>% 
  select(-state) %>% 
  map_dfr(mean)

# A tibble: 1 x 4
  Murder Assault UrbanPop  Rape
   <dbl>   <dbl>    <dbl> <dbl>
1   7.79    171.     65.5  21.2

We will show how to perform PCA in two different ways in this section. First, by using prcomp() directly, using broom to extract the information we need, and secondly by using recipes. prcomp() Takes 1 required argument x which much be a fully numeric data.frame or matrix. Then we pass that to prcomp(). We also set scale = TRUE in prcomp() which will perform the scaling we need.

usa_pca <- 
  usa %>% 
  select(-state) %>% 
  prcomp(scale = T)

usa_pca

Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494

Rotation (n x k) = (4 x 4):
                PC1        PC2        PC3         PC4
Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

usa_pca %>%  summary()

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

now we can use our favorite broom function to extract information from this prcomp object. We start with tidy(). tidy() can be used to extract a couple of different things, see ?broom:::tidy.prcomp() for more information. tidy() will by default extract the scores of a PCA object in long tidy format. The score of is the location of the observation in PCA space. So we can

tidy(usa_pca)

# A tibble: 200 x 3
     row    PC  value
   <int> <dbl>  <dbl>
 1     1     1 -0.976
 2     1     2  1.12 
 3     1     3 -0.440
 4     1     4  0.155
 5     2     1 -1.93 
 6     2     2  1.06 
 7     2     3  2.02 
 8     2     4 -0.434
 9     3     1 -1.75 
10     3     2 -0.738
# ... with 190 more rows

#tidy(usa_pca, matrix = "scores")

tidy(usa_pca, matrix = "loadings")

# A tibble: 16 x 3
   column      PC   value
   <chr>    <dbl>   <dbl>
 1 Murder       1 -0.536 
 2 Murder       2  0.418 
 3 Murder       3 -0.341 
 4 Murder       4  0.649 
 5 Assault      1 -0.583 
 6 Assault      2  0.188 
 7 Assault      3 -0.268 
 8 Assault      4 -0.743 
 9 UrbanPop     1 -0.278 
10 UrbanPop     2 -0.873 
11 UrbanPop     3 -0.378 
12 UrbanPop     4  0.134 
13 Rape         1 -0.543 
14 Rape         2 -0.167 
15 Rape         3  0.818 
16 Rape         4  0.0890

tidy(usa_pca, matrix = "eigenvalues")

# A tibble: 4 x 4
     PC std.dev percent cumulative
  <dbl>   <dbl>   <dbl>      <dbl>
1     1   1.57   0.620       0.620
2     2   0.995  0.247       0.868
3     3   0.597  0.0891      0.957
4     4   0.416  0.0434      1

tidy(usa_pca, matrix = "loadings") %>% 
  ggplot(aes(value, column, fill = factor(PC))) + 
  facet_wrap(~ PC) + 
  geom_col() + 
  theme(legend.position = "none")

tidy(usa_pca, matrix = "eigenvalues") %>% 
  ggplot(aes(PC, percent)) + 
  geom_col() + 
  labs(x = "Principal Components",
       y = "Variance (%)",
       title = "Hola")

augment(usa_pca)

# A tibble: 50 x 5
   .rownames .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4
   <chr>          <dbl>      <dbl>      <dbl>      <dbl>
 1 1            -0.976      1.12      -0.440     0.155  
 2 2            -1.93       1.06       2.02     -0.434  
 3 3            -1.75      -0.738      0.0542   -0.826  
 4 4             0.140      1.11       0.113    -0.181  
 5 5            -2.50      -1.53       0.593    -0.339  
 6 6            -1.50      -0.978      1.08      0.00145
 7 7             1.34      -1.08      -0.637    -0.117  
 8 8            -0.0472    -0.322     -0.711    -0.873  
 9 9            -2.98       0.0388    -0.571    -0.0953 
10 10           -1.62       1.27      -0.339     1.07   
# ... with 40 more rows

biplot(usa_pca)

pca_rec <- 
  recipe(~ ., data = usa) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors(), id = "pca") %>% 
  prep()

pca_rec %>% juice

# A tibble: 50 x 5
   state           PC1     PC2     PC3      PC4
   <fct>         <dbl>   <dbl>   <dbl>    <dbl>
 1 Alabama     -0.976   1.12   -0.440   0.155  
 2 Alaska      -1.93    1.06    2.02   -0.434  
 3 Arizona     -1.75   -0.738   0.0542 -0.826  
 4 Arkansas     0.140   1.11    0.113  -0.181  
 5 California  -2.50   -1.53    0.593  -0.339  
 6 Colorado    -1.50   -0.978   1.08    0.00145
 7 Connecticut  1.34   -1.08   -0.637  -0.117  
 8 Delaware    -0.0472 -0.322  -0.711  -0.873  
 9 Florida     -2.98    0.0388 -0.571  -0.0953 
10 Georgia     -1.62    1.27   -0.339   1.07   
# ... with 40 more rows

tidy(pca_rec, id = "pca", type = "coef")

# A tibble: 16 x 4
   terms      value component id   
   <chr>      <dbl> <chr>     <chr>
 1 Murder   -0.536  PC1       pca  
 2 Assault  -0.583  PC1       pca  
 3 UrbanPop -0.278  PC1       pca  
 4 Rape     -0.543  PC1       pca  
 5 Murder    0.418  PC2       pca  
 6 Assault   0.188  PC2       pca  
 7 UrbanPop -0.873  PC2       pca  
 8 Rape     -0.167  PC2       pca  
 9 Murder   -0.341  PC3       pca  
10 Assault  -0.268  PC3       pca  
11 UrbanPop -0.378  PC3       pca  
12 Rape      0.818  PC3       pca  
13 Murder    0.649  PC4       pca  
14 Assault  -0.743  PC4       pca  
15 UrbanPop  0.134  PC4       pca  
16 Rape      0.0890 PC4       pca

tidy(pca_rec, id = "pca", type = "variance")

# A tibble: 16 x 4
   terms                         value component id   
   <chr>                         <dbl>     <int> <chr>
 1 variance                      2.48          1 pca  
 2 variance                      0.990         2 pca  
 3 variance                      0.357         3 pca  
 4 variance                      0.173         4 pca  
 5 cumulative variance           2.48          1 pca  
 6 cumulative variance           3.47          2 pca  
 7 cumulative variance           3.83          3 pca  
 8 cumulative variance           4             4 pca  
 9 percent variance             62.0           1 pca  
10 percent variance             24.7           2 pca  
11 percent variance              8.91          3 pca  
12 percent variance              4.34          4 pca  
13 cumulative percent variance  62.0           1 pca  
14 cumulative percent variance  86.8           2 pca  
15 cumulative percent variance  95.7           3 pca  
16 cumulative percent variance 100             4 pca

pca_rec <- 
  recipe(~ ., data = usa) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors(), num_comp = 3) %>% 
  prep() %>% 
  juice() %>% 
  print()

# A tibble: 50 x 4
   state           PC1     PC2     PC3
   <fct>         <dbl>   <dbl>   <dbl>
 1 Alabama     -0.976   1.12   -0.440 
 2 Alaska      -1.93    1.06    2.02  
 3 Arizona     -1.75   -0.738   0.0542
 4 Arkansas     0.140   1.11    0.113 
 5 California  -2.50   -1.53    0.593 
 6 Colorado    -1.50   -0.978   1.08  
 7 Connecticut  1.34   -1.08   -0.637 
 8 Delaware    -0.0472 -0.322  -0.711 
 9 Florida     -2.98    0.0388 -0.571 
10 Georgia     -1.62    1.27   -0.339 
# ... with 40 more rows

pca_rec <- 
  recipe(~ ., data = usa) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_pca(all_numeric_predictors(), threshold = .7) %>% 
  prep() %>% 
  juice() %>% 
  print()

# A tibble: 50 x 3
   state           PC1     PC2
   <fct>         <dbl>   <dbl>
 1 Alabama     -0.976   1.12  
 2 Alaska      -1.93    1.06  
 3 Arizona     -1.75   -0.738 
 4 Arkansas     0.140   1.11  
 5 California  -2.50   -1.53  
 6 Colorado    -1.50   -0.978 
 7 Connecticut  1.34   -1.08  
 8 Delaware    -0.0472 -0.322 
 9 Florida     -2.98    0.0388
10 Georgia     -1.62    1.27  
# ... with 40 more rows

Kmeans Clustering

set.seed(2)

x_df <- tibble(
  V1 = rnorm(n = 50, mean = rep(c(0, 3), each = 25)),
  V2 = rnorm(n = 50, mean = rep(c(0, -4), each = 25))
)

x_df %>%
  ggplot(aes(V1, V2, color = rep(c("A", "B"), each = 25))) +
  geom_point()

set.seed(1234)
(res_kmeans <- kmeans(x_df, centers = 3, nstart = 20))

K-means clustering with 3 clusters of sizes 11, 23, 16

Cluster means:
         V1          V2
1 2.5355362 -2.48605364
2 0.2339095  0.04414551
3 2.8241300 -5.01221675

Clustering vector:
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 3 1 1 1 3 1 3 3 3 3 1 3 3
[39] 3 1 1 1 3 3 3 3 1 3 3 3

Within cluster sum of squares by cluster:
[1] 14.56698 54.84869 26.98215
 (between_SS / total_SS =  76.8 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

tidy(res_kmeans)

# A tibble: 3 x 5
     V1      V2  size withinss cluster
  <dbl>   <dbl> <int>    <dbl> <fct>  
1 2.54  -2.49      11     14.6 1      
2 0.234  0.0441    23     54.8 2      
3 2.82  -5.01      16     27.0 3

augment(res_kmeans, data = x_df)

# A tibble: 50 x 3
        V1     V2 .cluster
     <dbl>  <dbl> <fct>   
 1 -0.897  -0.838 2       
 2  0.185   2.07  2       
 3  1.59   -0.562 2       
 4 -1.13    1.28  2       
 5 -0.0803 -1.05  2       
 6  0.132  -1.97  2       
 7  0.708  -0.323 2       
 8 -0.240   0.936 2       
 9  1.98    1.14  2       
10 -0.139   1.67  2       
# ... with 40 more rows

augment(res_kmeans, data = x_df) %>%
  ggplot(aes(V1, V2, color = .cluster)) +
  geom_point()

Hierarchical Clustering

res_hclust_complete <- x_df %>%
  dist() %>%
  hclust(method = "complete")

res_hclust_average <- x_df %>%
  dist() %>%
  hclust(method = "average")

res_hclust_single <- x_df %>%
  dist() %>%
  hclust(method = "single")

fviz_dend(res_hclust_complete, main = "Complete", k = 2)

fviz_dend(res_hclust_average, main = "average", k = 2)

fviz_dend(res_hclust_single, main = "single", k = 2)

If we don’t know the importance of the different predictors in data set it could be beneficial to scale the data such that each variable has the same influence. We can perform scaling by using scale() before dist().

x_df %>%
  scale() %>%
  dist() %>%
  hclust(method = "complete") %>%
  fviz_dend(k = 2)

NCI60 dataset

data(NCI60, package = "ISLR")
nci60 <- NCI60$data %>%
  as_tibble() %>%
  mutate(label = factor(NCI60$labs)) %>%
  relocate(label)

nci60_pca <- nci60 %>%
  select(-label) %>%
  prcomp(scale = TRUE)

nci60_pcs <- bind_cols(
  augment(nci60_pca),
  nci60 %>% select(label)
  )

colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))

nci60_pcs %>%
  ggplot(aes(.fittedPC1, .fittedPC2, color = label)) +
  geom_point() +
  theme_minimal() +
  scale_color_manual(values = colors)

nci60_pcs %>%
  ggplot(aes(.fittedPC1, .fittedPC3, color = label)) +
  geom_point() +
  theme_minimal() +
  scale_color_manual(values = colors)

tidy(nci60_pca, matrix = "eigenvalues") %>%
  ggplot(aes(PC, percent)) +
  geom_point() +
  geom_line()

tidy(nci60_pca, matrix = "eigenvalues") %>%
  ggplot(aes(PC, cumulative)) +
  geom_point() +
  geom_line()

nci60_scaled <- recipe(~ ., data = nci60) %>%
  step_rm(label) %>%
  step_normalize(all_predictors()) %>%
  prep() %>%
  bake(new_data = NULL)

nci60_complete <- nci60_scaled %>%
    dist() %>%
    hclust(method = "complete")

nci60_average <- nci60_scaled %>%
    dist() %>%
    hclust(method = "average")

nci60_single <- nci60_scaled %>%
    dist() %>%
    hclust(method = "single")

fviz_dend(nci60_complete, main = "Complete")

fviz_dend(nci60_average, main = "Average")

fviz_dend(nci60_single, main = "Single")

nci60_complete %>%
  fviz_dend(k = 4, main = "hclust(complete) on nci60")

Sometimes useful is to perform dimensionality reduction before using the clustering method. Let us use the recipes package to calculate the PCA of nci60 and keep the 5 first components. (we could have started with nci60 too if we added step_rm() and step_normalize()).

nci60_pca <- recipe(~., nci60_scaled) %>%
  step_pca(all_predictors(), num_comp = 5) %>%
  prep() %>%
  bake(new_data = NULL)

nci60_pca %>%
  dist() %>%
  hclust() %>%
  fviz_dend(k = 4, main = "hclust on first five PCs")

An Introduction to Statistcial Learning

ISLR tidymodels Labs

– END

sessionInfo()

R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
[5] LC_TIME=Spanish_Mexico.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] factoextra_1.0.7   patchwork_1.1.1    ISLR_1.2           yardstick_0.0.8   
 [5] workflowsets_0.0.2 workflows_0.2.3    tune_0.1.6         tidyr_1.1.3       
 [9] tibble_3.1.3       rsample_0.1.0      recipes_0.1.16     purrr_0.3.4       
[13] parsnip_0.1.7      modeldata_0.1.1    infer_0.5.4        ggplot2_3.3.5     
[17] dplyr_1.0.7        dials_0.0.9        scales_1.1.1       broom_0.7.8       
[21] tidymodels_0.1.3  

loaded via a namespace (and not attached):
 [1] colorspace_2.0-2   ggsignif_0.6.2     ellipsis_0.3.2     class_7.3-19      
 [5] rio_0.5.27         rstudioapi_0.13    ggpubr_0.4.0       listenv_0.8.0     
 [9] furrr_0.2.3        farver_2.1.0       ggrepel_0.9.1      prodlim_2019.11.13
[13] fansi_0.5.0        lubridate_1.7.10   codetools_0.2-18   splines_4.1.0     
[17] knitr_1.33         jsonlite_1.7.2     pROC_1.17.0.1      compiler_4.1.0    
[21] backports_1.2.1    assertthat_0.2.1   Matrix_1.3-3       cli_3.0.0         
[25] htmltools_0.5.1.1  tools_4.1.0        gtable_0.3.0       glue_1.4.2        
[29] Rcpp_1.0.7         carData_3.0-4      cellranger_1.1.0   jquerylib_0.1.4   
[33] DiceDesign_1.9     vctrs_0.3.8        iterators_1.0.13   timeDate_3043.102 
[37] gower_0.2.2        xfun_0.24          stringr_1.4.0      globals_0.14.0    
[41] openxlsx_4.2.4     lifecycle_1.0.0    rstatix_0.7.0      dendextend_1.15.1 
[45] future_1.21.0      MASS_7.3-54        ipred_0.9-11       hms_1.1.0         
[49] parallel_4.1.0     yaml_2.2.1         curl_4.3.2         gridExtra_2.3     
[53] sass_0.4.0         rpart_4.1-15       stringi_1.6.2      highr_0.9         
[57] foreach_1.5.1      lhs_1.1.1          hardhat_0.1.6      zip_2.2.0         
[61] lava_1.6.9         rlang_0.4.11       pkgconfig_2.0.3    evaluate_0.14     
[65] lattice_0.20-44    labeling_0.4.2     tidyselect_1.1.1   parallelly_1.26.1 
[69] plyr_1.8.6         magrittr_2.0.1     R6_2.5.0           generics_0.1.0    
[73] DBI_1.1.1          pillar_1.6.2       haven_2.4.1        foreign_0.8-81    
[77] withr_2.4.2        abind_1.4-5        survival_3.2-11    nnet_7.3-16       
[81] crayon_1.4.1       car_3.0-11         utf8_1.2.2         rmarkdown_2.9     
[85] viridis_0.6.1      grid_4.1.0         readxl_1.3.1       data.table_1.14.0 
[89] forcats_0.5.1      digest_0.6.27      GPfit_1.0-8        munsell_0.5.0     
[93] viridisLite_0.4.0  bslib_0.2.5.1

12. Unsupervised Learning

GS

28/8/2021

An Introduction to Statistical Learning (2nd ed.)

Labs

Chapter 12

Unsupervised Learning

Kmeans Clustering

Hierarchical Clustering

NCI60 dataset