library(tidymodels)
library(ISLR)
library(patchwork)
library(factoextra)
theme_set(theme_bw())
usa <- as_tibble(USArrests, rownames = "state")
Notice how the mean of each of the variables is quite different. if we were to apply PCA directly to the data set then Murder would have a very small influence.
usa %>%
select(-state) %>%
map_dfr(mean)
# A tibble: 1 x 4
Murder Assault UrbanPop Rape
<dbl> <dbl> <dbl> <dbl>
1 7.79 171. 65.5 21.2
We will show how to perform PCA in two different ways in this section. First, by using prcomp() directly, using broom to extract the information we need, and secondly by using recipes. prcomp() Takes 1 required argument x which much be a fully numeric data.frame or matrix. Then we pass that to prcomp(). We also set scale = TRUE in prcomp() which will perform the scaling we need.
usa_pca <-
usa %>%
select(-state) %>%
prcomp(scale = T)
usa_pca
Standard deviations (1, .., p=4):
[1] 1.5748783 0.9948694 0.5971291 0.4164494
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
Murder -0.5358995 0.4181809 -0.3412327 0.64922780
Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
Rape -0.5434321 -0.1673186 0.8177779 0.08902432
usa_pca %>% summary()
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5749 0.9949 0.59713 0.41645
Proportion of Variance 0.6201 0.2474 0.08914 0.04336
Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
now we can use our favorite broom function to extract information from this prcomp object. We start with tidy(). tidy() can be used to extract a couple of different things, see ?broom:::tidy.prcomp() for more information. tidy() will by default extract the scores of a PCA object in long tidy format. The score of is the location of the observation in PCA space. So we can
tidy(usa_pca)
# A tibble: 200 x 3
row PC value
<int> <dbl> <dbl>
1 1 1 -0.976
2 1 2 1.12
3 1 3 -0.440
4 1 4 0.155
5 2 1 -1.93
6 2 2 1.06
7 2 3 2.02
8 2 4 -0.434
9 3 1 -1.75
10 3 2 -0.738
# ... with 190 more rows
#tidy(usa_pca, matrix = "scores")
tidy(usa_pca, matrix = "loadings")
# A tibble: 16 x 3
column PC value
<chr> <dbl> <dbl>
1 Murder 1 -0.536
2 Murder 2 0.418
3 Murder 3 -0.341
4 Murder 4 0.649
5 Assault 1 -0.583
6 Assault 2 0.188
7 Assault 3 -0.268
8 Assault 4 -0.743
9 UrbanPop 1 -0.278
10 UrbanPop 2 -0.873
11 UrbanPop 3 -0.378
12 UrbanPop 4 0.134
13 Rape 1 -0.543
14 Rape 2 -0.167
15 Rape 3 0.818
16 Rape 4 0.0890
tidy(usa_pca, matrix = "eigenvalues")
# A tibble: 4 x 4
PC std.dev percent cumulative
<dbl> <dbl> <dbl> <dbl>
1 1 1.57 0.620 0.620
2 2 0.995 0.247 0.868
3 3 0.597 0.0891 0.957
4 4 0.416 0.0434 1
tidy(usa_pca, matrix = "loadings") %>%
ggplot(aes(value, column, fill = factor(PC))) +
facet_wrap(~ PC) +
geom_col() +
theme(legend.position = "none")
tidy(usa_pca, matrix = "eigenvalues") %>%
ggplot(aes(PC, percent)) +
geom_col() +
labs(x = "Principal Components",
y = "Variance (%)",
title = "Hola")
augment(usa_pca)
# A tibble: 50 x 5
.rownames .fittedPC1 .fittedPC2 .fittedPC3 .fittedPC4
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 -0.976 1.12 -0.440 0.155
2 2 -1.93 1.06 2.02 -0.434
3 3 -1.75 -0.738 0.0542 -0.826
4 4 0.140 1.11 0.113 -0.181
5 5 -2.50 -1.53 0.593 -0.339
6 6 -1.50 -0.978 1.08 0.00145
7 7 1.34 -1.08 -0.637 -0.117
8 8 -0.0472 -0.322 -0.711 -0.873
9 9 -2.98 0.0388 -0.571 -0.0953
10 10 -1.62 1.27 -0.339 1.07
# ... with 40 more rows
biplot(usa_pca)
pca_rec <-
recipe(~ ., data = usa) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), id = "pca") %>%
prep()
pca_rec %>% juice
# A tibble: 50 x 5
state PC1 PC2 PC3 PC4
<fct> <dbl> <dbl> <dbl> <dbl>
1 Alabama -0.976 1.12 -0.440 0.155
2 Alaska -1.93 1.06 2.02 -0.434
3 Arizona -1.75 -0.738 0.0542 -0.826
4 Arkansas 0.140 1.11 0.113 -0.181
5 California -2.50 -1.53 0.593 -0.339
6 Colorado -1.50 -0.978 1.08 0.00145
7 Connecticut 1.34 -1.08 -0.637 -0.117
8 Delaware -0.0472 -0.322 -0.711 -0.873
9 Florida -2.98 0.0388 -0.571 -0.0953
10 Georgia -1.62 1.27 -0.339 1.07
# ... with 40 more rows
tidy(pca_rec, id = "pca", type = "coef")
# A tibble: 16 x 4
terms value component id
<chr> <dbl> <chr> <chr>
1 Murder -0.536 PC1 pca
2 Assault -0.583 PC1 pca
3 UrbanPop -0.278 PC1 pca
4 Rape -0.543 PC1 pca
5 Murder 0.418 PC2 pca
6 Assault 0.188 PC2 pca
7 UrbanPop -0.873 PC2 pca
8 Rape -0.167 PC2 pca
9 Murder -0.341 PC3 pca
10 Assault -0.268 PC3 pca
11 UrbanPop -0.378 PC3 pca
12 Rape 0.818 PC3 pca
13 Murder 0.649 PC4 pca
14 Assault -0.743 PC4 pca
15 UrbanPop 0.134 PC4 pca
16 Rape 0.0890 PC4 pca
tidy(pca_rec, id = "pca", type = "variance")
# A tibble: 16 x 4
terms value component id
<chr> <dbl> <int> <chr>
1 variance 2.48 1 pca
2 variance 0.990 2 pca
3 variance 0.357 3 pca
4 variance 0.173 4 pca
5 cumulative variance 2.48 1 pca
6 cumulative variance 3.47 2 pca
7 cumulative variance 3.83 3 pca
8 cumulative variance 4 4 pca
9 percent variance 62.0 1 pca
10 percent variance 24.7 2 pca
11 percent variance 8.91 3 pca
12 percent variance 4.34 4 pca
13 cumulative percent variance 62.0 1 pca
14 cumulative percent variance 86.8 2 pca
15 cumulative percent variance 95.7 3 pca
16 cumulative percent variance 100 4 pca
pca_rec <-
recipe(~ ., data = usa) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), num_comp = 3) %>%
prep() %>%
juice() %>%
print()
# A tibble: 50 x 4
state PC1 PC2 PC3
<fct> <dbl> <dbl> <dbl>
1 Alabama -0.976 1.12 -0.440
2 Alaska -1.93 1.06 2.02
3 Arizona -1.75 -0.738 0.0542
4 Arkansas 0.140 1.11 0.113
5 California -2.50 -1.53 0.593
6 Colorado -1.50 -0.978 1.08
7 Connecticut 1.34 -1.08 -0.637
8 Delaware -0.0472 -0.322 -0.711
9 Florida -2.98 0.0388 -0.571
10 Georgia -1.62 1.27 -0.339
# ... with 40 more rows
pca_rec <-
recipe(~ ., data = usa) %>%
step_normalize(all_numeric_predictors()) %>%
step_pca(all_numeric_predictors(), threshold = .7) %>%
prep() %>%
juice() %>%
print()
# A tibble: 50 x 3
state PC1 PC2
<fct> <dbl> <dbl>
1 Alabama -0.976 1.12
2 Alaska -1.93 1.06
3 Arizona -1.75 -0.738
4 Arkansas 0.140 1.11
5 California -2.50 -1.53
6 Colorado -1.50 -0.978
7 Connecticut 1.34 -1.08
8 Delaware -0.0472 -0.322
9 Florida -2.98 0.0388
10 Georgia -1.62 1.27
# ... with 40 more rows
set.seed(2)
x_df <- tibble(
V1 = rnorm(n = 50, mean = rep(c(0, 3), each = 25)),
V2 = rnorm(n = 50, mean = rep(c(0, -4), each = 25))
)
x_df %>%
ggplot(aes(V1, V2, color = rep(c("A", "B"), each = 25))) +
geom_point()
set.seed(1234)
(res_kmeans <- kmeans(x_df, centers = 3, nstart = 20))
K-means clustering with 3 clusters of sizes 11, 23, 16
Cluster means:
V1 V2
1 2.5355362 -2.48605364
2 0.2339095 0.04414551
3 2.8241300 -5.01221675
Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 3 1 1 1 3 1 3 3 3 3 1 3 3
[39] 3 1 1 1 3 3 3 3 1 3 3 3
Within cluster sum of squares by cluster:
[1] 14.56698 54.84869 26.98215
(between_SS / total_SS = 76.8 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
tidy(res_kmeans)
# A tibble: 3 x 5
V1 V2 size withinss cluster
<dbl> <dbl> <int> <dbl> <fct>
1 2.54 -2.49 11 14.6 1
2 0.234 0.0441 23 54.8 2
3 2.82 -5.01 16 27.0 3
augment(res_kmeans, data = x_df)
# A tibble: 50 x 3
V1 V2 .cluster
<dbl> <dbl> <fct>
1 -0.897 -0.838 2
2 0.185 2.07 2
3 1.59 -0.562 2
4 -1.13 1.28 2
5 -0.0803 -1.05 2
6 0.132 -1.97 2
7 0.708 -0.323 2
8 -0.240 0.936 2
9 1.98 1.14 2
10 -0.139 1.67 2
# ... with 40 more rows
augment(res_kmeans, data = x_df) %>%
ggplot(aes(V1, V2, color = .cluster)) +
geom_point()
res_hclust_complete <- x_df %>%
dist() %>%
hclust(method = "complete")
res_hclust_average <- x_df %>%
dist() %>%
hclust(method = "average")
res_hclust_single <- x_df %>%
dist() %>%
hclust(method = "single")
fviz_dend(res_hclust_complete, main = "Complete", k = 2)
fviz_dend(res_hclust_average, main = "average", k = 2)
fviz_dend(res_hclust_single, main = "single", k = 2)
If we don’t know the importance of the different predictors in data set it could be beneficial to scale the data such that each variable has the same influence. We can perform scaling by using scale() before dist().
x_df %>%
scale() %>%
dist() %>%
hclust(method = "complete") %>%
fviz_dend(k = 2)
data(NCI60, package = "ISLR")
nci60 <- NCI60$data %>%
as_tibble() %>%
mutate(label = factor(NCI60$labs)) %>%
relocate(label)
nci60_pca <- nci60 %>%
select(-label) %>%
prcomp(scale = TRUE)
nci60_pcs <- bind_cols(
augment(nci60_pca),
nci60 %>% select(label)
)
colors <- unname(palette.colors(n = 14, palette = "Polychrome 36"))
nci60_pcs %>%
ggplot(aes(.fittedPC1, .fittedPC2, color = label)) +
geom_point() +
theme_minimal() +
scale_color_manual(values = colors)
nci60_pcs %>%
ggplot(aes(.fittedPC1, .fittedPC3, color = label)) +
geom_point() +
theme_minimal() +
scale_color_manual(values = colors)
tidy(nci60_pca, matrix = "eigenvalues") %>%
ggplot(aes(PC, percent)) +
geom_point() +
geom_line()
tidy(nci60_pca, matrix = "eigenvalues") %>%
ggplot(aes(PC, cumulative)) +
geom_point() +
geom_line()
nci60_scaled <- recipe(~ ., data = nci60) %>%
step_rm(label) %>%
step_normalize(all_predictors()) %>%
prep() %>%
bake(new_data = NULL)
nci60_complete <- nci60_scaled %>%
dist() %>%
hclust(method = "complete")
nci60_average <- nci60_scaled %>%
dist() %>%
hclust(method = "average")
nci60_single <- nci60_scaled %>%
dist() %>%
hclust(method = "single")
fviz_dend(nci60_complete, main = "Complete")
fviz_dend(nci60_average, main = "Average")
fviz_dend(nci60_single, main = "Single")
nci60_complete %>%
fviz_dend(k = 4, main = "hclust(complete) on nci60")
Sometimes useful is to perform dimensionality reduction before using the clustering method. Let us use the recipes package to calculate the PCA of nci60 and keep the 5 first components. (we could have started with nci60 too if we added step_rm() and step_normalize()).
nci60_pca <- recipe(~., nci60_scaled) %>%
step_pca(all_predictors(), num_comp = 5) %>%
prep() %>%
bake(new_data = NULL)
nci60_pca %>%
dist() %>%
hclust() %>%
fviz_dend(k = 4, main = "hclust on first five PCs")
An Introduction to Statistcial Learning
– END
sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Mexico.1252 LC_CTYPE=Spanish_Mexico.1252
[3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Mexico.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] factoextra_1.0.7 patchwork_1.1.1 ISLR_1.2 yardstick_0.0.8
[5] workflowsets_0.0.2 workflows_0.2.3 tune_0.1.6 tidyr_1.1.3
[9] tibble_3.1.3 rsample_0.1.0 recipes_0.1.16 purrr_0.3.4
[13] parsnip_0.1.7 modeldata_0.1.1 infer_0.5.4 ggplot2_3.3.5
[17] dplyr_1.0.7 dials_0.0.9 scales_1.1.1 broom_0.7.8
[21] tidymodels_0.1.3
loaded via a namespace (and not attached):
[1] colorspace_2.0-2 ggsignif_0.6.2 ellipsis_0.3.2 class_7.3-19
[5] rio_0.5.27 rstudioapi_0.13 ggpubr_0.4.0 listenv_0.8.0
[9] furrr_0.2.3 farver_2.1.0 ggrepel_0.9.1 prodlim_2019.11.13
[13] fansi_0.5.0 lubridate_1.7.10 codetools_0.2-18 splines_4.1.0
[17] knitr_1.33 jsonlite_1.7.2 pROC_1.17.0.1 compiler_4.1.0
[21] backports_1.2.1 assertthat_0.2.1 Matrix_1.3-3 cli_3.0.0
[25] htmltools_0.5.1.1 tools_4.1.0 gtable_0.3.0 glue_1.4.2
[29] Rcpp_1.0.7 carData_3.0-4 cellranger_1.1.0 jquerylib_0.1.4
[33] DiceDesign_1.9 vctrs_0.3.8 iterators_1.0.13 timeDate_3043.102
[37] gower_0.2.2 xfun_0.24 stringr_1.4.0 globals_0.14.0
[41] openxlsx_4.2.4 lifecycle_1.0.0 rstatix_0.7.0 dendextend_1.15.1
[45] future_1.21.0 MASS_7.3-54 ipred_0.9-11 hms_1.1.0
[49] parallel_4.1.0 yaml_2.2.1 curl_4.3.2 gridExtra_2.3
[53] sass_0.4.0 rpart_4.1-15 stringi_1.6.2 highr_0.9
[57] foreach_1.5.1 lhs_1.1.1 hardhat_0.1.6 zip_2.2.0
[61] lava_1.6.9 rlang_0.4.11 pkgconfig_2.0.3 evaluate_0.14
[65] lattice_0.20-44 labeling_0.4.2 tidyselect_1.1.1 parallelly_1.26.1
[69] plyr_1.8.6 magrittr_2.0.1 R6_2.5.0 generics_0.1.0
[73] DBI_1.1.1 pillar_1.6.2 haven_2.4.1 foreign_0.8-81
[77] withr_2.4.2 abind_1.4-5 survival_3.2-11 nnet_7.3-16
[81] crayon_1.4.1 car_3.0-11 utf8_1.2.2 rmarkdown_2.9
[85] viridis_0.6.1 grid_4.1.0 readxl_1.3.1 data.table_1.14.0
[89] forcats_0.5.1 digest_0.6.27 GPfit_1.0-8 munsell_0.5.0
[93] viridisLite_0.4.0 bslib_0.2.5.1