(3.5/6) + (1.0/6) + (0.7/6) + (0.4/6)[1] 0.9333333
SUBMISSION INSTRUCTIONS
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
(3.5/6) + (1.0/6) + (0.7/6) + (0.4/6)[1] 0.9333333
You need at least the first four principal components to explain at least 90% of the variability in the data set
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixConsider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixA plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
Scores <- five_irises %*% pc_loadings
Scores PC1 PC2
1 0.5606200 1.7617440
2 1.5715031 -1.0649071
3 -2.4396481 -2.1418936
4 0.6308802 0.4146079
5 -2.1283408 -1.1346763
A = 3
B = 1
C = 4
D = 2
E = 5
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
library(tidyverse)
library(factoextra)Warning: package 'factoextra' was built under R version 4.5.2
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
places_numeric <- places %>% select(-c('City','Long','Lat','Pop'))
places_pca <- prcomp(places_numeric, scale. = TRUE)fviz_eig(places_pca, geom='line') +
labs(title='Scree plot, % variability explained')+
theme_classic(base_size = 16) summary(places_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8462 1.1018 1.0684 0.9596 0.8679 0.79408 0.70217
Proportion of Variance 0.3787 0.1349 0.1268 0.1023 0.0837 0.07006 0.05478
Cumulative Proportion 0.3787 0.5136 0.6404 0.7427 0.8264 0.89650 0.95128
PC8 PC9
Standard deviation 0.56395 0.34699
Proportion of Variance 0.03534 0.01338
Cumulative Proportion 0.98662 1.00000
Looking at the scree plot it seems like the first two principal components are all we would need for this data set. These first two PC’s explain 51.36% of the total variability in this data set.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
fviz_contrib(places_pca, choice = 'var', axes = 1) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to first dimension')sort(places_pca$rotation[,1], decreasing=TRUE) Arts HlthCare Housing Transp Recreat Crime Educ Climate
0.4630545 0.4602146 0.3565216 0.3511508 0.3278879 0.2812984 0.2752926 0.2064140
Econ
0.1354123
fviz_contrib(places_pca, choice = 'var', axes = 2) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to second dimension')sort(places_pca$rotation[,2], decreasing=TRUE) Econ Recreat Crime Housing Climate Transp Arts
0.4712833 0.3844746 0.3553423 0.2506240 0.2178353 -0.1796045 -0.1947899
HlthCare Educ
-0.2994653 -0.4833821
The variables that will be used to separate the cities along the first principal axis is Arts, Health Care, Housing, and Transpiration. The loading for all four of these variable are positive so Cities that have higher values in these will all be towards the right of the graph and low amounts will be on the left side.
The variables that will be used to separate the cities along the second principal axis is Education, Economy, Recreation, and Crime. The loading for all but Education are positive, so if cities have large amounts of Economy, Recreation, and Crime and a low amount in Education will have higher amounts of PC2 but if they have the opposite they will have low amounts of PC2.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
places_with_scores <- places %>%
mutate(PC1 = places_pca$x[,1],
PC2 = places_pca$x[,2])
head(places_with_scores) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop PC1 PC2
1 1405 7633 -99.6890 32.5590 110932 -1.0401799 0.89376897
2 2632 4350 -81.5180 41.0850 660328 0.4398136 0.07506618
3 859 5250 -84.1580 31.5750 112402 -1.8755393 0.06979169
4 1617 5864 -73.7983 42.7327 835880 0.9107414 -1.81758215
5 2612 5727 -106.6500 35.0830 419700 2.1492475 0.32885808
6 1018 5254 -92.4530 31.3020 135282 -1.7879611 -0.78120167
fviz_pca_var(places_pca, axes= c(1,2))Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
fviz_pca_biplot(places_pca,
geom.ind = "point",
label = "var",
repel = TRUE,
addEllipses = FALSE) +
geom_text(data = places_with_scores,
aes(x = PC1, y = PC2, label = City),
color = "black", size = 3, vjust = -0.5)outlying_cities <- places_with_scores %>%
filter(City %in% c("San-FranciscoCA", "Los_AngelesLong-BeachCA", "New-YorkNY", "PhiladelphiaPA-NJ", "ChicagoIL", "BostonMA", "WashingtonDC-MD-VA"))
outlying_cities City Climate Housing HlthCare Crime Transp Educ Arts Recreat
1 BostonMA 623 11609 5301 1215 6801 3479 21042 3066
2 ChicagoIL 514 10913 5766 1034 7742 3486 24846 2856
3 New-YorkNY 638 13358 7850 2498 8625 2984 56745 3579
4 PhiladelphiaPA-NJ 630 8310 5158 1059 5903 3781 17270 1979
5 San-FranciscoCA 910 17158 3726 1619 8299 3371 14226 4600
6 WashingtonDC-MD-VA 631 13724 4361 1317 8236 3635 21701 1578
Econ Long Lat Pop PC1 PC2
1 6363 -71.058 42.362 2805911 6.301057 -1.608742
2 5205 -87.625 41.883 6060387 6.464912 -3.087134
3 5338 -73.880 40.849 8274961 12.426251 -2.061746
4 5638 -75.163 39.950 4716818 4.765370 -3.073287
5 6063 -122.417 37.775 1488871 7.391403 1.227000
6 6072 -77.033 38.892 3250822 6.186471 -2.264874
The plot is very busy but when you look at the outside you can see a couple outlines like, San Francisco, Los-Angeles Long Beach, New York, Boston, Washington DC, Philadelphia, and Chicago. These places all have large amounts of Education, Health Care, Arts, and Transportation.
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
cells_numeric <- bc_cells %>% select(-'Diagnosis')
cells_pca <- prcomp(cells_numeric, scale. = TRUE)
fviz_eig(cells_pca, geom='line') +
labs(title='Scree plot, % variability explained')+
theme_classic(base_size = 16) summary(cells_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
I would have to agree that at least the first 3 PCs should be retained possibly even including PC 4 as well. With just the first 3 PCs it explains 86.7% of the total variability within this data set.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
fviz_contrib(cells_pca, choice = 'var', axes = 1) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to first dimension')sort(cells_pca$rotation[,1], decreasing=TRUE) Texture FracDim Radius Symmetry Smoothness ConcavePts
-0.1432175 -0.2251375 -0.3003952 -0.3240333 -0.3482386 -0.4459288
Concavity Compactness
-0.4508935 -0.4584098
fviz_contrib(cells_pca, choice = 'var', axes = 2) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to second dimension')sort(cells_pca$rotation[,2], decreasing=TRUE) Radius Texture ConcavePts Concavity Compactness Symmetry
0.52850910 0.35378530 0.22823091 0.12707085 -0.07219238 -0.28112508
Smoothness FracDim
-0.32661945 -0.57996072
fviz_contrib(cells_pca, choice = 'var', axes = 3) +
theme_classic(base_size = 16) +
labs(x = 'Variable',
title = 'Contribution to thrid dimension')sort(cells_pca$rotation[,3], decreasing=TRUE) Radius ConcavePts Smoothness Concavity Compactness Symmetry
0.27751200 0.17458320 0.12684205 0.04245883 -0.02956419 -0.08456832
FracDim Texture
-0.24389523 -0.89839046
For PC1 the driving variables are Compactness, Concavity, and Concave Points. Each one of these loadings are negative values, so if subjects have large amounts of these variables they will have low PC1 values and if they have low amounts they will have high PC1 values.
For PC2 the driving variables are Fractal Dimension, Radius, and Texture. Both Radius and Texture have positive loadings where Fractal Dimensions is negative. So if subjects have high amounts of PC2 they will have to have large amounts of Radius and Texture and low amounts of Fractal Dimensions and to have low they would need the opposite.
For PC3 there is only one driving variable which is Texture. Texture is a negative value so in order to have high amount of PC3 subjects need to have low amounts of Texture and the opposite for low amounts of PC3.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
p <- fviz_pca(cells_pca, geom = "blank", col.ind = "white", col.var = "black") +
geom_point(aes(x = cells_pca$x[,1],
y = cells_pca$x[,2],
color = bc_cells$Diagnosis,
size = abs(cells_pca$x[,3])),
alpha = 0.6) + # <-- transparency here (0 = invisible, 1 = opaque)
scale_color_manual(values = c("B" = "skyblue", "M" = "tomato")) +
labs(color = "Diagnosis", size = "|PC3|",
title = "PCA Biplot: PC1 vs PC2 (size = |PC3|)") +
theme_classic(base_size = 16)
pWith this biplot we can see that the characteristics that distinguish a malignant cell from benign cells are: Texture, Radius, Concave Points, and Concavity. If cells had large amounts of these features they would likely be a malignant cell. If the cells had low amounts of these they would be benign cells.
Of the PCs, it seems that PC1 does the best job at differentiating malignant cells from benign cells. This is because it appears that the two different cells fall majority on the left or right side on the PC1 = 0 line.