(3.5+1+.7+.4)/6[1] 0.9333333
SUBMISSION INSTRUCTIONS
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
(3.5+1+.7+.4)/6[1] 0.9333333
The first 3 eigenvalues only explain 86% of the variability, so to explain at least 90% of the variability we need to keep the first 4 eigenvalues.
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixConsider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixA plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
places_data <- scale(places[2:10])
pca_places <- prcomp(places_data, scale. = TRUE)
library(factoextra)Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
summary(pca_places)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8462 1.1018 1.0684 0.9596 0.8679 0.79408 0.70217
Proportion of Variance 0.3787 0.1349 0.1268 0.1023 0.0837 0.07006 0.05478
Cumulative Proportion 0.3787 0.5136 0.6404 0.7427 0.8264 0.89650 0.95128
PC8 PC9
Standard deviation 0.56395 0.34699
Proportion of Variance 0.03534 0.01338
Cumulative Proportion 0.98662 1.00000
fviz_eig(pca_places)I would just use the first and second PCs because we see the elbow in the plot at PC 2. This means that the second PC is not explaining much more variance than the third PC, and the third is not explaining much more than the fourth, and so on. Because of these diminishing returns, it is clear that the first PC is most important, but I would also include the second PC because it is easy to visualize things in 2-dimensions. These two PCs explain about 51% of the total variability.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
fviz_pca_var(pca_places, axes= c(1,2))Transportation, Arts, and Healthcare are going to be the main variables determining the first PC since they are the closest to the x-axis. These contributions will all be positive. Econ and Education are the main variables determining the second PC because they are the closest to the y-axis. The contributions from econ will be positive while those from education will be negative.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
library(tidyverse)
places <- (places %>% mutate(PC1 = pca_places$x[,1],
PC2 = pca_places$x[,2])
)
library(ggrepel)Warning: package 'ggrepel' was built under R version 4.4.3
(fviz_pca(pca_places, axes = 1:2, label = 'var')
+ geom_text_repel(data = places, aes(PC1, PC2, label = City), max.overlaps = 10)
)Warning: ggrepel: 314 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
New York, Chicago, and Philadelphia are unique for having high transportation, arts, healthcare, and education. Los Angeles Long Beach and San Francisco are unique for having a combination of high values in both positive and negative variables excluding education and econ. Vegas and Atlantic City are unique for having high levels of econ, crime, and recreation.
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
cells_pca <- prcomp(bc_cells[-1], scale. = TRUE)
fviz_eig(cells_pca)summary(cells_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
87% of the variability is explained by the first 3 PCs. I think using 3 PCs is appropriate, we see the ‘elbow’ in the line at the 6th PC, meaning we would use the first 5 PCs before negligible returns, but 5 PCs is difficult to visualize, so choosing the first 3 seems the most appropriate for a visual analysis.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
cells_pca$rotation[,1:3] PC1 PC2 PC3
Radius -0.3003952 0.52850910 0.27751200
Texture -0.1432175 0.35378530 -0.89839046
Smoothness -0.3482386 -0.32661945 0.12684205
Compactness -0.4584098 -0.07219238 -0.02956419
Concavity -0.4508935 0.12707085 0.04245883
ConcavePts -0.4459288 0.22823091 0.17458320
Symmetry -0.3240333 -0.28112508 -0.08456832
FracDim -0.2251375 -0.57996072 -0.24389523
The first PC gives the most weight to compactness, concavity, and concave points, with all of the values being negative. PC 1 also values texture the least. PC 2 gives the most weight to the radius, which is positive, and the fractal dimension, which is negative. It gives the lease weight to compactness and concavity, and the values are half positive and half negative. PC 3 gives the most weight to texture, which is negative, and it gives the rest of the variables very low weights.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
points <- (data.frame(cells_pca$x[,1:3]) %>% mutate(Diagnosis = bc_cells$Diagnosis))
(fviz_pca(cells_pca, axes = 1:2, col.ind = 'white')
+ geom_point(data = points, aes(x= PC1, y= PC2, size = PC3, color = Diagnosis))
+ scale_size(range = c(1, 4))
)It seems that the most important characteristics are texture, radius, concave points, and concavity. Of the 3 PCs, it seems that PC 1 is the best at differentiating between malignant and benign cells. This is because it gives higher weight to concavity, concave points, and radius. PC 2 gives higher weight to radius and texture but lacks in concavity and concave points. PC 3 gives higher weight to texture but lacks in the rest of the important variables. That is why PC 1 does the best job.