library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixActivity 3.3 - PCA implementation
SUBMISSION INSTRUCTIONS
- Render to html
- Publish your html to RPubs
- Submit a link to your published solutions
Problem 1
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
λ1=3.5/6 = 58.33%
λ1 + λ2=3.5+1.0=4.5 -> 4.5/6 =75%
λ1+λ2+λ3=4.5+0.7=5.2 -> 5.2/6 = 86.67%
λ1+λ2+λ3+λ4=5.2+0.4=5.6 -> 5.6/6= 93.33%
I will keep 4 PCs since they explain at least 90% of the total variability.
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
Consider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixA plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
PC1_score=(Sepal.Length).(0.521)+(Sepal.Width).(−0.269)+(Petal.Length).(0.580)+(Petal.Width).(0.565)
Iris1= (0.189 x0.521) +(-1.97 x −0.269) + (0.137 x 0.580) +(-0.262 x 0.565) = 0.561
Iris2=(0.551 x0.521) +(0.789 x −0.269) + (1.04 x 0.580) +(1.58 x 0.565)= 1.57
Iris3=(-0.415 x0.521) +(2.62 x −0.269) + (-1.34 x 0.580) +(-01.31 x 0.565)= -2.44
Iris4= (0.310 x0.521) +(-0.59 x −0.269) + (0.534 x 0.580) +(-0.0008x 0.565)= 0.631
Iris5= (-0.898 x0.521) +(1.7 x −0.269) + (-1.05 x 0.580) +(-1.05 x 0.565)= -2.128
PC2_score=(Sepal.Length)(−0.377)+(Sepal.Width)(−0.923)+(Petal.Length)(−0.024)+(Petal.Width)(−0.067)
Iris1= (0.189 x−0.377) +(-1.97 x −0.923) + (0.137 x −0.024) +(-0.262 x −0.067) = 1.762
Iris2=(0.551 x−0.377) +(0.789 x −0.923) + (1.04 x −0.024) +(1.58 x −0.067)= -1.065
Iris3=(-0.415 x−0.377) +(2.62 x −0.923) + (-1.34 x −0.024) +(-01.31 x −0.067)= -2.142
Iris4= (0.310 x−0.377) +(-0.59 x −0.923) + (0.534 x −0.024) +(-0.0008x −0.067)= 0.415
Iris5= (-0.898 x−0.377) +(1.7 x −0.923) + (-1.05 x −0.024) +(-1.05 x −0.067)= -1.135
Iris1 =B
Iris2 = D
Iris3 = A
iris4 = C
Iris5 = E
Problem 3
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
- Climate & Terrain
- Housing
- Health Care & Environment
- Crime
- Transportation
- Education
- The Arts
- Recreation
- Economics
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
- Climate & Terrain: very hot and very cold months, seasonal temperature variation, heating- and cooling-degree days, freezing days, zero-degree days, ninety-degree days.
- Housing: utility bills, property taxes, mortgage payments.
- Health Care & Environment: per capita physicians, teaching hospitals, medical schools, cardiac rehabilitation centers, comprehensive cancer treatment centers, hospices, insurance/hospitalization costs index, flouridation of drinking water, air pollution.
- Crime: violent crime rate, property crime rate.
- Transportation: daily commute, public transportation, Interstate highways, air service, passenger rail service.
- Education: pupil/teacher ratio in the public K-12 system, effort index in K-12, accademic options in higher education.
- The Arts: museums, fine arts and public radio stations, public television stations, universities offering a degree or degrees in the arts, symphony orchestras, theatres, opera companies, dance companies, public libraries.
- Recreation: good restaurants, public golf courses, certified lanes for tenpin bowling, movie theatres, zoos, aquariums, family theme parks, sanctioned automobile race tracks, pari-mutuel betting attractions, major- and minor- league professional sports teams, NCAA Division I football and basketball teams, miles of ocean or Great Lakes coastline, inland water, national forests, national parks, or national wildlife refuges, Consolidated Metropolitan Statistical Area access.
- Economics: average household income adjusted for taxes and living costs, income growth, job growth.
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
A.
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
numeric_places<- places %>% select(Climate:Pop)
rownames(numeric_places) <- make.unique(as.character(places$City))
places_pca <- prcomp(numeric_places, scale. = TRUE)
places_pcaStandard deviations (1, .., p=12):
[1] 2.0237384 1.3472730 1.2312271 1.0174314 0.9721419 0.8376890 0.7603762
[8] 0.6628697 0.5988163 0.5449244 0.3566748 0.3019278
Rotation (n x k) = (12 x 12):
PC1 PC2 PC3 PC4 PC5
Climate 0.17940726 0.15652817 -0.38762556 -0.08299896 0.653062149
Housing 0.29810653 0.07967653 -0.24975082 0.54066971 0.262095878
HlthCare 0.43703748 -0.19697686 0.08607086 -0.10221530 0.074904660
Crime 0.25173873 0.42713952 0.17335496 -0.31894942 -0.133086085
Transp 0.30635618 -0.09564368 -0.08243830 0.09136267 -0.549751213
Educ 0.24493979 -0.32317805 0.29372927 0.19839247 0.126307373
Arts 0.44567911 -0.09634450 0.03660440 -0.18821596 0.018546273
Recreat 0.28103070 0.25389477 -0.18675453 0.20791424 -0.264623566
Econ 0.09841696 0.35317713 0.41158338 0.56910048 -0.004732669
Long -0.01116130 -0.43716148 0.47980658 0.07771995 0.216303881
Lat 0.02250696 -0.49206073 -0.46567587 0.19265491 -0.211211365
Pop 0.42754722 -0.05097002 0.06443198 -0.30857542 0.039030939
PC6 PC7 PC8 PC9 PC10
Climate -0.414744839 0.04651951 -0.18458206 -0.26679321 -0.278526192
Housing 0.186914217 -0.08728499 -0.17439807 0.34254045 0.496594504
HlthCare 0.198566350 -0.06008306 -0.02863062 -0.02852104 0.044367540
Crime -0.257647214 -0.04207741 -0.22668534 0.67165793 -0.179559766
Transp -0.459301871 -0.15131171 -0.40328824 -0.36700555 0.207826949
Educ -0.543764832 -0.05676576 0.59846581 0.16454359 0.053478885
Arts 0.294880377 -0.09324141 -0.01325001 -0.05131517 -0.007011574
Recreat 0.038021030 0.79338414 0.23919422 -0.05689020 -0.117910633
Econ 0.122903845 -0.28099382 -0.05097580 -0.21307498 -0.478670745
Long -0.007256055 0.46896902 -0.52213793 0.06361636 -0.082080334
Lat 0.070332046 -0.12834268 -0.05075675 0.32025281 -0.574455574
Pop 0.268507118 -0.07214939 0.16526903 -0.19975571 -0.125461902
PC11 PC12
Climate -0.018313993 0.04838028
Housing 0.148659661 -0.14960167
HlthCare -0.835364093 -0.04751962
Crime -0.004590256 -0.04196267
Transp 0.055600421 -0.04644566
Educ 0.070675766 0.04446266
Arts 0.320654772 0.74407857
Recreat -0.043807966 0.03537736
Econ -0.027465832 0.01841111
Long 0.135994394 -0.06190192
Lat 0.025805853 -0.03959559
Pop 0.383367687 -0.63755978
library(factoextra)Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(places_pca) based on the scree plot, I would use k as 4. The 4 dimension is where the cliff ends and the scree starts or diminishing return point. I would say anything after the 4th dimension any PCs adds little information.
B.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
library(patchwork)Warning: package 'patchwork' was built under R version 4.4.3
loadings_df <- data.frame(places_pca$rotation) %>%
rownames_to_column('Variable')
base_plot <- ggplot(loadings_df) +
theme_classic(base_size = 12) +
theme(axis.text.x = element_text(angle = 45,hjust=1,vjust=1)) +
labs(x = 'Variable') +
geom_hline(aes(yintercept= 0), linetype=2)
l1 <- base_plot +
geom_col(aes(x = reorder(Variable,PC1, decreasing=TRUE), y = PC1))
l2 <- base_plot+
geom_col(aes(x = reorder(Variable,PC2, decreasing=TRUE), y = PC2))
(l1+l2) +
plot_annotation(title = 'Loadings plots for places PCA',
theme=theme(plot.title = element_text(hjust = 0.5)))Along the first principal axis the variables used to separated the cites are Arts , HealthCare and Pop in positive direction.
Along the second principal axis the variables separating the cites are crime ,econ in the positive direction and lat, long and education in the negative direction.
C.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
library(ggrepel)Warning: package 'ggrepel' was built under R version 4.4.3
pcs <- as.data.frame(places_pca$x[, 1:2])
pcs$City <- places$City
pcs$outlier <- abs(scale(pcs$PC1)) > 2 | abs(scale(pcs$PC2)) > 2
fviz_pca(places_pca,
axes = c(1, 2),
col.ind = "white",
label = "var",
repel = TRUE) +
geom_point(data = pcs,
aes(x = PC1, y = PC2),
color = "blue",
alpha = 0.7, size = 2) +
geom_text_repel(data = pcs[pcs$outlier, ],
aes(x = PC1, y = PC2, label = City),
size = 3,
color = "red",
fontface = "bold") +
labs(title = "PCA Biplot: PC1 vs PC2",
x = "PC1",
y = "PC2") +
theme_minimal()The most famous cities are outliers. Cities like Chicago, Philadelphia, Boston, and Washington DC are high on Educ and Health Care. San Francisco and Los Angeles and Seattle stand out for recreate, housing and good climate. Miami , Las-vegas and West Palm Beach lean toward higher crime rate and economic activity. New York is far right bottom where arts, Pop and transport is high.
Problem 4
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
- ID - patient identification number (not used in PCA)
- Diagnosis determined by biopsy - B = benign or M = malignant
- Radius: mean of distances from center to points on the perimeter
- Texture: standard deviation of gray-scale values
- Smoothness: local variation in radius lengths
- Compactness: perimeter^2 / area - 1.0
- Concavity: severity of concave portions of the contour
- Concavepts: number of concave portions of the contour
- Symmetry: measure of symmetry of the cell nucleus
- FracDim: fractal dimension; “coastline approximation” - 1
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
A.
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
numeric_bc_cells<- bc_cells %>% select(Radius:FracDim)
rownames(numeric_bc_cells) <- make.unique(as.character(bc_cells$Diagnosis))
bc_cells_pca <- prcomp(numeric_bc_cells, scale. = TRUE)
bc_cells_pcaStandard deviations (1, .., p=8):
[1] 2.0705378 1.3503646 0.9086939 0.7061387 0.6101579 0.3035518 0.2622598
[8] 0.1783697
Rotation (n x k) = (8 x 8):
PC1 PC2 PC3 PC4 PC5
Radius -0.3003952 0.52850910 0.27751200 -0.0449523963 0.04245937
Texture -0.1432175 0.35378530 -0.89839046 -0.0002176232 0.21581443
Smoothness -0.3482386 -0.32661945 0.12684205 0.1097614573 0.84332416
Compactness -0.4584098 -0.07219238 -0.02956419 0.1825835334 -0.23762997
Concavity -0.4508935 0.12707085 0.04245883 0.1571126948 -0.30459047
ConcavePts -0.4459288 0.22823091 0.17458320 0.0608428515 0.01923459
Symmetry -0.3240333 -0.28112508 -0.08456832 -0.8897711849 -0.11359240
FracDim -0.2251375 -0.57996072 -0.24389523 0.3640273309 -0.27912206
PC6 PC7 PC8
Radius -0.518437923 0.36152546 -0.387460874
Texture -0.006127134 0.02418201 0.004590238
Smoothness 0.079444068 -0.04732075 -0.155456892
Compactness -0.388065805 -0.73686177 0.020239147
Concavity 0.700061530 0.02347868 -0.413095816
ConcavePts 0.125314641 0.21313047 0.808318445
Symmetry -0.018262848 0.05764443 -0.023810142
FracDim -0.261064577 0.52365191 -0.026129456
summary(places_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0237 1.3473 1.2312 1.01743 0.97214 0.83769 0.76038
Proportion of Variance 0.3413 0.1513 0.1263 0.08626 0.07875 0.05848 0.04818
Cumulative Proportion 0.3413 0.4926 0.6189 0.70515 0.78390 0.84238 0.89056
PC8 PC9 PC10 PC11 PC12
Standard deviation 0.66287 0.59882 0.54492 0.3567 0.3019
Proportion of Variance 0.03662 0.02988 0.02475 0.0106 0.0076
Cumulative Proportion 0.92717 0.95706 0.98180 0.9924 1.0000
I would say we need at least the first 5 PCs. The first 3 PCs only explain 61.8% of the total variability.
fviz_eig(bc_cells_pca)All so based on the scree plot I would say the cliff ends on the 5th PC not the 3rd.
B.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
sort(bc_cells_pca$rotation[,1], decreasing=TRUE) Texture FracDim Radius Symmetry Smoothness ConcavePts
-0.1432175 -0.2251375 -0.3003952 -0.3240333 -0.3482386 -0.4459288
Concavity Compactness
-0.4508935 -0.4584098
sort(bc_cells_pca$rotation[,1]^2, decreasing=TRUE)Compactness Concavity ConcavePts Smoothness Symmetry Radius
0.21013951 0.20330494 0.19885247 0.12127015 0.10499756 0.09023725
FracDim Texture
0.05068687 0.02051125
loadings_df <- data.frame(bc_cells_pca$rotation) %>%
rownames_to_column('Variable')
base_plot <- ggplot(loadings_df) +
theme_classic(base_size = 12) +
theme(axis.text.x = element_text(angle = 45,hjust=1,vjust=1)) +
labs(x = 'Variable') +
geom_hline(aes(yintercept= 0), linetype=2)
l1 <- base_plot +
geom_col(aes(x = reorder(Variable,PC1, decreasing=TRUE), y = PC1))
l2 <- base_plot+
geom_col(aes(x = reorder(Variable,PC2, decreasing=TRUE), y = PC2))
l3 <- base_plot +
geom_col(aes(x = reorder(Variable,PC3, decreasing=TRUE), y = PC3))
(l1+l2)/(l3)+
plot_annotation(title = 'Loadings plots for bc_cells PCA',
theme=theme(plot.title = element_text(hjust = 0.5)))Along PC1 the variables used to separated the diagnosis are Compactness, Concavity, Concave pts in the negative direction.
Along PC2 the variables used to separated the diagnosis are FracDim and Smothness in a negative direction and Radisus and Texture in the positive direction.
Along PC3 the variables used to separated the diagnosis are texture in the negative direction Radius and Concave pts in the positive direction.
C.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
- What characteristics distinguish malignant from benign cells?
- Of the 3 PCs, which does the best job of differentiating malignant from benign cells?
pcs <- as.data.frame(bc_cells_pca$x[, 1:3])
pcs$Diagnosis <- bc_cells$Diagnosis
base_plot <- fviz_pca(bc_cells_pca,
axes = c(1, 2),
col.ind = "white",
label = "var",
repel = TRUE)
base_plot +
geom_point(data = pcs,
aes(x = PC1, y = PC2, color = Diagnosis, size = abs(PC3)),
alpha = 0.7) +
scale_size_continuous(name = "PC3") +
labs(title = "PCA Biplot: PC1 vs PC2 (size by PC3)",
color = "Diagnosis") Compactness, Concavity, Concave pts and maybe Radius distinguish malignant from benign cells.
PC1 Does the best job when it comes to differentiating malignant from benign cells.