eigs <- c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)
total_var <- sum(eigs)
cum_var <- cumsum(eigs) / total_var
cum_var[1] 0.5833333 0.7500000 0.8666667 0.9333333 0.9750000 1.0000000
SUBMISSION INSTRUCTIONS
Consider the following 6 eigenvalues from a \(6\times 6\) correlation matrix:
\[\lambda_1 = 3.5, \lambda_2 = 1.0, \lambda_3 = 0.7, \lambda_4 = 0.4, \lambda_5 = 0.25, \lambda_6 = 0.15\]
If you want to retain enough principal components to explain at least 90% of the variability inherent in the data set, how many should you keep?
eigs <- c(3.5, 1.0, 0.7, 0.4, 0.25, 0.15)
total_var <- sum(eigs)
cum_var <- cumsum(eigs) / total_var
cum_var[1] 0.5833333 0.7500000 0.8666667 0.9333333 0.9750000 1.0000000
You should only keep the first 4, as that would explain at 93.33% of the variability.
The iris data set is a classic data set often used to demonstrate PCA. Each iris in the data set contained a measurement of its sepal length, sepal width, petal length, and petal width. Consider the five irises below, following mean-centering and scaling:
library(tidyverse)
five_irises <- data.frame(
row.names = 1:5,
Sepal.Length = c(0.189, 0.551, -0.415, 0.310, -0.898),
Sepal.Width = c(-1.97, 0.786, 2.62, -0.590, 1.70),
Petal.Length = c(0.137, 1.04, -1.34, 0.534, -1.05),
Petal.Width = c(-0.262, 1.58, -1.31, 0.000875, -1.05)
) %>% as.matrixConsider also the loadings for the first two principal components:
# Create the data frame
pc_loadings <- data.frame(
PC1 = c(0.5210659, -0.2693474, 0.5804131, 0.5648565),
PC2 = c(-0.37741762, -0.92329566, -0.02449161, -0.06694199),
row.names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
) %>% as.matrixA plot of the first two PC scores for these five irises is shown in the plot below.
Match the ID of each iris (1-5) to the correct letter of its score coordinates on the plot.
scores <- five_irises %*% pc_loadings
scores PC1 PC2
1 0.5606200 1.7617440
2 1.5715031 -1.0649071
3 -2.4396481 -2.1418936
4 0.6308802 0.4146079
5 -2.1283408 -1.1346763
(1 = b), (2 = d), (3 = a), (4 = c), (5 = e),
These data are taken from the Places Rated Almanac, by Richard Boyer and David Savageau, copyrighted and published by Rand McNally. The nine rating criteria used by Places Rated Almanac are:
For all but two of the above criteria, the higher the score, the better. For Housing and Crime, the lower the score the better. The scores are computed using the following component statistics for each criterion (see the Places Rated Almanac for details):
In addition to these, latitude and longitude, population and state are also given, but should not be included in the PCA.
Use PCA to identify the major components of variation in the ratings among cities.
places <- read.csv('Data/Places.csv')
head(places) City Climate Housing HlthCare Crime Transp Educ Arts
1 AbileneTX 521 6200 237 923 4031 2757 996
2 AkronOH 575 8138 1656 886 4883 2438 5564
3 AlbanyGA 468 7339 618 970 2531 2560 237
4 Albany-Schenectady-TroyNY 476 7908 1431 610 6883 3399 4655
5 AlbuquerqueNM 659 8393 1853 1483 6558 3026 4496
6 AlexandriaLA 520 5819 640 727 2444 2972 334
Recreat Econ Long Lat Pop
1 1405 7633 -99.6890 32.5590 110932
2 2632 4350 -81.5180 41.0850 660328
3 859 5250 -84.1580 31.5750 112402
4 1617 5864 -73.7983 42.7327 835880
5 2612 5727 -106.6500 35.0830 419700
6 1018 5254 -92.4530 31.3020 135282
If you want to explore this data set in lower dimensional space using the first \(k\) principal components, how many would you use, and what percent of the total variability would these retained PCs explain? Use a scree plot to help you answer this question.
places_numeric <- places %>%
select(
Climate,
Housing,
HlthCare,
Crime,
Transp,
Educ,
Arts,
Recreat,
Econ
)
places_pca <- prcomp(places_numeric, scale. = TRUE)
summary(places_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.8462 1.1018 1.0684 0.9596 0.8679 0.79408 0.70217
Proportion of Variance 0.3787 0.1349 0.1268 0.1023 0.0837 0.07006 0.05478
Cumulative Proportion 0.3787 0.5136 0.6404 0.7427 0.8264 0.89650 0.95128
PC8 PC9
Standard deviation 0.56395 0.34699
Proportion of Variance 0.03534 0.01338
Cumulative Proportion 0.98662 1.00000
library(factoextra)Warning: package 'factoextra' was built under R version 4.4.3
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_eig(places_pca) +
theme_classic(base_size = 16)Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.
I would use 5, it would cover 82% of variability. In this case one could argue that almost any PC can be used after 3 as after that it decreases by a consistent amount, though that amount is very large so there isn’t an easy cut off.
Interpret the retained principal components by examining the loadings (plot(s) of the loadings may be helpful). Which variables will be used to separate cities along the first and second principal axes, and how? Make sure to discuss the signs of the loadings, not just their contributions!
places_pca$rotation[, 1:2] PC1 PC2
Climate 0.2064140 0.2178353
Housing 0.3565216 0.2506240
HlthCare 0.4602146 -0.2994653
Crime 0.2812984 0.3553423
Transp 0.3511508 -0.1796045
Educ 0.2752926 -0.4833821
Arts 0.4630545 -0.1947899
Recreat 0.3278879 0.3844746
Econ 0.1354123 0.4712833
PC1: All loadings on PC1 are positive. Cities with high PC1 scores tend to have strong ratings across most categories, especially Arts, Health Care, Housing, Transportation, and Recreation. Cities with low PC1 scores perform more poorly across these same variables. PC1 separates cities primarily based on overall amenities and service quality.
PC2: PC2 contains both positive and negative loadings. Positive loadings are Economics, Recreation, Crime, Housing, and Climate, while negative loadings are Education, Health Care, Arts, and Transportation. Cities scoring high on PC2 tend to have stronger economies, better climates, and more recreation opportunities but weaker educational, arts, and health-care systems. ## C.
Add the first two PC scores to the places data set. Create a biplot of the first 2 PCs, using repelled labeling to identify the cities. Which are the outlying cities and what characteristics make them unique?
scores_df <- as.data.frame(places_pca$x[, 1:2]) %>%
mutate(City = places$City)
# Identify outliers (high |PC1| or |PC2|)
outliers <- scores_df %>%
filter(abs(PC1) > quantile(abs(PC1), 0.95) |
abs(PC2) > quantile(abs(PC2), 0.95))# This is not necessary I just wanted to try and see if it provides any insight
# (your City entries look like "Los-AngelesLong-BeachCA", so the last 2 letters are state)
places$State <- substr(places$City, nchar(places$City)-1, nchar(places$City))
# Define regions
regions <- list(
Northeast = c("ME","NH","VT","MA","RI","CT","NY","NJ","PA"),
South = c("DE","MD","DC","VA","WV","KY","TN","NC","SC",
"GA","FL","AL","MS","AR","LA","OK","TX"),
Midwest = c("OH","MI","IN","IL","WI","MN","IA","MO","ND","SD","NE","KS"),
West = c("MT","ID","WY","CO","NM","AZ","UT","NV","CA","OR","WA","AK","HI")
)
# Assign region to each city
places$Region <- NA
for(r in names(regions)) {
places$Region[places$State %in% regions[[r]]] <- r
}
table(places$Region) # sanity check
Midwest Northeast South West
89 67 118 55
library(ggplot2)
library(ggrepel)Warning: package 'ggrepel' was built under R version 4.4.3
library(dplyr)
# Extract scores + loadings
scores_df <- as.data.frame(places_pca$x[, 1:2]) %>%
mutate(City = places$City,
Region = places$Region)
loadings_df <- as.data.frame(places_pca$rotation[, 1:2]) %>%
mutate(Variable = rownames(.))
# Identify outlier cities to label
outliers <- scores_df %>%
mutate(dist = sqrt(PC1^2 + PC2^2)) %>%
slice_max(dist, n = 12)
# Region color palette (soft / pastel)
region_colors <- c(
Midwest = "#74aaff",
Northeast = "#ff7070",
South = "#ffd966",
West = "#85d187"
)
# Scale arrows (otherwise they are too short)
arrow_scale <- 6
ggplot() +
# --- FAINT CITY DOTS ---
geom_point(data = scores_df,
aes(PC1, PC2, color = Region),
size = 1.8, alpha = 0.35) +
scale_color_manual(values = region_colors) +
# --- VARIABLE ARROWS ---
geom_segment(
data = loadings_df,
aes(x = 0, y = 0,
xend = PC1 * arrow_scale,
yend = PC2 * arrow_scale),
arrow = arrow(length = unit(0.25, "cm")),
color = "dodgerblue3",
linewidth = 1.2,
alpha = 0.9
) +
# --- VARIABLE LABELS ---
geom_text_repel(
data = loadings_df,
aes(x = PC1 * arrow_scale,
y = PC2 * arrow_scale,
label = Variable),
color = "dodgerblue4",
size = 6,
fontface = "bold"
) +
# --- LABEL OUTLIER CITIES ONLY ---
geom_text_repel(
data = outliers,
aes(PC1, PC2, label = City, color = Region),
size = 4,
fontface = "bold",
box.padding = 0.4,
max.overlaps = Inf
) +
# --- CROSSHAIR LINES ---
geom_hline(yintercept = 0, linetype = "dashed", color = "gray40") +
geom_vline(xintercept = 0, linetype = "dashed", color = "gray40") +
# --- LABELS & THEME ---
labs(
title = "PCA Biplot of U.S. Cities",
x = paste0("PC1 (", round(100*summary(places_pca)$importance[2,1], 1), "%)"),
y = paste0("PC2 (", round(100*summary(places_pca)$importance[2,2], 1), "%)"),
color = "Region"
) +
theme_classic(base_size = 16)#package you wanted us to use
library(factoextra)
library(ggrepel)
library(ggplot2)
library(dplyr)
# PC scores for cities
scores_df <- as.data.frame(places_pca$x[, 1:2]) %>%
mutate(City = places$City)
# TRUE outliers (optional labeling)
outliers <- scores_df %>%
mutate(dist = sqrt(PC1^2 + PC2^2)) %>%
slice_max(dist, n = 12)
# ---- PCA biplot (ONLY VARIABLE ARROWS; cities = dots) ----
p <- fviz_pca_biplot(
places_pca,
axes = c(1,2),
label = "var", # LABEL VARIABLES ONLY
col.var = "steelblue4", # variable arrow color
repel = TRUE,
select.var = list(contrib = 9), # show all 9 variable arrows
geom.ind = "point", # INDIVIDUAL CITIES = POINTS ONLY
col.ind = "gray20", # city color
alpha.ind = 0.35, # faint cities for clarity
pointsize = 1.6 # city point size
) +
theme_classic(base_size = 15) +
labs(title = "PCA Biplot of U.S. Cities",
x = paste0("PC1 (", round(100 * summary(places_pca)$importance[2,1],1), "%)"),
y = paste0("PC2 (", round(100 * summary(places_pca)$importance[2,2],1), "%)"))Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
# ---- OPTIONAL: label the outlier cities ----
p <- p +
geom_text_repel(
data = outliers,
aes(PC1, PC2, label = City),
size = 3.5
)
pNew York, very prominent transportation systems, higher in healthcare and education. Chicago, very prominent health care and education systems. Philadelphia, very higher in education but on the lower end of economy. Miami has a lot of recreational activities and a good economy but also suffers from a high crime rate.The outliers are all some of the largest cities in the United States with the exception of Newark, which has a very good transportation system.
The data we will look at here come from a study of malignant and benign breast cancer cells using fine needle aspiration conducted at the University of Wisconsin-Madison. The goal was determine if malignancy of a tumor could be established by using shape characteristics of cells obtained via fine needle aspiration (FNA) and digitized scanning of the cells.
The variables in the data file you will be using are:
bc_cells <- read.csv('Data/BreastDiag.csv')
head(bc_cells) Diagnosis Radius Texture Smoothness Compactness Concavity ConcavePts Symmetry
1 M 17.99 10.38 0.11840 0.27760 0.3001 0.14710 0.2419
2 M 20.57 17.77 0.08474 0.07864 0.0869 0.07017 0.1812
3 M 19.69 21.25 0.10960 0.15990 0.1974 0.12790 0.2069
4 M 11.42 20.38 0.14250 0.28390 0.2414 0.10520 0.2597
5 M 20.29 14.34 0.10030 0.13280 0.1980 0.10430 0.1809
6 M 12.45 15.70 0.12780 0.17000 0.1578 0.08089 0.2087
FracDim
1 0.07871
2 0.05667
3 0.05999
4 0.09744
5 0.05883
6 0.07613
My analysis suggests 3 PCs should be retained. Support or refute this suggestion. What percent of variability is explained by the first 3 PCs?
bc_numeric <- bc_cells %>%
select(Radius, Texture, Smoothness, Compactness,
Concavity, ConcavePts, Symmetry, FracDim)
bc_pca <- prcomp(bc_numeric, scale. = TRUE)
summary(bc_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
library(factoextra)
fviz_eig(bc_pca) +
theme_classic(base_size = 16)Warning in geom_bar(stat = "identity", fill = barfill, color = barcolor, :
Ignoring empty aesthetic: `width`.
The first 3 PCs account for 86.7% of the variability. This is a good cut off as it covers a very large portion of the data and after 3 the amount of increase with each PC does not add substantial variability explanation.
Interpret the first 3 principal components by examining the eigenvectors/loadings. Discuss.
bc_pca$rotation PC1 PC2 PC3 PC4 PC5
Radius -0.3003952 0.52850910 0.27751200 -0.0449523963 0.04245937
Texture -0.1432175 0.35378530 -0.89839046 -0.0002176232 0.21581443
Smoothness -0.3482386 -0.32661945 0.12684205 0.1097614573 0.84332416
Compactness -0.4584098 -0.07219238 -0.02956419 0.1825835334 -0.23762997
Concavity -0.4508935 0.12707085 0.04245883 0.1571126948 -0.30459047
ConcavePts -0.4459288 0.22823091 0.17458320 0.0608428515 0.01923459
Symmetry -0.3240333 -0.28112508 -0.08456832 -0.8897711849 -0.11359240
FracDim -0.2251375 -0.57996072 -0.24389523 0.3640273309 -0.27912206
PC6 PC7 PC8
Radius -0.518437923 0.36152546 -0.387460874
Texture -0.006127134 0.02418201 0.004590238
Smoothness 0.079444068 -0.04732075 -0.155456892
Compactness -0.388065805 -0.73686177 0.020239147
Concavity 0.700061530 0.02347868 -0.413095816
ConcavePts 0.125314641 0.21313047 0.808318445
Symmetry -0.018262848 0.05764443 -0.023810142
FracDim -0.261064577 0.52365191 -0.026129456
summary(bc_pca)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0705 1.3504 0.9087 0.70614 0.61016 0.30355 0.2623
Proportion of Variance 0.5359 0.2279 0.1032 0.06233 0.04654 0.01152 0.0086
Cumulative Proportion 0.5359 0.7638 0.8670 0.92937 0.97591 0.98743 0.9960
PC8
Standard deviation 0.17837
Proportion of Variance 0.00398
Cumulative Proportion 1.00000
PC1 reflects overall tumor abnormality, with high contributions from compactness, concavity, and concave points, meaning it separates large, irregular tumors from smaller, smoother ones. PC2 contrasts size/texture (radius, texture) with smoothness and symmetry. PC3 is dominated by texture and captures finer-scale textural differences among tumors.
Examine a biplot of the first two PCs. Incorporate the third PC by sizing the points by this variable. (Hint: use fviz_pca to set up a biplot, but set col.ind='white'. Then use geom_point() to maintain full control over the point mapping.) Color-code by whether the cells are benign or malignant. Answer the following:
library(tidyverse)
library(factoextra)
library(ggrepel)
# PC scores for first 3 PCs
scores_df <- as.data.frame(bc_pca$x[, 1:3]) %>%
mutate(Diagnosis = bc_cells$Diagnosis) # B or Mbase_biplot <- fviz_pca(
bc_pca,
axes = c(1,2),
label = "var",
col.var = "black",
arrowsize = 1.2,
repel = TRUE,
col.ind = "white" # required to hide individuals
) +
theme_classic(base_size = 16) +
labs(
title = "Breast Cancer PCA: PC1 vs PC2",
x = paste0("PC1 (", round(100 * summary(bc_pca)$importance[2,1], 1), "%)"),
y = paste0("PC2 (", round(100 * summary(bc_pca)$importance[2,2], 1), "%)")
)final_plot <- base_biplot +
# Benign (blue) points — subtle
geom_point(
data = scores_df,
aes(x = PC1, y = PC2, size = PC3, color = Diagnosis), # <- COLOR BACK IN AES
alpha = ifelse(scores_df$Diagnosis == "B", 0.25, 0), # B visible, M suppressed here
stroke = 0.1
) +
# Malignant (red) points — emphasized
geom_point(
data = scores_df %>% filter(Diagnosis == "M"),
aes(x = PC1, y = PC2, size = PC3, color = Diagnosis), # <- COLOR BACK IN AES
alpha = 0.55,
stroke = 0.3,
size = 3
) +
scale_color_manual(
values = c("B" = "steelblue2", "M" = "firebrick2"),
name = "Diagnosis"
) +
scale_size_continuous(
name = "PC3 score",
range = c(1.5, 4)
) +
guides(
size = guide_legend(order = 2),
color = guide_legend(order = 1)
)
final_plotMalignant cells tend to have larger radius, higher texture variation, and more irregular, concave, and complex boundaries (high concavity, compactness, and concave points).Benign cells, in contrast, are generally smaller, smoother, and more symmetric with less irregular features.
PC1 as it best captures the dominant features associated with malignancy, size, concavity, compactness, and boundary irregularity. PC2 and PC3 show some variation, but PC1 is the primary axis that distinguishes the two groups.