Homework: Principal Component Analysis

Published

March 26, 2026

1 Textbook Problems (Johnson & Wichern, Ch. 8)

For problems 1 and 2, work by hand or in code — but do not use prcomp or any built-in PCA function. Show your eigendecomposition steps explicitly.

1.1 Problem 8.1

Consider the covariance matrix

\[ \Sigma = \begin{pmatrix} 1 & -2 \\ -2 & 5 \end{pmatrix} \]

eigen(matrix(c(1, -2, -2, 5), ncol = 2, byrow = T))

eigen() decomposition
$values
[1] 5.8284271 0.1715729

$vectors
           [,1]       [,2]
[1,] -0.3826834 -0.9238795
[2,]  0.9238795 -0.3826834

(a) Find the eigenvalues $\lambda_1 \geq \lambda_2$ of $\Sigma$.

(1 - lambda)(5 - lambda) - (-2)(-2) = l^2 - 6l + 1 => l1 = 5.82, l2 = 0.17

(b) Find the corresponding unit eigenvectors $e_1$ and $e_2$.

e1 => (-0.383, 0.924)

e2 => (-0.923, -0.383)

(c) Write out the first and second population principal components $Y_1 = e_1'X$ and $Y_2 = e_2'X$.

PC1 => $e_1'X$ = -0.383$x_1$, 0.924$x_2$ PC2 => $e_2'X$ = -0.923$x_1$, -0.383$x_2$

(d) Compute $\text{Var}(Y_1)$ and $\text{Var}(Y_2)$. Verify that $\text{Var}(Y_1) + \text{Var}(Y_2) = \text{tr}(\Sigma)$.

$\text{Var}(Y_1)$ = $e_1'(\lambda_1)e_1$ = $(\lambda_1)$ = 5.82

$\text{Var}(Y_2)$ = $e_2'(\lambda_2)e_2$ = $(\lambda_2)$ = 0.18

$\text{Var}(Y_1)$ + $\text{Var}(Y_2)$ = 5.82 + 0.18 = 6 = $\text{Tr}(X)$ = 5 + 1 = 6

(e) What proportion of the total variance is explained by the first PC?

First PC is larger eigenvalue over total variance which is 5.82/6 = 97.14%

1.2 Problem 8.2

Consider the covariance matrix

\[ \Sigma = \begin{pmatrix} 5 & 2 \\ 2 & 2 \end{pmatrix} \]

(a) Find the eigenvalues and eigenvectors of $\Sigma$.

$(5 - \lambda)$ $(2 - \lambda)$ - $(2 * 2)$ = $(6 - 7\lambda + \lambda^2)$

$\lambda_1$ = 6 $\lambda_2$ = 1

-1x1 + 2x2 = 0

2x1 - 4x2 = 0

x1 = 2x2, x2 = 1, x1 = 2, standardized is $(2 / \sqrt{5}, 1 / \sqrt{5})$ = $e_1$

4x1 + 2x2 = 0

2x1 + x2 = 0

x2 = -2x1, x1 = 1, x2 = -2, standardized is $(1 / \sqrt{5}, -2 / \sqrt{5})$ = $e_2$

(b) Determine the first two principal components and their variances.

PC1 = $e_1 * X$ = $2 / \sqrt{5} X_1, 1 / \sqrt{5} X_2$

PC2 = $e_2 * X$ = $1 / X_1, -2 / X_2 $

$Var(PC1) = \lambda_1 = 6$

$Var(PC2) = \lambda_2 = 1$

(c) Compute $\text{Cov}(Y_1, Y_2)$ and verify the PCs are uncorrelated.

$\text{Var}(Y_1, Y_2)$ = $\text{Cov}(Y_1)$ + $\text{Cov}(Y_2)$ + $\text{Cov}(Y_1, Y_2)$

$0 + 0 + \text{Cov}(Y_1, Y_2)$ = $e^T\Sigma(e_2)$ = $\Sigma * e_2$ = (5/sqrt(5) - 4 / sqrt(5), 2/sqrt(5) - 4/sqrt(5)) = (1/sqrt(5), -2/sqrt(5))

$e_1 * (1/\sqrt{5}, -2/\sqrt{5})$ = $2/5 - 2/5 = 0$ = $\text{Cov}(Y_1, Y_2)$ = 0.

(d) Compute the proportion of variance explained by $Y_1$ alone. Does a single PC seem sufficient here? Explain briefly.

$\lambda_1 / \text{total variance}$ = 6/7 = 0.857, which is usually sufficient within a range of 0.7-0.9, but we may want to include the other principal component if it contextually has significance that the first may not fully capture.

1.3 Problem 8.10 (parts a–b)

The dataset T8-4.DAT contains weekly rates of return for five stocks: JP Morgan (jpmorgan), Citibank (citi), Wells Fargo (wells), Royal Dutch Shell (shell), and ExxonMobil (exmob).

stonks <- read.table("/Users/jdumanski24/Desktop/Stat 388/johnson_wichern_data/T8-4.DAT",
                     col.names = c("jpmorgan", "citi", "wells", "shell", "exmob"))

(a) Compute the sample covariance matrix $S$. Obtain the eigenvalues and eigenvectors of $S$ (you may use eigen() directly). Determine the first two sample principal components and the proportion of total sample variance each explains.

eigen(cov(stonks)) #eigenvalues + vectors

eigen() decomposition
$values
[1] 0.0013676780 0.0007011596 0.0002538024 0.0001426026 0.0001188868

$vectors
          [,1]       [,2]        [,3]       [,4]        [,5]
[1,] 0.2228228 -0.6252260  0.32611218  0.6627590  0.11765952
[2,] 0.3072900 -0.5703900 -0.24959014 -0.4140935 -0.58860803
[3,] 0.1548103 -0.3445049 -0.03763929 -0.4970499  0.78030428
[4,] 0.6389680  0.2479475 -0.64249741  0.3088689  0.14845546
[5,] 0.6509044  0.3218478  0.64586064 -0.2163758 -0.09371777

pc1 <- eigen(cov(stonks))$vectors[,1]
pc2 <- eigen(cov(stonks))$vectors[,2]
pc1 # first principal component

[1] 0.2228228 0.3072900 0.1548103 0.6389680 0.6509044

pc2 # second principal component

[1] -0.6252260 -0.5703900 -0.3445049  0.2479475  0.3218478

eigen(cov(stonks))$values[1]/sum(diag(cov(stonks))) #var explained PC1

[1] 0.5292607

eigen(cov(stonks))$values[2]/sum(diag(cov(stonks))) #var explained PC2

[1] 0.271333

(b) Interpret the first PC. What does the combination of loadings suggest about what $\hat{Y}_1$ is measuring? Are the two industry groups (banking vs. oil) distinguishable in the first or second PC?

The first pricipal component shows a pretty large difference in the industries. Shell and Exmob have loadings over a magnitude of 0.6 each, while the banking ones are all less than 0.31. This principal component is likely measuring something that distinguishes the industries.

2 Applied Problems

2.1 Penguins: Correlation Structure and Grouped PCA

library(palmerpenguins)
library(tidyverse)
library(factoextra)
data("penguins")

2.1.1 Part (a) — Correlation structure and scaling decision

Compute the pairwise correlation matrix among bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g (complete cases only).

Display the correlation matrix and at minimum one correlation visualization (e.g., corrplot or ggcorrplot).

penguin_stats <- penguins %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  drop_na()

cor(penguin_stats)

                  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm         1.0000000    -0.2350529         0.6561813   0.5951098
bill_depth_mm         -0.2350529     1.0000000        -0.5838512  -0.4719156
flipper_length_mm      0.6561813    -0.5838512         1.0000000   0.8712018
body_mass_g            0.5951098    -0.4719156         0.8712018   1.0000000

library(corrplot)

corrplot 0.95 loaded

corrplot(cor(penguin_stats), method = "ellipse")

Comment on the strength and direction of the relationships you see.

bill length is highly positively correlated with flipper length and body mass, while bill depth is negatively correlated with the rest of the predictors. the strongest relationship is between body mass and flipper length, which is positively correlated.

Based on the variable units and scales, explain whether you would run PCA on the covariance or correlation matrix, and why.

We would need to use the correlation matrix because each variable is on a different scale, as the correlation matrix weighs each value equally.

2.1.2 Part (b) — Full-dataset PCA and component selection

Run a PCA on the four variables (all species combined, complete cases, using the scaling you chose in part a).

(pca_penguins <- prcomp(penguin_stats, scale = T))

Standard deviations (1, .., p=4):
[1] 1.6594442 0.8789293 0.6043475 0.3293816

Rotation (n x k) = (4 x 4):
                         PC1          PC2        PC3        PC4
bill_length_mm     0.4552503 -0.597031143 -0.6443012  0.1455231
bill_depth_mm     -0.4003347 -0.797766572  0.4184272 -0.1679860
flipper_length_mm  0.5760133 -0.002282201  0.2320840 -0.7837987
body_mass_g        0.5483502 -0.084362920  0.5966001  0.5798821

Report the proportion of variance explained by each PC and the cumulative proportion.

summary(pca_penguins)

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.6594 0.8789 0.60435 0.32938
Proportion of Variance 0.6884 0.1931 0.09131 0.02712
Cumulative Proportion  0.6884 0.8816 0.97288 1.00000

Use a scree plot and at least one additional criterion (proportion threshold) to justify how many PCs you would retain.

screeplot(pca_penguins, type = "lines", main = "Scree Plot — Penguin Measurements")
abline(h = 1, col = "dodgerblue", lty = 2)

The Scree plot with the Kaiser criterion shows that two PCs are necessary to explain enough of the variance, as there is a negligible amount of remaining variance to explain with the other PCs. The summary statistics from the PCA show that 88% of the variance is explained by the first 2 PCs, which is an acceptable amount, so we will choose the first two PCs.

Report the loadings for the retained PCs and give a plain-language interpretation of what each retained PC might represent biologically.

pca_penguins$rotation[,c(1,2)]

                         PC1          PC2
bill_length_mm     0.4552503 -0.597031143
bill_depth_mm     -0.4003347 -0.797766572
flipper_length_mm  0.5760133 -0.002282201
body_mass_g        0.5483502 -0.084362920

From the first PC, all variables except bill_depth are strongly positive, but depth is strong and negative. So, in some situations, bill depth detracts from a penguin’s ability to have some function, but the rest of the measurements, when larger, are beneficial. This may be for swimming, where a disproportionate depth of the bill can make it less dynamic in the water. For the second PC, all loadings are negative, but length and depth are mainly significant, which shows that the bill’s measurements may detract from certain situations, like precise fishing, for example.

2.1.3 Part (c) — Species-stratified PCA

Repeat the PCA from part (b) separately for each of the three species (Adelie, Chinstrap, Gentoo).

adelie <- penguins %>% 
  filter(species == "Adelie") %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  drop_na()
chinstrap <- penguins %>%
  filter(species == "Chinstrap") %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  drop_na()
gentoo <- penguins %>%
  filter(species == "Gentoo") %>%
  dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
  drop_na()
prcomp(adelie, scale = T)$rotation

                         PC1         PC2         PC3        PC4
bill_length_mm    -0.4893684  0.25148465  0.77083131 -0.3210812
bill_depth_mm     -0.4938218  0.42468847 -0.62890872 -0.4245628
flipper_length_mm -0.4364077 -0.86610403 -0.08900771 -0.2269135
body_mass_g       -0.5711453  0.07911372 -0.04868819  0.8155756

prcomp(chinstrap, scale = T)$rotation

                        PC1        PC2         PC3         PC4
bill_length_mm    0.4799159 -0.6790106 -0.03702704 -0.55430518
bill_depth_mm     0.5215085 -0.2793392 -0.09797736  0.80024930
flipper_length_mm 0.4918889  0.5513227 -0.64139697 -0.20663636
body_mass_g       0.5057222  0.3961786  0.76002590 -0.09822548

prcomp(gentoo, scale = T)$rotation

                        PC1        PC2        PC3        PC4
bill_length_mm    0.4855410 -0.8562822 -0.0504436 -0.1687783
bill_depth_mm     0.5034179  0.4381959 -0.2676349 -0.6949290
flipper_length_mm 0.5035281  0.2024235  0.8212595  0.1761166
body_mass_g       0.5072276  0.1838200 -0.5013580  0.6764396

Do the first PCs differ meaningfully across species in terms of their loadings?

Yes, the Adelie species is unique in that each loading measurement is negative, while the other species are all positive.

Do the proportions of variance explained by PC1 differ? What might that tell you about which species has more “structured” variation?

summary(prcomp(adelie, scale = T))

Importance of components:
                          PC1    PC2    PC3     PC4
Standard deviation     1.5252 0.8437 0.7799 0.59476
Proportion of Variance 0.5815 0.1779 0.1521 0.08843
Cumulative Proportion  0.5815 0.7595 0.9116 1.00000

summary(prcomp(chinstrap, scale = T))

Importance of components:
                          PC1    PC2     PC3     PC4
Standard deviation     1.6537 0.7657 0.59634 0.56861
Proportion of Variance 0.6837 0.1466 0.08891 0.08083
Cumulative Proportion  0.6837 0.8303 0.91917 1.00000

summary(prcomp(gentoo, scale = T))

Importance of components:
                         PC1    PC2     PC3     PC4
Standard deviation     1.747 0.6089 0.54788 0.52654
Proportion of Variance 0.763 0.0927 0.07504 0.06931
Cumulative Proportion  0.763 0.8557 0.93069 1.00000

PC1 for Adelie has less than Chinstrap which has less variance explained than Gentoo, showing that Gentoo likely has the most structured variance in which it can be explained with the least amount of factors.

Produce a single figure that overlays or facets the species-specific PC1 vs. PC2 score plots. Comment on any clustering you observe.

fviz_pca_var(prcomp(adelie, scale = T),
             col.var  = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel    = TRUE,
             title    = "Variable contributions to PCs")

fviz_pca_var(prcomp(chinstrap, scale = T),
             col.var  = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel    = TRUE,
             title    = "Variable contributions to PCs")

fviz_pca_var(prcomp(gentoo, scale = T),
             col.var  = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel    = TRUE,
             title    = "Variable contributions to PCs")

Gentoo and Chinstrap have very similar behaviors in their first and second PCs, while Adelie is very different in regards to PC1.

2.2 Decathlon: PCA on Athletic Performance

The decathlon2 dataset (from the factoextra package) records results for 27 athletes across 10 events in two competitions.

data("decathlon2", package = "factoextra")
# Use only the 10 event columns (exclude Rank, Points, Competition)
dec <- decathlon2[, 1:10]

2.2.1 Part (a) — Setup and scaling decision

List the 10 events and their units. Should PCA be run on the covariance or correlation matrix here? Justify your answer in one or two sentences.

colnames(dec)

 [1] "X100m"        "Long.jump"    "Shot.put"     "High.jump"    "X400m"       
 [6] "X110m.hurdle" "Discus"       "Pole.vault"   "Javeline"     "X1500m"

The running ones are in time units, while the other events are in meters. This PCA should scale the data and use the correlation matrix, since there is not a uniform scale among the events already.

Note: For running and hurdle events a lower time is better; for field events a higher distance/height is better. Does this matter for PCA? Why or why not?

By using the correlation matrix, the magnitudes will be the same for each event, but the direction may be something that we will not analyze as closely, where a negative correlation in one event means a positive correlation in another. We will specifically look at the events and determine their significance contextually.

2.2.2 Part (b) — Run PCA and choose the number of components

Run PCA with appropriate scaling. Produce a scree plot and report how many PCs you would retain using:

The 70–80% cumulative variance threshold
The Kaiser criterion (eigenvalue > 1 when using the correlation matrix) [Look this up]

Do the two criteria agree? Which would you go with, and why?

pca_dec <- prcomp(dec, scale = T)
summary(pca_dec)

Importance of components:
                         PC1    PC2    PC3    PC4     PC5     PC6     PC7
Standard deviation     1.936 1.3210 1.2320 1.0160 0.78603 0.65444 0.57089
Proportion of Variance 0.375 0.1745 0.1518 0.1032 0.06178 0.04283 0.03259
Cumulative Proportion  0.375 0.5495 0.7013 0.8045 0.86630 0.90913 0.94172
                           PC8     PC9    PC10
Standard deviation     0.52857 0.43716 0.33511
Proportion of Variance 0.02794 0.01911 0.01123
Cumulative Proportion  0.96966 0.98877 1.00000

screeplot(pca_dec, type = "lines", main = "Scree Plot — Decathalon Measurements")
abline(h = 1, col = "dodgerblue", lty = 2)

The two criteria more or less agree. The scree plot tells us to use either four or five, as the fourth is very close to the Kaiser criterion, and the cumulative variance explained is just over 80% for the fourth, but still is increased quite a bit by including the fifth. I would use five just to be safe because it still explains much more variance by including the fifth.

2.2.3 Part (c) — Interpret the retained PCs

Examine the loadings of your retained PCs.

pca_dec$rotation[,1:5]

                     PC1        PC2          PC3         PC4        PC5
X100m        -0.42290657 -0.2594748  0.081870461  0.09974877 -0.2796419
Long.jump     0.39189495  0.2887806 -0.005082180 -0.18250903  0.3355025
Shot.put      0.36926619 -0.2135552  0.384621732  0.03553644 -0.3544877
High.jump     0.31422571 -0.4627797  0.003738604  0.07012348  0.3824125
X400m        -0.33248297 -0.1123521  0.418635317  0.26554389  0.2534755
X110m.hurdle -0.36995919 -0.2252392  0.338027983 -0.15726889  0.2048540
Discus        0.37020078 -0.1547241  0.219417086  0.39137188 -0.4319091
Pole.vault   -0.11433982  0.5583051  0.327177839 -0.24759476 -0.3340758
Javeline      0.18341259 -0.0745854  0.564474643 -0.47792535  0.1697426
X1500m        0.03599937  0.4300522  0.286328973  0.64220377  0.3227349

Does PC1 look like an overall athletic ability index, or something more specific?

Generally, yes. All of the track events are negative, which is preferable, and the other events are positive, which is also preferable. The first PC seems to show the characteristics of race-winning athletes.

Do later PCs separate “speed” events from “power/field” events?

PC3 in particular does this, as nearly all events are positive, which means that the track events are differentiated. This could show that there is some variance explained by athletes who just practice the field events.

Produce a variable loading plot (fviz_pca_var) and a biplot. Describe two or three specific observations about athletes or event groupings that stand out.

fviz_pca_var(pca_dec,
             col.var  = "contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel    = TRUE,
             title    = "Variable contributions to PCs")

fviz_pca(pca_dec,
         col.var = "contrib",
         gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
         repel   = TRUE,
         title   = "PCA Biplot — Decathalon")

Javeline, discus, shot put, and high jump are all grouped, perhaps showing an explanatory factor being the explosiveness of an athlete being able to quickly exert force. Athletes like Bernard can do this very well. Long jump and all the track events except for the longer run are grouped, showing that there are some athletes who are more running-dominant, which happen to be almost opposite of the grouping of the power events. Athletes like Sebrle and Karpov seem to be the most balanced, as they lie in between the groupings.

2.2.4 Part (d) — Comparison by competition

The dataset contains two competitions (Decastar and OlympicG). Compute PC1 scores for each athlete and compare the score distributions across competitions using a boxplot or density plot.

Is there evidence that Olympic Games athletes are shifted on the overall PC1 axis relative to Decastar athletes? What would that mean?

decastar <- decathlon2 %>%
  filter(Competition == "Decastar") %>%
  dplyr::select(-Rank, -Points, -Competition)
olympics <- decathlon2 %>%
  filter(Competition == "OlympicG") %>%
  dplyr::select(-Rank, -Points, -Competition)

prcomp(decastar, scale = T)$x[,1]

     SEBRLE        CLAY     BERNARD      YURKOV   ZSIVOCZKY    McMULLEN 
 -1.2632122  -2.5575789   0.1978619   0.7794255  -0.7374314  -1.3157075 
  MARTINEAU       HERNU      BARRAS        NOOL BOURGUIGNON      KARPOV 
  2.0878160   0.8572961   0.6541547   1.9706868   3.0047293  -2.5862681 
    WARNERS 
 -1.0917723

prcomp(olympics, scale = T)$x[,1]

    Sebrle       Clay     Karpov      Macey    Warners  Zsivoczky      Hernu 
 2.6797800  2.8667315  3.6454471  1.0841178  0.4676701 -0.4200804 -1.0037764 
   Bernard   Schwarzl  Pogorelov Schoenbeck     Barras       Nool      Drews 
 0.7311146 -1.9352001 -0.8138519 -1.4718750 -1.6187102 -1.7539722 -2.4573951

boxplot(prcomp(decastar, scale = T)$x[,1],
prcomp(olympics, scale = T)$x[,1], names = c("Decastar", "Olympics"))

The first principal component for the olympics and decastar looks to be slightly different, where the olympic scores are more negative. This would mean that the linear combinations that produce these values leans more negative, which is propelled by the track components, meaning that the olympic games are more focused on excelling at the track events.

2.3 USArrests: Covariance vs. Correlation PCA

# USArrests: Murder, Assault, UrbanPop, Rape (per 100,000 or %)
head(USArrests)

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

2.3.1 Part (a) — PCA on the covariance matrix

Run PCA without scaling (scale = FALSE). Report the eigenvalues and PC1/PC2 loadings. Which variable dominates PC1, and why do you think that is? (Hint: check the variances of the four variables first.)

eigen(cov(USArrests))$values[1:2]

[1] 7011.1149  201.9924

prcomp(USArrests, scale = F)$rotation[,1:2]

                PC1         PC2
Murder   0.04170432 -0.04482166
Assault  0.99522128 -0.05876003
UrbanPop 0.04633575  0.97685748
Rape     0.07515550  0.20071807

var(USArrests)

             Murder   Assault   UrbanPop      Rape
Murder    18.970465  291.0624   4.386204  22.99141
Assault  291.062367 6945.1657 312.275102 519.26906
UrbanPop   4.386204  312.2751 209.518776  55.76808
Rape      22.991412  519.2691  55.768082  87.72916

Assault dominates the first PCA, because assault is scaled much differently than the rest of the variables, and has a higher magnitude on average. It has the most variance, therefore, and the PC1 can therefore explain the most variance in all of the data just looking at the unscaled variable, since the bulk of the variance is from assault.

2.3.2 Part (b) — PCA on the correlation matrix

Run PCA with scaling (scale = TRUE). Report the eigenvalues and loadings again.

eigen(cor(USArrests))$values[1:2]

[1] 2.4802416 0.9897652

prcomp(USArrests, scale = T)$rotation[,1:2]

                PC1        PC2
Murder   -0.5358995 -0.4181809
Assault  -0.5831836 -0.1879856
UrbanPop -0.2781909  0.8728062
Rape     -0.5434321  0.1673186

How do the PC1 loadings change compared to part (a)?

The loadings changed in that murder, assault, and rape are all around the same significance, and are all negative, compared to them all being insignificant and positive except for assault in part (a).

What proportion of variance is explained by PC1 and PC2 combined?

(eigen(cor(USArrests))$values[1] + eigen(cor(USArrests))$values[2]) / sum(diag(cor(USArrests)))

[1] 0.8675017

Interpret PC1 and PC2 in plain terms: what does each axis seem to represent?

PC1 seems to be an overall crime score, as each crime is clustered around each other, so the first axis would be how much crime the state has.

PC2 takes urban pop into consideration more, where states with a high PC2 score would have large urban populations with less violent crime on average.

2.3.3 Part (c) — Biplot and state-level interpretation

Using the scaled PCA from part (b), produce a biplot.

pca_arrests <- prcomp(USArrests, scale = TRUE)
fviz_pca(pca_arrests, repel = TRUE)

Identify at least two states that stand out on PC1 (high and low ends) and explain what that means in context.

Florida has the most negative value on the first axis, which means that it has a significant amount of violent crime compared to the other states, like North Dakota, who was the largest positive value on that axis, meaning it has very low relative crime rates.

Identify at least one state that is unusual on PC2. What characterizes it?

California has a very high negative PC1-axis value, but it also has by far the most urban population, so it is relatively balanced in what you would expect the crime to be for that state.

Do the four crime/urban variables cluster in an interpretable way in the loading arrows?

More or less, yes. States with high crime are grouped, states with high crime and high urban pop, and then states with low urban pop and low crime are grouped, showing clear explanations for each state and its measurements.

2.3.4 Part (d) — Which PCA would you report?

Write 3–5 sentences making a recommendation: covariance-based or correlation-based PCA for this dataset? Justify your choice by referring to variable scales, units, and what the analyst is trying to learn.

By using the covariance-based PCA, we were barely able to get any information about any of the variables due to their unscaled nature. By balancing all of them, we saw a strong relationship between the types of crime and how they relate to urban population. By doing the correlation, we saw different clusters of states that are gauged by their crime rates in regard to their urban population, which is what an analyst would want to know.

3 Reflection (short answer)

Answer each question in 3–5 sentences.

R1. Your notes state that PCA is not equivariant with respect to scale. In your own words, explain what this means and why it matters practically. Give an example (it can be one from class or made up) where ignoring this could lead you to wrong conclusions.

Naturally, variables that are scaled differently can have different behaviors in PCA, and some can seem more important if they simply have more variance than others despite not being necessarily significant. In the USArrests data, assault was by far the most explanatory, only because originally, it had a different scale with higher variance, which had it dominant in the PCs. After rescaling, it was clear that it was no more significant than the other crimes, showing that the type of PCA is very important when making conclusions.

R2. The “proportion of variance explained” by the first $r$ PCs is often used to choose how many components to retain. Describe one situation where this criterion could be misleading — that is, where a high proportion of variance explained does not mean the PCA was useful or interpretable.

If certain PCs seem to repeat the behavior of previous, more explanatory PCs, it is not necessarily useful to include them. Instead, by seeing the relationship between different variables within the PCs, it can lead to interesting observations about the data despite not explaining much of the variance. Using a diverse set of PCs to explain rarer cases can sometimes be more informative than other PCs which show nearly the same observations as others.

R3. You ran species-stratified PCA on penguins in Question 2c. Explain conceptually why running PCA on a pooled dataset (all species together) could give a different — and potentially misleading — first PC compared to running it within each species.

Because of the differences in each of the species’ measurements, the pooled dataset which only describes the correlation between the data can be skewed if different species have vastly different characteristics. Therefore, it’s more insightful to group by the species and see how each species’ measurements can be explanatory for biological traits. By using a pooled dataset, some observations about the penguins may not be significant because generalizing for the entire penguin behavior is not generalized, or as interesting.