eigen(matrix(c(1, -2, -2, 5), ncol = 2, byrow = T))eigen() decomposition
$values
[1] 5.8284271 0.1715729
$vectors
[,1] [,2]
[1,] -0.3826834 -0.9238795
[2,] 0.9238795 -0.3826834
For problems 1 and 2, work by hand or in code — but do not use prcomp or any built-in PCA function. Show your eigendecomposition steps explicitly.
Consider the covariance matrix
\[ \Sigma = \begin{pmatrix} 1 & -2 \\ -2 & 5 \end{pmatrix} \]
eigen(matrix(c(1, -2, -2, 5), ncol = 2, byrow = T))eigen() decomposition
$values
[1] 5.8284271 0.1715729
$vectors
[,1] [,2]
[1,] -0.3826834 -0.9238795
[2,] 0.9238795 -0.3826834
(a) Find the eigenvalues \(\lambda_1 \geq \lambda_2\) of \(\Sigma\).
(1 - lambda)(5 - lambda) - (-2)(-2) = l^2 - 6l + 1 => l1 = 5.82, l2 = 0.17
(b) Find the corresponding unit eigenvectors \(e_1\) and \(e_2\).
e1 => (-0.383, 0.924)
e2 => (-0.923, -0.383)
(c) Write out the first and second population principal components \(Y_1 = e_1'X\) and \(Y_2 = e_2'X\).
PC1 => \(e_1'X\) = -0.383\(x_1\), 0.924\(x_2\) PC2 => \(e_2'X\) = -0.923\(x_1\), -0.383\(x_2\)
(d) Compute \(\text{Var}(Y_1)\) and \(\text{Var}(Y_2)\). Verify that \(\text{Var}(Y_1) + \text{Var}(Y_2) = \text{tr}(\Sigma)\).
\(\text{Var}(Y_1)\) = \(e_1'(\lambda_1)e_1\) = \((\lambda_1)\) = 5.82
\(\text{Var}(Y_2)\) = \(e_2'(\lambda_2)e_2\) = \((\lambda_2)\) = 0.18
\(\text{Var}(Y_1)\) + \(\text{Var}(Y_2)\) = 5.82 + 0.18 = 6 = \(\text{Tr}(X)\) = 5 + 1 = 6
(e) What proportion of the total variance is explained by the first PC?
First PC is larger eigenvalue over total variance which is 5.82/6 = 97.14%
Consider the covariance matrix
\[ \Sigma = \begin{pmatrix} 5 & 2 \\ 2 & 2 \end{pmatrix} \]
(a) Find the eigenvalues and eigenvectors of \(\Sigma\).
\((5 - \lambda)\) \((2 - \lambda)\) - \((2 * 2)\) = \((6 - 7\lambda + \lambda^2)\)
\(\lambda_1\) = 6 \(\lambda_2\) = 1
-1x1 + 2x2 = 0
2x1 - 4x2 = 0
x1 = 2x2, x2 = 1, x1 = 2, standardized is \((2 / \sqrt{5}, 1 / \sqrt{5})\) = \(e_1\)
4x1 + 2x2 = 0
2x1 + x2 = 0
x2 = -2x1, x1 = 1, x2 = -2, standardized is \((1 / \sqrt{5}, -2 / \sqrt{5})\) = \(e_2\)
(b) Determine the first two principal components and their variances.
PC1 = \(e_1 * X\) = \(2 / \sqrt{5} X_1, 1 / \sqrt{5} X_2\)
PC2 = \(e_2 * X\) = $1 / X_1, -2 / X_2 $
\(Var(PC1) = \lambda_1 = 6\)
\(Var(PC2) = \lambda_2 = 1\)
(c) Compute \(\text{Cov}(Y_1, Y_2)\) and verify the PCs are uncorrelated.
\(\text{Var}(Y_1, Y_2)\) = \(\text{Cov}(Y_1)\) + \(\text{Cov}(Y_2)\) + \(\text{Cov}(Y_1, Y_2)\)
\(0 + 0 + \text{Cov}(Y_1, Y_2)\) = \(e^T\Sigma(e_2)\) = \(\Sigma * e_2\) = (5/sqrt(5) - 4 / sqrt(5), 2/sqrt(5) - 4/sqrt(5)) = (1/sqrt(5), -2/sqrt(5))
\(e_1 * (1/\sqrt{5}, -2/\sqrt{5})\) = \(2/5 - 2/5 = 0\) = \(\text{Cov}(Y_1, Y_2)\) = 0.
(d) Compute the proportion of variance explained by \(Y_1\) alone. Does a single PC seem sufficient here? Explain briefly.
\(\lambda_1 / \text{total variance}\) = 6/7 = 0.857, which is usually sufficient within a range of 0.7-0.9, but we may want to include the other principal component if it contextually has significance that the first may not fully capture.
The dataset T8-4.DAT contains weekly rates of return for five stocks: JP Morgan (jpmorgan), Citibank (citi), Wells Fargo (wells), Royal Dutch Shell (shell), and ExxonMobil (exmob).
stonks <- read.table("/Users/jdumanski24/Desktop/Stat 388/johnson_wichern_data/T8-4.DAT",
col.names = c("jpmorgan", "citi", "wells", "shell", "exmob"))(a) Compute the sample covariance matrix \(S\). Obtain the eigenvalues and eigenvectors of \(S\) (you may use eigen() directly). Determine the first two sample principal components and the proportion of total sample variance each explains.
eigen(cov(stonks)) #eigenvalues + vectorseigen() decomposition
$values
[1] 0.0013676780 0.0007011596 0.0002538024 0.0001426026 0.0001188868
$vectors
[,1] [,2] [,3] [,4] [,5]
[1,] 0.2228228 -0.6252260 0.32611218 0.6627590 0.11765952
[2,] 0.3072900 -0.5703900 -0.24959014 -0.4140935 -0.58860803
[3,] 0.1548103 -0.3445049 -0.03763929 -0.4970499 0.78030428
[4,] 0.6389680 0.2479475 -0.64249741 0.3088689 0.14845546
[5,] 0.6509044 0.3218478 0.64586064 -0.2163758 -0.09371777
pc1 <- eigen(cov(stonks))$vectors[,1]
pc2 <- eigen(cov(stonks))$vectors[,2]
pc1 # first principal component[1] 0.2228228 0.3072900 0.1548103 0.6389680 0.6509044
pc2 # second principal component[1] -0.6252260 -0.5703900 -0.3445049 0.2479475 0.3218478
eigen(cov(stonks))$values[1]/sum(diag(cov(stonks))) #var explained PC1[1] 0.5292607
eigen(cov(stonks))$values[2]/sum(diag(cov(stonks))) #var explained PC2[1] 0.271333
(b) Interpret the first PC. What does the combination of loadings suggest about what \(\hat{Y}_1\) is measuring? Are the two industry groups (banking vs. oil) distinguishable in the first or second PC?
The first pricipal component shows a pretty large difference in the industries. Shell and Exmob have loadings over a magnitude of 0.6 each, while the banking ones are all less than 0.31. This principal component is likely measuring something that distinguishes the industries.
library(palmerpenguins)
library(tidyverse)
library(factoextra)
data("penguins")Compute the pairwise correlation matrix among bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g (complete cases only).
corrplot or ggcorrplot).penguin_stats <- penguins %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
drop_na()
cor(penguin_stats) bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm 1.0000000 -0.2350529 0.6561813 0.5951098
bill_depth_mm -0.2350529 1.0000000 -0.5838512 -0.4719156
flipper_length_mm 0.6561813 -0.5838512 1.0000000 0.8712018
body_mass_g 0.5951098 -0.4719156 0.8712018 1.0000000
library(corrplot)corrplot 0.95 loaded
corrplot(cor(penguin_stats), method = "ellipse")bill length is highly positively correlated with flipper length and body mass, while bill depth is negatively correlated with the rest of the predictors. the strongest relationship is between body mass and flipper length, which is positively correlated.
We would need to use the correlation matrix because each variable is on a different scale, as the correlation matrix weighs each value equally.
Run a PCA on the four variables (all species combined, complete cases, using the scaling you chose in part a).
(pca_penguins <- prcomp(penguin_stats, scale = T))Standard deviations (1, .., p=4):
[1] 1.6594442 0.8789293 0.6043475 0.3293816
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3 PC4
bill_length_mm 0.4552503 -0.597031143 -0.6443012 0.1455231
bill_depth_mm -0.4003347 -0.797766572 0.4184272 -0.1679860
flipper_length_mm 0.5760133 -0.002282201 0.2320840 -0.7837987
body_mass_g 0.5483502 -0.084362920 0.5966001 0.5798821
summary(pca_penguins)Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.6594 0.8789 0.60435 0.32938
Proportion of Variance 0.6884 0.1931 0.09131 0.02712
Cumulative Proportion 0.6884 0.8816 0.97288 1.00000
screeplot(pca_penguins, type = "lines", main = "Scree Plot — Penguin Measurements")
abline(h = 1, col = "dodgerblue", lty = 2)The Scree plot with the Kaiser criterion shows that two PCs are necessary to explain enough of the variance, as there is a negligible amount of remaining variance to explain with the other PCs. The summary statistics from the PCA show that 88% of the variance is explained by the first 2 PCs, which is an acceptable amount, so we will choose the first two PCs.
pca_penguins$rotation[,c(1,2)] PC1 PC2
bill_length_mm 0.4552503 -0.597031143
bill_depth_mm -0.4003347 -0.797766572
flipper_length_mm 0.5760133 -0.002282201
body_mass_g 0.5483502 -0.084362920
From the first PC, all variables except bill_depth are strongly positive, but depth is strong and negative. So, in some situations, bill depth detracts from a penguin’s ability to have some function, but the rest of the measurements, when larger, are beneficial. This may be for swimming, where a disproportionate depth of the bill can make it less dynamic in the water. For the second PC, all loadings are negative, but length and depth are mainly significant, which shows that the bill’s measurements may detract from certain situations, like precise fishing, for example.
Repeat the PCA from part (b) separately for each of the three species (Adelie, Chinstrap, Gentoo).
adelie <- penguins %>%
filter(species == "Adelie") %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
drop_na()
chinstrap <- penguins %>%
filter(species == "Chinstrap") %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
drop_na()
gentoo <- penguins %>%
filter(species == "Gentoo") %>%
dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>%
drop_na()
prcomp(adelie, scale = T)$rotation PC1 PC2 PC3 PC4
bill_length_mm -0.4893684 0.25148465 0.77083131 -0.3210812
bill_depth_mm -0.4938218 0.42468847 -0.62890872 -0.4245628
flipper_length_mm -0.4364077 -0.86610403 -0.08900771 -0.2269135
body_mass_g -0.5711453 0.07911372 -0.04868819 0.8155756
prcomp(chinstrap, scale = T)$rotation PC1 PC2 PC3 PC4
bill_length_mm 0.4799159 -0.6790106 -0.03702704 -0.55430518
bill_depth_mm 0.5215085 -0.2793392 -0.09797736 0.80024930
flipper_length_mm 0.4918889 0.5513227 -0.64139697 -0.20663636
body_mass_g 0.5057222 0.3961786 0.76002590 -0.09822548
prcomp(gentoo, scale = T)$rotation PC1 PC2 PC3 PC4
bill_length_mm 0.4855410 -0.8562822 -0.0504436 -0.1687783
bill_depth_mm 0.5034179 0.4381959 -0.2676349 -0.6949290
flipper_length_mm 0.5035281 0.2024235 0.8212595 0.1761166
body_mass_g 0.5072276 0.1838200 -0.5013580 0.6764396
Yes, the Adelie species is unique in that each loading measurement is negative, while the other species are all positive.
summary(prcomp(adelie, scale = T))Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.5252 0.8437 0.7799 0.59476
Proportion of Variance 0.5815 0.1779 0.1521 0.08843
Cumulative Proportion 0.5815 0.7595 0.9116 1.00000
summary(prcomp(chinstrap, scale = T))Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.6537 0.7657 0.59634 0.56861
Proportion of Variance 0.6837 0.1466 0.08891 0.08083
Cumulative Proportion 0.6837 0.8303 0.91917 1.00000
summary(prcomp(gentoo, scale = T))Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.747 0.6089 0.54788 0.52654
Proportion of Variance 0.763 0.0927 0.07504 0.06931
Cumulative Proportion 0.763 0.8557 0.93069 1.00000
PC1 for Adelie has less than Chinstrap which has less variance explained than Gentoo, showing that Gentoo likely has the most structured variance in which it can be explained with the least amount of factors.
fviz_pca_var(prcomp(adelie, scale = T),
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Variable contributions to PCs")fviz_pca_var(prcomp(chinstrap, scale = T),
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Variable contributions to PCs")fviz_pca_var(prcomp(gentoo, scale = T),
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Variable contributions to PCs")Gentoo and Chinstrap have very similar behaviors in their first and second PCs, while Adelie is very different in regards to PC1.
The decathlon2 dataset (from the factoextra package) records results for 27 athletes across 10 events in two competitions.
data("decathlon2", package = "factoextra")
# Use only the 10 event columns (exclude Rank, Points, Competition)
dec <- decathlon2[, 1:10]List the 10 events and their units. Should PCA be run on the covariance or correlation matrix here? Justify your answer in one or two sentences.
colnames(dec) [1] "X100m" "Long.jump" "Shot.put" "High.jump" "X400m"
[6] "X110m.hurdle" "Discus" "Pole.vault" "Javeline" "X1500m"
The running ones are in time units, while the other events are in meters. This PCA should scale the data and use the correlation matrix, since there is not a uniform scale among the events already.
Note: For running and hurdle events a lower time is better; for field events a higher distance/height is better. Does this matter for PCA? Why or why not?
By using the correlation matrix, the magnitudes will be the same for each event, but the direction may be something that we will not analyze as closely, where a negative correlation in one event means a positive correlation in another. We will specifically look at the events and determine their significance contextually.
Run PCA with appropriate scaling. Produce a scree plot and report how many PCs you would retain using:
Do the two criteria agree? Which would you go with, and why?
pca_dec <- prcomp(dec, scale = T)
summary(pca_dec)Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 1.936 1.3210 1.2320 1.0160 0.78603 0.65444 0.57089
Proportion of Variance 0.375 0.1745 0.1518 0.1032 0.06178 0.04283 0.03259
Cumulative Proportion 0.375 0.5495 0.7013 0.8045 0.86630 0.90913 0.94172
PC8 PC9 PC10
Standard deviation 0.52857 0.43716 0.33511
Proportion of Variance 0.02794 0.01911 0.01123
Cumulative Proportion 0.96966 0.98877 1.00000
screeplot(pca_dec, type = "lines", main = "Scree Plot — Decathalon Measurements")
abline(h = 1, col = "dodgerblue", lty = 2)The two criteria more or less agree. The scree plot tells us to use either four or five, as the fourth is very close to the Kaiser criterion, and the cumulative variance explained is just over 80% for the fourth, but still is increased quite a bit by including the fifth. I would use five just to be safe because it still explains much more variance by including the fifth.
Examine the loadings of your retained PCs.
pca_dec$rotation[,1:5] PC1 PC2 PC3 PC4 PC5
X100m -0.42290657 -0.2594748 0.081870461 0.09974877 -0.2796419
Long.jump 0.39189495 0.2887806 -0.005082180 -0.18250903 0.3355025
Shot.put 0.36926619 -0.2135552 0.384621732 0.03553644 -0.3544877
High.jump 0.31422571 -0.4627797 0.003738604 0.07012348 0.3824125
X400m -0.33248297 -0.1123521 0.418635317 0.26554389 0.2534755
X110m.hurdle -0.36995919 -0.2252392 0.338027983 -0.15726889 0.2048540
Discus 0.37020078 -0.1547241 0.219417086 0.39137188 -0.4319091
Pole.vault -0.11433982 0.5583051 0.327177839 -0.24759476 -0.3340758
Javeline 0.18341259 -0.0745854 0.564474643 -0.47792535 0.1697426
X1500m 0.03599937 0.4300522 0.286328973 0.64220377 0.3227349
Generally, yes. All of the track events are negative, which is preferable, and the other events are positive, which is also preferable. The first PC seems to show the characteristics of race-winning athletes.
PC3 in particular does this, as nearly all events are positive, which means that the track events are differentiated. This could show that there is some variance explained by athletes who just practice the field events.
fviz_pca_var) and a biplot. Describe two or three specific observations about athletes or event groupings that stand out.fviz_pca_var(pca_dec,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "Variable contributions to PCs")fviz_pca(pca_dec,
col.var = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE,
title = "PCA Biplot — Decathalon")Javeline, discus, shot put, and high jump are all grouped, perhaps showing an explanatory factor being the explosiveness of an athlete being able to quickly exert force. Athletes like Bernard can do this very well. Long jump and all the track events except for the longer run are grouped, showing that there are some athletes who are more running-dominant, which happen to be almost opposite of the grouping of the power events. Athletes like Sebrle and Karpov seem to be the most balanced, as they lie in between the groupings.
The dataset contains two competitions (Decastar and OlympicG). Compute PC1 scores for each athlete and compare the score distributions across competitions using a boxplot or density plot.
Is there evidence that Olympic Games athletes are shifted on the overall PC1 axis relative to Decastar athletes? What would that mean?
decastar <- decathlon2 %>%
filter(Competition == "Decastar") %>%
dplyr::select(-Rank, -Points, -Competition)
olympics <- decathlon2 %>%
filter(Competition == "OlympicG") %>%
dplyr::select(-Rank, -Points, -Competition)
prcomp(decastar, scale = T)$x[,1] SEBRLE CLAY BERNARD YURKOV ZSIVOCZKY McMULLEN
-1.2632122 -2.5575789 0.1978619 0.7794255 -0.7374314 -1.3157075
MARTINEAU HERNU BARRAS NOOL BOURGUIGNON KARPOV
2.0878160 0.8572961 0.6541547 1.9706868 3.0047293 -2.5862681
WARNERS
-1.0917723
prcomp(olympics, scale = T)$x[,1] Sebrle Clay Karpov Macey Warners Zsivoczky Hernu
2.6797800 2.8667315 3.6454471 1.0841178 0.4676701 -0.4200804 -1.0037764
Bernard Schwarzl Pogorelov Schoenbeck Barras Nool Drews
0.7311146 -1.9352001 -0.8138519 -1.4718750 -1.6187102 -1.7539722 -2.4573951
boxplot(prcomp(decastar, scale = T)$x[,1],
prcomp(olympics, scale = T)$x[,1], names = c("Decastar", "Olympics"))The first principal component for the olympics and decastar looks to be slightly different, where the olympic scores are more negative. This would mean that the linear combinations that produce these values leans more negative, which is propelled by the track components, meaning that the olympic games are more focused on excelling at the track events.
# USArrests: Murder, Assault, UrbanPop, Rape (per 100,000 or %)
head(USArrests) Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Run PCA without scaling (scale = FALSE). Report the eigenvalues and PC1/PC2 loadings. Which variable dominates PC1, and why do you think that is? (Hint: check the variances of the four variables first.)
eigen(cov(USArrests))$values[1:2][1] 7011.1149 201.9924
prcomp(USArrests, scale = F)$rotation[,1:2] PC1 PC2
Murder 0.04170432 -0.04482166
Assault 0.99522128 -0.05876003
UrbanPop 0.04633575 0.97685748
Rape 0.07515550 0.20071807
var(USArrests) Murder Assault UrbanPop Rape
Murder 18.970465 291.0624 4.386204 22.99141
Assault 291.062367 6945.1657 312.275102 519.26906
UrbanPop 4.386204 312.2751 209.518776 55.76808
Rape 22.991412 519.2691 55.768082 87.72916
Assault dominates the first PCA, because assault is scaled much differently than the rest of the variables, and has a higher magnitude on average. It has the most variance, therefore, and the PC1 can therefore explain the most variance in all of the data just looking at the unscaled variable, since the bulk of the variance is from assault.
Run PCA with scaling (scale = TRUE). Report the eigenvalues and loadings again.
eigen(cor(USArrests))$values[1:2][1] 2.4802416 0.9897652
prcomp(USArrests, scale = T)$rotation[,1:2] PC1 PC2
Murder -0.5358995 -0.4181809
Assault -0.5831836 -0.1879856
UrbanPop -0.2781909 0.8728062
Rape -0.5434321 0.1673186
The loadings changed in that murder, assault, and rape are all around the same significance, and are all negative, compared to them all being insignificant and positive except for assault in part (a).
(eigen(cor(USArrests))$values[1] + eigen(cor(USArrests))$values[2]) / sum(diag(cor(USArrests)))[1] 0.8675017
PC1 seems to be an overall crime score, as each crime is clustered around each other, so the first axis would be how much crime the state has.
PC2 takes urban pop into consideration more, where states with a high PC2 score would have large urban populations with less violent crime on average.
Using the scaled PCA from part (b), produce a biplot.
pca_arrests <- prcomp(USArrests, scale = TRUE)
fviz_pca(pca_arrests, repel = TRUE)Florida has the most negative value on the first axis, which means that it has a significant amount of violent crime compared to the other states, like North Dakota, who was the largest positive value on that axis, meaning it has very low relative crime rates.
California has a very high negative PC1-axis value, but it also has by far the most urban population, so it is relatively balanced in what you would expect the crime to be for that state.
More or less, yes. States with high crime are grouped, states with high crime and high urban pop, and then states with low urban pop and low crime are grouped, showing clear explanations for each state and its measurements.
Write 3–5 sentences making a recommendation: covariance-based or correlation-based PCA for this dataset? Justify your choice by referring to variable scales, units, and what the analyst is trying to learn.
By using the covariance-based PCA, we were barely able to get any information about any of the variables due to their unscaled nature. By balancing all of them, we saw a strong relationship between the types of crime and how they relate to urban population. By doing the correlation, we saw different clusters of states that are gauged by their crime rates in regard to their urban population, which is what an analyst would want to know.
Answer each question in 3–5 sentences.
R1. Your notes state that PCA is not equivariant with respect to scale. In your own words, explain what this means and why it matters practically. Give an example (it can be one from class or made up) where ignoring this could lead you to wrong conclusions.
Naturally, variables that are scaled differently can have different behaviors in PCA, and some can seem more important if they simply have more variance than others despite not being necessarily significant. In the USArrests data, assault was by far the most explanatory, only because originally, it had a different scale with higher variance, which had it dominant in the PCs. After rescaling, it was clear that it was no more significant than the other crimes, showing that the type of PCA is very important when making conclusions.
R2. The “proportion of variance explained” by the first \(r\) PCs is often used to choose how many components to retain. Describe one situation where this criterion could be misleading — that is, where a high proportion of variance explained does not mean the PCA was useful or interpretable.
If certain PCs seem to repeat the behavior of previous, more explanatory PCs, it is not necessarily useful to include them. Instead, by seeing the relationship between different variables within the PCs, it can lead to interesting observations about the data despite not explaining much of the variance. Using a diverse set of PCs to explain rarer cases can sometimes be more informative than other PCs which show nearly the same observations as others.
R3. You ran species-stratified PCA on penguins in Question 2c. Explain conceptually why running PCA on a pooled dataset (all species together) could give a different — and potentially misleading — first PC compared to running it within each species.
Because of the differences in each of the species’ measurements, the pooled dataset which only describes the correlation between the data can be skewed if different species have vastly different characteristics. Therefore, it’s more insightful to group by the species and see how each species’ measurements can be explanatory for biological traits. By using a pooled dataset, some observations about the penguins may not be significant because generalizing for the entire penguin behavior is not generalized, or as interesting.