Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

We need to scale the data because each decathlon event is measured in different units, like seconds for running and meters for jumping or throwing. Without scaling, events with larger numbers would dominate the analysis just because of their scale. Standardizing makes sure all events contribute equally and that PCA reflects real performance patterns, not differences in measurement units.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

decathlon <- read.csv("Data/mens_decathlon_paris2024.csv")

# Use only the 10 decathlon event columns (avoid accidentally including PC1/PC2)
events <- c("X100m", "LongJump", "ShotPut", "HighJump",
            "X400m", "X110mHurdle", "Discus", "PoleVault",
            "Javelin", "X1500m")

# Scale data
X <- scale(decathlon[, events])

# Perform SVD
svd_out <- svd(X)
u <- svd_out$u
d <- svd_out$d
v <- svd_out$v

# Compute PC scores and loadings using ONLY u, d, v
scores <- u %*% diag(d)
pc_scores <- scores[, 1:2]
pc_loadings <- v[, 1:2]

# Calculate percent of total variance explained by PC1 + PC2
total_var <- sum(d^2)
var_pc1_pc2 <- sum(d[1:2]^2)
percent_var_1_2 <- (var_pc1_pc2 / total_var) * 100

# Display results
print(round(percent_var_1_2, 3))
[1] 48.415
print(round(pc_loadings, 5))
          [,1]     [,2]
 [1,]  0.44661  0.01679
 [2,] -0.49304 -0.06566
 [3,] -0.30943  0.53134
 [4,] -0.15039 -0.05666
 [5,]  0.48597  0.14848
 [6,]  0.34527  0.31686
 [7,] -0.13915  0.59571
 [8,]  0.01983  0.48083
 [9,]  0.25282  0.03137
[10,]  0.00559 -0.01949

The first two principal components explain about 48% of the total variability in the athletes’ performances across the ten decathlon events. This means that a little under half of all the variation in the data set can be summarized using just the first two PCs.

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

X <- decathlon[, sapply(decathlon, is.numeric)]
X <- X[, !names(X) %in% "Overall"]
X <- scale(X)

svd_out <- svd(X)
u <- svd_out$u
d <- svd_out$d
v <- svd_out$v

scores <- u %*% diag(d)
pc_scores_1_2 <- scores[, 1:2]

pc_loadings_1_2 <- v[, 1:2, drop = FALSE]
rownames(pc_loadings_1_2) <- colnames(X)

total_var <- sum(d^2)
pc_var <- d^2 / total_var
percent_var_1_2 <- sum(pc_var[1:2]) * 100

print(round(percent_var_1_2, 3))
[1] 48.415
print(round(pc_loadings_1_2, 5))
                [,1]     [,2]
X100m        0.44661  0.01679
LongJump    -0.49304 -0.06566
ShotPut     -0.30943  0.53134
HighJump    -0.15039 -0.05666
X400m        0.48597  0.14848
X110mHurdle  0.34527  0.31686
Discus      -0.13915  0.59571
PoleVault    0.01983  0.48083
Javelin      0.25282  0.03137
X1500m       0.00559 -0.01949

The loadings tell us which events matter most for each principal component. Events like running have positive loadings, and field events have negative ones. Since faster times are lower numbers and longer distances are higher, strong athletes get high scores on both main components. So, the medalists would appear in the Quadrant I of the plot, meaning they score high on both PC1 and PC2, overall strong and balanced across all events.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggplot2)
library(ggrepel)


decathlon <- read.csv("Data/mens_decathlon_paris2024.csv")


events <- c("X100m","LongJump","ShotPut","HighJump",
            "X400m","X110mHurdle","Discus","PoleVault",
            "Javelin","X1500m")

X <- scale(decathlon[, events])

svd_out <- svd(X)
scores <- svd_out$u %*% diag(svd_out$d)

decathlon$PC1 <- scores[,1]
decathlon$PC2 <- scores[,2]
decathlon$Medalist <- ifelse(rank(-decathlon$Overall) <= 3, "Medalist", "Other")

ggplot(decathlon, aes(PC1, PC2, color = Medalist, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(show.legend = FALSE, size = 3.5) +
  labs(title = "Decathlon Athletes by First Two Principal Components",
       x = "PC1", y = "PC2", color = "Athlete Type") +
  theme_minimal()

My intuition was correct, which is the medalists stand out together based on their overall performance. Even though they appear in the top-left quadrant instead of the top-right, that’s just because PCA axes can flip signs. My main idea holds true, which is that the medalists cluster together because they perform similarly across all events.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

  • Find the mean vector from the 2024 athletes. Call it mean_vec_24.
  • Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
  • Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
  • Find Warner’s PC coordinates using the 2024 loadings.
  • Add his point to the scatterplot.
X <- decathlon[, sapply(decathlon, is.numeric)]
X <- X[, !names(X) %in% c("Overall", "PC1", "PC2")]


mean_vec_24 <- apply(X, 2, mean)
sd_vec_24 <- apply(X, 2, sd)


warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)
names(warner) <- colnames(X)

warner_std <- (warner - mean_vec_24) / sd_vec_24


X_scaled <- scale(X)
svd_out <- svd(X_scaled)
loadings <- svd_out$v[, 1:2]

warner_pcs <- as.numeric(warner_std %*% loadings)

decathlon$PC1 <- svd_out$u[, 1] * svd_out$d[1]
decathlon$PC2 <- svd_out$u[, 2] * svd_out$d[2]
decathlon$Medalist <- ifelse(rank(-decathlon$Overall) <= 3, "Medalist", "Other")

warner_df <- data.frame(
  Athlete = "Damian Warner (2020)",
  PC1 = warner_pcs[1],
  PC2 = warner_pcs[2],
  Medalist = "2020 Gold"
)

plot_data <- rbind(
  decathlon[, c("Athlete", "PC1", "PC2", "Medalist")],
  warner_df
)

ggplot(plot_data, aes(x = PC1, y = PC2, color = Medalist, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(show.legend = FALSE, size = 3.5) +
  labs(
    title = "Damian Warner (2020) Compared to 2024 Decathlon Athletes",
    x = "PC1",
    y = "PC2",
    color = "Athlete Type"
  ) +
  theme_minimal()

Do you think his 2020 performance would have won a medal if it had happened in 2024?

Warner’s 2020 results would not have earned a medal in 2024. On the plot, his point sits below and to the left of the 2024 medalists, meaning his overall performance was weaker compared to the stronger field that year.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(ggplot2)
library(patchwork)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
claudeA_scaled <- as.data.frame(scale(claudeA))
claudeB_scaled <- as.data.frame(scale(claudeB))
claudeC_scaled <- as.data.frame(scale(claudeC))


pA <- ggplot(claudeA_scaled, aes(x = X, y = Y)) +
  geom_point(color = "steelblue") +
  ggtitle("Claude A (Claimed 55%)") +
  coord_fixed(ratio = 1) +
  theme_minimal()

pB <- ggplot(claudeB_scaled, aes(x = X, y = Y)) +
  geom_point(color = "darkorange") +
  ggtitle("Claude B (Claimed 75%)") +
  coord_fixed(ratio = 1) +
  theme_minimal()

pC <- ggplot(claudeC_scaled, aes(x = X, y = Y)) +
  geom_point(color = "seagreen") +
  ggtitle("Claude C (Claimed 90%)") +
  coord_fixed(ratio = 1) +
  theme_minimal()


pA + pB + pC

You were skeptical because the plots don’t seem to match the percentages Claude claimed. The first data set looks too linear to only explain 55% of the variance, the second one curves too much to explain 75%, and the third doesn’t look much tighter than the first. Overall, the percent variability explained by the first PC in each data set doesn’t seem to match what was claimed, while some appear to exceed it, and others fall short.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

XA <- scale(claudeA)
svdA <- svd(XA)
uA <- svdA$u; dA <- svdA$d; vA <- svdA$v
pc1_var_A <- (dA[1]^2) / sum(dA^2) * 100  


XB <- scale(claudeB)
svdB <- svd(XB)
uB <- svdB$u; dB <- svdB$d; vB <- svdB$v
pc1_var_B <- (dB[1]^2) / sum(dB^2) * 100


XC <- scale(claudeC)
svdC <- svd(XC)
uC <- svdC$u; dC <- svdC$d; vC <- svdC$v
pc1_var_C <- (dC[1]^2) / sum(dC^2) * 100


data.frame(
  Dataset = c("Claude A", "Claude B", "Claude C"),
  `Claimed %` = c(55, 75, 90),
  `Actual % (PC1)` = round(c(pc1_var_A, pc1_var_B, pc1_var_C), 2)
)
   Dataset Claimed.. Actual....PC1.
1 Claude A        55          97.56
2 Claude B        75          95.92
3 Claude C        90          99.49

All three datasets explain way more variance than they were supposed to. Claude A explains about 98%, Claude B about 96%, and Claude C about 99%. That means they’re all almost perfectly straight lines, so the data doesn’t really match the claimed percentages at all.