Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

All units are measured differently and in a different scale. Some are much higher and will have a greater impact on the principal components which wouldn’t allow us to see the true variations.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

# 1. Extract numeric event columns (10 events only)
X <- decathlon[, 5:14]

# 2. Scale the data (center + scale)
X_scaled <- scale(X)

# 3. Perform singular value decomposition
svd_out <- svd(X_scaled)

# 4. Compute principal component scores and loadings
U <- svd_out$u
D <- diag(svd_out$d)
V <- svd_out$v

# PC scores (new coordinates for each athlete)
scores <- U %*% D

# Loadings (weights of each event on each PC)
loadings <- V

# 5. Compute proportion of variance explained
var_explained <- svd_out$d^2 / sum(svd_out$d^2)

# 6. Percent of total variability explained by first two PCs
percent_first2 <- sum(var_explained[1:2]) * 100
percent_first2
[1] 48.415
events <- colnames(X)
loadings_df <- data.frame(Event = events,
                          PC1 = round(loadings[,1], 3),
                          PC2 = round(loadings[,2], 3))
loadings_df
         Event    PC1    PC2
1        X100m  0.447  0.017
2     LongJump -0.493 -0.066
3      ShotPut -0.309  0.531
4     HighJump -0.150 -0.057
5        X400m  0.486  0.148
6  X110mHurdle  0.345  0.317
7       Discus -0.139  0.596
8    PoleVault  0.020  0.481
9      Javelin  0.253  0.031
10      X1500m  0.006 -0.019

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

They will be in Quadrant 1 and as both the PC1 and PC2 for the first loading is positive (PC1: 0.447, PC2: 0.017). A positive PC1 (0.447) pushes the medalist to the right, then a positive PC2 (0.017) pushes the medalist up leaving them in Q1. A positive PC1 likely relates to a higher score which would mean the medalists should lie within that quadrant.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.4.3
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
# 1. Add the first two PC scores to the decathlon data frame
decathlon$PC1 <- scores[, 1]
decathlon$PC2 <- scores[, 2]

# 2. Create a variable for medal status
decathlon$Medalist <- ifelse(decathlon$Medal == "None", "No", "Yes")

# 3. Plot PC1 vs PC2, label athletes, color-code medalists
ggplot(decathlon, aes(x = PC1, y = PC2, color = Medalist, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(size = 3, max.overlaps = 20) +
  labs(
    title = "Decathlon Athletes on First Two Principal Components",
    x = "PC1",
    y = "PC2",
    color = "Medalist"
  ) +
  theme_minimal()

In part (C) I predicted Q1 because both PC1 and PC2 loadings for the first variable were positive. After plotting the actual PC1 vs PC2 scores, the medalists appear mainly in Q2. My intuition for PC2 was correct but for PC1 it is actually the opposite.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

  • Find the mean vector from the 2024 athletes. Call it mean_vec_24.
  • Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
  • Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
  • Find Warner’s PC coordinates using the 2024 loadings.
  • Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

# Warner's 2020 results
warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

# mean and standard deviation 2024
mean_vec_24 <- apply(X, 2, mean)
sd_vec_24 <- apply(X, 2, sd)

# standardize 2020 with 2024
warner_scaled <- (warner - mean_vec_24) / sd_vec_24

#.PC coordinates 2024
warner_scores <- warner_scaled %*% loadings

# add 2020 point
warner_point <- data.frame(
  Athlete = "Damian Warner (2020)",
  PC1 = warner_scores[1, 1],
  PC2 = warner_scores[1, 2],
  Medalist = "Warner2020"
)

# combine with 2024
combined <- rbind(
  decathlon[, c("Athlete", "PC1", "PC2", "Medalist")],
  warner_point
)

# plot in 2024
ggplot(combined, aes(x = PC1, y = PC2, color = Medalist, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(size = 3, max.overlaps = 20) +
  labs(
    title = "Damian Warner's 2020 Performance vs 2024 Athletes (PCA Space)",
    x = "PC1",
    y = "PC2",
    color = "Medalist"
  ) +
  theme_minimal()

Damian Warner’s 2020 performance would not have resulting in a medal in 2024. His point falls right next to a ton of non medalists and he does not have a high PC2, he does have a very negative PC1 which the medalists also have but the PC2 score is the primary indicator and his is too low.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

claudeA <- read.csv("Data/claude_dataA.csv")
claudeB <- read.csv("Data/claude_dataB.csv")
claudeC <- read.csv("Data/claude_dataC.csv")

# Scale each dataset
claudeA_scaled <- scale(claudeA)
claudeB_scaled <- scale(claudeB)
claudeC_scaled <- scale(claudeC)
library(ggplot2)
library(patchwork)

pA <- ggplot(as.data.frame(claudeA_scaled), aes(x = X, y = Y)) +
  geom_point(color = "blue") +
  coord_fixed() +  # equal aspect ratio (height = width)
  labs(title = "claudeA (claimed 55% variance)") +
  theme_minimal()

pB <- ggplot(as.data.frame(claudeB_scaled), aes(x = X, y = Y)) +
  geom_point(color = "green") +
  coord_fixed() +
  labs(title = "claudeB (claimed 75% variance)") +
  theme_minimal()

pC <- ggplot(as.data.frame(claudeC_scaled), aes(x = X, y = Y)) +
  geom_point(color = "red") +
  coord_fixed() +
  labs(title = "claudeC (claimed 90% variance)") +
  theme_minimal()


(pA / pB / pC)

I think that you were skeptical because claudeA looks like it accounts for way more than 55% considering that it is very linear and not super spread out. Also B and C have almost the same level of linearity but have completely different estimated percentages. I think all exceed the variability that you asked for.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

ds_list <- list(
  claudeA = claudeA,
  claudeB = claudeB,
  claudeC = claudeC
)


pc1_info <- function(df) {
  X <- scale(df)                 
  s <- svd(X)                     
  prop_var <- (s$d^2) / sum(s$d^2)
  list(
    pc1_loading = s$v[, 1],       
    pc1_percent = 100 * prop_var[1]
  )
}


out_list <- lapply(ds_list, pc1_info)


pc1_table <- data.frame(
  dataset = names(out_list),
  pc1_percent = sapply(out_list, `[[`, "pc1_percent")
)


pc1_table$pc1_percent <- round(pc1_table$pc1_percent, 2)
print(pc1_table)
        dataset pc1_percent
claudeA claudeA       97.56
claudeB claudeB       95.92
claudeC claudeC       99.49

This confirms your skepticism all were way higher than they were supposed to be, and not even in the correct order.