Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

  • The event columns are in different units and scales

  • PCA is based on variances: variables with larger variance would dominate the principal components.

  • Standardizing (z-scores) puts all events on the same footing so PCs reflect structure rather than units.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

event_cols <- setdiff(colnames(decathlon), c("Overall"))  
X_raw <- decathlon[, event_cols]

numeric_cols <- sapply(X_raw, is.numeric)
X_raw <- X_raw[, numeric_cols, drop = FALSE]

X <- scale(X_raw)
n <- nrow(X)
p <- ncol(X)

# SVD of standardized data matrix
sv <- svd(X)            
U <- sv$u
D <- sv$d
V <- sv$v             

scores <- U %*% diag(D)

var_explained <- (D^2) / sum(D^2)
pct_12 <- 100 * sum(var_explained[1:2])

cat(sprintf("PC1+PC2 explain %.2f%% of total standardized variability.\n", pct_12))
PC1+PC2 explain 48.41% of total standardized variability.

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

ev_names <- colnames(X_raw)

L <- as.data.frame(V[,1:2]); names(L) <- c("PC1","PC2"); L$Event <- ev_names
L <- L[,c("Event","PC1","PC2")]
print(L, row.names = FALSE)
       Event          PC1         PC2
       X100m  0.446612410  0.01678592
    LongJump -0.493037935 -0.06565628
     ShotPut -0.309431542  0.53134100
    HighJump -0.150386291 -0.05665757
       X400m  0.485973869  0.14847674
 X110mHurdle  0.345270647  0.31685991
      Discus -0.139145881  0.59570522
   PoleVault  0.019829831  0.48083261
     Javelin  0.252821246  0.03137029
      X1500m  0.005588625 -0.01948870
k <- 4
top_pc1 <- L[order(abs(L$PC1), decreasing = TRUE), ][1:k, c("Event","PC1")]
top_pc2 <- L[order(abs(L$PC2), decreasing = TRUE), ][1:k, c("Event","PC2")]

s1 <- sign(mean(top_pc1$PC1))
s2 <- sign(mean(top_pc2$PC2))
quadrant <- dplyr::case_when(
  s1>0 & s2>0 ~ "I",  s1<0 & s2>0 ~ "II",
  s1<0 & s2<0 ~ "III", s1>0 & s2<0 ~ "IV",
  TRUE ~ "undetermined"
)

cat(sprintf(
  "\nPredicted medalist quadrant: %s\nReason: PC1 is driven most by %s (→ %s on PC1);\nPC2 is driven most by %s (→ %s on PC2).\n",
  quadrant,
  paste(top_pc1$Event, collapse = ", "), ifelse(s1>=0,"right","+/- left"),
  paste(top_pc2$Event, collapse = ", "), ifelse(s2>=0,"up","down")
))

Predicted medalist quadrant: I
Reason: PC1 is driven most by LongJump, X400m, X100m, X110mHurdle (→ right on PC1);
PC2 is driven most by Discus, ShotPut, PoleVault, X110mHurdle (→ up on PC2).

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggplot2)
library(ggrepel)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
theme_set(theme_minimal(base_size = 12))

scores <- U %*% diag(D)
pc_df <- as.data.frame(scores[,1:2]); names(pc_df) <- c("PC1","PC2")
dec_w_pc <- bind_cols(decathlon, pc_df)

name_col  <- intersect(names(dec_w_pc), c("Name","Athlete","AthleteName","name","athlete"))[1]
medal_col <- intersect(names(dec_w_pc), c("Medal","medal","Place","Rank","place","rank"))[1]

if (is.na(medal_col)) dec_w_pc$IsMedalist <- FALSE else {
  dec_w_pc$IsMedalist <- tolower(as.character(dec_w_pc[[medal_col]])) %in% c("gold","silver","bronze","1","2","3")
}

p <- ggplot(dec_w_pc, aes(PC1, PC2, color = IsMedalist)) +
  geom_point(size = 3, alpha = 0.9) +
  geom_text_repel(aes(label = if (!is.na(name_col)) .data[[name_col]] else NULL),
                  max.overlaps = 60, min.segment.length = 0, box.padding = 0.4) +
  scale_color_manual(values = c(`TRUE` = "#1f78b4", `FALSE` = "#9e9e9e")) +
  coord_equal() +
  labs(title = "Decathlon (Paris 2024): PC1 vs PC2",
       subtitle = "Scores from SVD on standardized events",
       x = "PC1", y = "PC2", color = "Medalist?")

p

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

  • Find the mean vector from the 2024 athletes. Call it mean_vec_24.
  • Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
  • Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
  • Find Warner’s PC coordinates using the 2024 loadings.
  • Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)
stopifnot(length(warner) == ncol(X_raw))

mean_vec_24 <- colMeans(X_raw, na.rm = TRUE)
sd_vec_24   <- apply(X_raw, 2, sd, na.rm = TRUE)
warner_std  <- (warner - mean_vec_24) / sd_vec_24

warner_pc   <- as.numeric(warner_std %*% V[,1:2, drop = FALSE])
names(warner_pc) <- c("PC1","PC2")
warner_pc
       PC1        PC2 
-3.4226989 -0.5092531 
p_warner <- p +
  geom_point(aes(x = warner_pc[1], y = warner_pc[2]),
             shape = 21, fill = "#FF7F00", color = "black", size = 4, stroke = 0.6) +
  annotate("text", x = warner_pc[1], y = warner_pc[2],
           label = "Damian Warner (Tokyo 2020)", vjust = -1, fontface = "bold")

p_warner
Warning in geom_point(aes(x = warner_pc[1], y = warner_pc[2]), shape = 21, : All aesthetics have length 1, but the data has 20 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warner’s 2020 performance falls in Quadrant III, not in the Quadrant II region where medalists lie on your plot. Since he’s below the medalists on PC2 (and similarly left on PC1), his overall PC profile is not consistent with the 2024 medalist pattern, so it probably wouldn’t have won a medal in Paris 2024.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(patchwork)
theme_set(theme_minimal(base_size = 12))

clean_xy <- function(df) df %>% select(X, Y) %>% mutate(across(everything(), as.numeric)) %>% drop_na()
A <- clean_xy(claudeA); B <- clean_xy(claudeB); C <- clean_xy(claudeC)

plot_scaled <- function(df, title) {
  Z <- as.data.frame(scale(df))
  ggplot(Z, aes(X, Y)) + geom_point(size = 2, alpha = 0.9) +
    coord_equal() + labs(title = title, subtitle = "scaled (z-scores)", x = "X", y = "Y")
}

(pA <- plot_scaled(A, "claudeA")) | (pB <- plot_scaled(B, "claudeB")) | (pC <- plot_scaled(C, "claudeC"))

  • claudeA= Looks more stretched than a 55% cloud, so it exceeds 55%.

  • claudeB= Strong one-direction pattern with some scatter; it meets or slightly exceeds 75%.

  • claudeC= Points are almost on a line; it meets or exceeds 90%.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

pc_shares <- function(df) {
  Z <- scale(df)
  sv <- svd(Z)
  tibble(PC1_pct = 100*sv$d[1]^2/sum(sv$d^2),
         PC2_pct = 100*sv$d[2]^2/sum(sv$d^2))
}

res <- bind_rows(
  pc_shares(A) %>% mutate(Data = "claudeA"),
  pc_shares(B) %>% mutate(Data = "claudeB"),
  pc_shares(C) %>% mutate(Data = "claudeC")
) %>% relocate(Data)

print(res, digits = 3)
# A tibble: 3 × 3
  Data    PC1_pct PC2_pct
  <chr>     <dbl>   <dbl>
1 claudeA    97.6   2.44 
2 claudeB    95.9   4.08 
3 claudeC    99.5   0.509