Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

We have varying units (time, distance in meters), and some categories will obviously have much higher raw numbers. Scaling it to Z-score allows fairly weighted distributions.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

decathlon_num <- decathlon[, c("X100m","LongJump","ShotPut","HighJump","X400m", "X110mHurdle","Discus","PoleVault","Javelin","X1500m")]

scaled <- scale(decathlon_num)

svd_dec <- svd(scaled)

scores <- svd_dec$u %*% diag(svd_dec$d)


#First 2 PC scores
head(scores[, 1:2])

           [,1]        [,2]
[1,] -0.9696335  0.90753969
[2,] -2.1338950  2.24014162
[3,] -0.7323012  2.05468273
[4,] -1.4369409 -0.05634227
[5,]  1.0587944 -0.11932168
[6,] -1.1274914 -0.28469476

# Percent variability
var_explained <- svd_dec$d^2 / sum(svd_dec$d^2)
sum(var_explained[1:2]) * 100

[1] 48.415

We can see that PC1 and PC2 account for ~48.415% of our variability in the data.

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

loadings <- svd_dec$v

loadings

              [,1]        [,2]        [,3]        [,4]        [,5]       [,6]
 [1,]  0.446612410  0.01678592  0.38576229  0.12493749  0.26964863  0.1667469
 [2,] -0.493037935 -0.06565628  0.27554389 -0.06611412 -0.03078180 -0.3264439
 [3,] -0.309431542  0.53134100 -0.00947694  0.06406468  0.19871416  0.2222259
 [4,] -0.150386291 -0.05665757  0.57613649 -0.37633765  0.19377423  0.5105783
 [5,]  0.485973869  0.14847674 -0.02405499 -0.39754578 -0.08944687  0.1120462
 [6,]  0.345270647  0.31685991 -0.14325429  0.05219004  0.44098456 -0.2900703
 [7,] -0.139145881  0.59570522  0.20471769 -0.20321421  0.12155979 -0.3620240
 [8,]  0.019829831  0.48083261 -0.09567361  0.17454218 -0.59927487  0.3907952
 [9,]  0.252821246  0.03137029  0.55462233  0.08930521 -0.49030585 -0.4020522
[10,]  0.005588625 -0.01948870 -0.25193417 -0.77057276 -0.18036416 -0.1158963
              [,7]        [,8]        [,9]       [,10]
 [1,] -0.262470499  0.59983862  0.26595652 -0.18944330
 [2,]  0.350947009  0.52787978  0.12787448  0.38291183
 [3,] -0.382254696 -0.19049711  0.45086373  0.37838762
 [4,]  0.323966967 -0.28263209 -0.04962895 -0.12277368
 [5,] -0.003756926  0.09533420 -0.32559335  0.66894987
 [6,]  0.612670284 -0.16451909  0.27451904 -0.02724166
 [7,] -0.193051886  0.07844891 -0.49908732 -0.32282941
 [8,]  0.356069620  0.23836573  0.04313143 -0.17740628
 [9,] -0.088583701 -0.36785551  0.26473950  0.06494954
[10,] -0.097936293  0.10450479  0.45251177 -0.26663253

Based on the loadings, medalists would fall in Quadrant I because the events where higher or lower values indicate better performance mostly contribute to positive PC1 and PC2 scores.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggplot2)
library(ggrepel)

decathlon$PC1 <- scores[,1]  #PC 1 and 2
decathlon$PC2 <- scores[,2]  

decathlon$Medal <- factor(decathlon$Medal, levels = c("Gold", "Silver", "Bronze", "None"))


ggplot(decathlon, aes(x = PC1, y = PC2, color = Medal)) +
  geom_point(size = 3) +  
  geom_text_repel(aes(label = Athlete), size = 3) +  # Avoid overlap
  scale_color_manual(values = c("Gold" = "#FFD700",     
                                "Silver" = "#C0C0C0",   # Color-code
                                "Bronze" = "#CD7F32",   
                                "None" = "black")) +  
  labs(title = "Decathlon Athletes on First Two Principal Components",
       x = "PC1",
       y = "PC2") +
  theme_minimal()

To my surprise, the medalists land in Quadrant II. This makes sense, as the competitions differ in if you want high values, like long jump, or low values, such as 100 M hurdles. Depending on what choose as PC1 and PC2 values, the best finishes will be in the second or fourth quadrant.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.

# Compute mean and SD vectors from 2024 athletes
mean_vec_24 <- apply(decathlon_num, 2, mean)
sd_vec_24   <- apply(decathlon_num, 2, sd)
cat("Mean vector:", mean_vec_24, "\nSD vector:", sd_vec_24, "\n")

Mean vector: 10.7465 7.378 14.863 1.9975 48.11 14.364 46.116 4.36 61.627 275.845 
SD vector: 0.2519247 0.3598333 0.8663967 0.06873864 1.222004 0.3759115 3.829339 1.504869 7.104155 12.15537

# 2024 athletes' PCs 
pc_scores <- svd_dec$u %*% diag(svd_dec$d)
decathlon$PC1_centered <- pc_scores[,1] - mean(pc_scores[,1])
decathlon$PC2_centered <- pc_scores[,2] - mean(pc_scores[,2])

# Standardize Warner's 2020 results and compute his PCs
warner_std <- (warner - mean_vec_24) / sd_vec_24
warner_PC <- warner_std %*% svd_dec$v[, 1:2]

# Add Warner to data
warner_df <- data.frame(Athlete="Damian Warner 2020",
                        Medal="Warner 2020",  
                        PC1_centered=warner_PC[1] - mean(pc_scores[,1]),
                        PC2_centered=warner_PC[2] - mean(pc_scores[,2]))
plot_df <- rbind(decathlon[, c("Athlete","Medal","PC1_centered","PC2_centered")], warner_df)

# 5. Scatterplot with Damian 
ggplot(plot_df, aes(x=PC1_centered, y=PC2_centered, color=Medal)) +
  geom_point(size=3) +
  geom_text_repel(aes(label=Athlete), size=3) +
  scale_color_manual(values=c("Gold"="#FFD700","Silver"="#C0C0C0",
                              "Bronze"="#CD7F32","None"="gray50",
                              "Warner 2020"="black")) +  
  labs(title="Decathlon Athletes (2024) with Damian Warner 2020",
       x="PC1 (centered)", y="PC2 (centered)") +
  theme_minimal() +
  geom_hline(yintercept=0, linetype="dashed") +
  geom_vline(xintercept=0, linetype="dashed")

Do you think his 2020 performance would have won a medal if it had happened in 2024?

Looking at his point, his PC1 and PC2 don’t seem as close to (-1,1) as others in the dataset. He would definitely be somewhere in the top 10, but winning a medal is unlikely in my opinion.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(patchwork)

# Scale and perform PCA
process_pca <- function(df) {
  df_scaled <- scale(df)        # center and scale X and Y
  pca <- prcomp(df_scaled)      # PCA rotation
  data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2])
}

# Apply PCA to Claude datasets
claudeA_pca <- process_pca(claudeA[, c("X","Y")])
claudeB_pca <- process_pca(claudeB[, c("X","Y")])
claudeC_pca <- process_pca(claudeC[, c("X","Y")])

# Plot PCA, with fixed borders and ratio
plot_pca <- function(df, title_text) {
  ggplot(df, aes(x=PC1, y=PC2)) +
    geom_point() +
    coord_fixed(ratio=3, xlim=c(-3,3), ylim=c(-1,1)) +  # fixed limits
    labs(title=title_text, x="PC1", y="PC2") +
    theme_minimal()
}

# Individual plots
plotA <- plot_pca(claudeA_pca, "Claude A (PC1 ~55%)")
plotB <- plot_pca(claudeB_pca, "Claude B (PC1 ~75%)")
plotC <- plot_pca(claudeC_pca, "Claude C (PC1 ~90%)")

# Side-by-side
combined_plot <- plotA + plotB + plotC + plot_layout(widths=c(5,5,5))

combined_plot

First off, explaining 90% of variability in only PC1 is not likely, and the data would probably have to be altered in some way to make that work.

Second, the variability looks off. While in plot C, the points seem to be elongated onto PC1, plot A looks very similar, when it should probably be way more varied, especially on the PC2 axis. Also, plot B looks way worse than plot A, even though the data should be tighter on the PC1 and PC 2 axis.

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

# Function to compute SVD on PC1
svd_pc1_variance <- function(df) {
  df_scaled <- scale(df)                # center and scale 
  svd_res <- svd(df_scaled)            
  singular_values <- svd_res$d          
  pc1 <- svd_res$u[,1] * singular_values[1]  
  variance_explained <- (singular_values[1]^2) / sum(singular_values^2) * 100
  list(PC1_scores = pc1, percent_var = variance_explained)
}

claudeA_svd <- svd_pc1_variance(claudeA[, c("X","Y")])
claudeB_svd <- svd_pc1_variance(claudeB[, c("X","Y")])
claudeC_svd <- svd_pc1_variance(claudeC[, c("X","Y")])

 # Print percent variance 
cat("Claude A: PC1 explains", round(claudeA_svd$percent_var, 1), "% of variance\n")

Claude A: PC1 explains 97.6 % of variance

cat("Claude B: PC1 explains", round(claudeB_svd$percent_var, 1), "% of variance\n")

Claude B: PC1 explains 95.9 % of variance

cat("Claude C: PC1 explains", round(claudeC_svd$percent_var, 1), "% of variance\n")

Claude C: PC1 explains 99.5 % of variance