Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA. Every variable will not be weighted the same in the PCA without scaling the data.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain? About 50%

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
decathlon_numeric <- decathlon %>% select("X100m":"X1500m")
scaled_decathlon <- scale(decathlon_numeric)
decathlon_svd <- svd(scaled_decathlon)
U <- decathlon_svd$u
D <- diag(decathlon_svd$d)
V <- decathlon_svd$v
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7
PCs <- scaled_decathlon %*% V


(PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/10))
)
         X1        X2       X3        X4        X5         X6         X7
1 0.2913494 0.1928006 0.146543 0.1404117 0.0952215 0.05535841 0.04806271
          X8         X9         X10
1 0.01403932 0.01222556 0.003987768

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning. Quadrant 2; PC1 has positive loadings for almost all races, so the longer it takes to finish them, the more the point will be pushed to the right, and the shortest race times will be the medal winners. For the field events, the loadings are negative, so the shortest distances will be the furthest right. For PC2, most of the significant loadings are field events (discuss, pole vault, shot put), which have positive loadings, so the further the distance, the higher up the point will be pushed.

(V %>% as.data.frame()  %>% select("V1","V2"))
             V1          V2
1   0.446612410  0.01678592
2  -0.493037935 -0.06565628
3  -0.309431542  0.53134100
4  -0.150386291 -0.05665757
5   0.485973869  0.14847674
6   0.345270647  0.31685991
7  -0.139145881  0.59570522
8   0.019829831  0.48083261
9   0.252821246  0.03137029
10  0.005588625 -0.01948870

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct. It was!

library(ggplot2)
library(ggrepel)
PC_df <- (PCs %>% as.data.frame())

decathlon["PC1"] <- PC_df$V1
decathlon["PC2"] <- PC_df$V2

ggplot(data = decathlon) + 
  geom_point(aes(x = PC1, y = PC2, color = Medal), size = 3, alpha = 0.8) + 
  geom_text_repel(aes(label = Athlete, x= PC1, y=PC2),
                  size = 3,
                  max.overlaps = 20) +
  labs(x = 'PC1', y = 'PC2') +
  xlim(c(-4, 4)) +
  ylim(c(-4, 4)) +
  theme_classic(base_size = 14) 
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text_repel()`).

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

  • Find the mean vector from the 2024 athletes. Call it mean_vec_24.

    mean_vect_24 <- sapply(decathlon_numeric, mean, na.rm = TRUE)
  • Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.

sd_vect_24 <- sapply(decathlon_numeric, sd, na.rm = TRUE)
  • Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
results <- ((warner-mean_vect_24)/sd_vect_24)
  • Find Warner’s PC coordinates using the 2024 loadings.


    warner_coords <- results %*% V
    print(warner_coords)
              [,1]       [,2]    [,3]       [,4]      [,5]       [,6]      [,7]
    [1,] -3.422699 -0.5092531 0.58943 -0.2655212 -1.898102 -0.5643154 0.1706496
              [,8]      [,9]     [,10]
    [1,] 0.0436934 -1.323623 0.8834635
  • Add his point to the scatterplot.

ggplot(data = decathlon) + 
  geom_point(aes(x = PC1, y = PC2, color = Medal), size = 3, alpha = 0.8) +
  geom_text_repel(aes(label = Athlete, x= PC1, y=PC2),
                  size = 3,
                  max.overlaps = 20) +
  geom_point(data = data.frame(PC1 = warner_coords[1], PC2 = warner_coords[2]),
             aes(x = PC1, y = PC2),
             color = "black", size = 4, shape = 17) +
  labs(x = "PC1", y = "PC2") +
  xlim(c(-4, 4)) +
  ylim(c(-4, 4)) +
  theme_classic(base_size = 14)
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_text_repel()`).

Do you think his 2020 performance would have won a medal if it had happened in 2024? No, just because the actual winners were much higher for PC2, and while PC1 does explain more variabilty than PC2, I still don’t think it would be enough

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')
claudeA
     X   Y
1   40 115
2   45 105
3   50 110
4   55  95
5   60 100
6   65  85
7   70  90
8   75  75
9   80  80
10  85  65
11  90  70
12  95  55
13 100  60
14 105  45
15 110  50
16 115  40
17  42 120
18  48 100
19  52 115
20  58  90
21  62 105
22  68  80
23  72  95
24  78  70
25  82  85
26  88  60
27  92  75
28  98  50
29 102  65
30 108  45
31 112  55
32 118  35
33 120  45
34 125  40
35 130  50
36 135  35
37 140  40
38 145  30
39 150  35
40 155  25

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to? Each of them seem like the first PC would have a vast majority of the variability explained by it. Like alot more than 75 and 50%

library(patchwork)
scaled_A <- scale(claudeA)
scaled_B <- scale(claudeB)
scaled_C <- scale(claudeC)
p1 <- ggplot(scaled_A, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  ggtitle("plot 1")

p2 <- ggplot(scaled_B, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  ggtitle("plot 2")

p3 <- ggplot(scaled_C, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) +
  ggtitle("plot 3")

# Combine side-by-side
combined_plot <- p1 + p2 + p3 + plot_layout(ncol = 3)
combined_plot

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

scaled_A <- scale(claudeA)
A_svd <- svd(scaled_A)
A_U <- A_svd$u
A_D <- diag(A_svd$d)
A_V <- A_svd$v

A_PCs <- scaled_A %*% A_V


(A_PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)
         X1         X2
1 0.9756201 0.02437991
scaled_B <- scale(claudeB)
B_svd <- svd(scaled_B)
B_U <- B_svd$u
B_D <- diag(B_svd$d)
B_V <- B_svd$v

B_PCs <- scaled_B %*% B_V


(B_PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)
         X1         X2
1 0.9591807 0.04081932
scaled_C <- scale(claudeC)
C_svd <- svd(scaled_C)
C_U <- C_svd$u
C_D <- diag(C_svd$d)
C_V <- C_svd$v

C_PCs <- scaled_C %*% C_V


(C_PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)
         X1          X2
1 0.9949096 0.005090442