Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('./Data/mens_decathlon_paris2024.csv')
head(decathlon)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

If we do not scale the data, then events like Javelin and Discus will dominate the differences between the observations. This is because these two events have the highest score. Scaling will ensure that all of these events are on the same scale.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

svd_dec_df <- (
  decathlon
  %>% select(-Athlete, -Medal, -Nation, - Overall)
  %>% scale()
)

svd_decathlon <- svd(svd_dec_df)
U_dec <- svd_decathlon$u
D_dec <- diag(svd_decathlon$d)
V_dec <- svd_decathlon$v

PCs_approach1 <- (U_dec %*% D_dec)[,1:2]
PCs_approach1

            [,1]        [,2]
 [1,] -0.9696335  0.90753969
 [2,] -2.1338950  2.24014162
 [3,] -0.7323012  2.05468273
 [4,] -1.4369409 -0.05634227
 [5,]  1.0587944 -0.11932168
 [6,] -1.1274914 -0.28469476
 [7,] -1.5170355  0.60252099
 [8,]  2.7432616  0.22618248
 [9,] -2.5354593 -0.71643964
[10,]  0.4269279 -0.78152461
[11,]  0.8580252  1.76415773
[12,] -1.2047343  0.78549507
[13,] -0.1761558 -0.74356430
[14,]  0.8835651 -1.04662217
[15,] -0.3205275 -1.05965320
[16,]  0.6345050 -0.94344570
[17,]  1.9329181  1.94562825
[18,] -1.6221238 -2.31911240
[19,]  4.1747261  0.40963793
[20,]  1.0635749 -2.86526573

PCs_approach2 <- (svd_dec_df %*% V_dec)[,1:2]
PCs_approach2

            [,1]        [,2]
 [1,] -0.9696335  0.90753969
 [2,] -2.1338950  2.24014162
 [3,] -0.7323012  2.05468273
 [4,] -1.4369409 -0.05634227
 [5,]  1.0587944 -0.11932168
 [6,] -1.1274914 -0.28469476
 [7,] -1.5170355  0.60252099
 [8,]  2.7432616  0.22618248
 [9,] -2.5354593 -0.71643964
[10,]  0.4269279 -0.78152461
[11,]  0.8580252  1.76415773
[12,] -1.2047343  0.78549507
[13,] -0.1761558 -0.74356430
[14,]  0.8835651 -1.04662217
[15,] -0.3205275 -1.05965320
[16,]  0.6345050 -0.94344570
[17,]  1.9329181  1.94562825
[18,] -1.6221238 -2.31911240
[19,]  4.1747261  0.40963793
[20,]  1.0635749 -2.86526573

Both approaches for finding the PCs match.

(
 PCs_approach1
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/10))
)

         X1        X2
1 0.2913494 0.1928006

Our first principal component explains 29.13% of the variability. The second principal component 19.28% of the variability. In total our first two principal components explain 48.41% of the variability. Additionally, we have our loadings stored in V_dec.

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

V_dec[,1:2]

              [,1]        [,2]
 [1,]  0.446612410  0.01678592
 [2,] -0.493037935 -0.06565628
 [3,] -0.309431542  0.53134100
 [4,] -0.150386291 -0.05665757
 [5,]  0.485973869  0.14847674
 [6,]  0.345270647  0.31685991
 [7,] -0.139145881  0.59570522
 [8,]  0.019829831  0.48083261
 [9,]  0.252821246  0.03137029
[10,]  0.005588625 -0.01948870

head(decathlon)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

We would expect the medalists to be in quadrant 2. We can see for starters that PC2 had very few negative loadings. The only 3 negative loadings are small in absolute value compared to the other loadings. That means that the negative loadings will not contribute much to PC2. This tells us that someone who did well in a lot of events (like a medalist) would have a high PC2. The average loading value for PC1 is much close to 0, then the average loading value for PC2. Of the 10 loadings for PC2, 4 of them are negative and 6 are positive, however, 2 of the positive ones are very close to 0. So, we would expect the medalists to be around the x = 0 line, since high scores in each event would balacne each other out. Notably, each medalist perforemd very well in long jump, which is one of the largest negative components, so that is why I say they will be in quadrant 2.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.4.3

PCs_df_dec <- (bind_cols(decathlon, data.frame(PCs_approach1)) 
               %>% mutate(medalists = case_when(Medal == 'None' ~ "No",
                                                TRUE ~ 'Yes'))
)

ggplot(data = PCs_df_dec, aes(x = X1, y = X2, color = medalists)) +
  geom_point() +
  geom_text_repel(aes(label = Athlete)) +
  xlab('PC1') + ylab('PC2') +
  theme_classic()

This matches our intuition from above. The medalists are in quadrant 2, but are close to the x = 0 line.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.

mean_vec_24 <- (apply(decathlon %>%  select(-Athlete, -Medal, -Nation, -Overall), 2, mean))
sd_vec_24 <- (apply(decathlon %>%  select(-Athlete, -Medal, -Nation, -Overall), 2, sd))

warner_standard <- ((warner-mean_vec_24)/sd_vec_24)

warner_pcs <- warner_standard %*% V_dec[,1:2]

ggplot(data = PCs_df_dec, aes(x = X1, y = X2, color = medalists)) +
  geom_point() +
  geom_text_repel(aes(label = Athlete)) +
  geom_point(x = warner_pcs[1], y = warner_pcs[2], color = 'black', show.legend = FALSE, size = 5, shape = "*") +
  geom_text(x = warner_pcs[1], y = warner_pcs[2]+0.25, label = 'Warner', show.legend = FALSE, color = 'grey50') +
  xlim(-4,4.5) +
  xlab('PC1') + ylab('PC2') +
  theme_classic()

Do you think his 2020 performance would have won a medal if it had happened in 2024?

I do not think that his 2020 performance would have won a medal if it happened in 2024. His PC2 is much lower than that of the 3 medalists. Since PC2 is high when you ahve high scores across all events, I do not believe this performance would compare.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('./Data/claude_dataA.csv')
claudeB <- read.csv('./Data/claude_dataB.csv')
claudeC <- read.csv('./Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(patchwork)

Warning: package 'patchwork' was built under R version 4.4.3

plot_a <- ggplot(data = claudeA, aes(x = X, y = Y)) +
  geom_point() +
  theme_classic() +
  coord_fixed(ratio = 1)

plot_b <- ggplot(data = claudeB, aes(x = X, y = Y)) +
  geom_point() +
  theme_classic() +
  coord_fixed(ratio = 1)

plot_c <- ggplot(data = claudeC, aes(x = X, y = Y)) +
  geom_point() +
  theme_classic() +
  coord_fixed(ratio = 1)

plot_a + plot_b + plot_c

I would expect the variance explained by the first PC to be higher than what you told Claude for each of these plots. When looking at each plot, I imagine rotating the points so that there is no correlation between the points anymore. If we do this, almost all of the variability for each plot is in the horizontal direction. The plots do not vary much vertically in this scenario. I would expect the first PCs for each plot to be higher than what you specified because of this.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

svd_a <- svd(scale(claudeA))
U_a <- svd_a$u
D_a <- diag(svd_a$d)

svd_b <- svd(scale(claudeB))
U_b <- svd_b$u
D_b <- diag(svd_b$d)

svd_c <- svd(scale(claudeC))
U_c <- svd_c$u
D_c <- diag(svd_c$d)

pcs_a <- (U_a %*% D_a)[,1]

(
 pcs_a
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)

          .
1 0.9756201

pcs_b <- (U_b %*% D_b)[,1]

(
 pcs_b
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)

          .
1 0.9591807

pcs_c <- (U_c %*% D_c)[,1]

(
 pcs_c
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/2))
)

          .
1 0.9949096

Each PC is much larger than what you told claude. This confirms my intuition from above.