Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

The reason that the data must be scaled is because the columns in the data set have different ranges. A column with a higher range is going to dominate the PCA because of the larger variance.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

decathlon_10 <- decathlon[,5:14]

decathlon_scaled <- scale(decathlon_10)

svd_decathlon <- svd(decathlon_scaled)

scores <- svd_decathlon$u %*% diag(svd_decathlon$d)
scores_PC1and2 <- scores[, 1:2]

loadings_PC1and2 <- svd_decathlon$v[, 1:2]

var_explained <- svd_decathlon$d^2 / sum(svd_decathlon$d^2)
percent_var_PC1and2 <- sum(var_explained[1:2]) * 100
percent_var_PC1and2

[1] 48.415

The first two PCs explain 48% of the variability.

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

decathlon

                 Athlete  Medal        Nation Overall X100m LongJump ShotPut
1           Markus Rooth   Gold        Norway    8796 10.71     7.80   15.25
2         Leo Neugebauer Silver       Germany    8748 10.67     7.98   16.55
3          Lindon Victor Bronze       Grenada    8711 10.56     7.48   15.71
4            Sven Roosen   None   Netherlands    8607 10.52     7.56   15.10
5          Janek Õiglane   None       Estonia    8572 10.89     7.25   14.58
6           Johannes Erm   None       Estonia    8569 10.64     7.66   14.61
7      Harrison Williams   None United States    8538 10.62     7.42   15.66
8            Niklas Kaul   None       Germany    8445 11.34     7.09   14.24
9    Ayden Owens-Delerme   None   Puerto Rico    8437 10.35     7.66   15.17
10         Heath Baldwin   None United States    8422 10.91     7.38   14.48
11           Karel Tilga   None       Estonia    8377 11.01     7.16   15.88
12       Makenson Gletty   None        France    8309 10.72     7.10   16.64
13          Ken Mullings   None       Bahamas    8226 10.60     7.36   14.19
14 José Fernando Santana   None        Brazil    8213 10.66     7.24   13.97
15       Till Steinforth   None       Germany    8170 10.52     7.61   13.96
16              Rik Taam   None   Netherlands    8046 10.64     7.27   14.27
17        Zachery Ziemek   None United States    7983 10.60     6.86   15.03
18       Sander Skotheim   None        Norway    7757 10.78     8.03   14.31
19      Daniel Golubovic   None     Australia    7566 11.32     6.60   13.89
20           Jorge Ureña   None         Spain    7096 10.87     7.05   13.77
   HighJump X400m X110mHurdle Discus PoleVault Javelin X1500m
1      1.99 47.69       14.25  49.80       5.3   66.87  279.6
2      2.05 47.70       14.51  53.33       5.0   56.64  284.7
3      2.02 47.84       14.62  53.91       4.9   68.22  283.5
4      1.87 46.40       13.99  46.88       4.7   63.72  258.5
5      1.99 48.02       14.45  43.49       5.3   71.89  265.6
6      2.08 47.19       14.35  46.29       4.6   59.58  259.7
7      1.96 46.71       14.28  46.91       5.1   51.17  259.6
8      2.02 49.13       14.53  46.28       4.8   77.78  255.0
9      2.02 46.17       14.09  43.36       4.8   51.17  280.4
10     2.17 49.04       14.04  43.66       4.7   67.59  280.7
11     1.99 48.67       14.66  50.13       4.7   64.16  266.4
12     1.99 47.48       13.96  46.03       4.7   53.02  275.6
13     2.02 49.43       13.70  46.07       4.8   59.83  295.8
14     1.93 48.78       14.00  42.86       4.8   70.58  289.7
15     1.96 47.96       14.37  42.59       4.7   59.14  285.4
16     1.93 47.73       14.78  39.31       4.7   57.08  264.8
17     1.96 50.79       15.11  50.08       5.0   57.05  293.2
18     2.11 47.02       14.15  45.77       0.0   59.79  277.5
19     1.93 50.37       15.15  44.65       4.6   59.33  279.0
20     1.96 48.08       14.29  40.92       0.0   57.93  282.2

Looking at the PC1 and PC2 columns of the decathlon data set the medalists have a negative PC1 and a positive PC2. This means that the quadrant that they will be in is the II quadrant.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(ggrepel)

decathlon <- decathlon %>%
  mutate(Medalist = if_else(Overall %in% sort(Overall, decreasing = TRUE)[1:3],
                            "Medalist", "Non-medalist"))

decathlon$PC1 <- scores_PC1and2[, 1]
decathlon$PC2 <- scores_PC1and2[, 2]

ggplot(decathlon, aes(x = PC1, y = PC2, label = Athlete, color = Medalist)) +
  geom_point(size = 3) +
  geom_text_repel(show.legend = FALSE) +
  theme_minimal() +
  labs(title = "PC1 vs PC2 for 2024 Men's Olympic Decathlon",
       x = "PC1",
       y = "PC2",
       color = "Medal Status")

Yes the medalists are in the II quadrant after looking at the plot.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

mean_vec_24 <- colMeans(decathlon[,5:14])
sd_vec_24 <- apply(decathlon[,5:14], 2, sd)

warner_std <- (warner - mean_vec_24) / sd_vec_24

warner_PC <- warner_std %*% loadings_PC1and2[, 1:2]
warner_PC

          [,1]       [,2]
[1,] -3.422699 -0.5092531

warner_df <- data.frame(
  Name = "Damian Warner (2020)",
  PC1 = warner_PC[1],
  PC2 = warner_PC[2],
  Medalist = "Warner 2020"
)

ggplot(decathlon, aes(x = PC1, y = PC2, label = Athlete, color = Medalist)) +
  geom_point(size = 3) +
  geom_text_repel() +
  
  geom_point(
    data = warner_df,
    aes(x = PC1, y = PC2, color = Medalist),
    size = 4,
    shape = 17,
    inherit.aes = FALSE
  ) +
  
  geom_text_repel(
    data = warner_df,
    aes(x = PC1, y = PC2, label = Name),
    fontface = "bold",
    inherit.aes = FALSE
  ) +
  
  scale_color_manual(values = c(
    "Medalist" = "gold",
    "Non-medalist" = "grey",
    "Warner 2020" = "red"
  )) +
  
  theme_minimal() +
  labs(
    title = "PC1 vs PC2 with Warner",
    x = "PC1",
    y = "PC2",
    color = "Legend"
  )

No, Warner wouldn’t have gotten gold in 2024 based off his 2020 performance. The reason for this is after looking at the plot his performance is not close to the plots of the medalists in 2024.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(patchwork)

claudeA_scaled <- scale(claudeA)
claudeB_scaled <- scale(claudeB)
claudeC_scaled <- scale(claudeC)

claudeA_scaled <- as.data.frame(claudeA_scaled)
claudeB_scaled <- as.data.frame(claudeB_scaled)
claudeC_scaled <- as.data.frame(claudeC_scaled)

plotA <- ggplot(claudeA_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Claude A")

plotB <- ggplot(claudeB_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Claude B")

plotC <- ggplot(claudeC_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Claude C")

plotA + plotB + plotC

Looking at the plot of Claude A it seems to be very linear. The plot is showing a variability far better than 55%. Same thing with Claude B. The plot is very linear and most likely explains more than 75% variability. Looking at Claude C the plot is extremely linear. Almost 100% variability. The goal was 90% but this plot seems to be doing better than 90%.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

svdA <- svd(claudeA_scaled)
svdB <- svd(claudeB_scaled)
svdC <- svd(claudeC_scaled)

dA <- svdA$d
dB <- svdB$d
dC <- svdC$d

pc1A <- (dA[1]^2) / sum(dA^2)
pc1B <- (dB[1]^2) / sum(dB^2)
pc1C <- (dC[1]^2) / sum(dC^2)

percent_df <- data.frame(
  Dataset = c("claudeA", "claudeB", "claudeC"),
  PC1_Variance_Proportion = c(pc1A, pc1B, pc1C),
  PC1_Percent = c(pc1A, pc1B, pc1C) * 100
)

percent_df

  Dataset PC1_Variance_Proportion PC1_Percent
1 claudeA               0.9756201    97.56201
2 claudeB               0.9591807    95.91807
3 claudeC               0.9949096    99.49096

After computing the variability for the three data sets it can be seen that the variability is all above 90%. This is way above what the variability was supposed to be.