Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

  1. Render to html
  2. Publish your html to RPubs
  3. Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

getwd()
[1] "C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes"
decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)
         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

library(HSAUR2)
Warning: package 'HSAUR2' was built under R version 4.4.3
Loading required package: tools
library(dplyr)
Warning: package 'dplyr' was built under R version 4.4.3

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.4.3
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'tibble' was built under R version 4.4.3
Warning: package 'tidyr' was built under R version 4.4.3
Warning: package 'readr' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
Warning: package 'stringr' was built under R version 4.4.3
Warning: package 'forcats' was built under R version 4.4.3
Warning: package 'lubridate' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   4.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.4     ✔ tibble    3.2.1
✔ purrr     1.0.4     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

A)

Explain why we need to scale this data set before performing PCA.

We need to scale this data set before performing PCA because we don’t want the events with larger units (ex: 1500m, Javelin) to dominate things just because they are measured on a larger scale than others. Scaling makes sure this doesn’t happen. It ensures that all of the events are contributing equally to our PCA.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?


decathlon_numeric <- decathlon %>%
  select(-Athlete, -Medal, -Nation, -Overall)

decathlon_scaled <- scale(decathlon_numeric)

svd_decathlon_scaled <-svd(decathlon_scaled)

U <- svd_decathlon_scaled$u
D <- diag(svd_decathlon_scaled$d)
V <- svd_decathlon_scaled$v

PCs <- U %*% D
head(PCs, 3)
           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]
[1,] -0.9696335 0.9075397 0.7205553 -0.1637611 -0.7854975 -0.7506910
[2,] -2.1338950 2.2401416 0.4899618 -1.0902256  0.7897618 -0.2423639
[3,] -0.7323012 2.0546827 0.6180006 -0.8017843 -0.1668149 -1.0966201
             [,7]       [,8]       [,9]       [,10]
[1,] 6.474979e-05  0.4802080 0.22886156 -0.01855389
[2,] 1.877641e-01  0.6018405 0.41095597  0.18266992
[3,] 3.515761e-02 -0.7366448 0.05310666 -0.41620467
cor(PCs) %>% round(2)
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]    1    0    0    0    0    0    0    0    0     0
 [2,]    0    1    0    0    0    0    0    0    0     0
 [3,]    0    0    1    0    0    0    0    0    0     0
 [4,]    0    0    0    1    0    0    0    0    0     0
 [5,]    0    0    0    0    1    0    0    0    0     0
 [6,]    0    0    0    0    0    1    0    0    0     0
 [7,]    0    0    0    0    0    0    1    0    0     0
 [8,]    0    0    0    0    0    0    0    1    0     0
 [9,]    0    0    0    0    0    0    0    0    1     0
[10,]    0    0    0    0    0    0    0    0    0     1
(PCs
 %>% data.frame()
 %>% summarize(across(everything(), var))
 %>% mutate(across(everything(), \(x) x/10))
)
         X1        X2       X3        X4        X5         X6         X7
1 0.2913494 0.1928006 0.146543 0.1404117 0.0952215 0.05535841 0.04806271
          X8         X9         X10
1 0.01403932 0.01222556 0.003987768

PC1 alone explains 29.13% of the total variability in these 10 variables

PC2 explains an additional 19.28%

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

loadings <- svd_decathlon_scaled$v
rownames(loadings) <- colnames(decathlon_numeric)
loadings[, 1:2]
                    [,1]        [,2]
X100m        0.446612410  0.01678592
LongJump    -0.493037935 -0.06565628
ShotPut     -0.309431542  0.53134100
HighJump    -0.150386291 -0.05665757
X400m        0.485973869  0.14847674
X110mHurdle  0.345270647  0.31685991
Discus      -0.139145881  0.59570522
PoleVault    0.019829831  0.48083261
Javelin      0.252821246  0.03137029
X1500m       0.005588625 -0.01948870

Based on the loadings –

PC1 (horizontal axis) :

+ positives: 100M, 400M, 110M Hurdle, Javelin

- negatives: Long Jump, Shot Put

PC1 increases more on speed events

PC2 (vertical axis):

+ positives: Shot Put, Discus, Pole Vault, Javelin

- negatives: Long Jump, High Jump, 1500M

PC2 increases more on strength events

Based on all of that, I would say I think that the medalists would be in the upper right quadrant (I) because of their high values.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.4.3
decathlon_with_PCs <- decathlon %>%
  mutate(
    PC1 = PCs[,1],
    PC2 = PCs[,2]
  )
ggplot(decathlon_with_PCs, aes(x = PC1, y = PC2, color = Medal, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(max.overlaps = 20, show.legend = FALSE) +
  theme_minimal() +
  labs(
    title = "PC1 vs PC2 for Decathlon Athletes",
    x = "Principal Component 1",
    y = "Principal Component 2"
  ) +
  scale_color_manual(values = c("Gold" = "steelblue", "Silver" = "steelblue",
                                "Bronze" = "steelblue", "None" = "gray60"))

My plot does line up with my intuition from C.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

  • Find the mean vector from the 2024 athletes. Call it mean_vec_24.
  • Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
  • Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
  • Find Warner’s PC coordinates using the 2024 loadings.
  • Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)
names(warner) <- colnames(decathlon_numeric)
mean_vec_24 <- decathlon_numeric %>% summarize(across(everything(), mean))
mean_vec_24
    X100m LongJump ShotPut HighJump X400m X110mHurdle Discus PoleVault Javelin
1 10.7465    7.378  14.863   1.9975 48.11      14.364 46.116      4.36  61.627
   X1500m
1 275.845
sd_vec_24   <- decathlon_numeric %>% summarize(across(everything(), sd))
sd_vec_24
      X100m  LongJump   ShotPut   HighJump    X400m X110mHurdle   Discus
1 0.2519247 0.3598333 0.8663967 0.06873864 1.222004   0.3759115 3.829339
  PoleVault  Javelin   X1500m
1  1.504869 7.104155 12.15537
warner_std <- (warner - as.numeric(mean_vec_24)) / as.numeric(sd_vec_24)
warner_std
      X100m    LongJump     ShotPut    HighJump       X400m X110mHurdle 
-2.48685424  2.39555375 -0.07271496  0.32732684 -0.51554643 -2.40482120 
     Discus   PoleVault     Javelin      X1500m 
 0.66695587  0.35883515  0.25520278 -0.39200766 
warner_pcs <- warner_std %*% loadings[, 1:2]
warner_pcs
          [,1]       [,2]
[1,] -3.422699 -0.5092531
ggplot(decathlon_with_PCs, aes(x = PC1, y = PC2, color = Medal, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(max.overlaps = 20, show.legend = FALSE) +
  geom_point(aes(x = warner_pcs[1], y = warner_pcs[2]), color = "steelblue", size = 4) +
  geom_text_repel(aes(x = warner_pcs[1], y = warner_pcs[2], label = "Damian Warner (2020)"), color = "steelblue", show.legend = FALSE) +
  theme_minimal() +
  labs(
    title = "PC1 vs PC2 for Decathlon Athletes + Warner",
    x = "Principal Component 1",
    y = "Principal Component 2"
  ) +
  scale_color_manual(values = c("Gold" = "steelblue", "Silver" = "steelblue", 
                                "Bronze" = "steelblue", "None" = "gray60"))
Warning in geom_text_repel(aes(x = warner_pcs[1], y = warner_pcs[2], label = "Damian Warner (2020)"), : All aesthetics have length 1, but the data has 20 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.
Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Based on the numbers and where Warner landed on the plot, I would say I don’t think he would have been a medal winner if his 2020 performance happened in 2024. His position on the plot is pretty far to the left of the other medalists and is a good bit below them. Looking at it that way, I don’t think he would have won a medal here.

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

getwd()
[1] "C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes"
setwd("C:/Users/wi7536ul/OneDrive - Minnesota State/DSCI 415/notes")
claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(patchwork)
Warning: package 'patchwork' was built under R version 4.4.3
claudedataA_scaled <- scale(claudeA)
claudedataB_scaled <- scale(claudeB)
claudedataC_scaled <- scale(claudeC)

svd_claudedataA_scaled <- svd(claudedataA_scaled)
svd_claudedataB_scaled <- svd(claudedataB_scaled)
svd_claudedataC_scaled <- svd(claudedataC_scaled)
U_A <- svd_claudedataA_scaled$u
D_A <- diag(svd_claudedataA_scaled$d)
V_A <- svd_claudedataA_scaled$v

U_B <- svd_claudedataB_scaled$u
D_B <- diag(svd_claudedataB_scaled$d)
V_B <- svd_claudedataB_scaled$v

U_C <- svd_claudedataC_scaled$u
D_C <- diag(svd_claudedataC_scaled$d)
V_C <- svd_claudedataC_scaled$v
plotA <- ggplot(claudedataA_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Scaled Claude A") +
  theme_minimal()

plotB <- ggplot(claudedataB_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Scaled Claude B") +
  theme_minimal()

plotC <- ggplot(claudedataC_scaled, aes(x = X, y = Y)) +
  geom_point() +
  coord_equal() +
  labs(title = "Scaled Claude C") +
  theme_minimal()

plotA + plotB + plotC

I’m guessing you were skeptical because all three datasets look very similar and strongly linear. The plots don’t clearly show differences that match 55%, 75%, and 90% variance, so the actual variability explained by PC1 probably exceeds the values that were claimed.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

svd_A <- svd(claudedataA_scaled)
varA <- (svd_A$d[1]^2) / sum(svd_A$d^2)

svd_B <- svd(claudedataB_scaled)
varB <- (svd_B$d[1]^2) / sum(svd_B$d^2)

svd_C <- svd(claudedataC_scaled)
varC <- (svd_C$d[1]^2) / sum(svd_C$d^2)

data.frame(
  Set = c("A", "B", "C"),
  PC1_VariancePercent = round(100 * c(varA, varB, varC), 1)
)
  Set PC1_VariancePercent
1   A                97.6
2   B                95.9
3   C                99.5