Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.4.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.4.3

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')[,-c(1:4)]
decathlon_names <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)

  X100m LongJump ShotPut HighJump X400m X110mHurdle Discus PoleVault Javelin
1 10.71     7.80   15.25     1.99 47.69       14.25  49.80       5.3   66.87
2 10.67     7.98   16.55     2.05 47.70       14.51  53.33       5.0   56.64
3 10.56     7.48   15.71     2.02 47.84       14.62  53.91       4.9   68.22
4 10.52     7.56   15.10     1.87 46.40       13.99  46.88       4.7   63.72
5 10.89     7.25   14.58     1.99 48.02       14.45  43.49       5.3   71.89
6 10.64     7.66   14.61     2.08 47.19       14.35  46.29       4.6   59.58
  X1500m
1  279.6
2  284.7
3  283.5
4  258.5
5  265.6
6  259.7

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

essentially the different metrics have a different sense of scale, with that the perspective of whatt variation is significant also changes. Thus we need to ensure that is accounted for before we actually run the data or the features with the largest scale will be the ones that are suggested as being used to understand the highest amount of the variability in the data.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

the 1st 2 loadings explain slightly less than 50% of the variability in our data. which really iisn’t bad for summarizing a 10 column table into just 2 essentially.

decaf_scale <- decathlon %>% scale()
components <- decaf_scale %>%  as.matrix() %>%   svd()

u <- components$u
d<-components$d# creation of matrix D.
v <- components$v

#first 2 principle component scores
(u[,c(1,2)] %*% diag(d[1:2]))

            [,1]        [,2]
 [1,] -0.9696335  0.90753969
 [2,] -2.1338950  2.24014162
 [3,] -0.7323012  2.05468273
 [4,] -1.4369409 -0.05634227
 [5,]  1.0587944 -0.11932168
 [6,] -1.1274914 -0.28469476
 [7,] -1.5170355  0.60252099
 [8,]  2.7432616  0.22618248
 [9,] -2.5354593 -0.71643964
[10,]  0.4269279 -0.78152461
[11,]  0.8580252  1.76415773
[12,] -1.2047343  0.78549507
[13,] -0.1761558 -0.74356430
[14,]  0.8835651 -1.04662217
[15,] -0.3205275 -1.05965320
[16,]  0.6345050 -0.94344570
[17,]  1.9329181  1.94562825
[18,] -1.6221238 -2.31911240
[19,]  4.1747261  0.40963793
[20,]  1.0635749 -2.86526573

#first 2 loadings
v[,1:2]

              [,1]        [,2]
 [1,]  0.446612410  0.01678592
 [2,] -0.493037935 -0.06565628
 [3,] -0.309431542  0.53134100
 [4,] -0.150386291 -0.05665757
 [5,]  0.485973869  0.14847674
 [6,]  0.345270647  0.31685991
 [7,] -0.139145881  0.59570522
 [8,]  0.019829831  0.48083261
 [9,]  0.252821246  0.03137029
[10,]  0.005588625 -0.01948870

#understand our % overall variability explained by 1st 2 pc's.
sum(var(as.matrix(decaf_scale) %*% v[,1:2]))/10

[1] 0.48415

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

loadings<-(v[,1:2])
rownames(loadings) <- colnames(decaf_scale)
loadings[,1:2]

                    [,1]        [,2]
X100m        0.446612410  0.01678592
LongJump    -0.493037935 -0.06565628
ShotPut     -0.309431542  0.53134100
HighJump    -0.150386291 -0.05665757
X400m        0.485973869  0.14847674
X110mHurdle  0.345270647  0.31685991
Discus      -0.139145881  0.59570522
PoleVault    0.019829831  0.48083261
Javelin      0.252821246  0.03137029
X1500m       0.005588625 -0.01948870

As pc1 represents the +x direction, that would mean that medalists would likely be along the pc1 line in either the I or IV sections. these players would be associated with a afast running time in the running based sports, but would struggle in the jumping and throwing sports. note that the hurdle is being counted as a running sport for pc1.

As pc2 represents the +y direction, that would mean that medalists would likely be along the pc2 line in either the I or II sections. these players would be associated with a slow running time in the running based sports, but would excel in the in many of the jumping and throwing sports. note that the hurdle is being counted as a running sport.

after making the chart in the next step, I found that my understanding of the data was insufficient, I thought everyone had a medal in this data, turns out theres like 4 people. That is because the overall number in the loading is a direction, and doesn’t tell me how far. but the negative would tell me thatt it is going in the oposite direction from a high score and pc2 has fewer of those.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

turns out I was pretty wrong.

#conduct data management
pc_df <- as.data.frame((u[,c(1,2)] %*% diag(d[1:2])))
colnames(pc_df) <- c("PC1", "PC2")
mapped_data <-cbind(decathlon_names[,1:2],as.data.frame(decaf_scale),  pc_df)

# make charting stuff.

ggplot(mapped_data, aes(x = PC1, y = PC2, color = Medal, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(show.legend = FALSE) +
  #scale_color_manual(values = c("gray60", "gold3")) +
  labs(
    title = "Decathlon PCA (PC1 vs PC2)",
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Medal"
  ) +
  theme_minimal(base_size = 13)

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

I would say it’s fairly unlikely, it looks like he’s going into the negative direction on the PC 2 scale, meaning that he is weak of the force of PC2, which appears to be where more of our medalists are most strongly correlated with.

# add warner to the decathlon data.
warner_2 <- c("Warner","None","NA",0,10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

decathlon_warner <- rbind(decathlon,warner)
decathlon_names_warner<-rbind(decathlon_names,warner_2)

decaf_scale_warner <- decathlon_warner %>% scale()
components_warner <- decaf_scale_warner %>%  as.matrix() %>%   svd()

u_warner <- components_warner$u
d_warner<-components_warner$d# creation of matrix D.
v_warner <- components_warner$v

pc_df_warner <- as.data.frame((u_warner[,c(1,2)] %*% diag(d_warner[1:2])))
colnames(pc_df_warner) <- c("PC1", "PC2")
mapped_data_warner<-cbind(decathlon_names_warner[,1:2],as.data.frame(decaf_scale_warner),  pc_df_warner)

ggplot(mapped_data_warner, aes(x = PC1, y = PC2, color = Medal, label = Athlete)) +
  geom_point(size = 3) +
  geom_text_repel(show.legend = FALSE) +
  #scale_color_manual(values = c("gray60", "gold3")) +
  labs(
    title = "Decathlon PCA (PC1 vs PC2)",
    x = "Principal Component 1",
    y = "Principal Component 2",
    color = "Medal"
  ) +
  theme_minimal(base_size = 13)

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

overall the amount of variation in the charts looks way too tight in each chart to begin with even before you start doing PCA and this can be identified without even running a true PCA. Meaning that the correlation between variable 1 and 2 is strong enough that that alone is kind of able to explain that variation on its own.

library(patchwork)

Warning: package 'patchwork' was built under R version 4.4.3

sa<-scale(claudeA)
sb<-scale(claudeB)
sc<-scale(claudeC)

sap<-ggplot(sa, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) # Sets a 1:1 aspect ratio
sbp<-ggplot(sb, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) # Sets a 1:1 aspect ratio
scp<-ggplot(sc, aes(x = X, y = Y)) +
  geom_point() +
  coord_fixed(ratio = 1) # Sets a 1:1 aspect ratio  
(sap+sbp+scp)

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

components <- decaf_scale %>% as.matrix() %>% svd()

u <- components$u d<-components$d# creation of matrix D. v <- components$v sum(var(as.matrix(decaf_scale) %*% v[,1:2]))/10

sav <- (sa %>%  as.matrix() %>%   svd())$v
sbv <- (sb %>%  as.matrix() %>%   svd())$v
scv <- (sc %>%  as.matrix() %>%   svd())$v


sa_perc <-sum(var(as.matrix(sa) %*% sav[,1]))/2
sb_perc <-sum(var(as.matrix(sb) %*% sbv[,1]))/2
sc_perc <-sum(var(as.matrix(sc) %*% scv[,1]))/2
sa_perc

[1] 0.9756201

sb_perc

[1] 0.9591807

sc_perc

[1] 0.9949096