For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.
A)
Explain why we need to scale this data set before performing PCA.
We need to scale this data set because the scores of different events are of different magnitudes. For example, X1500m has a score in the hundreds while high jump is in the ones.
B)
Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.4.3
Warning: package 'purrr' was built under R version 4.4.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The first two PCs explain about 48.4% of the overall variability.
C)
Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.
The medalists would probably be in the second quadrant. Most of the 2nd PC loadings are positive, and the most important variables like longjump and shotput are negative in the 1st PC.
D)
Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.
decathlon <- (decathlon%>%mutate(PC1 = first_two_PCs[,1],PC2 = first_two_PCs[,2]) )(ggplot(data = decathlon)+geom_point(aes(x=PC1, y=PC2, color = Medal))+geom_hline(yintercept =0, linetype ="dashed", color ="black")+geom_vline(xintercept =0, linetype ="dashed", color ="black")+theme_classic() )
All medalists are located in the second quadrant along with two other competitors.
E)
Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.
Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:
Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.
Do you think his 2020 performance would have won a medal if it had happened in 2024?
warner_pc <-as.vector(((warner - mean_vec_24)/sd_vec_24)) %*% loadings(ggplot(data = decathlon)+geom_point(aes(x=PC1, y=PC2, color = Medal))+geom_point(aes(x=warner_pc[1], y=warner_pc[2]), color ='red')+geom_hline(yintercept =0, linetype ="dashed", color ="black")+geom_vline(xintercept =0, linetype ="dashed", color ="black")+theme_classic() )
Warning in geom_point(aes(x = warner_pc[1], y = warner_pc[2]), color = "red"): All aesthetics have length 1, but the data has 20 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
All of the medalists from 2024 are in the second quadrant while Warner is down in the third quadrant. This tells us that he probably would not have won a medal in 2024 with his 2020 performance.
Question 2
Below is a screenshot of a conversation between me and chatbot Claude:
After looking at the graphs, I grew skeptical. So I said:
Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:
Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.
A)
Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?
I believe you were skeptical because in all data sets, the first PC is going to be explaining a majority of the variance. This is because the data looks highly correlated, so information about one x can tell you a lot about the y, and vice-versa. All PC1s are going to be much higher than requested, besides maybe the 90% PC1 request.
B)
Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.
The PC1 in set A explains 97.5% of the total variability, 96% in set B, and 99.5% in set C. All of these are much greater than what we asked for. This is the result that we expected.