── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.
A)
Explain why we need to scale this data set before performing PCA.
It needs to be scaled before the PCA so that the data can be equally distributed. Also helps when there are higher variances which can dominate the analysis. That will explain the most variability by making the weights look equal on the variables.
B)
Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?
PC1 = explains 30.27 % of total variability in the 11 variables
PC2 = explains additionally 20.74%
together they explain ( 30.72% + 20.74% = 51.01%)
C)
Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.
It would be on the III quadrant because the loadings are mainly negative loadings and PC2 also has negative loadings for many of the variables. so because they are both negative they fall in the III quadrant.
D)
Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.
library("factoextra")
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
ggplot(PCdf, aes(x = PC1, y = PC2, color = Medal, shape = Medal)) +geom_point(size =3) +geom_text_repel(aes(label= Athlete),size =3) +labs(x ="PC1", y ="PC2", color ="Medal", shape ="Nation")+theme_classic(base_size =14)
Looks like I was right on where the medals were located in the scatterplot.
E)
Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.
Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:
Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.
Do you think his 2020 performance would have won a medal if it had happened in 2024?
library(ggrepel)ggplot(PCdf, aes(x = PC1, y = PC2, color = Medal, shape = Medal)) +geom_point(size =3) +geom_text_repel(aes(label = Athlete), size =3) +geom_point(aes(x = warner_PC[1], y = warner_PC[2]),color ="black", fill ="gold", shape =8, size =6, stroke =1.5) +geom_text_repel(aes(x = warner_PC[1], y = warner_PC[2], label ="Warner 2020"),size =4, fontface ="bold", color ="black") +labs(x ="PC1", y ="PC2", color ="Medal", shape ="Medal") +theme_classic(base_size =14)
Warning in geom_point(aes(x = warner_PC[1], y = warner_PC[2]), color = "black", : All aesthetics have length 1, but the data has 20 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
a single row.
Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
I don’t he could have had a chance to win the Gold medal if it had happened in 2024 . I think that is the case, because the point is on the on the left top quadrant but not necessarily straight up from the clusters on the bottom?
Question 2
Below is a screenshot of a conversation between me and chatbot Claude:
After looking at the graphs, I grew skeptical. So I said:
Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:
Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.
A)
Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?
Standard deviations (1, .., p=2):
[1] 1.3968680 0.2208163
Rotation (n x k) = (2 x 2):
PC1 PC2
X 0.7071068 0.7071068
Y -0.7071068 0.7071068
claudeB_PCA
Standard deviations (1, .., p=2):
[1] 1.3850492 0.2857248
Rotation (n x k) = (2 x 2):
PC1 PC2
X 0.7071068 0.7071068
Y 0.7071068 -0.7071068
claudeC_PCA
Standard deviations (1, .., p=2):
[1] 1.4106095 0.1009004
Rotation (n x k) = (2 x 2):
PC1 PC2
X 0.7071068 0.7071068
Y -0.7071068 0.7071068
claudeA_PCA$rotation
PC1 PC2
X 0.7071068 0.7071068
Y -0.7071068 0.7071068
claudeB_PCA$rotation
PC1 PC2
X 0.7071068 0.7071068
Y 0.7071068 -0.7071068
claudeC_PCA$rotation
PC1 PC2
X 0.7071068 0.7071068
Y -0.7071068 0.7071068
summary(claudeA_PCA)
Importance of components:
PC1 PC2
Standard deviation 1.3969 0.22082
Proportion of Variance 0.9756 0.02438
Cumulative Proportion 0.9756 1.00000
summary(claudeB_PCA)
Importance of components:
PC1 PC2
Standard deviation 1.3850 0.28572
Proportion of Variance 0.9592 0.04082
Cumulative Proportion 0.9592 1.00000
summary(claudeC_PCA)
Importance of components:
PC1 PC2
Standard deviation 1.4106 0.10090
Proportion of Variance 0.9949 0.00509
Cumulative Proportion 0.9949 1.00000
c1 <-fviz_pca_var(claudeA_PCA, axes =c(1,2))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
Please report the issue at <https://github.com/kassambara/factoextra/issues>.
claudeA_PCA$rotation[,1:2]
PC1 PC2
X 0.7071068 0.7071068
Y -0.7071068 0.7071068
looks like only ClaudeB has more of a correlation, which is not what you asked claude to do. What dataset B is supposed to do is show a strong negative correlation. I dont think these plots show the variability.
B)
Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.