Activity 3.2 - PC concepts

SUBMISSION INSTRUCTIONS

Render to html
Publish your html to RPubs
Submit a link to your published solutions

Question 1

The data set we will analyze for this question are on the 10 events in the Men’s 2024 Olympic Decathlon in Paris.

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv')
head(decathlon)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

For the purposes of this question, assume we have 10-dimensional data - that is, ignore the Overall column.

A)

Explain why we need to scale this data set before performing PCA.

It needs to be scaled before the PCA so that the data can be equally distributed. Also helps when there are higher variances which can dominate the analysis. That will explain the most variability by making the weights look equal on the variables.

B)

Use svd() to find the first 2 principal component scores and their loadings. Full credit will only be granted if you use the svd() ingredients u, d, and v. What percent of the overall variability do the first two PCs explain?

decathlon <- read.csv('Data/mens_decathlon_paris2024.csv') %>%
  select(Overall:`X1500m`)
head(decathlon)

  Overall X100m LongJump ShotPut HighJump X400m X110mHurdle Discus PoleVault
1    8796 10.71     7.80   15.25     1.99 47.69       14.25  49.80       5.3
2    8748 10.67     7.98   16.55     2.05 47.70       14.51  53.33       5.0
3    8711 10.56     7.48   15.71     2.02 47.84       14.62  53.91       4.9
4    8607 10.52     7.56   15.10     1.87 46.40       13.99  46.88       4.7
5    8572 10.89     7.25   14.58     1.99 48.02       14.45  43.49       5.3
6    8569 10.64     7.66   14.61     2.08 47.19       14.35  46.29       4.6
  Javelin X1500m
1   66.87  279.6
2   56.64  284.7
3   68.22  283.5
4   63.72  258.5
5   71.89  265.6
6   59.58  259.7

scaled_decathlon <- scale(decathlon)
svd_decathlon<- svd(scaled_decathlon)
U <- svd_decathlon$u
D <- diag(svd_decathlon$d)
V <- svd_decathlon$v

PCs <- U %*% D
head(PCs, 3)

          [,1]       [,2]       [,3]      [,4]       [,5]       [,6]
[1,] -1.695937 -0.8904955 -0.4001134 0.4589659 -0.5377891 -0.7936640
[2,] -2.774415 -1.0726642  0.9552097 1.4036096  1.0521479 -0.2191248
[3,] -1.591066 -1.6723716  0.4329604 1.1106322  0.1463375 -1.1235656
             [,7]       [,8]        [,9]       [,10]       [,11]
[1,] -0.008081682 -0.4798479 -0.22928502 -0.02026038  0.06312045
[2,] -0.211363885 -0.6006325 -0.41174398  0.18357991 -0.00592933
[3,] -0.085657752  0.7378959 -0.05275103 -0.41488401 -0.03410443

head(cor(PCs) %>% round(2))

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    1    0    0    0    0    0    0    0    0     0     0
[2,]    0    1    0    0    0    0    0    0    0     0     0
[3,]    0    0    1    0    0    0    0    0    0     0     0
[4,]    0    0    0    1    0    0    0    0    0     0     0
[5,]    0    0    0    0    1    0    0    0    0     0     0
[6,]    0    0    0    0    0    1    0    0    0     0     0

head((PCs
  %>% data.frame()
  %>% summarize(across(everything(), var))
    %>%mutate(across(everything(), \(x) x/11))
))

        X1       X2        X3        X4         X5       X6         X7
1 0.302659 0.207356 0.1410204 0.1290576 0.09682299 0.050653 0.04476458
         X8         X9         X10          X11
1 0.0127684 0.01111458 0.003631489 0.0001519315

PC1 = explains 30.27 % of total variability in the 11 variables

PC2 = explains additionally 20.74%

together they explain ( 30.72% + 20.74% = 51.01%)

C)

Find and print the loadings. Based on the loadings alone, if the first two PCs are plotted in a 2D plane as shown below, which of the four quadrants will the medalists be in? Explain your reasoning.

loadings <- svd_decathlon$v
rownames(loadings) <- colnames(decathlon)
loadings[,1:2]

                   [,1]        [,2]
Overall     -0.43162210 -0.34231222
X100m        0.35798164 -0.25703870
LongJump    -0.43178412  0.22512564
ShotPut     -0.38055298 -0.26022632
HighJump    -0.13793625  0.06639053
X400m        0.39743783 -0.29722922
X110mHurdle  0.25439070 -0.34408887
Discus      -0.25866950 -0.40015712
PoleVault   -0.16547894 -0.47572624
Javelin      0.11544450 -0.29214272
X1500m       0.07053313  0.12415936

The medalist would be on the III quadrant.

It would be on the III quadrant because the loadings are mainly negative loadings and PC2 also has negative loadings for many of the variables. so because they are both negative they fall in the III quadrant.

D)

Add the PCs to the decathlon data set and create a scatterplot of these PCs, with the points labeled by the athletes’ names. Color-code the points on whether or not the athlete was a medalist. Use the ggrepel package for better labeling. Verify that your intuition from C) is correct.

library("factoextra")

Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(dplyr)
library(ggplot2)
library(ggrepel)

dec_data <- decathlon[,1:11]
dec_data <- as.data.frame(lapply(dec_data, as.numeric))

dec_pca <- prcomp(dec_data, scale. = TRUE)

decathlon_1<- read.csv('Data/mens_decathlon_paris2024.csv') 
  
head(decathlon_1)

         Athlete  Medal      Nation Overall X100m LongJump ShotPut HighJump
1   Markus Rooth   Gold      Norway    8796 10.71     7.80   15.25     1.99
2 Leo Neugebauer Silver     Germany    8748 10.67     7.98   16.55     2.05
3  Lindon Victor Bronze     Grenada    8711 10.56     7.48   15.71     2.02
4    Sven Roosen   None Netherlands    8607 10.52     7.56   15.10     1.87
5  Janek Õiglane   None     Estonia    8572 10.89     7.25   14.58     1.99
6   Johannes Erm   None     Estonia    8569 10.64     7.66   14.61     2.08
  X400m X110mHurdle Discus PoleVault Javelin X1500m
1 47.69       14.25  49.80       5.3   66.87  279.6
2 47.70       14.51  53.33       5.0   56.64  284.7
3 47.84       14.62  53.91       4.9   68.22  283.5
4 46.40       13.99  46.88       4.7   63.72  258.5
5 48.02       14.45  43.49       5.3   71.89  265.6
6 47.19       14.35  46.29       4.6   59.58  259.7

PCs <- dec_pca$x[,1:2]
PCdf <- bind_cols(data.frame(PCs), decathlon_1)

ggplot(PCdf, aes(x = PC1, y = PC2, color = Medal, shape = Medal)) +
             geom_point(size =3) +
  geom_text_repel(aes(label= Athlete),size =3)  +
               labs(x = "PC1", y = "PC2", color = "Medal", shape = "Nation")+
             theme_classic(base_size = 14)

Looks like I was right on where the medals were located in the scatterplot.

E)

Canadian Damian Warner won the gold medal in the decathlon in the 2020 Tokyo games. He began the 2024 decathlon but bowed out after three straight missed pole vault attempts.

These are his results in the 10 events in 2020:

warner <- c(10.12, 8.24, 14.8, 2.02, 47.48, 13.46, 48.67, 4.9, 63.44, 271.08)

Would this have won a medal if it had happened in 2024? To answer this, we will compute his PCs with respect to the 2024 athletes and add it to the plot to see where his 2020 gold-medal performance compares to the 2024 athletes. To do this:

Find the mean vector from the 2024 athletes. Call it mean_vec_24.
Find the standard deviation vector from the 2024 athletes. Call it sd_vec_24.
Standardize Warner’s 2020 results with respect to the 2024 athletes: (warner-mean_vec_24)/sd_vec_24
Find Warner’s PC coordinates using the 2024 loadings.
Add his point to the scatterplot.

Do you think his 2020 performance would have won a medal if it had happened in 2024?

mean_vec_24 <- apply(dec_data,2,mean)
sd_vec_24 <- apply(dec_data,2,sd)

warner_std <- (warner- mean_vec_24) / sd_vec_24

Warning in warner - mean_vec_24: longer object length is not a multiple of
shorter object length

warner_std

    Overall       X100m    LongJump     ShotPut    HighJump       X400m 
 -19.424962   -9.949402   20.626218  -14.823464  661.673013  -28.355054 
X110mHurdle      Discus   PoleVault     Javelin      X1500m 
  91.260836  -10.763216   39.259224   29.483171  -21.860700

warner_PC <-as.numeric(warner_std %*%
  dec_pca$rotation[,1:2])
warner_PC

[1] -79.61527  12.96549

library(ggrepel)

ggplot(PCdf, aes(x = PC1, y = PC2, color = Medal, shape = Medal)) +
  geom_point(size = 3) +
  geom_text_repel(aes(label = Athlete), size = 3) +
  
  geom_point(aes(x = warner_PC[1], y = warner_PC[2]),
             color = "black", fill = "gold", shape = 8, size = 6, stroke = 1.5) +
  geom_text_repel(aes(x = warner_PC[1], y = warner_PC[2], label = "Warner 2020"),
                   size = 4, fontface = "bold", color = "black") +
  
  labs(x = "PC1", y = "PC2", color = "Medal", shape = "Medal") +
  theme_classic(base_size = 14)

Warning in geom_point(aes(x = warner_PC[1], y = warner_PC[2]), color = "black", : All aesthetics have length 1, but the data has 20 rows.
ℹ Please consider using `annotate()` or provide this layer with data containing
  a single row.

Warning: ggrepel: 7 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Warning: ggrepel: 20 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

I don’t he could have had a chance to win the Gold medal if it had happened in 2024 . I think that is the case, because the point is on the on the left top quadrant but not necessarily straight up from the clusters on the bottom?

Question 2

Below is a screenshot of a conversation between me and chatbot Claude:

After looking at the graphs, I grew skeptical. So I said:

Behold, Claude’s three data sets which I’ve called claudeA, claudeB, and claudeC:

claudeA <- read.csv('Data/claude_dataA.csv')
claudeB <- read.csv('Data/claude_dataB.csv')
claudeC <- read.csv('Data/claude_dataC.csv')

Each data set has an X and a Y column which represent 2-dimensional variables that we need to rotate.

A)

Scale each data set and plot them side-by-side using the patchwork package. Make sure the aspect ratio of each graph is 1 (i.e., make the height and width of each graph equal). At this point, explain why you think I was skeptical. Specifically, do you think the percent variability explained by the first PC of each data set appears to exceed or fall short of the variability I asked it to?

library(patchwork)
library(ggplot2)

claudeA_PCA <- prcomp(claudeA, scale.=TRUE)
claudeB_PCA <-prcomp(claudeB, scale.=TRUE)
claudeC_PCA <- prcomp(claudeC, scale.=TRUE)

claudeA_PCA

Standard deviations (1, .., p=2):
[1] 1.3968680 0.2208163

Rotation (n x k) = (2 x 2):
         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

claudeB_PCA

Standard deviations (1, .., p=2):
[1] 1.3850492 0.2857248

Rotation (n x k) = (2 x 2):
        PC1        PC2
X 0.7071068  0.7071068
Y 0.7071068 -0.7071068

claudeC_PCA

Standard deviations (1, .., p=2):
[1] 1.4106095 0.1009004

Rotation (n x k) = (2 x 2):
         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

claudeA_PCA$rotation

         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

claudeB_PCA$rotation

        PC1        PC2
X 0.7071068  0.7071068
Y 0.7071068 -0.7071068

claudeC_PCA$rotation

         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

summary(claudeA_PCA)

Importance of components:
                          PC1     PC2
Standard deviation     1.3969 0.22082
Proportion of Variance 0.9756 0.02438
Cumulative Proportion  0.9756 1.00000

summary(claudeB_PCA)

Importance of components:
                          PC1     PC2
Standard deviation     1.3850 0.28572
Proportion of Variance 0.9592 0.04082
Cumulative Proportion  0.9592 1.00000

summary(claudeC_PCA)

Importance of components:
                          PC1     PC2
Standard deviation     1.4106 0.10090
Proportion of Variance 0.9949 0.00509
Cumulative Proportion  0.9949 1.00000

c1 <-fviz_pca_var(claudeA_PCA, axes = c(1,2))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the ggpubr package.
  Please report the issue at <https://github.com/kassambara/ggpubr/issues>.

Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
ℹ Please use tidy evaluation idioms with `aes()`.
ℹ See also `vignette("ggplot2-in-packages")` for more information.
ℹ The deprecated feature was likely used in the factoextra package.
  Please report the issue at <https://github.com/kassambara/factoextra/issues>.

claudeA_PCA$rotation[,1:2]

         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

c2 <-fviz_pca_var(claudeB_PCA, axes = c(1,2))
claudeB_PCA$rotation[,1:2]

        PC1        PC2
X 0.7071068  0.7071068
Y 0.7071068 -0.7071068

c3 <-fviz_pca_var(claudeC_PCA, axes = c(1,2))
claudeC_PCA$rotation[,1:2]

         PC1       PC2
X  0.7071068 0.7071068
Y -0.7071068 0.7071068

PlotA<- fviz_pca(claudeA_PCA, axes = 1:2)
claudeA_PCA$x[,1:2]

                PC1          PC2
 [1,] -2.2900181908  0.066492899
 [2,] -1.9202720378 -0.083102235
 [3,] -1.9400318501  0.156808597
 [4,] -1.4404503753 -0.122621859
 [5,] -1.4602101876  0.117288972
 [6,] -0.9606287128 -0.162141484
 [7,] -0.9803885251  0.077769347
 [8,] -0.4808070503 -0.201661108
 [9,] -0.5005668626  0.038249723
[10,] -0.0009853878 -0.241180733
[11,] -0.0207452001 -0.001269902
[12,]  0.4788362747 -0.280700358
[13,]  0.4590764624 -0.040789526
[14,]  0.9586579372 -0.320219982
[15,]  0.9388981249 -0.080309151
[16,]  1.3086442779 -0.229904285
[17,] -2.3758233088  0.240358425
[18,] -1.7243914103 -0.146892251
[19,] -2.0258369680  0.330674122
[20,] -1.2445697478 -0.186411875
[21,] -1.5460153055  0.291154498
[22,] -0.7647480853 -0.225931500
[23,] -1.0661936431  0.251634873
[24,] -0.2849264228 -0.265451125
[25,] -0.5863719806  0.212115248
[26,]  0.1948952397 -0.304970749
[27,] -0.1065503181  0.172595624
[28,]  0.6747169022 -0.344490374
[29,]  0.3732713444  0.133075999
[30,]  1.0247032429 -0.254174677
[31,]  0.8530930069  0.093556375
[32,]  1.5045249054 -0.293694301
[33,]  1.2888844656  0.010006546
[34,]  1.5287952969 -0.009753266
[35,]  1.3792001628  0.359992887
[36,]  1.8787816376  0.080562431
[37,]  1.8590218253  0.320473262
[38,]  2.2287679783  0.170878128
[39,]  2.2090081660  0.410788959
[40,]  2.5787543190  0.261193825

head(PlotA)

$data
   name             x            y       coord cos2    contrib
1     1 -2.2900181908  0.066492899 5.248604620    1 6.56075577
2     2 -1.9202720378 -0.083102235 3.694350680    1 4.61793835
3     3 -1.9400318501  0.156808597 3.788312515    1 4.73539064
4     4 -1.4404503753 -0.122621859 2.089933404    1 2.61241675
5     5 -1.4602101876  0.117288972 2.145970495    1 2.68246312
6     6 -0.9606287128 -0.162141484 0.949097385    1 1.18637173
7     7 -0.9803885251  0.077769347 0.967209731    1 1.20901216
8     8 -0.4808070503 -0.201661108 0.271842622    1 0.33980328
9     9 -0.5005668626  0.038249723 0.252030225    1 0.31503778
10   10 -0.0009853878 -0.241180733 0.058169117    1 0.07271140
11   11 -0.0207452001 -0.001269902 0.000431976    1 0.00053997
12   12  0.4788362747 -0.280700358 0.308076869    1 0.38509609
13   13  0.4590764624 -0.040789526 0.212414984    1 0.26551873
14   14  0.9586579372 -0.320219982 1.021565878    1 1.27695735
15   15  0.9388981249 -0.080309151 0.887979249    1 1.10997406
16   16  1.3086442779 -0.229904285 1.765405826    1 2.20675728
17   17 -2.3758233088  0.240358425 5.702308567    1 7.12788571
18   18 -1.7243914103 -0.146892251 2.995103069    1 3.74387884
19   19 -2.0258369680  0.330674122 4.213360796    1 5.26670100
20   20 -1.2445697478 -0.186411875 1.583703244    1 1.97962906
21   21 -1.5460153055  0.291154498 2.474934266    1 3.09366783
22   22 -0.7647480853 -0.225931500 0.635884677    1 0.79485585
23   23 -1.0661936431  0.251634873 1.200088994    1 1.50011124
24   24 -0.2849264228 -0.265451125 0.151647366    1 0.18955921
25   25 -0.5863719806  0.212115248 0.388824978    1 0.48603122
26   26  0.1948952397 -0.304970749 0.130991312    1 0.16373914
27   27 -0.1065503181  0.172595624 0.041142220    1 0.05142777
28   28  0.6747169022 -0.344490374 0.573916516    1 0.71739564
29   29  0.3732713444  0.133075999 0.157040718    1 0.19630090
30   30  1.0247032429 -0.254174677 1.114621502    1 1.39327688
31   31  0.8530930069  0.093556375 0.736520474    1 0.92065059
32   32  1.5045249054 -0.293694301 2.349851533    1 2.93731442
33   33  1.2888844656  0.010006546 1.661323297    1 2.07665412
34   34  1.5287952969 -0.009753266 2.337310186    1 2.92163773
35   35  1.3792001628  0.359992887 2.031787968    1 2.53973496
36   36  1.8787816376  0.080562431 3.536310747    1 4.42038843
37   37  1.8590218253  0.320473262 3.558665259    1 4.44833157
38   38  2.2287679783  0.170878128 4.996606036    1 6.24575754
39   39  2.2090081660  0.410788959 5.048464647    1 6.31058081
40   40  2.5787543190  0.261193825 6.718196052    1 8.39774507

PlotB<- fviz_pca(claudeB_PCA, axes = 1:2)
claudeA_PCA$x[,1:2]

                PC1          PC2
 [1,] -2.2900181908  0.066492899
 [2,] -1.9202720378 -0.083102235
 [3,] -1.9400318501  0.156808597
 [4,] -1.4404503753 -0.122621859
 [5,] -1.4602101876  0.117288972
 [6,] -0.9606287128 -0.162141484
 [7,] -0.9803885251  0.077769347
 [8,] -0.4808070503 -0.201661108
 [9,] -0.5005668626  0.038249723
[10,] -0.0009853878 -0.241180733
[11,] -0.0207452001 -0.001269902
[12,]  0.4788362747 -0.280700358
[13,]  0.4590764624 -0.040789526
[14,]  0.9586579372 -0.320219982
[15,]  0.9388981249 -0.080309151
[16,]  1.3086442779 -0.229904285
[17,] -2.3758233088  0.240358425
[18,] -1.7243914103 -0.146892251
[19,] -2.0258369680  0.330674122
[20,] -1.2445697478 -0.186411875
[21,] -1.5460153055  0.291154498
[22,] -0.7647480853 -0.225931500
[23,] -1.0661936431  0.251634873
[24,] -0.2849264228 -0.265451125
[25,] -0.5863719806  0.212115248
[26,]  0.1948952397 -0.304970749
[27,] -0.1065503181  0.172595624
[28,]  0.6747169022 -0.344490374
[29,]  0.3732713444  0.133075999
[30,]  1.0247032429 -0.254174677
[31,]  0.8530930069  0.093556375
[32,]  1.5045249054 -0.293694301
[33,]  1.2888844656  0.010006546
[34,]  1.5287952969 -0.009753266
[35,]  1.3792001628  0.359992887
[36,]  1.8787816376  0.080562431
[37,]  1.8590218253  0.320473262
[38,]  2.2287679783  0.170878128
[39,]  2.2090081660  0.410788959
[40,]  2.5787543190  0.261193825

head(PlotB)

$data
   name            x            y      coord cos2     contrib
1     1 -2.019222475 -0.204302816 4.11899905    1  5.14874881
2     2 -1.704764121 -0.298610151 2.99538873    1  3.74423591
3     3 -1.880824594  0.097601341 3.54702718    1  4.43378397
4     4 -1.280230258 -0.282841977 1.71898910    1  2.14873637
5     5 -1.456290731  0.113369515 2.13363534    1  2.66704417
6     6 -0.855696394 -0.267073802 0.80354474    1  1.00443092
7     7 -1.031756867  0.129137690 1.08119878    1  1.35149847
8     8 -0.431162531 -0.251305628 0.24905565    1  0.31131956
9     9 -0.607223004  0.144905864 0.38971749    1  0.48714686
10   10 -0.006628667 -0.235537453 0.05552183    1  0.06940229
11   11 -0.182689140  0.160674038 0.05919147    1  0.07398934
12   12  0.417905196 -0.219769279 0.22294329    1  0.27867911
13   13  0.241844723  0.176442213 0.08962072    1  0.11202591
14   14  0.842439059 -0.204001105 0.75132002    1  0.93915002
15   15  0.666378587  0.192210387 0.48100525    1  0.60125657
16   16  1.266972923 -0.188232930 1.64065202    1  2.05081503
17   17 -2.179575116  0.044110232 4.75249340    1  5.94061675
18   18 -1.516089109 -0.355194552 2.42468936    1  3.03086170
19   19 -2.041177235  0.346014389 4.28613046    1  5.35766308
20   20 -1.091555246 -0.339426378 1.30670312    1  1.63337890
21   21 -1.616643371  0.361782563 2.74442241    1  3.43052802
22   22 -0.667021382 -0.323658203 0.54967216    1  0.68709020
23   23 -1.192109508  0.377550738 1.56366964    1  1.95458705
24   24 -0.242487519 -0.307890029 0.15359647    1  0.19199558
25   25 -0.767575645  0.393318912 0.74387214    1  0.92984017
26   26  0.182046345 -0.292121854 0.11847605    1  0.14809506
27   27 -0.343041781  0.409087087 0.28502991    1  0.35628739
28   28  0.606580208 -0.276353680 0.44431091    1  0.55538863
29   29  0.081492082  0.424855261 0.18714295    1  0.23392869
30   30  1.031114072 -0.260585506 1.13110103    1  1.41387629
31   31  0.506025946  0.440623436 0.45021127    1  0.56276409
32   32  1.455647935 -0.244817331 2.17884644    1  2.72355805
33   33  1.295295295  0.003595717 1.67780283    1  2.09725354
34   34  1.691506786 -0.172464756 2.89093930    1  3.61367413
35   35  1.311063469  0.428129581 1.90218236    1  2.37772795
36   36  2.116040650 -0.156696581 4.50218185    1  5.62772731
37   37  1.735597332  0.443897755 3.20934332    1  4.01167915
38   38  2.540574513 -0.140928407 6.47437967    1  8.09297459
39   39  2.160131196  0.459665930 4.87745955    1  6.09682444
40   40  2.965108377 -0.125160232 8.80753277    1 11.00941596

PlotC<- fviz_pca(claudeC_PCA, axes = 1:2)
claudeA_PCA$x[,1:2]

                PC1          PC2
 [1,] -2.2900181908  0.066492899
 [2,] -1.9202720378 -0.083102235
 [3,] -1.9400318501  0.156808597
 [4,] -1.4404503753 -0.122621859
 [5,] -1.4602101876  0.117288972
 [6,] -0.9606287128 -0.162141484
 [7,] -0.9803885251  0.077769347
 [8,] -0.4808070503 -0.201661108
 [9,] -0.5005668626  0.038249723
[10,] -0.0009853878 -0.241180733
[11,] -0.0207452001 -0.001269902
[12,]  0.4788362747 -0.280700358
[13,]  0.4590764624 -0.040789526
[14,]  0.9586579372 -0.320219982
[15,]  0.9388981249 -0.080309151
[16,]  1.3086442779 -0.229904285
[17,] -2.3758233088  0.240358425
[18,] -1.7243914103 -0.146892251
[19,] -2.0258369680  0.330674122
[20,] -1.2445697478 -0.186411875
[21,] -1.5460153055  0.291154498
[22,] -0.7647480853 -0.225931500
[23,] -1.0661936431  0.251634873
[24,] -0.2849264228 -0.265451125
[25,] -0.5863719806  0.212115248
[26,]  0.1948952397 -0.304970749
[27,] -0.1065503181  0.172595624
[28,]  0.6747169022 -0.344490374
[29,]  0.3732713444  0.133075999
[30,]  1.0247032429 -0.254174677
[31,]  0.8530930069  0.093556375
[32,]  1.5045249054 -0.293694301
[33,]  1.2888844656  0.010006546
[34,]  1.5287952969 -0.009753266
[35,]  1.3792001628  0.359992887
[36,]  1.8787816376  0.080562431
[37,]  1.8590218253  0.320473262
[38,]  2.2287679783  0.170878128
[39,]  2.2090081660  0.410788959
[40,]  2.5787543190  0.261193825

head(PlotC)

$data
   name            x             y       coord cos2      contrib
1     1 -2.270556897  0.0470316057 5.157640595    1  6.447050744
2     2 -1.933711475 -0.0696627970 3.744092975    1  4.680116219
3     3 -1.914343931  0.1311206774 3.681905318    1  4.602381647
4     4 -1.486790544 -0.0762816903 2.216365019    1  2.770456273
5     5 -1.467423000  0.1245017841 2.168830954    1  2.711038693
6     6 -1.039869613 -0.0829005835 1.088201319    1  1.360251649
7     7 -1.020502069  0.1178828908 1.055320848    1  1.319151060
8     8 -0.592948682 -0.0895194768 0.359601876    1  0.449502345
9     9 -0.573581137  0.1112639976 0.341374998    1  0.426718748
10   10 -0.146027751 -0.0961383701 0.030566690    1  0.038208363
11   11 -0.126660206  0.1046451043 0.026993406    1  0.033741757
12   12  0.300893180 -0.1027572633 0.101095761    1  0.126369701
13   13  0.320260725  0.0980262110 0.112176070    1  0.140220088
14   14  0.747814112 -0.1093761566 0.571189089    1  0.713986361
15   15  0.767181656  0.0914073177 0.596922991    1  0.746153739
16   16  1.194735043 -0.1159950499 1.440846674    1  1.801058342
17   17 -2.135818728  0.0003538446 4.561721766    1  5.702152207
18   18 -1.776958205 -0.0943254562 3.166477753    1  3.958097192
19   19 -1.779605762  0.0844429163 3.174127275    1  3.967659094
20   20 -1.330037274 -0.1009443495 1.779188911    1  2.223986139
21   21 -1.332684831  0.0778240230 1.782105437    1  2.227631797
22   22 -0.883116343 -0.1075632427 0.791464326    1  0.989330407
23   23 -0.885763900  0.0712051297 0.789647857    1  0.987059821
24   24 -0.436195411 -0.1141821360 0.203303997    1  0.254129996
25   25 -0.438842969  0.0645862365 0.196754533    1  0.245943166
26   26  0.010725520 -0.1208010293 0.014707925    1  0.018384907
27   27  0.008077962  0.0579673432 0.003425466    1  0.004281833
28   28  0.457646451 -0.1274199225 0.225676111    1  0.282095138
29   29  0.454998894  0.0513484499 0.209660657    1  0.262075821
30   30  0.904567382 -0.1340388158 0.836208553    1  1.045260691
31   31  0.901919825  0.0447295566 0.815460104    1  1.019325130
32   32  1.351488313 -0.1406577091 1.846305252    1  2.307881565
33   33  1.214102587  0.0847884245 1.481234169    1  1.851542712
34   34  1.550948009 -0.0319059783 2.406457718    1  3.008072148
35   35  1.570315554  0.1688774961 2.494410547    1  3.118013183
36   36  1.997868940 -0.0385248715 3.992964468    1  4.991205585
37   37  2.017236485  0.1622586028 4.095570890    1  5.119463612
38   38  2.444789871 -0.0451437648 5.979035474    1  7.473794343
39   39  2.464157416  0.1556397096 6.096295490    1  7.620369362
40   40  2.891710802 -0.0517626581 8.364670738    1 10.455838423

library(patchwork)
(PlotA)/(PlotB)/(PlotC)

looks like only ClaudeB has more of a correlation, which is not what you asked claude to do. What dataset B is supposed to do is show a strong negative correlation. I dont think these plots show the variability.

B)

Use SVD to find the first PC for each data set, and find the actual percent of total variability explained by each PC using aggregation methods.

scaled_claudeA <- scale(claudeA)
svd_claudeA<- svd(scaled_claudeA)
U <- svd_claudeA$u
D <- diag(svd_claudeA$d)
V <- svd_claudeA$v

PCs_A <- U %*% D
head(PCs_A, 2)

          [,1]        [,2]
[1,] -2.290018  0.06649290
[2,] -1.920272 -0.08310223

cor(PCs_A) %>% round(2)

     [,1] [,2]
[1,]    1    0
[2,]    0    1

(PCs_A
  %>% data.frame()
  %>% summarize(across(everything(), var))
    %>%mutate(across(everything(), \(x) x/2))
)

         X1         X2
1 0.9756201 0.02437991

scaled_claudeB <- scale(claudeB)
svd_claudeB<- svd(scaled_claudeB)
U <- svd_claudeB$u
D <- diag(svd_claudeB$d)
V <- svd_claudeB$v

PCs_B <- U %*% D
head(PCs_B, 2)

          [,1]       [,2]
[1,] -2.019222 -0.2043028
[2,] -1.704764 -0.2986102

cor(PCs_B) %>% round(2)

     [,1] [,2]
[1,]    1    0
[2,]    0    1

(PCs_B
  %>% data.frame()
  %>% summarize(across(everything(), var))
    %>%mutate(across(everything(), \(x) x/2))
)

         X1         X2
1 0.9591807 0.04081932

scaled_claudeC <- scale(claudeC)
svd_claudeC<- svd(scaled_claudeC)
U <- svd_claudeC$u
D <- diag(svd_claudeC$d)
V <- svd_claudeC$v

PCs_C <- U %*% D
head(PCs_C, 2)

          [,1]        [,2]
[1,] -2.270557  0.04703161
[2,] -1.933711 -0.06966280

cor(PCs_C) %>% round(2)

     [,1] [,2]
[1,]    1    0
[2,]    0    1

(PCs_C
  %>% data.frame()
  %>% summarize(across(everything(), var))
    %>%mutate(across(everything(), \(x) x/2))
)

         X1          X2
1 0.9949096 0.005090442

variability explained

A = PC1= 97.5% , PC2 =2.43%

B = PC1 = 95.92% , PC2 = 4.08%

C = PC1 = 99.50 %, PC2 = 0.50%