The data for Practice 2 has national track records of 55 countries and 8 difference track races.
The first set of questions will be performing PCA using the covariance matrix, \(\mathbf{S}\)
# Calculating the covariance matrix
(track_S <- var(track))
## meter100 meter200 meter400 meter800 meter1500 meter5000 meter10000
## meter100 0.1235 0.2090 0.4307 0.01692 0.03837 0.1744 0.402
## meter200 0.2090 0.4156 0.7991 0.03312 0.07789 0.3591 0.812
## meter400 0.4307 0.7991 2.1229 0.08074 0.18974 0.9089 2.073
## meter800 0.0169 0.0331 0.0807 0.00406 0.00912 0.0441 0.100
## meter1500 0.0384 0.0779 0.1897 0.00912 0.02431 0.1159 0.263
## meter5000 0.1744 0.3591 0.9089 0.04406 0.11593 0.6419 1.412
## meter10000 0.4018 0.8117 2.0734 0.10005 0.26344 1.4115 3.268
## Marathon 1.6860 3.5462 9.4779 0.47390 1.24516 6.8910 15.732
## Marathon
## meter100 1.686
## meter200 3.546
## meter400 9.478
## meter800 0.474
## meter1500 1.245
## meter5000 6.891
## meter10000 15.732
## Marathon 85.138
The variance for Marathon is much, much larger than the other races’ variances, followed by the variance for the 10,000 meter dash. This indicates that the first 2 PCs will likely represent Marathon and meter10000 and mostly ignore the other 6 races:
# Using R to do principal components: prcomp()
# Since we want to use S, don't use .scale = T
track_PCS <- prcomp(track)
# Scree plot using fviz_screeplot() in factoextra package
fviz_screeplot(track_PCS,
geom = "line",
choice = "eigenvalue") +
labs(title = "Scree Plot of Track Race Time",
caption = "Using the Covariance Matrix",
x = "Number of PCs")
From the screeplot, only 1 PC is needed to simplify the data
# Looking at the percentage each PC explains
summary(track_PCS)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 9.48 1.1885 0.50975 0.33079 0.1652 0.11284 0.04737
## Proportion of Variance 0.98 0.0154 0.00283 0.00119 0.0003 0.00014 0.00002
## Cumulative Proportion 0.98 0.9955 0.99834 0.99953 0.9998 0.99997 1.00000
## PC8
## Standard deviation 0.0211
## Proportion of Variance 0.0000
## Cumulative Proportion 1.0000
Using cumulative proportion, the first PC accounts for 98% of the total variation in the track race times, which also indicates we only need 1 PC
# What is the average eigenvalue?
# We can get the standard deviations of the PCs using pca_object$sdev
# From there, if we square those, we get the variances of the PCs, which is also the eigenvalues!
mean(track_PCS$sdev^2)
## [1] 11.5
# Finding the eigenvalues:
round(track_PCS$sdev^2, 4)
## [1] 89.9136 1.4126 0.2598 0.1094 0.0273 0.0127 0.0022 0.0004
The average eigenvalue is 11.5, and the only PC with an eigenvalue larger than 11.5 is PC 1.
All 3 methods recommend 1 PC!
# Using track_PCS$rotation to see how much each variable contributes to the first and second PC
track_PCS$rotation |>
data.frame() |>
dplyr::select(PC1, PC2)
## PC1 PC2
## meter100 0.01987 0.2107
## meter200 0.04155 0.3589
## meter400 0.11063 0.8279
## meter800 0.00549 0.0232
## meter1500 0.01439 0.0447
## meter5000 0.07931 0.1300
## meter10000 0.18110 0.2989
## Marathon 0.97279 -0.1808
PC 1: The first PC is almost only determined by the country’s Marathon time (as expected in part 1a)
PC 2: Using similar reasoning, PC2 is mostly determined by a country’s 400 meter dash time. Would have expected PC2 to be mostly represented by the 10,000 meter dash, but it has a strong correlation with Marathon (r = 0.943), so it doesn’t contribute much “fresh” information not represented in PC1 since PC1 and PC2 need to be uncorrelated
Since PC1 is the marathon time, the countries with negative values will be the best and the counties with positive values will be the worst
# Best countries
track_PCS |>
pluck("x") |>
data.frame() |>
slice_min(PC1, n = 5) |>
dplyr::select(PC1)
## PC1
## usa -8.86
## australia -8.60
## japan -8.11
## portugal -8.07
## netherla -7.83
# Worst Countries
track_PCS |>
pluck("x") |>
data.frame() |>
slice_max(PC1, n = 5) |>
dplyr::select(PC1)
## PC1
## cookis 29.6
## wsamoa 26.1
## singapor 21.2
## domrep 17.6
## malaysia 17.4
The second set of questions will be performing PCA using the correlation matrix, \(\mathbf{R}\)
ggcorr(data = track,
low = "red",
mid = "grey90",
high = "blue",
label = T,
label_round = 2,
hjust = 0.7,
layout.exp = 1)
The closer the distances are, the stronger the correlation. The correlation for the 5,000 meter and 10,000 meter dash have a correlation of 0.97, while the weakest is between the 100 meter dash and Marathon is only 0.52.
# Using R to do principal components: prcomp()
# Since we want to use the correlation matrix, add scale. = T
track_PCR <- prcomp(track,
scale. = T)
# Scree plot using fviz_screeplot() in factoextra package
fviz_screeplot(track_PCR,
geom = "line",
choice = "eigenvalue",
ggtheme = theme_bw()) +
labs(title = "Scree Plot of Track Race Time",
caption = "Using the Correlation Matrix",
x = "Number of PCs")
From the screeplot, we need either 1 or 2 PCs. Definitely do not need any more than 2
# Looking at the percentage each PC explains
summary(track_PCR)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.573 0.937 0.3992 0.3522 0.28263 0.2607 0.2155 0.15033
## Proportion of Variance 0.828 0.110 0.0199 0.0155 0.00999 0.0085 0.0058 0.00283
## Cumulative Proportion 0.828 0.937 0.9574 0.9729 0.98288 0.9914 0.9972 1.00000
Using cumulative proportion, the first PC accounts for about 83% of the total variation in the track race times, while we can increase the total variation retain by about 11% if we use 2 PCs. This largely agrees with the scree plot that we likely just need 1, could potentially use 2 if we want to be safe.
# What is the average eigenvalue?
# We can get the standard deviations of the PCs using pca_object$sdev
# From there, if we square those, we get the variances of the PCs, which is also the eigenvalues!
mean(track_PCR$sdev^2)
## [1] 1
# Finding the eigenvalues:
round(track_PCR$sdev^2, 4)
## [1] 6.6221 0.8776 0.1593 0.1240 0.0799 0.0680 0.0464 0.0226
The average eigenvalue is 1 (which is always the case when using the correlation matrix!), and only the first PC with an eigenvalue larger than 1, while PC2 is close to 1 at 0.88.
All 3 methods recommends 1 PC is likely all we need, but 2 PCs if you want to be safe.
track_PCR |>
pluck("rotation") |>
data.frame() %>%
dplyr::select(PC1, PC2)
## PC1 PC2
## meter100 0.318 0.5669
## meter200 0.337 0.4616
## meter400 0.356 0.2483
## meter800 0.369 0.0124
## meter1500 0.373 -0.1398
## meter5000 0.364 -0.3120
## meter10000 0.367 -0.3069
## Marathon 0.342 -0.4390
PC 1: This seems to be a roughly equal weighted average of the standardized race times for all 8 races, indicating it measures the overall “racing ability” of the country across all 8 events.
PC 2: The second PC measures if a country is better at long distance running or short distance running. Countries with positive PC2 values are better at long distance running relative to short distances, and vice versa for countries with negative PC2 values
Best Countries
track_PCR |>
pluck("x") |>
data.frame() |>
slice_min(PC1, n = 5)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## usa -3.43 -1.110 -0.0405 0.3251 0.1221 0.374 -0.1523 0.00521
## gbni -3.02 -0.279 0.2350 -0.3386 -0.1020 0.013 0.0406 -0.15858
## italy -2.73 -0.990 -0.4913 -0.2592 -0.1016 0.379 0.2477 0.12594
## ussr -2.63 -0.757 -0.2027 0.2697 0.1563 0.275 0.1077 0.04742
## gdr -2.59 -0.311 0.0860 0.0612 -0.0177 -0.038 0.0320 0.01564
Worst Countries
track_PCR |>
pluck("x") |>
data.frame() |>
slice_max(PC1, n = 5)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## cookis 10.56 1.5088 0.0696 -0.8083 0.1301 0.807 0.00919 0.1361
## wsamoa 7.23 -1.9021 -0.6990 1.0493 0.0521 0.240 -0.02471 -0.1128
## mauritiu 4.26 0.6670 1.1966 0.1021 0.5631 -0.286 -0.28960 -0.0731
## png 3.91 0.0855 -0.0531 0.6054 0.0494 0.200 0.48516 -0.0135
## singapor 3.12 -1.7890 -0.0428 0.0542 0.4108 -0.606 0.08274 0.2391
The first PC using the covariance matrix was just marathon and ignored the other 7 events, while the the first PC for the correlation matrix used each of the variables about equally.