Lab 2: Track Example

The data for Practice 2 has national track records of 55 countries and 8 difference track races.

The meter100, meter200, and meter400 are recorded in seconds
The meter800, meter1500, meter5000, meter10000, and Marathon are measured in minutes.

Question 1: PCA with the Covariance Matrix

The first set of questions will be performing PCA using the covariance matrix, \(\mathbf{S}\)

Part 1a: Calculate the sample covariance matrix

Calculate the sample covariance matrix

# Calculating the covariance matrix
(track_S <- var(track))

##            meter100 meter200 meter400 meter800 meter1500 meter5000 meter10000
## meter100     0.1235   0.2090   0.4307  0.01692   0.03837    0.1744      0.402
## meter200     0.2090   0.4156   0.7991  0.03312   0.07789    0.3591      0.812
## meter400     0.4307   0.7991   2.1229  0.08074   0.18974    0.9089      2.073
## meter800     0.0169   0.0331   0.0807  0.00406   0.00912    0.0441      0.100
## meter1500    0.0384   0.0779   0.1897  0.00912   0.02431    0.1159      0.263
## meter5000    0.1744   0.3591   0.9089  0.04406   0.11593    0.6419      1.412
## meter10000   0.4018   0.8117   2.0734  0.10005   0.26344    1.4115      3.268
## Marathon     1.6860   3.5462   9.4779  0.47390   1.24516    6.8910     15.732
##            Marathon
## meter100      1.686
## meter200      3.546
## meter400      9.478
## meter800      0.474
## meter1500     1.245
## meter5000     6.891
## meter10000   15.732
## Marathon     85.138

Does the covariance matrix indicate that using it for PCA may cause an issue?

The variance for Marathon is much, much larger than the other races’ variances, followed by the variance for the 10,000 meter dash. This indicates that the first 2 PCs will likely represent Marathon and meter10000 and mostly ignore the other 6 races:

1b: Determine the Number of Principal Components

the eigenvalues to determine the appropriate number of PCs to summarize the data. Justify your answer and include any graphs you reference!

# Using R to do principal components: prcomp()
# Since we want to use S, don't use .scale = T
track_PCS <- prcomp(track)

# Scree plot using fviz_screeplot() in factoextra package
fviz_screeplot(track_PCS, 
               geom = "line",
               choice = "eigenvalue") + 
  labs(title = "Scree Plot of Track Race Time",
       caption = "Using the Covariance Matrix",
       x = "Number of PCs")

From the screeplot, only 1 PC is needed to simplify the data

# Looking at the percentage each PC explains
summary(track_PCS)

## Importance of components:
##                         PC1    PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     9.48 1.1885 0.50975 0.33079 0.1652 0.11284 0.04737
## Proportion of Variance 0.98 0.0154 0.00283 0.00119 0.0003 0.00014 0.00002
## Cumulative Proportion  0.98 0.9955 0.99834 0.99953 0.9998 0.99997 1.00000
##                           PC8
## Standard deviation     0.0211
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000

Using cumulative proportion, the first PC accounts for 98% of the total variation in the track race times, which also indicates we only need 1 PC

# What is the average eigenvalue?
# We can get the standard deviations of the PCs using pca_object$sdev
# From there, if we square those, we get the variances of the PCs, which is also the eigenvalues!
mean(track_PCS$sdev^2)

## [1] 11.5

# Finding the eigenvalues:
round(track_PCS$sdev^2, 4)

## [1] 89.9136  1.4126  0.2598  0.1094  0.0273  0.0127  0.0022  0.0004

The average eigenvalue is 11.5, and the only PC with an eigenvalue larger than 11.5 is PC 1.

All 3 methods recommend 1 PC!

1c: Interpret the first 2 Principal Components

Interpret the first 2 PCs in context of the data

# Using track_PCS$rotation to see how much each variable contributes to the first and second PC
track_PCS$rotation |> 
  data.frame() |> 
  dplyr::select(PC1, PC2)

##                PC1     PC2
## meter100   0.01987  0.2107
## meter200   0.04155  0.3589
## meter400   0.11063  0.8279
## meter800   0.00549  0.0232
## meter1500  0.01439  0.0447
## meter5000  0.07931  0.1300
## meter10000 0.18110  0.2989
## Marathon   0.97279 -0.1808

PC 1: The first PC is almost only determined by the country’s Marathon time (as expected in part 1a)

PC 2: Using similar reasoning, PC2 is mostly determined by a country’s 400 meter dash time. Would have expected PC2 to be mostly represented by the 10,000 meter dash, but it has a strong correlation with Marathon (r = 0.943), so it doesn’t contribute much “fresh” information not represented in PC1 since PC1 and PC2 need to be uncorrelated

1e: Best and worst Countries by PC1

Since PC1 is the marathon time, the countries with negative values will be the best and the counties with positive values will be the worst

# Best countries
track_PCS |> 
  pluck("x") |> 
  data.frame() |> 
  slice_min(PC1, n = 5) |> 
  dplyr::select(PC1)

##             PC1
## usa       -8.86
## australia -8.60
## japan     -8.11
## portugal  -8.07
## netherla  -7.83

# Worst Countries
track_PCS |> 
  pluck("x") |> 
  data.frame() |> 
  slice_max(PC1, n = 5) |> 
  dplyr::select(PC1)

##           PC1
## cookis   29.6
## wsamoa   26.1
## singapor 21.2
## domrep   17.6
## malaysia 17.4

Question 2: PCA with the Correlation Matrix

The second set of questions will be performing PCA using the correlation matrix, \(\mathbf{R}\)

Part 2a: Calculate the Sample Correlation Matrix

Create a correlation plot

ggcorr(data = track,
       low = "red",
       mid = "grey90",
       high = "blue", 
       label = T,
       label_round = 2,
       hjust = 0.7,
       layout.exp = 1)

Describe the patterns in the correlations between the 8 races.

The closer the distances are, the stronger the correlation. The correlation for the 5,000 meter and 10,000 meter dash have a correlation of 0.97, while the weakest is between the 100 meter dash and Marathon is only 0.52.

2b: Determine the Number of Principal Components with the Correlation Matrix

Using 2 different methods, how many PCs should be used? Justify your answer and include any graphs you reference!

# Using R to do principal components: prcomp()
# Since we want to use the correlation matrix, add scale. = T
track_PCR <- prcomp(track,
                    scale. = T)

# Scree plot using fviz_screeplot() in factoextra package
fviz_screeplot(track_PCR, 
               geom = "line",
               choice = "eigenvalue",
               ggtheme = theme_bw()) + 
  labs(title = "Scree Plot of Track Race Time",
       caption = "Using the Correlation Matrix",
       x = "Number of PCs")

From the screeplot, we need either 1 or 2 PCs. Definitely do not need any more than 2

# Looking at the percentage each PC explains
summary(track_PCR)

## Importance of components:
##                          PC1   PC2    PC3    PC4     PC5    PC6    PC7     PC8
## Standard deviation     2.573 0.937 0.3992 0.3522 0.28263 0.2607 0.2155 0.15033
## Proportion of Variance 0.828 0.110 0.0199 0.0155 0.00999 0.0085 0.0058 0.00283
## Cumulative Proportion  0.828 0.937 0.9574 0.9729 0.98288 0.9914 0.9972 1.00000

Using cumulative proportion, the first PC accounts for about 83% of the total variation in the track race times, while we can increase the total variation retain by about 11% if we use 2 PCs. This largely agrees with the scree plot that we likely just need 1, could potentially use 2 if we want to be safe.

# What is the average eigenvalue?
# We can get the standard deviations of the PCs using pca_object$sdev
# From there, if we square those, we get the variances of the PCs, which is also the eigenvalues!
mean(track_PCR$sdev^2)

## [1] 1

# Finding the eigenvalues:
round(track_PCR$sdev^2, 4)

## [1] 6.6221 0.8776 0.1593 0.1240 0.0799 0.0680 0.0464 0.0226

The average eigenvalue is 1 (which is always the case when using the correlation matrix!), and only the first PC with an eigenvalue larger than 1, while PC2 is close to 1 at 0.88.

All 3 methods recommends 1 PC is likely all we need, but 2 PCs if you want to be safe.

2c: Interpret the first 2 Principal Components

Interpret the first 2 PCs in context of the data

track_PCR |> 
  pluck("rotation") |> 
  data.frame() %>% 
  dplyr::select(PC1, PC2)

##              PC1     PC2
## meter100   0.318  0.5669
## meter200   0.337  0.4616
## meter400   0.356  0.2483
## meter800   0.369  0.0124
## meter1500  0.373 -0.1398
## meter5000  0.364 -0.3120
## meter10000 0.367 -0.3069
## Marathon   0.342 -0.4390

PC 1: This seems to be a roughly equal weighted average of the standardized race times for all 8 races, indicating it measures the overall “racing ability” of the country across all 8 events.

PC 2: The second PC measures if a country is better at long distance running or short distance running. Countries with positive PC2 values are better at long distance running relative to short distances, and vice versa for countries with negative PC2 values

2d: Best and Worst Countries by PC1

Best Countries

track_PCR |> 
  pluck("x") |> 
  data.frame() |> 
  slice_min(PC1, n = 5)

##         PC1    PC2     PC3     PC4     PC5    PC6     PC7      PC8
## usa   -3.43 -1.110 -0.0405  0.3251  0.1221  0.374 -0.1523  0.00521
## gbni  -3.02 -0.279  0.2350 -0.3386 -0.1020  0.013  0.0406 -0.15858
## italy -2.73 -0.990 -0.4913 -0.2592 -0.1016  0.379  0.2477  0.12594
## ussr  -2.63 -0.757 -0.2027  0.2697  0.1563  0.275  0.1077  0.04742
## gdr   -2.59 -0.311  0.0860  0.0612 -0.0177 -0.038  0.0320  0.01564

Worst Countries

track_PCR |> 
  pluck("x") |> 
  data.frame() |> 
  slice_max(PC1, n = 5)

##            PC1     PC2     PC3     PC4    PC5    PC6      PC7     PC8
## cookis   10.56  1.5088  0.0696 -0.8083 0.1301  0.807  0.00919  0.1361
## wsamoa    7.23 -1.9021 -0.6990  1.0493 0.0521  0.240 -0.02471 -0.1128
## mauritiu  4.26  0.6670  1.1966  0.1021 0.5631 -0.286 -0.28960 -0.0731
## png       3.91  0.0855 -0.0531  0.6054 0.0494  0.200  0.48516 -0.0135
## singapor  3.12 -1.7890 -0.0428  0.0542 0.4108 -0.606  0.08274  0.2391

Question 3: How do the results from question 1 and 2 compare?

The first PC using the covariance matrix was just marathon and ignored the other 7 events, while the the first PC for the correlation matrix used each of the variables about equally.