What is Principal Component Analysis?

Principal Component Analysis (PCA) is a method that uses the correlation between the original variables (\(\textbf{y}\)) to create a smaller set (\(k\)) or uncorrelated variables (\(\textbf{z}\)).

We use the first eigenvector \(\textbf{e}_1\) of the covariance matrix, \(\textbf{S}\), or correlation matrix, \(\textbf{R}\), to form the first principal component

\[z_1 = \textbf{e}_1' \textbf{y} = e_{11}(y_1) + e_{21}(y_2) + ... + e_{p1}(y_p)\]

The ith principal components will use the ith eigenvector:

\[z_i = \textbf{e}_i' \textbf{y} = e_{1i}(y_1) + e_{2i}(y_2) + ... + e_{pi}(y_p)\]

And the complete set of principal components is

\[\textbf{z} = \textbf{E}'\textbf{y}\]

Initial Examination

We should start by looking at the covariance matrix and correlation matrix before conducting PCA.

  1. Covariance matrix: Want to see if the covariances of the variables are very different
  1. Correlation matrix: PCA uses the correlations between variables to form the new variables. If there aren’t strong associations, then using PCA won’t be effective.

Covariance matrix:

We’ll create a bar chart to compare how the variances compare between our 13 variables:

bones_var <- 
  bones |> 
  # Stacking all the bones into a single column so we can use summarize() to calculate the variances
  pivot_longer(
    cols = everything(),
    names_to = "bone",
    values_to = "length"
  ) |> 
  summarize(
    .by = bone,
    variances = var(length)
  ) |> 
  
  # Assigning each bone where it is located: leg, arm, hip
  mutate(
    bone_location = case_when(
      str_detect(bone, "fem") | str_detect(bone, "tib") ~ "leg",  
      str_detect(bone, "hum") | str_detect(bone, "rad") ~ "arm",
      .default =  "hip")
    )  

bones_var
## # A tibble: 13 × 3
##    bone  variances bone_location
##    <chr>     <dbl> <chr>        
##  1 hum_l     546.  arm          
##  2 hum_r     536.  arm          
##  3 rad_l     343.  arm          
##  4 rad_r     345.  arm          
##  5 fem_l     993.  leg          
##  6 fem_r     991.  leg          
##  7 tib_l     784.  leg          
##  8 tib_r     813.  leg          
##  9 ibl       122.  hip          
## 10 ibr       119.  hip          
## 11 ace_l      17.0 hip          
## 12 ace_r      17.2 hip          
## 13 bib       336.  hip
# Creating the graph
ggplot(
  data = bones_var,
  mapping = aes(
    x = fct_reorder(bone, variances),
    y = variances
  )
) + 
  
  # Using geom_col() to make the bars (columns)
  geom_col(
    mapping = aes(fill = bone_location),
    show.legend = F
  ) + 
  
  # Adding the variances to the top of each bar
  geom_text(
    mapping = aes(label = round(variances)),
    vjust = -0.25
  ) +
  
  # Changing the axes labels
  labs(
    x = "Bone",
    y = "Variance"
  ) +
  
  # Making the bars sit on the x-axis
  scale_y_continuous(
    expand = c(0, 0, 0.05, 0)
  )

Since the variances are as low as 17 and the highest is almost 1000, we should use the correlation matrix!

Correlation plot:

Regardless of if we are using the covariance matrix or correlation matrix to calculate the PCs, we need to check to see if the variables are strongly correlated:

ggcorrplot(
  corr = cor(bones),                   # Correlation matrix to plot
  colors = c("red", "white", "blue"),  # color for positive correlations
  lab = T,                             # Shows the correlation in the box 
  type = "lower",
  hc.order = T,
  ggtheme = theme_void
)   +
  
  theme(legend.position = "none")

Unsurprisingly, most of the variables are strongly correlated, making PCA appropriate for these data!

PCA

Using the covariance matrix

While we should use the correlation matrix, we’ll start by using the covariance matrix.

The prcomp() function will perform PCA for us:

pc_s <- prcomp(x = bones)

# Looking at the linear transformation of the variables for each PC
data.frame(pc_s$rotation)
##         PC1    PC2    PC3      PC4    PC5    PC6    PC7     PC8    PC9    PC10
## hum_l 0.307 -0.108  0.353  0.37292 -0.432  0.075 -0.181  0.6291  0.039  0.0499
## hum_r 0.307 -0.140  0.319  0.31276 -0.352  0.049  0.162 -0.7217 -0.048 -0.0591
## rad_l 0.240  0.142 -0.060  0.44266  0.478  0.017  0.043  0.0948  0.014  0.0385
## rad_r 0.241  0.142 -0.063  0.42743  0.466  0.035  0.059 -0.0654  0.093 -0.0368
## fem_l 0.432 -0.024  0.254 -0.42808  0.208  0.106 -0.066  0.0635  0.076 -0.7024
## fem_r 0.431 -0.042  0.272 -0.43568  0.208  0.087  0.111 -0.0101  0.036  0.6934
## tib_l 0.374  0.274 -0.411 -0.05972 -0.166 -0.045 -0.716 -0.1797 -0.176  0.0844
## tib_r 0.377  0.284 -0.479 -0.08180 -0.334 -0.153  0.614  0.1487  0.064 -0.0590
## ibl   0.102 -0.358 -0.064  0.00636  0.062 -0.589 -0.057 -0.0034  0.132 -0.0452
## ibr   0.101 -0.356 -0.048  0.01398  0.063 -0.564 -0.070  0.0022  0.139  0.0472
## ace_l 0.041 -0.074  0.042  0.00666  0.076 -0.109  0.108  0.0749 -0.671 -0.0377
## ace_r 0.041 -0.076  0.039  0.00091  0.068 -0.115  0.097  0.0725 -0.677 -0.0277
## bib   0.133 -0.710 -0.468  0.02927  0.029  0.506  0.023  0.0261 -0.016  0.0018
##           PC11     PC12     PC13
## hum_l -0.07382  0.01548  0.00256
## hum_r  0.08482 -0.00422 -0.00265
## rad_l  0.69349 -0.00576 -0.02437
## rad_r -0.70643  0.01093  0.01452
## fem_l  0.03750 -0.04297 -0.00676
## fem_r -0.02820  0.04713  0.00497
## tib_l  0.00143  0.00452  0.00875
## tib_r -0.00059 -0.01764 -0.00194
## ibl    0.01000  0.69498  0.00871
## ibr   -0.01234 -0.71556 -0.00295
## ace_l -0.02548 -0.01311  0.70969
## ace_r -0.06524 -0.00028 -0.70376
## bib    0.00382  0.00234 -0.00063

Let’s compare the weights of the first eigenvector to the variances of the original variables:

data.frame(pc_s$rotation) |> 
  # Making the rownames into a column
  rownames_to_column(var = "bone") |> 
  
  # Adding the vector of variances from the bones_var data set
  left_join(
    y = bones_var,
    by = "bone"
  ) |> 
  # Picking the relevant columns
  select(bone, variances, PC1) |>
  # Arranging them from highest to lowest variance
  arrange(desc(variances))
##     bone variances   PC1
## 1  fem_l       993 0.432
## 2  fem_r       991 0.431
## 3  tib_r       813 0.377
## 4  tib_l       784 0.374
## 5  hum_l       546 0.307
## 6  hum_r       536 0.307
## 7  rad_r       345 0.241
## 8  rad_l       343 0.240
## 9    bib       336 0.133
## 10   ibl       122 0.102
## 11   ibr       119 0.101
## 12 ace_r        17 0.041
## 13 ace_l        17 0.041

Notice that the order of the weight sizes of the linear transformation for PC1 is in the same order of the variances. That’s because if some variables have much larger or smaller variances than other variables, the first PC will primarily be determined by the size of the individual variances and not the correlation structure between the variables.

Note: Since the unit of measurement affects the variance of a variable (var(\(ay\)) \(= a^2 \sigma^2\)), we can affect the PCs by just changing the measurement units.

For example, currently all the lengths are in millimeters. Let’s adjust the femur and tibia to be in meters instead (mm/1000 = meter) and redo the PCs

bones |> 
  # Changing the leg bones into meters from millimeters
  mutate(
    fem_l = fem_l/1000,
    fem_r = fem_r/1000,
    tib_l = tib_l/1000,
    tib_r = tib_r/1000
  ) |> 
  
  # Performing PCA
  prcomp() |> 
  
  # Picking the rotation matrix (Eigenvector matrix)
  pluck('rotation') |> 
  
  # Converting it into a data frame so we can use the dplyr functions
  data.frame() |> 
  
  # Picking just the first column
  select(PC1) |> 
  
  # Arranging from smallest to largest
  arrange(PC1)
##           PC1
## tib_l 0.00057
## tib_r 0.00057
## fem_r 0.00067
## fem_l 0.00067
## ace_l 0.07167
## ace_r 0.07184
## ibr   0.18624
## ibl   0.18762
## bib   0.25521
## rad_l 0.39350
## rad_r 0.39539
## hum_r 0.52056
## hum_l 0.52203

Previously, the two leg bones were the most important for PC1, but after we change it, they’re now the least important because the variance went from about \(1000\) to \(1000\)/\(1000^2 = 1\)/\(1000\)

Unless there is an explicit reason why, PCA is done using the correlation matrix instead of the covariance matrix!

PCA with the correlation matrix

The benefit of using the correlation matrix in place of the covariance matrix is that it standardizes the variances to 1, so none of the variables are seen as being more important and the PCs are determined by the correlation structure alone!

We can use prcomp() and the correlation matrix by just including scale. = T

pc_r <- 
  prcomp(
    x = bones,
    scale. = T
  )

# Looking at the eigenvector
data.frame(pc_r$rotation) |> 
  round(digits = 3)
##        PC1    PC2    PC3   PC4    PC5    PC6    PC7    PC8    PC9   PC10   PC11
## hum_l 0.30 -0.104 -0.013  0.36 -0.063  0.448  0.345  0.026  0.646 -0.098  0.030
## hum_r 0.30 -0.083  0.005  0.34 -0.078  0.403  0.241 -0.029 -0.720  0.137  0.000
## rad_l 0.29 -0.250  0.003 -0.26  0.471  0.178 -0.206 -0.017  0.115 -0.073  0.003
## rad_r 0.29 -0.253  0.018 -0.25  0.448  0.174 -0.217  0.061 -0.121 -0.014 -0.008
## fem_l 0.30 -0.136 -0.007  0.18 -0.335 -0.127 -0.467 -0.001  0.067 -0.011  0.070
## fem_r 0.30 -0.127 -0.013  0.19 -0.340 -0.122 -0.476  0.006 -0.007 -0.107 -0.069
## tib_l 0.29 -0.269  0.173 -0.16 -0.095 -0.352  0.268  0.068  0.115  0.738  0.120
## tib_r 0.28 -0.263  0.169 -0.18 -0.122 -0.406  0.444 -0.107 -0.113 -0.602 -0.142
## ibl   0.25  0.423  0.184  0.26  0.308 -0.269  0.005 -0.002 -0.030 -0.131  0.686
## ibr   0.25  0.423  0.178  0.28  0.306 -0.223 -0.025 -0.014  0.052  0.130 -0.695
## ace_l 0.26  0.254 -0.558 -0.22 -0.064 -0.006  0.058 -0.697  0.032  0.096  0.023
## ace_r 0.26  0.262 -0.545 -0.21 -0.090 -0.050  0.113  0.702 -0.036 -0.056 -0.022
## bib   0.20  0.434  0.516 -0.51 -0.340  0.361 -0.050  0.003  0.016 -0.010  0.003
##         PC12   PC13
## hum_l  0.110  0.039
## hum_r -0.124 -0.043
## rad_l -0.685 -0.062
## rad_r  0.700  0.063
## fem_l  0.065 -0.705
## fem_r -0.071  0.691
## tib_l -0.023  0.072
## tib_r  0.015 -0.052
## ibl    0.002  0.056
## ibr    0.005 -0.060
## ace_l  0.046  0.016
## ace_r -0.035 -0.017
## bib   -0.004  0.001

Let’s compare PC1 weights using the correlation and covariance matrix, along with the variance and average correlation for each variable:

# Start by calculating the average correlation:
bones |> 
  cor() |> 
  abs() |> 
  colMeans() |> 
  t() |> 
  data.frame() |> 
  pivot_longer(
    cols = everything(),
    names_to = "bone",
    values_to = "cor"
  ) ->
  avg_cors


bones_var |> 
  # Removing bone location (not needed)
  select(-bone_location) |> 
  # Adding the average correlations
  left_join(
    y = avg_cors,
    by = "bone"
  ) |> 
  # adding PC1 from the covariance matrix to the data set
  left_join(
    y = data.frame(pc_s$rotation) |> rownames_to_column(var = "bone") |> select(bone, PC1),
    by = "bone"
  ) |> 
  # adding PC1 from the correlation matrix to the data set
  left_join(
    y = data.frame(pc_r$rotation) |> rownames_to_column(var = "bone") |> select(bone, PC1),
    by = "bone",
    suffix = c("_S", "_R")
  ) ->
  pc_example


print(pc_example, n = 13)
## # A tibble: 13 × 5
##    bone  variances   cor  PC1_S PC1_R
##    <chr>     <dbl> <dbl>  <dbl> <dbl>
##  1 hum_l     546.  0.796 0.307  0.300
##  2 hum_r     536.  0.807 0.307  0.303
##  3 rad_l     343.  0.761 0.240  0.288
##  4 rad_r     345.  0.762 0.241  0.289
##  5 fem_l     993.  0.809 0.432  0.305
##  6 fem_r     991.  0.810 0.431  0.305
##  7 tib_l     784.  0.758 0.374  0.287
##  8 tib_r     813.  0.751 0.377  0.284
##  9 ibl       122.  0.686 0.102  0.250
## 10 ibr       119.  0.688 0.101  0.251
## 11 ace_l      17.0 0.709 0.0409 0.262
## 12 ace_r      17.2 0.709 0.0411 0.262
## 13 bib       336.  0.555 0.133  0.199
# Creating a correlation plot
pc_example |> 
  # picking just the PCs and correlations and variances
  select(variances, PC1_S, cor, PC1_R) |> 
  # Creating a correlation plot to see how closely the PC1s are to the variances & correlations
  ggcorr(
    high = "blue",     
    low = "red",       
    mid = "grey",  
    label = T,     
    label_round = 2
  )

The correlation matrix shows that:

  1. when using \(\textbf{S}\), the variances mostly determine the first principal component

  2. when using \(\textbf{R}\), the average correlation mostly determine the first principal component

Often when using PCA, we want the correlation structure to determine the how the variables are combined, not the variances (since the variances can be changed while the correlation can not).