Principal Component Analysis (PCA) is a method that uses the correlation between the original variables (\(\textbf{y}\)) to create a smaller set (\(k\)) or uncorrelated variables (\(\textbf{z}\)).
We use the first eigenvector \(\textbf{e}_1\) of the covariance matrix, \(\textbf{S}\), or correlation matrix, \(\textbf{R}\), to form the first principal component
\[z_1 = \textbf{e}_1' \textbf{y} = e_{11}(y_1) + e_{21}(y_2) + ... + e_{p1}(y_p)\]
The ith principal components will use the ith eigenvector:
\[z_i = \textbf{e}_i' \textbf{y} = e_{1i}(y_1) + e_{2i}(y_2) + ... + e_{pi}(y_p)\]
And the complete set of principal components is
\[\textbf{z} = \textbf{E}'\textbf{y}\]
We should start by looking at the covariance matrix and correlation matrix before conducting PCA.
We’ll create a bar chart to compare how the variances compare between our 13 variables:
bones_var <-
bones |>
# Stacking all the bones into a single column so we can use summarize() to calculate the variances
pivot_longer(
cols = everything(),
names_to = "bone",
values_to = "length"
) |>
summarize(
.by = bone,
variances = var(length)
) |>
# Assigning each bone where it is located: leg, arm, hip
mutate(
bone_location = case_when(
str_detect(bone, "fem") | str_detect(bone, "tib") ~ "leg",
str_detect(bone, "hum") | str_detect(bone, "rad") ~ "arm",
.default = "hip")
)
bones_var
## # A tibble: 13 × 3
## bone variances bone_location
## <chr> <dbl> <chr>
## 1 hum_l 546. arm
## 2 hum_r 536. arm
## 3 rad_l 343. arm
## 4 rad_r 345. arm
## 5 fem_l 993. leg
## 6 fem_r 991. leg
## 7 tib_l 784. leg
## 8 tib_r 813. leg
## 9 ibl 122. hip
## 10 ibr 119. hip
## 11 ace_l 17.0 hip
## 12 ace_r 17.2 hip
## 13 bib 336. hip
# Creating the graph
ggplot(
data = bones_var,
mapping = aes(
x = fct_reorder(bone, variances),
y = variances
)
) +
# Using geom_col() to make the bars (columns)
geom_col(
mapping = aes(fill = bone_location),
show.legend = F
) +
# Adding the variances to the top of each bar
geom_text(
mapping = aes(label = round(variances)),
vjust = -0.25
) +
# Changing the axes labels
labs(
x = "Bone",
y = "Variance"
) +
# Making the bars sit on the x-axis
scale_y_continuous(
expand = c(0, 0, 0.05, 0)
)
Since the variances are as low as 17 and the highest is almost 1000, we should use the correlation matrix!
Regardless of if we are using the covariance matrix or correlation matrix to calculate the PCs, we need to check to see if the variables are strongly correlated:
ggcorrplot(
corr = cor(bones), # Correlation matrix to plot
colors = c("red", "white", "blue"), # color for positive correlations
lab = T, # Shows the correlation in the box
type = "lower",
hc.order = T,
ggtheme = theme_void
) +
theme(legend.position = "none")
Unsurprisingly, most of the variables are strongly correlated, making PCA appropriate for these data!
While we should use the correlation matrix, we’ll start by using the covariance matrix.
The prcomp()
function will perform PCA for us:
pc_s <- prcomp(x = bones)
# Looking at the linear transformation of the variables for each PC
data.frame(pc_s$rotation)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## hum_l 0.307 -0.108 0.353 0.37292 -0.432 0.075 -0.181 0.6291 0.039 0.0499
## hum_r 0.307 -0.140 0.319 0.31276 -0.352 0.049 0.162 -0.7217 -0.048 -0.0591
## rad_l 0.240 0.142 -0.060 0.44266 0.478 0.017 0.043 0.0948 0.014 0.0385
## rad_r 0.241 0.142 -0.063 0.42743 0.466 0.035 0.059 -0.0654 0.093 -0.0368
## fem_l 0.432 -0.024 0.254 -0.42808 0.208 0.106 -0.066 0.0635 0.076 -0.7024
## fem_r 0.431 -0.042 0.272 -0.43568 0.208 0.087 0.111 -0.0101 0.036 0.6934
## tib_l 0.374 0.274 -0.411 -0.05972 -0.166 -0.045 -0.716 -0.1797 -0.176 0.0844
## tib_r 0.377 0.284 -0.479 -0.08180 -0.334 -0.153 0.614 0.1487 0.064 -0.0590
## ibl 0.102 -0.358 -0.064 0.00636 0.062 -0.589 -0.057 -0.0034 0.132 -0.0452
## ibr 0.101 -0.356 -0.048 0.01398 0.063 -0.564 -0.070 0.0022 0.139 0.0472
## ace_l 0.041 -0.074 0.042 0.00666 0.076 -0.109 0.108 0.0749 -0.671 -0.0377
## ace_r 0.041 -0.076 0.039 0.00091 0.068 -0.115 0.097 0.0725 -0.677 -0.0277
## bib 0.133 -0.710 -0.468 0.02927 0.029 0.506 0.023 0.0261 -0.016 0.0018
## PC11 PC12 PC13
## hum_l -0.07382 0.01548 0.00256
## hum_r 0.08482 -0.00422 -0.00265
## rad_l 0.69349 -0.00576 -0.02437
## rad_r -0.70643 0.01093 0.01452
## fem_l 0.03750 -0.04297 -0.00676
## fem_r -0.02820 0.04713 0.00497
## tib_l 0.00143 0.00452 0.00875
## tib_r -0.00059 -0.01764 -0.00194
## ibl 0.01000 0.69498 0.00871
## ibr -0.01234 -0.71556 -0.00295
## ace_l -0.02548 -0.01311 0.70969
## ace_r -0.06524 -0.00028 -0.70376
## bib 0.00382 0.00234 -0.00063
Let’s compare the weights of the first eigenvector to the variances of the original variables:
data.frame(pc_s$rotation) |>
# Making the rownames into a column
rownames_to_column(var = "bone") |>
# Adding the vector of variances from the bones_var data set
left_join(
y = bones_var,
by = "bone"
) |>
# Picking the relevant columns
select(bone, variances, PC1) |>
# Arranging them from highest to lowest variance
arrange(desc(variances))
## bone variances PC1
## 1 fem_l 993 0.432
## 2 fem_r 991 0.431
## 3 tib_r 813 0.377
## 4 tib_l 784 0.374
## 5 hum_l 546 0.307
## 6 hum_r 536 0.307
## 7 rad_r 345 0.241
## 8 rad_l 343 0.240
## 9 bib 336 0.133
## 10 ibl 122 0.102
## 11 ibr 119 0.101
## 12 ace_r 17 0.041
## 13 ace_l 17 0.041
Notice that the order of the weight sizes of the linear transformation for PC1 is in the same order of the variances. That’s because if some variables have much larger or smaller variances than other variables, the first PC will primarily be determined by the size of the individual variances and not the correlation structure between the variables.
Note: Since the unit of measurement affects the variance of a variable (var(\(ay\)) \(= a^2 \sigma^2\)), we can affect the PCs by just changing the measurement units.
For example, currently all the lengths are in millimeters. Let’s adjust the femur and tibia to be in meters instead (mm/1000 = meter) and redo the PCs
bones |>
# Changing the leg bones into meters from millimeters
mutate(
fem_l = fem_l/1000,
fem_r = fem_r/1000,
tib_l = tib_l/1000,
tib_r = tib_r/1000
) |>
# Performing PCA
prcomp() |>
# Picking the rotation matrix (Eigenvector matrix)
pluck('rotation') |>
# Converting it into a data frame so we can use the dplyr functions
data.frame() |>
# Picking just the first column
select(PC1) |>
# Arranging from smallest to largest
arrange(PC1)
## PC1
## tib_l 0.00057
## tib_r 0.00057
## fem_r 0.00067
## fem_l 0.00067
## ace_l 0.07167
## ace_r 0.07184
## ibr 0.18624
## ibl 0.18762
## bib 0.25521
## rad_l 0.39350
## rad_r 0.39539
## hum_r 0.52056
## hum_l 0.52203
Previously, the two leg bones were the most important for PC1, but after we change it, they’re now the least important because the variance went from about \(1000\) to \(1000\)/\(1000^2 = 1\)/\(1000\)
Unless there is an explicit reason why, PCA is done using the correlation matrix instead of the covariance matrix!
The benefit of using the correlation matrix in place of the covariance matrix is that it standardizes the variances to 1, so none of the variables are seen as being more important and the PCs are determined by the correlation structure alone!
We can use prcomp()
and the correlation matrix by just
including scale. = T
pc_r <-
prcomp(
x = bones,
scale. = T
)
# Looking at the eigenvector
data.frame(pc_r$rotation) |>
round(digits = 3)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
## hum_l 0.30 -0.104 -0.013 0.36 -0.063 0.448 0.345 0.026 0.646 -0.098 0.030
## hum_r 0.30 -0.083 0.005 0.34 -0.078 0.403 0.241 -0.029 -0.720 0.137 0.000
## rad_l 0.29 -0.250 0.003 -0.26 0.471 0.178 -0.206 -0.017 0.115 -0.073 0.003
## rad_r 0.29 -0.253 0.018 -0.25 0.448 0.174 -0.217 0.061 -0.121 -0.014 -0.008
## fem_l 0.30 -0.136 -0.007 0.18 -0.335 -0.127 -0.467 -0.001 0.067 -0.011 0.070
## fem_r 0.30 -0.127 -0.013 0.19 -0.340 -0.122 -0.476 0.006 -0.007 -0.107 -0.069
## tib_l 0.29 -0.269 0.173 -0.16 -0.095 -0.352 0.268 0.068 0.115 0.738 0.120
## tib_r 0.28 -0.263 0.169 -0.18 -0.122 -0.406 0.444 -0.107 -0.113 -0.602 -0.142
## ibl 0.25 0.423 0.184 0.26 0.308 -0.269 0.005 -0.002 -0.030 -0.131 0.686
## ibr 0.25 0.423 0.178 0.28 0.306 -0.223 -0.025 -0.014 0.052 0.130 -0.695
## ace_l 0.26 0.254 -0.558 -0.22 -0.064 -0.006 0.058 -0.697 0.032 0.096 0.023
## ace_r 0.26 0.262 -0.545 -0.21 -0.090 -0.050 0.113 0.702 -0.036 -0.056 -0.022
## bib 0.20 0.434 0.516 -0.51 -0.340 0.361 -0.050 0.003 0.016 -0.010 0.003
## PC12 PC13
## hum_l 0.110 0.039
## hum_r -0.124 -0.043
## rad_l -0.685 -0.062
## rad_r 0.700 0.063
## fem_l 0.065 -0.705
## fem_r -0.071 0.691
## tib_l -0.023 0.072
## tib_r 0.015 -0.052
## ibl 0.002 0.056
## ibr 0.005 -0.060
## ace_l 0.046 0.016
## ace_r -0.035 -0.017
## bib -0.004 0.001
Let’s compare PC1 weights using the correlation and covariance matrix, along with the variance and average correlation for each variable:
# Start by calculating the average correlation:
bones |>
cor() |>
abs() |>
colMeans() |>
t() |>
data.frame() |>
pivot_longer(
cols = everything(),
names_to = "bone",
values_to = "cor"
) ->
avg_cors
bones_var |>
# Removing bone location (not needed)
select(-bone_location) |>
# Adding the average correlations
left_join(
y = avg_cors,
by = "bone"
) |>
# adding PC1 from the covariance matrix to the data set
left_join(
y = data.frame(pc_s$rotation) |> rownames_to_column(var = "bone") |> select(bone, PC1),
by = "bone"
) |>
# adding PC1 from the correlation matrix to the data set
left_join(
y = data.frame(pc_r$rotation) |> rownames_to_column(var = "bone") |> select(bone, PC1),
by = "bone",
suffix = c("_S", "_R")
) ->
pc_example
print(pc_example, n = 13)
## # A tibble: 13 × 5
## bone variances cor PC1_S PC1_R
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 hum_l 546. 0.796 0.307 0.300
## 2 hum_r 536. 0.807 0.307 0.303
## 3 rad_l 343. 0.761 0.240 0.288
## 4 rad_r 345. 0.762 0.241 0.289
## 5 fem_l 993. 0.809 0.432 0.305
## 6 fem_r 991. 0.810 0.431 0.305
## 7 tib_l 784. 0.758 0.374 0.287
## 8 tib_r 813. 0.751 0.377 0.284
## 9 ibl 122. 0.686 0.102 0.250
## 10 ibr 119. 0.688 0.101 0.251
## 11 ace_l 17.0 0.709 0.0409 0.262
## 12 ace_r 17.2 0.709 0.0411 0.262
## 13 bib 336. 0.555 0.133 0.199
# Creating a correlation plot
pc_example |>
# picking just the PCs and correlations and variances
select(variances, PC1_S, cor, PC1_R) |>
# Creating a correlation plot to see how closely the PC1s are to the variances & correlations
ggcorr(
high = "blue",
low = "red",
mid = "grey",
label = T,
label_round = 2
)
The correlation matrix shows that:
when using \(\textbf{S}\), the variances mostly determine the first principal component
when using \(\textbf{R}\), the average correlation mostly determine the first principal component
Often when using PCA, we want the correlation structure to determine the how the variables are combined, not the variances (since the variances can be changed while the correlation can not).