Let’s look at the variances of each variable:
data.frame(
variance =
bones |>
cov() |>
diag()
) |>
# Changing the row name to a column in the data
rownames_to_column(var = "bone") |>
# Arranging from largest to smallest variance
arrange(-variance)
## bone variance
## 1 fem_l 993
## 2 fem_r 991
## 3 tib_r 813
## 4 tib_l 784
## 5 hum_l 546
## 6 hum_r 536
## 7 rad_r 345
## 8 rad_l 343
## 9 bib 336
## 10 ibl 122
## 11 ibr 119
## 12 ace_r 17
## 13 ace_l 17
pc_S <- prcomp(x = bones)
pc_R <- prcomp(x = bones,
scale. = T)
The goal of PCA is to reduce the number of variables from \(p\) to \(k\) by using linear combinations to form a set of uncorrelated variables, \(\textbf{z}\).
But how do we choose the number of PCs to use?
By looking at the amount of variability the PCs retain from the original data!
If we want to see the variance of each PC along with the cumulative variance retained, we can look at the eigenvalues, \(\lambda_i\)!
From the covariance matrix:
data.frame(
PC = 1:p,
variance = cov(bones) |>
eigen() |>
pluck('values')
) |>
# Calculating the individual proportion and the cumulative proportion of variance
mutate(
var_prop = variance/sum(variance),
cumul_var_prop = cumsum(var_prop)
)
## PC variance var_prop cumul_var_prop
## 1 1 5083.5 0.85261 0.85
## 2 2 382.4 0.06413 0.92
## 3 3 183.6 0.03080 0.95
## 4 4 129.6 0.02173 0.97
## 5 5 68.1 0.01143 0.98
## 6 6 51.5 0.00863 0.99
## 7 7 20.4 0.00342 0.99
## 8 8 17.2 0.00288 1.00
## 9 9 9.8 0.00165 1.00
## 10 10 8.6 0.00144 1.00
## 11 11 4.4 0.00074 1.00
## 12 12 2.6 0.00044 1.00
## 13 13 0.6 0.00010 1.00
The variance of the first PC is over 5000 and accounts for a little more than 85% of the total variance. If we are using \(\textbf{S}\), we only need 1 PC!
from the correlation matrix:
data.frame(
PC = 1:p,
variance = cor(bones) |>
eigen() |>
pluck('values')
) |>
mutate(
var_prop = variance/sum(variance),
cumul_var_prop = cumsum(var_prop)
)
## PC variance var_prop cumul_var_prop
## 1 1 9.6951 0.74578 0.75
## 2 2 1.7001 0.13078 0.88
## 3 3 0.6188 0.04760 0.92
## 4 4 0.2778 0.02137 0.95
## 5 5 0.2394 0.01841 0.96
## 6 6 0.2315 0.01781 0.98
## 7 7 0.1019 0.00784 0.99
## 8 8 0.0354 0.00272 0.99
## 9 9 0.0325 0.00250 0.99
## 10 10 0.0243 0.00187 1.00
## 11 11 0.0220 0.00170 1.00
## 12 12 0.0128 0.00098 1.00
## 13 13 0.0085 0.00066 1.00
If we are using the correlation matrix, the first PC have a variance of 9.7 and accounts for about 75% of the total variance. The second PC has a variance of 1.7 and has accounts for a total of 13.1% of the overall variability. Combined the first two account for about 88% of the variance of the original data!
We can find the variance, percentage, and cumulative percentage a
little simpler than how we found it above! We just need to use the
summary()
function:
summary(pc_S)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 71.299 19.5547 13.5505 11.3830 8.2543 7.17485 4.51378
## Proportion of Variance 0.853 0.0641 0.0308 0.0217 0.0114 0.00863 0.00342
## Cumulative Proportion 0.853 0.9167 0.9475 0.9693 0.9807 0.98933 0.99275
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 4.14346 3.13217 2.93081 2.10432 1.62389 0.7731
## Proportion of Variance 0.00288 0.00165 0.00144 0.00074 0.00044 0.0001
## Cumulative Proportion 0.99563 0.99727 0.99871 0.99946 0.99990 1.0000
summary(pc_R)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 3.114 1.304 0.7866 0.5271 0.4892 0.4811 0.31918 0.18803
## Proportion of Variance 0.746 0.131 0.0476 0.0214 0.0184 0.0178 0.00784 0.00272
## Cumulative Proportion 0.746 0.877 0.9242 0.9455 0.9639 0.9817 0.98957 0.99229
## PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.1803 0.15602 0.1485 0.11293 0.09234
## Proportion of Variance 0.0025 0.00187 0.0017 0.00098 0.00066
## Cumulative Proportion 0.9948 0.99667 0.9984 0.99934 1.00000
Note that summary reports the standard deviations, not the variances!
A screeplot is a way of visualizing the eigenvalues (variances) of the individual principal components.
In the factoextra
package, there is a function called
fviz_screeplot()
that uses ggplot()
framework
to create the graph!
fviz_screeplot(
X = pc_S, # give X the PCA object created by prcomp()
choice = "eigenvalue",
geom = "line",
linecolor = "steelblue",
ncp = p
) +
labs(
title = "Screeplot using the Covariance Matrix",
x = "Principal Component"
) +
# Average variance line
geom_hline(
yintercept = summary(pc_S)$sdev^2|> mean(),
color = "darkred"
)
# Displaying the percentage explained by using choice = "variance":
fviz_screeplot(
X = pc_S,
choice = "variance",
geom = "line",
linecolor = "steelblue",
ncp = p
) +
labs(
title = "Screeplot using the Covariance Matrix",
x = "Principal Component"
)
When examining a screeplot, you want to look where the “elbow” of the plot occurs. There is a large “drop” between PC1 and PC2, but the drop between PC2 and PC3 is much smaller. This indicates we only need 1 PC if we are using S
Now let’s look at the correlation matrix
fviz_screeplot(
X = pc_R,
choice = "eigenvalue",
geom = "line",
linecolor = "steelblue",
ncp = p
) +
labs(
title = "Screeplot using the Correlation Matrix",
x = "Principal Component"
) +
# Average variance line
geom_hline(
yintercept = summary(pc_R)$sdev^2|> mean(),
color = "darkred"
)
# Displaying the percentage explained by using choice = "variance":
fviz_screeplot(
X = pc_R,
choice = "variance",
geom = "line",
linecolor = "steelblue",
ncp = p
) +
labs(
title = "Screeplot using the Correlation Matrix",
x = "Principal Component"
)
There is a large drop from 1 to 2, and a smaller drop from 2 to 3, and very small drops after, indicating that we should include 1 or 2 PCs if we are using the correlation matrix.
From now on, we’ll be using the results from the correlation matrix
While it’s not a requirement for the PCs, we often like to see if the PCs we’re creating have an interpretation. We can check for what the PCs represent by examining the weights in the eigenvectors. The easiest way is to create a visual!
pc_R |>
pluck("rotation") |>
data.frame() |>
select(PC1:PC2) |>
rownames_to_column(var = "bone") |>
pivot_longer(
cols = contains("PC"),
names_to = "PC",
values_to = "weights"
) |>
mutate(bone = as_factor(bone)) |>
#filter(abs(weights) >= 0.1) |>
ggplot(
mapping = aes(
x = PC,
y = fct_rev(bone),
fill = weights
)
) +
geom_tile(
mapping = aes(alpha = abs(weights)),
color = "white",
linewidth = 1,
show.legend = F
) +
ggfittext::geom_fit_text(
mapping = aes(
label = round(weights,2),
alpha = abs(weights)
),
contrast = T,
fontface = 2,
show.legend = F
) +
labs(
y = "Variables",
x = "Principal Components"
) +
theme_test() +
scale_x_discrete(expand = c(0,0)) +
scale_y_discrete(expand = c(0,0))+
scale_fill_gradient2()
It looks like the first PC is a combination of all the original variables that cares more about all 13 variables almost equally (with hip width, bib, the least at 0.20).
The second PC looks to be a combination of the hip measurements minus the short limb measurements. If some of the variables are positive and others are negative, the interpretation is often describing a comparison (hip bones vs short limb bones)
One common purpose of PCA is to create a scatterplot using all the
origina variables. PCA accomplishes this by plotting the first two
principal components. Like we did to create the screeplot, we’ll use the
factoextra
package to create out biplot:
Creating a biplot using fviz_pca()
from the
factoextra
package
Biplot will plot two PCs (defaults to the first 2), and include arrows to indicate how the original variables are connected to the PCs
fviz_pca(
X = pc_R,
axes = c(1, 2), # Which PCs to plot
geom = "point", # "point" -> dots, "text" -> rownames
habillage = Bones |> # habillage = group_var will color the dots based on groups
drop_na() |>
pull(sex)
) +
coord_equal() +
theme(legend.position = "bottom")
If we wanted to see how each variable contributes, we can make the
dots light using alpha.ind =
close to 0
fviz_pca(
X = pc_R,
axes = c(1, 2),
geom = "point",
alpha.ind = 0.1
)
We can also compare the “size” of any 2 PCs by adding _ind doesn’t
plot the arrows for each variable and adding
coord_equal()
fviz_pca_ind(
X = pc_R,
axes = c(1, 7),
geom = "point"
) +
coord_equal()
If linear dependencies exist in the data, we can use PCA to help identify where the linear dependencies occur.
How many sets of linear dependent variables exist can be determine by looking at how many PCs have very small eigenvalues (variances)
summary(pc_R)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 3.114 1.304 0.7866 0.5271 0.4892 0.4811 0.31918 0.18803
## Proportion of Variance 0.746 0.131 0.0476 0.0214 0.0184 0.0178 0.00784 0.00272
## Cumulative Proportion 0.746 0.877 0.9242 0.9455 0.9639 0.9817 0.98957 0.99229
## PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.1803 0.15602 0.1485 0.11293 0.09234
## Proportion of Variance 0.0025 0.00187 0.0017 0.00098 0.00066
## Cumulative Proportion 0.9948 0.99667 0.9984 0.99934 1.00000
How small is “small” can be subjective, but the last two PCs, PC12 and PC13, are very small, indicating that there are two sets of variables that have strong linear dependencies. But what are the linear combinations?
We can look at the eigenvectors to find what the linear dependencies are!
# Last 2 make up a very small % of the overall variance
pc_R |>
pluck("rotation") |>
data.frame() |>
dplyr::select(PC12, PC13) |>
# Keeping the rows with above average weights for their respective PC
filter(
abs(PC12) > mean(abs(PC12)) |
abs(PC13) > mean(abs(PC13))
)
## PC12 PC13
## rad_l -0.685 -0.062
## rad_r 0.700 0.063
## fem_l 0.065 -0.705
## fem_r -0.071 0.691
Judging from PC12, the radius bones are nearly co-linear: RadL - RadR = 0
From PC13, the Femur bones are also nearly co-linear: FemR - FemL = 0
Creating a graph to see the association between the radius bones
ggplot(
data = bones,
mapping = aes(
x = rad_l,
y = rad_r
)
) +
# Adding the points
geom_point(
alpha = 0.5
) +
# Adding a trend line
geom_smooth(
method = "lm",
se = F,
formula = y ~ x,
color = "darkred"
) +
# Adding the correlation to the graph
annotate(
geom = "text",
label = paste("r =", cor(bones$rad_l, bones$rad_r) |> round(digits = 3)),
x = 250,
y = 200,
color = "darkred",
size = 5
)
ggplot(
data = bones,
mapping = aes(
x = fem_l,
y = fem_r
)
) +
geom_point(
alpha = 0.5
) +
geom_smooth(
method = "lm",
se = F,
formula = y ~ x,
color = "darkred"
) +
labs(
x = "Left Femur",
y = "Right Femur"
) +
annotate(
geom = "text",
label = paste("r =", cor(bones$fem_l, bones$fem_r) |> round(digits = 3)),
x = 500,
y = 375,
color = "darkred",
size = 5
)