Chapter 12: PCA

Let’s look at the variances of each variable:

data.frame(
  variance = 
    bones |> 
    cov() |> 
    diag()  
) |> 
  # Changing the row name to a column in the data
  rownames_to_column(var = "bone") |> 
  # Arranging from largest to smallest variance
  arrange(-variance)

##     bone variance
## 1  fem_l      993
## 2  fem_r      991
## 3  tib_r      813
## 4  tib_l      784
## 5  hum_l      546
## 6  hum_r      536
## 7  rad_r      345
## 8  rad_l      343
## 9    bib      336
## 10   ibl      122
## 11   ibr      119
## 12 ace_r       17
## 13 ace_l       17

PCA using \(\textbf{S}\) and \(\textbf{R}\)

pc_S <- prcomp(x = bones)

pc_R <- prcomp(x = bones,
               scale. = T)

Determine the number of PCs to include

The goal of PCA is to reduce the number of variables from \(p\) to \(k\) by using linear combinations to form a set of uncorrelated variables, \(\textbf{z}\).

But how do we choose the number of PCs to use?

By looking at the amount of variability the PCs retain from the original data!

If we want to see the variance of each PC along with the cumulative variance retained, we can look at the eigenvalues, \(\lambda_i\)!

From the covariance matrix:

data.frame(
  PC = 1:p,
  
  variance = cov(bones) |> 
             eigen() |> 
             pluck('values')
  ) |> 
  # Calculating the individual proportion and the cumulative proportion of variance
  mutate(
    var_prop = variance/sum(variance),
    cumul_var_prop = cumsum(var_prop)
  )

##    PC variance var_prop cumul_var_prop
## 1   1   5083.5  0.85261           0.85
## 2   2    382.4  0.06413           0.92
## 3   3    183.6  0.03080           0.95
## 4   4    129.6  0.02173           0.97
## 5   5     68.1  0.01143           0.98
## 6   6     51.5  0.00863           0.99
## 7   7     20.4  0.00342           0.99
## 8   8     17.2  0.00288           1.00
## 9   9      9.8  0.00165           1.00
## 10 10      8.6  0.00144           1.00
## 11 11      4.4  0.00074           1.00
## 12 12      2.6  0.00044           1.00
## 13 13      0.6  0.00010           1.00

The variance of the first PC is over 5000 and accounts for a little more than 85% of the total variance. If we are using \(\textbf{S}\), we only need 1 PC!

from the correlation matrix:

data.frame(
  PC = 1:p,
  
  variance = cor(bones) |> 
             eigen() |> 
             pluck('values')
  ) |> 
  mutate(
    var_prop = variance/sum(variance),
    cumul_var_prop = cumsum(var_prop)
  )

##    PC variance var_prop cumul_var_prop
## 1   1   9.6951  0.74578           0.75
## 2   2   1.7001  0.13078           0.88
## 3   3   0.6188  0.04760           0.92
## 4   4   0.2778  0.02137           0.95
## 5   5   0.2394  0.01841           0.96
## 6   6   0.2315  0.01781           0.98
## 7   7   0.1019  0.00784           0.99
## 8   8   0.0354  0.00272           0.99
## 9   9   0.0325  0.00250           0.99
## 10 10   0.0243  0.00187           1.00
## 11 11   0.0220  0.00170           1.00
## 12 12   0.0128  0.00098           1.00
## 13 13   0.0085  0.00066           1.00

If we are using the correlation matrix, the first PC have a variance of 9.7 and accounts for about 75% of the total variance. The second PC has a variance of 1.7 and has accounts for a total of 13.1% of the overall variability. Combined the first two account for about 88% of the variance of the original data!

We can find the variance, percentage, and cumulative percentage a little simpler than how we found it above! We just need to use the summary() function:

summary(pc_S)

## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     71.299 19.5547 13.5505 11.3830 8.2543 7.17485 4.51378
## Proportion of Variance  0.853  0.0641  0.0308  0.0217 0.0114 0.00863 0.00342
## Cumulative Proportion   0.853  0.9167  0.9475  0.9693 0.9807 0.98933 0.99275
##                            PC8     PC9    PC10    PC11    PC12   PC13
## Standard deviation     4.14346 3.13217 2.93081 2.10432 1.62389 0.7731
## Proportion of Variance 0.00288 0.00165 0.00144 0.00074 0.00044 0.0001
## Cumulative Proportion  0.99563 0.99727 0.99871 0.99946 0.99990 1.0000

summary(pc_R)

## Importance of components:
##                          PC1   PC2    PC3    PC4    PC5    PC6     PC7     PC8
## Standard deviation     3.114 1.304 0.7866 0.5271 0.4892 0.4811 0.31918 0.18803
## Proportion of Variance 0.746 0.131 0.0476 0.0214 0.0184 0.0178 0.00784 0.00272
## Cumulative Proportion  0.746 0.877 0.9242 0.9455 0.9639 0.9817 0.98957 0.99229
##                           PC9    PC10   PC11    PC12    PC13
## Standard deviation     0.1803 0.15602 0.1485 0.11293 0.09234
## Proportion of Variance 0.0025 0.00187 0.0017 0.00098 0.00066
## Cumulative Proportion  0.9948 0.99667 0.9984 0.99934 1.00000

Note that summary reports the standard deviations, not the variances!

Using a screeplot

A screeplot is a way of visualizing the eigenvalues (variances) of the individual principal components.

In the factoextra package, there is a function called fviz_screeplot() that uses ggplot() framework to create the graph!

fviz_screeplot(
  X = pc_S, # give X the PCA object created by prcomp()
  choice = "eigenvalue",
  geom = "line",
  linecolor = "steelblue",
  ncp = p
) + 
  
  labs(
    title = "Screeplot using the Covariance Matrix",
    x = "Principal Component"
  ) + 
  
  # Average variance line
  geom_hline(
    yintercept = summary(pc_S)$sdev^2|> mean(),
    color = "darkred"
  )

# Displaying the percentage explained by using choice = "variance":
fviz_screeplot(
  X = pc_S,
  choice = "variance",
  geom = "line",
  linecolor = "steelblue",
  ncp = p
) + 
  
  labs(
    title = "Screeplot using the Covariance Matrix",
    x = "Principal Component"
  )

When examining a screeplot, you want to look where the “elbow” of the plot occurs. There is a large “drop” between PC1 and PC2, but the drop between PC2 and PC3 is much smaller. This indicates we only need 1 PC if we are using S

Correlation Matrix

Now let’s look at the correlation matrix

fviz_screeplot(
  X = pc_R,
  choice = "eigenvalue",
  geom = "line",
  linecolor = "steelblue",
  ncp = p
) + 
  
  labs(
    title = "Screeplot using the Correlation Matrix",
    x = "Principal Component"
  ) + 
  
  # Average variance line
  geom_hline(
    yintercept = summary(pc_R)$sdev^2|> mean(),
    color = "darkred"
  )

# Displaying the percentage explained by using choice = "variance":
fviz_screeplot(
  X = pc_R,
  choice = "variance",
  geom = "line",
  linecolor = "steelblue",
  ncp = p
) + 
  
  labs(
    title = "Screeplot using the Correlation Matrix",
    x = "Principal Component"
  )

There is a large drop from 1 to 2, and a smaller drop from 2 to 3, and very small drops after, indicating that we should include 1 or 2 PCs if we are using the correlation matrix.

Visualizing the Principal Component Transformations

From now on, we’ll be using the results from the correlation matrix

While it’s not a requirement for the PCs, we often like to see if the PCs we’re creating have an interpretation. We can check for what the PCs represent by examining the weights in the eigenvectors. The easiest way is to create a visual!

pc_R |> 
  pluck("rotation") |> 
  data.frame() |> 
  select(PC1:PC2) |> 
  rownames_to_column(var = "bone") |> 
  pivot_longer(
    cols = contains("PC"),
    names_to = "PC",
    values_to = "weights"
  ) |> 
  
  mutate(bone = as_factor(bone)) |> 
  #filter(abs(weights) >= 0.1) |> 
  ggplot(
    mapping = aes(
      x = PC,
      y = fct_rev(bone),
      fill = weights
    )
  ) +
  
  geom_tile(
    mapping = aes(alpha = abs(weights)),
    color = "white", 
    linewidth = 1,
    show.legend = F
  ) + 
  
  ggfittext::geom_fit_text(
    mapping = aes(
      label = round(weights,2),
      alpha = abs(weights)
    ),
    contrast = T,
    fontface = 2,
    show.legend = F
  ) + 
  
  labs(
    y = "Variables",
    x = "Principal Components"
  ) +
  
  theme_test() +
  
  scale_x_discrete(expand = c(0,0)) + 
  scale_y_discrete(expand = c(0,0))+
  scale_fill_gradient2()

It looks like the first PC is a combination of all the original variables that cares more about all 13 variables almost equally (with hip width, bib, the least at 0.20).

The second PC looks to be a combination of the hip measurements minus the short limb measurements. If some of the variables are positive and others are negative, the interpretation is often describing a comparison (hip bones vs short limb bones)

Plotting the principal components

One common purpose of PCA is to create a scatterplot using all the origina variables. PCA accomplishes this by plotting the first two principal components. Like we did to create the screeplot, we’ll use the factoextra package to create out biplot:

Creating a biplot using fviz_pca() from the factoextra package

Biplot will plot two PCs (defaults to the first 2), and include arrows to indicate how the original variables are connected to the PCs

fviz_pca(
  X = pc_R,
  axes = c(1, 2),           # Which PCs to plot
  geom = "point",           # "point" -> dots, "text" -> rownames
  habillage = Bones |>      # habillage = group_var will color the dots based on groups
    drop_na() |> 
    pull(sex)
)  +   
  
  coord_equal() + 
  
  theme(legend.position = "bottom")

If we wanted to see how each variable contributes, we can make the dots light using alpha.ind = close to 0

fviz_pca(
  X = pc_R,
  axes = c(1, 2),      
  geom = "point", 
  alpha.ind = 0.1
)

We can also compare the “size” of any 2 PCs by adding _ind doesn’t plot the arrows for each variable and adding coord_equal()

fviz_pca_ind(
  X = pc_R,
  axes = c(1, 7),    
  geom = "point"
)  +   
  coord_equal()

Examining the last PCs

If linear dependencies exist in the data, we can use PCA to help identify where the linear dependencies occur.

How many sets of linear dependent variables exist can be determine by looking at how many PCs have very small eigenvalues (variances)

summary(pc_R)

## Importance of components:
##                          PC1   PC2    PC3    PC4    PC5    PC6     PC7     PC8
## Standard deviation     3.114 1.304 0.7866 0.5271 0.4892 0.4811 0.31918 0.18803
## Proportion of Variance 0.746 0.131 0.0476 0.0214 0.0184 0.0178 0.00784 0.00272
## Cumulative Proportion  0.746 0.877 0.9242 0.9455 0.9639 0.9817 0.98957 0.99229
##                           PC9    PC10   PC11    PC12    PC13
## Standard deviation     0.1803 0.15602 0.1485 0.11293 0.09234
## Proportion of Variance 0.0025 0.00187 0.0017 0.00098 0.00066
## Cumulative Proportion  0.9948 0.99667 0.9984 0.99934 1.00000

How small is “small” can be subjective, but the last two PCs, PC12 and PC13, are very small, indicating that there are two sets of variables that have strong linear dependencies. But what are the linear combinations?

We can look at the eigenvectors to find what the linear dependencies are!

# Last 2 make up a very small % of the overall variance
pc_R |> 
  pluck("rotation") |> 
  data.frame()  |>  
  
  dplyr::select(PC12, PC13) |> 
  # Keeping the rows with above average weights for their respective PC
  filter(
    abs(PC12) > mean(abs(PC12)) | 
      abs(PC13) > mean(abs(PC13))
  )

##         PC12   PC13
## rad_l -0.685 -0.062
## rad_r  0.700  0.063
## fem_l  0.065 -0.705
## fem_r -0.071  0.691

Judging from PC12, the radius bones are nearly co-linear: RadL - RadR = 0

From PC13, the Femur bones are also nearly co-linear: FemR - FemL = 0

Examining the linear relationship between the radius bones

Creating a graph to see the association between the radius bones

ggplot(
  data = bones,
  mapping = aes(
    x = rad_l, 
    y = rad_r
  )
) + 
  # Adding the points
  geom_point(
    alpha = 0.5
  ) + 
  # Adding a trend line
  geom_smooth(
    method = "lm",
    se = F,
    formula = y ~ x,
    color = "darkred"
  ) +
  # Adding the correlation to the graph
  annotate(
    geom = "text", 
    label = paste("r =", cor(bones$rad_l, bones$rad_r) |> round(digits = 3)),
    x = 250,
    y = 200,
    color = "darkred",
    size = 5
  )

Examining the linear relationship between the femur bones

ggplot(
  data = bones,
  mapping = aes(
    x = fem_l, 
    y = fem_r
  )
) + 
  
  geom_point(
    alpha = 0.5
  ) + 
  
  geom_smooth(
    method = "lm",
    se = F,
    formula = y ~ x,
    color = "darkred"
  ) +
  
  labs(
    x = "Left Femur",
    y = "Right Femur"
  ) +
  
  annotate(
    geom = "text", 
    label = paste("r =", cor(bones$fem_l, bones$fem_r) |> round(digits = 3)),
    x = 500,
    y = 375,
    color = "darkred",
    size = 5
  )

Chapter 12: PCA - Part 2

Jacob Martin

STAT 5230