Time now to see how to implement PCA in R, including:
Functions for executing PCA
Packages and functions for creating useful plots to gaining insight into our PCA
6D example: Prehistoric goblets
Suppose you have been asked to analyze data on 25 prehistoric goblets from Thailand (Professor C.F.W. Higham, University of Otago, as taken from Manly, B.F.J. 1986. Multivariate Statistical Methods: A Primer. Chapman and Hall, London, 159pp.)
Stem width and mouth width (least important for the first PC) are now the two most important variables for separating goblets in the second PC direction.
Total height is third most important, but has a negative loading
Goblets with large stem and mouth widths, and small heights (i.e., short, squat goblets) will have large PC2 values
Loadings plot with fviz_pca_var()
While the variable contribution plot shows us importance, it doesn’t convey how the variables explain variability in the first 2 PC directions.
That’s the job of the loadings plot, which plots the loadings as arrow vectors.
Total width and base width point almost directly horizontal, as we would expect given the variable contribution plots: the first PC is determined primarily by these variables
Stem width and mouth width point more in the vertical direction than any other variable, consistent with these being the highest contributors to PC2.
All arrows point in somewhat similar directions, consistent with notable positive correlation between all of these variables
Biplot with fviz_pca()
We can put this all together, adding the arrows on top of a plot of the first two PC scores with fviz_pca()
Note the “equal contribution” threshold = \(100\times 1/10 =10\%\)
Calories, fat, and protein are driving PC1
PC2 is driven by carbs, sugar and fiber. Sugar and fiber are main drivers of PC3.
Cholesterol third highest contributor to PC4.
Sign matters!
It takes a bit more work to plot the raw loadings (must create out own data frame), but in this way we can see their signs which provides insight the squared loadings don’t.
loadings_df <-data.frame(food_pca$rotation) %>%rownames_to_column('Variable')base_plot <-ggplot(loadings_df) +theme_classic(base_size =12) +theme(axis.text.x =element_text(angle =45,hjust=1,vjust=1)) +labs(x ='Variable') +geom_hline(aes(yintercept=0), linetype=2)l1 <- base_plot +geom_col(aes(x =reorder(Variable,PC1, decreasing=TRUE), y = PC1))l2 <- base_plot+geom_col(aes(x =reorder(Variable,PC2, decreasing=TRUE), y = PC2)) l3 <- base_plot +geom_col(aes(x =reorder(Variable,PC3, decreasing=TRUE), y = PC3)) l4 <- base_plot +geom_col(aes(x =reorder(Variable,PC4, decreasing=TRUE), y = PC4)) (l1+l2)/(l3+l4) +plot_annotation(title ='Loadings plots for fast food PCA',theme=theme(plot.title =element_text(hjust =0.5)))
Foods low in fiber, sugar, carbs will have high PC2 scores.
PC3 contrasts foods high in sugar and low in fiber.
PC4 contrasts foods high in fiber and low in carbs.
PCA biplot + bells and whistles
We can easily incorporate more information (ggrepeled-point labeling, color-coding) in PCA biplots
Here we are also suppressing the point labeling, labeling only the variable arrows (even repelling doesn’t save us if we label everything!), and cleaning up the legend
fviz_pca(food_pca, axes =c(1,2), col.ind = fastfood$Type, label ='var', # Only label variable arrows, not 'ind' toorepel =TRUE, pointsize =3) +scale_shape_manual(values =c(15:19,8)) +#Control point shapes; see ?pchlabs(color='Food type', shape ='Food type') #Clean up legend title
Interpretation: axes 1 and 2
The “big points” are the mean vectors of the PCs for each food type
Protein, sodium, saturated fat, total fat, cholesterol all highly correlated
Food items that are high in these are obviously sandwiches
Desserts are low in these
These are not that correlated with carbs, sugar, fiber
Low in sugar/fiber/carbs include chicken and wraps
What about PCs 3 and 4?
Want to produce a plot of PC3 vs PC4
Want to label points with more control over the repelling and text size
Need to set up a data frame with the PCs and information for point labelling (food type, item name) to feed to geom_text_repel()
PCS_with_labels <-bind_cols(food_pca$x[,1:4], Item=fastfood$Item, type = fastfood$Type)PCS_with_labels
p1 <-fviz_pca(food_pca, axes =c(1,2), col.ind = fastfood$Type, label ='var', repel=TRUE,select.var =list(contrib=5) # Can control num arrows! Top 5 per PC combo ) +scale_shape_manual(values =c(15:19,8)) +labs(color='Food type', shape ='Food type', title='') +geom_text_repel(aes(x = PC1, y = PC2, label = Item, color = type), data = PCS_with_labels, max.overlaps =7, size =3)p2 <-fviz_pca(food_pca, axes =c(3,4), col.ind = fastfood$Type, label ='var', repel=TRUE,select.var =list(contrib=5) # Can control num arrows! Top 5 per PC combo ) +scale_shape_manual(values =c(15:19,8)) +guides(color='none', shape ='none') +# eliminate color legend; already have onelabs(color='Food type', shape ='Food type', title='') +geom_text_repel(aes(x = PC3, y = PC4, label = Item, color = type), data = PCS_with_labels, max.overlaps =7, size =3)p1 + p2
Plot result
Notice how PCs 3 and 4 are much more focused on the sugar, cholesterol, fiber, and carb content - the “low-correlation” variables in our initial pairs plot
PCs 3 and 4 highlight variability in the salads, desserts, and sides - notice how clumped in the middle the sandwiches and chicken are in the PC3v4 plot, when they were the big story in the PC1v2 plot
Notice how PCs 3 and 4 contrast fiber, sugar, and carbs, which we knew from the loadings plots)