Introduction to Dimension Reduction

Part 1: Starbucks Demo

Thanks to Professor Ron Yurko for doing a similar demo for me when teaching Data Visualization

In this demo I’ll walk you through a process called PCA - “Principal Components Analysis”. This is the most popular and versatile dimension reduction method out there; it’s extremely practical to know. With it, we can take a dataset with any number of variables (as long as they are all numeric) and reduce it down to only 2 columns, while preserving the key relationships in the data.

This code chunk reads in the starbucks dataset - a data dictionary and description are available at this link:

https://github.com/rfordatascience/tidytuesday/blob/master/data/2021/2021-12-21/readme.md

Code

# clean the environment
rm(list = ls())

# read in the starbucks dataset
library(tidyverse)
starbucks <- 
  read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv", 
           show_col_types = FALSE) %>%
  # Convert columns to numeric that were saved as character
  mutate(trans_fat_g = as.numeric(trans_fat_g),
         fiber_g = as.numeric(fiber_g), 
         milk = factor(milk, 
                       levels = 0:5, 
                       labels = c("none", "nonfat", "2%", 
                                  "soy", "coconut", "whole")), 
         whip = as.logical(whip))

Before we get started doing PCA - let’s take a look at the dataset with str. str gives a summary of the structure of the data - here we can see which columns are numeric.

Code

str(starbucks)

tibble [1,147 × 15] (S3: tbl_df/tbl/data.frame)
 $ product_name   : chr [1:1147] "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" "brewed coffee - dark roast" ...
 $ size           : chr [1:1147] "short" "tall" "grande" "venti" ...
 $ milk           : Factor w/ 6 levels "none","nonfat",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ whip           : logi [1:1147] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ serv_size_m_l  : num [1:1147] 236 354 473 591 236 354 473 591 236 354 ...
 $ calories       : num [1:1147] 3 4 5 5 3 4 5 5 3 4 ...
 $ total_fat_g    : num [1:1147] 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
 $ saturated_fat_g: num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ trans_fat_g    : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ cholesterol_mg : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ sodium_mg      : num [1:1147] 5 10 10 10 5 10 10 10 5 5 ...
 $ total_carbs_g  : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ fiber_g        : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ sugar_g        : num [1:1147] 0 0 0 0 0 0 0 0 0 0 ...
 $ caffeine_mg    : num [1:1147] 130 193 260 340 15 20 25 30 155 235 ...

First, I will pull out all the numeric variables from the starbucks dataset using dplyr::select.

Code

numerics <- starbucks %>%
    dplyr::select(serv_size_m_l, calories, total_fat_g, saturated_fat_g, 
                  trans_fat_g, cholesterol_mg, sodium_mg, total_carbs_g,
                  fiber_g, sugar_g, caffeine_mg)

Next, I use the function prcomp to perform PCA on the numeric variables. This takes our 11 numeric variables and produces 11 new vectors called “principal components” (PCs for short.) The matrix with these vectors is stored in starbucks_pca$x.

The PCs have a special mathematical property: they account for the maximum possible variance in the data while still being uncorrelated to every prior PC. This means each PC explains as much of the data as possible, while still pointing in a different direction from all the earlier PCs.

Code

starbucks_pca <- prcomp(numerics, center = TRUE, scale = TRUE)

We can call summary(starbucks_pca) to see the importance of each principal component. Look at the “Proportion of Variance” row - this is the percent of variance in the data that that PC accounts for. So in this dataset, PC1 accounts for 55.6% of the total variance, and PC2 accounts for 15.5% of the total variance. Put together, PCs 1 and 2 account for 71.2% of total variance. That means we can represent 71.2% of the total variance in the data while only looking at 2 variables, instead of the previous 11!

Code

summary(starbucks_pca)

Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6    PC7
Standard deviation     2.4748 1.3074 1.0571 0.97919 0.67836 0.56399 0.4413
Proportion of Variance 0.5568 0.1554 0.1016 0.08716 0.04183 0.02892 0.0177
Cumulative Proportion  0.5568 0.7122 0.8138 0.90093 0.94276 0.97168 0.9894
                           PC8     PC9    PC10    PC11
Standard deviation     0.28123 0.16874 0.08702 0.04048
Proportion of Variance 0.00719 0.00259 0.00069 0.00015
Cumulative Proportion  0.99657 0.99916 0.99985 1.00000

Now let’s extract the first two PCs, put them on the starbucks dataset, and plot them!

Code

starbucks <- starbucks %>%
    mutate(PC1 = starbucks_pca$x[, 1], 
           PC2 = starbucks_pca$x[, 2])

ggplot(starbucks, aes(x = PC1, y = PC2)) + 
    geom_point(alpha = 0.5) + 
    labs(title = "PCA Plot of Starbucks Data", 
         x = "PC1", y = "PC2")

Ok, that plot is kinda cool but we can’t extract a lot of meaning from it - the PCs don’t have any units. There are a couple of classic ways to handle this. The first is to color the plot using the variables from the original data. This helps show relationships in the data. Here are a few plots as an example.

Code

ggplot(starbucks, aes(x = PC1, y = PC2)) + 
    geom_point(alpha = 0.5, aes(color = size)) + 
    labs(title = "PCA Plot of Starbucks Data", 
         x = "PC1", y = "PC2")

Code

ggplot(starbucks, aes(x = PC1, y = PC2)) + 
    geom_point(alpha = 0.5, aes(color = sugar_g)) + 
    labs(title = "PCA Plot of Starbucks Data", 
         x = "PC1", y = "PC2", color = "Sugar (grams)")

Another way to interpret PCA is to use a biplot. We can plot each of the variables on top of the PCA plot to see which direction each variable is pulling in. The direction of the arrow represents its correlation with other variables, and the length represents how much variation that variable accounts for in the data.

So in this example, it looks like sugar and carbs are highly correlated and explain lots of variation in the data. Meanwhile, fats and cholesterol are also highly correlated to each other and not particularly related to the sugar/carbs axis. Caffeine and fiber both have very low strengths, indicating they don’t explain a lot of the variation in the data.

Code

# "For the love of god, restart R before and after you install this package" 
# - Ron Yurko
library(factoextra)

fviz_pca_biplot(
    # this function takes the whole PCA object, the output of `prcomp`
    starbucks_pca,
    # this tells factoextra to only put labels on the variables
    label = "var", repel = TRUE,
    # graphics parameters
    alpha.ind = 0.25, alpha.var = 0.75, col.var = "darkblue"
)

Let’s try to make a single plot with LOTS of information on it, just to challenge ourselves. In fact, this probably has too much information on it, but it’s fun to look at.

Code

fviz_pca_biplot(
    # this function takes the whole PCA object, the output of `prcomp`
    starbucks_pca,
    # this tells factoextra to only put labels on the variables
    label = "var", repel = TRUE,
    col.ind = starbucks$milk,
    # graphics parameters
    alpha.ind = 0.5, alpha.var = 0.75, col.var = "darkblue"
)

Part 2: ChatGPT Exercise

Thanks to Professor David Brown for challenging MADS with a somewhat more cryptic version of this exercise.

Ok! Now that you’ve seen an example of PCA for dimension reduction - let’s have you do it.

Code

library(readr)

rm(list = ls())
biber <- read_csv("https://raw.githubusercontent.com/GKrotkov/gkrotkov_bpw/master/biber.csv", 
                  show_col_types = FALSE)

The primary goal here is to see what differences there are between ChatGPT’s writing and human writing. Earlier this year, the CMU Statistics & Data Science Department asked ChatGPT to write a paper similar to what statistics students in 36-200 were asked to write (using the exact same prompt that the students were given.) Then, they pooled those results with 100 randomly selected student responses from previous years (before ChatGPT was public), along with 100 papers written by published experts in the field.

The biber dataframe loaded above is extracted from those texts. Each row represents a single paper, with the author column telling you whether it was written by ChatGPT, a student, or an expert. Every other column is numeric, representing some tendency in how the author writes.

Let’s do PCA on it and see what we can turn up!

First - pull out all the numerics.

Code

# hint: writing "-colName" in dplyr::select() will drop that column
numerics <- biber %>%
    dplyr::select() # complete this select statement

Next, let’s run pca by calling prcomp on the numeric variables, and store the result in biber_pca. Remember to set center = TRUE, scale = TRUE.

Code

# your code here

Now take the top two PCs (biber_pca$x[, 1] and biber_pca$x[, 2]) and attach them to the the biber dataframe. There are many ways to do this - I find it easiest to do something like df$new_var_name <- new_variable.

Code

# your code here

Lastly, let’s make some plots!

Make a scatterplot (use geom_point()) with PC1 on the x axis and PC2 on the y-axis. Color the points by the author type, and give the graph a sensible title with labs().
Make a biplot based on the biber_pca object, using fviz_pca_biplot. Add the parameter select.var = list(contrib = 10) to limit the number of variables plotted to 10.

Code

# your code here