PCA_tutorial

PCA AND UMAP WITH TIDYMODELS AND TIDYTUESDAY COCKTAIL RECIPES

Lately, I’ve been publishing screencasts demonstrating how to use tidymodels framework, frm first steps in modeling to how to evaluate complex models. Todays’s screencast isn’t about predictive modeling, but about unsupervised machine learning using with this weeks TidyTuesday dataset on cocktail recipes

Explore the data

Our modeling goals is to use unsupervised algorithms for dimensionality reduction with coktail recipes from this week’s TidyTuesday dataset. In my earlier blog post this week, I used one of the cocktail datasets included and here let’s use the other one

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(scales))
suppressMessages(library(ggthemes))

theme_set(theme_light())

boston_cocktails <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-26/boston_cocktails.csv")

## Parsed with column specification:
## cols(
##   name = col_character(),
##   category = col_character(),
##   row_id = col_double(),
##   ingredient_number = col_double(),
##   ingredient = col_character(),
##   measure = col_character()
## )

boston_cocktails %>%
  count(ingredient, sort = TRUE)

## # A tibble: 569 x 2
##    ingredient            n
##    <chr>             <int>
##  1 Gin                 176
##  2 Fresh lemon juice   138
##  3 Simple Syrup        115
##  4 Vodka               114
##  5 Light Rum           113
##  6 Dry Vermouth        107
##  7 Fresh Lime Juice    107
##  8 Triple Sec          107
##  9 Powdered Sugar       90
## 10 Grenadine            85
## # ... with 559 more rows

There is a bit of data cleaning to do to start, both for the ingredient column and the measure column.

cocktails_parsed <- boston_cocktails %>% 
               mutate(
                              ingredient = str_to_lower(ingredient),
                              ingredient = str_replace_all(ingredient, "_", " "),
                              ingredient = str_remove(ingredient, " liqueur$"),
                              ingredient = str_remove(ingredient, " (if desired)$"),
                              ingredient = case_when(
                                             str_detect(ingredient, "bitters") ~ 'bitters',
                                             str_detect(ingredient, 'lemon') ~ 'lemon juice', 
                                             str_detect(ingredient, 'lime') ~'lime juice',
                                             str_detect(ingredient, 'grapefruit') ~'grapefruit juice',
                                             str_detect(ingredient, 'orange') ~ 'orange juice',
                                             TRUE ~ ingredient
                              ),
                              measure = case_when(
                                             str_detect(ingredient, 'bitters') ~ str_replace(measure, "oz$", 'dash'), 
                                             TRUE ~ measure
                              ),
                              measure = str_replace(measure, " ?1/2" ,".5"),
                              measure = str_replace(measure, " ?3/4", ".75"),
                              measure = str_replace(measure, " ?1/4", ".25"),
                              measure_number = parse_number(measure),
                              measure_number = ifelse(str_detect(measure, 'dash$'),
                                                      measure_number / 50,
                                                      measure_number)
               ) %>% 
               add_count(ingredient) %>% 
               filter(n > 15) %>%
               select (-n) %>% 
               distinct(row_id, ingredient, .keep_all = TRUE) %>% 
               na.omit()

## Warning: 20 parsing failures.
## row col expected    actual
## 637  -- a number For glass
## 671  -- a number For glass
## 794  -- a number For glass
## 878  -- a number For glass
## 928  -- a number For glass
## ... ... ........ .........
## See problems(...) for more details.

cocktails_parsed

## # A tibble: 2,528 x 7
##    name     category   row_id ingredient_numb~ ingredient measure measure_number
##    <chr>    <chr>       <dbl>            <dbl> <chr>      <chr>            <dbl>
##  1 Gauguin  Cocktail ~      1                1 light rum  2 oz              2   
##  2 Gauguin  Cocktail ~      1                3 lemon jui~ 1 oz              1   
##  3 Gauguin  Cocktail ~      1                4 lime juice 1 oz              1   
##  4 Fort La~ Cocktail ~      2                1 light rum  1.5 oz            1.5 
##  5 Fort La~ Cocktail ~      2                2 sweet ver~ .5 oz             0.5 
##  6 Fort La~ Cocktail ~      2                3 orange ju~ .25 oz            0.25
##  7 Fort La~ Cocktail ~      2                4 lime juice .25 oz            0.25
##  8 Cuban C~ Cocktail ~      4                1 lime juice .5 oz             0.5 
##  9 Cuban C~ Cocktail ~      4                2 powdered ~ .5 oz             0.5 
## 10 Cuban C~ Cocktail ~      4                3 light rum  2 oz              2   
## # ... with 2,518 more rows

I typically do my data cleaning with data in a tidy format, like boston_cocktailsorcocktails_parsed``. When it’s time for modeling, we usually need the data in a wider format, so let’s use pivot_wider() to reshape our data

cocktails_df <- cocktails_parsed %>% 
               select(-ingredient_number, -row_id, -measure) %>%
               pivot_wider( names_from = ingredient, values_from = measure_number, values_fill = 0) %>%
               janitor::clean_names() %>%
               na.omit()

cocktails_df

## # A tibble: 937 x 42
##    name  category light_rum lemon_juice lime_juice sweet_vermouth orange_juice
##    <chr> <chr>        <dbl>       <dbl>      <dbl>          <dbl>        <dbl>
##  1 Gaug~ Cocktai~      2           1          1               0           0   
##  2 Fort~ Cocktai~      1.5         0          0.25            0.5         0.25
##  3 Cuba~ Cocktai~      2           0          0.5             0           0   
##  4 Cool~ Cocktai~      0           0          0               0           1   
##  5 John~ Whiskies      0           1          0               0           0   
##  6 Cher~ Cocktai~      1.25        0          0               0           0   
##  7 Casa~ Cocktai~      2           0          1.5             0           0   
##  8 Cari~ Cocktai~      0.5         0          0               0           0   
##  9 Ambe~ Cordial~      0           0.25       0               0           0   
## 10 The ~ Whiskies      0           0.5        0               0           0   
## # ... with 927 more rows, and 35 more variables: powdered_sugar <dbl>,
## #   dark_rum <dbl>, cranberry_juice <dbl>, pineapple_juice <dbl>,
## #   bourbon_whiskey <dbl>, simple_syrup <dbl>, cherry_flavored_brandy <dbl>,
## #   light_cream <dbl>, triple_sec <dbl>, maraschino <dbl>, amaretto <dbl>,
## #   grenadine <dbl>, apple_brandy <dbl>, brandy <dbl>, gin <dbl>,
## #   anisette <dbl>, dry_vermouth <dbl>, bitters <dbl>,
## #   straight_rye_whiskey <dbl>, benedictine <dbl>, egg_white <dbl>,
## #   half_and_half <dbl>, vodka <dbl>, grapefruit_juice <dbl>,
## #   blended_scotch_whiskey <dbl>, port <dbl>, white_creme_de_cacao <dbl>,
## #   citrus_flavored_vodka <dbl>, whole_egg <dbl>,
## #   apricot_flavored_brandy <dbl>, egg_yolk <dbl>, blended_whiskey <dbl>,
## #   dubonnet <dbl>, blanco_tequila <dbl>, old_mr_boston_dry_gin <dbl>

There are lots more great examples of TidyTuesday EDA out there to explore on Twiter

Principla componenet analysis

This dataset is especially delightful because we get to use recipes wit recipes. Let’s load the tidymodels metapackage and implement principal component analysis with a recipe

library(tidymodels)

pca_rec <- recipe(~., data = cocktails_df) %>% 
               update_role(name, category, new_role = 'id') %>%
               step_normalize(all_predictors()) %>% 
               step_pca(all_predictors())

pca_prep <- prep(pca_rec)

pca_prep

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         40
## 
## Training data contained 937 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for light_rum, lemon_juice, ... [trained]
## PCA extraction with light_rum, lemon_juice, ... [trained]

Let’s walk through the steps in this recipe

Firstm we must tell the recipe() what’s going on with our model (notice the formular with no outcome) and what data we are using
Next, we update the role for cocktail name and category, since these are variables we want to keep around for convenience as identifiers for rows but are not a predictor or outcome
We need to center and scale the numeric predictors, because we are about to implement PCA
Finally, we use step_pca() for the actual principal componenet analysis.

Before using prep() these steps have been defined but not actually run or implemented. The prep() function is where everything gets evaluated.

Once we have that done, we can both explore the results of the PCA. Let’s start with checking out how the PCA turned out. We can tidy() any of our recipe steps including the PCA step, which is the second step. Then let’s make a visualization to see what the components look like.

tidied_pca <- tidy(pca_prep, 2)

tidied_pca %>% 
               filter(component %in% paste0("PC", 1:5)) %>% 
               mutate(component = fct_inorder(component)) %>% 
               ggplot(aes(x = value, y = terms, fill = terms)) +
               geom_col(show.legend = FALSE) + 
               facet_wrap(~component, nrow = 1) + 
               labs(y = NULL)

The biggest difference in PC1 is powdered sugar vs simple syrupt; recipes are not likely to have both, which makes s sense! let’s zoom on the first four components, and understand which cocktails ingredients contribute in the positive and negative direction

library(tidytext)

q= tidied_pca %>% 
               filter(component %in% paste0("PC", 1:4)) %>% 
               group_by(component) %>% 
               top_n(8, abs(value)) %>% 
               ungroup() %>% 
               mutate(terms = reorder_within(terms, abs(value), component)) %>% 
               ggplot(aes(x = abs(value), y = terms, fill = value > 0)) +
               geom_col() +
               facet_wrap(~component, scales = 'free_y') + 
               scale_y_reordered() + 
               labs(
                              x = 'Absolute value of contribution',
                              y = NULL, fill = 'Positive?'
               )

So PC1 is about powdered sugar + egg + gin drinks vs simple syrup + lime + tequila drinks. This is the componenet that explains the most variation in drinks. PC2 is mostly about vermouth, both sweet and dry.

How are the cocktails distributed in the plane of the first two components?

p = juice(pca_prep) %>% 
               ggplot(aes(x = PC1, y = PC2, label = name)) + 
               geom_point(aes(color = category), alpha = 0.7, size = 2) +
               geom_text(check_overlap = TRUE, hjust = 'inward', family = 'IBMPlexSans') + 
               labs(color = NULL)

Fizzy, egg, powdered sugar drinks are to the left
Simple syrup, lime, tequila drinks are to the right
Vermounth drinks are more to the top

You can change out PC2 for PC4, for example, to instead see where drinks with more grenadine are.

UMAP

One of the benefits of the tidymodels ecosystem is the flexibility and ease of trying different approaches for the same kind of task. For example, we can switch out PCA for UMAP, an entirely different algorithm for dimensionality reduction based on ideas from topological data analysis. The embed package provides recipe steps for ways to create embeddings including UMAP. Let’s switch out the PCA step for the UMAP step.

library(embed)

## Warning: package 'embed' was built under R version 4.0.2

umap_rec <- recipe(~., data = cocktails_df) %>% 
               update_role(name, category, new_role = 'id') %>% 
               step_normalize(all_predictors()) %>% 
               step_umap(all_predictors())

#umap_prep <- recipes::prep(umap_rec)

umap_rec

## Data Recipe
## 
## Inputs:
## 
##       role #variables
##         id          2
##  predictor         40
## 
## Operations:
## 
## Centering and scaling for all_predictors
## UMAP embedding for all_predictors

```

PCA_tutorial

Nguyen_LSCM

7/23/2020

Explore the data

Principla componenet analysis

UMAP