📺 Pivot to Video

📺 OK, not really!

This is actually about tidyr::pivot_*() functions, which you can learn all about in the new tidyr version 1.0.0 Pivoting vignette!

But, I will be using some nifty TV-related data. So, I’m sticking with the name.

library(tidyverse)

🍔 Bob’s Burgers: A Belcher Family survey

sheet <- googlesheets::gs_title("bobs_burgers_survey_results")

bobs_ws <- googlesheets::gs_ws_ls(sheet)

raw_dat <- sheet %>% 
  googlesheets::gs_read(ws = glue::glue("{bobs_ws}"))

## Accessing worksheet titled 'Form Responses 1'.

## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Members of the Belcher family with whom I identify (select all that apply)` = col_character()
## )

belcher_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
  dplyr::rename("response" = `Members of the Belcher family with whom I identify (select all that apply)`) %>%
  dplyr::select(-Timestamp)

belcher_results

## # A tibble: 290 x 2
##    resp_id response                      
##      <int> <chr>                         
##  1       1 Linda, Tina, Louise           
##  2       2 Bob, Gene                     
##  3       3 Bob, Tina, Gene               
##  4       4 Bob, Linda, Tina, Gene, Louise
##  5       5 Bob                           
##  6       6 Louise                        
##  7       7 Bob                           
##  8       8 Bob                           
##  9       9 Bob, Gene                     
## 10      10 Bob, Linda, Tina              
## # … with 280 more rows

Because I used a Google Form to collect this data, I don’t have to worry about order of names, since they come out the same every time.

agg_results <- belcher_results %>%
  dplyr::group_by(response) %>%
  dplyr::summarise(total = n()) %>%
  dplyr::arrange(desc(total))

agg_results

## # A tibble: 30 x 2
##    response                       total
##    <chr>                          <int>
##  1 Bob                               53
##  2 Tina                              31
##  3 Louise                            24
##  4 Bob, Tina                         22
##  5 Bob, Louise                       19
##  6 Bob, Tina, Louise                 18
##  7 Tina, Louise                      13
##  8 Bob, Linda, Tina, Gene, Louise    12
##  9 Bob, Tina, Gene                   10
## 10 Bob, Gene                          9
## # … with 20 more rows

`tidyr::separate_rows()`

Let’s make them long with tidyr::separate_rows(). I’m also adding a numeric variable, identify (short for “character(s) with whom I identify”), and ensuring that respondent IDs don’t get erroneously treated as numeric, by converting them to characters.

belcher_results <- belcher_results %>%
  tidyr::separate_rows(response) %>%
  dplyr::mutate(identify = 1,
                resp_id = as.character(resp_id))

## Note: Using an external vector in selections is brittle.
## ℹ If the data contains `vars` it will be selected instead.
## ℹ Use `all_of(vars)` instead of just `vars` to silence this message.

head(belcher_results)

## # A tibble: 6 x 3
##   resp_id response identify
##   <chr>   <chr>       <dbl>
## 1 1       Linda           1
## 2 1       Tina            1
## 3 1       Louise          1
## 4 2       Bob             1
## 5 2       Gene            1
## 6 3       Bob             1

`tidyr::pivot_wider()`

Now we’ll take one of the “new” tidyr verbs for a spin, pivot_wider(). Rather than fill things out with a bunch of NAs, we’ll prepare our data to use with the UpSetR package by turning it into binaries, and ditch the respondent ID in the end.

binary_tib <- belcher_results %>%
  tidyr::pivot_wider(
    names_from = response,
    values_from = identify,
    values_fill = list(identify = 0)
  ) %>%
  dplyr::select(-resp_id)

head(binary_tib)

## # A tibble: 6 x 5
##   Linda  Tina Louise   Bob  Gene
##   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     1     1      1     0     0
## 2     0     0      0     1     1
## 3     0     1      0     1     1
## 4     1     1      1     1     1
## 5     0     0      0     1     0
## 6     0     0      1     0     0

`UpSetR::upset()`

I highly recommend Paul Campbell’s code-through using UpSetR, which gave me (among other things) the pro tip that upset() does not like tibbles (hence the as.data.frame() at the end).

binary_df <- as.data.frame(binary_tib)

UpSetR::upset(binary_df, nsets = 5, order.by = "freq")

🎥 I have seen this movie…

Let’s take a look at another dataset I collected with a quick survey, this one asking people whether they had or had not seen a given movie.

sheet <- googlesheets::gs_title("seen_this_movie")

## Sheet successfully identified: "seen_this_movie"

movie_ws <- googlesheets::gs_ws_ls(sheet)

raw_dat <- sheet %>% 
  googlesheets::gs_read(ws = glue::glue("{movie_ws}"))

## Accessing worksheet titled 'Form Responses 1'.

## Parsed with column specification:
## cols(
##   Timestamp = col_character(),
##   `Airplane!` = col_logical(),
##   Anchorman = col_logical(),
##   `Animal House` = col_logical(),
##   `The Big Lebowski` = col_logical(),
##   `The Blues Brothers` = col_logical(),
##   Borat = col_logical(),
##   Bridesmaids = col_logical(),
##   `Office Space` = col_logical(),
##   `Old School` = col_logical(),
##   `This Is Spinal Tap` = col_logical(),
##   `Tommy Boy` = col_logical(),
##   Superbad = col_logical(),
##   `My age is…` = col_double()
## )

First I’ll use tibble::rowid_to_column() to keep track of which user said what (and, again, make sure those are stored as characters rather than numeric, so I don’t accidentally analyze it the wrong way).

I still don’t care about the timestamp, so I’ll get everything but that using dplyr::select(-Timestamp).

The variable names here have a whole bunch of annoying characters (e.g. spaces, exclamation points). So, I’m going to use Sam Firke’s janitor package — specifically the janitor::clean_names() function — to take care of some of the grunt work for me there.

I’m also going to rename my_age_is to age, just because it’s annoying.

movie_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
  dplyr::mutate(resp_id = as.character(resp_id)) %>%
  dplyr::select(-Timestamp) %>%
  janitor::clean_names() %>%
  dplyr::rename("age" = my_age_is)

head(movie_results)

## # A tibble: 6 x 14
##   resp_id airplane anchorman animal_house the_big_lebowski the_blues_broth…
##   <chr>   <lgl>    <lgl>     <lgl>        <lgl>            <lgl>           
## 1 1       TRUE     TRUE      TRUE         FALSE            TRUE            
## 2 2       FALSE    TRUE      TRUE         TRUE             FALSE           
## 3 3       FALSE    FALSE     FALSE        TRUE             FALSE           
## 4 4       FALSE    FALSE     FALSE        FALSE            FALSE           
## 5 5       FALSE    TRUE      FALSE        FALSE            FALSE           
## 6 6       FALSE    TRUE      FALSE        TRUE             TRUE            
## # … with 8 more variables: borat <lgl>, bridesmaids <lgl>,
## #   office_space <lgl>, old_school <lgl>, this_is_spinal_tap <lgl>,
## #   tommy_boy <lgl>, superbad <lgl>, age <dbl>

Since this dataset is wide, let’s also take a gander at it using glimpse():

glimpse(movie_results)

## Observations: 158
## Variables: 14
## $ resp_id            <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ airplane           <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ anchorman          <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
## $ animal_house       <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ the_big_lebowski   <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, …
## $ the_blues_brothers <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
## $ borat              <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE…
## $ bridesmaids        <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ office_space       <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ old_school         <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ this_is_spinal_tap <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALS…
## $ tommy_boy          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ superbad           <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, …
## $ age                <dbl> 48, 31, 30, 20, 29, 28, 24, 42, 33, 34, 33, 3…

`tidyr::pivot_longer()`

Since I only want to elongate the movies, and all of those columns are logical, I’ll first select which columns I want by using select_if() and is.logical() as the predicate function.

logicols <- select_if(movie_results, is.logical) %>%
  colnames()

movie_long <- movie_results %>%
  pivot_longer(
    cols = one_of(logicols),
    names_to = "movie",
    values_to = "seen"
  )

head(movie_long)

## # A tibble: 6 x 4
##   resp_id   age movie              seen 
##   <chr>   <dbl> <chr>              <lgl>
## 1 1          48 airplane           TRUE 
## 2 1          48 anchorman          TRUE 
## 3 1          48 animal_house       TRUE 
## 4 1          48 the_big_lebowski   FALSE
## 5 1          48 the_blues_brothers TRUE 
## 6 1          48 borat              FALSE

Aside: A nice little trick for recoding TRUE/FALSE as 0 and 1, just use as.numeric().

movie_bin <- movie_long %>%
  mutate(seen_num = as.numeric(seen))

movie_bin

## # A tibble: 1,896 x 5
##    resp_id   age movie              seen  seen_num
##    <chr>   <dbl> <chr>              <lgl>    <dbl>
##  1 1          48 airplane           TRUE         1
##  2 1          48 anchorman          TRUE         1
##  3 1          48 animal_house       TRUE         1
##  4 1          48 the_big_lebowski   FALSE        0
##  5 1          48 the_blues_brothers TRUE         1
##  6 1          48 borat              FALSE        0
##  7 1          48 bridesmaids        TRUE         1
##  8 1          48 office_space       FALSE        0
##  9 1          48 old_school         FALSE        0
## 10 1          48 this_is_spinal_tap FALSE        0
## # … with 1,886 more rows

lil_seen <- movie_bin %>%
  select(resp_id, movie, seen_num)

movie_bin %>%
  group_by(movie) %>%
  summarise(pct_seen = sum(seen_num) / n()) %>%
  arrange(desc(pct_seen))

## # A tibble: 12 x 2
##    movie              pct_seen
##    <chr>                 <dbl>
##  1 the_big_lebowski      0.715
##  2 airplane              0.646
##  3 office_space          0.633
##  4 anchorman             0.627
##  5 superbad              0.608
##  6 bridesmaids           0.582
##  7 borat                 0.563
##  8 the_blues_brothers    0.563
##  9 animal_house          0.430
## 10 old_school            0.430
## 11 this_is_spinal_tap    0.367
## 12 tommy_boy             0.323

lil_seen_wide <- lil_seen %>%
  pivot_wider(names_from = movie, values_from = seen_num) %>%
  select(-resp_id)

lil_seen_wide <- as.data.frame(lil_seen_wide)
UpSetR::upset(lil_seen_wide, nsets = 13, order.by = "freq")

Let’s briefly pretend we’re looking at just three movies: Bridesmaids, Anchorman, and Airplane.

three_movies <- c("airplane", "anchorman", "bridesmaids")

movie_long %>%
  filter(movie %in% three_movies)

## # A tibble: 474 x 4
##    resp_id   age movie       seen 
##    <chr>   <dbl> <chr>       <lgl>
##  1 1          48 airplane    TRUE 
##  2 1          48 anchorman   TRUE 
##  3 1          48 bridesmaids TRUE 
##  4 2          31 airplane    FALSE
##  5 2          31 anchorman   TRUE 
##  6 2          31 bridesmaids TRUE 
##  7 3          30 airplane    FALSE
##  8 3          30 anchorman   FALSE
##  9 3          30 bridesmaids FALSE
## 10 4          20 airplane    FALSE
## # … with 464 more rows

For the wide version, we’ll add the respondent id and age to the list of variables we want to look at.

vars_wanted <- c("resp_id", "age", three_movies)

movie_results %>%
  select(vars_wanted)

## Note: Using an external vector in selections is brittle.
## ℹ If the data contains `vars_wanted` it will be selected instead.
## ℹ Use `all_of(vars_wanted)` instead of just `vars_wanted` to silence this message.

## # A tibble: 158 x 5
##    resp_id   age airplane anchorman bridesmaids
##    <chr>   <dbl> <lgl>    <lgl>     <lgl>      
##  1 1          48 TRUE     TRUE      TRUE       
##  2 2          31 FALSE    TRUE      TRUE       
##  3 3          30 FALSE    FALSE     FALSE      
##  4 4          20 FALSE    FALSE     FALSE      
##  5 5          29 FALSE    TRUE      TRUE       
##  6 6          28 FALSE    TRUE      FALSE      
##  7 7          24 TRUE     TRUE      FALSE      
##  8 8          42 TRUE     FALSE     FALSE      
##  9 9          33 TRUE     TRUE      TRUE       
## 10 10         34 TRUE     TRUE      TRUE       
## # … with 148 more rows

Note the change in “shape” of our data, though the contents remain the same: