📺 OK, not really!
This is actually about tidyr::pivot_*() functions, which you can learn all about in the new tidyr version 1.0.0 Pivoting vignette!
But, I will be using some nifty TV-related data. So, I’m sticking with the name.
library(tidyverse)
sheet <- googlesheets::gs_title("bobs_burgers_survey_results")
bobs_ws <- googlesheets::gs_ws_ls(sheet)
raw_dat <- sheet %>%
googlesheets::gs_read(ws = glue::glue("{bobs_ws}"))
## Accessing worksheet titled 'Form Responses 1'.
## Parsed with column specification:
## cols(
## Timestamp = col_character(),
## `Members of the Belcher family with whom I identify (select all that apply)` = col_character()
## )
belcher_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
dplyr::rename("response" = `Members of the Belcher family with whom I identify (select all that apply)`) %>%
dplyr::select(-Timestamp)
belcher_results
## # A tibble: 290 x 2
## resp_id response
## <int> <chr>
## 1 1 Linda, Tina, Louise
## 2 2 Bob, Gene
## 3 3 Bob, Tina, Gene
## 4 4 Bob, Linda, Tina, Gene, Louise
## 5 5 Bob
## 6 6 Louise
## 7 7 Bob
## 8 8 Bob
## 9 9 Bob, Gene
## 10 10 Bob, Linda, Tina
## # … with 280 more rows
Because I used a Google Form to collect this data, I don’t have to worry about order of names, since they come out the same every time.
agg_results <- belcher_results %>%
dplyr::group_by(response) %>%
dplyr::summarise(total = n()) %>%
dplyr::arrange(desc(total))
agg_results
## # A tibble: 30 x 2
## response total
## <chr> <int>
## 1 Bob 53
## 2 Tina 31
## 3 Louise 24
## 4 Bob, Tina 22
## 5 Bob, Louise 19
## 6 Bob, Tina, Louise 18
## 7 Tina, Louise 13
## 8 Bob, Linda, Tina, Gene, Louise 12
## 9 Bob, Tina, Gene 10
## 10 Bob, Gene 9
## # … with 20 more rows
tidyr::separate_rows()Let’s make them long with tidyr::separate_rows(). I’m also adding a numeric variable, identify (short for “character(s) with whom I identify”), and ensuring that respondent IDs don’t get erroneously treated as numeric, by converting them to characters.
belcher_results <- belcher_results %>%
tidyr::separate_rows(response) %>%
dplyr::mutate(identify = 1,
resp_id = as.character(resp_id))
## Note: Using an external vector in selections is brittle.
## ℹ If the data contains `vars` it will be selected instead.
## ℹ Use `all_of(vars)` instead of just `vars` to silence this message.
head(belcher_results)
## # A tibble: 6 x 3
## resp_id response identify
## <chr> <chr> <dbl>
## 1 1 Linda 1
## 2 1 Tina 1
## 3 1 Louise 1
## 4 2 Bob 1
## 5 2 Gene 1
## 6 3 Bob 1
tidyr::pivot_wider()Now we’ll take one of the “new” tidyr verbs for a spin, pivot_wider(). Rather than fill things out with a bunch of NAs, we’ll prepare our data to use with the UpSetR package by turning it into binaries, and ditch the respondent ID in the end.
binary_tib <- belcher_results %>%
tidyr::pivot_wider(
names_from = response,
values_from = identify,
values_fill = list(identify = 0)
) %>%
dplyr::select(-resp_id)
head(binary_tib)
## # A tibble: 6 x 5
## Linda Tina Louise Bob Gene
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 0 0
## 2 0 0 0 1 1
## 3 0 1 0 1 1
## 4 1 1 1 1 1
## 5 0 0 0 1 0
## 6 0 0 1 0 0
UpSetR::upset()I highly recommend Paul Campbell’s code-through using UpSetR, which gave me (among other things) the pro tip that upset() does not like tibbles (hence the as.data.frame() at the end).
binary_df <- as.data.frame(binary_tib)
UpSetR::upset(binary_df, nsets = 5, order.by = "freq")
Let’s take a look at another dataset I collected with a quick survey, this one asking people whether they had or had not seen a given movie.
sheet <- googlesheets::gs_title("seen_this_movie")
## Sheet successfully identified: "seen_this_movie"
movie_ws <- googlesheets::gs_ws_ls(sheet)
raw_dat <- sheet %>%
googlesheets::gs_read(ws = glue::glue("{movie_ws}"))
## Accessing worksheet titled 'Form Responses 1'.
## Parsed with column specification:
## cols(
## Timestamp = col_character(),
## `Airplane!` = col_logical(),
## Anchorman = col_logical(),
## `Animal House` = col_logical(),
## `The Big Lebowski` = col_logical(),
## `The Blues Brothers` = col_logical(),
## Borat = col_logical(),
## Bridesmaids = col_logical(),
## `Office Space` = col_logical(),
## `Old School` = col_logical(),
## `This Is Spinal Tap` = col_logical(),
## `Tommy Boy` = col_logical(),
## Superbad = col_logical(),
## `My age is…` = col_double()
## )
First I’ll use tibble::rowid_to_column() to keep track of which user said what (and, again, make sure those are stored as characters rather than numeric, so I don’t accidentally analyze it the wrong way).
I still don’t care about the timestamp, so I’ll get everything but that using dplyr::select(-Timestamp).
The variable names here have a whole bunch of annoying characters (e.g. spaces, exclamation points). So, I’m going to use Sam Firke’s janitor package — specifically the janitor::clean_names() function — to take care of some of the grunt work for me there.
I’m also going to rename my_age_is to age, just because it’s annoying.
movie_results <- tibble::rowid_to_column(raw_dat, "resp_id") %>%
dplyr::mutate(resp_id = as.character(resp_id)) %>%
dplyr::select(-Timestamp) %>%
janitor::clean_names() %>%
dplyr::rename("age" = my_age_is)
head(movie_results)
## # A tibble: 6 x 14
## resp_id airplane anchorman animal_house the_big_lebowski the_blues_broth…
## <chr> <lgl> <lgl> <lgl> <lgl> <lgl>
## 1 1 TRUE TRUE TRUE FALSE TRUE
## 2 2 FALSE TRUE TRUE TRUE FALSE
## 3 3 FALSE FALSE FALSE TRUE FALSE
## 4 4 FALSE FALSE FALSE FALSE FALSE
## 5 5 FALSE TRUE FALSE FALSE FALSE
## 6 6 FALSE TRUE FALSE TRUE TRUE
## # … with 8 more variables: borat <lgl>, bridesmaids <lgl>,
## # office_space <lgl>, old_school <lgl>, this_is_spinal_tap <lgl>,
## # tommy_boy <lgl>, superbad <lgl>, age <dbl>
Since this dataset is wide, let’s also take a gander at it using glimpse():
glimpse(movie_results)
## Observations: 158
## Variables: 14
## $ resp_id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ airplane <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE…
## $ anchorman <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, F…
## $ animal_house <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ the_big_lebowski <lgl> FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, …
## $ the_blues_brothers <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE…
## $ borat <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE…
## $ bridesmaids <lgl> TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ office_space <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ old_school <lgl> FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ this_is_spinal_tap <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALS…
## $ tommy_boy <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ superbad <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, …
## $ age <dbl> 48, 31, 30, 20, 29, 28, 24, 42, 33, 34, 33, 3…
tidyr::pivot_longer()Since I only want to elongate the movies, and all of those columns are logical, I’ll first select which columns I want by using select_if() and is.logical() as the predicate function.
logicols <- select_if(movie_results, is.logical) %>%
colnames()
movie_long <- movie_results %>%
pivot_longer(
cols = one_of(logicols),
names_to = "movie",
values_to = "seen"
)
head(movie_long)
## # A tibble: 6 x 4
## resp_id age movie seen
## <chr> <dbl> <chr> <lgl>
## 1 1 48 airplane TRUE
## 2 1 48 anchorman TRUE
## 3 1 48 animal_house TRUE
## 4 1 48 the_big_lebowski FALSE
## 5 1 48 the_blues_brothers TRUE
## 6 1 48 borat FALSE
Aside: A nice little trick for recoding TRUE/FALSE as 0 and 1, just use as.numeric().
movie_bin <- movie_long %>%
mutate(seen_num = as.numeric(seen))
movie_bin
## # A tibble: 1,896 x 5
## resp_id age movie seen seen_num
## <chr> <dbl> <chr> <lgl> <dbl>
## 1 1 48 airplane TRUE 1
## 2 1 48 anchorman TRUE 1
## 3 1 48 animal_house TRUE 1
## 4 1 48 the_big_lebowski FALSE 0
## 5 1 48 the_blues_brothers TRUE 1
## 6 1 48 borat FALSE 0
## 7 1 48 bridesmaids TRUE 1
## 8 1 48 office_space FALSE 0
## 9 1 48 old_school FALSE 0
## 10 1 48 this_is_spinal_tap FALSE 0
## # … with 1,886 more rows
lil_seen <- movie_bin %>%
select(resp_id, movie, seen_num)
movie_bin %>%
group_by(movie) %>%
summarise(pct_seen = sum(seen_num) / n()) %>%
arrange(desc(pct_seen))
## # A tibble: 12 x 2
## movie pct_seen
## <chr> <dbl>
## 1 the_big_lebowski 0.715
## 2 airplane 0.646
## 3 office_space 0.633
## 4 anchorman 0.627
## 5 superbad 0.608
## 6 bridesmaids 0.582
## 7 borat 0.563
## 8 the_blues_brothers 0.563
## 9 animal_house 0.430
## 10 old_school 0.430
## 11 this_is_spinal_tap 0.367
## 12 tommy_boy 0.323
lil_seen_wide <- lil_seen %>%
pivot_wider(names_from = movie, values_from = seen_num) %>%
select(-resp_id)
lil_seen_wide <- as.data.frame(lil_seen_wide)
UpSetR::upset(lil_seen_wide, nsets = 13, order.by = "freq")
Let’s briefly pretend we’re looking at just three movies: Bridesmaids, Anchorman, and Airplane.
three_movies <- c("airplane", "anchorman", "bridesmaids")
movie_long %>%
filter(movie %in% three_movies)
## # A tibble: 474 x 4
## resp_id age movie seen
## <chr> <dbl> <chr> <lgl>
## 1 1 48 airplane TRUE
## 2 1 48 anchorman TRUE
## 3 1 48 bridesmaids TRUE
## 4 2 31 airplane FALSE
## 5 2 31 anchorman TRUE
## 6 2 31 bridesmaids TRUE
## 7 3 30 airplane FALSE
## 8 3 30 anchorman FALSE
## 9 3 30 bridesmaids FALSE
## 10 4 20 airplane FALSE
## # … with 464 more rows
For the wide version, we’ll add the respondent id and age to the list of variables we want to look at.
vars_wanted <- c("resp_id", "age", three_movies)
movie_results %>%
select(vars_wanted)
## Note: Using an external vector in selections is brittle.
## ℹ If the data contains `vars_wanted` it will be selected instead.
## ℹ Use `all_of(vars_wanted)` instead of just `vars_wanted` to silence this message.
## # A tibble: 158 x 5
## resp_id age airplane anchorman bridesmaids
## <chr> <dbl> <lgl> <lgl> <lgl>
## 1 1 48 TRUE TRUE TRUE
## 2 2 31 FALSE TRUE TRUE
## 3 3 30 FALSE FALSE FALSE
## 4 4 20 FALSE FALSE FALSE
## 5 5 29 FALSE TRUE TRUE
## 6 6 28 FALSE TRUE FALSE
## 7 7 24 TRUE TRUE FALSE
## 8 8 42 TRUE FALSE FALSE
## 9 9 33 TRUE TRUE TRUE
## 10 10 34 TRUE TRUE TRUE
## # … with 148 more rows
Note the change in “shape” of our data, though the contents remain the same: