Explore your data

Import your data

Read a CSV file

In this exercise, you’ll use read_csv() twice. The first time you will only specify the filename, but you’ll notice a problem with the imported data. The second time you’ll use a new argument called skip to fix the problem. Remember to use ?read_csv to read more about arguments like skip and how to use them.

The data you’ll work with is from “The Great British Bake-Off.” The file “bakeoff.csv” contains data for each episode of the show, organized by series and baker.

Tidyverse chearsheet

# Load readr
library(here)

## here() starts at C:/Users/Takunda Glendice/Documents/max/working with data in tidyverse

library(readr)

# Create bakeoff from "bakeoff.csv"
bakeoff <- read_csv("data/bakeoff.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   series = col_double(),
##   episode = col_double(),
##   baker = col_character(),
##   signature = col_character(),
##   technical = col_double(),
##   showstopper = col_character(),
##   result = col_character(),
##   uk_airdate = col_date(format = ""),
##   us_season = col_double(),
##   us_airdate = col_date(format = "")
## )

# Print bakeoff
bakeoff

## # A tibble: 549 x 10
##    series episode baker signature technical showstopper result uk_airdate
##     <dbl>   <dbl> <chr> <chr>         <dbl> <chr>       <chr>  <date>    
##  1      1       1 Anne~ "Light J~         2 "Chocolate~ IN     2010-08-17
##  2      1       1 David "Chocola~         3 "Black For~ IN     2010-08-17
##  3      1       1 Edd   "Caramel~         1  <NA>       IN     2010-08-17
##  4      1       1 Jasm~ "Fresh M~        NA  <NA>       IN     2010-08-17
##  5      1       1 Jona~ "Carrot ~         9 "Three-lay~ IN     2010-08-17
##  6      1       1 Loui~ "Carrot ~        NA "Never Fai~ IN     2010-08-17
##  7      1       1 Mira~ "Triple ~         8 "Three Tie~ IN     2010-08-17
##  8      1       1 Ruth  "Lemon D~        NA "Classic C~ IN     2010-08-17
##  9      1       1 Lea   "Cranber~        10 "Chocolate~ OUT    2010-08-17
## 10      1       1 Mark  "Sticky ~        NA "Heart-sha~ OUT    2010-08-17
## # ... with 539 more rows, and 2 more variables: us_season <dbl>,
## #   us_airdate <date>

# Create bakeoff but skip first row in case needed
# bakeoff <- read_csv("data/bakeoff.csv", skip = 1)

Our original import had 3 problems: we had 1 extra observation, our first row stored the true variable names, and all column types were characters. Adding skip = 1 worked because the default argument for col_names is TRUE.

Assign missing values

The read_csv() function also has an na argument, which allows you to specify value(s) that represent missing values in your data. The default values for the na argument are c("“,”NA“), so both are recoded as missing (NA) in R. When you read in data, you can add additional values like the string”UNKNOWN" to a vector of missing values using the c() function to combine multiple values into a single vector.

The is.na() function is also helpful for identifying rows with missing values for a variable.

# Load dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Filter rows where showstopper is UNKNOWN 
bakeoff %>% 
    filter(showstopper == "UNKNOWN")

## # A tibble: 0 x 10
## # ... with 10 variables: series <dbl>, episode <dbl>, baker <chr>,
## #   signature <chr>, technical <dbl>, showstopper <chr>, result <chr>,
## #   uk_airdate <date>, us_season <dbl>, us_airdate <date>

# Edit to add list of missing values
bakeoff <- read_csv("data/bakeoff.csv", na = c("", "NA", "UNKNOWN"))

## 
## -- Column specification --------------------------------------------------------
## cols(
##   series = col_double(),
##   episode = col_double(),
##   baker = col_character(),
##   signature = col_character(),
##   technical = col_double(),
##   showstopper = col_character(),
##   result = col_character(),
##   uk_airdate = col_date(format = ""),
##   us_season = col_double(),
##   us_airdate = col_date(format = "")
## )

# Filter rows where showstopper is NA 
bakeoff %>%
  filter(is.na(showstopper))

## # A tibble: 21 x 10
##    series episode baker signature technical showstopper result uk_airdate
##     <dbl>   <dbl> <chr> <chr>         <dbl> <chr>       <chr>  <date>    
##  1      1       1 Edd   "Caramel~         1 <NA>        IN     2010-08-17
##  2      1       1 Jasm~ "Fresh M~        NA <NA>        IN     2010-08-17
##  3      1       6 Mira~ "Lemon C~        NA <NA>        RUNNE~ 2010-09-21
##  4      2       1 Ian   "Apple a~        10 <NA>        IN     2011-08-16
##  5      2       1 Jason "Lemon M~         6 <NA>        IN     2011-08-16
##  6      2       1 Urva~ "Cherry ~         7 <NA>        IN     2011-08-16
##  7      2       1 Yasm~ "Cardamo~         5 <NA>        IN     2011-08-16
##  8      2       1 Holly "Cherry ~         1 <NA>        SB     2011-08-16
##  9      2       2 Ben   "Chorizo~         1 <NA>        IN     2011-08-23
## 10      2       2 Ian   "Stilton~         2 <NA>        IN     2011-08-23
## # ... with 11 more rows, and 2 more variables: us_season <dbl>,
## #   us_airdate <date>

We only had 4 missing values for the showstopper variable to start, but now all 21 are present and accounted for.

Know your data

Arrange and glimpse

You can combine glimpse() with other functions in a sequence using the pipe (%>%) operator. For example, you can use other dplyr functions like arrange first, then use glimpse by adding a line after the final pipe (%>%):

bakers_mini %>% 
  arrange(age) %>% 
  glimpse() # no argument needed here

Take a glimpse of the bakeoff data we imported in the first set of exercises. On which date did the first episode of the show air in the US?

library(skimr)
# Filter and skim
bakeoff %>%
  filter(!is.na(us_season)) %>% 
  skim()

Data summary
Name	Piped data
Number of rows	302
Number of columns	10
_______________________
Column type frequency:
character	4
Date	2
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
baker	0	1	3	9	48
signature	1	1	10	125	300
showstopper	1	1	5	126	298
result	0	1	2	9	6

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
uk_airdate	0	1	2013-08-20	2016-10-26	2014-10-08	40
us_airdate	0	1	2014-12-28	2017-08-04	2015-11-08	35

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
series	0	1	5.49	1.13	4	4	5	6	7	▇▇▁▇▇
episode	0	1	4.37	2.66	1	2	4	6	10	▇▆▅▃▂
technical	1	1	4.85	2.98	1	2	4	7	13	▇▅▅▂▁
us_season	0	1	2.49	1.13	1	1	2	3	4	▇▇▁▇▇

The first episode of the first US season aired on December 28, 2014.

Summarize your data

You can combine skim() with other functions in a sequence using the pipe (%>%) operator. For example, you could use other dplyr functions like group_by first, then use skim() by adding a line after the final pipe.

bakers_mini %>% 
  group_by(series) %>% 
  skim() # no argument needed here

This will produce summary statistics for each series. Let’s practice this!

# Load skimr
library(skimr)

bakeoff %>% 
  group_by(series) %>% 
  skim() # no argument needed here

## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf

## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf

## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf

## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf

## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf

## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf

## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf

## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf

Data summary
Name	Piped data
Number of rows	549
Number of columns	10
_______________________
Column type frequency:
character	4
Date	2
numeric	3
________________________
Group variables	series

Variable type: character

skim_variable	series	n_missing	complete_rate	min	max	n_unique
baker	1	0	1.00	3	9	10
baker	2	0	1.00	3	9	12
baker	3	0	1.00	4	10	12
baker	4	0	1.00	3	9	13
baker	5	0	1.00	4	7	12
baker	6	0	1.00	3	6	12
baker	7	0	1.00	3	9	12
baker	8	0	1.00	3	6	12
signature	1	0	1.00	8	63	36
signature	2	0	1.00	9	69	59
signature	3	0	1.00	11	119	76
signature	4	0	1.00	10	125	78
signature	5	1	0.99	15	107	73
signature	6	0	1.00	12	64	74
signature	7	0	1.00	12	93	75
signature	8	0	1.00	12	64	75
showstopper	1	3	0.92	13	197	33
showstopper	2	10	0.83	17	118	50
showstopper	3	7	0.91	12	70	69
showstopper	4	0	1.00	5	126	78
showstopper	5	1	0.99	8	82	73
showstopper	6	0	1.00	10	70	73
showstopper	7	0	1.00	5	86	74
showstopper	8	0	1.00	13	68	75
result	1	0	1.00	2	9	4
result	2	0	1.00	2	9	5
result	3	0	1.00	2	9	5
result	4	0	1.00	2	9	5
result	5	0	1.00	2	9	6
result	6	0	1.00	2	9	5
result	7	0	1.00	2	9	5
result	8	0	1.00	2	9	5

Variable type: Date

skim_variable	series	n_missing	complete_rate	min	max	median	n_unique
uk_airdate	1	0	1	2010-08-17	2010-09-21	2010-08-27	6
uk_airdate	2	0	1	2011-08-16	2011-10-04	2011-08-30	8
uk_airdate	3	0	1	2012-08-14	2012-10-16	2012-09-04	10
uk_airdate	4	0	1	2013-08-20	2013-10-22	2013-09-10	10
uk_airdate	5	0	1	2014-08-06	2014-10-08	2014-08-27	10
uk_airdate	6	0	1	2015-08-05	2015-10-07	2015-08-26	10
uk_airdate	7	0	1	2016-08-24	2016-10-26	2016-09-14	10
uk_airdate	8	0	1	2017-08-29	2017-10-31	2017-09-19	10
us_airdate	1	36	0	NA	NA	NA	0
us_airdate	2	60	0	NA	NA	NA	0
us_airdate	3	76	0	NA	NA	NA	0
us_airdate	4	0	1	2014-12-28	2015-03-01	2015-01-18	10
us_airdate	5	0	1	2015-09-06	2015-11-08	2015-09-27	10
us_airdate	6	0	1	2016-07-01	2016-08-12	2016-07-15	7
us_airdate	7	0	1	2017-06-16	2017-08-04	2017-06-30	8
us_airdate	8	75	0	NA	NA	NA	0

Variable type: numeric

skim_variable	series	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
episode	1	0	1.00	2.83	1.65	1	1.00	2.5	4.00	6	▇▂▂▂▁
episode	2	0	1.00	3.55	2.10	1	2.00	3.0	5.00	8	▇▃▅▂▂
episode	3	0	1.00	4.43	2.67	1	2.00	4.0	6.25	10	▇▆▅▅▂
episode	4	0	1.00	4.31	2.66	1	2.00	4.0	6.00	10	▇▆▅▃▂
episode	5	0	1.00	4.38	2.68	1	2.00	4.0	6.00	10	▇▆▅▃▂
episode	6	0	1.00	4.40	2.67	1	2.00	4.0	6.00	10	▇▆▅▃▂
episode	7	0	1.00	4.40	2.67	1	2.00	4.0	6.00	10	▇▆▅▃▂
episode	8	0	1.00	4.40	2.67	1	2.00	4.0	6.00	10	▇▆▅▃▂
technical	1	7	0.81	3.97	2.56	1	2.00	3.0	5.00	10	▇▇▃▂▂
technical	2	0	1.00	4.90	3.03	1	2.00	4.0	7.00	12	▇▅▃▂▂
technical	3	0	1.00	4.82	2.92	1	2.00	4.0	7.00	12	▇▅▃▂▂
technical	4	0	1.00	5.08	3.19	1	2.25	4.5	7.00	13	▇▅▅▂▂
technical	5	1	0.99	4.73	2.93	1	2.00	4.0	7.00	12	▇▅▃▂▂
technical	6	0	1.00	4.80	2.92	1	2.00	4.0	7.00	12	▇▅▃▂▂
technical	7	0	1.00	4.80	2.92	1	2.00	4.0	7.00	12	▇▅▃▂▂
technical	8	0	1.00	4.80	2.92	1	2.00	4.0	7.00	12	▇▅▃▂▂
us_season	1	36	0.00	NaN	NA	NA	NA	NA	NA	NA
us_season	2	60	0.00	NaN	NA	NA	NA	NA	NA	NA
us_season	3	76	0.00	NaN	NA	NA	NA	NA	NA	NA
us_season	4	0	1.00	1.00	0.00	1	1.00	1.0	1.00	1	▁▁▇▁▁
us_season	5	0	1.00	2.00	0.00	2	2.00	2.0	2.00	2	▁▁▇▁▁
us_season	6	0	1.00	3.00	0.00	3	3.00	3.0	3.00	3	▁▁▇▁▁
us_season	7	0	1.00	4.00	0.00	4	4.00	4.0	4.00	4	▁▁▇▁▁
us_season	8	75	0.00	NaN	NA	NA	NA	NA	NA	NA

# Edit to filter, group by, and skim
bakeoff %>% 
  filter(!is.na(us_season)) %>% 
  group_by(us_season)  %>% 
  skim()

Data summary
Name	Piped data
Number of rows	302
Number of columns	10
_______________________
Column type frequency:
character	4
Date	2
numeric	3
________________________
Group variables	us_season

Variable type: character

skim_variable	us_season	n_missing	complete_rate	min	max	n_unique
baker	1	0	1.00	3	9	13
baker	2	0	1.00	4	7	12
baker	3	0	1.00	3	6	12
baker	4	0	1.00	3	9	12
signature	1	0	1.00	10	125	78
signature	2	1	0.99	15	107	73
signature	3	0	1.00	12	64	74
signature	4	0	1.00	12	93	75
showstopper	1	0	1.00	5	126	78
showstopper	2	1	0.99	8	82	73
showstopper	3	0	1.00	10	70	73
showstopper	4	0	1.00	5	86	74
result	1	0	1.00	2	9	5
result	2	0	1.00	2	9	6
result	3	0	1.00	2	9	5
result	4	0	1.00	2	9	5

Variable type: Date

skim_variable	us_season	complete_rate	min	max	median	n_unique
uk_airdate	1	1	2013-08-20	2013-10-22	2013-09-10	10
uk_airdate	2	1	2014-08-06	2014-10-08	2014-08-27	10
uk_airdate	3	1	2015-08-05	2015-10-07	2015-08-26	10
uk_airdate	4	1	2016-08-24	2016-10-26	2016-09-14	10
us_airdate	1	1	2014-12-28	2015-03-01	2015-01-18	10
us_airdate	2	1	2015-09-06	2015-11-08	2015-09-27	10
us_airdate	3	1	2016-07-01	2016-08-12	2016-07-15	7
us_airdate	4	1	2017-06-16	2017-08-04	2017-06-30	8

Variable type: numeric

skim_variable	us_season	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
series	1	0	1.00	4.00	0.00	4	4.00	4.0	4	4	▁▁▇▁▁
series	2	0	1.00	5.00	0.00	5	5.00	5.0	5	5	▁▁▇▁▁
series	3	0	1.00	6.00	0.00	6	6.00	6.0	6	6	▁▁▇▁▁
series	4	0	1.00	7.00	0.00	7	7.00	7.0	7	7	▁▁▇▁▁
episode	1	0	1.00	4.31	2.66	1	2.00	4.0	6	10	▇▆▅▃▂
episode	2	0	1.00	4.38	2.68	1	2.00	4.0	6	10	▇▆▅▃▂
episode	3	0	1.00	4.40	2.67	1	2.00	4.0	6	10	▇▆▅▃▂
episode	4	0	1.00	4.40	2.67	1	2.00	4.0	6	10	▇▆▅▃▂
technical	1	0	1.00	5.08	3.19	1	2.25	4.5	7	13	▇▅▅▂▂
technical	2	1	0.99	4.73	2.93	1	2.00	4.0	7	12	▇▅▃▂▂
technical	3	0	1.00	4.80	2.92	1	2.00	4.0	7	12	▇▅▃▂▂
technical	4	0	1.00	4.80	2.92	1	2.00	4.0	7	12	▇▅▃▂▂

Nice piping & skimming! Producing quick grouped summaries with the skim function is a real time (and sanity) saver.

Know your variable types

How many variables of each type do we have in the bakeoff data? You may also want to try piping a skimmed object to summary(), also from the skimr package:

tibble_name %>% 
  skim() %>%  # no argument needed here
  summary() # no argument needed here

bakeoff %>% 
  skim() %>%  # no argument needed here
  summary() # no argument needed here

Data summary
Name	Piped data
Number of rows	549
Number of columns	10
_______________________
Column type frequency:
character	4
Date	2
numeric	4
________________________
Group variables	None

Count with your data

In every episode of “The Great British Bake-Off”, bakers complete 3 challenges and the show’s judges award the title “Star Baker” to the baker who excelled in that week’s challenges (with the exception of the finale). Each baker’s result for every episode is stored in bakeoff- result is a character variable, and the value “SB” stands for star baker.

Distinct and count

# View distinct results
bakeoff %>%
  distinct(result)

## # A tibble: 6 x 1
##   result   
##   <chr>    
## 1 IN       
## 2 OUT      
## 3 RUNNER UP
## 4 WINNER   
## 5 SB       
## 6 LEFT

# Count whether or not star baker
bakeoff %>% 
  count(result)

## # A tibble: 6 x 2
##   result        n
##   <chr>     <int>
## 1 IN          393
## 2 LEFT          1
## 3 OUT          70
## 4 RUNNER UP    16
## 5 SB           61
## 6 WINNER        8

# Count whether or not star baker
bakeoff %>% 
  count(result == "SB")

## # A tibble: 2 x 2
##   `result == "SB"`     n
##   <lgl>            <int>
## 1 FALSE              488
## 2 TRUE                61

There are 488 results that are not star baker.

Count episodes

Here’s how we can use count() back-to-back to roll up a level of detail across series:

bakers %>% 
  count(aired_us, series) %>% 
  count(aired_us)

Let’s practice counting the number of episodes per series.

bakeoff %>% 
  count(series, episode) %>% 
  count(series)

## # A tibble: 8 x 2
##   series     n
##    <dbl> <int>
## 1      1     6
## 2      2     8
## 3      3    10
## 4      4    10
## 5      5    10
## 6      6    10
## 7      7    10
## 8      8    10

You can see that by series 3 we were treated to 10 episodes with each new series!

Count bakers

We’ll practice that again, focusing on counting bakers.

# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>%
                      count(series, baker)


# Print to view
bakers_by_series

## # A tibble: 95 x 3
##    series baker         n
##     <dbl> <chr>     <int>
##  1      1 Annetha       2
##  2      1 David         4
##  3      1 Edd           6
##  4      1 Jasminder     5
##  5      1 Jonathan      3
##  6      1 Lea           1
##  7      1 Louise        2
##  8      1 Mark          1
##  9      1 Miranda       6
## 10      1 Ruth          6
## # ... with 85 more rows

# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>% 
  count(series, baker)
  
# Print to view
bakers_by_series

## # A tibble: 95 x 3
##    series baker         n
##     <dbl> <chr>     <int>
##  1      1 Annetha       2
##  2      1 David         4
##  3      1 Edd           6
##  4      1 Jasminder     5
##  5      1 Jonathan      3
##  6      1 Lea           1
##  7      1 Louise        2
##  8      1 Mark          1
##  9      1 Miranda       6
## 10      1 Ruth          6
## # ... with 85 more rows

# Count again by series
bakers_by_series %>%
  count(series)

## # A tibble: 8 x 2
##   series     n
##    <dbl> <int>
## 1      1    10
## 2      2    12
## 3      3    12
## 4      4    13
## 5      5    12
## 6      6    12
## 7      7    12
## 8      8    12

# Count again by baker
bakers_by_series %>%
  count(baker, sort = TRUE)

## # A tibble: 86 x 2
##    baker      n
##    <chr>  <int>
##  1 Kate       3
##  2 Ian        2
##  3 James      2
##  4 Louise     2
##  5 Mark       2
##  6 Peter      2
##  7 Robert     2
##  8 Tom        2
##  9 Ali        1
## 10 Alvin      1
## # ... with 76 more rows

Series 4 had the most bakers with 13 total (a baker’s dozen!). Kate is the most popular name- Kates have appeared in 3 different series of the TV show.

Plot counts

You can learn a lot about your data by counting, but sometimes you can learn even more by plotting counts. This is especially true when you have lots of things to count! With eight series, 74 episodes, and 95 bakers, a plot can be more helpful than a table of numbers. We’ll use ggplot2 to visualize the number of bakers across episodes for each series from bakeoff.

library(ggplot2)

ggplot(bakeoff, aes(episode)) + 
    geom_bar() + 
    facet_wrap(~series)

You made it through the first chapter! We have a lot of work left to do with The Great British Bake Off data, but notice how this plot is a great sanity check: the number of bakers tends to go down with every episode (never up!), with a few exceptions where no one was eliminated. Also, you can see that there are fewer episodes in the first two series. Finally, you can see that there are always three bakers who make it to the series finale.

Tame your data

Cast column types

Cast a column to a date

A good workflow for parsing dates using readr is to, for example:

Use parse_date(“2012-14-08”, format = “%Y-%d-%m”) first, then
Use col_date(format = “%Y-%d-%m”) within cols() as the col_types argument of read_csv().

In “desserts.csv”, the variable uk_airdate is formatted like “17 August 2010”. Let’s parse, then cast this variable!

Later on in Chapter 4, we’ll showcase some functions from the lubridate package to help you extract data from dates, once they are cast properly.

# Find format to parse uk_airdate 
parse_date("17 August 2010", format = "%d %B %Y")

## [1] "2010-08-17"

# Edit to cast uk_airdate
desserts <- read_csv("data/desserts.csv", 
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y")
                     )
                    )

# Arrange by descending uk_airdate
desserts %>% 
    arrange(desc(uk_airdate))

## # A tibble: 549 x 16
##    series episode baker technical result uk_airdate us_season us_airdate
##     <dbl>   <dbl> <chr> <chr>     <chr>  <date>         <dbl> <date>    
##  1      8      10 Kate  3rd       RUNNE~ 2017-10-31        NA NA        
##  2      8      10 Stev~ 1st       RUNNE~ 2017-10-31        NA NA        
##  3      8      10 Soph~ 2nd       WINNER 2017-10-31        NA NA        
##  4      8       9 Kate  4th       IN     2017-10-24        NA NA        
##  5      8       9 Stev~ 3rd       IN     2017-10-24        NA NA        
##  6      8       9 Stac~ 2nd       OUT    2017-10-24        NA NA        
##  7      8       9 Soph~ 1st       SB     2017-10-24        NA NA        
##  8      8       8 Kate  2nd       IN     2017-10-17        NA NA        
##  9      8       8 Soph~ 4th       IN     2017-10-17        NA NA        
## 10      8       8 Stev~ 1st       IN     2017-10-17        NA NA        
## # ... with 539 more rows, and 8 more variables: showstopper_chocolate <chr>,
## #   showstopper_dessert <chr>, showstopper_fruit <chr>, showstopper_nut <chr>,
## #   signature_chocolate <chr>, signature_dessert <chr>, signature_fruit <chr>,
## #   signature_nut <chr>

Notice that us_airdate didn’t need to be cast - this is because the date format %Y-%m-%d is unambiguous, so it is automatically parsed as a date by readr.

Cast a column to a number

We saw a good workflow for parsing columns using readr:

Use parse_number() to practice, then
Use col_number() to cast.

But sometimes you’ll need to start with casting, then diagnose parsing problems using a new readr function called problems(). Using problems() on a result of read_csv() will show you the rows and columns where parsing error occurred, what the parser expected to find (for example, a number), and the actual value that caused the parsing error.

Let’s practice this with “desserts.csv”, which includes data about the judges’ ranks for each baker in the technical challenge, for every episode.

# Try to cast technical as a number
desserts <- read_csv("data/desserts.csv", 
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number()
                     )
                    )

## Warning: 7 parsing failures.
## row       col expected actual                file
##   4 technical a number    N/A 'data/desserts.csv'
##   6 technical a number    N/A 'data/desserts.csv'
##   8 technical a number    N/A 'data/desserts.csv'
##  10 technical a number    N/A 'data/desserts.csv'
##  34 technical a number    N/A 'data/desserts.csv'
## ... ......... ........ ...... ...................
## See problems(...) for more details.

# View parsing problems
desserts %>%
  problems()

## # A tibble: 7 x 5
##     row col       expected actual file               
##   <int> <chr>     <chr>    <chr>  <chr>              
## 1     4 technical a number N/A    'data/desserts.csv'
## 2     6 technical a number N/A    'data/desserts.csv'
## 3     8 technical a number N/A    'data/desserts.csv'
## 4    10 technical a number N/A    'data/desserts.csv'
## 5    34 technical a number N/A    'data/desserts.csv'
## 6    35 technical a number N/A    'data/desserts.csv'
## 7    36 technical a number N/A    'data/desserts.csv'

# Edit code to fix the parsing error 
desserts <- read_csv("data/desserts.csv",
                      col_types = cols(
                        uk_airdate = col_date(format = "%d %B %Y"),
                        technical = col_number()
                      ),
                        na = c("", "NA", "N/A") #solving the problem
                     )

# View parsing problems
problems(desserts)

## [1] row      col      expected actual  
## <0 rows> (or 0-length row.names)

Cast a column as a factor

Factors are categorical variables, where the possible values are a fixed and known set. For example, take a simple factor like bake below:

bake <- c("pie", "cake", "neither") 
parse_factor(bake, levels = NULL)

## [1] pie     cake    neither
## Levels: pie cake neither

You can use parse_factor() to parse variables and col_factor() to cast columns as categorical. Both functions have a levels argument that is used to specify the possible values for the factors. When levels is set to NULL, the possible values will be inferred from the unique values in the dataset. Alternatively, you can pass a list of possible values.

This time, you will look at the result column, which contains the outcome of the competition for each baker.

# Cast result a factor
desserts <- read_csv("data/desserts.csv", 
                     na = c("", "NA", "N/A"),
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number(),                       
                       result = col_factor(levels = NULL)
                     )
                    )
                    
# Glimpse to view
glimpse(desserts)

## Rows: 549
## Columns: 16
## $ series                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ episode               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, ...
## $ baker                 <chr> "Annetha", "David", "Edd", "Jasminder", "Jona...
## $ technical             <dbl> 2, 3, 1, NA, 9, NA, 8, NA, 10, NA, 8, 6, 2, 1...
## $ result                <fct> IN, IN, IN, IN, IN, IN, IN, IN, OUT, OUT, IN,...
## $ uk_airdate            <date> 2010-08-17, 2010-08-17, 2010-08-17, 2010-08-...
## $ us_season             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ us_airdate            <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate", "no...
## $ showstopper_dessert   <chr> "other", "other", "other", "other", "other", ...
## $ showstopper_fruit     <chr> "no fruit", "no fruit", "no fruit", "no fruit...
## $ showstopper_nut       <chr> "no nut", "no nut", "no nut", "no nut", "almo...
## $ signature_chocolate   <chr> "no chocolate", "chocolate", "no chocolate", ...
## $ signature_dessert     <chr> "cake", "cake", "cake", "cake", "cake", "cake...
## $ signature_fruit       <chr> "no fruit", "fruit", "fruit", "fruit", "fruit...
## $ signature_nut         <chr> "no nut", "no nut", "no nut", "no nut", "no n...

While it takes a lot of work up front, casting column types when you import can make your analyses easier to reproduce.

Recode values

Recode a character variable

In this exercise, you’ll recode() the nut variable in the desserts data. This is a character variable that tells us, for each bake, whether a nut was a key ingredient and if so, what kind of nut!

Remember that the arguments of recode() are the variable that you want to recode, and then an expression of the form old_value = new_value. recode() is most useful when used inside a mutate() to create a new variable or reassign the old one.

Often you want to recode a certain value as missing. In this case, it is important to preserve the data type of the variable. R has built-in words to indicate missing data of various types, for example NA_character_ for missing value for a character variable, or NA_integer_ for missing integer data. The NA value is of the logical type.

Remember that you can use ?function_name_here to read more about arguments and how to use them.

# Count rows grouping by nut variable
desserts %>% 
    count(signature_nut, sort = TRUE)

## # A tibble: 8 x 2
##   signature_nut     n
##   <chr>         <int>
## 1 no nut          462
## 2 almond           20
## 3 walnut           20
## 4 pistachio        13
## 5 filbert          12
## 6 pecan            11
## 7 multiple          6
## 8 peanut            5

# Edit code to recode "no nut" as missing
desserts_2 <- desserts %>% 
  mutate(nut = recode(signature_nut, "filbert" = "hazelnut", 
                           "no nut" = NA_character_))

# Count rows again 
desserts_2 %>% 
    count(nut, sort = TRUE)

## # A tibble: 8 x 2
##   nut           n
##   <chr>     <int>
## 1 <NA>        462
## 2 almond       20
## 3 walnut       20
## 4 pistachio    13
## 5 hazelnut     12
## 6 pecan        11
## 7 multiple      6
## 8 peanut        5

Recode a numeric variable

Dummy variables are often used in data analysis to bin a variable into one of two categories to indicate the absence or presence of something. Dummy variables take the value 0 or 1 to stand for, for example, loser or winner. Dummy variables are often factor variables as opposed to numeric - we’ll cover more about factors in the last chapter.

In desserts, the technical variable is a number representing the judges’ ranks for each baker in the technical challenge, for every episode (remember to run glimpse(desserts) to take a peek). In this exercise, you will recode the numeric technical variable to a new factor dummy variable.

## Edit to recode tech_win as factor
desserts <- desserts %>% 
  mutate(tech_win = recode_factor(technical, `1` = 1,
                           .default = 0))

# Count to compare values                      
desserts %>% 
  count(technical == 1, tech_win)

## # A tibble: 3 x 3
##   `technical == 1` tech_win     n
##   <lgl>            <fct>    <int>
## 1 FALSE            0          468
## 2 TRUE             1           73
## 3 NA               <NA>         8

Select variables

Combine functions with select

The ratings data tells us about how many UK viewers tuned in to watch “The Great British Bake-Off” for each episode of each series. If you explore ratings in console, you’ll see that there is a separate variable for the number of viewers for each episode. A key question we’ll return to in Chapter 3 is: how many more viewers watched the finale than the premiere episode?

For series with 10 episodes, which showed the most growth in viewers from the premiere to the finale? Which showed the least?

ratings <- read_csv("data/02.03_messy_ratings.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   premiere = col_character(),
##   finale = col_character(),
##   winner = col_character(),
##   day_of_week = col_character(),
##   timeslot = col_time(format = ""),
##   channel = col_character(),
##   runner_up_1 = col_character(),
##   runner_up_2 = col_character(),
##   season_premiere = col_character(),
##   season_finale = col_character(),
##   e1_uk_airdate = col_character(),
##   e2_uk_airdate = col_character(),
##   e3_uk_airdate = col_character(),
##   e4_uk_airdate = col_character(),
##   e5_uk_airdate = col_character(),
##   e6_uk_airdate = col_character(),
##   e7_uk_airdate = col_character(),
##   e8_uk_airdate = col_character(),
##   e9_uk_airdate = col_character(),
##   e10_uk_airdate = col_character()
## )
## i Use `spec()` for the full column specifications.

head(ratings)

## # A tibble: 6 x 44
##   series episodes premiere finale winner avg_uk_viewers day_of_week timeslot
##    <dbl>    <dbl> <chr>    <chr>  <chr>           <dbl> <chr>       <time>  
## 1      1        6 17-Aug-~ 21-Se~ Edd K~           2.77 Tuesday     20:00   
## 2      2        8 14-Aug-~ 4-Oct~ Joann~           4    Tuesday     20:00   
## 3      3       10 14-Aug-~ 16-Oc~ John ~           5    Tuesday     20:00   
## 4      4       10 20-Aug-~ 22-Oc~ Franc~           7.35 Tuesday     20:00   
## 5      5       10 6-Aug-14 8-Oct~ Nancy~          10.0  Wednesday   20:00   
## 6      6       10 5-Aug-15 7-Oct~ Nadiy~          12.5  Wednesday   20:00   
## # ... with 36 more variables: channel <chr>, runner_up_1 <chr>,
## #   runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## #   season_finale <chr>, e1_viewers_7day <dbl>, e1_viewers_28day <dbl>,
## #   e2_viewers_7day <dbl>, e2_viewers_28day <dbl>, e3_viewers_7day <dbl>,
## #   e3_viewers_28day <dbl>, e4_viewers_7day <dbl>, e4_viewers_28day <dbl>,
## #   e5_viewers_7day <dbl>, e5_viewers_28day <dbl>, e6_viewers_7day <dbl>,
## #   e6_viewers_28day <dbl>, e7_viewers_7day <dbl>, e7_viewers_28day <dbl>,
## #   e8_viewers_7day <dbl>, e8_viewers_28day <dbl>, e9_viewers_7day <dbl>,
## #   e9_viewers_28day <dbl>, e10_viewers_7day <dbl>, e10_viewers_28day <dbl>,
## #   e1_uk_airdate <chr>, e2_uk_airdate <chr>, e3_uk_airdate <chr>,
## #   e4_uk_airdate <chr>, e5_uk_airdate <chr>, e6_uk_airdate <chr>,
## #   e7_uk_airdate <chr>, e8_uk_airdate <chr>, e9_uk_airdate <chr>,
## #   e10_uk_airdate <chr>

ratings <- ratings %>% 
  filter(episodes == 10) %>%
  mutate(viewer_growth = e10_viewers_7day - e1_viewers_7day)

ratings %>% 
  select(series, viewer_growth)

## # A tibble: 6 x 2
##   series viewer_growth
##    <dbl>         <dbl>
## 1      3         2.89 
## 2      4         2.85 
## 3      5         5    
## 4      6         3.43 
## 5      7         2.32 
## 6      8         0.580

Most: series 5, Least = series 8

Recode factor to plot

In the last exercise, you may have used mutate() to find the difference between the number of premiere and finale episode viewers for each series. This variable, which I created and called viewer_growth, is now available in ratings, and feel free to check out it.

Now let’s practice using select() to check the result of another mutate with recode_factor().

If you have examined ratings, you might have seen that series have aired on three different channels: “Channel 4”, “BBC One”, “BBC Two”.

# Recode channel as factor: bbc (1) or not (0)
ratings <- ratings %>% 
  mutate(bbc = recode_factor(channel, 
                             "Channel 4" = 0,
                             .default = 1))
                            
# Select to look at variables to plot next
ratings %>% 
  select(series, channel, bbc, viewer_growth)

## # A tibble: 6 x 4
##   series channel   bbc   viewer_growth
##    <dbl> <chr>     <fct>         <dbl>
## 1      3 BBC Two   1             2.89 
## 2      4 BBC Two   1             2.85 
## 3      5 BBC One   1             5    
## 4      6 BBC One   1             3.43 
## 5      7 BBC One   1             2.32 
## 6      8 Channel 4 0             0.580

# Make a filled bar chart
ggplot(ratings, aes(x = series, y = viewer_growth, fill = bbc)) +
  geom_col()

Remember to look up the other select helper functions using ?select. What is going on with viewer_growth in series 8?

Select and reorder variables

As we have seen, selecting a subset of columns to print can help you check that a mutate() worked as expected, and rearranging columns next to each other can help you spot obvious errors in data entry.

The select() helpers are functions that you can use inside select() to allow you to select variables based on their names.

In this exercise, you’ll work with the ratings data, and you’ll use a new helper function called everything() which can be useful when reordering columns. Use ?everything to read more. You will learn more about using helper functions in the next lesson.

# Move channel to first column
ratings %>% 
  select(channel, everything())

## # A tibble: 6 x 46
##   channel series episodes premiere finale winner avg_uk_viewers day_of_week
##   <chr>    <dbl>    <dbl> <chr>    <chr>  <chr>           <dbl> <chr>      
## 1 BBC Two      3       10 14-Aug-~ 16-Oc~ John ~           5    Tuesday    
## 2 BBC Two      4       10 20-Aug-~ 22-Oc~ Franc~           7.35 Tuesday    
## 3 BBC One      5       10 6-Aug-14 8-Oct~ Nancy~          10.0  Wednesday  
## 4 BBC One      6       10 5-Aug-15 7-Oct~ Nadiy~          12.5  Wednesday  
## 5 BBC One      7       10 24-Aug-~ 26-Oc~ Candi~          13.8  Wednesday  
## 6 Channe~      8       10 29-Aug-~ 31-Oc~ Sophi~           9.29 Tuesday    
## # ... with 38 more variables: timeslot <time>, runner_up_1 <chr>,
## #   runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## #   season_finale <chr>, e1_viewers_7day <dbl>, e1_viewers_28day <dbl>,
## #   e2_viewers_7day <dbl>, e2_viewers_28day <dbl>, e3_viewers_7day <dbl>,
## #   e3_viewers_28day <dbl>, e4_viewers_7day <dbl>, e4_viewers_28day <dbl>,
## #   e5_viewers_7day <dbl>, e5_viewers_28day <dbl>, e6_viewers_7day <dbl>,
## #   e6_viewers_28day <dbl>, e7_viewers_7day <dbl>, e7_viewers_28day <dbl>,
## #   e8_viewers_7day <dbl>, e8_viewers_28day <dbl>, e9_viewers_7day <dbl>,
## #   e9_viewers_28day <dbl>, e10_viewers_7day <dbl>, e10_viewers_28day <dbl>,
## #   e1_uk_airdate <chr>, e2_uk_airdate <chr>, e3_uk_airdate <chr>,
## #   e4_uk_airdate <chr>, e5_uk_airdate <chr>, e6_uk_airdate <chr>,
## #   e7_uk_airdate <chr>, e8_uk_airdate <chr>, e9_uk_airdate <chr>,
## #   e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>

# Drop 7- and 28-day episode ratings
ratings %>% 
  select(-ends_with("day"))

## # A tibble: 6 x 26
##   series episodes premiere finale winner avg_uk_viewers day_of_week timeslot
##    <dbl>    <dbl> <chr>    <chr>  <chr>           <dbl> <chr>       <time>  
## 1      3       10 14-Aug-~ 16-Oc~ John ~           5    Tuesday     20:00   
## 2      4       10 20-Aug-~ 22-Oc~ Franc~           7.35 Tuesday     20:00   
## 3      5       10 6-Aug-14 8-Oct~ Nancy~          10.0  Wednesday   20:00   
## 4      6       10 5-Aug-15 7-Oct~ Nadiy~          12.5  Wednesday   20:00   
## 5      7       10 24-Aug-~ 26-Oc~ Candi~          13.8  Wednesday   20:00   
## 6      8       10 29-Aug-~ 31-Oc~ Sophi~           9.29 Tuesday     20:00   
## # ... with 18 more variables: channel <chr>, runner_up_1 <chr>,
## #   runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## #   season_finale <chr>, e1_uk_airdate <chr>, e2_uk_airdate <chr>,
## #   e3_uk_airdate <chr>, e4_uk_airdate <chr>, e5_uk_airdate <chr>,
## #   e6_uk_airdate <chr>, e7_uk_airdate <chr>, e8_uk_airdate <chr>,
## #   e9_uk_airdate <chr>, e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>

# Move channel to front and drop 7-/28-day episode ratings
ratings %>% 
  select(channel, everything(), -ends_with("day"))

## # A tibble: 6 x 26
##   channel series episodes premiere finale winner avg_uk_viewers day_of_week
##   <chr>    <dbl>    <dbl> <chr>    <chr>  <chr>           <dbl> <chr>      
## 1 BBC Two      3       10 14-Aug-~ 16-Oc~ John ~           5    Tuesday    
## 2 BBC Two      4       10 20-Aug-~ 22-Oc~ Franc~           7.35 Tuesday    
## 3 BBC One      5       10 6-Aug-14 8-Oct~ Nancy~          10.0  Wednesday  
## 4 BBC One      6       10 5-Aug-15 7-Oct~ Nadiy~          12.5  Wednesday  
## 5 BBC One      7       10 24-Aug-~ 26-Oc~ Candi~          13.8  Wednesday  
## 6 Channe~      8       10 29-Aug-~ 31-Oc~ Sophi~           9.29 Tuesday    
## # ... with 18 more variables: timeslot <time>, runner_up_1 <chr>,
## #   runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## #   season_finale <chr>, e1_uk_airdate <chr>, e2_uk_airdate <chr>,
## #   e3_uk_airdate <chr>, e4_uk_airdate <chr>, e5_uk_airdate <chr>,
## #   e6_uk_airdate <chr>, e7_uk_airdate <chr>, e8_uk_airdate <chr>,
## #   e9_uk_airdate <chr>, e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>

everything() is a great little helper! If it had been placed at the end, it would have added back in all the columns that end with “day”. Placing it before deselecting columns, though, is a real time-saver.

Tame variable names

Reformat variables

We saw how to use clean_names() from the janitor package to convert all variable names to snake_case. Here is how:

library(janitor)
young_bakers3 <- young_bakers3 %>% 
   clean_names()

The function clean_names() takes an argument case that can be used to convert variable names to other cases, like “upper_camel” or “all_caps”.

Let’s practice using the messy_ratings tibble. Remember you can use ?clean_names to see the function arguments and how to use them.

messy_ratings <- read_csv("data/02.03_messy_ratings.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double(),
##   premiere = col_character(),
##   finale = col_character(),
##   winner = col_character(),
##   day_of_week = col_character(),
##   timeslot = col_time(format = ""),
##   channel = col_character(),
##   runner_up_1 = col_character(),
##   runner_up_2 = col_character(),
##   season_premiere = col_character(),
##   season_finale = col_character(),
##   e1_uk_airdate = col_character(),
##   e2_uk_airdate = col_character(),
##   e3_uk_airdate = col_character(),
##   e4_uk_airdate = col_character(),
##   e5_uk_airdate = col_character(),
##   e6_uk_airdate = col_character(),
##   e7_uk_airdate = col_character(),
##   e8_uk_airdate = col_character(),
##   e9_uk_airdate = col_character(),
##   e10_uk_airdate = col_character()
## )
## i Use `spec()` for the full column specifications.

colnames(messy_ratings)

names(messy_ratings) <- c("SERIES", "episodes", "premiere", "finale", "winner", "Avg.uk~Viewers", 
"day", "timeslot", "channel", "runner.up~1", "runner.up~2", "season", 
"season.premiere", "season.finale", "Gbbo E1.Viewers~7Day", "Gbbo E1.Viewers~28Day", 
"Gbbo E2.Viewers~7Day", "Gbbo E2.Viewers~28Day", "Gbbo E3.Viewers~7Day", 
"Gbbo E3.Viewers~28Day", "Gbbo E4.Viewers~7Day", "Gbbo E4.Viewers~28Day", 
"Gbbo E5.Viewers~7Day", "Gbbo E5.Viewers~28Day", "Gbbo E6.Viewers~7Day", 
"Gbbo E6.Viewers~28Day", "Gbbo E7.Viewers~7Day", "Gbbo E7.Viewers~28Day", 
"Gbbo E8.Viewers~7Day", "Gbbo E8.Viewers~28Day", "Gbbo E9.Viewers~7Day", 
"Gbbo E9.Viewers~28Day", "Gbbo E10.Viewers~7Day", "Gbbo E10.Viewers~28Day")

## Warning: The `value` argument of ``names<-`()` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 44, not 34.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: The `value` argument of ``names<-`()` can't be empty as of tibble 3.0.0.
## Columns 35, 36, 37, 38, 39, and 5 more must be named.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# Glimpse to see variable names
glimpse(messy_ratings)

## Rows: 8
## Columns: 44
## $ SERIES                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes                 <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere                 <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20...
## $ finale                   <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-...
## $ winner                   <chr> "Edd Kimber", "Joanne Wheatley", "John Wha...
## $ `Avg.uk~Viewers`         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.8...
## $ day                      <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday"...
## $ timeslot                 <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 2...
## $ channel                  <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two"...
## $ `runner.up~1`            <chr> "Miranda Gore Browne", "Holly Bell", "Bren...
## $ `runner.up~2`            <chr> "Ruth Clemens", "Mary-Anne Boermans", "Jam...
## $ season                   <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season.premiere          <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16"...
## $ season.finale            <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16"...
## $ `Gbbo E1.Viewers~7Day`   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58...
## $ `Gbbo E1.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ `Gbbo E2.Viewers~7Day`   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45...
## $ `Gbbo E2.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ `Gbbo E3.Viewers~7Day`   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01...
## $ `Gbbo E3.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ `Gbbo E4.Viewers~7Day`   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.2...
## $ `Gbbo E4.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ `Gbbo E5.Viewers~7Day`   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12...
## $ `Gbbo E5.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ `Gbbo E6.Viewers~7Day`   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.1...
## $ `Gbbo E6.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ `Gbbo E7.Viewers~7Day`   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45,...
## $ `Gbbo E7.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ `Gbbo E8.Viewers~7Day`   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13...
## $ `Gbbo E8.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ `Gbbo E9.Viewers~7Day`   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ `Gbbo E9.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ `Gbbo E10.Viewers~7Day`  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 1...
## $ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ NA                       <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13"...
## $ NA                       <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13"...
## $ NA                       <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13",...
## $ NA                       <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "...
## $ NA                       <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13"...
## $ NA                       <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13"...
## $ NA                       <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17...
## $ NA                       <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24...
## $ NA                       <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", ...
## $ NA                       <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14",...

# Load janitor
library(janitor)

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

# Reformat to lower camelcase
ratings2 <- messy_ratings %>%
  clean_names("lower_camel")
    
# Glimpse new tibble
glimpse(ratings2)

## Rows: 8
## Columns: 44
## $ series              <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes            <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere            <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20-Aug-...
## $ finale              <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-Oct-1...
## $ winner              <chr> "Edd Kimber", "Joanne Wheatley", "John Whaite",...
## $ avgUkViewers        <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day                 <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "We...
## $ timeslot            <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:...
## $ channel             <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two", "BB...
## $ runnerUp1           <chr> "Miranda Gore Browne", "Holly Bell", "Brendan L...
## $ runnerUp2           <chr> "Ruth Clemens", "Mary-Anne Boermans", "James Mo...
## $ season              <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ seasonPremiere      <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16", "6/...
## $ seasonFinale        <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16", "8/...
## $ gbboE1Viewers7Day   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ gbboE1Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ gbboE2Viewers7Day   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ gbboE2Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ gbboE3Viewers7Day   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ gbboE3Viewers28Day  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ gbboE4Viewers7Day   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ gbboE4Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ gbboE5Viewers7Day   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ gbboE5Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ gbboE6Viewers7Day   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ gbboE6Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ gbboE7Viewers7Day   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ gbboE7Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ gbboE8Viewers7Day   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13.260,...
## $ gbboE8Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ gbboE9Viewers7Day   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ gbboE9Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ gbboE10Viewers7Day  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ gbboE10Viewers28Day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ na                  <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13", "8/...
## $ na_2                <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13", "8/...
## $ na_3                <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13", "8/2...
## $ na_4                <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "8/27/...
## $ na_5                <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13", "9/...
## $ na_6                <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13", "9/...
## $ na_7                <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17/14",...
## $ na_8                <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24/14",...
## $ na_9                <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", "9/30...
## $ na_10               <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14", "10/...

# Glimpse to see variable names
glimpse(messy_ratings)

## Rows: 8
## Columns: 44
## $ SERIES                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes                 <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere                 <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20...
## $ finale                   <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-...
## $ winner                   <chr> "Edd Kimber", "Joanne Wheatley", "John Wha...
## $ `Avg.uk~Viewers`         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.8...
## $ day                      <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday"...
## $ timeslot                 <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 2...
## $ channel                  <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two"...
## $ `runner.up~1`            <chr> "Miranda Gore Browne", "Holly Bell", "Bren...
## $ `runner.up~2`            <chr> "Ruth Clemens", "Mary-Anne Boermans", "Jam...
## $ season                   <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season.premiere          <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16"...
## $ season.finale            <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16"...
## $ `Gbbo E1.Viewers~7Day`   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58...
## $ `Gbbo E1.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ `Gbbo E2.Viewers~7Day`   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45...
## $ `Gbbo E2.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ `Gbbo E3.Viewers~7Day`   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01...
## $ `Gbbo E3.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ `Gbbo E4.Viewers~7Day`   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.2...
## $ `Gbbo E4.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ `Gbbo E5.Viewers~7Day`   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12...
## $ `Gbbo E5.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ `Gbbo E6.Viewers~7Day`   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.1...
## $ `Gbbo E6.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ `Gbbo E7.Viewers~7Day`   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45,...
## $ `Gbbo E7.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ `Gbbo E8.Viewers~7Day`   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13...
## $ `Gbbo E8.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ `Gbbo E9.Viewers~7Day`   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ `Gbbo E9.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ `Gbbo E10.Viewers~7Day`  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 1...
## $ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ NA                       <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13"...
## $ NA                       <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13"...
## $ NA                       <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13",...
## $ NA                       <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "...
## $ NA                       <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13"...
## $ NA                       <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13"...
## $ NA                       <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17...
## $ NA                       <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24...
## $ NA                       <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", ...
## $ NA                       <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14",...

# Load janitor
library(janitor)

# Reformat to snake case
ratings2 <- messy_ratings %>%  
  clean_names()

# Glimpse cleaned names
glimpse(ratings2)

## Rows: 8
## Columns: 44
## $ series                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes               <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere               <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20-A...
## $ finale                 <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-Oc...
## $ winner                 <chr> "Edd Kimber", "Joanne Wheatley", "John Whait...
## $ avg_uk_viewers         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.85,...
## $ day                    <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", ...
## $ timeslot               <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:...
## $ channel                <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two", ...
## $ runner_up_1            <chr> "Miranda Gore Browne", "Holly Bell", "Brenda...
## $ runner_up_2            <chr> "Ruth Clemens", "Mary-Anne Boermans", "James...
## $ season                 <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season_premiere        <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16", ...
## $ season_finale          <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16", ...
## $ gbbo_e1_viewers_7day   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58, ...
## $ gbbo_e1_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ gbbo_e2_viewers_7day   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45, ...
## $ gbbo_e2_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ gbbo_e3_viewers_7day   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01, ...
## $ gbbo_e3_viewers_28day  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ gbbo_e4_viewers_7day   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.29,...
## $ gbbo_e4_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ gbbo_e5_viewers_7day   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12, ...
## $ gbbo_e5_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ gbbo_e6_viewers_7day   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.13,...
## $ gbbo_e6_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ gbbo_e7_viewers_7day   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ gbbo_e7_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ gbbo_e8_viewers_7day   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13.2...
## $ gbbo_e8_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ gbbo_e9_viewers_7day   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ gbbo_e9_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ gbbo_e10_viewers_7day  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ gbbo_e10_viewers_28day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ na                     <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13", ...
## $ na_2                   <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13", ...
## $ na_3                   <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13", "...
## $ na_4                   <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "8/...
## $ na_5                   <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13", ...
## $ na_6                   <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13", ...
## $ na_7                   <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17/1...
## $ na_8                   <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24/1...
## $ na_9                   <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", "9...
## $ na_10                  <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14", "...

You have a ton of case options with the clean_names function- choose wisely!

Rename and subset variables

In the next chapter, when we reshape data to be tidy, you’ll see the importance of having tame variable names.

Now that our variable names in ratings have been converted to snake_case, you’ll practice combining renaming with select() helper functions to take advantage of the variable name structure.

To rename a group of variables, you only need to specify a prefix of the new name.

For example, my_tbl %>% select( new_name_ = starts_with(“oldname”) ) will find all variables in my_tbl whose names start with oldname, enumerate them, then rename each variable as new_name_, where N is a number. If my_tbl has variables oldname, oldname_v1, oldname3, running the statement above will replace these names with new_name_1, new_name_2, new_name_3. Notice how we only specified new_name_ in select(), and the numbers were appended automatically!

Take a second to glimpse() at ratings in your console, and take note of the variables that store data about the number of viewers. Can you see the pattern in their names? In this exercise, you will first select those variables using a helper function, and then rename them!

# Select 7-day viewer data by series
viewers_7day <- ratings %>%
    select(series, ends_with("7day"))
    

# Glimpse
glimpse(viewers_7day)

## Rows: 6
## Columns: 11
## $ series           <dbl> 3, 4, 5, 6, 7, 8
## $ e1_viewers_7day  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ e2_viewers_7day  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ e3_viewers_7day  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ e4_viewers_7day  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ e5_viewers_7day  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ e6_viewers_7day  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ e7_viewers_7day  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ e8_viewers_7day  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ e9_viewers_7day  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ e10_viewers_7day <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04

# Adapt code to also rename 7-day viewer data
viewers_7day <- ratings %>% 
    select(series, viewers_7day_ = ends_with("7day"))

# Glimpse
glimpse(viewers_7day)

## Rows: 6
## Columns: 11
## $ series          <dbl> 3, 4, 5, 6, 7, 8
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04

Using select helper functions is a powerful way to tame your variable names. Tame names make it easier to tidy your data.

Rename and reorder variables

In the previous exercise, you renamed all the variables in ratings that end with “7day” and kept only a subset of variables. In this exercise, you’ll adapt that code to drop some specific columns, while keeping and reordering other columns.

# Adapt code to drop 28-day columns; keep 7-day in front
viewers_7day <- ratings %>% 
    select(viewers_7day_ = ends_with("7day"),
        everything(),
        -ends_with("28day"))
        
# Glimpse
glimpse(viewers_7day)

## Rows: 6
## Columns: 36
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ series          <dbl> 3, 4, 5, 6, 7, 8
## $ episodes        <dbl> 10, 10, 10, 10, 10, 10
## $ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15", "...
## $ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15", "...
## $ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhistle"...
## $ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day_of_week     <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "We...
## $ timeslot        <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:00, ...
## $ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BBC On...
## $ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troyano"...
## $ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr", "Tam...
## $ season          <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", NA
## $ season_finale   <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
## $ e1_uk_airdate   <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24/16"...
## $ e2_uk_airdate   <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/31/1...
## $ e3_uk_airdate   <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7/16"...
## $ e4_uk_airdate   <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/14/16...
## $ e5_uk_airdate   <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21/16"...
## $ e6_uk_airdate   <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/28/16...
## $ e7_uk_airdate   <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10/5/1...
## $ e8_uk_airdate   <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10/12/...
## $ e9_uk_airdate   <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "10/19...
## $ e10_uk_airdate  <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "10/2...
## $ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc             <fct> 1, 1, 1, 1, 1, 0

# Adapt code to keep original order
viewers_7day <- ratings %>% 
    select(everything(),
           viewers_7day_ = ends_with("7day"), 
           -ends_with("28day"))

# Glimpse
glimpse(viewers_7day)

## Rows: 6
## Columns: 36
## $ series          <dbl> 3, 4, 5, 6, 7, 8
## $ episodes        <dbl> 10, 10, 10, 10, 10, 10
## $ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15", "...
## $ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15", "...
## $ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhistle"...
## $ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day_of_week     <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "We...
## $ timeslot        <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:00, ...
## $ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BBC On...
## $ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troyano"...
## $ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr", "Tam...
## $ season          <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", NA
## $ season_finale   <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
## $ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ e1_uk_airdate   <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24/16"...
## $ e2_uk_airdate   <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/31/1...
## $ e3_uk_airdate   <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7/16"...
## $ e4_uk_airdate   <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/14/16...
## $ e5_uk_airdate   <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21/16"...
## $ e6_uk_airdate   <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/28/16...
## $ e7_uk_airdate   <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10/5/1...
## $ e8_uk_airdate   <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10/12/...
## $ e9_uk_airdate   <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "10/19...
## $ e10_uk_airdate  <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "10/2...
## $ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc             <fct> 1, 1, 1, 1, 1, 0

Tidy your data

Introduction to Tidy Data

Plot untidy data

We’ll start by visualizing the number of show viewers (in millions, cumulative across seven days) across series by episode using the ratings data

ratings <- read_csv("data/messy_ratings.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))

# Plot of episode 1 viewers by series
ggplot(ratings, aes(x = series, y = e1))+
 geom_col()

# Adapt code to plot episode 2 viewers by series
ggplot(ratings, aes(x = series, y = e2)) +
    geom_col()

Series 8 was the first with different hosts on a different channel. Do you notice a pattern? To draw any conclusions, we’d want to plot the viewers for every episode though- aren’t you glad you weren’t asked to make all 10 plots? Wouldn’t it be nice if episode was a variable we could use?

Gather

Gather & plot

we’ll tidy the data, then use row_number() from the dplyr package to create a continuous episode count variable, such that episode 1 from series 1 will have a value of 1, episode 1 from series 2 will have a value of 7, and so on. This new variable will be mapped onto the x-axis using ggplot2.

library(tidyr)
tidy_ratings <- ratings %>%
    # Gather and convert episode to factor
    gather(key = "episode", value = "viewers_7day", -series, 
           factor_key = TRUE, na.rm = TRUE) %>%
    # Sort in ascending order by series and episode
    arrange(series, episode) %>% 
    # Create new variable using row_number()
    mutate(episode_count = row_number())

# Plot viewers by episode and series
ggplot(tidy_ratings, aes(x = episode_count, 
                y = viewers_7day, 
                fill = series)) +
    geom_col()

Beautiful plot! In series 8, one of the judges accidentally tweeted out the winner’s name 12 hours before the finale aired. Do you think that hurt their finale viewer numbers?

Gather & plot non-sequential columns

In this exercise, you’ll combine tidyr with dplyr::select() to keep and gather non-sequential columns in the ratings2 data. You can review the select helper functions here. The ratings2 data includes per-episode data:

e*_7day: viewers within a 7-day window, and
e*_28day: viewers within a 28-day window.

Here, we’ll focus on the 7-day viewers.

ratings2 <- read_csv("data/messy_ratings2.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))

week_ratings <- ratings2  %>% 
    # Select 7-day viewer ratings
    select(series, ends_with("7day")) %>% 
    # Gather 7-day viewers by episode
    gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE, factor_key = TRUE)

# Plot 7-day viewers by episode and series
ggplot(week_ratings, aes(x = episode, 
                y = viewers_7day, 
                group = series)) +
    geom_line() +
    facet_wrap(~series)

You’re done gathering! This plot tells you a lot, but the x-axis labels are a mess! We need to do more tidying to make this plot better.

Separate

Separate a column

To create the week_ratings data, we first selected all the columns in ratings2 that contain data about the number of 7-day viewers in millions for each series, then we gathered those 10 columns into two columns.

In this exercise, you’ll use the tidyr, dplyr, and readr packages to tidy and plot the week_ratings data so we can read the x-axis, which labels each episode!

week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    # Edit to parse episode number
    mutate(episode = parse_number(episode))

# Print to view
week_ratings

## # A tibble: 74 x 3
##    series episode viewers_7day
##    <fct>    <dbl>        <dbl>
##  1 1            1         2.24
##  2 2            1         3.1 
##  3 3            1         3.85
##  4 4            1         6.6 
##  5 5            1         8.51
##  6 6            1        11.6 
##  7 7            1        13.6 
##  8 8            1         9.46
##  9 1            2         3   
## 10 2            2         3.53
## # ... with 64 more rows

# Create week_ratings
week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    mutate(episode = parse_number(episode))
    
# Edit your code to color by series and add a theme
ggplot(week_ratings, aes(x = episode, y = viewers_7day, 
                         group = series, color = series)) +
    geom_line() +
    facet_wrap(~series) +
    guides(color = FALSE) +
    theme_minimal()

### Unite columns

In the tidyr package, the opposite of separate() is unite(). Sometimes you need to paste values from two or more columns together to tidy. Here is an example usage for unite():

data %>%
 unite(new_var, old_var1, old_var2)

The ratings2 dataset includes these variables:

series: 1-8
episode: 1-10
viewers_millions: whole number of 7-day viewers in millions
viewers_decimal: additional number of 7-days viewers in millions

If viewers_millions = 2 and viewers_decimal = .6, then the total number of viewers in millions is 2.6. In this exercise, you’ll practice uniting these two columns into a single new column.

library(readr)
ratings_2 <- read_csv("data/messy_ratings2.csv")

## 
## -- Column specification --------------------------------------------------------
## cols(
##   .default = col_double()
## )
## i Use `spec()` for the full column specifications.

Working with Data in the Tidyverse

Mc Namara Chiwaye

1/14/2021

Explore your data

Import your data

Read a CSV file

Assign missing values

Know your data

Arrange and glimpse

Summarize your data

Know your variable types

Count with your data

Distinct and count

Count episodes

Count bakers

Plot counts

Tame your data

Cast column types

Cast a column to a date

Cast a column to a number

Cast a column as a factor

Recode values

Recode a character variable

Recode a numeric variable

Select variables

Combine functions with select

Recode factor to plot

Select and reorder variables

Tame variable names

Reformat variables

Rename and subset variables

Rename and reorder variables

Tidy your data

Introduction to Tidy Data

Plot untidy data

Gather

Gather & plot

Gather & plot non-sequential columns

Separate

Separate a column