In this exercise, you’ll use read_csv() twice. The first time you will only specify the filename, but you’ll notice a problem with the imported data. The second time you’ll use a new argument called skip to fix the problem. Remember to use ?read_csv to read more about arguments like skip and how to use them.
The data you’ll work with is from “The Great British Bake-Off.” The file “bakeoff.csv” contains data for each episode of the show, organized by series and baker.
# Load readr
library(here)
## here() starts at C:/Users/Takunda Glendice/Documents/max/working with data in tidyverse
library(readr)
# Create bakeoff from "bakeoff.csv"
bakeoff <- read_csv("data/bakeoff.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## series = col_double(),
## episode = col_double(),
## baker = col_character(),
## signature = col_character(),
## technical = col_double(),
## showstopper = col_character(),
## result = col_character(),
## uk_airdate = col_date(format = ""),
## us_season = col_double(),
## us_airdate = col_date(format = "")
## )
# Print bakeoff
bakeoff
## # A tibble: 549 x 10
## series episode baker signature technical showstopper result uk_airdate
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <date>
## 1 1 1 Anne~ "Light J~ 2 "Chocolate~ IN 2010-08-17
## 2 1 1 David "Chocola~ 3 "Black For~ IN 2010-08-17
## 3 1 1 Edd "Caramel~ 1 <NA> IN 2010-08-17
## 4 1 1 Jasm~ "Fresh M~ NA <NA> IN 2010-08-17
## 5 1 1 Jona~ "Carrot ~ 9 "Three-lay~ IN 2010-08-17
## 6 1 1 Loui~ "Carrot ~ NA "Never Fai~ IN 2010-08-17
## 7 1 1 Mira~ "Triple ~ 8 "Three Tie~ IN 2010-08-17
## 8 1 1 Ruth "Lemon D~ NA "Classic C~ IN 2010-08-17
## 9 1 1 Lea "Cranber~ 10 "Chocolate~ OUT 2010-08-17
## 10 1 1 Mark "Sticky ~ NA "Heart-sha~ OUT 2010-08-17
## # ... with 539 more rows, and 2 more variables: us_season <dbl>,
## # us_airdate <date>
# Create bakeoff but skip first row in case needed
# bakeoff <- read_csv("data/bakeoff.csv", skip = 1)
Our original import had 3 problems: we had 1 extra observation, our first row stored the true variable names, and all column types were characters. Adding skip = 1 worked because the default argument for col_names is TRUE.
The read_csv() function also has an na argument, which allows you to specify value(s) that represent missing values in your data. The default values for the na argument are c("“,”NA“), so both are recoded as missing (NA) in R. When you read in data, you can add additional values like the string”UNKNOWN" to a vector of missing values using the c() function to combine multiple values into a single vector.
The is.na() function is also helpful for identifying rows with missing values for a variable.
# Load dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Filter rows where showstopper is UNKNOWN
bakeoff %>%
filter(showstopper == "UNKNOWN")
## # A tibble: 0 x 10
## # ... with 10 variables: series <dbl>, episode <dbl>, baker <chr>,
## # signature <chr>, technical <dbl>, showstopper <chr>, result <chr>,
## # uk_airdate <date>, us_season <dbl>, us_airdate <date>
# Edit to add list of missing values
bakeoff <- read_csv("data/bakeoff.csv", na = c("", "NA", "UNKNOWN"))
##
## -- Column specification --------------------------------------------------------
## cols(
## series = col_double(),
## episode = col_double(),
## baker = col_character(),
## signature = col_character(),
## technical = col_double(),
## showstopper = col_character(),
## result = col_character(),
## uk_airdate = col_date(format = ""),
## us_season = col_double(),
## us_airdate = col_date(format = "")
## )
# Filter rows where showstopper is NA
bakeoff %>%
filter(is.na(showstopper))
## # A tibble: 21 x 10
## series episode baker signature technical showstopper result uk_airdate
## <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <date>
## 1 1 1 Edd "Caramel~ 1 <NA> IN 2010-08-17
## 2 1 1 Jasm~ "Fresh M~ NA <NA> IN 2010-08-17
## 3 1 6 Mira~ "Lemon C~ NA <NA> RUNNE~ 2010-09-21
## 4 2 1 Ian "Apple a~ 10 <NA> IN 2011-08-16
## 5 2 1 Jason "Lemon M~ 6 <NA> IN 2011-08-16
## 6 2 1 Urva~ "Cherry ~ 7 <NA> IN 2011-08-16
## 7 2 1 Yasm~ "Cardamo~ 5 <NA> IN 2011-08-16
## 8 2 1 Holly "Cherry ~ 1 <NA> SB 2011-08-16
## 9 2 2 Ben "Chorizo~ 1 <NA> IN 2011-08-23
## 10 2 2 Ian "Stilton~ 2 <NA> IN 2011-08-23
## # ... with 11 more rows, and 2 more variables: us_season <dbl>,
## # us_airdate <date>
We only had 4 missing values for the showstopper variable to start, but now all 21 are present and accounted for.
You can combine glimpse() with other functions in a sequence using the pipe (%>%) operator. For example, you can use other dplyr functions like arrange first, then use glimpse by adding a line after the final pipe (%>%):
bakers_mini %>%
arrange(age) %>%
glimpse() # no argument needed here
Take a glimpse of the bakeoff data we imported in the first set of exercises. On which date did the first episode of the show air in the US?
library(skimr)
# Filter and skim
bakeoff %>%
filter(!is.na(us_season)) %>%
skim()
| Name | Piped data |
| Number of rows | 302 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| baker | 0 | 1 | 3 | 9 | 0 | 48 | 0 |
| signature | 1 | 1 | 10 | 125 | 0 | 300 | 0 |
| showstopper | 1 | 1 | 5 | 126 | 0 | 298 | 0 |
| result | 0 | 1 | 2 | 9 | 0 | 6 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| uk_airdate | 0 | 1 | 2013-08-20 | 2016-10-26 | 2014-10-08 | 40 |
| us_airdate | 0 | 1 | 2014-12-28 | 2017-08-04 | 2015-11-08 | 35 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| series | 0 | 1 | 5.49 | 1.13 | 4 | 4 | 5 | 6 | 7 | ▇▇▁▇▇ |
| episode | 0 | 1 | 4.37 | 2.66 | 1 | 2 | 4 | 6 | 10 | ▇▆▅▃▂ |
| technical | 1 | 1 | 4.85 | 2.98 | 1 | 2 | 4 | 7 | 13 | ▇▅▅▂▁ |
| us_season | 0 | 1 | 2.49 | 1.13 | 1 | 1 | 2 | 3 | 4 | ▇▇▁▇▇ |
The first episode of the first US season aired on December 28, 2014.
You can combine skim() with other functions in a sequence using the pipe (%>%) operator. For example, you could use other dplyr functions like group_by first, then use skim() by adding a line after the final pipe.
bakers_mini %>%
group_by(series) %>%
skim() # no argument needed here
This will produce summary statistics for each series. Let’s practice this!
# Load skimr
library(skimr)
bakeoff %>%
group_by(series) %>%
skim() # no argument needed here
## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf
## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf
## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf
## Warning in min.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to min; returning Inf
## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf
## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf
## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf
## Warning in max.default(structure(c(NA_real_, NA_real_, NA_real_, NA_real_, : no
## non-missing arguments to max; returning -Inf
| Name | Piped data |
| Number of rows | 549 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | series |
Variable type: character
| skim_variable | series | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| baker | 1 | 0 | 1.00 | 3 | 9 | 0 | 10 | 0 |
| baker | 2 | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
| baker | 3 | 0 | 1.00 | 4 | 10 | 0 | 12 | 0 |
| baker | 4 | 0 | 1.00 | 3 | 9 | 0 | 13 | 0 |
| baker | 5 | 0 | 1.00 | 4 | 7 | 0 | 12 | 0 |
| baker | 6 | 0 | 1.00 | 3 | 6 | 0 | 12 | 0 |
| baker | 7 | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
| baker | 8 | 0 | 1.00 | 3 | 6 | 0 | 12 | 0 |
| signature | 1 | 0 | 1.00 | 8 | 63 | 0 | 36 | 0 |
| signature | 2 | 0 | 1.00 | 9 | 69 | 0 | 59 | 0 |
| signature | 3 | 0 | 1.00 | 11 | 119 | 0 | 76 | 0 |
| signature | 4 | 0 | 1.00 | 10 | 125 | 0 | 78 | 0 |
| signature | 5 | 1 | 0.99 | 15 | 107 | 0 | 73 | 0 |
| signature | 6 | 0 | 1.00 | 12 | 64 | 0 | 74 | 0 |
| signature | 7 | 0 | 1.00 | 12 | 93 | 0 | 75 | 0 |
| signature | 8 | 0 | 1.00 | 12 | 64 | 0 | 75 | 0 |
| showstopper | 1 | 3 | 0.92 | 13 | 197 | 0 | 33 | 0 |
| showstopper | 2 | 10 | 0.83 | 17 | 118 | 0 | 50 | 0 |
| showstopper | 3 | 7 | 0.91 | 12 | 70 | 0 | 69 | 0 |
| showstopper | 4 | 0 | 1.00 | 5 | 126 | 0 | 78 | 0 |
| showstopper | 5 | 1 | 0.99 | 8 | 82 | 0 | 73 | 0 |
| showstopper | 6 | 0 | 1.00 | 10 | 70 | 0 | 73 | 0 |
| showstopper | 7 | 0 | 1.00 | 5 | 86 | 0 | 74 | 0 |
| showstopper | 8 | 0 | 1.00 | 13 | 68 | 0 | 75 | 0 |
| result | 1 | 0 | 1.00 | 2 | 9 | 0 | 4 | 0 |
| result | 2 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 3 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 4 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 5 | 0 | 1.00 | 2 | 9 | 0 | 6 | 0 |
| result | 6 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 7 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 8 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | series | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| uk_airdate | 1 | 0 | 1 | 2010-08-17 | 2010-09-21 | 2010-08-27 | 6 |
| uk_airdate | 2 | 0 | 1 | 2011-08-16 | 2011-10-04 | 2011-08-30 | 8 |
| uk_airdate | 3 | 0 | 1 | 2012-08-14 | 2012-10-16 | 2012-09-04 | 10 |
| uk_airdate | 4 | 0 | 1 | 2013-08-20 | 2013-10-22 | 2013-09-10 | 10 |
| uk_airdate | 5 | 0 | 1 | 2014-08-06 | 2014-10-08 | 2014-08-27 | 10 |
| uk_airdate | 6 | 0 | 1 | 2015-08-05 | 2015-10-07 | 2015-08-26 | 10 |
| uk_airdate | 7 | 0 | 1 | 2016-08-24 | 2016-10-26 | 2016-09-14 | 10 |
| uk_airdate | 8 | 0 | 1 | 2017-08-29 | 2017-10-31 | 2017-09-19 | 10 |
| us_airdate | 1 | 36 | 0 | NA | NA | NA | 0 |
| us_airdate | 2 | 60 | 0 | NA | NA | NA | 0 |
| us_airdate | 3 | 76 | 0 | NA | NA | NA | 0 |
| us_airdate | 4 | 0 | 1 | 2014-12-28 | 2015-03-01 | 2015-01-18 | 10 |
| us_airdate | 5 | 0 | 1 | 2015-09-06 | 2015-11-08 | 2015-09-27 | 10 |
| us_airdate | 6 | 0 | 1 | 2016-07-01 | 2016-08-12 | 2016-07-15 | 7 |
| us_airdate | 7 | 0 | 1 | 2017-06-16 | 2017-08-04 | 2017-06-30 | 8 |
| us_airdate | 8 | 75 | 0 | NA | NA | NA | 0 |
Variable type: numeric
| skim_variable | series | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| episode | 1 | 0 | 1.00 | 2.83 | 1.65 | 1 | 1.00 | 2.5 | 4.00 | 6 | ▇▂▂▂▁ |
| episode | 2 | 0 | 1.00 | 3.55 | 2.10 | 1 | 2.00 | 3.0 | 5.00 | 8 | ▇▃▅▂▂ |
| episode | 3 | 0 | 1.00 | 4.43 | 2.67 | 1 | 2.00 | 4.0 | 6.25 | 10 | ▇▆▅▅▂ |
| episode | 4 | 0 | 1.00 | 4.31 | 2.66 | 1 | 2.00 | 4.0 | 6.00 | 10 | ▇▆▅▃▂ |
| episode | 5 | 0 | 1.00 | 4.38 | 2.68 | 1 | 2.00 | 4.0 | 6.00 | 10 | ▇▆▅▃▂ |
| episode | 6 | 0 | 1.00 | 4.40 | 2.67 | 1 | 2.00 | 4.0 | 6.00 | 10 | ▇▆▅▃▂ |
| episode | 7 | 0 | 1.00 | 4.40 | 2.67 | 1 | 2.00 | 4.0 | 6.00 | 10 | ▇▆▅▃▂ |
| episode | 8 | 0 | 1.00 | 4.40 | 2.67 | 1 | 2.00 | 4.0 | 6.00 | 10 | ▇▆▅▃▂ |
| technical | 1 | 7 | 0.81 | 3.97 | 2.56 | 1 | 2.00 | 3.0 | 5.00 | 10 | ▇▇▃▂▂ |
| technical | 2 | 0 | 1.00 | 4.90 | 3.03 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| technical | 3 | 0 | 1.00 | 4.82 | 2.92 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| technical | 4 | 0 | 1.00 | 5.08 | 3.19 | 1 | 2.25 | 4.5 | 7.00 | 13 | ▇▅▅▂▂ |
| technical | 5 | 1 | 0.99 | 4.73 | 2.93 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| technical | 6 | 0 | 1.00 | 4.80 | 2.92 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| technical | 7 | 0 | 1.00 | 4.80 | 2.92 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| technical | 8 | 0 | 1.00 | 4.80 | 2.92 | 1 | 2.00 | 4.0 | 7.00 | 12 | ▇▅▃▂▂ |
| us_season | 1 | 36 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| us_season | 2 | 60 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| us_season | 3 | 76 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| us_season | 4 | 0 | 1.00 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| us_season | 5 | 0 | 1.00 | 2.00 | 0.00 | 2 | 2.00 | 2.0 | 2.00 | 2 | ▁▁▇▁▁ |
| us_season | 6 | 0 | 1.00 | 3.00 | 0.00 | 3 | 3.00 | 3.0 | 3.00 | 3 | ▁▁▇▁▁ |
| us_season | 7 | 0 | 1.00 | 4.00 | 0.00 | 4 | 4.00 | 4.0 | 4.00 | 4 | ▁▁▇▁▁ |
| us_season | 8 | 75 | 0.00 | NaN | NA | NA | NA | NA | NA | NA |
# Edit to filter, group by, and skim
bakeoff %>%
filter(!is.na(us_season)) %>%
group_by(us_season) %>%
skim()
| Name | Piped data |
| Number of rows | 302 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| numeric | 3 |
| ________________________ | |
| Group variables | us_season |
Variable type: character
| skim_variable | us_season | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|---|
| baker | 1 | 0 | 1.00 | 3 | 9 | 0 | 13 | 0 |
| baker | 2 | 0 | 1.00 | 4 | 7 | 0 | 12 | 0 |
| baker | 3 | 0 | 1.00 | 3 | 6 | 0 | 12 | 0 |
| baker | 4 | 0 | 1.00 | 3 | 9 | 0 | 12 | 0 |
| signature | 1 | 0 | 1.00 | 10 | 125 | 0 | 78 | 0 |
| signature | 2 | 1 | 0.99 | 15 | 107 | 0 | 73 | 0 |
| signature | 3 | 0 | 1.00 | 12 | 64 | 0 | 74 | 0 |
| signature | 4 | 0 | 1.00 | 12 | 93 | 0 | 75 | 0 |
| showstopper | 1 | 0 | 1.00 | 5 | 126 | 0 | 78 | 0 |
| showstopper | 2 | 1 | 0.99 | 8 | 82 | 0 | 73 | 0 |
| showstopper | 3 | 0 | 1.00 | 10 | 70 | 0 | 73 | 0 |
| showstopper | 4 | 0 | 1.00 | 5 | 86 | 0 | 74 | 0 |
| result | 1 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 2 | 0 | 1.00 | 2 | 9 | 0 | 6 | 0 |
| result | 3 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
| result | 4 | 0 | 1.00 | 2 | 9 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | us_season | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| uk_airdate | 1 | 0 | 1 | 2013-08-20 | 2013-10-22 | 2013-09-10 | 10 |
| uk_airdate | 2 | 0 | 1 | 2014-08-06 | 2014-10-08 | 2014-08-27 | 10 |
| uk_airdate | 3 | 0 | 1 | 2015-08-05 | 2015-10-07 | 2015-08-26 | 10 |
| uk_airdate | 4 | 0 | 1 | 2016-08-24 | 2016-10-26 | 2016-09-14 | 10 |
| us_airdate | 1 | 0 | 1 | 2014-12-28 | 2015-03-01 | 2015-01-18 | 10 |
| us_airdate | 2 | 0 | 1 | 2015-09-06 | 2015-11-08 | 2015-09-27 | 10 |
| us_airdate | 3 | 0 | 1 | 2016-07-01 | 2016-08-12 | 2016-07-15 | 7 |
| us_airdate | 4 | 0 | 1 | 2017-06-16 | 2017-08-04 | 2017-06-30 | 8 |
Variable type: numeric
| skim_variable | us_season | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| series | 1 | 0 | 1.00 | 4.00 | 0.00 | 4 | 4.00 | 4.0 | 4 | 4 | ▁▁▇▁▁ |
| series | 2 | 0 | 1.00 | 5.00 | 0.00 | 5 | 5.00 | 5.0 | 5 | 5 | ▁▁▇▁▁ |
| series | 3 | 0 | 1.00 | 6.00 | 0.00 | 6 | 6.00 | 6.0 | 6 | 6 | ▁▁▇▁▁ |
| series | 4 | 0 | 1.00 | 7.00 | 0.00 | 7 | 7.00 | 7.0 | 7 | 7 | ▁▁▇▁▁ |
| episode | 1 | 0 | 1.00 | 4.31 | 2.66 | 1 | 2.00 | 4.0 | 6 | 10 | ▇▆▅▃▂ |
| episode | 2 | 0 | 1.00 | 4.38 | 2.68 | 1 | 2.00 | 4.0 | 6 | 10 | ▇▆▅▃▂ |
| episode | 3 | 0 | 1.00 | 4.40 | 2.67 | 1 | 2.00 | 4.0 | 6 | 10 | ▇▆▅▃▂ |
| episode | 4 | 0 | 1.00 | 4.40 | 2.67 | 1 | 2.00 | 4.0 | 6 | 10 | ▇▆▅▃▂ |
| technical | 1 | 0 | 1.00 | 5.08 | 3.19 | 1 | 2.25 | 4.5 | 7 | 13 | ▇▅▅▂▂ |
| technical | 2 | 1 | 0.99 | 4.73 | 2.93 | 1 | 2.00 | 4.0 | 7 | 12 | ▇▅▃▂▂ |
| technical | 3 | 0 | 1.00 | 4.80 | 2.92 | 1 | 2.00 | 4.0 | 7 | 12 | ▇▅▃▂▂ |
| technical | 4 | 0 | 1.00 | 4.80 | 2.92 | 1 | 2.00 | 4.0 | 7 | 12 | ▇▅▃▂▂ |
Nice piping & skimming! Producing quick grouped summaries with the skim function is a real time (and sanity) saver.
How many variables of each type do we have in the bakeoff data? You may also want to try piping a skimmed object to summary(), also from the skimr package:
tibble_name %>%
skim() %>% # no argument needed here
summary() # no argument needed here
bakeoff %>%
skim() %>% # no argument needed here
summary() # no argument needed here
| Name | Piped data |
| Number of rows | 549 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| Date | 2 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
In every episode of “The Great British Bake-Off”, bakers complete 3 challenges and the show’s judges award the title “Star Baker” to the baker who excelled in that week’s challenges (with the exception of the finale). Each baker’s result for every episode is stored in bakeoff- result is a character variable, and the value “SB” stands for star baker.
# View distinct results
bakeoff %>%
distinct(result)
## # A tibble: 6 x 1
## result
## <chr>
## 1 IN
## 2 OUT
## 3 RUNNER UP
## 4 WINNER
## 5 SB
## 6 LEFT
# Count whether or not star baker
bakeoff %>%
count(result)
## # A tibble: 6 x 2
## result n
## <chr> <int>
## 1 IN 393
## 2 LEFT 1
## 3 OUT 70
## 4 RUNNER UP 16
## 5 SB 61
## 6 WINNER 8
# Count whether or not star baker
bakeoff %>%
count(result == "SB")
## # A tibble: 2 x 2
## `result == "SB"` n
## <lgl> <int>
## 1 FALSE 488
## 2 TRUE 61
There are 488 results that are not star baker.
Here’s how we can use count() back-to-back to roll up a level of detail across series:
bakers %>%
count(aired_us, series) %>%
count(aired_us)
Let’s practice counting the number of episodes per series.
bakeoff %>%
count(series, episode) %>%
count(series)
## # A tibble: 8 x 2
## series n
## <dbl> <int>
## 1 1 6
## 2 2 8
## 3 3 10
## 4 4 10
## 5 5 10
## 6 6 10
## 7 7 10
## 8 8 10
You can see that by series 3 we were treated to 10 episodes with each new series!
We’ll practice that again, focusing on counting bakers.
# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>%
count(series, baker)
# Print to view
bakers_by_series
## # A tibble: 95 x 3
## series baker n
## <dbl> <chr> <int>
## 1 1 Annetha 2
## 2 1 David 4
## 3 1 Edd 6
## 4 1 Jasminder 5
## 5 1 Jonathan 3
## 6 1 Lea 1
## 7 1 Louise 2
## 8 1 Mark 1
## 9 1 Miranda 6
## 10 1 Ruth 6
## # ... with 85 more rows
# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>%
count(series, baker)
# Print to view
bakers_by_series
## # A tibble: 95 x 3
## series baker n
## <dbl> <chr> <int>
## 1 1 Annetha 2
## 2 1 David 4
## 3 1 Edd 6
## 4 1 Jasminder 5
## 5 1 Jonathan 3
## 6 1 Lea 1
## 7 1 Louise 2
## 8 1 Mark 1
## 9 1 Miranda 6
## 10 1 Ruth 6
## # ... with 85 more rows
# Count again by series
bakers_by_series %>%
count(series)
## # A tibble: 8 x 2
## series n
## <dbl> <int>
## 1 1 10
## 2 2 12
## 3 3 12
## 4 4 13
## 5 5 12
## 6 6 12
## 7 7 12
## 8 8 12
# Count again by baker
bakers_by_series %>%
count(baker, sort = TRUE)
## # A tibble: 86 x 2
## baker n
## <chr> <int>
## 1 Kate 3
## 2 Ian 2
## 3 James 2
## 4 Louise 2
## 5 Mark 2
## 6 Peter 2
## 7 Robert 2
## 8 Tom 2
## 9 Ali 1
## 10 Alvin 1
## # ... with 76 more rows
Series 4 had the most bakers with 13 total (a baker’s dozen!). Kate is the most popular name- Kates have appeared in 3 different series of the TV show.
You can learn a lot about your data by counting, but sometimes you can learn even more by plotting counts. This is especially true when you have lots of things to count! With eight series, 74 episodes, and 95 bakers, a plot can be more helpful than a table of numbers. We’ll use ggplot2 to visualize the number of bakers across episodes for each series from bakeoff.
library(ggplot2)
ggplot(bakeoff, aes(episode)) +
geom_bar() +
facet_wrap(~series)
You made it through the first chapter! We have a lot of work left to do with The Great British Bake Off data, but notice how this plot is a great sanity check: the number of bakers tends to go down with every episode (never up!), with a few exceptions where no one was eliminated. Also, you can see that there are fewer episodes in the first two series. Finally, you can see that there are always three bakers who make it to the series finale.
A good workflow for parsing dates using readr is to, for example:
In “desserts.csv”, the variable uk_airdate is formatted like “17 August 2010”. Let’s parse, then cast this variable!
Later on in Chapter 4, we’ll showcase some functions from the lubridate package to help you extract data from dates, once they are cast properly.
# Find format to parse uk_airdate
parse_date("17 August 2010", format = "%d %B %Y")
## [1] "2010-08-17"
# Edit to cast uk_airdate
desserts <- read_csv("data/desserts.csv",
col_types = cols(
uk_airdate = col_date(format = "%d %B %Y")
)
)
# Arrange by descending uk_airdate
desserts %>%
arrange(desc(uk_airdate))
## # A tibble: 549 x 16
## series episode baker technical result uk_airdate us_season us_airdate
## <dbl> <dbl> <chr> <chr> <chr> <date> <dbl> <date>
## 1 8 10 Kate 3rd RUNNE~ 2017-10-31 NA NA
## 2 8 10 Stev~ 1st RUNNE~ 2017-10-31 NA NA
## 3 8 10 Soph~ 2nd WINNER 2017-10-31 NA NA
## 4 8 9 Kate 4th IN 2017-10-24 NA NA
## 5 8 9 Stev~ 3rd IN 2017-10-24 NA NA
## 6 8 9 Stac~ 2nd OUT 2017-10-24 NA NA
## 7 8 9 Soph~ 1st SB 2017-10-24 NA NA
## 8 8 8 Kate 2nd IN 2017-10-17 NA NA
## 9 8 8 Soph~ 4th IN 2017-10-17 NA NA
## 10 8 8 Stev~ 1st IN 2017-10-17 NA NA
## # ... with 539 more rows, and 8 more variables: showstopper_chocolate <chr>,
## # showstopper_dessert <chr>, showstopper_fruit <chr>, showstopper_nut <chr>,
## # signature_chocolate <chr>, signature_dessert <chr>, signature_fruit <chr>,
## # signature_nut <chr>
Notice that us_airdate didn’t need to be cast - this is because the date format %Y-%m-%d is unambiguous, so it is automatically parsed as a date by readr.
We saw a good workflow for parsing columns using readr:
But sometimes you’ll need to start with casting, then diagnose parsing problems using a new readr function called problems(). Using problems() on a result of read_csv() will show you the rows and columns where parsing error occurred, what the parser expected to find (for example, a number), and the actual value that caused the parsing error.
Let’s practice this with “desserts.csv”, which includes data about the judges’ ranks for each baker in the technical challenge, for every episode.
# Try to cast technical as a number
desserts <- read_csv("data/desserts.csv",
col_types = cols(
uk_airdate = col_date(format = "%d %B %Y"),
technical = col_number()
)
)
## Warning: 7 parsing failures.
## row col expected actual file
## 4 technical a number N/A 'data/desserts.csv'
## 6 technical a number N/A 'data/desserts.csv'
## 8 technical a number N/A 'data/desserts.csv'
## 10 technical a number N/A 'data/desserts.csv'
## 34 technical a number N/A 'data/desserts.csv'
## ... ......... ........ ...... ...................
## See problems(...) for more details.
# View parsing problems
desserts %>%
problems()
## # A tibble: 7 x 5
## row col expected actual file
## <int> <chr> <chr> <chr> <chr>
## 1 4 technical a number N/A 'data/desserts.csv'
## 2 6 technical a number N/A 'data/desserts.csv'
## 3 8 technical a number N/A 'data/desserts.csv'
## 4 10 technical a number N/A 'data/desserts.csv'
## 5 34 technical a number N/A 'data/desserts.csv'
## 6 35 technical a number N/A 'data/desserts.csv'
## 7 36 technical a number N/A 'data/desserts.csv'
# Edit code to fix the parsing error
desserts <- read_csv("data/desserts.csv",
col_types = cols(
uk_airdate = col_date(format = "%d %B %Y"),
technical = col_number()
),
na = c("", "NA", "N/A") #solving the problem
)
# View parsing problems
problems(desserts)
## [1] row col expected actual
## <0 rows> (or 0-length row.names)
Factors are categorical variables, where the possible values are a fixed and known set. For example, take a simple factor like bake below:
bake <- c("pie", "cake", "neither")
parse_factor(bake, levels = NULL)
## [1] pie cake neither
## Levels: pie cake neither
You can use parse_factor() to parse variables and col_factor() to cast columns as categorical. Both functions have a levels argument that is used to specify the possible values for the factors. When levels is set to NULL, the possible values will be inferred from the unique values in the dataset. Alternatively, you can pass a list of possible values.
This time, you will look at the result column, which contains the outcome of the competition for each baker.
# Cast result a factor
desserts <- read_csv("data/desserts.csv",
na = c("", "NA", "N/A"),
col_types = cols(
uk_airdate = col_date(format = "%d %B %Y"),
technical = col_number(),
result = col_factor(levels = NULL)
)
)
# Glimpse to view
glimpse(desserts)
## Rows: 549
## Columns: 16
## $ series <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ episode <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, ...
## $ baker <chr> "Annetha", "David", "Edd", "Jasminder", "Jona...
## $ technical <dbl> 2, 3, 1, NA, 9, NA, 8, NA, 10, NA, 8, 6, 2, 1...
## $ result <fct> IN, IN, IN, IN, IN, IN, IN, IN, OUT, OUT, IN,...
## $ uk_airdate <date> 2010-08-17, 2010-08-17, 2010-08-17, 2010-08-...
## $ us_season <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ us_airdate <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate", "no...
## $ showstopper_dessert <chr> "other", "other", "other", "other", "other", ...
## $ showstopper_fruit <chr> "no fruit", "no fruit", "no fruit", "no fruit...
## $ showstopper_nut <chr> "no nut", "no nut", "no nut", "no nut", "almo...
## $ signature_chocolate <chr> "no chocolate", "chocolate", "no chocolate", ...
## $ signature_dessert <chr> "cake", "cake", "cake", "cake", "cake", "cake...
## $ signature_fruit <chr> "no fruit", "fruit", "fruit", "fruit", "fruit...
## $ signature_nut <chr> "no nut", "no nut", "no nut", "no nut", "no n...
While it takes a lot of work up front, casting column types when you import can make your analyses easier to reproduce.
In this exercise, you’ll recode() the nut variable in the desserts data. This is a character variable that tells us, for each bake, whether a nut was a key ingredient and if so, what kind of nut!
Remember that the arguments of recode() are the variable that you want to recode, and then an expression of the form old_value = new_value. recode() is most useful when used inside a mutate() to create a new variable or reassign the old one.
Often you want to recode a certain value as missing. In this case, it is important to preserve the data type of the variable. R has built-in words to indicate missing data of various types, for example NA_character_ for missing value for a character variable, or NA_integer_ for missing integer data. The NA value is of the logical type.
Remember that you can use ?function_name_here to read more about arguments and how to use them.
# Count rows grouping by nut variable
desserts %>%
count(signature_nut, sort = TRUE)
## # A tibble: 8 x 2
## signature_nut n
## <chr> <int>
## 1 no nut 462
## 2 almond 20
## 3 walnut 20
## 4 pistachio 13
## 5 filbert 12
## 6 pecan 11
## 7 multiple 6
## 8 peanut 5
# Edit code to recode "no nut" as missing
desserts_2 <- desserts %>%
mutate(nut = recode(signature_nut, "filbert" = "hazelnut",
"no nut" = NA_character_))
# Count rows again
desserts_2 %>%
count(nut, sort = TRUE)
## # A tibble: 8 x 2
## nut n
## <chr> <int>
## 1 <NA> 462
## 2 almond 20
## 3 walnut 20
## 4 pistachio 13
## 5 hazelnut 12
## 6 pecan 11
## 7 multiple 6
## 8 peanut 5
Dummy variables are often used in data analysis to bin a variable into one of two categories to indicate the absence or presence of something. Dummy variables take the value 0 or 1 to stand for, for example, loser or winner. Dummy variables are often factor variables as opposed to numeric - we’ll cover more about factors in the last chapter.
In desserts, the technical variable is a number representing the judges’ ranks for each baker in the technical challenge, for every episode (remember to run glimpse(desserts) to take a peek). In this exercise, you will recode the numeric technical variable to a new factor dummy variable.
## Edit to recode tech_win as factor
desserts <- desserts %>%
mutate(tech_win = recode_factor(technical, `1` = 1,
.default = 0))
# Count to compare values
desserts %>%
count(technical == 1, tech_win)
## # A tibble: 3 x 3
## `technical == 1` tech_win n
## <lgl> <fct> <int>
## 1 FALSE 0 468
## 2 TRUE 1 73
## 3 NA <NA> 8
The ratings data tells us about how many UK viewers tuned in to watch “The Great British Bake-Off” for each episode of each series. If you explore ratings in console, you’ll see that there is a separate variable for the number of viewers for each episode. A key question we’ll return to in Chapter 3 is: how many more viewers watched the finale than the premiere episode?
For series with 10 episodes, which showed the most growth in viewers from the premiere to the finale? Which showed the least?
ratings <- read_csv("data/02.03_messy_ratings.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## premiere = col_character(),
## finale = col_character(),
## winner = col_character(),
## day_of_week = col_character(),
## timeslot = col_time(format = ""),
## channel = col_character(),
## runner_up_1 = col_character(),
## runner_up_2 = col_character(),
## season_premiere = col_character(),
## season_finale = col_character(),
## e1_uk_airdate = col_character(),
## e2_uk_airdate = col_character(),
## e3_uk_airdate = col_character(),
## e4_uk_airdate = col_character(),
## e5_uk_airdate = col_character(),
## e6_uk_airdate = col_character(),
## e7_uk_airdate = col_character(),
## e8_uk_airdate = col_character(),
## e9_uk_airdate = col_character(),
## e10_uk_airdate = col_character()
## )
## i Use `spec()` for the full column specifications.
head(ratings)
## # A tibble: 6 x 44
## series episodes premiere finale winner avg_uk_viewers day_of_week timeslot
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <time>
## 1 1 6 17-Aug-~ 21-Se~ Edd K~ 2.77 Tuesday 20:00
## 2 2 8 14-Aug-~ 4-Oct~ Joann~ 4 Tuesday 20:00
## 3 3 10 14-Aug-~ 16-Oc~ John ~ 5 Tuesday 20:00
## 4 4 10 20-Aug-~ 22-Oc~ Franc~ 7.35 Tuesday 20:00
## 5 5 10 6-Aug-14 8-Oct~ Nancy~ 10.0 Wednesday 20:00
## 6 6 10 5-Aug-15 7-Oct~ Nadiy~ 12.5 Wednesday 20:00
## # ... with 36 more variables: channel <chr>, runner_up_1 <chr>,
## # runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## # season_finale <chr>, e1_viewers_7day <dbl>, e1_viewers_28day <dbl>,
## # e2_viewers_7day <dbl>, e2_viewers_28day <dbl>, e3_viewers_7day <dbl>,
## # e3_viewers_28day <dbl>, e4_viewers_7day <dbl>, e4_viewers_28day <dbl>,
## # e5_viewers_7day <dbl>, e5_viewers_28day <dbl>, e6_viewers_7day <dbl>,
## # e6_viewers_28day <dbl>, e7_viewers_7day <dbl>, e7_viewers_28day <dbl>,
## # e8_viewers_7day <dbl>, e8_viewers_28day <dbl>, e9_viewers_7day <dbl>,
## # e9_viewers_28day <dbl>, e10_viewers_7day <dbl>, e10_viewers_28day <dbl>,
## # e1_uk_airdate <chr>, e2_uk_airdate <chr>, e3_uk_airdate <chr>,
## # e4_uk_airdate <chr>, e5_uk_airdate <chr>, e6_uk_airdate <chr>,
## # e7_uk_airdate <chr>, e8_uk_airdate <chr>, e9_uk_airdate <chr>,
## # e10_uk_airdate <chr>
ratings <- ratings %>%
filter(episodes == 10) %>%
mutate(viewer_growth = e10_viewers_7day - e1_viewers_7day)
ratings %>%
select(series, viewer_growth)
## # A tibble: 6 x 2
## series viewer_growth
## <dbl> <dbl>
## 1 3 2.89
## 2 4 2.85
## 3 5 5
## 4 6 3.43
## 5 7 2.32
## 6 8 0.580
Most: series 5, Least = series 8
In the last exercise, you may have used mutate() to find the difference between the number of premiere and finale episode viewers for each series. This variable, which I created and called viewer_growth, is now available in ratings, and feel free to check out it.
Now let’s practice using select() to check the result of another mutate with recode_factor().
If you have examined ratings, you might have seen that series have aired on three different channels: “Channel 4”, “BBC One”, “BBC Two”.
# Recode channel as factor: bbc (1) or not (0)
ratings <- ratings %>%
mutate(bbc = recode_factor(channel,
"Channel 4" = 0,
.default = 1))
# Select to look at variables to plot next
ratings %>%
select(series, channel, bbc, viewer_growth)
## # A tibble: 6 x 4
## series channel bbc viewer_growth
## <dbl> <chr> <fct> <dbl>
## 1 3 BBC Two 1 2.89
## 2 4 BBC Two 1 2.85
## 3 5 BBC One 1 5
## 4 6 BBC One 1 3.43
## 5 7 BBC One 1 2.32
## 6 8 Channel 4 0 0.580
# Make a filled bar chart
ggplot(ratings, aes(x = series, y = viewer_growth, fill = bbc)) +
geom_col()
Remember to look up the other select helper functions using ?select. What is going on with viewer_growth in series 8?
As we have seen, selecting a subset of columns to print can help you check that a mutate() worked as expected, and rearranging columns next to each other can help you spot obvious errors in data entry.
The select() helpers are functions that you can use inside select() to allow you to select variables based on their names.
In this exercise, you’ll work with the ratings data, and you’ll use a new helper function called everything() which can be useful when reordering columns. Use ?everything to read more. You will learn more about using helper functions in the next lesson.
# Move channel to first column
ratings %>%
select(channel, everything())
## # A tibble: 6 x 46
## channel series episodes premiere finale winner avg_uk_viewers day_of_week
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 BBC Two 3 10 14-Aug-~ 16-Oc~ John ~ 5 Tuesday
## 2 BBC Two 4 10 20-Aug-~ 22-Oc~ Franc~ 7.35 Tuesday
## 3 BBC One 5 10 6-Aug-14 8-Oct~ Nancy~ 10.0 Wednesday
## 4 BBC One 6 10 5-Aug-15 7-Oct~ Nadiy~ 12.5 Wednesday
## 5 BBC One 7 10 24-Aug-~ 26-Oc~ Candi~ 13.8 Wednesday
## 6 Channe~ 8 10 29-Aug-~ 31-Oc~ Sophi~ 9.29 Tuesday
## # ... with 38 more variables: timeslot <time>, runner_up_1 <chr>,
## # runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## # season_finale <chr>, e1_viewers_7day <dbl>, e1_viewers_28day <dbl>,
## # e2_viewers_7day <dbl>, e2_viewers_28day <dbl>, e3_viewers_7day <dbl>,
## # e3_viewers_28day <dbl>, e4_viewers_7day <dbl>, e4_viewers_28day <dbl>,
## # e5_viewers_7day <dbl>, e5_viewers_28day <dbl>, e6_viewers_7day <dbl>,
## # e6_viewers_28day <dbl>, e7_viewers_7day <dbl>, e7_viewers_28day <dbl>,
## # e8_viewers_7day <dbl>, e8_viewers_28day <dbl>, e9_viewers_7day <dbl>,
## # e9_viewers_28day <dbl>, e10_viewers_7day <dbl>, e10_viewers_28day <dbl>,
## # e1_uk_airdate <chr>, e2_uk_airdate <chr>, e3_uk_airdate <chr>,
## # e4_uk_airdate <chr>, e5_uk_airdate <chr>, e6_uk_airdate <chr>,
## # e7_uk_airdate <chr>, e8_uk_airdate <chr>, e9_uk_airdate <chr>,
## # e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>
# Drop 7- and 28-day episode ratings
ratings %>%
select(-ends_with("day"))
## # A tibble: 6 x 26
## series episodes premiere finale winner avg_uk_viewers day_of_week timeslot
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr> <time>
## 1 3 10 14-Aug-~ 16-Oc~ John ~ 5 Tuesday 20:00
## 2 4 10 20-Aug-~ 22-Oc~ Franc~ 7.35 Tuesday 20:00
## 3 5 10 6-Aug-14 8-Oct~ Nancy~ 10.0 Wednesday 20:00
## 4 6 10 5-Aug-15 7-Oct~ Nadiy~ 12.5 Wednesday 20:00
## 5 7 10 24-Aug-~ 26-Oc~ Candi~ 13.8 Wednesday 20:00
## 6 8 10 29-Aug-~ 31-Oc~ Sophi~ 9.29 Tuesday 20:00
## # ... with 18 more variables: channel <chr>, runner_up_1 <chr>,
## # runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## # season_finale <chr>, e1_uk_airdate <chr>, e2_uk_airdate <chr>,
## # e3_uk_airdate <chr>, e4_uk_airdate <chr>, e5_uk_airdate <chr>,
## # e6_uk_airdate <chr>, e7_uk_airdate <chr>, e8_uk_airdate <chr>,
## # e9_uk_airdate <chr>, e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>
# Move channel to front and drop 7-/28-day episode ratings
ratings %>%
select(channel, everything(), -ends_with("day"))
## # A tibble: 6 x 26
## channel series episodes premiere finale winner avg_uk_viewers day_of_week
## <chr> <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 BBC Two 3 10 14-Aug-~ 16-Oc~ John ~ 5 Tuesday
## 2 BBC Two 4 10 20-Aug-~ 22-Oc~ Franc~ 7.35 Tuesday
## 3 BBC One 5 10 6-Aug-14 8-Oct~ Nancy~ 10.0 Wednesday
## 4 BBC One 6 10 5-Aug-15 7-Oct~ Nadiy~ 12.5 Wednesday
## 5 BBC One 7 10 24-Aug-~ 26-Oc~ Candi~ 13.8 Wednesday
## 6 Channe~ 8 10 29-Aug-~ 31-Oc~ Sophi~ 9.29 Tuesday
## # ... with 18 more variables: timeslot <time>, runner_up_1 <chr>,
## # runner_up_2 <chr>, season <dbl>, season_premiere <chr>,
## # season_finale <chr>, e1_uk_airdate <chr>, e2_uk_airdate <chr>,
## # e3_uk_airdate <chr>, e4_uk_airdate <chr>, e5_uk_airdate <chr>,
## # e6_uk_airdate <chr>, e7_uk_airdate <chr>, e8_uk_airdate <chr>,
## # e9_uk_airdate <chr>, e10_uk_airdate <chr>, viewer_growth <dbl>, bbc <fct>
everything() is a great little helper! If it had been placed at the end, it would have added back in all the columns that end with “day”. Placing it before deselecting columns, though, is a real time-saver.
We saw how to use clean_names() from the janitor package to convert all variable names to snake_case. Here is how:
library(janitor)
young_bakers3 <- young_bakers3 %>%
clean_names()
The function clean_names() takes an argument case that can be used to convert variable names to other cases, like “upper_camel” or “all_caps”.
Let’s practice using the messy_ratings tibble. Remember you can use ?clean_names to see the function arguments and how to use them.
messy_ratings <- read_csv("data/02.03_messy_ratings.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double(),
## premiere = col_character(),
## finale = col_character(),
## winner = col_character(),
## day_of_week = col_character(),
## timeslot = col_time(format = ""),
## channel = col_character(),
## runner_up_1 = col_character(),
## runner_up_2 = col_character(),
## season_premiere = col_character(),
## season_finale = col_character(),
## e1_uk_airdate = col_character(),
## e2_uk_airdate = col_character(),
## e3_uk_airdate = col_character(),
## e4_uk_airdate = col_character(),
## e5_uk_airdate = col_character(),
## e6_uk_airdate = col_character(),
## e7_uk_airdate = col_character(),
## e8_uk_airdate = col_character(),
## e9_uk_airdate = col_character(),
## e10_uk_airdate = col_character()
## )
## i Use `spec()` for the full column specifications.
colnames(messy_ratings)
names(messy_ratings) <- c("SERIES", "episodes", "premiere", "finale", "winner", "Avg.uk~Viewers",
"day", "timeslot", "channel", "runner.up~1", "runner.up~2", "season",
"season.premiere", "season.finale", "Gbbo E1.Viewers~7Day", "Gbbo E1.Viewers~28Day",
"Gbbo E2.Viewers~7Day", "Gbbo E2.Viewers~28Day", "Gbbo E3.Viewers~7Day",
"Gbbo E3.Viewers~28Day", "Gbbo E4.Viewers~7Day", "Gbbo E4.Viewers~28Day",
"Gbbo E5.Viewers~7Day", "Gbbo E5.Viewers~28Day", "Gbbo E6.Viewers~7Day",
"Gbbo E6.Viewers~28Day", "Gbbo E7.Viewers~7Day", "Gbbo E7.Viewers~28Day",
"Gbbo E8.Viewers~7Day", "Gbbo E8.Viewers~28Day", "Gbbo E9.Viewers~7Day",
"Gbbo E9.Viewers~28Day", "Gbbo E10.Viewers~7Day", "Gbbo E10.Viewers~28Day")
## Warning: The `value` argument of ``names<-`()` must have the same length as `x` as of tibble 3.0.0.
## `names` must have length 44, not 34.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Warning: The `value` argument of ``names<-`()` can't be empty as of tibble 3.0.0.
## Columns 35, 36, 37, 38, 39, and 5 more must be named.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
# Glimpse to see variable names
glimpse(messy_ratings)
## Rows: 8
## Columns: 44
## $ SERIES <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20...
## $ finale <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-...
## $ winner <chr> "Edd Kimber", "Joanne Wheatley", "John Wha...
## $ `Avg.uk~Viewers` <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.8...
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday"...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 2...
## $ channel <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two"...
## $ `runner.up~1` <chr> "Miranda Gore Browne", "Holly Bell", "Bren...
## $ `runner.up~2` <chr> "Ruth Clemens", "Mary-Anne Boermans", "Jam...
## $ season <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season.premiere <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16"...
## $ season.finale <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16"...
## $ `Gbbo E1.Viewers~7Day` <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58...
## $ `Gbbo E1.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ `Gbbo E2.Viewers~7Day` <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45...
## $ `Gbbo E2.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ `Gbbo E3.Viewers~7Day` <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01...
## $ `Gbbo E3.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ `Gbbo E4.Viewers~7Day` <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.2...
## $ `Gbbo E4.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ `Gbbo E5.Viewers~7Day` <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12...
## $ `Gbbo E5.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ `Gbbo E6.Viewers~7Day` <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.1...
## $ `Gbbo E6.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ `Gbbo E7.Viewers~7Day` <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45,...
## $ `Gbbo E7.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ `Gbbo E8.Viewers~7Day` <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13...
## $ `Gbbo E8.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ `Gbbo E9.Viewers~7Day` <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ `Gbbo E9.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ `Gbbo E10.Viewers~7Day` <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 1...
## $ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ NA <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13"...
## $ NA <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13"...
## $ NA <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13",...
## $ NA <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "...
## $ NA <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13"...
## $ NA <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13"...
## $ NA <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17...
## $ NA <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24...
## $ NA <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", ...
## $ NA <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14",...
# Load janitor
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Reformat to lower camelcase
ratings2 <- messy_ratings %>%
clean_names("lower_camel")
# Glimpse new tibble
glimpse(ratings2)
## Rows: 8
## Columns: 44
## $ series <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20-Aug-...
## $ finale <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-Oct-1...
## $ winner <chr> "Edd Kimber", "Joanne Wheatley", "John Whaite",...
## $ avgUkViewers <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "We...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:...
## $ channel <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two", "BB...
## $ runnerUp1 <chr> "Miranda Gore Browne", "Holly Bell", "Brendan L...
## $ runnerUp2 <chr> "Ruth Clemens", "Mary-Anne Boermans", "James Mo...
## $ season <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ seasonPremiere <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16", "6/...
## $ seasonFinale <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16", "8/...
## $ gbboE1Viewers7Day <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ gbboE1Viewers28Day <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ gbboE2Viewers7Day <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ gbboE2Viewers28Day <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ gbboE3Viewers7Day <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ gbboE3Viewers28Day <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ gbboE4Viewers7Day <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ gbboE4Viewers28Day <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ gbboE5Viewers7Day <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ gbboE5Viewers28Day <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ gbboE6Viewers7Day <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ gbboE6Viewers28Day <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ gbboE7Viewers7Day <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ gbboE7Viewers28Day <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ gbboE8Viewers7Day <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13.260,...
## $ gbboE8Viewers28Day <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ gbboE9Viewers7Day <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ gbboE9Viewers28Day <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ gbboE10Viewers7Day <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ gbboE10Viewers28Day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ na <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13", "8/...
## $ na_2 <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13", "8/...
## $ na_3 <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13", "8/2...
## $ na_4 <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "8/27/...
## $ na_5 <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13", "9/...
## $ na_6 <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13", "9/...
## $ na_7 <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17/14",...
## $ na_8 <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24/14",...
## $ na_9 <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", "9/30...
## $ na_10 <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14", "10/...
# Glimpse to see variable names
glimpse(messy_ratings)
## Rows: 8
## Columns: 44
## $ SERIES <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20...
## $ finale <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-...
## $ winner <chr> "Edd Kimber", "Joanne Wheatley", "John Wha...
## $ `Avg.uk~Viewers` <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.8...
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday"...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 2...
## $ channel <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two"...
## $ `runner.up~1` <chr> "Miranda Gore Browne", "Holly Bell", "Bren...
## $ `runner.up~2` <chr> "Ruth Clemens", "Mary-Anne Boermans", "Jam...
## $ season <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season.premiere <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16"...
## $ season.finale <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16"...
## $ `Gbbo E1.Viewers~7Day` <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58...
## $ `Gbbo E1.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ `Gbbo E2.Viewers~7Day` <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45...
## $ `Gbbo E2.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ `Gbbo E3.Viewers~7Day` <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01...
## $ `Gbbo E3.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ `Gbbo E4.Viewers~7Day` <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.2...
## $ `Gbbo E4.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ `Gbbo E5.Viewers~7Day` <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12...
## $ `Gbbo E5.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ `Gbbo E6.Viewers~7Day` <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.1...
## $ `Gbbo E6.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ `Gbbo E7.Viewers~7Day` <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45,...
## $ `Gbbo E7.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ `Gbbo E8.Viewers~7Day` <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13...
## $ `Gbbo E8.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ `Gbbo E9.Viewers~7Day` <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ `Gbbo E9.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ `Gbbo E10.Viewers~7Day` <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 1...
## $ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ NA <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13"...
## $ NA <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13"...
## $ NA <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13",...
## $ NA <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "...
## $ NA <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13"...
## $ NA <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13"...
## $ NA <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17...
## $ NA <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24...
## $ NA <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", ...
## $ NA <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14",...
# Load janitor
library(janitor)
# Reformat to snake case
ratings2 <- messy_ratings %>%
clean_names()
# Glimpse cleaned names
glimpse(ratings2)
## Rows: 8
## Columns: 44
## $ series <dbl> 1, 2, 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 6, 8, 10, 10, 10, 10, 10, 10
## $ premiere <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20-A...
## $ finale <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-Oc...
## $ winner <chr> "Edd Kimber", "Joanne Wheatley", "John Whait...
## $ avg_uk_viewers <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.85,...
## $ day <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", ...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:...
## $ channel <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two", ...
## $ runner_up_1 <chr> "Miranda Gore Browne", "Holly Bell", "Brenda...
## $ runner_up_2 <chr> "Ruth Clemens", "Mary-Anne Boermans", "James...
## $ season <dbl> NA, NA, NA, 1, 2, 3, 4, NA
## $ season_premiere <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16", ...
## $ season_finale <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16", ...
## $ gbbo_e1_viewers_7day <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58, ...
## $ gbbo_e1_viewers_28day <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
## $ gbbo_e2_viewers_7day <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45, ...
## $ gbbo_e2_viewers_28day <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
## $ gbbo_e3_viewers_7day <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01, ...
## $ gbbo_e3_viewers_28day <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
## $ gbbo_e4_viewers_7day <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.29,...
## $ gbbo_e4_viewers_28day <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
## $ gbbo_e5_viewers_7day <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12, ...
## $ gbbo_e5_viewers_28day <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
## $ gbbo_e6_viewers_7day <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.13,...
## $ gbbo_e6_viewers_28day <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
## $ gbbo_e7_viewers_7day <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ gbbo_e7_viewers_28day <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
## $ gbbo_e8_viewers_7day <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13.2...
## $ gbbo_e8_viewers_28day <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
## $ gbbo_e9_viewers_7day <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ gbbo_e9_viewers_28day <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
## $ gbbo_e10_viewers_7day <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ gbbo_e10_viewers_28day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
## $ na <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13", ...
## $ na_2 <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13", ...
## $ na_3 <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13", "...
## $ na_4 <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "8/...
## $ na_5 <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13", ...
## $ na_6 <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13", ...
## $ na_7 <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17/1...
## $ na_8 <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24/1...
## $ na_9 <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", "9...
## $ na_10 <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14", "...
You have a ton of case options with the clean_names function- choose wisely!
In the next chapter, when we reshape data to be tidy, you’ll see the importance of having tame variable names.
Now that our variable names in ratings have been converted to snake_case, you’ll practice combining renaming with select() helper functions to take advantage of the variable name structure.
To rename a group of variables, you only need to specify a prefix of the new name.
For example, my_tbl %>% select( new_name_ = starts_with(“oldname”) ) will find all variables in my_tbl whose names start with oldname, enumerate them, then rename each variable as new_name_
Take a second to glimpse() at ratings in your console, and take note of the variables that store data about the number of viewers. Can you see the pattern in their names? In this exercise, you will first select those variables using a helper function, and then rename them!
# Select 7-day viewer data by series
viewers_7day <- ratings %>%
select(series, ends_with("7day"))
# Glimpse
glimpse(viewers_7day)
## Rows: 6
## Columns: 11
## $ series <dbl> 3, 4, 5, 6, 7, 8
## $ e1_viewers_7day <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ e2_viewers_7day <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ e3_viewers_7day <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ e4_viewers_7day <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ e5_viewers_7day <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ e6_viewers_7day <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ e7_viewers_7day <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ e8_viewers_7day <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ e9_viewers_7day <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ e10_viewers_7day <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
# Adapt code to also rename 7-day viewer data
viewers_7day <- ratings %>%
select(series, viewers_7day_ = ends_with("7day"))
# Glimpse
glimpse(viewers_7day)
## Rows: 6
## Columns: 11
## $ series <dbl> 3, 4, 5, 6, 7, 8
## $ viewers_7day_1 <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2 <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3 <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4 <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5 <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6 <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7 <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8 <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9 <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
Using select helper functions is a powerful way to tame your variable names. Tame names make it easier to tidy your data.
In the previous exercise, you renamed all the variables in ratings that end with “7day” and kept only a subset of variables. In this exercise, you’ll adapt that code to drop some specific columns, while keeping and reordering other columns.
# Adapt code to drop 28-day columns; keep 7-day in front
viewers_7day <- ratings %>%
select(viewers_7day_ = ends_with("7day"),
everything(),
-ends_with("28day"))
# Glimpse
glimpse(viewers_7day)
## Rows: 6
## Columns: 36
## $ viewers_7day_1 <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2 <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3 <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4 <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5 <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6 <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7 <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8 <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9 <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ series <dbl> 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 10, 10, 10, 10, 10, 10
## $ premiere <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15", "...
## $ finale <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15", "...
## $ winner <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhistle"...
## $ avg_uk_viewers <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day_of_week <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "We...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:00, ...
## $ channel <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BBC On...
## $ runner_up_1 <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troyano"...
## $ runner_up_2 <chr> "James Morton", "Ruby Tandoh", "Richard Burr", "Tam...
## $ season <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", NA
## $ season_finale <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
## $ e1_uk_airdate <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24/16"...
## $ e2_uk_airdate <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/31/1...
## $ e3_uk_airdate <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7/16"...
## $ e4_uk_airdate <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/14/16...
## $ e5_uk_airdate <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21/16"...
## $ e6_uk_airdate <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/28/16...
## $ e7_uk_airdate <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10/5/1...
## $ e8_uk_airdate <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10/12/...
## $ e9_uk_airdate <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "10/19...
## $ e10_uk_airdate <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "10/2...
## $ viewer_growth <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc <fct> 1, 1, 1, 1, 1, 0
# Adapt code to keep original order
viewers_7day <- ratings %>%
select(everything(),
viewers_7day_ = ends_with("7day"),
-ends_with("28day"))
# Glimpse
glimpse(viewers_7day)
## Rows: 6
## Columns: 36
## $ series <dbl> 3, 4, 5, 6, 7, 8
## $ episodes <dbl> 10, 10, 10, 10, 10, 10
## $ premiere <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15", "...
## $ finale <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15", "...
## $ winner <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhistle"...
## $ avg_uk_viewers <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
## $ day_of_week <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday", "We...
## $ timeslot <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:00, ...
## $ channel <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BBC On...
## $ runner_up_1 <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troyano"...
## $ runner_up_2 <chr> "James Morton", "Ruby Tandoh", "Richard Burr", "Tam...
## $ season <dbl> NA, 1, 2, 3, 4, NA
## $ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", NA
## $ season_finale <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
## $ viewers_7day_1 <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
## $ viewers_7day_2 <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
## $ viewers_7day_3 <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
## $ viewers_7day_4 <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
## $ viewers_7day_5 <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
## $ viewers_7day_6 <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
## $ viewers_7day_7 <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
## $ viewers_7day_8 <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
## $ viewers_7day_9 <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
## $ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
## $ e1_uk_airdate <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24/16"...
## $ e2_uk_airdate <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/31/1...
## $ e3_uk_airdate <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7/16"...
## $ e4_uk_airdate <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/14/16...
## $ e5_uk_airdate <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21/16"...
## $ e6_uk_airdate <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/28/16...
## $ e7_uk_airdate <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10/5/1...
## $ e8_uk_airdate <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10/12/...
## $ e9_uk_airdate <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "10/19...
## $ e10_uk_airdate <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "10/2...
## $ viewer_growth <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
## $ bbc <fct> 1, 1, 1, 1, 1, 0
We’ll start by visualizing the number of show viewers (in millions, cumulative across seven days) across series by episode using the ratings data
ratings <- read_csv("data/messy_ratings.csv",
col_types = cols(
series = col_factor(levels = NULL)
))
# Plot of episode 1 viewers by series
ggplot(ratings, aes(x = series, y = e1))+
geom_col()
# Adapt code to plot episode 2 viewers by series
ggplot(ratings, aes(x = series, y = e2)) +
geom_col()
Series 8 was the first with different hosts on a different channel. Do you notice a pattern? To draw any conclusions, we’d want to plot the viewers for every episode though- aren’t you glad you weren’t asked to make all 10 plots? Wouldn’t it be nice if episode was a variable we could use?
we’ll tidy the data, then use row_number() from the dplyr package to create a continuous episode count variable, such that episode 1 from series 1 will have a value of 1, episode 1 from series 2 will have a value of 7, and so on. This new variable will be mapped onto the x-axis using ggplot2.
library(tidyr)
tidy_ratings <- ratings %>%
# Gather and convert episode to factor
gather(key = "episode", value = "viewers_7day", -series,
factor_key = TRUE, na.rm = TRUE) %>%
# Sort in ascending order by series and episode
arrange(series, episode) %>%
# Create new variable using row_number()
mutate(episode_count = row_number())
# Plot viewers by episode and series
ggplot(tidy_ratings, aes(x = episode_count,
y = viewers_7day,
fill = series)) +
geom_col()
Beautiful plot! In series 8, one of the judges accidentally tweeted out the winner’s name 12 hours before the finale aired. Do you think that hurt their finale viewer numbers?
In this exercise, you’ll combine tidyr with dplyr::select() to keep and gather non-sequential columns in the ratings2 data. You can review the select helper functions here. The ratings2 data includes per-episode data:
Here, we’ll focus on the 7-day viewers.
ratings2 <- read_csv("data/messy_ratings2.csv",
col_types = cols(
series = col_factor(levels = NULL)
))
week_ratings <- ratings2 %>%
# Select 7-day viewer ratings
select(series, ends_with("7day")) %>%
# Gather 7-day viewers by episode
gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE, factor_key = TRUE)
# Plot 7-day viewers by episode and series
ggplot(week_ratings, aes(x = episode,
y = viewers_7day,
group = series)) +
geom_line() +
facet_wrap(~series)
You’re done gathering! This plot tells you a lot, but the x-axis labels are a mess! We need to do more tidying to make this plot better.
To create the week_ratings data, we first selected all the columns in ratings2 that contain data about the number of 7-day viewers in millions for each series, then we gathered those 10 columns into two columns.
In this exercise, you’ll use the tidyr, dplyr, and readr packages to tidy and plot the week_ratings data so we can read the x-axis, which labels each episode!
week_ratings <- ratings2 %>%
select(series, ends_with("7day")) %>%
gather(episode, viewers_7day, ends_with("7day"),
na.rm = TRUE) %>%
separate(episode, into = "episode", extra = "drop") %>%
# Edit to parse episode number
mutate(episode = parse_number(episode))
# Print to view
week_ratings
## # A tibble: 74 x 3
## series episode viewers_7day
## <fct> <dbl> <dbl>
## 1 1 1 2.24
## 2 2 1 3.1
## 3 3 1 3.85
## 4 4 1 6.6
## 5 5 1 8.51
## 6 6 1 11.6
## 7 7 1 13.6
## 8 8 1 9.46
## 9 1 2 3
## 10 2 2 3.53
## # ... with 64 more rows
# Create week_ratings
week_ratings <- ratings2 %>%
select(series, ends_with("7day")) %>%
gather(episode, viewers_7day, ends_with("7day"),
na.rm = TRUE) %>%
separate(episode, into = "episode", extra = "drop") %>%
mutate(episode = parse_number(episode))
# Edit your code to color by series and add a theme
ggplot(week_ratings, aes(x = episode, y = viewers_7day,
group = series, color = series)) +
geom_line() +
facet_wrap(~series) +
guides(color = FALSE) +
theme_minimal()
### Unite columns
In the tidyr package, the opposite of separate() is unite(). Sometimes you need to paste values from two or more columns together to tidy. Here is an example usage for unite():
data %>%
unite(new_var, old_var1, old_var2)
The ratings2 dataset includes these variables:
If viewers_millions = 2 and viewers_decimal = .6, then the total number of viewers in millions is 2.6. In this exercise, you’ll practice uniting these two columns into a single new column.
library(readr)
ratings_2 <- read_csv("data/messy_ratings2.csv")
##
## -- Column specification --------------------------------------------------------
## cols(
## .default = col_double()
## )
## i Use `spec()` for the full column specifications.