1 Explore your data

1.1 Import your data

1.1.1 Read a CSV file

In this exercise, you’ll use read_csv() twice. The first time you will only specify the filename, but you’ll notice a problem with the imported data. The second time you’ll use a new argument called skip to fix the problem. Remember to use ?read_csv to read more about arguments like skip and how to use them.

The data you’ll work with is from “The Great British Bake-Off.” The file “bakeoff.csv” contains data for each episode of the show, organized by series and baker.

Tidyverse chearsheet

# Load readr
library(readr)

# Create bakeoff from "bakeoff.csv"
bakeoff <- read_csv("bakeoff.csv")
Parsed with column specification:
cols(
  series = col_double(),
  episode = col_double(),
  baker = col_character(),
  signature = col_character(),
  technical = col_double(),
  showstopper = col_character(),
  result = col_character(),
  uk_airdate = col_date(format = ""),
  us_season = col_double(),
  us_airdate = col_date(format = "")
)
# Print bakeoff
bakeoff

# Create bakeoff but skip first row in case needed
# bakeoff <- read_csv("bakeoff.csv", skip = 1)

Our original import had 3 problems: we had 1 extra observation, our first row stored the true variable names, and all column types were characters. Adding skip = 1 worked because the default argument for col_names is TRUE.

1.1.2 Assign missing values

The read_csv() function also has an na argument, which allows you to specify value(s) that represent missing values in your data. The default values for the na argument are c("“,”NA“), so both are recoded as missing (NA) in R. When you read in data, you can add additional values like the string”UNKNOWN" to a vector of missing values using the c() function to combine multiple values into a single vector.

The is.na() function is also helpful for identifying rows with missing values for a variable.

# Load dplyr
library(dplyr)

# Filter rows where showstopper is UNKNOWN 
bakeoff %>% 
    filter(showstopper == "UNKNOWN")

# Edit to add list of missing values
bakeoff <- read_csv("bakeoff.csv", na = c("", "NA", "UNKNOWN"))
Parsed with column specification:
cols(
  series = col_double(),
  episode = col_double(),
  baker = col_character(),
  signature = col_character(),
  technical = col_double(),
  showstopper = col_character(),
  result = col_character(),
  uk_airdate = col_date(format = ""),
  us_season = col_double(),
  us_airdate = col_date(format = "")
)
# Filter rows where showstopper is NA 
bakeoff %>%
  filter(is.na(showstopper))

We only had 4 missing values for the showstopper variable to start, but now all 21 are present and accounted for.

1.2 Know your data

1.2.1 Arrange and glimpse

You can combine glimpse() with other functions in a sequence using the pipe (%>%) operator. For example, you can use other dplyr functions like arrange first, then use glimpse by adding a line after the final pipe (%>%):

bakers_mini %>% 
  arrange(age) %>% 
  glimpse() # no argument needed here

Take a glimpse of the bakeoff data we imported in the first set of exercises. On which date did the first episode of the show air in the US?

# Filter and skim
bakeoff %>%
  filter(!is.na(us_season)) %>% 
  skim()
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             302       
Number of columns          10        
_______________________              
Column type frequency:               
  character                4         
  Date                     2         
  numeric                  4         
________________________             
Group variables            None      

── Variable type: character ──────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique
1 baker                 0         1         3     9     0       48
2 signature             1         0.997    10   125     0      300
3 showstopper           1         0.997     5   126     0      298
4 result                0         1         2     9     0        6
  whitespace
1          0
2          0
3          0
4          0

── Variable type: Date ───────────────────────────────────────────────────
  skim_variable n_missing complete_rate min        max        median    
1 uk_airdate            0             1 2013-08-20 2016-10-26 2014-10-08
2 us_airdate            0             1 2014-12-28 2017-08-04 2015-11-08
  n_unique
1       40
2       35

── Variable type: numeric ────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0   p25   p50
1 series                0         1      5.49  1.13     4     4     5
2 episode               0         1      4.37  2.66     1     2     4
3 technical             1         0.997  4.85  2.98     1     2     4
4 us_season             0         1      2.49  1.13     1     1     2
    p75  p100 hist 
1     6     7 ▇▇▁▇▇
2     6    10 ▇▆▅▃▂
3     7    13 ▇▅▅▂▁
4     3     4 ▇▇▁▇▇

The first episode of the first US season aired on December 28, 2014.

1.2.2 Summarize your data

You can combine skim() with other functions in a sequence using the pipe (%>%) operator. For example, you could use other dplyr functions like group_by first, then use skim() by adding a line after the final pipe.

bakers_mini %>% 
  group_by(series) %>% 
  skim() # no argument needed here

This will produce summary statistics for each series. Let’s practice this!

# Load skimr
library(skimr)
bakeoff %>% 
  group_by(series) %>% 
  skim() # no argument needed here
no non-missing arguments to min; returning Infno non-missing arguments to min; returning Infno non-missing arguments to min; returning Infno non-missing arguments to min; returning Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Infno non-missing arguments to max; returning -Inf
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             549       
Number of columns          10        
_______________________              
Column type frequency:               
  character                4         
  Date                     2         
  numeric                  3         
________________________             
Group variables            series    

── Variable type: character ──────────────────────────────────────────────
   skim_variable series n_missing complete_rate   min   max empty n_unique
 1 baker              1         0         1         3     9     0       10
 2 baker              2         0         1         3     9     0       12
 3 baker              3         0         1         4    10     0       12
 4 baker              4         0         1         3     9     0       13
 5 baker              5         0         1         4     7     0       12
 6 baker              6         0         1         3     6     0       12
 7 baker              7         0         1         3     9     0       12
 8 baker              8         0         1         3     6     0       12
 9 signature          1         0         1         8    63     0       36
10 signature          2         0         1         9    69     0       59
11 signature          3         0         1        11   119     0       76
12 signature          4         0         1        10   125     0       78
13 signature          5         1         0.986    15   107     0       73
14 signature          6         0         1        12    64     0       74
15 signature          7         0         1        12    93     0       75
16 signature          8         0         1        12    64     0       75
17 showstopper        1         3         0.917    13   197     0       33
18 showstopper        2        10         0.833    17   118     0       50
19 showstopper        3         7         0.908    12    70     0       69
20 showstopper        4         0         1         5   126     0       78
21 showstopper        5         1         0.986     8    82     0       73
22 showstopper        6         0         1        10    70     0       73
23 showstopper        7         0         1         5    86     0       74
24 showstopper        8         0         1        13    68     0       75
25 result             1         0         1         2     9     0        4
26 result             2         0         1         2     9     0        5
27 result             3         0         1         2     9     0        5
28 result             4         0         1         2     9     0        5
29 result             5         0         1         2     9     0        6
30 result             6         0         1         2     9     0        5
31 result             7         0         1         2     9     0        5
32 result             8         0         1         2     9     0        5
   whitespace
 1          0
 2          0
 3          0
 4          0
 5          0
 6          0
 7          0
 8          0
 9          0
10          0
11          0
12          0
13          0
14          0
15          0
16          0
17          0
18          0
19          0
20          0
21          0
22          0
23          0
24          0
25          0
26          0
27          0
28          0
29          0
30          0
31          0
32          0

── Variable type: Date ───────────────────────────────────────────────────
   skim_variable series n_missing complete_rate min        max       
 1 uk_airdate         1         0             1 2010-08-17 2010-09-21
 2 uk_airdate         2         0             1 2011-08-16 2011-10-04
 3 uk_airdate         3         0             1 2012-08-14 2012-10-16
 4 uk_airdate         4         0             1 2013-08-20 2013-10-22
 5 uk_airdate         5         0             1 2014-08-06 2014-10-08
 6 uk_airdate         6         0             1 2015-08-05 2015-10-07
 7 uk_airdate         7         0             1 2016-08-24 2016-10-26
 8 uk_airdate         8         0             1 2017-08-29 2017-10-31
 9 us_airdate         1        36             0 NA         NA        
10 us_airdate         2        60             0 NA         NA        
11 us_airdate         3        76             0 NA         NA        
12 us_airdate         4         0             1 2014-12-28 2015-03-01
13 us_airdate         5         0             1 2015-09-06 2015-11-08
14 us_airdate         6         0             1 2016-07-01 2016-08-12
15 us_airdate         7         0             1 2017-06-16 2017-08-04
16 us_airdate         8        75             0 NA         NA        
   median     n_unique
 1 2010-08-27        6
 2 2011-08-30        8
 3 2012-09-04       10
 4 2013-09-10       10
 5 2014-08-27       10
 6 2015-08-26       10
 7 2016-09-14       10
 8 2017-09-19       10
 9 NA                0
10 NA                0
11 NA                0
12 2015-01-18       10
13 2015-09-27       10
14 2016-07-15        7
15 2017-06-30        8
16 NA                0

── Variable type: numeric ────────────────────────────────────────────────
   skim_variable series n_missing complete_rate   mean    sd    p0   p25
 1 episode            1         0         1       2.83  1.65     1  1   
 2 episode            2         0         1       3.55  2.10     1  2   
 3 episode            3         0         1       4.43  2.67     1  2   
 4 episode            4         0         1       4.31  2.66     1  2   
 5 episode            5         0         1       4.38  2.68     1  2   
 6 episode            6         0         1       4.4   2.67     1  2   
 7 episode            7         0         1       4.4   2.67     1  2   
 8 episode            8         0         1       4.4   2.67     1  2   
 9 technical          1         7         0.806   3.97  2.56     1  2   
10 technical          2         0         1       4.9   3.03     1  2   
11 technical          3         0         1       4.82  2.92     1  2   
12 technical          4         0         1       5.08  3.19     1  2.25
13 technical          5         1         0.986   4.73  2.93     1  2   
14 technical          6         0         1       4.8   2.92     1  2   
15 technical          7         0         1       4.8   2.92     1  2   
16 technical          8         0         1       4.8   2.92     1  2   
17 us_season          1        36         0     NaN    NA       NA NA   
18 us_season          2        60         0     NaN    NA       NA NA   
19 us_season          3        76         0     NaN    NA       NA NA   
20 us_season          4         0         1       1     0        1  1   
21 us_season          5         0         1       2     0        2  2   
22 us_season          6         0         1       3     0        3  3   
23 us_season          7         0         1       4     0        4  4   
24 us_season          8        75         0     NaN    NA       NA NA   
     p50   p75  p100 hist   
 1   2.5  4        6 "▇▂▂▂▁"
 2   3    5        8 "▇▃▅▂▂"
 3   4    6.25    10 "▇▆▅▅▂"
 4   4    6       10 "▇▆▅▃▂"
 5   4    6       10 "▇▆▅▃▂"
 6   4    6       10 "▇▆▅▃▂"
 7   4    6       10 "▇▆▅▃▂"
 8   4    6       10 "▇▆▅▃▂"
 9   3    5       10 "▇▇▃▂▂"
10   4    7       12 "▇▅▃▂▂"
11   4    7       12 "▇▅▃▂▂"
12   4.5  7       13 "▇▅▅▂▂"
13   4    7       12 "▇▅▃▂▂"
14   4    7       12 "▇▅▃▂▂"
15   4    7       12 "▇▅▃▂▂"
16   4    7       12 "▇▅▃▂▂"
17  NA   NA       NA " "    
18  NA   NA       NA " "    
19  NA   NA       NA " "    
20   1    1        1 "▁▁▇▁▁"
21   2    2        2 "▁▁▇▁▁"
22   3    3        3 "▁▁▇▁▁"
23   4    4        4 "▁▁▇▁▁"
24  NA   NA       NA " "    
# Edit to filter, group by, and skim
bakeoff %>% 
  filter(!is.na(us_season)) %>% 
  group_by(us_season)  %>% 
  skim()
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             302       
Number of columns          10        
_______________________              
Column type frequency:               
  character                4         
  Date                     2         
  numeric                  3         
________________________             
Group variables            us_season 

── Variable type: character ──────────────────────────────────────────────
   skim_variable us_season n_missing complete_rate   min   max empty
 1 baker                 1         0         1         3     9     0
 2 baker                 2         0         1         4     7     0
 3 baker                 3         0         1         3     6     0
 4 baker                 4         0         1         3     9     0
 5 signature             1         0         1        10   125     0
 6 signature             2         1         0.986    15   107     0
 7 signature             3         0         1        12    64     0
 8 signature             4         0         1        12    93     0
 9 showstopper           1         0         1         5   126     0
10 showstopper           2         1         0.986     8    82     0
11 showstopper           3         0         1        10    70     0
12 showstopper           4         0         1         5    86     0
13 result                1         0         1         2     9     0
14 result                2         0         1         2     9     0
15 result                3         0         1         2     9     0
16 result                4         0         1         2     9     0
   n_unique whitespace
 1       13          0
 2       12          0
 3       12          0
 4       12          0
 5       78          0
 6       73          0
 7       74          0
 8       75          0
 9       78          0
10       73          0
11       73          0
12       74          0
13        5          0
14        6          0
15        5          0
16        5          0

── Variable type: Date ───────────────────────────────────────────────────
  skim_variable us_season n_missing complete_rate min        max       
1 uk_airdate            1         0             1 2013-08-20 2013-10-22
2 uk_airdate            2         0             1 2014-08-06 2014-10-08
3 uk_airdate            3         0             1 2015-08-05 2015-10-07
4 uk_airdate            4         0             1 2016-08-24 2016-10-26
5 us_airdate            1         0             1 2014-12-28 2015-03-01
6 us_airdate            2         0             1 2015-09-06 2015-11-08
7 us_airdate            3         0             1 2016-07-01 2016-08-12
8 us_airdate            4         0             1 2017-06-16 2017-08-04
  median     n_unique
1 2013-09-10       10
2 2014-08-27       10
3 2015-08-26       10
4 2016-09-14       10
5 2015-01-18       10
6 2015-09-27       10
7 2016-07-15        7
8 2017-06-30        8

── Variable type: numeric ────────────────────────────────────────────────
   skim_variable us_season n_missing complete_rate  mean    sd    p0   p25
 1 series                1         0         1      4     0        4  4   
 2 series                2         0         1      5     0        5  5   
 3 series                3         0         1      6     0        6  6   
 4 series                4         0         1      7     0        7  7   
 5 episode               1         0         1      4.31  2.66     1  2   
 6 episode               2         0         1      4.38  2.68     1  2   
 7 episode               3         0         1      4.4   2.67     1  2   
 8 episode               4         0         1      4.4   2.67     1  2   
 9 technical             1         0         1      5.08  3.19     1  2.25
10 technical             2         1         0.986  4.73  2.93     1  2   
11 technical             3         0         1      4.8   2.92     1  2   
12 technical             4         0         1      4.8   2.92     1  2   
     p50   p75  p100 hist 
 1   4       4     4 ▁▁▇▁▁
 2   5       5     5 ▁▁▇▁▁
 3   6       6     6 ▁▁▇▁▁
 4   7       7     7 ▁▁▇▁▁
 5   4       6    10 ▇▆▅▃▂
 6   4       6    10 ▇▆▅▃▂
 7   4       6    10 ▇▆▅▃▂
 8   4       6    10 ▇▆▅▃▂
 9   4.5     7    13 ▇▅▅▂▂
10   4       7    12 ▇▅▃▂▂
11   4       7    12 ▇▅▃▂▂
12   4       7    12 ▇▅▃▂▂

Nice piping & skimming! Producing quick grouped summaries with the skim function is a real time (and sanity) saver.

1.2.3 Know your variable types

How many variables of each type do we have in the bakeoff data? You may also want to try piping a skimmed object to summary(), also from the skimr package:

tibble_name %>% 
  skim() %>%  # no argument needed here
  summary() # no argument needed here
bakeoff %>% 
  skim() %>%  # no argument needed here
  summary() # no argument needed here
── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             549       
Number of columns          10        
_______________________              
Column type frequency:               
  character                4         
  Date                     2         
  numeric                  4         
________________________             
Group variables            None      

1.3 Count with your data

In every episode of “The Great British Bake-Off”, bakers complete 3 challenges and the show’s judges award the title “Star Baker” to the baker who excelled in that week’s challenges (with the exception of the finale). Each baker’s result for every episode is stored in bakeoff- result is a character variable, and the value “SB” stands for star baker.

1.3.1 Distinct and count

# View distinct results
bakeoff %>%
  distinct(result)
# Count whether or not star baker
bakeoff %>% 
  count(result)
# Count whether or not star baker
bakeoff %>% 
  count(result == "SB")

There are 488 results that are not star baker.

1.3.2 Count episodes

Here’s how we can use count() back-to-back to roll up a level of detail across series:

bakers %>% 
  count(aired_us, series) %>% 
  count(aired_us)
  

Let’s practice counting the number of episodes per series.

bakeoff %>% 
  count(series, episode) %>% 
  count(series)
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`

You can see that by series 3 we were treated to 10 episodes with each new series!

1.3.3 Count bakers

We’ll practice that again, focusing on counting bakers.

# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>%
                      count(series, baker)


# Print to view
bakers_by_series

# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>% 
  count(series, baker)
  
# Print to view
bakers_by_series
  
# Count again by series
bakers_by_series %>%
  count(series)
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# Count again by baker
bakers_by_series %>%
  count(baker, sort = TRUE)
Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`

Series 4 had the most bakers with 13 total (a baker’s dozen!). Kate is the most popular name- Kates have appeared in 3 different series of the TV show.

1.3.4 Plot counts

You can learn a lot about your data by counting, but sometimes you can learn even more by plotting counts. This is especially true when you have lots of things to count! With eight series, 74 episodes, and 95 bakers, a plot can be more helpful than a table of numbers. We’ll use ggplot2 to visualize the number of bakers across episodes for each series from bakeoff.

library(ggplot2)

ggplot(bakeoff, aes(episode)) + 
    geom_bar() + 
    facet_wrap(~series)

You made it through the first chapter! We have a lot of work left to do with The Great British Bake Off data, but notice how this plot is a great sanity check: the number of bakers tends to go down with every episode (never up!), with a few exceptions where no one was eliminated. Also, you can see that there are fewer episodes in the first two series. Finally, you can see that there are always three bakers who make it to the series finale.

2 Tame your data

2.1 Cast column types

2.1.1 Cast a column to a date

A good workflow for parsing dates using readr is to, for example:

  • Use parse_date(“2012-14-08”, format = “%Y-%d-%m”) first, then
  • Use col_date(format = “%Y-%d-%m”) within cols() as the col_types argument of read_csv().

In “desserts.csv”, the variable uk_airdate is formatted like “17 August 2010”. Let’s parse, then cast this variable!

Later on in Chapter 4, we’ll showcase some functions from the lubridate package to help you extract data from dates, once they are cast properly.

# Find format to parse uk_airdate 
parse_date("17 August 2010", format = "%d %B %Y")
[1] "2010-08-17"
# Edit to cast uk_airdate
desserts <- read_csv("desserts.csv", 
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y")
                     )
                    )

# Arrange by descending uk_airdate
desserts %>% 
    arrange(desc(uk_airdate))

Notice that us_airdate didn’t need to be cast - this is because the date format %Y-%m-%d is unambiguous, so it is automatically parsed as a date by readr.

2.1.2 Cast a column to a number

We saw a good workflow for parsing columns using readr:

  • Use parse_number() to practice, then
  • Use col_number() to cast.

But sometimes you’ll need to start with casting, then diagnose parsing problems using a new readr function called problems(). Using problems() on a result of read_csv() will show you the rows and columns where parsing error occurred, what the parser expected to find (for example, a number), and the actual value that caused the parsing error.

Let’s practice this with “desserts.csv”, which includes data about the judges’ ranks for each baker in the technical challenge, for every episode.

# Try to cast technical as a number
desserts <- read_csv("desserts.csv", 
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number()
                     )
                    )
7 parsing failures.
row       col expected actual           file
  4 technical a number    N/A 'desserts.csv'
  6 technical a number    N/A 'desserts.csv'
  8 technical a number    N/A 'desserts.csv'
 10 technical a number    N/A 'desserts.csv'
 34 technical a number    N/A 'desserts.csv'
... ......... ........ ...... ..............
See problems(...) for more details.
# View parsing problems
desserts %>%
  problems()
# Edit code to fix the parsing error 
desserts <- read_csv("desserts.csv",
                      col_types = cols(
                        uk_airdate = col_date(format = "%d %B %Y"),
                        technical = col_number()
                      ),
                        na = c("", "NA", "N/A") #solving the problem
                     )

# View parsing problems
problems(desserts)

2.1.3 Cast a column as a factor

Factors are categorical variables, where the possible values are a fixed and known set. For example, take a simple factor like bake below:

bake <- c("pie", "cake", "neither") 
parse_factor(bake, levels = NULL) 
[1] pie     cake    neither
Levels: pie cake neither

You can use parse_factor() to parse variables and col_factor() to cast columns as categorical. Both functions have a levels argument that is used to specify the possible values for the factors. When levels is set to NULL, the possible values will be inferred from the unique values in the dataset. Alternatively, you can pass a list of possible values.

This time, you will look at the result column, which contains the outcome of the competition for each baker.

# Cast result a factor
desserts <- read_csv("desserts.csv", 
                     na = c("", "NA", "N/A"),
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number(),                       
                       result = col_factor(levels = NULL)
                     )
                    )
                    
# Glimpse to view
glimpse(desserts)
Rows: 549
Columns: 16
$ series                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ episode               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2,…
$ baker                 <chr> "Annetha", "David", "Edd", "Jasminder", "…
$ technical             <dbl> 2, 3, 1, NA, 9, NA, 8, NA, 10, NA, 8, 6, …
$ result                <fct> IN, IN, IN, IN, IN, IN, IN, IN, OUT, OUT,…
$ uk_airdate            <date> 2010-08-17, 2010-08-17, 2010-08-17, 2010…
$ us_season             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ us_airdate            <date> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
$ showstopper_chocolate <chr> "chocolate", "chocolate", "no chocolate",…
$ showstopper_dessert   <chr> "other", "other", "other", "other", "othe…
$ showstopper_fruit     <chr> "no fruit", "no fruit", "no fruit", "no f…
$ showstopper_nut       <chr> "no nut", "no nut", "no nut", "no nut", "…
$ signature_chocolate   <chr> "no chocolate", "chocolate", "no chocolat…
$ signature_dessert     <chr> "cake", "cake", "cake", "cake", "cake", "…
$ signature_fruit       <chr> "no fruit", "fruit", "fruit", "fruit", "f…
$ signature_nut         <chr> "no nut", "no nut", "no nut", "no nut", "…

While it takes a lot of work up front, casting column types when you import can make your analyses easier to reproduce.

2.2 Recode values

2.2.1 Recode a character variable

In this exercise, you’ll recode() the nut variable in the desserts data. This is a character variable that tells us, for each bake, whether a nut was a key ingredient and if so, what kind of nut!

Remember that the arguments of recode() are the variable that you want to recode, and then an expression of the form old_value = new_value. recode() is most useful when used inside a mutate() to create a new variable or reassign the old one.

Often you want to recode a certain value as missing. In this case, it is important to preserve the data type of the variable. R has built-in words to indicate missing data of various types, for example NA_character_ for missing value for a character variable, or NA_integer_ for missing integer data. The NA value is of the logical type.

Remember that you can use ?function_name_here to read more about arguments and how to use them.

# Count rows grouping by nut variable
desserts %>% 
    count(signature_nut, sort = TRUE)
    
# Edit code to recode "no nut" as missing
desserts_2 <- desserts %>% 
  mutate(nut = recode(signature_nut, "filbert" = "hazelnut", 
                           "no nut" = NA_character_))

# Count rows again 
desserts_2 %>% 
    count(nut, sort = TRUE)

2.2.2 Recode a numeric variable

Dummy variables are often used in data analysis to bin a variable into one of two categories to indicate the absence or presence of something. Dummy variables take the value 0 or 1 to stand for, for example, loser or winner. Dummy variables are often factor variables as opposed to numeric - we’ll cover more about factors in the last chapter.

In desserts, the technical variable is a number representing the judges’ ranks for each baker in the technical challenge, for every episode (remember to run glimpse(desserts) to take a peek). In this exercise, you will recode the numeric technical variable to a new factor dummy variable.

## Edit to recode tech_win as factor
desserts <- desserts %>% 
  mutate(tech_win = recode_factor(technical, `1` = 1,
                           .default = 0))

# Count to compare values                      
desserts %>% 
  count(technical == 1, tech_win)

2.3 Select variables

2.3.1 Combine functions with select

The ratings data tells us about how many UK viewers tuned in to watch “The Great British Bake-Off” for each episode of each series. If you explore ratings in console, you’ll see that there is a separate variable for the number of viewers for each episode. A key question we’ll return to in Chapter 3 is: how many more viewers watched the finale than the premiere episode?

For series with 10 episodes, which showed the most growth in viewers from the premiere to the finale? Which showed the least?

ratings <- read_csv("Ratings/02.03_messy_ratings.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  premiere = col_character(),
  finale = col_character(),
  winner = col_character(),
  day_of_week = col_character(),
  timeslot = col_time(format = ""),
  channel = col_character(),
  runner_up_1 = col_character(),
  runner_up_2 = col_character(),
  season_premiere = col_character(),
  season_finale = col_character(),
  e1_uk_airdate = col_character(),
  e2_uk_airdate = col_character(),
  e3_uk_airdate = col_character(),
  e4_uk_airdate = col_character(),
  e5_uk_airdate = col_character(),
  e6_uk_airdate = col_character(),
  e7_uk_airdate = col_character(),
  e8_uk_airdate = col_character(),
  e9_uk_airdate = col_character(),
  e10_uk_airdate = col_character()
)
See spec(...) for full column specifications.
head(ratings)
ratings <- ratings %>% 
  filter(episodes == 10) %>%
  mutate(viewer_growth = e10_viewers_7day - e1_viewers_7day)
ratings %>% 
  select(series, viewer_growth)

Most: series 5, Least = series 8

2.3.2 Recode factor to plot

In the last exercise, you may have used mutate() to find the difference between the number of premiere and finale episode viewers for each series. This variable, which I created and called viewer_growth, is now available in ratings, and feel free to check out it.

Now let’s practice using select() to check the result of another mutate with recode_factor().

If you have examined ratings, you might have seen that series have aired on three different channels: “Channel 4”, “BBC One”, “BBC Two”.

# Recode channel as factor: bbc (1) or not (0)
ratings <- ratings %>% 
  mutate(bbc = recode_factor(channel, 
                             "Channel 4" = 0,
                             .default = 1))
                            
# Select to look at variables to plot next
ratings %>% 
  select(series, channel, bbc, viewer_growth)
  
# Make a filled bar chart
ggplot(ratings, aes(x = series, y = viewer_growth, fill = bbc)) +
  geom_col()

Remember to look up the other select helper functions using ?select. What is going on with viewer_growth in series 8?

2.3.3 Select and reorder variables

As we have seen, selecting a subset of columns to print can help you check that a mutate() worked as expected, and rearranging columns next to each other can help you spot obvious errors in data entry.

The select() helpers are functions that you can use inside select() to allow you to select variables based on their names.

In this exercise, you’ll work with the ratings data, and you’ll use a new helper function called everything() which can be useful when reordering columns. Use ?everything to read more. You will learn more about using helper functions in the next lesson.

# Move channel to first column
ratings %>% 
  select(channel, everything())

# Drop 7- and 28-day episode ratings
ratings %>% 
  select(-ends_with("day"))

# Move channel to front and drop 7-/28-day episode ratings
ratings %>% 
  select(channel, everything(), -ends_with("day"))

everything() is a great little helper! If it had been placed at the end, it would have added back in all the columns that end with “day”. Placing it before deselecting columns, though, is a real time-saver.

2.4 Tame variable names

2.4.1 Reformat variables

We saw how to use clean_names() from the janitor package to convert all variable names to snake_case. Here is how:

library(janitor)
young_bakers3 <- young_bakers3 %>% 
   clean_names()

The function clean_names() takes an argument case that can be used to convert variable names to other cases, like “upper_camel” or “all_caps”.

Let’s practice using the messy_ratings tibble. Remember you can use ?clean_names to see the function arguments and how to use them.

messy_ratings <- read_csv("Ratings/02.03_messy_ratings.csv")
Parsed with column specification:
cols(
  .default = col_double(),
  premiere = col_character(),
  finale = col_character(),
  winner = col_character(),
  day_of_week = col_character(),
  timeslot = col_time(format = ""),
  channel = col_character(),
  runner_up_1 = col_character(),
  runner_up_2 = col_character(),
  season_premiere = col_character(),
  season_finale = col_character(),
  e1_uk_airdate = col_character(),
  e2_uk_airdate = col_character(),
  e3_uk_airdate = col_character(),
  e4_uk_airdate = col_character(),
  e5_uk_airdate = col_character(),
  e6_uk_airdate = col_character(),
  e7_uk_airdate = col_character(),
  e8_uk_airdate = col_character(),
  e9_uk_airdate = col_character(),
  e10_uk_airdate = col_character()
)
See spec(...) for full column specifications.

colnames(messy_ratings)

names(messy_ratings) <- c("SERIES", "episodes", "premiere", "finale", "winner", "Avg.uk~Viewers", 
"day", "timeslot", "channel", "runner.up~1", "runner.up~2", "season", 
"season.premiere", "season.finale", "Gbbo E1.Viewers~7Day", "Gbbo E1.Viewers~28Day", 
"Gbbo E2.Viewers~7Day", "Gbbo E2.Viewers~28Day", "Gbbo E3.Viewers~7Day", 
"Gbbo E3.Viewers~28Day", "Gbbo E4.Viewers~7Day", "Gbbo E4.Viewers~28Day", 
"Gbbo E5.Viewers~7Day", "Gbbo E5.Viewers~28Day", "Gbbo E6.Viewers~7Day", 
"Gbbo E6.Viewers~28Day", "Gbbo E7.Viewers~7Day", "Gbbo E7.Viewers~28Day", 
"Gbbo E8.Viewers~7Day", "Gbbo E8.Viewers~28Day", "Gbbo E9.Viewers~7Day", 
"Gbbo E9.Viewers~28Day", "Gbbo E10.Viewers~7Day", "Gbbo E10.Viewers~28Day")
The `value` argument of ``names<-`()` must have the same length as `x` as of tibble 3.0.0.
`names` must have length 44, not 34.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.The `value` argument of ``names<-`()` can't be empty as of tibble 3.0.0.
Columns 35, 36, 37, 38, 39, and 5 more must be named.
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
# Glimpse to see variable names
glimpse(messy_ratings)
Rows: 8
Columns: 44
$ SERIES                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ episodes                 <dbl> 6, 8, 10, 10, 10, 10, 10, 10
$ premiere                 <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12",…
$ finale                   <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", …
$ winner                   <chr> "Edd Kimber", "Joanne Wheatley", "John…
$ `Avg.uk~Viewers`         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, …
$ day                      <chr> "Tuesday", "Tuesday", "Tuesday", "Tues…
$ timeslot                 <time> 20:00:00, 20:00:00, 20:00:00, 20:00:0…
$ channel                  <chr> "BBC Two", "BBC Two", "BBC Two", "BBC …
$ `runner.up~1`            <chr> "Miranda Gore Browne", "Holly Bell", "…
$ `runner.up~2`            <chr> "Ruth Clemens", "Mary-Anne Boermans", …
$ season                   <dbl> NA, NA, NA, 1, 2, 3, 4, NA
$ season.premiere          <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1…
$ season.finale            <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12…
$ `Gbbo E1.Viewers~7Day`   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 1…
$ `Gbbo E1.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
$ `Gbbo E2.Viewers~7Day`   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 1…
$ `Gbbo E2.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
$ `Gbbo E3.Viewers~7Day`   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 1…
$ `Gbbo E3.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
$ `Gbbo E4.Viewers~7Day`   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, …
$ `Gbbo E4.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
$ `Gbbo E5.Viewers~7Day`   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 1…
$ `Gbbo E5.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
$ `Gbbo E6.Viewers~7Day`   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, …
$ `Gbbo E6.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
$ `Gbbo E7.Viewers~7Day`   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13…
$ `Gbbo E7.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
$ `Gbbo E8.Viewers~7Day`   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090…
$ `Gbbo E8.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
$ `Gbbo E9.Viewers~7Day`   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.4…
$ `Gbbo E9.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
$ `Gbbo E10.Viewers~7Day`  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.9…
$ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
$ NA                       <chr> "8/17/10", "8/16/11", "8/14/12", "8/20…
$ NA                       <chr> "8/24/10", "8/23/11", "8/21/12", "8/27…
$ NA                       <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/…
$ NA                       <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13…
$ NA                       <chr> "9/14/10", "9/13/11", "9/11/12", "9/17…
$ NA                       <chr> "9/21/10", "9/20/11", "9/18/12", "9/24…
$ NA                       <chr> NA, "9/27/11", "9/25/12", "10/1/13", "…
$ NA                       <chr> NA, "10/4/11", "10/2/12", "10/8/13", "…
$ NA                       <chr> NA, NA, "10/9/12", "10/15/13", "10/1/1…
$ NA                       <chr> NA, NA, "10/16/12", "10/22/13", "10/8/…
# Load janitor
library(janitor)

# Reformat to lower camelcase
ratings2 <- messy_ratings %>%
  clean_names("lower_camel")
    
# Glimpse new tibble
glimpse(ratings2)
Rows: 8
Columns: 44
$ series              <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ episodes            <dbl> 6, 8, 10, 10, 10, 10, 10, 10
$ premiere            <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "20-…
$ finale              <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "22-O…
$ winner              <chr> "Edd Kimber", "Joanne Wheatley", "John Whai…
$ avgUkViewers        <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13.85…
$ day                 <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday",…
$ timeslot            <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20…
$ channel             <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Two",…
$ runnerUp1           <chr> "Miranda Gore Browne", "Holly Bell", "Brend…
$ runnerUp2           <chr> "Ruth Clemens", "Mary-Anne Boermans", "Jame…
$ season              <dbl> NA, NA, NA, 1, 2, 3, 4, NA
$ seasonPremiere      <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/16",…
$ seasonFinale        <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/16",…
$ gbboE1Viewers7Day   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.58,…
$ gbboE1Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
$ gbboE2Viewers7Day   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.45,…
$ gbboE2Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
$ gbboE3Viewers7Day   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.01,…
$ gbboE3Viewers28Day  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
$ gbboE4Viewers7Day   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13.29…
$ gbboE4Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
$ gbboE5Viewers7Day   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.12,…
$ gbboE5Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
$ gbboE6Viewers7Day   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13.13…
$ gbboE6Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
$ gbboE7Viewers7Day   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.45, …
$ gbboE7Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
$ gbboE8Viewers7Day   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, 13.…
$ gbboE8Viewers28Day  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
$ gbboE9Viewers7Day   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44, 9.…
$ gbboE9Viewers28Day  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
$ gbboE10Viewers7Day  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90, 10…
$ gbboE10Viewers28Day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
$ na                  <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/13",…
$ na_2                <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/13",…
$ na_3                <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13", …
$ na_4                <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13", "8…
$ na_5                <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/13",…
$ na_6                <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/13",…
$ na_7                <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/17/…
$ na_8                <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/24/…
$ na_9                <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14", "…
$ na_10               <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14", …
# Glimpse to see variable names
glimpse(messy_ratings)
Rows: 8
Columns: 44
$ SERIES                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ episodes                 <dbl> 6, 8, 10, 10, 10, 10, 10, 10
$ premiere                 <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12",…
$ finale                   <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", …
$ winner                   <chr> "Edd Kimber", "Joanne Wheatley", "John…
$ `Avg.uk~Viewers`         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, …
$ day                      <chr> "Tuesday", "Tuesday", "Tuesday", "Tues…
$ timeslot                 <time> 20:00:00, 20:00:00, 20:00:00, 20:00:0…
$ channel                  <chr> "BBC Two", "BBC Two", "BBC Two", "BBC …
$ `runner.up~1`            <chr> "Miranda Gore Browne", "Holly Bell", "…
$ `runner.up~2`            <chr> "Ruth Clemens", "Mary-Anne Boermans", …
$ season                   <dbl> NA, NA, NA, 1, 2, 3, 4, NA
$ season.premiere          <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1…
$ season.finale            <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12…
$ `Gbbo E1.Viewers~7Day`   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 1…
$ `Gbbo E1.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
$ `Gbbo E2.Viewers~7Day`   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 1…
$ `Gbbo E2.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
$ `Gbbo E3.Viewers~7Day`   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 1…
$ `Gbbo E3.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
$ `Gbbo E4.Viewers~7Day`   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, …
$ `Gbbo E4.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
$ `Gbbo E5.Viewers~7Day`   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 1…
$ `Gbbo E5.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
$ `Gbbo E6.Viewers~7Day`   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, …
$ `Gbbo E6.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
$ `Gbbo E7.Viewers~7Day`   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13…
$ `Gbbo E7.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
$ `Gbbo E8.Viewers~7Day`   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090…
$ `Gbbo E8.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
$ `Gbbo E9.Viewers~7Day`   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.4…
$ `Gbbo E9.Viewers~28Day`  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
$ `Gbbo E10.Viewers~7Day`  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.9…
$ `Gbbo E10.Viewers~28Day` <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
$ NA                       <chr> "8/17/10", "8/16/11", "8/14/12", "8/20…
$ NA                       <chr> "8/24/10", "8/23/11", "8/21/12", "8/27…
$ NA                       <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/…
$ NA                       <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13…
$ NA                       <chr> "9/14/10", "9/13/11", "9/11/12", "9/17…
$ NA                       <chr> "9/21/10", "9/20/11", "9/18/12", "9/24…
$ NA                       <chr> NA, "9/27/11", "9/25/12", "10/1/13", "…
$ NA                       <chr> NA, "10/4/11", "10/2/12", "10/8/13", "…
$ NA                       <chr> NA, NA, "10/9/12", "10/15/13", "10/1/1…
$ NA                       <chr> NA, NA, "10/16/12", "10/22/13", "10/8/…
# Load janitor
library(janitor)

# Reformat to snake case
ratings2 <- messy_ratings %>%  
  clean_names()

# Glimpse cleaned names
glimpse(ratings2)
Rows: 8
Columns: 44
$ series                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ episodes               <dbl> 6, 8, 10, 10, 10, 10, 10, 10
$ premiere               <chr> "17-Aug-10", "14-Aug-11", "14-Aug-12", "…
$ finale                 <chr> "21-Sep-10", "4-Oct-11", "16-Oct-12", "2…
$ winner                 <chr> "Edd Kimber", "Joanne Wheatley", "John W…
$ avg_uk_viewers         <dbl> 2.77, 4.00, 5.00, 7.35, 10.04, 12.50, 13…
$ day                    <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesda…
$ timeslot               <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00,…
$ channel                <chr> "BBC Two", "BBC Two", "BBC Two", "BBC Tw…
$ runner_up_1            <chr> "Miranda Gore Browne", "Holly Bell", "Br…
$ runner_up_2            <chr> "Ruth Clemens", "Mary-Anne Boermans", "J…
$ season                 <dbl> NA, NA, NA, 1, 2, 3, 4, NA
$ season_premiere        <chr> NA, NA, NA, "12/28/14", "9/6/15", "7/1/1…
$ season_finale          <chr> NA, NA, NA, "3/1/15", "11/8/15", "8/12/1…
$ gbbo_e1_viewers_7day   <dbl> 2.24, 3.10, 3.85, 6.60, 8.51, 11.62, 13.…
$ gbbo_e1_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.73, 13.86, 9.72
$ gbbo_e2_viewers_7day   <dbl> 3.00, 3.53, 4.60, 6.65, 8.79, 11.59, 13.…
$ gbbo_e2_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.84, 13.74, 9.53
$ gbbo_e3_viewers_7day   <dbl> 3.00, 3.82, 4.53, 7.17, 9.28, 12.01, 13.…
$ gbbo_e3_viewers_28day  <dbl> NA, NA, NA, NA, NA, NA, 13.38, 9.06
$ gbbo_e4_viewers_7day   <dbl> 2.60, 3.60, 4.71, 6.82, 10.25, 12.36, 13…
$ gbbo_e4_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.67, 13.88, 8.87
$ gbbo_e5_viewers_7day   <dbl> 3.03, 3.83, 4.61, 6.95, 9.95, 12.39, 13.…
$ gbbo_e5_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.63, 13.33, 8.98
$ gbbo_e6_viewers_7day   <dbl> 2.75, 4.25, 4.82, 7.32, 10.13, 12.00, 13…
$ gbbo_e6_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.26, 13.41, 8.91
$ gbbo_e7_viewers_7day   <dbl> NA, 4.42, 5.10, 7.76, 10.28, 12.35, 13.4…
$ gbbo_e7_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.58, 13.72, 9.32
$ gbbo_e8_viewers_7day   <dbl> NA, 5.060, 5.350, 7.410, 9.023, 11.090, …
$ gbbo_e8_viewers_28day  <dbl> NA, NA, NA, NA, NA, 11.35, 13.45, 9.19
$ gbbo_e9_viewers_7day   <dbl> NA, NA, 5.70, 7.41, 10.67, 12.65, 13.44,…
$ gbbo_e9_viewers_28day  <dbl> NA, NA, NA, NA, NA, 12.78, 13.65, 9.16
$ gbbo_e10_viewers_7day  <dbl> NA, NA, 6.74, 9.45, 13.51, 15.05, 15.90,…
$ gbbo_e10_viewers_28day <dbl> NA, NA, NA, NA, NA, 15.16, 16.03, 10.13
$ na                     <chr> "8/17/10", "8/16/11", "8/14/12", "8/20/1…
$ na_2                   <chr> "8/24/10", "8/23/11", "8/21/12", "8/27/1…
$ na_3                   <chr> "8/31/10", "8/30/11", "8/28/12", "9/3/13…
$ na_4                   <chr> "9/7/10", "9/6/11", "9/4/12", "9/10/13",…
$ na_5                   <chr> "9/14/10", "9/13/11", "9/11/12", "9/17/1…
$ na_6                   <chr> "9/21/10", "9/20/11", "9/18/12", "9/24/1…
$ na_7                   <chr> NA, "9/27/11", "9/25/12", "10/1/13", "9/…
$ na_8                   <chr> NA, "10/4/11", "10/2/12", "10/8/13", "9/…
$ na_9                   <chr> NA, NA, "10/9/12", "10/15/13", "10/1/14"…
$ na_10                  <chr> NA, NA, "10/16/12", "10/22/13", "10/8/14…

You have a ton of case options with the clean_names function- choose wisely!

2.4.2 Rename and subset variables

In the next chapter, when we reshape data to be tidy, you’ll see the importance of having tame variable names.

Now that our variable names in ratings have been converted to snake_case, you’ll practice combining renaming with select() helper functions to take advantage of the variable name structure.

To rename a group of variables, you only need to specify a prefix of the new name.

For example, my_tbl %>% select( new_name_ = starts_with(“oldname”) ) will find all variables in my_tbl whose names start with oldname, enumerate them, then rename each variable as new_name_, where N is a number. If my_tbl has variables oldname, oldname_v1, oldname3, running the statement above will replace these names with new_name_1, new_name_2, new_name_3. Notice how we only specified new_name_ in select(), and the numbers were appended automatically!

Take a second to glimpse() at ratings in your console, and take note of the variables that store data about the number of viewers. Can you see the pattern in their names? In this exercise, you will first select those variables using a helper function, and then rename them!

# Select 7-day viewer data by series
viewers_7day <- ratings %>%
    select(series, ends_with("7day"))
    

# Glimpse
glimpse(viewers_7day)
Rows: 6
Columns: 11
$ series           <dbl> 3, 4, 5, 6, 7, 8
$ e1_viewers_7day  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
$ e2_viewers_7day  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
$ e3_viewers_7day  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
$ e4_viewers_7day  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
$ e5_viewers_7day  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
$ e6_viewers_7day  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
$ e7_viewers_7day  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
$ e8_viewers_7day  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
$ e9_viewers_7day  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
$ e10_viewers_7day <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
# Adapt code to also rename 7-day viewer data
viewers_7day <- ratings %>% 
    select(series, viewers_7day_ = ends_with("7day"))

# Glimpse
glimpse(viewers_7day)
Rows: 6
Columns: 11
$ series          <dbl> 3, 4, 5, 6, 7, 8
$ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
$ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
$ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
$ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
$ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
$ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
$ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
$ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
$ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
$ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04

Using select helper functions is a powerful way to tame your variable names. Tame names make it easier to tidy your data.

2.4.3 Rename and reorder variables

In the previous exercise, you renamed all the variables in ratings that end with “7day” and kept only a subset of variables. In this exercise, you’ll adapt that code to drop some specific columns, while keeping and reordering other columns.

# Adapt code to drop 28-day columns; keep 7-day in front
viewers_7day <- ratings %>% 
    select(viewers_7day_ = ends_with("7day"),
        everything(),
        -ends_with("28day"))
        
# Glimpse
glimpse(viewers_7day)
Rows: 6
Columns: 36
$ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
$ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
$ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
$ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
$ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
$ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
$ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
$ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
$ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
$ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
$ series          <dbl> 3, 4, 5, 6, 7, 8
$ episodes        <dbl> 10, 10, 10, 10, 10, 10
$ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15…
$ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15…
$ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhis…
$ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
$ day_of_week     <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday",…
$ timeslot        <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:…
$ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BB…
$ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troy…
$ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr", …
$ season          <dbl> NA, 1, 2, 3, 4, NA
$ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", …
$ season_finale   <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
$ e1_uk_airdate   <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24…
$ e2_uk_airdate   <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/…
$ e3_uk_airdate   <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7…
$ e4_uk_airdate   <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/1…
$ e5_uk_airdate   <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21…
$ e6_uk_airdate   <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/2…
$ e7_uk_airdate   <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10…
$ e8_uk_airdate   <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10…
$ e9_uk_airdate   <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "1…
$ e10_uk_airdate  <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "…
$ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
$ bbc             <fct> 1, 1, 1, 1, 1, 0
# Adapt code to keep original order
viewers_7day <- ratings %>% 
    select(everything(),
           viewers_7day_ = ends_with("7day"), 
           -ends_with("28day"))

# Glimpse
glimpse(viewers_7day)
Rows: 6
Columns: 36
$ series          <dbl> 3, 4, 5, 6, 7, 8
$ episodes        <dbl> 10, 10, 10, 10, 10, 10
$ premiere        <chr> "14-Aug-12", "20-Aug-13", "6-Aug-14", "5-Aug-15…
$ finale          <chr> "16-Oct-12", "22-Oct-13", "8-Oct-14", "7-Oct-15…
$ winner          <chr> "John Whaite", "Frances Quinn", "Nancy Birtwhis…
$ avg_uk_viewers  <dbl> 5.00, 7.35, 10.04, 12.50, 13.85, 9.29
$ day_of_week     <chr> "Tuesday", "Tuesday", "Wednesday", "Wednesday",…
$ timeslot        <time> 20:00:00, 20:00:00, 20:00:00, 20:00:00, 20:00:…
$ channel         <chr> "BBC Two", "BBC Two", "BBC One", "BBC One", "BB…
$ runner_up_1     <chr> "Brendan Lynch", "Kimberley Wilson", "Luis Troy…
$ runner_up_2     <chr> "James Morton", "Ruby Tandoh", "Richard Burr", …
$ season          <dbl> NA, 1, 2, 3, 4, NA
$ season_premiere <chr> NA, "12/28/14", "9/6/15", "7/1/16", "6/16/17", …
$ season_finale   <chr> NA, "3/1/15", "11/8/15", "8/12/16", "8/4/17", NA
$ viewers_7day_1  <dbl> 3.85, 6.60, 8.51, 11.62, 13.58, 9.46
$ viewers_7day_2  <dbl> 4.60, 6.65, 8.79, 11.59, 13.45, 9.23
$ viewers_7day_3  <dbl> 4.53, 7.17, 9.28, 12.01, 13.01, 8.68
$ viewers_7day_4  <dbl> 4.71, 6.82, 10.25, 12.36, 13.29, 8.55
$ viewers_7day_5  <dbl> 4.61, 6.95, 9.95, 12.39, 13.12, 8.61
$ viewers_7day_6  <dbl> 4.82, 7.32, 10.13, 12.00, 13.13, 8.61
$ viewers_7day_7  <dbl> 5.10, 7.76, 10.28, 12.35, 13.45, 9.01
$ viewers_7day_8  <dbl> 5.350, 7.410, 9.023, 11.090, 13.260, 8.950
$ viewers_7day_9  <dbl> 5.70, 7.41, 10.67, 12.65, 13.44, 9.03
$ viewers_7day_10 <dbl> 6.74, 9.45, 13.51, 15.05, 15.90, 10.04
$ e1_uk_airdate   <chr> "8/14/12", "8/20/13", "8/6/14", "8/5/15", "8/24…
$ e2_uk_airdate   <chr> "8/21/12", "8/27/13", "8/13/14", "8/12/15", "8/…
$ e3_uk_airdate   <chr> "8/28/12", "9/3/13", "8/20/14", "8/19/15", "9/7…
$ e4_uk_airdate   <chr> "9/4/12", "9/10/13", "8/27/14", "8/26/15", "9/1…
$ e5_uk_airdate   <chr> "9/11/12", "9/17/13", "9/3/14", "9/2/15", "9/21…
$ e6_uk_airdate   <chr> "9/18/12", "9/24/13", "9/10/14", "9/9/15", "9/2…
$ e7_uk_airdate   <chr> "9/25/12", "10/1/13", "9/17/14", "9/16/15", "10…
$ e8_uk_airdate   <chr> "10/2/12", "10/8/13", "9/24/14", "9/23/15", "10…
$ e9_uk_airdate   <chr> "10/9/12", "10/15/13", "10/1/14", "9/30/15", "1…
$ e10_uk_airdate  <chr> "10/16/12", "10/22/13", "10/8/14", "10/7/15", "…
$ viewer_growth   <dbl> 2.89, 2.85, 5.00, 3.43, 2.32, 0.58
$ bbc             <fct> 1, 1, 1, 1, 1, 0

3 Tidy your data

3.1 Introduction to Tidy Data

3.1.1 Plot untidy data

We’ll start by visualizing the number of show viewers (in millions, cumulative across seven days) across series by episode using the ratings data

ratings <- read_csv("Ratings/messy_ratings.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))
# Plot of episode 1 viewers by series
ggplot(ratings, aes(x = series, y = e1))+
 geom_col()

# Adapt code to plot episode 2 viewers by series
ggplot(ratings, aes(x = series, y = e2)) +
    geom_col()

Series 8 was the first with different hosts on a different channel. Do you notice a pattern? To draw any conclusions, we’d want to plot the viewers for every episode though- aren’t you glad you weren’t asked to make all 10 plots? Wouldn’t it be nice if episode was a variable we could use?

3.2 Gather

3.2.1 Gather & plot

we’ll tidy the data, then use row_number() from the dplyr package to create a continuous episode count variable, such that episode 1 from series 1 will have a value of 1, episode 1 from series 2 will have a value of 7, and so on. This new variable will be mapped onto the x-axis using ggplot2.

library(tidyr)
tidy_ratings <- ratings %>%
    # Gather and convert episode to factor
    gather(key = "episode", value = "viewers_7day", -series, 
           factor_key = TRUE, na.rm = TRUE) %>%
    # Sort in ascending order by series and episode
    arrange(series, episode) %>% 
    # Create new variable using row_number()
    mutate(episode_count = row_number())

# Plot viewers by episode and series
ggplot(tidy_ratings, aes(x = episode_count, 
                y = viewers_7day, 
                fill = series)) +
    geom_col()

Beautiful plot! In series 8, one of the judges accidentally tweeted out the winner’s name 12 hours before the finale aired. Do you think that hurt their finale viewer numbers?

3.2.2 Gather & plot non-sequential columns

In this exercise, you’ll combine tidyr with dplyr::select() to keep and gather non-sequential columns in the ratings2 data. You can review the select helper functions here. The ratings2 data includes per-episode data:

  • e*_7day: viewers within a 7-day window, and
  • e*_28day: viewers within a 28-day window.

Here, we’ll focus on the 7-day viewers.

ratings2 <- read_csv("Ratings/messy_ratings2.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))
week_ratings <- ratings2  %>% 
    # Select 7-day viewer ratings
    select(series, ends_with("7day")) %>% 
    # Gather 7-day viewers by episode
    gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE, factor_key = TRUE)
# Plot 7-day viewers by episode and series
ggplot(week_ratings, aes(x = episode, 
                y = viewers_7day, 
                group = series)) +
    geom_line() +
    facet_wrap(~series) 

You’re done gathering! This plot tells you a lot, but the x-axis labels are a mess! We need to do more tidying to make this plot better.

3.3 Separate

3.3.1 Separate a column

To create the week_ratings data, we first selected all the columns in ratings2 that contain data about the number of 7-day viewers in millions for each series, then we gathered those 10 columns into two columns.

In this exercise, you’ll use the tidyr, dplyr, and readr packages to tidy and plot the week_ratings data so we can read the x-axis, which labels each episode!

week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    # Edit to parse episode number
    mutate(episode = parse_number(episode))

# Print to view
week_ratings
# Create week_ratings
week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    mutate(episode = parse_number(episode))
    
# Edit your code to color by series and add a theme
ggplot(week_ratings, aes(x = episode, y = viewers_7day, 
                         group = series, color = series)) +
    geom_line() +
    facet_wrap(~series) +
    guides(color = FALSE) +
    theme_minimal()

3.3.2 Unite columns

In the tidyr package, the opposite of separate() is unite(). Sometimes you need to paste values from two or more columns together to tidy. Here is an example usage for unite():

data %>%
 unite(new_var, old_var1, old_var2) 

The ratings2 dataset includes these variables:

  • series: 1-8
  • episode: 1-10
  • viewers_millions: whole number of 7-day viewers in millions
  • viewers_decimal: additional number of 7-days viewers in millions

If viewers_millions = 2 and viewers_decimal = .6, then the total number of viewers in millions is 2.6. In this exercise, you’ll practice uniting these two columns into a single new column.

ratings_2 <- readRDS("ratings_2.rds")
ratings3 <- ratings_2 %>% 
    # Unite viewers in millions and decimals together    
    unite(viewers_7day, viewers_millions, viewers_decimal)

# Print to view
ratings3
ratings3 <- ratings_2  %>% 
    # Adapt to change the separator
    unite(viewers_7day, viewers_millions, viewers_decimal, sep = "")

# Print to view
ratings3
ratings3 <- ratings_2  %>% 
    # Unite and change the separator
    unite(viewers_7day, viewers_millions, viewers_decimal, sep = "") %>%
    # Adapt to cast viewers as a number
    mutate(viewers_7day = parse_number(viewers_7day))

# Print to view
ratings3

3.4 Spread

3.4.1 Spread rows into columns

In this exercise, you’ll use the tidyr, dplyr, and readr packages to tidy the ratings2 data after using dplyr::count(). The tidy_ratings_all data should have four variables to work with:

  • series (integer: 1-8),
  • episode (integer: 1-10),
  • days (integer: 7 or 28), and
  • viewers (numeric).
# Create tidy data with 7- and 28-day viewers
tidy_ratings_all <- ratings2 %>%
    gather(key = episode, value = viewers, ends_with("day"), na.rm = TRUE) %>% 
    separate(episode, into = c("episode", "days")) %>%  
    mutate(episode = parse_number(episode),
           days = parse_number(days))

Using your new tidy data, count the number of viewers grouping by series and days. Remember: you don’t need to use group_by before count, but you do need to use the wt = argument to sum the values in the viewers column.

tidy_ratings_all %>%
    # Count viewers by series and days
    count(series, days, wt = viewers)
tidy_7day <- tidy_ratings_all %>% 
    # Count viewers by series and days
    count(series, days, wt = viewers) %>%
    # Adapt to spread counted values
      spread(key = days, value = n, sep = "_")

The sep = "_" argument is especially helpful when you are spreading by a numeric variable so that you don’t end up with variable names that start with numbers.

3.5 Tidy multiple sets of columns

3.5.1 Masterclass: Tidy

In the last three exercises, you’ll tidy the ratings data to create a scatterplot to see the relationship between the number of premiere and finale UK viewers (7-day only) by series.

# Gather viewer columns and remove NA rows
tidy_ratings <- ratings %>%
    gather(episode, viewers, -series, na.rm = TRUE) %>%
    mutate(episode = parse_number(episode)) %>% 
    group_by(series) %>% 
    filter(episode == 1 | episode == max(episode)) %>% 
    ungroup()
tidy_ratings

Now we have a tidy version of the ratings data with only premiere and finale episodes. Let’s see if that tweet spoiled the Series 8 finale.

In the previous exercise, you gathered, then did a grouped filter to make tidy_ratings. In this exercise, you’ll make two charts in ggplot2 with this tidy data to see if the judge’s tweet spoiled the Series 8 finale ratings.

The two plots you’ll make, a slope chart and a dumbbell chart, are nice ways to show change between two data points combining geom_point() and geom_line().

You’ll use these to visualize the “finale bump” in viewers between the premiere and finale episodes across series.

first_last <- tidy_ratings %>% 
  mutate(episode = recode(episode, `1` = "first", .default = "last")) 
first_last
# Fill in to make slope chart
ggplot(first_last, aes(x = episode, y = viewers, color = series)) +
  geom_point() +
  geom_line(aes(group = series))

# Switch the variables mapping x-axis and color
ggplot(first_last, aes(x = series, y = viewers, color = episode)) +
  geom_point() + # keep
  geom_line(aes(group = series)) + # keep
  coord_flip() # keep

Great work- notice how each of these two plots relied on data that is tidy! Series 8 had a pretty small ‘finale bump’, but we are still looking at raw changes in viewers. Let’s visualize percentage changes in viewers instead.

In the previous exercise, you visualized the size of the “finale bump” in viewers for each series. From those charts, we know that finale episodes always garner more viewers than premieres. We also know that the change in viewers for series 8 seemed smaller compared to some previous series.

But the question of whether the series 8 finale viewers were “spoiled” might be best answered by comparing the relative percentage changes in premiere and finale viewers across series. In this exercise, you’ll reshape first_last once more to calculate and visualize the relative finale bump.

# Calculate relative increase in viewers
bump_by_series <- first_last %>% 
  spread(episode, viewers) %>%   
  mutate(bump = (last - first) / first)
bump_by_series
# Calculate relative increase in viewers
bump_by_series <- first_last %>% 
  spread(episode, viewers) %>%   
  mutate(bump = (last - first) / first)
  
# Fill in to make bar chart of bumps by series
ggplot(bump_by_series, aes(x = series, y = bump)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) # converts to %

Series 8 definitely had the lowest ‘finale bump’ so far- only 6% more viewers watched the finale than the premiere. Notice that we had to spread here, because the tidy version of the data depends on the question you want to ask.

4 Transform your data

4.1 Complex recoding with case_when

4.1.1 Combine two variables

In this exercise, you’ll use case_when() with the bakers data to create a new variable based on the number of times each baker was crowned as star baker or the technical challenge winner:

  • If star baker is greater than technical wins, recode as super_star
  • If star baker is less than technical wins, recode as high_tech
  • Recode the rest as well_rounded
bakers <- read_csv("Bakers/baker_results.csv",
                   col_types = cols(series = col_factor(levels = NULL),
                                    series_winner = col_factor(levels = NULL))
                   )
head(bakers)
# Create skill variable with 3 levels
bakers_skill <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    TRUE ~ "well_rounded"
  ))
  
# Filter zeroes to examine skill variable
bakers_skill %>% 
  filter(star_baker == 0 & technical_winner == 0) %>% 
  count(skill)

Whoops- we may not want bakers who never won star baker or the technical challenge to be called ‘well-rounded’. Let’s add another level to take care of this!

4.1.2 Add another bin

In this exercise, you’ll edit your previous code so that skill has four levels:

  • If both star baker and technical wins are zero, recode as NA_character_
  • If star baker is greater than technical wins, recode as super_star
  • If star baker is less than technical wins, recode as high_tech
  • If star baker is equal to technical wins, recode as well_rounded
# Add pipe to drop skill = NA
bakers_skill <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    star_baker == 0 & technical_winner == 0 ~ NA_character_,
    star_baker == technical_winner  ~ "well_rounded"
  )) %>% 
  drop_na(skill)
  
# Count bakers by skill
bakers_skill %>%
  count(skill)

Very nice- now we only have 15 ‘well-rounded’ bakers, and we dropped the 41 bakers who never won star baker or a technical challenge.

4.2 Factors

4.2.1 Cast a factor and examine levels

In this exercise, you’ll start where you left off in the last series of exercises, working with the bakers data to examine your new categorical variable called skill, which has three levels: well_rounded, super_star, and high_tech.

# Cast skill as a factor
bakers <- bakers_skill %>% 
  mutate(skill = as.factor(skill))

# Examine levels
levels(bakers$skill)

Sometimes factors have a natural order to them- like beginner, intermediate, and advanced- but sometimes they don’t. Know your data, and know your factors!

4.2.2 Plot factor counts

In this exercise, you’ll use ggplot2 to make a bar chart with one per level of the skill factor we just made, where the height of the bar shows the number of bakers in each skill level.

Notice a big difference between the data here and the data we have used before to make a bar chart- here, you want the height of the bars to be represent the number of rows in the bakers data in each skill level. This is compared to bar charts we made where the height represented values in the data (when you used geom_col()).

library(forcats)
# Edit to reverse x-axis order
ggplot(bakers, aes(x = fct_rev(skill), fill = series_winner)) +
  geom_bar()

Note that when you use function like fct_rev(), forcats converts the character to a factor for your plot

4.3 Dates

4.3.1 Cast characters as dates

We can use lubridate to parse and cast a date variable within a mutate(). In this exercise, you’ll practice doing this with the baker_dates data, and then go even further to extract data from the dates like the labelled month. Remember to use ?month to read about the month() function and its arguments.

baker_dates <- readRDS("baker_dates.rds")
head(baker_dates)
library(lubridate)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union
# Add a line to extract labeled month
baker_dates_cast <- baker_dates %>% 
  mutate(last_date_appeared_us = dmy(last_date_appeared_us),
         last_month_us = month(last_date_appeared_us, label = TRUE))
         
# Make bar chart by last month
ggplot(baker_dates_cast,aes(x = last_month_us)) +
  geom_bar()

There are more functions like year for getting out other date components- you can even extract fiscal quarters or semesters! Look at the documentation to see more ways.

4.3.2 Calculate timespans

The first step to calculating a timespan in lubridate is to make an interval, then use division to convert the units to what you want (like weeks(x) or months(x)). The x refers to the number of time units to be included in the period. In this exercise, you’ll calculate timespans with the baker_time data we just made.

# Add a line to create whole months on air variable
baker_time <- baker_dates  %>% 
  mutate(time_on_air = interval(first_date_appeared_uk, last_date_appeared_uk),
         weeks_on_air = time_on_air / weeks(1),
         months_on_air = time_on_air %/% months(1))
baker_time

Here again, lubridate gives you many options for working with periods from years down to picoseconds. You may also end up needing to work with durations too, which have similar functions that always start with a d from dyears to dpicoseconds.

4.4 Strings

4.4.1 Wrangle a character variable

In this exercise, we’ll wrangle the values in the position_reached variable, which looks like this right now:

messy_baker_results <- read_csv("Bakers/messy_baker_results.csv")
Parsed with column specification:
cols(
  .default = col_character(),
  series = col_double(),
  star_baker = col_double(),
  technical_winner = col_double(),
  technical_top3 = col_double(),
  technical_bottom = col_double(),
  technical_highest = col_double(),
  technical_lowest = col_double(),
  technical_median = col_double(),
  series_winner = col_double(),
  series_runner_up = col_double(),
  total_episodes_appeared = col_double(),
  percent_episodes_appeared = col_double(),
  percent_technical_top3 = col_double(),
  first_date_appeared_uk = col_date(format = ""),
  last_date_appeared_uk = col_date(format = ""),
  first_date_us = col_date(format = ""),
  last_date_us = col_date(format = ""),
  e_1_technical = col_double(),
  e_10_technical = col_double(),
  e_2_technical = col_double()
  # ... with 7 more columns
)
See spec(...) for full column specifications.
messy_baker_results %>% 
  count(position_reached)

Let’s clean these strings up!

library(stringr)
# Add another mutate to replace "THIRD PLACE" with "RUNNER UP"and count
bakers_2 <- messy_baker_results %>% 
  mutate(position_reached = str_to_upper(position_reached),
         position_reached = str_replace(position_reached, "-", " "),
         position_reached = str_replace(position_reached, "THIRD PLACE", "RUNNER UP"))

# Count rows
bakers_2 %>%
  count(position_reached)

Great work! Each of the 8 series had one winner and two runner-ups, so now this variable is much easier to work with!

4.4.2 Detect a string pattern

we’ll wrangle each bakers’ occupation from the bakers data, which is currently a character string like this:

bakers_2 %>% 
  select(baker, occupation)

We’ll create a logical variable (TRUE or FALSE) indicating whether or not each baker is a student.

# Add a line to create new variable called student
bakers_3 <- bakers_2 %>% 
    mutate(occupation = str_to_lower(occupation), 
           student = str_detect(occupation, "student"))

# Find all students and examine occupations
bakers_3 %>%
  filter(student == TRUE) %>%
  select(baker, occupation, student)

Rav’s occupation was student support, but we classified him as a student. This shows a common problem in working with data- it helps to always check your work!

---
title: "Working with Data in the Tidyverse"
output:
  html_notebook:
    toc: true
    toc_float: true
    toc_collapsed: false
    number_sections: true
    
toc_depth: 3
---
# Explore your data

## Import your data

### Read a CSV file

In this exercise, you'll use read_csv() twice. The first time you will only specify the filename, but you'll notice a problem with the imported data. The second time you'll use a new argument called skip to fix the problem. Remember to use ?read_csv to read more about arguments like skip and how to use them.

The data you'll work with is from "The Great British Bake-Off." The file "bakeoff.csv" contains data for each episode of the show, organized by series and baker.

[Tidyverse chearsheet](https://datacamp-community-prod.s3.amazonaws.com/e63a8f6b-2aa3-4006-89e0-badc294b179c)

```{r}
# Load readr
library(readr)

# Create bakeoff from "bakeoff.csv"
bakeoff <- read_csv("bakeoff.csv")

# Print bakeoff
bakeoff

# Create bakeoff but skip first row in case needed
# bakeoff <- read_csv("bakeoff.csv", skip = 1)
```
Our original import had 3 problems: we had 1 extra observation, our first row stored the true variable names, and all column types were characters. Adding skip = 1 worked because the default argument for col_names is TRUE.

### Assign missing values

The read_csv() function also has an na argument, which allows you to specify value(s) that represent missing values in your data. The default values for the na argument are c("", "NA"), so both are recoded as missing (NA) in R. When you read in data, you can add additional values like the string "UNKNOWN" to a vector of missing values using the c() function to combine multiple values into a single vector.

The is.na() function is also helpful for identifying rows with missing values for a variable.

```{r}
# Load dplyr
library(dplyr)

# Filter rows where showstopper is UNKNOWN 
bakeoff %>% 
    filter(showstopper == "UNKNOWN")

# Edit to add list of missing values
bakeoff <- read_csv("bakeoff.csv", na = c("", "NA", "UNKNOWN"))

# Filter rows where showstopper is NA 
bakeoff %>%
  filter(is.na(showstopper))
```
We only had 4 missing values for the showstopper variable to start, but now all 21 are present and accounted for.

## Know your data

### Arrange and glimpse

You can combine glimpse() with other functions in a sequence using the pipe (%>%) operator. For example, you can use other dplyr functions like arrange first, then use glimpse by adding a line after the final pipe (%>%):

    bakers_mini %>% 
      arrange(age) %>% 
      glimpse() # no argument needed here

Take a glimpse of the bakeoff data we imported in the first set of exercises. On which date did the first episode of the show air in the US?
```{r}
# Filter and skim
bakeoff %>%
  filter(!is.na(us_season)) %>% 
  skim()
```
The first episode of the first US season aired on December 28, 2014.

### Summarize your data

You can combine skim() with other functions in a sequence using the pipe (%>%) operator. For example, you could use other dplyr functions like group_by first, then use skim() by adding a line after the final pipe.

    bakers_mini %>% 
      group_by(series) %>% 
      skim() # no argument needed here

This will produce summary statistics for each series. Let's practice this!
```{r}
# Load skimr
library(skimr)
```

```{r}
bakeoff %>% 
  group_by(series) %>% 
  skim() # no argument needed here
```
```{r}
# Edit to filter, group by, and skim
bakeoff %>% 
  filter(!is.na(us_season)) %>% 
  group_by(us_season)  %>% 
  skim()
```
Nice piping & skimming! Producing quick grouped summaries with the skim function is a real time (and sanity) saver.

### Know your variable types

How many variables of each type do we have in the bakeoff data? You may also want to try piping a skimmed object to summary(), also from the skimr package:

    tibble_name %>% 
      skim() %>%  # no argument needed here
      summary() # no argument needed here

```{r}
bakeoff %>% 
  skim() %>%  # no argument needed here
  summary() # no argument needed here
```
## Count with your data

In every episode of "The Great British Bake-Off", bakers complete 3 challenges and the show's judges award the title "Star Baker" to the baker who excelled in that week's challenges (with the exception of the finale). Each baker's result for every episode is stored in bakeoff- result is a character variable, and the value "SB" stands for star baker.


### Distinct and count

```{r}
# View distinct results
bakeoff %>%
  distinct(result)
```
```{r}
# Count whether or not star baker
bakeoff %>% 
  count(result)
```
```{r}
# Count whether or not star baker
bakeoff %>% 
  count(result == "SB")
```
There are 488 results that are not star baker.

### Count episodes

Here's how we can use count() back-to-back to roll up a level of detail across series:

    bakers %>% 
      count(aired_us, series) %>% 
      count(aired_us)
      
Let's practice counting the number of episodes per series.
```{r}
bakeoff %>% 
  count(series, episode) %>% 
  count(series)
```
You can see that by series 3 we were treated to 10 episodes with each new series!

### Count bakers


We'll practice that again, focusing on counting bakers.
```{r}
# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>%
                      count(series, baker)


# Print to view
bakers_by_series

# Count the number of rows by series and baker
bakers_by_series <- bakeoff %>% 
  count(series, baker)
  
# Print to view
bakers_by_series
  
# Count again by series
bakers_by_series %>%
  count(series)

# Count again by baker
bakers_by_series %>%
  count(baker, sort = TRUE)
```
Series 4 had the most bakers with 13 total (a baker's dozen!). Kate is the most popular name- Kates have appeared in 3 different series of the TV show.

### Plot counts

You can learn a lot about your data by counting, but sometimes you can learn even more by plotting counts. This is especially true when you have lots of things to count! With eight series, 74 episodes, and 95 bakers, a plot can be more helpful than a table of numbers. We'll use ggplot2 to visualize the number of bakers across episodes for each series from bakeoff.
```{r}
library(ggplot2)

ggplot(bakeoff, aes(episode)) + 
    geom_bar() + 
    facet_wrap(~series)
```
You made it through the first chapter! We have a lot of work left to do with The Great British Bake Off data, but notice how this plot is a great sanity check: the number of bakers tends to go down with every episode (never up!), with a few exceptions where no one was eliminated. Also, you can see that there are fewer episodes in the first two series. Finally, you can see that there are always three bakers who make it to the series finale.

# Tame your data

## Cast column types

### Cast a column to a date

A good workflow for parsing dates using readr is to, for example:

- Use parse_date("2012-14-08", format = "%Y-%d-%m") first, then
- Use col_date(format = "%Y-%d-%m") within cols() as the col_types argument of read_csv().

In "desserts.csv", the variable uk_airdate is formatted like "17 August 2010". Let's parse, then cast this variable!

Later on in Chapter 4, we'll showcase some functions from the lubridate package to help you extract data from dates, once they are cast properly.
```{r}
# Find format to parse uk_airdate 
parse_date("17 August 2010", format = "%d %B %Y")

# Edit to cast uk_airdate
desserts <- read_csv("desserts.csv", 
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y")
                     )
                    )

# Arrange by descending uk_airdate
desserts %>% 
	arrange(desc(uk_airdate))
```
Notice that us_airdate didn't need to be cast - this is because the date format %Y-%m-%d is unambiguous, so it is automatically parsed as a date by readr.

### Cast a column to a number

We saw a good workflow for parsing columns using readr:

- Use parse_number() to practice, then
- Use col_number() to cast.

But sometimes you'll need to start with casting, then diagnose parsing problems using a new readr function called problems(). Using problems() on a result of read_csv() will show you the rows and columns where parsing error occurred, what the parser expected to find (for example, a number), and the actual value that caused the parsing error.

Let's practice this with "desserts.csv", which includes data about the judges' ranks for each baker in the technical challenge, for every episode.

```{r}
# Try to cast technical as a number
desserts <- read_csv("desserts.csv", 
                     col_types = cols(
           			   uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number()
                     )
                    )

# View parsing problems
desserts %>%
  problems()
```
```{r}
# Edit code to fix the parsing error 
desserts <- read_csv("desserts.csv",
                      col_types = cols(
                        uk_airdate = col_date(format = "%d %B %Y"),
                        technical = col_number()
                      ),
                        na = c("", "NA", "N/A") #solving the problem
                     )

# View parsing problems
problems(desserts)
```
### Cast a column as a factor

Factors are categorical variables, where the possible values are a fixed and known set. For example, take a simple factor like bake below:

```{r}
bake <- c("pie", "cake", "neither") 
parse_factor(bake, levels = NULL) 
```
You can use parse_factor() to parse variables and col_factor() to cast columns as categorical. Both functions have a levels argument that is used to specify the possible values for the factors. When levels is set to NULL, the possible values will be inferred from the unique values in the dataset. Alternatively, you can pass a list of possible values.

This time, you will look at the result column, which contains the outcome of the competition for each baker.
```{r}
# Cast result a factor
desserts <- read_csv("desserts.csv", 
                     na = c("", "NA", "N/A"),
                     col_types = cols(
                       uk_airdate = col_date(format = "%d %B %Y"),
                       technical = col_number(),                       
                       result = col_factor(levels = NULL)
                     )
                    )
                    
# Glimpse to view
glimpse(desserts)
```
While it takes a lot of work up front, casting column types when you import can make your analyses easier to reproduce.

## Recode values


### Recode a character variable

In this exercise, you'll recode() the nut variable in the desserts data. This is a character variable that tells us, for each bake, whether a nut was a key ingredient and if so, what kind of nut!

Remember that the arguments of recode() are the variable that you want to recode, and then an expression of the form old_value = new_value. recode() is most useful when used inside a mutate() to create a new variable or reassign the old one.

Often you want to recode a certain value as missing. In this case, it is important to preserve the data type of the variable. R has built-in words to indicate missing data of various types, for example NA_character_ for missing value for a character variable, or NA_integer_ for missing integer data. The NA value is of the logical type.

Remember that you can use ?function_name_here to read more about arguments and how to use them.

```{r}
# Count rows grouping by nut variable
desserts %>% 
    count(signature_nut, sort = TRUE)
    
# Edit code to recode "no nut" as missing
desserts_2 <- desserts %>% 
  mutate(nut = recode(signature_nut, "filbert" = "hazelnut", 
                           "no nut" = NA_character_))

# Count rows again 
desserts_2 %>% 
    count(nut, sort = TRUE)
```
### Recode a numeric variable

Dummy variables are often used in data analysis to bin a variable into one of two categories to indicate the absence or presence of something. Dummy variables take the value 0 or 1 to stand for, for example, loser or winner. Dummy variables are often factor variables as opposed to numeric - we'll cover more about factors in the last chapter.

In desserts, the technical variable is a number representing the judges' ranks for each baker in the technical challenge, for every episode (remember to run glimpse(desserts) to take a peek). In this exercise, you will recode the numeric technical variable to a new factor dummy variable.

```{r}
## Edit to recode tech_win as factor
desserts <- desserts %>% 
  mutate(tech_win = recode_factor(technical, `1` = 1,
                           .default = 0))

# Count to compare values                      
desserts %>% 
  count(technical == 1, tech_win)
```
## Select variables

### Combine functions with select

The ratings data tells us about how many UK viewers tuned in to watch "The Great British Bake-Off" for each episode of each series. If you explore ratings in console, you'll see that there is a separate variable for the number of viewers for each episode. A key question we'll return to in Chapter 3 is: how many more viewers watched the finale than the premiere episode?

For series with 10 episodes, which showed the most growth in viewers from the premiere to the finale? Which showed the least?
```{r}
ratings <- read_csv("Ratings/02.03_messy_ratings.csv")
```

```{r}
head(ratings)
```
```{r}
ratings <- ratings %>% 
  filter(episodes == 10) %>%
  mutate(viewer_growth = e10_viewers_7day - e1_viewers_7day)
```
```{r}
ratings %>% 
  select(series, viewer_growth)
```

Most: series 5, Least = series 8

### Recode factor to plot

In the last exercise, you may have used mutate() to find the difference between the number of premiere and finale episode viewers for each series. This variable, which I created and called viewer_growth, is now available in ratings, and feel free to check out it.

Now let's practice using select() to check the result of another mutate with recode_factor().

If you have examined ratings, you might have seen that series have aired on three different channels: "Channel 4", "BBC One", "BBC Two".
```{r}
# Recode channel as factor: bbc (1) or not (0)
ratings <- ratings %>% 
  mutate(bbc = recode_factor(channel, 
                             "Channel 4" = 0,
                             .default = 1))
                            
# Select to look at variables to plot next
ratings %>% 
  select(series, channel, bbc, viewer_growth)
  
# Make a filled bar chart
ggplot(ratings, aes(x = series, y = viewer_growth, fill = bbc)) +
  geom_col()
```
Remember to look up the other select helper functions using ?select. What is going on with viewer_growth in series 8?

### Select and reorder variables

As we have seen, selecting a subset of columns to print can help you check that a mutate() worked as expected, and rearranging columns next to each other can help you spot obvious errors in data entry.

The select() helpers are functions that you can use inside select() to allow you to select variables based on their names.

In this exercise, you'll work with the ratings data, and you'll use a new helper function called everything() which can be useful when reordering columns. Use ?everything to read more. You will learn more about using helper functions in the next lesson.

```{r}
# Move channel to first column
ratings %>% 
  select(channel, everything())

# Drop 7- and 28-day episode ratings
ratings %>% 
  select(-ends_with("day"))

# Move channel to front and drop 7-/28-day episode ratings
ratings %>% 
  select(channel, everything(), -ends_with("day"))
```
everything() is a great little helper! If it had been placed at the end, it would have added back in all the columns that end with "day". Placing it before deselecting columns, though, is a real time-saver.

## Tame variable names

### Reformat variables

We saw how to use clean_names() from the janitor package to convert all variable names to snake_case. Here is how:

    library(janitor)
    young_bakers3 <- young_bakers3 %>% 
       clean_names()


The function clean_names() takes an argument case that can be used to convert variable names to other cases, like "upper_camel" or "all_caps".

Let's practice using the messy_ratings tibble. Remember you can use ?clean_names to see the function arguments and how to use them.
```{r}
messy_ratings <- read_csv("Ratings/02.03_messy_ratings.csv")
```
colnames(messy_ratings)

```{r}
names(messy_ratings) <- c("SERIES", "episodes", "premiere", "finale", "winner", "Avg.uk~Viewers", 
"day", "timeslot", "channel", "runner.up~1", "runner.up~2", "season", 
"season.premiere", "season.finale", "Gbbo E1.Viewers~7Day", "Gbbo E1.Viewers~28Day", 
"Gbbo E2.Viewers~7Day", "Gbbo E2.Viewers~28Day", "Gbbo E3.Viewers~7Day", 
"Gbbo E3.Viewers~28Day", "Gbbo E4.Viewers~7Day", "Gbbo E4.Viewers~28Day", 
"Gbbo E5.Viewers~7Day", "Gbbo E5.Viewers~28Day", "Gbbo E6.Viewers~7Day", 
"Gbbo E6.Viewers~28Day", "Gbbo E7.Viewers~7Day", "Gbbo E7.Viewers~28Day", 
"Gbbo E8.Viewers~7Day", "Gbbo E8.Viewers~28Day", "Gbbo E9.Viewers~7Day", 
"Gbbo E9.Viewers~28Day", "Gbbo E10.Viewers~7Day", "Gbbo E10.Viewers~28Day")
```
```{r}
# Glimpse to see variable names
glimpse(messy_ratings)

# Load janitor
library(janitor)

# Reformat to lower camelcase
ratings2 <- messy_ratings %>%
  clean_names("lower_camel")
    
# Glimpse new tibble
glimpse(ratings2)
```
```{r}
# Glimpse to see variable names
glimpse(messy_ratings)

# Load janitor
library(janitor)

# Reformat to snake case
ratings2 <- messy_ratings %>%  
  clean_names()

# Glimpse cleaned names
glimpse(ratings2)
```
You have a ton of case options with the clean_names function- choose wisely!

### Rename and subset variables

In the next chapter, when we reshape data to be tidy, you'll see the importance of having tame variable names.

Now that our variable names in ratings have been converted to snake_case, you'll practice combining renaming with select() helper functions to take advantage of the variable name structure.

To rename a group of variables, you only need to specify a prefix of the new name.

For example, my_tbl %>% select( new_name_ = starts_with("oldname") ) will find all variables in my_tbl whose names start with oldname, enumerate them, then rename each variable as new_name_<N>, where N is a number. If my_tbl has variables oldname, oldname_v1, oldname3, running the statement above will replace these names with new_name_1, new_name_2, new_name_3. Notice how we only specified new_name_ in select(), and the numbers were appended automatically!

Take a second to glimpse() at ratings in your console, and take note of the variables that store data about the number of viewers. Can you see the pattern in their names? In this exercise, you will first select those variables using a helper function, and then rename them!
```{r}
# Select 7-day viewer data by series
viewers_7day <- ratings %>%
	select(series, ends_with("7day"))
	

# Glimpse
glimpse(viewers_7day)
```
```{r}
# Adapt code to also rename 7-day viewer data
viewers_7day <- ratings %>% 
    select(series, viewers_7day_ = ends_with("7day"))

# Glimpse
glimpse(viewers_7day)
```
Using select helper functions is a powerful way to tame your variable names. Tame names make it easier to tidy your data.

### Rename and reorder variables

In the previous exercise, you renamed all the variables in ratings that end with "7day" and kept only a subset of variables. In this exercise, you'll adapt that code to drop some specific columns, while keeping and reordering other columns.
```{r}
# Adapt code to drop 28-day columns; keep 7-day in front
viewers_7day <- ratings %>% 
    select(viewers_7day_ = ends_with("7day"),
        everything(),
        -ends_with("28day"))
        
# Glimpse
glimpse(viewers_7day)
```
```{r}
# Adapt code to keep original order
viewers_7day <- ratings %>% 
    select(everything(),
           viewers_7day_ = ends_with("7day"), 
           -ends_with("28day"))

# Glimpse
glimpse(viewers_7day)
```
# Tidy your data

## Introduction to Tidy Data

### Plot untidy data

We'll start by visualizing the number of show viewers (in millions, cumulative across seven days) across series by episode using the ratings data
```{r}
ratings <- read_csv("Ratings/messy_ratings.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))
```
```{r}
# Plot of episode 1 viewers by series
ggplot(ratings, aes(x = series, y = e1))+
 geom_col()
```
```{r}
# Adapt code to plot episode 2 viewers by series
ggplot(ratings, aes(x = series, y = e2)) +
    geom_col()
```
Series 8 was the first with different hosts on a different channel. Do you notice a pattern? To draw any conclusions, we'd want to plot the viewers for every episode though- aren't you glad you weren't asked to make all 10 plots? Wouldn't it be nice if episode was a variable we could use?

## Gather

### Gather & plot

we'll tidy the data, then use row_number() from the dplyr package to create a continuous episode count variable, such that episode 1 from series 1 will have a value of 1, episode 1 from series 2 will have a value of 7, and so on. This new variable will be mapped onto the x-axis using ggplot2.

```{r}
library(tidyr)
tidy_ratings <- ratings %>%
    # Gather and convert episode to factor
	gather(key = "episode", value = "viewers_7day", -series, 
           factor_key = TRUE, na.rm = TRUE) %>%
	# Sort in ascending order by series and episode
    arrange(series, episode) %>% 
	# Create new variable using row_number()
    mutate(episode_count = row_number())

# Plot viewers by episode and series
ggplot(tidy_ratings, aes(x = episode_count, 
                y = viewers_7day, 
                fill = series)) +
    geom_col()
```
Beautiful plot! In series 8, one of the judges accidentally tweeted out the winner's name 12 hours before the finale aired. Do you think that hurt their finale viewer numbers?

### Gather & plot non-sequential columns

In this exercise, you'll combine tidyr with dplyr::select() to keep and gather non-sequential columns in the ratings2 data. You can review the select helper functions here. The ratings2 data includes per-episode data:

- e*_7day: viewers within a 7-day window, and
- e*_28day: viewers within a 28-day window.

Here, we'll focus on the 7-day viewers.

```{r}
ratings2 <- read_csv("Ratings/messy_ratings2.csv",
                     col_types = cols(
                    series = col_factor(levels = NULL)
                    ))
```
```{r}
week_ratings <- ratings2  %>% 
	# Select 7-day viewer ratings
    select(series, ends_with("7day")) %>% 
	# Gather 7-day viewers by episode
    gather(episode, viewers_7day, ends_with("7day"), na.rm = TRUE, factor_key = TRUE)
```
```{r}
# Plot 7-day viewers by episode and series
ggplot(week_ratings, aes(x = episode, 
                y = viewers_7day, 
                group = series)) +
    geom_line() +
    facet_wrap(~series) 
```
You're done gathering! This plot tells you a lot, but the x-axis labels are a mess! We need to do more tidying to make this plot better.

## Separate

### Separate a column

To create the week_ratings data, we first selected all the columns in ratings2 that contain data about the number of 7-day viewers in millions for each series, then we gathered those 10 columns into two columns.

In this exercise, you'll use the tidyr, dplyr, and readr packages to tidy and plot the week_ratings data so we can read the x-axis, which labels each episode!

```{r}
week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    # Edit to parse episode number
    mutate(episode = parse_number(episode))

# Print to view
week_ratings
```

```{r}
# Create week_ratings
week_ratings <- ratings2 %>% 
    select(series, ends_with("7day")) %>% 
    gather(episode, viewers_7day, ends_with("7day"), 
           na.rm = TRUE) %>% 
    separate(episode, into = "episode", extra = "drop") %>% 
    mutate(episode = parse_number(episode))
    
# Edit your code to color by series and add a theme
ggplot(week_ratings, aes(x = episode, y = viewers_7day, 
                         group = series, color = series)) +
    geom_line() +
    facet_wrap(~series) +
    guides(color = FALSE) +
    theme_minimal()
```
### Unite columns

In the tidyr package, the opposite of separate() is unite(). Sometimes you need to paste values from two or more columns together to tidy. Here is an example usage for unite():

    data %>%
     unite(new_var, old_var1, old_var2) 

The ratings2 dataset includes these variables:

- series: 1-8
- episode: 1-10
- viewers_millions: whole number of 7-day viewers in millions
- viewers_decimal: additional number of 7-days viewers in millions

If viewers_millions = 2 and viewers_decimal = .6, then the total number of viewers in millions is 2.6. In this exercise, you'll practice uniting these two columns into a single new column.
```{r}
ratings_2 <- readRDS("ratings_2.rds")
```
```{r}
ratings3 <- ratings_2 %>% 
	# Unite viewers in millions and decimals together    
	unite(viewers_7day, viewers_millions, viewers_decimal)

# Print to view
ratings3
```
```{r}
ratings3 <- ratings_2  %>% 
	# Adapt to change the separator
	unite(viewers_7day, viewers_millions, viewers_decimal, sep = "")

# Print to view
ratings3
```
```{r}
ratings3 <- ratings_2  %>% 
	# Unite and change the separator
	unite(viewers_7day, viewers_millions, viewers_decimal, sep = "") %>%
	# Adapt to cast viewers as a number
	mutate(viewers_7day = parse_number(viewers_7day))

# Print to view
ratings3
```

## Spread

### Spread rows into columns

In this exercise, you'll use the tidyr, dplyr, and readr packages to tidy the ratings2 data after using dplyr::count(). The tidy_ratings_all data should have four variables to work with:

- series (integer: 1-8),
- episode (integer: 1-10),
- days (integer: 7 or 28), and
- viewers (numeric).

```{r}
# Create tidy data with 7- and 28-day viewers
tidy_ratings_all <- ratings2 %>%
	gather(key = episode, value = viewers, ends_with("day"), na.rm = TRUE) %>% 
    separate(episode, into = c("episode", "days")) %>%  
    mutate(episode = parse_number(episode),
           days = parse_number(days))
```
Using your new tidy data, count the number of viewers grouping by series and days. Remember: you don't need to use group_by before count, but you do need to use the wt = argument to sum the values in the viewers column.
```{r}
tidy_ratings_all %>%
	# Count viewers by series and days
	count(series, days, wt = viewers)
```
```{r}
tidy_7day <- tidy_ratings_all %>% 
	# Count viewers by series and days
    count(series, days, wt = viewers) %>%
	# Adapt to spread counted values
	  spread(key = days, value = n, sep = "_")
tidy_7day
```
The sep = "_" argument is especially helpful when you are spreading by a numeric variable so that you don't end up with variable names that start with numbers.




## Tidy multiple sets of columns

### Masterclass: Tidy

In the last three exercises, you'll tidy the ratings data to create a scatterplot to see the relationship between the number of premiere and finale UK viewers (7-day only) by series.

```{r}
# Gather viewer columns and remove NA rows
tidy_ratings <- ratings %>%
    gather(episode, viewers, -series, na.rm = TRUE) %>%
    mutate(episode = parse_number(episode)) %>% 
    group_by(series) %>% 
    filter(episode == 1 | episode == max(episode)) %>% 
    ungroup()
tidy_ratings
```
Now we have a tidy version of the ratings data with only premiere and finale episodes. Let's see if that tweet spoiled the Series 8 finale.

In the previous exercise, you gathered, then did a grouped filter to make tidy_ratings. In this exercise, you'll make two charts in ggplot2 with this tidy data to see if the judge's tweet spoiled the Series 8 finale ratings.

The two plots you'll make, a slope chart and a dumbbell chart, are nice ways to show change between two data points combining geom_point() and geom_line().

You'll use these to visualize the "finale bump" in viewers between the premiere and finale episodes across series.
```{r}
first_last <- tidy_ratings %>% 
  mutate(episode = recode(episode, `1` = "first", .default = "last")) 
first_last
```

```{r}
# Fill in to make slope chart
ggplot(first_last, aes(x = episode, y = viewers, color = series)) +
  geom_point() +
  geom_line(aes(group = series))
```

```{r}
# Switch the variables mapping x-axis and color
ggplot(first_last, aes(x = series, y = viewers, color = episode)) +
  geom_point() + # keep
  geom_line(aes(group = series)) + # keep
  coord_flip() # keep
```
Great work- notice how each of these two plots relied on data that is tidy! Series 8 had a pretty small 'finale bump', but we are still looking at raw changes in viewers. Let's visualize percentage changes in viewers instead.

In the previous exercise, you visualized the size of the "finale bump" in viewers for each series. From those charts, we know that finale episodes always garner more viewers than premieres. We also know that the change in viewers for series 8 seemed smaller compared to some previous series.

But the question of whether the series 8 finale viewers were "spoiled" might be best answered by comparing the relative percentage changes in premiere and finale viewers across series. In this exercise, you'll reshape first_last once more to calculate and visualize the relative finale bump.
```{r}
# Calculate relative increase in viewers
bump_by_series <- first_last %>% 
  spread(episode, viewers) %>%   
  mutate(bump = (last - first) / first)
bump_by_series
```
```{r}
# Calculate relative increase in viewers
bump_by_series <- first_last %>% 
  spread(episode, viewers) %>%   
  mutate(bump = (last - first) / first)
  
# Fill in to make bar chart of bumps by series
ggplot(bump_by_series, aes(x = series, y = bump)) +
  geom_col() +
  scale_y_continuous(labels = scales::percent) # converts to %
```
Series 8 definitely had the lowest 'finale bump' so far- only 6% more viewers watched the finale than the premiere. Notice that we had to spread here, because the tidy version of the data depends on the question you want to ask.

# Transform your data

## Complex recoding with case_when

### Combine two variables

In this exercise, you'll use case_when() with the bakers data to create a new variable based on the number of times each baker was crowned as star baker or the technical challenge winner:

- If star baker is greater than technical wins, recode as super_star
- If star baker is less than technical wins, recode as high_tech
- Recode the rest as well_rounded
```{r}
bakers <- read_csv("Bakers/baker_results.csv",
                   col_types = cols(series = col_factor(levels = NULL),
                                    series_winner = col_factor(levels = NULL))
                   )
```
```{r}
head(bakers)
```
```{r}
# Create skill variable with 3 levels
bakers_skill <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    TRUE ~ "well_rounded"
  ))
  
# Filter zeroes to examine skill variable
bakers_skill %>% 
  filter(star_baker == 0 & technical_winner == 0) %>% 
  count(skill)
```
Whoops- we may not want bakers who never won star baker or the technical challenge to be called 'well-rounded'. Let's add another level to take care of this!

### Add another bin

In this exercise, you'll edit your previous code so that skill has four levels:

- If both star baker and technical wins are zero, recode as NA_character_
- If star baker is greater than technical wins, recode as super_star
- If star baker is less than technical wins, recode as high_tech
- If star baker is equal to technical wins, recode as well_rounded

```{r}
# Add pipe to drop skill = NA
bakers_skill <- bakers %>% 
  mutate(skill = case_when(
    star_baker > technical_winner ~ "super_star",
    star_baker < technical_winner ~ "high_tech",
    star_baker == 0 & technical_winner == 0 ~ NA_character_,
    star_baker == technical_winner  ~ "well_rounded"
  )) %>% 
  drop_na(skill)
  
# Count bakers by skill
bakers_skill %>%
  count(skill)
```
Very nice- now we only have 15 'well-rounded' bakers, and we dropped the 41 bakers who never won star baker or a technical challenge.

## Factors

### Cast a factor and examine levels

In this exercise, you'll start where you left off in the last series of exercises, working with the bakers data to examine your new categorical variable called skill, which has three levels: well_rounded, super_star, and high_tech.

```{r}
# Cast skill as a factor
bakers <- bakers_skill %>% 
  mutate(skill = as.factor(skill))

# Examine levels
levels(bakers$skill)
```
Sometimes factors have a natural order to them- like beginner, intermediate, and advanced- but sometimes they don't. Know your data, and know your factors!

### Plot factor counts

In this exercise, you'll use ggplot2 to make a bar chart with one per level of the skill factor we just made, where the height of the bar shows the number of bakers in each skill level.

Notice a big difference between the data here and the data we have used before to make a bar chart- here, you want the height of the bars to be represent the number of rows in the bakers data in each skill level. This is compared to bar charts we made where the height represented values in the data (when you used geom_col()).

```{r}
library(forcats)
```

```{r}
# Edit to reverse x-axis order
ggplot(bakers, aes(x = fct_rev(skill), fill = series_winner)) +
  geom_bar()
```
Note that when you use function like fct_rev(), forcats converts the character to a factor for your plot

## Dates

### Cast characters as dates

We can use lubridate to parse and cast a date variable within a mutate(). In this exercise, you'll practice doing this with the baker_dates data, and then go even further to extract data from the dates like the labelled month. Remember to use ?month to read about the month() function and its arguments.

```{r}
baker_dates <- readRDS("baker_dates.rds")
```
```{r}
head(baker_dates)
```
```{r}
library(lubridate)
# Add a line to extract labeled month
baker_dates_cast <- baker_dates %>% 
  mutate(last_date_appeared_us = dmy(last_date_appeared_us),
         last_month_us = month(last_date_appeared_us, label = TRUE))
         
# Make bar chart by last month
ggplot(baker_dates_cast,aes(x = last_month_us)) +
  geom_bar()
```
There are more functions like year for getting out other date components- you can even extract fiscal quarters or semesters! Look at the documentation to see more ways.

### Calculate timespans

The first step to calculating a timespan in lubridate is to make an interval, then use division to convert the units to what you want (like weeks(x) or months(x)). The x refers to the number of time units to be included in the period. In this exercise, you'll calculate timespans with the baker_time data we just made.
```{r}
# Add a line to create whole months on air variable
baker_time <- baker_dates  %>% 
  mutate(time_on_air = interval(first_date_appeared_uk, last_date_appeared_uk),
         weeks_on_air = time_on_air / weeks(1),
         months_on_air = time_on_air %/% months(1))
baker_time
```
Here again, lubridate gives you many options for working with periods from years down to picoseconds. You may also end up needing to work with durations too, which have similar functions that always start with a d from dyears to dpicoseconds.

## Strings

### Wrangle a character variable

In this exercise, we'll wrangle the values in the position_reached variable, which looks like this right now:

```{r}
messy_baker_results <- read_csv("Bakers/messy_baker_results.csv")
messy_baker_results %>% 
  count(position_reached)
```
Let's clean these strings up!

```{r}
library(stringr)
```
```{r}
# Add another mutate to replace "THIRD PLACE" with "RUNNER UP"and count
bakers_2 <- messy_baker_results %>% 
  mutate(position_reached = str_to_upper(position_reached),
         position_reached = str_replace(position_reached, "-", " "),
         position_reached = str_replace(position_reached, "THIRD PLACE", "RUNNER UP"))

# Count rows
bakers_2 %>%
  count(position_reached)
```
Great work! Each of the 8 series had one winner and two runner-ups, so now this variable is much easier to work with!

### Detect a string pattern

we'll wrangle each bakers' occupation from the bakers data, which is currently a character string like this:
```{r}
bakers_2 %>% 
  select(baker, occupation)
```
We'll create a logical variable (TRUE or FALSE) indicating whether or not each baker is a student.
```{r}
# Add a line to create new variable called student
bakers_3 <- bakers_2 %>% 
    mutate(occupation = str_to_lower(occupation), 
           student = str_detect(occupation, "student"))

# Find all students and examine occupations
bakers_3 %>%
  filter(student == TRUE) %>%
  select(baker, occupation, student)
```
Rav's occupation was student support, but we classified him as a student. This shows a common problem in working with data- it helps to always check your work!

