Tidy Data

Prerequisite check
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
 The output above illustrates that I have the necessary package and tools within the R Studio to continue.
Table examples
table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
Tidy data
 In order to have a tidy data set there are certain rules to follow:
 1. Each variable must have its own column.
 2. Each observation must have its own row.
 3. Each value must have its own cell.

The following figure provides more clarity on these rules 3 Rules Accordingly, we now know by the presentation of these rules that table1 it tidy.

Now that we know what tidy data looks like, we can use different tools in the tidyverse that help organize tidy data. dplr and ggplot2 are two packages that are designed to do such tasks. For example, we can use table1 to work with the tools in the tidyverse:

#Compute rate per 10,000
table1 %>%
  mutate(rate=cases/population*10000)
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <int>  <int>      <int> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

This above example shows that you can add another variable into the data set. Mutate means that you can manipulate a set a variables, add or subtract, and the %>% sign means that it is a pipeline operation. Essentially, whatever comes after the %>% sign relates to the values before the %>%. For example, you can have sqrt(2) and the answer will be the same if you type 2 %>% sqrt.

#Compute cases per year
table1 %>% 
  count(year, wt=cases)
## # A tibble: 2 x 2
##    year      n
##   <int>  <int>
## 1  1999 250740
## 2  2000 296920

In this case, we are adding up all the values under the cases variable and listing them under the total years, which is only 1999 and 2000. and the count or hashtag will list 1 and 2. 1 for 1999 and 2 for 2000.

#Visualize changes over time
library(ggplot2)
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), color = "grey50") + 
  geom_point(aes(color = country))

EXERCISES 12.2.1
12.2.1 Using prose, describe how the variables and observations are organised in each of the sample tables.

a.) In the first table, the tidy set, each variable, or column, supports numerical values in each row except for the nominal country descriptor variable. Also, the table is organized by 6 rows and 4 columns. This table is also the cleanest in retrospect to “tidiness”.

table1
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

b.) Table 2 represents a more categorical set compared to table 1. For instance, two columns represent nominal values, country and type, and the other two columns, year being ordinal, and count. The different categories within the type variable are matched with specific values directly adjacent to the right in the count column. The table is a 12 by 4 set meaning 12 rows and 4 columns.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

c.) Table 3 includes three variable columns that represent country, year, and rate. The interesting thing about this table is that the variable “rate” was included as an equation dividing case by population. Numbers in each rate row are shown to be a division argument by the forward slash. In other words, the rate column organizes both cases and population to fit a division equation within one variable.

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
12.2.2 Compute the rate for table 2, and table4a + table 4b. You will need to perform four operations:

i.) Extract the number of TB cases per country per year. ii.) Extract the matching population per country per year. iii.) Divide cases by population, and multiply by 10000. iv.) Store back in the appropriate place Which representation is easiest to work with? Which is hardest? Why?

In order to divide cases by population, we need to separate the cases and population so that we can work with clarity. Also, when we separate the cases and population into two sections, then it will be easy to move forward with the example problem.

t2_cases <- filter(table2, type == "cases") %>% 
  rename(cases = count) %>% 
  arrange(country, year)
t2_population <- filter(table2, type == "population") %>% 
  rename(population = count) %>%
  arrange(country, year)
 t2_cases
t2_cases
## # A tibble: 6 x 4
##   country      year type   cases
##   <chr>       <int> <chr>  <int>
## 1 Afghanistan  1999 cases    745
## 2 Afghanistan  2000 cases   2666
## 3 Brazil       1999 cases  37737
## 4 Brazil       2000 cases  80488
## 5 China        1999 cases 212258
## 6 China        2000 cases 213766
 t2_population
t2_population
## # A tibble: 6 x 4
##   country      year type       population
##   <chr>       <int> <chr>           <int>
## 1 Afghanistan  1999 population   19987071
## 2 Afghanistan  2000 population   20595360
## 3 Brazil       1999 population  172006362
## 4 Brazil       2000 population  174504898
## 5 China        1999 population 1272915272
## 6 China        2000 population 1280428583

The next R chunk below will show a recreation of table 2 so that the variables cases and population will be in the same table. We do this by using the tibble tool, which creates a new data frame. This shows that cases_per_pop has included the division equation into a new variable.

t2_cases_per_pop <- tibble(
  year = t2_cases$year,
  country = t2_cases$country,
  cases = t2_cases$cases,
  population = t2_population$population
) %>%
  mutate(cases_per_pop = (cases/population) * 10000) %>% 
  select(country, year, cases_per_pop)

This next chunk essentially creates the new variable with its associated numbers within each row. Therefore, this step will then be combined with table2 so that we now have the new variable and new units implemented into a new data frame.

t2_cases_per_pop <- t2_cases_per_pop %>%
  mutate(type = "cases_per_pop") %>%
  rename(count = cases_per_pop)

Below shows the new variable and its values ready to be combined with table2.

t2_cases_per_pop
## # A tibble: 6 x 4
##   country      year count type         
##   <chr>       <int> <dbl> <chr>        
## 1 Afghanistan  1999 0.373 cases_per_pop
## 2 Afghanistan  2000 1.29  cases_per_pop
## 3 Brazil       1999 2.19  cases_per_pop
## 4 Brazil       2000 4.61  cases_per_pop
## 5 China        1999 1.67  cases_per_pop
## 6 China        2000 1.67  cases_per_pop

This next section combines table2 and t2_cases_per_pop so that previous work is now integrated into the entirety of a new dataset.

bind_rows(table2, t2_cases_per_pop) %>%
  arrange(country, year, type, count)
## # A tibble: 18 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <dbl>
##  1 Afghanistan  1999 cases         7.45e+2
##  2 Afghanistan  1999 cases_per_pop 3.73e-1
##  3 Afghanistan  1999 population    2.00e+7
##  4 Afghanistan  2000 cases         2.67e+3
##  5 Afghanistan  2000 cases_per_pop 1.29e+0
##  6 Afghanistan  2000 population    2.06e+7
##  7 Brazil       1999 cases         3.77e+4
##  8 Brazil       1999 cases_per_pop 2.19e+0
##  9 Brazil       1999 population    1.72e+8
## 10 Brazil       2000 cases         8.05e+4
## 11 Brazil       2000 cases_per_pop 4.61e+0
## 12 Brazil       2000 population    1.75e+8
## 13 China        1999 cases         2.12e+5
## 14 China        1999 cases_per_pop 1.67e+0
## 15 China        1999 population    1.27e+9
## 16 China        2000 cases         2.14e+5
## 17 China        2000 cases_per_pop 1.67e+0
## 18 China        2000 population    1.28e+9

I am renaming the final data frame so that it is easy to view.

t2_cases_and_pop_rate <- bind_rows(table2, t2_cases_per_pop) %>%
  arrange(country, year, type, count)

Here is the final product.

t2_cases_and_pop_rate
## # A tibble: 18 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <dbl>
##  1 Afghanistan  1999 cases         7.45e+2
##  2 Afghanistan  1999 cases_per_pop 3.73e-1
##  3 Afghanistan  1999 population    2.00e+7
##  4 Afghanistan  2000 cases         2.67e+3
##  5 Afghanistan  2000 cases_per_pop 1.29e+0
##  6 Afghanistan  2000 population    2.06e+7
##  7 Brazil       1999 cases         3.77e+4
##  8 Brazil       1999 cases_per_pop 2.19e+0
##  9 Brazil       1999 population    1.72e+8
## 10 Brazil       2000 cases         8.05e+4
## 11 Brazil       2000 cases_per_pop 4.61e+0
## 12 Brazil       2000 population    1.75e+8
## 13 China        1999 cases         2.12e+5
## 14 China        1999 cases_per_pop 1.67e+0
## 15 China        1999 population    1.27e+9
## 16 China        2000 cases         2.14e+5
## 17 China        2000 cases_per_pop 1.67e+0
## 18 China        2000 population    1.28e+9

The next R chunk will be the integration of both table4a and table4b shown to represent cases per population like the section before.

#We can label the integration of both tables as table4ab
table4ab <- 
  tibble(
    country = table4a$country,
    '1999' = table4a[["1999"]] / table4b[["1999"]] * 10000,
    '2000' = table4a[["2000"]] / table4b[["2000"]] * 10000
  )
table4ab
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan  0.373   1.29
## 2 Brazil       2.19    4.61
## 3 China        1.67    1.67

In sum, I would say that the first set of examples devoid of table4a and table4b are particularly tedious to work with. My table4ab example was not as tedious but not necessarily less difficult. This was the first time I had to manipulate data like this in R so it was a bit of a learning curve.

12.2.3 Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?
table2 %>%
  filter(type == "cases") %>%
  ggplot(aes(year, count)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country)) + 
  scale_x_continuous(breaks = unique(table2$year)) + 
  ylab("cases")

12.3 Pivoting

There are reasons for untidiness. That is, most people are not familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data. Secondly, data is often organised to facilitate some use other than analysis. For example, data is often organized to make entry as easy as possible.

Most of the time, those who work with data will have to do some tidying. The first step is to figure out what the variables and observations are. The second step is to resolve one of two common problems: One variable might be spread across multiple columns, and one observation might be scattered across multiple rows. Typically a dataset will only suffer from one of these problems.

To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().

12.3.1 Longer

A common problem is a dataset where some of the column names are not names of variables, but values of a variable. Like table4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variable. To describe the operation we need three parameters: • The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000. • The name of the variable to move the column names to. Here it is year. • The name of the variable to move the column vales to. Here it’s cases.

Together those parameters generate the call to pivot_longer():

table4a %>%
  pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.

We can use pivot_longer() to tidy table4b in a similar fashion:

table4b %>%
  pivot_longer(c('1999','2000'), names_to = "year", values_to = "population")
## # A tibble: 6 x 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

To combine the tidied versions of table4a and table4b into a single tibble, we need to use dplyr::left_join().

tidy4A <- table4a %>%
  pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
tidy4B <- table4b %>%
  pivot_longer(c('1999','2000'), names_to = "year", values_to = "population")
left_join(tidy4A, tidy4B)
## Joining, by = c("country", "year")
## # A tibble: 6 x 4
##   country     year   cases population
##   <chr>       <chr>  <int>      <int>
## 1 Afghanistan 1999     745   19987071
## 2 Afghanistan 2000    2666   20595360
## 3 Brazil      1999   37737  172006362
## 4 Brazil      2000   80488  174504898
## 5 China       1999  212258 1272915272
## 6 China       2000  213766 1280428583

This is definitely easier to accomplish than the first exercise I did.

12.3.2 Wider

pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows. For example, take table2: an observation is a country in a year, but each observation is spread across two rows.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

We only need two parameters this time: • The column to take variable names from. Here, it’s type. • The column to take values form. Here it’s count. Then we can use pivot_wider(), as shown below:

table2 %>%
  pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
EXCERSISES 12.3.3
12.3.1 Why are pivot_longer() and pivot_wider() not perfectly symmetrical?

Carefully consider the following example:

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
##    year  half return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2016     1   0.92
## 4  2016     2   0.17

Below is pivot_wider()

stocks %>%
  pivot_wider(names_from = year, values_from = return)
## # A tibble: 2 x 3
##    half `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88   0.92
## 2     2   0.59   0.17

We then use pivot_wider() followed by pivot_longer()

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016    0.92
## 3     2 2015    0.59
## 4     2 2016    0.17

The tibble is in tidy form. However, the tibble is not quite the same as the original tibble. The difference is the change in double to character vector form in the year column/variable. In order to maintain the tibble and have the column in double vector form, we need to use the names_ptype = list(year = double()).

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return",
               names_ptypes = list(year = double()))
## # A tibble: 4 x 3
##    half  year return
##   <dbl> <dbl>  <dbl>
## 1     1  2015   1.88
## 2     1  2016   0.92
## 3     2  2015   0.59
## 4     2  2016   0.17

The tibble now have the appropriate vectors.

12.3.1 Why does this code fail?

table4a %>% pivot_longer(c(1999, 2000), names_to = “year”, values_to = “cases”) > Error in inds_combine(.vars, ind_list): Position must be between 0 and n

The reason why the code above fails is because we forgot that 1999 and 2000 are characters in the table and not integers. In other words, even though we might be fooled as to think they are integers (which they are in nature) when we want to refer to columns within a table we need to be careful whether the columns we wish to represent are integers or characters.

In order to change this we add ' to both sides of 1999 and 2000.

table4a %>% 
  pivot_longer(c('1999', '2000'), names_to = "year", values_to = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Now we can see that the code works.

12.3.3 What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)
people
## # A tibble: 5 x 3
##   name            names  values
##   <chr>           <chr>   <dbl>
## 1 Phillip Woods   age        45
## 2 Phillip Woods   height    186
## 3 Phillip Woods   age        50
## 4 Jessica Cordero age        37
## 5 Jessica Cordero height    156

The next intuitive move is to use the pivot_wider() tool to widen the table.

people %>%
  pivot_wider(names_from = names, values_from = values)
## Warning: Values in `values` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list(values = list)` to suppress this warning.
## * Use `values_fn = list(values = length)` to identify where the duplicates arise
## * Use `values_fn = list(values = summary_fun)` to summarise duplicates
## # A tibble: 2 x 3
##   name                    age      height
##   <chr>           <list<dbl>> <list<dbl>>
## 1 Phillip Woods           [2]         [1]
## 2 Jessica Cordero         [1]         [1]

This warning is important as it gives us the opportunity to see how to work with duplicated data. We can do that or re-create the table in a better fashion so no duplicates occur.

12.3.4 Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

In this case we would want to make the table longer:

preg %>%
  pivot_longer(c('male', 'female'), names_to = "sex", values_to = "count")
## # A tibble: 4 x 3
##   pregnant sex    count
##   <chr>    <chr>  <dbl>
## 1 yes      male      NA
## 2 yes      female    10
## 3 no       male      20
## 4 no       female    12
12.4 Separating and uniting

So far we’ve learned how to tidy table2 and both table4a and table4b, but not table3. table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.

12.4.1 Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:

table3
## # A tibble: 6 x 3
##   country      year rate             
## * <chr>       <int> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

The rate column contains both cases and population variable, and we need to split it into two variables. separate() the name of the column to separate, and the names of the columns tot separate into, as shown below:

table3 %>%
  separate(rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

By default, this will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). For example, in the code above, separate() split the values of rate at the forward slash characters. If you wish to sue a specific character to separate a column, you can pass the character to the sep argument of separate(). For example, we could rewrite the code above as:

table3 %>% 
  separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <int> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

You’ll notice that cases and population are character columns. This is the default behavior in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

I can also separate some of the integers within a column if I use sep within the separate() function:

table3 %>% 
  separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 x 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583
12.4.2 Unite

unite() is the inverse of separate(): it combines multiple columns into a single column. We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5. unite() takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:

table5 %>%
  unite(new, century, year)
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 19_99 745/19987071     
## 2 Afghanistan 20_00 2666/20595360    
## 3 Brazil      19_99 37737/172006362  
## 4 Brazil      20_00 80488/174504898  
## 5 China       19_99 212258/1272915272
## 6 China       20_00 213766/1280428583

We do not really want the underscore (_) in the data so what we can do is unite the data but separate the empty space.

table5 %>%
  unite(new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583
EXERCISES 12.4.3
12.4.1 What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

If you run both sets of code, then you receive an error saying that there are missing pieces. Interestingly, separate(), by default, drops extra values with a warning. Adding the argument, extra = "drop", produces the same result as above but without the error or warning.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j

Then, if you do extra = “merge”, the g is added with the f.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f,g  
## 3 h     i     j

In the next example, “d,e” has too few elements.

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

By default the empty space is filled with NA.

tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
  separate(x, c("one", "two", "three"), fill = "right")
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

The same will happen if you type in “left” but the missing value will appear on the left side of the table.

12.4.2 Both unite() and separate() have a remove argument What does it do? Why would you set it to FALSE?
    Essentially, the remove argument does away with the imputed columns in the result data frame, and the FALSE means that you want to create a new variable but keep the old one.
    
12.4.3 Compare and contrast separate() and extract(), Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
    Separate() separates types of values within a column and are very important in terms of tidying up data. Some examples can be seen above in earlier sections. Below are examples:
#Separators
tibble(x = c("A_1", "A_2", "A_3")) %>%
  separate(x, c("Category", "type"))
## # A tibble: 3 x 2
##   Category type 
##   <chr>    <chr>
## 1 A        1    
## 2 A        2    
## 3 A        3
#Position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
  separate(x, c("variable", "into"), sep = c(1))
## # A tibble: 4 x 2
##   variable into 
##   <chr>    <chr>
## 1 X        1    
## 2 X        2    
## 3 Y        1    
## 4 Y        2
    The function extract() uses a regular expression to specify groups in a character vector and split that single character vector into multiple columns. Generally, this is more flexible than separate() as you can create parameters that extract the varying values shown within a vector.

Examples:

#Separators
tibble(x = c("A_1","C_3","F_5","T_4", "X_8", "Y_9")) %>% 
  extract(x,c("Category", "Value"), regex = "([A-Z])_([0-9])")
## # A tibble: 6 x 2
##   Category Value
##   <chr>    <chr>
## 1 A        1    
## 2 C        3    
## 3 F        5    
## 4 T        4    
## 5 X        8    
## 6 Y        9
#Position
tibble(x = c("X1", "X2", "Y1", "Y2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z])([0-9])")
## # A tibble: 4 x 2
##   variable id   
##   <chr>    <chr>
## 1 X        1    
## 2 X        2    
## 3 Y        1    
## 4 Y        2
#Extract specific
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  extract(x, c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
## # A tibble: 4 x 2
##   variable id   
##   <chr>    <chr>
## 1 X        1    
## 2 X        20   
## 3 AA       11   
## 4 AA       2

Unite() as opposed to separate() and extract() converts many columns to one. Thanks to the book, “Both separate() and extract() convert a single column to many columns. However, unite() converts many columns to one, with a choice of a separator to include between column values.”

12.5 Missing Values

Values can be missing in two ways: • Explicitly, i.e. flagged with NA. • Implicitly, i.e. simply not present in the data. For example:

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

The NA is explicit and there should be a 1st qtr for 2016 which is explicitly missing. One way to make the implicit value explicit is to order or tidy the data by variable “year” so that NA appears.

stocks %>% 
  pivot_wider(names_from = year, values_from = return)
## # A tibble: 4 x 3
##     qtr `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88  NA   
## 2     2   0.59   0.92
## 3     3   0.35   0.17
## 4     4  NA      2.66

Now we have two NA. If those data points are not important then we can go ahead and make the values implicit. We do this by adding values_drop_na = TRUE in pivot_longer():

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(
    cols = c(`2015`, `2016`), 
    names_to = "year", 
    values_to = "return", 
    values_drop_na = TRUE
  )
## # A tibble: 6 x 3
##     qtr year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     2 2015    0.59
## 3     2 2016    0.92
## 4     3 2015    0.35
## 5     3 2016    0.17
## 6     4 2016    2.66

Another important tool for making missing values explicit in tidy data is complete():

stocks %>% 
  complete(year, qtr)
## # A tibble: 8 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     1  NA   
## 6  2016     2   0.92
## 7  2016     3   0.17
## 8  2016     4   2.66

We can also use fill() to carry forward the previous in this case person.

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
treatment
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 <NA>                     2       10
## 3 <NA>                     3        9
## 4 Katherine Burke          1        4
treatment %>% 
  fill(person)
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Derrick Whitmore         2       10
## 3 Derrick Whitmore         3        9
## 4 Katherine Burke          1        4
EXERCISES 12.5.1
12.5.1. Compare and contrast the fill arguments to pivot_wider() and complete().
   Both spread() and complete() replace implicit and explicit missing values. In spread(), the fill argument *explicitly* sets the value to replace NAs. The fill argument in complete() allows for different values of different variables.
12.5.2 What does the direction argument to fill() do?
    The direction determines whether the missing values should be replaced by values from above or below or "up" and "down".
    
12.6 Case Study

The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.

who
## # A tibble: 7,240 x 60
##    country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544
##    <chr>   <chr> <chr> <int>       <int>        <int>        <int>        <int>
##  1 Afghan… AF    AFG    1980          NA           NA           NA           NA
##  2 Afghan… AF    AFG    1981          NA           NA           NA           NA
##  3 Afghan… AF    AFG    1982          NA           NA           NA           NA
##  4 Afghan… AF    AFG    1983          NA           NA           NA           NA
##  5 Afghan… AF    AFG    1984          NA           NA           NA           NA
##  6 Afghan… AF    AFG    1985          NA           NA           NA           NA
##  7 Afghan… AF    AFG    1986          NA           NA           NA           NA
##  8 Afghan… AF    AFG    1987          NA           NA           NA           NA
##  9 Afghan… AF    AFG    1988          NA           NA           NA           NA
## 10 Afghan… AF    AFG    1989          NA           NA           NA           NA
## # … with 7,230 more rows, and 52 more variables: new_sp_m4554 <int>,
## #   new_sp_m5564 <int>, new_sp_m65 <int>, new_sp_f014 <int>,
## #   new_sp_f1524 <int>, new_sp_f2534 <int>, new_sp_f3544 <int>,
## #   new_sp_f4554 <int>, new_sp_f5564 <int>, new_sp_f65 <int>,
## #   new_sn_m014 <int>, new_sn_m1524 <int>, new_sn_m2534 <int>,
## #   new_sn_m3544 <int>, new_sn_m4554 <int>, new_sn_m5564 <int>,
## #   new_sn_m65 <int>, new_sn_f014 <int>, new_sn_f1524 <int>,
## #   new_sn_f2534 <int>, new_sn_f3544 <int>, new_sn_f4554 <int>,
## #   new_sn_f5564 <int>, new_sn_f65 <int>, new_ep_m014 <int>,
## #   new_ep_m1524 <int>, new_ep_m2534 <int>, new_ep_m3544 <int>,
## #   new_ep_m4554 <int>, new_ep_m5564 <int>, new_ep_m65 <int>,
## #   new_ep_f014 <int>, new_ep_f1524 <int>, new_ep_f2534 <int>,
## #   new_ep_f3544 <int>, new_ep_f4554 <int>, new_ep_f5564 <int>,
## #   new_ep_f65 <int>, newrel_m014 <int>, newrel_m1524 <int>,
## #   newrel_m2534 <int>, newrel_m3544 <int>, newrel_m4554 <int>,
## #   newrel_m5564 <int>, newrel_m65 <int>, newrel_f014 <int>,
## #   newrel_f1524 <int>, newrel_f2534 <int>, newrel_f3544 <int>,
## #   newrel_f4554 <int>, newrel_f5564 <int>, newrel_f65 <int>

This dataset is really messy so it is important that I use my knowledge of tidyr and later dplyr to help make this dataset clear.

Country, iso2, and iso3 are three variables that repeatedly label the country. Year is also a variable.

new_sp_m014, new_ep_m014, new_ep_f014 are likely to be values, not variables.

“We need to gather together all the columns from new_sp_m014 to newrel_f65. We don’t know what those values represent yet, so we’ll give them the generic name”key“. We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use na.rm just so we can focus on the values that are present.”

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
## # A tibble: 76,046 x 6
##    country     iso2  iso3   year key          cases
##    <chr>       <chr> <chr> <int> <chr>        <int>
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36
## # … with 76,036 more rows

We can get some hint of the structure of the values in the new key column by counting them:

who1 %>% 
  count(key)
## # A tibble: 56 x 2
##    key              n
##    <chr>        <int>
##  1 new_ep_f014   1032
##  2 new_ep_f1524  1021
##  3 new_ep_f2534  1021
##  4 new_ep_f3544  1021
##  5 new_ep_f4554  1017
##  6 new_ep_f5564  1017
##  7 new_ep_f65    1014
##  8 new_ep_m014   1038
##  9 new_ep_m1524  1026
## 10 new_ep_m2534  1020
## # … with 46 more rows

The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

The next two letters describe the type of TB:

rel stands for cases of relapse ep stands for cases of extrapulmonary TB sn stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative) sp stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive) The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

The remaining numbers gives the age group. The dataset groups cases into seven age groups:

014 = 0 – 14 years old 1524 = 15 – 24 years old 2534 = 25 – 34 years old 3544 = 35 – 44 years old 4554 = 45 – 54 years old 5564 = 55 – 64 years old 65 = 65 or older The names are slightly inconsistent because instead of new_rel we have newrel. The basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

who2 <- who1 %>% 
  mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 x 7
##    country     iso2  iso3   year key          cases names_from  
##    <chr>       <chr> <chr> <int> <chr>        <int> <chr>       
##  1 Afghanistan AF    AFG    1997 new_sp_m014      0 new_sp_m014 
##  2 Afghanistan AF    AFG    1997 new_sp_m1524    10 new_sp_m1524
##  3 Afghanistan AF    AFG    1997 new_sp_m2534     6 new_sp_m2534
##  4 Afghanistan AF    AFG    1997 new_sp_m3544     3 new_sp_m3544
##  5 Afghanistan AF    AFG    1997 new_sp_m4554     5 new_sp_m4554
##  6 Afghanistan AF    AFG    1997 new_sp_m5564     2 new_sp_m5564
##  7 Afghanistan AF    AFG    1997 new_sp_m65       0 new_sp_m65  
##  8 Afghanistan AF    AFG    1997 new_sp_f014      5 new_sp_f014 
##  9 Afghanistan AF    AFG    1997 new_sp_f1524    38 new_sp_f1524
## 10 Afghanistan AF    AFG    1997 new_sp_f2534    36 new_sp_f2534
## # … with 76,036 more rows

We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.

who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
who3
## # A tibble: 76,046 x 9
##    country     iso2  iso3   year new   type  sexage cases names_from  
##    <chr>       <chr> <chr> <int> <chr> <chr> <chr>  <int> <chr>       
##  1 Afghanistan AF    AFG    1997 new   sp    m014       0 new_sp_m014 
##  2 Afghanistan AF    AFG    1997 new   sp    m1524     10 new_sp_m1524
##  3 Afghanistan AF    AFG    1997 new   sp    m2534      6 new_sp_m2534
##  4 Afghanistan AF    AFG    1997 new   sp    m3544      3 new_sp_m3544
##  5 Afghanistan AF    AFG    1997 new   sp    m4554      5 new_sp_m4554
##  6 Afghanistan AF    AFG    1997 new   sp    m5564      2 new_sp_m5564
##  7 Afghanistan AF    AFG    1997 new   sp    m65        0 new_sp_m65  
##  8 Afghanistan AF    AFG    1997 new   sp    f014       5 new_sp_f014 
##  9 Afghanistan AF    AFG    1997 new   sp    f1524     38 new_sp_f1524
## 10 Afghanistan AF    AFG    1997 new   sp    f2534     36 new_sp_f2534
## # … with 76,036 more rows

Then we drop the “new” column because it’s constant in this dataset. We also drop iso2 and iso3 since they’re redundant.

who3 %>% 
  count(new)
## # A tibble: 2 x 2
##   new        n
##   <chr>  <int>
## 1 new    73466
## 2 newrel  2580
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
who4
## # A tibble: 76,046 x 6
##    country      year type  sexage cases names_from  
##    <chr>       <int> <chr> <chr>  <int> <chr>       
##  1 Afghanistan  1997 sp    m014       0 new_sp_m014 
##  2 Afghanistan  1997 sp    m1524     10 new_sp_m1524
##  3 Afghanistan  1997 sp    m2534      6 new_sp_m2534
##  4 Afghanistan  1997 sp    m3544      3 new_sp_m3544
##  5 Afghanistan  1997 sp    m4554      5 new_sp_m4554
##  6 Afghanistan  1997 sp    m5564      2 new_sp_m5564
##  7 Afghanistan  1997 sp    m65        0 new_sp_m65  
##  8 Afghanistan  1997 sp    f014       5 new_sp_f014 
##  9 Afghanistan  1997 sp    f1524     38 new_sp_f1524
## 10 Afghanistan  1997 sp    f2534     36 new_sp_f2534
## # … with 76,036 more rows

Next we’ll separate sexage into sex and age by splitting after the first character:

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
## # A tibble: 76,046 x 7
##    country      year type  sex   age   cases names_from  
##    <chr>       <int> <chr> <chr> <chr> <int> <chr>       
##  1 Afghanistan  1997 sp    m     014       0 new_sp_m014 
##  2 Afghanistan  1997 sp    m     1524     10 new_sp_m1524
##  3 Afghanistan  1997 sp    m     2534      6 new_sp_m2534
##  4 Afghanistan  1997 sp    m     3544      3 new_sp_m3544
##  5 Afghanistan  1997 sp    m     4554      5 new_sp_m4554
##  6 Afghanistan  1997 sp    m     5564      2 new_sp_m5564
##  7 Afghanistan  1997 sp    m     65        0 new_sp_m65  
##  8 Afghanistan  1997 sp    f     014       5 new_sp_f014 
##  9 Afghanistan  1997 sp    f     1524     38 new_sp_f1524
## 10 Afghanistan  1997 sp    f     2534     36 new_sp_f2534
## # … with 76,036 more rows

The who dataset is now tidy! Normally you would build up a really complex pipe:

who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
## # A tibble: 76,046 x 6
##    country      year var   sex   age   cases
##    <chr>       <int> <chr> <chr> <chr> <int>
##  1 Afghanistan  1997 sp    m     014       0
##  2 Afghanistan  1997 sp    m     1524     10
##  3 Afghanistan  1997 sp    m     2534      6
##  4 Afghanistan  1997 sp    m     3544      3
##  5 Afghanistan  1997 sp    m     4554      5
##  6 Afghanistan  1997 sp    m     5564      2
##  7 Afghanistan  1997 sp    m     65        0
##  8 Afghanistan  1997 sp    f     014       5
##  9 Afghanistan  1997 sp    f     1524     38
## 10 Afghanistan  1997 sp    f     2534     36
## # … with 76,036 more rows
EXERCISES 12.6.1
12.6.1 In this case study I set values_drop_na = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?
    One thing that is really important is that we want to know whether who did not have any cases are that they did not know or have any data on them for another reason. IF there are no 0 values, then that means there is no data or no cases in the who database. If there are both explicit and implicit missing values, then it suggests that missing values are being used in different ways. NA would mean no cases and implicit would mean no data on the number of cases, presumably.
    We can also check to see if there are any zeros by doing this:
who1 %>%
  filter(cases == 0) %>%
  nrow()
## [1] 11080

Cases of zero TB are explicitly indicated with NA. Next, it is important to check and see whether values for a country or year are missing or whether only some columns are missing.

gather(who, new_sp_m014:newrel_f65, key = "key", value = "cases") %>%
  group_by(country,year) %>%
  mutate(prop_missing = sum(is.na(cases)) / n()) %>%
  filter(prop_missing > 0, prop_missing < 1)
## # A tibble: 195,104 x 7
## # Groups:   country, year [3,484]
##    country     iso2  iso3   year key         cases prop_missing
##    <chr>       <chr> <chr> <int> <chr>       <int>        <dbl>
##  1 Afghanistan AF    AFG    1997 new_sp_m014     0         0.75
##  2 Afghanistan AF    AFG    1998 new_sp_m014    30         0.75
##  3 Afghanistan AF    AFG    1999 new_sp_m014     8         0.75
##  4 Afghanistan AF    AFG    2000 new_sp_m014    52         0.75
##  5 Afghanistan AF    AFG    2001 new_sp_m014   129         0.75
##  6 Afghanistan AF    AFG    2002 new_sp_m014    90         0.75
##  7 Afghanistan AF    AFG    2003 new_sp_m014   127         0.75
##  8 Afghanistan AF    AFG    2004 new_sp_m014   139         0.75
##  9 Afghanistan AF    AFG    2005 new_sp_m014   151         0.75
## 10 Afghanistan AF    AFG    2006 new_sp_m014   193         0.75
## # … with 195,094 more rows
12.6.2 What happens if you neglect the mutate() step? (mutate(names_from = stringr::str_replace(key, “newrel”, “new_rel”)))
    If you neglect this then the separate() function emits the warning "too few values". If we check the rows for keys beginning with "newrel_", we see that sexage is missing, and type = m014.
who3a <- who1 %>%
  separate(key, c("new", "type", "sexage"), sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
filter(who3a, new == "newrel") %>% head()
## # A tibble: 6 x 8
##   country     iso2  iso3   year new    type  sexage cases
##   <chr>       <chr> <chr> <int> <chr>  <chr> <chr>  <int>
## 1 Afghanistan AF    AFG    2013 newrel m014  <NA>    1705
## 2 Afghanistan AF    AFG    2013 newrel f014  <NA>    1749
## 3 Albania     AL    ALB    2013 newrel m014  <NA>      14
## 4 Albania     AL    ALB    2013 newrel m1524 <NA>      60
## 5 Albania     AL    ALB    2013 newrel m2534 <NA>      61
## 6 Albania     AL    ALB    2013 newrel m3544 <NA>      32
12.6.3 I claimed that iso2 and iso3 were redundant with country. Confirm this claim.
    iso2 and iso3 are redundant and should be one combination of the two values. The distinct tool helps with this process:
select(who3, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
## # A tibble: 0 x 3
## # Groups:   country [0]
## # … with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
    In other words, both iso2 and iso3 contain the 2 and 3 letter abbreviations for the country.
12.6.4 For each country, year, and sex compute the total number of cases of TB. Make an informative visualization of the data.
who5 %>%
  group_by(country, year, sex) %>%
  filter(year > 1995) %>%
  summarise(cases = sum(cases)) %>%
  unite(country_sex, country, sex, remove = FALSE) %>%
  ggplot(aes(x = year, y = cases, group = country_sex, color = sex)) + geom_line()