12 Tidy Data

12.1 Introduction

“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” - Hadley Wickham

In this cahpter, you will learn a consisten way to organise your data in R, an organisation called tidy data. Getting your data into this format requires some upfront work, but that work pays off in the long term. One you have tidy data and tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

12.1.1 Prerequisites

In this chapter we’ll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidy is a member of the core tidyverse.

library(tidyverse)

12.2 Tidy data

There are many ways to represent the same underlying data, but some datasets are easier to use than others.

There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

These three rules are interrelated because it’s impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

  1. Put each dataset in a tibble.
  2. Put each variable in a column.

In this example, only table1 is tidy. It’s the only representation where each column is a variable.

2 main advantages to tidy data.

1.There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have a underlying uniformity.

  1. There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

12.2.1 Exercises

  1. Using prose, describe how the variables and observations are organised in each of the sample tables.

Table1 is organized with four variables across four columns: country, year, cases and population. It is easy to imagine creating many graphics with this format and faceting easily by country and year.

Table2 is similar to table one - country and year are the same - cases and population have been turned into values; essentially combined to create a new column “type”. Count then represents respective numbers of cases and population.

Tables 3 combines the cases and population into a single new column called “rate” - essentially cases/population. While this new data has its own advantages it limits the options available for analysis. It is better to have the raw data and then create a new “rate” variable from the variables in column 1 than to start with the rate.

Tables 4 and 5 are split across two tables and defined by year. The first tibble contains cases data and the second contains population data - making it thus difficult to compare case and poulation data without performing a SQL join.

  1. Compute the rate for table2, and table4a + table4b. You will need to perform four operations:

  2. Extract the number of TB cases per country per year.
  3. Extract the matching population per country per year.
  4. Divide cases by population, and multiply by 10000.
  5. Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

table1 <- tibble(
  country = c(
    rep("Afghanistan", 2),
    rep("Brazil", 2),
    rep("China", 2)
  ),
  year = rep(c(1999, 2000), 3),
  cases = c(745, 2666, 37737, 80488, 212258, 213766),
  population = c(19987071, 20595360, 172006362, 174504898, 1272915272, 1280428583)
)

table2 <- tibble(
  country = c(
    rep("Afghanistan", 4),
    rep("Brazil", 4),
    rep("China", 4)
  ),
  year = rep(c(1999, 1999, 2000, 2000), 3),
  type = rep(c("cases", "population", "cases", "population"), 3),
  count = c(745, 19987071, 2666, 20595360, 37737, 172006362, 80488, 174504898, 212258, 1272915272, 213766, 1280428583)
)

table3 <- tibble(
  country = c(
    rep("Afghanistan", 2),
    rep("Brazil", 2),
    rep("China", 2)
  ),
  year = rep(c(1999, 2000), 3),
  rate = c("745/19987071", "2666/20595360", "37737/172006362", "80488/174504898", "212258/1272915272", "213766/1280428583")
)

table4a <- tibble(
  country = c("Afghanistan", "Brazil", "China"),
  '1999' = c(745, 37737, 212258),
  '2000' = c(2666, 80488, 213766)
)

table4b <- tribble(
~country, ~`1999`, ~`2000`,
#---------|-----------|---------
"Afghanistan", 19987071, 20595360,
"Brazil", 172006362, 174504898,
"China", 1272915272, 1280428583
)
table2_mod <- tibble(
  country = table2$country[table2$type == "cases"],
  year = table2$year[table2$type == "cases"],
  cases = table2$count[table2$type == "cases"],
  population = table2$count[table2$type == "population"],
  rate = (cases/population) * 10000
)

table2_mod
table4c <-
  tibble(
    country = table4a$country,
    `1999` = (table4a$`1999`/table4b$`1999`) * 10000,
    `2000` = (table4a$`2000`/table4b$`2000`) * 10000
  )

table4c

Neither were particularly easy to work with, but it was more confusing thinking about cases and population in terms of years.

  1. Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?
# Example
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

# Actual
table2 %>%
  filter(type == "cases") %>%
  ggplot(aes(year, count)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

Just need to filter out “population” data in type column from table.

12.3 Pivoting

Most data that you will encounter will be untidy. There are two main reasons:

  1. Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.

  2. Data is often organised to failitate some use other than analysis. For example, data is often organised to make entry as easy as possible.

This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originall generated the data. The second step is to resolve one of two common problems:

  1. One variable might be spread across multiple columns.

  2. One observation might be scattered across multiple rows.

Typically a dataset will only suffer from one of these problems; it’ll only suffer from both if you’re really unlucky! To fix these problems, you’ll need the two most important functions in tidyr: pivot_longer() and pivot_wider().

12.3.1 Longer

A common problem is a datset where some of the column names are not names of variables, but values of a variable. Take table4a: the column names 1999 and 2000 represent values of the year variable, the values in the 1999 and 2000 columns represent values of the cases variable, and each row represents two observations, not one.

table4a

To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. To describe that operation we need three paramaters:

  • The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000_.
  • The name of the variable to move the column names to. Here it is year.
  • The name of the variable to move the column values to. Here it’s cases.

Together those paramaters generate the call to pivot_longer():

table4a

table4a %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")

The columns to pivot are specified with dplyr::select() style notation. Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names (because they don’t start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns see, select.

year and cases do not exist in table41 so we put their names in quotes.

In the final result, the pivoted columns are dropped, and we get new year and cases columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in Figure 12.2.

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns. I don’t believe it makes sense to describe a datset as being in “long form”. Length is a relative term, and you can only say (e.g.) that datset A is longer than dataset B.

We can use pivot_longer() to tidy table4b in a similar fashion. The only difference is the variable stored in the cell values:

table4b %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")

To combine the tidied versions of table4a and table4b into a single tibble, we need to use dplyr::left_join(), which you’ll learn about in relational data.

tidy4a <- table4a %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")

tidy4b <- table4b %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b, by = c("country", "year"))

12.3.2 Wider

pivot_wider() is the opposite of pivot_longer(). You use it when an observation is scattered across multiple rows. For example, take table2: an observation is a country in a year, but each observation is spread across two rows.

table2 %>%
  pivot_wider(id_cols = c(country, year), names_from = type, values_from = count)

To tidy this up, we first analyse the representation in similar way to pivot_longer(). This time, however, we only need two paramaters:

Once we’ve figured that out, we can use pivot_wider(), as shown programmatically below, and visually in Figure 12.3.

# Above we provided __id_cols__, but they don't appear necessary in this example.
table2 %>%
  pivot_wider(names_from = type, values_from = count)

As you might have guessed from their names, pivot_wider() and pivot_longer() are complements. pivot_longer() makes wide tables narrower and longer; pivot_wider() makes long tables shorter and wider.

12.3.3 Exercises

  1. Why are pivot_longer() and pivot_wider() not perfectly symmetrical? Carefully consider the following example:
stocks <- tibble(
  year = c(2015, 2015, 2016, 2016),
  half = c( 1, 2, 1, 2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks

stocks %>%
  pivot_wider(names_from = year, values_from = return)

stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")

They are not perfectly symmetrical because when you pivot you change the nature of the data (columns, values and observations). For example, when we pivot_wider from our original dataset, each observation contains the half and return by respective year, but the year is provided by the column (the variable). It is a relatively “smaller” dataset, but it is confusing to think of the yearly returns in terms of variables. Were there no description, this data would be somewhat difficult to understand.

pivot_longer() has a names_ptypes argument, e.g. names_ptypes = list(year = double()). What does it do?

stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_ptypes = list(year = double()))
Error: Can't convert <character> to <double>.
Run `rlang::last_error()` to see where the error occurred.

It defines the type, class, and attributes of a vector. It should be used to confirm that the created columns are the types that you expect. It is not to be used to change the column type (hence, the above error message). Instead use names_transform or values_transform.

stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_transform = list(year = as.double)
               )
  1. Why does this code fail?
table4a %>%
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
Error: Can't subset columns that don't exist.
x Locations 1999 and 2000 don't exist.
ℹ There are only 3 columns.
Run `rlang::last_error()` to see where the error occurred.

The code fails because we need to wrap our column variables in backticks -> c(1999, 2000). These names do not follow proper R naming conventions (they do not begin with a letter) so we need to specify what they are - in this case column names.

table4a %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
  1. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble(
  ~name,             ~names,    ~values,
  #-----------------|----------|--------
  "Phillip Woods",   "age",     45,
  "Phillip Woods",   "height",  186,
  "Phillip Woods",   "age",     50,
  "Jessica Cordero", "age",     37,
  "Jessica Cordero", "height",   156
)

people

Before running a script to widen the only issue we might have is that we have two ages observations for Phillip Woods, but only one height observation - our new dataset won’t be balanced. We may be able to accomplish this adding a new column denoting the observation number.

people %>%
  group_by(name, names) %>%
  mutate(obs = row_number()) %>%
  pivot_wider(names_from = name, values_from = values)

With this particular type of data however it would be better if age and height were the columns and Philip Woods and Jessica Cordero remained in the name column. Ideally, we would just replicate the value of height for Philip Woods because it is unlikely that it is any different for ages 45 and 50.

  1. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)
preg

preg %>%
  pivot_longer(cols = c(male, female), names_to = "gender", values_to = "count")

# Can remove NA value for "pregnant males" using values_drop_na
preg %>%
  pivot_longer(cols = c(male, female), names_to = "gender", values_to = "count", values_drop_na = TRUE)

The data seems to make the most sense if we pivot_longer. In this example the dataset appears to resemble counts of pregnant males and females (obviously there are no pregnant males), but the data is represented by NA.

12.4 Separating and uniting

So far you’ve learned how to tidy table2 and table4, but not table3. table3 has a different problem: we have one column (rate) that contains two variables (cases and population). To fix this problem, we’ll need the separate() function. You’ll also learn about the complement of separate(): unite(), which you use if a single variable is spread across multiple columns.

12.4.1 Separate

separate() pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take table3:

table3 %>%
  separate(col = rate, into = c("cases", "population"), sep = "/")

The rate column contains both cases and population variables, and we need to split it into two variables. separate() takes the name of the column to separate, and the names of the columns to separate into, as shown in Figure 12.4 and the code above.

Above we specified what the separator should be. However, by default, separate() will split values wherever it sees a non-alphanumeric character (i.e. a character that isn’t a number or letter). In this case, we didn’t need to specify “/” as the separator, but in more complicated cases it will likely be necessary.

(Formally, sep is a regular expression, which you’ll learn more about in strings).

Look carefully at the column types: you’ll notice that cases and population are character columns. This is the default behavior in separate(): it leaves the type of the column as is. Here, however, it’s not very useful as those really are numbers. We can ask separate() to try and convert to better types using convert = TRUE:

table3 %>%
  separate(col = rate, into = c("cases", "population"), sep = "/", convert = TRUE)

You can also pass a vector of integers to sep. separate() will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative values start at -1 on the far-right of the strings. When using integers to separate strings, the length of sep should be one less than the number of names in into.

You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you’ll see in a little bit.

table3 %>%
  separate(col = year, into = c("century", "year"), sep = 2)

12.4.2 Unite

unite() is the inverse of separate(): it combines multiple columns into a single column. You’ll need it much less frequrently than separate(), but it’s still a useful tool to have in your back pocket.

We can use unite() to rejoin the century and year columns that we created in the last example. That data is saved as tidyr::table5. unite takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in dplyr::select() style:

table5 %>%
  unite(col = year, c(century, year), sep = "")

12.4.3 Exercises

  1. What do the extra and fill arguments do in separate()? Experiment with the various options for the following two oy datasets.

extra controls what happens when there are too many pieces when sep is a character vector. There are three valid options:

  • “warn” (the default): emit a warning and drop extra values.
  • “drop”: drop any extra values without a warning.
  • “merge”: only splits at most length(into) times

fill If sep is a character vector, this controls what happens when there are not enough pieces. There are three valid options:

  • “warn” (the default): emit a warning and fill from the right
  • “right”: fill with missing values on the right
  • “left”: fill with missing values on the left
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "warn")
Expected 3 pieces. Additional pieces discarded in 1 rows [2].
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "drop")

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "merge")
# In these examples the command expects "more data".

# Emit a warning and fill from the right
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "warn")
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
# Fill with missing values on the right
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "right")

# Fill with missing values on the left
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "left")
  1. Both unite() and separate() have a remove argument. What does it do? Why would you set it to FALSE?

remove If TRUE, remove input column from output data frame. You would set it to FALSE if you want to create a new variable, but keep the old one. It defaults to TRUE.

  1. Compare and contrast separate() and extract(). Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

Briefly mentioned earlier, separate() is a type of regular expression or regex. Thus, when using separate() you are telling the built-in regex function how to separate the data, i.e. by position or delimiter. However, extract() allows for a little bit more flexibility because it lets you write the regular expression dictating how to split up the data.

For example, in the example below separate() is unable to properly separate the data because there is no delimiter to separate by and the positions vary (in some cases there are two characters and in some cases there is only one).

# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = 1)
# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = 2)

# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = "")
Expected 2 pieces. Additional pieces discarded in 4 rows [1, 2, 3, 4].
# Using extract
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  extract(col = x, into = c("variable", "id"), regex = "([A-Z]+)([0-9]+)")

In the regular expression above, the parentheses indicate what each segment to be read should look like. Thus, for variable we are passing on the condition that it must be alphabetical [A-Z] and can be of any length (+). The second term, id, must be numeric [0-9] and can be of any length (+). The entire expression must be surrounded by quotes.

Thus when separating a single variable into multiple columns there can be multiple criteria by which you would like to separate.

However, as we’ll see in the example below when uniting data from multiple columns there is always only one choice as to how to combine their contents.

# Unite
tibble(x = c("A", "B", "C", "D"),
       y = c("1", "2", "3", "4")) %>%
  unite(col = "XY", c(x, y), sep = "_")

# OR
tibble(x = c("A", "B", "C", "D"),
       y = c("1", "2", "3", "4")) %>%
  unite(col = "XY", c(x, y), sep = "")

12.5 Missing values

Changing the representation of a datset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

Let’s illustrate this idea with a very simple data set:

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks

There are two missing values in this dataset:

One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:

stocks %>%
  pivot_wider(names_from = year, values_from = return)

Because these explicit missing values may not be important in other representations of the data, you can set values_drop_na = TRUE in pivot_longer() to turn explicit missing values implicit:

stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(
    cols = c(`2015`, `2016`),
    names_to = "year",
    values_to = "return",
    values_drop_na = TRUE
  )

Another important tool for making missing values explicit in tidy data is complete():

stocks %>%
  complete(year, qtr)

complete() takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary. There’s one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)

You can fill in these missing values with fill(). It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

treatment %>%
  fill(person, .direction = "down")

12.5.1 Exercises

  1. Compare and contrast the fill arguments to pivot_wider() and complete().

Fill does not make changes to the structure of the data. It does not reorganize columns. Pivot_wider takes the values in a column and creates new columns with them. Complete finds all existing combinations using a set of columns and then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

Most likely you would pivot_wider and then complete which would likely create some NAs, but those NAs could possibly be filled in using fill.

  1. What does the direction argument to fill() do?

.direction indicates in what direction the fill should happen c(“down”, “up”, “downup”, “updown”). Down and up are fairly intuitive, but “downup” means first down and then up and “updown” means first up and then down.

12.6 Case Study

Tidying a messy dataset from the World Health Organization on tuberculosis cases.

Here we move all of the columns that seem to provide case data to a new variable called “key” and move the values from the respective columns into the new column “cases”.

who1 <- who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65,
               names_to = "key",
               values_to = "cases",
               values_drop_na = TRUE)
who1

We can get some hint of the structure of the values in the new key column by counting them:

who1 %>%
  count(key)

According to the data dictionary:

  1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

  2. The next two letters descripe the type of TB:

  1. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

  2. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsisten because instead of new_rel we have newrel (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about str_replace() in strings, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

who2 <- who1 %>%
  mutate(key = str_replace(key, "newrel", "new_rel"))
who2 %>%
  filter(grepl("new_rel", key, ignore.case = TRUE) == TRUE)

We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.

who3 <- who2 %>%
  separate(col = key, into = c("newold", "type", "sexage"), sep = "_") %>%
  separate(col = sexage, into = c("sex", "age"), sep = 1)

who3

who4 <- subset(who3, select = -c(iso2, iso3, newold))
who4

The who dataset is now tidy!!!

In future cases, rather than performing each operation one code at a time, we would build up a complex pipe to perform all of the operations in a single chunk of code.

who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

12.6.1 Exercises

  1. In this case study I set values_drop_na = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

While it might not be advisable considering we started off knowing very little about the data, in hindsight it appears fine. The data is structured to show the number of tuberculosis cases by country, year, type, sex, and age. This means that for 219 countries represented in the dataset, they should each have 34 years of data (1980 - 2013), on 4 distinct types of tubeculosis and 7 age groups.

who1 %>%
  filter(cases == 0) %>%
  nrow()
[1] 11080

There are 11,080 rows in the data where the number of cases equals zero.

An NA means nothing was explicitly filled in for that value. A zero means a zero was entered.

pivot_longer(who, c(new_sp_m014:newrel_f65), names_to = "key", values_to = "cases") %>%
  group_by(country, year) %>%
  mutate(prop_missing = sum(is.na(cases))/n()) %>%
  filter(prop_missing > 0, prop_missing < 1)

Finally, I will check for implicit missing values. Implicit missing values are (year, country) combinations that do not apear in the data.

nrow(who)
[1] 7240
who %>%
  complete(country, year) %>%
  nrow()
[1] 7446
anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
  select(country, year) %>%
  group_by(country) %>%
  summarise(min_year = min(year), max_year = max(year))

To summarize:

  • 0 is used to represent no cases of TB.
  • Explicit missing values (__NA__s) are used to represent missing data for (country, year) combinations in which the country existed in that year.
  • Implicit missing values are used to represent missing data because a country did not exist in that year.
  1. What happens if you neglect the mutate() step? (__mutate(names_from = stringr::str_replace(key, “newrel”, “new_rel”)))?
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, 904, 905, 906, ...].

If you neglect the mutate step you get an error message when you try to separate the key variable because when it tries to separate by the "_" delimiter it does not find one and so NAs are created.

who %>%
  select(country, iso2, iso3) %>%
  group_by(country, iso2, iso3) %>%
  distinct() %>%
  extract(col = country, into = "abb3", regex = "([a-zA-Z]{3})", remove = FALSE) %>%
  extract(col = country, into = "abb2", regex = "([a-zA-Z]{2})", remove = FALSE) %>%
  mutate(abb2 = tolower(abb2),
         abb3 = tolower(abb3),
         iso2 = tolower(iso2),
         iso3 = tolower(iso3)) %>%
  filter(abb2 != iso2 & abb3 != iso3)

Unfortunately, the acronyms do not line up 100% evenly, but it seems pretty clear that these are just abbreviations using two and 3 letters.

A clearer way to do this is to just check if there are more than one unique combination for each country.

select(who, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
who4 %>%
  group_by(country, year, sex) %>%
  summarise(tot_cases = sum(cases))  %>%
  unite(country_sex, c(country, sex), sep = "_", remove = FALSE) %>%
  ggplot(aes(year, tot_cases)) +
  geom_line(aes(group = country_sex, color = sex)) +
  coord_cartesian(xlim = c(1995, 2013))
`summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.

It would probably make most sense to split the country data into quantiles and then examine those lines.

12.7 Non-tidy data

Just because a dataset might be “messy” or not “tidy” does not mean that it is useless. Two main reasons for using other data structures are:

  • Alternative representations may have substantial performance or space advantages.
  • Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.

Tidy data is a nice default, but there are good reasons to use other structures.

---
title: "Chapter 12 - Tidy Data"
output: html_notebook
---

## 12 Tidy Data

### 12.1 Introduction

"Happy families are all alike; every unhappy family is unhappy in its own way." - Leo Tolstoy

"Tidy datasets are all alike, but every messy dataset is messy in its own way." - Hadley Wickham

In this cahpter, you will learn a consisten way to organise your data in R, an organisation called __tidy data__. Getting your data into this format requires some upfront work, but that work pays off in the long term. One you have tidy data and tidy tools provided by packages in the tidyverse, you will spend much less time munging data from one representation to another, allowing you to spend more time on the analytic questions at hand.

### 12.1.1 Prerequisites

In this chapter we'll focus on tidyr, a package that provides a bunch of tools to help tidy up your messy datasets. tidy is a member of the core tidyverse.

```{r}
library(tidyverse)
```

### 12.2 Tidy data

There are many ways to represent the same underlying data, but some datasets are easier to use than others.

There are three interrelated rules which make a dataset tidy:

1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

These three rules are interrelated because it's impossible to only satisfy two of the three. That interrelationship leads to an even simpler set of practical instructions:

1. Put each dataset in a tibble.
2. Put each variable in a column.

In this example, only __table1__ is tidy. It's the only representation where each column is a variable.

2 main advantages to *tidy* data.

1.There's a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it's easier to learn the tools that work with it because they have a underlying uniformity.

2. There's a specific advantage to placing variables in columns because it allows R's vectorised nature to shine. As you learned in __mutate__ and __summary functions__, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with *tidy* data.

### 12.2.1 Exercises

1. Using prose, describe how the variables and observations are organised in each of the sample tables.

Table1 is organized with four variables across four columns: country, year, cases and population. It is easy to imagine creating many graphics with this format and faceting easily by country and year.

Table2 is similar to table one - country and year are the same - cases and population have been turned into values; essentially combined to create a new column "type". Count then represents respective numbers of cases and population.

Tables 3 combines the cases and population into a single new column called "rate" - essentially cases/population. While this new data has its own advantages it limits the options available for analysis. It is better to have the raw data and then create a new "rate" variable from the variables in column 1 than to start with the rate.

Tables 4 and 5 are split across two tables and defined by year. The first tibble contains cases data and the second contains population data - making it thus difficult to compare case and poulation data without performing a SQL join.

2. Compute the __rate__ for __table2__, and __table4a__ + __table4b__. You will need to perform four operations:

1. Extract the number of TB cases per country per year.
2. Extract the matching population per country per year.
3. Divide cases by population, and multiply by 10000.
4. Store back in the appropriate place.

Which representation is easiest to work with? Which is hardest? Why?

```{r}
table1 <- tibble(
  country = c(
    rep("Afghanistan", 2),
    rep("Brazil", 2),
    rep("China", 2)
  ),
  year = rep(c(1999, 2000), 3),
  cases = c(745, 2666, 37737, 80488, 212258, 213766),
  population = c(19987071, 20595360, 172006362, 174504898, 1272915272, 1280428583)
)

table2 <- tibble(
  country = c(
    rep("Afghanistan", 4),
    rep("Brazil", 4),
    rep("China", 4)
  ),
  year = rep(c(1999, 1999, 2000, 2000), 3),
  type = rep(c("cases", "population", "cases", "population"), 3),
  count = c(745, 19987071, 2666, 20595360, 37737, 172006362, 80488, 174504898, 212258, 1272915272, 213766, 1280428583)
)

table3 <- tibble(
  country = c(
    rep("Afghanistan", 2),
    rep("Brazil", 2),
    rep("China", 2)
  ),
  year = rep(c(1999, 2000), 3),
  rate = c("745/19987071", "2666/20595360", "37737/172006362", "80488/174504898", "212258/1272915272", "213766/1280428583")
)

table4a <- tibble(
  country = c("Afghanistan", "Brazil", "China"),
  '1999' = c(745, 37737, 212258),
  '2000' = c(2666, 80488, 213766)
)

table4b <- tribble(
~country, ~`1999`, ~`2000`,
#---------|-----------|---------
"Afghanistan", 19987071, 20595360,
"Brazil", 172006362, 174504898,
"China", 1272915272, 1280428583
)
```

```{r}
table2_mod <- tibble(
  country = table2$country[table2$type == "cases"],
  year = table2$year[table2$type == "cases"],
  cases = table2$count[table2$type == "cases"],
  population = table2$count[table2$type == "population"],
  rate = (cases/population) * 10000
)

table2_mod
```

```{r}
table4c <-
  tibble(
    country = table4a$country,
    `1999` = (table4a$`1999`/table4b$`1999`) * 10000,
    `2000` = (table4a$`2000`/table4b$`2000`) * 10000
  )

table4c
```

Neither were particularly easy to work with, but it was more confusing thinking about cases and population in terms of years.

3. Recreate the plot showing change in cases over time using __table2__ instead of __table1__. What do you need to do first?

```{r}
# Example
ggplot(table1, aes(year, cases)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))

# Actual
table2 %>%
  filter(type == "cases") %>%
  ggplot(aes(year, count)) +
  geom_line(aes(group = country), color = "grey50") +
  geom_point(aes(color = country))
```

Just need to filter out "population" data in type column from table.

## 12.3 Pivoting

Most data that you will encounter will be untidy. There are two main reasons:

1. Most people aren't familiar with the principles of tidy data, and it's hard to derive them yourself unless you spend a *lot* of time working with data.

2. Data is often organised to failitate some use other than analysis. For example, data is often organised to make entry as easy as possible.

This means for most real analyses, you'll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you'll need to consult with the people who originall generated the data. The second step is to resolve one of two common problems:

1. One variable might be spread across multiple columns.

2. One observation might be scattered across multiple rows.

Typically a dataset will only suffer from one of these problems; it'll only suffer from both if you're really unlucky! To fix these problems, you'll need the two most important functions in tidyr: __pivot_longer()__ and __pivot_wider()__.

### 12.3.1 Longer

A common problem is a datset where some of the column names are not names of variables, but *values* of a variable. Take __table4a__: the column names __1999__ and __2000__ represent values of the __year__ variable, the values in the __1999__ and __2000__ columns represent values of the __cases__ variable, and each row represents two observations, not one.

```{r}
table4a
```

To tidy a dataset like this, we need to __pivot__ the offending columns into a new pair of variables. To describe that operation we need three paramaters:

* The set of columns whose names are values, not variables. In this example, those are the columns __1999__ and __2000___.
* The name of the variable to move the column names to. Here it is __year__.
* The name of the variable to move the column values to. Here it's __cases__.

Together those paramaters generate the call to __pivot_longer()__:

```{r}
table4a

table4a %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")
```

The columns to pivot are specified with __dplyr::select()__ style notation. Here there are only two columns, so we list them individually. Note that "1999" and "2000" are non-syntactic names (because they don't start with a letter) so we have to surround them in backticks. To refresh your memory of the other ways to select columns see, __select__.

__year__ and __cases__ do not exist in __table41__ so we put their names in quotes.

In the final result, the pivoted columns are dropped, and we get new __year__ and __cases__ columns. Otherwise, the relationships between the original variables are preserved. Visually, this is shown in __Figure 12.2__.

__pivot_longer()__ makes datasets longer by increasing the number of rows and decreasing the number of columns. I don't believe it makes sense to describe a datset as being in "long form". Length is a relative term, and you can only say (e.g.) that datset A is longer than dataset B.

We can use __pivot_longer()__ to tidy __table4b__ in a similar fashion. The only difference is the variable stored in the cell values:

```{r}
table4b %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")
```

To combine the tidied versions of __table4a__ and __table4b__ into a single tibble, we need to use __dplyr::left_join()__, which you'll learn about in relational data.

```{r}
tidy4a <- table4a %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "cases")

tidy4b <- table4b %>%
  pivot_longer(cols = c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b, by = c("country", "year"))
```

## 12.3.2 Wider

__pivot_wider()__ is the opposite of __pivot_longer()__. You use it when an observation is scattered across multiple rows. For example, take __table2__: an observation is a country in a year, but each observation is spread across two rows.

```{r}
table2 %>%
  pivot_wider(id_cols = c(country, year), names_from = type, values_from = count)
```

To tidy this up, we first analyse the representation in similar way to __pivot_longer()__. This time, however, we only need two paramaters:

* The column to take variable names from. Here, it's __type__.
* The column to take values from. Here it's __count__.

Once we've figured that out, we can use __pivot_wider()__, as shown programmatically below, and visually in __Figure 12.3__.

```{r}
# Above we provided __id_cols__, but they don't appear necessary in this example.
table2 %>%
  pivot_wider(names_from = type, values_from = count)
```

As you might have guessed from their names, __pivot_wider()__ and __pivot_longer()__ are complements. __pivot_longer()__ makes wide tables narrower and longer; __pivot_wider()__ makes long tables shorter and wider.

### 12.3.3 Exercises

1. Why are __pivot_longer()__ and __pivot_wider()__ not perfectly symmetrical? Carefully consider the following example:

```{r}
stocks <- tibble(
  year = c(2015, 2015, 2016, 2016),
  half = c( 1, 2, 1, 2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks

stocks %>%
  pivot_wider(names_from = year, values_from = return)

stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
```

They are not perfectly symmetrical because when you pivot you change the nature of the data (columns, values and observations). For example,
when we pivot_wider from our original dataset, each observation contains the half and return by respective year, but the year is provided by the column (the variable). It is a relatively "smaller" dataset, but it is confusing to think of the yearly returns in terms of variables. Were there no description, this data would be somewhat difficult to understand.

__pivot_longer()__ has a __names_ptypes__ argument, e.g. __names_ptypes = list(year = double())__. What does it do?

```{r}
stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_ptypes = list(year = double()))
```

It defines the type, class, and attributes of a vector. It should be used to confirm that the created columns are the types that you expect. It is not to be used to change the column type (hence, the above error message). Instead use __names_transform__ or __values_transform__.

```{r}
stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_transform = list(year = as.double)
               )
```

2. Why does this code fail?

```{r}
table4a %>%
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
```

The code fails because we need to wrap our column variables in backticks -> c(`1999`, `2000`). These names do not follow proper R naming conventions (they do not begin with a letter) so we need to specify what they are - in this case column names.

```{r}
table4a %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
```

3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

```{r}
people <- tribble(
  ~name,             ~names,    ~values,
  #-----------------|----------|--------
  "Phillip Woods",   "age",     45,
  "Phillip Woods",   "height",  186,
  "Phillip Woods",   "age",     50,
  "Jessica Cordero", "age",     37,
  "Jessica Cordero", "height",   156
)

people
```

Before running a script to widen the only issue we might have is that we have two ages observations for Phillip Woods, but only one height observation - our new dataset won't be balanced. We may be able to accomplish this adding a new column denoting the observation number.

```{r}
people %>%
  group_by(name, names) %>%
  mutate(obs = row_number()) %>%
  pivot_wider(names_from = name, values_from = values)
```

With this particular type of data however it would be better if age and height were the columns and Philip Woods and Jessica Cordero remained in the name column. Ideally, we would just replicate the value of height for Philip Woods because it is unlikely that it is any different for ages 45 and 50.

4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

```{r}
preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)
preg

preg %>%
  pivot_longer(cols = c(male, female), names_to = "gender", values_to = "count")

# Can remove NA value for "pregnant males" using values_drop_na
preg %>%
  pivot_longer(cols = c(male, female), names_to = "gender", values_to = "count", values_drop_na = TRUE)
```

The data seems to make the most sense if we __pivot_longer__. In this example the dataset appears to resemble counts of pregnant males and females (obviously there are no pregnant males), but the data is represented by NA.

## 12.4 Separating and uniting

So far you've learned how to tidy __table2__ and __table4__, but not __table3__. __table3__ has a different problem: we have one column (__rate__) that contains two variables (__cases__ and __population__). To fix this problem, we'll need the __separate()__ function. You'll also learn about the complement of __separate()__: __unite()__, which you use if a single variable is spread across multiple columns.

### 12.4.1 Separate

__separate()__ pulls apart one column into multiple columns, by splitting wherever a separator character appears. Take __table3__:

```{r}
table3 %>%
  separate(col = rate, into = c("cases", "population"), sep = "/")
```

The __rate__ column contains both __cases__ and __population__ variables, and we need to split it into two variables. __separate()__ takes the name of the column to separate, and the names of the columns to separate into, as shown in __Figure 12.4__ and the code above.

Above we specified what the separator should be. However, by default, __separate()__ will split values wherever it sees a non-alphanumeric character (i.e. a character that isn't a number or letter). In this case, we didn't need to specify "/" as the separator, but in more complicated cases it will likely be necessary.

(Formally, __sep__ is a regular expression, which you'll learn more about in __strings__).

Look carefully at the column types: you'll notice that __cases__ and __population__ are character columns. This is the default behavior in __separate()__: it leaves the type of the column as is. Here, however, it's not very useful as those really are numbers. We can ask __separate()__ to try and convert to better types using __convert = TRUE__:

```{r}
table3 %>%
  separate(col = rate, into = c("cases", "population"), sep = "/", convert = TRUE)
```

You can also pass a vector of integers to __sep__. __separate()__ will interpret the integers as positions to split at. Positive values start at 1 on the far-left of the strings; negative values start at -1 on the far-right of the strings. When using integers to separate strings, the length of __sep__ should be one less than the number of names in __into__. 

You can use this arrangement to separate the last two digits of each year. This make this data less tidy, but is useful in other cases, as you'll see in a little bit.

```{r}
table3 %>%
  separate(col = year, into = c("century", "year"), sep = 2)
```

### 12.4.2 Unite

__unite()__ is the inverse of __separate()__: it combines multiple columns into a single column. You'll need it much less frequrently than __separate()__, but it's still a useful tool to have in your back pocket.


We can use __unite()__ to rejoin the *century* and *year* columns that we created in the last example. That data is saved as __tidyr::table5__. __unite__ takes a data frame, the name of the new variable to create, and a set of columns to combine, again specified in __dplyr::select()__ style:

```{r}
table5 %>%
  unite(col = year, c(century, year), sep = "")
```

### 12.4.3 Exercises

1. What do the __extra__ and __fill__ arguments do in __separate()__? Experiment with the various options for the following two oy datasets.

*extra* controls what happens when there are too many pieces when __sep__ is a character vector. There are three valid options:

* "warn" (the default): emit a warning and drop extra values.
* "drop": drop any extra values without a warning.
* "merge": only splits at most length(into) times

*fill* If __sep__ is a character vector, this controls what happens when there are not enough pieces. There are three valid options:

* "warn" (the default): emit a warning and fill from the right
* "right": fill with missing values on the right
* "left": fill with missing values on the left

```{r}
# In these examples we have "too much data".

# Emit a warning and drop extra values.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "warn")

# Drop any extra values without a warning
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "drop")

# Only splits at most length(into) times
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
  separate(x, c("one", "two", "three"), extra = "merge")
```

```{r}
# In these examples the command expects "more data".

# Emit a warning and fill from the right
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "warn")

# Fill with missing values on the right
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "right")

# Fill with missing values on the left
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"), fill = "left")
```

2. Both __unite()__ and __separate()__ have a __remove__ argument. What does it do? Why would you set it to __FALSE__?

*remove* If TRUE, remove input column from output data frame. You would set it to __FALSE__ if you want to create a new variable, but keep the old one. It defaults to TRUE.

3. Compare and contrast __separate()__ and __extract()__. Why are there three variations of separation (by position, by separator, and with groups), but only one unite?

Briefly mentioned earlier, __separate()__ is a type of regular expression or *regex*. Thus, when using __separate()__ you are telling the built-in regex function how to separate the data, i.e. by position or delimiter. However, __extract()__ allows for a little bit more flexibility because it lets you write the regular expression dictating how to split up the data.

For example, in the example below __separate()__ is unable to properly separate the data because there is no delimiter to separate by and the positions vary (in some cases there are two characters and in some cases there is only one).

```{r}
# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = 1)
# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = 2)

# Failed parse
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  separate(col = x, into = c("variable", "id"), sep = "")

# Using extract
tibble(x = c("X1", "X20", "AA11", "AA2")) %>%
  extract(col = x, into = c("variable", "id"), regex = "([A-Z]+)([0-9]+)")
```

In the regular expression above, the parentheses indicate what each segment to be read should look like. Thus, for variable we are passing on the condition that it must be alphabetical [A-Z] and can be of any length (+). The second term, id, must be numeric [0-9] and can be of any length (+). The entire expression must be surrounded by quotes.

Thus when separating a single variable into multiple columns there can be multiple criteria by which you would like to separate.

However, as we'll see in the example below when uniting data from multiple columns there is always only one choice as to how to combine their contents.

```{r}
# Unite
tibble(x = c("A", "B", "C", "D"),
       y = c("1", "2", "3", "4")) %>%
  unite(col = "XY", c(x, y), sep = "_")

# OR
tibble(x = c("A", "B", "C", "D"),
       y = c("1", "2", "3", "4")) %>%
  unite(col = "XY", c(x, y), sep = "")
```

## 12.5 Missing values

Changing the representation of a datset brings up an important subtlety of missing values. Surprisingly, a value can be missing in one of two possible ways:

* __Explicily__, i.e. flagged with __NA__.
* __Implicitly__, i.e. simply not present in the data.

Let's illustrate this idea with a very simple data set:

```{r}
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

stocks
```

There are two missing values in this dataset:

* The return for the fourth quarter of 2015 is explicitly missing, because the cell where its value should be instead contains __NA__.
* The return for the first quarter of 2016 is implicitly missing, because it simplyl does not appear in the dataset.

One way to think about the difference is with this Zen-like koan: An explicit missing value is the presence of an absence; an implicit missing value is the absence of a presence.

The way that a dataset is represented can make implicit values explicit. For example, we can make the implicit missing value explicit by putting years in the columns:

```{r}
stocks %>%
  pivot_wider(names_from = year, values_from = return)
```

Because these explicit missing values may not be important in other representations of the data, you can set __values_drop_na = TRUE__ in __pivot_longer()__ to turn explicit missing values implicit:

```{r}
stocks %>%
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(
    cols = c(`2015`, `2016`),
    names_to = "year",
    values_to = "return",
    values_drop_na = TRUE
  )
```

Another important tool for making missing values explicit in tidy data is __complete()__:

```{r}
stocks %>%
  complete(year, qtr)
```

__complete()__ takes a set of columns, and finds all unique combinations. It then ensures the original dataset contains all those values, filling in explicit NAs where necessary. There's one other important tool that you should know for working with missing values. Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:

```{r}
treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4
)
```

You can fill in these missing values with __fill()__. It takes a set of columns where you want missing values to be replaced by the most recent non-missing value (sometimes called last observation carried forward).

```{r}
treatment %>%
  fill(person, .direction = "down")
```

### 12.5.1 Exercises

1. Compare and contrast the __fill__ arguments to __pivot_wider()__ and __complete()__.

__Fill__ does not make changes to the structure of the data. It does not reorganize columns. Pivot_wider takes the values in a column and creates new columns with them. Complete finds all existing combinations using a set of columns and then ensures the original dataset contains all those values, filling in explicit NAs where necessary.

Most likely you would pivot_wider and then complete which would likely create some NAs, but those NAs could possibly be filled in using fill.

2. What does the direction argument to __fill()__ do?

*.direction* indicates in what direction the fill should happen c("down", "up", "downup", "updown"). Down and up are fairly intuitive, but "downup" means first down and then up and "updown" means first up and then down.

# 12.6 Case Study

Tidying a messy dataset from the World Health Organization on tuberculosis cases.

Here we move all of the columns that seem to provide case data to a new variable called "key" and move the values from the respective columns into the new column "cases".

```{r}
who1 <- who %>%
  pivot_longer(cols = new_sp_m014:newrel_f65,
               names_to = "key",
               values_to = "cases",
               values_drop_na = TRUE)
who1
```

We can get some hint of the structure of the values in the new __key__ column by counting them:

```{r}
who1 %>%
  count(key)
```

According to the data dictionary:

1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

2. The next two letters descripe the type of TB:

* __rel__ stands for cases of relapse
* __ep__ stands for cases of extrapulmonary TB
* __sn__ stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
* __sp__ stands for cases of pulmonary TB that could be diagnosed by a pulmonary smear (smear positive)

3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (__m__) and females (__f__).

4. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

* __014__ = 0 - 14 years old
* __1524__ = 15 - 24 years old
* __2534__ = 25 - 34 years old
* __ 3544__ = 35 - 44 years old
* __4544__ = 45 - 54 years old
* __5564__ = 55 - 64 years old
* __65__ = 65 or older

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsisten because instead of __new_rel__ we have __newrel__ (it's hard to spot this here but if you don't fix it we'll get errors in subsequent steps). You'll learn about __str_replace()__ in __strings__, but the basic idea is pretty simple: replace the characters "newrel" with "new_rel". This makes all variable names consistent.

```{r}
who2 <- who1 %>%
  mutate(key = str_replace(key, "newrel", "new_rel"))
who2 %>%
  filter(grepl("new_rel", key, ignore.case = TRUE) == TRUE)
```

We can separate the values in each code with two passes of __separate()__. The first pass will split the codes at each underscore.

```{r}
who3 <- who2 %>%
  separate(col = key, into = c("newold", "type", "sexage"), sep = "_") %>%
  separate(col = sexage, into = c("sex", "age"), sep = 1)

who3

who4 <- subset(who3, select = -c(iso2, iso3, newold))
who4
```

The __who__ dataset is now tidy!!!

In future cases, rather than performing each operation one code at a time, we would build up a complex pipe to perform all of the operations in a single chunk of code.

```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```

### 12.6.1 Exercises

1. In this case study I set __values_drop_na = TRUE__ just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What's the difference between an __NA__ and zero?

While it might not be advisable considering we started off knowing very little about the data, in hindsight it appears fine. The data is structured to show the number of tuberculosis cases by country, year, type, sex, and age. This means that for 219 countries represented in the dataset, they should each have 34 years of data (1980 - 2013), on 4 distinct types of tubeculosis and 7 age groups.

```{r}
# Check if there are true zeroes in the data (NAs don't mean zero)
who1 %>%
  filter(cases == 0) %>%
  nrow()
```

There are 11,080 rows in the data where the number of cases equals zero.

An __NA__ means nothing was explicitly filled in for that value. A zero means a zero was entered.

```{r}
# Check if country-year group can be missing some data, but not all.
pivot_longer(who, c(new_sp_m014:newrel_f65), names_to = "key", values_to = "cases") %>%
  group_by(country, year) %>%
  mutate(prop_missing = sum(is.na(cases))/n()) %>%
  filter(prop_missing > 0, prop_missing < 1)
```

Finally, I will check for implicit missing values. Implicit missing values are (__year__, __country__) combinations that do not apear in the data.

```{r}
# Check for implicit values.
nrow(who)

who %>%
  complete(country, year) %>%
  nrow()
```

```{r}
anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
  select(country, year) %>%
  group_by(country) %>%
  summarise(min_year = min(year), max_year = max(year))
```

To summarize:

* 0 is used to represent no cases of TB.
* Explicit missing values (__NA__s) are used to represent missing data for (__country__, __year__) combinations in which the country existed in that year.
* Implicit missing values are used to represent missing data because a country did not exist in that year.

2. What happens if you neglect the __mutate()__ step? (__mutate(names_from = stringr::str_replace(key, "newrel", "new_rel")))?

```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```

```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```

If you neglect the mutate step you get an error message when you try to separate the key variable because when it tries to separate by the "_" delimiter it does not find one and so NAs are created.

```{r}
who %>%
  select(country, iso2, iso3) %>%
  group_by(country, iso2, iso3) %>%
  distinct() %>%
  extract(col = country, into = "abb3", regex = "([a-zA-Z]{3})", remove = FALSE) %>%
  extract(col = country, into = "abb2", regex = "([a-zA-Z]{2})", remove = FALSE) %>%
  mutate(abb2 = tolower(abb2),
         abb3 = tolower(abb3),
         iso2 = tolower(iso2),
         iso3 = tolower(iso3)) %>%
  filter(abb2 != iso2 & abb3 != iso3)
```

Unfortunately, the acronyms do not line up 100% evenly, but it seems pretty clear that these are just abbreviations using two and 3 letters.

A clearer way to do this is to just check if there are more than one unique combination for each country.

```{r}
select(who, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
```

```{r}
who4 %>%
  group_by(country, year, sex) %>%
  summarise(tot_cases = sum(cases))  %>%
  unite(country_sex, c(country, sex), sep = "_", remove = FALSE) %>%
  ggplot(aes(year, tot_cases)) +
  geom_line(aes(group = country_sex, color = sex)) +
  coord_cartesian(xlim = c(1995, 2013))
```

It would probably make most sense to split the country data into quantiles and then examine those lines.

## 12.7 Non-tidy data

Just because a dataset might be "messy" or not "tidy" does not mean that it is useless. Two main reasons for using other data structures are:

* Alternative representations may have substantial performance or space advantages.
* Specialised fields have evolved their own conventions for storing data that may be quite different to the conventions of tidy data.

Tidy data is a nice default, but there are good reasons to use other structures.









