Case Study
To finish off this section, let’s pull together everything you’ve learned to tackle a realistic data tidying problem. The tidyr::who
dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided:
who
This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who
is messy, and we’ll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.
The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:
- It looks like
country
, iso2
, and iso3
are three variables that redundantly specify the country.
year
is clearly also a variable.
- We don’t know what all the other columns are yet, but given the structure in the variable names (e.g.
new_sp_m014
, new_ep_m014
, new_ep_f014
) these are likely to be values, not variables.
So we need to pivot together all the columns from new_sp_m014
to newrel_f65
. We don’t know what those values represent yet, so we’ll give them the generic name "key"
. We know the cells represent the count of cases, so we’ll use the variable cases
. There are a lot of missing values in the current representation, so for now we’ll use values_drop_na = TRUE
just so we can focus on the values that are present.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
We can get some hint of the structure of the values in the new key column by counting them:
who1 %>%
count(key)
You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two letters describe the type of TB:
rel
stands for cases of relapse
ep
stands for cases of extrapulmonary TB
sn
stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
sp
stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)
The sixth letter gives the sex of TB patients. The dataset groups cases by males (m
) and females (f
).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 – 14 years old
1524
= 15 – 24 years old
2534
= 25 – 34 years old
3544
= 35 – 44 years old
4554
= 45 – 54 years old
5564
= 55 – 64 years old
65
= 65 or older
We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel
we have newrel
(it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about str_replace()
in strings, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.
who2 <- who1 %>%
mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2
We can separate the values in each code with two passes of separate()
. The first pass will split the codes at each underscore.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, 904, 905, 906, ...].
who3
Then we might as well drop the new
column because it’s constant in this dataset. While we’re dropping columns, let’s also drop iso2
and iso3
since they’re redundant.
who3 %>%
count(new)
who4 <- who3 %>%
select(-new, -iso2, -iso3)
Next we’ll separate sexage
into sex
and age
by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
The who
dataset is now tidy!
I’ve shown you the code a piece at a time, assigning each interim result to a new variable. This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
) %>%
mutate(
key = stringr::str_replace(key, "newrel", "new_rel")
) %>%
separate(key, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)
Exercises
1. In this case study, I set values_drop_na = TRUE
just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?
The reasonableness of using values_drop_na = TRUE
depends on how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.
If there are no 0 values in the data, then missing values may be used to indicate no cases.
If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.
First, I’ll check for the presence of zeros in the data.
who1 %>%
filter(cases == 0) %>%
nrow()
[1] 11080
There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value ofNA is used to indicate missing data.
Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.
pivot_longer(who, new_sp_m014:newrel_f65, names_to = "key", values_to = "cases") %>%
group_by(country, year) %>%
mutate(prop_missing = sum(is.na(cases)) / n()) %>%
filter(prop_missing > 0, prop_missing < 1)
From the results above, it looks like it is possible for a (country, year) row to contain some, but not all, missing values in its columns.
Finally, I will check for implicit missing values. Implicit missing values are (year
, country
) combinations that do not appear in the data.
nrow(who)
[1] 7240
who %>%
complete(country, year) %>%
nrow()
[1] 7446
Since the number of complete cases of (country
, year
) is greater than the number of rows in who, there are some implicit values. But that doesn’t tell us what those implicit missing values are. To do this, I will use the anti_join()
function introduced in the later Relational Data lecture.
anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
select(country, year) %>%
group_by(country) %>%
# so I can make better sense of the years
summarise(min_year = min(year), max_year = max(year))
All of these refer to (country
, year
) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.
To summarize:
0
is used to represent no cases of TB.
- Explicit missing values (
NA
s) are used to represent missing data for (country
, year
) combinations in which the country existed in that year.
- Implicit missing values are used to represent missing data because a country did not exist in that year.
2. What happens if you neglect the mutate()
step? (mutate(key = str_replace(key, "newrel", "new_rel")
)
The separate()
function emits the warning “too few values”. If we check the rows for keys beginning with "newrel_"
, we see that sexage
is missing, and type = m014
.
who3a <- who1 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, 904, 905, 906, ...].
filter(who3a, new == "newrel") %>% head()
3. I claimed that iso2
and iso3
were redundant with country. Confirm this claim.
If iso2
and iso3
are redundant with country, then, within each country, there should only be one distinct combination of iso2
and iso3
values, which is the case.
select(who3, country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)
This makes sense, since iso2
and iso3
contain the 2- and 3-letter country abbreviations for the country. The iso2
variable contains each country’s ISO 3166 alpha-2, and the iso3
variable contains each country’s ISO 3166 alpha-3 abbreviation. You may recognize the ISO 3166-2 abbreviations, since they are almost identical to internet country-code top level domains, such as .uk
(United Kingdom), .ly
(Libya), .tv
(Tuvalu), and .io
(British Indian Ocean Territory).
---
title: "Tidy Data Case Study"
output: 
  html_notebook:
    toc: true
    toc_float: true
---

```{r}
suppressPackageStartupMessages(library("tidyverse"))
```

## Case Study

To finish off this section, let’s pull together everything you’ve learned to tackle a realistic data tidying problem. The `tidyr::who` dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the *2014 World Health Organization Global Tuberculosis Report*, available at <http://www.who.int/tb/country/data/download/en/>.

There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided:

```{r}
who
```

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, `who` is messy, and we’ll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.

The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:

 - It looks like `country`, `iso2`, and `iso3` are three variables that redundantly specify the country.
 - `year` is clearly also a variable.
 - We don’t know what all the other columns are yet, but given the structure in the variable names (e.g. `new_sp_m014`, `new_ep_m014`, `new_ep_f014`) these are likely to be values, not variables.

So we need to pivot together all the columns from `new_sp_m014` to `newrel_f65`. We don’t know what those values represent yet, so we’ll give them the generic name `"key"`. We know the cells represent the count of cases, so we’ll use the variable `cases`. There are a lot of missing values in the current representation, so for now we’ll use `values_drop_na = TRUE` just so we can focus on the values that are present.

```{r}
who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1
```

We can get some hint of the structure of the values in the new key column by counting them:

```{r}
who1 %>% 
  count(key)
```

You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:

1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

2. The next two letters describe the type of TB:

    - `rel` stands for cases of relapse
    - `ep` stands for cases of extrapulmonary TB
    - `sn` stands for cases of pulmonary TB that could not be diagnosed by a pulmonary smear (smear negative)
    - `sp` stands for cases of pulmonary TB that could be diagnosed be a pulmonary smear (smear positive)

3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (`m`) and females (`f`).

4. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

    - `014` = 0 – 14 years old
    - `1524` = 15 – 24 years old
    - `2534` = 25 – 34 years old
    - `3544` = 35 – 44 years old
    - `4554` = 45 – 54 years old
    - `5564` = 55 – 64 years old
    - `65` = 65 or older

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of `new_rel` we have `newrel` (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about `str_replace()` in strings, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

```{r}
who2 <- who1 %>% 
  mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2
```

We can separate the values in each code with two passes of `separate()`. The first pass will split the codes at each underscore.

```{r}
who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
```

Then we might as well drop the `new` column because it’s constant in this dataset. While we’re dropping columns, let’s also drop `iso2` and `iso3` since they’re redundant.

```{r}
who3 %>% 
  count(new)
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
```

Next we’ll separate `sexage` into `sex` and `age` by splitting after the first character:

```{r}
who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5
```

The `who` dataset is now tidy!

I’ve shown you the code a piece at a time, assigning each interim result to a new variable. This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:

```{r}
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)
```

## Exercises

### 1. In this case study, I set `values_drop_na = TRUE` just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

The reasonableness of using `values_drop_na = TRUE` depends on how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.

 - If there are no 0 values in the data, then missing values may be used to indicate no cases.

 - If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.

First, I’ll check for the presence of zeros in the data.

```{r}
who1 %>%
  filter(cases == 0) %>%
  nrow()
```

There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value ofNA is used to indicate missing data.

Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

```{r}
pivot_longer(who, new_sp_m014:newrel_f65, names_to = "key", values_to = "cases") %>%
  group_by(country, year) %>%
  mutate(prop_missing = sum(is.na(cases)) / n()) %>%
  filter(prop_missing > 0, prop_missing < 1)
```

From the results above, it looks like it is possible for a (country, year) row to contain some, but not all, missing values in its columns.

Finally, I will check for implicit missing values. Implicit missing values are (`year`, `country`) combinations that do not appear in the data.

```{r}
nrow(who)

who %>%
  complete(country, year) %>%
  nrow()
```

Since the number of complete cases of (`country`, `year`) is greater than the number of rows in who, there are some implicit values. But that doesn’t tell us what those implicit missing values are. To do this, I will use the `anti_join()` function introduced in the later Relational Data lecture.

```{r}
anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
  select(country, year) %>%
  group_by(country) %>%
  # so I can make better sense of the years
  summarise(min_year = min(year), max_year = max(year))
```

All of these refer to (`country`, `year`) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.

To summarize:

 - `0` is used to represent no cases of TB.
 - Explicit missing values (`NA`s) are used to represent missing data for (`country`, `year`) combinations in which the country existed in that year.
 - Implicit missing values are used to represent missing data because a country did not exist in that year.
 
### 2. What happens if you neglect the `mutate()` step? (`mutate(key = str_replace(key, "newrel", "new_rel")`)

The `separate()` function emits the warning “too few values”. If we check the rows for keys beginning with `"newrel_"`, we see that `sexage` is missing, and `type = m014`.

```{r}
who3a <- who1 %>%
  separate(key, c("new", "type", "sexage"), sep = "_")
filter(who3a, new == "newrel") %>% head()
```

### 3. I claimed that `iso2` and `iso3` were redundant with country. Confirm this claim.

If `iso2` and `iso3` are redundant with country, then, within each country, there should only be one distinct combination of `iso2` and `iso3` values, which is the case.

```{r}
select(who3, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)
```

This makes sense, since `iso2` and `iso3` contain the 2- and 3-letter country abbreviations for the country. The `iso2` variable contains each country’s [ISO 3166 alpha-2](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2), and the `iso3` variable contains each country’s [ISO 3166 alpha-3](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3) abbreviation. You may recognize the ISO 3166-2 abbreviations, since they are almost identical to internet [country-code top level domains](https://en.wikipedia.org/wiki/Country_code_top-level_domain), such as `.uk` (United Kingdom), `.ly` (Libya), `.tv` (Tuvalu), and `.io` (British Indian Ocean Territory).

### 4. For each country, year, and sex compute the total number of cases of TB. Make an informative visualization of the data.

```{r}
who5 %>%
  group_by(country, year, sex) %>%
  filter(year > 1995) %>%
  summarise(cases = sum(cases)) %>%
  unite(country_sex, country, sex, remove = FALSE) %>%
  ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
  geom_line()
```

A small multiples plot faceting by country is difficult given the number of countries. Focusing on those countries with the largest changes or absolute magnitudes after providing the context above is another option.