L07 Tidy Data

Data Science 1 with R (STAT 301-1)

Author

YOUR NAME

Load Packages

Code

# Loading package(s)
library(tidyverse)
library(dplyr)
data(table1)
data(table2)
data(table4a)
data(table4b)
data(who)

Datasets

All datasets are either defined inline or provided within the core tidyverse packages (table1, table2, table4a, table4a, who).

Exercises

Exercise 1

Follow these four steps to compute the rate per 10,000 once using only table2, and again using table4a + table4b:

Extract the number of TB cases per country per year.
Extract the matching population numbers per country per year.
Divide case numbers by population numbers and multiply by 10000.
Store in a new tibble.

Code

table2 %>% 
  pivot_wider(names_from = type, values_from = count) %>%
  mutate(rate = cases / population * 10000) %>%
  tibble()

# A tibble: 6 × 5
  country      year  cases population  rate
  <chr>       <int>  <int>      <int> <dbl>
1 Afghanistan  1999    745   19987071 0.373
2 Afghanistan  2000   2666   20595360 1.29 
3 Brazil       1999  37737  172006362 2.19 
4 Brazil       2000  80488  174504898 4.61 
5 China        1999 212258 1272915272 1.67 
6 China        2000 213766 1280428583 1.67

Code

tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b) %>%
  mutate(rate = cases / population * 10000) %>%
           tibble()

# A tibble: 6 × 5
  country     year   cases population  rate
  <chr>       <chr>  <int>      <int> <dbl>
1 Afghanistan 1999     745   19987071 0.373
2 Afghanistan 2000    2666   20595360 1.29 
3 Brazil      1999   37737  172006362 2.19 
4 Brazil      2000   80488  174504898 4.61 
5 China       1999  212258 1272915272 1.67 
6 China       2000  213766 1280428583 1.67

Which representation is easiest to work with? Which is hardest? Why?

Solution

It is significantly more challenging to both create and interpret a representation of the data as the steps order using baser as opposed to using tidy data, as it is presented as an organizaed table as opposed to a vector/list of values. With larger/more complicated datasets, this becomes a greater issue.

Exercise 2

Recreate the plot below showing the change in cases over time using table2 instead of table1. What do you need to do first?

Code

# Change over time in number of TB cases by country
ggplot(table1, aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

Code

table2 %>% pivot_wider(names_from = type, values_from = count) %>%
ggplot(aes(year, cases)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country))

Solution

We need to tidy the data by widening it to add an extra column, so we can specifically look at the number of cases per country. Without it, there is no unique coumn for cases, so a plot would not be able to be made.

Exercise 3

Why does the provided code fail? Fix it.

Code

table4a %>% 
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")

Code

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <int>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Solution

Without backticks, 1999 and 2000 read as to numbers, as opposed to the column headers they serve as in the initial table4a. Thus, you will get an error message saying that “locations 1999 and 2000 don’t exist.” To fix this, add backticks.

Exercise 4

What happens if you use pivot_wider() on this table so that we have a dataset with 3 columns (responent_name, age, height)?

Fix the issue by adding a new column.

Code

people <- tribble(
  ~respondent_name,  ~key,    ~value,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

Code

people <- tribble(
  ~respondent_name,  ~key,    ~value, ~ID,
  #-----------------|--------|-------|------
  "Phillip Woods",   "age",       45, "pat01",
  "Phillip Woods",   "height",   186, "pat01",
  "Phillip Woods",   "age",       50, "pat02",
  "Jessica Cordero", "age",       37, "pat03",
  "Jessica Cordero", "height",   156, "pat03"
)

Code

people %>%
  pivot_wider(names_from = key, values_from = value)

# A tibble: 3 × 4
  respondent_name ID      age height
  <chr>           <chr> <dbl>  <dbl>
1 Phillip Woods   pat01    45    186
2 Phillip Woods   pat02    50     NA
3 Jessica Cordero pat03    37    156

Solution

Since there are two age observations for Phillip Woods, we need to add a unique “ID” column to uniquely identify one of the Phillip Woods as a separate patient. If not, it will not be able to widen the table, since it will read it as a duplicate value.

Exercise 5

Tidy the simple tibble of M&M data below. Do you need to make it wider or longer? What are the variables in your tidy version?

Code

mm_data <- tribble(
  ~mm_type, ~blue, ~orange, ~green, ~yellow, ~brown, ~red, ~cyan_blue,
  "plain",  6,     18,      12,     6,       7,      7,    NA,
  "peanut", NA,    11,      9,      1,       12,       8,    15
)

mm_data %>% 
  pivot_longer(cols = -mm_type, names_to = "color")

# A tibble: 14 × 3
   mm_type color     value
   <chr>   <chr>     <dbl>
 1 plain   blue          6
 2 plain   orange       18
 3 plain   green        12
 4 plain   yellow        6
 5 plain   brown         7
 6 plain   red           7
 7 plain   cyan_blue    NA
 8 peanut  blue         NA
 9 peanut  orange       11
10 peanut  green         9
11 peanut  yellow        1
12 peanut  brown        12
13 peanut  red           8
14 peanut  cyan_blue    15

Solution

In the tidy version, each color is its own variable, so we can see the counts for every color of M&M faceted by whether they do or do not contain peanuts.

Exercise 6

Both unite() and separate() have a remove argument. What does it do? The default value for remove is TRUE, why would you set it to FALSE?

Solution

The remove argument gets rid of the columns that are being merged or separated. If false, the original column(s) will still be merged/separated, but will be kept in the data frame.

Exercise 7

The case study of data from the 2014 World Health Organization Global Tuberculosis Report produces a lot of useful data (who) — data sub-directory contains the codebook for who. However, the format is difficult to work with, so the authors walk you through the process of tidying the data. The tidying process uses this nice concise code:

Code

# tidying who
who %>%
  pivot_longer(
    cols = c(new_sp_m014:newrel_f65), 
    names_to = "code", 
    values_to = "value",
    values_drop_na = TRUE
  ) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Important

Insert a comment following each # in the code below that explains the purpose or objective of the line of code directly below it.

Code

# tidy who
who_tidy <- who %>%
  #Lengthen the data with fewer columns and more rows. The columns have been changed to represent the code for each group, while the value has been changed for the count of new TB cases recorded by group.
  pivot_longer(
    cols = c(new_sp_m014:newrel_f65), 
    names_to = "code", 
    values_to = "value",
    values_drop_na = TRUE
  ) %>% 
  #New column added. The newrel code is replaced with new_rel
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  #The code is separated out into three different columns to show if case is new,  variant and sexage.
  separate(code, c("new", "var", "sexage")) %>% 
  #Select every variable but new, iso2, and iso3. This shows only `country`, `year`, `var`, `sexage`, and count
  select(-new, -iso2, -iso3) %>% 
  #Separate sexage into two different columns: sex and age
  separate(sexage, c("sex", "age"), sep = 1)

who_tidy

Exercise 8

In the WHO case study, the authors set na.rm = TRUE to make it easier to check that they had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between NA and zero in this dataset?

Solution

The concern with setting na.rm = TRUE is deciphering between whether the missing values indicate no TB cases, or whether the data was simply missing. Thus, we should see if zero is included in the dataset

As you can see, there are zeros included in the dataset.

Code

who %>%
  pivot_longer(
    cols = c(new_sp_m014:newrel_f65), 
    names_to = "code", 
    values_to = "value",
    values_drop_na = TRUE
  ) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1) %>%
filter(value == 0) %>%
nrow()

[1] 11080

Next, we should check for implicit missing values, which would be have the absence of a presence all together:

Code

nrow(who)

[1] 7240

Code

who %>% 
  complete(country, year) %>%
  nrow()

[1] 7446

After using the complete() function, there are more rows than initially in the who dataset, leading us to be able to include there are implicit missing values. Since zeroes are included in the dataset, using na.rm = TRUE is reasonable because zero simply means no cases with NA being used for missing values.

Exercise 9

The authors of the WHO case study claimed that iso2 and iso3 were redundant with country. Confirm this claim.

Code

who %>%
  select(country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)

# A tibble: 0 × 3
# Groups:   country [0]
# … with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>

If iso2 and iso3 weren’t redundant with country, then there would be distinct combinations of those two variables with a country. Since that is not present, we can conclude that they are redundant, since there are simply different forms of abbreviations of the full name of the country.

Exercise 10

For each level of country, year, and sex, compute the total number of cases of TB (using the WHO case study data). Construct an informative visualization.

Code

who_tidy <- who %>%
  pivot_longer(
    cols = c(new_sp_m014:newrel_f65), 
    names_to = "code", 
    values_to = "value",
    values_drop_na = TRUE
  ) %>% 
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Code

(graph_who <- who_tidy %>% 
  group_by(country, year, sex) %>%
  summarise(cases = sum(value)) %>%
  ungroup())

# A tibble: 6,921 × 4
   country      year sex   cases
   <chr>       <int> <chr> <int>
 1 Afghanistan  1997 f       102
 2 Afghanistan  1997 m        26
 3 Afghanistan  1998 f      1207
 4 Afghanistan  1998 m       571
 5 Afghanistan  1999 f       517
 6 Afghanistan  1999 m       228
 7 Afghanistan  2000 f      1751
 8 Afghanistan  2000 m       915
 9 Afghanistan  2001 f      3062
10 Afghanistan  2001 m      1577
# … with 6,911 more rows

Code

graph_who %>% filter(year > 1995, cases > 50000) %>% 
  ggplot(aes(x = year, y = cases, color = country)) + geom_line() + facet_grid(~sex)