Overview

Tidy may seem obvious, but it isn’t. Hence the whole workshop on it. Especially at DairyNZ, where data tends to be UN-tidy. The package mainly utilised here is ‘tidyr’ in the tidyverse.

Below are the 5 tables downloaded from github

download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table1.csv", "data/table1.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table2.csv", "data/table2.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table3.csv", "data/table3.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table4a.csv", "data/table4a.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table4b.csv", "data/table4b.csv")
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
## # A tibble: 3 x 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Workshop content

We prefer tidy data sets, such data sets have the following characteristics: Each variable has a column, Each observation has a row, Each value has its own cell.

Calculating rate

table1 %>% mutate(rate = cases / population * 10000)
## # A tibble: 6 x 5
##   country      year  cases population  rate
##   <chr>       <dbl>  <dbl>      <dbl> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

Counting cases

table1 %>% count(year, wt = cases)
## # A tibble: 2 x 2
##    year      n
##   <dbl>  <dbl>
## 1  1999 250740
## 2  2000 296920

Where wt refers to frequency weight.

Visualising changes over time

ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), colour = "grey50") +
  geom_point(aes(colour = country))

Longer

There are three parameters to pivot longer; in table4a, set of column whose names are values(1999,2000), variable to move column values to a year, and the name of the variable to put columns values into (cases).

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
tidy4a <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), 
               names_to = "year",
               values_to = "cases")

This pivot longer table is putting each observation in its own row, a criteria of tidy data. This is done above for table4a above and table4b below. They are then saved in a tidy way in the environment. Pivot longer and wider are the new tools in the tidyverse, previously there were melt, cast, gather, and spread.

I had some issues when knitting, stopped the download commands from being re-evaluated every time. Further, originally uses read.csv not read_csv which was not good. Causes X to appear in column headers.

table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583
tidy4b <- table4b %>% 
  pivot_longer(c(`1999`, `2000`), 
               names_to = "year",
               values_to = "population")

They can then be joined together to make a tidy data frame.

tidy4joined <- left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")

Wider

Table 2 has observations split over two rows, so we use wider to reformat. Two parametres to make tidy; column to take the variable names from (type), and column to take the variables from (count.)

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
tidy2 <- table2 %>% pivot_wider(names_from = type,
                       values_from = count)
tidy2
## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Exercises 12.3.3

1. Why are pivot longer/wider not perfectly symmetrical?

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17))

stocks
## # A tibble: 4 x 3
##    year  half return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2016     1   0.92
## 4  2016     2   0.17
stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>% 
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016    0.92
## 3     2 2015    0.59
## 4     2 2016    0.17

So they are not symmetrical because it has change from number to character for the year and thus the order of columns is different.

2. Why does this code fail?

There are no backticks in the names, hence the code fails.

3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156)

Trying to widen this table will not work as there appears to be two Phillip Woods (two ages). This could be solved with a unique identifier if there are two people, such as a a middle name or number for participants. Or, a time could be added for each observation if it is the same person.

4. Tidy the simple tibble below. Do you need to make it wider or longer? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12)
preg
## # A tibble: 2 x 3
##   pregnant  male female
##   <chr>    <dbl>  <dbl>
## 1 yes         NA     10
## 2 no          20     12

Tidy data sets exhibit the following characteristics; each variable has a column, each observation has a row,each value has its own cell.

Each observation doesn’t have its own row. Suggest columns: gender, pregnant, total. Mark attempted it similarly, with non(pregnant) rather than total.

preg_tidy <- preg %>% 
  pivot_longer(cols = c("male", "female"),
                      names_to = "gender",
                      values_to = "values") %>% 
  pivot_wider(names_from = pregnant,
              values_from = values)
preg_tidy
## # A tibble: 2 x 3
##   gender   yes    no
##   <chr>  <dbl> <dbl>
## 1 male      NA    20
## 2 female    10    12

Separate

table3
## # A tibble: 6 x 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

This has two variables in one column, ie each observation split over multiple rows, additionally the rate is a character rather than a number.

table3 %>% separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
##   country      year cases  population
##   <chr>       <dbl> <chr>  <chr>     
## 1 Afghanistan  1999 745    19987071  
## 2 Afghanistan  2000 2666   20595360  
## 3 Brazil       1999 37737  172006362 
## 4 Brazil       2000 80488  174504898 
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

So the separate looks for a non-alphanumeric operator, in this case the ‘/’ and is using that as the separator. Be careful using separate when tidying as sometimes we don’t want a period to be separated where in some cases it is a separator, in others it isn’t. You can specify the separator with sep = “x”, where x is the desired separator.

You can separate many things, in the following code we separate the years into the century and the year.

table5 <- table3 %>% separate(col = year,
                    into = c("century", "year"),
                    sep = 2)

table5
## # A tibble: 6 x 4
##   country     century year  rate             
##   <chr>       <chr>   <chr> <chr>            
## 1 Afghanistan 19      99    745/19987071     
## 2 Afghanistan 20      00    2666/20595360    
## 3 Brazil      19      99    37737/172006362  
## 4 Brazil      20      00    80488/174504898  
## 5 China       19      99    212258/1272915272
## 6 China       20      00    213766/1280428583

Unite

Now that we have separated the years into century and years we can use ‘unite’. However, unite has a default separator of "_“, so to get a normal expression of year we specify no separator with”".

table5 %>% unite(col = new, century, year, sep = "")
## # A tibble: 6 x 3
##   country     new   rate             
##   <chr>       <chr> <chr>            
## 1 Afghanistan 1999  745/19987071     
## 2 Afghanistan 2000  2666/20595360    
## 3 Brazil      1999  37737/172006362  
## 4 Brazil      2000  80488/174504898  
## 5 China       1999  212258/1272915272
## 6 China       2000  213766/1280428583

Exercises 12.4.3

1.What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.

tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     f    
## 3 h     i     j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>% 
  separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
##   one   two   three
##   <chr> <chr> <chr>
## 1 a     b     c    
## 2 d     e     <NA> 
## 3 f     g     i

The first tibble has an extra data entry in the second column when separated, so ‘g’ is discarded. This it an extra, or are there missing data in the other two columns?

The second tibble is missing an entry in the second column, so when separated an NA is entered there but how do you know if it is the column 2 or 3 value that is missing and therefore is the NA in the right place?

Missing values

There can be explicit or implicit NAs, that is, the former is present in the original data while the latter arises after a transformation.

stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66))

stocks
## # A tibble: 7 x 3
##    year   qtr return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2015     3   0.35
## 4  2015     4  NA   
## 5  2016     2   0.92
## 6  2016     3   0.17
## 7  2016     4   2.66

The implicit data is the return in 2015. But after a transformation…

… We see two NAs in the fourth quarter of 2015, and the first quarter of 2016.

Can solve this with xxx_drop_na

But, what if you have a a missing input stemming from data entry where following variables pertain to the last? You can use fill to carry the value forward. There are many options within the fill command to change how it operates.

treatment <- tribble(
  ~ person,           ~ treatment, ~response,
  "Derrick Whitmore", 1,           7,
  NA,                 2,           10,
  NA,                 3,           9,
  "Katherine Burke",  1,           4)

treatment %>% fill(person)
## # A tibble: 4 x 3
##   person           treatment response
##   <chr>                <dbl>    <dbl>
## 1 Derrick Whitmore         1        7
## 2 Derrick Whitmore         2       10
## 3 Derrick Whitmore         3        9
## 4 Katherine Burke          1        4

Case Study

W.H.O. tuberculosis data set

who <- who
who1 <- who %>% 
  pivot_longer(
  cols = new_sp_m014:newrel_f65,
  names_to = "key",
  values_to = "cases",
  values_drop_na = TRUE)

who1 %>% count(key)
## # A tibble: 56 x 2
##    key              n
##    <chr>        <int>
##  1 new_ep_f014   1032
##  2 new_ep_f1524  1021
##  3 new_ep_f2534  1021
##  4 new_ep_f3544  1021
##  5 new_ep_f4554  1017
##  6 new_ep_f5564  1017
##  7 new_ep_f65    1014
##  8 new_ep_m014   1038
##  9 new_ep_m1524  1026
## 10 new_ep_m2534  1020
## # ... with 46 more rows

The key column encodes 4 variables: new/old, type of tb, gender, age group. This is not tidy, but we can fix it!

who2 <- who1 %>% mutate(stringr::str_replace(key, "newrel", "new_rel"))

This fixes the spelling inconsistencies. The following code

who2 <- who1 %>% mutate(stringr::str_replace(key, "newrel", "new_rel"))

who3 <- who2 %>% 
  separate(col = key, 
           into = c("new", "type", "sexage"),
           sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
who3 %>% count(new)
## # A tibble: 2 x 2
##   new        n
##   <chr>  <int>
## 1 new    73466
## 2 newrel  2580
who4 <- who3 %>% 
  select(-new, -iso2, -iso3)

who5 <- who4 %>% 
  separate(col = sexage,
           into = c("sex", "age"),
           sep = 1)

Exercises 12.6.1

1. Handling of NAs in this dataset.

This experiment dropped the NAs early on, without access to the researchers we cannot decide whether they are meant to be zero, or if they are non-zero but missing. The later would be problematic if they were treated as the former, hence they are omitted.

2. Neglecting the Mutate step.

The mutate step made the formatting of the consistent. Without this we wouldn’t be able to separate by "_". Hence, was very important. Those not entered correctly would through NAs.

3. Handling of the shortened ISO codes

Here we treated the iso2 and iso3 as unimportant. ISO are standardised abbreviated country names, you could table join to get them back if needed. Or, test via a table of name iso, iso frequency to see that they are consistant and countries don’t multiple codes.

4. Visualisation of the data

Not necessarily easy, this is many years of data, for many countries and multiple demographic considerations. We could filter for the most interesting (highest cases). The graph we are targeting is a faceted cases overtime. First we sort so that we have a count in descending order to find the ten highest case countries.

tb_country_desc <- who5 %>% filter(year == 2013) %>% 
  count(country, wt = cases) %>% 
  arrange(desc(n))

top10 <- tb_country_desc$country[1:10]

who5_top_10 <- who5 %>% 
  filter(country %in% top10) %>% 
  group_by(country, year, sex) %>% 
  summarise(sum_cases = sum(cases, na.rm = TRUE))
## `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
ggplot(data = who5_top_10)+
  geom_line(aes(x = year, y = sum_cases, colour = sex))+
  facet_wrap(~country)

## Non-tidy data Sometimes tidy data isn’t the answer, such as ease of access, storage or field-specific data conventions.

Summary of tidy data

Tidy Data:

One observation per row.

One variable per column.

One value per cell.