Tidy may seem obvious, but it isn’t. Hence the whole workshop on it. Especially at DairyNZ, where data tends to be UN-tidy. The package mainly utilised here is ‘tidyr’ in the tidyverse.
Below are the 5 tables downloaded from github
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table1.csv", "data/table1.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table2.csv", "data/table2.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table3.csv", "data/table3.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table4a.csv", "data/table4a.csv")
download.file("https://raw.githubusercontent.com/markbneal/R_for_Data_Science_Course/master/table4b.csv", "data/table4b.csv")
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
## # A tibble: 12 x 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
We prefer tidy data sets, such data sets have the following characteristics: Each variable has a column, Each observation has a row, Each value has its own cell.
table1 %>% mutate(rate = cases / population * 10000)
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
table1 %>% count(year, wt = cases)
## # A tibble: 2 x 2
## year n
## <dbl> <dbl>
## 1 1999 250740
## 2 2000 296920
Where wt refers to frequency weight.
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
There are three parameters to pivot longer; in table4a, set of column whose names are values(1999,2000), variable to move column values to a year, and the name of the variable to put columns values into (cases).
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`),
names_to = "year",
values_to = "cases")
This pivot longer table is putting each observation in its own row, a criteria of tidy data. This is done above for table4a above and table4b below. They are then saved in a tidy way in the environment. Pivot longer and wider are the new tools in the tidyverse, previously there were melt, cast, gather, and spread.
I had some issues when knitting, stopped the download commands from being re-evaluated every time. Further, originally uses read.csv not read_csv which was not good. Causes X to appear in column headers.
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`),
names_to = "year",
values_to = "population")
They can then be joined together to make a tidy data frame.
tidy4joined <- left_join(tidy4a, tidy4b)
## Joining, by = c("country", "year")
Table 2 has observations split over two rows, so we use wider to reformat. Two parametres to make tidy; column to take the variable names from (type), and column to take the variables from (count.)
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
tidy2 <- table2 %>% pivot_wider(names_from = type,
values_from = count)
tidy2
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17))
stocks
## # A tibble: 4 x 3
## year half return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2016 1 0.92
## 4 2016 2 0.17
stocks %>%
pivot_wider(names_from = year, values_from = return) %>%
pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 1 2016 0.92
## 3 2 2015 0.59
## 4 2 2016 0.17
So they are not symmetrical because it has change from number to character for the year and thus the order of columns is different.
There are no backticks in the names, hence the code fails.
people <- tribble(
~name, ~names, ~values,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156)
Trying to widen this table will not work as there appears to be two Phillip Woods (two ages). This could be solved with a unique identifier if there are two people, such as a a middle name or number for participants. Or, a time could be added for each observation if it is the same person.
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12)
preg
## # A tibble: 2 x 3
## pregnant male female
## <chr> <dbl> <dbl>
## 1 yes NA 10
## 2 no 20 12
Tidy data sets exhibit the following characteristics; each variable has a column, each observation has a row,each value has its own cell.
Each observation doesn’t have its own row. Suggest columns: gender, pregnant, total. Mark attempted it similarly, with non(pregnant) rather than total.
preg_tidy <- preg %>%
pivot_longer(cols = c("male", "female"),
names_to = "gender",
values_to = "values") %>%
pivot_wider(names_from = pregnant,
values_from = values)
preg_tidy
## # A tibble: 2 x 3
## gender yes no
## <chr> <dbl> <dbl>
## 1 male NA 20
## 2 female 10 12
table3
## # A tibble: 6 x 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
This has two variables in one column, ie each observation split over multiple rows, additionally the rate is a character rather than a number.
table3 %>% separate(col = rate, into = c("cases", "population"))
## # A tibble: 6 x 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
So the separate looks for a non-alphanumeric operator, in this case the ‘/’ and is using that as the separator. Be careful using separate when tidying as sometimes we don’t want a period to be separated where in some cases it is a separator, in others it isn’t. You can specify the separator with sep = “x”, where x is the desired separator.
You can separate many things, in the following code we separate the years into the century and the year.
table5 <- table3 %>% separate(col = year,
into = c("century", "year"),
sep = 2)
table5
## # A tibble: 6 x 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Now that we have separated the years into century and years we can use ‘unite’. However, unite has a default separator of "_“, so to get a normal expression of year we specify no separator with”".
table5 %>% unite(col = new, century, year, sep = "")
## # A tibble: 6 x 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The first tibble has an extra data entry in the second column when separated, so ‘g’ is discarded. This it an extra, or are there missing data in the other two columns?
The second tibble is missing an entry in the second column, so when separated an NA is entered there but how do you know if it is the column 2 or 3 value that is missing and therefore is the NA in the right place?
There can be explicit or implicit NAs, that is, the former is present in the original data while the latter arises after a transformation.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66))
stocks
## # A tibble: 7 x 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 2 0.92
## 6 2016 3 0.17
## 7 2016 4 2.66
The implicit data is the return in 2015. But after a transformation…
… We see two NAs in the fourth quarter of 2015, and the first quarter of 2016.
Can solve this with xxx_drop_na
But, what if you have a a missing input stemming from data entry where following variables pertain to the last? You can use fill to carry the value forward. There are many options within the fill command to change how it operates.
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4)
treatment %>% fill(person)
## # A tibble: 4 x 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
W.H.O. tuberculosis data set
who <- who
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE)
who1 %>% count(key)
## # A tibble: 56 x 2
## key n
## <chr> <int>
## 1 new_ep_f014 1032
## 2 new_ep_f1524 1021
## 3 new_ep_f2534 1021
## 4 new_ep_f3544 1021
## 5 new_ep_f4554 1017
## 6 new_ep_f5564 1017
## 7 new_ep_f65 1014
## 8 new_ep_m014 1038
## 9 new_ep_m1524 1026
## 10 new_ep_m2534 1020
## # ... with 46 more rows
The key column encodes 4 variables: new/old, type of tb, gender, age group. This is not tidy, but we can fix it!
who2 <- who1 %>% mutate(stringr::str_replace(key, "newrel", "new_rel"))
This fixes the spelling inconsistencies. The following code
who2 <- who1 %>% mutate(stringr::str_replace(key, "newrel", "new_rel"))
who3 <- who2 %>%
separate(col = key,
into = c("new", "type", "sexage"),
sep = "_")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243,
## 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903,
## 904, 905, 906, ...].
who3 %>% count(new)
## # A tibble: 2 x 2
## new n
## <chr> <int>
## 1 new 73466
## 2 newrel 2580
who4 <- who3 %>%
select(-new, -iso2, -iso3)
who5 <- who4 %>%
separate(col = sexage,
into = c("sex", "age"),
sep = 1)
This experiment dropped the NAs early on, without access to the researchers we cannot decide whether they are meant to be zero, or if they are non-zero but missing. The later would be problematic if they were treated as the former, hence they are omitted.
The mutate step made the formatting of the consistent. Without this we wouldn’t be able to separate by "_". Hence, was very important. Those not entered correctly would through NAs.
Here we treated the iso2 and iso3 as unimportant. ISO are standardised abbreviated country names, you could table join to get them back if needed. Or, test via a table of name iso, iso frequency to see that they are consistant and countries don’t multiple codes.
Not necessarily easy, this is many years of data, for many countries and multiple demographic considerations. We could filter for the most interesting (highest cases). The graph we are targeting is a faceted cases overtime. First we sort so that we have a count in descending order to find the ten highest case countries.
tb_country_desc <- who5 %>% filter(year == 2013) %>%
count(country, wt = cases) %>%
arrange(desc(n))
top10 <- tb_country_desc$country[1:10]
who5_top_10 <- who5 %>%
filter(country %in% top10) %>%
group_by(country, year, sex) %>%
summarise(sum_cases = sum(cases, na.rm = TRUE))
## `summarise()` has grouped output by 'country', 'year'. You can override using the `.groups` argument.
ggplot(data = who5_top_10)+
geom_line(aes(x = year, y = sum_cases, colour = sex))+
facet_wrap(~country)
## Non-tidy data Sometimes tidy data isn’t the answer, such as ease of access, storage or field-specific data conventions.
Tidy Data:
One observation per row.
One variable per column.
One value per cell.