Code
# Loading package(s)
library(tidyverse)
library(dplyr)
data(table1)
data(table2)
data(table4a)
data(table4b)
data(who)Data Science 1 with R (STAT 301-1)
# Loading package(s)
library(tidyverse)
library(dplyr)
data(table1)
data(table2)
data(table4a)
data(table4b)
data(who)All datasets are either defined inline or provided within the core tidyverse packages (table1, table2, table4a, table4a, who).
Follow these four steps to compute the rate per 10,000 once using only table2, and again using table4a + table4b:
tibble.table2 %>%
pivot_wider(names_from = type, values_from = count) %>%
mutate(rate = cases / population * 10000) %>%
tibble()# A tibble: 6 × 5
country year cases population rate
<chr> <int> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b) %>%
mutate(rate = cases / population * 10000) %>%
tibble()# A tibble: 6 × 5
country year cases population rate
<chr> <chr> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.29
3 Brazil 1999 37737 172006362 2.19
4 Brazil 2000 80488 174504898 4.61
5 China 1999 212258 1272915272 1.67
6 China 2000 213766 1280428583 1.67
Which representation is easiest to work with? Which is hardest? Why?
It is significantly more challenging to both create and interpret a representation of the data as the steps order using baser as opposed to using tidy data, as it is presented as an organizaed table as opposed to a vector/list of values. With larger/more complicated datasets, this becomes a greater issue.
Recreate the plot below showing the change in cases over time using table2 instead of table1. What do you need to do first?
# Change over time in number of TB cases by country
ggplot(table1, aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))table2 %>% pivot_wider(names_from = type, values_from = count) %>%
ggplot(aes(year, cases)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))We need to tidy the data by widening it to add an extra column, so we can specifically look at the number of cases per country. Without it, there is no unique coumn for cases, so a plot would not be able to be made.
Why does the provided code fail? Fix it.
table4a %>%
pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")# A tibble: 6 × 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Without backticks, 1999 and 2000 read as to numbers, as opposed to the column headers they serve as in the initial table4a. Thus, you will get an error message saying that “locations 1999 and 2000 don’t exist.” To fix this, add backticks.
What happens if you use pivot_wider() on this table so that we have a dataset with 3 columns (responent_name, age, height)?
Fix the issue by adding a new column.
people <- tribble(
~respondent_name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)people <- tribble(
~respondent_name, ~key, ~value, ~ID,
#-----------------|--------|-------|------
"Phillip Woods", "age", 45, "pat01",
"Phillip Woods", "height", 186, "pat01",
"Phillip Woods", "age", 50, "pat02",
"Jessica Cordero", "age", 37, "pat03",
"Jessica Cordero", "height", 156, "pat03"
)people %>%
pivot_wider(names_from = key, values_from = value)# A tibble: 3 × 4
respondent_name ID age height
<chr> <chr> <dbl> <dbl>
1 Phillip Woods pat01 45 186
2 Phillip Woods pat02 50 NA
3 Jessica Cordero pat03 37 156
Since there are two age observations for Phillip Woods, we need to add a unique “ID” column to uniquely identify one of the Phillip Woods as a separate patient. If not, it will not be able to widen the table, since it will read it as a duplicate value.
Tidy the simple tibble of M&M data below. Do you need to make it wider or longer? What are the variables in your tidy version?
mm_data <- tribble(
~mm_type, ~blue, ~orange, ~green, ~yellow, ~brown, ~red, ~cyan_blue,
"plain", 6, 18, 12, 6, 7, 7, NA,
"peanut", NA, 11, 9, 1, 12, 8, 15
)
mm_data %>%
pivot_longer(cols = -mm_type, names_to = "color")# A tibble: 14 × 3
mm_type color value
<chr> <chr> <dbl>
1 plain blue 6
2 plain orange 18
3 plain green 12
4 plain yellow 6
5 plain brown 7
6 plain red 7
7 plain cyan_blue NA
8 peanut blue NA
9 peanut orange 11
10 peanut green 9
11 peanut yellow 1
12 peanut brown 12
13 peanut red 8
14 peanut cyan_blue 15
In the tidy version, each color is its own variable, so we can see the counts for every color of M&M faceted by whether they do or do not contain peanuts.
Both unite() and separate() have a remove argument. What does it do? The default value for remove is TRUE, why would you set it to FALSE?
The remove argument gets rid of the columns that are being merged or separated. If false, the original column(s) will still be merged/separated, but will be kept in the data frame.
The case study of data from the 2014 World Health Organization Global Tuberculosis Report produces a lot of useful data (who) — data sub-directory contains the codebook for who. However, the format is difficult to work with, so the authors walk you through the process of tidying the data. The tidying process uses this nice concise code:
# tidying who
who %>%
pivot_longer(
cols = c(new_sp_m014:newrel_f65),
names_to = "code",
values_to = "value",
values_drop_na = TRUE
) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)Insert a comment following each # in the code below that explains the purpose or objective of the line of code directly below it.
# tidy who
who_tidy <- who %>%
#Lengthen the data with fewer columns and more rows. The columns have been changed to represent the code for each group, while the value has been changed for the count of new TB cases recorded by group.
pivot_longer(
cols = c(new_sp_m014:newrel_f65),
names_to = "code",
values_to = "value",
values_drop_na = TRUE
) %>%
#New column added. The newrel code is replaced with new_rel
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
#The code is separated out into three different columns to show if case is new, variant and sexage.
separate(code, c("new", "var", "sexage")) %>%
#Select every variable but new, iso2, and iso3. This shows only `country`, `year`, `var`, `sexage`, and count
select(-new, -iso2, -iso3) %>%
#Separate sexage into two different columns: sex and age
separate(sexage, c("sex", "age"), sep = 1)
who_tidyIn the WHO case study, the authors set na.rm = TRUE to make it easier to check that they had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between NA and zero in this dataset?
The concern with setting na.rm = TRUE is deciphering between whether the missing values indicate no TB cases, or whether the data was simply missing. Thus, we should see if zero is included in the dataset
As you can see, there are zeros included in the dataset.
who %>%
pivot_longer(
cols = c(new_sp_m014:newrel_f65),
names_to = "code",
values_to = "value",
values_drop_na = TRUE
) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1) %>%
filter(value == 0) %>%
nrow()[1] 11080
Next, we should check for implicit missing values, which would be have the absence of a presence all together:
nrow(who)[1] 7240
who %>%
complete(country, year) %>%
nrow()[1] 7446
After using the complete() function, there are more rows than initially in the who dataset, leading us to be able to include there are implicit missing values. Since zeroes are included in the dataset, using na.rm = TRUE is reasonable because zero simply means no cases with NA being used for missing values.
The authors of the WHO case study claimed that iso2 and iso3 were redundant with country. Confirm this claim.
who %>%
select(country, iso2, iso3) %>%
distinct() %>%
group_by(country) %>%
filter(n() > 1)# A tibble: 0 × 3
# Groups: country [0]
# … with 3 variables: country <chr>, iso2 <chr>, iso3 <chr>
If iso2 and iso3 weren’t redundant with country, then there would be distinct combinations of those two variables with a country. Since that is not present, we can conclude that they are redundant, since there are simply different forms of abbreviations of the full name of the country.
For each level of country, year, and sex, compute the total number of cases of TB (using the WHO case study data). Construct an informative visualization.
who_tidy <- who %>%
pivot_longer(
cols = c(new_sp_m014:newrel_f65),
names_to = "code",
values_to = "value",
values_drop_na = TRUE
) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "var", "sexage")) %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep = 1)(graph_who <- who_tidy %>%
group_by(country, year, sex) %>%
summarise(cases = sum(value)) %>%
ungroup())# A tibble: 6,921 × 4
country year sex cases
<chr> <int> <chr> <int>
1 Afghanistan 1997 f 102
2 Afghanistan 1997 m 26
3 Afghanistan 1998 f 1207
4 Afghanistan 1998 m 571
5 Afghanistan 1999 f 517
6 Afghanistan 1999 m 228
7 Afghanistan 2000 f 1751
8 Afghanistan 2000 m 915
9 Afghanistan 2001 f 3062
10 Afghanistan 2001 m 1577
# … with 6,911 more rows
graph_who %>% filter(year > 1995, cases > 50000) %>%
ggplot(aes(x = year, y = cases, color = country)) + geom_line() + facet_grid(~sex)