library("tidyverse")
Using prose, describe how the variables and observations are organized in each of the sample tables.
In table table1, each row represents a (country, year) combination. The columns cases and population contain the values for those variables.
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
In table2, each row represents a (country, year, variable) combination. The column count contains the values of variables cases and population in separate rows.
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
In table3, each row represents a (country, year) combination. The column rate provides the values of both cases and population in a string formatted like cases / population.
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Table 4 is split into two tables, one table for each variable. The table table4a contains the values of cases and table4b contains the values of population. Within each table, each row represents a country, each column represents a year, and the cells are the value of the table’s variable for that country and year.
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Compute the rate for table2, and table4a + table4b. You will need to perform four operations:
Which representation is easiest to work with? Which is hardest? Why?
To calculate cases per person, we need to divide cases by population for each country and year. This is easiest if the cases and population variables are two columns in a data frame in which rows represent (country, year) combinations.
Table 2: First, create separate tables for cases and population and ensure that they are sorted in the same order.
t2_cases <- filter(table2, type == "cases") %>%
rename(cases = count) %>%
arrange(country, year)
t2_population <- filter(table2, type == "population") %>%
rename(population = count) %>%
arrange(country, year)
Then create a new data frame with the population and cases columns, and calculate the cases per capita in a new column.[^ex-12.2.2]
t2_cases_per_cap <- tibble(
year = t2_cases$year,
country = t2_cases$country,
cases = t2_cases$cases,
population = t2_population$population
) %>%
mutate(cases_per_cap = (cases / population) * 10000) %>%
select(country, year, cases_per_cap)
To store this new variable in the appropriate location, we will add new rows to table2.
t2_cases_per_cap <- t2_cases_per_cap %>%
mutate(type = "cases_per_cap") %>%
rename(count = cases_per_cap)
bind_rows(table2, t2_cases_per_cap) %>%
arrange(country, year, type, count)
## # A tibble: 18 x 4
## country year type count
## <chr> <int> <chr> <dbl>
## 1 Afghanistan 1999 cases 7.45e+2
## 2 Afghanistan 1999 cases_per_cap 3.73e-1
## 3 Afghanistan 1999 population 2.00e+7
## 4 Afghanistan 2000 cases 2.67e+3
## 5 Afghanistan 2000 cases_per_cap 1.29e+0
## 6 Afghanistan 2000 population 2.06e+7
## 7 Brazil 1999 cases 3.77e+4
## 8 Brazil 1999 cases_per_cap 2.19e+0
## 9 Brazil 1999 population 1.72e+8
## 10 Brazil 2000 cases 8.05e+4
## 11 Brazil 2000 cases_per_cap 4.61e+0
## 12 Brazil 2000 population 1.75e+8
## 13 China 1999 cases 2.12e+5
## 14 China 1999 cases_per_cap 1.67e+0
## 15 China 1999 population 1.27e+9
## 16 China 2000 cases 2.14e+5
## 17 China 2000 cases_per_cap 1.67e+0
## 18 China 2000 population 1.28e+9
Note that after adding the cases_per_cap rows, the type of count is coerced to numeric (double) because cases_per_cap is not an integer.
For table4a and table4b, create a new table for cases per capita, which we’ll name table4c, with country rows and year columns.
table4c <-
tibble(
country = table4a$country,
`1999` = table4a[["1999"]] / table4b[["1999"]] * 10000,
`2000` = table4a[["2000"]] / table4b[["2000"]] * 10000
)
table4c
## # A tibble: 3 x 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 0.373 1.29
## 2 Brazil 2.19 4.61
## 3 China 1.67 1.67
Neither table is particularly easy to work with. Since table2 has separate rows for cases and population we needed to generate a table with columns for cases and population where we could calculate cases per capita. table4a and table4b split the cases and population variables into different tables which made it easy to divide cases by population. However, we had to repeat this calculation for each row.
The ideal format of a data frame to answer this question is one with columns country, year, cases, and population. Then problem could be answered with a single mutate() call.
Recreate the plot showing change in cases over time using table2 instead of table1. What do you need to do first?
Before creating the plot with change in cases over time, we need to filter table to only include rows representing cases of TB.
table2 %>%
filter(type == "cases") %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
scale_x_continuous(breaks = unique(table2$year)) +
ylab("cases")
Why does spreading this tibble fail? How could you add a new column to fix the problem?
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
glimpse(people)
## Observations: 5
## Variables: 3
## $ name <chr> "Phillip Woods", "Phillip Woods", "Phillip Woods", "Jessica Cor…
## $ key <chr> "age", "height", "age", "age", "height"
## $ value <dbl> 45, 186, 50, 37, 156
spread(people, key, value)
## Error: Each row of output must be identified by a unique combination of keys.
## Keys are shared for 2 rows:
## * 1, 3
Spreading this data frame fails because the name and key columns do not uniquely identify rows. In particular, there are two rows with values for the age of “Phillip Woods”.
We could solve the problem by adding a row with a distinct observation count for each combination of name and key.
people2 <- people %>%
group_by(name, key) %>%
mutate(obs = row_number())
people2
## # A tibble: 5 x 4
## # Groups: name, key [4]
## name key value obs
## <chr> <chr> <dbl> <int>
## 1 Phillip Woods age 45 1
## 2 Phillip Woods height 186 1
## 3 Phillip Woods age 50 2
## 4 Jessica Cordero age 37 1
## 5 Jessica Cordero height 156 1
We can spread people2 because the combination of name and obs will uniquely identify the spread rows.
spread(people2, key, value)
## # A tibble: 3 x 4
## # Groups: name [2]
## name obs age height
## <chr> <int> <dbl> <dbl>
## 1 Jessica Cordero 1 37 156
## 2 Phillip Woods 1 45 186
## 3 Phillip Woods 2 50 NA
Another way to solve this problem is by keeping only distinct rows of the name and key values, and dropping duplicate rows.
people %>%
distinct(name, key, .keep_all = TRUE) %>%
spread(key, value)
## # A tibble: 2 x 3
## name age height
## <chr> <dbl> <dbl>
## 1 Jessica Cordero 37 156
## 2 Phillip Woods 45 186
However, before doing this you would want to understand why there are duplicates in the data to begin with. This is usually not merely a nuisance, but indicates deeper problems with the data.
What do the extra and fill arguments do in separate()? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The extra argument tells separate() what to do if there are too many pieces, and the fill argument tells it what to do if there aren’t enough. By default, separate() drops extra values with a warning.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
Adding the argument, extra = "drop", produces the same result as above but without the warning.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "drop")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
Setting extra = "merge", then the extra values are not split, so "f,g" appears in column three.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = "merge")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
In this example, one of the values, "d,e", has too few elements. The default for fill is similar to those in separate(); it fills columns with missing values but emits a warning. In this example, the 2nd row of column three is NA.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
Alternative options for the fill are "right", to fill with missing values from the right, but without a warning
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "right")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
The option fill = "left" also fills with missing values without emitting a warning, but this time from the left side. Now, the 2nd row of column one will be missing, and the other values in that row are shifted right.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = "left")
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
In this case study, I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?
The reasonableness of using na.rm = TRUE depends on how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.
If there are no 0 values in the data, then missing values may be used to indicate no cases.
If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.
``All of these refer to (country,year`) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.
To summarize:
0 is used to represent no cases of TB.NAs) are used to represent missing data for (country, year) combinations in which the country existed in that year.