This code is repeated from the Tidy Data Case Study because it is needed by the exercises.

library(tidyverse)
who1 <- who %>%
  gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)
glimpse(who1)
Observations: 76,046
Variables: 6
$ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan"...
$ iso2    <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF"...
$ iso3    <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG...
$ year    <int> 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011...
$ key     <chr> "new_sp_m014", "new_sp_m014", "new_sp_m014", "new_sp_m014", "new_sp_m014", "new_sp_m014"...
$ cases   <int> 0, 30, 8, 52, 129, 90, 127, 139, 151, 193, 186, 187, 200, 197, 204, 188, 0, 0, 1, 0, 2, ...
who2 <- who1 %>%
  mutate(key = str_replace(key, "newrel", "new_rel"))
who3 <- who2 %>%
  separate(key, c("new", "type", "sexage"), sep = "_")
who3
who3 %>%
  count(new)
who4 <- who3 %>%
  select(-new, -iso2, -iso3)
who5 <- who4 %>%
  separate(sexage, c("sex", "age"), sep = 1)
who5

1 In this case study, I set na.rm = TRUE just to make it easier to check that we had the correct values. Is this reasonable? Think about how missing values are represented in this dataset. Are there implicit missing values? What’s the difference between an NA and zero?

The reasonableness of using na.rm = TRUE depends on how missing values are represented in this dataset. The main concern is whether a missing value means that there were no cases of TB or whether it means that the WHO does not have data on the number of TB cases. Here are some things we should look for to help distinguish between these cases.

  • If there are no 0 values in the data, then missing values may be used to indicate no cases.

  • If there are both explicit and implicit missing values, then it suggests that missing values are being used differently. In that case, it is likely that explicit missing values would mean no cases, and implicit missing values would mean no data on the number of cases.

First, I’ll check for the presence of zeros in the data.

who1 %>%
  filter(cases == 0) %>%
  nrow()
[1] 11080

There are zeros in the data, so it appears that cases of zero TB are explicitly indicated, and the value of NA is used to indicate missing data.

Second, I should check whether all values for a (country, year) are missing or whether it is possible for only some columns to be missing.

gather(who, new_sp_m014:newrel_f65, key = "key", value = "cases") %>%
  group_by(country, year) %>%
  mutate(prop_missing = sum(is.na(cases)) / n()) %>%
  filter(prop_missing > 0, prop_missing < 1)

From the results above, it looks like it is possible for a (country, year) row to contain some, but not all, missing values in its columns.

Finally, I will check for implicit missing values. Implicit missing values are (year, country) combinations that do not appear in the data.

nrow(who)
[1] 7240
who %>%
  complete(country, year) %>%
  nrow()
[1] 7446

Since the number of complete cases of (country, year) is greater than the number of rows in who, there are some implicit values. But that doesn’t tell us what those implicit missing values are. To do this, I will use the anti_join() function introduced in the later in the Relational Data lecture.

anti_join(complete(who, country, year), who, by = c("country", "year")) %>%
  select(country, year) %>%
  group_by(country) %>%
  # so I can make better sense of the years
  summarise(min_year = min(year), max_year = max(year))

All of these refer to (country, year) combinations for years prior to the existence of the country. For example, Timor-Leste achieved independence in 2002, so years prior to that are not included in the data.

To summarize:

  • 0 is used to represent no cases of TB.
  • Explicit missing values (NAs) are used to represent missing data for (country, year) combinations in which the country existed in that year.
  • Implicit missing values are used to represent missing data because a country did not exist in that year.

2 What happens if you neglect the mutate() step? (mutate(key = str_replace(key, "newrel", "new_rel"))?

The separate() function emits the warning “too few values”. If we check the rows for keys beginning with "newrel_", we see that sexage is missing, and type = m014.

who3a <- who1 %>%
  separate(key, c("new", "type", "sexage"), sep = "_")
Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [73467, 73468, 73469, 73470, 73471, 73472, 73473, 73474, 73475, 73476, 73477, 73478, 73479, 73480, 73481, 73482, 73483, 73484, 73485, 73486, ...].
filter(who3a, new == "newrel") %>% head()

3 I claimed that iso2 and iso3 were redundant with country. Confirm this claim.

If iso2 and iso3 are redundant with country, then, within each country, there should only be one distinct combination of iso2 and iso3 values, which is the case.

select(who3, country, iso2, iso3) %>%
  distinct() %>%
  group_by(country) %>%
  filter(n() > 1)

This makes sense, since iso2 and iso3 contain the 2- and 3-letter country abbreviations for the country. The iso2 variable contains each country’s ISO 3166 alpha-2, and the iso3 variable contains each country’s ISO 3166 alpha-3 abbreviation. You may recognize the ISO 3166-2 abbreviations, since they are almost identical to internet country-code top level domains, such as .uk (United Kingdom), .ly (Libya), .tv (Tuvalu), and .io (British Indian Ocean Territory).

4 For each country, year, and sex compute the total number of cases of TB. Make an informative visualization of the data.

who5 %>%
  group_by(country, year, sex) %>%
  filter(year > 1995) %>%
  summarise(cases = sum(cases)) %>%
  unite(country_sex, country, sex, remove = FALSE) %>%
  ggplot(aes(x = year, y = cases, group = country_sex, colour = sex)) +
  geom_line()

A small multiples plot faceting by country is difficult given the number of countries. Focusing on those countries with the largest changes or absolute magnitudes after providing the context above is another option.

LS0tDQp0aXRsZTogIkNhc2UgU3R1ZHkgV0hPIEV4YW1wbGUgUXVlc3Rpb25zIg0Kb3V0cHV0OiANCiAgaHRtbF9ub3RlYm9vazoNCiAgICB0b2M6IFRSVUUNCiAgICB0b2NfZmxvYXQ6IFRSVUUNCi0tLQ0KDQpUaGlzIGNvZGUgaXMgcmVwZWF0ZWQgZnJvbSB0aGUgW1RpZHkgRGF0YSBDYXNlIFN0dWR5XShodHRwczovL3JwdWJzLmNvbS91a3k5OTQvNTk1OTkwKSBiZWNhdXNlIGl0IGlzIG5lZWRlZCBieSB0aGUgZXhlcmNpc2VzLg0KDQpgYGB7cixtZXNzYWdlPUZBTFNFLHdhcm5pbmc9RkFMU0V9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCndobzEgPC0gd2hvICU+JQ0KICBnYXRoZXIobmV3X3NwX20wMTQ6bmV3cmVsX2Y2NSwga2V5ID0gImtleSIsIHZhbHVlID0gImNhc2VzIiwgbmEucm0gPSBUUlVFKQ0KZ2xpbXBzZSh3aG8xKQ0Kd2hvMiA8LSB3aG8xICU+JQ0KICBtdXRhdGUoa2V5ID0gc3RyX3JlcGxhY2Uoa2V5LCAibmV3cmVsIiwgIm5ld19yZWwiKSkNCndobzMgPC0gd2hvMiAlPiUNCiAgc2VwYXJhdGUoa2V5LCBjKCJuZXciLCAidHlwZSIsICJzZXhhZ2UiKSwgc2VwID0gIl8iKQ0Kd2hvMw0Kd2hvMyAlPiUNCiAgY291bnQobmV3KQ0Kd2hvNCA8LSB3aG8zICU+JQ0KICBzZWxlY3QoLW5ldywgLWlzbzIsIC1pc28zKQ0Kd2hvNSA8LSB3aG80ICU+JQ0KICBzZXBhcmF0ZShzZXhhZ2UsIGMoInNleCIsICJhZ2UiKSwgc2VwID0gMSkNCndobzUNCmBgYA0KIyMjIDEgSW4gdGhpcyBjYXNlIHN0dWR5LCBJIHNldCBgbmEucm0gPSBUUlVFYCBqdXN0IHRvIG1ha2UgaXQgZWFzaWVyIHRvIGNoZWNrIHRoYXQgd2UgaGFkIHRoZSBjb3JyZWN0IHZhbHVlcy4gSXMgdGhpcyByZWFzb25hYmxlPyBUaGluayBhYm91dCBob3cgbWlzc2luZyB2YWx1ZXMgYXJlIHJlcHJlc2VudGVkIGluIHRoaXMgZGF0YXNldC4gQXJlIHRoZXJlIGltcGxpY2l0IG1pc3NpbmcgdmFsdWVzPyBXaGF04oCZcyB0aGUgZGlmZmVyZW5jZSBiZXR3ZWVuIGFuIGBOQWAgYW5kIHplcm8/DQoNClRoZSByZWFzb25hYmxlbmVzcyBvZiB1c2luZyBgbmEucm0gPSBUUlVFYCBkZXBlbmRzIG9uIGhvdyBtaXNzaW5nIHZhbHVlcyBhcmUgcmVwcmVzZW50ZWQgaW4gdGhpcyBkYXRhc2V0LiBUaGUgbWFpbiBjb25jZXJuIGlzIHdoZXRoZXIgYSBtaXNzaW5nIHZhbHVlIG1lYW5zIHRoYXQgdGhlcmUgd2VyZSBubyBjYXNlcyBvZiBUQiBvciB3aGV0aGVyIGl0IG1lYW5zIHRoYXQgdGhlIFdITyBkb2VzIG5vdCBoYXZlIGRhdGEgb24gdGhlIG51bWJlciBvZiBUQiBjYXNlcy4gSGVyZSBhcmUgc29tZSB0aGluZ3Mgd2Ugc2hvdWxkIGxvb2sgZm9yIHRvIGhlbHAgZGlzdGluZ3Vpc2ggYmV0d2VlbiB0aGVzZSBjYXNlcy4NCg0KLSBJZiB0aGVyZSBhcmUgbm8gMCB2YWx1ZXMgaW4gdGhlIGRhdGEsIHRoZW4gbWlzc2luZyB2YWx1ZXMgbWF5IGJlIHVzZWQgdG8gaW5kaWNhdGUgbm8gY2FzZXMuDQoNCi0gSWYgdGhlcmUgYXJlIGJvdGggZXhwbGljaXQgYW5kIGltcGxpY2l0IG1pc3NpbmcgdmFsdWVzLCB0aGVuIGl0IHN1Z2dlc3RzIHRoYXQgbWlzc2luZyB2YWx1ZXMgYXJlIGJlaW5nIHVzZWQgZGlmZmVyZW50bHkuIEluIHRoYXQgY2FzZSwgaXQgaXMgbGlrZWx5IHRoYXQgZXhwbGljaXQgbWlzc2luZyB2YWx1ZXMgd291bGQgbWVhbiBubyBjYXNlcywgYW5kIGltcGxpY2l0IG1pc3NpbmcgdmFsdWVzIHdvdWxkIG1lYW4gbm8gZGF0YSBvbiB0aGUgbnVtYmVyIG9mIGNhc2VzLg0KDQpGaXJzdCwgSeKAmWxsIGNoZWNrIGZvciB0aGUgcHJlc2VuY2Ugb2YgemVyb3MgaW4gdGhlIGRhdGEuDQoNCmBgYHtyfQ0Kd2hvMSAlPiUNCiAgZmlsdGVyKGNhc2VzID09IDApICU+JQ0KICBucm93KCkNCmBgYA0KDQpUaGVyZSBhcmUgemVyb3MgaW4gdGhlIGRhdGEsIHNvIGl0IGFwcGVhcnMgdGhhdCBjYXNlcyBvZiB6ZXJvIFRCIGFyZSBleHBsaWNpdGx5IGluZGljYXRlZCwgYW5kIHRoZSB2YWx1ZSBvZiBgTkFgIGlzIHVzZWQgdG8gaW5kaWNhdGUgbWlzc2luZyBkYXRhLg0KDQpTZWNvbmQsIEkgc2hvdWxkIGNoZWNrIHdoZXRoZXIgYWxsIHZhbHVlcyBmb3IgYSAoYGNvdW50cnlgLCBgeWVhcmApIGFyZSBtaXNzaW5nIG9yIHdoZXRoZXIgaXQgaXMgcG9zc2libGUgZm9yIG9ubHkgc29tZSBjb2x1bW5zIHRvIGJlIG1pc3NpbmcuDQoNCmBgYHtyfQ0KZ2F0aGVyKHdobywgbmV3X3NwX20wMTQ6bmV3cmVsX2Y2NSwga2V5ID0gImtleSIsIHZhbHVlID0gImNhc2VzIikgJT4lDQogIGdyb3VwX2J5KGNvdW50cnksIHllYXIpICU+JQ0KICBtdXRhdGUocHJvcF9taXNzaW5nID0gc3VtKGlzLm5hKGNhc2VzKSkgLyBuKCkpICU+JQ0KICBmaWx0ZXIocHJvcF9taXNzaW5nID4gMCwgcHJvcF9taXNzaW5nIDwgMSkNCmBgYA0KDQpGcm9tIHRoZSByZXN1bHRzIGFib3ZlLCBpdCBsb29rcyBsaWtlIGl0IGlzIHBvc3NpYmxlIGZvciBhIChgY291bnRyeWAsIGB5ZWFyYCkgcm93IHRvIGNvbnRhaW4gc29tZSwgYnV0IG5vdCBhbGwsIG1pc3NpbmcgdmFsdWVzIGluIGl0cyBjb2x1bW5zLg0KDQpGaW5hbGx5LCBJIHdpbGwgY2hlY2sgZm9yIGltcGxpY2l0IG1pc3NpbmcgdmFsdWVzLiBJbXBsaWNpdCBtaXNzaW5nIHZhbHVlcyBhcmUgKGB5ZWFyYCwgYGNvdW50cnlgKSBjb21iaW5hdGlvbnMgdGhhdCBkbyBub3QgYXBwZWFyIGluIHRoZSBkYXRhLg0KDQpgYGB7cn0NCm5yb3cod2hvKQ0Kd2hvICU+JQ0KICBjb21wbGV0ZShjb3VudHJ5LCB5ZWFyKSAlPiUNCiAgbnJvdygpDQpgYGANCg0KU2luY2UgdGhlIG51bWJlciBvZiBjb21wbGV0ZSBjYXNlcyBvZiAoYGNvdW50cnlgLCBgeWVhcmApIGlzIGdyZWF0ZXIgdGhhbiB0aGUgbnVtYmVyIG9mIHJvd3MgaW4gd2hvLCB0aGVyZSBhcmUgc29tZSBpbXBsaWNpdCB2YWx1ZXMuIEJ1dCB0aGF0IGRvZXNu4oCZdCB0ZWxsIHVzIHdoYXQgdGhvc2UgaW1wbGljaXQgbWlzc2luZyB2YWx1ZXMgYXJlLiBUbyBkbyB0aGlzLCBJIHdpbGwgdXNlIHRoZSBgYW50aV9qb2luKClgIGZ1bmN0aW9uIGludHJvZHVjZWQgaW4gdGhlIGxhdGVyIGluIHRoZSBSZWxhdGlvbmFsIERhdGEgbGVjdHVyZS4NCg0KYGBge3J9DQphbnRpX2pvaW4oY29tcGxldGUod2hvLCBjb3VudHJ5LCB5ZWFyKSwgd2hvLCBieSA9IGMoImNvdW50cnkiLCAieWVhciIpKSAlPiUNCiAgc2VsZWN0KGNvdW50cnksIHllYXIpICU+JQ0KICBncm91cF9ieShjb3VudHJ5KSAlPiUNCiAgIyBzbyBJIGNhbiBtYWtlIGJldHRlciBzZW5zZSBvZiB0aGUgeWVhcnMNCiAgc3VtbWFyaXNlKG1pbl95ZWFyID0gbWluKHllYXIpLCBtYXhfeWVhciA9IG1heCh5ZWFyKSkNCmBgYA0KDQpBbGwgb2YgdGhlc2UgcmVmZXIgdG8gKGBjb3VudHJ5YCwgYHllYXJgKSBjb21iaW5hdGlvbnMgZm9yIHllYXJzIHByaW9yIHRvIHRoZSBleGlzdGVuY2Ugb2YgdGhlIGNvdW50cnkuIEZvciBleGFtcGxlLCBUaW1vci1MZXN0ZSBhY2hpZXZlZCBpbmRlcGVuZGVuY2UgaW4gMjAwMiwgc28geWVhcnMgcHJpb3IgdG8gdGhhdCBhcmUgbm90IGluY2x1ZGVkIGluIHRoZSBkYXRhLg0KDQpUbyBzdW1tYXJpemU6DQoNCi0gYDBgIGlzIHVzZWQgdG8gcmVwcmVzZW50IG5vIGNhc2VzIG9mIFRCLg0KLSBFeHBsaWNpdCBtaXNzaW5nIHZhbHVlcyAoYE5BYHMpIGFyZSB1c2VkIHRvIHJlcHJlc2VudCBtaXNzaW5nIGRhdGEgZm9yIChgY291bnRyeWAsIGB5ZWFyYCkgY29tYmluYXRpb25zIGluIHdoaWNoIHRoZSBjb3VudHJ5IGV4aXN0ZWQgaW4gdGhhdCB5ZWFyLg0KLSBJbXBsaWNpdCBtaXNzaW5nIHZhbHVlcyBhcmUgdXNlZCB0byByZXByZXNlbnQgbWlzc2luZyBkYXRhIGJlY2F1c2UgYSBjb3VudHJ5IGRpZCBub3QgZXhpc3QgaW4gdGhhdCB5ZWFyLg0KDQoNCiMjIyAyIFdoYXQgaGFwcGVucyBpZiB5b3UgbmVnbGVjdCB0aGUgYG11dGF0ZSgpYCBzdGVwPyAoYG11dGF0ZShrZXkgPSBzdHJfcmVwbGFjZShrZXksICJuZXdyZWwiLCAibmV3X3JlbCIpYCk/DQoNClRoZSBgc2VwYXJhdGUoKWAgZnVuY3Rpb24gZW1pdHMgdGhlIHdhcm5pbmcg4oCcdG9vIGZldyB2YWx1ZXPigJ0uIElmIHdlIGNoZWNrIHRoZSByb3dzIGZvciBrZXlzIGJlZ2lubmluZyB3aXRoIGAibmV3cmVsXyJgLCB3ZSBzZWUgdGhhdCBzZXhhZ2UgaXMgbWlzc2luZywgYW5kIGB0eXBlID0gbTAxNGAuDQoNCmBgYHtyfQ0Kd2hvM2EgPC0gd2hvMSAlPiUNCiAgc2VwYXJhdGUoa2V5LCBjKCJuZXciLCAidHlwZSIsICJzZXhhZ2UiKSwgc2VwID0gIl8iKQ0KDQpmaWx0ZXIod2hvM2EsIG5ldyA9PSAibmV3cmVsIikgJT4lIGhlYWQoKQ0KYGBgDQoNCiMjIyAzIEkgY2xhaW1lZCB0aGF0IGBpc28yYCBhbmQgYGlzbzNgIHdlcmUgcmVkdW5kYW50IHdpdGggYGNvdW50cnlgLiBDb25maXJtIHRoaXMgY2xhaW0uDQoNCklmIGBpc28yYCBhbmQgYGlzbzNgIGFyZSByZWR1bmRhbnQgd2l0aCBgY291bnRyeWAsIHRoZW4sIHdpdGhpbiBlYWNoIGBjb3VudHJ5YCwgdGhlcmUgc2hvdWxkIG9ubHkgYmUgb25lIGRpc3RpbmN0IGNvbWJpbmF0aW9uIG9mIGBpc28yYCBhbmQgYGlzbzNgIHZhbHVlcywgd2hpY2ggaXMgdGhlIGNhc2UuDQoNCmBgYHtyfQ0Kc2VsZWN0KHdobzMsIGNvdW50cnksIGlzbzIsIGlzbzMpICU+JQ0KICBkaXN0aW5jdCgpICU+JQ0KICBncm91cF9ieShjb3VudHJ5KSAlPiUNCiAgZmlsdGVyKG4oKSA+IDEpDQpgYGANCg0KVGhpcyBtYWtlcyBzZW5zZSwgc2luY2UgYGlzbzJgIGFuZCBgaXNvM2AgY29udGFpbiB0aGUgMi0gYW5kIDMtbGV0dGVyIGNvdW50cnkgYWJicmV2aWF0aW9ucyBmb3IgdGhlIGNvdW50cnkuIFRoZSBpc28yIHZhcmlhYmxlIGNvbnRhaW5zIGVhY2ggY291bnRyeeKAmXMgW0lTTyAzMTY2IGFscGhhLTJdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0lTT18zMTY2LTFfYWxwaGEtMiksIGFuZCB0aGUgYGlzbzNgIHZhcmlhYmxlIGNvbnRhaW5zIGVhY2ggY291bnRyeeKAmXMgW0lTTyAzMTY2IGFscGhhLTNdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0lTT18zMTY2LTFfYWxwaGEtMykgYWJicmV2aWF0aW9uLiBZb3UgbWF5IHJlY29nbml6ZSB0aGUgSVNPIDMxNjYtMiBhYmJyZXZpYXRpb25zLCBzaW5jZSB0aGV5IGFyZSBhbG1vc3QgaWRlbnRpY2FsIHRvIGludGVybmV0IFtjb3VudHJ5LWNvZGUgdG9wIGxldmVsIGRvbWFpbnNdKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0NvdW50cnlfY29kZV90b3AtbGV2ZWxfZG9tYWluKSwgc3VjaCBhcyBgLnVrYCAoVW5pdGVkIEtpbmdkb20pLCBgLmx5YCAoTGlieWEpLCBgLnR2YCAoVHV2YWx1KSwgYW5kIGAuaW9gIChCcml0aXNoIEluZGlhbiBPY2VhbiBUZXJyaXRvcnkpLg0KDQojIyMgNCBGb3IgZWFjaCBgY291bnRyeWAsIGB5ZWFyYCwgYW5kIGBzZXhgIGNvbXB1dGUgdGhlIHRvdGFsIG51bWJlciBvZiBjYXNlcyBvZiBUQi4gTWFrZSBhbiBpbmZvcm1hdGl2ZSB2aXN1YWxpemF0aW9uIG9mIHRoZSBkYXRhLg0KDQpgYGB7cn0NCndobzUgJT4lDQogIGdyb3VwX2J5KGNvdW50cnksIHllYXIsIHNleCkgJT4lDQogIGZpbHRlcih5ZWFyID4gMTk5NSkgJT4lDQogIHN1bW1hcmlzZShjYXNlcyA9IHN1bShjYXNlcykpICU+JQ0KICB1bml0ZShjb3VudHJ5X3NleCwgY291bnRyeSwgc2V4LCByZW1vdmUgPSBGQUxTRSkgJT4lDQogIGdncGxvdChhZXMoeCA9IHllYXIsIHkgPSBjYXNlcywgZ3JvdXAgPSBjb3VudHJ5X3NleCwgY29sb3VyID0gc2V4KSkgKw0KICBnZW9tX2xpbmUoKQ0KYGBgDQoNCkEgc21hbGwgbXVsdGlwbGVzIHBsb3QgZmFjZXRpbmcgYnkgY291bnRyeSBpcyBkaWZmaWN1bHQgZ2l2ZW4gdGhlIG51bWJlciBvZiBjb3VudHJpZXMuIEZvY3VzaW5nIG9uIHRob3NlIGNvdW50cmllcyB3aXRoIHRoZSBsYXJnZXN0IGNoYW5nZXMgb3IgYWJzb2x1dGUgbWFnbml0dWRlcyBhZnRlciBwcm92aWRpbmcgdGhlIGNvbnRleHQgYWJvdmUgaXMgYW5vdGhlciBvcHRpb24u