Let’s pull together everything you’ve learned to tackle a realistic data tidying problem. The tidyr::who dataset contains tuberculosis (TB) cases broken down by year, country, age, gender, and diagnosis method. The data comes from the 2014 World Health Organization Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.

There’s a wealth of epidemiological information in this dataset, but it’s challenging to work with the data in the form that it’s provided:

library(tidyverse)
who

This is a very typical real-life example dataset. It contains redundant columns, odd variable codes, and many missing values. In short, who is messy, and we’ll need multiple steps to tidy it. Like dplyr, tidyr is designed so that each function does one thing well. That means in real-life situations you’ll usually need to string together multiple verbs into a pipeline.

The best place to start is almost always to gather together the columns that are not variables. Let’s have a look at what we’ve got:

So we need to gather together all the columns from new_sp_m014 to newrel_f65. We don’t know what those values represent yet, so we’ll give them the generic name "key". We know the cells represent the count of cases, so we’ll use the variable cases. There are a lot of missing values in the current representation, so for now we’ll use na.rm just so we can focus on the values that are present.

who1 <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  )
who1

We can get some hint of the structure of the values in the new key column by counting them:

who1 %>% 
  count(key)

You might be able to parse this out by yourself with a little thought and some experimentation, but luckily we have the data dictionary handy. It tells us:

1. The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.

2. The next two letters describe the type of TB:

3. The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).

4. The remaining numbers gives the age group. The dataset groups cases into seven age groups:

We need to make a minor fix to the format of the column names: unfortunately the names are slightly inconsistent because instead of new_rel we have newrel (it’s hard to spot this here but if you don’t fix it we’ll get errors in subsequent steps). You’ll learn about str_replace() in the strings lecture, but the basic idea is pretty simple: replace the characters “newrel” with “new_rel”. This makes all variable names consistent.

who2 <- who1 %>% 
  mutate(names_from = stringr::str_replace(key, "newrel", "new_rel"))
who2

We can separate the values in each code with two passes of separate(). The first pass will split the codes at each underscore.

who3 <- who2 %>% 
  separate(key, c("new", "type", "sexage"), sep = "_")
Expected 3 pieces. Missing pieces filled with `NA` in 2580 rows [243, 244, 679, 680, 681, 682, 683, 684, 685, 686, 687, 688, 689, 690, 691, 692, 903, 904, 905, 906, ...].
who3

Then we might as well drop the new column because it’s constant in this dataset.

who3 %>% 
  count(new)

While we’re dropping columns, let’s also drop iso2 and iso3 since they’re redundant.

who4 <- who3 %>% 
  select(-new, -iso2, -iso3)
who4

Next we’ll separate sexage into sex and age by splitting after the first character:

who5 <- who4 %>% 
  separate(sexage, c("sex", "age"), sep = 1)
who5

The who dataset is now tidy!

I’ve shown you the code a piece at a time, assigning each interim result to a new variable. This typically isn’t how you’d work interactively. Instead, you’d gradually build up a complex pipe:

who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE
  ) %>% 
  mutate(
    key = stringr::str_replace(key, "newrel", "new_rel")
  ) %>%
  separate(key, c("new", "var", "sexage")) %>% 
  select(-new, -iso2, -iso3) %>% 
  separate(sexage, c("sex", "age"), sep = 1)

Head over to Case Study WHO Example Questions to see more.

LS0tDQp0aXRsZTogIlRpZHkgRGF0YSBDYXNlIFN0dWR5Ig0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KTGV04oCZcyBwdWxsIHRvZ2V0aGVyIGV2ZXJ5dGhpbmcgeW914oCZdmUgbGVhcm5lZCB0byB0YWNrbGUgYSByZWFsaXN0aWMgZGF0YSB0aWR5aW5nIHByb2JsZW0uIFRoZSBgdGlkeXI6Ondob2AgZGF0YXNldCBjb250YWlucyB0dWJlcmN1bG9zaXMgKFRCKSBjYXNlcyBicm9rZW4gZG93biBieSB5ZWFyLCBjb3VudHJ5LCBhZ2UsIGdlbmRlciwgYW5kIGRpYWdub3NpcyBtZXRob2QuIFRoZSBkYXRhIGNvbWVzIGZyb20gdGhlIDIwMTQgV29ybGQgSGVhbHRoIE9yZ2FuaXphdGlvbiBHbG9iYWwgVHViZXJjdWxvc2lzIFJlcG9ydCwgYXZhaWxhYmxlIGF0IDxodHRwOi8vd3d3Lndoby5pbnQvdGIvY291bnRyeS9kYXRhL2Rvd25sb2FkL2VuLy4+DQoNClRoZXJl4oCZcyBhIHdlYWx0aCBvZiBlcGlkZW1pb2xvZ2ljYWwgaW5mb3JtYXRpb24gaW4gdGhpcyBkYXRhc2V0LCBidXQgaXTigJlzIGNoYWxsZW5naW5nIHRvIHdvcmsgd2l0aCB0aGUgZGF0YSBpbiB0aGUgZm9ybSB0aGF0IGl04oCZcyBwcm92aWRlZDoNCg0KYGBge3IsbWVzc2FnZT1GQUxTRSx3YXJuaW5nPUZBTFNFfQ0KbGlicmFyeSh0aWR5dmVyc2UpDQp3aG8NCmBgYA0KDQoNClRoaXMgaXMgYSB2ZXJ5IHR5cGljYWwgcmVhbC1saWZlIGV4YW1wbGUgZGF0YXNldC4gSXQgY29udGFpbnMgcmVkdW5kYW50IGNvbHVtbnMsIG9kZCB2YXJpYWJsZSBjb2RlcywgYW5kIG1hbnkgbWlzc2luZyB2YWx1ZXMuIEluIHNob3J0LCBgd2hvYCBpcyBtZXNzeSwgYW5kIHdl4oCZbGwgbmVlZCBtdWx0aXBsZSBzdGVwcyB0byB0aWR5IGl0LiBMaWtlIGRwbHlyLCB0aWR5ciBpcyBkZXNpZ25lZCBzbyB0aGF0IGVhY2ggZnVuY3Rpb24gZG9lcyBvbmUgdGhpbmcgd2VsbC4gVGhhdCBtZWFucyBpbiByZWFsLWxpZmUgc2l0dWF0aW9ucyB5b3XigJlsbCB1c3VhbGx5IG5lZWQgdG8gc3RyaW5nIHRvZ2V0aGVyIG11bHRpcGxlIHZlcmJzIGludG8gYSBwaXBlbGluZS4NCg0KVGhlIGJlc3QgcGxhY2UgdG8gc3RhcnQgaXMgYWxtb3N0IGFsd2F5cyB0byBnYXRoZXIgdG9nZXRoZXIgdGhlIGNvbHVtbnMgdGhhdCBhcmUgbm90IHZhcmlhYmxlcy4gTGV04oCZcyBoYXZlIGEgbG9vayBhdCB3aGF0IHdl4oCZdmUgZ290Og0KDQotIEl0IGxvb2tzIGxpa2UgYGNvdW50cnlgLCBgaXNvMmAsIGFuZCBgaXNvM2AgYXJlIHRocmVlIHZhcmlhYmxlcyB0aGF0IHJlZHVuZGFudGx5IHNwZWNpZnkgdGhlIGNvdW50cnkuDQoNCi0gYHllYXJgIGlzIGNsZWFybHkgYWxzbyBhIHZhcmlhYmxlLg0KDQotIFdlIGRvbuKAmXQga25vdyB3aGF0IGFsbCB0aGUgb3RoZXIgY29sdW1ucyBhcmUgeWV0LCBidXQgZ2l2ZW4gdGhlIHN0cnVjdHVyZSBpbiB0aGUgdmFyaWFibGUgbmFtZXMgKGUuZy4gYG5ld19zcF9tMDE0YCwgYG5ld19lcF9tMDE0YCwgYG5ld19lcF9mMDE0YCkgdGhlc2UgYXJlIGxpa2VseSB0byBiZSB2YWx1ZXMsIG5vdCB2YXJpYWJsZXMuDQoNClNvIHdlIG5lZWQgdG8gZ2F0aGVyIHRvZ2V0aGVyIGFsbCB0aGUgY29sdW1ucyBmcm9tIGBuZXdfc3BfbTAxNGAgdG8gYG5ld3JlbF9mNjVgLiBXZSBkb27igJl0IGtub3cgd2hhdCB0aG9zZSB2YWx1ZXMgcmVwcmVzZW50IHlldCwgc28gd2XigJlsbCBnaXZlIHRoZW0gdGhlIGdlbmVyaWMgbmFtZSBgImtleSJgLiBXZSBrbm93IHRoZSBjZWxscyByZXByZXNlbnQgdGhlIGNvdW50IG9mIGNhc2VzLCBzbyB3ZeKAmWxsIHVzZSB0aGUgdmFyaWFibGUgYGNhc2VzYC4gVGhlcmUgYXJlIGEgbG90IG9mIG1pc3NpbmcgdmFsdWVzIGluIHRoZSBjdXJyZW50IHJlcHJlc2VudGF0aW9uLCBzbyBmb3Igbm93IHdl4oCZbGwgdXNlIGBuYS5ybWAganVzdCBzbyB3ZSBjYW4gZm9jdXMgb24gdGhlIHZhbHVlcyB0aGF0IGFyZSBwcmVzZW50Lg0KDQpgYGB7cn0NCndobzEgPC0gd2hvICU+JSANCiAgcGl2b3RfbG9uZ2VyKA0KICAgIGNvbHMgPSBuZXdfc3BfbTAxNDpuZXdyZWxfZjY1LCANCiAgICBuYW1lc190byA9ICJrZXkiLCANCiAgICB2YWx1ZXNfdG8gPSAiY2FzZXMiLCANCiAgICB2YWx1ZXNfZHJvcF9uYSA9IFRSVUUNCiAgKQ0Kd2hvMQ0KYGBgDQoNCldlIGNhbiBnZXQgc29tZSBoaW50IG9mIHRoZSBzdHJ1Y3R1cmUgb2YgdGhlIHZhbHVlcyBpbiB0aGUgbmV3IGBrZXlgIGNvbHVtbiBieSBjb3VudGluZyB0aGVtOg0KDQpgYGB7cn0NCndobzEgJT4lIA0KICBjb3VudChrZXkpDQpgYGANCg0KWW91IG1pZ2h0IGJlIGFibGUgdG8gcGFyc2UgdGhpcyBvdXQgYnkgeW91cnNlbGYgd2l0aCBhIGxpdHRsZSB0aG91Z2h0IGFuZCBzb21lIGV4cGVyaW1lbnRhdGlvbiwgYnV0IGx1Y2tpbHkgd2UgaGF2ZSB0aGUgZGF0YSBkaWN0aW9uYXJ5IGhhbmR5LiBJdCB0ZWxscyB1czoNCg0KKioxLioqIFRoZSBmaXJzdCB0aHJlZSBsZXR0ZXJzIG9mIGVhY2ggY29sdW1uIGRlbm90ZSB3aGV0aGVyIHRoZSBjb2x1bW4gY29udGFpbnMgbmV3IG9yIG9sZCBjYXNlcyBvZiBUQi4gSW4gdGhpcyBkYXRhc2V0LCBlYWNoIGNvbHVtbiBjb250YWlucyBuZXcgY2FzZXMuDQoNCioqMi4qKiBUaGUgbmV4dCB0d28gbGV0dGVycyBkZXNjcmliZSB0aGUgdHlwZSBvZiBUQjoNCg0KICAgLSBgcmVsYCBzdGFuZHMgZm9yIGNhc2VzIG9mIHJlbGFwc2UNCiAgIC0gYGVwYCBzdGFuZHMgZm9yIGNhc2VzIG9mIGV4dHJhcHVsbW9uYXJ5IFRCDQogICAtIGBzbmAgc3RhbmRzIGZvciBjYXNlcyBvZiBwdWxtb25hcnkgVEIgdGhhdCBjb3VsZCBub3QgYmUgZGlhZ25vc2VkIGJ5IGEgcHVsbW9uYXJ5IHNtZWFyIChzbWVhciBuZWdhdGl2ZSkNCiAgIC0gYHNwYCBzdGFuZHMgZm9yIGNhc2VzIG9mIHB1bG1vbmFyeSBUQiB0aGF0IGNvdWxkIGJlIGRpYWdub3NlZCBiZSBhIHB1bG1vbmFyeSBzbWVhciAoc21lYXIgcG9zaXRpdmUpDQoNCioqMy4qKiBUaGUgc2l4dGggbGV0dGVyIGdpdmVzIHRoZSBzZXggb2YgVEIgcGF0aWVudHMuIFRoZSBkYXRhc2V0IGdyb3VwcyBjYXNlcyBieSBtYWxlcyAoYG1gKSBhbmQgZmVtYWxlcyAoYGZgKS4NCg0KKio0LioqIFRoZSByZW1haW5pbmcgbnVtYmVycyBnaXZlcyB0aGUgYWdlIGdyb3VwLiBUaGUgZGF0YXNldCBncm91cHMgY2FzZXMgaW50byBzZXZlbiBhZ2UgZ3JvdXBzOg0KDQogICAtIGAwMTRgID0gMCDigJMgMTQgeWVhcnMgb2xkDQogICAtIGAxNTI0YCA9IDE1IOKAkyAyNCB5ZWFycyBvbGQNCiAgIC0gYDI1MzRgID0gMjUg4oCTIDM0IHllYXJzIG9sZA0KICAgLSBgMzU0NGAgPSAzNSDigJMgNDQgeWVhcnMgb2xkDQogICAtIGA0NTU0YCA9IDQ1IOKAkyA1NCB5ZWFycyBvbGQNCiAgIC0gYDU1NjRgID0gNTUg4oCTIDY0IHllYXJzIG9sZA0KICAgLSBgNjVgID0gNjUgb3Igb2xkZXINCg0KV2UgbmVlZCB0byBtYWtlIGEgbWlub3IgZml4IHRvIHRoZSBmb3JtYXQgb2YgdGhlIGNvbHVtbiBuYW1lczogdW5mb3J0dW5hdGVseSB0aGUgbmFtZXMgYXJlIHNsaWdodGx5IGluY29uc2lzdGVudCBiZWNhdXNlIGluc3RlYWQgb2YgYG5ld19yZWxgIHdlIGhhdmUgYG5ld3JlbGAgKGl04oCZcyBoYXJkIHRvIHNwb3QgdGhpcyBoZXJlIGJ1dCBpZiB5b3UgZG9u4oCZdCBmaXggaXQgd2XigJlsbCBnZXQgZXJyb3JzIGluIHN1YnNlcXVlbnQgc3RlcHMpLiBZb3XigJlsbCBsZWFybiBhYm91dCBgc3RyX3JlcGxhY2UoKWAgaW4gdGhlIHN0cmluZ3MgbGVjdHVyZSwgYnV0IHRoZSBiYXNpYyBpZGVhIGlzIHByZXR0eSBzaW1wbGU6IHJlcGxhY2UgdGhlIGNoYXJhY3RlcnMg4oCcbmV3cmVs4oCdIHdpdGgg4oCcbmV3X3JlbOKAnS4gVGhpcyBtYWtlcyBhbGwgdmFyaWFibGUgbmFtZXMgY29uc2lzdGVudC4NCg0KYGBge3J9DQp3aG8yIDwtIHdobzEgJT4lIA0KICBtdXRhdGUobmFtZXNfZnJvbSA9IHN0cmluZ3I6OnN0cl9yZXBsYWNlKGtleSwgIm5ld3JlbCIsICJuZXdfcmVsIikpDQp3aG8yDQpgYGANCg0KV2UgY2FuIHNlcGFyYXRlIHRoZSB2YWx1ZXMgaW4gZWFjaCBjb2RlIHdpdGggdHdvIHBhc3NlcyBvZiBgc2VwYXJhdGUoKWAuIFRoZSBmaXJzdCBwYXNzIHdpbGwgc3BsaXQgdGhlIGNvZGVzIGF0IGVhY2ggdW5kZXJzY29yZS4NCg0KYGBge3J9DQp3aG8zIDwtIHdobzIgJT4lIA0KICBzZXBhcmF0ZShrZXksIGMoIm5ldyIsICJ0eXBlIiwgInNleGFnZSIpLCBzZXAgPSAiXyIpDQp3aG8zDQpgYGANCg0KVGhlbiB3ZSBtaWdodCBhcyB3ZWxsIGRyb3AgdGhlIGBuZXdgIGNvbHVtbiBiZWNhdXNlIGl04oCZcyBjb25zdGFudCBpbiB0aGlzIGRhdGFzZXQuIA0KDQpgYGB7cn0NCndobzMgJT4lIA0KICBjb3VudChuZXcpDQpgYGANCg0KV2hpbGUgd2XigJlyZSBkcm9wcGluZyBjb2x1bW5zLCBsZXTigJlzIGFsc28gZHJvcCBgaXNvMmAgYW5kIGBpc28zYCBzaW5jZSB0aGV54oCZcmUgcmVkdW5kYW50Lg0KDQpgYGB7cn0NCndobzQgPC0gd2hvMyAlPiUgDQogIHNlbGVjdCgtbmV3LCAtaXNvMiwgLWlzbzMpDQp3aG80DQpgYGANCg0KTmV4dCB3ZeKAmWxsIHNlcGFyYXRlIGBzZXhhZ2VgIGludG8gYHNleGAgYW5kIGBhZ2VgIGJ5IHNwbGl0dGluZyBhZnRlciB0aGUgZmlyc3QgY2hhcmFjdGVyOg0KDQpgYGB7cn0NCndobzUgPC0gd2hvNCAlPiUgDQogIHNlcGFyYXRlKHNleGFnZSwgYygic2V4IiwgImFnZSIpLCBzZXAgPSAxKQ0Kd2hvNQ0KYGBgDQoNClRoZSB3aG8gZGF0YXNldCBpcyBub3cgdGlkeSENCg0KSeKAmXZlIHNob3duIHlvdSB0aGUgY29kZSBhIHBpZWNlIGF0IGEgdGltZSwgYXNzaWduaW5nIGVhY2ggaW50ZXJpbSByZXN1bHQgdG8gYSBuZXcgdmFyaWFibGUuIFRoaXMgdHlwaWNhbGx5IGlzbuKAmXQgaG93IHlvdeKAmWQgd29yayBpbnRlcmFjdGl2ZWx5LiBJbnN0ZWFkLCB5b3XigJlkIGdyYWR1YWxseSBidWlsZCB1cCBhIGNvbXBsZXggcGlwZToNCg0KYGBge3IsZXZhbD1GQUxTRX0NCndobyAlPiUNCiAgcGl2b3RfbG9uZ2VyKA0KICAgIGNvbHMgPSBuZXdfc3BfbTAxNDpuZXdyZWxfZjY1LCANCiAgICBuYW1lc190byA9ICJrZXkiLCANCiAgICB2YWx1ZXNfdG8gPSAiY2FzZXMiLCANCiAgICB2YWx1ZXNfZHJvcF9uYSA9IFRSVUUNCiAgKSAlPiUgDQogIG11dGF0ZSgNCiAgICBrZXkgPSBzdHJpbmdyOjpzdHJfcmVwbGFjZShrZXksICJuZXdyZWwiLCAibmV3X3JlbCIpDQogICkgJT4lDQogIHNlcGFyYXRlKGtleSwgYygibmV3IiwgInZhciIsICJzZXhhZ2UiKSkgJT4lIA0KICBzZWxlY3QoLW5ldywgLWlzbzIsIC1pc28zKSAlPiUgDQogIHNlcGFyYXRlKHNleGFnZSwgYygic2V4IiwgImFnZSIpLCBzZXAgPSAxKQ0KYGBgDQoNCkhlYWQgb3ZlciB0byBbQ2FzZSBTdHVkeSBXSE8gRXhhbXBsZSBRdWVzdGlvbnNdKGh0dHBzOi8vcnB1YnMuY29tL3VreTk5NC81OTYwOTEpIHRvIHNlZSBtb3JlLg==