library(tidyverse)
library(openintro)
In the next, we’ll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling. The term wrangling vividly describes how we may have to “fight with data” before putting them to use.
Data wrangling is very important: without it you can’t work with your own data! There are three main parts to data wrangling:
We have learned how to transform data. In this part, we will learn
The variant of R data frames -
tibble
from
tidyverse
How to import data of various formats into R
How to tidy data into a standard form for the convenience of data exploration
We will also learn more about data transformation for a few specifically important data types:
Relational data will give you tools for working with multiple interrelated datasets.
Strings will introduce regular expressions, a powerful tool for manipulating strings.
Factors are how R stores categorical data. They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
Dates and times will give you the key tools for working with dates and date-times.
Traditionally, R uses the data type data.frame
to store
rectangular (tabular) data. For example, if we check the data type of a
built-in data set iris
, we get:
class(iris)
## [1] "data.frame"
If we check the class of the data sets that we have been studied so far, we will see that the results are different:
class(mpg)
## [1] "tbl_df" "tbl" "data.frame"
The result tbl
stands for tibble
, which is
introduced by the package tibble
in tidyverse
.
Briefly speaking, a tibble is simply a data frame, but with some
particular aspects optimized for data wrangling and optimization.
data.frame
There are two main differences in the usage of a tibble
vs. a classic data.frame
: printing and subsetting.
Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. In addition to its name, each column reports its type. This makes it much easier to work with large data.
First, execute the following code in R by yourself. You will see that
for a classic data.frame
, all data are printed on screen,
with no data type indicated.
print(iris) # Try by yourself
In Contrast, printing a tibble gives a much more reasonable preview of data.
print(mpg)
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # … with 224 more rows
When there are too many columns to be printed on screen, we can use
glimpse()
to help us.
print(loans_full_schema)
## # A tibble: 10,000 × 55
## emp_t…¹ emp_l…² state homeo…³ annua…⁴ verif…⁵ debt_…⁶ annua…⁷ verif…⁸ debt_…⁹
## <chr> <dbl> <fct> <fct> <dbl> <fct> <dbl> <dbl> <fct> <dbl>
## 1 "globa… 3 NJ MORTGA… 90000 Verifi… 18.0 NA "" NA
## 2 "wareh… 10 HI RENT 40000 Not Ve… 5.04 NA "" NA
## 3 "assem… 3 WI RENT 40000 Source… 21.2 NA "" NA
## 4 "custo… 1 PA RENT 30000 Not Ve… 10.2 NA "" NA
## 5 "secur… 10 CA RENT 35000 Verifi… 58.0 57000 "Verif… 37.7
## 6 "" NA KY OWN 34000 Not Ve… 6.46 NA "" NA
## 7 "hr " 10 MI MORTGA… 35000 Source… 23.7 155000 "Not V… 13.1
## 8 "polic… 10 AZ MORTGA… 110000 Source… 16.2 NA "" NA
## 9 "parts" 10 NV MORTGA… 65000 Source… 36.5 NA "" NA
## 10 "4th p… 3 IL RENT 30000 Not Ve… 18.9 NA "" NA
## # … with 9,990 more rows, 45 more variables: delinq_2y <int>,
## # months_since_last_delinq <int>, earliest_credit_line <dbl>,
## # inquiries_last_12m <int>, total_credit_lines <int>,
## # open_credit_lines <int>, total_credit_limit <int>,
## # total_credit_utilized <int>, num_collections_last_12m <int>,
## # num_historical_failed_to_pay <int>, months_since_90d_late <int>,
## # current_accounts_delinq <int>, total_collection_amount_ever <int>, …
glimpse(loans_full_schema)
## Rows: 10,000
## Columns: 55
## $ emp_title <chr> "global config engineer ", "warehouse…
## $ emp_length <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
## $ homeownership <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
## $ annual_income <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income <fct> Verified, Not Verified, Source Verifi…
## $ debt_to_income <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint <fct> , , , , Verified, , Not Verified, , ,…
## $ debt_to_income_joint <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose <fct> moving, debt_consolidation, other, de…
## $ application_type <fct> individual, individual, individual, i…
## $ loan_amount <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
## $ sub_grade <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
## $ issue_month <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
## $ loan_status <fct> Current, Current, Current, Current, C…
## $ initial_listing_status <fct> whole, whole, fractional, whole, whol…
## $ disbursement_method <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
## $ balance <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
If you want to pull out a single variable from , you need some new
tools, $
and [[
. [[
can extract
by name or position; $ only extracts by name but is a little less
typing.
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
mpg$cyl
## [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
## [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6
mpg[["cyl"]]
## [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
## [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6
mpg[[5]] # Subset by position
## [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
## [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
## [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6
A key difference between tibble
and
data.frame
is that, tibbles don’t do partial matching for
subsetting. For example, we can do the following with the data set
iris
:
iris$Sp
This returns iris$Species
as the result since “Sp” is a
partial matching for the column name “Species”. This is not a
preferrable action in most cases and tibbles turn off this function.
mpg$cy
## NULL
[
(not recommended for tibbles)An old-style way to subset a data frame in R is to use the
[
. For example, the codes below subset the second row, the
third column, the cell at the second row and the third column
respectively.
iris[2, ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2 4.9 3 1.4 0.2 setosa
iris[, 3]
## [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
## [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
## [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
## [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
## [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
## [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1
iris[2,3]
## [1] 1.4
This is not recommended for tibbles since we use
filter()
to select rows and select()
to select
columns. That makes codes much more clear to write and understand.
However, there is still some useful things in using [
.
Logical subsetting would be one example. For example,
if we hope to choose rows where the value of Species
is
setosa
, we can do the following:
iris[iris$Species == "setosa", ]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5.0 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## 11 5.4 3.7 1.5 0.2 setosa
## 12 4.8 3.4 1.6 0.2 setosa
## 13 4.8 3.0 1.4 0.1 setosa
## 14 4.3 3.0 1.1 0.1 setosa
## 15 5.8 4.0 1.2 0.2 setosa
## 16 5.7 4.4 1.5 0.4 setosa
## 17 5.4 3.9 1.3 0.4 setosa
## 18 5.1 3.5 1.4 0.3 setosa
## 19 5.7 3.8 1.7 0.3 setosa
## 20 5.1 3.8 1.5 0.3 setosa
## 21 5.4 3.4 1.7 0.2 setosa
## 22 5.1 3.7 1.5 0.4 setosa
## 23 4.6 3.6 1.0 0.2 setosa
## 24 5.1 3.3 1.7 0.5 setosa
## 25 4.8 3.4 1.9 0.2 setosa
## 26 5.0 3.0 1.6 0.2 setosa
## 27 5.0 3.4 1.6 0.4 setosa
## 28 5.2 3.5 1.5 0.2 setosa
## 29 5.2 3.4 1.4 0.2 setosa
## 30 4.7 3.2 1.6 0.2 setosa
## 31 4.8 3.1 1.6 0.2 setosa
## 32 5.4 3.4 1.5 0.4 setosa
## 33 5.2 4.1 1.5 0.1 setosa
## 34 5.5 4.2 1.4 0.2 setosa
## 35 4.9 3.1 1.5 0.2 setosa
## 36 5.0 3.2 1.2 0.2 setosa
## 37 5.5 3.5 1.3 0.2 setosa
## 38 4.9 3.6 1.4 0.1 setosa
## 39 4.4 3.0 1.3 0.2 setosa
## 40 5.1 3.4 1.5 0.2 setosa
## 41 5.0 3.5 1.3 0.3 setosa
## 42 4.5 2.3 1.3 0.3 setosa
## 43 4.4 3.2 1.3 0.2 setosa
## 44 5.0 3.5 1.6 0.6 setosa
## 45 5.1 3.8 1.9 0.4 setosa
## 46 4.8 3.0 1.4 0.3 setosa
## 47 5.1 3.8 1.6 0.2 setosa
## 48 4.6 3.2 1.4 0.2 setosa
## 49 5.3 3.7 1.5 0.2 setosa
## 50 5.0 3.3 1.4 0.2 setosa
For data frames we can do the same by converting it into a tibble and
use the filter()
function:
iris_tib <- as.tibble(iris) %>%
filter(Species == "setosa") %>%
print()
## # A tibble: 50 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 40 more rows
However, the [
logical subsetting also works for vectors
and lists in R, which can be very useful in some situations.
rand_num <- runif(100, 0, 3)
To convert a data frame into a tibble, we can simply do that with the
function as_tibble()
:
as.tibble(iris)
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # … with 140 more rows
To create a new tibble (usually for testing and learning), we can use
the tibble()
function:
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
## # A tibble: 5 × 3
## x y z
## <int> <dbl> <dbl>
## 1 1 1 2
## 2 2 1 5
## 3 3 1 10
## 4 4 1 17
## 5 5 1 26
Here, “x”, “y”, “z” are the column names of the same lenth. For “y”,
tibble()
will automatically recycle inputs of length 1, and
allows you to refer to variables that you just created, as shown
above.
It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. Those names may appear in raw data sets, which might not be originally created in R.
To refer to these variables, you need to surround them with backticks:
tb <- tibble(
`:)` = "smile",
` ` = "space",
`2000` = "number"
)
Another way to create a tibble is with tribble()
, short
for transposed tibble. tribble()
is customised for data
entry in code: column headings are defined by formulas (i.e. they start
with ~), and entries are separated by commas. This makes it possible to
lay out small amounts of data in easy to read form:
tribble(
~x, ~y, ~z,
#--|--|----
"a", 2, 3.6,
"b", 1, 8.5
)
## # A tibble: 2 × 3
## x y z
## <chr> <dbl> <dbl>
## 1 a 2 3.6
## 2 b 1 8.5
You are recommended to add a comment (the line starting with #), to make it really clear where the header is.
Student Name | Test 1 Score | Test 2 Score | Test 3 Score |
---|---|---|---|
Alice | 80 | 90 | 85 |
Bob | 70 | 75 | 80 |
Charlie | 85 | 88 | 91 |
Denise | 60 | 65 | 70 |
In this chapter, you will learn a consistent way to organise your data in R, an organisation called tidy data. So far we have been handling data sets that can be directly put into analysis such as visualization, transformation. That is because those data sets are already tidy.
But in real-world data analysis, raw data are rarely tidy. We have to do some hands-on “dirty” work to tidy them. In this introdutory lecture, we will learn the important concepts for tidy data, and how they can be untidy.
We will learn how to use tools in the tidyr
package,
part of tidyverse
.
There are three interrelated rules which make a data set tidy:
Following three rules makes a data set tidy: variables are in columns, observations are in rows, and values are in cells. Let’s look at a few examples to understand it.
In all the tables below, “cases” refers to the number of Tuberculosis (TB) cases.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
To understand why table2-4 are not tidy, we may refer to the 5 most frequently seen symptoms of untidy data:
Why ensure that your data is tidy? There are two main advantages:
There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.
In short, A consistent data structure makes it easier to work with
the same tools. dplyr, ggplot2
, and all the other packages
in the tidyverse are designed to work with tidy data.
With Table 1, it is straightforward to perform the following data exploration.
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
## # A tibble: 6 × 5
## country year cases population rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
# Compute cases per year
table1 %>%
count(year, wt = cases) # "wt" refers to the weight when do the counting
## # A tibble: 2 × 2
## year n
## <dbl> <dbl>
## 1 1999 250740
## 2 2000 296920
# Visualise changes over time
table1 %>%
mutate(rate = cases / population * 10000) %>%
ggplot(aes(x = as.factor(year), y = rate)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country)) +
labs(title = "TB rate", x = "Year", y = "Cases per 10,000") +
ylim(0, 5) +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)),
axis.title = element_text(size = rel(1.4)),
axis.title.x = element_text(margin = margin(10,5,5,5)),
axis.title.y = element_text(margin = margin(5,10,5,5)),
axis.text = element_text(size = rel(1.4)))
However, if we try to create the same results or graph from Table 2-4, there would be much less convenient. Therefore, we should tidy our data before actual data exploration.
Untidy data naturally manifests in real world. Here is another simple example. Suppose that an instructor hopes to record student test scores in a class, where each row represents a student and each column represents a test. The data set may look like this:
Student Name | Test 1 Score | Test 2 Score | Test 3 Score |
---|---|---|---|
Alice | 80 | 90 | 85 |
Bob | 70 | 75 | 80 |
Charlie | 85 | 88 | 91 |
Denise | 60 | 65 | 70 |
However, this is untidy data since the three column
headers are actually different values of another variable
Test
. A tidy data set would look like this:
Student Name | Test Number | Score |
---|---|---|
Alice | 1 | 80 |
Alice | 2 | 90 |
Alice | 3 | 85 |
Bob | 1 | 70 |
Bob | 2 | 75 |
Bob | 3 | 80 |
Charlie | 1 | 85 |
Charlie | 2 | 88 |
Charlie | 3 | 91 |
Denise | 1 | 60 |
Denise | 2 | 65 |
Denise | 3 | 70 |
The principles of tidy data seem so obvious that you might wonder if you’ll ever encounter a dataset that isn’t tidy. Unfortunately, however, most data that you will encounter will be untidy. There are two main reasons:
Most people aren’t familiar with the principles of tidy data, and it’s hard to derive them yourself unless you spend a lot of time working with data.
Data is often organised to facilitate some use other than analysis. For example, data is often organised to make entry as easy as possible.
This means for most real analyses, you’ll need to do some tidying. The first step is always to figure out what the variables and observations are. Sometimes this is easy; other times you’ll need to consult with the people who originally generated the data.
One variable might be spread across multiple columns.
One observation might be scattered across multiple rows.
Typically a dataset will only suffer from one of these problems. To
fix these problems, you’ll need the two most important functions in
tidyr
: pivot_longer()
and
pivot_wider()
.
Pivot_longer()
A common problem is a dataset where some of the column names are not names of variables, but values of a variable. For example,
table4a
## # A tibble: 3 × 3
## country `1999` `2000`
## <chr> <dbl> <dbl>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
here the column names 1999
and 2000
represent values of the year variable, the values in the 1999 and 2000
columns represent values of the cases variable, and each row
represents two observations, not one.
To tidy a dataset like this, we need to pivot the offending columns into a new pair of variables. To describe that operation we need three parameters:
The set of columns whose names are values, not variables. In this example, those are the columns 1999 and 2000.
The name of the variable to move the column names to. Here it is year.
The value of the variable to move the column values to. Here it’s cases.
Together those parameters generate the call to pivot_longer():
table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
The columns to pivot are specified with dplyr::select() style notation. Here there are only two columns, so we list them individually. Note that “1999” and “2000” are non-syntactic names (because they don’t start with a letter) so we have to surround them in backticks.
We can also do the following (this can be useful when you have multiple columns to handle):
table4a %>%
pivot_longer(`1999`:`2000`, names_to = "year", values_to = "cases")
## # A tibble: 6 × 3
## country year cases
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
Year and cases do not exist in table4a so we put their names in quotes.
As shown in the figure above, in the final result, the pivoted columns are dropped, and we get new year and cases columns. Otherwise, the relationships between the original variables are preserved.
pivot_longer()
makes data sets longer by increasing the
number of rows and decreasing the number of columns (when there are more
than two columns to combine).
We can use pivot_longer()
to tidy table4b in a similar
fashion. The only difference is the variable stored in the cell
values:
table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
## # A tibble: 6 × 3
## country year population
## <chr> <chr> <dbl>
## 1 Afghanistan 1999 19987071
## 2 Afghanistan 2000 20595360
## 3 Brazil 1999 172006362
## 4 Brazil 2000 174504898
## 5 China 1999 1272915272
## 6 China 2000 1280428583
To combine the tidied versions of table4a and table4b into a single
tibble, we need to use dplyr::left_join()
, which you’ll
learn about in relational data.
tidy4a <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
tidy4b <- table4b %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "population")
left_join(tidy4a, tidy4b)
## # A tibble: 6 × 4
## country year cases population
## <chr> <chr> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
pivot_longer()
to
tidy the following data set:Student Name | Test 1 Score | Test 2 Score | Test 3 Score |
---|---|---|---|
Alice | 80 | 90 | 85 |
Bob | 70 | 75 | 80 |
Charlie | 85 | 88 | 91 |
Denise | 60 | 65 | 70 |
Use the following code to create the tibble:
pivot_ex1 <- tribble(
~Student_Name, ~Test1, ~Test2, ~Test3,
#-------------------------------------
"Alice", 80, 90, 85,
"Bob", 70, 75, 80,
"Charlie", 85, 88, 91,
"Denise", 60, 65, 70
)
pivot_wider()
pivot_wider()
is the opposite of
pivot_longer()
. You use it when an observation is scattered
across multiple rows. For example, take table2
: an
observation is a country in a year, but each observation is spread
across two rows.
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
The problem with this table is that two variables case
and population
are placed in the same column named
type
.
To tidy this up, we need to break the column type
into
two columns (if there are more variables then it will be broken into
more columns), and the values for each variable are taken from the
column count
.
The function pivot_wider()
exactly takes these two
column names as the input and then tidy things for us:
table2 %>%
pivot_wider(names_from = type, values_from = count)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
Here the argument names_from
demands the name of column
that holds multiple variable names, and the argument
values_from
demands the name of column that hold the
values.
Comments: As their names indicate, pivot_longer() usually makes wide tables narrower and longer; pivot_wider() usually makes long tables shorter and wider.
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
Now let’s look at table3
, which has a different problem
from table2
.
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
The problem here is that we have one column (rate) that contains two
variables (cases and population). To tidy this up, we’ll need the
separate()
function.
table3 %>%
separate(rate, into = c("cases", "population"))
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
By default, separate()
will split values wherever it
sees a non-alphanumeric character (i.e. a character that isn’t a number
or letter). For example, in the code above, separate()
split the values of rate at the forward slash /
characters.
If you wish to use a specific character to separate a column, you can
pass the character to the sep
argument of
separate()
. For example, we could rewrite the code above
as:
table3 %>%
separate(rate, into = c("cases", "population"), sep = "/")
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <chr> <chr>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
There is one more issue that needs to be handled. In the result
above, both case
and population
are of type
character
. This is because by default
separate()
leaves the type of column the same as the
original one. To convert them to a better type, we can use
convert = TRUE
in the separate()
function:
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
You can also pass a vector of integers to sep
. For
example,
table3 %>%
separate(year, into = c("century", "year"), sep = 2)
## # A tibble: 6 × 4
## country century year rate
## <chr> <chr> <chr> <chr>
## 1 Afghanistan 19 99 745/19987071
## 2 Afghanistan 20 00 2666/20595360
## 3 Brazil 19 99 37737/172006362
## 4 Brazil 20 00 80488/174504898
## 5 China 19 99 212258/1272915272
## 6 China 20 00 213766/1280428583
Here separate()
separate the year number into two parts
- the first two digits into a new column century
and the
last two digits into a new column year
.
Here sep=2
refers to the position to split at. Positive
values start at 1 on the far-left of the strings; negative value start
at -1 on the far-right of the strings.
unite()
The complement function of separate()
is the
unite()
function, which combines multiple columns into a
single column. unite()
is much less used than
separate()
, but still can be useful sometimes.
table5 %>%
unite(new, century, year)
## # A tibble: 6 × 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 19_99 745/19987071
## 2 Afghanistan 20_00 2666/20595360
## 3 Brazil 19_99 37737/172006362
## 4 Brazil 20_00 80488/174504898
## 5 China 19_99 212258/1272915272
## 6 China 20_00 213766/1280428583
Here the unite()
function combines century
and year
into new
column. By default, the
values are separated by the underscore _
. This can be
modified with the sep
argument.
table5 %>%
unite(new, century, year, sep = "") # No separation character
## # A tibble: 6 × 3
## country new rate
## <chr> <chr> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Sometimes we need to handle NA
values carefully when we
do data tidying. In a data frame, a value can be missing in one of two
possible ways:
NA
in the data set.To illustrate the idea, let’s look at the following simple data set:
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
There are two missing values in this dataset:
The return for the fourth quarter of 2015 is explicitly missing,
because the cell where its value should be instead contains
NA
.
The return for the first quarter of 2016 is implicitly missing, because it simply does not appear in the dataset.
If we just want to remove NA
values, we can simply use
the filter
function.
stocks %>%
filter(!is.na(return))
## # A tibble: 6 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2016 2 0.92
## 5 2016 3 0.17
## 6 2016 4 2.66
We can do something similar when using pivot_longer()
or
pivot_wider()
. For example, if we make the data set above
into an untidy form:
stocks %>%
pivot_wider(names_from = year, values_from = return) -> stocks_wide
stocks_wide
## # A tibble: 4 × 3
## qtr `2015` `2016`
## <dbl> <dbl> <dbl>
## 1 1 1.88 NA
## 2 2 0.59 0.92
## 3 3 0.35 0.17
## 4 4 NA 2.66
Then when we tidy this data set, we can choose to drop
NA
values by setting
values_drop_na = TRUE
:
stocks_wide %>%
pivot_longer(
cols = c(`2015`, `2016`),
names_to = "year",
values_to = "return",
values_drop_na = TRUE
)
## # A tibble: 6 × 3
## qtr year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 2 2016 0.92
## 4 3 2015 0.35
## 5 3 2016 0.17
## 6 4 2016 2.66
Sometimes we hope to make it clear that some data are missing. To do
this, we can use the complete()
function:
stocks %>%
complete(year, qtr)
## # A tibble: 8 × 3
## year qtr return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2015 3 0.35
## 4 2015 4 NA
## 5 2016 1 NA
## 6 2016 2 0.92
## 7 2016 3 0.17
## 8 2016 4 2.66
The complete()
function takes a set of columns (usually
categorical or discrete variables) as its argument, and finds
all unique combinations. It then ensures the original
data set contains all those values, filling in explicit NA
s
where necessary.
fill()
Sometimes when a data source has primarily been used for data entry, missing values indicate that the previous value should be carried forward:
treatment <- tribble(
~ person, ~ treatment, ~response,
"Derrick Whitmore", 1, 7,
NA, 2, 10,
NA, 3, 9,
"Katherine Burke", 1, 4
)
In this case, we can use the fill()
function to fill the
NA
values with the most recent non-NA
value above in that column.
treatment %>%
fill(person)
## # A tibble: 4 × 3
## person treatment response
## <chr> <dbl> <dbl>
## 1 Derrick Whitmore 1 7
## 2 Derrick Whitmore 2 10
## 3 Derrick Whitmore 3 9
## 4 Katherine Burke 1 4
Let’s finish this module with a real data set example. The
who
data set from tidyr
package contains
tuberculosis (TB) cases broken down by year, country, age, gender, and
diagnosis method. The data comes from the 2014 World Health Organization
Global Tuberculosis Report, available at http://www.who.int/tb/country/data/download/en/.
The information in this data set is very rich. But it is difficult to explore the data set in its original form.
glimpse(who)
## Rows: 7,240
## Columns: 60
## $ country <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan…
## $ iso2 <chr> "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF", "AF…
## $ iso3 <chr> "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "AFG", "…
## $ year <dbl> 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 198…
## $ new_sp_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sp_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_sn_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ new_ep_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_m65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f014 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f1524 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f2534 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f3544 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f4554 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f5564 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ newrel_f65 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
This is a very typical real-life example dataset. It contains
redundant columns, odd variable codes, and many missing values. In
short, who
is messy, and we’ll need
multiple steps to tidy it.
Let’s tidy the data set step by step. It is usually good to first handle columns that appear to be values of another variable. After inspecting the columns, we may have some preliminary sense:
It looks like country
, iso2
, and
iso3
are three variables that redundantly specify the
country.
year
is clearly also a variable.
We don’t know what all the other columns are yet, but given the
structure in the variable names (e.g. new_sp_m014
,
new_ep_m014
, new_ep_f014
) these are likely to
be values, not variables.
So we need to gather together all the columns from
new_sp_m014
to newrel_f65
. We don’t know what
those values represent yet, so we’ll give them the generic name
key
. We know the cells represent the count of cases, so
we’ll use the variable cases
. There are a lot of missing
values in the current representation, so for now we’ll use
values_drop_na
just so we can focus on the values that are
present.
who1 <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE
)
who1
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
Luckily, we have the help documentation for the data set to tell us the meaning of those odd-looking keys.
help(who)
The first three letters of each column denote whether the column contains new or old cases of TB. In this dataset, each column contains new cases.
The next two or three letters describe the type of TB:
rel
stands for cases of relapseep
stands for cases of extrapulmonary TBsn
stands for cases of pulmonary TB that could not be
diagnosed by a pulmonary smear (smear negative)sp
stands for cases of pulmonary TB that could be
diagnosed by a pulmonary smear (smear positive)The sixth letter gives the sex of TB patients. The dataset groups cases by males (m) and females (f).
The remaining numbers gives the age group. The dataset groups cases into seven age groups:
014
= 0 – 14 years old1524
= 15 – 24 years old2534
= 25 – 34 years old3544
= 35 – 44 years old4554
= 45 – 54 years old5564
= 55 – 64 years old65
= 65 or olderWe need to make a minor fix to the format of the column names:
unfortunately the names are slightly inconsistent because instead of
new_rel
we have newrel
(it’s hard to spot this
here but if you don’t fix it we’ll get errors in subsequent steps).
You’ll learn about str_replace()
in strings
,
but the basic idea is pretty simple: replace the characters
newrel
with new_rel
. This makes all variable
names consistent.
who2 <- who1 %>%
mutate(key = str_replace(key, "newrel", "new_rel"))
who2
## # A tibble: 76,046 × 6
## country iso2 iso3 year key cases
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new_sp_m014 0
## 2 Afghanistan AF AFG 1997 new_sp_m1524 10
## 3 Afghanistan AF AFG 1997 new_sp_m2534 6
## 4 Afghanistan AF AFG 1997 new_sp_m3544 3
## 5 Afghanistan AF AFG 1997 new_sp_m4554 5
## 6 Afghanistan AF AFG 1997 new_sp_m5564 2
## 7 Afghanistan AF AFG 1997 new_sp_m65 0
## 8 Afghanistan AF AFG 1997 new_sp_f014 5
## 9 Afghanistan AF AFG 1997 new_sp_f1524 38
## 10 Afghanistan AF AFG 1997 new_sp_f2534 36
## # … with 76,036 more rows
Now we can use the separate
function to separate the
values in each key
code.
who3 <- who2 %>%
separate(key, c("new", "type", "sexage"), sep = "_")
who3
## # A tibble: 76,046 × 8
## country iso2 iso3 year new type sexage cases
## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan AF AFG 1997 new sp m014 0
## 2 Afghanistan AF AFG 1997 new sp m1524 10
## 3 Afghanistan AF AFG 1997 new sp m2534 6
## 4 Afghanistan AF AFG 1997 new sp m3544 3
## 5 Afghanistan AF AFG 1997 new sp m4554 5
## 6 Afghanistan AF AFG 1997 new sp m5564 2
## 7 Afghanistan AF AFG 1997 new sp m65 0
## 8 Afghanistan AF AFG 1997 new sp f014 5
## 9 Afghanistan AF AFG 1997 new sp f1524 38
## 10 Afghanistan AF AFG 1997 new sp f2534 36
## # … with 76,036 more rows
Then we might as well drop the new
column because it’s
constant in this dataset. While we’re dropping columns, let’s also drop
iso2
and iso3
since they’re redundant.
who4 <- who3 %>%
select(-new, -iso2, -iso3)
Next we’ll separate sexage into sex and age by splitting after the first character:
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep = 1)
who5
## # A tibble: 76,046 × 6
## country year type sex age cases
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1997 sp m 014 0
## 2 Afghanistan 1997 sp m 1524 10
## 3 Afghanistan 1997 sp m 2534 6
## 4 Afghanistan 1997 sp m 3544 3
## 5 Afghanistan 1997 sp m 4554 5
## 6 Afghanistan 1997 sp m 5564 2
## 7 Afghanistan 1997 sp m 65 0
## 8 Afghanistan 1997 sp f 014 5
## 9 Afghanistan 1997 sp f 1524 38
## 10 Afghanistan 1997 sp f 2534 36
## # … with 76,036 more rows
The who
data set is now tidy and read for exploration
and analysis! As below are one example that we can do with the data.
That’s the reward of data tidying!
who5 %>%
filter(year == 1997) %>%
ggplot() +
stat_summary(aes(x = sex, y = cases, fill = age), fun = 'sum', geom = 'bar', position = 'dodge') +
labs(title = "TB cases in the world in 1997", x = "Gender", y = "Case Counts") +
theme(plot.title = element_text(hjust = 0.5, size = rel(1.2), margin = margin(0,0,15,0)), axis.title.x = element_text(size = rel(1.0), margin = margin(10,0,0,0)), axis.title.y = element_text(size = rel(1.0), margin = margin(0,10,0,0)), axis.text = element_text(size = rel(1.0)), plot.margin = margin(1,1,1,1,"cm")) +
scale_fill_discrete(labels = c("0-14 yrs", "15-24 yrs", "25-34 yrs", "35-44 yrs", "45-54 yrs", "55-64 yrs", "> 65 yrs"))
Starting the next module, we will learn how to tidy data to resolve different symptoms.