Load Libraries

library(tidyverse)
library(openintro)

0. Introduction to Data Wrangling

In the next, we’ll learn about data wrangling, the art of getting your data into R in a useful form for visualisation and modelling. The term wrangling vividly describes how we may have to “fight with data” before putting them to use.

Data wrangling is very important: without it you can’t work with your own data! There are three main parts to data wrangling:

We have learned how to transform data. In this part, we will learn

The variant of R data frames - tibble from tidyverse
How to import data of various formats into R
How to tidy data into a standard form for the convenience of data exploration

We will also learn more about data transformation for a few specifically important data types:

Relational data will give you tools for working with multiple interrelated datasets.
Strings will introduce regular expressions, a powerful tool for manipulating strings.
Factors are how R stores categorical data. They are used when a variable has a fixed set of possible values, or when you want to use a non-alphabetical ordering of a string.
Dates and times will give you the key tools for working with dates and date-times.

1. Tibbles

Traditionally, R uses the data type data.frame to store rectangular (tabular) data. For example, if we check the data type of a built-in data set iris, we get:

class(iris)

## [1] "data.frame"

If we check the class of the data sets that we have been studied so far, we will see that the results are different:

class(mpg)

## [1] "tbl_df"     "tbl"        "data.frame"

The result tbl stands for tibble, which is introduced by the package tibble in tidyverse. Briefly speaking, a tibble is simply a data frame, but with some particular aspects optimized for data wrangling and optimization.

Tibbles vs `data.frame`

There are two main differences in the usage of a tibble vs. a classic data.frame: printing and subsetting.

Printing

Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. In addition to its name, each column reports its type. This makes it much easier to work with large data.

First, execute the following code in R by yourself. You will see that for a classic data.frame, all data are printed on screen, with no data type indicated.

print(iris) # Try by yourself

In Contrast, printing a tibble gives a much more reasonable preview of data.

print(mpg)

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # … with 224 more rows

When there are too many columns to be printed on screen, we can use glimpse() to help us.

print(loans_full_schema)

## # A tibble: 10,000 × 55
##    emp_t…¹ emp_l…² state homeo…³ annua…⁴ verif…⁵ debt_…⁶ annua…⁷ verif…⁸ debt_…⁹
##    <chr>     <dbl> <fct> <fct>     <dbl> <fct>     <dbl>   <dbl> <fct>     <dbl>
##  1 "globa…       3 NJ    MORTGA…   90000 Verifi…   18.0       NA ""         NA  
##  2 "wareh…      10 HI    RENT      40000 Not Ve…    5.04      NA ""         NA  
##  3 "assem…       3 WI    RENT      40000 Source…   21.2       NA ""         NA  
##  4 "custo…       1 PA    RENT      30000 Not Ve…   10.2       NA ""         NA  
##  5 "secur…      10 CA    RENT      35000 Verifi…   58.0    57000 "Verif…    37.7
##  6 ""           NA KY    OWN       34000 Not Ve…    6.46      NA ""         NA  
##  7 "hr "        10 MI    MORTGA…   35000 Source…   23.7   155000 "Not V…    13.1
##  8 "polic…      10 AZ    MORTGA…  110000 Source…   16.2       NA ""         NA  
##  9 "parts"      10 NV    MORTGA…   65000 Source…   36.5       NA ""         NA  
## 10 "4th p…       3 IL    RENT      30000 Not Ve…   18.9       NA ""         NA  
## # … with 9,990 more rows, 45 more variables: delinq_2y <int>,
## #   months_since_last_delinq <int>, earliest_credit_line <dbl>,
## #   inquiries_last_12m <int>, total_credit_lines <int>,
## #   open_credit_lines <int>, total_credit_limit <int>,
## #   total_credit_utilized <int>, num_collections_last_12m <int>,
## #   num_historical_failed_to_pay <int>, months_since_90d_late <int>,
## #   current_accounts_delinq <int>, total_collection_amount_ever <int>, …

glimpse(loans_full_schema)

## Rows: 10,000
## Columns: 55
## $ emp_title                        <chr> "global config engineer ", "warehouse…
## $ emp_length                       <dbl> 3, 10, 3, 1, 10, NA, 10, 10, 10, 3, 1…
## $ state                            <fct> NJ, HI, WI, PA, CA, KY, MI, AZ, NV, I…
## $ homeownership                    <fct> MORTGAGE, RENT, RENT, RENT, RENT, OWN…
## $ annual_income                    <dbl> 90000, 40000, 40000, 30000, 35000, 34…
## $ verified_income                  <fct> Verified, Not Verified, Source Verifi…
## $ debt_to_income                   <dbl> 18.01, 5.04, 21.15, 10.16, 57.96, 6.4…
## $ annual_income_joint              <dbl> NA, NA, NA, NA, 57000, NA, 155000, NA…
## $ verification_income_joint        <fct> , , , , Verified, , Not Verified, , ,…
## $ debt_to_income_joint             <dbl> NA, NA, NA, NA, 37.66, NA, 13.12, NA,…
## $ delinq_2y                        <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0…
## $ months_since_last_delinq         <int> 38, NA, 28, NA, NA, 3, NA, 19, 18, NA…
## $ earliest_credit_line             <dbl> 2001, 1996, 2006, 2007, 2008, 1990, 2…
## $ inquiries_last_12m               <int> 6, 1, 4, 0, 7, 6, 1, 1, 3, 0, 4, 4, 8…
## $ total_credit_lines               <int> 28, 30, 31, 4, 22, 32, 12, 30, 35, 9,…
## $ open_credit_lines                <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ total_credit_limit               <int> 70795, 28800, 24193, 25400, 69839, 42…
## $ total_credit_utilized            <int> 38767, 4321, 16000, 4997, 52722, 3898…
## $ num_collections_last_12m         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_historical_failed_to_pay     <int> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ months_since_90d_late            <int> 38, NA, 28, NA, NA, 60, NA, 71, 18, N…
## $ current_accounts_delinq          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ total_collection_amount_ever     <int> 1250, 0, 432, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ current_installment_accounts     <int> 2, 0, 1, 1, 1, 0, 2, 2, 6, 1, 2, 1, 2…
## $ accounts_opened_24m              <int> 5, 11, 13, 1, 6, 2, 1, 4, 10, 5, 6, 7…
## $ months_since_last_credit_inquiry <int> 5, 8, 7, 15, 4, 5, 9, 7, 4, 17, 3, 4,…
## $ num_satisfactory_accounts        <int> 10, 14, 10, 4, 16, 12, 10, 15, 21, 6,…
## $ num_accounts_120d_past_due       <int> 0, 0, 0, 0, 0, 0, 0, NA, 0, 0, 0, 0, …
## $ num_accounts_30d_past_due        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ num_active_debit_accounts        <int> 2, 3, 3, 2, 10, 1, 3, 5, 11, 3, 2, 2,…
## $ total_debit_limit                <int> 11100, 16500, 4300, 19400, 32700, 272…
## $ num_total_cc_accounts            <int> 14, 24, 14, 3, 20, 27, 8, 16, 19, 7, …
## $ num_open_cc_accounts             <int> 8, 14, 8, 3, 15, 12, 7, 12, 14, 5, 8,…
## $ num_cc_carrying_balance          <int> 6, 4, 6, 2, 13, 5, 6, 10, 14, 3, 5, 3…
## $ num_mort_accounts                <int> 1, 0, 0, 0, 0, 3, 2, 7, 2, 0, 2, 3, 3…
## $ account_never_delinq_percent     <dbl> 92.9, 100.0, 93.5, 100.0, 100.0, 78.1…
## $ tax_liens                        <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ public_record_bankrupt           <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0…
## $ loan_purpose                     <fct> moving, debt_consolidation, other, de…
## $ application_type                 <fct> individual, individual, individual, i…
## $ loan_amount                      <int> 28000, 5000, 2000, 21600, 23000, 5000…
## $ term                             <dbl> 60, 36, 36, 36, 36, 36, 60, 60, 36, 3…
## $ interest_rate                    <dbl> 14.07, 12.61, 17.09, 6.72, 14.07, 6.7…
## $ installment                      <dbl> 652.53, 167.54, 71.40, 664.19, 786.87…
## $ grade                            <fct> C, C, D, A, C, A, C, B, C, A, C, B, C…
## $ sub_grade                        <fct> C3, C1, D1, A3, C3, A3, C2, B5, C2, A…
## $ issue_month                      <fct> Mar-2018, Feb-2018, Feb-2018, Jan-201…
## $ loan_status                      <fct> Current, Current, Current, Current, C…
## $ initial_listing_status           <fct> whole, whole, fractional, whole, whol…
## $ disbursement_method              <fct> Cash, Cash, Cash, Cash, Cash, Cash, C…
## $ balance                          <dbl> 27015.86, 4651.37, 1824.63, 18853.26,…
## $ paid_total                       <dbl> 1999.330, 499.120, 281.800, 3312.890,…
## $ paid_principal                   <dbl> 984.14, 348.63, 175.37, 2746.74, 1569…
## $ paid_interest                    <dbl> 1015.19, 150.49, 106.43, 566.15, 754.…
## $ paid_late_fees                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

Subsetting

If you want to pull out a single variable from , you need some new tools, $ and [[. [[ can extract by name or position; $ only extracts by name but is a little less typing.

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

mpg$cyl

##   [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
##  [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
##  [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6

mpg[["cyl"]]

##   [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
##  [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
##  [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6

mpg[[5]]      # Subset by position

##   [1] 4 4 4 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 4 4 6 6 6
##  [38] 4 6 6 6 6 6 6 6 6 6 6 6 6 6 6 8 8 8 8 8 6 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
##  [75] 8 8 8 6 6 6 6 8 8 6 6 8 8 8 8 8 6 6 6 6 8 8 8 8 8 4 4 4 4 4 4 4 4 4 4 4 4
## [112] 4 6 6 6 4 4 4 4 6 6 6 6 6 6 8 8 8 8 8 8 8 8 8 8 8 8 6 6 8 8 4 4 4 4 6 6 6
## [149] 6 6 6 6 6 8 6 6 6 6 8 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 8 4 4 4 4 6 6
## [186] 6 4 4 4 4 6 6 6 4 4 4 4 4 8 8 4 4 4 6 6 6 6 4 4 4 4 6 4 4 4 4 4 5 5 6 6 4
## [223] 4 4 4 5 5 4 4 4 4 6 6 6

A key difference between tibble and data.frame is that, tibbles don’t do partial matching for subsetting. For example, we can do the following with the data set iris:

iris$Sp

This returns iris$Species as the result since “Sp” is a partial matching for the column name “Species”. This is not a preferrable action in most cases and tibbles turn off this function.

mpg$cy

## NULL

Subsetting with `[` (not recommended for tibbles)

An old-style way to subset a data frame in R is to use the [. For example, the codes below subset the second row, the third column, the cell at the second row and the third column respectively.

iris[2, ]

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 2          4.9           3          1.4         0.2  setosa

iris[, 3]

##   [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4
##  [19] 1.7 1.5 1.7 1.5 1.0 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2
##  [37] 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4.0
##  [55] 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4.0 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4.0
##  [73] 4.9 4.7 4.3 4.4 4.8 5.0 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4.0
##  [91] 4.4 4.6 4.0 3.3 4.2 4.2 4.2 4.3 3.0 4.1 6.0 5.1 5.9 5.6 5.8 6.6 4.5 6.3
## [109] 5.8 6.1 5.1 5.3 5.5 5.0 5.1 5.3 5.5 6.7 6.9 5.0 5.7 4.9 6.7 4.9 5.7 6.0
## [127] 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9
## [145] 5.7 5.2 5.0 5.2 5.4 5.1

iris[2,3]

## [1] 1.4

This is not recommended for tibbles since we use filter() to select rows and select() to select columns. That makes codes much more clear to write and understand.

However, there is still some useful things in using [. Logical subsetting would be one example. For example, if we hope to choose rows where the value of Species is setosa, we can do the following:

iris[iris$Species == "setosa", ]

##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa
## 26          5.0         3.0          1.6         0.2  setosa
## 27          5.0         3.4          1.6         0.4  setosa
## 28          5.2         3.5          1.5         0.2  setosa
## 29          5.2         3.4          1.4         0.2  setosa
## 30          4.7         3.2          1.6         0.2  setosa
## 31          4.8         3.1          1.6         0.2  setosa
## 32          5.4         3.4          1.5         0.4  setosa
## 33          5.2         4.1          1.5         0.1  setosa
## 34          5.5         4.2          1.4         0.2  setosa
## 35          4.9         3.1          1.5         0.2  setosa
## 36          5.0         3.2          1.2         0.2  setosa
## 37          5.5         3.5          1.3         0.2  setosa
## 38          4.9         3.6          1.4         0.1  setosa
## 39          4.4         3.0          1.3         0.2  setosa
## 40          5.1         3.4          1.5         0.2  setosa
## 41          5.0         3.5          1.3         0.3  setosa
## 42          4.5         2.3          1.3         0.3  setosa
## 43          4.4         3.2          1.3         0.2  setosa
## 44          5.0         3.5          1.6         0.6  setosa
## 45          5.1         3.8          1.9         0.4  setosa
## 46          4.8         3.0          1.4         0.3  setosa
## 47          5.1         3.8          1.6         0.2  setosa
## 48          4.6         3.2          1.4         0.2  setosa
## 49          5.3         3.7          1.5         0.2  setosa
## 50          5.0         3.3          1.4         0.2  setosa

For data frames we can do the same by converting it into a tibble and use the filter() function:

iris_tib <- as.tibble(iris) %>%
  filter(Species == "setosa") %>%
  print()

## # A tibble: 50 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 40 more rows

However, the [ logical subsetting also works for vectors and lists in R, which can be very useful in some situations.

Lab Exercise: Run the following code to create a random number list. Then use logical subsetting to find all numbers that are greater than one and less than two.

rand_num <- runif(100, 0, 3)

Convert a data frame into a tibble

To convert a data frame into a tibble, we can simply do that with the function as_tibble():

as.tibble(iris)

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # … with 140 more rows

Create a new tibble

To create a new tibble (usually for testing and learning), we can use the tibble() function:

tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)

## # A tibble: 5 × 3
##       x     y     z
##   <int> <dbl> <dbl>
## 1     1     1     2
## 2     2     1     5
## 3     3     1    10
## 4     4     1    17
## 5     5     1    26

Here, “x”, “y”, “z” are the column names of the same lenth. For “y”, tibble() will automatically recycle inputs of length 1, and allows you to refer to variables that you just created, as shown above.

It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names. For example, they might not start with a letter, or they might contain unusual characters like a space. Those names may appear in raw data sets, which might not be originally created in R.

To refer to these variables, you need to surround them with backticks:

tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)

Another way to create a tibble is with tribble(), short for transposed tibble. tribble() is customised for data entry in code: column headings are defined by formulas (i.e. they start with ~), and entries are separated by commas. This makes it possible to lay out small amounts of data in easy to read form:

tribble(
  ~x, ~y, ~z,
  #--|--|----
  "a", 2, 3.6,
  "b", 1, 8.5
)

## # A tibble: 2 × 3
##   x         y     z
##   <chr> <dbl> <dbl>
## 1 a         2   3.6
## 2 b         1   8.5

You are recommended to add a comment (the line starting with #), to make it really clear where the header is.

Lab Exercise: Create a tibble in R for the following data set:

Student Name	Test 1 Score	Test 2 Score	Test 3 Score
Alice	80	90	85
Bob	70	75	80
Charlie	85	88	91
Denise	60	65	70

2. Introduction to Tidy Data

In this chapter, you will learn a consistent way to organise your data in R, an organisation called tidy data. So far we have been handling data sets that can be directly put into analysis such as visualization, transformation. That is because those data sets are already tidy.

But in real-world data analysis, raw data are rarely tidy. We have to do some hands-on “dirty” work to tidy them. In this introdutory lecture, we will learn the important concepts for tidy data, and how they can be untidy.

We will learn how to use tools in the tidyr package, part of tidyverse.

What is Tidy data?

There are three interrelated rules which make a data set tidy:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Following three rules makes a data set tidy: variables are in columns, observations are in rows, and values are in cells. Let’s look at a few examples to understand it.

In all the tables below, “cases” refers to the number of Tuberculosis (TB) cases.

Example 1:table1 is tidy

table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Example 2: table2 is not tidy

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Example 3: table3 is not tidy.

table3

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Example 4a: table4a is not tidy.

table4a

## # A tibble: 3 × 3
##   country     `1999` `2000`
##   <chr>        <dbl>  <dbl>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Example 4b:? table4b is not tidy.

table4b

## # A tibble: 3 × 3
##   country         `1999`     `2000`
##   <chr>            <dbl>      <dbl>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

Five Symptoms of Untidy Data

To understand why table2-4 are not tidy, we may refer to the 5 most frequently seen symptoms of untidy data:

Column headers are values, not variable names.
Multiple variables are stored in one column.
Variables are stored in both rows and columns.
Multiple types of observational units are stored in the same table.
A single observational unit is stored in multiple tables.

Lab Exercise: Tell for each table of Table 2-4, what symptoms of untidiness are there?

Why do we need tidy data?

Why ensure that your data is tidy? There are two main advantages:

There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity.
There’s a specific advantage to placing variables in columns because it allows R’s vectorised nature to shine. As you learned in mutate and summary functions, most built-in R functions work with vectors of values. That makes transforming tidy data feel particularly natural.

In short, A consistent data structure makes it easier to work with the same tools. dplyr, ggplot2, and all the other packages in the tidyverse are designed to work with tidy data.

Example

With Table 1, it is straightforward to perform the following data exploration.

# Compute rate per 10,000
table1 %>% 
  mutate(rate = cases / population * 10000)

## # A tibble: 6 × 5
##   country      year  cases population  rate
##   <chr>       <dbl>  <dbl>      <dbl> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67

# Compute cases per year
table1 %>% 
  count(year, wt = cases) # "wt" refers to the weight when do the counting

## # A tibble: 2 × 2
##    year      n
##   <dbl>  <dbl>
## 1  1999 250740
## 2  2000 296920

# Visualise changes over time
table1 %>% 
  mutate(rate = cases / population * 10000) %>%
  ggplot(aes(x = as.factor(year), y = rate)) + 
  geom_line(aes(group = country), colour = "grey50") + 
  geom_point(aes(colour = country)) + 
  labs(title = "TB rate", x = "Year", y = "Cases per 10,000") +
  ylim(0, 5) + 
  theme(plot.title = element_text(hjust = 0.5, size = rel(1.5), margin = margin(15,15,15,15)), 
        axis.title = element_text(size = rel(1.4)), 
        axis.title.x = element_text(margin = margin(10,5,5,5)), 
        axis.title.y = element_text(margin = margin(5,10,5,5)), 
        axis.text = element_text(size = rel(1.4)))

However, if we try to create the same results or graph from Table 2-4, there would be much less convenience. Therefore, we should tidy our data before actual data exploration.

Lab Exercise: Compute the rates per 10,000 in population for each year and country using table2 instead of table1. What is the difficulty?

Another Example

Untidy data naturally manifests in real world. Here is another simple example. Suppose that an instructor hopes to record student test scores in a class, where each row represents a student and each column represents a test. The data set may look like this:

Student Name	Test 1 Score	Test 2 Score	Test 3 Score
Alice	80	90	85
Bob	70	75	80
Charlie	85	88	91
Denise	60	65	70

However, this is untidy data since the three column headers are actually different values of another variable Test. A tidy data set would look like this:

Student Name	Test Number	Score
Alice	1	80
Alice	2	90
Alice	3	85
Bob	1	70
Bob	2	75
Bob	3	80
Charlie	1	85
Charlie	2	88
Charlie	3	91
Denise	1	60
Denise	2	65
Denise	3	70

Starting the next module, we will learn how to tidy data to resolve different symptoms.

Lecture 11 - Tidy Data Part 1

Miao Yu

2023-02-27

Load Libraries

0. Introduction to Data Wrangling

1. Tibbles

Tibbles vs `data.frame`

Printing

Subsetting

Subsetting with `[` (not recommended for tibbles)

Lab Exercise: Run the following code to create a random number list. Then use logical subsetting to find all numbers that are greater than one and less than two.

Convert a data frame into a tibble

Create a new tibble

Lab Exercise: Create a tibble in R for the following data set:

2. Introduction to Tidy Data

What is Tidy data?

Example 1:table1 is tidy

Example 2: table2 is not tidy

Example 3: table3 is not tidy.

Example 4a: table4a is not tidy.

Example 4b:? table4b is not tidy.

Five Symptoms of Untidy Data

Lab Exercise: Tell for each table of Table 2-4, what symptoms of untidiness are there?

Why do we need tidy data?

Example

Lab Exercise: Compute the rates per 10,000 in population for each year and country using table2 instead of table1. What is the difficulty?

Another Example

Lecture 11 - Tidy Data Part 1

Miao Yu

2023-02-27

Load Libraries

0. Introduction to Data Wrangling

1. Tibbles

Tibbles vs data.frame

Printing

Subsetting

Subsetting with [ (not recommended for tibbles)

Lab Exercise: Run the following code to create a random number list. Then use logical subsetting to find all numbers that are greater than one and less than two.

Convert a data frame into a tibble

Create a new tibble

Lab Exercise: Create a tibble in R for the following data set:

2. Introduction to Tidy Data

What is Tidy data?

Example 1:table1 is tidy

Example 2: table2 is not tidy

Example 3: table3 is not tidy.

Example 4a: table4a is not tidy.

Example 4b:? table4b is not tidy.

Five Symptoms of Untidy Data

Lab Exercise: Tell for each table of Table 2-4, what symptoms of untidiness are there?

Why do we need tidy data?

Example

Lab Exercise: Compute the rates per 10,000 in population for each year and country using table2 instead of table1. What is the difficulty?

Another Example

Tibbles vs `data.frame`

Subsetting with `[` (not recommended for tibbles)