Importing and Tidying Data using
the tidyverse in R


Chester Ismay
Twitter: @old_man_chester
email: cismay@reed.edu

Tuesday, November 1, 2016

Slides available at http://bit.ly/reed-imp-tidy
Messy files to download at http://bit.ly/reed-messy-data

Packages to install

I highly recommend you download and use RStudio Desktop if you aren’t already.

Please copy the following commands into an R script and run in your RStudio Console:

dl_repo <- "http://cran.rstudio.org"
install.packages("tidyverse", repos = dl_repo)
install.packages("googlesheets", repos = dl_repo)

Note that you don’t need to do this if you are using the Reed College RStudio Server as these are already installed there.



Slides available at http://bit.ly/reed-imp-tidy

What the tidyverse contains

For our work today, it includes the following R packages:

IMPORT

  • haven for SPSS, SAS, and Stata data files
  • jsonlite for JSON files
  • readxl for XLS and XLSX files
  • readr for CSV and TSV files (and R specific RDS files)

TIDYING

  • tidyr for converting “messy” into “tidy” data frames

Slides available at http://bit.ly/reed-imp-tidy

Why am I here…?

Hadley’s Workflow

Tidy Data

Slides available at http://bit.ly/reed-imp-tidy

Basics of Importing

Downloading the files

We will begin by downloading and importing a variety of different “messy” data sets. You can download all of them in a ZIP file at http://bit.ly/reed-messy-data. The links below go to the original sources. I’ve converted these original sources into different formats.

Slides available at http://bit.ly/reed-imp-tidy

Demonstration

Slides available at http://bit.ly/reed-imp-tidy

Practice

Slides available at http://bit.ly/reed-imp-tidy

Tidying

The Life Expectancy by year data

library(readr)
life_exp_df <- read_csv("le_mess.csv")
View(life_exp_df)

Slides available at http://bit.ly/reed-imp-tidy

Using the tidyr package

  • Highly recommend downloading the RStudio Cheatsheets here and specifically the Data Wrangling one here

Slides available at http://bit.ly/reed-imp-tidy

Doing the “tidying”/reshaping

library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>% 
    gather(key = "year", value = "life_exp", -country))
## # A tibble: 13,332 x 3
##                country  year life_exp
##                  <chr> <chr>    <dbl>
## 1          Afghanistan  1951    27.13
## 2              Albania  1951    54.72
## 3              Algeria  1951    43.03
## 4               Angola  1951    31.05
## 5  Antigua and Barbuda  1951    58.26
## 6            Argentina  1951    61.93
## 7              Armenia  1951    62.67
## 8                Aruba  1951    58.96
## 9            Australia  1951    68.71
## 10             Austria  1951    65.24
## # ... with 13,322 more rows

Slides available at http://bit.ly/reed-imp-tidy

Tidying

World Health Organization TB data (Stata DTA)

library(haven)
who_df <- read_dta("who.dta")
View(who_df)

Slides available at http://bit.ly/reed-imp-tidy

WHO ugly…

  • This data set contains strange variable names that will require us to look up their meaning in the corresponding data dictionary.
  • What do we know?
    • country, iso2, and iso3 are redundant and identify the country
    • year is a variable and it looks like it corresponds to each country
  • But what in the world is new_sp_m014? And the other columns?…

Slides available at http://bit.ly/reed-imp-tidy

First step

Like before, we can gather these non-country and non-year variables together:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value")

We can now see what this has done to the data frame:

View(who_tidy)

Slides available at http://bit.ly/reed-imp-tidy

Data dictionary saves us some…

  1. The first three letters of entries in the key column correspond to new or old cases of TB.
  2. The next two letters (after the _) correspond to the type of TB:
    • rel for relapse, ep for extrapulmonary TB
    • sn for smear negative, sp for smear positive
  3. The next letter after the second _ corresponds to the sex of the TB patient.
  4. The remaining numbers correspond to age group:
    • 014 for 0 to 14 years
    • 65 for 65 or older

Slides available at http://bit.ly/reed-imp-tidy

  • Looking over the entries of key in who_tidy, do you see anything else that doesn’t match the pattern?

  • It may be easier to remember that the entries of key correspond to column names in who_df:

names(who_df)
##  [1] "country"      "iso2"         "iso3"         "year"        
##  [5] "new_sp_m014"  "new_sp_m1524" "new_sp_m2534" "new_sp_m3544"
##  [9] "new_sp_m4554" "new_sp_m5564" "new_sp_m65"   "new_sp_f014" 
## [13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"
## [17] "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
## [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564"
## [25] "new_sn_m65"   "new_sn_f014"  "new_sn_f1524" "new_sn_f2534"
## [29] "new_sn_f3544" "new_sn_f4554" "new_sn_f5564" "new_sn_f65"  
## [33] "new_ep_m014"  "new_ep_m1524" "new_ep_m2534" "new_ep_m3544"
## [37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
## [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554"
## [45] "new_ep_f5564" "new_ep_f65"   "newrel_m014"  "newrel_m1524"
## [49] "newrel_m2534" "newrel_m3544" "newrel_m4554" "newrel_m5564"
## [53] "newrel_m65"   "newrel_f014"  "newrel_f1524" "newrel_f2534"
## [57] "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Slides available at http://bit.ly/reed-imp-tidy

A fix using stringr

You can replace all of the entries in key that contained newrel with new_rel via

library(stringr)
library(dplyr)
who_tidy <- who_tidy %>% 
  mutate(key = str_replace(string = key, 
    pattern = "newrel", 
    replacement = "new_rel"))

Slides available at http://bit.ly/reed-imp-tidy

Pulling apart variables

The entry new_sp_m014 is actually four different variables. Use the separate function to pull them apart:

who_tidy <- who_tidy %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy
## # A tibble: 405,440 x 8
##        country  iso2  iso3  year   new  type sexage value
## *        <chr> <chr> <chr> <dbl> <chr> <chr>  <chr> <dbl>
## 1  Afghanistan    AF   AFG  1980   new    sp   m014   NaN
## 2  Afghanistan    AF   AFG  1981   new    sp   m014   NaN
## 3  Afghanistan    AF   AFG  1982   new    sp   m014   NaN
## 4  Afghanistan    AF   AFG  1983   new    sp   m014   NaN
## 5  Afghanistan    AF   AFG  1984   new    sp   m014   NaN
## 6  Afghanistan    AF   AFG  1985   new    sp   m014   NaN
## 7  Afghanistan    AF   AFG  1986   new    sp   m014   NaN
## 8  Afghanistan    AF   AFG  1987   new    sp   m014   NaN
## 9  Afghanistan    AF   AFG  1988   new    sp   m014   NaN
## 10 Afghanistan    AF   AFG  1989   new    sp   m014   NaN
## # ... with 405,430 more rows

Slides available at http://bit.ly/reed-imp-tidy

One more pull apart

  • We need to pull sexage into two different variables.
  • If you use ?separate, you’ll see that the following is an option:
who_tidy <- who_tidy %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1)
who_tidy
## # A tibble: 405,440 x 9
##        country  iso2  iso3  year   new  type   sex   age value
## *        <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1  Afghanistan    AF   AFG  1980   new    sp     m   014   NaN
## 2  Afghanistan    AF   AFG  1981   new    sp     m   014   NaN
## 3  Afghanistan    AF   AFG  1982   new    sp     m   014   NaN
## 4  Afghanistan    AF   AFG  1983   new    sp     m   014   NaN
## 5  Afghanistan    AF   AFG  1984   new    sp     m   014   NaN
## 6  Afghanistan    AF   AFG  1985   new    sp     m   014   NaN
## 7  Afghanistan    AF   AFG  1986   new    sp     m   014   NaN
## 8  Afghanistan    AF   AFG  1987   new    sp     m   014   NaN
## 9  Afghanistan    AF   AFG  1988   new    sp     m   014   NaN
## 10 Afghanistan    AF   AFG  1989   new    sp     m   014   NaN
## # ... with 405,430 more rows

Slides available at http://bit.ly/reed-imp-tidy

Final cleaning

  • iso2 and iso3 are redundant if we have country
  • new is constant since this only has new cases of TB
  • Let’s just ignore missing values here
  • We also know that value is actually cases so we can rename that column.
who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>% 
  na.omit() %>% rename("cases" = value)
head(who_tidy, 7)
## # A tibble: 7 x 6
##       country  year  type   sex   age cases
##         <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan  1997    sp     m   014     0
## 2 Afghanistan  1998    sp     m   014    30
## 3 Afghanistan  1999    sp     m   014     8
## 4 Afghanistan  2000    sp     m   014    52
## 5 Afghanistan  2001    sp     m   014   129
## 6 Afghanistan  2002    sp     m   014    90
## 7 Afghanistan  2003    sp     m   014   127

Slides available at http://bit.ly/reed-imp-tidy

The power of the %>% (pipe)

Everything we did above can be chained into one sequence:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%  
  mutate(key = str_replace(key, pattern = "newrel", 
                           replacement = "new_rel")) %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-iso2, -iso3, -new) %>% 
  na.omit() %>% 
  rename("cases" = value)

Slides available at http://bit.ly/reed-imp-tidy

BONUS

A Gapminder tidy dataset read in from a Google Sheet!

  • I’ve updated the gapminder data set available in the gapminder R data package by Jenny Bryan here. Jenny provides instructions for recreating the data here.

  • You can download the updated data from Google Sheets by running the following in R:

# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
  gs_read()
  • A script going through the steps to take the “messy” original Gapminder.org files and turn them into this tidy dataset is available here.

Slides available at http://bit.ly/reed-imp-tidy

More practice

Try to tidy the other three data sets you downloaded and imported.

Slides available at http://bit.ly/reed-imp-tidy

Additional resources

Slides available at http://bit.ly/reed-imp-tidy

Thanks!

  • All workshop materials available here
  • Slides available here
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.1 (Sierra)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringr_1.1.0 dplyr_0.5.0   tidyr_0.6.0   needs_0.0.3  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        png_0.1-7          revealjs_0.7      
##  [4] digest_0.6.10      assertthat_0.1     R6_2.1.3          
##  [7] DBI_0.5            formatR_1.4        magrittr_1.5      
## [10] evaluate_0.9       stringi_1.1.1      lazyeval_0.2.0    
## [13] rmarkdown_1.0.9013 tools_3.3.1        readr_1.0.0       
## [16] yaml_2.1.13        htmltools_0.3.5    haven_1.0.0       
## [19] knitr_1.14         tibble_1.1