Importing and Tidying Data using
the `tidyverse` in R

Chester Ismay
Twitter: @old_man_chester
email: cismay@reed.edu

Tuesday, November 1, 2016

Slides available at http://bit.ly/reed-imp-tidy
Messy files to download at http://bit.ly/reed-messy-data

Packages to install

I highly recommend you download and use RStudio Desktop if you aren’t already.

Please copy the following commands into an R script and run in your RStudio Console:

dl_repo <- "http://cran.rstudio.org"
install.packages("tidyverse", repos = dl_repo)
install.packages("googlesheets", repos = dl_repo)

Note that you don’t need to do this if you are using the Reed College RStudio Server as these are already installed there.

Slides available at http://bit.ly/reed-imp-tidy

What the `tidyverse` contains

For our work today, it includes the following R packages:

IMPORT

haven for SPSS, SAS, and Stata data files
jsonlite for JSON files
readxl for XLS and XLSX files
readr for CSV and TSV files (and R specific RDS files)

TIDYING

tidyr for converting “messy” into “tidy” data frames

Slides available at http://bit.ly/reed-imp-tidy

Why am I here…?

Hadley’s Workflow

Tidy Data

Slides available at http://bit.ly/reed-imp-tidy

Basics of Importing

Downloading the files

We will begin by downloading and importing a variety of different “messy” data sets. You can download all of them in a ZIP file at http://bit.ly/reed-messy-data. The links below go to the original sources. I’ve converted these original sources into different formats.

Slides available at http://bit.ly/reed-imp-tidy

Demonstration

Slides available at http://bit.ly/reed-imp-tidy

Practice

Download the four other files
Import them into R using the appropriate package
- You may need to check out the help pages for the different packages
- Give them the following names: who_df, county_pop_df, edu_county_df, and potus12_df

Slides available at http://bit.ly/reed-imp-tidy

Tidying

The Life Expectancy by year data

library(readr)
life_exp_df <- read_csv("le_mess.csv")
View(life_exp_df)

Slides available at http://bit.ly/reed-imp-tidy

Using the `tidyr` package

Highly recommend downloading the RStudio Cheatsheets here and specifically the Data Wrangling one here

Slides available at http://bit.ly/reed-imp-tidy

Doing the “tidying”/reshaping

library(tidyr)
library(dplyr)
(life_exp_tidy <- life_exp_df %>% 
    gather(key = "year", value = "life_exp", -country))

## # A tibble: 13,332 x 3
##                country  year life_exp
##                  <chr> <chr>    <dbl>
## 1          Afghanistan  1951    27.13
## 2              Albania  1951    54.72
## 3              Algeria  1951    43.03
## 4               Angola  1951    31.05
## 5  Antigua and Barbuda  1951    58.26
## 6            Argentina  1951    61.93
## 7              Armenia  1951    62.67
## 8                Aruba  1951    58.96
## 9            Australia  1951    68.71
## 10             Austria  1951    65.24
## # ... with 13,322 more rows

Slides available at http://bit.ly/reed-imp-tidy

Tidying

World Health Organization TB data (Stata DTA)

library(haven)
who_df <- read_dta("who.dta")
View(who_df)

Slides available at http://bit.ly/reed-imp-tidy

WHO ugly…

This data set contains strange variable names that will require us to look up their meaning in the corresponding data dictionary.
What do we know?
- country, iso2, and iso3 are redundant and identify the country
- year is a variable and it looks like it corresponds to each country
But what in the world is new_sp_m014? And the other columns?…

Slides available at http://bit.ly/reed-imp-tidy

First step

Like before, we can gather these non-country and non-year variables together:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value")

We can now see what this has done to the data frame:

View(who_tidy)

Slides available at http://bit.ly/reed-imp-tidy

Data dictionary saves us some…

The first three letters of entries in the key column correspond to new or old cases of TB.
The next two letters (after the _) correspond to the type of TB:
- rel for relapse, ep for extrapulmonary TB
- sn for smear negative, sp for smear positive
The next letter after the second _ corresponds to the sex of the TB patient.
The remaining numbers correspond to age group:
- 014 for 0 to 14 years
- …
- 65 for 65 or older

Slides available at http://bit.ly/reed-imp-tidy

Looking over the entries of key in who_tidy, do you see anything else that doesn’t match the pattern?
It may be easier to remember that the entries of key correspond to column names in who_df:

names(who_df)

##  [1] "country"      "iso2"         "iso3"         "year"        
##  [5] "new_sp_m014"  "new_sp_m1524" "new_sp_m2534" "new_sp_m3544"
##  [9] "new_sp_m4554" "new_sp_m5564" "new_sp_m65"   "new_sp_f014" 
## [13] "new_sp_f1524" "new_sp_f2534" "new_sp_f3544" "new_sp_f4554"
## [17] "new_sp_f5564" "new_sp_f65"   "new_sn_m014"  "new_sn_m1524"
## [21] "new_sn_m2534" "new_sn_m3544" "new_sn_m4554" "new_sn_m5564"
## [25] "new_sn_m65"   "new_sn_f014"  "new_sn_f1524" "new_sn_f2534"
## [29] "new_sn_f3544" "new_sn_f4554" "new_sn_f5564" "new_sn_f65"  
## [33] "new_ep_m014"  "new_ep_m1524" "new_ep_m2534" "new_ep_m3544"
## [37] "new_ep_m4554" "new_ep_m5564" "new_ep_m65"   "new_ep_f014" 
## [41] "new_ep_f1524" "new_ep_f2534" "new_ep_f3544" "new_ep_f4554"
## [45] "new_ep_f5564" "new_ep_f65"   "newrel_m014"  "newrel_m1524"
## [49] "newrel_m2534" "newrel_m3544" "newrel_m4554" "newrel_m5564"
## [53] "newrel_m65"   "newrel_f014"  "newrel_f1524" "newrel_f2534"
## [57] "newrel_f3544" "newrel_f4554" "newrel_f5564" "newrel_f65"

Slides available at http://bit.ly/reed-imp-tidy

A fix using `stringr`

You can replace all of the entries in key that contained newrel with new_rel via

library(stringr)
library(dplyr)
who_tidy <- who_tidy %>% 
  mutate(key = str_replace(string = key, 
    pattern = "newrel", 
    replacement = "new_rel"))

Slides available at http://bit.ly/reed-imp-tidy

Pulling apart variables

The entry new_sp_m014 is actually four different variables. Use the separate function to pull them apart:

who_tidy <- who_tidy %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_")
who_tidy

## # A tibble: 405,440 x 8
##        country  iso2  iso3  year   new  type sexage value
## *        <chr> <chr> <chr> <dbl> <chr> <chr>  <chr> <dbl>
## 1  Afghanistan    AF   AFG  1980   new    sp   m014   NaN
## 2  Afghanistan    AF   AFG  1981   new    sp   m014   NaN
## 3  Afghanistan    AF   AFG  1982   new    sp   m014   NaN
## 4  Afghanistan    AF   AFG  1983   new    sp   m014   NaN
## 5  Afghanistan    AF   AFG  1984   new    sp   m014   NaN
## 6  Afghanistan    AF   AFG  1985   new    sp   m014   NaN
## 7  Afghanistan    AF   AFG  1986   new    sp   m014   NaN
## 8  Afghanistan    AF   AFG  1987   new    sp   m014   NaN
## 9  Afghanistan    AF   AFG  1988   new    sp   m014   NaN
## 10 Afghanistan    AF   AFG  1989   new    sp   m014   NaN
## # ... with 405,430 more rows

Slides available at http://bit.ly/reed-imp-tidy

One more pull apart

We need to pull sexage into two different variables.
If you use ?separate, you’ll see that the following is an option:

who_tidy <- who_tidy %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1)
who_tidy

## # A tibble: 405,440 x 9
##        country  iso2  iso3  year   new  type   sex   age value
## *        <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
## 1  Afghanistan    AF   AFG  1980   new    sp     m   014   NaN
## 2  Afghanistan    AF   AFG  1981   new    sp     m   014   NaN
## 3  Afghanistan    AF   AFG  1982   new    sp     m   014   NaN
## 4  Afghanistan    AF   AFG  1983   new    sp     m   014   NaN
## 5  Afghanistan    AF   AFG  1984   new    sp     m   014   NaN
## 6  Afghanistan    AF   AFG  1985   new    sp     m   014   NaN
## 7  Afghanistan    AF   AFG  1986   new    sp     m   014   NaN
## 8  Afghanistan    AF   AFG  1987   new    sp     m   014   NaN
## 9  Afghanistan    AF   AFG  1988   new    sp     m   014   NaN
## 10 Afghanistan    AF   AFG  1989   new    sp     m   014   NaN
## # ... with 405,430 more rows

Slides available at http://bit.ly/reed-imp-tidy

Final cleaning

iso2 and iso3 are redundant if we have country
new is constant since this only has new cases of TB
Let’s just ignore missing values here
We also know that value is actually cases so we can rename that column.

who_tidy <- who_tidy %>% select(-iso2, -iso3, -new) %>% 
  na.omit() %>% rename("cases" = value)
head(who_tidy, 7)

## # A tibble: 7 x 6
##       country  year  type   sex   age cases
##         <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan  1997    sp     m   014     0
## 2 Afghanistan  1998    sp     m   014    30
## 3 Afghanistan  1999    sp     m   014     8
## 4 Afghanistan  2000    sp     m   014    52
## 5 Afghanistan  2001    sp     m   014   129
## 6 Afghanistan  2002    sp     m   014    90
## 7 Afghanistan  2003    sp     m   014   127

Slides available at http://bit.ly/reed-imp-tidy

The power of the `%>%` (pipe)

Everything we did above can be chained into one sequence:

who_tidy <- who_df %>% 
  gather(new_sp_m014:newrel_f65, key = "key", value = "value") %>%  
  mutate(key = str_replace(key, pattern = "newrel", 
                           replacement = "new_rel")) %>% 
  separate(col = key, into = c("new", "type", "sexage"), sep = "_") %>% 
  separate(col = sexage, into = c("sex", "age"), sep = 1) %>% 
  select(-iso2, -iso3, -new) %>% 
  na.omit() %>% 
  rename("cases" = value)

Slides available at http://bit.ly/reed-imp-tidy

BONUS

A Gapminder tidy dataset read in from a Google Sheet!

I’ve updated the gapminder data set available in the gapminder R data package by Jenny Bryan here. Jenny provides instructions for recreating the data here.
You can download the updated data from Google Sheets by running the following in R:

# Link is https://docs.google.com/spreadsheets/d/18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA/edit?usp=sharing
updated_gap <- gs_key("18L5ZiXd1CQ97XWSqb04x1YQ4ncsEOdR7eHzgX0ZIuPA") %>%
  gs_read()

A script going through the steps to take the “messy” original Gapminder.org files and turn them into this tidy dataset is available here.

Slides available at http://bit.ly/reed-imp-tidy

More practice

Try to tidy the other three data sets you downloaded and imported.

Slides available at http://bit.ly/reed-imp-tidy

Additional resources

rio R package
DataCamp Importing Tutorials
- This R Data Import Tutorial Is Everything You Need
- Importing Data Into R - Part Two
R for Data Science
- Chapter 11 on Importing Data
- Chapter 12 on Tidying Data

Slides available at http://bit.ly/reed-imp-tidy

Thanks!

All workshop materials available here
Slides available here

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.12.1 (Sierra)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringr_1.1.0 dplyr_0.5.0   tidyr_0.6.0   needs_0.0.3  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6        png_0.1-7          revealjs_0.7      
##  [4] digest_0.6.10      assertthat_0.1     R6_2.1.3          
##  [7] DBI_0.5            formatR_1.4        magrittr_1.5      
## [10] evaluate_0.9       stringi_1.1.1      lazyeval_0.2.0    
## [13] rmarkdown_1.0.9013 tools_3.3.1        readr_1.0.0       
## [16] yaml_2.1.13        htmltools_0.3.5    haven_1.0.0       
## [19] knitr_1.14         tibble_1.1

Importing and Tidying Data using the tidyverse in R

Chester Ismay Twitter: @old_man_chester email: cismay@reed.edu

Tuesday, November 1, 2016 Slides available at http://bit.ly/reed-imp-tidy Messy files to download at http://bit.ly/reed-messy-data

Packages to install

What the tidyverse contains

Why am I here…?

Basics of Importing

Downloading the files

Demonstration

Practice

Tidying

The Life Expectancy by year data

Using the tidyr package

Doing the “tidying”/reshaping

Tidying

World Health Organization TB data (Stata DTA)

WHO ugly…

First step

Data dictionary saves us some…

A fix using stringr

Pulling apart variables

One more pull apart

Final cleaning

The power of the %>% (pipe)

BONUS

A Gapminder tidy dataset read in from a Google Sheet!

More practice

Additional resources

Thanks!

Importing and Tidying Data using
the `tidyverse` in R

Chester Ismay
Twitter: @old_man_chester
email: cismay@reed.edu

Tuesday, November 1, 2016

Slides available at http://bit.ly/reed-imp-tidy
Messy files to download at http://bit.ly/reed-messy-data

What the `tidyverse` contains

Using the `tidyr` package

A fix using `stringr`

The power of the `%>%` (pipe)