R-Ladies RTP Getting Data into R workshop

November 14, 2016

Welcome to R-Ladies RTP!

Welcome to TransLoc!

Outline

Data sources
Controlling the source
Data validation and cleanup
Understanding read_csv
R and databases
Problem solving

Materials

All source code at https://github.com/rladies/rtp-data-import-11-13-16

(Some) Data Sources

Source	Package
flat files (csv, txt)	readr
databases	DBI/dplyr/DB-specifics
JSON/XML	jsonlite/xml2
APIs	httr
Web scraping	rvest
Domain-specific	from ROpenSci/Bioconductor
Stats formats (SAS, SPSS, Stata)	haven
Matlab	R.matlab
Excel	readxl/jailbreakr
Google Sheets	googlesheets
PDFs	tabulizer (Tabula)

Controlling the Source

People wrangling over data wrangling!

Suggestions

HAVE REQUIREMENTS
naming conventions
require tidy format
data validation tools (spreadsheets)
data testing upon import

Professionalizing Data Import

Meet the Data Engineer…

Data Validation and Cleanup

NEVER ASSUME

Suggestions

Inspect all variable types
Inspect contents - variable by variable and cross-tabs (table, glimpse, etc.)
Check number of rows
Look for missing values
Build checks into the code
If the data can change, consider testing! (testthat)

R and Databases

R can connect to many databases - and with dplyr, no SQL!!

https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html

readr (flat files)

readr handles much more than csv

readr vignettes

https://github.com/tidyverse/readr/blob/master/vignettes/readr.Rmd

https://cran.r-project.org/web/packages/readr/vignettes/column-types.html

Loading readr

library(readr)
library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

Digging into read_csv

R for Data Science

http://r4ds.had.co.nz/data-import.html

read_csv

flat file > tibble

But what about read.csv?

faster
better defaults
reproducible across OS

read_csv basics

In the ideal case:

library(readr) # or library(tidyverse)
my_df <- read_csv('my_file.csv')

But…

What read_csv does

Makes a rectangle of strings
Determines column types
Parses the columns into types

Guessing

read_csv tries to guess what the column types are using the first 1000 rows

If it has trouble, it will tell you

Parsing columns (vectors)

parse_logical() (simple)
parse_integer() (simple)
parse_double() (strict)
parse_number() (flexible)
parse_character() (encodings)
parse_factor() (hi statisticians)
parse_datetime() (ugh)
parse_date() (ugh)
parse_time() (ugh)

Let's try it!

mtcars_df <- read_csv(readr_example('mtcars.csv'))

## Parsed with column specification:
## cols(
##   mpg = col_double(),
##   cyl = col_integer(),
##   disp = col_double(),
##   hp = col_integer(),
##   drat = col_double(),
##   wt = col_double(),
##   qsec = col_double(),
##   vs = col_integer(),
##   am = col_integer(),
##   gear = col_integer(),
##   carb = col_integer()
## )

col_types

Specify column types:

with a string of single characters: col_types = "dc__d"
with cols(), set by column name
Or, read_csv guesses

Best practice

mtcars_df <- read_csv(readr_example('mtcars.csv'),
                      col_types = cols(
                        mpg = col_double(),
                        cyl = col_integer(),
                        disp = col_double(),
                        hp = col_integer(),
                        drat = col_double(),
                        wt = col_double(),
                        qsec = col_double(),
                        vs = col_integer(),
                        am = col_integer(),
                        gear = col_integer(),
                        carb = col_integer()
                      ))

Let's try something tougher

challenge <- read_csv(readr_example("challenge.csv"))

## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )

## Warning: 1000 parsing failures.
##  row col               expected             actual
## 1001   x no trailing characters .23837975086644292
## 1002   x no trailing characters .41167997173033655
## 1003   x no trailing characters .7460716762579978 
## 1004   x no trailing characters .723450553836301  
## 1005   x no trailing characters .614524137461558  
## .... ... ...................... ..................
## See problems(...) for more details.

Using problems()

problems(challenge)

## # A tibble: 1,000 × 4
##      row   col               expected             actual
##    <int> <chr>                  <chr>              <chr>
## 1   1001     x no trailing characters .23837975086644292
## 2   1002     x no trailing characters .41167997173033655
## 3   1003     x no trailing characters  .7460716762579978
## 4   1004     x no trailing characters   .723450553836301
## 5   1005     x no trailing characters   .614524137461558
## 6   1006     x no trailing characters   .473980569280684
## 7   1007     x no trailing characters  .5784610391128808
## 8   1008     x no trailing characters  .2415937229525298
## 9   1009     x no trailing characters .11437866208143532
## 10  1010     x no trailing characters  .2983446326106787
## # ... with 990 more rows

Work through problems one column at a time…

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_integer(),
    y = col_character()
  )
)

## Warning: 1000 parsing failures.
##  row col               expected             actual
## 1001   x no trailing characters .23837975086644292
## 1002   x no trailing characters .41167997173033655
## 1003   x no trailing characters .7460716762579978 
## 1004   x no trailing characters .723450553836301  
## 1005   x no trailing characters .614524137461558  
## .... ... ...................... ..................
## See problems(...) for more details.

Manually change col_types

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_character()
  )
)

Now how are we doing?

head(challenge)

## # A tibble: 6 × 2
##       x     y
##   <dbl> <chr>
## 1   404  <NA>
## 2  4172  <NA>
## 3  3004  <NA>
## 4   787  <NA>
## 5    37  <NA>
## 6  2332  <NA>

Hmmmm….

Now how are we doing?

tail(challenge)

## # A tibble: 6 × 2
##           x          y
##       <dbl>      <chr>
## 1 0.8052743 2019-11-21
## 2 0.1635163 2018-03-29
## 3 0.4719390 2014-08-04
## 4 0.7183186 2015-08-16
## 5 0.2698786 2020-02-04
## 6 0.6082372 2019-01-06

Set date type

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)

Success :)

tail(challenge)

## # A tibble: 6 × 2
##           x          y
##       <dbl>     <date>
## 1 0.8052743 2019-11-21
## 2 0.1635163 2018-03-29
## 3 0.4719390 2014-08-04
## 4 0.7183186 2015-08-16
## 5 0.2698786 2020-02-04
## 6 0.6082372 2019-01-06

read_csv Arguments

read_csv(file, 
         col_names = TRUE, 
         col_types = NULL,
         locale = default_locale(), 
         na = c("", "NA"), 
         quoted_na = TRUE,
         comment = "", 
         trim_ws = TRUE, 
         skip = 0, 
         n_max = Inf,
         guess_max = min(1000, n_max), 
         progress = interactive()
         )

Problem Solving

What are your data import challenges?

Vote!

2017 meetup topics?

Format suggestions?

Come present!!

Next Meetup

Tuesday, December 13th
Same location - TransLoc
RMarkdown!

Networking

Serena's 5311 S Miami Blvd

Welcome to R-Ladies RTP!

Welcome to TransLoc!

Outline

Materials

(Some) Data Sources

Controlling the Source

Suggestions

Professionalizing Data Import

Data Validation and Cleanup

Suggestions

R and Databases

readr (flat files)

Loading readr

Digging into read_csv

read_csv

But what about read.csv?

read_csv basics

What read_csv does

Guessing

Parsing columns (vectors)

Let's try it!

col_types

Best practice

Let's try something tougher

Using problems()

Work through problems one column at a time…

Manually change col_types

Now how are we doing?

Now how are we doing?

Set date type

Success :)

Other strategies for solving problems

read_csv Arguments

Problem Solving

Vote!

Next Meetup

Networking