November 14, 2016

Welcome to R-Ladies RTP!

Welcome to TransLoc!

Outline

  • Data sources
  • Controlling the source
  • Data validation and cleanup
  • Understanding read_csv
  • R and databases
  • Problem solving

Materials

(Some) Data Sources

Source Package
flat files (csv, txt) readr
databases DBI/dplyr/DB-specifics
JSON/XML jsonlite/xml2
APIs httr
Web scraping rvest
Domain-specific from ROpenSci/Bioconductor
Stats formats (SAS, SPSS, Stata) haven
Matlab R.matlab
Excel readxl/jailbreakr
Google Sheets googlesheets
PDFs tabulizer (Tabula)

Controlling the Source

People wrangling over data wrangling!

Suggestions

  • HAVE REQUIREMENTS
  • naming conventions
  • require tidy format
  • data validation tools (spreadsheets)
  • data testing upon import

Professionalizing Data Import

Meet the Data Engineer…

Data Validation and Cleanup

NEVER ASSUME

Suggestions

  • Inspect all variable types
  • Inspect contents - variable by variable and cross-tabs (table, glimpse, etc.)
  • Check number of rows
  • Look for missing values
  • Build checks into the code
  • If the data can change, consider testing! (testthat)

R and Databases

readr (flat files)

Loading readr

library(readr)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Digging into read_csv

read_csv

flat file > tibble

But what about read.csv?

  • faster
  • better defaults
  • reproducible across OS

read_csv basics

In the ideal case:

library(readr) # or library(tidyverse)
my_df <- read_csv('my_file.csv')

But…

What read_csv does

  1. Makes a rectangle of strings
  2. Determines column types
  3. Parses the columns into types

Guessing

read_csv tries to guess what the column types are using the first 1000 rows

If it has trouble, it will tell you

Parsing columns (vectors)

  • parse_logical() (simple)
  • parse_integer() (simple)
  • parse_double() (strict)
  • parse_number() (flexible)
  • parse_character() (encodings)
  • parse_factor() (hi statisticians)
  • parse_datetime() (ugh)
  • parse_date() (ugh)
  • parse_time() (ugh)

Let's try it!

mtcars_df <- read_csv(readr_example('mtcars.csv'))
## Parsed with column specification:
## cols(
##   mpg = col_double(),
##   cyl = col_integer(),
##   disp = col_double(),
##   hp = col_integer(),
##   drat = col_double(),
##   wt = col_double(),
##   qsec = col_double(),
##   vs = col_integer(),
##   am = col_integer(),
##   gear = col_integer(),
##   carb = col_integer()
## )

col_types

Specify column types:

  • with a string of single characters: col_types = "dc__d"
  • with cols(), set by column name

  • Or, read_csv guesses

Best practice

mtcars_df <- read_csv(readr_example('mtcars.csv'),
                      col_types = cols(
                        mpg = col_double(),
                        cyl = col_integer(),
                        disp = col_double(),
                        hp = col_integer(),
                        drat = col_double(),
                        wt = col_double(),
                        qsec = col_double(),
                        vs = col_integer(),
                        am = col_integer(),
                        gear = col_integer(),
                        carb = col_integer()
                      ))

Let's try something tougher

challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification:
## cols(
##   x = col_integer(),
##   y = col_character()
## )
## Warning: 1000 parsing failures.
##  row col               expected             actual
## 1001   x no trailing characters .23837975086644292
## 1002   x no trailing characters .41167997173033655
## 1003   x no trailing characters .7460716762579978 
## 1004   x no trailing characters .723450553836301  
## 1005   x no trailing characters .614524137461558  
## .... ... ...................... ..................
## See problems(...) for more details.

Using problems()

problems(challenge)
## # A tibble: 1,000 × 4
##      row   col               expected             actual
##    <int> <chr>                  <chr>              <chr>
## 1   1001     x no trailing characters .23837975086644292
## 2   1002     x no trailing characters .41167997173033655
## 3   1003     x no trailing characters  .7460716762579978
## 4   1004     x no trailing characters   .723450553836301
## 5   1005     x no trailing characters   .614524137461558
## 6   1006     x no trailing characters   .473980569280684
## 7   1007     x no trailing characters  .5784610391128808
## 8   1008     x no trailing characters  .2415937229525298
## 9   1009     x no trailing characters .11437866208143532
## 10  1010     x no trailing characters  .2983446326106787
## # ... with 990 more rows

Work through problems one column at a time…

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_integer(),
    y = col_character()
  )
)
## Warning: 1000 parsing failures.
##  row col               expected             actual
## 1001   x no trailing characters .23837975086644292
## 1002   x no trailing characters .41167997173033655
## 1003   x no trailing characters .7460716762579978 
## 1004   x no trailing characters .723450553836301  
## 1005   x no trailing characters .614524137461558  
## .... ... ...................... ..................
## See problems(...) for more details.

Manually change col_types

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_character()
  )
)

Now how are we doing?

head(challenge)
## # A tibble: 6 × 2
##       x     y
##   <dbl> <chr>
## 1   404  <NA>
## 2  4172  <NA>
## 3  3004  <NA>
## 4   787  <NA>
## 5    37  <NA>
## 6  2332  <NA>

Hmmmm….

Now how are we doing?

tail(challenge)
## # A tibble: 6 × 2
##           x          y
##       <dbl>      <chr>
## 1 0.8052743 2019-11-21
## 2 0.1635163 2018-03-29
## 3 0.4719390 2014-08-04
## 4 0.7183186 2015-08-16
## 5 0.2698786 2020-02-04
## 6 0.6082372 2019-01-06

Set date type

challenge <- read_csv(
  readr_example("challenge.csv"), 
  col_types = cols(
    x = col_double(),
    y = col_date()
  )
)

Success :)

tail(challenge)
## # A tibble: 6 × 2
##           x          y
##       <dbl>     <date>
## 1 0.8052743 2019-11-21
## 2 0.1635163 2018-03-29
## 3 0.4719390 2014-08-04
## 4 0.7183186 2015-08-16
## 5 0.2698786 2020-02-04
## 6 0.6082372 2019-01-06

Other strategies for solving problems

Increase guess information:

challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)

Read everything in as character and work from there:

challenge2 <- read_csv(readr_example("challenge.csv"), 
                          col_types = cols(.default = col_character())
                      ) 

read_csv Arguments

read_csv(file, 
         col_names = TRUE, 
         col_types = NULL,
         locale = default_locale(), 
         na = c("", "NA"), 
         quoted_na = TRUE,
         comment = "", 
         trim_ws = TRUE, 
         skip = 0, 
         n_max = Inf,
         guess_max = min(1000, n_max), 
         progress = interactive()
         )

Problem Solving

What are your data import challenges?

Vote!

2017 meetup topics?

Format suggestions?

Come present!!

Next Meetup

  • Tuesday, December 13th
  • Same location - TransLoc
  • RMarkdown!

Networking

Serena's 5311 S Miami Blvd