November 14, 2016
| Source | Package |
|---|---|
| flat files (csv, txt) | readr |
| databases | DBI/dplyr/DB-specifics |
| JSON/XML | jsonlite/xml2 |
| APIs | httr |
| Web scraping | rvest |
| Domain-specific | from ROpenSci/Bioconductor |
| Stats formats (SAS, SPSS, Stata) | haven |
| Matlab | R.matlab |
| Excel | readxl/jailbreakr |
| Google Sheets | googlesheets |
| PDFs | tabulizer (Tabula) |
People wrangling over data wrangling!
Meet the Data Engineer…
NEVER ASSUME
R can connect to many databases - and with dplyr, no SQL!!
https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html
readr handles much more than csv
readr vignettes
https://github.com/tidyverse/readr/blob/master/vignettes/readr.Rmd
https://cran.r-project.org/web/packages/readr/vignettes/column-types.html
library(readr) library(tidyverse)
## Loading tidyverse: ggplot2 ## Loading tidyverse: tibble ## Loading tidyverse: tidyr ## Loading tidyverse: purrr ## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats ## lag(): dplyr, stats
R for Data Science
flat file > tibble
In the ideal case:
library(readr) # or library(tidyverse)
my_df <- read_csv('my_file.csv')
But…
read_csv tries to guess what the column types are using the first 1000 rows
If it has trouble, it will tell you
parse_logical() (simple)parse_integer() (simple)parse_double() (strict)parse_number() (flexible)parse_character() (encodings)parse_factor() (hi statisticians)parse_datetime() (ugh)parse_date() (ugh)parse_time() (ugh)mtcars_df <- read_csv(readr_example('mtcars.csv'))
## Parsed with column specification: ## cols( ## mpg = col_double(), ## cyl = col_integer(), ## disp = col_double(), ## hp = col_integer(), ## drat = col_double(), ## wt = col_double(), ## qsec = col_double(), ## vs = col_integer(), ## am = col_integer(), ## gear = col_integer(), ## carb = col_integer() ## )
Specify column types:
col_types = "dc__d"with cols(), set by column name
Or, read_csv guesses
mtcars_df <- read_csv(readr_example('mtcars.csv'),
col_types = cols(
mpg = col_double(),
cyl = col_integer(),
disp = col_double(),
hp = col_integer(),
drat = col_double(),
wt = col_double(),
qsec = col_double(),
vs = col_integer(),
am = col_integer(),
gear = col_integer(),
carb = col_integer()
))
challenge <- read_csv(readr_example("challenge.csv"))
## Parsed with column specification: ## cols( ## x = col_integer(), ## y = col_character() ## )
## Warning: 1000 parsing failures. ## row col expected actual ## 1001 x no trailing characters .23837975086644292 ## 1002 x no trailing characters .41167997173033655 ## 1003 x no trailing characters .7460716762579978 ## 1004 x no trailing characters .723450553836301 ## 1005 x no trailing characters .614524137461558 ## .... ... ...................... .................. ## See problems(...) for more details.
problems(challenge)
## # A tibble: 1,000 × 4 ## row col expected actual ## <int> <chr> <chr> <chr> ## 1 1001 x no trailing characters .23837975086644292 ## 2 1002 x no trailing characters .41167997173033655 ## 3 1003 x no trailing characters .7460716762579978 ## 4 1004 x no trailing characters .723450553836301 ## 5 1005 x no trailing characters .614524137461558 ## 6 1006 x no trailing characters .473980569280684 ## 7 1007 x no trailing characters .5784610391128808 ## 8 1008 x no trailing characters .2415937229525298 ## 9 1009 x no trailing characters .11437866208143532 ## 10 1010 x no trailing characters .2983446326106787 ## # ... with 990 more rows
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_integer(),
y = col_character()
)
)
## Warning: 1000 parsing failures. ## row col expected actual ## 1001 x no trailing characters .23837975086644292 ## 1002 x no trailing characters .41167997173033655 ## 1003 x no trailing characters .7460716762579978 ## 1004 x no trailing characters .723450553836301 ## 1005 x no trailing characters .614524137461558 ## .... ... ...................... .................. ## See problems(...) for more details.
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_character()
)
)
head(challenge)
## # A tibble: 6 × 2 ## x y ## <dbl> <chr> ## 1 404 <NA> ## 2 4172 <NA> ## 3 3004 <NA> ## 4 787 <NA> ## 5 37 <NA> ## 6 2332 <NA>
Hmmmm….
tail(challenge)
## # A tibble: 6 × 2 ## x y ## <dbl> <chr> ## 1 0.8052743 2019-11-21 ## 2 0.1635163 2018-03-29 ## 3 0.4719390 2014-08-04 ## 4 0.7183186 2015-08-16 ## 5 0.2698786 2020-02-04 ## 6 0.6082372 2019-01-06
challenge <- read_csv(
readr_example("challenge.csv"),
col_types = cols(
x = col_double(),
y = col_date()
)
)
tail(challenge)
## # A tibble: 6 × 2 ## x y ## <dbl> <date> ## 1 0.8052743 2019-11-21 ## 2 0.1635163 2018-03-29 ## 3 0.4719390 2014-08-04 ## 4 0.7183186 2015-08-16 ## 5 0.2698786 2020-02-04 ## 6 0.6082372 2019-01-06
Increase guess information:
challenge2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)
Read everything in as character and work from there:
challenge2 <- read_csv(readr_example("challenge.csv"),
col_types = cols(.default = col_character())
)
read_csv(file,
col_names = TRUE,
col_types = NULL,
locale = default_locale(),
na = c("", "NA"),
quoted_na = TRUE,
comment = "",
trim_ws = TRUE,
skip = 0,
n_max = Inf,
guess_max = min(1000, n_max),
progress = interactive()
)
What are your data import challenges?
2017 meetup topics?
Format suggestions?
Come present!!
Serena's 5311 S Miami Blvd