Data Wrangling with tidyverse

Jesse Yang
February 23, 2017

The Tidy Universe (tidyverse)

The tidyverse is a collection of R packages that share common philosophies and are designed to work together.

Core packages include
- ggplot2*
- tibble
- tidyr*
- dplyr*
- readr*
- purrr

library(tidyverse)  # load above all

readr

base R

dat <- read.csv("AirlineDelays.txt")

readr

dat <- read_csv("AirlineDelays.txt")

perks of readr
- 10x faster
- Progress bar
- forget about row names
- stringsAsFactors = FALSE
- consistent parameter names

dplyr

Base R

subset(dat, CARRIER == "AA",
       select = c("DEP_DELAY", "ARR_DELAY"))

dat[dat$CARRIER == "AA",
    c("DEP_DELAY", "ARR_DELAY")]

dplyr

filter(dat, CARRIER == "AA") %>%
  select(DEP_DELAY, ARR_DELAY)

which is equivalent to

select(filter(dat, CARRIER == "AA"),
       DEP_DELAY, ARR_DELAY)

dplyr - contigency table

Base R

x <- table(dat$ORIGIN, dat$DEST)
x <- as.data.frame(x)
colnames(x) <-
  c("ORIGIN", "DEST", "n")
x[order(x$n, decreasing = TRUE), ]

dplyr

dat %>%
  count(ORIGIN, DEST, order = TRUE)

to count at each combination of factor levels

dplyr - contigency table

dat %>%
  count(ORIGIN, DEST, sort = TRUE) %>%
  head() %>% knitr::kable()

ORIGIN	DEST	n
SFO	LAX	1109
LAX	SFO	1092
LAX	LAS	1033
LAS	LAX	1032
JFK	LAX	921
LAX	JFK	919

dplyr - aggregate vs summarise

Base R

aggregate(DEP_DELAY ~ CARRIER, dat, mean)

dplyr

dat %>%
  group_by(CARRIER) %>%
  summarise(
    dep = mean(DEP_DELAY, na.rm = TRUE))

dplyr is also much faster

dplyr - get samples

base R

dat[sample(1:nrow(dat), 10), ]

dplyr

sample_n(dat, 10)

dplyr - arrange and select

arrange orders of rows by the value of columns;
select columns

dat %>%
  select(CARRIER, ORIGIN, DEST, ARR_DELAY) %>%
  # `desc` means descending order
  arrange(desc(ARR_DELAY), CARRIER) %>%
  head(4)

# A tibble: 4 × 4
  CARRIER ORIGIN  DEST ARR_DELAY
    <chr>  <chr> <chr>     <int>
1      AA    BNA   DFW      1530
2      AA    ONT   DFW      1504
3      AA    CMH   LAX      1473
4      AA    IAD   DFW      1448

dplyr - a comprehensive example

Top 5 Most Unpunctual Routes

dat %>%
  group_by(ORIGIN, DEST) %>%
  select(ARR_DELAY, DEP_DELAY) %>%
  summarise(
    arr = mean(ARR_DELAY, na.rm = TRUE),
    dep = mean(DEP_DELAY, na.rm = TRUE)
  ) %>%
  filter(arr > dep) %>%
  arrange(desc(arr), desc(dep), ORIGIN) %>%
  head(5)

(that didn't catch up en route)

Source: local data frame [5 x 4]
Groups: ORIGIN [5]

  ORIGIN  DEST   arr      dep
   <chr> <chr> <dbl>    <dbl>
1    ORD   PNS 164.0 150.3333
2    CAK   DTW 150.0 146.0000
3    IND   LGA 142.5  59.5000
4    PNS   ORD 140.5 132.2500
5    HDN   EWR 132.0 126.0000

stringr - everything you need for strings

Basic Operations
- str_length | nchar
- str_c | paste
- str_sub | substr
- str_split | strsplit
- str_trim
- str_pad
Matching and Replacing
- str_detect | grep
- str_subset
  grep(.., value = T)
- str_replace | sub
- str_replace_all
  gsub
- str_locate
- str_match
- str_match_all
- str_count

stringr - str_split

x <- c("Jan to Mar", "Jan 1 to Sep 30", "June",
       "Aug 1, 2016 to Sep 1, 2016")
tmp <- str_split(x, " to ", simplify = TRUE)
tmp

     [,1]          [,2]         
[1,] "Jan"         "Mar"        
[2,] "Jan 1"       "Sep 30"     
[3,] "June"        ""           
[4,] "Aug 1, 2016" "Sep 1, 2016"

begin.date <- tmp[, 1]
end.date <- tmp[, 2]

stringr - match and complete

CompleteDate <- function(dates, day = 1, year = 2016) {
  idx <- -which(str_detect(dates, "\\b[0-9]{1,2}\\b"))
  dates[idx] <- sprintf("%s %s, %s", dates[idx], day, year)
  idx <- -which(str_detect(dates, "\\b[0-9]{4}\\s*$"))
  dates[idx] <- str_c(dates[idx], year, sep = ", ")
  dates
}
begin.date <- CompleteDate(begin.date)
end.date <- CompleteDate(end.date, day = 28)

begin.date

[1] "Jan 1, 2016"  "Jan 1, 2016"  "June 1, 2016" "Aug 1, 2016"

end.date

[1] "Mar 28, 2016" "Sep 30, 2016" " 28, 2016"    "Sep 1, 2016"

stringr

library(lubridate)

data.frame(begin = mdy(begin.date), end = mdy(end.date)) %>%
  mutate(duration = end - begin)

       begin        end duration
1 2016-01-01 2016-03-28  87 days
2 2016-01-01 2016-09-30 273 days
3 2016-06-01       <NA>  NA days
4 2016-08-01 2016-09-01  31 days

More about Regular Expressions

Seek character or character classes from left to right
Metacharacters have special meanings
^ $ * + . ? [ ] { } ( ) | \
Escape sequences: \' \" \n \t \b
Quantifiers: ? * + {n} {n,} {n,m}
Greedy & 'Ungreedy' mode: add ? after quantifiers
Backreferences: \1 to \9

Quantifier

x <- c("ababa", "aabb", "bbaa")
str_match(x, "(a*)(b*)")

     [,1]   [,2] [,3]
[1,] "ab"   "a"  "b" 
[2,] "aabb" "aa" "bb"
[3,] "bb"   ""   "bb"

str_match(x, "(a+)(b+)")

     [,1]   [,2] [,3]
[1,] "ab"   "a"  "b" 
[2,] "aabb" "aa" "bb"
[3,] NA     NA   NA

Quantifier (Continue)

x <- c("ababa", "aabb", "bbaa", "aaab")
str_match(x, "a{2}")

     [,1]
[1,] NA  
[2,] "aa"
[3,] "aa"
[4,] "aa"

str_match(x, "a{1,2}")

     [,1]
[1,] "a" 
[2,] "aa"
[3,] "aa"
[4,] "aa"

Greedy vs Ungreedy

x <- "Hello (2010)  2010"
str_match(x, "(.*)([0-9]{4})")

     [,1]                 [,2]             [,3]  
[1,] "Hello (2010)  2010" "Hello (2010)  " "2010"

str_match(x, "(.*?)([0-9]{4})")

     [,1]          [,2]      [,3]  
[1,] "Hello (2010" "Hello (" "2010"

str_match(x, "(.*?) +(?:\\([0-9]{4}\\))? +([0-9]{4})")

     [,1]                 [,2]    [,3]  
[1,] "Hello (2010)  2010" "Hello" "2010"

Match and Reshape

pattern <- "^ *([A-Z][a-z]+) +(\\d{1,2}), +(\\d{4})"
dates <- c("Jan 1, 2017", " Sep 2, 2011", "Dec 1, 95")
matched <- str_match(dates, pattern)
matched

     [,1]           [,2]  [,3] [,4]  
[1,] "Jan 1, 2017"  "Jan" "1"  "2017"
[2,] " Sep 2, 2011" "Sep" "2"  "2011"
[3,] NA             NA    NA   NA

matched %>% .[, 2:4] %>% as_tibble() %>%
  rename(month = V1, day = V2, year = V3)

# A tibble: 3 × 3
  month   day  year
  <chr> <chr> <chr>
1   Jan     1  2017
2   Sep     2  2011
3  <NA>  <NA>  <NA>

Data Wrangling with tidyverse

The Tidy Universe (tidyverse)

readr

dplyr

Base R

dplyr

dplyr - contigency table

Base R

dplyr

dplyr - contigency table

dplyr - aggregate vs summarise

Base R

dplyr

dplyr - get samples

base R

dplyr

dplyr - arrange and select

dplyr - a comprehensive example

Top 5 Most Unpunctual Routes

(that didn't catch up en route)

stringr - everything you need for strings

stringr - str_split

stringr - match and complete

stringr

More about Regular Expressions

Quantifier

Quantifier (Continue)

Greedy vs Ungreedy

Match and Reshape

Thank you!