Data Wrangling in R

October 1, 2015

Agenda

Introduction
Data Wrangling
- Why?
- Data & packages
- tidyr
- dplyr
Additional Resources

Introduction

Air Force
- HQ Air Force Materiel Command, Studies and Analyses Division
- Operations Research Analyst
Air Force Institute of Technology
- Ph.D. in 2015
- Adjunct Professor
Social Media
- Blog
- Twitter
- LinkedIn
- GitHub

GR&A

R packages used…

install.packages("package")
library(package)
tidyr

dplyr

data used…

install.packages("EDAWR")
library(EDAWR)

Data sets: cases, storms, tb, iris, a, b

%>% operator…

learn it, love it, leverage it

magrittr → Stefan Milton Bache
not required but extremely convenient

filter(data, variable == numeric_value)
or
data %>% filter(variable == numeric_value)

%>% operator…

learn it, love it, leverage it

arrange(
        summarize(
                filter(data, variable == "numeric_value"),
                Total = sum(variable)
        ),
        desc(Total)
)

a <- filter(data, variable == "numeric_value")
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))

data %>%
        filter(variable == "value") %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))

Same results but the %>% operator is more efficient and legible

Data Wrangling

because…

Jenny Bryan

because…

Jenny Bryan

because…

Jenny Bryan

because…

Jenny Bryan

because…

Jenny Bryan

because…

Jenny Bryan

cf. Wickham, 2014 and Dasu and Johnson, 2003

because…

Jenny Bryan

cf. Wickham, 2014 and Dasu and Johnson, 2003

tidyr

a package that reshapes the layout of dataframes

tidyr

Primary functions

gather(): transforms data from wide to long

spread(): transforms data from long to wide

separate(): splits a single column into multiple columns

unite(): combines multiple columns into a single column

tidyr

`gather()`

Transform data from wide to long:

Function:       gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as:        data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)

Arguments:
        data:           data frame
        key:            column name representing new variable
        value:          column name representing variable values
        ...:            names of columns to gather (or not gather)
        na.rm:          option to remove observations with missing values (represented by NAs)
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

`gather()`

tidyr

`gather()`

tidyr

`gather()`

tidyr

`gather()`

tidyr

`gather()`

tidyr

`gather()`

tidyr

`gather()`

Code alternatives:

These all produce the same results:
        cases %>% gather(Year, n, `2011`:`2013`)
        cases %>% gather(Year, n, `2011`, `2012`, `2013`)
        cases %>% gather(Year, n, 2:4)
        cases %>% gather(Year, n, -Country)

Also note that if you do not supply arguments for na.rm or convert values then the defaults are used

tidyr

`spread()`

Transform data from long to wide:

Function:       spread(data, key, value, fill = NA, convert = FALSE)
Same as:        data %>% spread(key, value, fill = NA, convert = FALSE)

Arguments:
        data:           data frame
        key:            column values to convert to multiple columns
        value:          single column values to convert to multiple columns' values 
        fill:           If there isn't a value for every combination of the other variables and the key 
                        column, this value will be substituted
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

`spread()`

tidyr

`spread()`

tidyr

`spread()`

tidyr

`spread()`

tidyr

`spread()`

tidyr

`separate()`

Splits a single variable (column) into multiple variables (columns):

Function:       separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as:        data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)

Arguments:
        data:           data frame
        col:            column name representing current variable
        into:           names of variables representing new variables
        sep:            how to separate current variable (char, num, or symbol)
        remove:         if TRUE, remove input column from output data frame
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

`separate()`

tidyr

`separate()`

tidyr

`separate()`

tidyr

`separate()`

tidyr

`separate()`

tidyr

`separate()`

tidyr

`separate()`

Code alternatives:

These produce the same results:
        storms %>% separate(date, c("year", "month", "day"))
        storms %>% separate(date, c("year", "month", "day"), sep = "-")
        
By default, if no separator is specified, will separate by any regular expression that matches any sequence 
of non-alphanumeric values

tidyr

`unite()`

Combines multiple variables (columns) into a single variable (column):

Function:       unite(data, col, ..., sep = " ", remove = TRUE)
Same as:        data %>% unite(col, ..., sep = " ", remove = TRUE)

Arguments:
        data:           data frame
        col:            column name of new "merged" column
        ...:            names of columns to merge
        sep:            separator to use between merged values
        remove:         if TRUE, remove input column from output data frame

tidyr

`unite()`

tidyr

`unite()`

tidyr

`unite()`

tidyr

`unite()`

tidyr

`unite()`

tidyr

`unite()`

tidyr

`unite()`

Code alternatives:

These produce the same results:
        storms %>% unite(date, year, month, day, sep = "_")
        storms %>% unite(date, year, month, day)

If no spearator is identified, "_" will automatically be used

dplyr

a package that transforms data

dplyr

Primary functions

select(): select variables of concern

filter(): filter values based on conditions

group_by(): group data by categorical levels

summarise(): change unit of analysis

dplyr

Primary functions

arrange(): order the data

mutate(): create new variables

join(): combine separate data sets

dplyr

`select()`

Select variables of concern:

Function:       select(data, ...)
Same as:        data %>% select(...)

Arguments:
        data:           data frame
        ...:            call variables by name or by function

Special functions:
        starts_with(x, ignore.case = TRUE): names starts with x
        ends_with(x, ignore.case = TRUE):   names ends in x
        contains(x, ignore.case = TRUE):    selects all variables whose name contains x
        matches(x, ignore.case = TRUE):     selects all variables whose name matches the regular expression x

dplyr

`select()`

dplyr

`select()`

dplyr

`select()`

dplyr

`select()`

dplyr

`select()`

Code alternatives:

These produce the same results:
        storms %>% select(wind, pressure, date)
        storms %>% select(wind:pressure)
        storms %>% select(-storm)

dplyr

`filter()`

filter values based on conditions:

Function:       filter(data, ...)
Same as:        data %>% filter(...)

Arguments:
        data:           data frame
        ...:            conditions to be met

dplyr

`filter()`

dplyr

`filter()`

Can filter for multiple conditions

dplyr

`filter()`

We can apply multiple logic rules in the filter() function such as:

<	Less than	!=	Not equal to
>	Greater than	%in%	Group membership
==	Equal to	is.na	is NA
!=	Not equal to	!is.na	is not NA
<=	Less than or equal to	&,\|,!	Boolean operators
>=	Greater than or equal to

dplyr

`filter()`

Additional subsetting functions:

	storms %>% `distinct()`	Remove duplicate rows
	storms %>% `sample_frac(0.5, replace = TRUE)`	Randomly select fraction of rows
	storms %>% `sample_n(10, replace = TRUE)`	Randomly select n rows
	storms %>% `slice(3:5)`	Select rows by position
	storms %>% `top_n(storms, 2, date)`	Select and order top n entries

dplyr

`group_by()`

Group data by categorical levels:

Function:       group_by(data, ...)
Same as:        data %>% group_by(...)

Arguments:
        data:           data frame
        ...:            variables to group_by

*Use ungroup(x) to remove groups

dplyr

`group_by()`

dplyr

`group_by()`

dplyr

`group_by()`

dplyr

`group_by()`

dplyr

`group_by()`

group_by() function is a silent function in which no observable manipulation of the data occurs
Alters underlying attributes → note Groups: city [3] in output below
Becomes useful when combined with other functions

pollution %>% group_by(city)

## Source: local data frame [6 x 3]
## Groups: city [3]
## 
##       city  size amount
##      (chr) (chr)  (dbl)
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

dplyr

`summarise()`

Perform summary statistics on variables:

Function:       summarise(data, ...)
Same as:        data %>% summarise(...)

Arguments:
        data:           data frame
        ...:            Name-value pairs of summary functions like min(), mean(), max() etc.

*Developer is from New Zealand...can use "summarise(x)" or "summarize(x)"

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

dplyr

`summarise()`

Summary functions:

`first()`	First value of a vector	`min()`	Min value in vector
`last()`	Last value of a vector	`max()`	Max value in vector
`nth()`	Nth value of a vector	`mean()`	Mean value of vector
`n()`	# of values in a vector	`median()`	Median value of vector
`n_distinct()`	# of distinct values	`var()`	Variance of vector
`IQR()`	IQR of a vector	`sd()`	St. dev. of vector

dplyr

`arrange()`

Order variable values:

Function:       arrange(data, ...)
Same as:        data %>% arrange(...)

Arguments:
        data:           data frame
        ...:            Variable(s) to order

*use desc(x) to sort variable(s) in descending order

dplyr

`arrange()`

dplyr

`arrange()`

Use desc(x) to sort variable(s) in descending order

dplyr

`mutate()`

Create new variables:

Function:       
                mutate(data, ...)
Same as:        data %>% mutate(...)                

Arguments:
        data:           data frame
        ...:            Expression(s)

dplyr

`mutate()`

dplyr

`mutate()`

dplyr

`mutate()`

dplyr

`mutate()`

dplyr

`mutate()`

Window functions that can be applied within mutate_each(funs()):

`lead()`	`ntile()`	`cumsum()`
`lag()`	`between()`	`cummax()`
`dense_rank()`	`cume_dist()`	`cummin()`
`min_rank()`	`cumall()`	`cumprod()`
`percent_rank()`	`cumany()`	`pmax()`
`row_number()`	`cumean()`	`pmin()`

Can also create own functions to pass through mutate_each(funs())
Be aware that some functions will cause errors when applied to non-numeric variables

dplyr

`join()`

Combining data sets:

Functions:       
                left_join(x, y, by = NULL)
                right_join(x, y, by = NULL)
                inner_join(x, y, by = NULL)
                full_join(x, y, by = NULL)
                semi_join(x, y, by = NULL)
                anti_join(x, y, by = NULL)

Arguments:
        x, y:          data frames to join
        by:            a character vector of variables to join by. If NULL, the default, join will do a 
                       natural join, using all variables with common names across the two tables.

dplyr

`join()`

dplyr

`join()`

Join matching rows from b to a

dplyr

`join()`

Join matching rows from b to a

dplyr

`join()`

Join matching rows from b to a

dplyr

`join()`

Join matching rows from a to b

dplyr

`join()`

Join data. Retain only matching rows in both sets.

dplyr

`join()`

Join data. Retain all values, all rows.

dplyr

`join()`

Keep all rows in a that have a match in b.

dplyr

`join()`

Keep all rows in a that do not have a match in b.

dplyr

`join()`

Additional dplyr functions for merging data sets:

	`intersect(y, z)`	Rows that appear in both y and z
	`union(y, z)`	Rows that appear in either or both y and z
	`setdiff(y, z)`	Rows that appear in y but not z
	`bind_rows(y, z)`	Append z to y as new rows
	`bind_cols(y, z)`	Append z to y as new columns

Use extreme caution when applying bind_ functions.

Wrap up

Additional Resources

This tutorial simply touches on the basics that these two packages can do. There are several other resources you can check out to learn more. In addition, much of what I have learned and, therefore, much of the content in this tutorial is simply a modified regurgitation of the wonderful resources provided by R Studio, Hadley Wickham, and Garrett Grolemund.

R Studio's Data wrangling with R and RStudio webinar
R Studio's Data wrangling GitHub repository
R Studio's Data wrangling cheat sheet
Hadley Wickham’s dplyr tutorial at useR! 2014, Part 1
Hadley Wickham’s dplyr tutorial at useR! 2014, Part 2
Hadley Wickham's paper on Tidy Data