October 1, 2015

Agenda

Introduction

  • Air Force
    • HQ Air Force Materiel Command, Studies and Analyses Division
    • Operations Research Analyst
  • Air Force Institute of Technology
    • Ph.D. in 2015
    • Adjunct Professor
  • Social Media

GR&A

R packages used…

install.packages("package")

library(package)

tidyr


dplyr

data used…



install.packages("EDAWR")

library(EDAWR)


Data sets: cases, storms, tb, iris, a, b

%>% operator…

learn it, love it, leverage it



filter(data, variable == numeric_value)

or

data %>% filter(variable == numeric_value)

%>% operator…

learn it, love it, leverage it

arrange(
        summarize(
                filter(data, variable == "numeric_value"),
                Total = sum(variable)
        ),
        desc(Total)
)
a <- filter(data, variable == "numeric_value")
b <- summarise(a, Total = sum(variable))
c <- arrange(b, desc(Total))
data %>%
        filter(variable == "value") %>%
        summarise(Total = sum(variable)) %>%
        arrange(desc(Total))
Same results but the %>% operator is more efficient and legible

Data Wrangling


because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   

because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   

because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   

because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   

because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   


because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   



Up to 80% of data analysis is spent on the process of cleaning and preparing data.

cf. Wickham, 2014 and Dasu and Johnson, 2003

because…

"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."

Jenny Bryan                   



Up to 80% of data analysis is spent on the process of cleaning and preparing data.

cf. Wickham, 2014 and Dasu and Johnson, 2003



tidyr & dplyr make 95% of your data wrangling tasks much easier!

tidyr

a package that reshapes the layout of dataframes

tidyr

Primary functions

gather(): transforms data from wide to long


spread(): transforms data from long to wide


separate(): splits a single column into multiple columns


unite(): combines multiple columns into a single column

tidyr

gather()

Transform data from wide to long:

Function:       gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as:        data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)

Arguments:
        data:           data frame
        key:            column name representing new variable
        value:          column name representing variable values
        ...:            names of columns to gather (or not gather)
        na.rm:          option to remove observations with missing values (represented by NAs)
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

gather()

tidyr

gather()

tidyr

gather()

tidyr

gather()

tidyr

gather()

tidyr

gather()

tidyr

gather()

Code alternatives:

These all produce the same results:
        cases %>% gather(Year, n, `2011`:`2013`)
        cases %>% gather(Year, n, `2011`, `2012`, `2013`)
        cases %>% gather(Year, n, 2:4)
        cases %>% gather(Year, n, -Country)

Also note that if you do not supply arguments for na.rm or convert values then the defaults are used

tidyr

spread()

Transform data from long to wide:

Function:       spread(data, key, value, fill = NA, convert = FALSE)
Same as:        data %>% spread(key, value, fill = NA, convert = FALSE)

Arguments:
        data:           data frame
        key:            column values to convert to multiple columns
        value:          single column values to convert to multiple columns' values 
        fill:           If there isn't a value for every combination of the other variables and the key 
                        column, this value will be substituted
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

spread()

tidyr

spread()

tidyr

spread()

tidyr

spread()

tidyr

spread()

tidyr

separate()

Splits a single variable (column) into multiple variables (columns):

Function:       separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as:        data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)

Arguments:
        data:           data frame
        col:            column name representing current variable
        into:           names of variables representing new variables
        sep:            how to separate current variable (char, num, or symbol)
        remove:         if TRUE, remove input column from output data frame
        convert:        if TRUE will automatically convert values to logical, integer, numeric, complex or 
                        factor as appropriate

tidyr

separate()

tidyr

separate()

tidyr

separate()

tidyr

separate()

tidyr

separate()

tidyr

separate()

tidyr

separate()

Code alternatives:

These produce the same results:
        storms %>% separate(date, c("year", "month", "day"))
        storms %>% separate(date, c("year", "month", "day"), sep = "-")
        
By default, if no separator is specified, will separate by any regular expression that matches any sequence 
of non-alphanumeric values

tidyr

unite()

Combines multiple variables (columns) into a single variable (column):

Function:       unite(data, col, ..., sep = " ", remove = TRUE)
Same as:        data %>% unite(col, ..., sep = " ", remove = TRUE)

Arguments:
        data:           data frame
        col:            column name of new "merged" column
        ...:            names of columns to merge
        sep:            separator to use between merged values
        remove:         if TRUE, remove input column from output data frame

tidyr

unite()

tidyr

unite()

tidyr

unite()

tidyr

unite()

tidyr

unite()

tidyr

unite()

tidyr

unite()

Code alternatives:

These produce the same results:
        storms %>% unite(date, year, month, day, sep = "_")
        storms %>% unite(date, year, month, day)

If no spearator is identified, "_" will automatically be used

dplyr

a package that transforms data

dplyr

Primary functions

dplyr

Primary functions

dplyr

select()

Select variables of concern:

Function:       select(data, ...)
Same as:        data %>% select(...)

Arguments:
        data:           data frame
        ...:            call variables by name or by function

Special functions:
        starts_with(x, ignore.case = TRUE): names starts with x
        ends_with(x, ignore.case = TRUE):   names ends in x
        contains(x, ignore.case = TRUE):    selects all variables whose name contains x
        matches(x, ignore.case = TRUE):     selects all variables whose name matches the regular expression x

dplyr

select()

dplyr

select()

dplyr

select()

dplyr

select()

dplyr

select()

Code alternatives:

These produce the same results:
        storms %>% select(wind, pressure, date)
        storms %>% select(wind:pressure)
        storms %>% select(-storm)

dplyr

filter()

filter values based on conditions:

Function:       filter(data, ...)
Same as:        data %>% filter(...)

Arguments:
        data:           data frame
        ...:            conditions to be met

dplyr

filter()

dplyr

filter()

Can filter for multiple conditions

dplyr

filter()

We can apply multiple logic rules in the filter() function such as:

< Less than != Not equal to
> Greater than %in% Group membership
== Equal to is.na is NA
!= Not equal to !is.na is not NA
<= Less than or equal to &,|,! Boolean operators
>= Greater than or equal to

dplyr

filter()

Additional subsetting functions:

storms %>% distinct() Remove duplicate rows
storms %>% sample_frac(0.5, replace = TRUE) Randomly select fraction of rows
storms %>% sample_n(10, replace = TRUE) Randomly select n rows
storms %>% slice(3:5) Select rows by position
storms %>% top_n(storms, 2, date) Select and order top n entries

dplyr

group_by()

Group data by categorical levels:

Function:       group_by(data, ...)
Same as:        data %>% group_by(...)

Arguments:
        data:           data frame
        ...:            variables to group_by

*Use ungroup(x) to remove groups

dplyr

group_by()

dplyr

group_by()

dplyr

group_by()

dplyr

group_by()

dplyr

group_by()

  • group_by() function is a silent function in which no observable manipulation of the data occurs
  • Alters underlying attributes → note Groups: city [3] in output below
  • Becomes useful when combined with other functions
pollution %>% group_by(city)        
## Source: local data frame [6 x 3]
## Groups: city [3]
## 
##       city  size amount
##      (chr) (chr)  (dbl)
## 1 New York large     23
## 2 New York small     14
## 3   London large     22
## 4   London small     16
## 5  Beijing large    121
## 6  Beijing small     56

dplyr

summarise()

Perform summary statistics on variables:

Function:       summarise(data, ...)
Same as:        data %>% summarise(...)

Arguments:
        data:           data frame
        ...:            Name-value pairs of summary functions like min(), mean(), max() etc.

*Developer is from New Zealand...can use "summarise(x)" or "summarize(x)"

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

dplyr

summarise()

Summary functions:

first() First value of a vector min() Min value in vector
last() Last value of a vector max() Max value in vector
nth() Nth value of a vector mean() Mean value of vector
n() # of values in a vector median() Median value of vector
n_distinct() # of distinct values var() Variance of vector
IQR() IQR of a vector sd() St. dev. of vector

dplyr

arrange()

Order variable values:

Function:       arrange(data, ...)
Same as:        data %>% arrange(...)

Arguments:
        data:           data frame
        ...:            Variable(s) to order

*use desc(x) to sort variable(s) in descending order

dplyr

arrange()

dplyr

arrange()

Use desc(x) to sort variable(s) in descending order

dplyr

mutate()

Create new variables:

Function:       
                mutate(data, ...)
Same as:        data %>% mutate(...)                

Arguments:
        data:           data frame
        ...:            Expression(s)

dplyr

mutate()

dplyr

mutate()

dplyr

mutate()

dplyr

mutate()

dplyr

mutate()

Window functions that can be applied within mutate_each(funs()):

lead() ntile() cumsum()
lag() between() cummax()
dense_rank() cume_dist() cummin()
min_rank() cumall() cumprod()
percent_rank() cumany() pmax()
row_number() cumean() pmin()


  • Can also create own functions to pass through mutate_each(funs())
  • Be aware that some functions will cause errors when applied to non-numeric variables

dplyr

join()

Combining data sets:

Functions:       
                left_join(x, y, by = NULL)
                right_join(x, y, by = NULL)
                inner_join(x, y, by = NULL)
                full_join(x, y, by = NULL)
                semi_join(x, y, by = NULL)
                anti_join(x, y, by = NULL)

Arguments:
        x, y:          data frames to join
        by:            a character vector of variables to join by. If NULL, the default, join will do a 
                       natural join, using all variables with common names across the two tables.

dplyr

join()

dplyr

join()

Join matching rows from b to a

dplyr

join()

Join matching rows from b to a

dplyr

join()

Join matching rows from b to a

dplyr

join()

Join matching rows from a to b

dplyr

join()

Join data. Retain only matching rows in both sets.

dplyr

join()

Join data. Retain all values, all rows.

dplyr

join()

Keep all rows in a that have a match in b.

dplyr

join()

Keep all rows in a that do not have a match in b.

dplyr

join()

Additional dplyr functions for merging data sets:

intersect(y, z) Rows that appear in both y and z
union(y, z) Rows that appear in either or both y and z
setdiff(y, z) Rows that appear in y but not z
bind_rows(y, z) Append z to y as new rows
bind_cols(y, z) Append z to y as new columns


Use extreme caution when applying bind_ functions.

Wrap up

Additional Resources

This tutorial simply touches on the basics that these two packages can do. There are several other resources you can check out to learn more. In addition, much of what I have learned and, therefore, much of the content in this tutorial is simply a modified regurgitation of the wonderful resources provided by R Studio, Hadley Wickham, and Garrett Grolemund.

Thank you