I decided not to use any dataset examples in my summary because I wanted the functions as general as possible so that I don’t feel as restricted in using them to the specific example’s application in the future. As such, the format looks a little strange since I wasn’t able to put any of the code into chunks since there was nothing the functions could operate on.

Data Processing with dplyr & tidyr

Brad Boehmke

These are the main functions that the tidyr and dplyr packages provide to us to help us manipulate our data.

tidyr:

gather() spread() separate() unite()

dplyr:

select() filter() group_by() summarise() arrange() join() mutate()

The %>% operator allows us to join multiple arguments together, without having to nest or use multiple objects. This makes it easier to decipher what the code is actually doing, and also makes it easier to code because nested arguments can get confusing once they become 3 or so levels deep.

Tidyr Operations

There are four fundamental functions of data tidying:

gather() ##takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer (Unpivots the data into less columns than there were initially) spread() ##takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider (pivots the data into more columns than there were initially) complement of gather() separate() ##splits a single column into multiple columns unite() ##combines multiple columns into a single column

Dplyr Operations

There are seven fundamental functions of data transformation:

select() ##selecting specific portions of previous variables filter() ##provides basic filtering capabilities, most filtering arguments are accepted group_by() ##groups data by categorical levels summarise() ##summarise data by functions of choice (use with group_by() for powerful statistical summaries) arrange() ##ordering data join() ##joining separate dataframes mutate() ##create new variables

There are also many versions of join(), namely leftjoin(), inner_join(), antijoin(), etc.

Data Science with R by Garrett Grolemund

Tidy data is arranged such that the variables are arranged in columns, with each variable as its own column. Then, each observation is its own row across all of the variables, thus creating a vector for each column, which is saved within R as a column vector. The values are then situated such that each value has one corresponding column and row each, and doesn’t have any other values with the same coordinates. The article then went on to demonstrate some uses of the previous tidyr and dplyr functions that we learned about in the prior article.

Introduction to dplyr for Faster Data Manipulation in R

This article roughly gave us the same information as the past two, with a couple extra handy functions:

sample_n() ##randomly sample a fixed number of rows, without replacement sample_frac() ##randomly sample a fraction of rows, with replacement str() ##base R approach to view the structure of an object glimpse() ##dplyr approach: better formatting, and adapts to your screen width

It is also compatible with SQL datasets, and will allow you to make SQL commands from inside R when you are connecting to SQL.