October 1, 2015
arrange(
summarize(
filter(data, variable == "numeric_value"),
Total = sum(variable)
),
desc(Total)
)
a <- filter(data, variable == "numeric_value") b <- summarise(a, Total = sum(variable)) c <- arrange(b, desc(Total))
data %>%
filter(variable == "value") %>%
summarise(Total = sum(variable)) %>%
arrange(desc(Total))
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
Up to 80% of data analysis is spent on the process of cleaning and preparing data.
cf. Wickham, 2014 and Dasu and Johnson, 2003
"Classroom data are like teddy bears and real data are like a grizzley bear with salmon blood dripping out its mouth."
Jenny Bryan                  Â
Up to 80% of data analysis is spent on the process of cleaning and preparing data.
cf. Wickham, 2014 and Dasu and Johnson, 2003
tidyr & dplyr make 95% of your data wrangling tasks much easier!
gather(): transforms data from wide to long
spread(): transforms data from long to wide
separate(): splits a single column into multiple columns
unite(): combines multiple columns into a single column
gather()Transform data from wide to long:
Function: gather(data, key, value, ..., na.rm = FALSE, convert = FALSE)
Same as: data %>% gather(key, value, ..., na.rm = FALSE, convert = FALSE)
Arguments:
data: data frame
key: column name representing new variable
value: column name representing variable values
...: names of columns to gather (or not gather)
na.rm: option to remove observations with missing values (represented by NAs)
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
gather()gather()gather()gather()gather()gather()gather()Code alternatives:
These all produce the same results:
cases %>% gather(Year, n, `2011`:`2013`)
cases %>% gather(Year, n, `2011`, `2012`, `2013`)
cases %>% gather(Year, n, 2:4)
cases %>% gather(Year, n, -Country)
Also note that if you do not supply arguments for na.rm or convert values then the defaults are used
spread()Transform data from long to wide:
Function: spread(data, key, value, fill = NA, convert = FALSE)
Same as: data %>% spread(key, value, fill = NA, convert = FALSE)
Arguments:
data: data frame
key: column values to convert to multiple columns
value: single column values to convert to multiple columns' values
fill: If there isn't a value for every combination of the other variables and the key
column, this value will be substituted
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
spread()spread()spread()spread()spread()separate()Splits a single variable (column) into multiple variables (columns):
Function: separate(data, col, into, sep = " ", remove = TRUE, convert = FALSE)
Same as: data %>% separate(col, into, sep = " ", remove = TRUE, convert = FALSE)
Arguments:
data: data frame
col: column name representing current variable
into: names of variables representing new variables
sep: how to separate current variable (char, num, or symbol)
remove: if TRUE, remove input column from output data frame
convert: if TRUE will automatically convert values to logical, integer, numeric, complex or
factor as appropriate
separate()separate()separate()separate()separate()separate()separate()Code alternatives:
These produce the same results:
storms %>% separate(date, c("year", "month", "day"))
storms %>% separate(date, c("year", "month", "day"), sep = "-")
By default, if no separator is specified, will separate by any regular expression that matches any sequence
of non-alphanumeric values
unite()Combines multiple variables (columns) into a single variable (column):
Function: unite(data, col, ..., sep = " ", remove = TRUE)
Same as: data %>% unite(col, ..., sep = " ", remove = TRUE)
Arguments:
data: data frame
col: column name of new "merged" column
...: names of columns to merge
sep: separator to use between merged values
remove: if TRUE, remove input column from output data frame
unite()unite()unite()unite()unite()unite()unite()Code alternatives:
These produce the same results:
storms %>% unite(date, year, month, day, sep = "_")
storms %>% unite(date, year, month, day)
If no spearator is identified, "_" will automatically be used
select(): select variables of concern
filter(): filter values based on conditions
group_by(): group data by categorical levels
summarise(): change unit of analysis
arrange(): order the data
mutate(): create new variables
join(): combine separate data sets
select()Select variables of concern:
Function: select(data, ...)
Same as: data %>% select(...)
Arguments:
data: data frame
...: call variables by name or by function
Special functions:
starts_with(x, ignore.case = TRUE): names starts with x
ends_with(x, ignore.case = TRUE): names ends in x
contains(x, ignore.case = TRUE): selects all variables whose name contains x
matches(x, ignore.case = TRUE): selects all variables whose name matches the regular expression x
select()select()select()select()select()Code alternatives:
These produce the same results:
storms %>% select(wind, pressure, date)
storms %>% select(wind:pressure)
storms %>% select(-storm)
filter()filter values based on conditions:
Function: filter(data, ...)
Same as: data %>% filter(...)
Arguments:
data: data frame
...: conditions to be met
filter()filter()filter()We can apply multiple logic rules in the filter() function such as:
| < | Less than | != | Not equal to |
| > | Greater than | %in% | Group membership |
| == | Equal to | is.na | is NA |
| != | Not equal to | !is.na | is not NA |
| <= | Less than or equal to | &,|,! | Boolean operators |
| >= | Greater than or equal to |
filter()Additional subsetting functions:
storms %>% distinct() |
Remove duplicate rows | |
storms %>% sample_frac(0.5, replace = TRUE) |
Randomly select fraction of rows | |
storms %>% sample_n(10, replace = TRUE) |
Randomly select n rows | |
storms %>% slice(3:5) |
Select rows by position | |
storms %>% top_n(storms, 2, date) |
Select and order top n entries |
group_by()Group data by categorical levels:
Function: group_by(data, ...)
Same as: data %>% group_by(...)
Arguments:
data: data frame
...: variables to group_by
*Use ungroup(x) to remove groups
group_by()group_by()group_by()group_by()group_by()group_by() function is a silent function in which no observable manipulation of the data occurspollution %>% group_by(city)
## Source: local data frame [6 x 3] ## Groups: city [3] ## ## city size amount ## (chr) (chr) (dbl) ## 1 New York large 23 ## 2 New York small 14 ## 3 London large 22 ## 4 London small 16 ## 5 Beijing large 121 ## 6 Beijing small 56
summarise()Perform summary statistics on variables:
Function: summarise(data, ...)
Same as: data %>% summarise(...)
Arguments:
data: data frame
...: Name-value pairs of summary functions like min(), mean(), max() etc.
*Developer is from New Zealand...can use "summarise(x)" or "summarize(x)"
summarise()summarise()summarise()summarise()summarise()summarise()summarise()summarise()Summary functions:
first() |
First value of a vector | min() |
Min value in vector | |
last() |
Last value of a vector | max() |
Max value in vector | |
nth() |
Nth value of a vector | mean() |
Mean value of vector | |
n() |
# of values in a vector | median() |
Median value of vector | |
n_distinct() |
# of distinct values | var() |
Variance of vector | |
IQR() |
IQR of a vector | sd() |
St. dev. of vector |
arrange()Order variable values:
Function: arrange(data, ...)
Same as: data %>% arrange(...)
Arguments:
data: data frame
...: Variable(s) to order
*use desc(x) to sort variable(s) in descending order
arrange()arrange()desc(x) to sort variable(s) in descending order
mutate()Create new variables:
Function:
mutate(data, ...)
Same as: data %>% mutate(...)
Arguments:
data: data frame
...: Expression(s)
mutate()mutate()mutate()mutate()mutate()Window functions that can be applied within mutate_each(funs()):
lead() |
ntile() |
cumsum() |
|
lag() |
between() |
cummax() |
|
dense_rank() |
cume_dist() |
cummin() |
|
min_rank() |
cumall() |
cumprod() |
|
percent_rank() |
cumany() |
pmax() |
|
row_number() |
cumean() |
pmin() |
mutate_each(funs())join()Combining data sets:
Functions:
left_join(x, y, by = NULL)
right_join(x, y, by = NULL)
inner_join(x, y, by = NULL)
full_join(x, y, by = NULL)
semi_join(x, y, by = NULL)
anti_join(x, y, by = NULL)
Arguments:
x, y: data frames to join
by: a character vector of variables to join by. If NULL, the default, join will do a
natural join, using all variables with common names across the two tables.
join()join()b to a
join()b to a
join()b to a
join()a to b
join()join()join()a that have a match in b.
join()a that do not have a match in b.
join()Additional dplyr functions for merging data sets:
intersect(y, z) |
Rows that appear in both y and z | |
union(y, z) |
Rows that appear in either or both y and z | |
setdiff(y, z) |
Rows that appear in y but not z | |
bind_rows(y, z) |
Append z to y as new rows | |
bind_cols(y, z) |
Append z to y as new columns |
bind_ functions.
This tutorial simply touches on the basics that these two packages can do. There are several other resources you can check out to learn more. In addition, much of what I have learned and, therefore, much of the content in this tutorial is simply a modified regurgitation of the wonderful resources provided by R Studio, Hadley Wickham, and Garrett Grolemund.