Reducing code with dplyr

Why is dplyr http://cran.r-project.org/web/packages/dplyr/dplyr.pdf so great? It simplifies all your code for managing dataframes.

Let’s say you were using the dataframe “mtcars” and wanted to calculate the mean, median, and standard deviation of the miles per gallon of cars aggregated by the number of cylindars.

Here’s (one way) you could do that using the aggregate function in base R (though I admit there may be simpler ways…)

Means <- aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = mean)   ## Calculate aggregated means
Medians <- aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = median) ## Repeat with medians
SDs <- aggregate(mtcars$mpg, by = list(mtcars$cyl), FUN = sd)  ## (sigh) again with sd
Final <- cbind(Means, Medians[,2], SDs[2])  ## Combine (and delete duplicated columns)
names(Final) <- c("cyl", "mpg.mean", "mpg.median", "mpg.sd")  ## Rename
Final

##   cyl mpg.mean mpg.median mpg.sd
## 1   4    26.66       26.0  4.510
## 2   6    19.74       19.7  1.454
## 3   8    15.10       15.2  2.560

Note that you have to run the aggregate command for each function separately and then combine them into one dataframe later. Kind of a pain.

Here’s the much, much simpler version in dplyr

Final <- mtcars %>%  ## Define the dataframe once
            group_by(cyl) %>%  ## Define one (or more!) grouping variables
            summarise_each(funs(mean, median, sd), mpg)  ## Define the aggregation functions
Final

## Source: local data frame [3 x 4]
## 
##   cyl  mean median    sd
## 1   4 26.66   26.0 4.510
## 2   6 19.74   19.7 1.454
## 3   8 15.10   15.2 2.560

Isn’t that so much nicer?!

Reducing code with dplyr

Nathaniel D Phillips

7 Oct 2014