Grouped Aggregation with tapply and dplyr

How can we quickly and easily calculate summary statistics over groups of variables in R? For example, if you have a dataset of calorie counts for different items at McDonalds, how can you calculate the mean number of calories separately for desserts and sandwiches?

Let’s walk through how to do this using the ChickWeight dataset. Here’s how the dataset looks:

head(ChickWeight)

##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1

Let’s say we want to calculate the mean weight across all chicks separately for each time point. Here, our critical DV is weight and our grouping variable is time.

tapply()

One way to do this is to use the tapply() function. The tapply() function has three arguments

X = the critical variable you want to aggregate
INDEX = the grouping variable
FUN = the function you want to apply to X for each level of INDEX

Mean.Weights <- with(ChickWeight, tapply(X = weight,  # X is the critical DV
                         INDEX = Time, # INDEX is the grouping variable
                         FUN = mean # FUN is the aggregation function
                         ))

Mean.Weights

##      0      2      4      6      8     10     12     14     16     18 
##  41.06  49.22  59.96  74.31  91.24 107.84 129.24 143.81 168.09 190.19 
##     20     21 
## 209.72 218.69

As you can see, the output of the tapply() function is an array of mean weights aggregated at each time period. When we plot the data, we can clearly see that (unsurprisingly) the mean weight of chicks increases over time.

You can put ANY function you want into the FUN argument as long as it takes a single vector as an argument and returns a single number. This function can either be a function pre-defined by R (like mean, median, sd, etc.) or a custom function that you define.

tapply is great for quickly looking at the result of a single function applied to groups on a single variable, but what if you want to apply several functions to several different variables in a dataframe? For example, say you want to calculate both the mean weight and the standard deviation of weights and the mean of another variable such as the chicken’s height.

You could do this using several tapply() functions, but this would be cumbersome. Instead, we can utilize the almightly dplyr package!

dplyr and pipelining (%>%)

Pipelining with dplyr allows you to simultaneously calculate several summary variables and save the results in a dataframe.

First, let me add a new column titled Height to ChickWeight that gives the height of the chicks

ChickWeight$Height <- ChickWeight$weight / 3 + rnorm(nrow(ChickWeight), 2, .5)

Ok, now let’s create a new dataframe called “Summary.df” that aggregates the following data over time (note, you need to install and load the “dplyr” library first!):

Observations = The number of datapoints at each time point
Mean.Weight = The mean weight…
SD.Weight = The standard deviation of weights…
Mean.Height = The mean height…
SD.Height = THe standard deviation of heights…

# Make sure to install and load the "dplyr" package before running this code!

Summary.df <- ChickWeight %>% # Start by defining the original dataframe, AND THEN...
                group_by(Time) %>% # Define the grouping variable, AND THEN...
                summarise( # Now you define your summary variables with a name and a function...
                          Observations = n(),  ## The function n() in dlpyr gives you the number of observations
                          Mean.Weight = mean(weight),
                          SD.Weight = sd(weight),
                          Mean.Height = mean(Height),
                          SD.Height = sd(Height)
)

Summary.df

## Source: local data frame [12 x 6]
## 
##    Time Observations Mean.Weight SD.Weight Mean.Height SD.Height
## 1     0           50       41.06     1.132       15.77    0.6295
## 2     2           50       49.22     3.688       18.33    1.3366
## 3     4           49       59.96     4.495       22.02    1.7512
## 4     6           49       74.31     9.012       26.68    2.9927
## 5     8           49       91.24    16.240       32.40    5.4900
## 6    10           49      107.84    23.987       38.07    7.9248
## 7    12           49      129.24    34.120       45.19   11.3184
## 8    14           48      143.81    38.300       49.78   12.6195
## 9    16           47      168.09    46.904       58.13   15.5520
## 10   18           47      190.19    57.395       65.52   19.0256
## 11   20           46      209.72    66.512       71.86   22.2392
## 12   21           45      218.69    71.510       74.78   24.0051

Let’s briefly look through the main points of the dplyr pipelining code:

Begin by defining the dataframe the contains the original dataset.
The symbole %>% is called a ‘pipe’ - you can think about it as “and then”. So we begin by defining the data frame, “and then”…
group_by tells dplyr which grouping variable (the same as the INDEX variable in tapply) you want to aggregate over. In this case, the grouping variable is Time.
summarise tells dplyr that you want to calculate summary variables. These will correspond to the new variables in your final dataframe. In this example. For each summary variable, give it a name, then define a function. Because we already defined a grouping variable, R will calculate these summary statistics separately for each group.

dplyr is an incredibly usefull package that can allow you to complete many tasks quickly and easily. While there are many other functions to use in dplyr, one I want to point out is filter which allows you to aggregate data over groups only for specific rows that you specify. For example, let’s rerun the previous analysis, but only include data for data where the Chicken’s Height is greater than 50 (maybe we only care about the weight of chicks when they are taller than 50). We do this by adding a filter(Height > 50) command before the group_by command.

Summary.HGT50.df <- ChickWeight %>% # Start by defining the original dataframe, AND THEN...
                filter(Height > 50) %>% # Only include cases where Height > 50 AND THEN...
                group_by(Time) %>% # Define the grouping variable, AND THEN...
                summarise( # Now you define your summary variables with a name and a function...
                          Observations = n(),  ## The function n() in dlpyr gives you the number of observations
                          Mean.Weight = mean(weight),
                          SD.Weight = sd(weight),
                          Mean.Height = mean(Height),
                          SD.Height = sd(Height)
)

Summary.HGT50.df

## Source: local data frame [7 x 6]
## 
##   Time Observations Mean.Weight SD.Weight Mean.Height SD.Height
## 1   10            3       159.3     3.215       55.21     1.305
## 2   12           19       161.1    19.731       55.83     6.335
## 3   14           26       171.0    24.305       58.77     7.891
## 4   16           34       189.1    34.395       65.13    11.364
## 5   18           39       207.4    45.962       71.25    15.189
## 6   20           38       230.1    53.247       78.66    17.830
## 7   21           39       235.6    60.275       80.45    20.253

You’ll notice that we have many fewer rows in this dataframe, that’s because at many time points, there were no Chicks with a height greater than 50.

Grouped Aggregation with tapply and dplyr

Nathaniel Phillips

3 Dec 2014

tapply()

dplyr and pipelining (%>%)