How can we quickly and easily calculate summary statistics over groups of variables in R? For example, if you have a dataset of calorie counts for different items at McDonalds, how can you calculate the mean number of calories separately for desserts and sandwiches?
Let’s walk through how to do this using the ChickWeight dataset. Here’s how the dataset looks:
head(ChickWeight)
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Let’s say we want to calculate the mean weight across all chicks separately for each time point. Here, our critical DV is weight and our grouping variable is time.
One way to do this is to use the tapply() function. The tapply() function has three arguments
Mean.Weights <- with(ChickWeight, tapply(X = weight, # X is the critical DV
INDEX = Time, # INDEX is the grouping variable
FUN = mean # FUN is the aggregation function
))
Mean.Weights
## 0 2 4 6 8 10 12 14 16 18
## 41.06 49.22 59.96 74.31 91.24 107.84 129.24 143.81 168.09 190.19
## 20 21
## 209.72 218.69
As you can see, the output of the tapply() function is an array of mean weights aggregated at each time period. When we plot the data, we can clearly see that (unsurprisingly) the mean weight of chicks increases over time.
You can put ANY function you want into the FUN argument as long as it takes a single vector as an argument and returns a single number. This function can either be a function pre-defined by R (like mean, median, sd, etc.) or a custom function that you define.
tapply is great for quickly looking at the result of a single function applied to groups on a single variable, but what if you want to apply several functions to several different variables in a dataframe? For example, say you want to calculate both the mean weight and the standard deviation of weights and the mean of another variable such as the chicken’s height.
You could do this using several tapply() functions, but this would be cumbersome. Instead, we can utilize the almightly dplyr package!
Pipelining with dplyr allows you to simultaneously calculate several summary variables and save the results in a dataframe.
First, let me add a new column titled Height to ChickWeight that gives the height of the chicks
ChickWeight$Height <- ChickWeight$weight / 3 + rnorm(nrow(ChickWeight), 2, .5)
Ok, now let’s create a new dataframe called “Summary.df” that aggregates the following data over time (note, you need to install and load the “dplyr” library first!):
# Make sure to install and load the "dplyr" package before running this code!
Summary.df <- ChickWeight %>% # Start by defining the original dataframe, AND THEN...
group_by(Time) %>% # Define the grouping variable, AND THEN...
summarise( # Now you define your summary variables with a name and a function...
Observations = n(), ## The function n() in dlpyr gives you the number of observations
Mean.Weight = mean(weight),
SD.Weight = sd(weight),
Mean.Height = mean(Height),
SD.Height = sd(Height)
)
Summary.df
## Source: local data frame [12 x 6]
##
## Time Observations Mean.Weight SD.Weight Mean.Height SD.Height
## 1 0 50 41.06 1.132 15.77 0.6295
## 2 2 50 49.22 3.688 18.33 1.3366
## 3 4 49 59.96 4.495 22.02 1.7512
## 4 6 49 74.31 9.012 26.68 2.9927
## 5 8 49 91.24 16.240 32.40 5.4900
## 6 10 49 107.84 23.987 38.07 7.9248
## 7 12 49 129.24 34.120 45.19 11.3184
## 8 14 48 143.81 38.300 49.78 12.6195
## 9 16 47 168.09 46.904 58.13 15.5520
## 10 18 47 190.19 57.395 65.52 19.0256
## 11 20 46 209.72 66.512 71.86 22.2392
## 12 21 45 218.69 71.510 74.78 24.0051
Let’s briefly look through the main points of the dplyr pipelining code:
dplyr is an incredibly usefull package that can allow you to complete many tasks quickly and easily. While there are many other functions to use in dplyr, one I want to point out is filter which allows you to aggregate data over groups only for specific rows that you specify. For example, let’s rerun the previous analysis, but only include data for data where the Chicken’s Height is greater than 50 (maybe we only care about the weight of chicks when they are taller than 50). We do this by adding a filter(Height > 50) command before the group_by command.
Summary.HGT50.df <- ChickWeight %>% # Start by defining the original dataframe, AND THEN...
filter(Height > 50) %>% # Only include cases where Height > 50 AND THEN...
group_by(Time) %>% # Define the grouping variable, AND THEN...
summarise( # Now you define your summary variables with a name and a function...
Observations = n(), ## The function n() in dlpyr gives you the number of observations
Mean.Weight = mean(weight),
SD.Weight = sd(weight),
Mean.Height = mean(Height),
SD.Height = sd(Height)
)
Summary.HGT50.df
## Source: local data frame [7 x 6]
##
## Time Observations Mean.Weight SD.Weight Mean.Height SD.Height
## 1 10 3 159.3 3.215 55.21 1.305
## 2 12 19 161.1 19.731 55.83 6.335
## 3 14 26 171.0 24.305 58.77 7.891
## 4 16 34 189.1 34.395 65.13 11.364
## 5 18 39 207.4 45.962 71.25 15.189
## 6 20 38 230.1 53.247 78.66 17.830
## 7 21 39 235.6 60.275 80.45 20.253
You’ll notice that we have many fewer rows in this dataframe, that’s because at many time points, there were no Chicks with a height greater than 50.