Introduction

This post discusses techniques in R to find different functions (such as mean, max or any other user-defined function) of any number of variables in a data frame split according to any number of groups. One can also specify which function to evaluate for which variable and for which group, as will be shown in the following.

Finding the mean (or any other function) of a variable split by groups

Suppose each observation in a dataset consists of several variables, of which only two are relevant are for us. Of these, one is a variable each of whose observations can be assigned to a unique  group, and the other is a numeric variable. We want to find the mean of the latter variable split according to groups of the former. For example, the first variable could be the country a person belongs to, and the second could be the weight of the person. So from a dataset containing thousands of such observations carried out in Europe, we want to find the mean weight of people from France, Italy, Germany etc.

The simplest way for solving this problem is to use the aggregate command which comes with base R. To illustrate this, consider the mtcars dataset (see https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html for a description of this dataset) which consists of observations for 32 car models with 11 integer/numeric variables and one character variable. The character variable is the model name and the integer/numeric variables are engine parameters such as number of cylinders, volume etc. Using the head command, the first six observations look like this

head(mtcars)

Let us say we want to find the average mpg (miles per gallon) split according to cyl (number of cylinders). Just for information, the table command shows how many models there are as a function of the number of cylinders:

table(mtcars$cyl)
## 
##  4  6  8 
## 11  7 14

showing that there are 11, 7, and 14 models with 4, 6 and 8 cylinders, respectively.

Method 1(a): using the aggregate command, part-1

The simplest way to use the aggregate command is to type

aggregate(mpg~cyl,data=mtcars,mean)

which shows that as the number of cylinders increases, more fuel is consumed per kilometer.

Method 1(b): using the aggregate command, part-2

Now suppose we want to split the means according to two variables instead of a single variable. For example, in the above example, we want to find the mean of mpg split according to both cyl and gear. For each car, the number of cylinders is either 4, 6, or 8, and the number of gears is either 3, 4, or 5. To find the number of cars for each combination of cylinders and gears is easy. We simply use the table function.

table(mtcars$cyl,mtcars$gear)
##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

The above says that there are, for example, 4 cars with 6 cylinders and 4 gears, and none with 8 cylinders and 4 gears. We now wish to find the mean of the mpg for each of these combinations. This is done as follows.

aggregate(mtcars$mpg,by=list(mtcars$cyl,mtcars$gear),FUN=mean)

Method 2a: using the dplyr package, part-1

This method uses the dplyr package, which can be used to do slightly more advanced things, as will be shown below. (Note that dplyr does not behave properly if you have the plyr package open as well. The latter should be detached before using the dplyr package.) This method will use two commands, group_by and summarize.

library(dplyr)
## 
## Attache Paket: 'dplyr'
## Die folgenden Objekte sind maskiert von 'package:stats':
## 
##     filter, lag
## Die folgenden Objekte sind maskiert von 'package:base':
## 
##     intersect, setdiff, setequal, union
by_cyl <- group_by(mtcars,cyl)
summarize(by_cyl,mean(mpg))

The above is not a data frame but can easily be converted into one using the as.data.frame command.

Splitting the means according to more than one variable

Now suppose we want to split the means according to two variables instead of a single variable. For example, in the above example, we want to find the mean of mpg split according to both cyl and gear. For each car, the number of cylinders is either 4, 6, or 8, and the number of gears is either 3, 4, or 5. To find the number of cars for each combination of cylinders and gears is easy. We simply use the table function.

table(mtcars$cyl,mtcars$gear)
##    
##      3  4  5
##   4  1  8  2
##   6  2  4  1
##   8 12  0  2

The above says, for example, that there are 4 cars with 6 cylinders and 4 gears, and none with 8 cylinders and 4 gears. We now wish to find the mean of the mpg for each of these combinations. This is done as follows.

by_cyl_gear <- group_by(mtcars,cyl,gear)
as.data.frame(summarize(by_cyl_gear,mean(mpg)))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.

The by_cyl_gear object can even to used to find the mean of more than one variable split across the different cylinder and gear combinations. In addition, we can even have different functions for different variables. For example, suppose we want the mean of mpg, drat and the sum of wt, we just include these variables in the above command:

as.data.frame(summarize(by_cyl_gear,mean(mpg),mean(drat),sum(wt)))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.

Method 2b: using the dplyr package, part-2

Now suppose there are a 100 variables in a data set whose means we want to find split across the different combinations of some other variables. Typing each of the variables in the above command is tedious and in some cases simply not practical. This is where the summarize_each command comes in handy, which finds the mean of each of all the variables (except those according to which the means have to be split). If we do this in the above example, it will return the means of mpg, disp, hp, drat, wt etc. split by the cylinder and gear number.

as.data.frame(summarize_each(by_cyl_gear,funs(mean)))
## Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Now suppose along with the mean, we even want the sum of these variables. This can be achieved simply by including it in the argument to funs above.

as.data.frame(summarize_each(by_cyl_gear,funs(mean,sum)))
## Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Note that if you have any character variables then the above method will return NAs for their means/sums.

User-defined functions

In the above we just used pre-defined functions such as mean and sum to illustrate the idea. The nice thing is that we can also have user-defined functions. For purposes of illustration let us define a simple function meanp2 which just adds 2 to the mean.

meanp2 <- function(z) {
  mean(z) + 2.0
}

For method 1(a), where we used the aggregate function to find the mean of a variable split according to the groups of another variable, we specify the user-defined function as

aggregate(mpg~cyl,data=mtcars,FUN= function(y) meanp2(y))

For method 1(b), which also used the aggregate function but could return the mean of a variable split by two or more other variables, we can pass the user-defined function (instead of the mean) as

aggregate(mtcars$mpg,by=list(mtcars$cyl,mtcars$gear),FUN=meanp2)

In method 2(a), where we used the dplyr package, we can pass the user-defined function with the command

as.data.frame(summarize(by_cyl,meanp2(mpg)))

In method 2(a), we had also discussed how to find the means or sums (or any other pre-defined functions in R) of more than 1 variable, split by groups of two or more other variables. The way to pass the user-defined function is simply

as.data.frame(summarize(by_cyl_gear,mean(mpg),meanp2(mpg),mean(drat),sum(wt)))
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
## argument.

where we are calculating the mean of mpg and drat, the sum of wt, and meanp2, which is a user-defined function, of mpg.

In method 2(b), we used the summarize_each function to find the mean and/ or any other pre-defined function split by groups of a variable of all other variables. To do the same for a user-defined function, we simply pass this function in the argument to funs as below, in this case the user-defined function being meanp2 :

as.data.frame(summarize_each(by_cyl_gear,funs(mean,meanp2)))
## Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
## ℹ Please use `across()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## ℹ Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In conclusion we see that using R we can very neatly and quickly find any pre-defined or user-defined functions of any number of variables, split according to the groups of certain other variables. We also have the freedom to specify which function should be determined for which variable. For example, in the second part of method 2, we found the mean of the variables mpg and drat, the sum of wt, and meanp2 of mpg, split according to the number of gears and number of cylinders.