Programming in R - tutorial : tapply() function in R

Author: Abhinav Agrawal

tapply() applies a function or operation on subset of the vector broken down by a given factor variable.

To understand this, imagine we have ages of 20 people (male/females), and we need to know the average age of males and females from this sample. To start with we can group ages by the gender (male or female), ages of 12 males, and ages of 8 females, and later calculate the average age for males and females.

In this example, technically, we have a quantitative variable age, factor variable, gender, We created subset of this quantitative varible broken down by gender, and after subsetting, we got ages of 12 males, and ages of 8 females. Next, average operation was performed individually on the above subsets.

This is exactly what tapply() does!

Syntax of tapply: tapply(X, INDEX, FUN, …)

X = a vector, INDEX = list of one or more factor, FUN = Function or operation that needs to be applied, … optional arguments for the function

We will use the iris dataset for this example. Load the iris dataset.

data(iris)  # Load the dataset iris
str(iris)  # Structure of the dataset
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Let us calculate the mean of the Sepal Length

mean(iris$Sepal.Length)
## [1] 5.843

Now, we want to calculate the mean of the Sepal Length but broken by the Species, so we will use the tapply() function

tapply(iris$Sepal.Length, iris$Species, mean)
##     setosa versicolor  virginica 
##      5.006      5.936      6.588

Now, let us see another example, this time another inbuilt dataset from R , mtcars dataset

data(mtcars)
str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

We are interested in seeing the avg mpg for the various transmission types and number of cylinders in car. This is nothing but avg mpg grouped by transmission type and the number of cylinders in car.

tapply(mtcars$mpg, list(mtcars$cyl, mtcars$am), mean)
##       0     1
## 4 22.90 28.07
## 6 19.12 20.57
## 8 15.05 15.40