1 Goal


The goal of this tutorial is to understand how to use the aggregate function. Through clear examples we will show the correct use of this function and some useful configurations. We will use the Iris dataset and aggregate the values obtained using the different species.


2 Data loading


# In this example we will use the open repository of plants classification Iris. 
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

3 Problems using the aggregate function

3.1 Error example: sum function


# If we try to use the aggregate function stright away we will find some error due to non numerical columns in the dataset

# > aggregate(iris, by = list(iris$Species), FUN = sum)

Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : ‘sum’ not meaningful for factors


3.2 Warning example: mean function


# Using the mean function a warning will appear because non numerical attributes are inside of the dataset. 
# However the function is able to handle this issue and returns an NA value to non numerical attributes. 
aggregate(iris, by = list(iris$Species), FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1     setosa        5.006       3.428        1.462       0.246      NA
## 2 versicolor        5.936       2.770        4.260       1.326      NA
## 3  virginica        6.588       2.974        5.552       2.026      NA
# The first parameter is the table we want to aggregate.
# The second parameter is a list we will use to aggregate. In this case the column Species of the original dataset.
# The FUN parameter is the function parameter used to aggregate. Some examples are: sum, mean, sd. 

4 Easy solution: Avoid non numerical variables


# We can create a new table with just the numerical attributes and handle the issue like this
iris_num <- iris[-5]
str(iris_num)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# Using the sum function
aggregate(iris_num, by = list(iris$Species), FUN = sum)
##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        250.3       171.4         73.1        12.3
## 2 versicolor        296.8       138.5        213.0        66.3
## 3  virginica        329.4       148.7        277.6       101.3
# Using the mean function
aggregate(iris_num, by = list(iris$Species), FUN = mean)
##      Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

5 Fancy solution: Name correctly the first column and detect automatically the numerical attributes

5.1 Create function to detect automatically numerical columns


# Function returns dataset without non numerical variables
numeric_dataset <- function(Dataset){
  nums <- sapply(Dataset, is.numeric)
  return(Dataset[ , nums])
}

5.2 Apply function


# Only numerical attributes are kept
iris_num_fancy <- numeric_dataset(iris)
str(iris_num_fancy)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

6 Examples of different functions

6.1 Aggregate with sum


# Create dataset with the sum of the attributes for each species
iris_aggregate_sum <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = sum)
colnames(iris_aggregate_sum)[1] <- "Species"
iris_aggregate_sum
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        250.3       171.4         73.1        12.3
## 2 versicolor        296.8       138.5        213.0        66.3
## 3  virginica        329.4       148.7        277.6       101.3

6.2 Aggregate with mean


# Create dataset with the mean of the attributes for each species
iris_aggregate_mean <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = mean)
colnames(iris_aggregate_mean)[1] <- "Species"
iris_aggregate_mean
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

6.3 Aggregate with mean if there are NAs in the dataframe


# Create dataset with the mean of the attributes for each species
iris_aggregate_mean <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = function(x) mean(x, na.rm = TRUE))
colnames(iris_aggregate_mean)[1] <- "Species"
iris_aggregate_mean
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa        5.006       3.428        1.462       0.246
## 2 versicolor        5.936       2.770        4.260       1.326
## 3  virginica        6.588       2.974        5.552       2.026

6.4 Aggregate with standard deviation


# Create dataset with the standard deviation of the attributes for each species
iris_aggregate_sd <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = sd)
colnames(iris_aggregate_sd)[1] <- "Species"
iris_aggregate_sd
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa    0.3524897   0.3790644    0.1736640   0.1053856
## 2 versicolor    0.5161711   0.3137983    0.4699110   0.1977527
## 3  virginica    0.6358796   0.3224966    0.5518947   0.2746501

6.5 Aggregate with max


# Create dataset with the max of the attributes for each species
iris_aggregate_max <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = max)
colnames(iris_aggregate_max)[1] <- "Species"
iris_aggregate_max
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa          5.8         4.4          1.9         0.6
## 2 versicolor          7.0         3.4          5.1         1.8
## 3  virginica          7.9         3.8          6.9         2.5

6.6 Aggregate with min


# Create dataset with the min of the attributes for each species
iris_aggregate_min <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = min)
colnames(iris_aggregate_min)[1] <- "Species"
iris_aggregate_min
##      Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     setosa          4.3         2.3          1.0         0.1
## 2 versicolor          4.9         2.0          3.0         1.0
## 3  virginica          4.9         2.2          4.5         1.4

7 Conclusion


In this tutorial we have shown how to use and the limitations of the aggregate function. The different examples can be useful for different parts of a typical analysis. The last part shows a more automatic approach that can be important if the number of variables both numerical and non-numerical is quite high to do it by hand.

The main idea to keep in mind is that only numerical attributes are allowed by aggregate.