The goal of this tutorial is to understand how to use the aggregate function. Through clear examples we will show the correct use of this function and some useful configurations. We will use the Iris dataset and aggregate the values obtained using the different species.
# In this example we will use the open repository of plants classification Iris.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# If we try to use the aggregate function stright away we will find some error due to non numerical columns in the dataset
# > aggregate(iris, by = list(iris$Species), FUN = sum)
Error in Summary.factor(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, : ‘sum’ not meaningful for factors
# Using the mean function a warning will appear because non numerical attributes are inside of the dataset.
# However the function is able to handle this issue and returns an NA value to non numerical attributes.
aggregate(iris, by = list(iris$Species), FUN = mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 setosa 5.006 3.428 1.462 0.246 NA
## 2 versicolor 5.936 2.770 4.260 1.326 NA
## 3 virginica 6.588 2.974 5.552 2.026 NA
# The first parameter is the table we want to aggregate.
# The second parameter is a list we will use to aggregate. In this case the column Species of the original dataset.
# The FUN parameter is the function parameter used to aggregate. Some examples are: sum, mean, sd.
# We can create a new table with just the numerical attributes and handle the issue like this
iris_num <- iris[-5]
str(iris_num)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# Using the sum function
aggregate(iris_num, by = list(iris$Species), FUN = sum)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 250.3 171.4 73.1 12.3
## 2 versicolor 296.8 138.5 213.0 66.3
## 3 virginica 329.4 148.7 277.6 101.3
# Using the mean function
aggregate(iris_num, by = list(iris$Species), FUN = mean)
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
# Function returns dataset without non numerical variables
numeric_dataset <- function(Dataset){
nums <- sapply(Dataset, is.numeric)
return(Dataset[ , nums])
}
# Only numerical attributes are kept
iris_num_fancy <- numeric_dataset(iris)
str(iris_num_fancy)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# Create dataset with the sum of the attributes for each species
iris_aggregate_sum <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = sum)
colnames(iris_aggregate_sum)[1] <- "Species"
iris_aggregate_sum
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 250.3 171.4 73.1 12.3
## 2 versicolor 296.8 138.5 213.0 66.3
## 3 virginica 329.4 148.7 277.6 101.3
# Create dataset with the mean of the attributes for each species
iris_aggregate_mean <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = mean)
colnames(iris_aggregate_mean)[1] <- "Species"
iris_aggregate_mean
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
# Create dataset with the mean of the attributes for each species
iris_aggregate_mean <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = function(x) mean(x, na.rm = TRUE))
colnames(iris_aggregate_mean)[1] <- "Species"
iris_aggregate_mean
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.006 3.428 1.462 0.246
## 2 versicolor 5.936 2.770 4.260 1.326
## 3 virginica 6.588 2.974 5.552 2.026
# Create dataset with the standard deviation of the attributes for each species
iris_aggregate_sd <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = sd)
colnames(iris_aggregate_sd)[1] <- "Species"
iris_aggregate_sd
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 0.3524897 0.3790644 0.1736640 0.1053856
## 2 versicolor 0.5161711 0.3137983 0.4699110 0.1977527
## 3 virginica 0.6358796 0.3224966 0.5518947 0.2746501
# Create dataset with the max of the attributes for each species
iris_aggregate_max <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = max)
colnames(iris_aggregate_max)[1] <- "Species"
iris_aggregate_max
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 5.8 4.4 1.9 0.6
## 2 versicolor 7.0 3.4 5.1 1.8
## 3 virginica 7.9 3.8 6.9 2.5
# Create dataset with the min of the attributes for each species
iris_aggregate_min <- aggregate(iris_num_fancy, by = list(iris$Species), FUN = min)
colnames(iris_aggregate_min)[1] <- "Species"
iris_aggregate_min
## Species Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 setosa 4.3 2.3 1.0 0.1
## 2 versicolor 4.9 2.0 3.0 1.0
## 3 virginica 4.9 2.2 4.5 1.4
In this tutorial we have shown how to use and the limitations of the aggregate function. The different examples can be useful for different parts of a typical analysis. The last part shows a more automatic approach that can be important if the number of variables both numerical and non-numerical is quite high to do it by hand.
The main idea to keep in mind is that only numerical attributes are allowed by aggregate.