Collapsing Data Frames

Prologue

The combination of entires for each row across the different columns may be unique.
However, elements within a single column often repeat themselve, thus forming groups.
You may be interested to compute some statistics, e.g. mean, sum, etc. for a variable for each group from the same column, i.e. by-group analysis/tabulation.
In this session, we will learn how to calculate simple statistics stratified by groups and then we will learn to select a single row, among many rows within the same group based on a statistic of interest, e.g. the row with the highest mean.
First, let’s create a data frame for us to use in this session.

# Set seed for reproduciblity
set.seed(888)

# Define vectors
col1 <- c(1:100)
col2 <- sample(c("Dog", "Cat", "Pikachu"), size=100, replace=TRUE)
col3 <- sample(c("Pink", "Blue", "Yellow"), size=100, replace=TRUE)
col4 <- rnorm(n=100, mean=5, sd=1)
col5 <- rnorm(n=100, mean=10, sd=2)
col6 <- runif(n=100, min=1, max=10)
col7 <- runif(n=100, min=1, max=10)
col8 <- runif(n=100, min=1, max=10)

# Create data frame
df <- data.frame("ID"=col1, "Animal"=col2, "Colour"=col3, "Weight"=col4, "Age"=col5, "Popularity_1"=col6, "Popularity_2"=col7, "Popularity_3"=col8)
head(df)

##   ID  Animal Colour   Weight       Age Popularity_1 Popularity_2
## 1  1     Dog Yellow 5.729918 10.055249     3.475518     8.039746
## 2  2     Cat   Pink 4.154880 12.855135     9.084591     8.183140
## 3  3     Dog   Blue 3.792285 13.096655     1.850557     2.558149
## 4  4 Pikachu Yellow 6.760088 12.471876     7.714097     9.637706
## 5  5 Pikachu   Pink 3.671745  8.126629     6.591289     4.228270
## 6  6     Dog Yellow 4.352879  9.523793     7.704855     6.027007
##   Popularity_3
## 1     8.181624
## 2     9.984992
## 3     8.841783
## 4     3.884761
## 5     4.876320
## 6     8.366354

Collapsing by 1 column

The by() function is suitable for the purpose of collapsing a data frame by a single column.
Let’s compute the mean age stratified by animal type.

by(df, df$Animal, function(x) {
  mean.age <- mean(x$Age)
})

## df$Animal: Cat
## [1] 10.07007
## -------------------------------------------------------- 
## df$Animal: Dog
## [1] 10.39474
## -------------------------------------------------------- 
## df$Animal: Pikachu
## [1] 9.675325

Alternatively, you may use the tapply() function.
Instead of taking in a data frame like the by() function, the tapply() function takes in a vector, specifically the vector whose statistic you are interested in.
The tapply() is also less verbose compared to the by() function for simple objectives.

tapply(df$Age, df$Animal, mean)

##       Cat       Dog   Pikachu 
## 10.070069 10.394743  9.675325

Collapsing by >1 column

The by() and tapply() function is a quick way to compute the by-group statistic based on the groupings of one column.
The aggregate() function is one way to compute the by-group statistic based on the grouping of more than 1 column.
Also, the aggregarte() function allows for computing the by-group statistic for several variables, e.g. Age, Weight etc. vs. just Age in this example.

# Compute statistic for 1 variable based on 1 column
aggregate(Age ~ Animal, data=df, mean)

##    Animal       Age
## 1     Cat 10.070069
## 2     Dog 10.394743
## 3 Pikachu  9.675325

# Compute statistic for 1 variable based on >1 column
aggregate(Age ~ Animal + Colour, data=df, mean)

##    Animal Colour       Age
## 1     Cat   Blue 10.364872
## 2     Dog   Blue 10.875155
## 3 Pikachu   Blue  9.702642
## 4     Cat   Pink  9.611790
## 5     Dog   Pink 10.010738
## 6 Pikachu   Pink  8.609669
## 7     Cat Yellow 10.293623
## 8     Dog Yellow 10.293161
## 9 Pikachu Yellow 10.394175

# Compute statistic for >1 variable based on 1 column
aggregate(cbind(Age, Weight, Popularity_1) ~ Animal, data=df, mean)

##    Animal       Age   Weight Popularity_1
## 1     Cat 10.070069 5.012885     6.186497
## 2     Dog 10.394743 5.238304     5.100087
## 3 Pikachu  9.675325 5.212584     5.658100

# Compute statistic for >1 variable based on >1 column
aggregate(cbind(Age, Weight, Popularity_1) ~ Animal + Colour, data=df, mean)

##    Animal Colour       Age   Weight Popularity_1
## 1     Cat   Blue 10.364872 5.280345     6.102481
## 2     Dog   Blue 10.875155 5.393524     4.621162
## 3 Pikachu   Blue  9.702642 5.290922     4.720847
## 4     Cat   Pink  9.611790 5.038958     6.105811
## 5     Dog   Pink 10.010738 5.243971     6.169592
## 6 Pikachu   Pink  8.609669 4.847720     6.746450
## 7     Cat Yellow 10.293623 4.810678     6.316468
## 8     Dog Yellow 10.293161 5.140614     4.858012
## 9 Pikachu Yellow 10.394175 5.410949     5.553495

Selecting representative rows

You may be interested in only retaining the one specific row among all the other rows within the same group, e.g. you may be interested in picking just one dog (from all the other dogs).
You may want do this selection by using some statistics, e.g. the row with the highest mean.

# Calculate mean for all 3 Popularity scores
df$PopularityMean <- rowMeans(df[ ,c("Popularity_1", "Popularity_2", "Popularity_3")])
head(df)

##   ID  Animal Colour   Weight       Age Popularity_1 Popularity_2
## 1  1     Dog Yellow 5.729918 10.055249     3.475518     8.039746
## 2  2     Cat   Pink 4.154880 12.855135     9.084591     8.183140
## 3  3     Dog   Blue 3.792285 13.096655     1.850557     2.558149
## 4  4 Pikachu Yellow 6.760088 12.471876     7.714097     9.637706
## 5  5 Pikachu   Pink 3.671745  8.126629     6.591289     4.228270
## 6  6     Dog Yellow 4.352879  9.523793     7.704855     6.027007
##   Popularity_3 PopularityMean
## 1     8.181624       6.565630
## 2     9.984992       9.084241
## 3     8.841783       4.416830
## 4     3.884761       7.078854
## 5     4.876320       5.231959
## 6     8.366354       7.366072

# Select a representative for each animal type. In this case, that one animal from each group with the highest mean Popularity score
collapsed <- aggregate(PopularityMean ~ Animal, data=df, max)
collapsed

##    Animal PopularityMean
## 1     Cat       9.197044
## 2     Dog       8.856893
## 3 Pikachu       8.834103

# Susbet the represenative rows in the original data frame using the mean Popularity score
df_rep <- df[grep(paste("^", collapsed$PopularityMean, "$" , sep="", collapse="|"), df$PopularityMean), ]
df_rep

##    ID  Animal Colour   Weight       Age Popularity_1 Popularity_2
## 8   8     Dog Yellow 4.919433  6.870818     7.297155     9.369084
## 30 30 Pikachu   Pink 4.275036 11.362936     9.801482     8.473748
## 69 69     Cat   Pink 4.225605 10.858256     9.319453     8.346067
##    Popularity_3 PopularityMean
## 8      9.904439       8.856893
## 30     8.227080       8.834103
## 69     9.925611       9.197044

For more info on subsetting data frames, refer to Subsetting Data Frames.