Collapsing Data Frames
Prologue
- The combination of entires for each row across the different columns may be unique.
- However, elements within a single column often repeat themselve, thus forming groups.
- You may be interested to compute some statistics, e.g. mean, sum, etc. for a variable for each group from the same column, i.e. by-group analysis/tabulation.
- In this session, we will learn how to calculate simple statistics stratified by groups and then we will learn to select a single row, among many rows within the same group based on a statistic of interest, e.g. the row with the highest mean.
- First, let’s create a data frame for us to use in this session.
# Set seed for reproduciblity
set.seed(888)
# Define vectors
col1 <- c(1:100)
col2 <- sample(c("Dog", "Cat", "Pikachu"), size=100, replace=TRUE)
col3 <- sample(c("Pink", "Blue", "Yellow"), size=100, replace=TRUE)
col4 <- rnorm(n=100, mean=5, sd=1)
col5 <- rnorm(n=100, mean=10, sd=2)
col6 <- runif(n=100, min=1, max=10)
col7 <- runif(n=100, min=1, max=10)
col8 <- runif(n=100, min=1, max=10)
# Create data frame
df <- data.frame("ID"=col1, "Animal"=col2, "Colour"=col3, "Weight"=col4, "Age"=col5, "Popularity_1"=col6, "Popularity_2"=col7, "Popularity_3"=col8)
head(df)
## ID Animal Colour Weight Age Popularity_1 Popularity_2
## 1 1 Dog Yellow 5.729918 10.055249 3.475518 8.039746
## 2 2 Cat Pink 4.154880 12.855135 9.084591 8.183140
## 3 3 Dog Blue 3.792285 13.096655 1.850557 2.558149
## 4 4 Pikachu Yellow 6.760088 12.471876 7.714097 9.637706
## 5 5 Pikachu Pink 3.671745 8.126629 6.591289 4.228270
## 6 6 Dog Yellow 4.352879 9.523793 7.704855 6.027007
## Popularity_3
## 1 8.181624
## 2 9.984992
## 3 8.841783
## 4 3.884761
## 5 4.876320
## 6 8.366354
Collapsing by 1 column
- The by() function is suitable for the purpose of collapsing a data frame by a single column.
- Let’s compute the mean age stratified by animal type.
by(df, df$Animal, function(x) {
mean.age <- mean(x$Age)
})
## df$Animal: Cat
## [1] 10.07007
## --------------------------------------------------------
## df$Animal: Dog
## [1] 10.39474
## --------------------------------------------------------
## df$Animal: Pikachu
## [1] 9.675325
- Alternatively, you may use the tapply() function.
- Instead of taking in a data frame like the by() function, the tapply() function takes in a vector, specifically the vector whose statistic you are interested in.
- The tapply() is also less verbose compared to the by() function for simple objectives.
tapply(df$Age, df$Animal, mean)
## Cat Dog Pikachu
## 10.070069 10.394743 9.675325
Collapsing by >1 column
- The by() and tapply() function is a quick way to compute the by-group statistic based on the groupings of one column.
- The aggregate() function is one way to compute the by-group statistic based on the grouping of more than 1 column.
- Also, the aggregarte() function allows for computing the by-group statistic for several variables, e.g. Age, Weight etc. vs. just Age in this example.
# Compute statistic for 1 variable based on 1 column
aggregate(Age ~ Animal, data=df, mean)
## Animal Age
## 1 Cat 10.070069
## 2 Dog 10.394743
## 3 Pikachu 9.675325
# Compute statistic for 1 variable based on >1 column
aggregate(Age ~ Animal + Colour, data=df, mean)
## Animal Colour Age
## 1 Cat Blue 10.364872
## 2 Dog Blue 10.875155
## 3 Pikachu Blue 9.702642
## 4 Cat Pink 9.611790
## 5 Dog Pink 10.010738
## 6 Pikachu Pink 8.609669
## 7 Cat Yellow 10.293623
## 8 Dog Yellow 10.293161
## 9 Pikachu Yellow 10.394175
# Compute statistic for >1 variable based on 1 column
aggregate(cbind(Age, Weight, Popularity_1) ~ Animal, data=df, mean)
## Animal Age Weight Popularity_1
## 1 Cat 10.070069 5.012885 6.186497
## 2 Dog 10.394743 5.238304 5.100087
## 3 Pikachu 9.675325 5.212584 5.658100
# Compute statistic for >1 variable based on >1 column
aggregate(cbind(Age, Weight, Popularity_1) ~ Animal + Colour, data=df, mean)
## Animal Colour Age Weight Popularity_1
## 1 Cat Blue 10.364872 5.280345 6.102481
## 2 Dog Blue 10.875155 5.393524 4.621162
## 3 Pikachu Blue 9.702642 5.290922 4.720847
## 4 Cat Pink 9.611790 5.038958 6.105811
## 5 Dog Pink 10.010738 5.243971 6.169592
## 6 Pikachu Pink 8.609669 4.847720 6.746450
## 7 Cat Yellow 10.293623 4.810678 6.316468
## 8 Dog Yellow 10.293161 5.140614 4.858012
## 9 Pikachu Yellow 10.394175 5.410949 5.553495
Selecting representative rows
- You may be interested in only retaining the one specific row among all the other rows within the same group, e.g. you may be interested in picking just one dog (from all the other dogs).
- You may want do this selection by using some statistics, e.g. the row with the highest mean.
# Calculate mean for all 3 Popularity scores
df$PopularityMean <- rowMeans(df[ ,c("Popularity_1", "Popularity_2", "Popularity_3")])
head(df)
## ID Animal Colour Weight Age Popularity_1 Popularity_2
## 1 1 Dog Yellow 5.729918 10.055249 3.475518 8.039746
## 2 2 Cat Pink 4.154880 12.855135 9.084591 8.183140
## 3 3 Dog Blue 3.792285 13.096655 1.850557 2.558149
## 4 4 Pikachu Yellow 6.760088 12.471876 7.714097 9.637706
## 5 5 Pikachu Pink 3.671745 8.126629 6.591289 4.228270
## 6 6 Dog Yellow 4.352879 9.523793 7.704855 6.027007
## Popularity_3 PopularityMean
## 1 8.181624 6.565630
## 2 9.984992 9.084241
## 3 8.841783 4.416830
## 4 3.884761 7.078854
## 5 4.876320 5.231959
## 6 8.366354 7.366072
# Select a representative for each animal type. In this case, that one animal from each group with the highest mean Popularity score
collapsed <- aggregate(PopularityMean ~ Animal, data=df, max)
collapsed
## Animal PopularityMean
## 1 Cat 9.197044
## 2 Dog 8.856893
## 3 Pikachu 8.834103
# Susbet the represenative rows in the original data frame using the mean Popularity score
df_rep <- df[grep(paste("^", collapsed$PopularityMean, "$" , sep="", collapse="|"), df$PopularityMean), ]
df_rep
## ID Animal Colour Weight Age Popularity_1 Popularity_2
## 8 8 Dog Yellow 4.919433 6.870818 7.297155 9.369084
## 30 30 Pikachu Pink 4.275036 11.362936 9.801482 8.473748
## 69 69 Cat Pink 4.225605 10.858256 9.319453 8.346067
## Popularity_3 PopularityMean
## 8 9.904439 8.856893
## 30 8.227080 8.834103
## 69 9.925611 9.197044