The bigvis package by Hadley Wickham is able to generate rapid binned summaries of very large vector data, however it cannot directly generate grouped binned summaries from a data.frame.
The dplyr package also by Hadley is able to perform grouping on large data.frames much more quickly than his plyr package or manual subsetting.
So lets combine the two packages to do grouped binning summaries of big data!
bin_by a helper function to bin on a grouped data frame.
library(ggplot2) library(bigvis) library(dplyr) library(data.table) bin_by <- function(.data, ...){ plyr:::list_to_dataframe(do(.data, function(.data, ...){ eval(substitute(condense(bin(...))), env=.data, enclos=parent.frame()) } , ...) ,labels=attr(.data, 'labels')) }
num <- 10e6 binwidth=1 obs <- rbind(data.frame(x=rnorm(num), name='normal'), data.frame(x=rnorm(num, sd=6), name='wide'), data.frame(x=rnorm(num, mean=3), name='right'))
bins <- bin_by(obs, x, binwidth) ggplot(data=bins, aes(x=x, y=.count)) + geom_line() + geom_area(alpha=1/10, position="identity")
## Warning: Removed 1 rows containing missing values (geom_path).
obs_g <- group_by(obs, name) bins_g <- bin_by(obs_g, x, binwidth) ggplot(data=bins_g, aes(x=x, y=.count, color=name, fill=name)) + geom_line() + geom_area(alpha=1/10, position="identity")
## Warning: Removed 3 rows containing missing values (geom_path).
Using these packages gives us the ability to plot grouped big data ~10x faster using this dataset on our machines than the naive approach.
system.time({ obs_g <- group_by(obs, name) bins_g <- bin_by(obs_g, x, binwidth) print(ggplot(data=bins_g, aes(x=x, y=.count, color=name, fill=name)) + geom_line() + geom_area(alpha=1/10, position="identity")) })
## Warning: Removed 3 rows containing missing values (geom_path).
## user system elapsed ## 13.585 3.906 17.544
system.time({ print(ggplot(obs, aes(x=x, color=name, fill=name)) + geom_area(stat='bin', binwidth=binwidth, alpha=1/10, position='identity')) })
## user system elapsed ## 108.03 26.82 134.91
Author: Jim Hester Created: 2013 Jul 31 03:00:49 PM Last Modified: 2013 Aug 08 11:32:46 AM