Big Data Binning

The bigvis package by Hadley Wickham is able to generate rapid binned summaries of very large vector data, however it cannot directly generate grouped binned summaries from a data.frame.

The dplyr package also by Hadley is able to perform grouping on large data.frames much more quickly than his plyr package or manual subsetting.

So lets combine the two packages to do grouped binning summaries of big data!

Functions

bin_by a helper function to bin on a grouped data frame.

library(ggplot2)
library(bigvis)
library(dplyr)
library(data.table)
bin_by <- function(.data, ...){
  plyr:::list_to_dataframe(do(.data,
                              function(.data, ...){
                                eval(substitute(condense(bin(...))),
                                     env=.data,
                                     enclos=parent.frame())
                              }
                              , ...)
  ,labels=attr(.data, 'labels'))
}

Generate example data

num <- 10e6
binwidth=1
obs <- rbind(data.frame(x=rnorm(num), name='normal'),
             data.frame(x=rnorm(num, sd=6), name='wide'),
             data.frame(x=rnorm(num, mean=3), name='right'))

Compute counts

All

bins <- bin_by(obs, x, binwidth)
ggplot(data=bins, aes(x=x, y=.count)) + geom_line() + geom_area(alpha=1/10, position="identity")
## Warning: Removed 1 rows containing missing values (geom_path).

plot of chunk big_plot_all

Grouped

obs_g <- group_by(obs, name)
bins_g <- bin_by(obs_g, x, binwidth)
ggplot(data=bins_g, aes(x=x, y=.count, color=name, fill=name)) +
  geom_line() + geom_area(alpha=1/10, position="identity")
## Warning: Removed 3 rows containing missing values (geom_path).

plot of chunk big_plot_group

Timing

Using these packages gives us the ability to plot grouped big data ~10x faster using this dataset on our machines than the naive approach.

system.time({
  obs_g <- group_by(obs, name)
  bins_g <- bin_by(obs_g, x, binwidth)
  print(ggplot(data=bins_g, aes(x=x, y=.count, color=name, fill=name)) +
          geom_line() + geom_area(alpha=1/10, position="identity"))
})
## Warning: Removed 3 rows containing missing values (geom_path).

plot of chunk big_plot_group2

##    user  system elapsed 
##  13.585   3.906  17.544
system.time({
  print(ggplot(obs, aes(x=x, color=name, fill=name)) +
          geom_area(stat='bin', binwidth=binwidth, alpha=1/10, position='identity'))
})

plot of chunk big_plot_group2

##    user  system elapsed 
##  108.03   26.82  134.91

Author: Jim Hester Created: 2013 Jul 31 03:00:49 PM Last Modified: 2013 Aug 08 11:32:46 AM