Correlation Story with Aster R and toaster Examples

Executive Summary

Calculating correlation coefficients is often starting point before more advanced steps take place. In big data world often both long (big data) and wide (many features) data is analyzed across complex structure using in-group correlations (calculations across multiple subsets). Aster R big data analytics offers various methods to calculate correlations at scale in big data settings. They differ in functionality, performance, etc. that could be implementation and could be summarized in the table below:

Correlation Methods with Aster R
Method / Properties	Feature Permutations	Group Calculations	SQL-MR	In-database R
Aster R ta.cor	N	N	Y	N
Aster R ta.tapply (in-database)	N	Y	N	Y
toaster computeCorrelations	Y	Y	Y	N

Connecting to Aster

Load Aster R (which in turn loads RODBC) and make ODBC connection to Aster.
This connection is saved into conn object for use with toaster package.

library(TeradataAsterR)

## Loading required package: RODBC

library(toaster)
library(ggplot2)
library(ggthemes)
library(reshape2)

conn = ta.connect("PreSalesCluster1-dallas",  database ='dallas')

options(warn=-1)

Dataset

Lahman Baseball Database

Lahman baseball database (Aster script to create it can be found here) has the table batting with batting stats per player per year per single stint with a team.

Battings Stats

Having established connection to the database above we can load data into Aster R virtual data frame without actually moving any data between R and Aster yet:

batting_all = ta.data.frame("batting")

ta.nrow(batting_all)

## [1] 101332

ta.head(batting_all, n=10)

##     playerid yearid stint teamid lgid  g  ab  r  h x2b x3b hr rbi sb cs bb
## 1   bassjo01   1871     1    CL1   NA 22  89 18 27   1  10  3  18  0  1  3
## 2  beaveed01   1871     1    TRO   NA  3  15  7  6   0   0  0   5  2  0  0
## 3  bechtge01   1871     1    PH1   NA 20  94 24 33   9   1  1  21  4  0  2
## 4  biermch01   1871     1    FW1   NA  1   2  0  0   0   0  0   0  0  0  1
## 5  carleji01   1871     1    CL1   NA 29 127 31 32   8   1  0  18  2  1  8
## 6  fishech01   1871     1    RC1   NA 25 123 24 28   3   3  1  22  1  2  3
## 7  flynncl01   1871     1    TRO   NA 29 142 43 48   6   1  0  27  3  3  4
## 8  fulmech01   1871     1    RC1   NA 16  63 11 17   1   3  0   3  0  0  5
## 9  mcbridi01   1871     1    PH1   NA 25 132 36 31   3   0  0  17  4  0  7
## 10 minched01   1871     1    FW1   NA  9  36  4  8   0   0  0   5  1  0  0
##    so ibb hbp sh sf gidp
## 1   4  NA  NA NA NA   NA
## 2   0  NA  NA NA NA   NA
## 3   2  NA  NA NA NA   NA
## 4   0  NA  NA NA NA   NA
## 5   3  NA  NA NA NA   NA
## 6   1  NA  NA NA NA   NA
## 7   2  NA  NA NA NA   NA
## 8   1  NA  NA NA NA   NA
## 9   1  NA  NA NA NA   NA
## 10  0  NA  NA NA NA   NA

Subsetting Data

Lahman database stores complete histoy of baseball statistics beginning with XIX-th century, but we restrict analysis to last 35 years or so:

ta.colMins(batting_all$yearid)

## yearid 
##   1871

bat = ta.subset(batting_all[,1:17], yearid >= 1980)

ta.nrow(bat)

## [1] 43971

ta.head(bat, n=10)

##     playerid yearid stint teamid lgid   g  ab  r   h x2b x3b hr rbi sb cs
## 1  arroyfe01   1980     1    MIN   AL  21  NA NA  NA  NA  NA NA  NA NA NA
## 2  bandosa01   1980     1    ML4   AL  78 254 28  50  12   1  5  31  5  3
## 3  bannifl01   1980     1    SEA   AL  32  NA NA  NA  NA  NA NA  NA NA NA
## 4   barrji01   1980     1    CAL   AL  24  NA NA  NA  NA  NA NA  NA NA NA
## 5   bassra01   1980     1    SDN   NL  19  49  5  14   0   1  3   8  0  0
## 6  berenbr01   1980     1    CIN   NL   6   7  1   0   0   0  0   0  0  0
## 7  bergmda01   1980     1    HOU   NL  90  78 12  20   6   1  0   3  1  0
## 8  carbobe01   1980     1    SLN   NL  14  11  0   2   0   0  0   0  0  0
## 9  carbobe01   1980     2    PIT   NL   7   6  0   2   0   0  0   1  0  0
## 10 dauerri01   1980     1    BAL   AL 152 557 71 158  32   0  2  63  3  2
##    bb so
## 1  NA NA
## 2  29 35
## 3  NA NA
## 4  NA NA
## 5   7  7
## 6   0  2
## 7  10 10
## 8   1  0
## 9   1  1
## 10 46 19

Note ta.subset constrained data both ways: horizontally by filtering in time yearid >= 1980 and vertically by selecting only leading 17 columns [,1:17] .

Basic correlation with Aster R

Aster R ta.cor does correlation computation over Aster table defined with virtual data frame: Function ta.cor returns a list that contains just one element - virtual data frame defined as a query that computes correlations requred in Aster. What important here is to realize that ta.cor has not computed correlations in Aster yet. It only parsed its input to construct an Aster query utilizing SQL/MR function that computes correlations. In R this is called lazy execution, but it’s not obvious that Aster R does it that way. The execution is triggered when we access results by calling function head on virtual data frame cor.asterr[[1]] and it takes place in Aster:

t = proc.time()
cor.asterr = ta.cor(bat, c('g:ab','g:r','r:h','sb:h'))
tmp = head(cor.asterr[[1]])
basicCorTime = proc.time() - t

tmp

##   corr    value
## 1  g:r 0.889582
## 2 sb:h 0.604084
## 3 g:ab 0.927144
## 4  r:h 0.974202

Visualization

As the list of correlation coefficients is short and far from exhaustive we can use simple bar plot to visualize results:

df = as.data.frame(cor.asterr[[1]])
ggplot(data = df, mapping = aes(corr, value)) +
  geom_bar(aes(fill=corr), stat="identity") +
  labs(title = "Correlation with Aster R ta.cor", x = "features", y = "correlation Value") +
  theme_tufte(ticks=FALSE, base_size = 16) +
  theme(legend.position = "none")

Criticism of ta.cor

Lazy Execution with Aster R (not really a criticism)

For one, it is not obvious how it works unitl one really does play with this function for a bit. But its behavior is very much in line with general Aster R way of returning results and executing code, so this point is somewhat subjective.

Attribute Pairs with Wide Tables

Next, ta.cor makes it very hard to work with wide tables that contain many numeric attributes. Since it lets parameter column.pairs accept only explicitly defined pairs the more attributes your table has the less convinient it becomes to analyze correlations. You may even run R like this that gets you all pair combinations of the attributes:

attributes = names(bat)[6:length(names(bat))]
tmp = expand.grid(attributes, attributes, stringsAsFactors = FALSE)[1:(length(attributes)^2/2),]
pairs.to.correlate = apply(tmp[tmp$Var1!=tmp$Var2,], 1, paste, collapse=":")
pairs.to.correlate

##         2         3         4         5         6         7         8 
##    "ab:g"     "r:g"     "h:g"   "x2b:g"   "x3b:g"    "hr:g"   "rbi:g" 
##         9        10        11        12        13        15        16 
##    "sb:g"    "cs:g"    "bb:g"    "so:g"    "g:ab"    "r:ab"    "h:ab" 
##        17        18        19        20        21        22        23 
##  "x2b:ab"  "x3b:ab"   "hr:ab"  "rbi:ab"   "sb:ab"   "cs:ab"   "bb:ab" 
##        24        25        26        28        29        30        31 
##   "so:ab"     "g:r"    "ab:r"     "h:r"   "x2b:r"   "x3b:r"    "hr:r" 
##        32        33        34        35        36        37        38 
##   "rbi:r"    "sb:r"    "cs:r"    "bb:r"    "so:r"     "g:h"    "ab:h" 
##        39        41        42        43        44        45        46 
##     "r:h"   "x2b:h"   "x3b:h"    "hr:h"   "rbi:h"    "sb:h"    "cs:h" 
##        47        48        49        50        51        52        54 
##    "bb:h"    "so:h"   "g:x2b"  "ab:x2b"   "r:x2b"   "h:x2b" "x3b:x2b" 
##        55        56        57        58        59        60        61 
##  "hr:x2b" "rbi:x2b"  "sb:x2b"  "cs:x2b"  "bb:x2b"  "so:x2b"   "g:x3b" 
##        62        63        64        65        67        68        69 
##  "ab:x3b"   "r:x3b"   "h:x3b" "x2b:x3b"  "hr:x3b" "rbi:x3b"  "sb:x3b" 
##        70        71        72 
##  "cs:x3b"  "bb:x3b"  "so:x3b"

It does get the job done but it looks really heavy. If you can get this accomplished any simplier would be great, but having ta.cor taking care of this is the best solution. Meanwhile, we incorporate this into correlations:

t = proc.time()
cor.asterr.auto = ta.cor(bat, pairs.to.correlate)
tmp = head(cor.asterr.auto[[1]])
basicAutoCorTime = proc.time() - t

tmp

##       corr    value
## 1   x2b:ab 0.949511
## 2    g:x2b 0.875669
## 3    rbi:h 0.941514
## 4    x3b:h 0.689285
## 5   ab:x3b 0.682406
## 6    x2b:r 0.940411
## 7     bb:r 0.915290
## 8   bb:x2b 0.857618
## 9     cs:r 0.682049
## 10  cs:x2b 0.600180
## 11    hr:r 0.846858
## 12    so:r 0.879058
## 13    ab:r 0.964957
## 14  ab:x2b 0.949511
## 15    sb:r 0.637612
## 16 x3b:x2b 0.621563
## 17    bb:g 0.833575
## 18    cs:g 0.626081
## 19    hr:g 0.735561
## 20  sb:x2b 0.526301
## 21    so:g 0.842468
## 22    h:ab 0.990564
## 23     h:g 0.913908
## 24   h:x3b 0.689285
## 25    r:ab 0.964957
## 26     r:g 0.889582
## 27   r:x3b 0.689643
## 28     g:r 0.889582
## 29  hr:x3b 0.431390
## 30  so:x3b 0.581723
## 31   x2b:g 0.875669
## 32  bb:x3b 0.565886
## 33  cs:x3b 0.658777
## 34   rbi:r 0.940341
## 35 rbi:x2b 0.927981
## 36 x2b:x3b 0.621563
## 37   x3b:r 0.689643
## 38    ab:h 0.990564
## 39   bb:ab 0.885078
## 40   cs:ab 0.676318
## 41   hr:ab 0.798933
## 42    sb:h 0.604084
## 43   so:ab 0.902529
## 44    g:ab 0.927144
## 45   g:x3b 0.628107
## 46     g:h 0.913908
## 47     h:r 0.974202
## 48   rbi:g 0.863682
## 49  sb:x3b 0.653189
## 50   x3b:g 0.628107
## 51    bb:h 0.882800
## 52    cs:h 0.676540
## 53    hr:h 0.805181
## 54   sb:ab 0.596866
## 55    so:h 0.872335
## 56  rbi:ab 0.934018
## 57  x3b:ab 0.682406
## 58     r:h 0.974202
## 59 rbi:x3b 0.573167
## 60   h:x2b 0.959213
## 61   r:x2b 0.940411
## 62   x2b:h 0.959213
## 63    ab:g 0.927144
## 64  hr:x2b 0.803838
## 65    sb:g 0.540930
## 66  so:x2b 0.855860

df = as.data.frame(cor.asterr.auto[[1]])
ggplot(data = df, mapping = aes(corr, value)) +
  geom_bar(aes(fill=corr), stat="identity") +
  labs(title = "Correlation with Aster R ta.cor", x = "features", y = "correlation Value") +
  theme_tufte(ticks=FALSE, base_size = 16) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, hjust = 1, vjust=0))

Correlations within Groups

Finally, it is common for large tables to divide rows into groups to have analysis including correlations performed for each group. In fact, Aster SQL/MR functions usually have such option built-in and its correlation functions are no exceptions. But ta.cor does not offer this feature and thus we can not easily produce groupwise correlation in batting table.

In-Database Correlation with Groups

In baseball such group-wise correlations would be within each league or team. To answer this question we can utilize in-database R:

t = proc.time()
cor.asterr.indb = ta.tapply(bat, INDEX=bat$teamid, 
                          FUN = function(x) {c(cor(x$g, x$ab, use="complete.obs"),
                                               cor(x$g, x$r, use="complete.obs"),
                                               cor(x$g, x$h, use="complete.obs")
                                               )})
inDbCorTime = proc.time() - t

str(cor.asterr.indb)

## List of 35
##  $ KCA: num [1:3] 0.945 0.908 0.93
##  $ MIA: num [1:3] 0.897 0.854 0.884
##  $ DET: num [1:3] 0.939 0.908 0.924
##  $ SFN: num [1:3] 0.913 0.867 0.902
##  $ SLN: num [1:3] 0.916 0.87 0.904
##  $ PHI: num [1:3] 0.917 0.877 0.907
##  $ CHN: num [1:3] 0.912 0.876 0.9
##  $ OAK: num [1:3] 0.935 0.902 0.924
##  $ WAS: num [1:3] 0.899 0.865 0.887
##  $ BAL: num [1:3] 0.938 0.908 0.924
##  $ TEX: num [1:3] 0.941 0.908 0.927
##  $ TOR: num [1:3] 0.941 0.908 0.93
##  $ ARI: num [1:3] 0.919 0.892 0.911
##  $ CAL: num [1:3] 0.968 0.912 0.946
##  $ SEA: num [1:3] 0.939 0.898 0.923
##  $ TBA: num [1:3] 0.923 0.895 0.91
##  $ ANA: num [1:3] 0.935 0.908 0.926
##  $ CHA: num [1:3] 0.939 0.904 0.926
##  $ COL: num [1:3] 0.904 0.875 0.892
##  $ HOU: num [1:3] 0.921 0.877 0.908
##  $ MIL: num [1:3] 0.913 0.881 0.902
##  $ CIN: num [1:3] 0.914 0.876 0.902
##  $ CLE: num [1:3] 0.942 0.904 0.93
##  $ MIN: num [1:3] 0.929 0.903 0.915
##  $ MON: num [1:3] 0.92 0.874 0.903
##  $ BOS: num [1:3] 0.946 0.915 0.932
##  $ LAA: num [1:3] 0.926 0.901 0.916
##  $ NYA: num [1:3] 0.942 0.905 0.925
##  $ ATL: num [1:3] 0.915 0.874 0.903
##  $ FLO: num [1:3] 0.917 0.881 0.906
##  $ ML4: num [1:3] 0.959 0.911 0.94
##  $ LAN: num [1:3] 0.916 0.877 0.902
##  $ NYN: num [1:3] 0.916 0.873 0.903
##  $ PIT: num [1:3] 0.92 0.884 0.908
##  $ SDN: num [1:3] 0.922 0.882 0.905

head(cor.asterr.indb, n=10)

## $KCA
## [1] 0.9445238 0.9078436 0.9302269
## 
## $MIA
## [1] 0.8972361 0.8540480 0.8843923
## 
## $DET
## [1] 0.9385825 0.9084073 0.9243898
## 
## $SFN
## [1] 0.9132162 0.8674165 0.9022762
## 
## $SLN
## [1] 0.9157595 0.8696552 0.9038536
## 
## $PHI
## [1] 0.9167811 0.8766890 0.9070337
## 
## $CHN
## [1] 0.9117595 0.8760679 0.8995387
## 
## $OAK
## [1] 0.9346988 0.9024481 0.9244367
## 
## $WAS
## [1] 0.8992372 0.8652082 0.8872238
## 
## $BAL
## [1] 0.9384184 0.9084005 0.9244904

Visualization

In-database R returns rather specialized data structure that needs serious refactoring before using it:

df = as.data.frame(t(sapply(cor.asterr.indb, function(x) x)))
names(df) = c("g:ab", "g:r", "g:h")
df = cbind(team = names(cor.asterr.indb), df)
df = melt(df, id.vars = 'team', measure.vars = c("g:ab", "g:r", "g:h"))

ggplot(data = df, mapping = aes(team, value)) +
  geom_bar(aes(fill=team), stat="identity") +
  facet_wrap(~variable, ncol=3) +
  labs(title="Correlation with In-database R", x="teams", y="correlation value") +
  theme_tufte(ticks = FALSE, base_size = 16) +
  theme(legend.position = "none") +
  coord_flip()

In-Database Execution = In-Database R

The most important point about code above is that correlations are no longer computed with Aster SQL/MR functions. Instead, ta.tapply sends anonymous function defined within its body to Aster to execute inside Aster R sandboxes on each of its workers. Still, all distribution and parallel features of Aster take care of data and execution but R cor function did actual computations across the table and its groups.

Still No Love for Attribute Combinations

Again, as with ta.cor we either have all attribute pairs listed explicitly or have additional R code to produce them.

Correlation with toaster

We would finish by showing how the same correlations are computed with open-source package toaster and how its function computeCorrelations addresses problems discussed above:

t = proc.time()
cor.toaster = computeCorrelations(conn, "batting", 
                                  include=c('g','ab','r','h','sb','cs','bb','so'),
                                  by='teamid', 
                                  where='yearid >= 1980')
toasterCorTime = proc.time() - t

dim(cor.toaster)

## [1] 1960    6

head(cor.toaster)

##   teamid  corr    value metric1 metric2 sign
## 1    ANA ab:cs 0.752329      ab      cs    1
## 2    ARI ab:cs 0.741179      ab      cs    1
## 3    ATL ab:cs 0.664995      ab      cs    1
## 4    BAL ab:cs 0.620531      ab      cs    1
## 5    BOS ab:cs 0.646773      ab      cs    1
## 6    CAL ab:cs 0.597666      ab      cs    1

It turns out that computeCorrelations offers both grouping of correlations by arbitrary attributes (using argument by) and produces all possible combinations for the attributes given with argument include. At the same time it’s important to remember that toaster functions that start with compute always run all calculations within Aster by utilizing its SQL/MR functions and SQL.

Visualization

It turns out that toaster also offers way to effectively and quickly visualize correlations even given many features and groups with its function createBubblechart. For clarity we included in the plot just 6 teams out of 35 available:

createBubblechart(cor.toaster[cor.toaster$teamid %in% c('DET','PHI','NYN','TEX','MIA','MIL'),], 
                  "metric1", "metric2", "value", label = NULL, fill = "value",
                  facet = "teamid", ncol = 2,
                  title="Correlation with toaster", xlab = NULL, ylab = NULL,
                  baseSize = 16) +
  scale_fill_gradient(low="cadetblue1", high="blue")

Execution Times

Between 3 methods used we observed differences in how they use features to correlate on and their ability to group a data set. But we also measured performance for each method to compare how they performed on relatively large (still small in Aster world) data set of baseball statistics:

Correlation Story with Aster R and toaster Examples

Gregory Kanevsky

December 8, 2016

Executive Summary

Connecting to Aster

Dataset

Lahman Baseball Database

Battings Stats

Subsetting Data

Basic correlation with Aster R

Visualization

Criticism of ta.cor

Lazy Execution with Aster R (not really a criticism)

Attribute Pairs with Wide Tables

Correlations within Groups

In-Database Correlation with Groups

Visualization

In-Database Execution = In-Database R

Still No Love for Attribute Combinations

Correlation with toaster

Visualization

Execution Times