Calculating correlation coefficients is often starting point before more advanced steps take place. In big data world often both long (big data) and wide (many features) data is analyzed across complex structure using in-group correlations (calculations across multiple subsets). Aster R big data analytics offers various methods to calculate correlations at scale in big data settings. They differ in functionality, performance, etc. that could be implementation and could be summarized in the table below:
| Method / Properties | Feature Permutations | Group Calculations | SQL-MR | In-database R |
|---|---|---|---|---|
| Aster R ta.cor | N | N | Y | N |
| Aster R ta.tapply (in-database) | N | Y | N | Y |
| toaster computeCorrelations | Y | Y | Y | N |
Load Aster R (which in turn loads RODBC) and make ODBC connection to Aster.
This connection is saved into conn object for use with toaster package.
library(TeradataAsterR)
## Loading required package: RODBC
library(toaster)
library(ggplot2)
library(ggthemes)
library(reshape2)
conn = ta.connect("PreSalesCluster1-dallas", database ='dallas')
options(warn=-1)
Lahman baseball database (Aster script to create it can be found here) has the table batting with batting stats per player per year per single stint with a team.
Having established connection to the database above we can load data into Aster R virtual data frame without actually moving any data between R and Aster yet:
batting_all = ta.data.frame("batting")
ta.nrow(batting_all)
## [1] 101332
ta.head(batting_all, n=10)
## playerid yearid stint teamid lgid g ab r h x2b x3b hr rbi sb cs bb
## 1 bassjo01 1871 1 CL1 NA 22 89 18 27 1 10 3 18 0 1 3
## 2 beaveed01 1871 1 TRO NA 3 15 7 6 0 0 0 5 2 0 0
## 3 bechtge01 1871 1 PH1 NA 20 94 24 33 9 1 1 21 4 0 2
## 4 biermch01 1871 1 FW1 NA 1 2 0 0 0 0 0 0 0 0 1
## 5 carleji01 1871 1 CL1 NA 29 127 31 32 8 1 0 18 2 1 8
## 6 fishech01 1871 1 RC1 NA 25 123 24 28 3 3 1 22 1 2 3
## 7 flynncl01 1871 1 TRO NA 29 142 43 48 6 1 0 27 3 3 4
## 8 fulmech01 1871 1 RC1 NA 16 63 11 17 1 3 0 3 0 0 5
## 9 mcbridi01 1871 1 PH1 NA 25 132 36 31 3 0 0 17 4 0 7
## 10 minched01 1871 1 FW1 NA 9 36 4 8 0 0 0 5 1 0 0
## so ibb hbp sh sf gidp
## 1 4 NA NA NA NA NA
## 2 0 NA NA NA NA NA
## 3 2 NA NA NA NA NA
## 4 0 NA NA NA NA NA
## 5 3 NA NA NA NA NA
## 6 1 NA NA NA NA NA
## 7 2 NA NA NA NA NA
## 8 1 NA NA NA NA NA
## 9 1 NA NA NA NA NA
## 10 0 NA NA NA NA NA
Lahman database stores complete histoy of baseball statistics beginning with XIX-th century, but we restrict analysis to last 35 years or so:
ta.colMins(batting_all$yearid)
## yearid
## 1871
bat = ta.subset(batting_all[,1:17], yearid >= 1980)
ta.nrow(bat)
## [1] 43971
ta.head(bat, n=10)
## playerid yearid stint teamid lgid g ab r h x2b x3b hr rbi sb cs
## 1 arroyfe01 1980 1 MIN AL 21 NA NA NA NA NA NA NA NA NA
## 2 bandosa01 1980 1 ML4 AL 78 254 28 50 12 1 5 31 5 3
## 3 bannifl01 1980 1 SEA AL 32 NA NA NA NA NA NA NA NA NA
## 4 barrji01 1980 1 CAL AL 24 NA NA NA NA NA NA NA NA NA
## 5 bassra01 1980 1 SDN NL 19 49 5 14 0 1 3 8 0 0
## 6 berenbr01 1980 1 CIN NL 6 7 1 0 0 0 0 0 0 0
## 7 bergmda01 1980 1 HOU NL 90 78 12 20 6 1 0 3 1 0
## 8 carbobe01 1980 1 SLN NL 14 11 0 2 0 0 0 0 0 0
## 9 carbobe01 1980 2 PIT NL 7 6 0 2 0 0 0 1 0 0
## 10 dauerri01 1980 1 BAL AL 152 557 71 158 32 0 2 63 3 2
## bb so
## 1 NA NA
## 2 29 35
## 3 NA NA
## 4 NA NA
## 5 7 7
## 6 0 2
## 7 10 10
## 8 1 0
## 9 1 1
## 10 46 19
Note ta.subset constrained data both ways: horizontally by filtering in time yearid >= 1980 and vertically by selecting only leading 17 columns [,1:17] .
Aster R ta.cor does correlation computation over Aster table defined with virtual data frame: Function ta.cor returns a list that contains just one element - virtual data frame defined as a query that computes correlations requred in Aster. What important here is to realize that ta.cor has not computed correlations in Aster yet. It only parsed its input to construct an Aster query utilizing SQL/MR function that computes correlations. In R this is called lazy execution, but it’s not obvious that Aster R does it that way. The execution is triggered when we access results by calling function head on virtual data frame cor.asterr[[1]] and it takes place in Aster:
t = proc.time()
cor.asterr = ta.cor(bat, c('g:ab','g:r','r:h','sb:h'))
tmp = head(cor.asterr[[1]])
basicCorTime = proc.time() - t
tmp
## corr value
## 1 g:r 0.889582
## 2 sb:h 0.604084
## 3 g:ab 0.927144
## 4 r:h 0.974202
As the list of correlation coefficients is short and far from exhaustive we can use simple bar plot to visualize results:
df = as.data.frame(cor.asterr[[1]])
ggplot(data = df, mapping = aes(corr, value)) +
geom_bar(aes(fill=corr), stat="identity") +
labs(title = "Correlation with Aster R ta.cor", x = "features", y = "correlation Value") +
theme_tufte(ticks=FALSE, base_size = 16) +
theme(legend.position = "none")
For one, it is not obvious how it works unitl one really does play with this function for a bit. But its behavior is very much in line with general Aster R way of returning results and executing code, so this point is somewhat subjective.
Next, ta.cor makes it very hard to work with wide tables that contain many numeric attributes. Since it lets parameter column.pairs accept only explicitly defined pairs the more attributes your table has the less convinient it becomes to analyze correlations. You may even run R like this that gets you all pair combinations of the attributes:
attributes = names(bat)[6:length(names(bat))]
tmp = expand.grid(attributes, attributes, stringsAsFactors = FALSE)[1:(length(attributes)^2/2),]
pairs.to.correlate = apply(tmp[tmp$Var1!=tmp$Var2,], 1, paste, collapse=":")
pairs.to.correlate
## 2 3 4 5 6 7 8
## "ab:g" "r:g" "h:g" "x2b:g" "x3b:g" "hr:g" "rbi:g"
## 9 10 11 12 13 15 16
## "sb:g" "cs:g" "bb:g" "so:g" "g:ab" "r:ab" "h:ab"
## 17 18 19 20 21 22 23
## "x2b:ab" "x3b:ab" "hr:ab" "rbi:ab" "sb:ab" "cs:ab" "bb:ab"
## 24 25 26 28 29 30 31
## "so:ab" "g:r" "ab:r" "h:r" "x2b:r" "x3b:r" "hr:r"
## 32 33 34 35 36 37 38
## "rbi:r" "sb:r" "cs:r" "bb:r" "so:r" "g:h" "ab:h"
## 39 41 42 43 44 45 46
## "r:h" "x2b:h" "x3b:h" "hr:h" "rbi:h" "sb:h" "cs:h"
## 47 48 49 50 51 52 54
## "bb:h" "so:h" "g:x2b" "ab:x2b" "r:x2b" "h:x2b" "x3b:x2b"
## 55 56 57 58 59 60 61
## "hr:x2b" "rbi:x2b" "sb:x2b" "cs:x2b" "bb:x2b" "so:x2b" "g:x3b"
## 62 63 64 65 67 68 69
## "ab:x3b" "r:x3b" "h:x3b" "x2b:x3b" "hr:x3b" "rbi:x3b" "sb:x3b"
## 70 71 72
## "cs:x3b" "bb:x3b" "so:x3b"
It does get the job done but it looks really heavy. If you can get this accomplished any simplier would be great, but having ta.cor taking care of this is the best solution. Meanwhile, we incorporate this into correlations:
t = proc.time()
cor.asterr.auto = ta.cor(bat, pairs.to.correlate)
tmp = head(cor.asterr.auto[[1]])
basicAutoCorTime = proc.time() - t
tmp
## corr value
## 1 x2b:ab 0.949511
## 2 g:x2b 0.875669
## 3 rbi:h 0.941514
## 4 x3b:h 0.689285
## 5 ab:x3b 0.682406
## 6 x2b:r 0.940411
## 7 bb:r 0.915290
## 8 bb:x2b 0.857618
## 9 cs:r 0.682049
## 10 cs:x2b 0.600180
## 11 hr:r 0.846858
## 12 so:r 0.879058
## 13 ab:r 0.964957
## 14 ab:x2b 0.949511
## 15 sb:r 0.637612
## 16 x3b:x2b 0.621563
## 17 bb:g 0.833575
## 18 cs:g 0.626081
## 19 hr:g 0.735561
## 20 sb:x2b 0.526301
## 21 so:g 0.842468
## 22 h:ab 0.990564
## 23 h:g 0.913908
## 24 h:x3b 0.689285
## 25 r:ab 0.964957
## 26 r:g 0.889582
## 27 r:x3b 0.689643
## 28 g:r 0.889582
## 29 hr:x3b 0.431390
## 30 so:x3b 0.581723
## 31 x2b:g 0.875669
## 32 bb:x3b 0.565886
## 33 cs:x3b 0.658777
## 34 rbi:r 0.940341
## 35 rbi:x2b 0.927981
## 36 x2b:x3b 0.621563
## 37 x3b:r 0.689643
## 38 ab:h 0.990564
## 39 bb:ab 0.885078
## 40 cs:ab 0.676318
## 41 hr:ab 0.798933
## 42 sb:h 0.604084
## 43 so:ab 0.902529
## 44 g:ab 0.927144
## 45 g:x3b 0.628107
## 46 g:h 0.913908
## 47 h:r 0.974202
## 48 rbi:g 0.863682
## 49 sb:x3b 0.653189
## 50 x3b:g 0.628107
## 51 bb:h 0.882800
## 52 cs:h 0.676540
## 53 hr:h 0.805181
## 54 sb:ab 0.596866
## 55 so:h 0.872335
## 56 rbi:ab 0.934018
## 57 x3b:ab 0.682406
## 58 r:h 0.974202
## 59 rbi:x3b 0.573167
## 60 h:x2b 0.959213
## 61 r:x2b 0.940411
## 62 x2b:h 0.959213
## 63 ab:g 0.927144
## 64 hr:x2b 0.803838
## 65 sb:g 0.540930
## 66 so:x2b 0.855860
df = as.data.frame(cor.asterr.auto[[1]])
ggplot(data = df, mapping = aes(corr, value)) +
geom_bar(aes(fill=corr), stat="identity") +
labs(title = "Correlation with Aster R ta.cor", x = "features", y = "correlation Value") +
theme_tufte(ticks=FALSE, base_size = 16) +
theme(legend.position = "none",
axis.text.x = element_text(angle = 90, hjust = 1, vjust=0))
Finally, it is common for large tables to divide rows into groups to have analysis including correlations performed for each group. In fact, Aster SQL/MR functions usually have such option built-in and its correlation functions are no exceptions. But ta.cor does not offer this feature and thus we can not easily produce groupwise correlation in batting table.
In baseball such group-wise correlations would be within each league or team. To answer this question we can utilize in-database R:
t = proc.time()
cor.asterr.indb = ta.tapply(bat, INDEX=bat$teamid,
FUN = function(x) {c(cor(x$g, x$ab, use="complete.obs"),
cor(x$g, x$r, use="complete.obs"),
cor(x$g, x$h, use="complete.obs")
)})
inDbCorTime = proc.time() - t
str(cor.asterr.indb)
## List of 35
## $ KCA: num [1:3] 0.945 0.908 0.93
## $ MIA: num [1:3] 0.897 0.854 0.884
## $ DET: num [1:3] 0.939 0.908 0.924
## $ SFN: num [1:3] 0.913 0.867 0.902
## $ SLN: num [1:3] 0.916 0.87 0.904
## $ PHI: num [1:3] 0.917 0.877 0.907
## $ CHN: num [1:3] 0.912 0.876 0.9
## $ OAK: num [1:3] 0.935 0.902 0.924
## $ WAS: num [1:3] 0.899 0.865 0.887
## $ BAL: num [1:3] 0.938 0.908 0.924
## $ TEX: num [1:3] 0.941 0.908 0.927
## $ TOR: num [1:3] 0.941 0.908 0.93
## $ ARI: num [1:3] 0.919 0.892 0.911
## $ CAL: num [1:3] 0.968 0.912 0.946
## $ SEA: num [1:3] 0.939 0.898 0.923
## $ TBA: num [1:3] 0.923 0.895 0.91
## $ ANA: num [1:3] 0.935 0.908 0.926
## $ CHA: num [1:3] 0.939 0.904 0.926
## $ COL: num [1:3] 0.904 0.875 0.892
## $ HOU: num [1:3] 0.921 0.877 0.908
## $ MIL: num [1:3] 0.913 0.881 0.902
## $ CIN: num [1:3] 0.914 0.876 0.902
## $ CLE: num [1:3] 0.942 0.904 0.93
## $ MIN: num [1:3] 0.929 0.903 0.915
## $ MON: num [1:3] 0.92 0.874 0.903
## $ BOS: num [1:3] 0.946 0.915 0.932
## $ LAA: num [1:3] 0.926 0.901 0.916
## $ NYA: num [1:3] 0.942 0.905 0.925
## $ ATL: num [1:3] 0.915 0.874 0.903
## $ FLO: num [1:3] 0.917 0.881 0.906
## $ ML4: num [1:3] 0.959 0.911 0.94
## $ LAN: num [1:3] 0.916 0.877 0.902
## $ NYN: num [1:3] 0.916 0.873 0.903
## $ PIT: num [1:3] 0.92 0.884 0.908
## $ SDN: num [1:3] 0.922 0.882 0.905
head(cor.asterr.indb, n=10)
## $KCA
## [1] 0.9445238 0.9078436 0.9302269
##
## $MIA
## [1] 0.8972361 0.8540480 0.8843923
##
## $DET
## [1] 0.9385825 0.9084073 0.9243898
##
## $SFN
## [1] 0.9132162 0.8674165 0.9022762
##
## $SLN
## [1] 0.9157595 0.8696552 0.9038536
##
## $PHI
## [1] 0.9167811 0.8766890 0.9070337
##
## $CHN
## [1] 0.9117595 0.8760679 0.8995387
##
## $OAK
## [1] 0.9346988 0.9024481 0.9244367
##
## $WAS
## [1] 0.8992372 0.8652082 0.8872238
##
## $BAL
## [1] 0.9384184 0.9084005 0.9244904
In-database R returns rather specialized data structure that needs serious refactoring before using it:
df = as.data.frame(t(sapply(cor.asterr.indb, function(x) x)))
names(df) = c("g:ab", "g:r", "g:h")
df = cbind(team = names(cor.asterr.indb), df)
df = melt(df, id.vars = 'team', measure.vars = c("g:ab", "g:r", "g:h"))
ggplot(data = df, mapping = aes(team, value)) +
geom_bar(aes(fill=team), stat="identity") +
facet_wrap(~variable, ncol=3) +
labs(title="Correlation with In-database R", x="teams", y="correlation value") +
theme_tufte(ticks = FALSE, base_size = 16) +
theme(legend.position = "none") +
coord_flip()
The most important point about code above is that correlations are no longer computed with Aster SQL/MR functions. Instead, ta.tapply sends anonymous function defined within its body to Aster to execute inside Aster R sandboxes on each of its workers. Still, all distribution and parallel features of Aster take care of data and execution but R cor function did actual computations across the table and its groups.
Again, as with ta.cor we either have all attribute pairs listed explicitly or have additional R code to produce them.
We would finish by showing how the same correlations are computed with open-source package toaster and how its function computeCorrelations addresses problems discussed above:
t = proc.time()
cor.toaster = computeCorrelations(conn, "batting",
include=c('g','ab','r','h','sb','cs','bb','so'),
by='teamid',
where='yearid >= 1980')
toasterCorTime = proc.time() - t
dim(cor.toaster)
## [1] 1960 6
head(cor.toaster)
## teamid corr value metric1 metric2 sign
## 1 ANA ab:cs 0.752329 ab cs 1
## 2 ARI ab:cs 0.741179 ab cs 1
## 3 ATL ab:cs 0.664995 ab cs 1
## 4 BAL ab:cs 0.620531 ab cs 1
## 5 BOS ab:cs 0.646773 ab cs 1
## 6 CAL ab:cs 0.597666 ab cs 1
It turns out that computeCorrelations offers both grouping of correlations by arbitrary attributes (using argument by) and produces all possible combinations for the attributes given with argument include. At the same time it’s important to remember that toaster functions that start with compute always run all calculations within Aster by utilizing its SQL/MR functions and SQL.
It turns out that toaster also offers way to effectively and quickly visualize correlations even given many features and groups with its function createBubblechart. For clarity we included in the plot just 6 teams out of 35 available:
createBubblechart(cor.toaster[cor.toaster$teamid %in% c('DET','PHI','NYN','TEX','MIA','MIL'),],
"metric1", "metric2", "value", label = NULL, fill = "value",
facet = "teamid", ncol = 2,
title="Correlation with toaster", xlab = NULL, ylab = NULL,
baseSize = 16) +
scale_fill_gradient(low="cadetblue1", high="blue")
Between 3 methods used we observed differences in how they use features to correlate on and their ability to group a data set. But we also measured performance for each method to compare how they performed on relatively large (still small in Aster world) data set of baseball statistics: