WARNING: WORK IN PROGRESS

This document is still work in progress, so expect its content enhanced, edited, and expanded with more materials in the future. It was created with the latest development version of toaster planned as 0.5.2 in near future. While 0.5.2 contains bug fixes and enhancements most of the examples should work > with the latest release 0.5.1 available from CRAN.

Intro to Intro

Teradata Aster is big data analytical platform that combines the power of SQL database with a rich set of analytical and graph functions while enabling distributed and parallel execution. With introduction R package toaster for analyzing large datasets stored in Aster database number of Aster features became available to R developers. We start with simple exploratory functions and gradually proceed to more complex techniques.

Datasets

The examples below will feature two datasets: * Lahman baseball database * City of Dallas Open Data: police records and building permits Both are readly available for download including scripts to load them into Aster database here.

Connecting to Aster

To connect to Aster toaster uses package RODBC to initialize ODBC connection to a database. Below, we connect to Aster using pre-configured ODBC DSN:

library(toaster)
#> Loading required package: RODBC
library(ggplot2)
library(ggthemes)
library(RColorBrewer)

# using ODBC data source name
if (exists("conn") && isOpen(conn)) 
  close(conn)

conn = odbcConnect("PreSalesCluster1-dallas")

or ad-hoc without pre-configured ODBC DSN:

conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                                     server=xx.xx.xx.xx;
                                     port=2406;
                                     database=beehive;
                                     uid=beehive;
                                     pwd=beehive",
                         interpretDot=TRUE)

In both cases, object conn now hods initialized and live connection to the Aster instance. It was only necessary to have Aster ODBC Driver installed, RODBC package available (always installed and loaded with toaster), and ODBC DSN configured (when using it).

Data Exploration Made Fast

Baseball dataset has tables ptiching_enh and batting_enh with pitchers and batters stats. Our analysis will focus on these two tables.

Counting NULL Values

Function getNullCounts returns NULL counts or percentages for all (or selected) columns of the table. This illustrates how to return percentages of NULLs for all columns in batting_enh:

battingNullCounts = getNullCounts(conn, 'batting_enh', percent = TRUE)
head(battingNullCounts)
#>   variable    nullcount
#> 1     lgid 1.106097e-05
#> 2   teamid 1.106097e-05
#> 3 playerid 1.106097e-05
#> 4   yearid 1.106097e-05
#> 5 decadeid 1.106097e-05
#> 6        g 1.106097e-05

Then show NULL percentages with ggplot2 graphics (function displayPercentOfNotNullCounts)

displayPercentOfNotNullCounts <- function(df, name) {
  df$notnullcount = as.integer(100 * (1. - df$nullcount))
  df$variable = factor(df$variable, levels=df$variable[order(df$notnullcount)], ordered=TRUE)

  ggplot(df) +
    geom_bar(aes(variable, notnullcount, fill=variable), stat='identity') +
   labs(title=paste("Percent of NOT NULL values in", name), x="Column", y=NULL) +
  theme_tufte(ticks = FALSE) +
  # scale_fill_brewer(palette = "Set1", guide=FALSE) +
  scale_fill_manual(values = colorRampPalette(brewer.pal(8, "Set1"))(nrow(df)), guide=FALSE) + 
  theme(axis.text.x = element_text(angle = 315))
}

displayPercentOfNotNullCounts(battingNullCounts, 'batting_enh')

The same performed for the table pitching_enh:

pitchingNullCounts = getNullCounts(conn, 'pitching_enh', percent = TRUE) 

displayPercentOfNotNullCounts(pitchingNullCounts, 'pitching_enh')

Summary statistics with getTableSummary

Summary statistics is a way to start with data exploration. Function getTableSummary performs various computations on table columns. The following statistics go into its summary:

Capitalizing on collected NULL counts battingNullCounts and pitchingNullCounts we run getTableSummary only on the columns with no more 60% of NULL values:

pitchingColumns = pitchingNullCounts[pitchingNullCounts$nullcount < 0.6, 'variable']
pitchingInfo = getTableSummary(conn, 'pitching_enh', include=pitchingColumns, 
                               except = c('lgid','teamid','playerid','yearid','decadeid'))

battingColumns = battingNullCounts[battingNullCounts$nullcount < 0.6, 'variable']
battingInfo = getTableSummary(conn, tableName = 'batting_enh', include=battingColumns,
                              except = c('lgid','teamid','playerid','yearid','decadeid','g_old'))

Note, that getTableSummary also counts NULLs but by using getNullCounts first we avoid overhead of all other summary statistics computed in getTableSummary. Summary stats results are saved in the data frames pitchingInfo and battingInfo (not all of the columns are printed):

#>  COLUMN_NAME        TYPE_NAME distinct_count null_count minimum maximum     average
#>            g           bigint            165          0       1     165  54.1102459
#>    g_batting           bigint            166       1151       0     165  52.2877342
#>           ab           bigint            698       6413       1     716 167.7862943
#>            r           bigint            167       6413       0     192  22.2864728
#>            h           bigint            250       6413       0     262  43.9586042
#>          x2b           bigint             64       6413       0      67   7.4047789
#>          x3b           bigint             32       6413       0      36   1.5513013
#>           hr           bigint             67       6413       0      73   3.2689121
#>          rbi           bigint            175       6836       0     191  20.1242776
#>           sb           bigint            114       7711       0     138   3.5589871
#>           cs           bigint             36      29640       0      42   1.5489822
#>           bb           bigint            154       6413       0     232  15.4788318
#>           so           bigint            199      14148       0     223  24.0753485
#>          ibb           bigint             45      42605       0     120   1.4726162
#>          hbp           bigint             41       9219       0      51   1.2401217
#>           sh           bigint             54      12730       0      67   2.8075235
#>           sf           bigint             20      42085       0      19   1.3795580
#>         gidp           bigint             35      32287       0      36   3.7441844
#>           tb           bigint            416       6413       0     457  64.2727219
#>          tob           bigint            344       9219       0     379  61.1543701
#>          xbh           bigint            107       6413       0     119  12.2249923
#>           ba double precision          12555       6413       0       1   0.2066777
#>          slg double precision          18021       6413       0       4   0.2873143
#>           ta double precision          14089      37168       0      11   0.4977085
#>          obp double precision          10987      42086       0       1   0.2550415

Fast Visualizations with showData

Now, armed with the results of getTableSummary function showData either just visualizes certain statistics for a table (per its argument format) or may run additional computations in Aster to augment data returned by getTableSummary.

Overview Format

Same information visualized with function showData (using format='overview'):

showData(NULL, "batting_enh", tableInfo=battingInfo, type="numeric", format="overview",
         title="Pitching Stats Overview")

Boxplot Format

And again same information as boxplots (format='boxplot'):

showData(NULL, tableName='batting_enh', tableInfo=battingInfo, format='boxplot', 
         title="Pitching Stats Boxplots")

Facets with Boxplot Format

Or the same boxplots in more compact and independently scaled:

showData(NULL, tableName='batting_enh', format='boxplot', tableInfo=battingInfo, 
         facet=TRUE, useIQR=TRUE, title="Pitching Stats Boxplots")

Note that this time we called showData with connection NULL, which means that it never called database - instead it used argument tableInfo as it contains all necessary information to represent boxplots.

Correlation Matrix Format

showData(conn, tableName = "batting_enh", format = 'corr', tableInfo=battingInfo, 
         include = c('ba','bb','cs','h','hbp','hr','ibb','obp','r','rbi','sb',
                     'slg','so','ta','tb','tob','x2b','x3b','xbh'),
         title = "Correlation Matrix of Batting Stats")

Scatterplot Format

Strike-outs vs. Walks By Leagues and Decades

showData(conn, 'batting_enh', format='scatterplot', 
         include=c('r', 'so'), facetName=c("lgid","decadeid"), pointColour="lgid", 
         sampleSize=10000, regressionLine=TRUE,
         title="SO vs BB by League 1980-2000", 
         where='yearid >= 1970')
#> Warning: Removed 1494 rows containing non-finite values (stat_smooth).
#> Warning: Removed 1494 rows containing missing values (geom_point).

Sampling from Aster for Analysis in R

For quick modeling and prototyping in R computeSample lets randomly select data from Aster using one fo the following stratum models:

battingSample = computeSample(conn, "batting_enh", sampleSize = rep(4000,6),
                     conditionColumn = 'decadeid', conditionValues = seq(1960, 2010, 10),
                     include=c('lgid','yearid','slg','obp','ta','ops','ba','r'),
                     where="ab >= 20")
table(battingSample$decadeid)
#> 
#> 1960 1970 1980 1990 2000 2010 
#> 4011 3947 3930 3918 3928 3998

This loads sampled data into R data frame which is available for any manipulations and statistical analysis with R:

battingSample$decadeid = factor(battingSample$decadeid, levels=sort(unique(battingSample$decadeid)), ordered = TRUE)
battingSample$lgid = factor(battingSample$lgid, ordered = TRUE)

ms = aggregate(battingSample[,c('slg','r','obp','ta','ops','ba')], battingSample[,c('lgid','decadeid')], median)

library(reshape2)
data = melt(ms, id.vars=c('decadeid','lgid'), measure.vars=setdiff(names(battingSample),c('decadeid','lgid','yearid')),
            variable.name='stat')
data$stat = factor(data$stat, levels=sort(levels(unique(data$stat))), ordered=TRUE)

ggplot(data, aes(decadeid, value, fill=stat, group=1)) +
  geom_line(color='black', size=1.0) + geom_point(shape=21, size=6) + 
  facet_grid(stat ~ lgid, scales = "free_y") +
  # scale_fill_brewer(palette="Set1", guide=FALSE) +
  scale_fill_manual(values = (getDiscretePaletteFactory("Set1"))(length(unique(data[,'stat']))), guide = FALSE) +
  labs(title="Batting Stats Median Trends by Decade and League (1960 - 2015)", x='Decade', y=NULL) +
  theme_tufte(ticks = FALSE) +
  theme(strip.text.y=element_text(angle=0))

We can validate the trends visualized above by running similar distributions in Aster using boxplots:

percentiles = computePercentiles(conn, 'batting_enh', 
                                 columns=c('slg','r','obp','ta','ops','ba'), 
                                 by=c('lgid','decadeid'), where='yearid >= 1960 and ab >= 20')

percentiles$column = factor(percentiles$column, levels=sort(unique(percentiles$column)), ordered=TRUE)
createBoxplot(percentiles, x='lgid', fill='column', facet=c('column','decadeid'), useIQR = TRUE,
              facetScales="free", palette = "Set1", # paletteValues = c('firebrick3','blue4'), 
              title="Batting Stats Boxplots by League and Decade", xlab ="League", 
              fillGuide = FALSE, defaultTheme = theme_tufte(ticks=FALSE),
              themeExtra = theme(axis.text.x=element_text(angle=0),
                                 strip.text.y=element_text(angle=0)))

Linear Model with Aster


lmModel = computeLm(conn, "batting_enh", r ~ ba + slg + obp, where='yearid >= 1950')
summary(lmModel)
#> 
#> Call:
#> computeLm(channel = conn, tableName = "batting_enh", formula = r ~ 
#>     ba + slg + obp, where = "yearid >= 1950")
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -76.750 -13.176  -4.464   5.351  93.837 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -5.050      1.574  -3.209 0.001374 ** 
#> ba           -69.908     20.192  -3.462 0.000558 ***
#> slg           90.560     10.610   8.536  < 2e-16 ***
#> obp           56.769     13.666   4.154 3.54e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 22.49 on 1016 degrees of freedom
#> Multiple R-squared:  0.2975, Adjusted R-squared:  0.2954 
#> F-statistic: 143.4 on 3 and 1016 DF,  p-value: < 2.2e-16

References