WARNING: WORK IN PROGRESS
This document is still work in progress, so expect its content enhanced, edited, and expanded with more materials in the future. It was created with the latest development version of toaster planned as 0.5.2 in near future. While 0.5.2 contains bug fixes and enhancements most of the examples should work > with the latest release 0.5.1 available from CRAN.
Teradata Aster is big data analytical platform that combines the power of SQL database with a rich set of analytical and graph functions while enabling distributed and parallel execution. With introduction R package toaster for analyzing large datasets stored in Aster database number of Aster features became available to R developers. We start with simple exploratory functions and gradually proceed to more complex techniques.
The examples below will feature two datasets: * Lahman baseball database * City of Dallas Open Data: police records and building permits Both are readly available for download including scripts to load them into Aster database here.
To connect to Aster toaster uses package RODBC to initialize ODBC connection to a database. Below, we connect to Aster using pre-configured ODBC DSN:
library(toaster)
#> Loading required package: RODBC
library(ggplot2)
library(ggthemes)
library(RColorBrewer)
# using ODBC data source name
if (exists("conn") && isOpen(conn))
close(conn)
conn = odbcConnect("PreSalesCluster1-dallas")or ad-hoc without pre-configured ODBC DSN:
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
server=xx.xx.xx.xx;
port=2406;
database=beehive;
uid=beehive;
pwd=beehive",
interpretDot=TRUE)In both cases, object conn now hods initialized and live connection to the Aster instance. It was only necessary to have Aster ODBC Driver installed, RODBC package available (always installed and loaded with toaster), and ODBC DSN configured (when using it).
Baseball dataset has tables ptiching_enh and batting_enh with pitchers and batters stats. Our analysis will focus on these two tables.
NULL ValuesFunction getNullCounts returns NULL counts or percentages for all (or selected) columns of the table. This illustrates how to return percentages of NULLs for all columns in batting_enh:
battingNullCounts = getNullCounts(conn, 'batting_enh', percent = TRUE)
head(battingNullCounts)
#> variable nullcount
#> 1 lgid 1.106097e-05
#> 2 teamid 1.106097e-05
#> 3 playerid 1.106097e-05
#> 4 yearid 1.106097e-05
#> 5 decadeid 1.106097e-05
#> 6 g 1.106097e-05Then show NULL percentages with ggplot2 graphics (function displayPercentOfNotNullCounts)
displayPercentOfNotNullCounts <- function(df, name) {
df$notnullcount = as.integer(100 * (1. - df$nullcount))
df$variable = factor(df$variable, levels=df$variable[order(df$notnullcount)], ordered=TRUE)
ggplot(df) +
geom_bar(aes(variable, notnullcount, fill=variable), stat='identity') +
labs(title=paste("Percent of NOT NULL values in", name), x="Column", y=NULL) +
theme_tufte(ticks = FALSE) +
# scale_fill_brewer(palette = "Set1", guide=FALSE) +
scale_fill_manual(values = colorRampPalette(brewer.pal(8, "Set1"))(nrow(df)), guide=FALSE) +
theme(axis.text.x = element_text(angle = 315))
}
displayPercentOfNotNullCounts(battingNullCounts, 'batting_enh')The same performed for the table pitching_enh:
pitchingNullCounts = getNullCounts(conn, 'pitching_enh', percent = TRUE)
displayPercentOfNotNullCounts(pitchingNullCounts, 'pitching_enh')getTableSummarySummary statistics is a way to start with data exploration. Function getTableSummary performs various computations on table columns. The following statistics go into its summary:
total_count: total row count (this value is the same for each table column)distinct_count: distinct values countnot_null_count: not null countminimum: minimum value (numerical data types only)maximum: maximum value (numerical data types only)average: mean (numerical data types only)deviation: standard deviation (numerical data types only)IQR: interquartile range is the 1st Quartile subtracted from the 3rd Quartileminimum_str: minimum string value (character data types only)maximum_str: maximum string value (character data types only)mode: mode value (only when modeValue=TRUE)mode_count: mode count (only when modeValue=TRUE)Capitalizing on collected NULL counts battingNullCounts and pitchingNullCounts we run getTableSummary only on the columns with no more 60% of NULL values:
pitchingColumns = pitchingNullCounts[pitchingNullCounts$nullcount < 0.6, 'variable']
pitchingInfo = getTableSummary(conn, 'pitching_enh', include=pitchingColumns,
except = c('lgid','teamid','playerid','yearid','decadeid'))
battingColumns = battingNullCounts[battingNullCounts$nullcount < 0.6, 'variable']
battingInfo = getTableSummary(conn, tableName = 'batting_enh', include=battingColumns,
except = c('lgid','teamid','playerid','yearid','decadeid','g_old'))Note, that getTableSummary also counts NULLs but by using getNullCounts first we avoid overhead of all other summary statistics computed in getTableSummary. Summary stats results are saved in the data frames pitchingInfo and battingInfo (not all of the columns are printed):
#> COLUMN_NAME TYPE_NAME distinct_count null_count minimum maximum average
#> g bigint 165 0 1 165 54.1102459
#> g_batting bigint 166 1151 0 165 52.2877342
#> ab bigint 698 6413 1 716 167.7862943
#> r bigint 167 6413 0 192 22.2864728
#> h bigint 250 6413 0 262 43.9586042
#> x2b bigint 64 6413 0 67 7.4047789
#> x3b bigint 32 6413 0 36 1.5513013
#> hr bigint 67 6413 0 73 3.2689121
#> rbi bigint 175 6836 0 191 20.1242776
#> sb bigint 114 7711 0 138 3.5589871
#> cs bigint 36 29640 0 42 1.5489822
#> bb bigint 154 6413 0 232 15.4788318
#> so bigint 199 14148 0 223 24.0753485
#> ibb bigint 45 42605 0 120 1.4726162
#> hbp bigint 41 9219 0 51 1.2401217
#> sh bigint 54 12730 0 67 2.8075235
#> sf bigint 20 42085 0 19 1.3795580
#> gidp bigint 35 32287 0 36 3.7441844
#> tb bigint 416 6413 0 457 64.2727219
#> tob bigint 344 9219 0 379 61.1543701
#> xbh bigint 107 6413 0 119 12.2249923
#> ba double precision 12555 6413 0 1 0.2066777
#> slg double precision 18021 6413 0 4 0.2873143
#> ta double precision 14089 37168 0 11 0.4977085
#> obp double precision 10987 42086 0 1 0.2550415
showDataNow, armed with the results of getTableSummary function showData either just visualizes certain statistics for a table (per its argument format) or may run additional computations in Aster to augment data returned by getTableSummary.
Same information visualized with function showData (using format='overview'):
showData(NULL, "batting_enh", tableInfo=battingInfo, type="numeric", format="overview",
title="Pitching Stats Overview")And again same information as boxplots (format='boxplot'):
showData(NULL, tableName='batting_enh', tableInfo=battingInfo, format='boxplot',
title="Pitching Stats Boxplots")Or the same boxplots in more compact and independently scaled:
showData(NULL, tableName='batting_enh', format='boxplot', tableInfo=battingInfo,
facet=TRUE, useIQR=TRUE, title="Pitching Stats Boxplots") Note that this time we called showData with connection NULL, which means that it never called database - instead it used argument tableInfo as it contains all necessary information to represent boxplots.
showData(conn, tableName = "batting_enh", format = 'corr', tableInfo=battingInfo,
include = c('ba','bb','cs','h','hbp','hr','ibb','obp','r','rbi','sb',
'slg','so','ta','tb','tob','x2b','x3b','xbh'),
title = "Correlation Matrix of Batting Stats")Strike-outs vs. Walks By Leagues and Decades
showData(conn, 'batting_enh', format='scatterplot',
include=c('r', 'so'), facetName=c("lgid","decadeid"), pointColour="lgid",
sampleSize=10000, regressionLine=TRUE,
title="SO vs BB by League 1980-2000",
where='yearid >= 1970')
#> Warning: Removed 1494 rows containing non-finite values (stat_smooth).
#> Warning: Removed 1494 rows containing missing values (geom_point).For quick modeling and prototyping in R computeSample lets randomly select data from Aster using one fo the following stratum models:
sampleFraction, this single fraction is targeted throughout the whole population or across all the strata defined by the sample conditions.sampleFraction, each of them is used for sampling a particular stratum defined by the conditionColumn or conditionStratum arguments.sampleSize for the total sample size for the entire population. If in addition you specify the conditionColumn or conditionStratum arguments, the function proportionally generates sample units for each stratum.sampleSize so that each size corresponds to a stratum defined by conditionColumn or conditionStratum. The sample function generates sample units directly for each stratum based on the supplied sizes.battingSample = computeSample(conn, "batting_enh", sampleSize = rep(4000,6),
conditionColumn = 'decadeid', conditionValues = seq(1960, 2010, 10),
include=c('lgid','yearid','slg','obp','ta','ops','ba','r'),
where="ab >= 20")
table(battingSample$decadeid)
#>
#> 1960 1970 1980 1990 2000 2010
#> 4011 3947 3930 3918 3928 3998This loads sampled data into R data frame which is available for any manipulations and statistical analysis with R:
battingSample$decadeid = factor(battingSample$decadeid, levels=sort(unique(battingSample$decadeid)), ordered = TRUE)
battingSample$lgid = factor(battingSample$lgid, ordered = TRUE)
ms = aggregate(battingSample[,c('slg','r','obp','ta','ops','ba')], battingSample[,c('lgid','decadeid')], median)
library(reshape2)
data = melt(ms, id.vars=c('decadeid','lgid'), measure.vars=setdiff(names(battingSample),c('decadeid','lgid','yearid')),
variable.name='stat')
data$stat = factor(data$stat, levels=sort(levels(unique(data$stat))), ordered=TRUE)
ggplot(data, aes(decadeid, value, fill=stat, group=1)) +
geom_line(color='black', size=1.0) + geom_point(shape=21, size=6) +
facet_grid(stat ~ lgid, scales = "free_y") +
# scale_fill_brewer(palette="Set1", guide=FALSE) +
scale_fill_manual(values = (getDiscretePaletteFactory("Set1"))(length(unique(data[,'stat']))), guide = FALSE) +
labs(title="Batting Stats Median Trends by Decade and League (1960 - 2015)", x='Decade', y=NULL) +
theme_tufte(ticks = FALSE) +
theme(strip.text.y=element_text(angle=0))We can validate the trends visualized above by running similar distributions in Aster using boxplots:
percentiles = computePercentiles(conn, 'batting_enh',
columns=c('slg','r','obp','ta','ops','ba'),
by=c('lgid','decadeid'), where='yearid >= 1960 and ab >= 20')
percentiles$column = factor(percentiles$column, levels=sort(unique(percentiles$column)), ordered=TRUE)
createBoxplot(percentiles, x='lgid', fill='column', facet=c('column','decadeid'), useIQR = TRUE,
facetScales="free", palette = "Set1", # paletteValues = c('firebrick3','blue4'),
title="Batting Stats Boxplots by League and Decade", xlab ="League",
fillGuide = FALSE, defaultTheme = theme_tufte(ticks=FALSE),
themeExtra = theme(axis.text.x=element_text(angle=0),
strip.text.y=element_text(angle=0)))
lmModel = computeLm(conn, "batting_enh", r ~ ba + slg + obp, where='yearid >= 1950')
summary(lmModel)
#>
#> Call:
#> computeLm(channel = conn, tableName = "batting_enh", formula = r ~
#> ba + slg + obp, where = "yearid >= 1950")
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -76.750 -13.176 -4.464 5.351 93.837
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5.050 1.574 -3.209 0.001374 **
#> ba -69.908 20.192 -3.462 0.000558 ***
#> slg 90.560 10.610 8.536 < 2e-16 ***
#> obp 56.769 13.666 4.154 3.54e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 22.49 on 1016 degrees of freedom
#> Multiple R-squared: 0.2975, Adjusted R-squared: 0.2954
#> F-statistic: 143.4 on 3 and 1016 DF, p-value: < 2.2e-16