K-means clustering is powerful unsupervised learning technique available out of the box with Aster database. But it still requires certain amount of data preparation, handling, and custom coding to acheive the same results as with core R functions kmeans, scale, etc. toaster streamlines and simplifies kmeans clustering with Aster with its family of kmeans clustering functions for computing and vizualization. This results in kmeans object compatible with R (but richer) and choice of visualizations to describe cluster centroids, cluster properties and more. toaster kmeans function family also offers scaling, cluster sampling, cluster aggregated properties, silhouette technique.
toaster demo data sets baseball and dallas should be imported into Aster database. Aster database should be availble via ODBC. For details please see here.
Clustering MLB batters based on their performance using baseball dataset. Table batting_enh contains batters’ stats among which we chose *ba, slg, ta to cluster by.
conn = odbcConnect("PreSalesCluster1-dallas")
km.model = computeKmeans(conn, "batting_enh", centers=4, include=c('ba','slg','ta'), scale = FALSE,
aggregates = c("COUNT(*) cnt", "AVG(g_batting) avg_g",
"AVG(r) avg_r", "AVG(h) avg_h","AVG(ab) avg_ab"),
id="playerid || '-' || teamid || '-' || yearid", idAlias = "id",
scaledTableName='kmeans_vignette_scaled',
centroidTableName='kmeans_vignette_centroids', schema='public',
where="yearid > 2000 and g_batting >= 30", iterMax = 25)
Explaining computeKmeans parameters:
include = c('ba','slg','ta') specifies which variables to use. If none specified then all available variables (numeric) used.centers = 4 specifies number of clusters. This is arguably the most important parameter in kmeans and it needs more analysis before finalizing its value (see The Elbow Method below)scale = FALSE tells not to scale data (just eliminate empty values). It is a standard practice in kmeans to scale its variables first. In this case we choose to omit this step since all variables are fractions between 0 and 1.aggregates = c("COUNT(*) cnt", "AVG(g_batting) avg_g", "AVG(r) avg_r", "AVG(h) avg_h","AVG(ab) avg_ab") tells Aster compute certain aggregates on clusters which could be used to describe and explain clustersid="playerid || '-' || teamid || '-' || yearid", idAlias = "id" defines id and its aliasscaledTableName='kmeans_vignette_scaled', centroidTableName='kmeans_vignette_centroids' explicitly define tables to store intermidiate data. toaster will produce random names if these are omitted (found in kmeans object)where="yearid > 2000 and g_batting >= 30" limits batter stats to year 2000 or later and those players with 30 or greater games batted per team per seasonThe result is object compatible with kmeans in R:
km.model
#> K-means clustering with 4 clusters of sizes 2782, 1008, 651, 2647
#>
#> Cluster means:
#> ba slg ta
#> 0 0.23700410 0.34190796 0.5501221
#> 1 0.06208777 0.07422343 0.1195800
#> 2 0.31256702 0.55735999 1.0201456
#> 3 0.27190869 0.44207918 0.7521625
#>
#> Clustering vector:
#> integer(0)
#>
#> Within cluster sum of squares by cluster:
#> [1] 30.04443 26.91839 60.49188 21.93678
#> (between_SS / total_SS = 81.1 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss"
#> [4] "withinss" "tot.withinss" "betweenss"
#> [7] "size" "iter" "ifault"
#> [10] "scale" "aggregates" "tableName"
#> [13] "columns" "scaledTableName" "centroidTableName"
#> [16] "id" "idAlias" "whereClause"
#> [19] "time"
Multiple options available to visualize resulting clusters and their properties. Lineplot of centroids by clusters and by variables:
createCentroidPlot(km.model, format="line", coordFlip = FALSE)
createCentroidPlot(km.model, format="line", groupByCluster=FALSE, coordFlip = FALSE)
Barplot of centroids by clusters and by variables:
createCentroidPlot(km.model, format="bar", coordFlip = FALSE)
createCentroidPlot(km.model, format="bar", groupByCluster=FALSE, coordFlip = FALSE)
Dodged barplot of centroids by clusters and by variables:
createCentroidPlot(km.model, format="bar_dodge", coordFlip = FALSE)
createCentroidPlot(km.model, format="bar_dodge", groupByCluster=FALSE, coordFlip = FALSE)
Heatmaps of centroids:
createCentroidPlot(km.model, format="heatmap", coordFlip = FALSE)
createCentroidPlot(km.model, format="heatmap", coordFlip = TRUE)
Cluster properties in 2 color schemes:
createClusterPlot(km.model)
createClusterPlot(km.model, colorByCluster = FALSE)
Using proportional sampling we can visualize cluster data:
km.model = computeClusterSample(conn, km.model, '0.5')
createClusterPairsPlot(km.model)
Silhouette refers to a method of interpretation and validation of consistency within clusters of data.The technique provides a succinct graphical representation of how well each object lies within its cluster. Following toaster convention of never consuming potentially large datasets in R its silhouette profile function constructs histograms of silhouette values for each cluster instead of each point’s silhouette value:
km.model = computeSilhouette(conn, km.model, scaled=FALSE,
silhouetteTableName = "public.silhouette_vignette_data")
createSilhouetteProfile(km.model)
kmeans always computes within cluster sum of squares (withinss) for each cluster and their total sum (tot.withinss). Using ‘elbow method’ by tracing latter metric vs. number of clusters suggests optimal number of clusters:
ggplot(data) +
geom_line(aes(K, totwithinss)) + geom_point(aes(K, totwithinss)) +
labs(title="Within groups Sums of Squares vs. the Number of Clusters",
x="Number of Clusters", y="Within groups Sums of Squares") +
scale_x_continuous(breaks=seq(0,N,2)) + ggthemes::theme_tufte()
Another method is looking at total silhouette value vs. number of clusters:
ggplot(data) +
geom_line(aes(K, silvalue), color="black") + geom_point(aes(K, silvalue), color="black") +
labs(title="Total Silhouette vs. the Number of Clusters",
x="Number of Clusters", y="Total Silhouette Value") +
scale_x_continuous(breaks=seq(0,N,2)) + ggthemes::theme_tufte()
We can also track Aster performance when running kmeans:
ggplot(data) +
geom_line(aes(K, time), color="lightblue") + geom_point(aes(K, time), color="lightblue") +
labs(title="Execution Time vs. the Number of Clusters",
x="Number of Clusters", y="Time (s)") +
scale_x_continuous(breaks=seq(0,N,2)) + ggthemes::theme_tufte()
We can cluster by location (longitude and latitude) and use createMap function to visualize results. In this example we use dallas dataset and its building permits table dallasbuildingpermits (again, scaling is not necessary when using location):
km.dallas = computeKmeans(conn, "dallasbuildingpermits", 5, id="permitno",
include=c("lat","lon"), scale=FALSE, iterMax = 50,
scaledTableName = "dallasbuildingpermits_scaled",
centroidTableName = "dallasbuildingpermits_centroids",
aggregates=c("avg(value) value", "avg(area) area",
"avg(value/case when area > 0 then area else null end) value_sqft"),
where = "landuse = 'SINGLE FAMILY DWELLING'")
Before mapping clusters on the map we can see their properties:
createClusterPlot(km.dallas)
The resulting clusters map also using cluster properties assigned to the shapes - in this case it’s number of elements in the cluster and average area of construction (sq.ft.):
getMapFun = memoise::memoise(ggmap::get_map)
geocodeFun = memoise::memoise(ggmap::geocode)
dataForMap = cbind(data.frame(km.dallas$centers), km.dallas$aggregates)
createMap(dataForMap, source="google", locator="center", zoom = 11,
lonName="lon", latName="lat",
metrics=c("value","area"), shape=21, shapeColour = "purple", shapeAlpha = 0.6,
scaleRange=c(4,20), geocodeFun = geocodeFun, getmapFun=getMapFun,
metricGuides = list(guide_legend("Average value"), guide_colorbar("Average area")))