Ask A Data Scientist: How does Clustering in R vs. Tableau work?

I write a blog for Arkatechture and in the last installment of AADS, about the Top Algorithms Used by Data Scientists, I closed with a teaser:

In the next installment of Ask a Data Scientist we will compare two approaches; Tableau, the “new kid” on the clustering block, and R, a tried and true clustering stalwart. So, stay tuned, and thanks for reading!

Ben Richey has been working with me on this. We decided that the best way to understand the inner workings of Tableau’s clustering would be to use the classic Iris dataset (you can get it within R or from the Machine Learning Repository at the University of California, Irvine:

library(datasets)
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

We use this to understand how the implementation of k-means differs between R & Tableau by trying to match the actual species of irises using clustering on the petal & sepal lengths & widths. The species are clearly clustered when plotted by the petal length vs. width and this will be our comparison point:

if(require("ggplot2")==FALSE){install.packages("ggplot2")}
## Loading required package: ggplot2
library(ggplot2)
if(require("colorspace")==FALSE){install.packages("colorspace")}
## Loading required package: colorspace
library(colorspace)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()

Boulder Insights published a nice YouTube tutorial on getting clustering going on the Iris data. There are four features to cluster on in the Iris data set: Petal Length, Petal Width, Sepal Length, and Sepal Width. The basic Petal Length vs Petal Width is the same in both (R on the Left, Tableau on the Right):

# First IRIS Clustering 
set.seed(888)
kc1<-kmeans(iris[,c(3:4)],3)
iris$cluster1 <- as.factor(kc1$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster1)) + geom_point() + ggtitle("2 Variables")

What’s really interesting is that when you add more features (Sepal Width), the clusters start to diverge:

# Second IRIS Clustering 
set.seed(888)
kc2<-kmeans(iris[,c(2:4)],3)
iris$cluster2 <- as.factor(kc2$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster2)) + geom_point() + ggtitle("3 Variables")

Adding the last variable (Sepal Length), doesn’t change the R clusters at all; however, changing the randon seed does significantly:

# Third IRIS Clustering 
set.seed(888)
kc3<-kmeans(iris[,c(1:4)],3)
iris$cluster3 <- as.factor(kc3$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster3)) + geom_point() + ggtitle("4 Variables & Seed 888")

# Third IRIS Clustering, Different Seed
set.seed(20)
kc4<-kmeans(iris[,c(1:4)],3)
iris$cluster4 <- as.factor(kc4$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster4)) + geom_point() + ggtitle("4 Variables & Seed 20")

The drift between the two R versions has now become very noticeable, while Tableau is quite stable; however it’s 2nd cluster has actually grown slightly with the addition of a fourth feature to the model. This difference between the two methods occurs because of several differences in technical implementation of k-means:

Tableau R (kmeans{stats})
Distance Measurement Euclidean only Euclidean only, but with alternative implementation (kmeans {amap}) Maximum, Manhattan, Canberra, Binary, Pearson , Correlation, Spearman, & Kendall are also available
Centroid Initialization Uses the Howard-Harris method to divide the original data into 2 parts, then repeats on the part with the highest distance variance until k is reached. Bora Beran from Tableau does a great job explaining this. Randomly (using set.seed) or Deterministically (using centers) picks k points define the clusters
Categorical Variable Use Built in transformation using Multiple Correspondence Analysis (MCA) to convert the category to a distance. Separate function for categorical data (kmodes) using mode vs. mean as measure or requires pre-processing to convert categories into numbers. FactoMineR is most popular.

Tableau does a much better job matching the actual Iris species, regardless of the features clustered. Here is a comparison of their misclassification rates:

iris$realcluster1<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster1match<-ifelse(iris$cluster1==iris$realcluster1,T,F)
summary(iris$cluster1match)
##    Mode   FALSE    TRUE    NA's 
## logical       6     144       0
iris$realcluster2<- ifelse(iris$Species=='setosa', 1,ifelse(iris$Species == 'versicolor', 3, 2))
iris$cluster2match<-ifelse(iris$cluster2==iris$realcluster2,T,F)
summary(iris$cluster2match)
##    Mode   FALSE    TRUE    NA's 
## logical      68      82       0
iris$realcluster3<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster3match<-ifelse(iris$cluster3==iris$realcluster3,T,F)
summary(iris$cluster3match)
##    Mode   FALSE    TRUE    NA's 
## logical      63      87       0
iris$realcluster4<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster4match<-ifelse(iris$cluster3==iris$realcluster4,T,F)
summary(iris$cluster4match)
##    Mode   FALSE    TRUE    NA's 
## logical      63      87       0
bp1<-barplot(table(iris$cluster1match),main="2 Variable Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))

bp2<-barplot(table(iris$cluster2match),main="3 Variable Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))

bp3<-barplot(table(iris$cluster3match),main="4 Variable Seed 888 Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))

bp4<-barplot(table(iris$cluster4match),main="4 Variable Seed 20 Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))

Therefore, Tableau can really help speed up your analytics!