I write a blog for Arkatechture and in the last installment of AADS, about the Top Algorithms Used by Data Scientists, I closed with a teaser:
In the next installment of Ask a Data Scientist we will compare two approaches; Tableau, the “new kid” on the clustering block, and R, a tried and true clustering stalwart. So, stay tuned, and thanks for reading!
Ben Richey has been working with me on this. We decided that the best way to understand the inner workings of Tableau’s clustering would be to use the classic Iris dataset (you can get it within R or from the Machine Learning Repository at the University of California, Irvine:
library(datasets)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
We use this to understand how the implementation of k-means differs between R & Tableau by trying to match the actual species of irises using clustering on the petal & sepal lengths & widths. The species are clearly clustered when plotted by the petal length vs. width and this will be our comparison point:
if(require("ggplot2")==FALSE){install.packages("ggplot2")}
## Loading required package: ggplot2
library(ggplot2)
if(require("colorspace")==FALSE){install.packages("colorspace")}
## Loading required package: colorspace
library(colorspace)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Boulder Insights published a nice YouTube tutorial on getting clustering going on the Iris data. There are four features to cluster on in the Iris data set: Petal Length, Petal Width, Sepal Length, and Sepal Width. The basic Petal Length vs Petal Width is the same in both (R on the Left, Tableau on the Right):
# First IRIS Clustering
set.seed(888)
kc1<-kmeans(iris[,c(3:4)],3)
iris$cluster1 <- as.factor(kc1$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster1)) + geom_point() + ggtitle("2 Variables")
What’s really interesting is that when you add more features (Sepal Width), the clusters start to diverge:
# Second IRIS Clustering
set.seed(888)
kc2<-kmeans(iris[,c(2:4)],3)
iris$cluster2 <- as.factor(kc2$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster2)) + geom_point() + ggtitle("3 Variables")
Adding the last variable (Sepal Length), doesn’t change the R clusters at all; however, changing the randon seed does significantly:
# Third IRIS Clustering
set.seed(888)
kc3<-kmeans(iris[,c(1:4)],3)
iris$cluster3 <- as.factor(kc3$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster3)) + geom_point() + ggtitle("4 Variables & Seed 888")
# Third IRIS Clustering, Different Seed
set.seed(20)
kc4<-kmeans(iris[,c(1:4)],3)
iris$cluster4 <- as.factor(kc4$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = iris$cluster4)) + geom_point() + ggtitle("4 Variables & Seed 20")
The drift between the two R versions has now become very noticeable, while Tableau is quite stable; however it’s 2nd cluster has actually grown slightly with the addition of a fourth feature to the model. This difference between the two methods occurs because of several differences in technical implementation of k-means:
| Tableau | R (kmeans{stats}) | |
|---|---|---|
| Distance Measurement | Euclidean only | Euclidean only, but with alternative implementation (kmeans {amap}) Maximum, Manhattan, Canberra, Binary, Pearson , Correlation, Spearman, & Kendall are also available |
| Centroid Initialization | Uses the Howard-Harris method to divide the original data into 2 parts, then repeats on the part with the highest distance variance until k is reached. Bora Beran from Tableau does a great job explaining this. | Randomly (using set.seed) or Deterministically (using centers) picks k points define the clusters |
| Categorical Variable Use | Built in transformation using Multiple Correspondence Analysis (MCA) to convert the category to a distance. | Separate function for categorical data (kmodes) using mode vs. mean as measure or requires pre-processing to convert categories into numbers. FactoMineR is most popular. |
Tableau does a much better job matching the actual Iris species, regardless of the features clustered. Here is a comparison of their misclassification rates:
iris$realcluster1<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster1match<-ifelse(iris$cluster1==iris$realcluster1,T,F)
summary(iris$cluster1match)
## Mode FALSE TRUE NA's
## logical 6 144 0
iris$realcluster2<- ifelse(iris$Species=='setosa', 1,ifelse(iris$Species == 'versicolor', 3, 2))
iris$cluster2match<-ifelse(iris$cluster2==iris$realcluster2,T,F)
summary(iris$cluster2match)
## Mode FALSE TRUE NA's
## logical 68 82 0
iris$realcluster3<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster3match<-ifelse(iris$cluster3==iris$realcluster3,T,F)
summary(iris$cluster3match)
## Mode FALSE TRUE NA's
## logical 63 87 0
iris$realcluster4<- ifelse(iris$Species=='setosa', 3,ifelse(iris$Species == 'versicolor', 1, 2))
iris$cluster4match<-ifelse(iris$cluster3==iris$realcluster4,T,F)
summary(iris$cluster4match)
## Mode FALSE TRUE NA's
## logical 63 87 0
bp1<-barplot(table(iris$cluster1match),main="2 Variable Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))
bp2<-barplot(table(iris$cluster2match),main="3 Variable Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))
bp3<-barplot(table(iris$cluster3match),main="4 Variable Seed 888 Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))
bp4<-barplot(table(iris$cluster4match),main="4 Variable Seed 20 Clustering Misclassification",ylab="Count",col=c("orangered","light blue"))
Therefore, Tableau can really help speed up your analytics!