Cluster Analysis with R

Data Processing:

Lets first load the file and see the descriptive statistics.

setwd("C:/Swapnil/Docs/Data Science/Talentica/Clustering/")
data=read.csv("EastWestAirlinesCluster.csv",header = T,stringsAsFactors = FALSE)
str(data)

## 'data.frame':    3999 obs. of  12 variables:
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Balance          : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
##  $ Qual_miles       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cc1_miles        : int  1 1 1 1 4 1 3 1 3 3 ...
##  $ cc2_miles        : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ cc3_miles        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Bonus_miles      : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ Bonus_trans      : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ Flight_miles_12mo: int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ Flight_trans_12  : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ Days_since_enroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
##  $ Award            : int  0 0 0 0 1 0 0 1 1 1 ...

Here we have a data frame of 3999 observations and 12 variables.

What we see once that there are some categorical variables.

cc1_miles
cc2_miles
cc3_miles
award

We can convert then to factor variables , but for the hierarchical and k-means clustering we need to have numerical variable. Other techniques like 2 step clustering are applied to deal with the categorical variables. So i will keep the variables as numerical only.

The miles variables is interval based so, its fine to use them as they are and award is a categorical variable. So we have to standardize(basically normalize) the numerical variables in order to use Euclidean distance.

## Remove unwanted fields
data<-tbl_df(data)
invalid_cols<-c("ID")
valid_cols<-names(data)[!names(data) %in% invalid_cols]

# Function to standarize the variables
standardize<-function(x){ 
    x<-(x-min(x))/(max(x)-min(x))
    return(x)
}
k<-apply(data,2,standardize)
data.toUse<-data[,names(data) %in% valid_cols]
data.sd<-data.toUse%>%mutate_each_(funs(standardize),vars=valid_cols)

If we don’t standardize the variable then the variables in the bigger unit scale will contribute more to the Euclidean distance. And we will have a biased estimate of clusters.

Note: We could have Gower similarity coefficient as we have mixed type of variables as well.

Hierarchical Clustering:

Hierarchical methods use a distance matrix as an input for the clustering algorithm. The choice of an appropriate metric will influence the shape of the clusters, as some elements may be close to one another according to one distance and farther away according to another.

data.distMat<-dist(data.sd, method = "euclidean") # Euclidean distance matrix

We use the Euclidean distance as an input for the clustering algorithm (Ward’s minimum variance criterion minimizes the total within-cluster variance):

hc <- hclust(data.distMat,method="ward.D")

The clustering output can be displayed in a dendrogram

plot(hc,labels=FALSE,hang=-1) # display dendogram

# draw dendogram with red borders around the 2 clusters
rect.hclust(hc, k=2, border="red")

We can clearly see here that two clusters are formed.

Comparison of the centroids

noOfCluster<-2
data.toUse$label<-cutree(hc, noOfCluster)
cent <- NULL
for(i in 1:noOfCluster){
    cent<- rbind(cent,colMeans(data.toUse[data.toUse$label==i,]))
    }
cent

##       Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## [1,] 59807.84   87.60246  1.705322   1.01390  1.008737    10227.69
## [2,] 97053.05  240.19649  2.661715   1.01553  1.018231    28905.41
##      Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award
## [1,]    9.142971          226.2923        0.652502          3820.652     0
## [2,]   15.782579          857.5010        2.599595          4625.062     1
##      label
## [1,]     1
## [2,]     2

We can clearly see that there are two sets of people, once who are frequent flyers and others who are not. The once that are marked a 2 have more of all the attributes. They have high number of flight transactions and flight miles. They have higher Bonus miles and transactions from non-flight transactions. So we can clearly mark 1 as non-frequent flyers and 2 as frequent flyers.

By randomly removing random 5% of the data. The graph doesn’t change much:

set.seed(1234)
row.number<- sample(1:nrow(data), size=0.05*nrow(data))
data.toUse2<-data[-row.number,]
data.toUse2<-data.toUse2[,names(data) %in% valid_cols]
nrow(data.toUse2)

## [1] 3800

data.sd2<-data.toUse2%>%mutate_each_(funs(standardize),vars=valid_cols)
data.distMat<-dist(data.sd2)
hc <- hclust(data.distMat,method="ward.D")
plot(hc,labels=FALSE,hang=-1)
rect.hclust(hc, k=2, border="red")

K-Means Clustering:

K-Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. In K-Means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps:

Reassign data points to the cluster whose centroid is closest.
Calculate new centroid of each cluster.

These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids.

data.toUse3<-data[,names(data) %in% valid_cols]
data.sd<-data.toUse3%>%mutate_each_(funs(standardize),vars=valid_cols) # To standarize the variables

# K-Means
set.seed(1234)
noOfCluster.kmeans<-2
cl <- kmeans(data.sd,noOfCluster.kmeans,iter.max = 10,nstart = 123)

In k.means.fit are contained all the elements of the cluster output:

attributes(cl)

## $names
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"      
## 
## $class
## [1] "kmeans"

# Centroids:
cl$centers

##      Balance  Qual_miles cc1_miles   cc2_miles   cc3_miles Bonus_miles
## 1 0.03508124 0.007858133 0.1763304 0.006949960 0.002184273  0.03878753
## 2 0.05692802 0.021546151 0.4154288 0.007765024 0.004557731  0.10962100
##   Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award
## 1   0.1063136       0.007343099      0.01231136         0.4604114     0
## 2   0.1835184       0.027825584      0.04904896         0.5573984     1

# Cluster size:
cl$size

## [1] 2518 1481

data.toUse3$label<-cl$cluster
table(cl$cluster)

## 
##    1    2 
## 2518 1481

A fundamental question is how to determine the value of the parameter k. If we looks at the percentage of variance explained as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the “elbow criterion”.

wssplot <- function(data, nc=15, seed=1234){
  wss <- (nrow(data)-1)*sum(apply(data,2,var))
  for (i in 2:nc){
    set.seed(seed)
    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
  plot(1:nc, wss, type="b", xlab="Number of Clusters",
       ylab="Within groups sum of squares")}

wssplot(data.sd, nc=8)

From the plot, we observe that the elbow is formed at K=2. It means, we should have 2 clusters for this data.

Examining the centriods of both the models

Centriod from Hierarcial clustering

cent <- NULL
for(i in 1:noOfCluster){
    cent<- rbind(cent,colMeans(data.toUse[data.toUse$label==i,]))
    }
cent

##       Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## [1,] 59807.84   87.60246  1.705322   1.01390  1.008737    10227.69
## [2,] 97053.05  240.19649  2.661715   1.01553  1.018231    28905.41
##      Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award
## [1,]    9.142971          226.2923        0.652502          3820.652     0
## [2,]   15.782579          857.5010        2.599595          4625.062     1
##      label
## [1,]     1
## [2,]     2

Centriod from k-means clustering

cent <- NULL
for(i in 1:noOfCluster.kmeans){
    cent<- rbind(cent,colMeans(data.toUse3[cl$cluster==i,]))
    }
cent

##       Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## [1,] 59807.84   87.60246  1.705322   1.01390  1.008737    10227.69
## [2,] 97053.05  240.19649  2.661715   1.01553  1.018231    28905.41
##      Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award
## [1,]    9.142971          226.2923        0.652502          3820.652     0
## [2,]   15.782579          857.5010        2.599595          4625.062     1
##      label
## [1,]     1
## [2,]     2

Here also we see two distinct sets of people, once who fly frequently and other who don’t.

We should target the frequently flying customers. We can create box plot for all the variables for each cluster.

data.toUse$label<-as.factor(data.toUse$label)
data.new<-melt(as.data.frame(data.toUse[,-1]),id.var="label")
data.new<-tbl_df(data.new)
ggplot(data=data.new,aes(x=label,y=value))+geom_boxplot(aes(fill=label))+facet_wrap( ~ variable, scales="free")