The data used for this analysis contains information on 4,000 passengers who belong to an airline’s frequent flier program. For each passenger, the data include information on their mileage history and on different ways they accrued or spent miles in the last year.
http://dataminingbook.com/ (EastWestAirlinesCluster.xls)
For the analysis, I am using the “cluster” package
library(cluster)
suppressPackageStartupMessages(library(dendextend))
library(factoextra)
## Loading required package: ggplot2
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
library(cluster)
library(fpc)
library(NbClust)
For the sake for program simplicity the Excel file was converted to .csv to be read into R-program.
# Reading the .csv file as a data frame
AirLine_DF = read.csv("EastWestAirlinesCluster.csv")
# Reading the structure of the data
str(AirLine_DF)
## 'data.frame': 3999 obs. of 12 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Balance : int 28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
## $ Qual_miles : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cc1_miles : int 1 1 1 1 4 1 3 1 3 3 ...
## $ cc2_miles : int 1 1 1 1 1 1 1 1 2 1 ...
## $ cc3_miles : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Bonus_miles : int 174 215 4123 500 43300 0 27482 5250 1753 28426 ...
## $ Bonus_trans : int 1 2 4 1 26 0 25 4 43 28 ...
## $ Flight_miles_12mo: int 0 0 0 0 2077 0 0 250 3850 1150 ...
## $ Flight_trans_12 : int 0 0 0 0 4 0 0 1 12 3 ...
## $ Days_since_enroll: int 7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
## $ Award : int 0 0 0 0 1 0 0 1 1 1 ...
As per the data dictionary provided; the observations cc1_miles, cc2_miles, cc3_miles and Award are categorical variables. As the euclidean distance is not appropriate for categorical values, converted the cc1_miles, cc2_miles, cc3_miles to numeric by taking the average of the respective range.
We shall analyze the data with Euclidean distance with Single Linkage, Ward and Complete Linkage.
Note: Here we would ignore the ID and Award variables for the clustering
Convert the cc1_miles, cc2_miles, cc3_miles to numeric with taking the average of the range
AirLine_DF$cc1_miles = ifelse(AirLine_DF$cc1_miles==1,2500,
ifelse(AirLine_DF$cc1_miles==2,7500,
ifelse(AirLine_DF$cc1_miles==3,17500,
ifelse(AirLine_DF$cc1_miles==4,32500,
ifelse(AirLine_DF$cc1_miles==5,50000,0)))))
AirLine_DF$cc2_miles = ifelse(AirLine_DF$cc2_miles==1,2500,
ifelse(AirLine_DF$cc2_miles==2,7500,
ifelse(AirLine_DF$cc2_miles==3,17500,
ifelse(AirLine_DF$cc2_miles==4,32500,
ifelse(AirLine_DF$cc2_miles==5,50000,0)))))
AirLine_DF$cc3_miles = ifelse(AirLine_DF$cc3_miles==1,2500,
ifelse(AirLine_DF$cc3_miles==2,7500,
ifelse(AirLine_DF$cc3_miles==3,17500,
ifelse(AirLine_DF$cc3_miles==4,32500,
ifelse(AirLine_DF$cc3_miles==5,50000,0)))))
The scale of the variables in the data is varying; hence normalizing the data with Mean=0 and SD=1
data = scale(AirLine_DF)
d <- dist(data[,2:11], method = "euclidean")
fit <- hclust(d, method="ward.D2")
fit <- as.dendrogram(fit)
cd = color_branches(fit,k=3) #Coloured dendrogram branches
plot(cd)
For the Ward method; There are 3 clusters getting formed at height 85, and at 75 we see 5 clusters.
As there are larger numbers of observations; interpreting the results will be difficult or cumbersome. At height <20; there are multiple clusters created and is very difficult to read the data.
Divided the entire tree into 3 clusters and assigned each cluster with its respective observations.
groups <- cutree(fit, k=3) # cut tree into 3 clusters; height 85.
# Number of observations in each cluster
table(groups)
## groups
## 1 2 3
## 2850 405 744
g1 = aggregate(AirLine_DF[,2:11],list(groups),median)
data.frame(Cluster=g1[,1],Freq=as.vector(table(groups)),g1[,-1])
## Cluster Freq Balance Qual_miles cc1_miles cc2_miles cc3_miles
## 1 1 2850 31419.0 0 2500 2500 2500
## 2 2 405 79333.0 0 2500 2500 2500
## 3 3 744 97990.5 0 32500 2500 2500
## Bonus_miles Bonus_trans Flight_miles_12mo Flight_trans_12
## 1 3340.5 7 0 0
## 2 16314.0 19 2400 7
## 3 45359.0 17 0 0
## Days_since_enroll
## 1 3739
## 2 4190
## 3 5048
From the above table it seem that the cluster solution recognized the groups by old customer (Days_since_enroll), bonus miles and balance.
Cluster1 (Non-frequent Travellers) : Has highest number of observations. This cluster contains customers who are relatively new with low balance and bonus miles.
Cluster3 (Frequent Flyers): Has low number of observations. This cluster contains customers who are relatively old with both high balance and bonus miles.
Cluster2 (Middle level customers) : Categorizes customers between cluster 1 and 2.
# function to find medoid in cluster i
centroid = function(i, dat, groups)
{
ind = (groups == i)
colMeans(dat[ind,])
}
sapply(unique(groups), centroid, AirLine_DF[,2:11], groups)
## [,1] [,2] [,3]
## Balance 4.774137e+04 1.144233e+05 1.504400e+05
## Qual_miles 6.834491e+01 7.659630e+02 9.585484e+01
## cc1_miles 5.852632e+03 1.425926e+04 3.707997e+04
## cc2_miles 2.545614e+03 3.080247e+03 2.500000e+03
## cc3_miles 2.501754e+03 2.500000e+03 3.155242e+03
## Bonus_miles 6.854256e+03 2.554787e+04 5.199022e+04
## Bonus_trans 8.415789e+00 2.077037e+01 1.881586e+01
## Flight_miles_12mo 1.217372e+02 3.203721e+03 2.625067e+02
## Flight_trans_12 4.017544e-01 9.150617e+00 8.629032e-01
## Days_since_enroll 3.882286e+03 4.205486e+03 4.976321e+03
It is observed that with a sample (sample1 and sample2) of 95% data; the dendrograms are different and would need further analysis/clustering to get the insights.
SampleSize = as.integer(0.95 * nrow(data))
sample1 = data[sample(1:nrow(data), SampleSize,replace=FALSE),]
dist.1 = dist(sample1[,2:11], method = "euclidean")
fit.1 <- hclust(dist.1, method="ward.D2")
fit.1 <- as.dendrogram(fit.1)
cd = color_branches(fit.1,k=3) #Coloured dendrogram branches
plot(cd)
sample2 = data[sample(1:nrow(data), SampleSize,replace=FALSE),]
dist.2 = dist(sample2[,2:11], method = "euclidean")
fit.2 <- hclust(dist.2, method="ward.D2")
fit.2 <- as.dendrogram(fit.2)
cd = color_branches(fit.2,k=3) #Coloured dendrogram branches
plot(cd)
# 3 Clusters
km.3 <- eclust(data[,2:11], "kmeans", k = 3, nstart = 25, graph = TRUE)
## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
# 5 Clusters
km.5 <- eclust(data[,2:11], "kmeans", k = 5, nstart = 25, graph = TRUE)
As observed from the above K-means plotting;with 5 clusters; the data for cluster 3 and 2 highly overlap with others. Hence we can consider using 3 clusters. Also there are outliers in cluster 1 and 3 as per firt plotting above.
fviz_silhouette(km.3)
## cluster size ave.sil.width
## 1 1 924 0.18
## 2 2 2914 0.46
## 3 3 161 0.04
# 3 Clusters
km.3 <- eclust(data[,2:11], "pam", k = 3, graph = TRUE)
# 5 Clusters
km.5 <- eclust(data[,2:11], "pam", k = 5, graph = TRUE)
fviz_silhouette(km.3)
## cluster size ave.sil.width
## 1 1 1372 0.21
## 2 2 1071 0.00
## 3 3 1556 0.30