Team Members
Context:
The file EastWestAirlinesCluster.xls is available on the textbook website https://www.dataminingbook.com/book/r-edition which contains information on 4000 passengers who belong to an airline’s frequent flyer program.
For each passenger the data includes information on their mileage history and on different ways they accrued or spent miles in the last year.
Objective:
We are trying to learn more about EastWest Airlines’ customers based on their flying patterns, earning and use of frequent flyer rewards, and use of the airline credit card.
The primary objective is to identify customer segments via clustering and design targeted marketing campaigns for each segment.
knitr::opts_chunk$set(echo = TRUE)
library("dummies")
library("dendextend")
library("gridExtra")
library("cluster")
library("factoextra")
library("MASS")
library("fpc")
library("clValid")
library("optCluster")
airlines_1 <- read.csv("E:/Vinni_USA/MSIS coursework docs/Spring-20/4th Flex/Data Mining for BI/Unsupervised learning project/Final project/EastWestAirlinesCluster.csv", header = TRUE)
head(airlines_1)
## ï..ID. Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## 1 1 28143 0 1 1 1 174
## 2 2 19244 0 1 1 1 215
## 3 3 41354 0 1 1 1 4123
## 4 4 14776 0 1 1 1 500
## 5 5 97752 0 4 1 1 43300
## 6 6 16420 0 1 1 1 0
## Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award.
## 1 1 0 0 7000 0
## 2 2 0 0 6968 0
## 3 4 0 0 7034 0
## 4 1 0 0 6952 0
## 5 26 2077 4 6935 1
## 6 0 0 0 6942 0
# Removing the ID# column from dataset
airlines_2 <- airlines_1[c(-1,-12)]
head(airlines_2)
## Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
## 1 28143 0 1 1 1 174 1
## 2 19244 0 1 1 1 215 2
## 3 41354 0 1 1 1 4123 4
## 4 14776 0 1 1 1 500 1
## 5 97752 0 4 1 1 43300 26
## 6 16420 0 1 1 1 0 0
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1 0 0 7000
## 2 0 0 6968
## 3 0 0 7034
## 4 0 0 6952
## 5 2077 4 6935
## 6 0 0 6942
str(airlines_2)
## 'data.frame': 3999 obs. of 10 variables:
## $ Balance : int 28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
## $ Qual_miles : int 0 0 0 0 0 0 0 0 0 0 ...
## $ cc1_miles : int 1 1 1 1 4 1 3 1 3 3 ...
## $ cc2_miles : int 1 1 1 1 1 1 1 1 2 1 ...
## $ cc3_miles : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Bonus_miles : int 174 215 4123 500 43300 0 27482 5250 1753 28426 ...
## $ Bonus_trans : int 1 2 4 1 26 0 25 4 43 28 ...
## $ Flight_miles_12mo: int 0 0 0 0 2077 0 0 250 3850 1150 ...
## $ Flight_trans_12 : int 0 0 0 0 4 0 0 1 12 3 ...
## $ Days_since_enroll: int 7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
summary(airlines_2)
## Balance Qual_miles cc1_miles cc2_miles
## Min. : 0 Min. : 0.0 Min. :1.00 Min. :1.000
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.:1.00 1st Qu.:1.000
## Median : 43097 Median : 0.0 Median :1.00 Median :1.000
## Mean : 73601 Mean : 144.1 Mean :2.06 Mean :1.015
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.:3.00 3rd Qu.:1.000
## Max. :1704838 Max. :11148.0 Max. :5.00 Max. :3.000
## cc3_miles Bonus_miles Bonus_trans Flight_miles_12mo
## Min. :1.000 Min. : 0 Min. : 0.0 Min. : 0.0
## 1st Qu.:1.000 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0.0
## Median :1.000 Median : 7171 Median :12.0 Median : 0.0
## Mean :1.012 Mean : 17145 Mean :11.6 Mean : 460.1
## 3rd Qu.:1.000 3rd Qu.: 23801 3rd Qu.:17.0 3rd Qu.: 311.0
## Max. :5.000 Max. :263685 Max. :86.0 Max. :30817.0
## Flight_trans_12 Days_since_enroll
## Min. : 0.000 Min. : 2
## 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.000 Median :4096
## Mean : 1.374 Mean :4119
## 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :53.000 Max. :8296
# Converting categorical miles data into numerical data with taking the average of the range
airlines_2$cc1_miles = ifelse(airlines_2$cc1_miles==1,2500,
ifelse(airlines_2$cc1_miles==2,7500,
ifelse(airlines_2$cc1_miles==3,17500,
ifelse(airlines_2$cc1_miles==4,32500,
ifelse(airlines_2$cc1_miles==5,50000,0)))))
airlines_2$cc2_miles = ifelse(airlines_2$cc2_miles==1,2500,
ifelse(airlines_2$cc2_miles==2,7500,
ifelse(airlines_2$cc2_miles==3,17500,
ifelse(airlines_2$cc2_miles==4,32500,
ifelse(airlines_2$cc2_miles==5,50000,0)))))
airlines_2$cc3_miles = ifelse(airlines_2$cc3_miles==1,2500,
ifelse(airlines_2$cc3_miles==2,7500,
ifelse(airlines_2$cc3_miles==3,17500,
ifelse(airlines_2$cc3_miles==4,32500,
ifelse(airlines_2$cc3_miles==5,50000,0)))))
head(airlines_2)
## Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
## 1 28143 0 2500 2500 2500 174 1
## 2 19244 0 2500 2500 2500 215 2
## 3 41354 0 2500 2500 2500 4123 4
## 4 14776 0 2500 2500 2500 500 1
## 5 97752 0 32500 2500 2500 43300 26
## 6 16420 0 2500 2500 2500 0 0
## Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1 0 0 7000
## 2 0 0 6968
## 3 0 0 7034
## 4 0 0 6952
## 5 2077 4 6935
## 6 0 0 6942
# The scale of the variables in the data is varying; hence normalizing the data with Mean=0 and SD=1
airlines_3 <- scale (airlines_2)
head(airlines_3)
## Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## [1,] -0.4510844 -0.1862754 -0.6802649 -0.09077821 -0.05870488 -0.7026984
## [2,] -0.5393894 -0.1862754 -0.6802649 -0.09077821 -0.05870488 -0.7010007
## [3,] -0.3199912 -0.1862754 -0.6802649 -0.09077821 -0.05870488 -0.5391853
## [4,] -0.5837255 -0.1862754 -0.6802649 -0.09077821 -0.05870488 -0.6892000
## [5,] 0.2396479 -0.1862754 1.3577268 -0.09077821 -0.05870488 1.0829857
## [6,] -0.5674121 -0.1862754 -0.6802649 -0.09077821 -0.05870488 -0.7099031
## Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## [1,] -1.1039265 -0.3285622 -0.3621226 1.395280
## [2,] -0.9998011 -0.3285622 -0.3621226 1.379784
## [3,] -0.7915505 -0.3285622 -0.3621226 1.411744
## [4,] -1.1039265 -0.3285622 -0.3621226 1.372037
## [5,] 1.4992070 1.1547876 0.6924037 1.363805
## [6,] -1.2080518 -0.3285622 -0.3621226 1.367195
If Normalization of data is not done, then the following issues will influence the model -
Distance measured will be wrongly calculated, if all the variables are not with equal weight
The variable with the largest scale will dominate the measure
k.max <- 10
wss <- sapply(1:k.max, function(k){ kmeans(airlines_3[,-1], k)$tot.withinss})
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")
Interpretation:
As seen from the elbow graph, the slope changes at k=2. However, since spltting the dataset into 2 groups would not be very beneficial for segmenting frequent flyers, we further evaluate clusters for higher values of k.
# K-means clustering with 3 clusters
k_mean_cluster_3 <- eclust(airlines_3, "kmeans", k = 3, nstart = 25, graph = TRUE)
# K-means clustering with 4 clusters
k_mean_cluster_4 <- eclust(airlines_3, "kmeans", k = 4, nstart = 25, graph = TRUE)
# K-means clustering with 5 clusters
k_mean_cluster_5 <- eclust(airlines_3, "kmeans", k = 5, nstart = 25, graph = TRUE)
Interpretation:
As we see in the K-means clustering graph, for values of k > 3, the number of overlapping observations with neighboring clusters increase. This implies that dividing the dataset into 3 clusters would distinctly or unambiguously segment the frequent flyers.
#Checking cluster stability with K means Silhouette
fviz_silhouette(k_mean_cluster_3)
## cluster size ave.sil.width
## 1 1 161 0.04
## 2 2 2914 0.46
## 3 3 924 0.18
fviz_silhouette(k_mean_cluster_4)
## cluster size ave.sil.width
## 1 1 167 0.02
## 2 2 11 0.60
## 3 3 922 0.23
## 4 4 2899 0.44
fviz_silhouette(k_mean_cluster_5)
## cluster size ave.sil.width
## 1 1 11 0.60
## 2 2 158 0.02
## 3 3 1357 0.17
## 4 4 821 0.22
## 5 5 1652 0.34
Interpretation:
As observed from the silhouette coefficients, the value of average.silhouette.width is maximum (closest to +1) for k=3 compared to other k values indicating that dividing the data set into 3 clusters is stable.
Further, when k=3, it can be observed from the cluster size that the second cluster contains major chunk (72.86%) of the data followed by the third cluster containing 23.1% and then the first cluster with only 4.04% of the data.
# Calculating Distance
# Dissimilarity matrix
airlines_euclidian_dist <- dist(airlines_3, method = "euclidean")
str(airlines_euclidian_dist)
## 'dist' num [1:7994001] 0.137 0.377 0.135 4.232 0.159 ...
## - attr(*, "Size")= int 3999
## - attr(*, "Diag")= logi FALSE
## - attr(*, "Upper")= logi FALSE
## - attr(*, "method")= chr "euclidean"
## - attr(*, "call")= language dist(x = airlines_3, method = "euclidean")
# Hierarchical clustering using Ward's method
airline_fit <- hclust(airlines_euclidian_dist, method="ward.D2")
airline_fit <- as.dendrogram(airline_fit)
plot(airline_fit, cex = 0.6)
abline(h=85,col="red",lty=2)
Interpretation:
From the Ward method, we see that as the height increases the clusters get grouped together.
As there are larger numbers of observations, if we dont limit the number of clusters, interpreting the results will become cumbersome. Ex: At height <20, there are multiple clusters created and it is very difficult to comprehend the data.
Hence, we decided to cut the tree at height 85 to obtain 3 clusters and then assigned each cluster with its respective observations.
#cutting the tree into 3 clusters
hierarchical_cluster <- cutree(airline_fit, k=3)
cd = color_branches(airline_fit,k=3) #Coloured dendrogram branches
plot(cd)
# Number of observations in each cluster
table(hierarchical_cluster)
## hierarchical_cluster
## 1 2 3
## 2850 405 744
Interpretation:
When k=3, it can be observed from the cluster size that the first cluster contains major chunk (71.27%) of the data followed by the third cluster containing 18.6% and then the second cluster with only 10.13% of the data.`
#Checking for optimal k-value
intern <- clValid(airlines_3, 2:6, clMethods=c("hierarchical"),
validation="internal", maxitems=nrow(airlines_3))
## Warning in clValid(airlines_3, 2:6, clMethods = c("hierarchical"), validation =
## "internal", : rownames for data not specified, using 1:nrow(data)
summary (intern)
##
## Clustering Methods:
## hierarchical
##
## Cluster sizes:
## 2 3 4 5 6
##
## Validation Measures:
## 2 3 4 5 6
##
## hierarchical Connectivity 5.3825 9.6401 9.6401 17.5944 22.6671
## Dunn 0.2058 0.2162 0.2203 0.1863 0.1863
## Silhouette 0.8275 0.7907 0.7837 0.7461 0.7350
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 5.3825 hierarchical 2
## Dunn 0.2203 hierarchical 4
## Silhouette 0.8275 hierarchical 2
Interpretation:
Internal measures -
Connectivity corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.
The Dunn index is another internal clustering validation measure which is defined as the ratio of inter-cluster separation to intracluster compactness. The Dunn index should be maximized.
The silhouette coefficients measure how well an observation is clustered and it estimates the average distance between clusters. Observations with a large Si (almost 1) are very well clustered.
As observed from the output, the above combination of values are closer to the ideal for k=3 than the other inputs.This indicates that the optimal number of clusters is 3.
stab <- matrix(0,nrow=ncol(airlines_3),ncol=4)
colnames(stab) <- c("APN","AD","ADM","FOM")
## Need loop over all removed samples
for (del in 1:ncol(airlines_3)) {
matDel <- airlines_3[,-del]
DistDel <- dist(matDel,method="euclidean")
clusterObjDel <- hclust(DistDel, method="ward.D2")
clusterDel <- cutree(clusterObjDel,3)
stab[del,] <- stability(airlines_3, airlines_euclidian_dist, del, hierarchical_cluster, clusterDel)
}
colMeans(stab)
## APN AD ADM FOM
## 0.08004969 2.99112627 0.57682079 0.93819405
Interpretation:
Stability measures -
The APN, AD, and ADM are all based on the cross-classification table of the original clustering on the full data with the clustering based on the removal of one column.
The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
The AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).
The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.
The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.
Upon comparing the above values for k=3 with the industry standards, we can conclude that the classification is stable.
#Finding the mean of all the values
df1 <- data.frame(airlines_3, k_mean_cluster_3$cluster) # append cluster membership
Cluster_mean1 <- aggregate(df1, by=list(k_mean_cluster_3$cluster), FUN=mean)
Cluster_mean1
## Group.1 Balance Qual_miles cc1_miles cc2_miles cc3_miles
## 1 1 1.1543624 8.449235e-01 0.1414719 0.15632265 -0.05870488
## 2 2 -0.2623636 -4.665237e-02 -0.4864202 0.01673483 -0.05788699
## 3 3 0.6262719 -9.487725e-05 1.5093631 -0.08001434 0.19278589
## Bonus_miles Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1 0.5954161 1.6408433 3.56067928 3.85598254 0.2775388
## 2 -0.4447171 -0.3650082 -0.17881692 -0.19357570 -0.1572599
## 3 1.2987486 0.8652144 -0.05649011 -0.06139999 0.4475884
## k_mean_cluster_3.cluster
## 1 1
## 2 2
## 3 3
#To find the size of clusters
Cluster_sum <- aggregate(df1, by=list(k_mean_cluster_3$cluster), FUN=sum)
d <- transform(Cluster_sum, clusterSize = k_mean_cluster_3.cluster / Group.1)
d <- transform(d, k_mean_cluster_3.cluster= k_mean_cluster_3.cluster/ clusterSize)
d
## Group.1 Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## 1 1 185.8524 136.03268021 22.77698 25.16795 -9.451486 95.86199
## 2 2 -764.5276 -135.94501363 -1417.42851 48.76531 -168.682675 -1295.90566
## 3 3 578.6753 -0.08766658 1394.65153 -73.93325 178.134161 1200.04367
## Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1 264.1758 573.26936 620.81319 44.68374
## 2 -1063.6339 -521.07250 -564.07960 -458.25540
## 3 799.4581 -52.19686 -56.73359 413.57166
## k_mean_cluster_3.cluster clusterSize
## 1 1 161
## 2 2 2914
## 3 3 924
#Appending the size of clusters with the mean of all other values
Cluster_mean1$clusterSize <- d$clusterSize
Cluster_mean1$clusterPCT <- (d$clusterSize*100)/3999
# transpose to change from horizontal to vertical
temp1 <- t(Cluster_mean1)
round_df <- function(x, digits) {
# round all numeric variables
# x: data frame
# digits: number of digits to round
numeric_columns <- sapply(x, class) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
temp2 <- round_df (temp1,2)
temp2
## [,1] [,2] [,3]
## Group.1 1.00 2.00 3.00
## Balance 1.15 -0.26 0.63
## Qual_miles 0.84 -0.05 0.00
## cc1_miles 0.14 -0.49 1.51
## cc2_miles 0.16 0.02 -0.08
## cc3_miles -0.06 -0.06 0.19
## Bonus_miles 0.60 -0.44 1.30
## Bonus_trans 1.64 -0.37 0.87
## Flight_miles_12mo 3.56 -0.18 -0.06
## Flight_trans_12 3.86 -0.19 -0.06
## Days_since_enroll 0.28 -0.16 0.45
## k_mean_cluster_3.cluster 1.00 2.00 3.00
## clusterSize 161.00 2914.00 924.00
## clusterPCT 4.03 72.87 23.11
Interpretation:
From the k-means table, we can develop a profile for customers belonging to each cluster as follows -
Cluster1 (High Frequency Travellers): This cluster has the lowest number of customers (4.03%). These customers travel very frequently and have covered the maximum flight miles in the past 12 months.Hence, they have accrued the maximum balance.They have also made a significant number of non-flight transactions in the past 12 months. They also have the maximum number of miles counted as qualifying for Top Flight Status.
Cluster2 (Not-so-frequent Travellers): This cluster has the highest number of customers (72.87%). They are new customers who don’t travel as frequently compared to the other two segments and have the least flight miles eligible for award travel.
Cluster3 (Loyal Customers): This cluster consists 23.11% of the customers. They are the oldest customers who have been travelling consistently since the time of enrollment. They use the frequent flyer credit card very often and have earned the maximum number of miles from non-flight bonus transactions in the past 12 months.
#Finding the mean of all the values
Cluster_mean2 <- aggregate(airlines_3,list(hierarchical_cluster),mean)
#Appending the size of clusters with the mean of all other values
df2 <- data.frame(Cluster=Cluster_mean2[,1],Cluster_sum=as.vector(table(hierarchical_cluster)),Cluster_mean2[,-1])
# transpose to change from horizontal to vertical
temp3 <- t(df2)
round_df <- function(x, digits) {
# round all numeric variables
# x: data frame
# digits: number of digits to round
numeric_columns <- sapply(x, class) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
temp4 <- round_df(temp3, 2)
temp4
## [,1] [,2] [,3]
## Cluster 1.00 2.00 3.00
## Cluster_sum 2850.00 405.00 744.00
## Balance -0.26 0.41 0.76
## Qual_miles -0.10 0.80 -0.06
## cc1_miles -0.45 0.12 1.67
## cc2_miles -0.05 0.49 -0.09
## cc3_miles -0.06 -0.06 0.25
## Bonus_miles -0.43 0.35 1.44
## Bonus_trans -0.33 0.95 0.75
## Flight_miles_12mo -0.24 1.96 -0.14
## Flight_trans_12 -0.26 2.05 -0.13
## Days_since_enroll -0.11 0.04 0.42
Interpretation:
From the hierarchical table, we can develop a profile for customers belonging to each cluster as follows -
Cluster1 (Not-so-frequent Travellers): This cluster has the highest number of customers (10.12%). They are new customers who don’t travel as frequently compared to the other two segments and have the least flight miles eligible for award travel.
Cluster2 (High Frequency Travellers): This cluster has the lowest number of customers (71.26%) and these customers travel very frequently. When compared to the other two segments, they have covered the maximum number flight miles making the most number of non-flight transactions in the past 12 months. They also have the maximum number of miles counted as qualifying for Top Flight Status.
Cluster3 (Loyal Customers): This cluster consists 18.6% of the customers. They are the oldest customers who have the highest number of flight miles eligible for award travel. They use the frequent flyer credit card very often and have earned the maximum number of miles from non-flight bonus transactions in the past 12 months.
Although the outputs of both the alogirthms are very similar, there exist the following differences -
High Frequency Travellers have more “balance” in K-means clustering whereas Loyal customers have more “balance” in hierarchical clustering.
Hierarchical clustering was able to explain that the High Frequency Travellers tend to use their rewards credit card more than their frequent flyer credit card. The K-means algorithm failed to provide this evidence.
We see that the values of Flight_miles_12mo and Flight_trans_12, are extreme in case of k-means clustering and more balanced in case of hierarchical clustering.
Hierarchical clustering also seems to be more insightful when comparing Bonus_miles to Bonus_trans as it provides substantial evidence that High-Frequency Fliers utilize more number of bonus miles than Loyal customers.
Upon qualitative evaluation of the above differences, we believe that Hierachical clustering describes our dataset more accurately than k-means. This can be attributed to the categorical nature of the dataset.
Further, although k-means provides faster results, since our dataset is not very big and not dynamic, creating new trees for analysis is not very time consuming.
Hence, we can conclude that Hierarchical clustering is a better option than k-means for segmenting the frequent flyers.
Based on our observations from the clustering analysis, we propose the following marketing campaigns to target specific segments of frequent flyers. Adopting these recommendations could increase the bottomline of EastWest Airlines.
Business Proposal for Not-so-frequent Travelers:
Since most of the customers (~72%) fall into this cluster, there is a potential business growth opportunity if these customers are converted into High Frequency Travelers.
Potential Offers:
If frequent flyer credit card is used then the reward points per travel can be increased.
Extra bonus points can be awarded to the customer if his/her travel check-ins increases.
Business Proposal for Loyal Customers:
This cluster is the second largest with ~19% of customers.
Potential Offer:
Business Proposal for High-Frequency Customers:
Since, the high-frequency travellers are relatively less in number and since their travel characteristics are desirable, we are not proposing any specific marketing plan to this segment of customers at this point.