Classification of frequent flyers for airline marketing

Chapter 1: Introduction

Team Members

Ruthvik Chidambaram
Harsh Dave
Neha Ganshani
Tanmay Jog
Vinishruth Mocherla

Context:

The file EastWestAirlinesCluster.xls is available on the textbook website https://www.dataminingbook.com/book/r-edition which contains information on 4000 passengers who belong to an airline’s frequent flyer program.

For each passenger the data includes information on their mileage history and on different ways they accrued or spent miles in the last year.

Objective:

We are trying to learn more about EastWest Airlines’ customers based on their flying patterns, earning and use of frequent flyer rewards, and use of the airline credit card.

The primary objective is to identify customer segments via clustering and design targeted marketing campaigns for each segment.

Chapter 2: Exploratory Data Analysis

Installing Libraries

knitr::opts_chunk$set(echo = TRUE)
library("dummies")
library("dendextend")
library("gridExtra")
library("cluster")
library("factoextra")
library("MASS")
library("fpc")
library("clValid")
library("optCluster")

Importing the data

airlines_1 <- read.csv("E:/Vinni_USA/MSIS coursework docs/Spring-20/4th Flex/Data Mining for BI/Unsupervised learning project/Final project/EastWestAirlinesCluster.csv", header = TRUE)
head(airlines_1)

##   ï..ID. Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles
## 1      1   28143          0         1         1         1         174
## 2      2   19244          0         1         1         1         215
## 3      3   41354          0         1         1         1        4123
## 4      4   14776          0         1         1         1         500
## 5      5   97752          0         4         1         1       43300
## 6      6   16420          0         1         1         1           0
##   Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll Award.
## 1           1                 0               0              7000      0
## 2           2                 0               0              6968      0
## 3           4                 0               0              7034      0
## 4           1                 0               0              6952      0
## 5          26              2077               4              6935      1
## 6           0                 0               0              6942      0

Data Cleaning

# Removing the ID# column from dataset
airlines_2 <- airlines_1[c(-1,-12)] 
head(airlines_2)

##   Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
## 1   28143          0         1         1         1         174           1
## 2   19244          0         1         1         1         215           2
## 3   41354          0         1         1         1        4123           4
## 4   14776          0         1         1         1         500           1
## 5   97752          0         4         1         1       43300          26
## 6   16420          0         1         1         1           0           0
##   Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1                 0               0              7000
## 2                 0               0              6968
## 3                 0               0              7034
## 4                 0               0              6952
## 5              2077               4              6935
## 6                 0               0              6942

str(airlines_2)

## 'data.frame':    3999 obs. of  10 variables:
##  $ Balance          : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
##  $ Qual_miles       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cc1_miles        : int  1 1 1 1 4 1 3 1 3 3 ...
##  $ cc2_miles        : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ cc3_miles        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Bonus_miles      : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ Bonus_trans      : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ Flight_miles_12mo: int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ Flight_trans_12  : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ Days_since_enroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...

summary(airlines_2)

##     Balance          Qual_miles        cc1_miles      cc2_miles    
##  Min.   :      0   Min.   :    0.0   Min.   :1.00   Min.   :1.000  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:1.00   1st Qu.:1.000  
##  Median :  43097   Median :    0.0   Median :1.00   Median :1.000  
##  Mean   :  73601   Mean   :  144.1   Mean   :2.06   Mean   :1.015  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.:3.00   3rd Qu.:1.000  
##  Max.   :1704838   Max.   :11148.0   Max.   :5.00   Max.   :3.000  
##    cc3_miles      Bonus_miles      Bonus_trans   Flight_miles_12mo
##  Min.   :1.000   Min.   :     0   Min.   : 0.0   Min.   :    0.0  
##  1st Qu.:1.000   1st Qu.:  1250   1st Qu.: 3.0   1st Qu.:    0.0  
##  Median :1.000   Median :  7171   Median :12.0   Median :    0.0  
##  Mean   :1.012   Mean   : 17145   Mean   :11.6   Mean   :  460.1  
##  3rd Qu.:1.000   3rd Qu.: 23801   3rd Qu.:17.0   3rd Qu.:  311.0  
##  Max.   :5.000   Max.   :263685   Max.   :86.0   Max.   :30817.0  
##  Flight_trans_12  Days_since_enroll
##  Min.   : 0.000   Min.   :   2     
##  1st Qu.: 0.000   1st Qu.:2330     
##  Median : 0.000   Median :4096     
##  Mean   : 1.374   Mean   :4119     
##  3rd Qu.: 1.000   3rd Qu.:5790     
##  Max.   :53.000   Max.   :8296

# Converting categorical miles data into numerical data with taking the average of the range

airlines_2$cc1_miles = ifelse(airlines_2$cc1_miles==1,2500,
                              ifelse(airlines_2$cc1_miles==2,7500,
                                     ifelse(airlines_2$cc1_miles==3,17500,
                                            ifelse(airlines_2$cc1_miles==4,32500,
                                                ifelse(airlines_2$cc1_miles==5,50000,0)))))

airlines_2$cc2_miles = ifelse(airlines_2$cc2_miles==1,2500,
                              ifelse(airlines_2$cc2_miles==2,7500,
                                     ifelse(airlines_2$cc2_miles==3,17500,
                                            ifelse(airlines_2$cc2_miles==4,32500,
                                               ifelse(airlines_2$cc2_miles==5,50000,0)))))

airlines_2$cc3_miles = ifelse(airlines_2$cc3_miles==1,2500,
                              ifelse(airlines_2$cc3_miles==2,7500,
                                     ifelse(airlines_2$cc3_miles==3,17500,
                                            ifelse(airlines_2$cc3_miles==4,32500,
                                                ifelse(airlines_2$cc3_miles==5,50000,0)))))

head(airlines_2)

##   Balance Qual_miles cc1_miles cc2_miles cc3_miles Bonus_miles Bonus_trans
## 1   28143          0      2500      2500      2500         174           1
## 2   19244          0      2500      2500      2500         215           2
## 3   41354          0      2500      2500      2500        4123           4
## 4   14776          0      2500      2500      2500         500           1
## 5   97752          0     32500      2500      2500       43300          26
## 6   16420          0      2500      2500      2500           0           0
##   Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1                 0               0              7000
## 2                 0               0              6968
## 3                 0               0              7034
## 4                 0               0              6952
## 5              2077               4              6935
## 6                 0               0              6942

Normalization

# The scale of the variables in the data is varying; hence normalizing the data with Mean=0 and SD=1

airlines_3 <- scale (airlines_2)
head(airlines_3)

##         Balance Qual_miles  cc1_miles   cc2_miles   cc3_miles Bonus_miles
## [1,] -0.4510844 -0.1862754 -0.6802649 -0.09077821 -0.05870488  -0.7026984
## [2,] -0.5393894 -0.1862754 -0.6802649 -0.09077821 -0.05870488  -0.7010007
## [3,] -0.3199912 -0.1862754 -0.6802649 -0.09077821 -0.05870488  -0.5391853
## [4,] -0.5837255 -0.1862754 -0.6802649 -0.09077821 -0.05870488  -0.6892000
## [5,]  0.2396479 -0.1862754  1.3577268 -0.09077821 -0.05870488   1.0829857
## [6,] -0.5674121 -0.1862754 -0.6802649 -0.09077821 -0.05870488  -0.7099031
##      Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## [1,]  -1.1039265        -0.3285622      -0.3621226          1.395280
## [2,]  -0.9998011        -0.3285622      -0.3621226          1.379784
## [3,]  -0.7915505        -0.3285622      -0.3621226          1.411744
## [4,]  -1.1039265        -0.3285622      -0.3621226          1.372037
## [5,]   1.4992070         1.1547876       0.6924037          1.363805
## [6,]  -1.2080518        -0.3285622      -0.3621226          1.367195

If Normalization of data is not done, then the following issues will influence the model -

Distance measured will be wrongly calculated, if all the variables are not with equal weight
The variable with the largest scale will dominate the measure

Chapter 3: K-means Clustering

Finding Optimal k-value

k.max <- 10
wss <- sapply(1:k.max, function(k){ kmeans(airlines_3[,-1], k)$tot.withinss})
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

Interpretation:

As seen from the elbow graph, the slope changes at k=2. However, since spltting the dataset into 2 groups would not be very beneficial for segmenting frequent flyers, we further evaluate clusters for higher values of k.

Clustering

# K-means clustering with 3 clusters
k_mean_cluster_3 <- eclust(airlines_3, "kmeans", k = 3, nstart = 25, graph = TRUE)

# K-means clustering with 4 clusters
k_mean_cluster_4 <- eclust(airlines_3, "kmeans", k = 4, nstart = 25, graph = TRUE)

# K-means clustering with 5 clusters
k_mean_cluster_5 <- eclust(airlines_3, "kmeans", k = 5, nstart = 25, graph = TRUE)

Interpretation:

As we see in the K-means clustering graph, for values of k > 3, the number of overlapping observations with neighboring clusters increase. This implies that dividing the dataset into 3 clusters would distinctly or unambiguously segment the frequent flyers.

Cluster Stability

#Checking cluster stability with K means Silhouette

fviz_silhouette(k_mean_cluster_3)

##   cluster size ave.sil.width
## 1       1  161          0.04
## 2       2 2914          0.46
## 3       3  924          0.18

fviz_silhouette(k_mean_cluster_4)

##   cluster size ave.sil.width
## 1       1  167          0.02
## 2       2   11          0.60
## 3       3  922          0.23
## 4       4 2899          0.44

fviz_silhouette(k_mean_cluster_5)

##   cluster size ave.sil.width
## 1       1   11          0.60
## 2       2  158          0.02
## 3       3 1357          0.17
## 4       4  821          0.22
## 5       5 1652          0.34

Interpretation:

As observed from the silhouette coefficients, the value of average.silhouette.width is maximum (closest to +1) for k=3 compared to other k values indicating that dividing the data set into 3 clusters is stable.

Further, when k=3, it can be observed from the cluster size that the second cluster contains major chunk (72.86%) of the data followed by the third cluster containing 23.1% and then the first cluster with only 4.04% of the data.

Chapter 4: Agglomerative Hierarchical Clustering

Clustering

# Calculating Distance
# Dissimilarity matrix
airlines_euclidian_dist <- dist(airlines_3, method = "euclidean")
str(airlines_euclidian_dist)

##  'dist' num [1:7994001] 0.137 0.377 0.135 4.232 0.159 ...
##  - attr(*, "Size")= int 3999
##  - attr(*, "Diag")= logi FALSE
##  - attr(*, "Upper")= logi FALSE
##  - attr(*, "method")= chr "euclidean"
##  - attr(*, "call")= language dist(x = airlines_3, method = "euclidean")

# Hierarchical clustering using Ward's method
airline_fit <- hclust(airlines_euclidian_dist, method="ward.D2")
airline_fit <- as.dendrogram(airline_fit)
plot(airline_fit, cex = 0.6)
abline(h=85,col="red",lty=2)

Interpretation:

From the Ward method, we see that as the height increases the clusters get grouped together.

As there are larger numbers of observations, if we dont limit the number of clusters, interpreting the results will become cumbersome. Ex: At height <20, there are multiple clusters created and it is very difficult to comprehend the data.

Hence, we decided to cut the tree at height 85 to obtain 3 clusters and then assigned each cluster with its respective observations.

Cutting the tree

#cutting the tree into 3 clusters
hierarchical_cluster <- cutree(airline_fit, k=3)
cd = color_branches(airline_fit,k=3) #Coloured dendrogram branches
plot(cd)

# Number of observations in each cluster
table(hierarchical_cluster)

## hierarchical_cluster
##    1    2    3 
## 2850  405  744

Interpretation:

When k=3, it can be observed from the cluster size that the first cluster contains major chunk (71.27%) of the data followed by the third cluster containing 18.6% and then the second cluster with only 10.13% of the data.`

Finding optimal k-value

#Checking for optimal k-value
intern <- clValid(airlines_3, 2:6, clMethods=c("hierarchical"),
validation="internal", maxitems=nrow(airlines_3))

## Warning in clValid(airlines_3, 2:6, clMethods = c("hierarchical"), validation =
## "internal", : rownames for data not specified, using 1:nrow(data)

summary (intern)

## 
## Clustering Methods:
##  hierarchical 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                  2       3       4       5       6
##                                                                   
## hierarchical Connectivity   5.3825  9.6401  9.6401 17.5944 22.6671
##              Dunn           0.2058  0.2162  0.2203  0.1863  0.1863
##              Silhouette     0.8275  0.7907  0.7837  0.7461  0.7350
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 5.3825 hierarchical 2       
## Dunn         0.2203 hierarchical 4       
## Silhouette   0.8275 hierarchical 2

Interpretation:

Internal measures -

Connectivity corresponds to what extent items are placed in the same cluster as their nearest neighbors in the data space. The connectivity has a value between 0 and infinity and should be minimized.
The Dunn index is another internal clustering validation measure which is defined as the ratio of inter-cluster separation to intracluster compactness. The Dunn index should be maximized.
The silhouette coefficients measure how well an observation is clustered and it estimates the average distance between clusters. Observations with a large Si (almost 1) are very well clustered.

As observed from the output, the above combination of values are closer to the ideal for k=3 than the other inputs.This indicates that the optimal number of clusters is 3.

Checking for stability

stab <- matrix(0,nrow=ncol(airlines_3),ncol=4)
colnames(stab) <- c("APN","AD","ADM","FOM")

## Need loop over all removed samples
for (del in 1:ncol(airlines_3)) {
  matDel <- airlines_3[,-del]               
  DistDel <- dist(matDel,method="euclidean")
  clusterObjDel <- hclust(DistDel, method="ward.D2")
  clusterDel <- cutree(clusterObjDel,3)
  stab[del,] <- stability(airlines_3, airlines_euclidian_dist, del, hierarchical_cluster, clusterDel)
}
colMeans(stab)

##        APN         AD        ADM        FOM 
## 0.08004969 2.99112627 0.57682079 0.93819405

Interpretation:

Stability measures -

The APN, AD, and ADM are all based on the cross-classification table of the original clustering on the full data with the clustering based on the removal of one column.

The APN measures the average proportion of observations not placed in the same cluster by clustering based on the full data and clustering based on the data with a single column removed.
The AD measures the average distance between observations placed in the same cluster under both cases (full data set and removal of one column).
The ADM measures the average distance between cluster centers for observations placed in the same cluster under both cases.
The FOM measures the average intra-cluster variance of the deleted column, where the clustering is based on the remaining (undeleted) columns.

Upon comparing the above values for k=3 with the industry standards, we can conclude that the classification is stable.

Chapter 5: Comparision of clustering techniques

K-means clustering

#Finding the mean of all the values
df1 <- data.frame(airlines_3, k_mean_cluster_3$cluster) # append cluster membership
Cluster_mean1 <- aggregate(df1, by=list(k_mean_cluster_3$cluster), FUN=mean)
Cluster_mean1

##   Group.1    Balance    Qual_miles  cc1_miles   cc2_miles   cc3_miles
## 1       1  1.1543624  8.449235e-01  0.1414719  0.15632265 -0.05870488
## 2       2 -0.2623636 -4.665237e-02 -0.4864202  0.01673483 -0.05788699
## 3       3  0.6262719 -9.487725e-05  1.5093631 -0.08001434  0.19278589
##   Bonus_miles Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1   0.5954161   1.6408433        3.56067928      3.85598254         0.2775388
## 2  -0.4447171  -0.3650082       -0.17881692     -0.19357570        -0.1572599
## 3   1.2987486   0.8652144       -0.05649011     -0.06139999         0.4475884
##   k_mean_cluster_3.cluster
## 1                        1
## 2                        2
## 3                        3

#To find the size of clusters 
Cluster_sum <- aggregate(df1, by=list(k_mean_cluster_3$cluster), FUN=sum) 
d <- transform(Cluster_sum, clusterSize = k_mean_cluster_3.cluster / Group.1)
d <- transform(d, k_mean_cluster_3.cluster= k_mean_cluster_3.cluster/ clusterSize)
d

##   Group.1   Balance    Qual_miles   cc1_miles cc2_miles   cc3_miles Bonus_miles
## 1       1  185.8524  136.03268021    22.77698  25.16795   -9.451486    95.86199
## 2       2 -764.5276 -135.94501363 -1417.42851  48.76531 -168.682675 -1295.90566
## 3       3  578.6753   -0.08766658  1394.65153 -73.93325  178.134161  1200.04367
##   Bonus_trans Flight_miles_12mo Flight_trans_12 Days_since_enroll
## 1    264.1758         573.26936       620.81319          44.68374
## 2  -1063.6339        -521.07250      -564.07960        -458.25540
## 3    799.4581         -52.19686       -56.73359         413.57166
##   k_mean_cluster_3.cluster clusterSize
## 1                        1         161
## 2                        2        2914
## 3                        3         924

#Appending the size of clusters with the mean of all other values
Cluster_mean1$clusterSize <- d$clusterSize
Cluster_mean1$clusterPCT <- (d$clusterSize*100)/3999

# transpose to change from horizontal to vertical
temp1 <- t(Cluster_mean1)

round_df <- function(x, digits) {
  # round all numeric variables
  # x: data frame 
  # digits: number of digits to round
  numeric_columns    <- sapply(x, class) == 'numeric'
  x[numeric_columns] <-  round(x[numeric_columns], digits)
  x
}

temp2 <- round_df (temp1,2)
temp2

##                            [,1]    [,2]   [,3]
## Group.1                    1.00    2.00   3.00
## Balance                    1.15   -0.26   0.63
## Qual_miles                 0.84   -0.05   0.00
## cc1_miles                  0.14   -0.49   1.51
## cc2_miles                  0.16    0.02  -0.08
## cc3_miles                 -0.06   -0.06   0.19
## Bonus_miles                0.60   -0.44   1.30
## Bonus_trans                1.64   -0.37   0.87
## Flight_miles_12mo          3.56   -0.18  -0.06
## Flight_trans_12            3.86   -0.19  -0.06
## Days_since_enroll          0.28   -0.16   0.45
## k_mean_cluster_3.cluster   1.00    2.00   3.00
## clusterSize              161.00 2914.00 924.00
## clusterPCT                 4.03   72.87  23.11

Interpretation:

From the k-means table, we can develop a profile for customers belonging to each cluster as follows -

Cluster1 (High Frequency Travellers): This cluster has the lowest number of customers (4.03%). These customers travel very frequently and have covered the maximum flight miles in the past 12 months.Hence, they have accrued the maximum balance.They have also made a significant number of non-flight transactions in the past 12 months. They also have the maximum number of miles counted as qualifying for Top Flight Status.

Cluster2 (Not-so-frequent Travellers): This cluster has the highest number of customers (72.87%). They are new customers who don’t travel as frequently compared to the other two segments and have the least flight miles eligible for award travel.

Cluster3 (Loyal Customers): This cluster consists 23.11% of the customers. They are the oldest customers who have been travelling consistently since the time of enrollment. They use the frequent flyer credit card very often and have earned the maximum number of miles from non-flight bonus transactions in the past 12 months.

Hierarchical clustering

#Finding the mean of all the values
Cluster_mean2 <- aggregate(airlines_3,list(hierarchical_cluster),mean)

#Appending the size of clusters with the mean of all other values
df2 <- data.frame(Cluster=Cluster_mean2[,1],Cluster_sum=as.vector(table(hierarchical_cluster)),Cluster_mean2[,-1])

# transpose to change from horizontal to vertical
temp3 <- t(df2)

round_df <- function(x, digits) {
  # round all numeric variables
  # x: data frame 
  # digits: number of digits to round
  numeric_columns    <- sapply(x, class) == 'numeric'
  x[numeric_columns] <-  round(x[numeric_columns], digits)
  x
}

temp4 <- round_df(temp3, 2)
temp4

##                      [,1]   [,2]   [,3]
## Cluster              1.00   2.00   3.00
## Cluster_sum       2850.00 405.00 744.00
## Balance             -0.26   0.41   0.76
## Qual_miles          -0.10   0.80  -0.06
## cc1_miles           -0.45   0.12   1.67
## cc2_miles           -0.05   0.49  -0.09
## cc3_miles           -0.06  -0.06   0.25
## Bonus_miles         -0.43   0.35   1.44
## Bonus_trans         -0.33   0.95   0.75
## Flight_miles_12mo   -0.24   1.96  -0.14
## Flight_trans_12     -0.26   2.05  -0.13
## Days_since_enroll   -0.11   0.04   0.42

Interpretation:

From the hierarchical table, we can develop a profile for customers belonging to each cluster as follows -

Cluster1 (Not-so-frequent Travellers): This cluster has the highest number of customers (10.12%). They are new customers who don’t travel as frequently compared to the other two segments and have the least flight miles eligible for award travel.

Cluster2 (High Frequency Travellers): This cluster has the lowest number of customers (71.26%) and these customers travel very frequently. When compared to the other two segments, they have covered the maximum number flight miles making the most number of non-flight transactions in the past 12 months. They also have the maximum number of miles counted as qualifying for Top Flight Status.

Cluster3 (Loyal Customers): This cluster consists 18.6% of the customers. They are the oldest customers who have the highest number of flight miles eligible for award travel. They use the frequent flyer credit card very often and have earned the maximum number of miles from non-flight bonus transactions in the past 12 months.

Comparison

Although the outputs of both the alogirthms are very similar, there exist the following differences -

High Frequency Travellers have more “balance” in K-means clustering whereas Loyal customers have more “balance” in hierarchical clustering.
Hierarchical clustering was able to explain that the High Frequency Travellers tend to use their rewards credit card more than their frequent flyer credit card. The K-means algorithm failed to provide this evidence.
We see that the values of Flight_miles_12mo and Flight_trans_12, are extreme in case of k-means clustering and more balanced in case of hierarchical clustering.
Hierarchical clustering also seems to be more insightful when comparing Bonus_miles to Bonus_trans as it provides substantial evidence that High-Frequency Fliers utilize more number of bonus miles than Loyal customers.

Result

Upon qualitative evaluation of the above differences, we believe that Hierachical clustering describes our dataset more accurately than k-means. This can be attributed to the categorical nature of the dataset.

Further, although k-means provides faster results, since our dataset is not very big and not dynamic, creating new trees for analysis is not very time consuming.

Hence, we can conclude that Hierarchical clustering is a better option than k-means for segmenting the frequent flyers.

Chapter 6: Conclusion

Based on our observations from the clustering analysis, we propose the following marketing campaigns to target specific segments of frequent flyers. Adopting these recommendations could increase the bottomline of EastWest Airlines.

Business Proposal for Not-so-frequent Travelers:

Since most of the customers (~72%) fall into this cluster, there is a potential business growth opportunity if these customers are converted into High Frequency Travelers.

Potential Offers:

If frequent flyer credit card is used then the reward points per travel can be increased.
Extra bonus points can be awarded to the customer if his/her travel check-ins increases.

Business Proposal for Loyal Customers:

This cluster is the second largest with ~19% of customers.

Potential Offer:

If Bonus miles are used, then the customers can be awarded with discounts on the use of frequent flyer credit card.

Business Proposal for High-Frequency Customers:

Since, the high-frequency travellers are relatively less in number and since their travel characteristics are desirable, we are not proposing any specific marketing plan to this segment of customers at this point.