EastWest Airlines - Marketing to Frequent Fliers

Data

The data used for this analysis contains information on 4,000 passengers who belong to an airline’s frequent flier program. For each passenger, the data include information on their mileage history and on different ways they accrued or spent miles in the last year.

Source of Data

http://dataminingbook.com/ (EastWestAirlinesCluster.xls)

For the analysis, I am using the “cluster” package

library(cluster)
suppressPackageStartupMessages(library(dendextend))
library(factoextra)

## Loading required package: ggplot2

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

library(cluster)
library(fpc)
library(NbClust)

For the sake for program simplicity the Excel file was converted to .csv to be read into R-program.

# Reading the .csv file as a data frame
AirLine_DF = read.csv("EastWestAirlinesCluster.csv")

# Reading the structure of the data
str(AirLine_DF)

## 'data.frame':    3999 obs. of  12 variables:
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Balance          : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
##  $ Qual_miles       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ cc1_miles        : int  1 1 1 1 4 1 3 1 3 3 ...
##  $ cc2_miles        : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ cc3_miles        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Bonus_miles      : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ Bonus_trans      : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ Flight_miles_12mo: int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ Flight_trans_12  : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ Days_since_enroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
##  $ Award            : int  0 0 0 0 1 0 0 1 1 1 ...

As per the data dictionary provided; the observations cc1_miles, cc2_miles, cc3_miles and Award are categorical variables. As the euclidean distance is not appropriate for categorical values, converted the cc1_miles, cc2_miles, cc3_miles to numeric by taking the average of the respective range.

We shall analyze the data with Euclidean distance with Single Linkage, Ward and Complete Linkage.

Note: Here we would ignore the ID and Award variables for the clustering

Translating categorical to Numeric

Convert the cc1_miles, cc2_miles, cc3_miles to numeric with taking the average of the range

AirLine_DF$cc1_miles = ifelse(AirLine_DF$cc1_miles==1,2500,
                              ifelse(AirLine_DF$cc1_miles==2,7500,
                                     ifelse(AirLine_DF$cc1_miles==3,17500,
                                            ifelse(AirLine_DF$cc1_miles==4,32500,
                                                   ifelse(AirLine_DF$cc1_miles==5,50000,0)))))

AirLine_DF$cc2_miles = ifelse(AirLine_DF$cc2_miles==1,2500,
                              ifelse(AirLine_DF$cc2_miles==2,7500,
                                     ifelse(AirLine_DF$cc2_miles==3,17500,
                                            ifelse(AirLine_DF$cc2_miles==4,32500,
                                                   ifelse(AirLine_DF$cc2_miles==5,50000,0)))))

AirLine_DF$cc3_miles = ifelse(AirLine_DF$cc3_miles==1,2500,
                              ifelse(AirLine_DF$cc3_miles==2,7500,
                                     ifelse(AirLine_DF$cc3_miles==3,17500,
                                            ifelse(AirLine_DF$cc3_miles==4,32500,
                                                   ifelse(AirLine_DF$cc3_miles==5,50000,0)))))

Normalize the data

The scale of the variables in the data is varying; hence normalizing the data with Mean=0 and SD=1

data = scale(AirLine_DF)

Build the Distance matrix with euclidean method

d <- dist(data[,2:11], method = "euclidean")

Euclidean distance, Ward

fit <- hclust(d, method="ward.D2")
fit <- as.dendrogram(fit)
cd = color_branches(fit,k=3) #Coloured dendrogram branches
plot(cd)

Observations

For the Ward method; There are 3 clusters getting formed at height 85, and at 75 we see 5 clusters.

As there are larger numbers of observations; interpreting the results will be difficult or cumbersome. At height <20; there are multiple clusters created and is very difficult to read the data.

Divided the entire tree into 3 clusters and assigned each cluster with its respective observations.

groups <- cutree(fit, k=3) # cut tree into 3 clusters; height 85.

# Number of observations in each cluster
table(groups)

## groups
##    1    2    3 
## 2850  405  744

g1 = aggregate(AirLine_DF[,2:11],list(groups),median)
data.frame(Cluster=g1[,1],Freq=as.vector(table(groups)),g1[,-1])

##   Cluster Freq Balance Qual_miles cc1_miles cc2_miles cc3_miles
## 1       1 2850 31419.0          0      2500      2500      2500
## 2       2  405 79333.0          0      2500      2500      2500
## 3       3  744 97990.5          0     32500      2500      2500
##   Bonus_miles Bonus_trans Flight_miles_12mo Flight_trans_12
## 1      3340.5           7                 0               0
## 2     16314.0          19              2400               7
## 3     45359.0          17                 0               0
##   Days_since_enroll
## 1              3739
## 2              4190
## 3              5048

From the above table it seem that the cluster solution recognized the groups by old customer (Days_since_enroll), bonus miles and balance.

Cluster1 (Non-frequent Travellers) : Has highest number of observations. This cluster contains customers who are relatively new with low balance and bonus miles.

Cluster3 (Frequent Flyers): Has low number of observations. This cluster contains customers who are relatively old with both high balance and bonus miles.

Cluster2 (Middle level customers) : Categorizes customers between cluster 1 and 2.

Cluster Centriod

# function to find medoid in cluster i
centroid = function(i, dat, groups) 
{
  ind = (groups == i)
  colMeans(dat[ind,])
}

sapply(unique(groups), centroid, AirLine_DF[,2:11], groups)

##                           [,1]         [,2]         [,3]
## Balance           4.774137e+04 1.144233e+05 1.504400e+05
## Qual_miles        6.834491e+01 7.659630e+02 9.585484e+01
## cc1_miles         5.852632e+03 1.425926e+04 3.707997e+04
## cc2_miles         2.545614e+03 3.080247e+03 2.500000e+03
## cc3_miles         2.501754e+03 2.500000e+03 3.155242e+03
## Bonus_miles       6.854256e+03 2.554787e+04 5.199022e+04
## Bonus_trans       8.415789e+00 2.077037e+01 1.881586e+01
## Flight_miles_12mo 1.217372e+02 3.203721e+03 2.625067e+02
## Flight_trans_12   4.017544e-01 9.150617e+00 8.629032e-01
## Days_since_enroll 3.882286e+03 4.205486e+03 4.976321e+03

Cluster stability with 95% data. Wards Method

It is observed that with a sample (sample1 and sample2) of 95% data; the dendrograms are different and would need further analysis/clustering to get the insights.

Sample 1

SampleSize = as.integer(0.95 * nrow(data))
sample1 = data[sample(1:nrow(data), SampleSize,replace=FALSE),]

dist.1 = dist(sample1[,2:11], method = "euclidean") 
fit.1 <- hclust(dist.1, method="ward.D2")
fit.1 <- as.dendrogram(fit.1)
cd = color_branches(fit.1,k=3) #Coloured dendrogram branches
plot(cd)

Sample 2

sample2 = data[sample(1:nrow(data), SampleSize,replace=FALSE),]

dist.2 = dist(sample2[,2:11], method = "euclidean") 
fit.2 <- hclust(dist.2, method="ward.D2")
fit.2 <- as.dendrogram(fit.2)
cd = color_branches(fit.2,k=3) #Coloured dendrogram branches
plot(cd)

K-Means Clustering - k-Means

# 3 Clusters
km.3 <- eclust(data[,2:11], "kmeans", k = 3, nstart = 25, graph = TRUE)

## Warning: Installed Rcpp (0.12.10) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.

# 5 Clusters
km.5 <- eclust(data[,2:11], "kmeans", k = 5, nstart = 25, graph = TRUE)

As observed from the above K-means plotting;with 5 clusters; the data for cluster 3 and 2 highly overlap with others. Hence we can consider using 3 clusters. Also there are outliers in cluster 1 and 3 as per firt plotting above.

Cluster Stability - K-means - Silhouette

fviz_silhouette(km.3)

##   cluster size ave.sil.width
## 1       1  924          0.18
## 2       2 2914          0.46
## 3       3  161          0.04

K-Means Clustering - PAM (Partitioning Around Mediods)

# 3 Clusters
km.3 <- eclust(data[,2:11], "pam", k = 3, graph = TRUE)

# 5 Clusters
km.5 <- eclust(data[,2:11], "pam", k = 5, graph = TRUE)

Cluster Stability - PAM - Silhouette

fviz_silhouette(km.3)

##   cluster size ave.sil.width
## 1       1 1372          0.21
## 2       2 1071          0.00
## 3       3 1556          0.30

Offerings

For the non-frequent flyers who are more in numbers promotions like more miles per fly, discounted air fare rates can be offered to improve the number of flyers. This offers would help customers to fly frequently. Most of the customers in this cluster did not fly in last 12 months.