Clustering for market segmentation

Market segmentation is a strategy that divides a broad target of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a technique pretty common for this kind of problems since it automatically finds similar groups given a data set. In this problem, we’ll see how clustering can be used to a real dataset from an airline company.Data comes from the book “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.

The file AirlinesCluster contains information on 3,999 members of the frequent flyer program. There are seven different variables in the dataset, described below:

Balance = number of miles eligible for award travel
QualMiles = number of miles qualifying for TopFlight status
BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months
BonusTrans = number of non-flight bonus transactions in the past 12 months
FlightMiles = number of flight miles in the past 12 months
FlightTrans = number of flight transactions in the past 12 months
DaysSinceEnroll = number of days since enrolled in the frequent flyer program

We will use hierachical clustering in this approach

airlines<-read.csv("AirlinesCluster.csv")
str(airlines)

## 'data.frame':    3999 obs. of  7 variables:
##  $ Balance        : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
##  $ QualMiles      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BonusMiles     : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ BonusTrans     : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ FlightMiles    : int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ FlightTrans    : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ DaysSinceEnroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...

summary(airlines)

##     Balance          QualMiles         BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23800   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  
##   FlightMiles       FlightTrans     DaysSinceEnroll
##  Min.   :    0.0   Min.   : 0.000   Min.   :   2   
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
##  Median :    0.0   Median : 0.000   Median :4096   
##  Mean   :  460.1   Mean   : 1.374   Mean   :4119   
##  3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
##  Max.   :30817.0   Max.   :53.000   Max.   :8296

Running some exploratory descriptive statistics we can see that the scale of the variables has very large differences and our clustering will be dominated by the variables with larger scale. So we have to normalize our data first and then apply our clustering algorithm.

library(caret)
prepro<-preProcess(airlines)
airnorm<-predict(prepro,airlines)
distair<-dist(airnorm,method="euclidean")
airclust<-hclust(distair,method="ward.D")
plot(airclust)

Looking at the dendogram choosing to cut the tree at five clusters seems reasonable for segmenting the client base

clusterGroups<-cutree(airclust,k=5)

Next we will explore the average values in each of the variables for the 5 clusters (the centroids of the clusters).Computing the average values of the unnormalized data so that it is easier to interpret seems reasonable

tapply(airlines$Balance,clusterGroups,mean)

##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91

tapply(airlines$QualMiles,clusterGroups,mean)

##            1            2            3            4            5 
##    0.6443299 1065.9826590   30.3461538    4.8479263    2.5111773

tapply(airlines$BonusMiles,clusterGroups,mean)

##         1         2         3         4         5 
## 10360.124 22881.763 55795.860 20788.766  2264.788

tapply(airlines$BonusTrans,clusterGroups,mean)

##         1         2         3         4         5 
## 10.823454 18.229287 19.663968 17.087558  2.973174

tapply(airlines$FlightMiles,clusterGroups,mean)

##          1          2          3          4          5 
##   83.18428 2613.41811  327.67611  111.57373  119.32191

tapply(airlines$FlightTrans,clusterGroups,mean)

##         1         2         3         4         5 
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972

tapply(airlines$DaysSinceEnroll,clusterGroups,mean)

##        1        2        3        4        5 
## 6235.365 4402.414 5615.709 2840.823 3060.081

Looking at the output we can examine which variables has their largest means in which cluster for example we can say for cluster1 that has the largest average values in the variable “DaysSinceEnrollment” and we can describe the customers in cluster1 as infrequent but loyal.For cluster2 we can see that the largest average is found in “QualMiles”,“FlightMiles”,“FlightTrans” and the customers there can described as customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions. For cluster3 we can see that the largest average is found in “Balance”,“FlightMiles”,“FlightTrans” and they can be described as customers who have accumulated a large amount of miles, mostly through non-flight transactions. For cluster4 we can see that there is not a single variable that has the largest average value in any cluster and we can infer that the customers there are relatively new customers who seem to be accumulating miles, mostly through non-flight transactions. For cluster5 we can see also that there is not a variable that has larger mean values between the clusters in that variable and the costumers there can described as relatively new customers who don’t use the airline very often.

Clustering for market segmentation

Christos Chalitsios

23 May 2016