Market segmentation is a strategy that divides a broad target of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a technique pretty common for this kind of problems since it automatically finds similar groups given a data set. In this problem, we’ll see how clustering can be used to a real dataset from an airline company.Data comes from the book “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce.
The file AirlinesCluster contains information on 3,999 members of the frequent flyer program. There are seven different variables in the dataset, described below:
We will use hierachical clustering in this approach
airlines<-read.csv("AirlinesCluster.csv")
str(airlines)
## 'data.frame': 3999 obs. of 7 variables:
## $ Balance : int 28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
## $ QualMiles : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BonusMiles : int 174 215 4123 500 43300 0 27482 5250 1753 28426 ...
## $ BonusTrans : int 1 2 4 1 26 0 25 4 43 28 ...
## $ FlightMiles : int 0 0 0 0 2077 0 0 250 3850 1150 ...
## $ FlightTrans : int 0 0 0 0 4 0 0 1 12 3 ...
## $ DaysSinceEnroll: int 7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
summary(airlines)
## Balance QualMiles BonusMiles BonusTrans
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23800 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296
Running some exploratory descriptive statistics we can see that the scale of the variables has very large differences and our clustering will be dominated by the variables with larger scale. So we have to normalize our data first and then apply our clustering algorithm.
library(caret)
prepro<-preProcess(airlines)
airnorm<-predict(prepro,airlines)
distair<-dist(airnorm,method="euclidean")
airclust<-hclust(distair,method="ward.D")
plot(airclust)
Looking at the dendogram choosing to cut the tree at five clusters seems reasonable for segmenting the client base
clusterGroups<-cutree(airclust,k=5)
Next we will explore the average values in each of the variables for the 5 clusters (the centroids of the clusters).Computing the average values of the unnormalized data so that it is easier to interpret seems reasonable
tapply(airlines$Balance,clusterGroups,mean)
## 1 2 3 4 5
## 57866.90 110669.27 198191.57 52335.91 36255.91
tapply(airlines$QualMiles,clusterGroups,mean)
## 1 2 3 4 5
## 0.6443299 1065.9826590 30.3461538 4.8479263 2.5111773
tapply(airlines$BonusMiles,clusterGroups,mean)
## 1 2 3 4 5
## 10360.124 22881.763 55795.860 20788.766 2264.788
tapply(airlines$BonusTrans,clusterGroups,mean)
## 1 2 3 4 5
## 10.823454 18.229287 19.663968 17.087558 2.973174
tapply(airlines$FlightMiles,clusterGroups,mean)
## 1 2 3 4 5
## 83.18428 2613.41811 327.67611 111.57373 119.32191
tapply(airlines$FlightTrans,clusterGroups,mean)
## 1 2 3 4 5
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972
tapply(airlines$DaysSinceEnroll,clusterGroups,mean)
## 1 2 3 4 5
## 6235.365 4402.414 5615.709 2840.823 3060.081
Looking at the output we can examine which variables has their largest means in which cluster for example we can say for cluster1 that has the largest average values in the variable “DaysSinceEnrollment” and we can describe the customers in cluster1 as infrequent but loyal.For cluster2 we can see that the largest average is found in “QualMiles”,“FlightMiles”,“FlightTrans” and the customers there can described as customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions. For cluster3 we can see that the largest average is found in “Balance”,“FlightMiles”,“FlightTrans” and they can be described as customers who have accumulated a large amount of miles, mostly through non-flight transactions. For cluster4 we can see that there is not a single variable that has the largest average value in any cluster and we can infer that the customers there are relatively new customers who seem to be accumulating miles, mostly through non-flight transactions. For cluster5 we can see also that there is not a variable that has larger mean values between the clusters in that variable and the costumers there can described as relatively new customers who don’t use the airline very often.