Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:

Balance = number of miles eligible for award travel QualMiles = number of miles qualifying for TopFlight status BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months BonusTrans = number of non-flight bonus transactions in the past 12 months FlightMiles = number of flight miles in the past 12 months FlightTrans = number of flight transactions in the past 12 months DaysSinceEnroll = number of days since enrolled in the frequent flyer program

Normalizing the Data

Read the dataset AirlinesCluster.csv into R and call it “airlines”.

airlines <- read.csv("AirlinesCluster.csv")
summary(airlines)
##     Balance          QualMiles         BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23800   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  
##   FlightMiles       FlightTrans     DaysSinceEnroll
##  Min.   :    0.0   Min.   : 0.000   Min.   :   2   
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
##  Median :    0.0   Median : 0.000   Median :4096   
##  Mean   :  460.1   Mean   : 1.374   Mean   :4119   
##  3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
##  Max.   :30817.0   Max.   :53.000   Max.   :8296

Pre-process data

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <-  preProcess(airlines)

airlinesNorm <- predict(preproc, airlines)

summary(airlinesNorm)
##     Balance          QualMiles         BonusMiles        BonusTrans      
##  Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
##  1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
##  Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
##  Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  
##   FlightMiles       FlightTrans       DaysSinceEnroll   
##  Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
##  1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
##  Median :-0.3286   Median :-0.36212   Median :-0.01092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
##  Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284

Hierarchical Clustering

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

distances <- dist(airlinesNorm, method = "euclidean")

# Hierarchical clustering
airlineClust <- hclust(distances, method = "ward.D") 

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.

plot(airlineClust)

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters.

clusterGroups = cutree(airlineClust, k = 5)
airlineClusters <- split(airlinesNorm, clusterGroups)
table(clusterGroups)
## clusterGroups
##    1    2    3    4    5 
##  776  519  494  868 1342

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters).

tapply(airlines$Balance, clusterGroups, mean)
##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91
tapply(airlines$QualMiles, clusterGroups, mean)
##            1            2            3            4            5 
##    0.6443299 1065.9826590   30.3461538    4.8479263    2.5111773
tapply(airlines$BonusMiles, clusterGroups, mean)
##         1         2         3         4         5 
## 10360.124 22881.763 55795.860 20788.766  2264.788
tapply(airlines$BonusTrans, clusterGroups, mean)
##         1         2         3         4         5 
## 10.823454 18.229287 19.663968 17.087558  2.973174
tapply(airlines$FlightMiles, clusterGroups, mean)
##          1          2          3          4          5 
##   83.18428 2613.41811  327.67611  111.57373  119.32191
tapply(airlines$FlightTrans, clusterGroups, mean)
##         1         2         3         4         5 
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972
tapply(airlines$DaysSinceEnroll, clusterGroups, mean)
##        1        2        3        4        5 
## 6235.365 4402.414 5615.709 2840.823 3060.081

Cluster 1 has the largest average values in DaysSinceEnroll

Cluster 1 are infrequent but loyal customers. Mostly contains customers with few miles, but who have been with the airline the longest.

Cluster 2 has the largest average values in QualMiles,FlightMiles,FlightTrans

Cluster 2 customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.

Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. Customers who have accumulated a large amount of miles, mostly through non-flight transactions.

Cluster 4 does not have the largest values in any of the variables. Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.They relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.

Cluster 5 does not have the largest values in any of the variables. Relatively new customers who don’t use the airline very often. Cluster 5 customers have lower than average values in all variables.

K-Means Clustering

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters.

# Run k-means
set.seed(88)

kmeansClust<- kmeans(airlinesNorm, centers=5, iter.max=1000)

table(kmeansClust$cluster)
## 
##    1    2    3    4    5 
##  408  141  993 1182 1275

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust$centers, where “kmeansClust” is the name of the output of the kmeans function.

airlinesUnnormClusters<-split(airlinesNorm,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)
##                       1      2       3       4       5
## Balance         -0.1561 0.3678  1.2363 -0.2110 -0.3706
## QualMiles       -0.1854 1.1916 -0.1471 -0.1800 -0.1830
## BonusMiles      -0.2809 0.2375  1.6004  0.1509 -0.6161
## BonusTrans      -0.0811 0.6901  0.8395  0.5712 -0.8985
## FlightMiles     -0.2692 1.5379 -0.0945 -0.2489 -0.2433
## FlightTrans     -0.2823 1.5895 -0.0803 -0.2713 -0.2464
## DaysSinceEnroll  1.0250 0.1375  0.7250 -0.6187 -0.5125
kmeansClust$centers
##       Balance   QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
## 1  1.44439706  0.51115730  1.8769284  1.0331951   0.1169945   0.1444636
## 2  1.00054098  0.68382234  0.6144780  1.7214887   3.8559798   4.1196141
## 3 -0.05580605 -0.14104391  0.3041358  0.7108744  -0.1218278  -0.1287569
## 4 -0.13331742 -0.11491607 -0.3492669 -0.3373455  -0.1833989  -0.1961819
## 5 -0.40579897 -0.02281076 -0.5816482 -0.7619054  -0.1989602  -0.2196582
##   DaysSinceEnroll
## 1       0.7198040
## 2       0.2742394
## 3      -0.3398209
## 4       0.9640923
## 5      -0.8897747

Cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.