Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.
The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.
There are seven different variables in the dataset, described below:
Balance = number of miles eligible for award travel QualMiles = number of miles qualifying for TopFlight status BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months BonusTrans = number of non-flight bonus transactions in the past 12 months FlightMiles = number of flight miles in the past 12 months FlightTrans = number of flight transactions in the past 12 months DaysSinceEnroll = number of days since enrolled in the frequent flyer program
Read the dataset AirlinesCluster.csv into R and call it “airlines”.
airlines <- read.csv("AirlinesCluster.csv")
summary(airlines)
## Balance QualMiles BonusMiles BonusTrans
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23800 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296
Pre-process data
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(airlines)
airlinesNorm <- predict(preproc, airlines)
summary(airlinesNorm)
## Balance QualMiles BonusMiles BonusTrans
## Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805
## 1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568
## Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208
## Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673
## FlightMiles FlightTrans DaysSinceEnroll
## Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
## 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
## Median :-0.3286 Median :-0.36212 Median :-0.01092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
## Max. :21.6803 Max. :13.61035 Max. : 2.02284
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
distances <- dist(airlinesNorm, method = "euclidean")
# Hierarchical clustering
airlineClust <- hclust(distances, method = "ward.D")
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.
plot(airlineClust)
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters.
clusterGroups = cutree(airlineClust, k = 5)
airlineClusters <- split(airlinesNorm, clusterGroups)
table(clusterGroups)
## clusterGroups
## 1 2 3 4 5
## 776 519 494 868 1342
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters).
tapply(airlines$Balance, clusterGroups, mean)
## 1 2 3 4 5
## 57866.90 110669.27 198191.57 52335.91 36255.91
tapply(airlines$QualMiles, clusterGroups, mean)
## 1 2 3 4 5
## 0.6443299 1065.9826590 30.3461538 4.8479263 2.5111773
tapply(airlines$BonusMiles, clusterGroups, mean)
## 1 2 3 4 5
## 10360.124 22881.763 55795.860 20788.766 2264.788
tapply(airlines$BonusTrans, clusterGroups, mean)
## 1 2 3 4 5
## 10.823454 18.229287 19.663968 17.087558 2.973174
tapply(airlines$FlightMiles, clusterGroups, mean)
## 1 2 3 4 5
## 83.18428 2613.41811 327.67611 111.57373 119.32191
tapply(airlines$FlightTrans, clusterGroups, mean)
## 1 2 3 4 5
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972
tapply(airlines$DaysSinceEnroll, clusterGroups, mean)
## 1 2 3 4 5
## 6235.365 4402.414 5615.709 2840.823 3060.081
Cluster 1 has the largest average values in DaysSinceEnroll
Cluster 1 are infrequent but loyal customers. Mostly contains customers with few miles, but who have been with the airline the longest.
Cluster 2 has the largest average values in QualMiles,FlightMiles,FlightTrans
Cluster 2 customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. Customers who have accumulated a large amount of miles, mostly through non-flight transactions.
Cluster 4 does not have the largest values in any of the variables. Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.They relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
Cluster 5 does not have the largest values in any of the variables. Relatively new customers who don’t use the airline very often. Cluster 5 customers have lower than average values in all variables.
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters.
# Run k-means
set.seed(88)
kmeansClust<- kmeans(airlinesNorm, centers=5, iter.max=1000)
table(kmeansClust$cluster)
##
## 1 2 3 4 5
## 408 141 993 1182 1275
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust$centers, where “kmeansClust” is the name of the output of the kmeans function.
airlinesUnnormClusters<-split(airlinesNorm,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)
## 1 2 3 4 5
## Balance -0.1561 0.3678 1.2363 -0.2110 -0.3706
## QualMiles -0.1854 1.1916 -0.1471 -0.1800 -0.1830
## BonusMiles -0.2809 0.2375 1.6004 0.1509 -0.6161
## BonusTrans -0.0811 0.6901 0.8395 0.5712 -0.8985
## FlightMiles -0.2692 1.5379 -0.0945 -0.2489 -0.2433
## FlightTrans -0.2823 1.5895 -0.0803 -0.2713 -0.2464
## DaysSinceEnroll 1.0250 0.1375 0.7250 -0.6187 -0.5125
kmeansClust$centers
## Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
## 1 1.44439706 0.51115730 1.8769284 1.0331951 0.1169945 0.1444636
## 2 1.00054098 0.68382234 0.6144780 1.7214887 3.8559798 4.1196141
## 3 -0.05580605 -0.14104391 0.3041358 0.7108744 -0.1218278 -0.1287569
## 4 -0.13331742 -0.11491607 -0.3492669 -0.3373455 -0.1833989 -0.1961819
## 5 -0.40579897 -0.02281076 -0.5816482 -0.7619054 -0.1989602 -0.2196582
## DaysSinceEnroll
## 1 0.7198040
## 2 0.2742394
## 3 -0.3398209
## 4 0.9640923
## 5 -0.8897747
Cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.