Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.
The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.
There are seven different variables in the dataset, described below:
Read the dataset AirlinesCluster.csv into R and call it “airlines”.
# Load in the dataset
airlines = read.csv("AirlinesCluster.csv")# Obtain a summary of the dataset
z = summary(airlines)
kable(z)| Balance | QualMiles | BonusMiles | BonusTrans | FlightMiles | FlightTrans | DaysSinceEnroll | |
|---|---|---|---|---|---|---|---|
| Min. : 0 | Min. : 0.0 | Min. : 0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.000 | Min. : 2 | |
| 1st Qu.: 18528 | 1st Qu.: 0.0 | 1st Qu.: 1250 | 1st Qu.: 3.0 | 1st Qu.: 0.0 | 1st Qu.: 0.000 | 1st Qu.:2330 | |
| Median : 43097 | Median : 0.0 | Median : 7171 | Median :12.0 | Median : 0.0 | Median : 0.000 | Median :4096 | |
| Mean : 73601 | Mean : 144.1 | Mean : 17145 | Mean :11.6 | Mean : 460.1 | Mean : 1.374 | Mean :4119 | |
| 3rd Qu.: 92404 | 3rd Qu.: 0.0 | 3rd Qu.: 23801 | 3rd Qu.:17.0 | 3rd Qu.: 311.0 | 3rd Qu.: 1.000 | 3rd Qu.:5790 | |
| Max. :1704838 | Max. :11148.0 | Max. :263685 | Max. :86.0 | Max. :30817.0 | Max. :53.000 | Max. :8296 |
For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other variables have values in the thousands.
# Obtain a summary of the dataset
summary(airlines)For the largest values, Balance and BonusMiles have average values in the tens of thousands.
If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.
Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
# Loading the caret package
library(caret)Now, create a normalized data frame called “airlinesNorm” by running the following commands:
# Preprocess the data
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)| Balance | QualMiles | BonusMiles | BonusTrans | FlightMiles | FlightTrans | DaysSinceEnroll | |
|---|---|---|---|---|---|---|---|
| Min. :-0.7303 | Min. :-0.1863 | Min. :-0.7099 | Min. :-1.20805 | Min. :-0.3286 | Min. :-0.36212 | Min. :-1.99336 | |
| 1st Qu.:-0.5465 | 1st Qu.:-0.1863 | 1st Qu.:-0.6581 | 1st Qu.:-0.89568 | 1st Qu.:-0.3286 | 1st Qu.:-0.36212 | 1st Qu.:-0.86607 | |
| Median :-0.3027 | Median :-0.1863 | Median :-0.4130 | Median : 0.04145 | Median :-0.3286 | Median :-0.36212 | Median :-0.01092 | |
| Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.00000 | |
| 3rd Qu.: 0.1866 | 3rd Qu.:-0.1863 | 3rd Qu.: 0.2756 | 3rd Qu.: 0.56208 | 3rd Qu.:-0.1065 | 3rd Qu.:-0.09849 | 3rd Qu.: 0.80960 | |
| Max. :16.1868 | Max. :14.2231 | Max. :10.2083 | Max. : 7.74673 | Max. :21.6803 | Max. :13.61035 | Max. : 2.02284 |
FlightMiles now has the largest maximum value.
# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)DaysSinceEnroll now has the smallest minimum value.
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
# Hierarchical clustering algorithm
airlinesNormDist = dist(airlinesNorm, method="euclidean")
airlinesNormHierClust = hclust(airlinesNormDist, method="ward.D")
plot(airlinesNormHierClust)If you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.
# Plot a dendrogram and divide it into 5 clusters
plot(airlinesNormHierClust)
rect.hclust(airlinesNormHierClust, k = 5, border = "red")hierGroups = cutree(airlinesNormHierClust, k = 5)
# Subset the clusters into 5 different groups
HierCluster1 = subset(airlinesNorm, hierGroups == 1)
HierCluster2 = subset(airlinesNorm, hierGroups == 2)
HierCluster3 = subset(airlinesNorm, hierGroups == 3)
HierCluster4 = subset(airlinesNorm, hierGroups == 4)
HierCluster5 = subset(airlinesNorm, hierGroups == 5)# Outsput the number of rows
nrow(HierCluster1)
## [1] 776Cluster 3 has 776 data points.
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
# Compares two different groups using a statistical measure
tapply(airlines$Balance, hierGroups, mean)
## 1 2 3 4 5
## 57866.90 110669.27 198191.57 52335.91 36255.91# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, hierGroups, mean)
kable(z)| x |
|---|
| 57866.90 |
| 110669.27 |
| 198191.57 |
| 52335.91 |
| 36255.91 |
z = tapply(airlines$QualMiles, hierGroups, mean)
kable(z)| x |
|---|
| 0.6443299 |
| 1065.9826590 |
| 30.3461538 |
| 4.8479263 |
| 2.5111773 |
z = tapply(airlines$BonusMiles, hierGroups, mean)
kable(z)| x |
|---|
| 10360.124 |
| 22881.763 |
| 55795.860 |
| 20788.766 |
| 2264.788 |
z = tapply(airlines$BonusTrans, hierGroups, mean)
kable(z)| x |
|---|
| 10.823454 |
| 18.229287 |
| 19.663968 |
| 17.087558 |
| 2.973174 |
z = tapply(airlines$FlightMiles, hierGroups, mean)
kable(z)| x |
|---|
| 83.18428 |
| 2613.41811 |
| 327.67611 |
| 111.57373 |
| 119.32191 |
z = tapply(airlines$FlightTrans, hierGroups, mean)
kable(z)| x |
|---|
| 0.3028351 |
| 7.4026975 |
| 1.0688259 |
| 0.3444700 |
| 0.4388972 |
z = tapply(airlines$DaysSinceEnroll, hierGroups, mean)
kable(z)| x |
|---|
| 6235.365 |
| 4402.414 |
| 5615.709 |
| 2840.823 |
| 3060.081 |
Cluster 1 has the largest average values in DaysSinceEnroll.
Customers in Cluster 1 are infrequent but loyal customers.
Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.
Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.
Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.
Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.
Cluster 4 does not have the largest values in any of the variables.
Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.
Cluster 5 does not have the largest values in any of the variables.
Cluster 5 customers have lower than average values in all variables.
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
# k-means algorithm
set.seed(88)
kmc = kmeans(airlinesNorm, centers=5, iter.max = 1000)
# Subset into 5 different cluster datasets
KmeansCluster1 = subset(airlinesNorm, kmc$cluster == 1)
KmeansCluster2 = subset(airlinesNorm, kmc$cluster == 2)
KmeansCluster3 = subset(airlinesNorm, kmc$cluster == 3)
KmeansCluster4 = subset(airlinesNorm, kmc$cluster == 4)
KmeansCluster5 = subset(airlinesNorm, kmc$cluster == 5)# Calculates the number of rows in each cluster dataset
nrow(KmeansCluster1)
## [1] 408
nrow(KmeansCluster2)
## [1] 141
nrow(KmeansCluster3)
## [1] 993
nrow(KmeansCluster4)
## [1] 1182
nrow(KmeansCluster5)
## [1] 1275There are two clusters with more than 1000 observations.
# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, kmc$cluster, mean)
kable(z)| x |
|---|
| 219161.40 |
| 174431.51 |
| 67977.44 |
| 60166.18 |
| 32706.67 |
z = tapply(airlines$QualMiles, kmc$cluster, mean)
kable(z)| x |
|---|
| 539.57843 |
| 673.16312 |
| 34.99396 |
| 55.20812 |
| 126.46667 |
z = tapply(airlines$BonusMiles, kmc$cluster, mean)
kable(z)| x |
|---|
| 62474.483 |
| 31985.085 |
| 24490.019 |
| 8709.712 |
| 3097.478 |
z = tapply(airlines$BonusTrans, kmc$cluster, mean)
kable(z)| x |
|---|
| 21.524510 |
| 28.134752 |
| 18.429003 |
| 8.362098 |
| 4.284706 |
z = tapply(airlines$FlightMiles, kmc$cluster, mean)
kable(z)| x |
|---|
| 623.8725 |
| 5859.2340 |
| 289.4713 |
| 203.2589 |
| 181.4698 |
z = tapply(airlines$FlightTrans, kmc$cluster, mean)
kable(z)| x |
|---|
| 1.9215686 |
| 17.0000000 |
| 0.8851964 |
| 0.6294416 |
| 0.5403922 |
z = tapply(airlines$DaysSinceEnroll, kmc$cluster, mean)
kable(z)| x |
|---|
| 5605.051 |
| 4684.901 |
| 3416.783 |
| 6109.540 |
| 2281.055 |
The clusters are not displayed in a meaningful order, so while there may be a cluster produced by the k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.