Background Information on the Dataset

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:

R Exercises

Normalizing the Data

Read the dataset AirlinesCluster.csv into R and call it “airlines”.

# Load in the dataset
airlines = read.csv("AirlinesCluster.csv")

Looking at the summary of airlines, which TWO variables have (on average) the smallest values?

# Obtain a summary of the dataset
z = summary(airlines)
kable(z)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans DaysSinceEnroll
Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0.000 Min. : 2
1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
Median : 43097 Median : 0.0 Median : 7171 Median :12.0 Median : 0.0 Median : 0.000 Median :4096
Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6 Mean : 460.1 Mean : 1.374 Mean :4119
3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23801 3rd Qu.:17.0 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0 Max. :30817.0 Max. :53.000 Max. :8296

For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other variables have values in the thousands.

Which TWO variables have (on average) the largest values?

# Obtain a summary of the dataset
summary(airlines)

For the largest values, Balance and BonusMiles have average values in the tens of thousands.

In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to normalize the data before clustering?

If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.

Caret

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

# Loading the caret package
library(caret)

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

# Preprocess the data
preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

In the normalized data, which variable has the largest maximum value?
# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans DaysSinceEnroll
Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805 Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145 Median :-0.3286 Median :-0.36212 Median :-0.01092
Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673 Max. :21.6803 Max. :13.61035 Max. : 2.02284

FlightMiles now has the largest maximum value.

In the normalized data, which variable has the smallest minimum value?
# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)

DaysSinceEnroll now has the smallest minimum value.

Hierarchical Clustering

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

# Hierarchical clustering algorithm
airlinesNormDist = dist(airlinesNorm, method="euclidean")
airlinesNormHierClust = hclust(airlinesNormDist, method="ward.D")
plot(airlinesNormHierClust)

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

If you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

# Plot a dendrogram and divide it into 5 clusters
plot(airlinesNormHierClust)
rect.hclust(airlinesNormHierClust, k = 5, border = "red")

hierGroups = cutree(airlinesNormHierClust, k = 5)
# Subset the clusters into 5 different groups
HierCluster1 = subset(airlinesNorm, hierGroups == 1)
HierCluster2 = subset(airlinesNorm, hierGroups == 2)
HierCluster3 = subset(airlinesNorm, hierGroups == 3)
HierCluster4 = subset(airlinesNorm, hierGroups == 4)
HierCluster5 = subset(airlinesNorm, hierGroups == 5)
How many data points are in Cluster 1?
# Outsput the number of rows
nrow(HierCluster1)
## [1] 776

Cluster 3 has 776 data points.

Compare the average values in each of the variables

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

# Compares two different groups using a statistical measure
tapply(airlines$Balance, hierGroups, mean)
##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91
# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, hierGroups, mean)
kable(z)
x
57866.90
110669.27
198191.57
52335.91
36255.91
z = tapply(airlines$QualMiles, hierGroups, mean)
kable(z)
x
0.6443299
1065.9826590
30.3461538
4.8479263
2.5111773
z = tapply(airlines$BonusMiles, hierGroups, mean)
kable(z)
x
10360.124
22881.763
55795.860
20788.766
2264.788
z = tapply(airlines$BonusTrans, hierGroups, mean)
kable(z)
x
10.823454
18.229287
19.663968
17.087558
2.973174
z = tapply(airlines$FlightMiles, hierGroups, mean)
kable(z)
x
83.18428
2613.41811
327.67611
111.57373
119.32191
z = tapply(airlines$FlightTrans, hierGroups, mean)
kable(z)
x
0.3028351
7.4026975
1.0688259
0.3444700
0.4388972
z = tapply(airlines$DaysSinceEnroll, hierGroups, mean)
kable(z)
x
6235.365
4402.414
5615.709
2840.823
3060.081
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

Cluster 1 has the largest average values in DaysSinceEnroll.

How would you describe the customers in Cluster 1?

Customers in Cluster 1 are infrequent but loyal customers.

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.

How would you describe the customers in Cluster 2?

Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.

How would you describe the customers in Cluster 3?

Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

Cluster 4 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 4?

Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

Cluster 5 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 5?

Cluster 5 customers have lower than average values in all variables.

K-Means Clustering

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

# k-means algorithm
set.seed(88)
kmc = kmeans(airlinesNorm, centers=5, iter.max = 1000)
# Subset into 5 different cluster datasets
KmeansCluster1 = subset(airlinesNorm, kmc$cluster == 1)

KmeansCluster2 = subset(airlinesNorm, kmc$cluster == 2)

KmeansCluster3 = subset(airlinesNorm, kmc$cluster == 3)

KmeansCluster4 = subset(airlinesNorm, kmc$cluster == 4)

KmeansCluster5 = subset(airlinesNorm, kmc$cluster == 5)

How many clusters have more than 1,000 observations?

# Calculates the number of rows in each cluster dataset
nrow(KmeansCluster1)
## [1] 408
nrow(KmeansCluster2)
## [1] 141
nrow(KmeansCluster3)
## [1] 993
nrow(KmeansCluster4)
## [1] 1182
nrow(KmeansCluster5)
## [1] 1275

There are two clusters with more than 1000 observations.

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, kmc$cluster, mean)
kable(z)
x
219161.40
174431.51
67977.44
60166.18
32706.67
z = tapply(airlines$QualMiles, kmc$cluster, mean)
kable(z)
x
539.57843
673.16312
34.99396
55.20812
126.46667
z = tapply(airlines$BonusMiles, kmc$cluster, mean)
kable(z)
x
62474.483
31985.085
24490.019
8709.712
3097.478
z = tapply(airlines$BonusTrans, kmc$cluster, mean)
kable(z)
x
21.524510
28.134752
18.429003
8.362098
4.284706
z = tapply(airlines$FlightMiles, kmc$cluster, mean)
kable(z)
x
623.8725
5859.2340
289.4713
203.2589
181.4698
z = tapply(airlines$FlightTrans, kmc$cluster, mean)
kable(z)
x
1.9215686
17.0000000
0.8851964
0.6294416
0.5403922
z = tapply(airlines$DaysSinceEnroll, kmc$cluster, mean)
kable(z)
x
5605.051
4684.901
3416.783
6109.540
2281.055

The clusters are not displayed in a meaningful order, so while there may be a cluster produced by the k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.