Analytics Edge: Unit 6 - Market Segmentation for Airlines

R Exercises

Normalizing the Data

Read the dataset AirlinesCluster.csv into R and call it “airlines”.

# Load in the dataset
airlines = read.csv("AirlinesCluster.csv")

Looking at the summary of airlines, which TWO variables have (on average) the smallest values?

# Obtain a summary of the dataset
z = summary(airlines)
kable(z)

Balance	QualMiles	BonusMiles	BonusTrans	FlightMiles	FlightTrans	DaysSinceEnroll
Min. : 0	Min. : 0.0	Min. : 0	Min. : 0.0	Min. : 0.0	Min. : 0.000	Min. : 2
1st Qu.: 18528	1st Qu.: 0.0	1st Qu.: 1250	1st Qu.: 3.0	1st Qu.: 0.0	1st Qu.: 0.000	1st Qu.:2330
Median : 43097	Median : 0.0	Median : 7171	Median :12.0	Median : 0.0	Median : 0.000	Median :4096
Mean : 73601	Mean : 144.1	Mean : 17145	Mean :11.6	Mean : 460.1	Mean : 1.374	Mean :4119
3rd Qu.: 92404	3rd Qu.: 0.0	3rd Qu.: 23801	3rd Qu.:17.0	3rd Qu.: 311.0	3rd Qu.: 1.000	3rd Qu.:5790
Max. :1704838	Max. :11148.0	Max. :263685	Max. :86.0	Max. :30817.0	Max. :53.000	Max. :8296

For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other variables have values in the thousands.

Which TWO variables have (on average) the largest values?

# Obtain a summary of the dataset
summary(airlines)

For the largest values, Balance and BonusMiles have average values in the tens of thousands.

In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to normalize the data before clustering?

If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.

Caret

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

# Loading the caret package
library(caret)

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

# Preprocess the data
preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

In the normalized data, which variable has the largest maximum value?

# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)

Balance	QualMiles	BonusMiles	BonusTrans	FlightMiles	FlightTrans	DaysSinceEnroll
Min. :-0.7303	Min. :-0.1863	Min. :-0.7099	Min. :-1.20805	Min. :-0.3286	Min. :-0.36212	Min. :-1.99336
1st Qu.:-0.5465	1st Qu.:-0.1863	1st Qu.:-0.6581	1st Qu.:-0.89568	1st Qu.:-0.3286	1st Qu.:-0.36212	1st Qu.:-0.86607
Median :-0.3027	Median :-0.1863	Median :-0.4130	Median : 0.04145	Median :-0.3286	Median :-0.36212	Median :-0.01092
Mean : 0.0000	Mean : 0.0000	Mean : 0.0000	Mean : 0.00000	Mean : 0.0000	Mean : 0.00000	Mean : 0.00000
3rd Qu.: 0.1866	3rd Qu.:-0.1863	3rd Qu.: 0.2756	3rd Qu.: 0.56208	3rd Qu.:-0.1065	3rd Qu.:-0.09849	3rd Qu.: 0.80960
Max. :16.1868	Max. :14.2231	Max. :10.2083	Max. : 7.74673	Max. :21.6803	Max. :13.61035	Max. : 2.02284

FlightMiles now has the largest maximum value.

In the normalized data, which variable has the smallest minimum value?

# Obtain a summary of the data
z = summary(airlinesNorm)
kable(z)

DaysSinceEnroll now has the smallest minimum value.

Hierarchical Clustering

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

# Hierarchical clustering algorithm
airlinesNormDist = dist(airlinesNorm, method="euclidean")
airlinesNormHierClust = hclust(airlinesNormDist, method="ward.D")
plot(airlinesNormHierClust)

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

If you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

# Plot a dendrogram and divide it into 5 clusters
plot(airlinesNormHierClust)
rect.hclust(airlinesNormHierClust, k = 5, border = "red")

hierGroups = cutree(airlinesNormHierClust, k = 5)
# Subset the clusters into 5 different groups
HierCluster1 = subset(airlinesNorm, hierGroups == 1)
HierCluster2 = subset(airlinesNorm, hierGroups == 2)
HierCluster3 = subset(airlinesNorm, hierGroups == 3)
HierCluster4 = subset(airlinesNorm, hierGroups == 4)
HierCluster5 = subset(airlinesNorm, hierGroups == 5)

How many data points are in Cluster 1?

# Outsput the number of rows
nrow(HierCluster1)
## [1] 776

Cluster 3 has 776 data points.

Compare the average values in each of the variables

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

# Compares two different groups using a statistical measure
tapply(airlines$Balance, hierGroups, mean)
##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91

# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, hierGroups, mean)
kable(z)

x
57866.90
110669.27
198191.57
52335.91
36255.91

z = tapply(airlines$QualMiles, hierGroups, mean)
kable(z)

x
0.6443299
1065.9826590
30.3461538
4.8479263
2.5111773

z = tapply(airlines$BonusMiles, hierGroups, mean)
kable(z)

x
10360.124
22881.763
55795.860
20788.766
2264.788

z = tapply(airlines$BonusTrans, hierGroups, mean)
kable(z)

x
10.823454
18.229287
19.663968
17.087558
2.973174

z = tapply(airlines$FlightMiles, hierGroups, mean)
kable(z)

x
83.18428
2613.41811
327.67611
111.57373
119.32191

z = tapply(airlines$FlightTrans, hierGroups, mean)
kable(z)

x
0.3028351
7.4026975
1.0688259
0.3444700
0.4388972

z = tapply(airlines$DaysSinceEnroll, hierGroups, mean)
kable(z)

x
6235.365
4402.414
5615.709
2840.823
3060.081

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

Cluster 1 has the largest average values in DaysSinceEnroll.

How would you describe the customers in Cluster 1?

Customers in Cluster 1 are infrequent but loyal customers.

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.

How would you describe the customers in Cluster 2?

Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.

How would you describe the customers in Cluster 3?

Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

Cluster 4 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 4?

Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

Cluster 5 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 5?

Cluster 5 customers have lower than average values in all variables.

K-Means Clustering

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

# k-means algorithm
set.seed(88)
kmc = kmeans(airlinesNorm, centers=5, iter.max = 1000)
# Subset into 5 different cluster datasets
KmeansCluster1 = subset(airlinesNorm, kmc$cluster == 1)

KmeansCluster2 = subset(airlinesNorm, kmc$cluster == 2)

KmeansCluster3 = subset(airlinesNorm, kmc$cluster == 3)

KmeansCluster4 = subset(airlinesNorm, kmc$cluster == 4)

KmeansCluster5 = subset(airlinesNorm, kmc$cluster == 5)

How many clusters have more than 1,000 observations?

# Calculates the number of rows in each cluster dataset
nrow(KmeansCluster1)
## [1] 408
nrow(KmeansCluster2)
## [1] 141
nrow(KmeansCluster3)
## [1] 993
nrow(KmeansCluster4)
## [1] 1182
nrow(KmeansCluster5)
## [1] 1275

There are two clusters with more than 1000 observations.

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

# Compares two different groups using a statistical measure
z = tapply(airlines$Balance, kmc$cluster, mean)
kable(z)

x
219161.40
174431.51
67977.44
60166.18
32706.67

z = tapply(airlines$QualMiles, kmc$cluster, mean)
kable(z)

x
539.57843
673.16312
34.99396
55.20812
126.46667

z = tapply(airlines$BonusMiles, kmc$cluster, mean)
kable(z)

x
62474.483
31985.085
24490.019
8709.712
3097.478

z = tapply(airlines$BonusTrans, kmc$cluster, mean)
kable(z)

x
21.524510
28.134752
18.429003
8.362098
4.284706

z = tapply(airlines$FlightMiles, kmc$cluster, mean)
kable(z)

x
623.8725
5859.2340
289.4713
203.2589
181.4698

z = tapply(airlines$FlightTrans, kmc$cluster, mean)
kable(z)

x
1.9215686
17.0000000
0.8851964
0.6294416
0.5403922

z = tapply(airlines$DaysSinceEnroll, kmc$cluster, mean)
kable(z)

x
5605.051
4684.901
3416.783
6109.540
2281.055

The clusters are not displayed in a meaningful order, so while there may be a cluster produced by the k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.