Source: Analytics Edge Unit 6 Clustering Homework

Techniques: Normalization, colMeans

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:

Balance = number of miles eligible for award travel

QualMiles = number of miles qualifying for TopFlight status

BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months

BonusTrans = number of non-flight bonus transactions in the past 12 months

FlightMiles = number of flight miles in the past 12 months

FlightTrans = number of flight transactions in the past 12 months

DaysSinceEnroll = number of days since enrolled in the frequent flyer program

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_6_Clustering")
airlines <- read.csv("AirlinesCluster.csv")
str(airlines)
## 'data.frame':    3999 obs. of  7 variables:
##  $ Balance        : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 104860 ...
##  $ QualMiles      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BonusMiles     : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ BonusTrans     : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ FlightMiles    : int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ FlightTrans    : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ DaysSinceEnroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...

NORMALIZING THE DATA

We can normalize the variables in a data frame by using the preProcess function in the “caret” package.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     movies
preproc <- preProcess(airlines)
airlinesNorm <- predict(preproc, airlines)
summary(airlinesNorm)
##     Balance          QualMiles         BonusMiles        BonusTrans      
##  Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
##  1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
##  Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
##  Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  
##   FlightMiles       FlightTrans       DaysSinceEnroll   
##  Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
##  1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
##  Median :-0.3286   Median :-0.36212   Median :-0.01092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
##  Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284
lapply(airlinesNorm, sd)
## $Balance
## [1] 1
## 
## $QualMiles
## [1] 1
## 
## $BonusMiles
## [1] 1
## 
## $BonusTrans
## [1] 1
## 
## $FlightMiles
## [1] 1
## 
## $FlightTrans
## [1] 1
## 
## $DaysSinceEnroll
## [1] 1

HIERARCHICAL CLUSTERING

distance <- dist(airlinesNorm, method = "euclidean")
airlineHClust <- hclust(distance, method = "ward.D")
plot(airlineHClust)

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters.

airlineClusters <- cutree(airlineHClust, k = 5)
HClust1 <- subset(airlinesNorm, airlineClusters == 1 )
HClust2 <- subset(airlinesNorm, airlineClusters == 2 )
HClust3 <- subset(airlinesNorm, airlineClusters == 3 )
HClust4 <- subset(airlinesNorm, airlineClusters == 4 )
HClust5 <- subset(airlinesNorm, airlineClusters == 5 )

Compare the average values in each of the variables for the 5 clusters. We may want to compute the average values of the unnormalized data so that it is easier to interpret.

tapply(airlines$Balance, airlineClusters, mean)
##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91
tapply(airlines$QualMiles, airlineClusters, mean)
##            1            2            3            4            5 
##    0.6443299 1065.9826590   30.3461538    4.8479263    2.5111773
tapply(airlines$BonusMiles, airlineClusters, mean)
##         1         2         3         4         5 
## 10360.124 22881.763 55795.860 20788.766  2264.788
tapply(airlines$BonusTrans, airlineClusters, mean)
##         1         2         3         4         5 
## 10.823454 18.229287 19.663968 17.087558  2.973174
tapply(airlines$FlightMiles, airlineClusters, mean)
##          1          2          3          4          5 
##   83.18428 2613.41811  327.67611  111.57373  119.32191
tapply(airlines$FlightTrans, airlineClusters, mean)
##         1         2         3         4         5 
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972
tapply(airlines$DaysSinceEnroll, airlineClusters, mean)
##        1        2        3        4        5 
## 6235.365 4402.414 5615.709 2840.823 3060.081

Instead of using tapply, We could have alternatively used colMeans and subset, as follows:

colMeans(subset(airlines, clusterGroups == 1))
##         Balance       QualMiles      BonusMiles      BonusTrans 
##     72404.48606       126.39417     17189.49176        11.58048 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##       443.78200         1.33270      4142.10456
colMeans(subset(airlines, clusterGroups == 2))
##         Balance       QualMiles      BonusMiles      BonusTrans 
##     78082.02966       210.45552     16977.70344        11.68209 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##       520.98102         1.52669      4030.41163
colMeans(subset(airlines, clusterGroups == 3))
##         Balance       QualMiles      BonusMiles      BonusTrans 
##             NaN             NaN             NaN             NaN 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##             NaN             NaN             NaN
colMeans(subset(airlines, clusterGroups == 4))
##         Balance       QualMiles      BonusMiles      BonusTrans 
##             NaN             NaN             NaN             NaN 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##             NaN             NaN             NaN
colMeans(subset(airlines, clusterGroups == 5))
##         Balance       QualMiles      BonusMiles      BonusTrans 
##             NaN             NaN             NaN             NaN 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##             NaN             NaN             NaN

This only requires 5 lines of code instead of the 7 above. But an even more compact way of finding the centroids would be to use the function “split” to first split the data into clusters, and then to use the function “lapply” to apply the function “colMeans” to each of the clusters:

lapply(split(airlines, clusterGroups), colMeans)

  • How would you describe the customers in Cluster 1? Infrequent but loyal customers.

  • How would you describe the customers in Cluster 3? Customers who have accumulated a large amount of miles, mostly through non-flight transactions.

K-means clustering

set.seed(88)
airlinesKMC <- kmeans(airlinesNorm, centers = 5)
str(airlinesKMC)
## List of 9
##  $ cluster     : int [1:3999] 4 4 4 4 1 4 3 4 2 3 ...
##  $ centers     : num [1:5, 1:7] 1.4444 1.0005 -0.0558 -0.1333 -0.4058 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:7] "Balance" "QualMiles" "BonusMiles" "BonusTrans" ...
##  $ totss       : num 27986
##  $ withinss    : num [1:5] 4948 3624 2054 2040 2321
##  $ tot.withinss: num 14987
##  $ betweenss   : num 12999
##  $ size        : int [1:5] 408 141 993 1182 1275
##  $ iter        : int 4
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

Subset the data to 5 clusters

airlinesKcluster1 <- subset(airlinesNorm, airlinesKMC$cluster == 1)
airlinesKcluster2 <- subset(airlinesNorm, airlinesKMC$cluster == 2)
airlinesKcluster3 <- subset(airlinesNorm, airlinesKMC$cluster == 3)
airlinesKcluster4 <- subset(airlinesNorm, airlinesKMC$cluster == 4)
airlinesKcluster5 <- subset(airlinesNorm, airlinesKMC$cluster == 5)

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

tapply(airlines$Balance, airlinesKMC$cluster, mean)
##         1         2         3         4         5 
## 219161.40 174431.51  67977.44  60166.18  32706.67
tapply(airlines$QualMiles, airlinesKMC$cluster, mean)
##         1         2         3         4         5 
## 539.57843 673.16312  34.99396  55.20812 126.46667
tapply(airlines$BonusMiles, airlinesKMC$cluster, mean)
##         1         2         3         4         5 
## 62474.483 31985.085 24490.019  8709.712  3097.478
tapply(airlines$BonusTrans, airlinesKMC$cluster, mean)
##         1         2         3         4         5 
## 21.524510 28.134752 18.429003  8.362098  4.284706
tapply(airlines$FlightMiles, airlinesKMC$cluster, mean)
##         1         2         3         4         5 
##  623.8725 5859.2340  289.4713  203.2589  181.4698
tapply(airlines$FlightTrans, airlinesKMC$cluster, mean)
##          1          2          3          4          5 
##  1.9215686 17.0000000  0.8851964  0.6294416  0.5403922
tapply(airlines$DaysSinceEnroll, airlinesKMC$cluster, mean)
##        1        2        3        4        5 
## 5605.051 4684.901 3416.783 6109.540 2281.055

The clusters are not displayed in a meaningful order, so while there may be a cluster produced by the k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.