Problem 4 - Airline Frequent Fliers

A. Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to normalize the data first. How many clusters appear?

Read the file

input<- read.csv("G:/My Drive/MSDA/DATA 610 Big Data Analytics/Rdata/Rdata/EastWestAirlinesCluster.csv",header=TRUE)

Exclude unwanted columns

mydata<- input[1:3999,2:12]

Normmalize the data

normalized_data<- scale(mydata)

Calculate the euclidean distance using Ward’s method

d <- dist(normalized_data, method = "euclidean") 
fit <- hclust(d, method="ward.D2")

Display the dendrogram and outline the three clusters in blue

plot(fit)
rect.hclust(fit, k=3, border="blue")

Three clusters appear as can be seen in the dendrogram above.

B. What would happen if the data were not normalized?

If the data weren’t normalized, the data with larger values/scale will skew the distance calculated and therefore the clusters.

C. Compare the cluster centroid to characterize the different clusters, and try to give each cluster a label.

d <- dist(normalized_data, method = "euclidean") 
fit <- hclust(d, method="complete")
options(scipen=99999999)

groups <- cutree(fit, k=3)

clust.centroid = function(i, dat, groups) {
  ind = (groups == i)
  colMeans(dat[ind,])
}
sapply(unique(groups), clust.centroid, mydata, groups)
##                            [,1]           [,2]      [,3]
## Balance           73299.6959799 138061.4000000 131999.50
## Qual_miles          144.1567839     78.8000000    347.00
## cc1_miles             2.0537688      3.4666667      2.50
## cc2_miles             1.0145729      1.0000000      1.00
## cc3_miles             1.0007538      4.0666667      1.00
## Bonus_miles       16806.7298995  93927.8666667  65634.25
## Bonus_trans          11.4819095     28.0666667     69.25
## Flight_miles_12mo   440.2821608    506.6666667  19960.00
## Flight_trans_12       1.3246231      1.6000000     49.25
## Days_since_enroll  4118.6206030   4613.8666667   2200.25
## Award.                0.3690955      0.5333333      1.00

1. Cluster 1 – Lowest number of bonus_miles and lowest number of non-flight bonus transactions 2. Cluster 2 – Largest number of non-flight bonus transactions (bonus_miles) and highest number of miles eligible for award travel (balance) 3. Cluster 3 – Frequent fliers but have enrolled more recently than the first two clusters (highest Flight_trans_12, lowest days_since_enroll)

D. To check the stability of clusters, remove a random 5% of the data x (by taking a random sample of 95% of the records), and repeat the analysis. Does the same picture emerge?

Outline the three clusters in blue

normalized_data_95 <- scale(mydata[sample(nrow(mydata), nrow(mydata)*.95), ])
d <- dist(normalized_data_95, method = "euclidean") 
fit <- hclust(d, method="ward.D2")
plot(fit)
rect.hclust(fit, k=3, border="blue")

clust.centroid = function(i, dat, groups) {
  ind = (groups == i)
  colMeans(dat[ind,])
}
sapply(unique(groups), clust.centroid, mydata, groups)
##                            [,1]           [,2]      [,3]
## Balance           73299.6959799 138061.4000000 131999.50
## Qual_miles          144.1567839     78.8000000    347.00
## cc1_miles             2.0537688      3.4666667      2.50
## cc2_miles             1.0145729      1.0000000      1.00
## cc3_miles             1.0007538      4.0666667      1.00
## Bonus_miles       16806.7298995  93927.8666667  65634.25
## Bonus_trans          11.4819095     28.0666667     69.25
## Flight_miles_12mo   440.2821608    506.6666667  19960.00
## Flight_trans_12       1.3246231      1.6000000     49.25
## Days_since_enroll  4118.6206030   4613.8666667   2200.25
## Award.                0.3690955      0.5333333      1.00

Using a random sample of 95% of the records changes the clusters as can be seen above.

E. Use k-means clustering with the number of clusters that you found above. Does the same picture emerge?

fit <- kmeans(mydata, centers=3, iter.max=10)
t(fit$centers)
##                                1             2            3
## Balance           563233.2222222 36759.9743425 167615.27250
## Qual_miles           451.9259259   104.6905067    266.60375
## cc1_miles              3.2222222     1.7854394      3.01000
## cc2_miles              1.0370370     1.0153945      1.00875
## cc3_miles              1.0370370     1.0076972      1.02750
## Bonus_miles        53100.8518519 11901.0041693  33942.17500
## Bonus_trans           20.4444444     9.9268762     17.23500
## Flight_miles_12mo   1758.1234568   309.5686337    915.15000
## Flight_trans_12        5.7037037     0.9278384      2.67250
## Days_since_enroll   6203.7530864  3854.8710712   4935.15875
## Award.                 0.7901235     0.3316228      0.47875

The same picture does not emerge using k-means clustering. The k-means clusters can be seen above.

F. Which clusters would you target for offers, and what types of offers would you target to customers in that cluster?

I would target the k-means cluster #1 because they have the largest balance and have been enrolled in the program the longest so they are more likely the frequent fliers.