Problem 4 - Airline Frequent Fliers
A. Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to normalize the data first. How many clusters appear?
Read the file
input<- read.csv("G:/My Drive/MSDA/DATA 610 Big Data Analytics/Rdata/Rdata/EastWestAirlinesCluster.csv",header=TRUE)
Exclude unwanted columns
mydata<- input[1:3999,2:12]
Normmalize the data
normalized_data<- scale(mydata)
Calculate the euclidean distance using Ward’s method
d <- dist(normalized_data, method = "euclidean")
fit <- hclust(d, method="ward.D2")
Display the dendrogram and outline the three clusters in blue
plot(fit)
rect.hclust(fit, k=3, border="blue")
Three clusters appear as can be seen in the dendrogram above.
B. What would happen if the data were not normalized?
If the data weren’t normalized, the data with larger values/scale will skew the distance calculated and therefore the clusters.
C. Compare the cluster centroid to characterize the different clusters, and try to give each cluster a label.
d <- dist(normalized_data, method = "euclidean")
fit <- hclust(d, method="complete")
options(scipen=99999999)
groups <- cutree(fit, k=3)
clust.centroid = function(i, dat, groups) {
ind = (groups == i)
colMeans(dat[ind,])
}
sapply(unique(groups), clust.centroid, mydata, groups)
## [,1] [,2] [,3]
## Balance 73299.6959799 138061.4000000 131999.50
## Qual_miles 144.1567839 78.8000000 347.00
## cc1_miles 2.0537688 3.4666667 2.50
## cc2_miles 1.0145729 1.0000000 1.00
## cc3_miles 1.0007538 4.0666667 1.00
## Bonus_miles 16806.7298995 93927.8666667 65634.25
## Bonus_trans 11.4819095 28.0666667 69.25
## Flight_miles_12mo 440.2821608 506.6666667 19960.00
## Flight_trans_12 1.3246231 1.6000000 49.25
## Days_since_enroll 4118.6206030 4613.8666667 2200.25
## Award. 0.3690955 0.5333333 1.00
1. Cluster 1 – Lowest number of bonus_miles and lowest number of non-flight bonus transactions 2. Cluster 2 – Largest number of non-flight bonus transactions (bonus_miles) and highest number of miles eligible for award travel (balance) 3. Cluster 3 – Frequent fliers but have enrolled more recently than the first two clusters (highest Flight_trans_12, lowest days_since_enroll)
D. To check the stability of clusters, remove a random 5% of the data x (by taking a random sample of 95% of the records), and repeat the analysis. Does the same picture emerge?
Outline the three clusters in blue
normalized_data_95 <- scale(mydata[sample(nrow(mydata), nrow(mydata)*.95), ])
d <- dist(normalized_data_95, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit)
rect.hclust(fit, k=3, border="blue")
clust.centroid = function(i, dat, groups) {
ind = (groups == i)
colMeans(dat[ind,])
}
sapply(unique(groups), clust.centroid, mydata, groups)
## [,1] [,2] [,3]
## Balance 73299.6959799 138061.4000000 131999.50
## Qual_miles 144.1567839 78.8000000 347.00
## cc1_miles 2.0537688 3.4666667 2.50
## cc2_miles 1.0145729 1.0000000 1.00
## cc3_miles 1.0007538 4.0666667 1.00
## Bonus_miles 16806.7298995 93927.8666667 65634.25
## Bonus_trans 11.4819095 28.0666667 69.25
## Flight_miles_12mo 440.2821608 506.6666667 19960.00
## Flight_trans_12 1.3246231 1.6000000 49.25
## Days_since_enroll 4118.6206030 4613.8666667 2200.25
## Award. 0.3690955 0.5333333 1.00
Using a random sample of 95% of the records changes the clusters as can be seen above.
E. Use k-means clustering with the number of clusters that you found above. Does the same picture emerge?
fit <- kmeans(mydata, centers=3, iter.max=10)
t(fit$centers)
## 1 2 3
## Balance 563233.2222222 36759.9743425 167615.27250
## Qual_miles 451.9259259 104.6905067 266.60375
## cc1_miles 3.2222222 1.7854394 3.01000
## cc2_miles 1.0370370 1.0153945 1.00875
## cc3_miles 1.0370370 1.0076972 1.02750
## Bonus_miles 53100.8518519 11901.0041693 33942.17500
## Bonus_trans 20.4444444 9.9268762 17.23500
## Flight_miles_12mo 1758.1234568 309.5686337 915.15000
## Flight_trans_12 5.7037037 0.9278384 2.67250
## Days_since_enroll 6203.7530864 3854.8710712 4935.15875
## Award. 0.7901235 0.3316228 0.47875
The same picture does not emerge using k-means clustering. The k-means clusters can be seen above.
F. Which clusters would you target for offers, and what types of offers would you target to customers in that cluster?
I would target the k-means cluster #1 because they have the largest balance and have been enrolled in the program the longest so they are more likely the frequent fliers.