Instruction
Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.
In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.
The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.
There sare seven different variables in the dataset, described below:
- Balance = number of miles eligible for award travel
- QualMiles = number of miles qualifying for TopFlight status
- BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months
- BonusTrans = number of non-flight bonus transactions in the past 12 months
- FlightMiles = number of flight miles in the past 12 months
- FlightTrans = number of flight transactions in the past 12 months
- DaysSinceEnroll = number of days since enrolled in the frequent flyer program
Load library
library(caret)
library(caTools)
library(flexclust)
library(readr)
library(dplyr)
Glimpse Dataset
airlines = read_csv("AirlinesCluster.csv")
knitr::kable(head(airlines))
| 28143 |
0 |
174 |
1 |
0 |
0 |
7000 |
| 19244 |
0 |
215 |
2 |
0 |
0 |
6968 |
| 41354 |
0 |
4123 |
4 |
0 |
0 |
7034 |
| 14776 |
0 |
500 |
1 |
0 |
0 |
6952 |
| 97752 |
0 |
43300 |
26 |
2077 |
4 |
6935 |
| 16420 |
0 |
0 |
0 |
0 |
0 |
6942 |
knitr::kable(summary(airlines))
|
Min. : 0 |
Min. : 0.0 |
Min. : 0 |
Min. : 0.0 |
Min. : 0.0 |
Min. : 0.000 |
Min. : 2 |
|
1st Qu.: 18528 |
1st Qu.: 0.0 |
1st Qu.: 1250 |
1st Qu.: 3.0 |
1st Qu.: 0.0 |
1st Qu.: 0.000 |
1st Qu.:2330 |
|
Median : 43097 |
Median : 0.0 |
Median : 7171 |
Median :12.0 |
Median : 0.0 |
Median : 0.000 |
Median :4096 |
|
Mean : 73601 |
Mean : 144.1 |
Mean : 17145 |
Mean :11.6 |
Mean : 460.1 |
Mean : 1.374 |
Mean :4119 |
|
3rd Qu.: 92404 |
3rd Qu.: 0.0 |
3rd Qu.: 23801 |
3rd Qu.:17.0 |
3rd Qu.: 311.0 |
3rd Qu.: 1.000 |
3rd Qu.:5790 |
|
Max. :1704838 |
Max. :11148.0 |
Max. :263685 |
Max. :86.0 |
Max. :30817.0 |
Max. :53.000 |
Max. :8296 |
Normalizing the Data
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
knitr::kable(summary(airlinesNorm))
|
Min. :-0.7303 |
Min. :-0.1863 |
Min. :-0.7099 |
Min. :-1.20805 |
Min. :-0.3286 |
Min. :-0.36212 |
Min. :-1.99336 |
|
1st Qu.:-0.5465 |
1st Qu.:-0.1863 |
1st Qu.:-0.6581 |
1st Qu.:-0.89568 |
1st Qu.:-0.3286 |
1st Qu.:-0.36212 |
1st Qu.:-0.86607 |
|
Median :-0.3027 |
Median :-0.1863 |
Median :-0.4130 |
Median : 0.04145 |
Median :-0.3286 |
Median :-0.36212 |
Median :-0.01092 |
|
Mean : 0.0000 |
Mean : 0.0000 |
Mean : 0.0000 |
Mean : 0.00000 |
Mean : 0.0000 |
Mean : 0.00000 |
Mean : 0.00000 |
|
3rd Qu.: 0.1866 |
3rd Qu.:-0.1863 |
3rd Qu.: 0.2756 |
3rd Qu.: 0.56208 |
3rd Qu.:-0.1065 |
3rd Qu.:-0.09849 |
3rd Qu.: 0.80960 |
|
Max. :16.1868 |
Max. :14.2231 |
Max. :10.2083 |
Max. : 7.74673 |
Max. :21.6803 |
Max. :13.61035 |
Max. : 2.02284 |
Hierarchical Clustering
airDist = dist(airlinesNorm, method = "euclidean")
airHclust = hclust(airDist, method = "ward.D")
clusterGroups = cutree(airHclust, k = 5)
summary_hc = airlines %>% mutate(clusterGroups) %>% group_by(clusterGroups) %>%
summarise_each(funs(mean))
knitr::kable(summary_hc)
| 1 |
57866.90 |
0.6443299 |
10360.124 |
10.823454 |
83.18428 |
0.3028351 |
6235.365 |
| 2 |
110669.27 |
1065.9826590 |
22881.763 |
18.229287 |
2613.41811 |
7.4026975 |
4402.414 |
| 3 |
198191.57 |
30.3461538 |
55795.860 |
19.663968 |
327.67611 |
1.0688259 |
5615.709 |
| 4 |
52335.91 |
4.8479263 |
20788.766 |
17.087558 |
111.57373 |
0.3444700 |
2840.823 |
| 5 |
36255.91 |
2.5111773 |
2264.788 |
2.973174 |
119.32191 |
0.4388972 |
3060.081 |
K-Means Clustering
set.seed(88)
airKmeans = kmeans(airlinesNorm, 5, iter.max = 1000)
airKclust = airKmeans$cluster
summary_km = airlines %>% mutate(airKclust) %>% group_by(airKclust) %>% summarise_each(funs(mean))
knitr::kable(summary_km)
| 1 |
219161.40 |
539.57843 |
62474.483 |
21.524510 |
623.8725 |
1.9215686 |
5605.051 |
| 2 |
174431.51 |
673.16312 |
31985.085 |
28.134752 |
5859.2340 |
17.0000000 |
4684.901 |
| 3 |
67977.44 |
34.99396 |
24490.019 |
18.429003 |
289.4713 |
0.8851964 |
3416.783 |
| 4 |
60166.18 |
55.20812 |
8709.712 |
8.362098 |
203.2589 |
0.6294416 |
6109.540 |
| 5 |
32706.67 |
126.46667 |
3097.478 |
4.284706 |
181.4698 |
0.5403922 |
2281.055 |