rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(dplyr)
library(caret)
Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,
A = read.csv('data/AirlinesCluster.csv')
summary(A)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.00
1st Qu.: 18528 1st Qu.: 0 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0 1st Qu.: 0.00
Median : 43097 Median : 0 Median : 7171 Median :12.0 Median : 0 Median : 0.00
Mean : 73601 Mean : 144 Mean : 17145 Mean :11.6 Mean : 460 Mean : 1.37
3rd Qu.: 92404 3rd Qu.: 0 3rd Qu.: 23800 3rd Qu.:17.0 3rd Qu.: 311 3rd Qu.: 1.00
Max. :1704838 Max. :11148 Max. :263685 Max. :86.0 Max. :30817 Max. :53.00
DaysSinceEnroll
Min. : 2
1st Qu.:2330
Median :4096
Mean :4119
3rd Qu.:5790
Max. :8296
colMeans(A) %>% sort
FlightTrans BonusTrans QualMiles FlightMiles DaysSinceEnroll BonusMiles
1.374 11.602 144.115 460.056 4118.559 17144.846
Balance
73601.328
which TWO variables have (on average) the smallest values?
Which TWO variables have (on average) the largest values?
In this problem, we will normalize our data before we run the clustering algorithms.
Why is it important to normalize the data before clustering?
Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
Now, create a normalized data frame called “airlinesNorm” by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)
summary(AN)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
Min. :-0.730 Min. :-0.186 Min. :-0.710 Min. :-1.208 Min. :-0.329 Min. :-0.362
1st Qu.:-0.546 1st Qu.:-0.186 1st Qu.:-0.658 1st Qu.:-0.896 1st Qu.:-0.329 1st Qu.:-0.362
Median :-0.303 Median :-0.186 Median :-0.413 Median : 0.041 Median :-0.329 Median :-0.362
Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
3rd Qu.: 0.187 3rd Qu.:-0.186 3rd Qu.: 0.276 3rd Qu.: 0.562 3rd Qu.:-0.106 3rd Qu.:-0.098
Max. :16.187 Max. :14.223 Max. :10.208 Max. : 7.747 Max. :21.680 Max. :13.610
DaysSinceEnroll
Min. :-1.9934
1st Qu.:-0.8661
Median :-0.0109
Mean : 0.0000
3rd Qu.: 0.8096
Max. : 2.0228
apply(AN, 2, mean) %>% round(3)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
0 0 0 0 0 0
DaysSinceEnroll
0
apply(AN, 2, sd) %>% round(3)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
1 1 1 1 1 1
DaysSinceEnroll
1
apply(AN, 2, max) %>% sort
DaysSinceEnroll BonusTrans BonusMiles FlightTrans QualMiles Balance
2.023 7.747 10.208 13.610 14.223 16.187
FlightMiles
21.680
In the normalized data, which variable has the largest maximum value?
apply(AN, 2, min) %>% sort
DaysSinceEnroll BonusTrans Balance BonusMiles FlightTrans FlightMiles
-1.9934 -1.2081 -0.7303 -0.7099 -0.3621 -0.3286
QualMiles
-0.1863
In the normalized data, which variable has the smallest minimum value?
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.
d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)
According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.
kg = cutree(hc, k=5)
table(kg)
kg
1 2 3 4 5
776 519 494 868 1342
How many data points are in Cluster 1?
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
tapply(airlines$Balance, clusterGroups, mean)
sapply(split(A,kg), colMeans) %>% round(2)
1 2 3 4 5
Balance 57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles 0.64 1065.98 30.35 4.85 2.51
BonusMiles 10360.12 22881.76 55795.86 20788.77 2264.79
BonusTrans 10.82 18.23 19.66 17.09 2.97
FlightMiles 83.18 2613.42 327.68 111.57 119.32
FlightTrans 0.30 7.40 1.07 0.34 0.44
DaysSinceEnroll 6235.36 4402.41 5615.71 2840.82 3060.08
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 1?
split(AN,kg) %>% sapply(colMeans) %>% round(2)
1 2 3 4 5
Balance -0.16 0.37 1.24 -0.21 -0.37
QualMiles -0.19 1.19 -0.15 -0.18 -0.18
BonusMiles -0.28 0.24 1.60 0.15 -0.62
BonusTrans -0.08 0.69 0.84 0.57 -0.90
FlightMiles -0.27 1.54 -0.09 -0.25 -0.24
FlightTrans -0.28 1.59 -0.08 -0.27 -0.25
DaysSinceEnroll 1.03 0.14 0.72 -0.62 -0.51
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 2?
Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 3?
Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 4?
Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 5?
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
kg2
1 2 3 4 5
408 141 993 1182 1275
How many clusters have more than 1,000 observations?
par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)
table(Hierarchical=kg, KMeans=kg2)
KMeans
Hierarchical 1 2 3 4 5
1 4 0 98 673 1
2 92 137 105 92 93
3 300 4 132 58 0
4 12 0 653 30 173
5 0 0 5 329 1008
Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
請你們為這五個族群各起一個名稱
請你們為這五個族群各設計一個行銷策略
1 瞌睡顧客:此群顧客的註冊時間偏長,在flightmile的表現偏差,但在bonus這塊累積很多,代表此群顧客日後有很大機會仍會來使用我們的服務,我們針對他們的行銷就是提升flightmile的表現,因此推出旅遊方案等,加速他們來使用我們服務的意願
2 主力顧客:此群顧客註冊時間不長,算在成長中的顧客群,在各項類別的表驗都很好,針對此顧客群,要推出忠誠方案,透過累積里程數等提供們許多優惠和服務,留住這些成長客群。
3 潛在顧客:此目標客群選擇我們航空公司的bonus,表示他們日後有可能來消費的意願,因此針對他們做出首飛特別服務或第一次飛行優惠,來吸引他們使用我們的服務。
4 沈睡顧客:此目標客群,註冊時間偏長,但在其他各項表現偏差,推測可能轉向其他航空公司,針對這些顧客,我們可以先提供滿問卷調查送里程優惠,希望他們提供意見,為何後來不使用我們服務,接著可以推出老友回娘家,吸引他們回流。
5 非目標客群:此客群可能是其他航空公司的忠實顧客,他們的轉換成本可能很高,因此我們不會針對此目標客群特別行銷,會將預算留給上述客群。
統計上最好的分群也是實務上最好的分群嗎?
除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因素?
```