1. 資料常態化
1.1 資料摘要
Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,
getwd()
[1] "C:/MIT summer 2018/Unit6"
airlines = read.csv("data/AirlinesCluster.csv")
summary(airlines)
Balance QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.00
1st Qu.: 18528 1st Qu.: 0 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0 1st Qu.: 0.00
Median : 43097 Median : 0 Median : 7171 Median :12.0 Median : 0 Median : 0.00
Mean : 73601 Mean : 144 Mean : 17145 Mean :11.6 Mean : 460 Mean : 1.37
3rd Qu.: 92404 3rd Qu.: 0 3rd Qu.: 23800 3rd Qu.:17.0 3rd Qu.: 311 3rd Qu.: 1.00
Max. :1704838 Max. :11148 Max. :263685 Max. :86.0 Max. :30817 Max. :53.00
DaysSinceEnroll
Min. : 2
1st Qu.:2330
Median :4096
Mean :4119
3rd Qu.:5790
Max. :8296
which TWO variables have (on average) the smallest values?
Which TWO variables have (on average) the largest values?
1.2 為甚麼要做資料常態化
In this problem, we will normalize our data before we run the clustering algorithms.
Why is it important to normalize the data before clustering?
- If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.
1.3 使用caret套件做資料常態化
Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
Now, create a normalized data frame called “airlinesNorm” by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
library(caret)
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
summary(airlinesNorm)
Balance QualMiles BonusMiles BonusTrans FlightMiles
Min. :-0.730 Min. :-0.186 Min. :-0.710 Min. :-1.208 Min. :-0.329
1st Qu.:-0.546 1st Qu.:-0.186 1st Qu.:-0.658 1st Qu.:-0.896 1st Qu.:-0.329
Median :-0.303 Median :-0.186 Median :-0.413 Median : 0.041 Median :-0.329
Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
3rd Qu.: 0.187 3rd Qu.:-0.186 3rd Qu.: 0.276 3rd Qu.: 0.562 3rd Qu.:-0.106
Max. :16.187 Max. :14.223 Max. :10.208 Max. : 7.747 Max. :21.680
FlightTrans DaysSinceEnroll
Min. :-0.362 Min. :-1.9934
1st Qu.:-0.362 1st Qu.:-0.8661
Median :-0.362 Median :-0.0109
Mean : 0.000 Mean : 0.0000
3rd Qu.:-0.098 3rd Qu.: 0.8096
Max. :13.610 Max. : 2.0228
In the normalized data, which variable has the largest maximum value?
In the normalized data, which variable has the smallest minimum value?
2. 層級式集群分析
2.1 依據樹狀圖和應用需求決定群數
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.
distance = dist(airlinesNorm,method = "euclidean")
clusterairlines = hclust(distance,method="ward.D")
plot(clusterairlines)

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
2.2 分割群組
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.
hierGroups = cutree(clusterairlines, k = 5)
table(hierGroups)
hierGroups
1 2 3 4 5
776 519 494 868 1342
How many data points are in Cluster 1?
2.3 從區隔變數的平均值推論族群特性
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
tapply(airlines$Balance, clusterGroups, mean)
names(airlines)
[1] "Balance" "QualMiles" "BonusMiles" "BonusTrans" "FlightMiles"
[6] "FlightTrans" "DaysSinceEnroll"
airlines$hierGroups = hierGroups
sapply(split(airlines[,1:7], hierGroups), colMeans)
1 2 3 4 5
Balance 57866.9046 110669.266 198191.575 52335.9136 36255.9098
QualMiles 0.6443 1065.983 30.346 4.8479 2.5112
BonusMiles 10360.1237 22881.763 55795.860 20788.7661 2264.7876
BonusTrans 10.8235 18.229 19.664 17.0876 2.9732
FlightMiles 83.1843 2613.418 327.676 111.5737 119.3219
FlightTrans 0.3028 7.403 1.069 0.3445 0.4389
DaysSinceEnroll 6235.3647 4402.414 5615.709 2840.8226 3060.0812
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 1?
- Infrequent but loyal customers.
2.4 Cluster 2
Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.
- QualMiles
- FlightMiles
- FlightTrans
How would you describe the customers in Cluster 2?
- Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
2.5 Cluster 3
Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.
- Balance
- BonusMiles
- BonusTrans
How would you describe the customers in Cluster 3?
- Customers who have accumulated a large amount of miles, mostly through non-flight transactions. 正确
2.6 Cluster 4
Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 4?
- Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
2.7 Cluster 5
Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 5?
- Relatively new customers who don’t use the airline very often.
3. K-Means集群分析
3.1 K-Means集群分析
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
set.seed(88)
KMC = kmeans(airlinesNorm, centers = 5,iter.max =1000 )
table(KMC$cluster)
1 2 3 4 5
408 141 993 1182 1275
How many clusters have more than 1,000 observations?
3.2 Hierarchical和K-Means集群的對應關係
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)
table(hierGroups,KMC$cluster)
hierGroups 1 2 3 4 5
1 4 0 98 673 1
2 92 137 105 92 93
3 300 4 132 58 0
4 12 0 653 30 173
5 0 0 5 329 1008
sapply(split(airlines[,1:7],airlines$hierGroups),colMeans)
1 2 3 4 5
Balance 57866.9046 110669.266 198191.575 52335.9136 36255.9098
QualMiles 0.6443 1065.983 30.346 4.8479 2.5112
BonusMiles 10360.1237 22881.763 55795.860 20788.7661 2264.7876
BonusTrans 10.8235 18.229 19.664 17.0876 2.9732
FlightMiles 83.1843 2613.418 327.676 111.5737 119.3219
FlightTrans 0.3028 7.403 1.069 0.3445 0.4389
DaysSinceEnroll 6235.3647 4402.414 5615.709 2840.8226 3060.0812
Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
- No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.
【討論問題】
請你們為這五個族群各起一個名稱
- 1、一般(精緻)會員
- 註冊已久卻顯少使用的會員,可能對本公司有高度的忠誠度,或受誘於加入會員時的獎勵,然使用需求卻不高
- 2、尊榮黃金(鑽石)會員
- 購買公司主要服務的VIP顧客,人數雖最小,卻可能為公司主要的獲利來源
- 3、便車老司機
- 吸收了公司提供額外的福利卻幾乎沒有消費主要服務的客群,不只是搭便車,幾乎是駕駛便車的人
- 4、青少年會員
- 加入會員時間甚短,如青少年般心性未定的會員,可能成長為尊榮黃金會員,但也有可能成為便車老司機。
- 5、潛水會員
- 加入會員時間尚短,目前仍潛在水底,未有明確消費行為潛出水面的會員
請你們為這五個族群各設計一個行銷策略
搭配app提供客製化服務
- 一般(精緻)會員
- 目標:持續品牌喚起,維持與顧客的關係
- 定時發送電子coupon,提醒顧客優惠訊息,一方面確保顧客在有需求時首先聯想起本公司,另一方面電子coupon沒有menu cost,若顧客未使用而超過有效期限,公司不會有太大損失。
- 尊榮黃金(鑽石)會員
- 目標:維持服務流暢性,提供消費特權
- 航空業因轉換成本高,顧客常有消費者隋性,因此針對此客群的行銷方案並非另提出促銷優惠,而是確保其服務流程的流暢並提出一系列消費特權配套方案。如:僅對其提供24小時保證20分鐘內回覆訊息的線上客服,每架班機保留數個VIP座位,以及生日當月招待會貝(同行一人)出國旅遊來回機票等。
- 便車老司機
- 目標:削減其享用的bonus以降低公司機會成本
- 針對其提供組合配套方案,以部分點數搭配較低價格,激發其消費動機。
- 青少年會員
- 目標:避免其成為便車老司機
- 調整Bonus方案,避免其太易於累積BonusMiles及BonusTrans,並不定期提出航班促銷的“快閃低價”,助其積累FlightMiles及FlightTrans同時培養品牌忠誠及喜好。
- 潛水會員
統計上最好的分群也是實務上最好的分群嗎?
+ 不一定,還是要看分析者本身的判斷,根據他的專業知識來挑選合理、可解釋(說出故事)的分群狀況。
除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?
- 資料分布密度
- 資料分布形狀
- 人口資料:如性別、年齡、收入、職業等,甚至婚姻狀態及家庭成員人數等。
- 其他消費習慣:如近3年、1年、半年的消費次數、“金額”,以及常去的國家等。
