主要議題:依顧客屬性做市場區隔

學習重點:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
## [1] "C"
options(digits=4, scipen=12)
library(dplyr)
library(caret)



1. 資料常態化

1.1 資料摘要

Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,

D= read.csv("data/AirlinesCluster.csv")
summary(D)
##     Balance          QualMiles       BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0   3rd Qu.: 23800   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148   Max.   :263685   Max.   :86.0  
##   FlightMiles     FlightTrans    DaysSinceEnroll
##  Min.   :    0   Min.   : 0.00   Min.   :   2   
##  1st Qu.:    0   1st Qu.: 0.00   1st Qu.:2330   
##  Median :    0   Median : 0.00   Median :4096   
##  Mean   :  460   Mean   : 1.37   Mean   :4119   
##  3rd Qu.:  311   3rd Qu.: 1.00   3rd Qu.:5790   
##  Max.   :30817   Max.   :53.00   Max.   :8296
colMeans(D) %>% sort
##     FlightTrans      BonusTrans       QualMiles     FlightMiles 
##           1.374          11.602         144.115         460.056 
## DaysSinceEnroll      BonusMiles         Balance 
##        4118.559       17144.846       73601.328

Looking at the summary of airlines, which TWO variables have (on average) the smallest values? + BonusTrans + FlightTrans

Which TWO variables have (on average) the largest values? + Balance + BonusMiles

1.2 為甚麼要做資料常態化

In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to normalize the data before clustering? + If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.

1.3 使用caret套件做資料常態化

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

preproc = preProcess(D)
ND = predict(preproc, D)
apply(ND,2,sd)
##         Balance       QualMiles      BonusMiles      BonusTrans 
##               1               1               1               1 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##               1               1               1
summary(ND)
##     Balance         QualMiles        BonusMiles       BonusTrans    
##  Min.   :-0.730   Min.   :-0.186   Min.   :-0.710   Min.   :-1.208  
##  1st Qu.:-0.546   1st Qu.:-0.186   1st Qu.:-0.658   1st Qu.:-0.896  
##  Median :-0.303   Median :-0.186   Median :-0.413   Median : 0.041  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
##  3rd Qu.: 0.187   3rd Qu.:-0.186   3rd Qu.: 0.276   3rd Qu.: 0.562  
##  Max.   :16.187   Max.   :14.223   Max.   :10.208   Max.   : 7.747  
##   FlightMiles      FlightTrans     DaysSinceEnroll  
##  Min.   :-0.329   Min.   :-0.362   Min.   :-1.9934  
##  1st Qu.:-0.329   1st Qu.:-0.362   1st Qu.:-0.8661  
##  Median :-0.329   Median :-0.362   Median :-0.0109  
##  Mean   : 0.000   Mean   : 0.000   Mean   : 0.0000  
##  3rd Qu.:-0.106   3rd Qu.:-0.098   3rd Qu.: 0.8096  
##  Max.   :21.680   Max.   :13.610   Max.   : 2.0228
apply(ND, 2, max) %>% sort
## DaysSinceEnroll      BonusTrans      BonusMiles     FlightTrans 
##           2.023           7.747          10.208          13.610 
##       QualMiles         Balance     FlightMiles 
##          14.223          16.187          21.680

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function. In the normalized data, which variable has the largest maximum value? + FlightMiles

In the normalized data, which variable has the smallest minimum value? + DaysSinceEnroll



2. 層級式集群分析

2.1 依據樹狀圖和應用需求決定群數

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.

Ddis=dist(ND, method="euclidean" )
hcl=hclust(Ddis, method = "ward.D")
plot(hcl) 
rect.hclust(hcl, k=6, border="purple")

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

  • 6
2.2 分割群組

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

chcl= cutree(hcl, k=5)
table(chcl)
## chcl
##    1    2    3    4    5 
##  776  519  494  868 1342

How many data points are in Cluster 1?

  • 776
2.3 從區隔變數的平均值推論族群特性

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

tapply(airlines$Balance, clusterGroups, mean)

sapply(split(D, chcl), colMeans) %>% round(2) #round指留下小數點後幾位
##                        1         2         3        4        5
## Balance         57866.90 110669.27 198191.57 52335.91 36255.91
## QualMiles           0.64   1065.98     30.35     4.85     2.51
## BonusMiles      10360.12  22881.76  55795.86 20788.77  2264.79
## BonusTrans         10.82     18.23     19.66    17.09     2.97
## FlightMiles        83.18   2613.42    327.68   111.57   119.32
## FlightTrans         0.30      7.40      1.07     0.34     0.44
## DaysSinceEnroll  6235.36   4402.41   5615.71  2840.82  3060.08
par(cex=1)
split(ND, chcl) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))#
legend('topright',legend=colnames(D),fill=rainbow(7))

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

  • DaysSinceEnroll

How would you describe the customers in Cluster 1?

  • Infrequent but loyal customers
2.4 Cluster 2

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

  • QualMiles
  • FlightMiles
  • FlightTrans

How would you describe the customers in Cluster 2? + Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.

2.5 Cluster 3

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

  • Balance
  • BonusMiles
  • BonusTrans

How would you describe the customers in Cluster 3? + Customers who have accumulated a large amount of miles, mostly through non-flight transactions.

2.6 Cluster 4

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

  • None

How would you describe the customers in Cluster 4?

  • Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
2.7 Cluster 5

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

  • None

How would you describe the customers in Cluster 5?

  • Relatively new customers who don’t use the airline very often.

3. K-Means集群分析

3.1 K-Means集群分析

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

set.seed(88)
km=kmeans(ND, 5, iter.max=1000)
kmc = km$cluster
table(kmc)
## kmc
##    1    2    3    4    5 
##  408  141  993 1182 1275

How many clusters have more than 1,000 observations?

  • 4
  • 5
par(cex=1)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(D),fill=rainbow(7))

3.2 Hierarchical和K-Means集群的對應關係

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

table(Hierarchical=chcl, KMeans=kmc)
##             KMeans
## Hierarchical    1    2    3    4    5
##            1    4    0   98  673    1
##            2   92  137  105   92   93
##            3  300    4  132   58    0
##            4   12    0  653   30  173
##            5    0    0    5  329 1008

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

  • No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.


#Hierarchical
split(ND, chcl) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))#
legend('topright',legend=colnames(D),fill=rainbow(7))

#kmean
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(D),fill=rainbow(7))

【討論問題】

請你們為這五個族群各起一個名稱

  • 1、精緻(一般)會員
    • 註冊已久卻顯少使用的會員,可能對本公司有高度的忠誠度,或受誘於加入會員時的獎勵,然使用需求卻不高
  • 2、尊榮黃金(鑽石)會員
    • 購買公司主要服務的VIP顧客,人數雖最小,卻可能為公司主要的獲利來源
  • 3、便車老司機
    • 吸收了公司提供額外的福利卻幾乎沒有消費主要服務的客群,不只是搭便車,幾乎是駕駛便車的人
  • 4、青少年會員
    • 加入會員時間甚短,如青少年般心性未定的會員,可能成長為尊榮黃金會員,但也有可能成為便車老司機。
  • 5、潛水會員
    • 加入會員時間尚短,目前仍潛在水底,未有明確消費行為潛出水面的會員

請你們為這五個族群各設計一個行銷策略

搭配app提供客製化服務

  • 一般(精緻)會員
    • 目標:持續品牌喚起,維持與顧客的關係
    • 定時發送電子coupon,提醒顧客優惠訊息,一方面確保顧客在有需求時首先聯想起本公司,另一方面電子coupon沒有menu cost,若顧客未使用而超過有效期限,公司不會有太大損失。
  • 尊榮黃金(鑽石)會員
    • 目標:維持服務流暢性,提供消費特權
    • 航空業因轉換成本高,顧客常有消費者隋性,因此針對此客群的行銷方案並非另提出促銷優惠,而是確保其服務流程的流暢並提出一系列消費特權配套方案。如:僅對其提供24小時保證20分鐘內回覆訊息的線上客服,每架班機保留數個VIP座位,碓休其任何時候均能購入航班,以及生日當月招待會貝(同行一人)出國旅遊來回機票等。
  • 便車老司機
    • 目標:削減其享用的bonus以降低公司機會成本
    • 針對其提供組合配套方案,以其部分點數搭配較低價格可購入指定航班,借此激發其消費動機。
  • 青少年會員
    • 目標:避免其成為便車老司機
    • 調整Bonus方案,避免其太易於累積BonusMiles及BonusTrans,並不定期提出航班促銷的“快閃低價”,助其積累FlightMiles及FlightTrans同時培養品牌忠誠及喜好。
  • 潛水會員
    • 除了定時宣導公司訊息之外,暫時不做行銷策略

統計上最好的分群也是實務上最好的分群嗎?

  • 並不一定
  • 在此例中,這些變數很像是不同的人口變數所促成的消費模式,是一種“結果”,而不是“原因”,如果要分群應也考慮其原因再分。
  • 還是要看分析者本身的判斷,根據他的專業知識來挑選合理、可解釋(說出故事)的分群狀況。

除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?

  • 資料分布密度
  • 資料分布形狀 +人口資料:如性別、年齡、收入、職業等,甚至婚姻狀態及家庭成員人數等。 +其他消費習慣:如近3年、1年、半年的消費次數、“金額”,以及常去的國家等。