主要議題:依顧客屬性做市場區隔

學習重點:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(dplyr)
library(caret)



1. 資料常態化

1.1 資料摘要

Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,

getwd()
[1] "C:/MIT summer 2018/Unit6"
airlines = read.csv("data/AirlinesCluster.csv")
summary(airlines)
    Balance          QualMiles       BonusMiles       BonusTrans    FlightMiles     FlightTrans   
 Min.   :      0   Min.   :    0   Min.   :     0   Min.   : 0.0   Min.   :    0   Min.   : 0.00  
 1st Qu.:  18528   1st Qu.:    0   1st Qu.:  1250   1st Qu.: 3.0   1st Qu.:    0   1st Qu.: 0.00  
 Median :  43097   Median :    0   Median :  7171   Median :12.0   Median :    0   Median : 0.00  
 Mean   :  73601   Mean   :  144   Mean   : 17145   Mean   :11.6   Mean   :  460   Mean   : 1.37  
 3rd Qu.:  92404   3rd Qu.:    0   3rd Qu.: 23800   3rd Qu.:17.0   3rd Qu.:  311   3rd Qu.: 1.00  
 Max.   :1704838   Max.   :11148   Max.   :263685   Max.   :86.0   Max.   :30817   Max.   :53.00  
 DaysSinceEnroll
 Min.   :   2   
 1st Qu.:2330   
 Median :4096   
 Mean   :4119   
 3rd Qu.:5790   
 Max.   :8296   

which TWO variables have (on average) the smallest values?

  • BonusTrans
  • FlightTrans

Which TWO variables have (on average) the largest values?

  • Balance
  • BonusMiles
1.2 為甚麼要做資料常態化

In this problem, we will normalize our data before we run the clustering algorithms.

Why is it important to normalize the data before clustering?

  • If we don’t normalize the data, the clustering will be dominated by the variables that are on a larger scale.
1.3 使用caret套件做資料常態化

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

library(caret)
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
summary(airlinesNorm)
    Balance         QualMiles        BonusMiles       BonusTrans      FlightMiles    
 Min.   :-0.730   Min.   :-0.186   Min.   :-0.710   Min.   :-1.208   Min.   :-0.329  
 1st Qu.:-0.546   1st Qu.:-0.186   1st Qu.:-0.658   1st Qu.:-0.896   1st Qu.:-0.329  
 Median :-0.303   Median :-0.186   Median :-0.413   Median : 0.041   Median :-0.329  
 Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
 3rd Qu.: 0.187   3rd Qu.:-0.186   3rd Qu.: 0.276   3rd Qu.: 0.562   3rd Qu.:-0.106  
 Max.   :16.187   Max.   :14.223   Max.   :10.208   Max.   : 7.747   Max.   :21.680  
  FlightTrans     DaysSinceEnroll  
 Min.   :-0.362   Min.   :-1.9934  
 1st Qu.:-0.362   1st Qu.:-0.8661  
 Median :-0.362   Median :-0.0109  
 Mean   : 0.000   Mean   : 0.0000  
 3rd Qu.:-0.098   3rd Qu.: 0.8096  
 Max.   :13.610   Max.   : 2.0228  

In the normalized data, which variable has the largest maximum value?

  • FlightMiles

In the normalized data, which variable has the smallest minimum value?

  • DaysSinceEnroll



2. 層級式集群分析

2.1 依據樹狀圖和應用需求決定群數

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.

distance = dist(airlinesNorm,method = "euclidean")
clusterairlines = hclust(distance,method="ward.D")
plot(clusterairlines)

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

  • 6
2.2 分割群組

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

hierGroups = cutree(clusterairlines, k = 5)
table(hierGroups)
hierGroups
   1    2    3    4    5 
 776  519  494  868 1342 

How many data points are in Cluster 1?

  • 776
2.3 從區隔變數的平均值推論族群特性

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

tapply(airlines$Balance, clusterGroups, mean)

names(airlines)
[1] "Balance"         "QualMiles"       "BonusMiles"      "BonusTrans"      "FlightMiles"    
[6] "FlightTrans"     "DaysSinceEnroll"
airlines$hierGroups = hierGroups
sapply(split(airlines[,1:7], hierGroups), colMeans)
                         1          2          3          4          5
Balance         57866.9046 110669.266 198191.575 52335.9136 36255.9098
QualMiles           0.6443   1065.983     30.346     4.8479     2.5112
BonusMiles      10360.1237  22881.763  55795.860 20788.7661  2264.7876
BonusTrans         10.8235     18.229     19.664    17.0876     2.9732
FlightMiles        83.1843   2613.418    327.676   111.5737   119.3219
FlightTrans         0.3028      7.403      1.069     0.3445     0.4389
DaysSinceEnroll  6235.3647   4402.414   5615.709  2840.8226  3060.0812

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

  • DaysSinceEnroll

How would you describe the customers in Cluster 1?

  • Infrequent but loyal customers.
2.4 Cluster 2

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

  • QualMiles
  • FlightMiles
  • FlightTrans

How would you describe the customers in Cluster 2?

  • Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.
2.5 Cluster 3

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

  • Balance
  • BonusMiles
  • BonusTrans

How would you describe the customers in Cluster 3?

  • Customers who have accumulated a large amount of miles, mostly through non-flight transactions. 正确
2.6 Cluster 4

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

  • None

How would you describe the customers in Cluster 4?

  • Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
2.7 Cluster 5

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

  • None

How would you describe the customers in Cluster 5?

  • Relatively new customers who don’t use the airline very often.

3. K-Means集群分析

3.1 K-Means集群分析

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

set.seed(88)
KMC = kmeans(airlinesNorm, centers = 5,iter.max =1000 )
table(KMC$cluster)

   1    2    3    4    5 
 408  141  993 1182 1275 

How many clusters have more than 1,000 observations?

  • 2
3.2 Hierarchical和K-Means集群的對應關係

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

table(hierGroups,KMC$cluster)
          
hierGroups    1    2    3    4    5
         1    4    0   98  673    1
         2   92  137  105   92   93
         3  300    4  132   58    0
         4   12    0  653   30  173
         5    0    0    5  329 1008
sapply(split(airlines[,1:7],airlines$hierGroups),colMeans)
                         1          2          3          4          5
Balance         57866.9046 110669.266 198191.575 52335.9136 36255.9098
QualMiles           0.6443   1065.983     30.346     4.8479     2.5112
BonusMiles      10360.1237  22881.763  55795.860 20788.7661  2264.7876
BonusTrans         10.8235     18.229     19.664    17.0876     2.9732
FlightMiles        83.1843   2613.418    327.676   111.5737   119.3219
FlightTrans         0.3028      7.403      1.069     0.3445     0.4389
DaysSinceEnroll  6235.3647   4402.414   5615.709  2840.8226  3060.0812

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

  • No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.


【討論問題】

請你們為這五個族群各起一個名稱

  • 1、一般(精緻)會員
    • 註冊已久卻顯少使用的會員,可能對本公司有高度的忠誠度,或受誘於加入會員時的獎勵,然使用需求卻不高
  • 2、尊榮黃金(鑽石)會員
    • 購買公司主要服務的VIP顧客,人數雖最小,卻可能為公司主要的獲利來源
  • 3、便車老司機
    • 吸收了公司提供額外的福利卻幾乎沒有消費主要服務的客群,不只是搭便車,幾乎是駕駛便車的人
  • 4、青少年會員
    • 加入會員時間甚短,如青少年般心性未定的會員,可能成長為尊榮黃金會員,但也有可能成為便車老司機。
  • 5、潛水會員
    • 加入會員時間尚短,目前仍潛在水底,未有明確消費行為潛出水面的會員

請你們為這五個族群各設計一個行銷策略

搭配app提供客製化服務

  • 一般(精緻)會員
    • 目標:持續品牌喚起,維持與顧客的關係
    • 定時發送電子coupon,提醒顧客優惠訊息,一方面確保顧客在有需求時首先聯想起本公司,另一方面電子coupon沒有menu cost,若顧客未使用而超過有效期限,公司不會有太大損失。
  • 尊榮黃金(鑽石)會員
    • 目標:維持服務流暢性,提供消費特權
    • 航空業因轉換成本高,顧客常有消費者隋性,因此針對此客群的行銷方案並非另提出促銷優惠,而是確保其服務流程的流暢並提出一系列消費特權配套方案。如:僅對其提供24小時保證20分鐘內回覆訊息的線上客服,每架班機保留數個VIP座位,以及生日當月招待會貝(同行一人)出國旅遊來回機票等。
  • 便車老司機
    • 目標:削減其享用的bonus以降低公司機會成本
    • 針對其提供組合配套方案,以部分點數搭配較低價格,激發其消費動機。
  • 青少年會員
    • 目標:避免其成為便車老司機
    • 調整Bonus方案,避免其太易於累積BonusMiles及BonusTrans,並不定期提出航班促銷的“快閃低價”,助其積累FlightMiles及FlightTrans同時培養品牌忠誠及喜好。
  • 潛水會員
    • 除了定時宣導公司訊息之外,暫時不做行銷策略

統計上最好的分群也是實務上最好的分群嗎?

+ 不一定,還是要看分析者本身的判斷,根據他的專業知識來挑選合理、可解釋(說出故事)的分群狀況。

除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?

  • 資料分布密度
  • 資料分布形狀
  • 人口資料:如性別、年齡、收入、職業等,甚至婚姻狀態及家庭成員人數等。
  • 其他消費習慣:如近3年、1年、半年的消費次數、“金額”,以及常去的國家等。






