主要議題:依顧客屬性做市場區隔

學習重點:

小組討論:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
[1] "C"
options(digits=4, scipen=12)
library(dplyr)
library(caret)



1. 資料常態化

1.1 資料摘要

Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,

A = read.csv('data/AirlinesCluster.csv')
summary(A)
    Balance          QualMiles       BonusMiles    
 Min.   :      0   Min.   :    0   Min.   :     0  
 1st Qu.:  18528   1st Qu.:    0   1st Qu.:  1250  
 Median :  43097   Median :    0   Median :  7171  
 Mean   :  73601   Mean   :  144   Mean   : 17145  
 3rd Qu.:  92404   3rd Qu.:    0   3rd Qu.: 23800  
 Max.   :1704838   Max.   :11148   Max.   :263685  
   BonusTrans    FlightMiles     FlightTrans   
 Min.   : 0.0   Min.   :    0   Min.   : 0.00  
 1st Qu.: 3.0   1st Qu.:    0   1st Qu.: 0.00  
 Median :12.0   Median :    0   Median : 0.00  
 Mean   :11.6   Mean   :  460   Mean   : 1.37  
 3rd Qu.:17.0   3rd Qu.:  311   3rd Qu.: 1.00  
 Max.   :86.0   Max.   :30817   Max.   :53.00  
 DaysSinceEnroll
 Min.   :   2   
 1st Qu.:2330   
 Median :4096   
 Mean   :4119   
 3rd Qu.:5790   
 Max.   :8296   
colMeans(A) %>% sort
    FlightTrans      BonusTrans       QualMiles 
          1.374          11.602         144.115 
    FlightMiles DaysSinceEnroll      BonusMiles 
        460.056        4118.559       17144.846 
        Balance 
      73601.328 

which TWO variables have (on average) the smallest values?

  • FlightTrans
  • BonusTrans

Which TWO variables have (on average) the largest values?

  • BonusMiles
  • Balance
1.1小組討論
  • summary看資料四分位數分布
  • colMeans & sort 方便我們看出各個欄位平均值高低
1.2 為甚麼要做資料常態化

In this problem, we will normalize our data before we run the clustering algorithms.

Why is it important to normalize the data before clustering?

  • 如果變數的scale不同,例如血壓與薪水,這樣在計算距離的時候會受到sacle較大的變數(薪水)影響

  • 作資料常態化我們計算變數的平均,以離均差來表示scale,這樣可以使每個變數的scale比較接近,計算距離時也比較不會受到影響

1.3 使用caret套件做資料常態化

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)  
summary(AN)
    Balance         QualMiles        BonusMiles    
 Min.   :-0.730   Min.   :-0.186   Min.   :-0.710  
 1st Qu.:-0.546   1st Qu.:-0.186   1st Qu.:-0.658  
 Median :-0.303   Median :-0.186   Median :-0.413  
 Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
 3rd Qu.: 0.187   3rd Qu.:-0.186   3rd Qu.: 0.276  
 Max.   :16.187   Max.   :14.223   Max.   :10.208  
   BonusTrans      FlightMiles      FlightTrans    
 Min.   :-1.208   Min.   :-0.329   Min.   :-0.362  
 1st Qu.:-0.896   1st Qu.:-0.329   1st Qu.:-0.362  
 Median : 0.041   Median :-0.329   Median :-0.362  
 Mean   : 0.000   Mean   : 0.000   Mean   : 0.000  
 3rd Qu.: 0.562   3rd Qu.:-0.106   3rd Qu.:-0.098  
 Max.   : 7.747   Max.   :21.680   Max.   :13.610  
 DaysSinceEnroll  
 Min.   :-1.9934  
 1st Qu.:-0.8661  
 Median :-0.0109  
 Mean   : 0.0000  
 3rd Qu.: 0.8096  
 Max.   : 2.0228  
apply(AN, 2, mean) %>% round(3)
        Balance       QualMiles      BonusMiles 
              0               0               0 
     BonusTrans     FlightMiles     FlightTrans 
              0               0               0 
DaysSinceEnroll 
              0 
apply(AN, 2, sd) %>% round(3)
        Balance       QualMiles      BonusMiles 
              1               1               1 
     BonusTrans     FlightMiles     FlightTrans 
              1               1               1 
DaysSinceEnroll 
              1 
apply(AN, 2, max) %>% sort
DaysSinceEnroll      BonusTrans      BonusMiles 
          2.023           7.747          10.208 
    FlightTrans       QualMiles         Balance 
         13.610          14.223          16.187 
    FlightMiles 
         21.680 

In the normalized data, which variable has the largest maximum value?

  • 從上表可發現FlightMiles的平均值是所有變數裡最大的
apply(AN, 2, min) %>% sort
DaysSinceEnroll      BonusTrans         Balance      BonusMiles     FlightTrans 
        -1.9934         -1.2081         -0.7303         -0.7099         -0.3621 
    FlightMiles       QualMiles 
        -0.3286         -0.1863 

In the normalized data, which variable has the smallest minimum value?

  • 從上表可發現FlightTrans的平均值是所有變數裡面最小的
1.3小組討論
  • 分群前先看一下變數scale,不要被scale大的變數影響了距離。
  • 學習scale函數或者caret套件
  • apply家族 & sort方便我們看最大值或最小值。



2. 層級式集群分析

2.1 依據樹狀圖和應用需求決定群數

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.

d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)
# Select 3 clusters
rect.hclust(hc, k = 6, border = "red")

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

  • 如上圖,分成6群的時候到中心點的距離比較短,表示比較難區分群間的距離
  • 所以分成6群比較不是一個好的選擇(就2~7群而言)
2.2 分割群組

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

kg = cutree(hc, k=5)
table(kg)
kg
   1    2    3    4    5 
 776  519  494  868 1342 

How many data points are in Cluster 1?

  • 使用table函數可以看到Cluster1有776個資料點
2.3 從區隔變數的平均值推論族群特性

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

tapply(airlines$Balance, clusterGroups, mean)

sapply(split(A,kg), colMeans) %>% round(2) 
                       1         2         3        4        5
Balance         57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles           0.64   1065.98     30.35     4.85     2.51
BonusMiles      10360.12  22881.76  55795.86 20788.77  2264.79
BonusTrans         10.82     18.23     19.66    17.09     2.97
FlightMiles        83.18   2613.42    327.68   111.57   119.32
FlightTrans         0.30      7.40      1.07     0.34     0.44
DaysSinceEnroll  6235.36   4402.41   5615.71  2840.82  3060.08

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

  • DaysSinceEnroll

How would you describe the customers in Cluster 1?

  • Cluster1註冊的天數最久,但是他們的miles最少
  • 所以是Infrequent but loyal customers
2.4 Cluster 2
split(AN,kg) %>% sapply(colMeans) %>% round(2)
                    1    2     3     4     5
Balance         -0.16 0.37  1.24 -0.21 -0.37
QualMiles       -0.19 1.19 -0.15 -0.18 -0.18
BonusMiles      -0.28 0.24  1.60  0.15 -0.62
BonusTrans      -0.08 0.69  0.84  0.57 -0.90
FlightMiles     -0.27 1.54 -0.09 -0.25 -0.24
FlightTrans     -0.28 1.59 -0.08 -0.27 -0.25
DaysSinceEnroll  1.03 0.14  0.72 -0.62 -0.51
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

  • QualMiles
  • FlightMiles
  • FlightTrans

How would you describe the customers in Cluster 2?

  • Cluster2累積了很多miles,而且有最多的flight transactions
  • 所以是Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions. 正确
2.5 Cluster 3

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

  • Balance
  • BonusMiles
  • BonusTrans

How would you describe the customers in Cluster 3?

  • Cluster3的顧客累積很多的miles,是透過non-flight transactions累積
  • 所以是Customers who have accumulated a large amount of miles, mostly through non-flight transactions.
2.6 Cluster 4

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster4沒有比其他Cluster還要大的變數

How would you describe the customers in Cluster 4?

  • Cluster4註冊天數是最年輕的,是比較新的顧客,且透過non-flight transactions累積了一些miles
  • 所以是Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions
2.7 Cluster 5

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster5沒有比其他Cluster還要大的變數

How would you describe the customers in Cluster 5?

  • Cluster5的顧客很新且也沒有累積什麼miles
  • 所以是Relatively new customers who don’t use the airline very often
小組討論2
  • 用這個資料學習用分群來分出不同的族群,並進一步分析找到這些族群的特性。

  • 相對於我們一開始對整份資料沒有分析的想法,我們做分群以後發現不同族群有不同特性。所以我們可以為這些不同族群量身打造不同的方案來迎合顧客的需求。

  • 所以我們學到如何用分群來留住核心顧客,怎麼喚醒沉睡顧客等等。我們要先針對不同的顧客分析出特性,進一步做決策,如此便可以大大的提升效率


3. K-Means集群分析

3.1 K-Means集群分析

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
kg2
   1    2    3    4    5 
 408  141  993 1182 1275 

How many clusters have more than 1,000 observations?

  • Cluster4有1182個、Cluster5有1275個
  • 共兩個Cluster擁有超過1000個Observations
par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))

3.2 Hierarchical和K-Means集群的對應關係

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

table(Hierarchical=kg, KMeans=kg2)
            KMeans
Hierarchical    1    2    3    4    5
           1    4    0   98  673    1
           2   92  137  105   92   93
           3  300    4  132   58    0
           4   12    0  653   30  173
           5    0    0    5  329 1008

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

  • Kmeans計算群的中心點並且遞迴的將離中心點最近的資料點歸類為該群
  • Kmeans演算法每次分群都不會一樣的結果,所以本來就不會跟hierarchical的分群結果一樣


【討論問題】

請你們為這五個族群各起一個名稱 (2.4 Cluster 2的那張圖)

  • 第一群:沉睡顧客;最久的顧客,但所有里程數都小於平均
  • 第二群:核心顧客
  • 第三群:間接顧客(透過協力廠商拉到的會員)
  • 第四群:潛力顧客
  • 第五群:新顧客

請你們為這五個族群各設計一個行銷策略

  • 第一群:偶爾提供顧客一些優惠,促使他們時常搭乘
  • 第二群:累積點數
  • 第三群:偕同促銷方案(機票+住宿優惠)
  • 第四群:提供折價券,喚起老顧客的注意
  • 第五群:根據新顧客的需求推薦適合的組合優惠

統計上最好的分群也是實務上最好的分群嗎?

  • 不一定,統計讓我們可以假設一下資料的分布,實際上我們還是要看一下真實資料的分布以及視覺化分群出來的樣貌
  • 通常統計上我們可以評估分群的效果,看能不能良好的區分不同群的資料。但實務上我們因為要依照不同族群來量身打造不同的方案,所以當然實務上我們可以應該要自己衡量應該要增加群數或者減少群數
  • 簡單來說,統計輔助我們評估分群好壞,但做決策的是人。統計讓我們比較方便的找到不同族群的特色,實際上我們依照現實情況可能會有不同的考量。

除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?

  • 資料量大小也要考慮,如資料量大我們用Kmeans;量少用Hierarchical
  • 我們要考慮是否有noise data,要拿掉,否則會破壞分群
  • 通常還會考量群內變異數來衡量分群的好壞
小組總結
  • 這個範例我們用分群來達到行銷的STP。
  • 一開始我們對於整份資料沒有分析想法的時候,我們利用分群將族群分出來,找到不同族群的特性。
  • 我們在做市場區隔前首要先了解不同的顧客你要把它放到哪個市場,所以我們做分群。然後發現不同族群有不同的特性,這時我們可以投其所好,量身打造一些行銷策略。所以我們用對的行銷策略放在對的人事物上,整個就提升效率了。
  • 以後我們在做市場區隔的時候要懂得利用分群來分析族群的特性,然後我們才能去制定決策。






