1. 資料常態化
1.1 資料摘要
Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,
A = read.csv("AirlinesCluster.csv")
summary(A)
Balance QualMiles BonusMiles BonusTrans FlightMiles
Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0
1st Qu.: 18528 1st Qu.: 0 1st Qu.: 1250 1st Qu.: 3.0 1st Qu.: 0
Median : 43097 Median : 0 Median : 7171 Median :12.0 Median : 0
Mean : 73601 Mean : 144 Mean : 17145 Mean :11.6 Mean : 460
3rd Qu.: 92404 3rd Qu.: 0 3rd Qu.: 23800 3rd Qu.:17.0 3rd Qu.: 311
Max. :1704838 Max. :11148 Max. :263685 Max. :86.0 Max. :30817
FlightTrans DaysSinceEnroll
Min. : 0.00 Min. : 2
1st Qu.: 0.00 1st Qu.:2330
Median : 0.00 Median :4096
Mean : 1.37 Mean :4119
3rd Qu.: 1.00 3rd Qu.:5790
Max. :53.00 Max. :8296
colMeans(A) %>% sort
FlightTrans BonusTrans QualMiles FlightMiles DaysSinceEnroll
1.374 11.602 144.115 460.056 4118.559
BonusMiles Balance
17144.846 73601.328
colMeans()計算逐行平均值
which TWO variables have (on average) the smallest values?
Which TWO variables have (on average) the largest values?
1.2 為甚麼要做資料常態化
In this problem, we will normalize our data before we run the clustering algorithms.
Why is it important to normalize the data before clustering?
- If we don’t nor malize the data, the clustering will be dominated by the variables that are on a larger scale
- 因為不對數據做標準化動作,集群將受到更大規模的變量影響。
1.3 使用caret套件做資料常態化
Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
Now, create a normalized data frame called “airlinesNorm” by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)
summary(AN)
Balance QualMiles BonusMiles BonusTrans FlightMiles
Min. :-0.730 Min. :-0.186 Min. :-0.710 Min. :-1.208 Min. :-0.329
1st Qu.:-0.546 1st Qu.:-0.186 1st Qu.:-0.658 1st Qu.:-0.896 1st Qu.:-0.329
Median :-0.303 Median :-0.186 Median :-0.413 Median : 0.041 Median :-0.329
Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000 Mean : 0.000
3rd Qu.: 0.187 3rd Qu.:-0.186 3rd Qu.: 0.276 3rd Qu.: 0.562 3rd Qu.:-0.106
Max. :16.187 Max. :14.223 Max. :10.208 Max. : 7.747 Max. :21.680
FlightTrans DaysSinceEnroll
Min. :-0.362 Min. :-1.9934
1st Qu.:-0.362 1st Qu.:-0.8661
Median :-0.362 Median :-0.0109
Mean : 0.000 Mean : 0.0000
3rd Qu.:-0.098 3rd Qu.: 0.8096
Max. :13.610 Max. : 2.0228
apply(AN, 2 ,mean) %>% round(3)
Balance QualMiles BonusMiles BonusTrans FlightMiles
0 0 0 0 0
FlightTrans DaysSinceEnroll
0 0
apply(AN, 2 ,sd) %>% round(3)
Balance QualMiles BonusMiles BonusTrans FlightMiles
1 1 1 1 1
FlightTrans DaysSinceEnroll
1 1
apply()將矩陣或資料框架逐列,逐行(MA=1 . 2…..)計算
apply(AN, 2, max) %>% sort
DaysSinceEnroll BonusTrans BonusMiles FlightTrans QualMiles
2.023 7.747 10.208 13.610 14.223
Balance FlightMiles
16.187 21.680
In the normalized data, which variable has the largest maximum value?
apply(AN, 2, min) %>% sort
DaysSinceEnroll BonusTrans Balance BonusMiles FlightTrans
-1.9934 -1.2081 -0.7303 -0.7099 -0.3621
FlightMiles QualMiles
-0.3286 -0.1863
In the normalized data, which variable has the smallest minimum value?
2. 層級式集群分析
2.1 依據樹狀圖和應用需求決定群數
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.
d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)

euclidean distance 歐式距離 dist()距離函數,可作為距離矩陣 hclust()做階層式分群 method=’ward.D’華德法 method=“euclidean”歐式距離
集群中點的距離,底下每一個點,到族群中心點的距離遠
According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
2.2 分割群組
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. 看區隔變數
CG = cutree(hc, k=5)
table(CG)
CG
1 2 3 4 5
776 519 494 868 1342
分群:就是把近似的放在同一個群裡面,群內的差異變小,方便管理 分割族群就是在看區隔變數 cutree()可以讓整個階層的結構縮減,K為最佳的分群數目
How many data points are in Cluster 1?
2.3 從區隔變數的平均值推論族群特性
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
tapply(airlines$Balance, clusterGroups, mean)
sapply(split(A,CG), colMeans) %>% round(2)
1 2 3 4 5
Balance 57866.90 110669.27 198191.57 52335.91 36255.91
QualMiles 0.64 1065.98 30.35 4.85 2.51
BonusMiles 10360.12 22881.76 55795.86 20788.77 2264.79
BonusTrans 10.82 18.23 19.66 17.09 2.97
FlightMiles 83.18 2613.42 327.68 111.57 119.32
FlightTrans 0.30 7.40 1.07 0.34 0.44
DaysSinceEnroll 6235.36 4402.41 5615.71 2840.82 3060.08
tapply(A$Balance , CG ,mean)
1 2 3 4 5
57867 110669 198192 52336 36256
tapply(A$QualMiles , CG ,mean)
1 2 3 4 5
0.6443 1065.9827 30.3462 4.8479 2.5112
tapply(A$BonusMiles , CG ,mean)
1 2 3 4 5
10360 22882 55796 20789 2265
tapply(A$BonusTrans , CG ,mean)
1 2 3 4 5
10.823 18.229 19.664 17.088 2.973
tapply(A$FlightMiles , CG ,mean)
1 2 3 4 5
83.18 2613.42 327.68 111.57 119.32
tapply(A$FlightTrans , CG ,mean)
1 2 3 4 5
0.3028 7.4027 1.0688 0.3445 0.4389
tapply(A$DaysSinceEnroll , CG ,mean)
1 2 3 4 5
6235 4402 5616 2841 3060
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 1?
2.4 Cluster 2
split(AN,CG) %>% sapply(colMeans) %>% round(2)
1 2 3 4 5
Balance -0.16 0.37 1.24 -0.21 -0.37
QualMiles -0.19 1.19 -0.15 -0.18 -0.18
BonusMiles -0.28 0.24 1.60 0.15 -0.62
BonusTrans -0.08 0.69 0.84 0.57 -0.90
FlightMiles -0.27 1.54 -0.09 -0.25 -0.24
FlightTrans -0.28 1.59 -0.08 -0.27 -0.25
DaysSinceEnroll 1.03 0.14 0.72 -0.62 -0.51
split:把要处理的数据分割成小片断; apply:对每个小片断独立进行操作; combine:把片断重新组合。 sapply()(代表simplified [l]apply)可以將結果整理以矢量,矩陣,列表 的形式輸出。
par(cex=0.8)
split(AN,CG) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))

barplot()劃出直條圖 legend()圖解 par函數用於設定或詢問繪圖參數。參數設定可通過par(參數名=取值)或par(賦值參數列表)的形式進行。
Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.
- QualMiles
- FlightMiles
- FlightTrans
How would you describe the customers in Cluster 2?
2.5 Cluster 3
Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.
- Balance
- BonusMiles
- BonusTrans
How would you describe the customers in Cluster 3?
2.6 Cluster 4
Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 4?
2.7 Cluster 5
Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 5?
3. K-Means集群分析
3.1 K-Means集群分析
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
kg2
1 2 3 4 5
408 141 993 1182 1275
set.seed()設定隨機數種子 kmeans()分群
How many clusters have more than 1,000 observations?
par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))

3.2 Hierarchical和K-Means集群的對應關係
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)
table(Hierarchical=CG, KMeans=kg2)
KMeans
Hierarchical 1 2 3 4 5
1 4 0 98 673 1
2 92 137 105 92 93
3 300 4 132 58 0
4 12 0 653 30 173
5 0 0 5 329 1008
$ center的輸出將用於規範化數據 Hierarchical階層式分群
Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
- 不,因為群集排序在k均值聚類或層次聚類中沒有意義。
【討論問題】
請你們為這五個族群各起一個名稱
- 第一群:品牌支持者
- 第二群:搭乘飛機往返的商務客
- 第三群:注意性價比的客戶
- 第四群:協力廠商的導流客戶
- 第五群:未搭乘飛機的新客戶
請你們為這五個族群各設計一個行銷策略
- 第一群:開發周邊商品增加品牌黏著度
- 第二群:提供尊榮服務、精緻飛航體驗
- 第三群:提供CP值高的套裝飛航選擇
- 第四群:提供自由行或紅眼廉價航空促銷機票
- 第五群:主打形象廣告,加深客戶印象
統計上最好的分群也是實務上最好的分群嗎?
- 依照商業情境而定,實際分群可能與市場需求有關,因此統計分群的最佳解不一定是實際分群的最佳解。
除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?
- 群體大小及群內密集度,群體大小與市場選擇有關,群體愈大、重視程度愈高,因此必須特別關注該群體,此外群內密集度與目標客群行銷精準度有關,代表市場策略可更加精準,使行銷效果更為顯著,而使用分群作為市場區隔,接著選擇目標市場,並依據該目標市場特性,來考慮市場定位(STP分析)。
---
title: "AS6-2 航空公司的市場區隔"
author: "GROUP5——施采彣、陳怡安、楊凱倫、唐思琪、凌偉誠"
output: html_notebook
---

<br>

**主要議題：依顧客屬性做市場區隔**

**學習重點：**

+ 利用集群分析做市場區隔
+ 資料常態化
+ 資料視覺化
+ 族群特性與行銷策略
+ 行銷工具vs行銷對象

集群分析與預測分析有很大的差異

```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(dplyr)
library(caret)
```
<br>

- - -

### 1. 資料常態化

##### 1.1 資料摘要
Read the dataset AirlinesCluster.csv into R and call it "airlines". Looking at the summary of airlines, 

```{r}
A = read.csv("AirlinesCluster.csv")
summary(A)
```


```{r}
colMeans(A) %>% sort
```
colMeans()計算逐行平均值

_which TWO variables have (on average) the smallest values?_

+ BonusTrans
+ FlightTrans  

_Which TWO variables have (on average) the largest values?_

+ Balance
+ BonusMiles


##### 1.2 為甚麼要做資料常態化
In this problem, we will normalize our data before we run the clustering algorithms. 

_Why is it important to normalize the data before clustering?_

+ If we don't nor malize the data, the clustering will be dominated by the variables that are on a larger scale
+ 因為不對數據做標準化動作，集群將受到更大規模的變量影響。

##### 1.3 使用`caret`套件做資料常態化
Let's go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the "caret" package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages("caret"). Then load the package with library(caret).

Now, create a normalized data frame called "airlinesNorm" by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
```{r}
library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)
summary(AN)
apply(AN, 2 ,mean) %>% round(3)
apply(AN, 2 ,sd) %>% round(3)
```
apply()將矩陣或資料框架逐列，逐行(MA=1 . 2.....)計算

```{r}
apply(AN, 2, max) %>% sort
```

In the normalized data, _which variable has the largest maximum value?_

+ FlightMiles 

```{r}
apply(AN, 2, min) %>% sort
```

In the normalized data, _which variable has the smallest minimum value?_

+ DaysSinceEnroll

<br>

- - -

### 2. 層級式集群分析

##### 2.1 依據樹狀圖和應用需求決定群數
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method="ward.D") on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters. 

```{r}
d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)
```
euclidean distance 歐式距離
dist()距離函數，可作為距離矩陣
hclust()做階層式分群
method='ward.D'華德法
method="euclidean"歐式距離

集群中點的距離，底下每一個點，到族群中心點的距離遠


According to the dendrogram, _which of the following is NOT a good choice for the number of clusters?_

+ 6

##### 2.2 分割群組
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. 
看區隔變數

```{r}
CG = cutree(hc, k=5)
table(CG)
```
分群:就是把近似的放在同一個群裡面，群內的差異變小，方便管理
分割族群就是在看區隔變數
cutree()可以讓整個階層的結構縮減，K為最佳的分群數目

_How many data points are in Cluster 1?_

+ 776

##### 2.3 從區隔變數的平均值推論族群特性
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable "Balance" with the following command:

tapply(airlines$Balance, clusterGroups, mean)

```{r}
sapply(split(A,CG), colMeans) %>% round(2) 
```

```{r}
tapply(A$Balance , CG ,mean)
tapply(A$QualMiles , CG ,mean)
tapply(A$BonusMiles , CG ,mean)
tapply(A$BonusTrans , CG ,mean)
tapply(A$FlightMiles , CG ,mean)
tapply(A$FlightTrans , CG ,mean)
tapply(A$DaysSinceEnroll , CG ,mean)
```

Compared to the other clusters, _Cluster 1 has the largest average values in which variables (if any)? Select all that apply._

+ DaysSinceEnroll 

_How would you describe the customers in Cluster 1?_

+ 不常搭飛機但具有顧客忠誠度

##### 2.4 Cluster 2
```{r}
split(AN,CG) %>% sapply(colMeans) %>% round(2)
```
split：把要处理的数据分割成小片断；
apply：对每个小片断独立进行操作；
combine：把片断重新组合。
sapply（）（代表simplified [l]apply）可以將結果整理以矢量，矩陣，列表 的形式輸出。

```{r}
par(cex=0.8)
split(AN,CG) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
```
barplot()劃出直條圖
legend()圖解
par函數用於設定或詢問繪圖參數。參數設定可通過par(參數名=取值)或par(賦值參數列表)的形式進行。

Compared to the other clusters, _Cluster 2 has the largest average values in which variables (if any)? Select all that apply._

+ QualMiles   
+ FlightMiles
+ FlightTrans

_How would you describe the customers in Cluster 2?_

+ 經常搭飛機且累積大量里程數的客戶 

##### 2.5 Cluster 3
Compared to the other clusters, _Cluster 3 has the largest average values in which variables (if any)? Select all that apply._

+ Balance
+ BonusMiles
+ BonusTrans

_How would you describe the customers in Cluster 3?_

+ 非經常搭飛機且得到大量里程數的客戶

##### 2.6 Cluster 4
Compared to the other clusters, _Cluster 4 has the largest average values in which variables (if any)? Select all that apply._

+ None

_How would you describe the customers in Cluster 4?_

+ 非經常搭飛機但有累積里程數的相對較新的客戶

##### 2.7 Cluster 5
Compared to the other clusters, _Cluster 5 has the largest average values in which variables (if any)? Select all that apply._

+ None

_How would you describe the customers in Cluster 5?_

+ 不常使用飛機的新客戶

- - -

### 3. K-Means集群分析

##### 3.1 K-Means集群分析
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

```{r}
set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
```
set.seed()設定隨機數種子
kmeans()分群

_How many clusters have more than 1,000 observations?_

+ 2

```{r}
par(cex=0.8)
  km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
```

##### 3.2 Hierarchical和K-Means集群的對應關係
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust$centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust$centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

```{r}
table(Hierarchical=CG, KMeans=kg2)
```
$ center的輸出將用於規範化數據
Hierarchical階層式分群

_Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?_

+ 不，因為群集排序在k均值聚類或層次聚類中沒有意義。


<br>

##### 【討論問題】

請你們為這五個族群各起一個名稱

+ 第一群:品牌支持者
+ 第二群:搭乘飛機往返的商務客
+ 第三群:注意性價比的客戶
+ 第四群:協力廠商的導流客戶
+ 第五群:未搭乘飛機的新客戶

請你們為這五個族群各設計一個行銷策略

+ 第一群:開發周邊商品增加品牌黏著度
+ 第二群:提供尊榮服務、精緻飛航體驗
+ 第三群:提供CP值高的套裝飛航選擇
+ 第四群:提供自由行或紅眼廉價航空促銷機票
+ 第五群:主打形象廣告，加深客戶印象

統計上最好的分群也是實務上最好的分群嗎？ 

+ 依照商業情境而定，實際分群可能與市場需求有關，因此統計分群的最佳解不一定是實際分群的最佳解。  

除了考慮群間和群間距離之外，實務上的分群通常還需要考慮那些因數？ 

+ 群體大小及群內密集度，群體大小與市場選擇有關，群體愈大、重視程度愈高，因此必須特別關注該群體，此外群內密集度與目標客群行銷精準度有關，代表市場策略可更加精準，使行銷效果更為顯著，而使用分群作為市場區隔，接著選擇目標市場，並依據該目標市場特性，來考慮市場定位(STP分析)。


- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

