主要議題:依顧客屬性做市場區隔

學習重點:

rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(dplyr)
library(caret)



1. 資料常態化

1.1 資料摘要

Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,

A = read.csv('data/AirlinesCluster.csv')
summary(A)
colMeans(A) %>% sort

which TWO variables have (on average) the smallest values?

  • BonusTrans and FlightTrans

Which TWO variables have (on average) the largest values?

  • Balance and BonusMiles
1.2 為甚麼要做資料常態化

In this problem, we will normalize our data before we run the clustering algorithms.

Why is it important to normalize the data before clustering?

  • If we don’t normalize the data, the variables that are on a larger scale will contribute much more to the distance calculation, and thus will dominate the clustering.
1.3 使用caret套件做資料常態化

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)  
summary(AN)
apply(AN, 2, mean) %>% round(3)
apply(AN, 2, sd) %>% round(3)
apply(AN, 2, max) %>% sort

In the normalized data, which variable has the largest maximum value?

  • FlightMiles
apply(AN, 2, min) %>% sort

In the normalized data, which variable has the smallest minimum value?

  • DaysSinceEnroll



2. 層級式集群分析

2.1 依據樹狀圖和應用需求決定群數

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.

d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)

According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

  • If you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.
2.2 分割群組

Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.

kg = cutree(hc, k=5)
table(kg)

How many data points are in Cluster 1?

  • 776
2.3 從區隔變數的平均值推論族群特性

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

tapply(airlines$Balance, clusterGroups, mean)

sapply(split(A,kg), colMeans) %>% round(2) 

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

  • The only variable for which Cluster 1 has large values is DaysSinceEnroll.

How would you describe the customers in Cluster 1?

  • Cluster 1 mostly contains customers with few miles, but who have been with the airline the longest.
2.4 Cluster 2
split(AN,kg) %>% sapply(colMeans) %>% round(2)
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))

Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.

How would you describe the customers in Cluster 2?

  • Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.
2.5 Cluster 3

Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.

How would you describe the customers in Cluster 3?

  • Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.
2.6 Cluster 4

Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster 4 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 4?

  • Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.
2.7 Cluster 5

Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.

  • Cluster 5 does not have the largest values in any of the variables.

How would you describe the customers in Cluster 5?

  • Relatively new customers who don’t use the airline very often.

3. K-Means集群分析

3.1 K-Means集群分析

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)

How many clusters have more than 1,000 observations?

+There are two clusters with more than 1000 observations.

par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
3.2 Hierarchical和K-Means集群的對應關係

Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

table(Hierarchical=kg, KMeans=kg2)

Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?

  • No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.


【討論問題】

請你們為這五個族群各起一個名稱

  • 賠錢貨
  • 商務客戶(金雞母
  • 潛在客戶
  • 沈睡客戶
  • 殭屍

請你們為這五個族群各設計一個行銷策略

  • 公司不需要花費資源在這些客戶上。

  • 商務客戶,公司應當優先將資源投放在他們身上,對他們做到一對一精準營銷,比如提供相應的優惠政策,提高這類客戶的忠誠度和滿意度,盡可能延長這類客戶的高消費水平。

  • 對那些接近但尚未達到首次兌現機票的會員,對他們進行提醒,使他們達到首次兌現標準並加入會員。

  • 航空公司在運營過程中要積極推測這類客戶的異常情況,進行競爭分析。該群客戶既然是會員,卻許久未搭乘,有可能是其他家航空公司有更誘人的行銷策略。因此我們應該觀察其他航空公司有什麼營銷手法,然後採取有針對性的營銷手段,將沈睡客戶喚醒。

  • 公司不需要花費資源在這些客戶上。

統計上最好的分群也是實務上最好的分群嗎?

  • 並不是。統計上選出的變數,最好的分群不一定適用於實務上,也不一定對實務有貢獻。

除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?

  • 分組數量有時也是必須考慮的因素,因為有時分組最後的結果會讓人難以解釋。 這時可以改變分組數量,也許不同的分法會讓你看到不同的觀點。






---
title: "AS6-2 航空公司的市場區隔"
author: "GROUP3, 2018/07/29"
output: html_notebook
---

<br>

**主要議題：依顧客屬性做市場區隔**

**學習重點：**

+ 利用集群分析做市場區隔
+ 資料常態化
+ 資料視覺化
+ 族群特性與行銷策略
+ 行銷工具vs行銷對象


```{r echo=T, message=F, cache=F, warning=F}
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(dplyr)
library(caret)
```
<br>

- - -

### 1. 資料常態化

##### 1.1 資料摘要
Read the dataset AirlinesCluster.csv into R and call it "airlines". Looking at the summary of airlines, 
```{r}
A = read.csv('data/AirlinesCluster.csv')
summary(A)
```
```{r}
colMeans(A) %>% sort
```

_which TWO variables have (on average) the smallest values?_

+ BonusTrans and FlightTrans

_Which TWO variables have (on average) the largest values?_

+ Balance and BonusMiles



##### 1.2 為甚麼要做資料常態化
In this problem, we will normalize our data before we run the clustering algorithms. 

_Why is it important to normalize the data before clustering?_

+ If we don't normalize the data, the variables that are on a larger scale will contribute much more to the distance calculation, and thus will dominate the clustering.


##### 1.3 使用`caret`套件做資料常態化
Let's go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the "caret" package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages("caret"). Then load the package with library(caret).

Now, create a normalized data frame called "airlinesNorm" by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

```{r}
library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)  
summary(AN)
apply(AN, 2, mean) %>% round(3)
apply(AN, 2, sd) %>% round(3)
```

```{r}
apply(AN, 2, max) %>% sort
```

In the normalized data, _which variable has the largest maximum value?_

+ FlightMiles


```{r}
apply(AN, 2, min) %>% sort
```

In the normalized data, _which variable has the smallest minimum value?_

+  DaysSinceEnroll


<br>

- - -

### 2. 層級式集群分析

##### 2.1 依據樹狀圖和應用需求決定群數
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method="ward.D") on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters. 
```{r}
d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)
```
According to the dendrogram, _which of the following is NOT a good choice for the number of clusters?_

+ If you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.


##### 2.2 分割群組
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. 
```{r}
kg = cutree(hc, k=5)
table(kg)
```
_How many data points are in Cluster 1?_

+ 776

##### 2.3 從區隔變數的平均值推論族群特性
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable "Balance" with the following command:

tapply(airlines$Balance, clusterGroups, mean)
```{r}
sapply(split(A,kg), colMeans) %>% round(2) 
```
Compared to the other clusters, _Cluster 1 has the largest average values in which variables (if any)? Select all that apply._

+ The only variable for which Cluster 1 has large values is DaysSinceEnroll.


_How would you describe the customers in Cluster 1?_

+ Cluster 1 mostly contains customers with few miles, but who have been with the airline the longest.


##### 2.4 Cluster 2
```{r}
split(AN,kg) %>% sapply(colMeans) %>% round(2)
```

```{r}
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
```

Compared to the other clusters, _Cluster 2 has the largest average values in which variables (if any)? Select all that apply._

+ Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.


_How would you describe the customers in Cluster 2?_

+ Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.

##### 2.5 Cluster 3
Compared to the other clusters, _Cluster 3 has the largest average values in which variables (if any)? Select all that apply._

+ Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.


_How would you describe the customers in Cluster 3?_

+ Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.


##### 2.6 Cluster 4
Compared to the other clusters, _Cluster 4 has the largest average values in which variables (if any)? Select all that apply._

+ Cluster 4 does not have the largest values in any of the variables.

_How would you describe the customers in Cluster 4?_

+ Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.


##### 2.7 Cluster 5
Compared to the other clusters, _Cluster 5 has the largest average values in which variables (if any)? Select all that apply._

+ Cluster 5 does not have the largest values in any of the variables.

_How would you describe the customers in Cluster 5?_

+ Relatively new customers who don't use the airline very often. 

- - -

### 3. K-Means集群分析

##### 3.1 K-Means集群分析
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

```{r}
set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
```
_How many clusters have more than 1,000 observations?_

+There are two clusters with more than 1000 observations.

```{r}
par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
```

##### 3.2 Hierarchical和K-Means集群的對應關係
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust$centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust$centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)
```{r}
table(Hierarchical=kg, KMeans=kg2)
```

_Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?_

+ No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering.


<br>

##### 【討論問題】

請你們為這五個族群各起一個名稱

+ 賠錢貨
+ 商務客戶（金雞母
+ 潛在客戶
+ 沈睡客戶
+ 殭屍

請你們為這五個族群各設計一個行銷策略

+ 公司不需要花費資源在這些客戶上。

+ 商務客戶，公司應當優先將資源投放在他們身上，對他們做到一對一精準營銷，比如提供相應的優惠政策，提高這類客戶的忠誠度和滿意度，盡可能延長這類客戶的高消費水平。

+ 對那些接近但尚未達到首次兌現機票的會員，對他們進行提醒，使他們達到首次兌現標準並加入會員。

+ 航空公司在運營過程中要積極推測這類客戶的異常情況，進行競爭分析。該群客戶既然是會員，卻許久未搭乘，有可能是其他家航空公司有更誘人的行銷策略。因此我們應該觀察其他航空公司有什麼營銷手法，然後採取有針對性的營銷手段，將沈睡客戶喚醒。

+ 公司不需要花費資源在這些客戶上。


統計上最好的分群也是實務上最好的分群嗎？ 

+ 並不是。統計上選出的變數，最好的分群不一定適用於實務上，也不一定對實務有貢獻。

除了考慮群間和群間距離之外，實務上的分群通常還需要考慮那些因數？ 

+ 分組數量有時也是必須考慮的因素，因為有時分組最後的結果會讓人難以解釋。
這時可以改變分組數量，也許不同的分法會讓你看到不同的觀點。




- - -

<br><br><br><br><br>

<style>
.caption {
  color: #777;
  margin-top: 10px;
}
p code {
  white-space: inherit;
}
pre {
  word-break: normal;
  word-wrap: normal;
  line-height: 1;
}
pre code {
  white-space: inherit;
}
p,li {
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

.r{
  line-height: 1.2;
}

title{
  color: #cc0000;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

body{
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h1,h2,h3,h4,h5{
  color: #008800;
  font-family: "Trebuchet MS", "微軟正黑體", "Microsoft JhengHei";
}

h3{
  color: #b36b00;
  background: #ffe0b3;
  line-height: 2;
  font-weight: bold;
}

h5{
  color: #006000;
  background: #ffffe0;
  line-height: 2;
  font-weight: bold;
}

em{
  color: #0000c0;
  background: #f0f0f0;
  }
</style>

