主要議題:依顧客屬性做市場區隔
學習重點:
rm(list=ls(all=T))
Sys.setlocale("LC_ALL","C")
options(digits=4, scipen=12)
library(dplyr)
library(caret)
Read the dataset AirlinesCluster.csv into R and call it “airlines”. Looking at the summary of airlines,
A = read.csv('data/AirlinesCluster.csv')
summary(A)
colMeans(A) %>% sort
which TWO variables have (on average) the smallest values?
Which TWO variables have (on average) the largest values?
In this problem, we will normalize our data before we run the clustering algorithms.
Why is it important to normalize the data before clustering?
caret套件做資料常態化Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
Now, create a normalized data frame called “airlinesNorm” by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
library(caret)
preproc = preProcess(A)
AN = predict(preproc, A)
summary(AN)
apply(AN, 2, mean) %>% round(3)
apply(AN, 2, sd) %>% round(3)
apply(AN, 2, max) %>% sort
In the normalized data, which variable has the largest maximum value?
apply(AN, 2, min) %>% sort
In the normalized data, which variable has the smallest minimum value?
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters.
d = dist(AN,method="euclidean")
hc = hclust(d, method='ward.D')
plot(hc)
According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function.
kg = cutree(hc, k=5)
table(kg)
How many data points are in Cluster 1?
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
tapply(airlines$Balance, clusterGroups, mean)
sapply(split(A,kg), colMeans) %>% round(2)
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 1?
split(AN,kg) %>% sapply(colMeans) %>% round(2)
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 2?
Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 3?
Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 4?
Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.
How would you describe the customers in Cluster 5?
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
set.seed(88)
km = kmeans(AN, 5, iter.max = 1000)
kg2 = km$cluster
table(kg2)
How many clusters have more than 1,000 observations?
+There are two clusters with more than 1000 observations.
par(cex=0.8)
km$centers %>% round(2) %>% t %>% barplot(beside=T,col=rainbow(7))
legend('topright',legend=colnames(A),fill=rainbow(7))
Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust\(centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust\)centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)
table(Hierarchical=kg, KMeans=kg2)
Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
請你們為這五個族群各起一個名稱
請你們為這五個族群各設計一個行銷策略
公司不需要花費資源在這些客戶上。
商務客戶,公司應當優先將資源投放在他們身上,對他們做到一對一精準營銷,比如提供相應的優惠政策,提高這類客戶的忠誠度和滿意度,盡可能延長這類客戶的高消費水平。
對那些接近但尚未達到首次兌現機票的會員,對他們進行提醒,使他們達到首次兌現標準並加入會員。
航空公司在運營過程中要積極推測這類客戶的異常情況,進行競爭分析。該群客戶既然是會員,卻許久未搭乘,有可能是其他家航空公司有更誘人的行銷策略。因此我們應該觀察其他航空公司有什麼營銷手法,然後採取有針對性的營銷手段,將沈睡客戶喚醒。
公司不需要花費資源在這些客戶上。
統計上最好的分群也是實務上最好的分群嗎?
除了考慮群間和群間距離之外,實務上的分群通常還需要考慮那些因數?