# 1电子商务数据 我是电子商务的商业分析师。 我想分析访问我们公司网站的客户行为。 目标是使用组合PCA和k-means聚类来观察和分析在线购物者购买意图
数据集信息:
数据集由属于12,330个会话的特征向量组成。 形成数据集,以便每个会话在1年内属于不同的用户,以避免出现特定活动,特殊日期,用户个人资料或期间的任何倾向。
https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset
数据集由10个数字和8个分类属性组成:
“收入”属性可用作类标签。
“管理”,“管理持续时间”,“信息”,“信息持续时间”,“产品相关”和“产品相关持续时间”表示访问者在该会话中访问的不同类型页面的数量以及每个页面花费的总时间这些页面类别。
这些特征的值是从用户访问的页面的URL信息中导出的,并且当用户采取行动时实时更新,例如,从一个页面移动到另一个页面。
“跳出率”,“退出率”和“页面值”功能表示“Google Analytics”针对电子商务网站中每个网页衡量的指标。
网页的“跳出率”功能的值是指从该页面进入网站然后离开(“退回”)而不会在该会话期间触发对分析服务器的任何其他请求的访问者的百分比。
特定网页的“退出率”功能的值将根据页面的所有页面浏览量计算,即会话中最后一个的百分比。
“页面值”功能表示用户在完成电子商务交易之前访问的网页的平均值。
“特殊日”功能表示网站访问时间与特定特殊日子(例如母亲节,情人节)的密切关系,其中会话更有可能通过交易完成。通过考虑电子商务的动态(例如订单日期和交货日期之间的持续时间)来确定该属性的值。例如,对于情人节,此值在2月2日到2月12日之间采用非零值,在此日期之前和之后为零,除非它接近另一个特殊日期,并且在2月8日其最大值为1。
数据集还包括操作系统,浏览器,区域,流量类型,作为返回或新访客的访客类型,指示访问日期是周末以及一年中的月份的布尔值。
我只是检查群集的2个类别,即收入和访客类型。
# load the library
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.1 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(FactoMineR)
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
online <- read.csv("/Users/milin/Downloads/online_shoppers_intention.csv")
head(online)
## Administrative Administrative_Duration Informational
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 1 0 1 0.000000
## 2 0 2 64.000000
## 3 0 1 0.000000
## 4 0 2 2.666667
## 5 0 10 627.500000
## 6 0 19 154.216667
## BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1 0.20000000 0.2000000 0 0 Feb 1
## 2 0.00000000 0.1000000 0 0 Feb 2
## 3 0.20000000 0.2000000 0 0 Feb 4
## 4 0.05000000 0.1400000 0 0 Feb 3
## 5 0.02000000 0.0500000 0 0 Feb 3
## 6 0.01578947 0.0245614 0 0 Feb 2
## Browser Region TrafficType VisitorType Weekend Revenue
## 1 1 1 1 Returning_Visitor FALSE FALSE
## 2 2 1 2 Returning_Visitor FALSE FALSE
## 3 1 9 3 Returning_Visitor FALSE FALSE
## 4 2 2 4 Returning_Visitor FALSE FALSE
## 5 3 1 4 Returning_Visitor TRUE FALSE
## 6 2 1 3 Returning_Visitor FALSE FALSE
str(online)
## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
summary(online)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
##
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.8
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
##
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
##
## Month OperatingSystems Browser Region
## May :3364 Min. :1.000 Min. : 1.000 Min. :1.000
## Nov :2998 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mar :1907 Median :2.000 Median : 2.000 Median :3.000
## Dec :1727 Mean :2.124 Mean : 2.357 Mean :3.147
## Oct : 549 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Sep : 448 Max. :8.000 Max. :13.000 Max. :9.000
## (Other):1337
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 New_Visitor : 1694 Mode :logical Mode :logical
## 1st Qu.: 2.00 Other : 85 FALSE:9462 FALSE:10422
## Median : 2.00 Returning_Visitor:10551 TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
##
如果我们从上面的矩阵中提取我们的主成分,结果就不会有用了。 当我们将PCA视为最大化运动的方差时,这变得更加清晰:当我们对上述数据(未按比例)我们的PCA时,由不同主成分解释的方差量将由变量上的变量支配。 范围更广。
online_small <- online[1:100,1:10]
biplot(prcomp(online_small,scale = T), cex = 0.8)
选取一部分数据
data.frame(online[c(30,58,67,77),])
## Administrative Administrative_Duration Informational
## 30 1 6.000 1
## 58 4 56.000 2
## 67 4 44.000 0
## 77 10 1005.667 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 30 0 45 1582.7500
## 58 120 36 998.7417
## 67 0 90 6951.9722
## 77 0 36 2111.3417
## BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 30 0.043478261 0.05082126 54.17976 0.4 Feb 3
## 58 0.000000000 0.01473647 19.44708 0.2 Feb 2
## 67 0.002150538 0.01501303 0.00000 0.0 Feb 4
## 77 0.004347826 0.01449275 11.43941 0.0 Feb 2
## Browser Region TrafficType VisitorType Weekend Revenue
## 30 2 1 1 Returning_Visitor FALSE FALSE
## 58 2 4 1 Returning_Visitor FALSE FALSE
## 67 1 1 3 Returning_Visitor FALSE FALSE
## 77 6 1 2 Returning_Visitor FALSE TRUE
基于双标图,我们可以得出结论:数据58具有较大的信息持续时间和信息,几乎类似于数据58是数据30.数据67具有较大的ProductRelated_Duration,几乎与此类似,数据77具有较大的Administrative_Duration和产品相关持续时间。
在我们仅使用“在线”的小数据之前,接下来我们将使用所有数据
onlineNum <- online[,1:10]
onlineZ <- scale(onlineNum, center = T, scale = T)
pr <- prcomp(onlineZ)
summary(pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.844 1.2943 1.0350 1.0054 0.97009 0.96287 0.6496
## Proportion of Variance 0.340 0.1675 0.1071 0.1011 0.09411 0.09271 0.0422
## Cumulative Proportion 0.340 0.5076 0.6147 0.7158 0.80987 0.90258 0.9448
## PC8 PC9 PC10
## Standard deviation 0.59301 0.35055 0.27858
## Proportion of Variance 0.03517 0.01229 0.00776
## Cumulative Proportion 0.97995 0.99224 1.00000
plot(pr,type = "l")
基于摘要和Elbow方法,最佳集群或将反映所有数据的PC数量: - 直到PC5,累计比例相当不错:0.80987 - 使用Elbow方法,在PC3之后,再没有显着变化,考虑到方差和累积比例,我们将测试所有可能性k = 3-5
# 2.2使用PCA功能 在我们进入k意味着聚类之前,我们想使用其他函数来显示PCA。 使用PCA功能,我们需要将定性数据定义为因子。
online <- online %>%
mutate(
Weekend = as.factor(Weekend),
Revenue = as.factor(Revenue),
OperatingSystems = as.factor(OperatingSystems),
Browser = as.factor(Browser),
Region = as.factor(Region),
TrafficType = as.factor(TrafficType)
)
prOnlineFacto <- PCA(online, quali.sup= c(11:18) ,scale.unit = T, graph = F)
plot(prOnlineFacto)
绘制图形
plot.PCA(prOnlineFacto, choix = "var")
plot.PCA(prOnlineFacto, choix = "ind",habillage = 18, select = "contrib 10", invisible = "quali")
online_pca <- PCA(online, quali.sup = c(11:18), graph=F, scale.unit = T)
plot.PCA(online_pca, choix = "var")
plot.PCA(online_pca, choix = "ind",habillage = 18, select = "contrib 5", invisible = "quali")
data.frame(online[c(5153,10641),])
## Administrative Administrative_Duration Informational
## 5153 17 2629.254 24
## 10641 22 1153.682 3
## Informational_Duration ProductRelated ProductRelated_Duration
## 5153 2050.433 705 43171.233
## 10641 108.000 205 4295.305
## BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 5153 0.004851285 0.015431438 0.763829 0 May 2
## 10641 0.001746725 0.008801049 177.528825 0 Nov 2
## Browser Region TrafficType VisitorType Weekend Revenue
## 5153 2 1 14 Returning_Visitor TRUE FALSE
## 10641 5 3 3 Returning_Visitor TRUE FALSE
数据5153具有较大的Informational_Duration,ProductRelated_Duration,Admministrative_Duration。 数据10641具有较低的信息值。
如上所述,我们将找到最大k
set.seed(100)
# k-means with 3 clusters
online_km <- kmeans(onlineZ, 3) #bandingin pake Elbow
online$clust <- as.factor(online_km$cluster)
head(online)
## Administrative Administrative_Duration Informational
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 1 0 1 0.000000
## 2 0 2 64.000000
## 3 0 1 0.000000
## 4 0 2 2.666667
## 5 0 10 627.500000
## 6 0 19 154.216667
## BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1 0.20000000 0.2000000 0 0 Feb 1
## 2 0.00000000 0.1000000 0 0 Feb 2
## 3 0.20000000 0.2000000 0 0 Feb 4
## 4 0.05000000 0.1400000 0 0 Feb 3
## 5 0.02000000 0.0500000 0 0 Feb 3
## 6 0.01578947 0.0245614 0 0 Feb 2
## Browser Region TrafficType VisitorType Weekend Revenue clust
## 1 1 1 1 Returning_Visitor FALSE FALSE 1
## 2 2 1 2 Returning_Visitor FALSE FALSE 1
## 3 1 9 3 Returning_Visitor FALSE FALSE 1
## 4 2 2 4 Returning_Visitor FALSE FALSE 1
## 5 3 1 4 Returning_Visitor TRUE FALSE 1
## 6 2 1 3 Returning_Visitor FALSE FALSE 1
online_km$centers
## Administrative Administrative_Duration Informational
## 1 -0.2332921 -0.1986104 -0.2461845
## 2 -0.4091124 -0.3068585 -0.2469736
## 3 1.4958138 1.2493741 1.4711601
## Informational_Duration ProductRelated ProductRelated_Duration
## 1 -0.1939953 -0.2395940 -0.2202607
## 2 -0.1851529 -0.1616858 -0.1940331
## 3 1.1537882 1.3860745 1.3005984
## BounceRates ExitRates PageValues SpecialDay
## 1 0.03171517 0.05064445 -0.01870258 -0.2916209
## 2 0.27828636 0.38150678 -0.21784532 3.0961874
## 3 -0.33269469 -0.49474113 0.22740737 -0.2257802
online_km$iter
## [1] 3
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind= online$clust) #choix = individual
legend("topright", levels(online$clust), pch=19, col=1:4)
k = 3的PCA结果
使用wss功能检查弯头:
wss <- function(data, maxCluster = 10) {
# Initialize within sum of squares
SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
SSw <- vector()
for (i in 2:maxCluster) {
SSw[i] <- sum(kmeans(data, centers = i)$withinss)
}
plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=18)
}
wss(onlineZ) # method wss
还有其他方法来检查最大k:
fviz_nbclust(onlineZ, kmeans, method = "silhouette") # method silhouette
作为Wss功能的结果,我们继续检查是否使用k = 5,
online_km5 <- kmeans(onlineZ, 5)
online_km5$clust <- as.factor(online_km5$cluster)
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km5$clust) #choix = individual
legend("topright", levels(online_km5$clust), pch=19, col=1:4)
当 k 等于 5 的时候
online_km5$centers
## Administrative Administrative_Duration Informational
## 1 1.310908346 0.99926824 0.37985074
## 2 -0.392526858 -0.30361589 -0.25154885
## 3 1.418098310 1.04250027 2.85570065
## 4 -0.008206934 -0.02623911 -0.09066038
## 5 -0.687222206 -0.45074395 -0.38881809
## Informational_Duration ProductRelated ProductRelated_Duration
## 1 0.05981531 0.55813884 0.47070281
## 2 -0.19196798 -0.24510112 -0.22695383
## 3 3.10802386 2.40576960 2.39539497
## 4 -0.11178168 -0.03912234 -0.01991354
## 5 -0.24492057 -0.65422899 -0.60036955
## BounceRates ExitRates PageValues SpecialDay
## 1 -0.3270197 -0.4801483 0.004715737 -0.15658862
## 2 -0.2318675 -0.1314659 -0.224938955 0.05376042
## 3 -0.3168492 -0.4735970 0.045631040 -0.16623220
## 4 -0.4010185 -0.5847576 3.498575152 -0.24543338
## 5 3.2443481 2.9667157 -0.317164982 0.17710109
online_km5$iter
## [1] 6
上面的值显示,可能k = 4优于5,我们将尝试以下:
online_km4 <- kmeans(onlineZ, 4)
online_km4$clust <- as.factor(online_km4$cluster)
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km4$clust) #choix = individual
legend("topright", levels(online_km4$clust), pch=19, col=1:4)
PCA 的结果
online_km4$centers
## Administrative Administrative_Duration Informational
## 1 1.4274242 1.0307384 2.7527566
## 2 -0.3908723 -0.3023091 -0.2520712
## 3 1.1999391 0.9131627 0.3178291
## 4 -0.6832144 -0.4490826 -0.3842200
## Informational_Duration ProductRelated ProductRelated_Duration
## 1 2.88843399 2.3721430 2.3532959
## 2 -0.19271784 -0.2376723 -0.2187825
## 3 0.03761746 0.4683853 0.3914884
## 4 -0.24429087 -0.6477109 -0.5973185
## BounceRates ExitRates PageValues SpecialDay
## 1 -0.3142597 -0.4705710 0.08422691 -0.15438352
## 2 -0.2524249 -0.1683578 -0.13622633 0.04090515
## 3 -0.3395200 -0.5021188 0.54799035 -0.18223570
## 4 3.0227322 2.8468297 -0.31716498 0.21394340
online_km4$iter
## [1] 5
当我们使用$ iter时,我们看到k-means仅需要3次迭代来收敛,在第三次迭代时停止:它已经识别出4个充分不同的聚类,并且进一步的迭代不会进一步改进它。
4.1使用FactoExtra软件包组合PCA和k-means集群
fviz_screeplot(online_pca, addlabels = TRUE, ylim = c(0, 50))
var_pca <- get_pca_var(online_pca)
var_pca
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
head(var_pca$coord)
## Dim.1 Dim.2 Dim.3 Dim.4
## Administrative 0.7050389 0.06793628 0.2645375 0.3110102
## Administrative_Duration 0.6069612 0.13871617 0.3323191 0.3665404
## Informational 0.6410573 0.36478844 0.1564255 -0.4746090
## Informational_Duration 0.5454962 0.39458007 0.1462845 -0.6015172
## ProductRelated 0.7588367 0.19252869 -0.4088971 0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875 0.2159479
## Dim.5
## Administrative -0.287321899
## Administrative_Duration -0.378159791
## Informational -0.027365516
## Informational_Duration 0.002172933
## ProductRelated 0.272972429
## ProductRelated_Duration 0.269586543
head(var_pca$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Administrative 14.618345 0.2755127 6.532303 9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational 12.085531 7.9436515 2.284058 22.285560
## Informational_Duration 8.750957 9.2941218 1.997507 35.797086
## ProductRelated 16.934356 2.2127327 15.607020 6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804 4.613704
## Dim.5
## Administrative 8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational 7.957589e-02
## Informational_Duration 5.017262e-04
## ProductRelated 7.917932e+00
## ProductRelated_Duration 7.722726e+00
# Graph of variables: default plot
fviz_pca_var(online_pca, col.var = "black")
# Control variable colors using their contributions
fviz_pca_var(online_pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
# Contributions of variables to PC1
fviz_contrib(online_pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(online_pca, choice = "var", axes = 2, top = 10)
ind_pca <- get_pca_var(online_pca)
ind_pca
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
head(ind_pca$coord)
## Dim.1 Dim.2 Dim.3 Dim.4
## Administrative 0.7050389 0.06793628 0.2645375 0.3110102
## Administrative_Duration 0.6069612 0.13871617 0.3323191 0.3665404
## Informational 0.6410573 0.36478844 0.1564255 -0.4746090
## Informational_Duration 0.5454962 0.39458007 0.1462845 -0.6015172
## ProductRelated 0.7588367 0.19252869 -0.4088971 0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875 0.2159479
## Dim.5
## Administrative -0.287321899
## Administrative_Duration -0.378159791
## Informational -0.027365516
## Informational_Duration 0.002172933
## ProductRelated 0.272972429
## ProductRelated_Duration 0.269586543
head(ind_pca$contrib)
## Dim.1 Dim.2 Dim.3 Dim.4
## Administrative 14.618345 0.2755127 6.532303 9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational 12.085531 7.9436515 2.284058 22.285560
## Informational_Duration 8.750957 9.2941218 1.997507 35.797086
## ProductRelated 16.934356 2.2127327 15.607020 6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804 4.613704
## Dim.5
## Administrative 8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational 7.957589e-02
## Informational_Duration 5.017262e-04
## ProductRelated 7.917932e+00
## ProductRelated_Duration 7.722726e+00
# Graph of individuals
# 1. Use repel = TRUE to avoid overplotting
# 2. Control automatically the color of individuals using the cos2
# cos2 = the quality of the individuals on the factor map
# Use points only
# 3. Use gradient color
fviz_pca_ind(online_pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping (slow if many points)
)
Fviz结果
基于收入的群集:
fviz_pca_ind(online_pca,
label = "none", # hide individual labels
habillage = online$Revenue, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE # Concentration ellipses
)
基于访客类型的群集:
fviz_pca_ind(online_pca,
label = "none", # hide individual labels
habillage = online$VisitorType, # color by groups
palette = c("#00AFBB", "#E7B800", "#FC4E07"),
addEllipses = TRUE # Concentration ellipses
)
基于上述方法,我们可以得出结论:
最大k = 4 可以使用PCA和k-means观察该数据 使用FactoExtra,我们可以清楚地看到上面示例的数据集群:与收入和访客类型相关。 详细说明可以在上面找到(在sub bab PCA和k-means中)