# 1电子商务数据 我是电子商务的商业分析师。 我想分析访问我们公司网站的客户行为。 目标是使用组合PCA和k-means聚类来观察和分析在线购物者购买意图

数据集信息:

数据集由属于12,330个会话的特征向量组成。 形成数据集,以便每个会话在1年内属于不同的用户,以避免出现特定活动,特殊日期,用户个人资料或期间的任何倾向。

https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

数据集由10个数字和8个分类属性组成:

“收入”属性可用作类标签。

“管理”,“管理持续时间”,“信息”,“信息持续时间”,“产品相关”和“产品相关持续时间”表示访问者在该会话中访问的不同类型页面的数量以及每个页面花费的总时间这些页面类别。

这些特征的值是从用户访问的页面的URL信息中导出的,并且当用户采取行动时实时更新,例如,从一个页面移动到另一个页面。

“跳出率”,“退出率”和“页面值”功能表示“Google Analytics”针对电子商务网站中每个网页衡量的指标。

网页的“跳出率”功能的值是指从该页面进入网站然后离开(“退回”)而不会在该会话期间触发对分析服务器的任何其他请求的访问者的百分比。

特定网页的“退出率”功能的值将根据页面的所有页面浏览量计算,即会话中最后一个的百分比。

“页面值”功能表示用户在完成电子商务交易之前访问的网页的平均值。

“特殊日”功能表示网站访问时间与特定特殊日子(例如母亲节,情人节)的密切关系,其中会话更有可能通过交易完成。通过考虑电子商务的动态(例如订单日期和交货日期之间的持续时间)来确定该属性的值。例如,对于情人节,此值在2月2日到2月12日之间采用非零值,在此日期之前和之后为零,除非它接近另一个特殊日期,并且在2月8日其最大值为1。

数据集还包括操作系统,浏览器,区域,流量类型,作为返回或新访客的访客类型,指示访问日期是周末以及一年中的月份的布尔值。

我只是检查群集的2个类别,即收入和访客类型。

# load the library
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.1     ✔ dplyr   0.8.1
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(FactoMineR)
library(factoextra)
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
online <- read.csv("/Users/milin/Downloads/online_shoppers_intention.csv")
head(online)
##   Administrative Administrative_Duration Informational
## 1              0                       0             0
## 2              0                       0             0
## 3              0                       0             0
## 4              0                       0             0
## 5              0                       0             0
## 6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1                      0              1                0.000000
## 2                      0              2               64.000000
## 3                      0              1                0.000000
## 4                      0              2                2.666667
## 5                      0             10              627.500000
## 6                      0             19              154.216667
##   BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1  0.20000000 0.2000000          0          0   Feb                1
## 2  0.00000000 0.1000000          0          0   Feb                2
## 3  0.20000000 0.2000000          0          0   Feb                4
## 4  0.05000000 0.1400000          0          0   Feb                3
## 5  0.02000000 0.0500000          0          0   Feb                3
## 6  0.01578947 0.0245614          0          0   Feb                2
##   Browser Region TrafficType       VisitorType Weekend Revenue
## 1       1      1           1 Returning_Visitor   FALSE   FALSE
## 2       2      1           2 Returning_Visitor   FALSE   FALSE
## 3       1      9           3 Returning_Visitor   FALSE   FALSE
## 4       2      2           4 Returning_Visitor   FALSE   FALSE
## 5       3      1           4 Returning_Visitor    TRUE   FALSE
## 6       2      1           3 Returning_Visitor   FALSE   FALSE
str(online)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
summary(online)
##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##                                                            
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##                                                                 
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##                                                                          
##      Month      OperatingSystems    Browser           Region     
##  May    :3364   Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Nov    :2998   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mar    :1907   Median :2.000    Median : 2.000   Median :3.000  
##  Dec    :1727   Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##  Oct    : 549   3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##  Sep    : 448   Max.   :8.000    Max.   :13.000   Max.   :9.000  
##  (Other):1337                                                    
##   TrafficType               VisitorType     Weekend         Revenue       
##  Min.   : 1.00   New_Visitor      : 1694   Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Other            :   85   FALSE:9462      FALSE:10422    
##  Median : 2.00   Returning_Visitor:10551   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                            
##  3rd Qu.: 4.00                                                            
##  Max.   :20.00                                                            
## 

如果我们从上面的矩阵中提取我们的主成分,结果就不会有用了。 当我们将PCA视为最大化运动的方差时,这变得更加清晰:当我们对上述数据(未按比例)我们的PCA时,由不同主成分解释的方差量将由变量上的变量支配。 范围更广。

主成分分析

online_small <- online[1:100,1:10]
biplot(prcomp(online_small,scale = T), cex = 0.8)

选取一部分数据

data.frame(online[c(30,58,67,77),])
##    Administrative Administrative_Duration Informational
## 30              1                   6.000             1
## 58              4                  56.000             2
## 67              4                  44.000             0
## 77             10                1005.667             0
##    Informational_Duration ProductRelated ProductRelated_Duration
## 30                      0             45               1582.7500
## 58                    120             36                998.7417
## 67                      0             90               6951.9722
## 77                      0             36               2111.3417
##    BounceRates  ExitRates PageValues SpecialDay Month OperatingSystems
## 30 0.043478261 0.05082126   54.17976        0.4   Feb                3
## 58 0.000000000 0.01473647   19.44708        0.2   Feb                2
## 67 0.002150538 0.01501303    0.00000        0.0   Feb                4
## 77 0.004347826 0.01449275   11.43941        0.0   Feb                2
##    Browser Region TrafficType       VisitorType Weekend Revenue
## 30       2      1           1 Returning_Visitor   FALSE   FALSE
## 58       2      4           1 Returning_Visitor   FALSE   FALSE
## 67       1      1           3 Returning_Visitor   FALSE   FALSE
## 77       6      1           2 Returning_Visitor   FALSE    TRUE

基于双标图,我们可以得出结论:数据58具有较大的信息持续时间和信息,几乎类似于数据58是数据30.数据67具有较大的ProductRelated_Duration,几乎与此类似,数据77具有较大的Administrative_Duration和产品相关持续时间。

在我们仅使用“在线”的小数据之前,接下来我们将使用所有数据

onlineNum <- online[,1:10]
onlineZ <- scale(onlineNum, center = T, scale = T)

pr <- prcomp(onlineZ)
summary(pr)
## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5     PC6    PC7
## Standard deviation     1.844 1.2943 1.0350 1.0054 0.97009 0.96287 0.6496
## Proportion of Variance 0.340 0.1675 0.1071 0.1011 0.09411 0.09271 0.0422
## Cumulative Proportion  0.340 0.5076 0.6147 0.7158 0.80987 0.90258 0.9448
##                            PC8     PC9    PC10
## Standard deviation     0.59301 0.35055 0.27858
## Proportion of Variance 0.03517 0.01229 0.00776
## Cumulative Proportion  0.97995 0.99224 1.00000
plot(pr,type = "l")

基于摘要和Elbow方法,最佳集群或将反映所有数据的PC数量: - 直到PC5,累计比例相当不错:0.80987 - 使用Elbow方法,在PC3之后,再没有显着变化,考虑到方差和累积比例,我们将测试所有可能性k = 3-5

# 2.2使用PCA功能 在我们进入k意味着聚类之前,我们想使用其他函数来显示PCA。 使用PCA功能,我们需要将定性数据定义为因子。

online <- online %>% 
  mutate(
    Weekend = as.factor(Weekend),
    Revenue = as.factor(Revenue),
    OperatingSystems = as.factor(OperatingSystems),
    Browser = as.factor(Browser),
    Region = as.factor(Region),
    TrafficType = as.factor(TrafficType)
  )

prOnlineFacto <- PCA(online, quali.sup= c(11:18) ,scale.unit = T, graph = F)
plot(prOnlineFacto)

绘制图形

plot.PCA(prOnlineFacto, choix = "var")

plot.PCA(prOnlineFacto, choix = "ind",habillage = 18, select = "contrib 10", invisible = "quali")

online_pca <- PCA(online, quali.sup = c(11:18), graph=F, scale.unit = T)
plot.PCA(online_pca, choix = "var")

plot.PCA(online_pca, choix = "ind",habillage = 18, select = "contrib 5", invisible = "quali")

data.frame(online[c(5153,10641),])
##       Administrative Administrative_Duration Informational
## 5153              17                2629.254            24
## 10641             22                1153.682             3
##       Informational_Duration ProductRelated ProductRelated_Duration
## 5153                2050.433            705               43171.233
## 10641                108.000            205                4295.305
##       BounceRates   ExitRates PageValues SpecialDay Month OperatingSystems
## 5153  0.004851285 0.015431438   0.763829          0   May                2
## 10641 0.001746725 0.008801049 177.528825          0   Nov                2
##       Browser Region TrafficType       VisitorType Weekend Revenue
## 5153        2      1          14 Returning_Visitor    TRUE   FALSE
## 10641       5      3           3 Returning_Visitor    TRUE   FALSE

数据5153具有较大的Informational_Duration,ProductRelated_Duration,Admministrative_Duration。 数据10641具有较低的信息值。

3 k-means聚类

如上所述,我们将找到最大k

set.seed(100)
# k-means with 3 clusters
online_km <- kmeans(onlineZ, 3) #bandingin pake Elbow
online$clust <- as.factor(online_km$cluster)
head(online)
##   Administrative Administrative_Duration Informational
## 1              0                       0             0
## 2              0                       0             0
## 3              0                       0             0
## 4              0                       0             0
## 5              0                       0             0
## 6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1                      0              1                0.000000
## 2                      0              2               64.000000
## 3                      0              1                0.000000
## 4                      0              2                2.666667
## 5                      0             10              627.500000
## 6                      0             19              154.216667
##   BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1  0.20000000 0.2000000          0          0   Feb                1
## 2  0.00000000 0.1000000          0          0   Feb                2
## 3  0.20000000 0.2000000          0          0   Feb                4
## 4  0.05000000 0.1400000          0          0   Feb                3
## 5  0.02000000 0.0500000          0          0   Feb                3
## 6  0.01578947 0.0245614          0          0   Feb                2
##   Browser Region TrafficType       VisitorType Weekend Revenue clust
## 1       1      1           1 Returning_Visitor   FALSE   FALSE     1
## 2       2      1           2 Returning_Visitor   FALSE   FALSE     1
## 3       1      9           3 Returning_Visitor   FALSE   FALSE     1
## 4       2      2           4 Returning_Visitor   FALSE   FALSE     1
## 5       3      1           4 Returning_Visitor    TRUE   FALSE     1
## 6       2      1           3 Returning_Visitor   FALSE   FALSE     1
online_km$centers
##   Administrative Administrative_Duration Informational
## 1     -0.2332921              -0.1986104    -0.2461845
## 2     -0.4091124              -0.3068585    -0.2469736
## 3      1.4958138               1.2493741     1.4711601
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             -0.1939953     -0.2395940              -0.2202607
## 2             -0.1851529     -0.1616858              -0.1940331
## 3              1.1537882      1.3860745               1.3005984
##   BounceRates   ExitRates  PageValues SpecialDay
## 1  0.03171517  0.05064445 -0.01870258 -0.2916209
## 2  0.27828636  0.38150678 -0.21784532  3.0961874
## 3 -0.33269469 -0.49474113  0.22740737 -0.2257802
online_km$iter
## [1] 3
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind= online$clust) #choix = individual
legend("topright", levels(online$clust), pch=19, col=1:4)

k = 3的PCA结果

使用wss功能检查弯头:

wss <- function(data, maxCluster = 10) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=18)
}
wss(onlineZ) # method wss

还有其他方法来检查最大k:

fviz_nbclust(onlineZ, kmeans, method = "silhouette") # method silhouette

作为Wss功能的结果,我们继续检查是否使用k = 5,

online_km5 <- kmeans(onlineZ, 5)
online_km5$clust <- as.factor(online_km5$cluster)
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km5$clust) #choix = individual
legend("topright", levels(online_km5$clust), pch=19, col=1:4)

当 k 等于 5 的时候

online_km5$centers
##   Administrative Administrative_Duration Informational
## 1    1.310908346              0.99926824    0.37985074
## 2   -0.392526858             -0.30361589   -0.25154885
## 3    1.418098310              1.04250027    2.85570065
## 4   -0.008206934             -0.02623911   -0.09066038
## 5   -0.687222206             -0.45074395   -0.38881809
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             0.05981531     0.55813884              0.47070281
## 2            -0.19196798    -0.24510112             -0.22695383
## 3             3.10802386     2.40576960              2.39539497
## 4            -0.11178168    -0.03912234             -0.01991354
## 5            -0.24492057    -0.65422899             -0.60036955
##   BounceRates  ExitRates   PageValues  SpecialDay
## 1  -0.3270197 -0.4801483  0.004715737 -0.15658862
## 2  -0.2318675 -0.1314659 -0.224938955  0.05376042
## 3  -0.3168492 -0.4735970  0.045631040 -0.16623220
## 4  -0.4010185 -0.5847576  3.498575152 -0.24543338
## 5   3.2443481  2.9667157 -0.317164982  0.17710109
online_km5$iter
## [1] 6

上面的值显示,可能k = 4优于5,我们将尝试以下:

online_km4 <- kmeans(onlineZ, 4)
online_km4$clust <- as.factor(online_km4$cluster)
plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km4$clust) #choix = individual
legend("topright", levels(online_km4$clust), pch=19, col=1:4)

PCA 的结果

online_km4$centers
##   Administrative Administrative_Duration Informational
## 1      1.4274242               1.0307384     2.7527566
## 2     -0.3908723              -0.3023091    -0.2520712
## 3      1.1999391               0.9131627     0.3178291
## 4     -0.6832144              -0.4490826    -0.3842200
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             2.88843399      2.3721430               2.3532959
## 2            -0.19271784     -0.2376723              -0.2187825
## 3             0.03761746      0.4683853               0.3914884
## 4            -0.24429087     -0.6477109              -0.5973185
##   BounceRates  ExitRates  PageValues  SpecialDay
## 1  -0.3142597 -0.4705710  0.08422691 -0.15438352
## 2  -0.2524249 -0.1683578 -0.13622633  0.04090515
## 3  -0.3395200 -0.5021188  0.54799035 -0.18223570
## 4   3.0227322  2.8468297 -0.31716498  0.21394340
online_km4$iter
## [1] 5

当我们使用$ iter时,我们看到k-means仅需要3次迭代来收敛,在第三次迭代时停止:它已经识别出4个充分不同的聚类,并且进一步的迭代不会进一步改进它。

4附加

4.1使用FactoExtra软件包组合PCA和k-means集群

fviz_screeplot(online_pca, addlabels = TRUE, ylim = c(0, 50))

var_pca <- get_pca_var(online_pca)
var_pca
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
head(var_pca$coord)
##                             Dim.1      Dim.2      Dim.3      Dim.4
## Administrative          0.7050389 0.06793628  0.2645375  0.3110102
## Administrative_Duration 0.6069612 0.13871617  0.3323191  0.3665404
## Informational           0.6410573 0.36478844  0.1564255 -0.4746090
## Informational_Duration  0.5454962 0.39458007  0.1462845 -0.6015172
## ProductRelated          0.7588367 0.19252869 -0.4088971  0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875  0.2159479
##                                Dim.5
## Administrative          -0.287321899
## Administrative_Duration -0.378159791
## Informational           -0.027365516
## Informational_Duration   0.002172933
## ProductRelated           0.272972429
## ProductRelated_Duration  0.269586543
head(var_pca$contrib)
##                             Dim.1     Dim.2     Dim.3     Dim.4
## Administrative          14.618345 0.2755127  6.532303  9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational           12.085531 7.9436515  2.284058 22.285560
## Informational_Duration   8.750957 9.2941218  1.997507 35.797086
## ProductRelated          16.934356 2.2127327 15.607020  6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804  4.613704
##                                Dim.5
## Administrative          8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational           7.957589e-02
## Informational_Duration  5.017262e-04
## ProductRelated          7.917932e+00
## ProductRelated_Duration 7.722726e+00
# Graph of variables: default plot
fviz_pca_var(online_pca, col.var = "black")

# Control variable colors using their contributions
fviz_pca_var(online_pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(online_pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(online_pca, choice = "var", axes = 2, top = 10)

ind_pca <- get_pca_var(online_pca)
ind_pca
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"
head(ind_pca$coord)
##                             Dim.1      Dim.2      Dim.3      Dim.4
## Administrative          0.7050389 0.06793628  0.2645375  0.3110102
## Administrative_Duration 0.6069612 0.13871617  0.3323191  0.3665404
## Informational           0.6410573 0.36478844  0.1564255 -0.4746090
## Informational_Duration  0.5454962 0.39458007  0.1462845 -0.6015172
## ProductRelated          0.7588367 0.19252869 -0.4088971  0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875  0.2159479
##                                Dim.5
## Administrative          -0.287321899
## Administrative_Duration -0.378159791
## Informational           -0.027365516
## Informational_Duration   0.002172933
## ProductRelated           0.272972429
## ProductRelated_Duration  0.269586543
head(ind_pca$contrib)
##                             Dim.1     Dim.2     Dim.3     Dim.4
## Administrative          14.618345 0.2755127  6.532303  9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational           12.085531 7.9436515  2.284058 22.285560
## Informational_Duration   8.750957 9.2941218  1.997507 35.797086
## ProductRelated          16.934356 2.2127327 15.607020  6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804  4.613704
##                                Dim.5
## Administrative          8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational           7.957589e-02
## Informational_Duration  5.017262e-04
## ProductRelated          7.917932e+00
## ProductRelated_Duration 7.722726e+00
# Graph of individuals
# 1. Use repel = TRUE to avoid overplotting
# 2. Control automatically the color of individuals using the cos2
    # cos2 = the quality of the individuals on the factor map
    # Use points only
# 3. Use gradient color
fviz_pca_ind(online_pca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

Fviz结果

基于收入的群集:

fviz_pca_ind(online_pca,
             label = "none", # hide individual labels
             habillage = online$Revenue, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses
             )

基于访客类型的群集:

fviz_pca_ind(online_pca,
             label = "none", # hide individual labels
             habillage = online$VisitorType, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses
             )

5结论

基于上述方法,我们可以得出结论:

最大k = 4 可以使用PCA和k-means观察该数据 使用FactoExtra,我们可以清楚地看到上面示例的数据集群:与收入和访客类型相关。 详细说明可以在上面找到(在sub bab PCA和k-means中)