1 E-Commerce Data

I am a Business Analyst for e-Commerce. I’d like to analyze Customer behaviour who visited our Company Website. The objective is to observe & analyze Online Shoppers Purchase Intention using Combining PCA and k-means Clustering

1.1 Data Preparation

Data Set Information:

The dataset consists of feature vectors belonging to 12,330 sessions. The dataset was formed so that each session would belong to a different user in a 1-year period to avoid any tendency to a specific campaign, special day, user profile, or period.

https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset

1.2 Attribute Information:

The dataset consists of 10 numerical and 8 categorical attributes:

The ‘Revenue’ attribute can be used as the class label.
“Administrative”, “Administrative Duration”, “Informational”, “Informational Duration”, “Product Related” and “Product Related Duration” represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories.
The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another.
“Bounce Rate”, “Exit Rate” and “Page Value” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.
The value of “Bounce Rate” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.
The value of “Exit Rate” feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.
The “Page Value” feature represents the average value for a web page that a user visited before completing an e-commerce transaction.
The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.
The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

I just check the cluster for 2 categories which are Revenue & Visitor Type.

1.3 Read the data & analyze:

# load the library
library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.1.1       v purrr   0.3.2  
## v tibble  2.1.1       v dplyr   0.8.0.1
## v tidyr   0.8.3       v stringr 1.4.0  
## v readr   1.3.1       v forcats 0.4.0

## -- Conflicts ---------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(FactoMineR)
library(factoextra)

## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

online <- read.csv("data/online_shoppers_intention.csv")
head(online)

##   Administrative Administrative_Duration Informational
## 1              0                       0             0
## 2              0                       0             0
## 3              0                       0             0
## 4              0                       0             0
## 5              0                       0             0
## 6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1                      0              1                0.000000
## 2                      0              2               64.000000
## 3                      0              1                0.000000
## 4                      0              2                2.666667
## 5                      0             10              627.500000
## 6                      0             19              154.216667
##   BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1  0.20000000 0.2000000          0          0   Feb                1
## 2  0.00000000 0.1000000          0          0   Feb                2
## 3  0.20000000 0.2000000          0          0   Feb                4
## 4  0.05000000 0.1400000          0          0   Feb                3
## 5  0.02000000 0.0500000          0          0   Feb                3
## 6  0.01578947 0.0245614          0          0   Feb                2
##   Browser Region TrafficType       VisitorType Weekend Revenue
## 1       1      1           1 Returning_Visitor   FALSE   FALSE
## 2       2      1           2 Returning_Visitor   FALSE   FALSE
## 3       1      9           3 Returning_Visitor   FALSE   FALSE
## 4       2      2           4 Returning_Visitor   FALSE   FALSE
## 5       3      1           4 Returning_Visitor    TRUE   FALSE
## 6       2      1           3 Returning_Visitor   FALSE   FALSE

str(online)

## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

summary(online)

##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##                                                            
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##                                                                 
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##                                                                          
##      Month      OperatingSystems    Browser           Region     
##  May    :3364   Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Nov    :2998   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mar    :1907   Median :2.000    Median : 2.000   Median :3.000  
##  Dec    :1727   Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##  Oct    : 549   3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##  Sep    : 448   Max.   :8.000    Max.   :13.000   Max.   :9.000  
##  (Other):1337                                                    
##   TrafficType               VisitorType     Weekend         Revenue       
##  Min.   : 1.00   New_Visitor      : 1694   Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Other            :   85   FALSE:9462      FALSE:10422    
##  Median : 2.00   Returning_Visitor:10551   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                            
##  3rd Qu.: 4.00                                                            
##  Max.   :20.00                                                            
##

If we extract our principal components from the above matrix, the result is not going to be useful. When we think of PCA as a variance maximizing exercise, this become clearer: when we our PCA on the above data (un-scaled), the amount of variance explained by the different principal components is going to be dominated by variables that are on a larger range.

2 Principal Components Analysis

2.1 Using ‘prcomp’

online_small <- online[1:100,1:10]
biplot(prcomp(online_small,scale = T), cex = 0.8)

Use another theme - function fancy_biplot:

source("biplot.R")
fancy_biplot(prcomp(online_small,scale = T))

# We would like to analyze 4 data
data.frame(online[c(30,58,67,77),])

##    Administrative Administrative_Duration Informational
## 30              1                   6.000             1
## 58              4                  56.000             2
## 67              4                  44.000             0
## 77             10                1005.667             0
##    Informational_Duration ProductRelated ProductRelated_Duration
## 30                      0             45               1582.7500
## 58                    120             36                998.7417
## 67                      0             90               6951.9722
## 77                      0             36               2111.3417
##    BounceRates  ExitRates PageValues SpecialDay Month OperatingSystems
## 30 0.043478261 0.05082126   54.17976        0.4   Feb                3
## 58 0.000000000 0.01473647   19.44708        0.2   Feb                2
## 67 0.002150538 0.01501303    0.00000        0.0   Feb                4
## 77 0.004347826 0.01449275   11.43941        0.0   Feb                2
##    Browser Region TrafficType       VisitorType Weekend Revenue
## 30       2      1           1 Returning_Visitor   FALSE   FALSE
## 58       2      4           1 Returning_Visitor   FALSE   FALSE
## 67       1      1           3 Returning_Visitor   FALSE   FALSE
## 77       6      1           2 Returning_Visitor   FALSE    TRUE

Based on the biplot, we can conclude: Data 58 has big Informational Duration & Informational, almost similar to data 58 is data 30. Data 67 has big ProductRelated_Duration, almost similar to this, Data 77 has big Administrative_Duration & Product Related Duration.

Before we only use small data from ‘online’, next we will use all the data

onlineNum <- online[,1:10]
onlineZ <- scale(onlineNum, center = T, scale = T)

pr <- prcomp(onlineZ)
summary(pr)

## Importance of components:
##                          PC1    PC2    PC3    PC4     PC5     PC6    PC7
## Standard deviation     1.844 1.2943 1.0350 1.0054 0.97009 0.96287 0.6496
## Proportion of Variance 0.340 0.1675 0.1071 0.1011 0.09411 0.09271 0.0422
## Cumulative Proportion  0.340 0.5076 0.6147 0.7158 0.80987 0.90258 0.9448
##                            PC8     PC9    PC10
## Standard deviation     0.59301 0.35055 0.27858
## Proportion of Variance 0.03517 0.01229 0.00776
## Cumulative Proportion  0.97995 0.99224 1.00000

plot(pr,type = "l")

Based on summary and Elbow method, the best cluster or how many PCs that will reflect all the data:
- until PC5, cumulative proportion is quite good: 0.80987
- using Elbow method, after PC3, there is no significant changes again, considering the variance & cumulative proportion, we will test to all the possibilities k = 3-5

2.2 Using PCA function

Before we move into k means clustering, we would like to use other function to show PCA. using PCA function, we need to define the qualitative data as Factors.

online <- online %>% 
  mutate(
    Weekend = as.factor(Weekend),
    Revenue = as.factor(Revenue),
    OperatingSystems = as.factor(OperatingSystems),
    Browser = as.factor(Browser),
    Region = as.factor(Region),
    TrafficType = as.factor(TrafficType)
  )

prOnlineFacto <- PCA(online, quali.sup= c(11:18) ,scale.unit = T, graph = F)
plot(prOnlineFacto)

PCA using quali sup

plot.PCA(prOnlineFacto, choix = "var")
plot.PCA(prOnlineFacto, choix = "ind",habillage = 18, select = "contrib 10", invisible = "quali")

PCA using quali sup

online_pca <- PCA(online, quali.sup = c(11:18), graph=F, scale.unit = T)

plot.PCA(online_pca, choix = "var")
plot.PCA(online_pca, choix = "ind",habillage = 18, select = "contrib 5", invisible = "quali")

PCA using quali sup

data.frame(online[c(5153,10641),])

##       Administrative Administrative_Duration Informational
## 5153              17                2629.254            24
## 10641             22                1153.682             3
##       Informational_Duration ProductRelated ProductRelated_Duration
## 5153                2050.433            705               43171.233
## 10641                108.000            205                4295.305
##       BounceRates   ExitRates PageValues SpecialDay Month OperatingSystems
## 5153  0.004851285 0.015431438   0.763829          0   May                2
## 10641 0.001746725 0.008801049 177.528825          0   Nov                2
##       Browser Region TrafficType       VisitorType Weekend Revenue
## 5153        2      1          14 Returning_Visitor    TRUE   FALSE
## 10641       5      3           3 Returning_Visitor    TRUE   FALSE

Data 5153 has big Informational_Duration, ProductRelated_Duration, Admministrative_Duration. Data 10641 has low value of Informational.

3 k-means Clustering

As per stated above that we will to find the maximum k

set.seed(100)
# k-means with 3 clusters
online_km <- kmeans(onlineZ, 3) #bandingin pake Elbow
online$clust <- as.factor(online_km$cluster)
head(online)

##   Administrative Administrative_Duration Informational
## 1              0                       0             0
## 2              0                       0             0
## 3              0                       0             0
## 4              0                       0             0
## 5              0                       0             0
## 6              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1                      0              1                0.000000
## 2                      0              2               64.000000
## 3                      0              1                0.000000
## 4                      0              2                2.666667
## 5                      0             10              627.500000
## 6                      0             19              154.216667
##   BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1  0.20000000 0.2000000          0          0   Feb                1
## 2  0.00000000 0.1000000          0          0   Feb                2
## 3  0.20000000 0.2000000          0          0   Feb                4
## 4  0.05000000 0.1400000          0          0   Feb                3
## 5  0.02000000 0.0500000          0          0   Feb                3
## 6  0.01578947 0.0245614          0          0   Feb                2
##   Browser Region TrafficType       VisitorType Weekend Revenue clust
## 1       1      1           1 Returning_Visitor   FALSE   FALSE     1
## 2       2      1           2 Returning_Visitor   FALSE   FALSE     1
## 3       1      9           3 Returning_Visitor   FALSE   FALSE     1
## 4       2      2           4 Returning_Visitor   FALSE   FALSE     1
## 5       3      1           4 Returning_Visitor    TRUE   FALSE     1
## 6       2      1           3 Returning_Visitor   FALSE   FALSE     1

online_km$centers

##   Administrative Administrative_Duration Informational
## 1     -0.2332921              -0.1986104    -0.2461845
## 2     -0.4091124              -0.3068585    -0.2469736
## 3      1.4958138               1.2493741     1.4711601
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             -0.1939953     -0.2395940              -0.2202607
## 2             -0.1851529     -0.1616858              -0.1940331
## 3              1.1537882      1.3860745               1.3005984
##   BounceRates   ExitRates  PageValues SpecialDay
## 1  0.03171517  0.05064445 -0.01870258 -0.2916209
## 2  0.27828636  0.38150678 -0.21784532  3.0961874
## 3 -0.33269469 -0.49474113  0.22740737 -0.2257802

online_km$iter

## [1] 3

plot.PCA(online_pca, choix=c("ind"), label="none", col.ind= online$clust) #choix = individual
legend("topright", levels(online$clust), pch=19, col=1:4)

PCA result for k = 3

Check the Elbow using wss function:

wss <- function(data, maxCluster = 10) {
    # Initialize within sum of squares
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Within groups sum of squares", pch=18)
}
wss(onlineZ) # method wss

There are other method to check the maximum k:

fviz_nbclust(onlineZ, kmeans, method = "silhouette") # method silhouette

The Elbow as a result from wss function, we continue check if use k = 5,

online_km5 <- kmeans(onlineZ, 5)
online_km5$clust <- as.factor(online_km5$cluster)

plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km5$clust) #choix = individual
legend("topright", levels(online_km5$clust), pch=19, col=1:4)

PCA result for k = 5

online_km5$centers

##   Administrative Administrative_Duration Informational
## 1    1.310908346              0.99926824    0.37985074
## 2   -0.392526858             -0.30361589   -0.25154885
## 3    1.418098310              1.04250027    2.85570065
## 4   -0.008206934             -0.02623911   -0.09066038
## 5   -0.687222206             -0.45074395   -0.38881809
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             0.05981531     0.55813884              0.47070281
## 2            -0.19196798    -0.24510112             -0.22695383
## 3             3.10802386     2.40576960              2.39539497
## 4            -0.11178168    -0.03912234             -0.01991354
## 5            -0.24492057    -0.65422899             -0.60036955
##   BounceRates  ExitRates   PageValues  SpecialDay
## 1  -0.3270197 -0.4801483  0.004715737 -0.15658862
## 2  -0.2318675 -0.1314659 -0.224938955  0.05376042
## 3  -0.3168492 -0.4735970  0.045631040 -0.16623220
## 4  -0.4010185 -0.5847576  3.498575152 -0.24543338
## 5   3.2443481  2.9667157 -0.317164982  0.17710109

online_km5$iter

## [1] 6

The value above shows, maybe k = 4 better than 5, we will try below:

online_km4 <- kmeans(onlineZ, 4)
online_km4$clust <- as.factor(online_km4$cluster)

plot.PCA(online_pca, choix=c("ind"), label="none", col.ind=online_km4$clust) #choix = individual
legend("topright", levels(online_km4$clust), pch=19, col=1:4)

PCA result

online_km4$centers

##   Administrative Administrative_Duration Informational
## 1      1.4274242               1.0307384     2.7527566
## 2     -0.3908723              -0.3023091    -0.2520712
## 3      1.1999391               0.9131627     0.3178291
## 4     -0.6832144              -0.4490826    -0.3842200
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             2.88843399      2.3721430               2.3532959
## 2            -0.19271784     -0.2376723              -0.2187825
## 3             0.03761746      0.4683853               0.3914884
## 4            -0.24429087     -0.6477109              -0.5973185
##   BounceRates  ExitRates  PageValues  SpecialDay
## 1  -0.3142597 -0.4705710  0.08422691 -0.15438352
## 2  -0.2524249 -0.1683578 -0.13622633  0.04090515
## 3  -0.3395200 -0.5021188  0.54799035 -0.18223570
## 4   3.0227322  2.8468297 -0.31716498  0.21394340

online_km4$iter

## [1] 5

When we use $iter, we see that k-means take only 3 iterations to converge, stopping at the third iteration: it already identified 4 sufficiently distinct clusters and further iteration wouldn’t improve it any further.

4 Additional

4.1 Combining PCA and k-means Clustering using FactoExtra Package

fviz_screeplot(online_pca, addlabels = TRUE, ylim = c(0, 50))

var_pca <- get_pca_var(online_pca)
var_pca

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

head(var_pca$coord)

##                             Dim.1      Dim.2      Dim.3      Dim.4
## Administrative          0.7050389 0.06793628  0.2645375  0.3110102
## Administrative_Duration 0.6069612 0.13871617  0.3323191  0.3665404
## Informational           0.6410573 0.36478844  0.1564255 -0.4746090
## Informational_Duration  0.5454962 0.39458007  0.1462845 -0.6015172
## ProductRelated          0.7588367 0.19252869 -0.4088971  0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875  0.2159479
##                                Dim.5
## Administrative          -0.287321899
## Administrative_Duration -0.378159791
## Informational           -0.027365516
## Informational_Duration   0.002172933
## ProductRelated           0.272972429
## ProductRelated_Duration  0.269586543

head(var_pca$contrib)

##                             Dim.1     Dim.2     Dim.3     Dim.4
## Administrative          14.618345 0.2755127  6.532303  9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational           12.085531 7.9436515  2.284058 22.285560
## Informational_Duration   8.750957 9.2941218  1.997507 35.797086
## ProductRelated          16.934356 2.2127327 15.607020  6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804  4.613704
##                                Dim.5
## Administrative          8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational           7.957589e-02
## Informational_Duration  5.017262e-04
## ProductRelated          7.917932e+00
## ProductRelated_Duration 7.722726e+00

# Graph of variables: default plot
fviz_pca_var(online_pca, col.var = "black")

# Control variable colors using their contributions
fviz_pca_var(online_pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             )

# Contributions of variables to PC1
fviz_contrib(online_pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(online_pca, choice = "var", axes = 2, top = 10)

ind_pca <- get_pca_var(online_pca)
ind_pca

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

head(ind_pca$coord)

##                             Dim.1      Dim.2      Dim.3      Dim.4
## Administrative          0.7050389 0.06793628  0.2645375  0.3110102
## Administrative_Duration 0.6069612 0.13871617  0.3323191  0.3665404
## Informational           0.6410573 0.36478844  0.1564255 -0.4746090
## Informational_Duration  0.5454962 0.39458007  0.1462845 -0.6015172
## ProductRelated          0.7588367 0.19252869 -0.4088971  0.2489076
## ProductRelated_Duration 0.7624212 0.24605368 -0.3753875  0.2159479
##                                Dim.5
## Administrative          -0.287321899
## Administrative_Duration -0.378159791
## Informational           -0.027365516
## Informational_Duration   0.002172933
## ProductRelated           0.272972429
## ProductRelated_Duration  0.269586543

head(ind_pca$contrib)

##                             Dim.1     Dim.2     Dim.3     Dim.4
## Administrative          14.618345 0.2755127  6.532303  9.569755
## Administrative_Duration 10.834125 1.1486620 10.308663 13.292155
## Informational           12.085531 7.9436515  2.284058 22.285560
## Informational_Duration   8.750957 9.2941218  1.997507 35.797086
## ProductRelated          16.934356 2.2127327 15.607020  6.129542
## ProductRelated_Duration 17.094719 3.6140803 13.153804  4.613704
##                                Dim.5
## Administrative          8.772263e+00
## Administrative_Duration 1.519585e+01
## Informational           7.957589e-02
## Informational_Duration  5.017262e-04
## ProductRelated          7.917932e+00
## ProductRelated_Duration 7.722726e+00

# Graph of individuals
# 1. Use repel = TRUE to avoid overplotting
# 2. Control automatically the color of individuals using the cos2
    # cos2 = the quality of the individuals on the factor map
    # Use points only
# 3. Use gradient color
fviz_pca_ind(online_pca, col.ind = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping (slow if many points)
             )

Fviz result

Clustering based on Revenue:

fviz_pca_ind(online_pca,
             label = "none", # hide individual labels
             habillage = online$Revenue, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses
             )

Clustering based on Visitor type:

fviz_pca_ind(online_pca,
             label = "none", # hide individual labels
             habillage = online$VisitorType, # color by groups
             palette = c("#00AFBB", "#E7B800", "#FC4E07"),
             addEllipses = TRUE # Concentration ellipses
             )

5 Conclusion

Based on methods above, we can conclude:

The maximum k = 4
This data can be observe using PCA & k-means
using FactoExtra, We can clearly see the data cluster for examples above : related with Revenue & Visitor Type. The detail explanation can be found above (in sub bab PCA & k-means)

Learn-By-Building: Unsupervised Learning

Analysis of Online Shoppers Purchasing Intention

Meilinie

Jun 11, 2019

1 E-Commerce Data

1.1 Data Preparation

1.2 Attribute Information:

1.3 Read the data & analyze:

2 Principal Components Analysis

2.1 Using ‘prcomp’

2.2 Using PCA function

3 k-means Clustering

4 Additional

4.1 Combining PCA and k-means Clustering using FactoExtra Package

5 Conclusion