1 Intro

This article aims to explore the different ways of customer’s behavioral pattern of surfing an e-commerce site or a marketplace. It contains a few key elements that were stated by google analytics to measure different segmentation of people’s behavior on surfing a website such as bounce rates, exit rates, page values, etc that we are going to explain later on this article. From this article we hope that we can find a distinct differences in people’s way of surfing a marketplace using K-Means clustering and Principle Component Analysis (PCA).

This is a data of online shoppers purchasing intention that was obtained from UCI Machine Learning Repository (archive.ics.uci.edu/ml/index.php). We will try to do a clustering analysis using K-means method to hopefully suggest the segmentations of customer. And we would like to also see if we can do a dimensionality reduction using Principle Components Analysis (PCA)

1.1 Import Data

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.1.1
library(animation)
## Warning: package 'animation' was built under R version 4.1.1
library(knitr)
library(reactable)
## Warning: package 'reactable' was built under R version 4.1.1

1.2 Reading The Dataset

data <- read.csv("data_input/online_shoppers_intention.csv", stringsAsFactors = T)
str(data)
## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

This datasets consists of 12,330 sessions of customer’s marketplace surfing in a one year period from an undisclosed company. The data contains 18 different parameters that we are going to explore later.

summary(data)
##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##                                                            
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##                                                                 
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##                                                                          
##      Month      OperatingSystems    Browser           Region     
##  May    :3364   Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Nov    :2998   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mar    :1907   Median :2.000    Median : 2.000   Median :3.000  
##  Dec    :1727   Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##  Oct    : 549   3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##  Sep    : 448   Max.   :8.000    Max.   :13.000   Max.   :9.000  
##  (Other):1337                                                    
##   TrafficType               VisitorType     Weekend         Revenue       
##  Min.   : 1.00   New_Visitor      : 1694   Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Other            :   85   FALSE:9462      FALSE:10422    
##  Median : 2.00   Returning_Visitor:10551   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                            
##  3rd Qu.: 4.00                                                            
##  Max.   :20.00                                                            
## 

Variables:

  1. Administrative : How many administrational page did the customer went through
  2. Administrative_Duration : Total Duration in administration page
  3. Informational : How many Informational page did the customer went through
  4. Informational_Duration : Total Duration in Information page
  5. ProductRelated : How many Product-Related page did the customer went through
  6. ProductRelated_Duration : Total Duration in Product Related page
  7. BounceRates : The value of “Bounce Rate” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session. (EV ~ 0)
  8. ExitRates : The value of “Exit Rate” feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session.

Below are a illustration of Bounce and exit rates that were stated by google analytics as a key elements of website surfing that aims to differentiate groups of people.

  1. PageValues : Represents the average value for a web page that a user visited before completing an e-commerce transaction.
  2. SpecialDay : The “Special Day” feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day). For example, for Valentine’s day, this value takes a nonzero value between February 2 and February 14, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.
  3. Month : Month of transaction.
  4. Visitor_Type : Returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.
  5. Operating_Systems : Types of operating systems
  6. Browser : Types of Browser
  7. Region : Regions
  8. Traffic_Type : Types of traffic coming in such as URL, from Google searches, etc
  9. Revenue : Customer buys or not. T/F logical/factor
  10. Weekend : Transaction took place in the weekend or not. T/F logical/factor

2 PCA and Dimension Reduction

2.1 Principle Component Analysis

library(FactoMineR)
quantivar <- 1:10
qualivar <- 11:18
data_pca <- PCA(X = data, scale.unit = T, ncp = 10, quali.sup = qualivar,
                graph = F)
plot.PCA(x = data_pca,
         choix = "ind", # individual factor map 
         invisible = "quali",
         select = "contrib 5", # labeling 5 outlier
         habillage = 16 # visitor_type color indication
         )

plot.PCA(x = data_pca,
         choix = "var")

From this plot we can point out the differences in each variables and we can also point out the correlation between each variable. We can take a look at Exit Rates, Bounce Rates and Page Values. Page Values and Bounce Rate has a high positive correlation which means that the higher the Bounce Rate, the higher the Exit Rates and vise versa. On the other hand the increase of Page values, the lower the bounce and exit rates get.

Another thing that correlates are the duration and the amount of page. For instance, The higher the Informational Page that are being surfed, the longer the duration it took to surf them and this is the same for other types of page.

2.2 Dimension Reduction

dim <- dimdesc(data_pca)
as.data.frame(dim$Dim.1$quanti) %>% reactable()
as.data.frame(dim$Dim.2$quanti) %>% reactable()
as.data.frame(dim$Dim.3$quanti) %>% reactable()
data_pca$eig 
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1  3.40038402             34.0038402                          34.00384
## comp 2  1.67518179             16.7518179                          50.75566
## comp 3  1.07129276             10.7129276                          61.46859
## comp 4  1.01076078             10.1076078                          71.57619
## comp 5  0.94107838              9.4107838                          80.98698
## comp 6  0.92712013              9.2712013                          90.25818
## comp 7  0.42202120              4.2202120                          94.47839
## comp 8  0.35166643              3.5166643                          97.99505
## comp 9  0.12288660              1.2288660                          99.22392
## comp 10 0.07760792              0.7760792                         100.00000

From the eigenvalues above if we want to do a dimensionality reduction for our shopping data to still retain 80% of the original data we can reduce the dimension to 5.

data_keep <- as.data.frame(data_pca$ind$coord[,c(1:5)])
data_keep %>% head() %>% reactable()

3 Clustering

Clustering aims to classify data into separate distinguishable cluster with different characteristic where observations in one cluster has a similar characteristic with each other and observation on different cluster has a different characteristics.

3.1 K-Means Clustering

The Algorithms we are going to use are K-means which is a centroid based clustering algorithms where cluster are seperated by the clusters centroids.

3.1.1 Business Question: Different Type of Website Surfers

glimpse(data)
## Rows: 12,330
## Columns: 18
## $ Administrative          <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2~
## $ Administrative_Duration <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5~
## $ Informational           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Informational_Duration  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ProductRelated          <int> 1, 2, 1, 2, 10, 19, 1, 0, 2, 3, 3, 16, 7, 6, 2~
## $ ProductRelated_Duration <dbl> 0.000000, 64.000000, 0.000000, 2.666667, 627.5~
## $ BounceRates             <dbl> 0.200000000, 0.000000000, 0.200000000, 0.05000~
## $ ExitRates               <dbl> 0.200000000, 0.100000000, 0.200000000, 0.14000~
## $ PageValues              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ SpecialDay              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8, 0~
## $ Month                   <fct> Feb, Feb, Feb, Feb, Feb, Feb, Feb, Feb, Feb, F~
## $ OperatingSystems        <int> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3, 1~
## $ Browser                 <int> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2, 1~
## $ Region                  <int> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3, 9~
## $ TrafficType             <int> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3, 3~
## $ VisitorType             <fct> Returning_Visitor, Returning_Visitor, Returnin~
## $ Weekend                 <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE~
## $ Revenue                 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~

Our model focuses on differentiating the ways of people surfing on the website. We still don’t know what will be the segmentation but the aim is to have a better understanding of what the people wants, and through those segmentation the developer of the website can deploy a program to each group to hopefully create more purchases and maximize the revenue of the company.

3.1.2 Seperating Data Type

First of all, we are going to seperate the data into two, categorical and numerical because k-means can only process numerical data.

data_cat <- data %>%
  select(Month:Revenue)

data_num <- data %>% 
  select(Administrative:SpecialDay)

Then we have to scale them to avoid different powers in numbers. For example, Variables such as Administrative are a count variable that has a max value approaching infinite. At the other hand, Bounce rates are percentage based, and tha maximum value it can reach is 1.0 or 100%. From this two variable, the differences in power can influence our measurment and make our model biassed.

data_num_z <- scale(data_num)

After we scale we can determine how many groups that are needed that can optimize the number of groups that are still distinguishable with one another. But we have to avoid too many groups that it can make the grouping pointless because maybe groups that have the same characteristic splitted into two seperate groups. To do this we use fviz.nbclust() function.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 1:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", 
       ylab = "Total within")
}

# kmeansTunning(your_data, maxK = 5)
kmeansTunning(data = data_num_z, maxK = 15) # with scaling

kmeansTunning(data = data_num, maxK = 15) # w/o scaling
## Warning: did not converge in 10 iterations

From this interpretation we have to determine the elbow, hence the “elbow method”. we need to find a point that that make an elbow, this is determined by our own subjectivity. I found that three and five number of cluster will be suited.

  • k=3
  • k=5
  • k-8

3.1.3 3 Cluster

Proses Cluster

# k-means dengan 3 cluster
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

data_num_km <- kmeans(data_num_z, centers = 3)

Menggunakan modus untuk melihat data kategorikal

mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Melakukan combine data numerik dan kategorik yang telah dipisah

summary_num <- cbind(data_cat, data_num) %>% 
  mutate(cluster = as.factor(data_num_km$cluster)) %>%
  mutate_at(.vars = vars(Administrative, Informational, ProductRelated), 
            .funs = as.numeric) %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate_if(is.logical, as.factor) %>% 
  group_by(cluster) %>% 
  summarise_if(is.numeric, mean) 
cbind(data_cat, data_num) %>% 
  mutate(cluster = as.factor(data_num_km$cluster)) %>%
  mutate_at(.vars = vars(Administrative, Informational, ProductRelated),
            .funs = as.numeric) %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate_if(is.logical, as.factor) %>% 
  group_by(cluster) %>% 
  summarise_if(is.factor, mode) %>% 
  left_join(summary_num) %>% 
  reactable()
## Joining, by = "cluster"
library(factoextra)
fviz_cluster(object = data_num_km, # object kmeans
             data = data_num) # data variable numerik

As we can see the group differs from one another, but cluster 1 and 2 piles up in the plot, so we have to check 5 cluster.

3.1.4 5 Cluster

Proses Cluster

# k-means dengan 5 cluster
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

data_num_km2 <- kmeans(data_num_z, centers = 5)
summary_num <- cbind(data_cat, data_num) %>% 
  mutate(cluster = as.factor(data_num_km2$cluster)) %>%
  mutate_at(.vars = vars(Administrative, Informational, ProductRelated), 
            .funs = as.numeric) %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate_if(is.logical, as.factor) %>% 
  group_by(cluster) %>% 
  summarise_if(is.numeric, mean) 
summary <- cbind(data_cat, data_num) %>% 
  mutate(cluster = as.factor(data_num_km2$cluster)) %>%
  mutate_at(.vars = vars(Administrative, Informational, ProductRelated), 
            .funs = as.numeric) %>% 
  mutate_if(is.integer, as.factor) %>% 
  mutate_if(is.logical, as.factor) %>% 
  group_by(cluster) %>% 
  summarise_if(is.factor, mode) %>% 
  left_join(summary_num)
## Joining, by = "cluster"
summary %>% reactable()
library(factoextra)
fviz_cluster(object = data_num_km2, # object kmeans
             data = data_num) # data variable numerik

Compared to the other model that we have built this the better model that we made.

3.1.5 Profiling

This is the number of observations in each cluster

data_num %>% mutate(cluster = data_num_km2$cluster) %>% group_by(cluster) %>%
  summarise(count = n()) %>% reactable()
summary %>% reactable()

Dalam melakukan profiling, kita akan melihat terlebih dahulu data-data numerik nya karena disini perlakuan clustering berdasarkan data numerik. Lalu untuk data kategorikal kita dapat menginterpretasikan setelahnya.

When Profiling we first need to look at the numerical data because when performing clustring function with k-means we only use numerical data. And then after we profile that we can sort out caategorical data if it means something to the group.

Why don’t we use categorical data? Because we use mode to sort out the cluster. This is not always good because the weight of each category is different. take a look at the table below.

data %>% group_by(VisitorType) %>% summarise(count = n()) %>% reactable()

From this table we can infer that Returning visitor will almost likely be chosen as the mode of every cluster because the number is too high.

We start the Clustering by first give a meaning to each value of our variables

Cluster 1:

  • Has the Lowest Administrative, Informational and Product Related number of pages surfed and the duration on it
    • Less than 2 seconds on administrative and informational page
    • 42 seconds on product related page
  • High Exit Rates
  • surfed 0 page (this means only seen the page that they clicked to enter and then backed out)
  • Relatively High Special Day

We can easily suggests that this group of people that get in the marketplace through promo deals nearing special day. Usually just to check up what product that the promo suggested because they only interested in product related stuff.

Cluster 2:

  • Relatively Low Administrative Duration
  • Average Informational and Product related Duration
  • Relatively high bounce and exit rates
  • On average surfed 1-2 pages
  • Have the high Special day

This people is the special day promo hunter. The high Special Day number suggest that it is. These people usually shows no interest in administration hence the low rate and usually hops between different promo page hence the high bounce and exit rates with relatively low page values.

Cluster 3:

Before we dive deeper into the profiling process on the third cluster, keep in mind that the fourth and third cluster may seem contradictory and could never happen but a it’s actually two very distinct profile that can happen.

  • The Highest Duration on administrative, informational and product related duration.
    • more than 5 minutes at administration
    • 4 minutes at informational
    • more than 1 hour at product related
  • Relatively average bounce and exit rate
  • on average surfing 7 pages
  • average special day

Cluster 4:

  • Relatively high duration on administrative, informational and product related page.
    • more than 1.5 minute at administration
    • more than 20 seconds at informational
    • 20 minutes at product related pages
  • Lowest Bounce and exit rate
  • On average surfing 70 pages
  • Low special day

Cluster 3 and 4 is interesting because maybe with the high duration the higher the page value is, this was also shown from our PCA that these two are positively correlated BUT cluster 4 has the higher page value than cluster 3.

So from that information we can proposed a profile for Cluster 4. We can suggest that the high number of page value is for the customer with specific needs, they already know how to navigate the marketplace they just bounce from one page to another to find the right purchase. Another Additional info that usually this type of customer has a high chances of actually buying the product

Cluster 3 is more likely to be a brand new customer, haven’t familiarize with the page hence high information duratiion, never been registered hence the high administration duration, and it takes time for them to know how to operate and navigate the page hence the high product related duration and low page values becaus it takes time for them to go from page to page.

Cluster 5:

  • Relatively Average Duration on administrative, Informational and product related page
  • Average Bounce and exit rate
  • On average surfing 2-3 pages
  • Low Special day

This is the average, daily user, frequent checker of the site. No need to take time on surfing each page, surfed only 2-3 page is enough for them. this is supported by the high number of observations in this group.