This article aims to explore the different ways of customer’s behavioral pattern of surfing an e-commerce site or a marketplace. It contains a few key elements that were stated by google analytics to measure different segmentation of people’s behavior on surfing a website such as bounce rates, exit rates, page values, etc that we are going to explain later on this article. From this article we hope that we can find a distinct differences in people’s way of surfing a marketplace using K-Means clustering and Principle Component Analysis (PCA).
This is a data of online shoppers purchasing intention that was obtained from UCI Machine Learning Repository (archive.ics.uci.edu/ml/index.php). We will try to do a clustering analysis using K-means method to hopefully suggest the segmentations of customer. And we would like to also see if we can do a dimensionality reduction using Principle Components Analysis (PCA)
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.1
## Warning: package 'readr' was built under R version 4.1.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(factoextra)## Warning: package 'factoextra' was built under R version 4.1.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)## Warning: package 'FactoMineR' was built under R version 4.1.1
library(animation)## Warning: package 'animation' was built under R version 4.1.1
library(knitr)
library(reactable)## Warning: package 'reactable' was built under R version 4.1.1
data <- read.csv("data_input/online_shoppers_intention.csv", stringsAsFactors = T)
str(data)## 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 0 2 3 ...
## $ ProductRelated_Duration: num 0 64 0 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
This datasets consists of 12,330 sessions of customer’s marketplace surfing in a one year period from an undisclosed company. The data contains 18 different parameters that we are going to explore later.
summary(data)## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.000 Median : 7.50 Median : 0.0000
## Mean : 2.315 Mean : 80.82 Mean : 0.5036
## 3rd Qu.: 4.000 3rd Qu.: 93.26 3rd Qu.: 0.0000
## Max. :27.000 Max. :3398.75 Max. :24.0000
##
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 184.1
## Median : 0.00 Median : 18.00 Median : 598.9
## Mean : 34.47 Mean : 31.73 Mean : 1194.8
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1464.2
## Max. :2549.38 Max. :705.00 Max. :63973.5
##
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003112 Median :0.02516 Median : 0.000 Median :0.00000
## Mean :0.022191 Mean :0.04307 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016813 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
##
## Month OperatingSystems Browser Region
## May :3364 Min. :1.000 Min. : 1.000 Min. :1.000
## Nov :2998 1st Qu.:2.000 1st Qu.: 2.000 1st Qu.:1.000
## Mar :1907 Median :2.000 Median : 2.000 Median :3.000
## Dec :1727 Mean :2.124 Mean : 2.357 Mean :3.147
## Oct : 549 3rd Qu.:3.000 3rd Qu.: 2.000 3rd Qu.:4.000
## Sep : 448 Max. :8.000 Max. :13.000 Max. :9.000
## (Other):1337
## TrafficType VisitorType Weekend Revenue
## Min. : 1.00 New_Visitor : 1694 Mode :logical Mode :logical
## 1st Qu.: 2.00 Other : 85 FALSE:9462 FALSE:10422
## Median : 2.00 Returning_Visitor:10551 TRUE :2868 TRUE :1908
## Mean : 4.07
## 3rd Qu.: 4.00
## Max. :20.00
##
Variables:
Below are a illustration of Bounce and exit rates that were stated by google analytics as a key elements of website surfing that aims to differentiate groups of people.
library(FactoMineR)quantivar <- 1:10
qualivar <- 11:18data_pca <- PCA(X = data, scale.unit = T, ncp = 10, quali.sup = qualivar,
graph = F)plot.PCA(x = data_pca,
choix = "ind", # individual factor map
invisible = "quali",
select = "contrib 5", # labeling 5 outlier
habillage = 16 # visitor_type color indication
)plot.PCA(x = data_pca,
choix = "var")From this plot we can point out the differences in each variables and we can also point out the correlation between each variable. We can take a look at Exit Rates, Bounce Rates and Page Values. Page Values and Bounce Rate has a high positive correlation which means that the higher the Bounce Rate, the higher the Exit Rates and vise versa. On the other hand the increase of Page values, the lower the bounce and exit rates get.
Another thing that correlates are the duration and the amount of page. For instance, The higher the Informational Page that are being surfed, the longer the duration it took to surf them and this is the same for other types of page.
dim <- dimdesc(data_pca)
as.data.frame(dim$Dim.1$quanti) %>% reactable()as.data.frame(dim$Dim.2$quanti) %>% reactable()as.data.frame(dim$Dim.3$quanti) %>% reactable()data_pca$eig ## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.40038402 34.0038402 34.00384
## comp 2 1.67518179 16.7518179 50.75566
## comp 3 1.07129276 10.7129276 61.46859
## comp 4 1.01076078 10.1076078 71.57619
## comp 5 0.94107838 9.4107838 80.98698
## comp 6 0.92712013 9.2712013 90.25818
## comp 7 0.42202120 4.2202120 94.47839
## comp 8 0.35166643 3.5166643 97.99505
## comp 9 0.12288660 1.2288660 99.22392
## comp 10 0.07760792 0.7760792 100.00000
From the eigenvalues above if we want to do a dimensionality reduction for our shopping data to still retain 80% of the original data we can reduce the dimension to 5.
data_keep <- as.data.frame(data_pca$ind$coord[,c(1:5)])
data_keep %>% head() %>% reactable()Clustering aims to classify data into separate distinguishable cluster with different characteristic where observations in one cluster has a similar characteristic with each other and observation on different cluster has a different characteristics.
The Algorithms we are going to use are K-means which is a centroid based clustering algorithms where cluster are seperated by the clusters centroids.
glimpse(data)## Rows: 12,330
## Columns: 18
## $ Administrative <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 2~
## $ Administrative_Duration <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5~
## $ Informational <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ Informational_Duration <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ ProductRelated <int> 1, 2, 1, 2, 10, 19, 1, 0, 2, 3, 3, 16, 7, 6, 2~
## $ ProductRelated_Duration <dbl> 0.000000, 64.000000, 0.000000, 2.666667, 627.5~
## $ BounceRates <dbl> 0.200000000, 0.000000000, 0.200000000, 0.05000~
## $ ExitRates <dbl> 0.200000000, 0.100000000, 0.200000000, 0.14000~
## $ PageValues <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ SpecialDay <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8, 0~
## $ Month <fct> Feb, Feb, Feb, Feb, Feb, Feb, Feb, Feb, Feb, F~
## $ OperatingSystems <int> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3, 1~
## $ Browser <int> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2, 1~
## $ Region <int> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3, 9~
## $ TrafficType <int> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3, 3~
## $ VisitorType <fct> Returning_Visitor, Returning_Visitor, Returnin~
## $ Weekend <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE~
## $ Revenue <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS~
Our model focuses on differentiating the ways of people surfing on the website. We still don’t know what will be the segmentation but the aim is to have a better understanding of what the people wants, and through those segmentation the developer of the website can deploy a program to each group to hopefully create more purchases and maximize the revenue of the company.
First of all, we are going to seperate the data into two, categorical and numerical because k-means can only process numerical data.
data_cat <- data %>%
select(Month:Revenue)
data_num <- data %>%
select(Administrative:SpecialDay)Then we have to scale them to avoid different powers in numbers. For example, Variables such as Administrative are a count variable that has a max value approaching infinite. At the other hand, Bounce rates are percentage based, and tha maximum value it can reach is 1.0 or 100%. From this two variable, the differences in power can influence our measurment and make our model biassed.
data_num_z <- scale(data_num)After we scale we can determine how many groups that are needed that can optimize the number of groups that are still distinguishable with one another. But we have to avoid too many groups that it can make the grouping pointless because maybe groups that have the same characteristic splitted into two seperate groups. To do this we use fviz.nbclust() function.
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
kmeansTunning <- function(data, maxK) {
withinall <- NULL
total_k <- NULL
for (i in 1:maxK) {
set.seed(101)
temp <- kmeans(data,i)$tot.withinss
withinall <- append(withinall, temp)
total_k <- append(total_k,i)
}
plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster",
ylab = "Total within")
}
# kmeansTunning(your_data, maxK = 5)
kmeansTunning(data = data_num_z, maxK = 15) # with scalingkmeansTunning(data = data_num, maxK = 15) # w/o scaling## Warning: did not converge in 10 iterations
From this interpretation we have to determine the elbow, hence the “elbow method”. we need to find a point that that make an elbow, this is determined by our own subjectivity. I found that three and five number of cluster will be suited.
Proses Cluster
# k-means dengan 3 cluster
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
data_num_km <- kmeans(data_num_z, centers = 3)Menggunakan modus untuk melihat data kategorikal
mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}Melakukan combine data numerik dan kategorik yang telah dipisah
summary_num <- cbind(data_cat, data_num) %>%
mutate(cluster = as.factor(data_num_km$cluster)) %>%
mutate_at(.vars = vars(Administrative, Informational, ProductRelated),
.funs = as.numeric) %>%
mutate_if(is.integer, as.factor) %>%
mutate_if(is.logical, as.factor) %>%
group_by(cluster) %>%
summarise_if(is.numeric, mean) cbind(data_cat, data_num) %>%
mutate(cluster = as.factor(data_num_km$cluster)) %>%
mutate_at(.vars = vars(Administrative, Informational, ProductRelated),
.funs = as.numeric) %>%
mutate_if(is.integer, as.factor) %>%
mutate_if(is.logical, as.factor) %>%
group_by(cluster) %>%
summarise_if(is.factor, mode) %>%
left_join(summary_num) %>%
reactable()## Joining, by = "cluster"
library(factoextra)
fviz_cluster(object = data_num_km, # object kmeans
data = data_num) # data variable numerikAs we can see the group differs from one another, but cluster 1 and 2 piles up in the plot, so we have to check 5 cluster.
Proses Cluster
# k-means dengan 5 cluster
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
data_num_km2 <- kmeans(data_num_z, centers = 5)summary_num <- cbind(data_cat, data_num) %>%
mutate(cluster = as.factor(data_num_km2$cluster)) %>%
mutate_at(.vars = vars(Administrative, Informational, ProductRelated),
.funs = as.numeric) %>%
mutate_if(is.integer, as.factor) %>%
mutate_if(is.logical, as.factor) %>%
group_by(cluster) %>%
summarise_if(is.numeric, mean) summary <- cbind(data_cat, data_num) %>%
mutate(cluster = as.factor(data_num_km2$cluster)) %>%
mutate_at(.vars = vars(Administrative, Informational, ProductRelated),
.funs = as.numeric) %>%
mutate_if(is.integer, as.factor) %>%
mutate_if(is.logical, as.factor) %>%
group_by(cluster) %>%
summarise_if(is.factor, mode) %>%
left_join(summary_num)## Joining, by = "cluster"
summary %>% reactable()library(factoextra)
fviz_cluster(object = data_num_km2, # object kmeans
data = data_num) # data variable numerikCompared to the other model that we have built this the better model that we made.
This is the number of observations in each cluster
data_num %>% mutate(cluster = data_num_km2$cluster) %>% group_by(cluster) %>%
summarise(count = n()) %>% reactable()summary %>% reactable()Dalam melakukan profiling, kita akan melihat terlebih dahulu data-data numerik nya karena disini perlakuan clustering berdasarkan data numerik. Lalu untuk data kategorikal kita dapat menginterpretasikan setelahnya.
When Profiling we first need to look at the numerical data because when performing clustring function with k-means we only use numerical data. And then after we profile that we can sort out caategorical data if it means something to the group.
Why don’t we use categorical data? Because we use mode to sort out the cluster. This is not always good because the weight of each category is different. take a look at the table below.
data %>% group_by(VisitorType) %>% summarise(count = n()) %>% reactable()From this table we can infer that Returning visitor will almost likely be chosen as the mode of every cluster because the number is too high.
We start the Clustering by first give a meaning to each value of our variables
Cluster 1:
We can easily suggests that this group of people that get in the marketplace through promo deals nearing special day. Usually just to check up what product that the promo suggested because they only interested in product related stuff.
Cluster 2:
This people is the special day promo hunter. The high Special Day number suggest that it is. These people usually shows no interest in administration hence the low rate and usually hops between different promo page hence the high bounce and exit rates with relatively low page values.
Cluster 3:
Before we dive deeper into the profiling process on the third cluster, keep in mind that the fourth and third cluster may seem contradictory and could never happen but a it’s actually two very distinct profile that can happen.
Cluster 4:
Cluster 3 and 4 is interesting because maybe with the high duration the higher the page value is, this was also shown from our PCA that these two are positively correlated BUT cluster 4 has the higher page value than cluster 3.
So from that information we can proposed a profile for Cluster 4. We can suggest that the high number of page value is for the customer with specific needs, they already know how to navigate the marketplace they just bounce from one page to another to find the right purchase. Another Additional info that usually this type of customer has a high chances of actually buying the product
Cluster 3 is more likely to be a brand new customer, haven’t familiarize with the page hence high information duratiion, never been registered hence the high administration duration, and it takes time for them to know how to operate and navigate the page hence the high product related duration and low page values becaus it takes time for them to go from page to page.
Cluster 5:
This is the average, daily user, frequent checker of the site. No need to take time on surfing each page, surfed only 2-3 page is enough for them. this is supported by the high number of observations in this group.