What are the characteristics of Kira Plastinina customer groups?
The Kira Plastinina company, is a Russian fashion brand, that previously belonged to the Russian fashion designer, Kira Plastinina. The brand was sold through a through a now defunct chain of eponymous retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines and Armenia.
The brand’s sales and marketing team requested for an in depth analysis on their customer groups. They provided data that had been collected over the past year to be used in the analysis. The main aim of the project was to cater to the clients request, as well as develop a clustering model, necessary for grouping of various customers, in order to derive useful insights to be used to improve the brand’s marketing strategy, to boost sales.
For the data to be relevant, it should have meaningful insights that can be used to identify the characteristics of the brand’s customer groups. The dataset should also have relevant information that allows for efficient and relevant clustering of customer groups.
# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(ggplot2) # Plotting Library
library(moments) # Measure of distribution library
library(ggcorrplot) # Correlation plotting library
library(mice) # Missing values
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(stats) # KMeans
library(factoextra) # Cluster visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggpubr) # Mutliple graph visualization
##
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
##
## mutate
library(cluster) # Gap-Statistic
library(caret) # Preprocessing (min-max scaling)
## Loading required package: lattice
library(lattice) # Used with caret
# Loading the data
cust <- fread('online_shoppers_intention.csv')
# Number of rows and columns of the advertisement data.
cat('Number of rows = ', nrow(cust), 'and the number of columns = ', ncol(cust),'.')
## Number of rows = 12330 and the number of columns = 18 .
head(cust, 5)
## Administrative Administrative_Duration Informational Informational_Duration
## 1: 0 0 0 0
## 2: 0 0 0 0
## 3: 0 -1 0 -1
## 4: 0 0 0 0
## 5: 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1: 1 0.000000 0.20 0.20 0
## 2: 2 64.000000 0.00 0.10 0
## 3: 1 -1.000000 0.20 0.20 0
## 4: 2 2.666667 0.05 0.14 0
## 5: 10 627.500000 0.02 0.05 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1: 0 Feb 1 1 1 1
## 2: 0 Feb 2 2 1 2
## 3: 0 Feb 4 1 9 3
## 4: 0 Feb 3 2 2 4
## 5: 0 Feb 3 3 1 4
## VisitorType Weekend Revenue
## 1: Returning_Visitor FALSE FALSE
## 2: Returning_Visitor FALSE FALSE
## 3: Returning_Visitor FALSE FALSE
## 4: Returning_Visitor FALSE FALSE
## 5: Returning_Visitor TRUE FALSE
tail(cust, 5)
## Administrative Administrative_Duration Informational Informational_Duration
## 1: 3 145 0 0
## 2: 0 0 0 0
## 3: 0 0 0 0
## 4: 4 75 0 0
## 5: 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1: 53 1783.792 0.007142857 0.02903061 12.24172
## 2: 5 465.750 0.000000000 0.02133333 0.00000
## 3: 6 184.250 0.083333333 0.08666667 0.00000
## 4: 15 346.000 0.000000000 0.02105263 0.00000
## 5: 3 21.250 0.000000000 0.06666667 0.00000
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1: 0 Dec 4 6 1 1
## 2: 0 Nov 3 2 1 8
## 3: 0 Nov 3 2 1 13
## 4: 0 Nov 2 2 3 11
## 5: 0 Nov 3 2 1 2
## VisitorType Weekend Revenue
## 1: Returning_Visitor TRUE FALSE
## 2: Returning_Visitor TRUE FALSE
## 3: Returning_Visitor TRUE FALSE
## 4: Returning_Visitor FALSE FALSE
## 5: New_Visitor TRUE FALSE
At first glance of the data set, no anomalies can be seen.
str(cust)
## Classes 'data.table' and 'data.frame': 12330 obs. of 18 variables:
## $ Administrative : int 0 0 0 0 0 0 0 1 0 0 ...
## $ Administrative_Duration: num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ Informational : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Informational_Duration : num 0 0 -1 0 0 0 -1 -1 0 0 ...
## $ ProductRelated : int 1 2 1 2 10 19 1 1 2 3 ...
## $ ProductRelated_Duration: num 0 64 -1 2.67 627.5 ...
## $ BounceRates : num 0.2 0 0.2 0.05 0.02 ...
## $ ExitRates : num 0.2 0.1 0.2 0.14 0.05 ...
## $ PageValues : num 0 0 0 0 0 0 0 0 0 0 ...
## $ SpecialDay : num 0 0 0 0 0 0 0.4 0 0.8 0.4 ...
## $ Month : chr "Feb" "Feb" "Feb" "Feb" ...
## $ OperatingSystems : int 1 2 4 3 3 2 2 1 2 2 ...
## $ Browser : int 1 2 1 2 3 2 4 2 2 4 ...
## $ Region : int 1 1 9 2 1 1 3 1 2 1 ...
## $ TrafficType : int 1 2 3 4 4 3 3 5 3 2 ...
## $ VisitorType : chr "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
## $ Weekend : logi FALSE FALSE FALSE FALSE TRUE FALSE ...
## $ Revenue : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, ".internal.selfref")=<externalptr>
All columns have the required datatypes, however, the data type of the categorical columns will be converted to factors.
# Converting categorical columns to factor types
cust[,11:18] <- lapply(cust[,11:18], factor)
# Checking changes
head(cust, 5)
## Administrative Administrative_Duration Informational Informational_Duration
## 1: 0 0 0 0
## 2: 0 0 0 0
## 3: 0 -1 0 -1
## 4: 0 0 0 0
## 5: 0 0 0 0
## ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1: 1 0.000000 0.20 0.20 0
## 2: 2 64.000000 0.00 0.10 0
## 3: 1 -1.000000 0.20 0.20 0
## 4: 2 2.666667 0.05 0.14 0
## 5: 10 627.500000 0.02 0.05 0
## SpecialDay Month OperatingSystems Browser Region TrafficType
## 1: 0 Feb 1 1 1 1
## 2: 0 Feb 2 2 1 2
## 3: 0 Feb 4 1 9 3
## 4: 0 Feb 3 2 2 4
## 5: 0 Feb 3 3 1 4
## VisitorType Weekend Revenue
## 1: Returning_Visitor FALSE FALSE
## 2: Returning_Visitor FALSE FALSE
## 3: Returning_Visitor FALSE FALSE
## 4: Returning_Visitor FALSE FALSE
## 5: Returning_Visitor TRUE FALSE
The data was provided by the client, and was based on a data collected over the past year therefore, there is no need for external dataset validation.
Column Validation
colnames(cust)
## [1] "Administrative" "Administrative_Duration"
## [3] "Informational" "Informational_Duration"
## [5] "ProductRelated" "ProductRelated_Duration"
## [7] "BounceRates" "ExitRates"
## [9] "PageValues" "SpecialDay"
## [11] "Month" "OperatingSystems"
## [13] "Browser" "Region"
## [15] "TrafficType" "VisitorType"
## [17] "Weekend" "Revenue"
All columns are valid.
Checking for invalid values
# Checking for anomalies
# Data set summary
summary(cust)
## Administrative Administrative_Duration Informational
## Min. : 0.000 Min. : -1.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 1.000 Median : 8.00 Median : 0.000
## Mean : 2.318 Mean : 80.91 Mean : 0.504
## 3rd Qu.: 4.000 3rd Qu.: 93.50 3rd Qu.: 0.000
## Max. :27.000 Max. :3398.75 Max. :24.000
## NA's :14 NA's :14 NA's :14
## Informational_Duration ProductRelated ProductRelated_Duration
## Min. : -1.00 Min. : 0.00 Min. : -1.0
## 1st Qu.: 0.00 1st Qu.: 7.00 1st Qu.: 185.0
## Median : 0.00 Median : 18.00 Median : 599.8
## Mean : 34.51 Mean : 31.76 Mean : 1196.0
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1466.5
## Max. :2549.38 Max. :705.00 Max. :63973.5
## NA's :14 NA's :14 NA's :14
## BounceRates ExitRates PageValues SpecialDay
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01429 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.003119 Median :0.02512 Median : 0.000 Median :0.00000
## Mean :0.022152 Mean :0.04300 Mean : 5.889 Mean :0.06143
## 3rd Qu.:0.016684 3rd Qu.:0.05000 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
## NA's :14 NA's :14
## Month OperatingSystems Browser Region TrafficType
## May :3364 2 :6601 2 :7961 1 :4780 2 :3913
## Nov :2998 1 :2585 1 :2462 3 :2403 1 :2451
## Mar :1907 3 :2555 4 : 736 4 :1182 3 :2052
## Dec :1727 4 : 478 5 : 467 2 :1136 4 :1069
## Oct : 549 8 : 79 6 : 174 6 : 805 13 : 738
## Sep : 448 6 : 19 10 : 163 7 : 761 10 : 450
## (Other):1337 (Other): 13 (Other): 367 (Other):1263 (Other):1657
## VisitorType Weekend Revenue
## New_Visitor : 1694 FALSE:9462 FALSE:10422
## Other : 85 TRUE :2868 TRUE : 1908
## Returning_Visitor:10551
##
##
##
##
as.integer(c(1,2,3,4,5))
## [1] 1 2 3 4 5
The duration columns have negative minimum values. They will be converted to positive values as durations cannot be negative.
# Changing the negative values to positive values
cust[,c(2,4,6)][(cust[,c(2,4,6)] == -1) & !is.na(cust[,c(2,4,6)])] <- 1
# Confirming changes
summary(cust[,c(2,4,6)])
## Administrative_Duration Informational_Duration ProductRelated_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 185.0
## Median : 8.00 Median : 0.00 Median : 599.8
## Mean : 80.91 Mean : 34.51 Mean : 1196.0
## 3rd Qu.: 93.50 3rd Qu.: 0.00 3rd Qu.: 1466.5
## Max. :3398.75 Max. :2549.38 Max. :63973.5
## NA's :14 NA's :14 NA's :14
# Checking for missing values
colSums(is.na(cust))
## Administrative Administrative_Duration Informational
## 14 14 14
## Informational_Duration ProductRelated ProductRelated_Duration
## 14 14 14
## BounceRates ExitRates PageValues
## 14 14 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
# Percentage of missing values
perc <- (sum(is.na(cust))/nrow(cust))*100
precentage.missing <- paste0('Missing values = ', round(perc, 2), '%')
precentage.missing
## [1] "Missing values = 0.91%"
# Imputing the numeric columns using the mean as the missing values are minimal.
cust <- data.frame(cust)
for(col in 1:8) {
cust[ , col][is.na(cust[ , col])] <- mean(cust[ , col], na.rm = TRUE)
}
# Confirming changes
colSums(is.na(cust))
## Administrative Administrative_Duration Informational
## 0 0 0
## Informational_Duration ProductRelated ProductRelated_Duration
## 0 0 0
## BounceRates ExitRates PageValues
## 0 0 0
## SpecialDay Month OperatingSystems
## 0 0 0
## Browser Region TrafficType
## 0 0 0
## VisitorType Weekend Revenue
## 0 0 0
All missing values have been imputed. The mean was used as the percentage of missing values was minuscule. Using MICE was computationally expensive for the simple imputation procedure.
# Checking for duplicates
sum(duplicated(cust))
## [1] 119
# Removing identifies duplicates
cust <- cust[!duplicated(cust),]
# Confirming changes
sum(duplicated(cust))
## [1] 0
Duplicates have been removed.
# Checking the uniformity of column names
colnames(cust)
## [1] "Administrative" "Administrative_Duration"
## [3] "Informational" "Informational_Duration"
## [5] "ProductRelated" "ProductRelated_Duration"
## [7] "BounceRates" "ExitRates"
## [9] "PageValues" "SpecialDay"
## [11] "Month" "OperatingSystems"
## [13] "Browser" "Region"
## [15] "TrafficType" "VisitorType"
## [17] "Weekend" "Revenue"
The column names do not have a uniform format. The underscores will be introduced.
# Introducing underscores
colnames(cust) <- c("Aministrative", "Administrative_Duration", "Informational",
"Informational_Duration", "Product_Related", "Product_Related_Duration",
"Bounce_Rates", "Exit_Rates", "Page_Values", "Special_Day", "Month", "Operating_Systems", "Browser", "Region", "Traffic_Type", "Visitor_Type", "Weekend", "Revenue")
# Confirming changes
colnames(cust)
## [1] "Aministrative" "Administrative_Duration"
## [3] "Informational" "Informational_Duration"
## [5] "Product_Related" "Product_Related_Duration"
## [7] "Bounce_Rates" "Exit_Rates"
## [9] "Page_Values" "Special_Day"
## [11] "Month" "Operating_Systems"
## [13] "Browser" "Region"
## [15] "Traffic_Type" "Visitor_Type"
## [17] "Weekend" "Revenue"
The Administrative column will be renamed as it has been mispelt.
# Renaming the column name
colnames(cust)[1] <- c('Administrative')
# Confirming changes
colnames(cust)[1]
## [1] "Administrative"
# Selecting numerical columns
num <- unlist(lapply(cust, is.numeric))
# Numeric data frame
num_df <- cust[, num]
# Number of numeric columns
length(names(num_df))
## [1] 10
# Boxplots
par(mfrow = c(5,2), mar=c(1,5,2,2))
for (i in 1:length(num_df)){
boxplot(num_df[ , i], main = paste('Boxplot of', names(num_df)[i]),
ylab = 'Count')
}
All columns have outliers.
# Outliers summary function
outliers.summary <- function(data){
total <- c(0) # Placeholder for total number of outliers
cat('Precentage outliers per column\n')
cat(rep('-', 16), '\n')
for (col in names(data)){
# IQR
Q1 <- quantile(data[,col], probs=.25)
Q3 <- quantile(data[,col], probs=.75)
IQR = Q3-Q1
# Limits
upper.limit = Q3 + (IQR*1.5)
lower.limit = Q1 - (IQR*1.5)
# Outliers per column
out <- sum(data[,col] > upper.limit | data[,col] < lower.limit)
total <- total + out
perc <- (out/nrow(data))*100
print(paste0(col, ' = ', round(perc,2), '%'))
cat('\n')
}
# Total percentage of outliers
cat('\nTotal precentage of outliers\n')
cat(rep('-', 16), '\n')
cells <- nrow(data)*ncol(data)
perc <- (total/cells)*100
print(paste0('Total = ', round(perc,2), '%'))
}
# Outliers summary
outliers.summary(num_df)
## Precentage outliers per column
## - - - - - - - - - - - - - - - -
## [1] "Administrative = 3.31%"
##
## [1] "Administrative_Duration = 9.42%"
##
## [1] "Informational = 21.64%"
##
## [1] "Informational_Duration = 20.06%"
##
## [1] "Product_Related = 8.25%"
##
## [1] "Product_Related_Duration = 7.81%"
##
## [1] "Bounce_Rates = 11.72%"
##
## [1] "Exit_Rates = 10.87%"
##
## [1] "Page_Values = 22.36%"
##
## [1] "Special_Day = 10.23%"
##
##
## Total precentage of outliers
## - - - - - - - - - - - - - - - -
## [1] "Total = 12.56%"
Outliers will be retained for analysis, then scaling or removal will be applied during modelling.
# Removing outliers
# Function to obtain logical vector of outliers
outliers.vector <- function(data) {
# IQR
Q1 <- quantile(data, probs=.25)
Q3 <- quantile(data, probs=.75)
IQR <- Q3-Q1
# Limits
upper.limit = Q3 + (IQR*1.5)
lower.limit = Q1 - (IQR*1.5)
# Logical vector
data > upper.limit | data < lower.limit
}
remove.outliers <- function(data, cols = names(data)) {
for (col in cols) {
data <- data[!outliers.vector(data[[col]]),]
}
data
}
# Copy of the dataset.
cust.no <- data.frame(cust)
# Function that removes rows with outliers
fill.func <- function(data){
for(col in 1:10){
no <- outliers.vector(data[,col])
data <- data[!no,]
}
data
}
# Function call
cust.no <- fill.func(cust.no)
# Number of removed rows
perc <- ((nrow(cust.no))/nrow(cust))*100
print(paste0('Removed rows = ', round((perc),2), '%'))
## [1] "Removed rows = 40.7%"
print(nrow(cust.no))
## [1] 4970
Majority of the rows have been dropped, leading to a significant loss in data.
# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend = NULL, colors = NULL, method, col2 = NULL){
if (method == 'count'){
ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
ggtitle(paste(title, 'Frequency Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
else if (method == 'bar'){
ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)}
}
# Categorical columns
cat <- unlist(lapply(cust, is.factor))
cat.df <- cust[,cat]
Month
# Unique country columns
print(unique(cat.df$Month))
## [1] Feb Mar May Oct June Jul Aug Nov Sep Dec
## Levels: Aug Dec Feb Jul June Mar May Nov Oct Sep
length(unique(cat.df$Month))
## [1] 10
The dataset does not have the months of January and April.
# Ordering Factors
cust$Month <- factor(cust$Month, levels = c('Feb', 'Mar', 'May', 'June', 'Jul',
'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
cat <- unlist(lapply(cust, is.factor))
cat.df <- cust[,cat]
print(unique(cat.df$Month))
## [1] Feb Mar May Oct June Jul Aug Nov Sep Dec
## Levels: Feb Mar May June Jul Aug Sep Oct Nov Dec
# Months column bar plot
bar.plt (cat.df, Month, title = 'Month Column', method = 'count', legend =
'Month', colors = c('Feb' = '#03a9f4', 'Mar' = '#0398dc',
'May' = '#0287c3', 'June' = '#0276AB',
'Jul' = '#026592', 'Aug' = '#02557A',
'Sep' = '#014462', 'Oct' = '#013349',
'Nov' = '#012231', 'Dec' = '#001118'),
col2 = NULL)
The months of May, November and March have the highest frequencies.
Operating Systems
# Unique values
unique(cat.df$Operating_Systems)
## [1] 1 2 4 3 7 6 8 5
## Levels: 1 2 3 4 5 6 7 8
# Operating Systems column bar plot
bar.plt (cat.df, Operating_Systems, title = 'Operating System Column', method = 'count', legend =
'Operating System', colors = c('1' = '#03a9f4', '2' = '#0398dc',
'3' = '#0287c3', '4' = '#0276AB',
'5' = '#026592', '6' = '#02557A',
'7' = '#014462', '8' = '#013349'),
col2 = NULL)
The second operating system has the highest frequency, while 5 and 7 have the least number of records.
Browser
# Unique browser values
unique(cat.df$Browser)
## [1] 1 2 3 4 5 6 7 10 8 9 12 13 11
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13
# Browser column bar plot
bar.plt (cat.df, Browser, title = 'Browser Column', method = 'count', legend =
'Browser', colors = c('1' = '#d9f1ff', '2' = '#bfe6ff',
'3' = '#59bfff','4' = '#03a9f4',
'5' = '#0398dc', '6' = '#0287c3',
'7' = '#0276AB', '8' = '#026592',
'9' = '#02557A', '10' = '#014462',
'11' = '#013349', '12' = '#012231',
'13' = '#001118'),
col2 = NULL)
The second browser has the highest frequency.
Region
# Unique region values
unique(cat.df$Region)
## [1] 1 9 2 3 4 5 6 7 8
## Levels: 1 2 3 4 5 6 7 8 9
# Region column bar plot
bar.plt (cat.df, Region, title = 'Region Column', method = 'count', legend =
'Region', colors = c('1' = '#0398dc', '2' = '#0287c3',
'3' = '#0276AB', '4' = '#026592',
'5' = '#02557A', '6' = '#014462',
'7' = '#013349', '8' = '#012231',
'9' = '#001118'),
col2 = NULL)
The first region has the highest frequency.
Traffic Type
# Unique traffic type values
unique(cat.df$Traffic_Type)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 18 19 16 17 20
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Traffic type column bar plot
ggplot(cat.df, aes(x = Traffic_Type, fill = Traffic_Type)) + geom_bar() +
ggtitle(paste('Traffic Type Frequency Plot')) +
theme(plot.title = element_text(hjust = 0.5))
The second type has the highest frequency.
Visitor Type
# Unique visitor type values
unique(cat.df$Visitor_Type)
## [1] Returning_Visitor New_Visitor Other
## Levels: New_Visitor Other Returning_Visitor
# Visitor Type column bar plot
bar.plt (cat.df, Visitor_Type, title = 'Visitor Type Column', method = 'count',
legend = 'Visitor Type', colors = c('New_Visitor' = '#0398dc',
'Returning_Visitor' = '#026592',
'Other' = '#001118'),
col2 = NULL)
The returning visitor class has the highest frequency.
Weekend
# Unique weekend values
unique(cat.df$Weekend)
## [1] FALSE TRUE
## Levels: FALSE TRUE
# Weekend column bar plot
bar.plt (cat.df, Weekend, title = 'Weekend Column', method = 'count',
legend = 'Weekend', colors = c('FALSE' = '#0398dc',
'TRUE' = '#001118'),
col2 = NULL)
Most activity/traffic did not take place on the weekend.
Revenue
# Unique revenue values
unique(cat.df$Revenue)
## [1] FALSE TRUE
## Levels: FALSE TRUE
# Revenue column bar plot
bar.plt (cat.df, Revenue, title = 'Revenue Column', method = 'count',
legend = 'Revenue', colors = c('FALSE' = '#0398dc',
'TRUE' = '#001118'),
col2 = NULL)
Lack of revenue had a higher frequency.
# Numerical columns
num.df <- cust[,!cat]
# Mode function
mode <- function(col, data) {
unique.value <- unique(data[, col])
unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}
central.tendency <- function(col, data){
cat('Measures of Central Tendency \n')
# Mean
cat('Mean = ', mean(data[, col]), '\n')
# Median
cat('Median = ', median(data[,col]), '\n')
# Mode
cat('Mode = ', mode(col, data), '\n')
}
dispersion <- function(col, data){
# Range
cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
# IQR
cat('IQR = ', IQR(data[ ,col]), '\n')
# Variance
cat('Variance = ', var(data[ ,col]), '\n')
# Standard Deviation
cat('Standard Deviation = ', sd(data[ ,col]), '\n')
# Skewness
cat('Skewness = ', skewness(data[ ,col]), '\n')
# Kurtosis
cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}
Administrative
# Measures of central tendency
central.tendency(names(num.df)[1], num.df)
## Measures of Central Tendency
## Mean = 2.340006
## Median = 1
## Mode = 0
The average number of administrative pages visited was ~3.
# Measures of dispersion
dispersion(names(num.df)[1], num.df)
## Range = 0 - 27
## IQR = 4
## Variance = 11.08367
## Standard Deviation = 3.329214
## Skewness = 1.947225
## Kurtosis = 7.643668
# Administrative Histogram
hist(num.df$ Administrative,
main = 'Histogram of Administrative Pages',
xlab = 'Administrative')
The administrative pages were not visited most of the time.
Administrative Duration
# Measures of central tendency
central.tendency(names(num.df)[2], num.df)
## Measures of Central Tendency
## Mean = 81.68679
## Median = 9
## Mode = 0
The average administrative page visit duration was ~82. The duration is likely to be in seconds, especially in regards to the above observation.
# Measures of dispersion
dispersion(names(num.df)[2], num.df)
## Range = 0 - 3398.75
## IQR = 94.6
## Variance = 31484.39
## Standard Deviation = 177.4384
## Skewness = 5.593135
## Kurtosis = 53.14849
# Administrative Duration Histogram
hist(num.df$ Administrative_Duration,
main = 'Histogram of Administrative Page Duration',
xlab = 'Administrative Duration')
Most customers did not visit the administrative pages.
Informational
# Measures of central tendency
central.tendency(names(num.df)[3], num.df)
## Measures of Central Tendency
## Mean = 0.5088074
## Median = 0
## Mode = 0
The average number of informational pages visited was 1.
# Measures of dispersion
dispersion(names(num.df)[3], num.df)
## Range = 0 - 24
## IQR = 0
## Variance = 1.62611
## Standard Deviation = 1.27519
## Skewness = 4.015436
## Kurtosis = 29.67176
# Informational Histogram
hist(num.df$ Informational,
main = 'Histogram of Informational Pages',
xlab = 'Informational')
The informational page was rarely visited.
Informational Duration
# Measures of central tendency
central.tendency(names(num.df)[4], num.df)
## Measures of Central Tendency
## Mean = 34.84242
## Median = 0
## Mode = 0
The average informational page visit duration was ~35 seconds.
# Measures of dispersion
dispersion(names(num.df)[4], num.df)
## Range = 0 - 2549.375
## IQR = 0
## Variance = 19990.46
## Standard Deviation = 141.3876
## Skewness = 7.541253
## Kurtosis = 78.54314
# Informational Duration Histogram
hist(num.df$ Informational_Duration,
main = 'Histogram of Informational Duration',
xlab = 'Informational Duration')
The site wasn’t visited most of the time.
Product Related
# Measures of central tendency
central.tendency(names(num.df)[5], num.df)
## Measures of Central Tendency
## Mean = 32.05816
## Median = 18
## Mode = 1
The average number of pages visited was 32.
# Measures of dispersion
dispersion(names(num.df)[5], num.df)
## Range = 0 - 705
## IQR = 30
## Variance = 1987.286
## Standard Deviation = 44.57899
## Skewness = 4.334283
## Kurtosis = 34.08263
# Product Related Histogram
hist(num.df$ Product_Related,
main = 'Histogram of Product Related Page',
xlab = 'Product Related')
The product related pages were rarely visited.
Product Related Duration
# Measures of central tendency
central.tendency(names(num.df)[6], num.df)
## Measures of Central Tendency
## Mean = 1207.502
## Median = 611
## Mode = 0
The average product related page visit duration was ~20 minutes. The highest page duration seen so far.
# Measures of dispersion
dispersion(names(num.df)[6], num.df)
## Range = 0 - 63973.52
## IQR = 1282.4
## Variance = 3682486
## Standard Deviation = 1918.98
## Skewness = 7.25502
## Kurtosis = 139.7292
# Product Related Duration Histogram
hist(num.df$ Product_Related_Duration,
main = 'Histogram of Product Related Duration',
xlab = 'Product Related Duration')
Similar to the other pages, the product related pages were rarely visited.
Bounce Rates
# Measures of central tendency
central.tendency(names(num.df)[7], num.df)
## Measures of Central Tendency
## Mean = 0.02044841
## Median = 0.002941176
## Mode = 0
The average bounce rate was close to zero.
# Measures of dispersion
dispersion(names(num.df)[7], num.df)
## Range = 0 - 0.2
## IQR = 0.01666667
## Variance = 0.002059364
## Standard Deviation = 0.04538022
## Skewness = 3.154307
## Kurtosis = 12.26662
# Bounce Rate Histogram
hist(num.df$ Bounce_Rate,
main = 'Histogram of the Bounce Rate',
xlab = 'Bounce Rate')
The highest frequency for the bounce rate was zero.
Exit Rate
# Measures of central tendency
central.tendency(names(num.df)[8], num.df)
## Measures of Central Tendency
## Mean = 0.04149826
## Median = 0.025
## Mode = 0.2
The exit rate was close to zero.
# Measures of dispersion
dispersion(names(num.df)[8], num.df)
## Range = 0 - 0.2
## IQR = 0.03421079
## Variance = 0.0021367
## Standard Deviation = 0.04622445
## Skewness = 2.234124
## Kurtosis = 7.63145
# Exit Rate Histogram
hist(num.df$Exit_Rate,
main = 'Histogram of the Exit Rate',
xlab = 'Exit Rate')
A low exit rate of 0.01 had the highest frequency.
Page Value
# Measures of central tendency
central.tendency(names(num.df)[9], num.df)
## Measures of Central Tendency
## Mean = 5.946651
## Median = 0
## Mode = 0
The average page value is 5.95.
# Measures of dispersion
dispersion(names(num.df)[9], num.df)
## Range = 0 - 361.7637
## IQR = 0
## Variance = 347.8058
## Standard Deviation = 18.64955
## Skewness = 6.351741
## Kurtosis = 68.00151
# Page Value Histogram
hist(num.df$ Page_Value,
main = 'Histogram of Page Value',
xlab = 'Page Value')
A page value of zero had the highest frequency.
Special Day
# Measures of central tendency
central.tendency(names(num.df)[10], num.df)
## Measures of Central Tendency
## Mean = 0.06191139
## Median = 0
## Mode = 0
The average value was close to zero, therefore, the average site visiting time was not close to special days.
# Measures of dispersion
dispersion(names(num.df)[10], num.df)
## Range = 0 - 1
## IQR = 0
## Variance = 0.03984889
## Standard Deviation = 0.1996219
## Skewness = 3.286515
## Kurtosis = 12.80023
# Special Day Histogram
hist(num.df$ Special_Day,
main = 'Histogram of Proximity to a Special Day',
xlab = 'Special Day')
Majority of the site visiting times were not close to a special day.
Summary
The univariate section provided an overview of the frequency distribution of numeric columns, and the frequency distribution of classes within categorical variables. For a more in depth analysis of the characteristics of different customer groups, bivariate analysis will be done.
Revenue Vs Month
# Target columns
Month <- cat.df$Month
Revenue <- cat.df$Revenue
# Contingency table
contingency.table <- table(Revenue, Month)
contingency.table
## Month
## Revenue Feb Mar May June Jul Aug Sep Oct Nov Dec
## FALSE 179 1672 2964 256 366 357 362 434 2223 1490
## TRUE 3 192 365 29 66 76 86 115 760 216
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Month', ylab='Revenue',
main='Revenue Vs Month', color = '#1338BE', las = 1)
The months of November, May and December had the highest revenue frequency counts. Further more, May, November, March and December had the highest frequencies for a lack of revenue.
Revenue Vs Operating Systems
# Target columns
Operating_Systems <- cat.df$Operating_Systems
# Contingency table
contingency.table <- table(Revenue, Operating_Systems)
contingency.table
## Operating_Systems
## Revenue 1 2 3 4 5 6 7 8
## FALSE 2172 5387 2265 393 5 17 6 58
## TRUE 379 1155 268 85 1 2 1 17
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Operating Systems', ylab='Revenue',
main='Revenue Vs Operating Systems', color = '#1338BE', las = 1)
The second operating system has the highest revenue count. However, it also records the highest frequency for a lack of revenue, followed by the third, then the first.
Revenue Vs Browser
# Target columns
Browser <- cat.df$Browser
# Contingency table
contingency.table <- table(Revenue, Browser)
contingency.table
## Browser
## Revenue 1 2 3 4 5 6 7 8 9 10 11 12 13
## FALSE 2064 6663 100 601 380 154 43 114 1 131 5 7 40
## TRUE 365 1223 5 130 86 20 6 21 0 32 1 3 16
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Browser', ylab='Revenue',
main='Revenue Vs Browser', color = '#1338BE', las = 1)
The second browser has the highest frequency for the revenue, but this value is overshadowed by a lack of revenue on the same browser. The first browser also has the highest frequency for a lack of revenue.
Revenue Vs Region
# Target columns
Region <- cat.df$Region
# Contingency table
contingency.table <- table(Revenue, Region)
contingency.table
## Region
## Revenue 1 2 3 4 5 6 7 8 9
## FALSE 3944 941 2034 996 266 689 639 375 419
## TRUE 771 188 349 175 52 112 119 56 86
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Region', ylab='Revenue',
main='Revenue Vs Region', color = '#1338BE', las = 1)
The first and third regions had the highest revenue frequency. They also had the highest frequency for no revenue. A factor could be the population of the regions.
# Target columns
Traffic_Type <- cat.df$Traffic_Type
# Contingency table
contingency.table <- table(Revenue, Traffic_Type)
contingency.table
## Traffic_Type
## Revenue 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## FALSE 2128 3062 1838 901 204 390 28 248 37 360 200 1 686 11
## TRUE 262 847 180 165 56 53 12 95 4 90 47 0 43 2
## Traffic_Type
## Revenue 15 16 17 18 19 20
## FALSE 37 2 1 10 16 143
## TRUE 0 1 0 0 1 50
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Traffic_Type', ylab='Revenue',
main='Revenue Vs Traffic Type', color = '#1338BE', las = 1)
The second traffic type had the highest revenue frequency, by a large margin. It also has the highest frequency for a lack of revenue, followed by the first and second types. The traffic types had revenue counts.
Revenue Vs Visitor Type
# Target columns
Visitor_Type <- cat.df$Visitor_Type
# Contingency table
contingency.table <- table(Revenue, Visitor_Type)
contingency.table
## Visitor_Type
## Revenue New_Visitor Other Returning_Visitor
## FALSE 1271 65 8967
## TRUE 422 16 1470
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Visitor_Type', ylab='Revenue',
main='Revenue Vs Visitor Type', color = '#1338BE', las = 1)
Returning visitors led to the highest revenue frequency counts. However, the total value is lower than that of a lack of revenue.
Revenue Vs Weekend
# Target columns
Weekend <- cat.df$Weekend
# Contingency table
contingency.table <- table(Revenue, Weekend)
contingency.table
## Weekend
## Revenue FALSE TRUE
## FALSE 7943 2360
## TRUE 1409 499
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Weekend', ylab='Revenue',
main='Revenue Vs Weekend', color = '#1338BE', las = 1)
Weekdays had the highest revenue frequencies, as well as a lock of it.
Function Definition
# Correlation
correlation <- function(cor1, cor2, data){
correlation <- cor(data[,c(cor1)], data[, c(cor2)])
print(paste0('Correlation = ', round(correlation, 2), '.'))
}
# Scatter plot and correlation function
scatter.plt <- function(col1, col2, data, title){
# Scatter plot
ggplot(data, aes(x = {{col1}}, y= {{col2}})) +
geom_point(color = '#281E5D') + ggtitle(paste(title, 'Scatter Plot')) + theme(plot.title = element_text(hjust = 0.5))
}
Bounce Rates Vs Product Related
# Scatter plot and correlation
scatter.plt(col1 = Bounce_Rates, col2 = Product_Related , data = num.df,
title = 'Bounce Rate Vs Administrative')
# Correlation
correlation('Product_Related', 'Bounce_Rates', num.df)
## [1] "Correlation = -0.19."
The correlation between the bounce rates and the product related pages is very weak and negative.
Exit Rates Vs Product Related
# Scatter plot and correlation
scatter.plt(col1 = Exit_Rates, col2 = Product_Related, data = num.df,
title = 'Exit Rates Vs Administrative')
# Correlation
correlation('Product_Related', 'Exit_Rates', num.df)
## [1] "Correlation = -0.29."
The correlation between the exit rates and the product related pages is also very weak and negative, but slightly higher than that of the previous observation.
Product Related Vs Page Values
# Scatter plot and correlation
scatter.plt(col1 = Product_Related, col2 = Page_Values, data = num.df,
title = 'Exit Rates Vs Administrative')
# Correlation
correlation('Product_Related', 'Page_Values', num.df)
## [1] "Correlation = 0.05."
The correlation between the product related page and the page values is very weak, but positive.
All of the columns in this section have very week correlations with each other, therefore, a correlation matrix in the multivariate section will be used to analyze the correlation of all the numeric columns.
Function Definition
# Bar plot for averaged y axis
bar.plt.summary <- function(data, col1, col2, title, legend, colors){
ggplot(data, mapping=aes(x= {{col1}}, y= {{col2}}, fill = {{col1}})) +
stat_summary(fun=median, geom="bar") + ggtitle(paste(title, 'Bar Plot')) +
theme(plot.title = element_text(hjust = 0.5))+
scale_fill_manual(legend, values = colors)
}
All columns numeric columns are positively skewed, therefore, the median will be used.
Revenue Vs Administrative
# Bar plot
bar.plt.summary(cust, Revenue, Administrative,'Revenue Vs Administrative Pages',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The higher the number of administrative pages were visited, the higher the revenue. The median number of pages was 2.
Revenue Vs Administrative Duration
# Bar plot
bar.plt.summary(cust, Revenue, Administrative_Duration,
'Revenue Vs Administrative Duration',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The median time spent on the administrative pages leading to higher attainment of revenue was 52 seconds.
Revenue Vs Informational Pages
# Bar plot
bar.plt.summary(cust, Revenue, Informational,'Revenue Vs Informational Pages',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
Visitation of the informational did not contribute to the brand’s revenue.
Revenue Vs Informational Duration
# Bar plot
bar.plt.summary(cust, Revenue, Informational_Duration,
'Revenue Vs Informational Duration',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
There was little to no traffic to the informational pages, therefore it did not contribute to the brand’s revenue.
Product Related
# Bar plot
bar.plt.summary(cust, Revenue, Product_Related,
'Revenue Vs Product Related Pages',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The product related pages had a higher number of visits than the previous pages, leading to a higher contribution to the revenue, the median number of pages visited was 38.
Revenue Vs Product Related Duration
# Bar plot
bar.plt.summary(cust, Revenue, Product_Related_Duration,
'Revenue Vs Product Related Duration',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
Most visitors visiting the product related pages carried out transactions leading to higher instances of revenue. The median duration was 20 minutes.
Revenue Vs Bounce Rates
# Bar plot
bar.plt.summary(cust, Revenue, Bounce_Rates,'Revenue Vs Bounce Rates',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
Bounce rates did not contribute to the revenue. The median bounce rate was 0.004.
Revenue Vs Exit Rates
# Bar plot
bar.plt.summary(cust, Revenue, Exit_Rates,'Revenue Vs Exit Rates',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The higher the exit rate the lower the revenue. The exit rate median of 0.029 had no revenue, while loser rates of 0.025 contributed to the revenue.
# Bar plot
bar.plt.summary(cust, Revenue, Page_Values,'Revenue Vs Page Values',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The median page value before an e-commerce transaction was completed, contributing to the revenue was 14.
Revenue Vs Special Day
# Bar plot
bar.plt.summary(cust, Revenue, Special_Day,'Revenue Vs Special Day',
'Revenue',
c("FALSE" = '#757C88', "TRUE" = "#1338BE"))
The special day feature had no contribution to the revenue.
Summary
The bivariate analysis provided an in depth analysis of the different customer groups. For more in depth analysis. Multivariate analysis will be done.
Correlation Matrix
# Visualize correlation matrix
# Reducing the column name sizes for efficient utilization of the graphing
# space.
num.corr <- data.frame(num.df)
colnames(num.corr) <- c('Admin', 'Admin.Dur', 'Info', 'Info.Dur', 'Prod.Rel',
'Prod.Rel.Dur', 'Bounce.Rts', 'Exit.Rts', 'Pg.Vals',
'Spec.Day')
ggcorrplot(cor(num.corr), lab = TRUE, hc.order = TRUE,
title = 'Correlation Heatmap',
colors = c('#022D36', 'white', '#48AAAD'))
The exit and bounce rates and the product related and product related duration have the highest positive correlations.
# Multivariate Scatter plot
scatter.plt.multi <- function(data, col1, col2, col3, legend, colors)
ggplot(data, aes(x = {{col1}}, y= {{col2}}, color = {{col3}},
shape = {{col3}})) + geom_point() + ggtitle('Scatter Plot') +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = colors)
Revenue Vs Bounce Rate Vs Exit Rate
# Function call
scatter.plt.multi(num.df, Bounce_Rates, Exit_Rates, Revenue,
'Revenue', colors = c('black', 'red'))
Lower exit and bounce rates lead to a higher number of transactions.
Revenue Vs Product Related Pages Vs Product Related Page Duration
# Function call
scatter.plt.multi(num.df, Product_Related, Product_Related_Duration, Revenue,
'Revenue', colors = c('black', 'red'))
Most users spend a small amount of time on the product related sites. There is no glaring distinction between customers who make or don’t make transactions.
Revenue Vs Product Related Pages Vs Page Values
# Function call
scatter.plt.multi(num.df, Product_Related, Page_Values, Revenue,
'Revenue', colors = c('black', 'red'))
The lower the number of product related pages, as well as the page values, the higher the number of transactions leading to attainment of revenue by the brand.
Summary
This section has provided a more in depth view of how various factors are related. To increase the quality of insights, more features should be explored and the data collected, to provide better and more specific insights related to each customer group.
# preview of the data
head(cust)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 1 0 1
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1 1 0.000000 0.20000000 0.2000000 0
## 2 2 64.000000 0.00000000 0.1000000 0
## 3 1 1.000000 0.20000000 0.2000000 0
## 4 2 2.666667 0.05000000 0.1400000 0
## 5 10 627.500000 0.02000000 0.0500000 0
## 6 19 154.216667 0.01578947 0.0245614 0
## Special_Day Month Operating_Systems Browser Region Traffic_Type
## 1 0 Feb 1 1 1 1
## 2 0 Feb 2 2 1 2
## 3 0 Feb 4 1 9 3
## 4 0 Feb 3 2 2 4
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## Visitor_Type Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
K-Means clustering is an unsupervised non-linear algorithm that clusters data based on similarity. Each cluster is represented by a centroid, which corresponds to the mean of points assigned to the cluster. The total inter-cluster variation is calculated using distance metrics such as the Euclidean distance. Points are assigned to a cluster based on minimu distance from cluster centroids (datanovia).
These distance metrics only apply to numeric columns, therefore, it isn’t directly applicable to categorical columns as they are discrete and lack a natural origin (have no numerical meaning/relevance) (datascience stackexchange), (IBM). Therefore, KModes is used for categorical columns (Analytics Vidhya).
Gower distance is a metric that me asures the dissimilarity of two items with mixed numeric and non-numeric data (link). Therefore, this metric can be used for clustering mixed datasets.
The kmeans(stats package) function has cluster centers defined by the Euclidean distance, while the Kmeans(amap package) supports more distance metrics, but all are numeric distances. Both do not support custom distance metrics. Partitioning around mediods(PAM) supports custom metrics and can be used as an alternative.
As per the observations made above, only the numeric will be used.
Base Model
# Column ranges
summary(num_df)
## Administrative Administrative_Duration Informational
## Min. : 0.00 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 1.00 Median : 9.00 Median : 0.0000
## Mean : 2.34 Mean : 81.69 Mean : 0.5088
## 3rd Qu.: 4.00 3rd Qu.: 94.60 3rd Qu.: 0.0000
## Max. :27.00 Max. :3398.75 Max. :24.0000
## Informational_Duration Product_Related Product_Related_Duration
## Min. : 0.00 Min. : 0.00 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 8.00 1st Qu.: 194
## Median : 0.00 Median : 18.00 Median : 611
## Mean : 34.84 Mean : 32.06 Mean : 1208
## 3rd Qu.: 0.00 3rd Qu.: 38.00 3rd Qu.: 1476
## Max. :2549.38 Max. :705.00 Max. :63974
## Bounce_Rates Exit_Rates Page_Values Special_Day
## Min. :0.000000 Min. :0.00000 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.000000 1st Qu.:0.01426 1st Qu.: 0.000 1st Qu.:0.00000
## Median :0.002941 Median :0.02500 Median : 0.000 Median :0.00000
## Mean :0.020448 Mean :0.04150 Mean : 5.947 Mean :0.06191
## 3rd Qu.:0.016667 3rd Qu.:0.04847 3rd Qu.: 0.000 3rd Qu.:0.00000
## Max. :0.200000 Max. :0.20000 Max. :361.764 Max. :1.00000
The columns have differing ranges, therefore, standard scaling will be used to prevent the model from overfitting on columns with larger ranges.
Standard Scaling
# Standard scaling
# Original dataset
cust.num.scaled <- as.data.frame(scale(num_df))
head(cust.num.scaled)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 2 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 3 -0.7028704 -0.4547312 -0.3990051 -0.2393591
## 4 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 5 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 6 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1 -0.6966995 -0.6292416 3.956604814 3.4289586 -0.3188629
## 2 -0.6742674 -0.5958906 -0.450601876 1.2656015 -0.3188629
## 3 -0.6966995 -0.6287205 3.956604814 3.4289586 -0.3188629
## 4 -0.6742674 -0.6278520 0.651199796 2.1309443 -0.3188629
## 5 -0.4948106 -0.3022450 -0.009881207 0.1839229 -0.3188629
## 6 -0.2929218 -0.5488778 -0.102664499 -0.3664048 -0.3188629
## Special_Day
## 1 -0.3101433
## 2 -0.3101433
## 3 -0.3101433
## 4 -0.3101433
## 5 -0.3101433
## 6 -0.3101433
# Replacing columns in the original dataset
cust.scaled <- data.frame(cust)
cust.scaled[,names(cust.num.scaled)] <- cust.num.scaled
head(cust.scaled)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 2 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 3 -0.7028704 -0.4547312 -0.3990051 -0.2393591
## 4 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 5 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 6 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1 -0.6966995 -0.6292416 3.956604814 3.4289586 -0.3188629
## 2 -0.6742674 -0.5958906 -0.450601876 1.2656015 -0.3188629
## 3 -0.6966995 -0.6287205 3.956604814 3.4289586 -0.3188629
## 4 -0.6742674 -0.6278520 0.651199796 2.1309443 -0.3188629
## 5 -0.4948106 -0.3022450 -0.009881207 0.1839229 -0.3188629
## 6 -0.2929218 -0.5488778 -0.102664499 -0.3664048 -0.3188629
## Special_Day Month Operating_Systems Browser Region Traffic_Type
## 1 -0.3101433 Feb 1 1 1 1
## 2 -0.3101433 Feb 2 2 1 2
## 3 -0.3101433 Feb 4 1 9 3
## 4 -0.3101433 Feb 3 2 2 4
## 5 -0.3101433 Feb 3 3 1 4
## 6 -0.3101433 Feb 2 2 1 3
## Visitor_Type Weekend Revenue
## 1 Returning_Visitor FALSE FALSE
## 2 Returning_Visitor FALSE FALSE
## 3 Returning_Visitor FALSE FALSE
## 4 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
# Dataset with outliers removed
num.cols2 <- unlist(lapply(cust.no, is.numeric))
num.df2 <- cust.no[,num.cols2]
cust.no.scaled <- as.data.frame(scale(num.df2))
head(cust.no.scaled)
## Administrative Administrative_Duration Informational Informational_Duration
## 2 -0.6447929 -0.6052036 NaN NaN
## 5 -0.6447929 -0.6052036 NaN NaN
## 6 -0.6447929 -0.6052036 NaN NaN
## 11 -0.6447929 -0.6052036 NaN NaN
## 13 -0.6447929 -0.6052036 NaN NaN
## 14 -0.6447929 -0.6052036 NaN NaN
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2 -1.0985386 -0.9595217 -0.5969676 2.5851188 NaN
## 5 -0.4491690 0.2977482 0.8990293 0.6012199 NaN
## 6 0.2813717 -0.7582321 0.5840826 -0.4081321 NaN
## 11 -1.0173674 -0.2210011 -0.5969676 1.2625196 NaN
## 13 -0.6926826 -0.4764712 -0.5969676 -0.2490224 NaN
## 14 -0.7738538 -0.8836616 -0.5969676 1.2625196 NaN
## Special_Day
## 2 NaN
## 5 NaN
## 6 NaN
## 11 NaN
## 13 NaN
## 14 NaN
Standardizing the informational, information duration, page values and special day columns produces NaN values, therefore, a different scaling method will be used.
Min-Max Scaling
# Dataset with outliers removed
process <- preProcess(as.data.frame(num.df2), method = c('range'))
norm.scaled <- predict(process, as.data.frame(num.df2))
head(norm.scaled)
## Administrative Administrative_Duration Informational Informational_Duration
## 2 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 11 0 0 0 0
## 13 0 0 0 0
## 14 0 0 0 0
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2 0.03278689 0.03267251 0.0000000 1.0000000 0
## 5 0.16393443 0.32034374 0.3230769 0.5000000 0
## 6 0.31147541 0.07872884 0.2550607 0.2456140 0
## 11 0.04918033 0.20165064 0.0000000 0.6666667 0
## 13 0.11475410 0.14319748 0.0000000 0.2857143 0
## 14 0.09836066 0.05002978 0.0000000 0.6666667 0
## Special_Day
## 2 0
## 5 0
## 6 0
## 11 0
## 13 0
## 14 0
# Replacing columns in the original dataset
cust.no.scaled <- data.frame(cust.no)
cust.no.scaled[,names(norm.scaled)] <- norm.scaled
head(cust.no.scaled)
## Administrative Administrative_Duration Informational Informational_Duration
## 2 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## 11 0 0 0 0
## 13 0 0 0 0
## 14 0 0 0 0
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2 0.03278689 0.03267251 0.0000000 1.0000000 0
## 5 0.16393443 0.32034374 0.3230769 0.5000000 0
## 6 0.31147541 0.07872884 0.2550607 0.2456140 0
## 11 0.04918033 0.20165064 0.0000000 0.6666667 0
## 13 0.11475410 0.14319748 0.0000000 0.2857143 0
## 14 0.09836066 0.05002978 0.0000000 0.6666667 0
## Special_Day Month Operating_Systems Browser Region Traffic_Type
## 2 0 Feb 2 2 1 2
## 5 0 Feb 3 3 1 4
## 6 0 Feb 2 2 1 3
## 11 0 Feb 1 1 3 3
## 13 0 Feb 1 1 1 3
## 14 0 Feb 2 5 1 3
## Visitor_Type Weekend Revenue
## 2 Returning_Visitor FALSE FALSE
## 5 Returning_Visitor TRUE FALSE
## 6 Returning_Visitor FALSE FALSE
## 11 Returning_Visitor FALSE FALSE
## 13 Returning_Visitor FALSE FALSE
## 14 Returning_Visitor FALSE FALSE
# Model before optimization
# Seed for reproducible results
set.seed(0)
# Kmeans models
base.km2 <- kmeans(cust.num.scaled, centers = 2, nstart = 25)
base.km3 <- kmeans(cust.num.scaled, centers = 3, nstart = 25)
base.km4 <- kmeans(cust.num.scaled, centers = 4, nstart = 25)
base.km5 <- kmeans(cust.num.scaled, centers = 5, nstart = 25)
# Visualizations
par(mfrow = c(2,2))
plt1 <- fviz_cluster(base.km2, geom = "point", data = cust.num.scaled) + ggtitle(" K = 2")
plt2 <- fviz_cluster(base.km3, geom = "point", data = cust.num.scaled) + ggtitle(" K = 3")
plt3 <- fviz_cluster(base.km4, geom = "point", data = cust.num.scaled)+ ggtitle(" K = 4")
plt4 <- fviz_cluster(base.km5, geom = "point", data = cust.num.scaled)+ ggtitle(" K = 5")
plot(plt1)
plot(plt2)
plot(plt3)
plot(plt4)
For k = 2, the second cluster contains the majority of outliers. k= 2 and k = 3 have clearer cluster distinctions.
Optimal Number of Clusters
1. Elbow Method
# Elbow plot
fviz_nbclust(x = cust.num.scaled, FUNcluster = kmeans, method = 'wss') +
labs(title = 'Elbow Method')
2. Silhouette Method
# Silhouette Method plot
fviz_nbclust(x = cust.num.scaled, FUNcluster = kmeans, method = 'silhouette') +
labs(title = 'Silhouette Method')
3. Gap Statistic Method
# Seed value
set.seed(0)
# Gap statistic
# Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(cust.num.scaled, FUNcluster=Kmeans.func, method="gap_stat",
verbose = FALSE) +
labs(title = 'Gap Statistic Method')
The optimal k value = 3.
Optimal Number of Clusters
1. Elbow Method
# Elbow plot
fviz_nbclust(x = norm.scaled, FUNcluster = kmeans, method = 'wss') +
labs(title = 'Elbow Method')
2. Silhouette Method
# Silhouette Method plot
fviz_nbclust(x = norm.scaled, FUNcluster = kmeans, method = 'silhouette') +
labs(title = 'Silhouette Method')
3. Gap Statistic Method
# Seed value
set.seed(0)
# Gap statistic
# Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(norm.scaled, FUNcluster=Kmeans.func, method="gap_stat",
verbose = FALSE) +
labs(title = 'Gap Statistic Method')
The optimal k values are k = 4. This dataset will not be used as a significant portion of the dataset was lost (~40%).
Applying PCA on the entire dataset.
Encoding
# Dataset copy
cust.encoded <- data.frame(cust.scaled)
# For PCA all columns need to be numeric
cust.encoded$Month <- as.integer(cust.encoded$Month)
cust.encoded$Visitor_Type <- as.integer(cust.encoded$Visitor_Type)
cust.encoded$Weekend <- as.integer(cust.encoded$Weekend)
cust.encoded$Operating_Systems <- as.integer(cust.encoded$Operating_Systems)
cust.encoded$Region <- as.integer(cust.encoded$Region)
cust.encoded$Traffic_Type <- as.integer(cust.encoded$Traffic_Type)
cust.encoded$Visitor_Type <- as.integer(cust.encoded$Visitor_Type)
cust.encoded$Browser<- as.integer(cust.encoded$Browser)
head(cust.encoded)
## Administrative Administrative_Duration Informational Informational_Duration
## 1 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 2 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 3 -0.7028704 -0.4547312 -0.3990051 -0.2393591
## 4 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 5 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## 6 -0.7028704 -0.4603670 -0.3990051 -0.2464319
## Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1 -0.6966995 -0.6292416 3.956604814 3.4289586 -0.3188629
## 2 -0.6742674 -0.5958906 -0.450601876 1.2656015 -0.3188629
## 3 -0.6966995 -0.6287205 3.956604814 3.4289586 -0.3188629
## 4 -0.6742674 -0.6278520 0.651199796 2.1309443 -0.3188629
## 5 -0.4948106 -0.3022450 -0.009881207 0.1839229 -0.3188629
## 6 -0.2929218 -0.5488778 -0.102664499 -0.3664048 -0.3188629
## Special_Day Month Operating_Systems Browser Region Traffic_Type Visitor_Type
## 1 -0.3101433 1 1 1 1 1 3
## 2 -0.3101433 1 2 2 1 2 3
## 3 -0.3101433 1 4 1 9 3 3
## 4 -0.3101433 1 3 2 2 4 3
## 5 -0.3101433 1 3 3 1 4 3
## 6 -0.3101433 1 2 2 1 3 3
## Weekend Revenue
## 1 1 FALSE
## 2 1 FALSE
## 3 1 FALSE
## 4 1 FALSE
## 5 2 FALSE
## 6 1 FALSE
#Features
X <- cust.encoded[,-ncol(cust.encoded)]
# Label
y <- cust.encoded[,ncol(cust.encoded)]
# PCA
pca <- prcomp(X)
# Scree Plot
fviz_eig(pca)
# Summary
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 4.0364 3.1929 2.4097 1.79928 1.69865 1.30384 1.03724
## Proportion of Variance 0.3517 0.2200 0.1253 0.06988 0.06228 0.03669 0.02322
## Cumulative Proportion 0.3517 0.5717 0.6970 0.76692 0.82920 0.86590 0.88912
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 1.00382 0.96764 0.92462 0.86525 0.65752 0.64784 0.59312
## Proportion of Variance 0.02175 0.02021 0.01845 0.01616 0.00933 0.00906 0.00759
## Cumulative Proportion 0.91087 0.93108 0.94953 0.96569 0.97503 0.98408 0.99168
## PC15 PC16 PC17
## Standard deviation 0.42214 0.35148 0.28956
## Proportion of Variance 0.00385 0.00267 0.00181
## Cumulative Proportion 0.99552 0.99819 1.00000
# Up to PC11, the cummulative proportion is 0.96201 (>95%), therefore,
# the rest can be excluded.
pca.select <- as.data.frame(-pca$x[,1:11])
# Visualization - Elbow method
fviz_nbclust(pca.select, kmeans, method = 'wss')
# Silhouette Method
fviz_nbclust(pca.select, kmeans, method = 'silhouette')
# Up to PC11, the cummulative proportion is 0.96201 (>95%), therefore,
# the rest can be excluded.
pca.select <- as.data.frame(-pca$x[,1:11])
# Seed value
set.seed(0)
# Gap statistic
# Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(pca.select, FUNcluster=Kmeans.func, method="gap_stat",
verbose = FALSE) +
labs(title = 'Gap Statistic Method')
Using PCA to transform the data before clustering has reduced the optimal number of clusters to 3.
K = 2
# Seed for reproducible results
set.seed(0)
# Kmeans model
base.pca2 <- kmeans(pca.select, centers = 2, nstart = 25)
# Visualization
fviz_cluster(base.pca2, data = pca.select)
The clusters overlap. Therefore, the predictions will be compared to the actual label.
# Predicted clusters
y_pred <- base.pca2$cluster
# Frequency
table(y_pred)
## y_pred
## 1 2
## 2099 10112
# Frequency
table(cust$Revenue)
##
## FALSE TRUE
## 10303 1908
From the frequency distribution, the cluster 2 seems to represent the False category, while 1 = True. The number of FP = 191, FN = 191. From Precision = 0.9.
precision <- 1908/(1908+191)
recall <- 1908/(1908+191)
f1.score <- 2*((precision*recall)/(precision+recall))
accuracy <- ((1908 + 10112)/(1908+191+10112+191))*100
print(paste0('Accuracy = ', round(accuracy, 2), '%'))
## [1] "Accuracy = 96.92%"
cat('Precision = ', round(precision,2),'\n')
## Precision = 0.91
cat('Recall = ', round(recall,2), '\n')
## Recall = 0.91
cat('F1 Score = ', round(f1.score,2), '\n')
## F1 Score = 0.91
The clustering achieves a high performance scores.
K = 3
# Seed for reproducible results
set.seed(0)
# Kmeans model
base.pca4 <- kmeans(pca.select, centers = 3, nstart = 25)
# Visualization
fviz_cluster(base.pca4, data = pca.select)
Though there are outliers present, the clusters do not confrom
explicitly in regards to their position.
As per the observed optimal values of k, it can be seen that the customers cannot only be clustered by the revenue only, as more customer group specifications exist. Domain knowledge is required to infer their significance.
Hierarchical clustering has an advantage in that custom distance metrics can be implemented. Therefore, Gower distance will be used. Therefore, the entire dataset can be used as is.
Original Dataset
Choosing the best linkage method
# Features
X2 <- cust.scaled[,-ncol(cust.scaled)]
y <- cust.scaled[,ncol(cust.scaled)]
Computing the aggregation coefficient is time consuming, therefore, three linkage methods will be compared.
# Gower distance
gower <- daisy(X2, metric = c("gower"))
# Linkage = ward.d2
cluster <- hclust(gower, method = 'ward.D2')
# Dendogram
plot(cluster, main = 'Dendogram')
# Linkage = complete
cluster <- hclust(gower, method = 'complete')
# Dendogram
plot(cluster, main = 'Dendogram')
# Linkage = average
cluster <- hclust(gower, method = 'average')
# Dendogram
plot(cluster, main = 'Dendogram')
The ward linkage method provides the best clustering results.
Pruning
The common methods used to determine the optimal number of k will not be used as NAs will be introduced by coercion, therefore, the optimal number of k determined in the previous section will be used.
# Selecting k from the dendogram
cluster <- hclust(gower, method = 'ward.D2')
# Dendogram
plot(cluster, labels=FALSE)
rect.hclust(cluster, k=8, border="red")
The optimal k = 8.
# Pruning
cluster.cut <- cutree(cluster, 8)
# Appending the column
X2 <- cust.scaled[,-ncol(cust.scaled)]
Clusters <- as.factor(cluster.cut)
X2 <- cbind(X2, Clusters)
head(X2[,c('Clusters')])
## [1] 1 2 1 3 4 2
## Levels: 1 2 3 4 5 6 7 8
# Visualization
ggplot(X2, aes(x=Product_Related, y=Product_Related_Duration, color = Clusters)) + geom_point()
From the hierarchical clustering procedure, more clusters have been derived from the data. This shows that the revenue changes in more ways than just a presence of lack of revenue. Domain knowledge should be used to classify these changes in revenue. Furthermore there are 8 categorical variable in the dataset, and k = 8. Domain knowledge is required to determine if this is coincidence, or it the number of categorical columns have an effect on the number of clusters.
From this study, the following observation were made:
a) Categorical
b) Numerical
a) Categorical-Categorical
b) Numeric-Numeric
All the columns tested had very weak correlation with each other. A correlation matrix will be plotted in the multivariate section for a more in depth and efficient analysis.
c) Numeric-Categorical
This shows that customers can be grouped further, without being constrained to users who either contribute, or don’t, to the brands revenue. Hierarchical clustering provides more granulated clusters, while KMeans seems to provide generalized clustering. To infer the significance of the groups, the sales and marketing team need to apply their domain knowledge for more accurate insights into each group, in both cluster types.
Various techniques were implemented during the modelling procedure to obtain the optimal number of clusters.
As Kmeans is only valid for numerical data, numeric customers were first trained on the model, leading to a k = 3.
For the dataset with the outliers removed, k = 4. This dataset was excluded as a 40% was lost when outliers were removed.
PCA was done to transform the entire dataset into a numeric form that the entire dataset can be used. The result was k = 3. This model was selected as all variables were included in the modelling process.
For a more generalized grouping of customers, the KMeans customer clusters can be used. For a more in depth classification, the hierarchical model clusters can be used.
In conclusion, the following customer groups contributed to the brand’s revenue:
Furthermore, the product related pages were of the most importance to these users.
Finally, for the modelling section, it was discovered that there exist more customer groups, possibly other than those relating to revenue contribution. The KMeans approach had k = 3, while Hierarchical clustering had = 8. This showed that the latter model provided more granulated insights into possible customer segmentation.
Yes, as meaningful insights on customer groups was derived from the data. Furthermore, training the Kmeans and the hierarchical on the data led to the formation of meaningful customer group clusters.
Yes, more granulated (more relevant features) and complete (including all months) data is necessary to derive more insights about the brand’s customer groups. The data will also be used to confirm the validity of the clusters attained during the modelling process.
Yes, as the aim of the project is to identify the characteristics of the brand’s customer groups, as per the client’s request.
# Suppressing warnings
options(warn = defaultW)