CUSTOMER SEGMENTATION

1. Defining the Question

a) Specifying the Question

What are the characteristics of Kira Plastinina customer groups?

b) Defining the Metric for Success

  1. Correct identification of fundamental characteristics of users who had previously clicked on the previous course ad.
  2. Creation of Kmeans and Hierarchical clustering models that best cluster the brand’s customers.

c) Understanding the Context

The Kira Plastinina company, is a Russian fashion brand, that previously belonged to the Russian fashion designer, Kira Plastinina. The brand was sold through a through a now defunct chain of eponymous retail stores in Russia, Ukraine, Kazakhstan, Belarus, China, Philippines and Armenia.

The brand’s sales and marketing team requested for an in depth analysis on their customer groups. They provided data that had been collected over the past year to be used in the analysis. The main aim of the project was to cater to the clients request, as well as develop a clustering model, necessary for grouping of various customers, in order to derive useful insights to be used to improve the brand’s marketing strategy, to boost sales.

d) Recording the Experimental Design

  1. Data sourcing/loading
  2. Data Understanding
  3. Data Relevance
  4. External Dataset Validation
  5. Data Preparation
  6. Univariate Analysis
  7. Bivariate Analysis
  8. Multivariate Analysis
  9. Modelling
  1. KMeans Clustering
  2. Hierarchical Clustering
  1. Implementing the solution
  2. Challenging the solution
  3. Conclusion
  4. Follow up questions

e) Data Relevance

For the data to be relevant, it should have meaningful insights that can be used to identify the characteristics of the brand’s customer groups. The dataset should also have relevant information that allows for efficient and relevant clustering of customer groups.

2. Data Understanding

a) Importing Required Libraries

# Suppressing warnings
defaultW <- getOption("warn")
options(warn = -1)
# Libraries
library(data.table) # Data Table library
library (plyr)
library(ggplot2)    # Plotting Library
library(moments)    # Measure of distribution library
library(ggcorrplot) # Correlation plotting library
library(mice)       # Missing values  
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(stats)      # KMeans 
library(factoextra) # Cluster visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggpubr)     # Mutliple graph visualization
## 
## Attaching package: 'ggpubr'
## The following object is masked from 'package:plyr':
## 
##     mutate
library(cluster)    # Gap-Statistic
library(caret)      # Preprocessing (min-max scaling)
## Loading required package: lattice
library(lattice)    # Used with caret

b) Reading the Data

# Loading the data
cust <- fread('online_shoppers_intention.csv')

c) Checking the Data

i) Number of Records

# Number of rows and columns of the advertisement data.
cat('Number of rows = ', nrow(cust), 'and the number of columns = ', ncol(cust),'.')
## Number of rows =  12330 and the number of columns =  18 .

ii) Top Dataset Preview

head(cust, 5)
##    Administrative Administrative_Duration Informational Informational_Duration
## 1:              0                       0             0                      0
## 2:              0                       0             0                      0
## 3:              0                      -1             0                     -1
## 4:              0                       0             0                      0
## 5:              0                       0             0                      0
##    ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1:              1                0.000000        0.20      0.20          0
## 2:              2               64.000000        0.00      0.10          0
## 3:              1               -1.000000        0.20      0.20          0
## 4:              2                2.666667        0.05      0.14          0
## 5:             10              627.500000        0.02      0.05          0
##    SpecialDay Month OperatingSystems Browser Region TrafficType
## 1:          0   Feb                1       1      1           1
## 2:          0   Feb                2       2      1           2
## 3:          0   Feb                4       1      9           3
## 4:          0   Feb                3       2      2           4
## 5:          0   Feb                3       3      1           4
##          VisitorType Weekend Revenue
## 1: Returning_Visitor   FALSE   FALSE
## 2: Returning_Visitor   FALSE   FALSE
## 3: Returning_Visitor   FALSE   FALSE
## 4: Returning_Visitor   FALSE   FALSE
## 5: Returning_Visitor    TRUE   FALSE

iii) Bottom Dataset Preview

tail(cust, 5)
##    Administrative Administrative_Duration Informational Informational_Duration
## 1:              3                     145             0                      0
## 2:              0                       0             0                      0
## 3:              0                       0             0                      0
## 4:              4                      75             0                      0
## 5:              0                       0             0                      0
##    ProductRelated ProductRelated_Duration BounceRates  ExitRates PageValues
## 1:             53                1783.792 0.007142857 0.02903061   12.24172
## 2:              5                 465.750 0.000000000 0.02133333    0.00000
## 3:              6                 184.250 0.083333333 0.08666667    0.00000
## 4:             15                 346.000 0.000000000 0.02105263    0.00000
## 5:              3                  21.250 0.000000000 0.06666667    0.00000
##    SpecialDay Month OperatingSystems Browser Region TrafficType
## 1:          0   Dec                4       6      1           1
## 2:          0   Nov                3       2      1           8
## 3:          0   Nov                3       2      1          13
## 4:          0   Nov                2       2      3          11
## 5:          0   Nov                3       2      1           2
##          VisitorType Weekend Revenue
## 1: Returning_Visitor    TRUE   FALSE
## 2: Returning_Visitor    TRUE   FALSE
## 3: Returning_Visitor    TRUE   FALSE
## 4: Returning_Visitor   FALSE   FALSE
## 5:       New_Visitor    TRUE   FALSE

At first glance of the data set, no anomalies can be seen.

d) Checking Datatypes

str(cust)
## Classes 'data.table' and 'data.frame':   12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 -1 0 0 0 -1 -1 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 1 2 3 ...
##  $ ProductRelated_Duration: num  0 64 -1 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : chr  "Feb" "Feb" "Feb" "Feb" ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : chr  "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" "Returning_Visitor" ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  - attr(*, ".internal.selfref")=<externalptr>

All columns have the required datatypes, however, the data type of the categorical columns will be converted to factors.

# Converting categorical columns to factor types
cust[,11:18] <- lapply(cust[,11:18], factor)
# Checking changes
head(cust, 5)
##    Administrative Administrative_Duration Informational Informational_Duration
## 1:              0                       0             0                      0
## 2:              0                       0             0                      0
## 3:              0                      -1             0                     -1
## 4:              0                       0             0                      0
## 5:              0                       0             0                      0
##    ProductRelated ProductRelated_Duration BounceRates ExitRates PageValues
## 1:              1                0.000000        0.20      0.20          0
## 2:              2               64.000000        0.00      0.10          0
## 3:              1               -1.000000        0.20      0.20          0
## 4:              2                2.666667        0.05      0.14          0
## 5:             10              627.500000        0.02      0.05          0
##    SpecialDay Month OperatingSystems Browser Region TrafficType
## 1:          0   Feb                1       1      1           1
## 2:          0   Feb                2       2      1           2
## 3:          0   Feb                4       1      9           3
## 4:          0   Feb                3       2      2           4
## 5:          0   Feb                3       3      1           4
##          VisitorType Weekend Revenue
## 1: Returning_Visitor   FALSE   FALSE
## 2: Returning_Visitor   FALSE   FALSE
## 3: Returning_Visitor   FALSE   FALSE
## 4: Returning_Visitor   FALSE   FALSE
## 5: Returning_Visitor    TRUE   FALSE

3. External Dataset Validation

The data was provided by the client, and was based on a data collected over the past year therefore, there is no need for external dataset validation.

4. Data Preperation

a) Validation

Column Validation

colnames(cust)
##  [1] "Administrative"          "Administrative_Duration"
##  [3] "Informational"           "Informational_Duration" 
##  [5] "ProductRelated"          "ProductRelated_Duration"
##  [7] "BounceRates"             "ExitRates"              
##  [9] "PageValues"              "SpecialDay"             
## [11] "Month"                   "OperatingSystems"       
## [13] "Browser"                 "Region"                 
## [15] "TrafficType"             "VisitorType"            
## [17] "Weekend"                 "Revenue"

All columns are valid.

Checking for invalid values

# Checking for anomalies
# Data set summary
summary(cust)
##  Administrative   Administrative_Duration Informational   
##  Min.   : 0.000   Min.   :  -1.00         Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.000  
##  Median : 1.000   Median :   8.00         Median : 0.000  
##  Mean   : 2.318   Mean   :  80.91         Mean   : 0.504  
##  3rd Qu.: 4.000   3rd Qu.:  93.50         3rd Qu.: 0.000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.000  
##  NA's   :14       NA's   :14              NA's   :14      
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :  -1.00        Min.   :  0.00   Min.   :   -1.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  185.0        
##  Median :   0.00        Median : 18.00   Median :  599.8        
##  Mean   :  34.51        Mean   : 31.76   Mean   : 1196.0        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1466.5        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##  NA's   :14             NA's   :14       NA's   :14             
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003119   Median :0.02512   Median :  0.000   Median :0.00000  
##  Mean   :0.022152   Mean   :0.04300   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016684   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##  NA's   :14         NA's   :14                                           
##      Month      OperatingSystems    Browser         Region      TrafficType  
##  May    :3364   2      :6601     2      :7961   1      :4780   2      :3913  
##  Nov    :2998   1      :2585     1      :2462   3      :2403   1      :2451  
##  Mar    :1907   3      :2555     4      : 736   4      :1182   3      :2052  
##  Dec    :1727   4      : 478     5      : 467   2      :1136   4      :1069  
##  Oct    : 549   8      :  79     6      : 174   6      : 805   13     : 738  
##  Sep    : 448   6      :  19     10     : 163   7      : 761   10     : 450  
##  (Other):1337   (Other):  13     (Other): 367   (Other):1263   (Other):1657  
##             VisitorType     Weekend      Revenue     
##  New_Visitor      : 1694   FALSE:9462   FALSE:10422  
##  Other            :   85   TRUE :2868   TRUE : 1908  
##  Returning_Visitor:10551                             
##                                                      
##                                                      
##                                                      
## 
as.integer(c(1,2,3,4,5))
## [1] 1 2 3 4 5

The duration columns have negative minimum values. They will be converted to positive values as durations cannot be negative.

# Changing the negative values to positive values
cust[,c(2,4,6)][(cust[,c(2,4,6)] == -1) & !is.na(cust[,c(2,4,6)])] <- 1
# Confirming changes
summary(cust[,c(2,4,6)])
##  Administrative_Duration Informational_Duration ProductRelated_Duration
##  Min.   :   0.00         Min.   :   0.00        Min.   :    0.0        
##  1st Qu.:   0.00         1st Qu.:   0.00        1st Qu.:  185.0        
##  Median :   8.00         Median :   0.00        Median :  599.8        
##  Mean   :  80.91         Mean   :  34.51        Mean   : 1196.0        
##  3rd Qu.:  93.50         3rd Qu.:   0.00        3rd Qu.: 1466.5        
##  Max.   :3398.75         Max.   :2549.38        Max.   :63973.5        
##  NA's   :14              NA's   :14             NA's   :14

b) Consistency

# Checking for missing values
colSums(is.na(cust))
##          Administrative Administrative_Duration           Informational 
##                      14                      14                      14 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                      14                      14                      14 
##             BounceRates               ExitRates              PageValues 
##                      14                      14                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0
# Percentage of missing values
perc <- (sum(is.na(cust))/nrow(cust))*100
precentage.missing <- paste0('Missing values = ', round(perc, 2), '%')
precentage.missing
## [1] "Missing values = 0.91%"
# Imputing the numeric columns using the mean as the missing values are minimal.
cust <- data.frame(cust) 
for(col in 1:8) {
  cust[ , col][is.na(cust[ , col])] <- mean(cust[ , col], na.rm = TRUE)
}
# Confirming changes
colSums(is.na(cust))
##          Administrative Administrative_Duration           Informational 
##                       0                       0                       0 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                       0                       0                       0 
##             BounceRates               ExitRates              PageValues 
##                       0                       0                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0

All missing values have been imputed. The mean was used as the percentage of missing values was minuscule. Using MICE was computationally expensive for the simple imputation procedure.

c) Completeness

# Checking for duplicates
sum(duplicated(cust))
## [1] 119
# Removing identifies duplicates
cust <- cust[!duplicated(cust),]
# Confirming changes
sum(duplicated(cust))
## [1] 0

Duplicates have been removed.

d) Uniformity

# Checking the uniformity of column names
colnames(cust)
##  [1] "Administrative"          "Administrative_Duration"
##  [3] "Informational"           "Informational_Duration" 
##  [5] "ProductRelated"          "ProductRelated_Duration"
##  [7] "BounceRates"             "ExitRates"              
##  [9] "PageValues"              "SpecialDay"             
## [11] "Month"                   "OperatingSystems"       
## [13] "Browser"                 "Region"                 
## [15] "TrafficType"             "VisitorType"            
## [17] "Weekend"                 "Revenue"

The column names do not have a uniform format. The underscores will be introduced.

# Introducing underscores
colnames(cust) <- c("Aministrative", "Administrative_Duration", "Informational",      
"Informational_Duration",  "Product_Related", "Product_Related_Duration",
"Bounce_Rates", "Exit_Rates", "Page_Values", "Special_Day", "Month",                    "Operating_Systems", "Browser", "Region", "Traffic_Type",  "Visitor_Type",             "Weekend", "Revenue")

# Confirming changes
colnames(cust)
##  [1] "Aministrative"            "Administrative_Duration" 
##  [3] "Informational"            "Informational_Duration"  
##  [5] "Product_Related"          "Product_Related_Duration"
##  [7] "Bounce_Rates"             "Exit_Rates"              
##  [9] "Page_Values"              "Special_Day"             
## [11] "Month"                    "Operating_Systems"       
## [13] "Browser"                  "Region"                  
## [15] "Traffic_Type"             "Visitor_Type"            
## [17] "Weekend"                  "Revenue"

The Administrative column will be renamed as it has been mispelt.

# Renaming the column name
colnames(cust)[1] <- c('Administrative')

# Confirming changes
colnames(cust)[1]
## [1] "Administrative"

e) Outliers

# Selecting numerical columns
num <- unlist(lapply(cust, is.numeric))
# Numeric data frame
num_df <- cust[, num]
# Number of numeric columns
length(names(num_df))
## [1] 10
# Boxplots
par(mfrow = c(5,2), mar=c(1,5,2,2))
for (i in 1:length(num_df)){
  boxplot(num_df[ , i], main = paste('Boxplot of',  names(num_df)[i]), 
          ylab = 'Count')
}

All columns have outliers.

# Outliers summary function
outliers.summary <- function(data){
  total <- c(0) # Placeholder for total number of outliers
  cat('Precentage outliers per column\n')
  cat(rep('-', 16), '\n')
  for (col in names(data)){
    # IQR
    Q1 <- quantile(data[,col], probs=.25)
    Q3 <- quantile(data[,col], probs=.75)
    IQR = Q3-Q1
    # Limits
    upper.limit = Q3 + (IQR*1.5)
    lower.limit = Q1 - (IQR*1.5)
    # Outliers per column
    out <- sum(data[,col] > upper.limit | data[,col] < lower.limit)
    total <- total + out
    perc <- (out/nrow(data))*100
    print(paste0(col, ' = ', round(perc,2), '%'))
    cat('\n')
  }
  # Total percentage of outliers
  cat('\nTotal precentage of outliers\n')
  cat(rep('-', 16), '\n')
  cells <- nrow(data)*ncol(data)
  perc <- (total/cells)*100
  print(paste0('Total = ', round(perc,2), '%'))
  
}
# Outliers summary
outliers.summary(num_df)
## Precentage outliers per column
## - - - - - - - - - - - - - - - - 
## [1] "Administrative = 3.31%"
## 
## [1] "Administrative_Duration = 9.42%"
## 
## [1] "Informational = 21.64%"
## 
## [1] "Informational_Duration = 20.06%"
## 
## [1] "Product_Related = 8.25%"
## 
## [1] "Product_Related_Duration = 7.81%"
## 
## [1] "Bounce_Rates = 11.72%"
## 
## [1] "Exit_Rates = 10.87%"
## 
## [1] "Page_Values = 22.36%"
## 
## [1] "Special_Day = 10.23%"
## 
## 
## Total precentage of outliers
## - - - - - - - - - - - - - - - - 
## [1] "Total = 12.56%"

Outliers will be retained for analysis, then scaling or removal will be applied during modelling.

# Removing outliers

# Function to obtain logical vector of outliers
outliers.vector <- function(data) {
  # IQR
  Q1 <- quantile(data, probs=.25)
  Q3 <- quantile(data, probs=.75)
  IQR <- Q3-Q1
  # Limits
  upper.limit = Q3 + (IQR*1.5)
  lower.limit = Q1 - (IQR*1.5)
  # Logical vector
  data > upper.limit | data < lower.limit
}

remove.outliers <- function(data, cols = names(data)) {
  for (col in cols) {
    data <- data[!outliers.vector(data[[col]]),]
  }
  data
}
# Copy of the dataset.
cust.no <- data.frame(cust)
# Function that removes rows with outliers
fill.func <- function(data){
  for(col in 1:10){
    no <- outliers.vector(data[,col])
    data <- data[!no,]
    
  }
  data
}
# Function call
cust.no <- fill.func(cust.no)
# Number of removed rows
perc <- ((nrow(cust.no))/nrow(cust))*100
print(paste0('Removed rows = ', round((perc),2), '%'))
## [1] "Removed rows = 40.7%"
print(nrow(cust.no))
## [1] 4970

Majority of the rows have been dropped, leading to a significant loss in data.

5. Descriptive Analysis

a) Univariate Analysis

i) Categorical

# Count plot and normal bar plot function.
bar.plt <- function(data, col1, title, legend = NULL, colors = NULL, method, col2 = NULL){
  if (method == 'count'){
    ggplot(data, aes(x = {{col1}}, fill = {{col1}})) + geom_bar() +
    ggtitle(paste(title, 'Frequency Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
  else if (method == 'bar'){
    ggplot(data, aes(x = {{col1}}, y = {{col2}}, fill = {{col1}})) + geom_bar(stat = 'identity') + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)}
}
# Categorical columns
cat <- unlist(lapply(cust, is.factor))
cat.df <- cust[,cat]

Month

# Unique country columns
print(unique(cat.df$Month))
##  [1] Feb  Mar  May  Oct  June Jul  Aug  Nov  Sep  Dec 
## Levels: Aug Dec Feb Jul June Mar May Nov Oct Sep
length(unique(cat.df$Month))
## [1] 10

The dataset does not have the months of January and April.

# Ordering Factors
cust$Month <- factor(cust$Month, levels = c('Feb', 'Mar', 'May', 'June', 'Jul',
                                            'Aug', 'Sep', 'Oct', 'Nov', 'Dec'))
cat <- unlist(lapply(cust, is.factor))
cat.df <- cust[,cat]
print(unique(cat.df$Month))
##  [1] Feb  Mar  May  Oct  June Jul  Aug  Nov  Sep  Dec 
## Levels: Feb Mar May June Jul Aug Sep Oct Nov Dec
# Months column bar plot
bar.plt (cat.df, Month, title = 'Month Column', method = 'count', legend = 
           'Month', colors = c('Feb' = '#03a9f4', 'Mar' = '#0398dc',
                               'May' = '#0287c3', 'June' = '#0276AB',
                               'Jul' = '#026592', 'Aug' = '#02557A',
                               'Sep' = '#014462', 'Oct' = '#013349',
                               'Nov' = '#012231', 'Dec' = '#001118'),
         col2 = NULL)

The months of May, November and March have the highest frequencies.

Operating Systems

# Unique values
unique(cat.df$Operating_Systems)
## [1] 1 2 4 3 7 6 8 5
## Levels: 1 2 3 4 5 6 7 8
# Operating Systems column bar plot
bar.plt (cat.df, Operating_Systems, title = 'Operating System Column', method = 'count', legend = 
           'Operating System', colors = c('1' = '#03a9f4', '2' = '#0398dc',
                               '3' = '#0287c3', '4' = '#0276AB',
                               '5' = '#026592', '6' = '#02557A',
                               '7' = '#014462', '8' = '#013349'),
         col2 = NULL)

The second operating system has the highest frequency, while 5 and 7 have the least number of records.

Browser

# Unique browser values
unique(cat.df$Browser)
##  [1] 1  2  3  4  5  6  7  10 8  9  12 13 11
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13
# Browser column bar plot
bar.plt (cat.df, Browser, title = 'Browser Column', method = 'count', legend = 
           'Browser', colors = c('1' = '#d9f1ff', '2' = '#bfe6ff',
                                 '3' = '#59bfff','4' = '#03a9f4',
                                 '5' = '#0398dc', '6' = '#0287c3',
                                 '7' = '#0276AB', '8' = '#026592',
                                 '9' = '#02557A', '10' = '#014462',
                                 '11' = '#013349', '12' = '#012231',
                                 '13' = '#001118'),
         col2 = NULL)

The second browser has the highest frequency.

Region

# Unique region values
unique(cat.df$Region)
## [1] 1 9 2 3 4 5 6 7 8
## Levels: 1 2 3 4 5 6 7 8 9
# Region column bar plot
bar.plt (cat.df, Region, title = 'Region Column', method = 'count', legend = 
           'Region', colors = c('1' = '#0398dc', '2' = '#0287c3',
                                 '3' = '#0276AB', '4' = '#026592',
                                 '5' = '#02557A', '6' = '#014462',
                                 '7' = '#013349', '8' = '#012231',
                                 '9' = '#001118'),
         col2 = NULL)

The first region has the highest frequency.

Traffic Type

# Unique traffic type values
unique(cat.df$Traffic_Type)
##  [1] 1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 18 19 16 17 20
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Traffic type column bar plot
ggplot(cat.df, aes(x = Traffic_Type, fill = Traffic_Type)) + geom_bar() +
    ggtitle(paste('Traffic Type Frequency Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))

The second type has the highest frequency.

Visitor Type

# Unique visitor type values
unique(cat.df$Visitor_Type)
## [1] Returning_Visitor New_Visitor       Other            
## Levels: New_Visitor Other Returning_Visitor
# Visitor Type column bar plot
bar.plt (cat.df, Visitor_Type, title = 'Visitor Type Column', method = 'count',
         legend =  'Visitor Type', colors = c('New_Visitor' = '#0398dc',
                                'Returning_Visitor' = '#026592',
                                 'Other' = '#001118'),
         col2 = NULL)

The returning visitor class has the highest frequency.

Weekend

# Unique weekend values
unique(cat.df$Weekend)
## [1] FALSE TRUE 
## Levels: FALSE TRUE
# Weekend column bar plot
bar.plt (cat.df, Weekend, title = 'Weekend Column', method = 'count',
         legend =  'Weekend', colors = c('FALSE' = '#0398dc',
                                 'TRUE' = '#001118'),
         col2 = NULL)

Most activity/traffic did not take place on the weekend.

Revenue

# Unique revenue values
unique(cat.df$Revenue)
## [1] FALSE TRUE 
## Levels: FALSE TRUE
# Revenue column bar plot
bar.plt (cat.df, Revenue, title = 'Revenue Column', method = 'count',
         legend =  'Revenue', colors = c('FALSE' = '#0398dc',
                                 'TRUE' = '#001118'),
         col2 = NULL)

Lack of revenue had a higher frequency.

ii) Numerical

# Numerical columns
num.df <- cust[,!cat]
# Mode function
mode <- function(col, data) {
   unique.value <- unique(data[, col])
   unique.value[which.max(tabulate(match(data[,col], unique.value)))]
}

central.tendency <- function(col, data){
  cat('Measures of Central Tendency \n')
  # Mean
  cat('Mean = ', mean(data[, col]), '\n')
  # Median
  cat('Median = ', median(data[,col]), '\n')
  # Mode
  cat('Mode = ', mode(col, data), '\n')
  
}

dispersion <- function(col, data){
  # Range
  cat('Range = ', min(data[ ,col]), '-', max(data[ ,col]), '\n')
  # IQR
  cat('IQR = ', IQR(data[ ,col]), '\n')
  # Variance
  cat('Variance = ', var(data[ ,col]), '\n')
  # Standard Deviation
  cat('Standard Deviation = ', sd(data[ ,col]), '\n')
  # Skewness
  cat('Skewness = ', skewness(data[ ,col]), '\n')
  # Kurtosis
  cat('Kurtosis = ', kurtosis(data[ ,col]), '\n')
}

Administrative

# Measures of central tendency
central.tendency(names(num.df)[1], num.df)
## Measures of Central Tendency 
## Mean =  2.340006 
## Median =  1 
## Mode =  0

The average number of administrative pages visited was ~3.

# Measures of dispersion
dispersion(names(num.df)[1], num.df)
## Range =  0 - 27 
## IQR =  4 
## Variance =  11.08367 
## Standard Deviation =  3.329214 
## Skewness =  1.947225 
## Kurtosis =  7.643668
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Administrative Histogram
hist(num.df$ Administrative, 
     main = 'Histogram of Administrative Pages', 
     xlab = 'Administrative')

The administrative pages were not visited most of the time.

Administrative Duration

# Measures of central tendency
central.tendency(names(num.df)[2], num.df)
## Measures of Central Tendency 
## Mean =  81.68679 
## Median =  9 
## Mode =  0

The average administrative page visit duration was ~82. The duration is likely to be in seconds, especially in regards to the above observation.

# Measures of dispersion
dispersion(names(num.df)[2], num.df)
## Range =  0 - 3398.75 
## IQR =  94.6 
## Variance =  31484.39 
## Standard Deviation =  177.4384 
## Skewness =  5.593135 
## Kurtosis =  53.14849
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Administrative Duration Histogram
hist(num.df$ Administrative_Duration, 
     main = 'Histogram of Administrative Page Duration', 
     xlab = 'Administrative Duration')

Most customers did not visit the administrative pages.

Informational

# Measures of central tendency
central.tendency(names(num.df)[3], num.df)
## Measures of Central Tendency 
## Mean =  0.5088074 
## Median =  0 
## Mode =  0

The average number of informational pages visited was 1.

# Measures of dispersion
dispersion(names(num.df)[3], num.df)
## Range =  0 - 24 
## IQR =  0 
## Variance =  1.62611 
## Standard Deviation =  1.27519 
## Skewness =  4.015436 
## Kurtosis =  29.67176
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Informational Histogram
hist(num.df$ Informational, 
     main = 'Histogram of Informational Pages', 
     xlab = 'Informational')

The informational page was rarely visited.

Informational Duration

# Measures of central tendency
central.tendency(names(num.df)[4], num.df)
## Measures of Central Tendency 
## Mean =  34.84242 
## Median =  0 
## Mode =  0

The average informational page visit duration was ~35 seconds.

# Measures of dispersion
dispersion(names(num.df)[4], num.df)
## Range =  0 - 2549.375 
## IQR =  0 
## Variance =  19990.46 
## Standard Deviation =  141.3876 
## Skewness =  7.541253 
## Kurtosis =  78.54314
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Informational Duration Histogram
hist(num.df$ Informational_Duration, 
     main = 'Histogram of Informational Duration', 
     xlab = 'Informational Duration')

The site wasn’t visited most of the time.

Product Related

# Measures of central tendency
central.tendency(names(num.df)[5], num.df)
## Measures of Central Tendency 
## Mean =  32.05816 
## Median =  18 
## Mode =  1

The average number of pages visited was 32.

# Measures of dispersion
dispersion(names(num.df)[5], num.df)
## Range =  0 - 705 
## IQR =  30 
## Variance =  1987.286 
## Standard Deviation =  44.57899 
## Skewness =  4.334283 
## Kurtosis =  34.08263
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Product Related Histogram
hist(num.df$ Product_Related, 
     main = 'Histogram of Product Related Page', 
     xlab = 'Product Related')

The product related pages were rarely visited.

Product Related Duration

# Measures of central tendency
central.tendency(names(num.df)[6], num.df)
## Measures of Central Tendency 
## Mean =  1207.502 
## Median =  611 
## Mode =  0

The average product related page visit duration was ~20 minutes. The highest page duration seen so far.

# Measures of dispersion
dispersion(names(num.df)[6], num.df)
## Range =  0 - 63973.52 
## IQR =  1282.4 
## Variance =  3682486 
## Standard Deviation =  1918.98 
## Skewness =  7.25502 
## Kurtosis =  139.7292
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also heavily platykurtic.
# Product Related Duration Histogram
hist(num.df$ Product_Related_Duration, 
     main = 'Histogram of Product Related Duration', 
     xlab = 'Product Related Duration')

Similar to the other pages, the product related pages were rarely visited.

Bounce Rates

# Measures of central tendency
central.tendency(names(num.df)[7], num.df)
## Measures of Central Tendency 
## Mean =  0.02044841 
## Median =  0.002941176 
## Mode =  0

The average bounce rate was close to zero.

# Measures of dispersion
dispersion(names(num.df)[7], num.df)
## Range =  0 - 0.2 
## IQR =  0.01666667 
## Variance =  0.002059364 
## Standard Deviation =  0.04538022 
## Skewness =  3.154307 
## Kurtosis =  12.26662
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Bounce Rate Histogram
hist(num.df$ Bounce_Rate, 
     main = 'Histogram of the Bounce Rate', 
     xlab = 'Bounce Rate')

The highest frequency for the bounce rate was zero.

Exit Rate

# Measures of central tendency
central.tendency(names(num.df)[8], num.df)
## Measures of Central Tendency 
## Mean =  0.04149826 
## Median =  0.025 
## Mode =  0.2

The exit rate was close to zero.

# Measures of dispersion
dispersion(names(num.df)[8], num.df)
## Range =  0 - 0.2 
## IQR =  0.03421079 
## Variance =  0.0021367 
## Standard Deviation =  0.04622445 
## Skewness =  2.234124 
## Kurtosis =  7.63145
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Exit Rate Histogram
hist(num.df$Exit_Rate, 
     main = 'Histogram of the Exit Rate', 
     xlab = 'Exit Rate')

A low exit rate of 0.01 had the highest frequency.

Page Value

# Measures of central tendency
central.tendency(names(num.df)[9], num.df)
## Measures of Central Tendency 
## Mean =  5.946651 
## Median =  0 
## Mode =  0

The average page value is 5.95.

# Measures of dispersion
dispersion(names(num.df)[9], num.df)
## Range =  0 - 361.7637 
## IQR =  0 
## Variance =  347.8058 
## Standard Deviation =  18.64955 
## Skewness =  6.351741 
## Kurtosis =  68.00151
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Page Value Histogram
hist(num.df$ Page_Value, 
     main = 'Histogram of Page Value', 
     xlab = 'Page Value')

A page value of zero had the highest frequency.

Special Day

# Measures of central tendency
central.tendency(names(num.df)[10], num.df)
## Measures of Central Tendency 
## Mean =  0.06191139 
## Median =  0 
## Mode =  0

The average value was close to zero, therefore, the average site visiting time was not close to special days.

# Measures of dispersion
dispersion(names(num.df)[10], num.df)
## Range =  0 - 1 
## IQR =  0 
## Variance =  0.03984889 
## Standard Deviation =  0.1996219 
## Skewness =  3.286515 
## Kurtosis =  12.80023
  • The skewness value is positive, therefore lower values had higher frequencies.
  • The distribution is also platykurtic.
# Special Day Histogram
hist(num.df$ Special_Day, 
     main = 'Histogram of Proximity to a Special Day', 
     xlab = 'Special Day')

Majority of the site visiting times were not close to a special day.

Summary

The univariate section provided an overview of the frequency distribution of numeric columns, and the frequency distribution of classes within categorical variables. For a more in depth analysis of the characteristics of different customer groups, bivariate analysis will be done.

b) Bivariate Analysis

i) Categorical-Categorical

Revenue Vs Month

# Target columns
Month <- cat.df$Month
Revenue <- cat.df$Revenue

# Contingency table
contingency.table <- table(Revenue, Month)
contingency.table
##        Month
## Revenue  Feb  Mar  May June  Jul  Aug  Sep  Oct  Nov  Dec
##   FALSE  179 1672 2964  256  366  357  362  434 2223 1490
##   TRUE     3  192  365   29   66   76   86  115  760  216
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Month', ylab='Revenue',
           main='Revenue Vs Month', color = '#1338BE', las = 1)

The months of November, May and December had the highest revenue frequency counts. Further more, May, November, March and December had the highest frequencies for a lack of revenue.

Revenue Vs Operating Systems

# Target columns
Operating_Systems <- cat.df$Operating_Systems

# Contingency table
contingency.table <- table(Revenue, Operating_Systems)
contingency.table
##        Operating_Systems
## Revenue    1    2    3    4    5    6    7    8
##   FALSE 2172 5387 2265  393    5   17    6   58
##   TRUE   379 1155  268   85    1    2    1   17
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Operating Systems', ylab='Revenue',
           main='Revenue Vs Operating Systems', color = '#1338BE', las = 1)

The second operating system has the highest revenue count. However, it also records the highest frequency for a lack of revenue, followed by the third, then the first.

Revenue Vs Browser

# Target columns
Browser <- cat.df$Browser

# Contingency table
contingency.table <- table(Revenue, Browser)
contingency.table
##        Browser
## Revenue    1    2    3    4    5    6    7    8    9   10   11   12   13
##   FALSE 2064 6663  100  601  380  154   43  114    1  131    5    7   40
##   TRUE   365 1223    5  130   86   20    6   21    0   32    1    3   16
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Browser', ylab='Revenue',
           main='Revenue Vs Browser', color = '#1338BE', las = 1)

The second browser has the highest frequency for the revenue, but this value is overshadowed by a lack of revenue on the same browser. The first browser also has the highest frequency for a lack of revenue.

Revenue Vs Region

# Target columns
Region <- cat.df$Region

# Contingency table
contingency.table <- table(Revenue, Region)
contingency.table
##        Region
## Revenue    1    2    3    4    5    6    7    8    9
##   FALSE 3944  941 2034  996  266  689  639  375  419
##   TRUE   771  188  349  175   52  112  119   56   86
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Region', ylab='Revenue',
           main='Revenue Vs Region', color = '#1338BE', las = 1)

The first and third regions had the highest revenue frequency. They also had the highest frequency for no revenue. A factor could be the population of the regions.

# Target columns
Traffic_Type <- cat.df$Traffic_Type

# Contingency table
contingency.table <- table(Revenue, Traffic_Type)
contingency.table
##        Traffic_Type
## Revenue    1    2    3    4    5    6    7    8    9   10   11   12   13   14
##   FALSE 2128 3062 1838  901  204  390   28  248   37  360  200    1  686   11
##   TRUE   262  847  180  165   56   53   12   95    4   90   47    0   43    2
##        Traffic_Type
## Revenue   15   16   17   18   19   20
##   FALSE   37    2    1   10   16  143
##   TRUE     0    1    0    0    1   50
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Traffic_Type', ylab='Revenue',
           main='Revenue Vs Traffic Type', color = '#1338BE', las = 1)

The second traffic type had the highest revenue frequency, by a large margin. It also has the highest frequency for a lack of revenue, followed by the first and second types. The traffic types had revenue counts.

Revenue Vs Visitor Type

# Target columns
Visitor_Type <- cat.df$Visitor_Type

# Contingency table
contingency.table <- table(Revenue, Visitor_Type)
contingency.table
##        Visitor_Type
## Revenue New_Visitor Other Returning_Visitor
##   FALSE        1271    65              8967
##   TRUE          422    16              1470
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Visitor_Type', ylab='Revenue',
           main='Revenue Vs Visitor Type', color = '#1338BE', las = 1)

Returning visitors led to the highest revenue frequency counts. However, the total value is lower than that of a lack of revenue.

Revenue Vs Weekend

# Target columns
Weekend <- cat.df$Weekend

# Contingency table
contingency.table <- table(Revenue, Weekend)
contingency.table
##        Weekend
## Revenue FALSE TRUE
##   FALSE  7943 2360
##   TRUE   1409  499
# Mosaic plot of contingency table
mosaicplot(contingency.table, xlab='Weekend', ylab='Revenue',
           main='Revenue Vs Weekend', color = '#1338BE', las = 1)

Weekdays had the highest revenue frequencies, as well as a lock of it.

ii) Numerical-Numerical

Function Definition

# Correlation
correlation <- function(cor1, cor2, data){
  correlation <- cor(data[,c(cor1)], data[, c(cor2)])
  print(paste0('Correlation = ', round(correlation, 2), '.'))
}
# Scatter plot and correlation function
scatter.plt <- function(col1, col2, data, title){
  # Scatter plot
  ggplot(data, aes(x = {{col1}}, y= {{col2}})) + 
  geom_point(color = '#281E5D') + ggtitle(paste(title, 'Scatter Plot')) + theme(plot.title = element_text(hjust = 0.5))
 
  
  
}

Bounce Rates Vs Product Related

# Scatter plot and correlation
scatter.plt(col1 = Bounce_Rates, col2 = Product_Related , data = num.df,
            title = 'Bounce Rate Vs Administrative')

# Correlation
correlation('Product_Related', 'Bounce_Rates', num.df)
## [1] "Correlation = -0.19."

The correlation between the bounce rates and the product related pages is very weak and negative.

Exit Rates Vs Product Related

# Scatter plot and correlation
scatter.plt(col1 = Exit_Rates, col2 =  Product_Related, data = num.df,
            title = 'Exit Rates Vs Administrative')

  # Correlation
correlation('Product_Related', 'Exit_Rates', num.df)
## [1] "Correlation = -0.29."

The correlation between the exit rates and the product related pages is also very weak and negative, but slightly higher than that of the previous observation.

Product Related Vs Page Values

# Scatter plot and correlation
scatter.plt(col1 = Product_Related, col2 =  Page_Values, data = num.df,
            title = 'Exit Rates Vs Administrative')

  # Correlation
correlation('Product_Related', 'Page_Values', num.df)
## [1] "Correlation = 0.05."

The correlation between the product related page and the page values is very weak, but positive.

All of the columns in this section have very week correlations with each other, therefore, a correlation matrix in the multivariate section will be used to analyze the correlation of all the numeric columns.

iii) Numerical-Categorical

Function Definition

# Bar plot for averaged y axis
bar.plt.summary <- function(data, col1, col2, title, legend, colors){
  ggplot(data, mapping=aes(x= {{col1}}, y= {{col2}}, fill = {{col1}})) + 
  stat_summary(fun=median, geom="bar") + ggtitle(paste(title, 'Bar Plot')) + 
    theme(plot.title = element_text(hjust = 0.5))+
    scale_fill_manual(legend, values = colors)
}

All columns numeric columns are positively skewed, therefore, the median will be used.

Revenue Vs Administrative

# Bar plot
bar.plt.summary(cust, Revenue, Administrative,'Revenue Vs Administrative Pages',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The higher the number of administrative pages were visited, the higher the revenue. The median number of pages was 2.

Revenue Vs Administrative Duration

# Bar plot
bar.plt.summary(cust, Revenue, Administrative_Duration,
                'Revenue Vs Administrative Duration',
                'Revenue', 
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The median time spent on the administrative pages leading to higher attainment of revenue was 52 seconds.

Revenue Vs Informational Pages

# Bar plot
bar.plt.summary(cust, Revenue, Informational,'Revenue Vs Informational Pages',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

Visitation of the informational did not contribute to the brand’s revenue.

Revenue Vs Informational Duration

# Bar plot
bar.plt.summary(cust, Revenue, Informational_Duration,
                'Revenue Vs Informational Duration',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

There was little to no traffic to the informational pages, therefore it did not contribute to the brand’s revenue.

Product Related

# Bar plot
bar.plt.summary(cust, Revenue, Product_Related,
                'Revenue Vs Product Related Pages',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The product related pages had a higher number of visits than the previous pages, leading to a higher contribution to the revenue, the median number of pages visited was 38.

Revenue Vs Product Related Duration

# Bar plot
bar.plt.summary(cust, Revenue, Product_Related_Duration,
                'Revenue Vs Product Related Duration',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

Most visitors visiting the product related pages carried out transactions leading to higher instances of revenue. The median duration was 20 minutes.

Revenue Vs Bounce Rates

# Bar plot
bar.plt.summary(cust, Revenue, Bounce_Rates,'Revenue Vs Bounce Rates',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

Bounce rates did not contribute to the revenue. The median bounce rate was 0.004.

Revenue Vs Exit Rates

# Bar plot
bar.plt.summary(cust, Revenue, Exit_Rates,'Revenue Vs Exit Rates',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The higher the exit rate the lower the revenue. The exit rate median of 0.029 had no revenue, while loser rates of 0.025 contributed to the revenue.

# Bar plot
bar.plt.summary(cust, Revenue, Page_Values,'Revenue Vs Page Values',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The median page value before an e-commerce transaction was completed, contributing to the revenue was 14.

Revenue Vs Special Day

# Bar plot
bar.plt.summary(cust, Revenue, Special_Day,'Revenue Vs Special Day',
                'Revenue',
                c("FALSE" = '#757C88', "TRUE" = "#1338BE"))

The special day feature had no contribution to the revenue.

Summary

The bivariate analysis provided an in depth analysis of the different customer groups. For more in depth analysis. Multivariate analysis will be done.

c) Multivariate Analysis

Correlation Matrix

# Visualize correlation matrix
# Reducing the column name sizes for efficient utilization of the graphing 
# space.
num.corr <- data.frame(num.df)
colnames(num.corr) <- c('Admin', 'Admin.Dur', 'Info', 'Info.Dur', 'Prod.Rel',
                        'Prod.Rel.Dur', 'Bounce.Rts', 'Exit.Rts', 'Pg.Vals',
                        'Spec.Day') 
ggcorrplot(cor(num.corr), lab = TRUE, hc.order = TRUE,
           title = 'Correlation Heatmap',
           colors = c('#022D36', 'white', '#48AAAD'))

The exit and bounce rates and the product related and product related duration have the highest positive correlations.

# Multivariate Scatter plot
scatter.plt.multi <- function(data, col1, col2, col3, legend, colors)
ggplot(data, aes(x = {{col1}}, y= {{col2}}, color = {{col3}}, 
                shape = {{col3}})) + geom_point() + ggtitle('Scatter Plot') +
  theme(plot.title = element_text(hjust = 0.5)) + 
  scale_color_manual(values = colors)

Revenue Vs Bounce Rate Vs Exit Rate

# Function call
scatter.plt.multi(num.df, Bounce_Rates, Exit_Rates, Revenue,
                  'Revenue', colors = c('black', 'red'))

Lower exit and bounce rates lead to a higher number of transactions.

Revenue Vs Product Related Pages Vs Product Related Page Duration

# Function call
scatter.plt.multi(num.df, Product_Related, Product_Related_Duration, Revenue,
                  'Revenue', colors = c('black', 'red'))

Most users spend a small amount of time on the product related sites. There is no glaring distinction between customers who make or don’t make transactions.

Revenue Vs Product Related Pages Vs Page Values

# Function call
scatter.plt.multi(num.df, Product_Related, Page_Values, Revenue,
                  'Revenue', colors = c('black', 'red'))

The lower the number of product related pages, as well as the page values, the higher the number of transactions leading to attainment of revenue by the brand.

Summary

This section has provided a more in depth view of how various factors are related. To increase the quality of insights, more features should be explored and the data collected, to provide better and more specific insights related to each customer group.

7. Modelling

a) Selecting the best implementation method

# preview of the data
head(cust)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1              0                       0             0                      0
## 2              0                       0             0                      0
## 3              0                       1             0                      1
## 4              0                       0             0                      0
## 5              0                       0             0                      0
## 6              0                       0             0                      0
##   Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1               1                 0.000000   0.20000000  0.2000000           0
## 2               2                64.000000   0.00000000  0.1000000           0
## 3               1                 1.000000   0.20000000  0.2000000           0
## 4               2                 2.666667   0.05000000  0.1400000           0
## 5              10               627.500000   0.02000000  0.0500000           0
## 6              19               154.216667   0.01578947  0.0245614           0
##   Special_Day Month Operating_Systems Browser Region Traffic_Type
## 1           0   Feb                 1       1      1            1
## 2           0   Feb                 2       2      1            2
## 3           0   Feb                 4       1      9            3
## 4           0   Feb                 3       2      2            4
## 5           0   Feb                 3       3      1            4
## 6           0   Feb                 2       2      1            3
##        Visitor_Type Weekend Revenue
## 1 Returning_Visitor   FALSE   FALSE
## 2 Returning_Visitor   FALSE   FALSE
## 3 Returning_Visitor   FALSE   FALSE
## 4 Returning_Visitor   FALSE   FALSE
## 5 Returning_Visitor    TRUE   FALSE
## 6 Returning_Visitor   FALSE   FALSE

K-Means clustering is an unsupervised non-linear algorithm that clusters data based on similarity. Each cluster is represented by a centroid, which corresponds to the mean of points assigned to the cluster. The total inter-cluster variation is calculated using distance metrics such as the Euclidean distance. Points are assigned to a cluster based on minimu distance from cluster centroids (datanovia).

These distance metrics only apply to numeric columns, therefore, it isn’t directly applicable to categorical columns as they are discrete and lack a natural origin (have no numerical meaning/relevance) (datascience stackexchange), (IBM). Therefore, KModes is used for categorical columns (Analytics Vidhya).

Gower distance is a metric that me asures the dissimilarity of two items with mixed numeric and non-numeric data (link). Therefore, this metric can be used for clustering mixed datasets.

The kmeans(stats package) function has cluster centers defined by the Euclidean distance, while the Kmeans(amap package) supports more distance metrics, but all are numeric distances. Both do not support custom distance metrics. Partitioning around mediods(PAM) supports custom metrics and can be used as an alternative.

b) KMeans Clustering

As per the observations made above, only the numeric will be used.

Base Model

# Column ranges
summary(num_df)
##  Administrative  Administrative_Duration Informational    
##  Min.   : 0.00   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.00   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.00   Median :   9.00         Median : 0.0000  
##  Mean   : 2.34   Mean   :  81.69         Mean   : 0.5088  
##  3rd Qu.: 4.00   3rd Qu.:  94.60         3rd Qu.: 0.0000  
##  Max.   :27.00   Max.   :3398.75         Max.   :24.0000  
##  Informational_Duration Product_Related  Product_Related_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0           
##  1st Qu.:   0.00        1st Qu.:  8.00   1st Qu.:  194           
##  Median :   0.00        Median : 18.00   Median :  611           
##  Mean   :  34.84        Mean   : 32.06   Mean   : 1208           
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1476           
##  Max.   :2549.38        Max.   :705.00   Max.   :63974           
##   Bounce_Rates        Exit_Rates       Page_Values       Special_Day     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01426   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.002941   Median :0.02500   Median :  0.000   Median :0.00000  
##  Mean   :0.020448   Mean   :0.04150   Mean   :  5.947   Mean   :0.06191  
##  3rd Qu.:0.016667   3rd Qu.:0.04847   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000

The columns have differing ranges, therefore, standard scaling will be used to prevent the model from overfitting on columns with larger ranges.

Standard Scaling

# Standard scaling

# Original dataset
cust.num.scaled <- as.data.frame(scale(num_df))
head(cust.num.scaled)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 2     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 3     -0.7028704              -0.4547312    -0.3990051             -0.2393591
## 4     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 5     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 6     -0.7028704              -0.4603670    -0.3990051             -0.2464319
##   Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1      -0.6966995               -0.6292416  3.956604814  3.4289586  -0.3188629
## 2      -0.6742674               -0.5958906 -0.450601876  1.2656015  -0.3188629
## 3      -0.6966995               -0.6287205  3.956604814  3.4289586  -0.3188629
## 4      -0.6742674               -0.6278520  0.651199796  2.1309443  -0.3188629
## 5      -0.4948106               -0.3022450 -0.009881207  0.1839229  -0.3188629
## 6      -0.2929218               -0.5488778 -0.102664499 -0.3664048  -0.3188629
##   Special_Day
## 1  -0.3101433
## 2  -0.3101433
## 3  -0.3101433
## 4  -0.3101433
## 5  -0.3101433
## 6  -0.3101433
# Replacing columns in the original dataset
cust.scaled <- data.frame(cust)
cust.scaled[,names(cust.num.scaled)] <- cust.num.scaled
head(cust.scaled)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 2     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 3     -0.7028704              -0.4547312    -0.3990051             -0.2393591
## 4     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 5     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 6     -0.7028704              -0.4603670    -0.3990051             -0.2464319
##   Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1      -0.6966995               -0.6292416  3.956604814  3.4289586  -0.3188629
## 2      -0.6742674               -0.5958906 -0.450601876  1.2656015  -0.3188629
## 3      -0.6966995               -0.6287205  3.956604814  3.4289586  -0.3188629
## 4      -0.6742674               -0.6278520  0.651199796  2.1309443  -0.3188629
## 5      -0.4948106               -0.3022450 -0.009881207  0.1839229  -0.3188629
## 6      -0.2929218               -0.5488778 -0.102664499 -0.3664048  -0.3188629
##   Special_Day Month Operating_Systems Browser Region Traffic_Type
## 1  -0.3101433   Feb                 1       1      1            1
## 2  -0.3101433   Feb                 2       2      1            2
## 3  -0.3101433   Feb                 4       1      9            3
## 4  -0.3101433   Feb                 3       2      2            4
## 5  -0.3101433   Feb                 3       3      1            4
## 6  -0.3101433   Feb                 2       2      1            3
##        Visitor_Type Weekend Revenue
## 1 Returning_Visitor   FALSE   FALSE
## 2 Returning_Visitor   FALSE   FALSE
## 3 Returning_Visitor   FALSE   FALSE
## 4 Returning_Visitor   FALSE   FALSE
## 5 Returning_Visitor    TRUE   FALSE
## 6 Returning_Visitor   FALSE   FALSE
# Dataset with outliers removed
num.cols2 <- unlist(lapply(cust.no, is.numeric))
num.df2 <- cust.no[,num.cols2]
cust.no.scaled <- as.data.frame(scale(num.df2))
head(cust.no.scaled)
##    Administrative Administrative_Duration Informational Informational_Duration
## 2      -0.6447929              -0.6052036           NaN                    NaN
## 5      -0.6447929              -0.6052036           NaN                    NaN
## 6      -0.6447929              -0.6052036           NaN                    NaN
## 11     -0.6447929              -0.6052036           NaN                    NaN
## 13     -0.6447929              -0.6052036           NaN                    NaN
## 14     -0.6447929              -0.6052036           NaN                    NaN
##    Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2       -1.0985386               -0.9595217   -0.5969676  2.5851188         NaN
## 5       -0.4491690                0.2977482    0.8990293  0.6012199         NaN
## 6        0.2813717               -0.7582321    0.5840826 -0.4081321         NaN
## 11      -1.0173674               -0.2210011   -0.5969676  1.2625196         NaN
## 13      -0.6926826               -0.4764712   -0.5969676 -0.2490224         NaN
## 14      -0.7738538               -0.8836616   -0.5969676  1.2625196         NaN
##    Special_Day
## 2          NaN
## 5          NaN
## 6          NaN
## 11         NaN
## 13         NaN
## 14         NaN

Standardizing the informational, information duration, page values and special day columns produces NaN values, therefore, a different scaling method will be used.

Min-Max Scaling

# Dataset with outliers removed
process <- preProcess(as.data.frame(num.df2), method = c('range'))
norm.scaled <- predict(process, as.data.frame(num.df2))
head(norm.scaled)
##    Administrative Administrative_Duration Informational Informational_Duration
## 2               0                       0             0                      0
## 5               0                       0             0                      0
## 6               0                       0             0                      0
## 11              0                       0             0                      0
## 13              0                       0             0                      0
## 14              0                       0             0                      0
##    Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2       0.03278689               0.03267251    0.0000000  1.0000000           0
## 5       0.16393443               0.32034374    0.3230769  0.5000000           0
## 6       0.31147541               0.07872884    0.2550607  0.2456140           0
## 11      0.04918033               0.20165064    0.0000000  0.6666667           0
## 13      0.11475410               0.14319748    0.0000000  0.2857143           0
## 14      0.09836066               0.05002978    0.0000000  0.6666667           0
##    Special_Day
## 2            0
## 5            0
## 6            0
## 11           0
## 13           0
## 14           0
# Replacing columns in the original dataset
cust.no.scaled <- data.frame(cust.no)
cust.no.scaled[,names(norm.scaled)] <- norm.scaled
head(cust.no.scaled)
##    Administrative Administrative_Duration Informational Informational_Duration
## 2               0                       0             0                      0
## 5               0                       0             0                      0
## 6               0                       0             0                      0
## 11              0                       0             0                      0
## 13              0                       0             0                      0
## 14              0                       0             0                      0
##    Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 2       0.03278689               0.03267251    0.0000000  1.0000000           0
## 5       0.16393443               0.32034374    0.3230769  0.5000000           0
## 6       0.31147541               0.07872884    0.2550607  0.2456140           0
## 11      0.04918033               0.20165064    0.0000000  0.6666667           0
## 13      0.11475410               0.14319748    0.0000000  0.2857143           0
## 14      0.09836066               0.05002978    0.0000000  0.6666667           0
##    Special_Day Month Operating_Systems Browser Region Traffic_Type
## 2            0   Feb                 2       2      1            2
## 5            0   Feb                 3       3      1            4
## 6            0   Feb                 2       2      1            3
## 11           0   Feb                 1       1      3            3
## 13           0   Feb                 1       1      1            3
## 14           0   Feb                 2       5      1            3
##         Visitor_Type Weekend Revenue
## 2  Returning_Visitor   FALSE   FALSE
## 5  Returning_Visitor    TRUE   FALSE
## 6  Returning_Visitor   FALSE   FALSE
## 11 Returning_Visitor   FALSE   FALSE
## 13 Returning_Visitor   FALSE   FALSE
## 14 Returning_Visitor   FALSE   FALSE

KMEANS CLUSTERING

i) Original dataset
# Model before optimization

# Seed for reproducible results
set.seed(0)
# Kmeans models
base.km2 <- kmeans(cust.num.scaled, centers = 2, nstart = 25)
base.km3 <- kmeans(cust.num.scaled, centers = 3, nstart = 25)
base.km4 <- kmeans(cust.num.scaled, centers = 4, nstart = 25)
base.km5 <- kmeans(cust.num.scaled, centers = 5, nstart = 25)
# Visualizations
par(mfrow = c(2,2))
plt1 <- fviz_cluster(base.km2, geom = "point", data = cust.num.scaled) + ggtitle(" K = 2")
plt2 <- fviz_cluster(base.km3, geom = "point", data = cust.num.scaled) + ggtitle(" K = 3")
plt3 <- fviz_cluster(base.km4, geom = "point", data = cust.num.scaled)+ ggtitle(" K = 4")
plt4 <- fviz_cluster(base.km5, geom = "point", data = cust.num.scaled)+ ggtitle(" K = 5")

plot(plt1)

plot(plt2)

plot(plt3)

plot(plt4)

For k = 2, the second cluster contains the majority of outliers. k= 2 and k = 3 have clearer cluster distinctions.

Optimal Number of Clusters

1. Elbow Method

# Elbow plot
fviz_nbclust(x = cust.num.scaled, FUNcluster = kmeans, method = 'wss') + 
  labs(title = 'Elbow Method')

2. Silhouette Method

# Silhouette Method plot
fviz_nbclust(x = cust.num.scaled, FUNcluster = kmeans, method = 'silhouette') + 
  labs(title = 'Silhouette Method')

3. Gap Statistic Method

# Seed value
set.seed(0)
# Gap statistic
  # Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(cust.num.scaled, FUNcluster=Kmeans.func, method="gap_stat", 
             verbose = FALSE) +
  labs(title = 'Gap Statistic Method')

The optimal k value = 3.

ii) Dataset with outliers removed

Optimal Number of Clusters

1. Elbow Method

# Elbow plot
fviz_nbclust(x = norm.scaled, FUNcluster = kmeans, method = 'wss') + 
  labs(title = 'Elbow Method')

2. Silhouette Method

# Silhouette Method plot
fviz_nbclust(x = norm.scaled, FUNcluster = kmeans, method = 'silhouette') + 
  labs(title = 'Silhouette Method')

3. Gap Statistic Method

# Seed value
set.seed(0)
# Gap statistic
  # Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(norm.scaled, FUNcluster=Kmeans.func, method="gap_stat", 
             verbose = FALSE) +
  labs(title = 'Gap Statistic Method')

The optimal k values are k = 4. This dataset will not be used as a significant portion of the dataset was lost (~40%).

iii) PCA

Applying PCA on the entire dataset.

Encoding

# Dataset copy
cust.encoded <- data.frame(cust.scaled)
# For PCA all columns need to be numeric
cust.encoded$Month <- as.integer(cust.encoded$Month)
cust.encoded$Visitor_Type <- as.integer(cust.encoded$Visitor_Type)
cust.encoded$Weekend <- as.integer(cust.encoded$Weekend)
cust.encoded$Operating_Systems <- as.integer(cust.encoded$Operating_Systems)
cust.encoded$Region <- as.integer(cust.encoded$Region)
cust.encoded$Traffic_Type <- as.integer(cust.encoded$Traffic_Type)
cust.encoded$Visitor_Type <- as.integer(cust.encoded$Visitor_Type)
cust.encoded$Browser<- as.integer(cust.encoded$Browser)

head(cust.encoded)
##   Administrative Administrative_Duration Informational Informational_Duration
## 1     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 2     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 3     -0.7028704              -0.4547312    -0.3990051             -0.2393591
## 4     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 5     -0.7028704              -0.4603670    -0.3990051             -0.2464319
## 6     -0.7028704              -0.4603670    -0.3990051             -0.2464319
##   Product_Related Product_Related_Duration Bounce_Rates Exit_Rates Page_Values
## 1      -0.6966995               -0.6292416  3.956604814  3.4289586  -0.3188629
## 2      -0.6742674               -0.5958906 -0.450601876  1.2656015  -0.3188629
## 3      -0.6966995               -0.6287205  3.956604814  3.4289586  -0.3188629
## 4      -0.6742674               -0.6278520  0.651199796  2.1309443  -0.3188629
## 5      -0.4948106               -0.3022450 -0.009881207  0.1839229  -0.3188629
## 6      -0.2929218               -0.5488778 -0.102664499 -0.3664048  -0.3188629
##   Special_Day Month Operating_Systems Browser Region Traffic_Type Visitor_Type
## 1  -0.3101433     1                 1       1      1            1            3
## 2  -0.3101433     1                 2       2      1            2            3
## 3  -0.3101433     1                 4       1      9            3            3
## 4  -0.3101433     1                 3       2      2            4            3
## 5  -0.3101433     1                 3       3      1            4            3
## 6  -0.3101433     1                 2       2      1            3            3
##   Weekend Revenue
## 1       1   FALSE
## 2       1   FALSE
## 3       1   FALSE
## 4       1   FALSE
## 5       2   FALSE
## 6       1   FALSE
#Features
X <- cust.encoded[,-ncol(cust.encoded)]
# Label
y <- cust.encoded[,ncol(cust.encoded)]
# PCA
pca <- prcomp(X)
# Scree Plot
fviz_eig(pca)

# Summary
summary(pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     4.0364 3.1929 2.4097 1.79928 1.69865 1.30384 1.03724
## Proportion of Variance 0.3517 0.2200 0.1253 0.06988 0.06228 0.03669 0.02322
## Cumulative Proportion  0.3517 0.5717 0.6970 0.76692 0.82920 0.86590 0.88912
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.00382 0.96764 0.92462 0.86525 0.65752 0.64784 0.59312
## Proportion of Variance 0.02175 0.02021 0.01845 0.01616 0.00933 0.00906 0.00759
## Cumulative Proportion  0.91087 0.93108 0.94953 0.96569 0.97503 0.98408 0.99168
##                           PC15    PC16    PC17
## Standard deviation     0.42214 0.35148 0.28956
## Proportion of Variance 0.00385 0.00267 0.00181
## Cumulative Proportion  0.99552 0.99819 1.00000
# Up to PC11, the cummulative proportion is 0.96201 (>95%), therefore, 
# the rest can be excluded.
pca.select <- as.data.frame(-pca$x[,1:11])

# Visualization - Elbow method
fviz_nbclust(pca.select, kmeans, method = 'wss')

# Silhouette Method
fviz_nbclust(pca.select, kmeans, method = 'silhouette')

# Up to PC11, the cummulative proportion is 0.96201 (>95%), therefore, 
# the rest can be excluded.
pca.select <- as.data.frame(-pca$x[,1:11])
# Seed value
set.seed(0)
# Gap statistic
  # Setting max iterations
Kmeans.func <- function(x,k) list(cluster=kmeans(x, k, iter.max=50))
# Visualization
fviz_nbclust(pca.select, FUNcluster=Kmeans.func, method="gap_stat", 
             verbose = FALSE) +
  labs(title = 'Gap Statistic Method')

Using PCA to transform the data before clustering has reduced the optimal number of clusters to 3.

K = 2

# Seed for reproducible results
set.seed(0)
# Kmeans model
base.pca2 <- kmeans(pca.select, centers = 2, nstart = 25)
# Visualization
fviz_cluster(base.pca2, data = pca.select)

The clusters overlap. Therefore, the predictions will be compared to the actual label.

# Predicted clusters
y_pred <- base.pca2$cluster
# Frequency
table(y_pred)
## y_pred
##     1     2 
##  2099 10112
# Frequency
table(cust$Revenue)
## 
## FALSE  TRUE 
## 10303  1908

From the frequency distribution, the cluster 2 seems to represent the False category, while 1 = True. The number of FP = 191, FN = 191. From Precision = 0.9.

precision <- 1908/(1908+191)
recall <- 1908/(1908+191)
f1.score <- 2*((precision*recall)/(precision+recall))
accuracy <- ((1908 + 10112)/(1908+191+10112+191))*100

print(paste0('Accuracy = ', round(accuracy, 2), '%'))
## [1] "Accuracy = 96.92%"
cat('Precision = ', round(precision,2),'\n')
## Precision =  0.91
cat('Recall = ', round(recall,2), '\n')
## Recall =  0.91
cat('F1 Score = ', round(f1.score,2), '\n')
## F1 Score =  0.91

The clustering achieves a high performance scores.

K = 3

# Seed for reproducible results
set.seed(0)
# Kmeans model
base.pca4 <- kmeans(pca.select, centers = 3, nstart = 25)
# Visualization
fviz_cluster(base.pca4, data = pca.select)

Though there are outliers present, the clusters do not confrom explicitly in regards to their position.

As per the observed optimal values of k, it can be seen that the customers cannot only be clustered by the revenue only, as more customer group specifications exist. Domain knowledge is required to infer their significance.

Hierachical Clustering

Hierarchical clustering has an advantage in that custom distance metrics can be implemented. Therefore, Gower distance will be used. Therefore, the entire dataset can be used as is.

Original Dataset

Choosing the best linkage method

# Features
X2 <- cust.scaled[,-ncol(cust.scaled)]
y <- cust.scaled[,ncol(cust.scaled)]

Computing the aggregation coefficient is time consuming, therefore, three linkage methods will be compared.

# Gower distance
gower <- daisy(X2, metric = c("gower"))
# Linkage = ward.d2
cluster <- hclust(gower, method = 'ward.D2')
# Dendogram
plot(cluster, main = 'Dendogram')

# Linkage = complete
cluster <- hclust(gower, method = 'complete')
# Dendogram
plot(cluster, main = 'Dendogram')

# Linkage = average
cluster <- hclust(gower, method = 'average')
# Dendogram
plot(cluster, main = 'Dendogram')

The ward linkage method provides the best clustering results.

Pruning

The common methods used to determine the optimal number of k will not be used as NAs will be introduced by coercion, therefore, the optimal number of k determined in the previous section will be used.

# Selecting k from the dendogram
cluster <- hclust(gower, method = 'ward.D2')
# Dendogram
plot(cluster, labels=FALSE)
rect.hclust(cluster, k=8, border="red")

The optimal k = 8.

# Pruning
cluster.cut <- cutree(cluster, 8)
# Appending the column
X2 <- cust.scaled[,-ncol(cust.scaled)]
Clusters <- as.factor(cluster.cut)
X2 <- cbind(X2, Clusters)
head(X2[,c('Clusters')])
## [1] 1 2 1 3 4 2
## Levels: 1 2 3 4 5 6 7 8
# Visualization
ggplot(X2, aes(x=Product_Related, y=Product_Related_Duration, color = Clusters)) + geom_point()

From the hierarchical clustering procedure, more clusters have been derived from the data. This shows that the revenue changes in more ways than just a presence of lack of revenue. Domain knowledge should be used to classify these changes in revenue. Furthermore there are 8 categorical variable in the dataset, and k = 8. Domain knowledge is required to determine if this is coincidence, or it the number of categorical columns have an effect on the number of clusters.

Comparisson of Models

From this study, the following observation were made:

  1. After Kmeans clustering, the optimal number of clusters is 3. However, after hierarchical clustering, more clusters were obtained. This could mean that there are more factors that represent customer groups apart from the revenue, or that there are more changes to the revenue than just an increase or decrease. Therefore, the later clustering method provides more granular insights.
  2. Hierarchical clustering is more flexible in that custom distance metrics can be used, allowing for the clustering of mixed type datasets. Kmenas only works with numerical columns, PCA had to be used to transform the data into a form that can be used with Kmeans clustering.
  3. Selecting the optimal clusters was more accurate for Kmeans, as pruning the hierarchical cluster was done visually.

8. Implementing the Solution

Analysis Summary

Univariate Analysis

a) Categorical

  1. The months of May, November and March have the highest frequencies.
  2. The second operating system has the highest frequency, while 5 and 7 have the least number of records.
  3. The second browser has the highest frequency.
  4. The first region has the highest frequency.
  5. The second traffic type has the highest frequency.
  6. The returning visitor class has the highest frequency.
  7. Most activity/traffic did not take place on the weekend.
  8. Lack of revenue had a higher frequency.

b) Numerical

  1. The administrative pages were not visited most of the time but had an average visit duration of ~82 seconds.
  2. The informational page was rarely visited and had an average visit duration of ~35 seconds.
  3. The product related pages were rarely visited, but had a longer average visit duration of ~20 minutes. The highest page duration seen so far.
  4. The highest frequency for the bounce rate was zero.
  5. A low exit rate of 0.01 had the highest frequency.
  6. Majority of the site visiting times were not close to a special day.

Bivariate Summary

a) Categorical-Categorical

  1. The months of November, May and December had the highest revenue frequency counts. Further more, May, November, March and December had the highest frequencies for a lack of revenue.
  2. The second operating system has the highest revenue count. However, it also records the highest frequency for a lack of revenue, followed by the third, then the first.
  3. The second browser has the highest frequency for the revenue, but this value is overshadowed by a lack of revenue on the same browser. The first browser also has the highest frequency for a lack of revenue. This could be attributed to the popularity of their use.
  4. The first and third regions had the highest revenue frequency. They also had the highest frequency for no revenue. A factor could be the population of the regions.
  5. The second traffic type had the highest revenue frequency, by a large margin. It also has the highest frequency for a lack of revenue, followed by the first and second types. The traffic types had revenue counts.
  6. Returning visitors led to the highest revenue frequency counts. However, the total value is lower than that of a lack of revenue.
  7. Weekdays had the highest revenue frequencies, as well as a lock of it.

b) Numeric-Numeric

All the columns tested had very weak correlation with each other. A correlation matrix will be plotted in the multivariate section for a more in depth and efficient analysis.

c) Numeric-Categorical

  1. The higher the number of administrative pages were visited, the higher the revenue. The median number of pages was 2.
  2. The median time spent on the administrative pages leading to higher attainment of revenue was 52 seconds.
  3. Visitation of the informational did not contribute to the brand’s revenue.
  4. There was little to no traffic to the informational pages, therefore it did not contribute to the brand’s revenue.
  5. The product related pages had a higher number of visits than the previous pages, leading to a higher contribution to the revenue, the median number of pages visited was 38.
  6. Most visitors visiting the product related pages carried out transactions leading to higher instances of revenue.
  7. The median duration was 20 minutes in the product related pages.
  8. Bounce rates did not contribute to the revenue. The median bounce rate was 0.004.
  9. The higher the exit rate the lower the revenue. The exit rate median of 0.029 had no revenue, while loser rates of 0.025 contributed to the revenue.
  10. The median page value before an e-commerce transaction was completed, contributing to the revenue was 14.
  11. The special day feature had no contribution to the revenue.

Multivariate Analysis

  1. Lower exit and bounce rates lead to a higher number of transactions.
  2. Most users spend a small amount of time on the product related sites. There is no glaring distinction between customers who make or don’t make transactions.
  3. The lower the number of product related pages, as well as the page values, the higher the number of transactions leading to attainment of revenue by the brand.

Modelling

  1. KMeans clustering -> clusters = 3.
  2. Hierarchical Clustering -> clusters = 8.

This shows that customers can be grouped further, without being constrained to users who either contribute, or don’t, to the brands revenue. Hierarchical clustering provides more granulated clusters, while KMeans seems to provide generalized clustering. To infer the significance of the groups, the sales and marketing team need to apply their domain knowledge for more accurate insights into each group, in both cluster types.

9. Challenging the Solution

Various techniques were implemented during the modelling procedure to obtain the optimal number of clusters.

  1. Kmeans and Hierarchical clustering methods were employed and their performances compared. Kmeans provided a more general view of customer groups. Hierarchical clustering provided a more granulated approach, grouping customers into 8 clusters.
  1. As Kmeans is only valid for numerical data, numeric customers were first trained on the model, leading to a k = 3.

  2. For the dataset with the outliers removed, k = 4. This dataset was excluded as a 40% was lost when outliers were removed.

  3. PCA was done to transform the entire dataset into a numeric form that the entire dataset can be used. The result was k = 3. This model was selected as all variables were included in the modelling process.

  1. Hierarchical clustering was flexible as it allowed for the use of custom distance metrics. Gower’s distance is used to determine the similarities and differences in mixed variable datasets. Therefore, the clustering model used this metric, leading to an increase in clusters, up to 8, indicating a larger number of unique customer groups.

For a more generalized grouping of customers, the KMeans customer clusters can be used. For a more in depth classification, the hierarchical model clusters can be used.

10. Conclusion

In conclusion, the following customer groups contributed to the brand’s revenue:

  1. customers who purchase goods in the moths of November, May and December carried out the most transactions. Fewer customers made transactions outside these months,
  2. customer in the first and third regions carried out more transactions compared to customers in other regions,
  3. returning customers carried out majority of the transactions,
  4. users using the second operating system, browser and traffic type also made the most transactions.

Furthermore, the product related pages were of the most importance to these users.

Finally, for the modelling section, it was discovered that there exist more customer groups, possibly other than those relating to revenue contribution. The KMeans approach had k = 3, while Hierarchical clustering had = 8. This showed that the latter model provided more granulated insights into possible customer segmentation.

11. Recommendations

  1. Effort should be placed in retaining current customers, as well as attracting new ones.
  2. The months of November, May and December had the highest revenue frequency counts. Therefore, marketing strategies can be made to maximize sales during these months. An in depth study can also be done to uncover the reason behind the dip in performance for the remaining months.
  3. The second operating system has the highest revenue count. Therefore more attention should focused on it. Further investigation should be done on the lagging performance of the other systems, and if necessary, only focusing on the most lucrative operating systems to save on time and money.
  4. The first and third regions had the highest revenue frequency. Therefore marketing should be boosted in order to attract more users in these regions.
  5. Returning visitors led to the highest revenue frequency counts. However, the total value is lower than that of a lack of revenue.
  6. Weekdays had the highest revenue frequencies, therefore, the site performance and transactions should be carried out effectively for customer satisfaction. Therefore, an major site maintenance can be done on weekends.
  7. Customers spend the most amount of time on the product related pages, therefore, these pages should be structured in such a way as to encourage the user to carry out complete transactions, in the most efficient way possible.
  8. Business domain knowledge is required to determine if the k = 8 clusters are related to the 8 categorical variables, or if this is a coincidence.
  9. The hierarchical clusters should be used if more granulated customer grouping are required, and vice-versa for KMeans clusters.
  10. More data, with relevant features should be collected to provide better insights into the characteristics of various customer groups.

12. Follow Up Questions

a) Did we have the right data?

Yes, as meaningful insights on customer groups was derived from the data. Furthermore, training the Kmeans and the hierarchical on the data led to the formation of meaningful customer group clusters.

b) Do we need other data to answer our question?

Yes, more granulated (more relevant features) and complete (including all months) data is necessary to derive more insights about the brand’s customer groups. The data will also be used to confirm the validity of the clusters attained during the modelling process.

c) Did we have the right qustion?

Yes, as the aim of the project is to identify the characteristics of the brand’s customer groups, as per the client’s request.

# Suppressing warnings
options(warn = defaultW)