Preliminaries Setting up libraries and loading the data set

## Libraries used

library(ggplot2)

## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang

library(tidyverse)

## Registered S3 method overwritten by 'rvest':
##   method            from
##   read_xml.response xml2

## -- Attaching packages ----------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1       v purrr   0.3.2  
## v tidyr   0.8.3       v dplyr   0.8.0.1
## v readr   1.3.1       v stringr 1.4.0  
## v tibble  2.1.1       v forcats 0.4.0

## Warning: package 'stringr' was built under R version 3.6.1

## -- Conflicts -------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.6.1

library(ggmosaic)

## Warning: package 'ggmosaic' was built under R version 3.6.1

library(corrplot)

## corrplot 0.84 loaded

library(caret)

## Warning: package 'caret' was built under R version 3.6.1

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.6.1

library(fpc)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(cluster)

## Loading the dataset
shopper_data <- read.csv("C:/Users/qasim/OneDrive/Desktop/York U/2 - Basic Methods of Data Analytics/Final Project/online_shoppers_intention.csv")

Online Shoppers Intention

EDA + Clustering Algorithms + Classification Algorithms

Abstract We are using the “Online Shoppers Purchasing Intention” UCI dataset related to visits of customers to an online store and their decision on purchase a product(s). The goal of the analysis is use clustering algorithms and classification algorithms, to make predictive models around shoppers’ intentions.

1. Introduction

The dataset used is based on “Online Shoppers Purchasing Intention” UCI dataset , (detailed description at: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset) This dataset is almost identical to the one used in [Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018)] difference being that attributes that could be used as “personally identifiable information” has been removed due to privacy concerns.

The data is related to shoppers’ behavior as they visit a website. the “session” activity is measured along 17 dimensions and we note whether the visitor generated revenue (FALSE OR TRUE)

2. Objective

The objective of this analysis is to provide a reliable and feasible recommendation algorithm to predict shopper behavior. The target value is the binary ‘FALSE’ or ‘TRUE’ regarding the website visitors’ decision to buy. The plan is to use clustering and maybe classification techniques to be able to make predictions about shoppers’ intentions.

3. Dataset

Here is a breif description table of the variables

The dataset consists of 10 numerical and 8 categorical attributes.

Administrative, Administrative Duration, Informational, Informational Duration, Product Related and Product Related Duration represent the number of different types of pages visited by the visitor in that session and total time spent in each of these page categories. The values of these features are derived from the URL information of the pages visited by the user and updated in real time when a user takes an action, e.g. moving from one page to another.

The Bounce Rate, Exit Rate and Page Value features represent the metrics measured by “Google Analytics” for each page in the e-commerce site.

Bounce Rate - feature for a web page refers to the percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session. This is the number of single-page visits by visitors of the website.

Exit Rate - feature for a specific web page is calculated as for all pageviews to the page, the percentage that were the last in the session. This is the number of exits from the website.

Page Value - feature represents the average value for a web page that a user visited before completing an e-commerce transaction. It tells you which specific pages of the site offer the most value. For instance, a product page for an Ecommerce site will usually have a higher page value than a resource page.

Special Day - feature indicates the closeness of the site visiting time to a specific special day (e.g. Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with transaction. The value of this attribute is determined by considering the dynamics of e-commerce such as the duration between the order date and delivery date. For example, for Valentina’s day, this value takes a nonzero value between February 2 and February 12, zero before and after this date unless it is close to another special day, and its maximum value of 1 on February 8.

The dataset also includes operating system, browser, region, traffic type, visitor type as returning or new visitor, a Boolean value indicating whether the date of the visit is weekend, and month of the year.

Class Label (desired target):

Revenue - has the client purchased a product on the website? (binary: ‘TRUE’, ‘FALSE’)

Initial exploration of the dataset:

Checking number of rows and columns.

ncol(shopper_data)

## [1] 18

nrow(shopper_data)

## [1] 12330

Quickly preview data structure with HEAD(), STR(), and SUMMARY().

head(shopper_data,5)

##   Administrative Administrative_Duration Informational
## 1              0                       0             0
## 2              0                       0             0
## 3              0                       0             0
## 4              0                       0             0
## 5              0                       0             0
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1                      0              1                0.000000
## 2                      0              2               64.000000
## 3                      0              1                0.000000
## 4                      0              2                2.666667
## 5                      0             10              627.500000
##   BounceRates ExitRates PageValues SpecialDay Month OperatingSystems
## 1        0.20      0.20          0          0   Feb                1
## 2        0.00      0.10          0          0   Feb                2
## 3        0.20      0.20          0          0   Feb                4
## 4        0.05      0.14          0          0   Feb                3
## 5        0.02      0.05          0          0   Feb                3
##   Browser Region TrafficType       VisitorType Weekend Revenue
## 1       1      1           1 Returning_Visitor   FALSE   FALSE
## 2       2      1           2 Returning_Visitor   FALSE   FALSE
## 3       1      9           3 Returning_Visitor   FALSE   FALSE
## 4       2      2           4 Returning_Visitor   FALSE   FALSE
## 5       3      1           4 Returning_Visitor    TRUE   FALSE

str(shopper_data)

## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

summary(shopper_data)

##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##                                                            
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##                                                                 
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##                                                                          
##      Month      OperatingSystems    Browser           Region     
##  May    :3364   Min.   :1.000    Min.   : 1.000   Min.   :1.000  
##  Nov    :2998   1st Qu.:2.000    1st Qu.: 2.000   1st Qu.:1.000  
##  Mar    :1907   Median :2.000    Median : 2.000   Median :3.000  
##  Dec    :1727   Mean   :2.124    Mean   : 2.357   Mean   :3.147  
##  Oct    : 549   3rd Qu.:3.000    3rd Qu.: 2.000   3rd Qu.:4.000  
##  Sep    : 448   Max.   :8.000    Max.   :13.000   Max.   :9.000  
##  (Other):1337                                                    
##   TrafficType               VisitorType     Weekend         Revenue       
##  Min.   : 1.00   New_Visitor      : 1694   Mode :logical   Mode :logical  
##  1st Qu.: 2.00   Other            :   85   FALSE:9462      FALSE:10422    
##  Median : 2.00   Returning_Visitor:10551   TRUE :2868      TRUE :1908     
##  Mean   : 4.07                                                            
##  3rd Qu.: 4.00                                                            
##  Max.   :20.00                                                            
##

Check distribution of target variable.

summary(shopper_data$Revenue)

##    Mode   FALSE    TRUE 
## logical   10422    1908

CrossTable(shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##           |     FALSE |      TRUE | 
##           |-----------|-----------|
##           |     10422 |      1908 | 
##           |     0.845 |     0.155 | 
##           |-----------|-----------|
## 
## 
## 
##

Creating a binary dependent variable for potential regression models.

shopper_data <- shopper_data %>%
  mutate(Revenue_binary = ifelse(Revenue == "FALSE",0,1))

Check distribution of target variable.

hist(shopper_data$Revenue_binary)

summary(shopper_data$Revenue_binary)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1547  0.0000  1.0000

Getting an idea of missing values:

colSums(is.na(shopper_data))

##          Administrative Administrative_Duration           Informational 
##                       0                       0                       0 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                       0                       0                       0 
##             BounceRates               ExitRates              PageValues 
##                       0                       0                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0 
##          Revenue_binary 
##                       0

colSums(shopper_data == "")

##          Administrative Administrative_Duration           Informational 
##                       0                       0                       0 
##  Informational_Duration          ProductRelated ProductRelated_Duration 
##                       0                       0                       0 
##             BounceRates               ExitRates              PageValues 
##                       0                       0                       0 
##              SpecialDay                   Month        OperatingSystems 
##                       0                       0                       0 
##                 Browser                  Region             TrafficType 
##                       0                       0                       0 
##             VisitorType                 Weekend                 Revenue 
##                       0                       0                       0 
##          Revenue_binary 
##                       0

As this is data from website visits, we see that there are no missing values. Every data point is generated by a person interacting with the website in some way, with no interactions being a ‘0’.

## default theme for ggplot
theme_set(theme_bw())

## setting default parameters for mosaic plots
mosaic_theme = theme(axis.text.x = element_text(angle = 90,
                                                hjust = 1,
                                                vjust = 0.5),
                     axis.text.y = element_blank(),
                     axis.ticks.y = element_blank())

4. Exploratory Data Analysis

4.1 Numerical Univariate Analysis

We visually analyze the tracking data to see if there are any visual differences between the shoppers and non-shoppers.

Administrative

summary(shopper_data$Administrative)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   2.315   4.000  27.000

shopper_data %>% 
  ggplot() +
  aes(x = Administrative) +
  geom_bar() +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Administrative Duration

summary(shopper_data$Administrative_Duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    7.50   80.82   93.26 3398.75

shopper_data %>% 
  ggplot() +
  aes(x = Administrative_Duration) +
  geom_histogram(bins = 50) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Informational

summary(shopper_data$Informational)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.5036  0.0000 24.0000

shopper_data %>% 
  ggplot() +
  aes(x = Informational) +
  geom_bar() +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Informational Duration

summary(shopper_data$Informational_Duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   34.47    0.00 2549.38

shopper_data %>% 
  ggplot() +
  aes(x = Informational_Duration) +
  geom_histogram(bins = 50) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Product Related

summary(shopper_data$ProductRelated)

shopper_data %>% 
  ggplot() +
  aes(x = ProductRelated) +
  geom_bar() +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Product Related Duration

summary(shopper_data$ProductRelated_Duration)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   184.1   598.9  1194.8  1464.2 63973.5

shopper_data %>% 
  ggplot() +
  aes(x = ProductRelated_Duration) +
  geom_histogram(bins = 100) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Bounce Rates

summary(shopper_data$BounceRates)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.000000 0.003112 0.022191 0.016813 0.200000

shopper_data %>% 
  ggplot() +
  aes(x = BounceRates) +
  geom_histogram(bins = 100) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Exit Rates

summary(shopper_data$ExitRates)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01429 0.02516 0.04307 0.05000 0.20000

shopper_data %>% 
  ggplot() +
  aes(x = ExitRates) +
  geom_histogram(bins = 100) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Page Value

summary(shopper_data$PageValues)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.889   0.000 361.764

shopper_data %>% 
  ggplot() +
  aes(x = PageValues) +
  geom_histogram(bins = 50) +
  facet_grid(Revenue ~ .,
             scales = "free_y")

Special Day

summary(shopper_data$SpecialDay)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.06143 0.00000 1.00000

shopper_data %>% 
  ggplot() +
  aes(x = SpecialDay) +
  geom_bar() +
  facet_grid(Revenue ~ .,
             scales = "free_y") +
  scale_x_continuous(breaks = seq(0, 1, 0.1))

4.2 Categorical Univariate Analysis

Month

Does month make a difference?

Cross-tab with our dependent variable:

CrossTable(shopper_data$Month, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                    | shopper_data$Revenue 
## shopper_data$Month |     FALSE |      TRUE | Row Total | 
## -------------------|-----------|-----------|-----------|
##                Aug |       357 |        76 |       433 | 
##                    |     0.221 |     1.208 |           | 
##                    |     0.824 |     0.176 |     0.035 | 
##                    |     0.034 |     0.040 |           | 
##                    |     0.029 |     0.006 |           | 
## -------------------|-----------|-----------|-----------|
##                Dec |      1511 |       216 |      1727 | 
##                    |     1.799 |     9.826 |           | 
##                    |     0.875 |     0.125 |     0.140 | 
##                    |     0.145 |     0.113 |           | 
##                    |     0.123 |     0.018 |           | 
## -------------------|-----------|-----------|-----------|
##                Feb |       181 |         3 |       184 | 
##                    |     4.172 |    22.789 |           | 
##                    |     0.984 |     0.016 |     0.015 | 
##                    |     0.017 |     0.002 |           | 
##                    |     0.015 |     0.000 |           | 
## -------------------|-----------|-----------|-----------|
##                Jul |       366 |        66 |       432 | 
##                    |     0.002 |     0.011 |           | 
##                    |     0.847 |     0.153 |     0.035 | 
##                    |     0.035 |     0.035 |           | 
##                    |     0.030 |     0.005 |           | 
## -------------------|-----------|-----------|-----------|
##               June |       259 |        29 |       288 | 
##                    |     0.995 |     5.437 |           | 
##                    |     0.899 |     0.101 |     0.023 | 
##                    |     0.025 |     0.015 |           | 
##                    |     0.021 |     0.002 |           | 
## -------------------|-----------|-----------|-----------|
##                Mar |      1715 |       192 |      1907 | 
##                    |     6.594 |    36.019 |           | 
##                    |     0.899 |     0.101 |     0.155 | 
##                    |     0.165 |     0.101 |           | 
##                    |     0.139 |     0.016 |           | 
## -------------------|-----------|-----------|-----------|
##                May |      2999 |       365 |      3364 | 
##                    |     8.511 |    46.487 |           | 
##                    |     0.891 |     0.109 |     0.273 | 
##                    |     0.288 |     0.191 |           | 
##                    |     0.243 |     0.030 |           | 
## -------------------|-----------|-----------|-----------|
##                Nov |      2238 |       760 |      2998 | 
##                    |    34.593 |   188.955 |           | 
##                    |     0.746 |     0.254 |     0.243 | 
##                    |     0.215 |     0.398 |           | 
##                    |     0.182 |     0.062 |           | 
## -------------------|-----------|-----------|-----------|
##                Oct |       434 |       115 |       549 | 
##                    |     1.945 |    10.626 |           | 
##                    |     0.791 |     0.209 |     0.045 | 
##                    |     0.042 |     0.060 |           | 
##                    |     0.035 |     0.009 |           | 
## -------------------|-----------|-----------|-----------|
##                Sep |       362 |        86 |       448 | 
##                    |     0.734 |     4.011 |           | 
##                    |     0.808 |     0.192 |     0.036 | 
##                    |     0.035 |     0.045 |           | 
##                    |     0.029 |     0.007 |           | 
## -------------------|-----------|-----------|-----------|
##       Column Total |     10422 |      1908 |     12330 | 
##                    |     0.845 |     0.155 |           | 
## -------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  aes(x = Month, Revenue = ..count../nrow(shopper_data), fill = Revenue) +
  geom_bar() +
  ylab("relative frequency")

month_table <- table(shopper_data$Month, shopper_data$Revenue)
month_tab <- as.data.frame(prop.table(month_table, 2))
colnames(month_tab) <-  c("Month", "Revenue", "perc")

ggplot(data = month_tab, aes(x = Month, y = perc, fill = Revenue)) + 
  geom_bar(stat = 'identity', position = 'dodge', alpha = 2/3) + 
  xlab("Month")+
  ylab("Percent")

We see very high shopping rates in September, October, and November; months that typically correspond to the ‘shopping season’ in North America. Also, of note is the month of May with a lot of visits to the website.

Operating System

Plotting the frequency of OS type:

CrossTable(shopper_data$OperatingSystems, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                               | shopper_data$Revenue 
## shopper_data$OperatingSystems |     FALSE |      TRUE | Row Total | 
## ------------------------------|-----------|-----------|-----------|
##                             1 |      2206 |       379 |      2585 | 
##                               |     0.202 |     1.104 |           | 
##                               |     0.853 |     0.147 |     0.210 | 
##                               |     0.212 |     0.199 |           | 
##                               |     0.179 |     0.031 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             2 |      5446 |      1155 |      6601 | 
##                               |     3.196 |    17.456 |           | 
##                               |     0.825 |     0.175 |     0.535 | 
##                               |     0.523 |     0.605 |           | 
##                               |     0.442 |     0.094 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             3 |      2287 |       268 |      2555 | 
##                               |     7.512 |    41.034 |           | 
##                               |     0.895 |     0.105 |     0.207 | 
##                               |     0.219 |     0.140 |           | 
##                               |     0.185 |     0.022 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             4 |       393 |        85 |       478 | 
##                               |     0.301 |     1.645 |           | 
##                               |     0.822 |     0.178 |     0.039 | 
##                               |     0.038 |     0.045 |           | 
##                               |     0.032 |     0.007 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             5 |         5 |         1 |         6 | 
##                               |     0.001 |     0.006 |           | 
##                               |     0.833 |     0.167 |     0.000 | 
##                               |     0.000 |     0.001 |           | 
##                               |     0.000 |     0.000 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             6 |        17 |         2 |        19 | 
##                               |     0.055 |     0.301 |           | 
##                               |     0.895 |     0.105 |     0.002 | 
##                               |     0.002 |     0.001 |           | 
##                               |     0.001 |     0.000 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             7 |         6 |         1 |         7 | 
##                               |     0.001 |     0.006 |           | 
##                               |     0.857 |     0.143 |     0.001 | 
##                               |     0.001 |     0.001 |           | 
##                               |     0.000 |     0.000 |           | 
## ------------------------------|-----------|-----------|-----------|
##                             8 |        62 |        17 |        79 | 
##                               |     0.341 |     1.865 |           | 
##                               |     0.785 |     0.215 |     0.006 | 
##                               |     0.006 |     0.009 |           | 
##                               |     0.005 |     0.001 |           | 
## ------------------------------|-----------|-----------|-----------|
##                  Column Total |     10422 |      1908 |     12330 | 
##                               |     0.845 |     0.155 |           | 
## ------------------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, OperatingSystems), fill = Revenue)) +
  mosaic_theme +
  xlab("OS Types") +
  ylab(NULL)

OS type 8 stands out with 21.5% of shoppers buying (but we only have 79 visitors using this). The lowest we see is OS type 6 with only 10.5% yes. Also, of note is that majority of visitors are from OS types 1, 2, and 3.

Browser Type

Does a particular type of browser effect users experience?

CrossTable(shopper_data$Browser, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                      | shopper_data$Revenue 
## shopper_data$Browser |     FALSE |      TRUE | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                    1 |      2097 |       365 |      2462 | 
##                      |     0.123 |     0.670 |           | 
##                      |     0.852 |     0.148 |     0.200 | 
##                      |     0.201 |     0.191 |           | 
##                      |     0.170 |     0.030 |           | 
## ---------------------|-----------|-----------|-----------|
##                    2 |      6738 |      1223 |      7961 | 
##                      |     0.012 |     0.065 |           | 
##                      |     0.846 |     0.154 |     0.646 | 
##                      |     0.647 |     0.641 |           | 
##                      |     0.546 |     0.099 |           | 
## ---------------------|-----------|-----------|-----------|
##                    3 |       100 |         5 |       105 | 
##                      |     1.426 |     7.787 |           | 
##                      |     0.952 |     0.048 |     0.009 | 
##                      |     0.010 |     0.003 |           | 
##                      |     0.008 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                    4 |       606 |       130 |       736 | 
##                      |     0.417 |     2.278 |           | 
##                      |     0.823 |     0.177 |     0.060 | 
##                      |     0.058 |     0.068 |           | 
##                      |     0.049 |     0.011 |           | 
## ---------------------|-----------|-----------|-----------|
##                    5 |       381 |        86 |       467 | 
##                      |     0.478 |     2.610 |           | 
##                      |     0.816 |     0.184 |     0.038 | 
##                      |     0.037 |     0.045 |           | 
##                      |     0.031 |     0.007 |           | 
## ---------------------|-----------|-----------|-----------|
##                    6 |       154 |        20 |       174 | 
##                      |     0.326 |     1.781 |           | 
##                      |     0.885 |     0.115 |     0.014 | 
##                      |     0.015 |     0.010 |           | 
##                      |     0.012 |     0.002 |           | 
## ---------------------|-----------|-----------|-----------|
##                    7 |        43 |         6 |        49 | 
##                      |     0.060 |     0.330 |           | 
##                      |     0.878 |     0.122 |     0.004 | 
##                      |     0.004 |     0.003 |           | 
##                      |     0.003 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                    8 |       114 |        21 |       135 | 
##                      |     0.000 |     0.001 |           | 
##                      |     0.844 |     0.156 |     0.011 | 
##                      |     0.011 |     0.011 |           | 
##                      |     0.009 |     0.002 |           | 
## ---------------------|-----------|-----------|-----------|
##                    9 |         1 |         0 |         1 | 
##                      |     0.028 |     0.155 |           | 
##                      |     1.000 |     0.000 |     0.000 | 
##                      |     0.000 |     0.000 |           | 
##                      |     0.000 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                   10 |       131 |        32 |       163 | 
##                      |     0.333 |     1.821 |           | 
##                      |     0.804 |     0.196 |     0.013 | 
##                      |     0.013 |     0.017 |           | 
##                      |     0.011 |     0.003 |           | 
## ---------------------|-----------|-----------|-----------|
##                   11 |         5 |         1 |         6 | 
##                      |     0.001 |     0.006 |           | 
##                      |     0.833 |     0.167 |     0.000 | 
##                      |     0.000 |     0.001 |           | 
##                      |     0.000 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                   12 |         7 |         3 |        10 | 
##                      |     0.250 |     1.363 |           | 
##                      |     0.700 |     0.300 |     0.001 | 
##                      |     0.001 |     0.002 |           | 
##                      |     0.001 |     0.000 |           | 
## ---------------------|-----------|-----------|-----------|
##                   13 |        45 |        16 |        61 | 
##                      |     0.835 |     4.560 |           | 
##                      |     0.738 |     0.262 |     0.005 | 
##                      |     0.004 |     0.008 |           | 
##                      |     0.004 |     0.001 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |     10422 |      1908 |     12330 | 
##                      |     0.845 |     0.155 |           | 
## ---------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, Browser), fill = Revenue)) +
  mosaic_theme +
  xlab("Broswer Types") +
  ylab(NULL)

Browser of type 3 stands out as having very few conversions (4.8% only). Every other browser preforms almost similarly (14.8% to 19.6%). Browser 12 and 13 show high conversions (30% and 26.2 % respectively) but have very few users (10 and 61 respectively).

Region

Does the users’ region influence revenue?

CrossTable(shopper_data$Region, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                     | shopper_data$Revenue 
## shopper_data$Region |     FALSE |      TRUE | Row Total | 
## --------------------|-----------|-----------|-----------|
##                   1 |      4009 |       771 |      4780 | 
##                     |     0.243 |     1.326 |           | 
##                     |     0.839 |     0.161 |     0.388 | 
##                     |     0.385 |     0.404 |           | 
##                     |     0.325 |     0.063 |           | 
## --------------------|-----------|-----------|-----------|
##                   2 |       948 |       188 |      1136 | 
##                     |     0.155 |     0.848 |           | 
##                     |     0.835 |     0.165 |     0.092 | 
##                     |     0.091 |     0.099 |           | 
##                     |     0.077 |     0.015 |           | 
## --------------------|-----------|-----------|-----------|
##                   3 |      2054 |       349 |      2403 | 
##                     |     0.257 |     1.404 |           | 
##                     |     0.855 |     0.145 |     0.195 | 
##                     |     0.197 |     0.183 |           | 
##                     |     0.167 |     0.028 |           | 
## --------------------|-----------|-----------|-----------|
##                   4 |      1007 |       175 |      1182 | 
##                     |     0.063 |     0.342 |           | 
##                     |     0.852 |     0.148 |     0.096 | 
##                     |     0.097 |     0.092 |           | 
##                     |     0.082 |     0.014 |           | 
## --------------------|-----------|-----------|-----------|
##                   5 |       266 |        52 |       318 | 
##                     |     0.029 |     0.158 |           | 
##                     |     0.836 |     0.164 |     0.026 | 
##                     |     0.026 |     0.027 |           | 
##                     |     0.022 |     0.004 |           | 
## --------------------|-----------|-----------|-----------|
##                   6 |       693 |       112 |       805 | 
##                     |     0.232 |     1.268 |           | 
##                     |     0.861 |     0.139 |     0.065 | 
##                     |     0.066 |     0.059 |           | 
##                     |     0.056 |     0.009 |           | 
## --------------------|-----------|-----------|-----------|
##                   7 |       642 |       119 |       761 | 
##                     |     0.002 |     0.013 |           | 
##                     |     0.844 |     0.156 |     0.062 | 
##                     |     0.062 |     0.062 |           | 
##                     |     0.052 |     0.010 |           | 
## --------------------|-----------|-----------|-----------|
##                   8 |       378 |        56 |       434 | 
##                     |     0.339 |     1.854 |           | 
##                     |     0.871 |     0.129 |     0.035 | 
##                     |     0.036 |     0.029 |           | 
##                     |     0.031 |     0.005 |           | 
## --------------------|-----------|-----------|-----------|
##                   9 |       425 |        86 |       511 | 
##                     |     0.111 |     0.607 |           | 
##                     |     0.832 |     0.168 |     0.041 | 
##                     |     0.041 |     0.045 |           | 
##                     |     0.034 |     0.007 |           | 
## --------------------|-----------|-----------|-----------|
##        Column Total |     10422 |      1908 |     12330 | 
##                     |     0.845 |     0.155 |           | 
## --------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, Region), fill = Revenue)) +
  mosaic_theme +
  xlab("Regions") +
  ylab(NULL)

Very little variation by region ranging from 14.5% to 16.8 %

Traffic Type

Does the type of traffic matter?

CrossTable(shopper_data$TrafficType, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                          | shopper_data$Revenue 
## shopper_data$TrafficType |     FALSE |      TRUE | Row Total | 
## -------------------------|-----------|-----------|-----------|
##                        1 |      2189 |       262 |      2451 | 
##                          |     6.639 |    36.264 |           | 
##                          |     0.893 |     0.107 |     0.199 | 
##                          |     0.210 |     0.137 |           | 
##                          |     0.178 |     0.021 |           | 
## -------------------------|-----------|-----------|-----------|
##                        2 |      3066 |       847 |      3913 | 
##                          |    17.631 |    96.306 |           | 
##                          |     0.784 |     0.216 |     0.317 | 
##                          |     0.294 |     0.444 |           | 
##                          |     0.249 |     0.069 |           | 
## -------------------------|-----------|-----------|-----------|
##                        3 |      1872 |       180 |      2052 | 
##                          |    10.906 |    59.572 |           | 
##                          |     0.912 |     0.088 |     0.166 | 
##                          |     0.180 |     0.094 |           | 
##                          |     0.152 |     0.015 |           | 
## -------------------------|-----------|-----------|-----------|
##                        4 |       904 |       165 |      1069 | 
##                          |     0.000 |     0.001 |           | 
##                          |     0.846 |     0.154 |     0.087 | 
##                          |     0.087 |     0.086 |           | 
##                          |     0.073 |     0.013 |           | 
## -------------------------|-----------|-----------|-----------|
##                        5 |       204 |        56 |       260 | 
##                          |     1.131 |     6.178 |           | 
##                          |     0.785 |     0.215 |     0.021 | 
##                          |     0.020 |     0.029 |           | 
##                          |     0.017 |     0.005 |           | 
## -------------------------|-----------|-----------|-----------|
##                        6 |       391 |        53 |       444 | 
##                          |     0.657 |     3.591 |           | 
##                          |     0.881 |     0.119 |     0.036 | 
##                          |     0.038 |     0.028 |           | 
##                          |     0.032 |     0.004 |           | 
## -------------------------|-----------|-----------|-----------|
##                        7 |        28 |        12 |        40 | 
##                          |     0.998 |     5.454 |           | 
##                          |     0.700 |     0.300 |     0.003 | 
##                          |     0.003 |     0.006 |           | 
##                          |     0.002 |     0.001 |           | 
## -------------------------|-----------|-----------|-----------|
##                        8 |       248 |        95 |       343 | 
##                          |     6.062 |    33.112 |           | 
##                          |     0.723 |     0.277 |     0.028 | 
##                          |     0.024 |     0.050 |           | 
##                          |     0.020 |     0.008 |           | 
## -------------------------|-----------|-----------|-----------|
##                        9 |        38 |         4 |        42 | 
##                          |     0.176 |     0.961 |           | 
##                          |     0.905 |     0.095 |     0.003 | 
##                          |     0.004 |     0.002 |           | 
##                          |     0.003 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       10 |       360 |        90 |       450 | 
##                          |     1.090 |     5.956 |           | 
##                          |     0.800 |     0.200 |     0.036 | 
##                          |     0.035 |     0.047 |           | 
##                          |     0.029 |     0.007 |           | 
## -------------------------|-----------|-----------|-----------|
##                       11 |       200 |        47 |       247 | 
##                          |     0.369 |     2.016 |           | 
##                          |     0.810 |     0.190 |     0.020 | 
##                          |     0.019 |     0.025 |           | 
##                          |     0.016 |     0.004 |           | 
## -------------------------|-----------|-----------|-----------|
##                       12 |         1 |         0 |         1 | 
##                          |     0.028 |     0.155 |           | 
##                          |     1.000 |     0.000 |     0.000 | 
##                          |     0.000 |     0.000 |           | 
##                          |     0.000 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       13 |       695 |        43 |       738 | 
##                          |     8.127 |    44.392 |           | 
##                          |     0.942 |     0.058 |     0.060 | 
##                          |     0.067 |     0.023 |           | 
##                          |     0.056 |     0.003 |           | 
## -------------------------|-----------|-----------|-----------|
##                       14 |        11 |         2 |        13 | 
##                          |     0.000 |     0.000 |           | 
##                          |     0.846 |     0.154 |     0.001 | 
##                          |     0.001 |     0.001 |           | 
##                          |     0.001 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       15 |        38 |         0 |        38 | 
##                          |     1.077 |     5.880 |           | 
##                          |     1.000 |     0.000 |     0.003 | 
##                          |     0.004 |     0.000 |           | 
##                          |     0.003 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       16 |         2 |         1 |         3 | 
##                          |     0.113 |     0.618 |           | 
##                          |     0.667 |     0.333 |     0.000 | 
##                          |     0.000 |     0.001 |           | 
##                          |     0.000 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       17 |         1 |         0 |         1 | 
##                          |     0.028 |     0.155 |           | 
##                          |     1.000 |     0.000 |     0.000 | 
##                          |     0.000 |     0.000 |           | 
##                          |     0.000 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       18 |        10 |         0 |        10 | 
##                          |     0.283 |     1.547 |           | 
##                          |     1.000 |     0.000 |     0.001 | 
##                          |     0.001 |     0.000 |           | 
##                          |     0.001 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       19 |        16 |         1 |        17 | 
##                          |     0.185 |     1.011 |           | 
##                          |     0.941 |     0.059 |     0.001 | 
##                          |     0.002 |     0.001 |           | 
##                          |     0.001 |     0.000 |           | 
## -------------------------|-----------|-----------|-----------|
##                       20 |       148 |        50 |       198 | 
##                          |     2.240 |    12.234 |           | 
##                          |     0.747 |     0.253 |     0.016 | 
##                          |     0.014 |     0.026 |           | 
##                          |     0.012 |     0.004 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |     10422 |      1908 |     12330 | 
##                          |     0.845 |     0.155 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, TrafficType), fill = Revenue)) +
  mosaic_theme +
  xlab("Traffic Type") +
  ylab(NULL)

Lots of variation in revenue with the type of traffic the website is getting.

Visitor Type

Does the type of visitor matter?

CrossTable(shopper_data$VisitorType, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                          | shopper_data$Revenue 
## shopper_data$VisitorType |     FALSE |      TRUE | Row Total | 
## -------------------------|-----------|-----------|-----------|
##              New_Visitor |      1272 |       422 |      1694 | 
##                          |    17.848 |    97.491 |           | 
##                          |     0.751 |     0.249 |     0.137 | 
##                          |     0.122 |     0.221 |           | 
##                          |     0.103 |     0.034 |           | 
## -------------------------|-----------|-----------|-----------|
##                    Other |        69 |        16 |        85 | 
##                          |     0.113 |     0.616 |           | 
##                          |     0.812 |     0.188 |     0.007 | 
##                          |     0.007 |     0.008 |           | 
##                          |     0.006 |     0.001 |           | 
## -------------------------|-----------|-----------|-----------|
##        Returning_Visitor |      9081 |      1470 |     10551 | 
##                          |     2.969 |    16.215 |           | 
##                          |     0.861 |     0.139 |     0.856 | 
##                          |     0.871 |     0.770 |           | 
##                          |     0.736 |     0.119 |           | 
## -------------------------|-----------|-----------|-----------|
##             Column Total |     10422 |      1908 |     12330 | 
##                          |     0.845 |     0.155 |           | 
## -------------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, VisitorType), fill = Revenue)) +
  mosaic_theme +
  xlab("Visitor Type") +
  ylab(NULL)

Some interesting results here. Many our visitors are returning visitors (85.6%) compared only 13.7% being new visitors. But in contract newcomers have a higher probability of buying a product (24.9%) compared to only 13.9% of returning visitors generating revenue.

Weekend

Does it matter if people visit the website on a weekend or a weekday?

CrossTable(shopper_data$Weekend, shopper_data$Revenue)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  12330 
## 
##  
##                      | shopper_data$Revenue 
## shopper_data$Weekend |     FALSE |      TRUE | Row Total | 
## ---------------------|-----------|-----------|-----------|
##                FALSE |      8053 |      1409 |      9462 | 
##                      |     0.381 |     2.080 |           | 
##                      |     0.851 |     0.149 |     0.767 | 
##                      |     0.773 |     0.738 |           | 
##                      |     0.653 |     0.114 |           | 
## ---------------------|-----------|-----------|-----------|
##                 TRUE |      2369 |       499 |      2868 | 
##                      |     1.257 |     6.864 |           | 
##                      |     0.826 |     0.174 |     0.233 | 
##                      |     0.227 |     0.262 |           | 
##                      |     0.192 |     0.040 |           | 
## ---------------------|-----------|-----------|-----------|
##         Column Total |     10422 |      1908 |     12330 | 
##                      |     0.845 |     0.155 |           | 
## ---------------------|-----------|-----------|-----------|
## 
##

shopper_data %>% 
  ggplot() +
  geom_mosaic(aes(x = product(Revenue, Weekend), fill = Revenue)) +
  mosaic_theme +
  xlab("Weekend") +
  ylab(NULL)

We see that 76.7% of our visitors visiting on the weekday, a five-day period, with a 14.9% chance of buying something and 23.3% of visitors on weekends, a two-day period, and a 17.4% chance of buying.

4.3 EDA Summary

We see very little variation in our numerical variables and moderate variation in our categorical. Based on this we are going to attempt a classification via a decision tree algorithm and clustering via a k-means algorithm.

5. Data Preparation for Analysis

5.1 Converting our Categorical Variables to Ordinal Factors.

Converting our variables to factors with ordered levels (ordinal variables) for use with various algorithms:

shopper_data$OperatingSystems <- factor(shopper_data$OperatingSystems, order = TRUE, levels = c(6,3,7,1,5,2,4,8))
shopper_data$Browser <- factor(shopper_data$Browser, order = TRUE, levels = c(9,3,6,7,1,2,8,11,4,5,10,13,12))
shopper_data$Region <- factor(shopper_data$Region, order = TRUE, levels = c(8,6,3,4,7,1,5,2,9))
shopper_data$TrafficType <- factor(shopper_data$TrafficType, order = TRUE, levels = c(12,15,17,18,13,19,3,9,1,6,4,14,11,10,5,2,20,8,7,16))

Changing Month and Visitor Type to ordinal variables and assigning numbers to the levels for clustering.

library(plyr)

## -------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## -------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following object is masked from 'package:purrr':
## 
##     compact

shopper_data$Month <- factor(shopper_data$Month, order = TRUE, levels =c('Feb', 'Mar', 'May', 'June','Jul', 'Aug', 'Sep','Oct', 'Nov','Dec'))
shopper_data$Month_Numeric <-mapvalues(shopper_data$Month, from = c('Feb', 'Mar', 'May', 'June','Jul', 'Aug', 'Sep','Oct', 'Nov','Dec'), to = c(1,2,3,4,5,6,7,8,9,10))


shopper_data$VisitorType <- factor(shopper_data$VisitorType, order = TRUE, levels = c('Returning_Visitor', 'Other', 'New_Visitor'))
shopper_data$VisitorType_Numeric <-mapvalues(shopper_data$VisitorType, from = c("Returning_Visitor", "Other", "New_Visitor"), to = c(1,2,3))

library(dplyr)

5.2 Creating Appropriate Dummy Variables

We convert the variable weekend to a dummy, with weekend being a ‘1’ and a weekday being a ‘0’

shopper_data <- shopper_data %>%
  mutate(Weekend_binary = ifelse(Weekend == "FALSE",0,1))

5.3 Normalizing Numerical Data

Theory

From the mathematical point of view when we refer to “normalization” it means transforming your values to the range between 0 and 1.

Why do we do it?

Certain machine learning algorithms (such as SVM and K-means) are more sensitive to the scale of data than others since the distance between the data points is very important.

In order to avoid this problem, we bring the dataset to a common scale (between 0 and 1) while keeping the distributions of variables the same.

To start we write a function to normalize our numerical data:

As mentioned earlier, what we are going to do is rescale the data points for our numeric variables to be between 0 and 1 (0 ≤ x ≤ 1).

What we need to do now is to create a function in R that will normalize the data according to the following formula:

z_i = (x_i - min(x))/(max(x) - min(x))

Where z_i is our normalized observation and x_i is the original observation.

Running this formula through the data in the column does the following: it takes every observation one by one, subtracts the smallest value from the data. Then this difference is divided by the difference between the largest data point and the smallest data point, which in turn scales it to a range [0;1].

Logically, the rescaled value of the smallest data point will be 0 and the rescaled value of the largest data point will be 1.

Writing this code:

normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

Then we normalize our numerical data:

## Creating a copy of the original data.
shopper_data_norm <- shopper_data

## Normalizing our 10 variables.
shopper_data_norm$Administrative <- normalize(shopper_data$Administrative)
shopper_data_norm$Administrative_Duration <- normalize(shopper_data$Administrative_Duration)
shopper_data_norm$Informational <- normalize(shopper_data$Informational_Duration)
shopper_data_norm$Informational_Duration <- normalize(shopper_data$Administrative)
shopper_data_norm$ProductRelated <- normalize(shopper_data$ProductRelated)
shopper_data_norm$ProductRelated_Duration <- normalize(shopper_data$ProductRelated_Duration)
shopper_data_norm$BounceRates <- normalize(shopper_data$BounceRates)
shopper_data_norm$ExitRates <- normalize(shopper_data$ExitRates)
shopper_data_norm$PageValues <- normalize(shopper_data$PageValues)
shopper_data_norm$SpecialDay <- normalize(shopper_data$SpecialDay)

Finalizing our normalized dataframe for clustering models:

shopper_data_clust <- shopper_data_norm[-c(11,16:19)]

5.2 Creating Test and Train Data

Splitting the data into training and test datasets (80-20 split) for classification:

shopper_data_class <- shopper_data[-c(19:22)]

set.seed(1984)
training <- createDataPartition(shopper_data_class$Revenue, p = 0.8, list=FALSE)

train_data <- shopper_data_class[training,]
test_data <- shopper_data_class[-training,]

We now have two specific data sets:

shopper_data_class for our classification algorithms
shopper_data_clust for our clustering algorithms

6. Clustering

Our data visualization suggests that there are no clear distribution patterns among our variables and hence, clustering might a good sorting Algorithm for our needs. It will look at the data and try to find groupings.

6.1 K-Means Clustering

Data we are feeding to our clustering models:

summary(shopper_data_clust)

##  Administrative    Administrative_Duration Informational    
##  Min.   :0.00000   Min.   :0.000000        Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000        1st Qu.:0.00000  
##  Median :0.03704   Median :0.002207        Median :0.00000  
##  Mean   :0.08575   Mean   :0.023779        Mean   :0.01352  
##  3rd Qu.:0.14815   3rd Qu.:0.027438        3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.000000        Max.   :1.00000  
##                                                             
##  Informational_Duration ProductRelated     ProductRelated_Duration
##  Min.   :0.00000        Min.   :0.000000   Min.   :0.000000       
##  1st Qu.:0.00000        1st Qu.:0.009929   1st Qu.:0.002878       
##  Median :0.03704        Median :0.025532   Median :0.009362       
##  Mean   :0.08575        Mean   :0.045009   Mean   :0.018676       
##  3rd Qu.:0.14815        3rd Qu.:0.053901   3rd Qu.:0.022887       
##  Max.   :1.00000        Max.   :1.000000   Max.   :1.000000       
##                                                                   
##   BounceRates        ExitRates         PageValues        SpecialDay     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.07143   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.01556   Median :0.12578   Median :0.00000   Median :0.00000  
##  Mean   :0.11096   Mean   :0.21536   Mean   :0.01628   Mean   :0.06143  
##  3rd Qu.:0.08406   3rd Qu.:0.25000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##  OperatingSystems    Browser         Region      TrafficType  
##  2      :6601     2      :7961   1      :4780   2      :3913  
##  1      :2585     1      :2462   3      :2403   1      :2451  
##  3      :2555     4      : 736   4      :1182   3      :2052  
##  4      : 478     5      : 467   2      :1136   4      :1069  
##  8      :  79     6      : 174   6      : 805   13     : 738  
##  6      :  19     10     : 163   7      : 761   10     : 450  
##  (Other):  13     (Other): 367   (Other):1263   (Other):1657  
##  Month_Numeric  VisitorType_Numeric Weekend_binary  
##  3      :3364   1:10551             Min.   :0.0000  
##  9      :2998   2:   85             1st Qu.:0.0000  
##  2      :1907   3: 1694             Median :0.0000  
##  10     :1727                       Mean   :0.2326  
##  8      : 549                       3rd Qu.:0.0000  
##  7      : 448                       Max.   :1.0000  
##  (Other):1337

str(shopper_data_clust)

## 'data.frame':    12330 obs. of  17 variables:
##  $ Administrative         : num  0 0 0 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 ...
##  $ ProductRelated         : num  0.00142 0.00284 0.00142 0.00284 0.01418 ...
##  $ ProductRelated_Duration: num  0.00 1.00e-03 0.00 4.17e-05 9.81e-03 ...
##  $ BounceRates            : num  1 0 1 0.25 0.1 ...
##  $ ExitRates              : num  1 0.5 1 0.7 0.25 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ OperatingSystems       : Ord.factor w/ 8 levels "6"<"3"<"7"<"1"<..: 4 6 7 2 2 6 6 4 6 6 ...
##  $ Browser                : Ord.factor w/ 13 levels "9"<"3"<"6"<"7"<..: 5 6 5 6 2 6 9 6 6 9 ...
##  $ Region                 : Ord.factor w/ 9 levels "8"<"6"<"3"<"4"<..: 6 6 9 8 6 6 3 6 8 6 ...
##  $ TrafficType            : Ord.factor w/ 20 levels "12"<"15"<"17"<..: 9 16 7 11 11 7 7 15 7 16 ...
##  $ Month_Numeric          : Ord.factor w/ 10 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ VisitorType_Numeric    : Ord.factor w/ 3 levels "1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weekend_binary         : num  0 0 0 0 1 0 0 1 0 0 ...

Running the K-Means model:

We are asking the model to group our data into two groups (or centers) to be able to predict ‘TRUE’ and ‘FALSE’ Revenue.

k_mean_clust <- kmeans(shopper_data_clust, centers = 2, iter.max = 100)

Our findings:

## Size of our clusters
k_mean_clust$size

## [1] 5971 6359

## Our cluster centers (means)
k_mean_clust$centers

##   Administrative Administrative_Duration Informational
## 1     0.07532084              0.02079144    0.01139788
## 2     0.09553680              0.02658412    0.01551632
##   Informational_Duration ProductRelated ProductRelated_Duration
## 1             0.07532084      0.0360468              0.01481076
## 2             0.09553680      0.0534247              0.02230470
##   BounceRates ExitRates PageValues SpecialDay OperatingSystems  Browser
## 1   0.1196009 0.2307260 0.01364934 0.10812259         2.071010 2.360241
## 2   0.1028403 0.2009393 0.01874878 0.01758138         2.173769 2.354144
##     Region TrafficType Month_Numeric VisitorType_Numeric Weekend_binary
## 1 3.103500    3.070173      2.928823            1.221236      0.2286049
## 2 3.188552    5.008020      8.537820            1.338418      0.2363579

## Between cluster sum of squares
k_mean_clust$betweenss

## [1] 108576.2

## Total cluster sum of squares
k_mean_clust$totss

## [1] 451859.9

## Whithin clusters sum of squares
k_mean_clust$betweenss / k_mean_clust$totss

## [1] 0.2402874

Within cluster sum of squares by cluster: (between_SS / total_SS = 24.0 %)

Suggests the model is not very accurate at prediction.

Lets look at our K-Means Confusion Matrix:

t1 <- table(k_mean_clust$cluster, shopper_data_norm$Revenue)
t1

##    
##     FALSE TRUE
##   1  5295  676
##   2  5127 1232

We see that this iteration of the model did a good job with the ‘TRUE’ values, with most of them being in group 2, but did not sort the ‘FALSE’ values correctly.

Visualizing our K-means Clusters

Preforming a PCA

We are going to create components from our existing data:

pca_cluster_data <- prcomp(shopper_data_clust[c(1:10)], scale. = TRUE)
plot(pca_cluster_data, main = "Principal Components")

Picking out and plotting the first two components against each other:

shopper_components_data <- as.data.frame(pca_cluster_data$x)

## Show first two PCs for out shoppers
head(shopper_components_data[1:2], 5)

##         PC1        PC2
## 1 -3.277670  3.7312465
## 2 -1.535840 -0.1360421
## 3 -3.277670  3.7312465
## 4 -2.073731  1.0127275
## 5 -1.149687 -0.4122382

## Plotting
plot(PC1~PC2, data=shopper_components_data,
     cex = .1, lty = "solid")
text(PC1~PC2, data=shopper_components_data, 
     labels=rownames(shopper_data_clust[c(1:10)]),
     cex=.8)

Finally comparing how our derived clusters compare:

plot(PC1~PC2, data=shopper_components_data, 
     main= "Online Shopper Intent: PC1 vs PC2 - K-Means Clusters",
     cex = .1, lty = "solid", col=k_mean_clust$cluster)
text(PC1~PC2, data=shopper_components_data, 
     labels=rownames(shopper_data_clust[c(1:10)]),
     cex=.8, col=k_mean_clust$cluster)

We can see there is a lot of overlap between our two components (the red overlaps the black). Indicating that our model might not be very accurate.

Next, we check the accuracy mathematically.

Precision, Recall, and F1 Score

Precision attempts to answer the following question:

What proportion of positive identifications was actually correct?

While Recall attempts to answer the following question:

What proportion of actual positives was identified correctly?

To fully evaluate the effectiveness of a model, you must examine both precision and recall. Unfortunately, precision and recall are often in tension. That is, improving precision typically reduces recall and vice versa. Various metrics have been developed that rely on both precision and recall F1 score is one such metric.

In statistical analysis of binary classification, the F1 score (also F-score or F-measure) is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score.

Prediction

Let’s look at the predictive power of this model.

Below are our accuracy measures:

presicion_kmeans<- t1[1,1]/(sum(t1[1,]))
recall_kmeans<- t1[1,1]/(sum(t1[,1]))
## Precision
presicion_kmeans

## [1] 0.8867861

## Recall
recall_kmeans

## [1] 0.5080599

And our K-Means F-Score:

F1_kmeans<- 2*presicion_kmeans*recall_kmeans/(presicion_kmeans+recall_kmeans)
F1_kmeans

## [1] 0.6460074

6.2 K-Medoids Clustering

Since our data has a lot of ’0’s, it might make sense to run a median based clustering model.

Running the K-Medoids model:

We are asking the model to group our data into two groups (or centers) to be able to predict ‘TRUE’ and ‘FALSE’ Revenue.

k_med_clust <- pam(x = shopper_data_clust, k = 2)

Our findings:

## Size of our clusters
k_med_clust$id.med

## [1] 6545 6961

## Centers of our clusters (medians)
k_med_clust$mediods

## NULL

## Objective Function
k_med_clust$objective

##    build     swap 
## 5.038088 4.649363

## Summary of our cluster
k_med_clust$clusinfo

##      size max_diss  av_diss diameter separation
## [1,] 7113 10.14957 4.785288 16.76444   1.414214
## [2,] 5217  9.64590 4.464040 16.09200   1.414214

Let’s look at our K-Medoids Confusion Matrix:

t1b <- table(k_med_clust$clustering, shopper_data_norm$Revenue)
t1b

##    
##     FALSE TRUE
##   1  6376  737
##   2  4046 1171

We do not see any clear sorting.

Maybe visualizing can give us a better picture.

Visualizing our clusters

Preforming a PCA

Again, visualizing against our principal components (derived in section 6.1):

plot(PC1~PC2, data=shopper_components_data, 
     main= "Online Shopper Intent: PC1 vs PC2 - K-Medoids Clusters",
     cex = .1, lty = "solid", col=k_med_clust$clustering)
text(PC1~PC2, data=shopper_components_data, 
     labels=rownames(shopper_data_clust[c(1:10)]),
     cex=.8, col=k_med_clust$clustering)

As with our K-means, we see there is a lot of overlap between our two components (the red overlaps the black). Although there seems to be better separation along the second principal component (Along the x-axis when PC1 is ‘0’ or below). This suggest that the model might not be very accurate but may be somewhat better than our K-means clustering.

We move to check the accuracy mathematically.

Prediction

Let’s look at the predictive power of this model.

Below are our accuracy measures:

presicion_kmed<- t1b[1,1]/(sum(t1b[1,]))
recall_kmed<- t1b[1,1]/(sum(t1b[,1]))
## Precision
presicion_kmed

## [1] 0.8963869

## Recall
recall_kmed

## [1] 0.6117828

And our K-Medoids F-Score:

F1_kmed<- 2*presicion_kmed*recall_kmed/(presicion_kmed+recall_kmed)
F1_kmed

## [1] 0.7272313

Summary

We see that clustering by K-means give us a somewhat precise model (0.89) that has bad recall (0.51). This results in a low F-score of about 0.65.

In contrast, clustering by K-medoids give us similarly precise model (0.89) though it still has a bad recall of 0.59 (this is slightly better than our K-means recall of 0.51) resulting in a low F-score of about 0.72 (though it is better than our k-means F-score of 0.65).

We conclude that given our imbalanced data with only about 12000 observations (10422 FALSE against 1908 TRUE) we need more data to get a better F-Score from clustering algorithms.

7. Classification

7.1 Decision Tree

Data we are feeding to our clustering models:

summary(shopper_data_class)

##  Administrative   Administrative_Duration Informational    
##  Min.   : 0.000   Min.   :   0.00         Min.   : 0.0000  
##  1st Qu.: 0.000   1st Qu.:   0.00         1st Qu.: 0.0000  
##  Median : 1.000   Median :   7.50         Median : 0.0000  
##  Mean   : 2.315   Mean   :  80.82         Mean   : 0.5036  
##  3rd Qu.: 4.000   3rd Qu.:  93.26         3rd Qu.: 0.0000  
##  Max.   :27.000   Max.   :3398.75         Max.   :24.0000  
##                                                            
##  Informational_Duration ProductRelated   ProductRelated_Duration
##  Min.   :   0.00        Min.   :  0.00   Min.   :    0.0        
##  1st Qu.:   0.00        1st Qu.:  7.00   1st Qu.:  184.1        
##  Median :   0.00        Median : 18.00   Median :  598.9        
##  Mean   :  34.47        Mean   : 31.73   Mean   : 1194.8        
##  3rd Qu.:   0.00        3rd Qu.: 38.00   3rd Qu.: 1464.2        
##  Max.   :2549.38        Max.   :705.00   Max.   :63973.5        
##                                                                 
##   BounceRates         ExitRates         PageValues        SpecialDay     
##  Min.   :0.000000   Min.   :0.00000   Min.   :  0.000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.01429   1st Qu.:  0.000   1st Qu.:0.00000  
##  Median :0.003112   Median :0.02516   Median :  0.000   Median :0.00000  
##  Mean   :0.022191   Mean   :0.04307   Mean   :  5.889   Mean   :0.06143  
##  3rd Qu.:0.016813   3rd Qu.:0.05000   3rd Qu.:  0.000   3rd Qu.:0.00000  
##  Max.   :0.200000   Max.   :0.20000   Max.   :361.764   Max.   :1.00000  
##                                                                          
##      Month      OperatingSystems    Browser         Region    
##  May    :3364   2      :6601     2      :7961   1      :4780  
##  Nov    :2998   1      :2585     1      :2462   3      :2403  
##  Mar    :1907   3      :2555     4      : 736   4      :1182  
##  Dec    :1727   4      : 478     5      : 467   2      :1136  
##  Oct    : 549   8      :  79     6      : 174   6      : 805  
##  Sep    : 448   6      :  19     10     : 163   7      : 761  
##  (Other):1337   (Other):  13     (Other): 367   (Other):1263  
##   TrafficType              VisitorType     Weekend         Revenue       
##  2      :3913   Returning_Visitor:10551   Mode :logical   Mode :logical  
##  1      :2451   Other            :   85   FALSE:9462      FALSE:10422    
##  3      :2052   New_Visitor      : 1694   TRUE :2868      TRUE :1908     
##  4      :1069                                                            
##  13     : 738                                                            
##  10     : 450                                                            
##  (Other):1657

str(shopper_data_class)

## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Ord.factor w/ 10 levels "Feb"<"Mar"<"May"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ OperatingSystems       : Ord.factor w/ 8 levels "6"<"3"<"7"<"1"<..: 4 6 7 2 2 6 6 4 6 6 ...
##  $ Browser                : Ord.factor w/ 13 levels "9"<"3"<"6"<"7"<..: 5 6 5 6 2 6 9 6 6 9 ...
##  $ Region                 : Ord.factor w/ 9 levels "8"<"6"<"3"<"4"<..: 6 6 9 8 6 6 3 6 8 6 ...
##  $ TrafficType            : Ord.factor w/ 20 levels "12"<"15"<"17"<..: 9 16 7 11 11 7 7 15 7 16 ...
##  $ VisitorType            : Ord.factor w/ 3 levels "Returning_Visitor"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

Running the decision tree algorithm from the “rpart” library:

model_dt<- rpart(Revenue ~ . , data = train_data, method="class")
rpart.plot(model_dt)

Our predictive model suggests that Page Values greater than 0.99 lead to a TRUE 57% of the time. On top of this, an effective Bounce Rate above 0 improves our TRUE to 75% and as an added point pages of Administrative type ‘5’ or below (0,1,2,3,4,5) result in a TRUE 83% of the time.

The months of October and November are good months for shoppers’ conversions.

As a web-developer and designer we should focus on these three metrics; increasing the page value, decreasing the bounce rate, and paying attention to the types of products listed on pages of administrative type ‘5’ and below.

Marketing can either ‘double-down’ on October and November to drive revenue or it can focus on other months to try to bring them up.

Prediction

Let’s look at the prediction of this model on the test dataset (test_data):

pred.train.dt <- predict(model_dt,test_data,type = "class")
mean(pred.train.dt==test_data$Revenue)

## [1] 0.8953347

Our Decision Tree Confusion Matrix:

t2<-table(pred.train.dt,test_data$Revenue)
t2

##              
## pred.train.dt FALSE TRUE
##         FALSE  1989  163
##         TRUE     95  218

Our accuracy measures:

presicion_dt<- t2[1,1]/(sum(t2[1,]))
recall_dt<- t2[1,1]/(sum(t2[,1]))
## Precision
presicion_dt

## [1] 0.9242565

## Recall
recall_dt

## [1] 0.9544146

and F-Score:

F1_dt<- 2*presicion_dt*recall_dt/(presicion_dt+recall_dt)
F1_dt

## [1] 0.9390935

Summary

We see that classifying by decision tree gave us a very precise model (0.92) that also has good recall (0.95). This resulting F-Score of 0.94 suggests a high predictive power for our decision tree model.

8. Conclusion

For our particular set of variables, we found that the Decision tree was better able to predict the shoppers purchasing intent than the Clustering models because of the limitations of our datasets.

With the limited number of observation and variables, the Decision Tree had a higher F-score of 0.94, whereas the Clustering model had an F-Score of 0.72. With the Decision Tree, we were able to predict that the consumer was more likely to make a purchase during October and November. We also can increase the chance of sales by focusing on three metrics; increasing the page value, decreasing the bounce rate, and paying attention to the types of products listed on pages of administrative type ‘5’ and below.

In the future, to use Clustering models we would need more variable and observations. Variables such as socio-economic or demographic information would have enabled us to create more meaningful clusters. Also, having access to more observations would have also enabled us to train more in our clustering model.

Therefore, we’d recommend adding additional variables and collecting more observation so that we’re better able analyze and predict shoppers’ intentions.

Online Shoppers Intention

Qasim Ahmed

8/12/2019

Online Shoppers Intention

EDA + Clustering Algorithms + Classification Algorithms

1. Introduction

2. Objective

3. Dataset

Initial exploration of the dataset:

4. Exploratory Data Analysis

4.1 Numerical Univariate Analysis

Administrative

Administrative Duration

Informational

Informational Duration

Product Related

Product Related Duration

Bounce Rates

Exit Rates

Page Value

Special Day

4.2 Categorical Univariate Analysis

Month

Operating System

Browser Type

Region

Traffic Type

Visitor Type

Weekend

4.3 EDA Summary

5. Data Preparation for Analysis

5.1 Converting our Categorical Variables to Ordinal Factors.

5.2 Creating Appropriate Dummy Variables

5.3 Normalizing Numerical Data

Theory

5.2 Creating Test and Train Data

6. Clustering

6.1 K-Means Clustering

Visualizing our K-means Clusters

Preforming a PCA

Precision, Recall, and F1 Score

Prediction

6.2 K-Medoids Clustering

Visualizing our clusters

Preforming a PCA

Prediction

Summary

7. Classification

7.1 Decision Tree

Prediction

Summary

8. Conclusion