library(ggplot2)
library(gridExtra)
library(GGally)
library(caret)
library(data.table)
library(ggpubr)
library(ROSE)
library(class)
library(tree)
library(dtree)
library(randomForest)
library(mltools)
library(rsample)
library(e1071)
library(pheatmap)
library(keras)
library(dummies)
library(mlbench)
library(reticulate)
library(dplyr)
library(infotheo)
library(praznik)
set.seed(2019)

Abstract

It is the final project of course EECS6690 Statistical Learning of Columbia University, aimed at predicting purchasing decisions of online shoppers by 18 features collected through their browsers and website information. The dataset is from UCI Machine Learning Repository, consisting of 12,330 samples with 10 numerical attributes and 8 categorical attributes. After trying classical machine learning methods like tree-based algorithm and support vector machine, and deep learning method like multilayer perceptron, we chose the best-performed one to predict the purchasing intention and suggested few possible ways to attract more customers to finish with purchasing.

Getting and Cleaning Data

setwd("~/Desktop/Online_Shoppers_Purchasing_Intention")
osi <- read.csv(file = "online_shoppers_intention.csv")
str(osi)

## 'data.frame':    12330 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

There are duplicate records in the dataset, after removing 125 duplicates, 12205 records remain and there is no missing value.

number_duplicate <- nrow(osi[duplicated(osi),])
osi <- osi[!duplicated(osi),]
str(osi)

## 'data.frame':    12205 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 10 levels "Aug","Dec","Feb",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ OperatingSystems       : int  1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : int  1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : int  1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : int  1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : logi  FALSE FALSE FALSE FALSE TRUE FALSE ...
##  $ Revenue                : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

which(is.na(osi))

## integer(0)

The attribute “Month” includes 10 months except January and April, and we change “June” to “Jun” for convenience of plotting barplot in order.

osi$Month <- as.character(osi$Month)
osi$Month[osi$Month == "June"] <- "Jun"
osi$Month <- as.factor(osi$Month)
osi$Month = factor(osi$Month, levels = month.abb)

Exploratory Data Analysis

The first 6 attributes represent number of pages visited of different types and time spent, of which the medians of numbers are 1, 0 and 18 and the medians of time are 9, 0, 608.9 respectively. It illustrates that only a small portion of visitors choose to dig in information about one product, but the probability to explore more about related products is relatively much higher.

plot1 <- ggplot(osi, aes(x=1, y=Administrative)) + geom_violin() + geom_violin(trim=FALSE, fill='#E69F00', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Number of Administrative pages visited") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot2 <- ggplot(osi, aes(x=1, y=Administrative_Duration)) + geom_violin() + geom_violin(trim=FALSE, fill='#E69F00', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Total time spent in Administrative pages") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot3 <- ggplot(osi, aes(x=1, y=Informational)) + geom_violin() + geom_violin(trim=FALSE, fill='#56B4E9', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Number of Informational pages visited") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot4 <- ggplot(osi, aes(x=1, y=Informational_Duration)) + geom_violin() +  geom_violin(trim=FALSE, fill='#56B4E9', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Total time spent in Informational pages") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot5 <- ggplot(osi, aes(x=1, y=ProductRelated)) + geom_violin() + geom_violin(trim=FALSE, fill='#FF9999', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Number of ProductRelated pages visited") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot6 <- ggplot(osi, aes(x=1, y=ProductRelated_Duration)) + geom_violin() + geom_violin(trim=FALSE, fill='#FF9999', color='gray') + coord_flip() + labs(x = " ") + labs(y = "Total time spent in ProductRelated pages") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, nrow = 3, ncol = 2)

The customers who completed transaction tended to browse more and spend more time on Administrative and ProductRelated pages while it seems that they spended less time on Informational pages, which is a bit surprising since it means the majority are loyal customers who added items to the cart and click check-out.

plot1 <- ggplot(osi, aes(x=Revenue, y=Administrative)) + geom_violin() + geom_violin(trim=FALSE, fill='#E69F00', color='gray') + labs(x = "Administrative") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot4 <- ggplot(osi, aes(x=Revenue, y=Administrative_Duration)) + geom_violin() + geom_violin(trim=FALSE, fill='#E69F00', color='gray') + labs(x = "Administrative_Duration") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot2 <- ggplot(osi, aes(x=Revenue, y=Informational)) + geom_violin() + geom_violin(trim=FALSE, fill='#56B4E9', color='gray') + labs(x = "Informational") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot5 <- ggplot(osi, aes(x=Revenue, y=Informational_Duration)) + geom_violin() +  geom_violin(trim=FALSE, fill='#56B4E9', color='gray') + labs(x = "Informational_Duration") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot3 <- ggplot(osi, aes(x=Revenue, y=ProductRelated)) + geom_violin() + geom_violin(trim=FALSE, fill='#FF9999', color='gray') + labs(x = "ProductRelated") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
plot6 <- ggplot(osi, aes(x=Revenue, y=ProductRelated_Duration)) + geom_violin() + geom_violin(trim=FALSE, fill='#FF9999', color='gray') + labs(x = "ProductRelated_Duration") + labs(y = " ") + theme(axis.text.y = element_blank(), axis.ticks = element_blank())
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, nrow = 2, ncol = 3)

The “Bounce Rates”, “Exit Rates” and “Page Values” features represent the metrics measured by “Google Analytics” for each page in the e-commerce site. The value of “Bounce Rates” feature for a web page refers to the percentage of visitors who enter the site from that page and then leave without triggering any other requests to the analytics server during that session. The value of “Exit Rates” feature for a specific web page is calculated as for all pageviewers to the page, the percentage that was the last in the session. The “Page Values” feature represents the average value for a web page that a user visited before completing an e-commerce site.

plot1 <- ggdensity(osi, x = "BounceRates", fill = "thistle2", color = "thistle2", add = "median", rug = TRUE) + labs(y = " ")
plot2 <- ggdensity(osi, x = "ExitRates", fill = "skyblue1", color = "skyblue1", add = "median", rug = TRUE) + labs(y = " ")
plot3 <- ggdensity(osi, x = "PageValues", fill = "sienna3", color = "sienna3", add = "median", rug = TRUE) + labs(y = " ")
grid.arrange(plot1, plot2, plot3, nrow = 3)

It shows no significant difference of BounceRates between the two customer categories, the ExitRates of T-customers is in general lower than that of F-customers, because they stayed on the pages with higher probability, the Pagevalues of F-customers is way less than that of T-customers because they spent less time on related pages.

plot1 <- ggplot(osi, aes(x=BounceRates, fill=Revenue)) + geom_density(alpha=0.4) + labs(y = " ")
plot2 <- ggplot(osi, aes(x=ExitRates, fill=Revenue)) + geom_density(alpha=0.4) + labs(y = " ")
plot3 <- ggplot(osi, aes(x=PageValues, fill=Revenue)) + geom_density(alpha=0.4) + labs(y = " ")
grid.arrange(plot1, plot2, plot3, nrow = 3)

The “Special Day” feature indicates the closeness of the site visiting time to a specific special day in which the sessions are more likely to be finalized with transaction. The majority of transaction are done close to none of the special days, since there are merely 1 or 2 holidays but shopping is around 365 days. The maximum value of this feature is 1 on exactly the date of the special day and the minimum is 0 if the date is too far from any of the special days, other nonzero values represent the influence of the closest special day. T-customers were more likely to purchase on non-special days, which is kinda consistent with our observation that the majority of customers are loyal ones so their decisions are less affected by whether it is near special days.

plot1 <- ggplot(osi, aes(x = factor(1), y = SpecialDay)) + geom_boxplot(width = 0.4, fill = "white") + geom_jitter(color = "deepskyblue4", width = 0.1, size = 1, alpha=0.4) + labs(x = "Special Day") + labs(y = "Closeness") + theme(axis.text.x = element_blank(), axis.ticks = element_blank())
plot2 <- ggplot(osi, aes(x = Revenue, y = SpecialDay)) + geom_boxplot(width = 0.4, fill = "white") + geom_jitter(color = "deepskyblue4", width = 0.2, size = 1, alpha=0.4) + labs(x = "Special Day") + labs(y = " ") + theme(axis.ticks = element_blank())
grid.arrange(plot1, plot2, ncol = 2)

The “Month”, “OperatingSystems”, “Browser”, “Region”, “TrafficType” and “Weekend” are straightforward features. The “VisitorType” feature indicates whether the visitor is a new visitor or returning.

The majority of purchasing happened in March, May, November and December, in other words, head and tail of a year. It reasonable since May is the time swithing from winter to spring, more families and individuals buy goods for the new season, not to say, November and December are good time to prepare for Christmas, the most important day in a year, therefore the deals witnessed such a significant rise. There shows no significant difference between the trends of two customer categories, since the reasons like seasons apply to everyone.

plot <- ggplot(data.frame(osi), aes(Month, fill=Revenue)) + geom_bar() + labs(x = "Month") + labs(y = " ")
plot

There are different types of Operating Systems, Browser, Region and Traffic Type, although not given details on what each category means, it might be helpful to classify customers. It shows that number of deals happening in weekdays are approximately 3.5 times that in weekends, there is no big difference between them. Return customers are over 5 times new customers, the reason could be that the majority of customers explore deeper and then complete the transaction.

plot1 <- ggplot(data.frame(osi), aes(OperatingSystems, fill=Revenue)) + geom_bar() + labs(x = "Operating Systems") + labs(y = " ") + scale_x_continuous(breaks = 1:8)
plot2 <- ggplot(data.frame(osi), aes(Browser, fill=Revenue)) + geom_bar() + labs(x = "Browser") + labs(y = " ") + scale_x_continuous(breaks = 1:13)
plot3 <- ggplot(data.frame(osi), aes(Region, fill=Revenue)) + geom_bar() + labs(x = "Region") + labs(y = " ") + scale_x_continuous(breaks = 1:9)
plot4 <- ggplot(data.frame(osi), aes(TrafficType, fill=Revenue)) + geom_bar() + labs(x = "Traffic Type") + labs(y = " ")
plot5 <- ggplot(data.frame(osi), aes(Weekend, fill=Revenue)) + geom_bar() + labs(x = "Weekend") + labs(y = " ")
plot6 <- ggplot(data.frame(osi), aes(VisitorType, fill=Revenue)) + geom_bar() + labs(x = "Visitor Type") + labs(y = " ") + scale_x_discrete(labels = c("New_Visitor" = "New", "Other" = "Other", "Returning_Visitor" = "Return"))
grid.arrange(plot1, plot2, plot3, plot4, plot5, plot6, nrow = 3, ncol = 2)

The target “Revenue” demonstrates that the majority of customers failed to complete the purchasing, which means that the dataset is extremely imbalanced.

plot <- ggplot(data.frame(osi$Revenue), aes(x=osi$Revenue)) + geom_bar() + labs(x = "Target Feature Distribution")
plot

There appears very high-correlated pairs like BounceRates&ExitRates and ProductRelated&ProductRelated_Duration, one of each pairs might be dropped considering their importance to accuracy of our model.

corr_map <- ggcorr(osi[, 1:10], method=c("everything", "pearson"), label=TRUE, hjust = .90, size = 3, layout.exp = 2)
corr_map

Data Preprocessing

Transform categorical attributes into factor types and do one-hot encoding, save a copy of the original data since some methods may not perform better without encoding.

osi <- osi %>% 
  mutate(OperatingSystems = as.factor(OperatingSystems),
         Browser = as.factor(Browser),
         Region = as.factor(Region),
         TrafficType = as.factor(TrafficType),
         VisitorType = as.factor(VisitorType),
         Weekend = as.integer(Weekend),
         Revenue = as.integer(Revenue)
         )

ori <- osi
ori$Revenue <- as.factor(ori$Revenue)
print("Original dataset")

## [1] "Original dataset"

print(str(ori))

## 'data.frame':    12205 obs. of  18 variables:
##  $ Administrative         : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated         : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration: num  0 64 0 2.67 627.5 ...
##  $ BounceRates            : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates              : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay             : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month                  : Factor w/ 12 levels "Jan","Feb","Mar",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ OperatingSystems       : Factor w/ 8 levels "1","2","3","4",..: 1 2 4 3 3 2 2 1 2 2 ...
##  $ Browser                : Factor w/ 13 levels "1","2","3","4",..: 1 2 1 2 3 2 4 2 2 4 ...
##  $ Region                 : Factor w/ 9 levels "1","2","3","4",..: 1 1 9 2 1 1 3 1 2 1 ...
##  $ TrafficType            : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 4 3 3 5 3 2 ...
##  $ VisitorType            : Factor w/ 3 levels "New_Visitor",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Weekend                : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Revenue                : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## NULL

osi <- one_hot(as.data.table(osi))
osi$Revenue <- as.factor(osi$Revenue)
print("After one-hot encoding")

## [1] "After one-hot encoding"

str(osi)

## Classes 'data.table' and 'data.frame':   12205 obs. of  77 variables:
##  $ Administrative               : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Administrative_Duration      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Informational_Duration       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ProductRelated               : int  1 2 1 2 10 19 1 0 2 3 ...
##  $ ProductRelated_Duration      : num  0 64 0 2.67 627.5 ...
##  $ BounceRates                  : num  0.2 0 0.2 0.05 0.02 ...
##  $ ExitRates                    : num  0.2 0.1 0.2 0.14 0.05 ...
##  $ PageValues                   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ SpecialDay                   : num  0 0 0 0 0 0 0.4 0 0.8 0.4 ...
##  $ Month_Jan                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Feb                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Month_Mar                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Apr                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_May                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Jun                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Jul                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Aug                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Sep                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Oct                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Nov                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Month_Dec                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OperatingSystems_1           : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ OperatingSystems_2           : int  0 1 0 0 0 1 1 0 1 1 ...
##  $ OperatingSystems_3           : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ OperatingSystems_4           : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ OperatingSystems_5           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OperatingSystems_6           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OperatingSystems_7           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ OperatingSystems_8           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_1                    : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ Browser_2                    : int  0 1 0 1 0 1 0 1 1 0 ...
##  $ Browser_3                    : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Browser_4                    : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ Browser_5                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_6                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_7                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_8                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_9                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_10                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_11                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_12                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Browser_13                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_1                     : int  1 1 0 0 1 1 0 1 0 1 ...
##  $ Region_2                     : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Region_3                     : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Region_4                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_5                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_6                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_7                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_8                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Region_9                     : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ TrafficType_1                : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_2                : int  0 1 0 0 0 0 0 0 0 1 ...
##  $ TrafficType_3                : int  0 0 1 0 0 1 1 0 1 0 ...
##  $ TrafficType_4                : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ TrafficType_5                : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ TrafficType_6                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_7                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_8                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_9                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_10               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_11               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_12               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_13               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_14               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_15               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_16               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_17               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_18               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_19               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ TrafficType_20               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VisitorType_New_Visitor      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VisitorType_Other            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ VisitorType_Returning_Visitor: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Weekend                      : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ Revenue                      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Split training and test data.

split_ori <- initial_split(ori, prop = .7, strata = "Revenue")
train_ori <- training(split_ori)
test_ori <- testing(split_ori)

print("Original dataset")

## [1] "Original dataset"

table(train_ori$Revenue) %>% prop.table()

## 
##        0        1 
## 0.843633 0.156367

table(test_ori$Revenue) %>% prop.table()

## 
##         0         1 
## 0.8437585 0.1562415

split <- initial_split(osi, prop = .7, strata = "Revenue")
train_data <- training(split)
test_data <- testing(split)

print("After one-hot encoding")

## [1] "After one-hot encoding"

table(train_data$Revenue) %>% prop.table()

## 
##        0        1 
## 0.843633 0.156367

table(test_data$Revenue) %>% prop.table()

## 
##         0         1 
## 0.8437585 0.1562415

Preprocess the continuous attributes by splitting from categorical ones and bind later.

train_numerical <- train_data[,1:10] 
train_categorical <- train_data[,11:77]
test_numerical <- test_data[,1:10] 
test_categorical = test_data[,11:77]

train_scaled = scale(train_numerical)
test_scaled = scale(test_numerical, center=attr(train_scaled, "scaled:center"), scale=attr(train_scaled, "scaled:scale"))

train_data <- cbind(train_scaled, train_categorical)
test_data <- cbind(test_scaled, test_categorical)

Due to the imbalance of dataset, the classifier will always try to predict the target as False since it achieves higher accuracy, to solve this problem, we oversample the training data to be balanced while leaving test data unchanged.

N_ori = 2*length(which(train_ori$Revenue == 0))
ori_over <- ovun.sample(Revenue~.,data = train_ori, method= 'over', N = N_ori, seed = 2020)$data

N = 2*length(which(train_data$Revenue == 0))
osi_over <- ovun.sample(Revenue~.,data = train_data, method= 'over', N = N, seed = 2020)$data

Split features and target for further use.

features_ori <- setdiff(names(train_ori), "Revenue")
features <- setdiff(names(train_data), "Revenue")

Prediction

Naive Bayes Classifier

x_ori <- train_ori[, features_ori]
y_ori <- train_ori$Revenue
x <- train_data[, ..features]
y <- train_data$Revenue

train_control <- trainControl(
  method = "cv",
  number = 10
)

nb.ml_ori <- caret::train(
  x = x_ori,
  y = y_ori,
  method = "nb",
  trControl = train_control
)

nb.ml <- caret::train(
  x = x,
  y = y,
  method = "nb",
  trControl = train_control
)

print("Without one-hot encoding")

## [1] "Without one-hot encoding"

print(confusionMatrix(nb.ml_ori))

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1
##          0 72.1  4.7
##          1 12.2 11.0
##                            
##  Accuracy (average) : 0.831

print("With one-hot encoding")

## [1] "With one-hot encoding"

print(confusionMatrix(nb.ml))

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1
##          0 84.4 15.6
##          1  0.0  0.0
##                             
##  Accuracy (average) : 0.8435

First we try to predict using NBC, which is fast and scalable. NBC classifies a new customer by conditional probabilities of all the features, picks the class with highest probability: \(P(H|Multiple Evidences) = P(E1|H)*P(E2|H)*...*P(En|H)*P(H)/P(Multiple Evidences)\), The accuracy is 83.0995795%, even worse than guessing all are negative, it fails to predict any positive case right if using one-hot encoding, but a good start to explore other methods since it is relatively a simple model.

k-Nearest Neighbour

#Train the model and predict
k_nn <- knn(osi_over[, 1:76], test_data[, 1:76], osi_over$Revenue)

#Confusion Matrix and Metrics
print("Default k-NN")

## [1] "Default k-NN"

CM_knn_default <- confusionMatrix(k_nn, factor(test_data$Revenue))
print(CM_knn_default)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2824  306
##          1  265  266
##                                           
##                Accuracy : 0.844           
##                  95% CI : (0.8319, 0.8556)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : 0.49300         
##                                           
##                   Kappa : 0.3907          
##                                           
##  Mcnemar's Test P-Value : 0.09414         
##                                           
##             Sensitivity : 0.9142          
##             Specificity : 0.4650          
##          Pos Pred Value : 0.9022          
##          Neg Pred Value : 0.5009          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7714          
##    Detection Prevalence : 0.8550          
##       Balanced Accuracy : 0.6896          
##                                           
##        'Positive' Class : 0               
##

#Visualize accuracies of different k
k_nn <- NULL
errors <- NULL

for (i in 1:30) {
  k_nn <- knn(osi_over[, 1:76], test_data[, 1:76], osi_over$Revenue, k = i)
  errors[i] <- mean(k_nn != test_data$Revenue)
}

knn.error <- as.data.frame(cbind(k=1:30, errors))

ggplot(knn.error, aes(k, errors)) +
  geom_point() +
  geom_line() +
  scale_x_continuous(breaks = 1:30) +
  theme_bw() +
  xlab("Value of K") +
  ylab("Error")

#Result of the best-performance model
k_nn <- knn(osi_over[, 1:76], test_data[, 1:76], osi_over$Revenue, k=1)
print("Best-performance k-NN")

## [1] "Best-performance k-NN"

CM_knn_best <- confusionMatrix(k_nn, factor(test_data$Revenue))
print(CM_knn_best)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2824  306
##          1  265  266
##                                           
##                Accuracy : 0.844           
##                  95% CI : (0.8319, 0.8556)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : 0.49300         
##                                           
##                   Kappa : 0.3907          
##                                           
##  Mcnemar's Test P-Value : 0.09414         
##                                           
##             Sensitivity : 0.9142          
##             Specificity : 0.4650          
##          Pos Pred Value : 0.9022          
##          Neg Pred Value : 0.5009          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7714          
##    Detection Prevalence : 0.8550          
##       Balanced Accuracy : 0.6896          
##                                           
##        'Positive' Class : 0               
##

Then we try k-NN, an unsupervised clustering algorithm, by default it achieves accuracy of 84.4031685%, we further try visualizing model errors of different k, which shows that k=1 is the one that achieves the highest accuracy. Its result is approximately identical to NBC, but it is still reasonable since there are continuous and categorical attributes and k-nn is an expert on cluster, not on classification.

Tree-based methods

#Train the model
my_ctree<-ctree(as.factor(ori_over$Revenue)~.,data=ori_over)

n<-length(names(ori_over)) 
m = ceiling(log2(n))
rf_train<-randomForest(as.factor(ori_over$Revenue)~.,data=ori_over,mtry=m ,ntree=100,importance=TRUE,proximity=TRUE)

#Predict
pred<-predict(my_ctree,newdata=test_ori)
ptree=prune.rpart(my_ctree,cp=my_ctree$Revenue[which.min(my_ctree$Revenue[,"x error"]),"CP"])
pred<-predict(ptree,newdata=test_ori)

pred_2<-predict(rf_train,newdata=test_ori)

#Confusion Matrix and Metrics
print("C4.5")

## [1] "C4.5"

CM_C45 <- confusionMatrix(pred, factor(test_ori$Revenue))
print(CM_C45)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2617  114
##          1  472  458
##                                           
##                Accuracy : 0.8399          
##                  95% CI : (0.8277, 0.8517)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : 0.7463          
##                                           
##                   Kappa : 0.5163          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8472          
##             Specificity : 0.8007          
##          Pos Pred Value : 0.9583          
##          Neg Pred Value : 0.4925          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7148          
##    Detection Prevalence : 0.7460          
##       Balanced Accuracy : 0.8239          
##                                           
##        'Positive' Class : 0               
##

print("Random Forest")

## [1] "Random Forest"

CM_RF <- confusionMatrix(pred_2, factor(test_ori$Revenue))
print(CM_RF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2899  176
##          1  190  396
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8899, 0.9096)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6246          
##                                           
##  Mcnemar's Test P-Value : 0.4968          
##                                           
##             Sensitivity : 0.9385          
##             Specificity : 0.6923          
##          Pos Pred Value : 0.9428          
##          Neg Pred Value : 0.6758          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7919          
##    Detection Prevalence : 0.8399          
##       Balanced Accuracy : 0.8154          
##                                           
##        'Positive' Class : 0               
##

Next, we try some tree-based methods, first we apply C4.5 to build a decision tree and prune it, which achieves accuracy of 83.9934444%, not as good as expected. However tree-based model is effective for classification, we then try Random Forest to build a bagging of 100 different trees, number of variables for each tree is set to \(log_2M\), where M is the number of variables. The accuracy achieves 90.0027315%, it gives us momentumn to try more other methods.

#Train the model
svmfit = svm(as.factor(Revenue)~., data=osi_over, kernel = "linear", scale = FALSE)

#Predict
pred <- predict(svmfit, newdata = test_data)

#Confusion Matrix and Metrics of Linear SVM
print("Linear SVM")

## [1] "Linear SVM"

CM_LSVM <- confusionMatrix(pred, factor(test_data$Revenue))
print(CM_LSVM)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2784  150
##          1  305  422
##                                           
##                Accuracy : 0.8757          
##                  95% CI : (0.8646, 0.8862)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : 2.388e-08       
##                                           
##                   Kappa : 0.5755          
##                                           
##  Mcnemar's Test P-Value : 5.213e-13       
##                                           
##             Sensitivity : 0.9013          
##             Specificity : 0.7378          
##          Pos Pred Value : 0.9489          
##          Neg Pred Value : 0.5805          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7604          
##    Detection Prevalence : 0.8014          
##       Balanced Accuracy : 0.8195          
##                                           
##        'Positive' Class : 0               
##

#Tune the model, too slow please never run it:(
#linear.tune = tune.svm(as.factor(Revenue)~., data=osi_over, kernel="linear", cost=c(0.001, 0.01, 0.1, 1, 5, 10, 20), scale = FALSE)

#Predict with best model
#best.linear = linear.tune$best.model
#pred = predict(best.linear, newdata=test_data1)

#Confusion Matrix and Metrics of tuned Linear SVM
#print("Best-performance Linear SVM")
#print(confusionMatrix(pred, factor(test_data1$Revenue)))

Then we try Linear Support Vector Machine by default, which achieves accuracy of 87.5717017%, higher than NBC, and not significant improve by tuning.

#Train the model
svmfit = svm(as.factor(Revenue)~., data=osi_over, kernel = "radial", scale = FALSE)

#Predict
pred <- predict(svmfit, newdata = test_data)

#Confusion Matrix and Metrics of RBF SVM
print("RBF SVM")

## [1] "RBF SVM"

CM_RBFSVM <- confusionMatrix(pred, factor(test_data$Revenue))
print(CM_RBFSVM)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2773  143
##          1  316  429
##                                           
##                Accuracy : 0.8746          
##                  95% CI : (0.8635, 0.8852)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : 6.976e-08       
##                                           
##                   Kappa : 0.5766          
##                                           
##  Mcnemar's Test P-Value : 9.885e-16       
##                                           
##             Sensitivity : 0.8977          
##             Specificity : 0.7500          
##          Pos Pred Value : 0.9510          
##          Neg Pred Value : 0.5758          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7574          
##    Detection Prevalence : 0.7965          
##       Balanced Accuracy : 0.8239          
##                                           
##        'Positive' Class : 0               
##

#Tune the model, too slow please never run it:(
#rbf.tune = tune.svm(as.factor(Revenue)~., data=osi_over, kernel="radial", gamma = c(0.01, 0.1, 0.5, 1, 2, 3, 4, 5, 10, 20), cost = c(0.001, 0.01, 0.1, 1, 5, 10, 20), scale = FALSE)

#Predict with best model
#best.rbf = rbf.tune$best.model
#pred = predict(best.rbf, newdata=test_data1)

#Confusion Matrix and Metrics of tuned Linear SVM
#print("Best-performance RBF SVM")
#print(confusionMatrix(pred, factor(test_data1$Revenue)))

Further, Radial-basis-function SVM is applied, by default it achieves accuracy of 87.462442% which is approximately the same as that of Linear SVM, while after tuning it achieves surprisingly little lower accuracy, its specificity increases a bit, since to detect positive ones is more important, the tuned model is said to have better performance.

Finally, we try to predict with a simple Multilayer Perceptron, of which the input layer consists of 76 neurons, the hidden layer consists of 10 neurons and the output layer consists of 2 neurons.

train_x <- data.matrix(osi_over[, features])
train_y <- data.matrix(dummy.data.frame(as.data.frame(osi_over$Revenue)))
test_x <- data.matrix(test_data[, ..features])
test_y <- data.matrix(dummy.data.frame(as.data.frame(test_data$Revenue)))

model <- keras_model_sequential()
model %>%
  layer_dense(units = 10, activation = "relu", input_shape = 76, name = "hidden_layer")%>%
  layer_dropout(rate = 0.25)%>%
  layer_dense(units = 2, activation = "softmax")

print(summary(model))

model%>%compile(
  loss = "categorical_crossentropy",
  optimizer = optimizer_adam(),
  metrics = "accuracy"
)

model_history <- model%>% fit(train_x, train_y, epochs = 100, batch_size = 1000, validation_split = 0.3, verbose = 0)

plot(model_history) + theme_bw()

model_evaluate <- model %>% evaluate(test_x, test_y)
print("Accuracy of best performance MLP: ")

model_accuracy <- model_evaluate$accuracy
print(model_accuracy)

model_weight <- as.matrix(model$weights[[1]])
pheatmap(t(model_weight), cluster_rows = F, cluster_cols = F, labels_row = 1:10, labels_col = 1:76, main = "hidden layer weight", cellwidth = 5.5, cellheight = 20, fontsize = 12, fontsize_col = 6)

It achieves accuracy of 88.9374495%. Hence the Random Forest model performs the best among all the algorithms we have implemented, we further explore whether we could even improve the performance more.

Optimization

#Train the model
n<-length(names(ori_over)) 
m = ceiling(log2(n))
rf_train<-randomForest(as.factor(ori_over$Revenue)~.,data=ori_over,mtry=m ,ntree=100,importance=TRUE,proximity=TRUE)

#Predict
pred_2<-predict(rf_train,newdata=test_ori)

#Confusion Matrix and Metrics
print("Before feature selection")

## [1] "Before feature selection"

CM_RF <- confusionMatrix(pred_2, factor(test_ori$Revenue))
print(CM_RF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2900  168
##          1  189  404
##                                           
##                Accuracy : 0.9025          
##                  95% CI : (0.8924, 0.9119)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.6356          
##                                           
##  Mcnemar's Test P-Value : 0.2898          
##                                           
##             Sensitivity : 0.9388          
##             Specificity : 0.7063          
##          Pos Pred Value : 0.9452          
##          Neg Pred Value : 0.6813          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7921          
##    Detection Prevalence : 0.8380          
##       Balanced Accuracy : 0.8226          
##                                           
##        'Positive' Class : 0               
##

First, we try Recursive Feature Elimination to build Random Forest model with all possible subsets of features. The result shows that PageValues is the most important attribute, Month, ExitRates and ProductRelated/ProductRelated_Duration are among the most important ones, consistent with the result of varImPlot. Besides, BounceRates is important as well, but since it is highly correlated with ExitRates, it is reasonable to leave it out.

#Too slow, please never run it:(
#control <- rfeControl(functions = rfFuncs, method = "cv", number = 10)
#results <- rfe(ori_over[, 1:17], ori_over[, 18], sizes = c(1:17), rfeControl = control)

#A copy of the result
RFE_Accuracy = c(0.8676473, 0.8463514, 0.8475305, 0.8997643, 0.9289016, 0.9537330, 0.9589354, 0.9612942, 0.9630973, 0.9652478, 0.9652478, 0.9651092, 0.9659416, 0.9648317, 0.9655947, 0.9648317, 0.9658723)
RFE_Rank = c("PageValues", "Month", "Region", "TrafficType", "ProductRelated", "ExitRates", "ProductRelated_Duration", "Browser", "BounceRates", "Administrative_Duration", "Administrative", "Informational_Duration", "OperatingSystems")
plot(x = c(1:17), y = RFE_Accuracy, col="deepskyblue4", xlab = "Accuracy", ylab = "Number of features", pch=19)

Function “varImpPlot” measures the importance of features by 2 metrics: MeanDecreaseAccuracy and MeanDecreaseGini, MDA means the decrease of accuracy after exclusion or permutation of a single variable, MDG means the decrease of node impurity.

varImpPlot(rf_train, sort = TRUE, n.var = 17, main = 'Features Importance by RF')

The result of varImpPlot shows that PageValues is definitely the most important one, attributes like Month, ExitRates and ProductRelated/ProductRelated_Duration are among the most important ones as well. Both measurements show that “SpecialDay” is not an important feature, which is consistent with our result of exploratory analysis.

Next, we bin the continuous variables for use of Mutual Information and minimum Redundancy Maximum Relevance Feature Selection.

continous_cols <- ori_over %>%
  select(Administrative_Duration, Informational_Duration, ProductRelated_Duration, BounceRates, ExitRates, PageValues, SpecialDay)
standadized_cols = as.data.frame(scale(continous_cols))

col_names = names(standadized_cols)
binned_cols <- standadized_cols %>% mutate(
  Administrative_Duration=cut(Administrative_Duration, breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
  Informational_Duration=cut(Informational_Duration, breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
  ProductRelated_Duration=cut(ProductRelated_Duration,breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
  BounceRates=cut(BounceRates,breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
  ExitRates=cut(ExitRates,breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
 PageValues=cut(PageValues,breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4)),
  SpecialDay=cut(SpecialDay,breaks = c(-Inf,-3.5,-2.5,-1.5,-0.5,0.5,1.5,2.5,3.5,Inf), labels = c(-4,-3,-2,-1,0,1,2,3,4))
  )

binned <- ori_over %>% mutate(
  Administrative_Duration = binned_cols$Administrative_Duration,
  Informational_Duration = binned_cols$Informational_Duration,
  ProductRelated_Duration = binned_cols$ProductRelated_Duration,
  BounceRates = binned_cols$BounceRates,
  ExitRates = binned_cols$ExitRates,
  PageValues = binned_cols$PageValues,
  SpecialDay = binned_cols$SpecialDay
)

Mutual Information measures how much information the presence/absence a term contributes to making the correct classification on, similar to MDA of varImpPlot.

#MI filter
MI = vector()
for (i in 1:17){
  MI <- c(MI, mutinformation(binned$Revenue, binned[,i]))
}
MI = data.frame(names(binned)[1:17],MI)
 MI <- MI[with(MI, order(-MI)), ]

MI

##        names.binned..1.17.          MI
## 9               PageValues 0.270988845
## 5           ProductRelated 0.078224402
## 8                ExitRates 0.068088292
## 6  ProductRelated_Duration 0.047300727
## 7              BounceRates 0.038667529
## 1           Administrative 0.033709839
## 15             TrafficType 0.032184685
## 2  Administrative_Duration 0.029938867
## 11                   Month 0.028704828
## 3            Informational 0.011165937
## 10              SpecialDay 0.010694771
## 16             VisitorType 0.009919429
## 4   Informational_Duration 0.006601009
## 12        OperatingSystems 0.005996146
## 13                 Browser 0.003155251
## 14                  Region 0.001893571
## 17                 Weekend 0.001059959

The results show that PageValues, ProductRelated/ProductRelated_Duration and ExitRates are still among the most important. The is one potential problem with the method that it ignores the relevance between variables, therefore features like ProductRelated and ProductRelated_Duration are ranked top in the same time, we further use Minimum redundancy feature selection to gain more insights, which takes into account the relevance between features and rank the other relevant ones on the tail.

#mMRM filter
score = MRMR(binned[1:17],binned$Revenue,17)$score
score = as.data.frame(score)
score

##                                 score
## PageValues               0.2709888447
## Month                    0.0099297753
## BounceRates              0.0122496071
## Weekend                 -0.0007819900
## Administrative_Duration -0.0018089058
## VisitorType             -0.0006757124
## ProductRelated_Duration -0.0006680025
## Region                  -0.0069390413
## Informational_Duration  -0.0084860245
## ExitRates               -0.0013032523
## SpecialDay              -0.0077167437
## OperatingSystems        -0.0076168180
## TrafficType             -0.0199032680
## Informational           -0.0255886539
## Browser                 -0.0483869793
## Administrative          -0.0572081062
## ProductRelated          -0.1239455787

The result shows that PageValues, Month, ExitRates and ProductRelated_Duration are still among the most important ones, it is different than the former results that features like ProductRelated have been listed on the tail, after selecting BounceRates, the ExitRates is ranked way behind than its former position. Hence after consideration of the above results, we decide to select the following 12 features to re-train our model.

features_selected = c("Administrative_Duration", "Informational_Duration", "ProductRelated_Duration", "ExitRates", "PageValues", "Month", "OperatingSystems", "Browser", "Region", "TrafficType", "VisitorType", "Weekend")

ori_over <- ori_over[, c(features_selected, "Revenue")]
test_ori <- test_ori[, c(features_selected, "Revenue")]

n<-length(names(ori_over)) 
m = ceiling(log2(n))
rf_train<-randomForest(as.factor(ori_over$Revenue)~.,data=ori_over,mtry=m ,ntree=100,importance=TRUE,proximity=TRUE)

#Predict
pred_2<-predict(rf_train,newdata=test_ori)

#Confusion Matrix and Metrics
print("After feature selection")

## [1] "After feature selection"

CM_RF <- confusionMatrix(pred_2, factor(test_ori$Revenue))
print(CM_RF)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2872  175
##          1  217  397
##                                           
##                Accuracy : 0.8929          
##                  95% CI : (0.8825, 0.9028)
##     No Information Rate : 0.8438          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6057          
##                                           
##  Mcnemar's Test P-Value : 0.03838         
##                                           
##             Sensitivity : 0.9298          
##             Specificity : 0.6941          
##          Pos Pred Value : 0.9426          
##          Neg Pred Value : 0.6466          
##              Prevalence : 0.8438          
##          Detection Rate : 0.7845          
##    Detection Prevalence : 0.8323          
##       Balanced Accuracy : 0.8119          
##                                           
##        'Positive' Class : 0               
##

Although the result has not changed much, it is still worth selecting subset of features as it simplifies the model and requires less time to train.

Conclusion

In this project, we explore methods to predict purchasing decision of consumers based on the 17 attributes, it is a well-structured real-world dataset and inspires us a lot to analyze the behavior of customers from both commercial insights and technical perspectives. The lesson worth taken away is that sometimes inference from our daily life experience could be inaccurate, for instance, we would generally believe “SpecialDay” has strong influence on the decision, since it is in weekend we have more free time to browse the online shopping mart, nontheless, the result of our Random Forest model and exploratory data analysis demonstrate that there is no difference on this feature, probably due to the fact the people are more used to spending their time in fration time so whether it is on weekend would not affect much. We may also provide some suggestions based on our results, fon instance, the website owners could re-distribute budget on advertisements, allocate more on months when customers are more likely to finalize their transactions so that it is more likely to lift the return rates, also they could re-design the informational pages to best cater to customers for their longer stay, which further increases the probability leading to purchasing.

Online Shoppers Purchasing Intention

Yuming Liu