# Load package(s) ----
library(tidymodels)
library(tidyverse)
library(stacks)
library(patchwork)
# Handle common conflicts
tidymodels_prefer()
# Set seed
set.seed(3013)
# Load candidate model info ----
load("results/rf_tune.rda")
load("results/glmnet_res.rda")
load("results/knn_res.rda")
load("results/earth_res.rda")
# Load split data object & get testing data
load("data/general_setup.rda")

Long Form

Introduction

The main goal of this project is to use a predictive classification model to predict the purchase intention of online shoppers. We fit four different model classes (random forest, KNN, elastic net, MARS) to the training set, and finally use an ensemble model which includes random forest, elastic net, and MARS. In this report, we describe our data collection, EDA, model setup, feature engineering, model fitting, and final selection processes.

When navigating our repository, Data and setup information can be found in the data folder. Model-related code can be found in the model_docs folder. Model result files can be found in the results folder.

Data Collection

We obtained the data from University of California Irvine’s machine learning repository. The dataset was collected by a faculty researcher and information technology specialist in Turkey from an online bookstore built on an osCommerce platform. The dataset documents data that allows us to determine the purchasing intentions of online shoppers through multiple information sources on browsers. There is data that will help us be able to predict purchase intention through analyzing past behavior and history of the customers. This would be valuable information for sellers as well as they would want to know what attracts customers to purchase goods. We will be downloading the data from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset). ## Exploratory Data Analysis (Explores outcome variable distribution. Assesses missing data patterns. Creates at least 2-3 plots or tables to explore relationships among variables. ) ### (ES) 1. Initial overview & Quality Check & Missingness We first read in the dataset and standardized naming conventions using the janitor package. Using naniar, and skimr, we find that the dataset contains 12330 observations, 18 features, and no missingness.

shopper_dat <-
 read_csv("data/unprocessed/online_shoppers_intention.csv") %>%
 janitor::clean_names()

skimr::skim(shopper_dat)

Data summary
Name	shopper_dat
Number of rows	12330
Number of columns	18
_______________________
Column type frequency:
character	2
logical	2
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
month	0	1	3	4	0	10	0
visitor_type	0	1	5	17	0	3	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
weekend	0	1	0.23	FAL: 9462, TRU: 2868
revenue	0	1	0.15	FAL: 10422, TRU: 1908

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
administrative	1	2.32	3.32	0	0.00	1.00	4.00	27.00	▇▁▁▁▁
administrative_duration	1	80.82	176.78	0	0.00	7.50	93.26	3398.75	▇▁▁▁▁
informational	1	0.50	1.27	0	0.00	0.00	0.00	24.00	▇▁▁▁▁
informational_duration	1	34.47	140.75	0	0.00	0.00	0.00	2549.38	▇▁▁▁▁
product_related	1	31.73	44.48	0	7.00	18.00	38.00	705.00	▇▁▁▁▁
product_related_duration	1	1194.75	1913.67	0	184.14	598.94	1464.16	63973.52	▇▁▁▁▁
bounce_rates	1	0.02	0.05	0	0.00	0.00	0.02	0.20	▇▁▁▁▁
exit_rates	1	0.04	0.05	0	0.01	0.03	0.05	0.20	▇▂▁▁▁
page_values	1	5.89	18.57	0	0.00	0.00	0.00	361.76	▇▁▁▁▁
special_day	1	0.06	0.20	0	0.00	0.00	0.00	1.00	▇▁▁▁▁
operating_systems	1	2.12	0.91	1	2.00	2.00	3.00	8.00	▇▂▁▁▁
browser	1	2.36	1.72	1	2.00	2.00	2.00	13.00	▇▁▁▁▁
region	1	3.15	2.40	1	1.00	3.00	4.00	9.00	▇▅▁▂▁
traffic_type	1	4.07	4.03	1	2.00	2.00	4.00	20.00	▇▁▁▁▁

naniar::miss_var_summary(shopper_dat) %>% 
  gt::gt()

variable	n_miss	pct_miss
administrative	0	0
administrative_duration	0	0
informational	0	0
informational_duration	0	0
product_related	0	0
product_related_duration	0	0
bounce_rates	0	0
exit_rates	0	0
page_values	0	0
special_day	0	0
month	0	0
operating_systems	0	0
browser	0	0
region	0	0
traffic_type	0	0
visitor_type	0	0
weekend	0	0
revenue	0	0

2. Outcome variable: Revenue

In the variable revenue, TRUE means that the visitor made a purchase, FALSE indicates no purchase was made. There is a significant data imbalance between the two results of our target variable. We expect to deal with this problem with resampling and feature engineering.

ggplot(shopper_dat, aes(revenue)) +
 geom_bar()

3. Univariate investigation of important predictor variables

The first four plots show that there are important data imbalances within the categorical variables that will require feature engineering. The last three histograms are all positively skewed, therefore will also require feature engineering steps.

ggplot(shopper_dat, aes(visitor_type)) +
 geom_bar()

ggplot(shopper_dat, aes(month)) +
 geom_bar()

ggplot(shopper_dat, aes(weekend)) +
 geom_bar()

ggplot(shopper_dat, aes(special_day)) +
 geom_bar()

ggplot(shopper_dat, aes(bounce_rates)) +
 geom_histogram()

ggplot(shopper_dat, aes(exit_rates)) +
 geom_histogram()

ggplot(shopper_dat, aes(page_values)) +
 geom_histogram()

3. Relationships between Revenue & predictor variables

These plots show the impact of loyal customers, especially throughout the weekend. It displays whether or not the companies had good retention rates with the customers and how much revenue they create. New customers at a frequent rate always mean higher revenue. According to the plots, more revenue was made during the weekday, being led by new customers instead of returning ones.

p1 <- ggplot(data = shopper_dat, mapping = aes(x = revenue)) +
 geom_bar(mapping = aes(fill = visitor_type)) +
 ggtitle("Revenue on visitor type") +
 xlab("Revenue") +
 ylab("Visitors") +
 theme(legend.position = "bottom")
 
p2 <- ggplot(data = shopper_dat, mapping = aes(x = revenue)) +
 geom_bar(mapping = aes(fill = weekend)) +
 ggtitle("Revenue on weekend status") +
 xlab("Revenue") + ylab("Visitors") +
 theme(legend.position = "bottom")
 
(p1 + p2)

4. Relationships among predictor variables

In this graph, we examined the relationship between time spent on the information page of the product vs time spent on pages that displayed product related items. This was hard to predict as not every customer is looking to buy an item related to the original product they intended on purchasing. It showed a rather unstable chart as though it ends on a positive relationship, there are small dips, especially before 1000 seconds for Information_Duration.

# informational duration & product related duration
shopper_dat %>%
 ggplot(aes(x = informational_duration, y = product_related_duration)) +
 geom_point(alpha = 0.5) +
 geom_smooth()

Bounce rate is the overall percentage of visitors who enter the site from that page and leave without setting off any additional requests to the analytics server. Exit rates show the percentage of visitors on the site where they exit the website to a different website. The relationship showed here displays a positive one, where as exit rates increase, bounce rates increase. This could be the case as a high bounce rate means that user satisfaction was low whether it was due to site errors or very being slow. A high exit rate could mean lower performing sectors for the item, leading to customers leaving and never coming back. It is highly likely that these two are correlated.

# exit rate & bounce rate
ggplot(data = shopper_dat, mapping = aes(x = bounce_rates, y = exit_rates)) +
 geom_point(mapping = aes(color = revenue)) + geom_smooth(se = TRUE, alpha = 0.5) +
 theme_light() +
 ggtitle("Relationship between Exit Rates and Bounce Rates") +
 xlab("Exit Rates") +
 ylab("Bounce Rates")

(ES) 5. Corrplot

shopper_dat %>%
  select_if(is.numeric) %>%
  cor() %>%
  corrplot::corrplot()

Data Cleaning, Splitting, Resampling & Feature Engineering

Data Cleaning

The dataset contains 10 numeric variables and 8 categorical variables, but four of the categorical variables are coded as numbers in the dataset. We decided to leave them as numerics, since converting them into factors can result in too many factor levels. We converted the other four categorical variables, which were originally also encoded as characters, into factors for model fitting.

shopper_dat <- shopper_dat %>% 
  mutate(
    month = as.factor(month),
    weekend = as.factor(weekend),
    visitor_type = as.factor(visitor_type),
    revenue = as.factor(revenue)
  )

Data Splitting & Resampling

We decided to use a 70% proportion with the stratification of revenue, the outcome variable. This was done so that the test and train datasets could still resemble the original dataset as much as possible.

We then used a V-fold cross-validation with 5 folds and 3 repeats to resample data by splitting the training data in 5 different sets. Then it would be repeated three times to get the mean statistic.

shopper_split <- initial_split(shopper_dat, prop = 0.7, strata = revenue)
 
shopper_train <- training(shopper_split)
shopper_test <- testing(shopper_split)
 
shopper_folds <- vfold_cv(shopper_dat, v = 5, repeats = 3, strata = revenue)

Feature Engineering

The recipe for each model differed based on the respective requirements. We used the following recipe for random forest and k-nearest neighbors:

shopper_recipe <- recipe(revenue ~ ., data = shopper_train) %>%  
  step_novel(all_nominal_predictors())%>%
  step_nzv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

However, we modified it slightly for elastic net regression and MARS, which require all predictors to be numeric.

shopper_recipe <- 
  recipe(formula = revenue ~ ., data = shopper_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_nzv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

Through trial and error of including different subsets of the predictors, we found that changing the composition of predictors did not make any significant difference in the accuracy metrics.

Model Fitting

The four models we have chosen to use are elastic net mode, random forest model, earth model, and k-nearest neighbors model. We then updated the tuning parameters for each of these models. We set specific ranges(upper and lower limits) to the models that needed fine adjustments and later followed with a tuning grid to make numerous possible combinations of tuning parameter values. We created workflows for each model and saved files in their respective R scripts to load into this long form output.

Tuning Parameters

Results Comparison

rbind(
  # elastic net
  glmnet_res %>% 
    show_best(metric = "accuracy") %>% 
    slice(1:1) %>% 
    mutate(model = "elastic_net") %>% 
    select(mean, model), 
  # knn
  knn_res %>% 
    show_best(metric = "accuracy") %>% 
    slice(1:1) %>% 
    mutate(model = "knn") %>% 
    select(mean, model), 
  # rf
  rf_tune %>% 
    show_best(metric = "accuracy") %>% 
    slice(1:1) %>% 
    mutate(model = "rf") %>% 
    select(mean, model),
  # mars
  earth_res %>% 
    show_best(metric = "accuracy") %>% 
    slice(1:1) %>% 
    mutate(model = "mars") %>% 
    select(mean, model) 
) %>% 
  select(model, mean) %>% 
  rename(accuracy = mean) %>% 
  arrange(desc(accuracy)) %>% 
  gt::gt()

model	accuracy
rf	0.9038112
mars	0.9000265
elastic_net	0.8845357
knn	0.8794536

Conclusion

This project explored a dataset about an e-commerce world that gives customers more access to products online through multiple platforms. Few of these platforms include spaces such as Amazon, eBay, or Etsy where consumers are thrown a lot of advertisements, marketing strategies, and hyperbole to gain the attention of the producer. To understand the e-commerce market and how people think before they commit to a purchase, we decided to look into this dataset to see what drives people’s intentions. The outcome variable is revenue, showing whether or not the visitor made a purchase online.

With the creation of a recipe that followed an initial Exploratory Data Analysis, we found crucial predictor variables such as: PageValues, BounceRates, ExitRates, and VisitorType. Using these predictor variables and the code folding, we used four models: Elastic Net, Random Forest, MARS, and K-Nearest Neighbor models. All of these models returned a respectable accuracy value, but random forest with an mtry value of 10 produced the winning model with an accuracy value of 0.904. When applied to the testing set, we found that we had neither over or underfitted as the accuracy came out to be 0.903. This gave confidence in our data cleaning/manipulation and recipe.

Final Report

Evelyn Long, Heejun Park, Joe Omatoi

2022-06-06