# Load package(s) ----
library(tidymodels)
library(tidyverse)
library(stacks)
library(patchwork)
# Handle common conflicts
tidymodels_prefer()
# Set seed
set.seed(3013)
# Load candidate model info ----
load("results/rf_tune.rda")
load("results/glmnet_res.rda")
load("results/knn_res.rda")
load("results/earth_res.rda")
# Load split data object & get testing data
load("data/general_setup.rda")
The main goal of this project is to use a predictive classification model to predict the purchase intention of online shoppers. We fit four different model classes (random forest, KNN, elastic net, MARS) to the training set, and finally use an ensemble model which includes random forest, elastic net, and MARS. In this report, we describe our data collection, EDA, model setup, feature engineering, model fitting, and final selection processes.
When navigating our repository, Data and setup information can be
found in the data folder. Model-related code can be found
in the model_docs folder. Model result files can be found
in the results folder.
We obtained the data from University of California Irvine’s machine learning repository. The dataset was collected by a faculty researcher and information technology specialist in Turkey from an online bookstore built on an osCommerce platform. The dataset documents data that allows us to determine the purchasing intentions of online shoppers through multiple information sources on browsers. There is data that will help us be able to predict purchase intention through analyzing past behavior and history of the customers. This would be valuable information for sellers as well as they would want to know what attracts customers to purchase goods. We will be downloading the data from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset). ## Exploratory Data Analysis (Explores outcome variable distribution. Assesses missing data patterns. Creates at least 2-3 plots or tables to explore relationships among variables. ) ### (ES) 1. Initial overview & Quality Check & Missingness We first read in the dataset and standardized naming conventions using the janitor package. Using naniar, and skimr, we find that the dataset contains 12330 observations, 18 features, and no missingness.
shopper_dat <-
read_csv("data/unprocessed/online_shoppers_intention.csv") %>%
janitor::clean_names()
skimr::skim(shopper_dat)
| Name | shopper_dat |
| Number of rows | 12330 |
| Number of columns | 18 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| logical | 2 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| month | 0 | 1 | 3 | 4 | 0 | 10 | 0 |
| visitor_type | 0 | 1 | 5 | 17 | 0 | 3 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| weekend | 0 | 1 | 0.23 | FAL: 9462, TRU: 2868 |
| revenue | 0 | 1 | 0.15 | FAL: 10422, TRU: 1908 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| administrative | 0 | 1 | 2.32 | 3.32 | 0 | 0.00 | 1.00 | 4.00 | 27.00 | ▇▁▁▁▁ |
| administrative_duration | 0 | 1 | 80.82 | 176.78 | 0 | 0.00 | 7.50 | 93.26 | 3398.75 | ▇▁▁▁▁ |
| informational | 0 | 1 | 0.50 | 1.27 | 0 | 0.00 | 0.00 | 0.00 | 24.00 | ▇▁▁▁▁ |
| informational_duration | 0 | 1 | 34.47 | 140.75 | 0 | 0.00 | 0.00 | 0.00 | 2549.38 | ▇▁▁▁▁ |
| product_related | 0 | 1 | 31.73 | 44.48 | 0 | 7.00 | 18.00 | 38.00 | 705.00 | ▇▁▁▁▁ |
| product_related_duration | 0 | 1 | 1194.75 | 1913.67 | 0 | 184.14 | 598.94 | 1464.16 | 63973.52 | ▇▁▁▁▁ |
| bounce_rates | 0 | 1 | 0.02 | 0.05 | 0 | 0.00 | 0.00 | 0.02 | 0.20 | ▇▁▁▁▁ |
| exit_rates | 0 | 1 | 0.04 | 0.05 | 0 | 0.01 | 0.03 | 0.05 | 0.20 | ▇▂▁▁▁ |
| page_values | 0 | 1 | 5.89 | 18.57 | 0 | 0.00 | 0.00 | 0.00 | 361.76 | ▇▁▁▁▁ |
| special_day | 0 | 1 | 0.06 | 0.20 | 0 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| operating_systems | 0 | 1 | 2.12 | 0.91 | 1 | 2.00 | 2.00 | 3.00 | 8.00 | ▇▂▁▁▁ |
| browser | 0 | 1 | 2.36 | 1.72 | 1 | 2.00 | 2.00 | 2.00 | 13.00 | ▇▁▁▁▁ |
| region | 0 | 1 | 3.15 | 2.40 | 1 | 1.00 | 3.00 | 4.00 | 9.00 | ▇▅▁▂▁ |
| traffic_type | 0 | 1 | 4.07 | 4.03 | 1 | 2.00 | 2.00 | 4.00 | 20.00 | ▇▁▁▁▁ |
naniar::miss_var_summary(shopper_dat) %>%
gt::gt()
| variable | n_miss | pct_miss |
|---|---|---|
| administrative | 0 | 0 |
| administrative_duration | 0 | 0 |
| informational | 0 | 0 |
| informational_duration | 0 | 0 |
| product_related | 0 | 0 |
| product_related_duration | 0 | 0 |
| bounce_rates | 0 | 0 |
| exit_rates | 0 | 0 |
| page_values | 0 | 0 |
| special_day | 0 | 0 |
| month | 0 | 0 |
| operating_systems | 0 | 0 |
| browser | 0 | 0 |
| region | 0 | 0 |
| traffic_type | 0 | 0 |
| visitor_type | 0 | 0 |
| weekend | 0 | 0 |
| revenue | 0 | 0 |
In the variable revenue, TRUE means that the visitor
made a purchase, FALSE indicates no purchase was made. There is a
significant data imbalance between the two results of our target
variable. We expect to deal with this problem with resampling and
feature engineering.
ggplot(shopper_dat, aes(revenue)) +
geom_bar()
The first four plots show that there are important data imbalances within the categorical variables that will require feature engineering. The last three histograms are all positively skewed, therefore will also require feature engineering steps.
ggplot(shopper_dat, aes(visitor_type)) +
geom_bar()
ggplot(shopper_dat, aes(month)) +
geom_bar()
ggplot(shopper_dat, aes(weekend)) +
geom_bar()
ggplot(shopper_dat, aes(special_day)) +
geom_bar()
ggplot(shopper_dat, aes(bounce_rates)) +
geom_histogram()
ggplot(shopper_dat, aes(exit_rates)) +
geom_histogram()
ggplot(shopper_dat, aes(page_values)) +
geom_histogram()
These plots show the impact of loyal customers, especially throughout the weekend. It displays whether or not the companies had good retention rates with the customers and how much revenue they create. New customers at a frequent rate always mean higher revenue. According to the plots, more revenue was made during the weekday, being led by new customers instead of returning ones.
p1 <- ggplot(data = shopper_dat, mapping = aes(x = revenue)) +
geom_bar(mapping = aes(fill = visitor_type)) +
ggtitle("Revenue on visitor type") +
xlab("Revenue") +
ylab("Visitors") +
theme(legend.position = "bottom")
p2 <- ggplot(data = shopper_dat, mapping = aes(x = revenue)) +
geom_bar(mapping = aes(fill = weekend)) +
ggtitle("Revenue on weekend status") +
xlab("Revenue") + ylab("Visitors") +
theme(legend.position = "bottom")
(p1 + p2)
In this graph, we examined the relationship between time spent on the
information page of the product vs time spent on pages that displayed
product related items. This was hard to predict as not every customer is
looking to buy an item related to the original product they intended on
purchasing. It showed a rather unstable chart as though it ends on a
positive relationship, there are small dips, especially before 1000
seconds for Information_Duration.
# informational duration & product related duration
shopper_dat %>%
ggplot(aes(x = informational_duration, y = product_related_duration)) +
geom_point(alpha = 0.5) +
geom_smooth()
Bounce rate is the overall percentage of visitors who enter the site from that page and leave without setting off any additional requests to the analytics server. Exit rates show the percentage of visitors on the site where they exit the website to a different website. The relationship showed here displays a positive one, where as exit rates increase, bounce rates increase. This could be the case as a high bounce rate means that user satisfaction was low whether it was due to site errors or very being slow. A high exit rate could mean lower performing sectors for the item, leading to customers leaving and never coming back. It is highly likely that these two are correlated.
# exit rate & bounce rate
ggplot(data = shopper_dat, mapping = aes(x = bounce_rates, y = exit_rates)) +
geom_point(mapping = aes(color = revenue)) + geom_smooth(se = TRUE, alpha = 0.5) +
theme_light() +
ggtitle("Relationship between Exit Rates and Bounce Rates") +
xlab("Exit Rates") +
ylab("Bounce Rates")
shopper_dat %>%
select_if(is.numeric) %>%
cor() %>%
corrplot::corrplot()
The dataset contains 10 numeric variables and 8 categorical variables, but four of the categorical variables are coded as numbers in the dataset. We decided to leave them as numerics, since converting them into factors can result in too many factor levels. We converted the other four categorical variables, which were originally also encoded as characters, into factors for model fitting.
shopper_dat <- shopper_dat %>%
mutate(
month = as.factor(month),
weekend = as.factor(weekend),
visitor_type = as.factor(visitor_type),
revenue = as.factor(revenue)
)
We decided to use a 70% proportion with the stratification of
revenue, the outcome variable. This was done so that the
test and train datasets could still resemble the original dataset as
much as possible.
We then used a V-fold cross-validation with 5 folds and 3 repeats to resample data by splitting the training data in 5 different sets. Then it would be repeated three times to get the mean statistic.
shopper_split <- initial_split(shopper_dat, prop = 0.7, strata = revenue)
shopper_train <- training(shopper_split)
shopper_test <- testing(shopper_split)
shopper_folds <- vfold_cv(shopper_dat, v = 5, repeats = 3, strata = revenue)
The recipe for each model differed based on the respective requirements. We used the following recipe for random forest and k-nearest neighbors:
shopper_recipe <- recipe(revenue ~ ., data = shopper_train) %>%
step_novel(all_nominal_predictors())%>%
step_nzv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
However, we modified it slightly for elastic net regression and MARS, which require all predictors to be numeric.
shopper_recipe <-
recipe(formula = revenue ~ ., data = shopper_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_nzv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
Through trial and error of including different subsets of the predictors, we found that changing the composition of predictors did not make any significant difference in the accuracy metrics.
The four models we have chosen to use are elastic net mode, random forest model, earth model, and k-nearest neighbors model. We then updated the tuning parameters for each of these models. We set specific ranges(upper and lower limits) to the models that needed fine adjustments and later followed with a tuning grid to make numerous possible combinations of tuning parameter values. We created workflows for each model and saved files in their respective R scripts to load into this long form output.
rbind(
# elastic net
glmnet_res %>%
show_best(metric = "accuracy") %>%
slice(1:1) %>%
mutate(model = "elastic_net") %>%
select(mean, model),
# knn
knn_res %>%
show_best(metric = "accuracy") %>%
slice(1:1) %>%
mutate(model = "knn") %>%
select(mean, model),
# rf
rf_tune %>%
show_best(metric = "accuracy") %>%
slice(1:1) %>%
mutate(model = "rf") %>%
select(mean, model),
# mars
earth_res %>%
show_best(metric = "accuracy") %>%
slice(1:1) %>%
mutate(model = "mars") %>%
select(mean, model)
) %>%
select(model, mean) %>%
rename(accuracy = mean) %>%
arrange(desc(accuracy)) %>%
gt::gt()
| model | accuracy |
|---|---|
| rf | 0.9038112 |
| mars | 0.9000265 |
| elastic_net | 0.8845357 |
| knn | 0.8794536 |
This project explored a dataset about an e-commerce world that gives
customers more access to products online through multiple platforms. Few
of these platforms include spaces such as Amazon, eBay, or Etsy where
consumers are thrown a lot of advertisements, marketing strategies, and
hyperbole to gain the attention of the producer. To understand the
e-commerce market and how people think before they commit to a purchase,
we decided to look into this dataset to see what drives people’s
intentions. The outcome variable is revenue, showing
whether or not the visitor made a purchase online.
With the creation of a recipe that followed an initial Exploratory
Data Analysis, we found crucial predictor variables such as:
PageValues, BounceRates,
ExitRates, and VisitorType. Using these
predictor variables and the code folding, we used four models: Elastic
Net, Random Forest, MARS, and K-Nearest Neighbor models. All of these
models returned a respectable accuracy value, but random forest with an
mtry value of 10 produced the winning model with an
accuracy value of 0.904. When applied to the testing set,
we found that we had neither over or underfitted as the accuracy came
out to be 0.903. This gave confidence in our data
cleaning/manipulation and recipe.