Conversion Rate

Synopsis

The data revolution has a lot to do with the fact that now we are able to collect all sorts of data about people who buy something on our site as well as people who don’t. This gives us tremendous opportunity to understand what’s working well (and potentially scale it even further) and what’s not working well (and fix it).

In this report we analyse and predict the conversion rate of users who hit our website: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session.

Before we dive into the data, let’s take a step back and think about what we’re about to do. Effectively, let’s form some intution about what we may see in the data so we don’t merely report what the data is telling us. The intution we develop now will act as guide as we navigate the data so that we don’t fall into the data mining trap.

So first, let’s begin with the end in mind:

  1. what is the purpose of this analysis/prediction model? Data Science take home challenge to assess a candidates competency with a data analysis assignment.

Second, now let’s begin with the beginning:

  1. what problem are we trying to solve? Classification problem to predict which customers would convert on our website.

### Below are some additions questions we should/could ask after reading the assignment problem statement.

  1. How was the data captured? This usually informs us of potetnial biases that may be inherently embedded in the data which can ultimately leads us to appropriate feature engineering or chose a model that best captures this phenomenon we’re looking to model. E.g., classification model with Logistic Regression or Random Forest
  2. What was happening at the time the data was captured?
  3. Does each record capture a unique user?
  1. Which company/industry is the data from?
  2. Customers don’t typically supply age information upon sign-in. They could provide incorrect age.
  3. What is the conversion rate?
  4. What day of the week was the data captured? Weekday visitors may look different than weekend.
  5. Was the data sampled? If so there maybe biases in the sampling.

Concluding/Working hypothesis after preliminary research:

  1. Proceed with caution from the prediction or analysis from this dataset. The dataset doesn’t seem to be grounded in too many cogent business practices.
  2. I could see major issues with this data: 2a. user session on average about 30 mins. 2b. What if a user visited the site 2 or 3 times in a day that gets captured in several sessions?
  3. An individual session - even though they convert may not provide enough info on the user
  4. What about the platform the user converted on? mobile tends to be higher conversion rate than desktop?
  5. People browse on desktop and convert on mobile at some ecommerce companies. I suspect it will depend on how convenient the product is for shopping
  6. There also seem to be leakage in the data. If the prediction model would be used for real time predicction on the website, the variable total_pages_vsited is only available after the consumer has ended their session. So even of it turns out to be a great predictor, it won’t be available until the customer leaves the site. A workaround could break take the distrition of pages visted over a time period as a work around.

Now that we have a working hypothesis, let’s take a look at the data.

  1. Let’s take a look at the data and see if our initial hypothesis is wrong. OR better yet, reveal something to us that we just don’t know about. And, it should be fun.

Data Processing

Load the require packages

library(ggplot2)
library(reshape2)
library(plyr)
library(knitr)
library(caret)
library(QuantPsyc)
library(ROCR)
library(e1071)
require(randomForest)
require(rpart)
require(dplyr)

source(“https://bioconductor.org/biocLite.R”)

biocLite(“preprocessCore”)

We first read the conversion rate data into R:

conversion <- read.table("conversion_data.csv", header=TRUE, sep = ",")

The data look something like this:

head(conversion)
##   country age new_user source total_pages_visited converted       random
## 1      UK  25        1    Ads                   1         0 7.793772e-01
## 2      US  23        1    Ads                   4         0 6.787150e-06
## 3      US  19        0    Seo                   1         0 1.476440e-05
## 4      US  17        1    Ads                   2         0 2.072850e-05
## 5      US  23        1    Ads                   3         0 2.101780e-05
## 6      US  30        1    Seo                   6         0 2.663340e-05

Now that we have our data loaded into R. Let’s do some data munging on the data. However, before we do so, let’s take a look at the structure.

str(conversion)
## 'data.frame':    316198 obs. of  7 variables:
##  $ country            : Factor w/ 4 levels "China","Germany",..: 3 4 4 4 4 4 2 4 4 1 ...
##  $ age                : int  25 23 19 17 23 30 25 43 47 31 ...
##  $ new_user           : int  1 1 0 1 1 1 1 1 1 0 ...
##  $ source             : Factor w/ 3 levels "Ads","Direct",..: 1 1 3 1 1 3 2 2 3 3 ...
##  $ total_pages_visited: int  1 4 1 2 3 6 2 2 3 5 ...
##  $ converted          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ random             : num  7.79e-01 6.79e-06 1.48e-05 2.07e-05 2.10e-05 ...

Looking at the structure of the data, we see that the variable converted and new users are interger. These should be categorial variables. Let’s use R factor function to convert these to categorial variables.

conversion$new_user <- as.factor(conversion$new_user)
conversion$converted <- as.factor(conversion$converted)

Now, let’s inspect the data to look for weird behavior/wrong data. Data is never perfect in real life and requires to be cleaned. Often takehome challenges have wrong data which has been put there on purpose. Identifying the wrong data and dealing with it is part of the challenge.

Data Exploration

R summary function is usually the best place to start:

summary(conversion)
##     country            age        new_user      source      
##  China  : 76602   Min.   :17.00   0: 99454   Ads   : 88739  
##  Germany: 13055   1st Qu.:24.00   1:216744   Direct: 72420  
##  UK     : 48449   Median :30.00              Seo   :155039  
##  US     :178092   Mean   :30.57                             
##                   3rd Qu.:36.00                             
##                   Max.   :79.00                             
##  total_pages_visited converted         random         
##  Min.   : 1.000      0   :305999   Min.   :0.0000068  
##  1st Qu.: 2.000      1   : 10198   1st Qu.:0.2511177  
##  Median : 4.000      NA's:     1   Median :0.5012310  
##  Mean   : 4.873                    Mean   :0.5008166  
##  3rd Qu.: 7.000                    3rd Qu.:0.7510471  
##  Max.   :29.000                    Max.   :0.9999984

A few quick observations:

  1. The site is probably a US site, although it does have a large Chinese user base as well user base is pretty young
  2. Conversion rate at around 3% is industry standard. It makes sense.
  3. Summary reveals that a user has age of 123. Did this users purchase? They self disclosed age, it could be wrong. Do we drop it or impute it?

Let’s take a look at age closely.

sort(unique(conversion$age), decreasing=TRUE)
##  [1] 79 77 73 72 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52
## [24] 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29
## [47] 28 27 26 25 24 23 22 21 20 19 18 17

Distribution of Visitor Age

qplot(conversion$age,
      geom="histogram",
      binwidth = 5,  
      main = "Distribution of Visitor Age", 
      xlab = "Age",  
      fill=I("blue"), 
      col=I("blue"), 
      alpha=I(.2))

Those 123 and 111 values seem unrealistic. How many users are we talking about:

subset(conversion, age>79)
## [1] country             age                 new_user           
## [4] source              total_pages_visited converted          
## [7] random             
## <0 rows> (or 0-length row.names)

It is just 2 users! In this case, we can remove them, it won’t have much effect on our analysis/prediction. In general, depending on the problem, we can: 1. remove the entire row saying we don’t trust those data point and treat those values as NAs 2. if there is a pattern, try to figure out what went wrong. In doubt, always go with removing the row. It is the safest choice.

We probably also want to emphasize in the text that wrong data is worrisome and can be an indicator of some bug in the logging code. This is exactly in line with some of the intuition we formed before even looking at the data Therefore, you’d like to talk to the software engineer who implemented the code to see if, perhaps, there are some bugs which affect the data significantly. You can use this an opportunity to follow-up on other questions you may have. Anyway, here is probably just users who put wrong data. So let’s remove them:

conversion <- subset(conversion, age<80)

Now, let’s quickly investigate the variables and how their distribution differs for the two classes. This will help us understand whether there is any information in our data in the first place and get a sense of the data.

Never start by blindly building a machine learning/predictive model.

Always first form a working hypothesis and then get a sense of the data. Let’s just pick a couple of variables of interest as an example, but you should do it with all:

Here it clearly looks like Chinese convert at a much lower rate than other countries!

conversion$converted <- as.numeric(conversion$converted)
conversion_country <- conversion %>% 
group_by(country) %>% 
summarise(conversion_rate = mean(converted))
ggplot(data = conversion_country, aes(x=country, y = conversion_rate - 1)) +
geom_bar(stat = "identity", aes(fill=country))
## Warning: Removed 1 rows containing missing values (position_stack).

Definitely spending more time on the site implies higher probability of conversion!

conversion_pages = conversion %>%
group_by(total_pages_visited) %>%
summarise(conversion_rate = mean(converted))
qplot(total_pages_visited, conversion_rate, data=conversion_pages, geom="line")

Leakage: when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict. Total pages visited is a high predictor of conversion but we won’t know that until the user leaves the site. Depending on the application, we have to careful. For our case, let’s include this in the text to the product and/or marketing team.

Data Modeling

Let’s now build a model to predict conversion rate. Outcome is binary and we care about insights to give product and marketing team some ideas. We should probably choose from among the following models:

  1. Logistic regression
  2. Decision Trees
  3. RuleFit (this is often your best choice)
  4. Random Forest in combination with partial dependence plots

I am going to pick Logistic Regression…

First, “Converted” should really be a factor aka categorial variable here. So let’s change it:

conversion$converted = as.factor(conversion$converted)

Create test/training set with a 50% split.

testing_1v$converted_prob <- predict(conversion_logistic_model, newdata = testing_1v, type = “response”)

1 Indepent Variable

t
## function (x) 
## UseMethod("t")
## <bytecode: 0x000000001e4edf08>
## <environment: namespace:base>