The data revolution has a lot to do with the fact that now we are able to collect all sorts of data about people who buy something on our site as well as people who don’t. This gives us tremendous opportunity to understand what’s working well (and potentially scale it even further) and what’s not working well (and fix it).
In this report we analyse and predict the conversion rate of users who hit our website: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session.
Before we dive into the data, let’s take a step back and think about what we’re about to do. Effectively, let’s form some intution about what we may see in the data so we don’t merely report what the data is telling us. The intution we develop now will act as guide as we navigate the data so that we don’t fall into the data mining trap.
So first, let’s begin with the end in mind:
Second, now let’s begin with the beginning:
### Below are some additions questions we should/could ask after reading the assignment problem statement.
library(ggplot2)
library(reshape2)
library(plyr)
library(knitr)
library(caret)
library(QuantPsyc)
library(ROCR)
library(e1071)
require(randomForest)
require(rpart)
require(dplyr)
We first read the conversion rate data into R:
conversion <- read.table("conversion_data.csv", header=TRUE, sep = ",")
The data look something like this:
head(conversion)
## country age new_user source total_pages_visited converted random
## 1 UK 25 1 Ads 1 0 7.793772e-01
## 2 US 23 1 Ads 4 0 6.787150e-06
## 3 US 19 0 Seo 1 0 1.476440e-05
## 4 US 17 1 Ads 2 0 2.072850e-05
## 5 US 23 1 Ads 3 0 2.101780e-05
## 6 US 30 1 Seo 6 0 2.663340e-05
str(conversion)
## 'data.frame': 316198 obs. of 7 variables:
## $ country : Factor w/ 4 levels "China","Germany",..: 3 4 4 4 4 4 2 4 4 1 ...
## $ age : int 25 23 19 17 23 30 25 43 47 31 ...
## $ new_user : int 1 1 0 1 1 1 1 1 1 0 ...
## $ source : Factor w/ 3 levels "Ads","Direct",..: 1 1 3 1 1 3 2 2 3 3 ...
## $ total_pages_visited: int 1 4 1 2 3 6 2 2 3 5 ...
## $ converted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ random : num 7.79e-01 6.79e-06 1.48e-05 2.07e-05 2.10e-05 ...
Looking at the structure of the data, we see that the variable converted and new users are interger. These should be categorial variables. Let’s use R factor function to convert these to categorial variables.
conversion$new_user <- as.factor(conversion$new_user)
conversion$converted <- as.factor(conversion$converted)
Now, letâs inspect the data to look for weird behavior/wrong data. Data is never perfect in real life and requires to be cleaned. Often takehome challenges have wrong data which has been put there on purpose. Identifying the wrong data and dealing with it is part of the challenge.
R summary function is usually the best place to start:
summary(conversion)
## country age new_user source
## China : 76602 Min. :17.00 0: 99454 Ads : 88739
## Germany: 13055 1st Qu.:24.00 1:216744 Direct: 72420
## UK : 48449 Median :30.00 Seo :155039
## US :178092 Mean :30.57
## 3rd Qu.:36.00
## Max. :79.00
## total_pages_visited converted random
## Min. : 1.000 0 :305999 Min. :0.0000068
## 1st Qu.: 2.000 1 : 10198 1st Qu.:0.2511177
## Median : 4.000 NA's: 1 Median :0.5012310
## Mean : 4.873 Mean :0.5008166
## 3rd Qu.: 7.000 3rd Qu.:0.7510471
## Max. :29.000 Max. :0.9999984
Let’s take a look at age closely.
sort(unique(conversion$age), decreasing=TRUE)
## [1] 79 77 73 72 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52
## [24] 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29
## [47] 28 27 26 25 24 23 22 21 20 19 18 17
Distribution of Visitor Age
qplot(conversion$age,
geom="histogram",
binwidth = 5,
main = "Distribution of Visitor Age",
xlab = "Age",
fill=I("blue"),
col=I("blue"),
alpha=I(.2))
Those 123 and 111 values seem unrealistic. How many users are we talking about:
subset(conversion, age>79)
## [1] country age new_user
## [4] source total_pages_visited converted
## [7] random
## <0 rows> (or 0-length row.names)
It is just 2 users! In this case, we can remove them, it won’t have much effect on our analysis/prediction. In general, depending on the problem, we can: 1. remove the entire row saying we donât trust those data point and treat those values as NAs 2. if there is a pattern, try to figure out what went wrong. In doubt, always go with removing the row. It is the safest choice.
We probably also want to emphasize in the text that wrong data is worrisome and can be an indicator of some bug in the logging code. This is exactly in line with some of the intuition we formed before even looking at the data Therefore, youâd like to talk to the software engineer who implemented the code to see if, perhaps, there are some bugs which affect the data significantly. You can use this an opportunity to follow-up on other questions you may have. Anyway, here is probably just users who put wrong data. So letâs remove them:
conversion <- subset(conversion, age<80)
Now, letâs quickly investigate the variables and how their distribution differs for the two classes. This will help us understand whether there is any information in our data in the first place and get a sense of the data.
Always first form a working hypothesis and then get a sense of the data. Letâs just pick a couple of variables of interest as an example, but you should do it with all:
Here it clearly looks like Chinese convert at a much lower rate than other countries!
conversion$converted <- as.numeric(conversion$converted)
conversion_country <- conversion %>%
group_by(country) %>%
summarise(conversion_rate = mean(converted))
ggplot(data = conversion_country, aes(x=country, y = conversion_rate - 1)) +
geom_bar(stat = "identity", aes(fill=country))
## Warning: Removed 1 rows containing missing values (position_stack).
Definitely spending more time on the site implies higher probability of conversion!
conversion_pages = conversion %>%
group_by(total_pages_visited) %>%
summarise(conversion_rate = mean(converted))
qplot(total_pages_visited, conversion_rate, data=conversion_pages, geom="line")
Leakage: when the data you are using to train a machine learning algorithm happens to have the information you are trying to predict. Total pages visited is a high predictor of conversion but we won’t know that until the user leaves the site. Depending on the application, we have to careful. For our case, let’s include this in the text to the product and/or marketing team.
Letâs now build a model to predict conversion rate. Outcome is binary and we care about insights to give product and marketing team some ideas. We should probably choose from among the following models:
I am going to pick Logistic Regression…
First, âConvertedâ should really be a factor aka categorial variable here. So letâs change it:
conversion$converted = as.factor(conversion$converted)
Create test/training set with a 50% split.
testing_1v$converted_prob <- predict(conversion_logistic_model, newdata = testing_1v, type = “response”)
t
## function (x)
## UseMethod("t")
## <bytecode: 0x000000001e4edf08>
## <environment: namespace:base>