Conversion Rate

Goal

The goal of this challenge is to build a model that predicts conversion rate and, based on the model, come up with ideas to improve revenue.There are no dates, no tables to join, no feature engineering required, and the problem is really straightforward.

Challenge Description

We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site). * Predict conversion rate Come up with recommendations for the product team and the marketing team to improve conversion rate

Data Columns

ENVIRONMENT SETUP and READ DATA

#libraries needed 
library(dplyr)  
library(rpart) 
library(ggplot2) 
library(randomForest)
#Let's read the dataset into R.
data = read.csv('C:/DSLA/conversion_rate/conversion_data.csv')
head(data)
##   country age new_user source total_pages_visited converted
## 1      UK  25        1    Ads                   1         0
## 2      US  23        1    Seo                   5         0
## 3      US  28        1    Seo                   4         0
## 4   China  39        1    Seo                   5         0
## 5      US  30        1    Seo                   6         0
## 6      US  31        0    Seo                   1         0
#Let's check the structure of the data:
str(data)
## 'data.frame':    316200 obs. of  6 variables:
##  $ country            : Factor w/ 4 levels "China","Germany",..: 3 4 4 1 4 4 1 4 3 4 ...
##  $ age                : int  25 23 28 39 30 31 27 23 29 25 ...
##  $ new_user           : int  1 1 1 1 1 0 1 0 0 0 ...
##  $ source             : Factor w/ 3 levels "Ads","Direct",..: 1 3 3 3 3 3 3 1 2 1 ...
##  $ total_pages_visited: int  1 5 4 5 6 1 4 4 4 2 ...
##  $ converted          : int  0 0 0 0 0 0 0 0 0 0 ...
summary(data)
##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000
#A few quick observations: 
#The site is probably a US site, although it does have a large Chinese user base as well 
#user base is pretty young 
#conversion rate at around 3% is industry standard. 
#It makes sense. everything seems to make sense here except for max age 123 yrs! Let's investigate 

sort(unique(data$age), decreasing=TRUE)
##  [1] 123 111  79  77  73  72  70  69  68  67  66  65  64  63  62  61  60
## [18]  59  58  57  56  55  54  53  52  51  50  49  48  47  46  45  44  43
## [35]  42  41  40  39  38  37  36  35  34  33  32  31  30  29  28  27  26
## [52]  25  24  23  22  21  20  19  18  17
#Those 123 and 111 values seem unrealistic. How many users are we talking about:
subset(data, age>79)
##        country age new_user source total_pages_visited converted
## 90929  Germany 123        0    Seo                  15         1
## 295582      UK 111        0    Ads                  10         1
#It is just 2 users! In this case, we can remove them, nothing will change. 
#In general, depending on the problem: remove the entire row saying you don't trust the data treat those values as NAs if there is a pattern, 
#try to figure out what went wrong. In doubt, always go with removing the row. It is the safest choice.
#Anyway, here is probably just users who put wrong data. So let's remove them:
data = subset(data, age<80)

To get sense of data Let’s just pick a couple of vars as an example,

data_country = data %>%                             
  group_by(country) %>%                                
  summarise(conversion_rate = mean(converted))   
  ggplot(data=data_country, aes(x=country, y=conversion_rate)) + geom_bar(stat = "identity", aes(fill = country))

# Here it clearly looks like Chinese convert at a much lower rate than other countries!

data_pages = data %>%  
  group_by(total_pages_visited) %>%    
  summarise(conversion_rate = mean(converted)) 
qplot(total_pages_visited, conversion_rate, data=data_pages, geom="line")

# Definitely spending more time on the site implies higher probability of conversion

Implimenting Machine Learning

# Let's now build a model to predict conversion rate. Outcome is binary and you care about insightsto give product and marketing team some ideas. We will probably choose among thefollowing models:

#Logistic regression 
#Decision Trees 
#RuleFit (this is often your best choice)
#Random Forest in combination with partial dependence plots

# Firstly, "Converted" should really be a factor here as well as new_user. So let's change them:

data$converted = as.factor(data$converted)  # let's make the class a factor 
data$new_user = as.factor(data$new_user) #also this a factor
levels(data$country)[levels(data$country)=="Germany"]="DE" # Shorter name, easier to plot.
# Create test/training set with a standard 66% split (if the data were too small, I would cross-validate) and then build the forest with standard values for the 3 most important parameters (100 trees, trees as large as possible, 3 random variables selected at each split).

train_sample = sample(nrow(data), size = nrow(data)*0.66) 
train_data = data[train_sample,] 
test_data = data[-train_sample,] 
rf = randomForest(y=train_data$converted, x = train_data[, -ncol(train_data)],  ytest = test_data$converted, xtest = test_data[, -ncol(test_data)],ntree = 100, mtry = 3, keep.forest = TRUE) 
rf
## 
## Call:
##  randomForest(x = train_data[, -ncol(train_data)], y = train_data$converted,      xtest = test_data[, -ncol(test_data)], ytest = test_data$converted,      ntree = 100, mtry = 3, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.46%
## Confusion matrix:
##        0    1 class.error
## 0 201083  918 0.004544532
## 1   2124 4565 0.317536254
##                 Test set error rate: 1.46%
## Confusion matrix:
##        0    1 class.error
## 0 103518  481 0.004625044
## 1   1086 2423 0.309489883
#OOB error and test error are pretty similar: 1.5% and 1.4%. We are confident we are not overfitting. Error is pretty low. However, we started from a 97% accuracy (that's the case if we classified everything as "non converted"). So, 98.5% is good, but nothing shocking. Indeed, 30% of conversions are predicted as "non conversion". 

# Let's start checking variable importance:

varImpPlot(rf,type=2)

# Total pages visited is the most important variance
# Unfortunately, it is probably the least "actionable". People visit many pages cause they already want to buy. Also, in order to buy you have to click on multiple pages.

Rebuild RF

#Let's rebuild the RF without that variable. Since classes are heavily unbalanced and we don't have that very powerful variable anymore, let's change the weight a bit, just to make sure we will get something classified as 1.

rf = randomForest(y=train_data$converted, x = train_data[, -c(5, ncol(train_data))], 
                  ytest = test_data$converted, 
                  xtest = test_data[, -c(5, ncol(train_data))],ntree = 100, mtry = 3,keep.forest = TRUE, classwt = c(0.7,0.3)) 
rf
## 
## Call:
##  randomForest(x = train_data[, -c(5, ncol(train_data))], y = train_data$converted,      xtest = test_data[, -c(5, ncol(train_data))], ytest = test_data$converted,      ntree = 100, mtry = 3, classwt = c(0.7, 0.3), keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.05%
## Confusion matrix:
##        0     1 class.error
## 0 175697 26304   0.1302172
## 1   3009  3680   0.4498430
##                 Test set error rate: 14.39%
## Confusion matrix:
##       0     1 class.error
## 0 90138 13861   0.1332801
## 1  1605  1904   0.4573953
#Accuracy went down, but that's fine. The model is still good enough to give us insights. 

#Let's recheck variable importance:
varImpPlot(rf,type=2)

#Interesting! New user is the most important one. Source doesn't seem to matter at all. 

#Let's check partial dependence plots for the 4 vars:
op <- par(mfrow=c(2, 2)) 
partialPlot(rf, train_data, country, 1) 
partialPlot(rf, train_data, age, 1) 
partialPlot(rf, train_data, new_user, 1) 
partialPlot(rf, train_data, source, 1)

#In partial dependence plots, we just care about the trend, not the actual y value. So this shows that: 
#Users with an old account are much better than new users 
#China is really bad, all other countries are similar with Germany being the best 
#The site works very well for young people and bad for less young people (>30 yrs old) 
#Source is irrelevant 

Let’s now build a simple decision tree and check the 2 or 3 most important segments:

tree = rpart(data$converted ~ ., data[, -c(5,ncol(data))],  control = rpart.control(maxdepth = 3),parms = list(prior = c(0.7, 0.3))) 
tree
## n= 316198 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 316198 94859.4000 0 (0.70000000 0.30000000)  
##    2) new_user=1 216744 28268.0600 0 (0.84540048 0.15459952) *
##    3) new_user=0 99454 66591.3400 0 (0.50063101 0.49936899)  
##      6) country=China 23094   613.9165 0 (0.96445336 0.03554664) *
##      7) country=DE,UK,US 76360 50102.8100 1 (0.43162227 0.56837773)  
##       14) age>=29.5 38341 19589.5200 0 (0.57227507 0.42772493) *
##       15) age< 29.5 38019 23893.0000 1 (0.33996429 0.66003571) *
# Some conclusions and suggestions: 
#1. The site is working very well for young users. Definitely let's tell marketing to advertise and use marketing channel which are more likely to reach young people. 
#2. The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again, marketing should get more Germans. Big opportunity. 
#3. Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try. 
#4. Something is wrong with the Chinese version of the site. It is either poorly translated, doesn't fit the local culture, some payment issue or maybe it is just in English! Given how many users are based in China, fixing this should be a top priority. Huge opportunity. 
#5. Maybe go through the UI and figure out why older users perform so poorly? From 30 y/o conversion clearly starts dropping. #6. If I know someone has visited many pages, but hasn't converted, she almost surely has high purchase intent. I could email her targeted offers or sending her reminders. Overall, these are probably the easiest users to make convert. 

# As we can see, conclusions usually end up being about: 
# 1. tell marketing to get more of the good performing user segments 
# 2. tell product to fix the experience for the bad performing ones