The goal of this challenge is to build a model that predicts conversion rate and, based on the model, come up with ideas to improve revenue.There are no dates, no tables to join, no feature engineering required, and the problem is really straightforward.
We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site). * Predict conversion rate Come up with recommendations for the product team and the marketing team to improve conversion rate
#libraries needed
library(dplyr)
library(rpart)
library(ggplot2)
library(randomForest)
#Let's read the dataset into R.
data = read.csv('C:/DSLA/conversion_rate/conversion_data.csv')
head(data)
## country age new_user source total_pages_visited converted
## 1 UK 25 1 Ads 1 0
## 2 US 23 1 Seo 5 0
## 3 US 28 1 Seo 4 0
## 4 China 39 1 Seo 5 0
## 5 US 30 1 Seo 6 0
## 6 US 31 0 Seo 1 0
#Let's check the structure of the data:
str(data)
## 'data.frame': 316200 obs. of 6 variables:
## $ country : Factor w/ 4 levels "China","Germany",..: 3 4 4 1 4 4 1 4 3 4 ...
## $ age : int 25 23 28 39 30 31 27 23 29 25 ...
## $ new_user : int 1 1 1 1 1 0 1 0 0 0 ...
## $ source : Factor w/ 3 levels "Ads","Direct",..: 1 3 3 3 3 3 3 1 2 1 ...
## $ total_pages_visited: int 1 5 4 5 6 1 4 4 4 2 ...
## $ converted : int 0 0 0 0 0 0 0 0 0 0 ...
summary(data)
## country age new_user source
## China : 76602 Min. : 17.00 Min. :0.0000 Ads : 88740
## Germany: 13056 1st Qu.: 24.00 1st Qu.:0.0000 Direct: 72420
## UK : 48450 Median : 30.00 Median :1.0000 Seo :155040
## US :178092 Mean : 30.57 Mean :0.6855
## 3rd Qu.: 36.00 3rd Qu.:1.0000
## Max. :123.00 Max. :1.0000
## total_pages_visited converted
## Min. : 1.000 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.:0.00000
## Median : 4.000 Median :0.00000
## Mean : 4.873 Mean :0.03226
## 3rd Qu.: 7.000 3rd Qu.:0.00000
## Max. :29.000 Max. :1.00000
#A few quick observations:
#The site is probably a US site, although it does have a large Chinese user base as well
#user base is pretty young
#conversion rate at around 3% is industry standard.
#It makes sense. everything seems to make sense here except for max age 123 yrs! Let's investigate
sort(unique(data$age), decreasing=TRUE)
## [1] 123 111 79 77 73 72 70 69 68 67 66 65 64 63 62 61 60
## [18] 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43
## [35] 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
## [52] 25 24 23 22 21 20 19 18 17
#Those 123 and 111 values seem unrealistic. How many users are we talking about:
subset(data, age>79)
## country age new_user source total_pages_visited converted
## 90929 Germany 123 0 Seo 15 1
## 295582 UK 111 0 Ads 10 1
#It is just 2 users! In this case, we can remove them, nothing will change.
#In general, depending on the problem: remove the entire row saying you don't trust the data treat those values as NAs if there is a pattern,
#try to figure out what went wrong. In doubt, always go with removing the row. It is the safest choice.
#Anyway, here is probably just users who put wrong data. So let's remove them:
data = subset(data, age<80)
data_country = data %>%
group_by(country) %>%
summarise(conversion_rate = mean(converted))
ggplot(data=data_country, aes(x=country, y=conversion_rate)) + geom_bar(stat = "identity", aes(fill = country))
# Here it clearly looks like Chinese convert at a much lower rate than other countries!
data_pages = data %>%
group_by(total_pages_visited) %>%
summarise(conversion_rate = mean(converted))
qplot(total_pages_visited, conversion_rate, data=data_pages, geom="line")
# Definitely spending more time on the site implies higher probability of conversion
# Let's now build a model to predict conversion rate. Outcome is binary and you care about insightsto give product and marketing team some ideas. We will probably choose among thefollowing models:
#Logistic regression
#Decision Trees
#RuleFit (this is often your best choice)
#Random Forest in combination with partial dependence plots
# Firstly, "Converted" should really be a factor here as well as new_user. So let's change them:
data$converted = as.factor(data$converted) # let's make the class a factor
data$new_user = as.factor(data$new_user) #also this a factor
levels(data$country)[levels(data$country)=="Germany"]="DE" # Shorter name, easier to plot.
# Create test/training set with a standard 66% split (if the data were too small, I would cross-validate) and then build the forest with standard values for the 3 most important parameters (100 trees, trees as large as possible, 3 random variables selected at each split).
train_sample = sample(nrow(data), size = nrow(data)*0.66)
train_data = data[train_sample,]
test_data = data[-train_sample,]
rf = randomForest(y=train_data$converted, x = train_data[, -ncol(train_data)], ytest = test_data$converted, xtest = test_data[, -ncol(test_data)],ntree = 100, mtry = 3, keep.forest = TRUE)
rf
##
## Call:
## randomForest(x = train_data[, -ncol(train_data)], y = train_data$converted, xtest = test_data[, -ncol(test_data)], ytest = test_data$converted, ntree = 100, mtry = 3, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 1.46%
## Confusion matrix:
## 0 1 class.error
## 0 201083 918 0.004544532
## 1 2124 4565 0.317536254
## Test set error rate: 1.46%
## Confusion matrix:
## 0 1 class.error
## 0 103518 481 0.004625044
## 1 1086 2423 0.309489883
#OOB error and test error are pretty similar: 1.5% and 1.4%. We are confident we are not overfitting. Error is pretty low. However, we started from a 97% accuracy (that's the case if we classified everything as "non converted"). So, 98.5% is good, but nothing shocking. Indeed, 30% of conversions are predicted as "non conversion".
# Let's start checking variable importance:
varImpPlot(rf,type=2)
# Total pages visited is the most important variance
# Unfortunately, it is probably the least "actionable". People visit many pages cause they already want to buy. Also, in order to buy you have to click on multiple pages.
#Let's rebuild the RF without that variable. Since classes are heavily unbalanced and we don't have that very powerful variable anymore, let's change the weight a bit, just to make sure we will get something classified as 1.
rf = randomForest(y=train_data$converted, x = train_data[, -c(5, ncol(train_data))],
ytest = test_data$converted,
xtest = test_data[, -c(5, ncol(train_data))],ntree = 100, mtry = 3,keep.forest = TRUE, classwt = c(0.7,0.3))
rf
##
## Call:
## randomForest(x = train_data[, -c(5, ncol(train_data))], y = train_data$converted, xtest = test_data[, -c(5, ncol(train_data))], ytest = test_data$converted, ntree = 100, mtry = 3, classwt = c(0.7, 0.3), keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.05%
## Confusion matrix:
## 0 1 class.error
## 0 175697 26304 0.1302172
## 1 3009 3680 0.4498430
## Test set error rate: 14.39%
## Confusion matrix:
## 0 1 class.error
## 0 90138 13861 0.1332801
## 1 1605 1904 0.4573953
#Accuracy went down, but that's fine. The model is still good enough to give us insights.
#Let's recheck variable importance:
varImpPlot(rf,type=2)
#Interesting! New user is the most important one. Source doesn't seem to matter at all.
#Let's check partial dependence plots for the 4 vars:
op <- par(mfrow=c(2, 2))
partialPlot(rf, train_data, country, 1)
partialPlot(rf, train_data, age, 1)
partialPlot(rf, train_data, new_user, 1)
partialPlot(rf, train_data, source, 1)
#In partial dependence plots, we just care about the trend, not the actual y value. So this shows that:
#Users with an old account are much better than new users
#China is really bad, all other countries are similar with Germany being the best
#The site works very well for young people and bad for less young people (>30 yrs old)
#Source is irrelevant
tree = rpart(data$converted ~ ., data[, -c(5,ncol(data))], control = rpart.control(maxdepth = 3),parms = list(prior = c(0.7, 0.3)))
tree
## n= 316198
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 316198 94859.4000 0 (0.70000000 0.30000000)
## 2) new_user=1 216744 28268.0600 0 (0.84540048 0.15459952) *
## 3) new_user=0 99454 66591.3400 0 (0.50063101 0.49936899)
## 6) country=China 23094 613.9165 0 (0.96445336 0.03554664) *
## 7) country=DE,UK,US 76360 50102.8100 1 (0.43162227 0.56837773)
## 14) age>=29.5 38341 19589.5200 0 (0.57227507 0.42772493) *
## 15) age< 29.5 38019 23893.0000 1 (0.33996429 0.66003571) *
# Some conclusions and suggestions:
#1. The site is working very well for young users. Definitely let's tell marketing to advertise and use marketing channel which are more likely to reach young people.
#2. The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again, marketing should get more Germans. Big opportunity.
#3. Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try.
#4. Something is wrong with the Chinese version of the site. It is either poorly translated, doesn't fit the local culture, some payment issue or maybe it is just in English! Given how many users are based in China, fixing this should be a top priority. Huge opportunity.
#5. Maybe go through the UI and figure out why older users perform so poorly? From 30 y/o conversion clearly starts dropping. #6. If I know someone has visited many pages, but hasn't converted, she almost surely has high purchase intent. I could email her targeted offers or sending her reminders. Overall, these are probably the easiest users to make convert.
# As we can see, conclusions usually end up being about:
# 1. tell marketing to get more of the good performing user segments
# 2. tell product to fix the experience for the bad performing ones