We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site).
The task is to:
Columns:
#libraries needed
library(dplyr)
library(rpart)
library(ggplot2)
library(randomForest)
Let’s check the structure of the data
head(data)
## country age new_user source total_pages_visited converted
## 1 UK 25 1 Ads 1 0
## 2 US 23 1 Seo 5 0
## 3 US 28 1 Seo 4 0
## 4 China 39 1 Seo 5 0
## 5 US 30 1 Seo 6 0
## 6 US 31 0 Seo 1 0
str(data)
## 'data.frame': 316200 obs. of 6 variables:
## $ country : Factor w/ 4 levels "China","Germany",..: 3 4 4 1 4 4 1 4 3 4 ...
## $ age : int 25 23 28 39 30 31 27 23 29 25 ...
## $ new_user : int 1 1 1 1 1 0 1 0 0 0 ...
## $ source : Factor w/ 3 levels "Ads","Direct",..: 1 3 3 3 3 3 3 1 2 1 ...
## $ total_pages_visited: int 1 5 4 5 6 1 4 4 4 2 ...
## $ converted : int 0 0 0 0 0 0 0 0 0 0 ...
summary(data)
## country age new_user source
## China : 76602 Min. : 17.00 Min. :0.0000 Ads : 88740
## Germany: 13056 1st Qu.: 24.00 1st Qu.:0.0000 Direct: 72420
## UK : 48450 Median : 30.00 Median :1.0000 Seo :155040
## US :178092 Mean : 30.57 Mean :0.6855
## 3rd Qu.: 36.00 3rd Qu.:1.0000
## Max. :123.00 Max. :1.0000
## total_pages_visited converted
## Min. : 1.000 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.:0.00000
## Median : 4.000 Median :0.00000
## Mean : 4.873 Mean :0.03226
## 3rd Qu.: 7.000 3rd Qu.:0.00000
## Max. :29.000 Max. :1.00000
Some initial observations:
Let’s inverstigate max age which is 123:
sort(unique(data$age), decreasing = TRUE)
## [1] 123 111 79 77 73 72 70 69 68 67 66 65 64 63 62 61 60
## [18] 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43
## [35] 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
## [52] 25 24 23 22 21 20 19 18 17
Let’s find out how many users are of the age 123 and 11
subset(data, age > 79)
## country age new_user source total_pages_visited converted
## 90929 Germany 123 0 Seo 15 1
## 295582 UK 111 0 Ads 10 1
Since, there are only 2 records. Let’s remove them from our analysis.
data <- subset(data, age <80)
Now let’s explore the variables:
data_country = data %>%
group_by(country) %>%
summarise(conversion_rate = mean(converted))
ggplot(data = data_country, aes(x=country, y=conversion_rate)) +
geom_bar(stat = "identity", aes(fill = country))
We observe that China has the lowest conversion rate. It will be interesting to explore the reasons for this further.
data_pages = data %>%
group_by(total_pages_visited) %>%
summarise(conversion_rate = mean(converted))
ggplot(data = data_pages, aes(x=total_pages_visited,y=conversion_rate))+
geom_line()
This graph definitely shows that longer you spend time on the site, brighter are chances of conversion.
Let’s see average user age by country
data_user = data %>%
group_by(country) %>%
summarize(age_by_country = mean(age))
ggplot(data = data_user, aes(x= country, y = age_by_country ))+
geom_bar(stat = "identity", aes(fill= country) )
This graph shows that user all across the world are on an average 30 years of age.
Let’s build a model to predict the conversion. The output is binary and we care about insights to give the product and marketing teams.
I am going to pick Random Forest for predicting the conversion rate. I used this algorithm beacuse:
I will use the random forest to predict conversion, then I will use its partial dependence plots and variable importance to get insights about how it got information from the variables. Also, I will build a simple tree to find the most obvious user segments and see if they agree with RF partial dependence plots. Firstly, “Converted” should really be a factor here as well as new_user. So let’s change them:
data$converted = as.factor(data$converted) # let's make the class a factor
data$new_user = as.factor(data$new_user) #also this a factor
levels(data$country)[levels(data$country)=="Germany"]="DE" # Shorter name, easier to plot.
Create test/training set with a standard 66% split (if the data were too small, I would cross-validate) and then build the forest with standard values for the 3 most important parameters (100 trees, trees as large as possible, 3 random variables selected at each split).
train_sample = sample(nrow(data), size = nrow(data)*0.66)
train_data = data[train_sample,]
test_data = data[-train_sample,]
rf = randomForest(y=train_data$converted, x = train_data[, -ncol(train_data)],
ytest = test_data$converted, xtest = test_data[, -ncol(test_data)],
ntree = 100, mtry = 3, keep.forest = TRUE)
rf
##
## Call:
## randomForest(x = train_data[, -ncol(train_data)], y = train_data$converted, xtest = test_data[, -ncol(test_data)], ytest = test_data$converted, ntree = 100, mtry = 3, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 1.44%
## Confusion matrix:
## 0 1 class.error
## 0 201232 852 0.004216069
## 1 2149 4457 0.325310324
## Test set error rate: 1.48%
## Confusion matrix:
## 0 1 class.error
## 0 103492 424 0.004080219
## 1 1163 2429 0.323775056
So, OOB error and test error are pretty similar: 1.5% and 1.4%. We are confident we are not overfitting. Error is pretty low. However, we started from a 97% accuracy (that’s the case if we classified everything as “non converted”). So, 98.5% is good, but nothing shocking. Indeed, 30% of conversions are predicted as “non conversion”. If we cared about the very best possible accuracy or specifically minimizing false positive/false negative, we would also use ROCR and find the best cut-off point. Since in this case that doesn’t appear to be particularly relevant, we are fine with the default 0.5 cutoff value used internally by the random forest to make the prediction.
Let’s check the variable inportance:
varImpPlot(rf,type=2)
Total pages visited is the most important one, by far. Unfortunately, it is probably the least “actionable”. People visit many pages cause they already want to buy. Also, in order to buy you have to click on multiple pages. Let’s rebuild the RF without that variable. Since classes are heavily unbalanced and we don’t have that very powerful variable anymore, let’s change the weight a bit, just to make sure we will get something classified as 1.
rf = randomForest(y=train_data$converted, x = train_data[, -c(5, ncol(train_data))],
ytest = test_data$converted, xtest = test_data[, -c(5, ncol(train_data))],
ntree = 100, mtry = 3, keep.forest = TRUE, classwt = c(0.7,0.3))
rf
##
## Call:
## randomForest(x = train_data[, -c(5, ncol(train_data))], y = train_data$converted, xtest = test_data[, -c(5, ncol(train_data))], ytest = test_data$converted, ntree = 100, mtry = 3, classwt = c(0.7, 0.3), keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 14.34%
## Confusion matrix:
## 0 1 class.error
## 0 175205 26879 0.1330090
## 1 3047 3559 0.4612474
## Test set error rate: 14.31%
## Confusion matrix:
## 0 1 class.error
## 0 90149 13767 0.1324820
## 1 1620 1972 0.4510022
Our accuracy went down but the model is still good enough to give us insights.
Let’s recheck the variable importance:
varImpPlot(rf,type = 2)
Interesting! New user is the most important one. Source doesn’t seem to matter at all. Let’s check partial dependence plots for the 4 vars:
op <- par(mfrow=c(2, 2))
partialPlot(rf, train_data, country, 1)
partialPlot(rf, train_data, age, 1)
partialPlot(rf, train_data, new_user, 1)
partialPlot(rf, train_data, source, 1)
In partial dependence plots, we just care about the trend, not the actual y value. So this shows that: Users with an old account are much better than new users China is really bad, all other countries are similar with Germany being the best The site works very well for young people and bad for less young people (>30 yrs old) Source is irrelevant Let’s now build a simple decision tree and check the 2 or 3 most important segments:
tree = rpart(data$converted ~ ., data[, -c(5,ncol(data))],
control = rpart.control(maxdepth = 3),
parms = list(prior = c(0.7, 0.3)))
tree
## n= 316198
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 316198 94859.4000 0 (0.70000000 0.30000000)
## 2) new_user=1 216744 28268.0600 0 (0.84540048 0.15459952) *
## 3) new_user=0 99454 66591.3400 0 (0.50063101 0.49936899)
## 6) country=China 23094 613.9165 0 (0.96445336 0.03554664) *
## 7) country=DE,UK,US 76360 50102.8100 1 (0.43162227 0.56837773)
## 14) age>=29.5 38341 19589.5200 0 (0.57227507 0.42772493) *
## 15) age< 29.5 38019 23893.0000 1 (0.33996429 0.66003571) *
Some conclusions and suggestions:
The site is working very well for young users. Definitely let’s tell marketing to advertise and use marketing channel which are more likely to reach young people.
The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again,marketing should get more Germans. Big opportunity.
Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try.
Something is wrong with the Chinese version of the site. It is either poorly translated, doesn’t fit the local culture, some payment issue or maybe it is just in English! Given how many users are based in China, fixing this should be a top priority. Huge opportunity.
Maybe go through the UI and figure out why older users perform so poorly? From 30 y/o conversion clearly starts dropping.
If I know someone has visited many pages, but hasn’t converted, she almost surely has high purchase intent. I could email her targeted offers or sending her reminders. Overall, these are probably the easiest users to make convert.