R

Challenge Description:

We have data about users who hit our site: whether they converted or not as well as some of their characteristics such as their country, the marketing channel, their age, whether they are repeat users and the number of pages visited during that session (as a proxy for site activity/time spent on site).

The task is to:

Data

Columns:

  • country : user country based on the IP address
  • age : user age. Self-reported at sign-in step
  • new_user : whether the user created the account during this session or had already an account and simply came back to the site
  • source : marketing channel source
  • Ads: came to the site by clicking on an advertisement
  • Seo: came to the site by clicking on search results
  • Direct: came to the site by directly typing the URL on the browser
  • total_pages_visited: number of total pages visited during the session. This is a proxy for time spent on site and engagement during the session.
  • converted: this is our label. 1 means they converted within the session, 0 means they leftwithout buying anything. The company goal is to increase conversion rate: # conversions / total sessions.
#libraries needed
library(dplyr)
library(rpart)
library(ggplot2)
library(randomForest)

Let’s check the structure of the data

head(data)
##   country age new_user source total_pages_visited converted
## 1      UK  25        1    Ads                   1         0
## 2      US  23        1    Seo                   5         0
## 3      US  28        1    Seo                   4         0
## 4   China  39        1    Seo                   5         0
## 5      US  30        1    Seo                   6         0
## 6      US  31        0    Seo                   1         0
str(data)
## 'data.frame':    316200 obs. of  6 variables:
##  $ country            : Factor w/ 4 levels "China","Germany",..: 3 4 4 1 4 4 1 4 3 4 ...
##  $ age                : int  25 23 28 39 30 31 27 23 29 25 ...
##  $ new_user           : int  1 1 1 1 1 0 1 0 0 0 ...
##  $ source             : Factor w/ 3 levels "Ads","Direct",..: 1 3 3 3 3 3 3 1 2 1 ...
##  $ total_pages_visited: int  1 5 4 5 6 1 4 4 4 2 ...
##  $ converted          : int  0 0 0 0 0 0 0 0 0 0 ...
summary(data)
##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000

Some initial observations:

  • Large Chinese user base for a US based comapany
  • Median age of users around 30 years
  • Conversion rate is around 3%

Let’s inverstigate max age which is 123:

sort(unique(data$age), decreasing = TRUE)
##  [1] 123 111  79  77  73  72  70  69  68  67  66  65  64  63  62  61  60
## [18]  59  58  57  56  55  54  53  52  51  50  49  48  47  46  45  44  43
## [35]  42  41  40  39  38  37  36  35  34  33  32  31  30  29  28  27  26
## [52]  25  24  23  22  21  20  19  18  17

Let’s find out how many users are of the age 123 and 11

subset(data, age > 79)
##        country age new_user source total_pages_visited converted
## 90929  Germany 123        0    Seo                  15         1
## 295582      UK 111        0    Ads                  10         1

Since, there are only 2 records. Let’s remove them from our analysis.

data <- subset(data, age <80)

Now let’s explore the variables:

data_country = data %>%
group_by(country) %>%
summarise(conversion_rate = mean(converted))

ggplot(data = data_country, aes(x=country, y=conversion_rate)) +
  geom_bar(stat = "identity", aes(fill = country))

We observe that China has the lowest conversion rate. It will be interesting to explore the reasons for this further.

data_pages = data %>%
  group_by(total_pages_visited) %>%
  summarise(conversion_rate = mean(converted))

ggplot(data = data_pages, aes(x=total_pages_visited,y=conversion_rate))+
  geom_line()

This graph definitely shows that longer you spend time on the site, brighter are chances of conversion.

Let’s see average user age by country

data_user = data %>%
  group_by(country) %>%
  summarize(age_by_country = mean(age))


ggplot(data = data_user, aes(x= country, y = age_by_country ))+
  geom_bar(stat = "identity", aes(fill= country) )

This graph shows that user all across the world are on an average 30 years of age.

Machine Learning

Let’s build a model to predict the conversion. The output is binary and we care about insights to give the product and marketing teams.

I am going to pick Random Forest for predicting the conversion rate. I used this algorithm beacuse:

  • It usually requires very little time to optimize it (its default params are often close to the best ones)
  • It is strong with outliers, irrelevant variables, continuous and discrete variables

I will use the random forest to predict conversion, then I will use its partial dependence plots and variable importance to get insights about how it got information from the variables. Also, I will build a simple tree to find the most obvious user segments and see if they agree with RF partial dependence plots. Firstly, “Converted” should really be a factor here as well as new_user. So let’s change them:

data$converted = as.factor(data$converted) # let's make the class a factor
data$new_user = as.factor(data$new_user) #also this a factor
levels(data$country)[levels(data$country)=="Germany"]="DE" # Shorter name, easier to plot.

Create test/training set with a standard 66% split (if the data were too small, I would cross-validate) and then build the forest with standard values for the 3 most important parameters (100 trees, trees as large as possible, 3 random variables selected at each split).

train_sample = sample(nrow(data), size = nrow(data)*0.66)
train_data = data[train_sample,]
test_data = data[-train_sample,]

rf = randomForest(y=train_data$converted, x = train_data[, -ncol(train_data)],
                  ytest = test_data$converted, xtest = test_data[, -ncol(test_data)],
                  ntree = 100, mtry = 3, keep.forest = TRUE)
rf
## 
## Call:
##  randomForest(x = train_data[, -ncol(train_data)], y = train_data$converted,      xtest = test_data[, -ncol(test_data)], ytest = test_data$converted,      ntree = 100, mtry = 3, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.44%
## Confusion matrix:
##        0    1 class.error
## 0 201232  852 0.004216069
## 1   2149 4457 0.325310324
##                 Test set error rate: 1.48%
## Confusion matrix:
##        0    1 class.error
## 0 103492  424 0.004080219
## 1   1163 2429 0.323775056

So, OOB error and test error are pretty similar: 1.5% and 1.4%. We are confident we are not overfitting. Error is pretty low. However, we started from a 97% accuracy (that’s the case if we classified everything as “non converted”). So, 98.5% is good, but nothing shocking. Indeed, 30% of conversions are predicted as “non conversion”. If we cared about the very best possible accuracy or specifically minimizing false positive/false negative, we would also use ROCR and find the best cut-off point. Since in this case that doesn’t appear to be particularly relevant, we are fine with the default 0.5 cutoff value used internally by the random forest to make the prediction.

Let’s check the variable inportance:

varImpPlot(rf,type=2)

Total pages visited is the most important one, by far. Unfortunately, it is probably the least “actionable”. People visit many pages cause they already want to buy. Also, in order to buy you have to click on multiple pages. Let’s rebuild the RF without that variable. Since classes are heavily unbalanced and we don’t have that very powerful variable anymore, let’s change the weight a bit, just to make sure we will get something classified as 1.

rf = randomForest(y=train_data$converted, x = train_data[, -c(5, ncol(train_data))],
ytest = test_data$converted, xtest = test_data[, -c(5, ncol(train_data))],
ntree = 100, mtry = 3, keep.forest = TRUE, classwt = c(0.7,0.3))
rf
## 
## Call:
##  randomForest(x = train_data[, -c(5, ncol(train_data))], y = train_data$converted,      xtest = test_data[, -c(5, ncol(train_data))], ytest = test_data$converted,      ntree = 100, mtry = 3, classwt = c(0.7, 0.3), keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 14.34%
## Confusion matrix:
##        0     1 class.error
## 0 175205 26879   0.1330090
## 1   3047  3559   0.4612474
##                 Test set error rate: 14.31%
## Confusion matrix:
##       0     1 class.error
## 0 90149 13767   0.1324820
## 1  1620  1972   0.4510022

Our accuracy went down but the model is still good enough to give us insights.

Let’s recheck the variable importance:

varImpPlot(rf,type = 2)

Interesting! New user is the most important one. Source doesn’t seem to matter at all. Let’s check partial dependence plots for the 4 vars:

op <- par(mfrow=c(2, 2))
partialPlot(rf, train_data, country, 1)
partialPlot(rf, train_data, age, 1)
partialPlot(rf, train_data, new_user, 1)
partialPlot(rf, train_data, source, 1)

In partial dependence plots, we just care about the trend, not the actual y value. So this shows that: Users with an old account are much better than new users China is really bad, all other countries are similar with Germany being the best The site works very well for young people and bad for less young people (>30 yrs old) Source is irrelevant Let’s now build a simple decision tree and check the 2 or 3 most important segments:

tree = rpart(data$converted ~ ., data[, -c(5,ncol(data))],
control = rpart.control(maxdepth = 3),
parms = list(prior = c(0.7, 0.3)))
tree
## n= 316198 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 316198 94859.4000 0 (0.70000000 0.30000000)  
##    2) new_user=1 216744 28268.0600 0 (0.84540048 0.15459952) *
##    3) new_user=0 99454 66591.3400 0 (0.50063101 0.49936899)  
##      6) country=China 23094   613.9165 0 (0.96445336 0.03554664) *
##      7) country=DE,UK,US 76360 50102.8100 1 (0.43162227 0.56837773)  
##       14) age>=29.5 38341 19589.5200 0 (0.57227507 0.42772493) *
##       15) age< 29.5 38019 23893.0000 1 (0.33996429 0.66003571) *

Some conclusions and suggestions:

  1. The site is working very well for young users. Definitely let’s tell marketing to advertise and use marketing channel which are more likely to reach young people.

  2. The site is working very well for Germany in terms of conversion. But the summary showed that there are few Germans coming to the site: way less than UK, despite a larger population. Again,marketing should get more Germans. Big opportunity.

  3. Users with old accounts do much better. Targeted emails with offers to bring them back to the site could be a good idea to try.

  4. Something is wrong with the Chinese version of the site. It is either poorly translated, doesn’t fit the local culture, some payment issue or maybe it is just in English! Given how many users are based in China, fixing this should be a top priority. Huge opportunity.

  5. Maybe go through the UI and figure out why older users perform so poorly? From 30 y/o conversion clearly starts dropping.

  6. If I know someone has visited many pages, but hasn’t converted, she almost surely has high purchase intent. I could email her targeted offers or sending her reminders. Overall, these are probably the easiest users to make convert.