ConversionRate

Challenge Description

We have data about a website user base, whether they converge or not, their characteristics such as country, the marketing channel, their age, weather they are repeat users, and number of page visited during that session. (as a proxy to time spent in the site)

The project is to

Predict the conversion rate
Come up with recommendations to the marketing team to improve conversion rate.
Convertion rate is #convertions/total sessions

df <- read.csv("./data/conversion_data.csv")
head(df)

##   country age new_user source total_pages_visited converted
## 1      UK  25        1    Ads                   1         0
## 2      US  23        1    Seo                   5         0
## 3      US  28        1    Seo                   4         0
## 4   China  39        1    Seo                   5         0
## 5      US  30        1    Seo                   6         0
## 6      US  31        0    Seo                   1         0

##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000

##  [1] 123 111  79  77  73  72  70  69  68  67  66  65  64  63  62  61  60
## [18]  59  58  57  56  55  54  53  52  51  50  49  48  47  46  45  44  43
## [35]  42  41  40  39  38  37  36  35  34  33  32  31  30  29  28  27  26
## [52]  25  24  23  22  21  20  19  18  17

## [1] 316200

Distribution by country

Distribution Age

Distribution new user

## # A tibble: 2 x 4
##   new_user      n convertions conversion_rate
##      <int>  <int>       <int>           <dbl>
## 1        0  99454        7159             7.2
## 2        1 216744        3039             1.4

Distribution Total pages visited (Converted vs Not converted)

## # A tibble: 2 x 3
##   converted       avg      std
##      <fctr>     <dbl>    <dbl>
## 1         0  4.550281 2.789910
## 2         1 14.553932 3.963522

Convertion Rate by Marketing Channel Source

## # A tibble: 3 x 4
##   source total_sessions total_converted convergence_rate
##   <fctr>          <int>           <int>            <dbl>
## 1 Direct          72420            2040       0.02816901
## 2    Seo         155039            5099       0.03288850
## 3    Ads          88739            3059       0.03447188

Predict Conversion Training

## [1] "Baseline?"

## 
##          0          1 
## 0.96774806 0.03225194

## 
## Call:
## glm(formula = converted ~ age + source + country + total_pages_visited + 
##     new_user, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1379  -0.0632  -0.0242  -0.0097   4.4345  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -10.379518   0.175136 -59.266  < 2e-16 ***
## age                  -0.072921   0.002736 -26.649  < 2e-16 ***
## sourceDirect         -0.210490   0.056355  -3.735 0.000188 ***
## sourceSeo            -0.025429   0.045923  -0.554 0.579768    
## countryGermany        3.867213   0.154768  24.987  < 2e-16 ***
## countryUK             3.633325   0.141088  25.752  < 2e-16 ***
## countryUS             3.277430   0.137017  23.920  < 2e-16 ***
## total_pages_visited   0.756791   0.007139 106.001  < 2e-16 ***
## new_user             -1.741296   0.041110 -42.357  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67577  on 237147  degrees of freedom
## Residual deviance: 19282  on 237139  degrees of freedom
## AIC: 19300
## 
## Number of Fisher Scoring iterations: 10

##          0          1 
## 0.01120418 0.66378676

## [1] "Confusion Matrix"

##    predictions
##          0      1
##   0 228178   1322
##   1   2014   5634

## [1] "Average error"

## [1] 0.9859328

## [1] "Sensitivity"

## [1] 0.7366632

## [1] "Specificity"

## [1] 0.9942397

Predict on test

##          0          1 
## 0.01125351 0.66031558

## [1] "average error"

## [1] 0.9855914

## [1] "Sensitivity"

## [1] 0.7266667

## [1] "Specificity"

## [1] 0.9942222

Let’s try random forest now

## 
## Call:
##  randomForest(x = train[, -ncol(train)], y = train$converted,      xtest = test[, -ncol(test)], ytest = test$converted, ntree = 100,      mtry = 3, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.43%
## Confusion matrix:
##        0    1 class.error
## 0 228452 1048 0.004566449
## 1   2354 5294 0.307792887
##                 Test set error rate: 1.47%
## Confusion matrix:
##       0    1 class.error
## 0 76149  351 0.004588235
## 1   813 1737 0.318823529

Variable Importance from RF

Use a simple decision tree

let’s change the weight a bit, just to make sure we will get something classified as 1. (70%, 30%)

## n= 237148 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 237148 71144.400 0 (0.70000000 0.30000000)  
##    2) new_user>=0.5 162501 21320.990 0 (0.84460429 0.15539571) *
##    3) new_user< 0.5 74647 49823.410 0 (0.50148415 0.49851585)  
##      6) country=China 17303   446.513 0 (0.96546029 0.03453971) *
##      7) country=Germany,UK,US 57344 37639.060 1 (0.43255353 0.56744647)  
##       14) age>=29.5 28864 14893.070 0 (0.56972789 0.43027211) *
##       15) age< 29.5 28480 17918.990 1 (0.34194703 0.65805297) *

Recommendations

Our product is doing very well with the young demographics. Choose the target channels that target better the young population.
The site is doing well with Germans in terms of conversion rate, although the number of Germans visiting the side is less than UK. So, big opportunity to market in Germany since the quality of the acquisitions are very good.
Users with old accounts are doing much better. Target emails with offers to bring them back to the site could be a good a idea.
There is probably something wrong with the Chinese site due to the high number of impressions but such a low conversion. It could be problems with the translation, or the UI doesn’t fit the local culture, some payment issues, or mabe it is just not translated at all. Huge opportunity here to grow.
Conversion rate of older clients (over 30) drops significantly. Maybe check the UI to figure out the problem.
If a client has visited the page many times, there is a high probability that she has purchase intents. As a result, we could be a good strategy to remind that person to complete the purchase. That’s an easy way improve a conversion.

Recommendations are usually two-fold:

Ask marketing to bring more of the good performance segments
Tell product to fix the experience for the bad performance ones.