Challenge Description

We have data about a website user base, whether they converge or not, their characteristics such as country, the marketing channel, their age, weather they are repeat users, and number of page visited during that session. (as a proxy to time spent in the site)

The project is to

df <- read.csv("./data/conversion_data.csv")
head(df)
##   country age new_user source total_pages_visited converted
## 1      UK  25        1    Ads                   1         0
## 2      US  23        1    Seo                   5         0
## 3      US  28        1    Seo                   4         0
## 4   China  39        1    Seo                   5         0
## 5      US  30        1    Seo                   6         0
## 6      US  31        0    Seo                   1         0
##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000
##  [1] 123 111  79  77  73  72  70  69  68  67  66  65  64  63  62  61  60
## [18]  59  58  57  56  55  54  53  52  51  50  49  48  47  46  45  44  43
## [35]  42  41  40  39  38  37  36  35  34  33  32  31  30  29  28  27  26
## [52]  25  24  23  22  21  20  19  18  17
## [1] 316200

Distribution by country

Distribution Age

Distribution new user

## # A tibble: 2 x 4
##   new_user      n convertions conversion_rate
##      <int>  <int>       <int>           <dbl>
## 1        0  99454        7159             7.2
## 2        1 216744        3039             1.4

Distribution Total pages visited (Converted vs Not converted)

## # A tibble: 2 x 3
##   converted       avg      std
##      <fctr>     <dbl>    <dbl>
## 1         0  4.550281 2.789910
## 2         1 14.553932 3.963522

Convertion Rate by Marketing Channel Source

## # A tibble: 3 x 4
##   source total_sessions total_converted convergence_rate
##   <fctr>          <int>           <int>            <dbl>
## 1 Direct          72420            2040       0.02816901
## 2    Seo         155039            5099       0.03288850
## 3    Ads          88739            3059       0.03447188

Predict Conversion Training

## [1] "Baseline?"
## 
##          0          1 
## 0.96774806 0.03225194
## 
## Call:
## glm(formula = converted ~ age + source + country + total_pages_visited + 
##     new_user, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1379  -0.0632  -0.0242  -0.0097   4.4345  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -10.379518   0.175136 -59.266  < 2e-16 ***
## age                  -0.072921   0.002736 -26.649  < 2e-16 ***
## sourceDirect         -0.210490   0.056355  -3.735 0.000188 ***
## sourceSeo            -0.025429   0.045923  -0.554 0.579768    
## countryGermany        3.867213   0.154768  24.987  < 2e-16 ***
## countryUK             3.633325   0.141088  25.752  < 2e-16 ***
## countryUS             3.277430   0.137017  23.920  < 2e-16 ***
## total_pages_visited   0.756791   0.007139 106.001  < 2e-16 ***
## new_user             -1.741296   0.041110 -42.357  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 67577  on 237147  degrees of freedom
## Residual deviance: 19282  on 237139  degrees of freedom
## AIC: 19300
## 
## Number of Fisher Scoring iterations: 10
##          0          1 
## 0.01120418 0.66378676
## [1] "Confusion Matrix"
##    predictions
##          0      1
##   0 228178   1322
##   1   2014   5634
## [1] "Average error"
## [1] 0.9859328
## [1] "Sensitivity"
## [1] 0.7366632
## [1] "Specificity"
## [1] 0.9942397

Predict on test

##          0          1 
## 0.01125351 0.66031558
## [1] "average error"
## [1] 0.9855914
## [1] "Sensitivity"
## [1] 0.7266667
## [1] "Specificity"
## [1] 0.9942222

Let’s try random forest now

## 
## Call:
##  randomForest(x = train[, -ncol(train)], y = train$converted,      xtest = test[, -ncol(test)], ytest = test$converted, ntree = 100,      mtry = 3, keep.forest = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.43%
## Confusion matrix:
##        0    1 class.error
## 0 228452 1048 0.004566449
## 1   2354 5294 0.307792887
##                 Test set error rate: 1.47%
## Confusion matrix:
##       0    1 class.error
## 0 76149  351 0.004588235
## 1   813 1737 0.318823529

Variable Importance from RF

Use a simple decision tree

let’s change the weight a bit, just to make sure we will get something classified as 1. (70%, 30%)

## n= 237148 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 237148 71144.400 0 (0.70000000 0.30000000)  
##    2) new_user>=0.5 162501 21320.990 0 (0.84460429 0.15539571) *
##    3) new_user< 0.5 74647 49823.410 0 (0.50148415 0.49851585)  
##      6) country=China 17303   446.513 0 (0.96546029 0.03453971) *
##      7) country=Germany,UK,US 57344 37639.060 1 (0.43255353 0.56744647)  
##       14) age>=29.5 28864 14893.070 0 (0.56972789 0.43027211) *
##       15) age< 29.5 28480 17918.990 1 (0.34194703 0.65805297) *

Recommendations

Recommendations are usually two-fold: