We have data about a website user base, whether they converge or not, their characteristics such as country, the marketing channel, their age, weather they are repeat users, and number of page visited during that session. (as a proxy to time spent in the site)
The project is to
Come up with recommendations to the marketing team to improve conversion rate.
Convertion rate is #convertions/total sessions
df <- read.csv("./data/conversion_data.csv")
head(df)
## country age new_user source total_pages_visited converted
## 1 UK 25 1 Ads 1 0
## 2 US 23 1 Seo 5 0
## 3 US 28 1 Seo 4 0
## 4 China 39 1 Seo 5 0
## 5 US 30 1 Seo 6 0
## 6 US 31 0 Seo 1 0
## country age new_user source
## China : 76602 Min. : 17.00 Min. :0.0000 Ads : 88740
## Germany: 13056 1st Qu.: 24.00 1st Qu.:0.0000 Direct: 72420
## UK : 48450 Median : 30.00 Median :1.0000 Seo :155040
## US :178092 Mean : 30.57 Mean :0.6855
## 3rd Qu.: 36.00 3rd Qu.:1.0000
## Max. :123.00 Max. :1.0000
## total_pages_visited converted
## Min. : 1.000 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.:0.00000
## Median : 4.000 Median :0.00000
## Mean : 4.873 Mean :0.03226
## 3rd Qu.: 7.000 3rd Qu.:0.00000
## Max. :29.000 Max. :1.00000
## [1] 123 111 79 77 73 72 70 69 68 67 66 65 64 63 62 61 60
## [18] 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43
## [35] 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26
## [52] 25 24 23 22 21 20 19 18 17
## [1] 316200
## # A tibble: 2 x 4
## new_user n convertions conversion_rate
## <int> <int> <int> <dbl>
## 1 0 99454 7159 7.2
## 2 1 216744 3039 1.4
## # A tibble: 2 x 3
## converted avg std
## <fctr> <dbl> <dbl>
## 1 0 4.550281 2.789910
## 2 1 14.553932 3.963522
## # A tibble: 3 x 4
## source total_sessions total_converted convergence_rate
## <fctr> <int> <int> <dbl>
## 1 Direct 72420 2040 0.02816901
## 2 Seo 155039 5099 0.03288850
## 3 Ads 88739 3059 0.03447188
## [1] "Baseline?"
##
## 0 1
## 0.96774806 0.03225194
##
## Call:
## glm(formula = converted ~ age + source + country + total_pages_visited +
## new_user, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1379 -0.0632 -0.0242 -0.0097 4.4345
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.379518 0.175136 -59.266 < 2e-16 ***
## age -0.072921 0.002736 -26.649 < 2e-16 ***
## sourceDirect -0.210490 0.056355 -3.735 0.000188 ***
## sourceSeo -0.025429 0.045923 -0.554 0.579768
## countryGermany 3.867213 0.154768 24.987 < 2e-16 ***
## countryUK 3.633325 0.141088 25.752 < 2e-16 ***
## countryUS 3.277430 0.137017 23.920 < 2e-16 ***
## total_pages_visited 0.756791 0.007139 106.001 < 2e-16 ***
## new_user -1.741296 0.041110 -42.357 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 67577 on 237147 degrees of freedom
## Residual deviance: 19282 on 237139 degrees of freedom
## AIC: 19300
##
## Number of Fisher Scoring iterations: 10
## 0 1
## 0.01120418 0.66378676
## [1] "Confusion Matrix"
## predictions
## 0 1
## 0 228178 1322
## 1 2014 5634
## [1] "Average error"
## [1] 0.9859328
## [1] "Sensitivity"
## [1] 0.7366632
## [1] "Specificity"
## [1] 0.9942397
## 0 1
## 0.01125351 0.66031558
## [1] "average error"
## [1] 0.9855914
## [1] "Sensitivity"
## [1] 0.7266667
## [1] "Specificity"
## [1] 0.9942222
##
## Call:
## randomForest(x = train[, -ncol(train)], y = train$converted, xtest = test[, -ncol(test)], ytest = test$converted, ntree = 100, mtry = 3, keep.forest = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 1.43%
## Confusion matrix:
## 0 1 class.error
## 0 228452 1048 0.004566449
## 1 2354 5294 0.307792887
## Test set error rate: 1.47%
## Confusion matrix:
## 0 1 class.error
## 0 76149 351 0.004588235
## 1 813 1737 0.318823529
## n= 237148
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 237148 71144.400 0 (0.70000000 0.30000000)
## 2) new_user>=0.5 162501 21320.990 0 (0.84460429 0.15539571) *
## 3) new_user< 0.5 74647 49823.410 0 (0.50148415 0.49851585)
## 6) country=China 17303 446.513 0 (0.96546029 0.03453971) *
## 7) country=Germany,UK,US 57344 37639.060 1 (0.43255353 0.56744647)
## 14) age>=29.5 28864 14893.070 0 (0.56972789 0.43027211) *
## 15) age< 29.5 28480 17918.990 1 (0.34194703 0.65805297) *
Our product is doing very well with the young demographics. Choose the target channels that target better the young population.
The site is doing well with Germans in terms of conversion rate, although the number of Germans visiting the side is less than UK. So, big opportunity to market in Germany since the quality of the acquisitions are very good.
Users with old accounts are doing much better. Target emails with offers to bring them back to the site could be a good a idea.
There is probably something wrong with the Chinese site due to the high number of impressions but such a low conversion. It could be problems with the translation, or the UI doesn’t fit the local culture, some payment issues, or mabe it is just not translated at all. Huge opportunity here to grow.
Conversion rate of older clients (over 30) drops significantly. Maybe check the UI to figure out the problem.
If a client has visited the page many times, there is a high probability that she has purchase intents. As a result, we could be a good strategy to remind that person to complete the purchase. That’s an easy way improve a conversion.
Recommendations are usually two-fold: