1 Business question

The task is to give Sales team and Marking team some insights regarding to users’ conversions based on information of their country, age, new/old user, source (ads/direct/seo), total pages visited.

2 Loading libraries and data

rm(list=ls())
library(readr) # Efficient reading of CSV data.
library(dplyr) # Data wrangling
library(magrittr) # Pipes %>% , %<>% 
library(ggplot2) # Visualise data
library(caret) # Build models
library(pROC) # Calculate AUC
library(lubridate) # Dates and time.
library(gridExtra) # Combine plots
library(forcats) # Factor reverse fct_rev()
library(rattle) # tree plot
library(tidyr) # Gather/ spread data

setwd("E:/DS Project/data science take home test/TakeHomeDataChallenges-master/01.ConversionRate_data")
ds <- read_csv("conversion_data.csv")

3 Data processing

3.1 Data structure

Now, let’s take the first look at the dataset. There are 316200 observation and 6 variables.

glimpse(ds)

## Observations: 316,200
## Variables: 6
## $ country             <chr> "UK", "US", "US", "China", "US", "US", "Ch...
## $ age                 <int> 25, 23, 28, 39, 30, 31, 27, 23, 29, 25, 38...
## $ new_user            <int> 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, ...
## $ source              <chr> "Ads", "Seo", "Seo", "Seo", "Seo", "Seo", ...
## $ total_pages_visited <int> 1, 5, 4, 5, 6, 1, 4, 4, 4, 2, 1, 8, 6, 7, ...
## $ converted           <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

3.2 Identify factors

Few variables ( country, new_user, source,converted) have been defined wrongly as character, I will then convert them into factor.

# Convert few independent variables into factors
ds$country  %<>% as.factor()
ds$source   %<>%  as.factor()
ds$new_user <- factor(ds$new_user,levels =c("1","0"),labels = c("New","Old"))

# Convert target variable into factor. 
ds$converted <- factor(ds$converted, levels =c("0","1"),labels = c("No","Yes"))

3.3 Missing values

Foretunatly, the dataset has no missing value!

sapply(ds,function(x) sum(is.na(x)))

##             country                 age            new_user 
##                   0                   0                   0 
##              source total_pages_visited           converted 
##                   0                   0                   0

Now, I will summarise the dataset to get the feel for the shape of the data.

summary(ds)

##     country            age         new_user        source      
##  China  : 76602   Min.   : 17.00   New:216744   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   Old: 99456   Direct: 72420  
##  UK     : 48450   Median : 30.00                Seo   :155040  
##  US     :178092   Mean   : 30.57                               
##                   3rd Qu.: 36.00                               
##                   Max.   :123.00                               
##  total_pages_visited converted   
##  Min.   : 1.000      No :306000  
##  1st Qu.: 2.000      Yes: 10200  
##  Median : 4.000                  
##  Mean   : 4.873                  
##  3rd Qu.: 7.000                  
##  Max.   :29.000

3.4 Outliers

length(which(ds$age>100))

## [1] 2

ds %<>% filter(age < 100)

From the summary, I detect that there should be outliers in the age varibale beacause there is not reasonable to have user at the at of 123. In fact, there are only 0 are older than 100 year olds. It is pretty nothing in comparision to huge number of observations in the dataset (316198observations), so I decide to remove them.

3.5 Imbalance dataset

round(100*table(ds$converted)/nrow(ds),2)

## 
##    No   Yes 
## 96.77  3.23

We see that there are low number of conversion in comparision to total number of users. With only 3.22% of total users make conversion, our dataset is imbalanced. It suggests to resampling dataset to make it balance.

4 Data visualization

4.1 The country distribution and its relationship with conversion

g1 <- ds %>% 
   group_by(country, converted) %>% 
   summarize(count = n()) %>% 
   ggplot(aes(country, count, fill = converted)) +
   geom_bar(stat = "identity", position = "dodge") +
   labs(title = "Country distribution and its relationship with conversion",
        subtitle = "Unit: Count",
        x = NULL,
        y = NULL,
        fill = "Converted") +
   theme(legend.position = "bottom",
         legend.direction = "horizontal",
         plot.title = element_text(size = 10, face = "bold")) 

g2 <- ds %>% 
   group_by(country, converted) %>% 
   summarize(count = n()) %>% 
   ggplot(aes(country, count, fill = converted)) +
   geom_bar(stat = "identity", position = "fill") +
   labs(title = "Country distribution and its relationship with conversion",
        subtitle = "Unit: %",
        x = NULL,
        y = NULL,
        fill = "Converted") +
   theme(legend.position = "bottom",
         legend.direction = "horizontal",
         plot.title = element_text(size = 10, face = "bold"))
grid.arrange(g1,g2,ncol=2)

From above charts, it is clear that:

Majority users come from United Sateds and it has the highest number of conversions. However, the second chart shows that the number of conversions is not comparable with its high number of users.
There are many users from China but its conversion rate is low.
Germany has lowest number of users, but the conversion rate is high.

4.2 The age distribution and its relationship with conversion

Now, let’s take a look at the age distribution after cleaning and its relationship with conversion.

ds %>% 
   ggplot(aes(y=age,converted,fill = converted)) +
   geom_boxplot() +
   labs(title = "Age distribution and its relation with conversion",
        x = NULL,
        y = "Age",
        fill = NULL) +
  theme(legend.position = "none")

After cleaning, the age variable now ranges from 17 to nearly 80 which makes sense. From the above graph, it shows that younger seems to have higher tendency to buy and people from 60 years seem to not buy at all.

4.3 The old/new users distribution and its relationship with conversion

 ds %>% 
   group_by(new_user, converted) %>% 
   summarize(count = n()) %>% 
   ggplot(aes(new_user, count, fill = converted)) +
   geom_bar(stat = "identity", position = "dodge") +
   labs(title = "New/old users distribution and its relationship with conversion",
        x = NULL,
        y = NULL,
        fill = "Converted") +
   theme(legend.position = "bottom",
         legend.direction = "horizontal")

The above chart shows that most of users are new users and they have lower conversion rate in comparision to old users.

4.4 The source distribution and its relationship with conversion

g3 <- ds %>% 
   group_by(source, converted) %>% 
   summarize(count = n()) %>% 
   ggplot(aes(source, count, fill = fct_rev(converted))) +
   geom_bar(stat = "identity", position = "dodge") +
   labs(title = "Source distribution and its relationship with conversion",
        subtitle = "Unit: Count",
        x = NULL,
        y = NULL,
        fill = "Converted") +
   theme(legend.position = "bottom",
         legend.direction = "horizontal",
         plot.title = element_text(size = 10, face = "bold")) 

g4 <- ds %>% 
   group_by(source, converted) %>% 
   summarize(count = n()) %>% 
   ggplot(aes(source, count, fill = fct_rev(converted))) +
   geom_bar(stat = "identity", position = "fill") +
   labs(title = "Country distribution and its relationship with conversion",
        subtitle = "Unit: %",
        x = NULL,
        y = NULL,
        fill = "Converted") +
   theme(legend.position = "bottom",
         legend.direction = "horizontal",
         plot.title = element_text(size = 10, face = "bold"))
grid.arrange(g3,g4,ncol=2)

From source distribution graph, it shows that:

There are high proportion of users come from SEO.
In combination with direct source, it is obvious that the company attract a lot of users with unpaid sources. It is a good sign,huh.
Ads also bring many users to website and looking at the percentage of converted users within each source, users from ad source have higher tendency to buy than users from unpaid sources.

4.5 The total pages visited distribution and its relationship with conversion

ds %>% 
   ggplot(aes(y=total_pages_visited,converted,fill = converted)) +
   geom_boxplot() +
   labs(title = "Total pages visited distribution and its relation with conversion",
        x = NULL,
        y = "Total number of pages visited",
        fill = NULL) +
  theme(legend.position = "none")

The above chart shows that the higher number of pages, the more likely that users will buy products. And the users who visits more than 10 pages tend to make conversion while users with around 7 total pages visit will not make conversion.

5 Modelling

5.1 KPIs:

In model building, I will first ignore this problem and use logistic and tree decision method to predict whether user make conversion or not given the task of interpretation. The results will be then compared based on:

Accuracy: how model accurately labels users ((TP+TN)/(TN+TP+FN+FP))
Sensitivity: how model accurately labels users as converted users among the total converted users in reality (TP/(TP+FN))
Specificity: how model labels accurately users as not converted users among the total not converted users in reality (TN/(TN+FP))
Precision: proprotion of predicted converted users are truly converted users (TP/(TP+FP))
AUC: area under ROC curve (ROC: receiver operationg characteristic which plot with (1-Specitify) in the x- axis and Sensitivity in the y-axis).

The baseline accuracy is 96.77% when we simply predict all users as not-converted users.

round(100*table(ds$converted)/nrow(ds),2)

## 
##    No   Yes 
## 96.77  3.23

5.2 Spliting data and setting parameter of k-folds cross validation

The dataset will be seperate into 2 parts: training set with 80% of total observation and testing set with the rest 20%. The models will be built with 5-folds cross validation with 5 repeats.

set.seed(123)
indexTrain <- createDataPartition(y=ds$converted,p=0.8,list=FALSE)
training <- ds[indexTrain,]
testing <- ds [-indexTrain,]

# k-folds cross validation: 5 folds, 5 repeats
ctrl <- trainControl(method= "repeatedcv", number = 5, repeats = 5, classProbs = TRUE)

5.3 Logit model

set.seed(234)
logit_mod <- train(converted~., data=training, method = "glm", family = "binomial",
               trControl = ctrl)
pred_logit <- predict(logit_mod,newdata = testing)
cm_logit <- confusionMatrix(data=pred_logit,testing$converted,positive="Yes")
auc_logit <- auc(testing$converted,predict(logit_mod,newdata = testing,type = "prob")[,2])[1]

5.4 Decision tree

set.seed(234)
tree_mod <- train(converted ~ ., data=training, method = "rpart2", 
                  trControl = ctrl)

## Loading required package: rpart

pred_tree <- predict(tree_mod,newdata=testing)
cm_tree <- confusionMatrix(testing$converted,pred_tree,positive = "Yes")
auc_tree <- auc(testing$converted,predict(tree_mod,newdata = testing,type = "prob")[,2])[1]

5.5 Results comparision:

Now, let’s compare results of all the models.

models_list <- list(logit = logit_mod,
                    tree = tree_mod)

comparision <- data.frame(model = names(models_list ),
                         Accuracy = rep(NA, length(models_list )),
                         Sensitivity = rep(NA, length(models_list)),
                         Specificity = rep(NA, length(models_list)),
                         Precision = rep(NA, length(models_list)),
                         AUC = rep(NA, length(models_list)))


for (name in names(models_list)) {
  model <- get(paste0("cm_", name))
  comparision[comparision$model == name,2] <- model$overall["Accuracy"]
  comparision[comparision$model == name,3] <- model$byClass["Sensitivity"]
  comparision[comparision$model == name,4] <- model$byClass["Specificity"]
  comparision[comparision$model == name,5] <- model$byClass["Pos Pred Value"]
  comparision[comparision$model == name,6] <- get(paste0("auc_", name))
}

comparision %>% 
    gather(Type,Value,Accuracy : AUC) %>% 
    ggplot(aes(Type,Value,col=model)) +
    geom_jitter(width = 0.2, alpha = 0.7, size = 3)+
    labs (title = "Results comparision",
          x = NULL,
          y= NULL)+
    scale_x_discrete(limits=c("Accuracy", "Sensitivity", "Specificity","Precision", "AUC"))+
    scale_fill_discrete(limits=c("Logit", "Original Tree"))+
    scale_y_continuous(limits = c(0,1))+
    geom_hline(yintercept = 0.96774576, col = "red",size=0.5 )

It’s worth to mention that the red line is the baseline accuracy which is obtained by simply classified all users as not-converted.

Accuracy: Both models have high accuracy(~98.5%), higer than the baseline accuracy. And the logit model’s accuracy is little higher.
Sensitivity: The logistic model performs worse in predicting converted users ( nearly 70%) while it is 85% for tree model.
Specificity: Both 2 models with original data can predict nearly perfect non-converted users.
Precision: The logit model performs better with 85%.It means that 85% of cases that models predict to be a conversion is true.
AUC: Logistic model has the highest AUC (98.6) and tree model has the lower AUC(84).

In short, it comes with the price when increases in sensitivity make precision and specificity decrease.

Overall, logistic model to be the best since it performs best in term of accuracy, AUC, Specificity.

6 More information from models

6.1 From logistic model

summary(logit_mod)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1349  -0.0632  -0.0240  -0.0096   4.4125  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -12.182505   0.177940 -68.464  < 2e-16 ***
## countryGermany        3.976645   0.153622  25.886  < 2e-16 ***
## countryUK             3.702823   0.140624  26.331  < 2e-16 ***
## countryUS             3.366112   0.136789  24.608  < 2e-16 ***
## age                  -0.073724   0.002654 -27.777  < 2e-16 ***
## new_userOld           1.718527   0.039636  43.358  < 2e-16 ***
## sourceDirect         -0.182688   0.054234  -3.369 0.000756 ***
## sourceSeo            -0.025064   0.044479  -0.564 0.573094    
## total_pages_visited   0.757021   0.006921 109.381  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 72090  on 252958  degrees of freedom
## Residual deviance: 20590  on 252950  degrees of freedom
## AIC: 20608
## 
## Number of Fisher Scoring iterations: 10

Country: its coefficients are positive and statistically significant. This means that users from Germany, UK, US has higher tendency to make conversion than users from China.
Age: its coefficient is negative and statistically significant. It means that the older users is, the less likely he/she will make conversion.
New/old user: its coefficient is positive and statistically significant. It means that old user is more likely to convert.
Source: with statistically significant and negative coefficient for direct source, it means that people from direct traffic are less likely to convert than people from Ads source.
Total page visits: with statistically significant and positive coefficient, it means that the more pages user visits, the more likely users will make conversion.

6.2 From tree model

Firstly, I plot the feature importance

ggVarImp(tree_mod$finalModel)

From the plot, it shows that the most important feature is the total_pages_visited and then new/old user. Age, country, source seems to be not that imporant.

Further insights canbe seen from the following plot. It show that people with total pages visited more than 14 and old users with more than 12 pages visited will buy.

fancyRpartPlot(tree_mod$finalModel)

7 Suggestions

Given huge number of population in China and huge number of traffic from this market, the company can improve Chinese users’ conversion rates by make better Chinese translation version of the website or designed it in the way that fits Chinese culture.From data visualization part, it shows that Germany has the highest conversion rate so the company may do something to attract more Geman traffics, for example increasing ads to German users.

The company should find out the reason why old people doesn’t like the web pages. One reason maybe the convert funel is too complicated to old people.

The company should keep in touch with old users, for example by sending promotion emails or coupon.

Regarding the source, the company may consider attract more users with more advertises beacause users from ads source have higher conversion rate. However, it comes with higher cost.

As the most importance feature, the total pages visits feature suggest the company may make better links between pages in its site to increase users’ interests and also send email to those people who browser a lot but has not bought anything.

Conversion rate

Vu Huong

30 December 2017