The task is to give Sales team and Marking team some insights regarding to users’ conversions based on information of their country, age, new/old user, source (ads/direct/seo), total pages visited.
rm(list=ls())
library(readr) # Efficient reading of CSV data.
library(dplyr) # Data wrangling
library(magrittr) # Pipes %>% , %<>%
library(ggplot2) # Visualise data
library(caret) # Build models
library(pROC) # Calculate AUC
library(lubridate) # Dates and time.
library(gridExtra) # Combine plots
library(forcats) # Factor reverse fct_rev()
library(rattle) # tree plot
library(tidyr) # Gather/ spread data
setwd("E:/DS Project/data science take home test/TakeHomeDataChallenges-master/01.ConversionRate_data")
ds <- read_csv("conversion_data.csv")
Now, let’s take the first look at the dataset. There are 316200 observation and 6 variables.
glimpse(ds)
## Observations: 316,200
## Variables: 6
## $ country <chr> "UK", "US", "US", "China", "US", "US", "Ch...
## $ age <int> 25, 23, 28, 39, 30, 31, 27, 23, 29, 25, 38...
## $ new_user <int> 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, ...
## $ source <chr> "Ads", "Seo", "Seo", "Seo", "Seo", "Seo", ...
## $ total_pages_visited <int> 1, 5, 4, 5, 6, 1, 4, 4, 4, 2, 1, 8, 6, 7, ...
## $ converted <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Few variables ( country, new_user, source,converted) have been defined wrongly as character, I will then convert them into factor.
# Convert few independent variables into factors
ds$country %<>% as.factor()
ds$source %<>% as.factor()
ds$new_user <- factor(ds$new_user,levels =c("1","0"),labels = c("New","Old"))
# Convert target variable into factor.
ds$converted <- factor(ds$converted, levels =c("0","1"),labels = c("No","Yes"))
Foretunatly, the dataset has no missing value!
sapply(ds,function(x) sum(is.na(x)))
## country age new_user
## 0 0 0
## source total_pages_visited converted
## 0 0 0
Now, I will summarise the dataset to get the feel for the shape of the data.
summary(ds)
## country age new_user source
## China : 76602 Min. : 17.00 New:216744 Ads : 88740
## Germany: 13056 1st Qu.: 24.00 Old: 99456 Direct: 72420
## UK : 48450 Median : 30.00 Seo :155040
## US :178092 Mean : 30.57
## 3rd Qu.: 36.00
## Max. :123.00
## total_pages_visited converted
## Min. : 1.000 No :306000
## 1st Qu.: 2.000 Yes: 10200
## Median : 4.000
## Mean : 4.873
## 3rd Qu.: 7.000
## Max. :29.000
length(which(ds$age>100))
## [1] 2
ds %<>% filter(age < 100)
From the summary, I detect that there should be outliers in the age varibale beacause there is not reasonable to have user at the at of 123. In fact, there are only 0 are older than 100 year olds. It is pretty nothing in comparision to huge number of observations in the dataset (316198observations), so I decide to remove them.
round(100*table(ds$converted)/nrow(ds),2)
##
## No Yes
## 96.77 3.23
We see that there are low number of conversion in comparision to total number of users. With only 3.22% of total users make conversion, our dataset is imbalanced. It suggests to resampling dataset to make it balance.
g1 <- ds %>%
group_by(country, converted) %>%
summarize(count = n()) %>%
ggplot(aes(country, count, fill = converted)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Country distribution and its relationship with conversion",
subtitle = "Unit: Count",
x = NULL,
y = NULL,
fill = "Converted") +
theme(legend.position = "bottom",
legend.direction = "horizontal",
plot.title = element_text(size = 10, face = "bold"))
g2 <- ds %>%
group_by(country, converted) %>%
summarize(count = n()) %>%
ggplot(aes(country, count, fill = converted)) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Country distribution and its relationship with conversion",
subtitle = "Unit: %",
x = NULL,
y = NULL,
fill = "Converted") +
theme(legend.position = "bottom",
legend.direction = "horizontal",
plot.title = element_text(size = 10, face = "bold"))
grid.arrange(g1,g2,ncol=2)
From above charts, it is clear that:
Majority users come from United Sateds and it has the highest number of conversions. However, the second chart shows that the number of conversions is not comparable with its high number of users.
There are many users from China but its conversion rate is low.
Germany has lowest number of users, but the conversion rate is high.
Now, let’s take a look at the age distribution after cleaning and its relationship with conversion.
ds %>%
ggplot(aes(y=age,converted,fill = converted)) +
geom_boxplot() +
labs(title = "Age distribution and its relation with conversion",
x = NULL,
y = "Age",
fill = NULL) +
theme(legend.position = "none")
After cleaning, the age variable now ranges from 17 to nearly 80 which makes sense. From the above graph, it shows that younger seems to have higher tendency to buy and people from 60 years seem to not buy at all.
ds %>%
group_by(new_user, converted) %>%
summarize(count = n()) %>%
ggplot(aes(new_user, count, fill = converted)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "New/old users distribution and its relationship with conversion",
x = NULL,
y = NULL,
fill = "Converted") +
theme(legend.position = "bottom",
legend.direction = "horizontal")
The above chart shows that most of users are new users and they have lower conversion rate in comparision to old users.
g3 <- ds %>%
group_by(source, converted) %>%
summarize(count = n()) %>%
ggplot(aes(source, count, fill = fct_rev(converted))) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Source distribution and its relationship with conversion",
subtitle = "Unit: Count",
x = NULL,
y = NULL,
fill = "Converted") +
theme(legend.position = "bottom",
legend.direction = "horizontal",
plot.title = element_text(size = 10, face = "bold"))
g4 <- ds %>%
group_by(source, converted) %>%
summarize(count = n()) %>%
ggplot(aes(source, count, fill = fct_rev(converted))) +
geom_bar(stat = "identity", position = "fill") +
labs(title = "Country distribution and its relationship with conversion",
subtitle = "Unit: %",
x = NULL,
y = NULL,
fill = "Converted") +
theme(legend.position = "bottom",
legend.direction = "horizontal",
plot.title = element_text(size = 10, face = "bold"))
grid.arrange(g3,g4,ncol=2)
From source distribution graph, it shows that:
There are high proportion of users come from SEO.
In combination with direct source, it is obvious that the company attract a lot of users with unpaid sources. It is a good sign,huh.
Ads also bring many users to website and looking at the percentage of converted users within each source, users from ad source have higher tendency to buy than users from unpaid sources.
ds %>%
ggplot(aes(y=total_pages_visited,converted,fill = converted)) +
geom_boxplot() +
labs(title = "Total pages visited distribution and its relation with conversion",
x = NULL,
y = "Total number of pages visited",
fill = NULL) +
theme(legend.position = "none")
The above chart shows that the higher number of pages, the more likely that users will buy products. And the users who visits more than 10 pages tend to make conversion while users with around 7 total pages visit will not make conversion.
In model building, I will first ignore this problem and use logistic and tree decision method to predict whether user make conversion or not given the task of interpretation. The results will be then compared based on:
Accuracy: how model accurately labels users ((TP+TN)/(TN+TP+FN+FP))
Sensitivity: how model accurately labels users as converted users among the total converted users in reality (TP/(TP+FN))
Specificity: how model labels accurately users as not converted users among the total not converted users in reality (TN/(TN+FP))
Precision: proprotion of predicted converted users are truly converted users (TP/(TP+FP))
AUC: area under ROC curve (ROC: receiver operationg characteristic which plot with (1-Specitify) in the x- axis and Sensitivity in the y-axis).
The baseline accuracy is 96.77% when we simply predict all users as not-converted users.
round(100*table(ds$converted)/nrow(ds),2)
##
## No Yes
## 96.77 3.23
The dataset will be seperate into 2 parts: training set with 80% of total observation and testing set with the rest 20%. The models will be built with 5-folds cross validation with 5 repeats.
set.seed(123)
indexTrain <- createDataPartition(y=ds$converted,p=0.8,list=FALSE)
training <- ds[indexTrain,]
testing <- ds [-indexTrain,]
# k-folds cross validation: 5 folds, 5 repeats
ctrl <- trainControl(method= "repeatedcv", number = 5, repeats = 5, classProbs = TRUE)
set.seed(234)
logit_mod <- train(converted~., data=training, method = "glm", family = "binomial",
trControl = ctrl)
pred_logit <- predict(logit_mod,newdata = testing)
cm_logit <- confusionMatrix(data=pred_logit,testing$converted,positive="Yes")
auc_logit <- auc(testing$converted,predict(logit_mod,newdata = testing,type = "prob")[,2])[1]
set.seed(234)
tree_mod <- train(converted ~ ., data=training, method = "rpart2",
trControl = ctrl)
## Loading required package: rpart
pred_tree <- predict(tree_mod,newdata=testing)
cm_tree <- confusionMatrix(testing$converted,pred_tree,positive = "Yes")
auc_tree <- auc(testing$converted,predict(tree_mod,newdata = testing,type = "prob")[,2])[1]
Now, let’s compare results of all the models.
models_list <- list(logit = logit_mod,
tree = tree_mod)
comparision <- data.frame(model = names(models_list ),
Accuracy = rep(NA, length(models_list )),
Sensitivity = rep(NA, length(models_list)),
Specificity = rep(NA, length(models_list)),
Precision = rep(NA, length(models_list)),
AUC = rep(NA, length(models_list)))
for (name in names(models_list)) {
model <- get(paste0("cm_", name))
comparision[comparision$model == name,2] <- model$overall["Accuracy"]
comparision[comparision$model == name,3] <- model$byClass["Sensitivity"]
comparision[comparision$model == name,4] <- model$byClass["Specificity"]
comparision[comparision$model == name,5] <- model$byClass["Pos Pred Value"]
comparision[comparision$model == name,6] <- get(paste0("auc_", name))
}
comparision %>%
gather(Type,Value,Accuracy : AUC) %>%
ggplot(aes(Type,Value,col=model)) +
geom_jitter(width = 0.2, alpha = 0.7, size = 3)+
labs (title = "Results comparision",
x = NULL,
y= NULL)+
scale_x_discrete(limits=c("Accuracy", "Sensitivity", "Specificity","Precision", "AUC"))+
scale_fill_discrete(limits=c("Logit", "Original Tree"))+
scale_y_continuous(limits = c(0,1))+
geom_hline(yintercept = 0.96774576, col = "red",size=0.5 )
It’s worth to mention that the red line is the baseline accuracy which is obtained by simply classified all users as not-converted.
Accuracy: Both models have high accuracy(~98.5%), higer than the baseline accuracy. And the logit model’s accuracy is little higher.
Sensitivity: The logistic model performs worse in predicting converted users ( nearly 70%) while it is 85% for tree model.
Specificity: Both 2 models with original data can predict nearly perfect non-converted users.
Precision: The logit model performs better with 85%.It means that 85% of cases that models predict to be a conversion is true.
AUC: Logistic model has the highest AUC (98.6) and tree model has the lower AUC(84).
In short, it comes with the price when increases in sensitivity make precision and specificity decrease.
Overall, logistic model to be the best since it performs best in term of accuracy, AUC, Specificity.
summary(logit_mod)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1349 -0.0632 -0.0240 -0.0096 4.4125
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -12.182505 0.177940 -68.464 < 2e-16 ***
## countryGermany 3.976645 0.153622 25.886 < 2e-16 ***
## countryUK 3.702823 0.140624 26.331 < 2e-16 ***
## countryUS 3.366112 0.136789 24.608 < 2e-16 ***
## age -0.073724 0.002654 -27.777 < 2e-16 ***
## new_userOld 1.718527 0.039636 43.358 < 2e-16 ***
## sourceDirect -0.182688 0.054234 -3.369 0.000756 ***
## sourceSeo -0.025064 0.044479 -0.564 0.573094
## total_pages_visited 0.757021 0.006921 109.381 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 72090 on 252958 degrees of freedom
## Residual deviance: 20590 on 252950 degrees of freedom
## AIC: 20608
##
## Number of Fisher Scoring iterations: 10
Country: its coefficients are positive and statistically significant. This means that users from Germany, UK, US has higher tendency to make conversion than users from China.
Age: its coefficient is negative and statistically significant. It means that the older users is, the less likely he/she will make conversion.
New/old user: its coefficient is positive and statistically significant. It means that old user is more likely to convert.
Source: with statistically significant and negative coefficient for direct source, it means that people from direct traffic are less likely to convert than people from Ads source.
Total page visits: with statistically significant and positive coefficient, it means that the more pages user visits, the more likely users will make conversion.
Firstly, I plot the feature importance
ggVarImp(tree_mod$finalModel)
From the plot, it shows that the most important feature is the total_pages_visited and then new/old user. Age, country, source seems to be not that imporant.
Further insights canbe seen from the following plot. It show that people with total pages visited more than 14 and old users with more than 12 pages visited will buy.
fancyRpartPlot(tree_mod$finalModel)
Given huge number of population in China and huge number of traffic from this market, the company can improve Chinese users’ conversion rates by make better Chinese translation version of the website or designed it in the way that fits Chinese culture.From data visualization part, it shows that Germany has the highest conversion rate so the company may do something to attract more Geman traffics, for example increasing ads to German users.
The company should find out the reason why old people doesn’t like the web pages. One reason maybe the convert funel is too complicated to old people.
The company should keep in touch with old users, for example by sending promotion emails or coupon.
Regarding the source, the company may consider attract more users with more advertises beacause users from ads source have higher conversion rate. However, it comes with higher cost.
As the most importance feature, the total pages visits feature suggest the company may make better links between pages in its site to increase users’ interests and also send email to those people who browser a lot but has not bought anything.