The data set contains information about users who visited a site and whether or not they have converted. The goal is to predict conversion rate and come up with recommendations for the product team and the marketing team to improve conversion rate.
First, let’s read in the data, and look at what’s in it.
#read data
dt <- fread('conversion_data.csv')
head(dt)
## country age new_user source total_pages_visited converted
## 1: UK 25 1 Ads 1 0
## 2: US 23 1 Seo 5 0
## 3: US 28 1 Seo 4 0
## 4: China 39 1 Seo 5 0
## 5: US 30 1 Seo 6 0
## 6: US 31 0 Seo 1 0
There are six columns to the data. For each user, we have the user’s country, age, marketing channel source, and total page visited. We are trying to predict the “converted” variable.
For each predicting variable, we can segment conversion rate by that variable, and see what % of users are in each segment. This would tell us if conversion rate is higher in certain segments. Also, if these segments are currently a small share of users, these could be potential growth areas that the product and marketing teams can focus on.
Here are the plots:
#visualize conversion rate by different variables
pl = list()
for (col in 1:5) {
dt_conversion <- dt[, sum(converted) / length(converted), by = eval(colnames(dt)[col])]
dt_conversion <- dt_conversion[order(dt_conversion[,1, with = FALSE]), ]
g_conv <- ggplot(dt_conversion, aes_string(x=colnames(dt_conversion)[1], y = "V1"))
g_conv <- g_conv + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("Conversion Rate")
pl[[col]] = g_conv
dt_count <- dt[, .N/nrow(dt), by = eval(colnames(dt)[col])]
dt_count <- dt_count[order(dt_count[,1, with = FALSE]), ]
g_count <- ggplot(dt_count, aes_string(x=colnames(dt_count)[1], y = "V1"))
g_count <- g_count + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("% Share of Users")
pl[[col+5]] = g_count
}
do.call(grid.arrange, c(pl, nrow = 2))
Hmm. Looks like there are some outliers in the age variable – two ages over 100 have 100% conversion rate! Looking at the distribution of ages, there aren’t actually a lot of users with age over 100. Likely, there were maybe two users with age over 100, and they both converted. Looking into the data, this is indeed the case.
dt[age>100,]
## country age new_user source total_pages_visited converted
## 1: Germany 123 0 Seo 15 1
## 2: UK 111 0 Ads 10 1
Age over 100 seem a bit fishy, and may come from errors in the data. It would be best to delete these two data points.
dt <- dt[age<100,]
These two outliers are also dwarfing the conversion rate bars of other ages, making the age conversion graph not very informative. Let’s re-draw the graph with these two outliers taken out:
#Looks like the age conversion graph has some outliers - two ages over 100 have 100% conversion rate, this is probably because the sample size was very small (i.e. there were two people over age 100 and they both converted)
#regenerate the age graph removing outliers defined by conversion rate less than the 95% quantile, and rerunning the plots:
age_conversion <- dt[, sum(converted)/length(converted), by = age]
age_conversion <- age_conversion[order(age_conversion$age),]
age_conversion <- age_conversion[V1 < quantile(V1, probs = 0.95)]
age_conv <- ggplot(age_conversion, aes(x = age, y = V1)) + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("Conversion Rate")
pl[[2]] <- age_conv
do.call(grid.arrange, c(pl, nrow = 2))
From this graph, we can already make some very interesting observations.
Next, let’s try to predict conversion with an XGBoost Model. We will split the data set into train and test, and estimate the out of sample error rate from the test set. Additionally, we will rank the variable importance based on the model.
train_idx <- createDataPartition(dt$converted, p = .8, list = FALSE, times = 1)
train <- dt[train_idx, ]
test <- dt[-train_idx, ]
#converting all variable classes into integers starting with 0:
train$country <- as.integer(as.factor(train$country)) - 1
train$age <- train$age - 1
train$source <- as.integer(as.factor(train$source)) - 1
train$total_pages_visited <- train$total_pages_visited - 1
test$country <- as.integer(as.factor(test$country)) - 1
test$age <- test$age - 1
test$source <- as.integer(as.factor(test$source)) - 1
test$total_pages_visited <- test$total_pages_visited - 1
#taking out the predicted variable
y_train <- train$converted
train[, converted:=NULL]
y_test <- test$converted
test[, converted:=NULL]
#converting into matrix
trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix
#XGBoost
#cv to get error rate:
numberOfClasses <- max(y_train) + 1
cv.nround <- 3
cv.nfold <- 3
bst.cv = xgb.cv(data = trainMatrix, label = y_train,
nfold = cv.nfold, nrounds = cv.nround)
#train the xgboost model:
nround = 50
bst = xgboost(data = trainMatrix, label = y_train
, objective = "binary:logistic", nrounds=nround,
max.depth = 2)
#predicting the testing set:
y_pred <- predict(bst, testMatrix)
err <- mean(as.numeric(y_pred > 0.5) != y_test)
print(paste("test-error=", err))
## [1] "test-error= 0.0134726988092791"
We get a test error rate of approximately 1.35%, which is in line with the train error rate of 1.39%.
Next, let’s see the importance of variables:
#get feature importance
names <- dimnames(trainMatrix)[[2]]
importance_matrix <- xgb.importance(names, model = bst)
xgb.plot.importance(importance_matrix)
Not surprisingly, total pages visited is by far the most important predictor. Source is insignificant.
From the data visualizations and machine learning model, we can draw these conclusions: