Introduction

The data set contains information about users who visited a site and whether or not they have converted. The goal is to predict conversion rate and come up with recommendations for the product team and the marketing team to improve conversion rate.

Data Exploration

First, let’s read in the data, and look at what’s in it.

#read data
dt <- fread('conversion_data.csv')
head(dt)
##    country age new_user source total_pages_visited converted
## 1:      UK  25        1    Ads                   1         0
## 2:      US  23        1    Seo                   5         0
## 3:      US  28        1    Seo                   4         0
## 4:   China  39        1    Seo                   5         0
## 5:      US  30        1    Seo                   6         0
## 6:      US  31        0    Seo                   1         0

There are six columns to the data. For each user, we have the user’s country, age, marketing channel source, and total page visited. We are trying to predict the “converted” variable.

For each predicting variable, we can segment conversion rate by that variable, and see what % of users are in each segment. This would tell us if conversion rate is higher in certain segments. Also, if these segments are currently a small share of users, these could be potential growth areas that the product and marketing teams can focus on.

Here are the plots:

#visualize conversion rate by different variables
pl = list()
for (col in 1:5) {
  dt_conversion <- dt[, sum(converted) / length(converted), by = eval(colnames(dt)[col])]
  dt_conversion <- dt_conversion[order(dt_conversion[,1, with = FALSE]), ]
  g_conv <- ggplot(dt_conversion, aes_string(x=colnames(dt_conversion)[1], y = "V1"))
  g_conv <- g_conv + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("Conversion Rate")
  pl[[col]] = g_conv
  
  dt_count <- dt[, .N/nrow(dt), by = eval(colnames(dt)[col])]
  dt_count <- dt_count[order(dt_count[,1, with = FALSE]), ]
  g_count <- ggplot(dt_count, aes_string(x=colnames(dt_count)[1], y = "V1")) 
  g_count <- g_count + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("% Share of Users")
  pl[[col+5]] = g_count
}
do.call(grid.arrange, c(pl, nrow = 2))

Hmm. Looks like there are some outliers in the age variable – two ages over 100 have 100% conversion rate! Looking at the distribution of ages, there aren’t actually a lot of users with age over 100. Likely, there were maybe two users with age over 100, and they both converted. Looking into the data, this is indeed the case.

dt[age>100,]
##    country age new_user source total_pages_visited converted
## 1: Germany 123        0    Seo                  15         1
## 2:      UK 111        0    Ads                  10         1

Age over 100 seem a bit fishy, and may come from errors in the data. It would be best to delete these two data points.

dt <- dt[age<100,]

These two outliers are also dwarfing the conversion rate bars of other ages, making the age conversion graph not very informative. Let’s re-draw the graph with these two outliers taken out:

#Looks like the age conversion graph has some outliers - two ages over 100 have 100% conversion rate, this is probably because the sample size was very small (i.e. there were two people over age 100 and they both converted)
#regenerate the age graph removing outliers defined by conversion rate less than the 95% quantile, and rerunning the plots:
age_conversion <- dt[, sum(converted)/length(converted), by = age]
age_conversion <- age_conversion[order(age_conversion$age),]
age_conversion <- age_conversion[V1 < quantile(V1, probs = 0.95)]
age_conv <- ggplot(age_conversion, aes(x = age, y = V1)) + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + ylab("Conversion Rate")
pl[[2]] <- age_conv
do.call(grid.arrange, c(pl, nrow = 2))

From this graph, we can already make some very interesting observations.

  1. Country: The majority of traffic comes from the US, but there is also signficant traffic from China. However, conversion rate from China is very low compared to other countries. Conversion rate from Germany is the highest while Germany also has the least % share of users. Perhaps the site can consider expanding more in Germany and less in China, to drive up conversion rate.
  2. Age: Conversion rate is the highest among the youngest users - in fact conversion rate shows a strong negative correlation with age, from ages 15 - 55, declining steadily as age goes up. Conversion picks up again for ages over 55. Looking at the distribution of age, we see that the peak distribution of users fall between 25 and 30 years of age. To drive conversion, the site con consider growing the user base for age < 25, and age > 55.
  3. New User: It looks like existing users have a higher conversion rate than new users, while more traffic comes from first time users. There are two great business opportunities here - 1) the site can work on converting new users into returning users, and 2) the site can follow up on existing users (through emails or push notifications on mobile), to get them to come to the site more often.
  4. Source: Conversion rate is roughly even between the three sources, with ads being slightly higher.
  5. Total Pages Visited: It looks like pages visited show a very strong relationship with conversion. In fact, when pages visited are over 20, conversion rate is consistently near 100%. Looking at the % Share of users, however, we see that most users visit 0-10 pages. However, it’s the small percentage of users that visit over 10 pages that drive a lot of conversion. Again, there are two exciting opportunities here - 1) the site can work on converting users that visit only a few pages to users that visit more pages, and 2) the site can follow up on users who visit over 20 pages who did not convert, in hopes of turning them into converters.

Machine Learning Model

Next, let’s try to predict conversion with an XGBoost Model. We will split the data set into train and test, and estimate the out of sample error rate from the test set. Additionally, we will rank the variable importance based on the model.

train_idx <- createDataPartition(dt$converted, p = .8, list = FALSE, times = 1)

train <- dt[train_idx, ]
test <- dt[-train_idx, ]

#converting all variable classes into integers starting with 0:
train$country <- as.integer(as.factor(train$country)) - 1
train$age <- train$age - 1
train$source <- as.integer(as.factor(train$source)) - 1
train$total_pages_visited <- train$total_pages_visited - 1

test$country <- as.integer(as.factor(test$country)) - 1
test$age <- test$age - 1
test$source <- as.integer(as.factor(test$source)) - 1
test$total_pages_visited <- test$total_pages_visited - 1

#taking out the predicted variable
y_train <- train$converted 
train[, converted:=NULL]
y_test <- test$converted
test[, converted:=NULL]

#converting into matrix
trainMatrix <- train[,lapply(.SD,as.numeric)] %>% as.matrix
testMatrix <- test[,lapply(.SD,as.numeric)] %>% as.matrix

#XGBoost
#cv to get error rate:
numberOfClasses <- max(y_train) + 1
cv.nround <- 3
cv.nfold <- 3
bst.cv = xgb.cv(data = trainMatrix, label = y_train, 
                nfold = cv.nfold, nrounds = cv.nround)

#train the xgboost model:
nround = 50
bst = xgboost(data = trainMatrix, label = y_train
              , objective = "binary:logistic", nrounds=nround,
              max.depth = 2)
#predicting the testing set:
y_pred <- predict(bst, testMatrix)
err <- mean(as.numeric(y_pred > 0.5) != y_test)
print(paste("test-error=", err))
## [1] "test-error= 0.0134726988092791"

We get a test error rate of approximately 1.35%, which is in line with the train error rate of 1.39%.

Next, let’s see the importance of variables:

#get feature importance 
names <- dimnames(trainMatrix)[[2]]
importance_matrix <- xgb.importance(names, model = bst)
xgb.plot.importance(importance_matrix)

Not surprisingly, total pages visited is by far the most important predictor. Source is insignificant.

Conclusion

From the data visualizations and machine learning model, we can draw these conclusions:

  1. Young users, especially those under 25, have very high conversion rates. Marketing should attract more young users.
  2. German users have almost double the average conversion rate. Marketing should build up the user base in Germany.
  3. Conversion rate in China is very low, despite a sizable user base. Product should look into the Chinese version of the site and see what isn’t working. Perhaps it is poorly translated, or does not fit the local culture?
  4. Users who have visited many pages have high purchase intent. If a user has visited many pages but have not purchased, perhaps marketing can send follow up emails or push notifications to the user to drive conversion.