A simple random forest problem

Problem statement

This dataset is taken from the ‘Collection of Data Science Take-home Challenges’ book. This is a really simple problem: to identify the conversion rate of a website, i.e. to predict who will be buying something from the website. The variables are quite straight-forward, we have the country in which the user is browsing, total pages visited, source that directs them to the website, new or old user, and age. The output is given as ‘converted’ in the form of 0 or 1.

I am exploring this exercise mainly for illustrating to approach a simple classification problem, and to explain the hyper-parameters and associated plots and metrics of the random forest approach.

Load and explore data

The libraries I am using for this analysis are randomForest, caret, and ROCR. I will explain more when I get to the actual usage of the commands from these libraries. First step is to load the CSV file and analyze its contents.

conv <- read.csv('conversion_data.csv')
head(conv)

##   country age new_user source total_pages_visited converted
## 1      UK  25        1    Ads                   1         0
## 2      US  23        1    Seo                   5         0
## 3      US  28        1    Seo                   4         0
## 4   China  39        1    Seo                   5         0
## 5      US  30        1    Seo                   6         0
## 6      US  31        0    Seo                   1         0

This gives the total number of variables we are dealing with in this problem. Right away we can identify ‘converted’ column as the target variable. But this alone does not explain everything about the data. We need to know the type and distribution of the dependent variables.

summary(conv)

##     country            age            new_user         source      
##  China  : 76602   Min.   : 17.00   Min.   :0.0000   Ads   : 88740  
##  Germany: 13056   1st Qu.: 24.00   1st Qu.:0.0000   Direct: 72420  
##  UK     : 48450   Median : 30.00   Median :1.0000   Seo   :155040  
##  US     :178092   Mean   : 30.57   Mean   :0.6855                  
##                   3rd Qu.: 36.00   3rd Qu.:1.0000                  
##                   Max.   :123.00   Max.   :1.0000                  
##  total_pages_visited   converted      
##  Min.   : 1.000      Min.   :0.00000  
##  1st Qu.: 2.000      1st Qu.:0.00000  
##  Median : 4.000      Median :0.00000  
##  Mean   : 4.873      Mean   :0.03226  
##  3rd Qu.: 7.000      3rd Qu.:0.00000  
##  Max.   :29.000      Max.   :1.00000

We can see that, even though ‘converted’ and ‘new_user’ variables should be factors, they are given as integers in the dataset, hence the summary command generates a quantile distribution for these variables. So, we need to convert them to the appropriate type.

conv$converted <- as.factor(conv$converted)
conv$new_user <- as.factor(conv$new_user)

Also, we could see that ‘age’ variable has the maximum value being 123! This could be a mistake or troll. Either way, let us limit the maximum age to 100 to avoid any outliers or unexplainable variables.

conv <- subset(conv, age <= 100)

Next, we could plot the variables to see their distribution.

Users mainly originate from United States followed by China and other countries. We can remember these graphs as we explore the results from the prediction later.

Second, we need to see if the data is balanced. By that, I mean if the classifications of 0 and 1 in the converted column is equally distributed.

It is clear that this is an imbalanced dataset. Which makes sense because not everyone who is browsing the website is buying from the website–there are many “lurkers”. There are two main ways to address this: (a) making the number of rows of no conversion same as the conversion, (b) changing the hyper-paramters to random forest algorithm.

Get Conversion Rate and Prepare the dataset

Conversion rate is the percentage of people who bought among the total number of people who browse the site.

I am following the first approach to ‘balance’ the dataset. I am splitting the dataset into two: converted and not converted. Here, I can estimate the conversion rate.

Then, I will be “sampling” or randomly choosing the rows from the not-converted dataset to match the number of rows of converted dataset. Then these two datasets will be joined and shuffled.

convert <- subset(conv, converted == 1)
no_convert <- subset(conv, converted == 0)
conversion_rate <- nrow(convert) / nrow(conv)
print(conversion_rate)

## [1] 0.03225194

Here, conversion rate is between the traditional rule of thumb (i.e. between 2% to 5%), which is a decent metric for a website.

final_no_convert <- no_convert[sample(nrow(no_convert),nrow(convert)),]
final_convdata <- rbind(convert, final_no_convert)
final_convdata <- final_convdata[sample(nrow(final_convdata)),]

We can also plot the variables from the converted dataset to see how it differs from the overall dataset.

par(mfrow=c(2,2))
plot(convert$country, col="green", main="Country")
plot(convert$source, col="green", main="Sources directing to the website")
hist(convert$age, col="green", main="Age", xlab="")
hist(convert$total_pages_visited, col="green", main="Total pages visited", xlab="")

Here, we see an interesting difference in the Country plot. Even though, a lot of lurkers are from China, among the converted dataset, their share is really low!

Model

In order to fit the random forest model, first step is to divide the dataset into train and test sets. createDataPartition from caret package is used for splitting the data into training set (80%) and test set (20%).

set.seed(1)
train_index <- createDataPartition(y = final_convdata$converted, p = 0.8, list=FALSE)
train_set <- final_convdata[train_index,]
test_set <- final_convdata[-train_index,]

We use the default parameters of random forest model to fit.

fit <- randomForest(converted~., train_set)
print(fit)

## 
## Call:
##  randomForest(formula = converted ~ ., data = train_set) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 6.31%
## Confusion matrix:
##      0    1 class.error
## 0 7655  504  0.06177228
## 1  525 7634  0.06434612

Errors

We can see there are two kinds of errors in the summary. “class.error” is the classification error of each class in the model. More interesting metric is the OOB error, which is the out-of-bag error. RF model samples data sets randomly with replacement to fit its model (let’s call this a “bag of sample data”). When the model is fit using this data, it is tested on the rest (which is the “out of bag” data) to determine the accuracy of fit. The model does this repeatedly, and finds the best fit of all. This is why we do not need a validation test for random forest approach.

Variable Importance Plot

Here, we see which variables have been more important in ‘classifying’ the conversion rate. This graph show the “MeanDecreaseGini” of each explanatory variable. ‘Gini index’ or ‘Gini coefficient’ typically indicate the heterogeneity or homogeneity of data. For simplicity, we can say higher the Gigi index–more uniform is the data, i.e. difficult to divide into classes. In this graph, we see the “mean DECREASE in Gini” being explained by variables. So, the one that has the highest decrease is the one that has clear distinctions or divisions. In other words, this is the most explainable attribute of the conversion data.

varImpPlot(fit)

In this case, we see ‘total pages visited’ as the most explainable attribute for this dataset.

Prediction

predicted <- predict(fit, test_set[,-6])

Cross-classification table, AUC

Cross-classifications table shows the true positive and true negative values of the test dataset.

table(predicted, test_set$converted)

##          
## predicted    0    1
##         0 1930  123
##         1  109 1916

The “AUC” or area-under-the-curve value is another way of visualizing true positive and true negative values. In general, higher the area under the curve, better the performance of the model. ‘ROCR’ package is used to plot this curve.

predictions <- as.numeric(predict(fit, test_set[,-6], type="response"))
pred <- prediction(predictions, test_set$converted)
perf <- performance(pred, measure = "tpr", x.measure = "fpr") 
plot(perf, col=rainbow(10))

auc<- performance(pred,"auc")
print(auc)

## An object of class "performance"
## Slot "x.name":
## [1] "None"
## 
## Slot "y.name":
## [1] "Area under the ROC curve"
## 
## Slot "alpha.name":
## [1] "none"
## 
## Slot "x.values":
## list()
## 
## Slot "y.values":
## [[1]]
## [1] 0.9431094
## 
## 
## Slot "alpha.values":
## list()

Here, the AUC value is 0.94, which is pretty decent for a simple model.

Partial dependence plots

Next, we need to look at the partial dependence plots to see how the different values of the dependent variable explain the conversion rate. These are plotted with no interaction from other variables. The way to interpret them is by observing the relative differences of y-values in these plots: higher the value–>favorable to conversion, in this case.

par(mfrow=c(2,2))
partialPlot(fit, train_set, country, 1)
partialPlot(fit, train_set, age, 1)
partialPlot(fit, train_set, new_user, 1)
partialPlot(fit, train_set, source, 1)

These plots help us gain more insight on the conversion and based on these we can arrive at some suggestions to improve the conversion rate.

Suggestions to improve the conversion rate

First, the conversion rate of china is absymal. We noticed in the exploratory graph that a lot of ‘lurkers’ are from China, just second to the United States. So, this is concerning.
Same goes for United States. Even though most lurkers are from the US, the buyers are only third in the list. More market research needs to be done to analyze why.
Clearly, old users are better than new users. This goes along with the conventional wisdom that returning users are typically regular buyers.
This site has a huge problem with the middle-aged and older buyers while it does well with young people. More market research needed to target middle-aged people.
The Ads and Seo are not doing better than the direct visitors. More research needs to be done at where these Ads are placed, and probably some A/B testing to see if it increases the conversion rate.

More..

So, these suggestions are based off of a simple random forest model with default paramters. There are different ways of approaching this problem. And sometimes, it might change the explanatory interpretations that we obtained. So, we should always take the suggestions with a grain of salt unless it is echoed in all the model approaches.

Since this is an imbalanced dataset, the “sampling” was done externally prior to fitting the data in the algorithm, as the number of rows are large enough to handle this. There are other options to do this. Randomforest has a hyperparamter called sampsize where we can specify the size of the sample to be drawn from each class. Another way to address is through the hyperparameter classwt where, we can give the prior class weights so that the model adjusts the fit accordingly.