This dataset is taken from the ‘Collection of Data Science Take-home Challenges’ book. This is a really simple problem: to identify the conversion rate of a website, i.e. to predict who will be buying something from the website. The variables are quite straight-forward, we have the country in which the user is browsing, total pages visited, source that directs them to the website, new or old user, and age. The output is given as ‘converted’ in the form of 0 or 1.
I am exploring this exercise mainly for illustrating to approach a simple classification problem, and to explain the hyper-parameters and associated plots and metrics of the random forest approach.
The libraries I am using for this analysis are randomForest, caret, and ROCR. I will explain more when I get to the actual usage of the commands from these libraries. First step is to load the CSV file and analyze its contents.
conv <- read.csv('conversion_data.csv')
head(conv)
## country age new_user source total_pages_visited converted
## 1 UK 25 1 Ads 1 0
## 2 US 23 1 Seo 5 0
## 3 US 28 1 Seo 4 0
## 4 China 39 1 Seo 5 0
## 5 US 30 1 Seo 6 0
## 6 US 31 0 Seo 1 0
This gives the total number of variables we are dealing with in this problem. Right away we can identify ‘converted’ column as the target variable. But this alone does not explain everything about the data. We need to know the type and distribution of the dependent variables.
summary(conv)
## country age new_user source
## China : 76602 Min. : 17.00 Min. :0.0000 Ads : 88740
## Germany: 13056 1st Qu.: 24.00 1st Qu.:0.0000 Direct: 72420
## UK : 48450 Median : 30.00 Median :1.0000 Seo :155040
## US :178092 Mean : 30.57 Mean :0.6855
## 3rd Qu.: 36.00 3rd Qu.:1.0000
## Max. :123.00 Max. :1.0000
## total_pages_visited converted
## Min. : 1.000 Min. :0.00000
## 1st Qu.: 2.000 1st Qu.:0.00000
## Median : 4.000 Median :0.00000
## Mean : 4.873 Mean :0.03226
## 3rd Qu.: 7.000 3rd Qu.:0.00000
## Max. :29.000 Max. :1.00000
We can see that, even though ‘converted’ and ‘new_user’ variables should be factors, they are given as integers in the dataset, hence the summary command generates a quantile distribution for these variables. So, we need to convert them to the appropriate type.
conv$converted <- as.factor(conv$converted)
conv$new_user <- as.factor(conv$new_user)
Also, we could see that ‘age’ variable has the maximum value being 123! This could be a mistake or troll. Either way, let us limit the maximum age to 100 to avoid any outliers or unexplainable variables.
conv <- subset(conv, age <= 100)
Next, we could plot the variables to see their distribution.
Users mainly originate from United States followed by China and other countries. We can remember these graphs as we explore the results from the prediction later.
Second, we need to see if the data is balanced. By that, I mean if the classifications of 0 and 1 in the converted column is equally distributed.
It is clear that this is an imbalanced dataset. Which makes sense because not everyone who is browsing the website is buying from the website–there are many “lurkers”. There are two main ways to address this: (a) making the number of rows of no conversion same as the conversion, (b) changing the hyper-paramters to random forest algorithm.
Conversion rate is the percentage of people who bought among the total number of people who browse the site.
I am following the first approach to ‘balance’ the dataset. I am splitting the dataset into two: converted and not converted. Here, I can estimate the conversion rate.
Then, I will be “sampling” or randomly choosing the rows from the not-converted dataset to match the number of rows of converted dataset. Then these two datasets will be joined and shuffled.
convert <- subset(conv, converted == 1)
no_convert <- subset(conv, converted == 0)
conversion_rate <- nrow(convert) / nrow(conv)
print(conversion_rate)
## [1] 0.03225194
Here, conversion rate is between the traditional rule of thumb (i.e. between 2% to 5%), which is a decent metric for a website.
final_no_convert <- no_convert[sample(nrow(no_convert),nrow(convert)),]
final_convdata <- rbind(convert, final_no_convert)
final_convdata <- final_convdata[sample(nrow(final_convdata)),]
We can also plot the variables from the converted dataset to see how it differs from the overall dataset.
par(mfrow=c(2,2))
plot(convert$country, col="green", main="Country")
plot(convert$source, col="green", main="Sources directing to the website")
hist(convert$age, col="green", main="Age", xlab="")
hist(convert$total_pages_visited, col="green", main="Total pages visited", xlab="")
Here, we see an interesting difference in the Country plot. Even though, a lot of lurkers are from China, among the converted dataset, their share is really low!
In order to fit the random forest model, first step is to divide the dataset into train and test sets. createDataPartition from caret package is used for splitting the data into training set (80%) and test set (20%).
set.seed(1)
train_index <- createDataPartition(y = final_convdata$converted, p = 0.8, list=FALSE)
train_set <- final_convdata[train_index,]
test_set <- final_convdata[-train_index,]
We use the default parameters of random forest model to fit.
fit <- randomForest(converted~., train_set)
print(fit)
##
## Call:
## randomForest(formula = converted ~ ., data = train_set)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 6.31%
## Confusion matrix:
## 0 1 class.error
## 0 7655 504 0.06177228
## 1 525 7634 0.06434612
We can see there are two kinds of errors in the summary. “class.error” is the classification error of each class in the model. More interesting metric is the OOB error, which is the out-of-bag error. RF model samples data sets randomly with replacement to fit its model (let’s call this a “bag of sample data”). When the model is fit using this data, it is tested on the rest (which is the “out of bag” data) to determine the accuracy of fit. The model does this repeatedly, and finds the best fit of all. This is why we do not need a validation test for random forest approach.
Here, we see which variables have been more important in ‘classifying’ the conversion rate. This graph show the “MeanDecreaseGini” of each explanatory variable. ‘Gini index’ or ‘Gini coefficient’ typically indicate the heterogeneity or homogeneity of data. For simplicity, we can say higher the Gigi index–more uniform is the data, i.e. difficult to divide into classes. In this graph, we see the “mean DECREASE in Gini” being explained by variables. So, the one that has the highest decrease is the one that has clear distinctions or divisions. In other words, this is the most explainable attribute of the conversion data.
varImpPlot(fit)
In this case, we see ‘total pages visited’ as the most explainable attribute for this dataset.
predicted <- predict(fit, test_set[,-6])
Cross-classifications table shows the true positive and true negative values of the test dataset.
table(predicted, test_set$converted)
##
## predicted 0 1
## 0 1930 123
## 1 109 1916
The “AUC” or area-under-the-curve value is another way of visualizing true positive and true negative values. In general, higher the area under the curve, better the performance of the model. ‘ROCR’ package is used to plot this curve.
predictions <- as.numeric(predict(fit, test_set[,-6], type="response"))
pred <- prediction(predictions, test_set$converted)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))
auc<- performance(pred,"auc")
print(auc)
## An object of class "performance"
## Slot "x.name":
## [1] "None"
##
## Slot "y.name":
## [1] "Area under the ROC curve"
##
## Slot "alpha.name":
## [1] "none"
##
## Slot "x.values":
## list()
##
## Slot "y.values":
## [[1]]
## [1] 0.9431094
##
##
## Slot "alpha.values":
## list()
Here, the AUC value is 0.94, which is pretty decent for a simple model.
Next, we need to look at the partial dependence plots to see how the different values of the dependent variable explain the conversion rate. These are plotted with no interaction from other variables. The way to interpret them is by observing the relative differences of y-values in these plots: higher the value–>favorable to conversion, in this case.
par(mfrow=c(2,2))
partialPlot(fit, train_set, country, 1)
partialPlot(fit, train_set, age, 1)
partialPlot(fit, train_set, new_user, 1)
partialPlot(fit, train_set, source, 1)
These plots help us gain more insight on the conversion and based on these we can arrive at some suggestions to improve the conversion rate.
So, these suggestions are based off of a simple random forest model with default paramters. There are different ways of approaching this problem. And sometimes, it might change the explanatory interpretations that we obtained. So, we should always take the suggestions with a grain of salt unless it is echoed in all the model approaches.
Since this is an imbalanced dataset, the “sampling” was done externally prior to fitting the data in the algorithm, as the number of rows are large enough to handle this. There are other options to do this. Randomforest has a hyperparamter called sampsize where we can specify the size of the sample to be drawn from each class. Another way to address is through the hyperparameter classwt where, we can give the prior class weights so that the model adjusts the fit accordingly.