Introduction

Following the conclusions from the Exploratory Analysis of the data set, this document outlines step-by-step process, which can be followed in order to research, build and test various ML models to maximize the performance in predicting the conversion rates for each possible age/gender/interest segments.

This document is designed for demonstration purposes only and it doesn’t include production class code. Its main purpose is to present one of possible approaches for an entry stage of Machine Learning model selection. Neither Data Pre-Processing oriented research nor low level parameter tuning of the proposed algorithms have been included at this stage.

Loading the dataset

The data set provided for this analysis has been downloaded and placed on a local drive within R working directory

df<- read.csv(".//Datasets/Facebook/conversion_data.csv", header = TRUE, sep=",")

Loading the Libraries

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
## 
##     autoplot
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel

Feature selection

Let’s see the structure of our data set

str(df)
## 'data.frame':    1143 obs. of  11 variables:
##  $ ad_id              : int  708746 708749 708771 708815 708818 708820 708889 708895 708953 708958 ...
##  $ xyz_campaign_id    : int  916 916 916 916 916 916 916 916 916 916 ...
##  $ fb_campaign_id     : int  103916 103917 103920 103928 103928 103929 103940 103941 103951 103952 ...
##  $ age                : Factor w/ 4 levels "30-34","35-39",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender             : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ interest           : int  15 16 20 28 28 29 15 16 27 28 ...
##  $ Impressions        : int  7350 17861 693 4259 4133 1915 15615 10951 2355 9502 ...
##  $ Clicks             : int  1 2 0 1 1 0 3 1 1 3 ...
##  $ Spent              : num  1.43 1.82 0 1.25 1.29 ...
##  $ Total_Conversion   : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Approved_Conversion: int  1 0 0 0 1 1 0 1 0 0 ...

We have been asked to design a model which would predict the conversion rates. Conversion rate is normally defined as a percentage of number of users who clicked the add compared to the overall number of the ad’s impressions.

What sort of predicting challenge are we facing?

Since we are expected to build a model which predicts a value of conversion rate, it means we deal with regression challenge.

It gives us the first, clear indication what kind ML algorithms we should focus our research on as well as what sort of metrics we will use to evaluate the performance of the predictive algorithms.

Let’s calculate and add Conversion Ratio feature to our data set.

df <- dplyr::mutate(df, Conversion_Ratio=100*Clicks/Impressions)

Factor Attributes

Now, we obviously want the three factor attributes (age, gender and interest) to contribute to our predictive model so let’s include them in our final training data set.

From the the other side xyz_campaign_id which has not been mentioned in the _campaign_id the requirement definition and will likely not contribute too much to the model performance and may introduce unnecessary level of noise. Let’s remove it from our training set.

Numeric Attributes

We also have a set of numeric attributes which will be vital to our model. Let’s group them together and check the correlations between them.

Building the numeric attributes correlation matrix

df_numeric <- dplyr::select(df, Impressions, Conversion_Ratio, Clicks, Spent, Total_Conversion)
correlations <- cor(df_numeric)
print(correlations)
##                  Impressions Conversion_Ratio    Clicks     Spent
## Impressions       1.00000000       0.07591386 0.9485141 0.9703862
## Conversion_Ratio  0.07591386       1.00000000 0.1592898 0.1409520
## Clicks            0.94851414       0.15928979 1.0000000 0.9929063
## Spent             0.97038617       0.14095198 0.9929063 1.0000000
## Total_Conversion  0.81283760       0.01072070 0.6946324 0.7253794
##                  Total_Conversion
## Impressions             0.8128376
## Conversion_Ratio        0.0107207
## Clicks                  0.6946324
## Spent                   0.7253794
## Total_Conversion        1.0000000

What we can observe from the correlation matrix is that:

  1. Conversion_Ratio’s top correlators are Clicks=0.159 Spent=0.1409 and Impressions=0.0759. This is perfectly understandable given the way the Conversion Ratio has been constructed (100*Clicks/Impressions)
  2. Impressions and Clicks are highly correlated (0.9485)
  3. Spent is highly correlated with both Impressions and Clicks (0.97 and 0.99). Good candidate to exclude from the feature list.
  4. Total_Conversion is highly correlated with Impressions (0.81) and a bit less with Clicks(0.69)

When there are multiple (linearly) correlated features (as is the case with our data set), the model becomes unstable, meaning that small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult.

Moving forward let’s include the Impressions as the main numeric feature in the models we are going to train to predict Conversion Ratio.

The rationale behind is as follows: Having known both number of impressions and clicks, getting the Conversion Ratio is a matter of simple mathematical transformation (100*Clicks/Impressions) and there is no need to come up with a predictive model for that. The challenge is to build a model which will predict a Conversion Ratio based on the factor attributes and amount of Impressions an individual ad gets.

df$interest <- as.factor(df$interest)
df.mod <- dplyr::select(df, age, gender, Impressions, Conversion_Ratio)

Our final data set, we are going to train our models includes the following

names(df.mod)
## [1] "age"              "gender"           "Impressions"     
## [4] "Conversion_Ratio"

The structure of the data set is displayed below:

str(df.mod)
## 'data.frame':    1143 obs. of  4 variables:
##  $ age             : Factor w/ 4 levels "30-34","35-39",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender          : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Impressions     : int  7350 17861 693 4259 4133 1915 15615 10951 2355 9502 ...
##  $ Conversion_Ratio: num  0.0136 0.0112 0 0.0235 0.0242 ...

Generating the test and training data sets from our input data set

set.seed(234)
index <- createDataPartition(df.mod$Conversion_Ratio, p = .7,list = FALSE)
df.train <- df.mod[index,]
df.test <- df.mod[-index,]

Regression Model Candidates selection

For this exercise, I selected the Linear Regression, Support Vector Machine, Random Forest and K-Nearest Neighbors.

Resampling Methods To Estimate Model Accuracy

5-fold Cross validation with 3 repeats will be used to estimate the model accuracy from the training set.

The following command will set up the parameters for the caret train() function to perform this process

trainControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

Model performance evaluation metrics

RMSE and Rsquared will be used to evaluate the performance of the created models

Rsquared (also called the coefficient of determination) provides a goodness-of- fit measure for the predictions to the observations. This is a value between 0 and 1 for no-fit and perfect fit respectively.

RMSE is the average deviation of the predictions from the observations. It is useful to get a gross idea of how well (or not) an algorithm is doing, in the units of the output variable.

Package selection

R packages: dplyr, caret will be used to compute the data and train the models. DoParallel will be used for multi-core processing of the models.

Train the models

cl <- detectCores()
registerDoParallel(cl)
set.seed(234)
algo_set <- caretList(Conversion_Ratio ~., data=df.train, methodList = c("svmRadial", "rf", "knn", "lm" ), trControl = trainControl)

Re-sampling the models.

set.seed(234)
results_caretList <- resamples(algo_set)

Visualising the models performance

Box and Whiskey plot with RMSE and Rsquared for each trained model.

scales <- list(x=list(relation="free", y=list(relation="free")))
bwplot(results_caretList, scales=scales)
## Warning in complete_names(x, x.scales): Invalid or ambiguous component
## names: y

We can clearly see that Random Forest algorithm outperforms the other algorithms in both RMSE and R^2 performance metrics.

Dot-plot with RMSE and Rsquared for each trained model.

dotplot(results_caretList, scales=scales)

Parallel plot illustrates RMSE of each sample model from the 5-fold cross-validation repeated 3 times for each algorithm.

parallelplot(results_caretList)

Correlated movements of the lines indicate good set up for a potential stacking process.

Scatter Plot Matrix plot of the RMSE.

splom(results_caretList)

This is invaluable while working with different models and considering whether the predictions from two different algorithms are correlated. Weak correlations indicate that, they are good candidates for stacking and being combined in an ensemble prediction. It looks like SVM and Random Forest are good candidates for giving them a chance together in algorithm stacking process as opposed to the svm-glm pair.

Final Model Selection

Summarizing all the trained models:

print(algo_set)
## $svmRadial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 803 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE        Rsquared 
##   0.25  0.01043154  0.1893258
##   0.50  0.01046821  0.1887852
##   1.00  0.01051069  0.1875694
## 
## Tuning parameter 'sigma' was held constant at a value of 0.2761751
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were sigma = 0.2761751 and C = 0.25. 
## 
## $rf
## Random Forest 
## 
## 803 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE         Rsquared 
##   2     0.008951066  0.4038149
##   3     0.008900894  0.4061091
##   5     0.009539474  0.3488432
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was mtry = 3. 
## 
## $knn
## k-Nearest Neighbors 
## 
## 803 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ... 
## Resampling results across tuning parameters:
## 
##   k  RMSE        Rsquared 
##   5  0.01081044  0.1706232
##   7  0.01052146  0.1951113
##   9  0.01042213  0.2029517
## 
## RMSE was used to select the optimal model using  the smallest value.
## The final value used for the model was k = 9. 
## 
## $lm
## Linear Regression 
## 
## 803 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ... 
## Resampling results:
## 
##   RMSE        Rsquared 
##   0.01045033  0.1794489
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
##  
## 
## attr(,"class")
## [1] "caretList"

Random Forest algorithm offers the highest level of predicted accuracy with

  • RMSE= 0.008900894 0.4061091
  • Rsquared = 0.4061091

Let’s move on with RF model to predict and illustrate the predictions on the test data set

set.seed(234)
predictions.rf <- predict(algo_set$rf, newdata = df.test[,1:4])
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
compare_predictions <- data.frame(
  Test.Data=df.test$Conversion_Ratio,
  Predictions=predictions.rf)

Conversion Ratio values from the test data set vs the predicted values by the Random Forest Algorithm.

g <- ggplot(data=compare_predictions, aes(x=Test.Data, y=Predictions))
g+ geom_smooth() + 
  geom_point(colour="Lightblue") +
  ggtitle("Conversation Rates Test vs Predictions")

Calculating the Rsquared value for the test and predicted values.

SSE <- sum((compare_predictions$Test.Data - compare_predictions$Predictions) ^ 2)
SST <- sum((compare_predictions$Test.Data - mean(compare_predictions$Test.Data)) ^ 2)
Rquared <- 1 - SSE/SST
paste("Rquared = ",Rquared,sep = "")
## [1] "Rquared = 0.345979766506341"

Calculting the RMSE value for the test and predicted values.

rmse <- function(error)
{
    sqrt(mean(error^2))
}

error <- compare_predictions$Test.Data - compare_predictions$Predictions

rmse<- rmse(error)
paste("RMSE = ", rmse, sep = "")
## [1] "RMSE = 0.00927762185203161"

Conclusions

The finally selected algorithm delivers average at best key evaluation metrics. Reported value of RMSE = 0.00927 indicates quite significant average deviation of the predicted values from the mean of Conversion Ratio (0.0164) included in the test data set. Rsquared value of 0.3459 provides us with the sligtly more promising level of confidence in terms of goodness-of-fit measure for the predictions to the observations with value between 0 and 1 for no-fit and perfect fit respectively.

All the above process was designed to present the first stage selection procedure. The process of finalizing the model is far from finished. There are many potential ways to research and try to improve the predictive performance of the model. Some of the areas which should be considered in next research stages are:

  1. Data Pre-Processing (e.g. data normalization, data standardization, Box-Cox or Yeo-Johnson transforms)
  2. Parameters tuning.
  3. Ensemble and stacking techniques.
  4. Considering muti-model structures with dedicated models built for different classes of observations.

Multi-model structures can be built using group_by() and do() functions from dplyr package. It enables to group the data set by e.g. age group and generate the predictive models for individual age groups.

The construction of the group_by() and do() calls is presented below:

cl <- detectCores(cl)
registerDoParallel(cl)
models_by_age_rf <- df.train %>%
                    group_by(age) %>% 
  do(mod=train(Conversion_Ratio ~., data = ., trainControl=trainControl, metric="RMSE", method="rf"))