Following the conclusions from the Exploratory Analysis of the data set, this document outlines step-by-step process, which can be followed in order to research, build and test various ML models to maximize the performance in predicting the conversion rates for each possible age/gender/interest segments.
This document is designed for demonstration purposes only and it doesn’t include production class code. Its main purpose is to present one of possible approaches for an entry stage of Machine Learning model selection. Neither Data Pre-Processing oriented research nor low level parameter tuning of the proposed algorithms have been included at this stage.
The data set provided for this analysis has been downloaded and placed on a local drive within R working directory
df<- read.csv(".//Datasets/Facebook/conversion_data.csv", header = TRUE, sep=",")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
##
## autoplot
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
Let’s see the structure of our data set
str(df)
## 'data.frame': 1143 obs. of 11 variables:
## $ ad_id : int 708746 708749 708771 708815 708818 708820 708889 708895 708953 708958 ...
## $ xyz_campaign_id : int 916 916 916 916 916 916 916 916 916 916 ...
## $ fb_campaign_id : int 103916 103917 103920 103928 103928 103929 103940 103941 103951 103952 ...
## $ age : Factor w/ 4 levels "30-34","35-39",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ interest : int 15 16 20 28 28 29 15 16 27 28 ...
## $ Impressions : int 7350 17861 693 4259 4133 1915 15615 10951 2355 9502 ...
## $ Clicks : int 1 2 0 1 1 0 3 1 1 3 ...
## $ Spent : num 1.43 1.82 0 1.25 1.29 ...
## $ Total_Conversion : int 2 2 1 1 1 1 1 1 1 1 ...
## $ Approved_Conversion: int 1 0 0 0 1 1 0 1 0 0 ...
We have been asked to design a model which would predict the conversion rates. Conversion rate is normally defined as a percentage of number of users who clicked the add compared to the overall number of the ad’s impressions.
Since we are expected to build a model which predicts a value of conversion rate, it means we deal with regression challenge.
It gives us the first, clear indication what kind ML algorithms we should focus our research on as well as what sort of metrics we will use to evaluate the performance of the predictive algorithms.
Let’s calculate and add Conversion Ratio feature to our data set.
df <- dplyr::mutate(df, Conversion_Ratio=100*Clicks/Impressions)
Now, we obviously want the three factor attributes (age, gender and interest) to contribute to our predictive model so let’s include them in our final training data set.
From the the other side xyz_campaign_id which has not been mentioned in the _campaign_id the requirement definition and will likely not contribute too much to the model performance and may introduce unnecessary level of noise. Let’s remove it from our training set.
We also have a set of numeric attributes which will be vital to our model. Let’s group them together and check the correlations between them.
df_numeric <- dplyr::select(df, Impressions, Conversion_Ratio, Clicks, Spent, Total_Conversion)
correlations <- cor(df_numeric)
print(correlations)
## Impressions Conversion_Ratio Clicks Spent
## Impressions 1.00000000 0.07591386 0.9485141 0.9703862
## Conversion_Ratio 0.07591386 1.00000000 0.1592898 0.1409520
## Clicks 0.94851414 0.15928979 1.0000000 0.9929063
## Spent 0.97038617 0.14095198 0.9929063 1.0000000
## Total_Conversion 0.81283760 0.01072070 0.6946324 0.7253794
## Total_Conversion
## Impressions 0.8128376
## Conversion_Ratio 0.0107207
## Clicks 0.6946324
## Spent 0.7253794
## Total_Conversion 1.0000000
What we can observe from the correlation matrix is that:
When there are multiple (linearly) correlated features (as is the case with our data set), the model becomes unstable, meaning that small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult.
Moving forward let’s include the Impressions as the main numeric feature in the models we are going to train to predict Conversion Ratio.
The rationale behind is as follows: Having known both number of impressions and clicks, getting the Conversion Ratio is a matter of simple mathematical transformation (100*Clicks/Impressions) and there is no need to come up with a predictive model for that. The challenge is to build a model which will predict a Conversion Ratio based on the factor attributes and amount of Impressions an individual ad gets.
df$interest <- as.factor(df$interest)
df.mod <- dplyr::select(df, age, gender, Impressions, Conversion_Ratio)
Our final data set, we are going to train our models includes the following
names(df.mod)
## [1] "age" "gender" "Impressions"
## [4] "Conversion_Ratio"
The structure of the data set is displayed below:
str(df.mod)
## 'data.frame': 1143 obs. of 4 variables:
## $ age : Factor w/ 4 levels "30-34","35-39",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
## $ Impressions : int 7350 17861 693 4259 4133 1915 15615 10951 2355 9502 ...
## $ Conversion_Ratio: num 0.0136 0.0112 0 0.0235 0.0242 ...
set.seed(234)
index <- createDataPartition(df.mod$Conversion_Ratio, p = .7,list = FALSE)
df.train <- df.mod[index,]
df.test <- df.mod[-index,]
For this exercise, I selected the Linear Regression, Support Vector Machine, Random Forest and K-Nearest Neighbors.
5-fold Cross validation with 3 repeats will be used to estimate the model accuracy from the training set.
The following command will set up the parameters for the caret train() function to perform this process
trainControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
RMSE and Rsquared will be used to evaluate the performance of the created models
Rsquared (also called the coefficient of determination) provides a goodness-of- fit measure for the predictions to the observations. This is a value between 0 and 1 for no-fit and perfect fit respectively.
RMSE is the average deviation of the predictions from the observations. It is useful to get a gross idea of how well (or not) an algorithm is doing, in the units of the output variable.
R packages: dplyr, caret will be used to compute the data and train the models. DoParallel will be used for multi-core processing of the models.
cl <- detectCores()
registerDoParallel(cl)
set.seed(234)
algo_set <- caretList(Conversion_Ratio ~., data=df.train, methodList = c("svmRadial", "rf", "knn", "lm" ), trControl = trainControl)
Re-sampling the models.
set.seed(234)
results_caretList <- resamples(algo_set)
Box and Whiskey plot with RMSE and Rsquared for each trained model.
scales <- list(x=list(relation="free", y=list(relation="free")))
bwplot(results_caretList, scales=scales)
## Warning in complete_names(x, x.scales): Invalid or ambiguous component
## names: y
We can clearly see that Random Forest algorithm outperforms the other algorithms in both RMSE and R^2 performance metrics.
Dot-plot with RMSE and Rsquared for each trained model.
dotplot(results_caretList, scales=scales)
Parallel plot illustrates RMSE of each sample model from the 5-fold cross-validation repeated 3 times for each algorithm.
parallelplot(results_caretList)
Correlated movements of the lines indicate good set up for a potential stacking process.
Scatter Plot Matrix plot of the RMSE.
splom(results_caretList)
This is invaluable while working with different models and considering whether the predictions from two different algorithms are correlated. Weak correlations indicate that, they are good candidates for stacking and being combined in an ensemble prediction. It looks like SVM and Random Forest are good candidates for giving them a chance together in algorithm stacking process as opposed to the svm-glm pair.
Summarizing all the trained models:
print(algo_set)
## $svmRadial
## Support Vector Machines with Radial Basis Function Kernel
##
## 803 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared
## 0.25 0.01043154 0.1893258
## 0.50 0.01046821 0.1887852
## 1.00 0.01051069 0.1875694
##
## Tuning parameter 'sigma' was held constant at a value of 0.2761751
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.2761751 and C = 0.25.
##
## $rf
## Random Forest
##
## 803 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared
## 2 0.008951066 0.4038149
## 3 0.008900894 0.4061091
## 5 0.009539474 0.3488432
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
##
## $knn
## k-Nearest Neighbors
##
## 803 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ...
## Resampling results across tuning parameters:
##
## k RMSE Rsquared
## 5 0.01081044 0.1706232
## 7 0.01052146 0.1951113
## 9 0.01042213 0.2029517
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 9.
##
## $lm
## Linear Regression
##
## 803 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 643, 643, 643, 640, 643, 643, ...
## Resampling results:
##
## RMSE Rsquared
## 0.01045033 0.1794489
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
##
##
## attr(,"class")
## [1] "caretList"
Random Forest algorithm offers the highest level of predicted accuracy with
set.seed(234)
predictions.rf <- predict(algo_set$rf, newdata = df.test[,1:4])
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
compare_predictions <- data.frame(
Test.Data=df.test$Conversion_Ratio,
Predictions=predictions.rf)
Conversion Ratio values from the test data set vs the predicted values by the Random Forest Algorithm.
g <- ggplot(data=compare_predictions, aes(x=Test.Data, y=Predictions))
g+ geom_smooth() +
geom_point(colour="Lightblue") +
ggtitle("Conversation Rates Test vs Predictions")
SSE <- sum((compare_predictions$Test.Data - compare_predictions$Predictions) ^ 2)
SST <- sum((compare_predictions$Test.Data - mean(compare_predictions$Test.Data)) ^ 2)
Rquared <- 1 - SSE/SST
paste("Rquared = ",Rquared,sep = "")
## [1] "Rquared = 0.345979766506341"
rmse <- function(error)
{
sqrt(mean(error^2))
}
error <- compare_predictions$Test.Data - compare_predictions$Predictions
rmse<- rmse(error)
paste("RMSE = ", rmse, sep = "")
## [1] "RMSE = 0.00927762185203161"
The finally selected algorithm delivers average at best key evaluation metrics. Reported value of RMSE = 0.00927 indicates quite significant average deviation of the predicted values from the mean of Conversion Ratio (0.0164) included in the test data set. Rsquared value of 0.3459 provides us with the sligtly more promising level of confidence in terms of goodness-of-fit measure for the predictions to the observations with value between 0 and 1 for no-fit and perfect fit respectively.
All the above process was designed to present the first stage selection procedure. The process of finalizing the model is far from finished. There are many potential ways to research and try to improve the predictive performance of the model. Some of the areas which should be considered in next research stages are:
Multi-model structures can be built using group_by() and do() functions from dplyr package. It enables to group the data set by e.g. age group and generate the predictive models for individual age groups.
The construction of the group_by() and do() calls is presented below:
cl <- detectCores(cl)
registerDoParallel(cl)
models_by_age_rf <- df.train %>%
group_by(age) %>%
do(mod=train(Conversion_Ratio ~., data = ., trainControl=trainControl, metric="RMSE", method="rf"))