Experiment Design

The goals of the analysis include (but not limited to):
  • Identify what the factors have the direct impact on the ad campaign and management team should focus on
  • Compare how the various ad campaigns are performing with respect to different companies
  • Quantify how conversion rate varies with age, gender or interest.
  • Identify segments with high and low cost per acquisition and how segment the audience based on click through rates conversion rates
  • Predict conversion rates for each possible age/gender/interest segment.
The dataset used for analysis is the csv file called “conversion_data.csv” stored on the local folder.
The analysis includes the following steps:
1. Data Exploration
  • Data upload into R dataframe
  • Data cleaning
    • Removing ID columns, constants
    • Handling the missing values
    • Re-arranging columns
  • Identifying column types and data type conversion
  • Developing new attributes or data features that can be usefull for visualizations and building the machine learning model
  • Visualization of final dataframe
2. Data Analysis
  • Identification of segments which have the best performance in the terms of clicks and spend money
  • Analysis of ad campaigns success ny companies
  • Identification os segents based on cost , click through and conversion rates
  • Test prediction models for conversion rate across segments
3. Advanced analysis - next steps

1. Data exploration

As the first step we need to upload required R libraries and business data from the file:
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'tidyr' was built under R version 3.3.2
## Warning: package 'ggplot2' was built under R version 3.3.2
## Warning: package 'readr' was built under R version 3.3.3
## Warning: package 'pander' was built under R version 3.3.3
## Warning: package 'gRain' was built under R version 3.3.3
## Loading required package: gRbase
## Warning: package 'gRbase' was built under R version 3.3.3
## Loading required package: graph
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Loading required package: grid
ds <- read_csv("C:/ThoughtWorks/conversion_data.csv", 
               col_types = cols(gender = col_factor(levels = c("M","F"))))
ds <- tbl_df(ds)
glimpse(ds)
## Observations: 1,143
## Variables: 11
## $ ad_id               <int> 708746, 708749, 708771, 708815, 708818, 70...
## $ xyz_campaign_id     <int> 916, 916, 916, 916, 916, 916, 916, 916, 91...
## $ fb_campaign_id      <int> 103916, 103917, 103920, 103928, 103928, 10...
## $ age                 <chr> "30-34", "30-34", "30-34", "30-34", "30-34...
## $ gender              <fctr> M, M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ interest            <int> 15, 16, 20, 28, 28, 29, 15, 16, 27, 28, 31...
## $ Impressions         <int> 7350, 17861, 693, 4259, 4133, 1915, 15615,...
## $ Clicks              <int> 1, 2, 0, 1, 1, 0, 3, 1, 1, 3, 0, 0, 0, 0, ...
## $ Spent               <dbl> 1.43, 1.82, 0.00, 1.25, 1.29, 0.00, 4.77, ...
## $ Total_Conversion    <int> 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Approved_Conversion <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...
Data cleansing
a) Identificators to be exculded
(ids <- which(sapply(ds, function(x) length(unique(x))) == nrow(ds)))
## ad_id 
##     1
ignore<-names(ids)
b) Missing values
vars<-names(ds)
mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
(mvn <- names(which(mvc == nrow(ds))))
## character(0)
##### Ignore missing values
ignore <- union(ignore,mvn)
c) Constants
(constants <- names(which(sapply(ds[vars], function(x) all(x == x[1L])))))
## character(0)
ignore <- union(ignore, constants)
###### Removing columns 
ds<-ds[ , !(names(ds) %in% ignore)]
New columns to be added
a) Click performance - how many clicks we have per 10 0000 imperssions
ds<- ds %>% 
  mutate(click_rate = as.factor(ifelse(ds$Impressions != 0 , round(ds$Clicks/ds$Impressions*10000) , 0))) 
b) Conversion rate
ds<- ds %>% 
  mutate(conv_rate = ifelse(ds$Total_Conversion != 0 , round(ds$Approved_Conversion/ds$Total_Conversion*100) , 0)) 
Conversion rate as a factor
ds$conv_rate <-cut(ds$conv_rate, seq(0,100,10), right=TRUE, labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"))
ds$conv_rate[is.na(ds$conv_rate)] <- "0-10%"
c) Evaluate campaign performance based on money spent and approved conversion. If we spent money and got 0 conversion, all money is our loss. Formula for calucaltion:
loss_amount= spent * (1 - approved_conversion / clicks)
ds<- ds %>% 
  mutate(loss_amount = ifelse(ds$Clicks !=0, ds$Spent*(1 - ds$Approved_Conversion / ds$Clicks),0)) 
d) Loss rate as ratio between loss_amount and spend columns
ds<- ds %>% 
  mutate(loss_rate =  ifelse(ds$Spent>0, round(ds$loss_amount/ds$Spent)*100,0)) 
ds$loss_rate <-cut(ds$loss_rate, seq(0,100,10), right=TRUE, labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"))
ds$loss_rate[is.na(ds$loss_rate)] <- "0-10%"

Writing final dataframe to the file ready for analysis
write.table(ds, file = "C:/ThoughtWorks/ds_final.csv", col.names = TRUE, row.names = FALSE, sep = ",")

2. Data analysis

There are 3 companies in the dataset
table(ds$xyz_campaign_id)
## 
##  916  936 1178 
##   54  464  625
Since companies are not connected to each other, analysis will be done separetly for every company
Let’s start with company #916

Company #916

ds_916<-ds %>% filter(xyz_campaign_id %in% c("916"))
As the first step “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions (all grpahs below are interactive)
Few conclusions can be made from the visualization:
  • For “W” segment the most promising is “30-34” group and the recommendation is to continue to work with this group with the goal of reducing zero click rate (blue). As to other age groups - it’s hard to make any conclusions since dataset doe not have enough evidence to support any hypothesis
  • For “M” segment “30-34” group looks attractive, as to others - not enough data to make any inference
The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.
All geneder and age groups look similar in the terms of return / spend, approximately fifty-fifty. No any conclusion can be made.
To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed
ds_916 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:10) %>%
pander(split.table = 120)
fb_campaign_id frequency
104013 3
103928 2
103965 2
104012 2
104085 2
104205 2
103916 1
103917 1
103920 1
103929 1
Only six fb_campaigns have been executed more than once, to get more details we can filter dataset for fb_campaign_id=104013 (TOP 1 in the list above)
ds_916 %>% 
  filter(fb_campaign_id %in% c("104013")) %>% 
  select(age, gender, interest,Impressions, Clicks,Total_Conversion,Approved_Conversion, click_rate) %>% 
  pander(split.table = 120)
age gender interest Impressions Clicks Total_Conversion Approved_Conversion click_rate
35-39 M 16 12190 2 1 0 2
35-39 M 16 12193 2 1 1 2
35-39 M 16 3332 0 1 1 0
The audience for 104013 campaign is men within 35-39 age group having the same interest (#16). For all three cases conversion rate is low (0 or 1).
To summarize - since most part of the facebook campaign has low frequency (was executed only once), it does not seem possible to make any reliable conclusion as to its performance
The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)
ds_916 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:15) %>%
  pander(split.table = 120)
interest frequency
16 16
15 9
28 4
20 3
27 3
29 3
7 2
10 2
31 2
63 2
18 1
19 1
21 1
24 1
25 1
There are two interests dominating for 916 company - #16 and #15.
Let’s evaluate #16 from conversion rate perspective (conversion rate = Approved conversion / Total_Conversion )
From chart above it seems plausible to conclude that #16 interest is relevant to “M” group acros two “age” segments - “35-39” and “40-44”
The simialr analysis for #15 interest
Resonable inference from the chart above is that #15 interest is more relevant for men between 35 and 44 years old and not relevant for women at all
Quick analysis of how how conversion rate varies with age, gender
Few conclusion cab be drawn from the chart:
  • Overall conversion rate is pretyy low and it does not depend on gender or age
  • Conversion rate is relatively high (more than 50%) for men with two age group - “30-34” and “35-39”
  • Conversion rate is relatively high (more than 50%) for women with two agr group - “30-34” and “45-49” ##### Three segments can be identified for women and men groups with respect to conversion rate - “0-10”, “40-50”, “90-100%”. The first and the third segments include all age group for women and men. ##### ##### What segments company #916 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“)

  • For men most part of “high cost” bar (“90-100%”) includes 30-34 and 34-35 age segments. For “Low cost” (“0-10%”) bar relative part of “30-34” group is higher than for “high cost” bar
  • For women all age segments are distributed approximately equally between high and low cost. “40-44” group exists only in low cost group , but since it’s just one record, no any relaible conclusion can be done about it

Any additional ad-hoc analysis can be done via interactive pivot table below
rpivotTable::rpivotTable(ds_916,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

Company #936

ds_936<-ds %>% filter(xyz_campaign_id %in% c("936"))
As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions
Few conclusions can be made from the visualization:
  • For both genders two click rates are prevalent - “0” and “2”
  • For “M” segment the structures for “35-39” and “45-49” looks very similar with respect to click rate
  • For “W” segment the structures for “35-39” and “40-49” looks very similar with respect to click rate
The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.
  • For “M” segment, three groups (“35-39”,“40-44”,“45-49”) ahs approximately similar ratio between “low” and “high” cost. “30-34” group has three time more “low” cost records than “high” ones, this group looks “the best” from cost-benefit analysis
  • For “W” segment “30-34” and “35-39” groups has fifty-fifty ratio between “low” and “high” costs, Two other age groups has more “high” cost records then “low” ones
To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed
ds_936 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:20) %>%
pander(split.table = 120)
fb_campaign_id frequency
110982 4
111006 4
115510 4
115541 4
115547 4
115485 3
115487 3
115523 3
115545 3
115631 3
116501 3
123457 3
123556 3
123601 3
123671 3
123736 3
108654 2
108672 2
108752 2
108753 2
To get more details we can filter dataset for TOP 5 fb_campaign_id
ds_936_fb_campaign <- ds_936 %>% 
  filter(fb_campaign_id %in% c("110982","111006","115510","115541","115547"))
write.table(ds_936_fb_campaign, file = "C:/ThoughtWorks/ds_936_fb_campaign.csv", col.names = TRUE, row.names = FALSE, sep = ",")
Bar chart for the data set
  • Two campaigns had only women as a target group, other three only men
  • Women segment includes two ages groups - “40-44” and “44-49”, for men all three age segments were covered
  • Two campaigns for women have a higher average click rate in comaprison with three campaigns for men
To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate
  • Average loss rate for women group is two times higher then for men one
  • Three campaign for men group have approximately similar loss rate (25-50%)
The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)
ds_936 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:10) %>% 
  pander(split.table = 120)
interest frequency
16 86
10 48
29 39
27 31
64 24
26 23
15 22
18 22
20 22
28 21
For analysis we can take TOP 10 interest and analyse its performance
ds_936_interest <- ds_936 %>% 
  filter(interest %in% c("16","10","29","27","64","26","15","18","20","28"))
write.table(ds_936_interest, file = "C:/ThoughtWorks/ds_936_interest.csv", col.names = TRUE, row.names = FALSE, sep = ",")
As the frist step, we can analyse ineterst frequencies across gender and age groups
  • The most popular interest for all gender groups is #16. Next two ones are #10 and #29
  • Interest #16 has simialr ration for age groups for men and women
The second step is to evaluate performance across interests, as the first metric we can use average click rate
  • For women group across all interests the average click rates are pretty close to each other
  • For men group #20 interest has higher average click rate than other interests, in particular for “40-44” age group
The second metric is average loss rate
  • Average loss rate for women is hight then for man across all interests
  • Two age groups for women make the most contribution into the average lost rate - “40-44”, “45-49”
  • Across all gender groups the one age segment are common for all interests - “30-34”
Quick analysis of how how conversion rate varies with age, gender
  • For both gender groups there is one age group tht have the highest conversion rate - “30-34”
What segments company #936 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“)
  • For both gender groups “30-34” age segment has the best ratio between low and high costs
  • Across all age segments the men group has better ratio between low and high costs then women
Any additional ad-hoc analysis can be done via interactive pivot table below
rpivotTable::rpivotTable(ds_936,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

Company 1178

ds_1178<-ds %>% filter(xyz_campaign_id %in% c("1178"))
As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions
Few conclusions can be made from the visualization:
  • For women group, across all age segments, the most common click rate is 2
  • For men group between 30 and 39 age the most commom click rate is 1, between 40 and 49 years old is 2
  • Men group does not have any records with click rate equal 4
The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.
  • Across all segments (gender, age) the ration between low cost and high cost is very slow
  • Only one age group (30-34) has some records with low cost
To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed
ds_1178 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:20) %>%
pander(split.table = 120)
fb_campaign_id frequency
144536 6
144562 6
144599 6
144611 6
144636 6
144674 6
144531 5
144532 5
144537 5
144549 5
144554 5
144565 5
144572 5
144580 5
144585 5
144586 5
144592 5
144597 5
144602 5
144622 5
To get more details we can filter dataset for TOP 6 fb_campaign_id whichhave the highest frequency
ds_1178_fb_campaign <- ds_1178 %>% 
  filter(fb_campaign_id %in% c("144536","144562","144599","144611","144636","144674"))
write.table(ds_1178_fb_campaign, file = "C:/ThoughtWorks/ds_1178_fb_campaign.csv", col.names = TRUE, row.names = FALSE, sep = ",")
Bar chart for the data set
  • All six campaign has a pretty narow target audience
  • Most part of campaigns were performed using men as a gender segemnt, only one campaign were executed for women
  • For women group average click rate is higher than for men
To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate
  • All campaigns have 100% loss for all gender and age groups
  • Three campaign for men group have approximately similar loss rate (25-50%)
The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)
ds_1178 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:10) %>% 
  pander(split.table = 120)
interest frequency
16 38
10 35
29 35
27 26
28 26
63 26
20 24
64 24
19 23
21 22
For analysis we can take TOP 10 interest and analyse its performance
ds_1178_interest <- ds_1178 %>% 
  filter(interest %in% c("16","10","29","27","28","63","20","64","19","21"))
write.table(ds_1178_interest, file = "C:/ThoughtWorks/ds_1178_interest.csv", col.names = TRUE, row.names = FALSE, sep = ",")
As the frist step, we can analyse ineterst frequencies across gender and age groups
  • The most popular interest for womens is #16 and for men is #29, both interests have pretty equsl ratio for age groups
  • 16 has simialr ration for age groups for men and women

The second step is to evaluate performance across interests, as the first metric we can use average click rate
  • For both gender group across all interets have the same structure with regards to age group
  • Average click rate for women is higher than for men across all interests
The second metric is average loss rate
  • For almost all interest across age groups the average loss rate is 100%
  • Few exceptions - for “30-34” group thre interests (21,29,63) have 80% as loss rate
Quick analysis of how how conversion rate varies with age, gender
  • Across all gender and age segements the structure of conversion rate is simialr, no any anomalies is visible
What segments company #1178 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“) + For both gender groups and almost all age groups”90-100%" loss rate is prevailing
  • Only “30-34” group has few cases with los loss rate (“0-10%”“)
Any additional ad-hoc analysis can be done via interactive pivot table below
rpivotTable::rpivotTable(ds_1178,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

#### Predictive modeling

The objective of the chapter is to create a predictive model that can be used to predict conversion rate for Company #1178. The response varibale (conv_rate) is transformed to binary response feature - if “con_rate” is equal “0-10%” (low rate), response is 0, otherwise is 1 (high rate). It allows us to reduce out task to training and testing the binary classification models.
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
ds_1178<-ds %>% filter(xyz_campaign_id %in% c("1178"))
ds_1178<- ds_1178 %>% 
  mutate(response = ifelse(ds_1178$conv_rate == "0-10%" , 0,1))
since our model is predictive, we can remove all “id” columns from the dataser
### remove xyz_campagn_id, fb_campaign_id 
ds_1178_predict<-ds_1178 %>% select( age, gender,interest, Impressions,Clicks,Spent,response)
prop.table(table(ds_1178_predict$response))
## 
##      0      1 
## 0.3872 0.6128
Not all machine learning algorithms can handle factor features, we need to transform it to dummy variables:
#### split to training and testing
set.seed(1234)
ds_1178_predict_Dummy <- dummyVars("~.",data=ds_1178_predict, fullRank=T)
ds_1178_predict_final <- as.data.frame(predict(ds_1178_predict_Dummy,ds_1178_predict))
print(names(ds_1178_predict_final))
## [1] "age35-39"    "age40-44"    "age45-49"    "gender.F"    "interest"   
## [6] "Impressions" "Clicks"      "Spent"       "response"
ds_1178_predict_final$response <- ifelse(ds_1178_predict_final$response==1,'Success','Failure')

outcomeName <- 'response'
predictorsNames <- names(ds_1178_predict_final)[names(ds_1178_predict_final) != outcomeName]
As with most modeling projects, we need to split our data into two portions: a training and a testing portion. By doing this, we can use one portion to teach the model how to recognize success or failure and the other portion to evaluate the model.
set.seed(1234)
splitIndex <- createDataPartition(ds_1178_predict_final[,outcomeName], p = .75, list = FALSE, times = 1)
trainDF <- ds_1178_predict_final[ splitIndex,]
testDF  <- ds_1178_predict_final[-splitIndex,]
As the first model we can use decision tree (rpart library)
set.seed(42)

library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
fit <- rpart(response ~ .,
             data = trainDF,
             method = "class",
             control = rpart.control(xval = 12, 
                                     minbucket = 10, 
                                     cp = 0), 
             parms = list(split = "information"))

rpart.plot(fit,tweak=1.5)

The graphical representation of the decision tree is very vivid and it allows to make few important conclusions about what features are of importance to get “Success” as an outcome
As an example of model traning and testing we can generalize the previous model and create so called “random forest” - an ensemble technique that can be used for regression and classification.
set.seed(42)

model_rf <- caret::train(response ~ .,
                         data = trainDF,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 5, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.3.2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
model_rf$finalModel$confusion
##         Failure Success class.error
## Failure      68     114   0.6263736
## Success      68     220   0.2361111
Feature importance
imp <- model_rf$finalModel$importance
imp[order(imp, decreasing = TRUE), ]
## Impressions       Spent      Clicks    interest    gender.F  `age45-49` 
##   49.009356   44.910180   36.447911   36.164429    4.467851    4.245633 
##  `age40-44`  `age35-39` 
##    4.100636    3.803721
# estimate variable importance
importance <- varImp(model_rf, scale = TRUE)
plot(importance)

##### Predicting test data

confusionMatrix(predict(model_rf, testDF), testDF$response)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Failure Success
##    Failure      22      28
##    Success      38      67
##                                           
##                Accuracy : 0.5742          
##                  95% CI : (0.4923, 0.6532)
##     No Information Rate : 0.6129          
##     P-Value [Acc > NIR] : 0.8579          
##                                           
##                   Kappa : 0.0742          
##  Mcnemar's Test P-Value : 0.2679          
##                                           
##             Sensitivity : 0.3667          
##             Specificity : 0.7053          
##          Pos Pred Value : 0.4400          
##          Neg Pred Value : 0.6381          
##              Prevalence : 0.3871          
##          Detection Rate : 0.1419          
##    Detection Prevalence : 0.3226          
##       Balanced Accuracy : 0.5360          
##                                           
##        'Positive' Class : Failure         
## 
Another importante question for predictive modeling is a fearture selection. There are few different ways to do it, for our dataset I selected “correlation” approach. We need to calcualte correlation matrix for all features in the data set (besides “response”) and remove the features with high correlation.
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
# calculate correlation matrix
corMatMy <- cor(trainDF[, -9])
corrplot(corMatMy, order = "hclust")

There are three highly correlated features - Impressions, Clicks and Spent. We can filter out the features with a correlation higher than 0.7, keeping the feature with the lower mean

#Apply correlation filter at 0.70,
highlyCor <- colnames(testDF[, -9])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]
## Compare row 7  and column  8 with corr  0.99 
##   Means:  0.381 vs 0.193 so flagging column 7 
## Compare row 8  and column  6 with corr  0.962 
##   Means:  0.27 vs 0.149 so flagging column 8 
## All correlations <= 0.7
highlyCor
## [1] "Clicks" "Spent"

Based on our analysis we can remove “Clicks” and “Spent” features and keep everything else,

3. Advanced analysis - next steps

One of the possible way to improve the prediction accuracy is to use probabilistic graphical models based on Bayesian rule (Bayesian networks). The main goal is to learn joint probability distribution for “Conversion rate” by factorization it into the conditional probabilities (see below)
The first step is to cretae a graphical model and caluclate CPT (conditional probabilities table) using the dataset
The second step is to run the exact inference algorithm to learn parameters of probability distribution
Use the new data against distribution to get the probabiltiies of conversion rate
Step #1.
FB <- dag(~Int:Age+Int:Gender+Clicks:Int+Clicks:Impres+Clicks:Fb_camp+Conv_rate:Clicks+Conv_rate:Total_conv)
plot( FB )

Probabilistic model :
P(Conv_rate,Total_conv,Clicks,Impress,Fb_camp,Int,Age,Gender)=
P(Age) x P(Gender) x P(Int|Age,Gender) x P(Clicks|Int, Fb_camp,Impres) x P(Total_conv)
CPT tables can be identified from frequency tables using the dataset. For example, apriori probability for age and gender (company #916) can be calculated as:
addmargins(prop.table(table(ds_916$age)))
## 
##     30-34     35-39     40-44     45-49       Sum 
## 0.5370370 0.2222222 0.1111111 0.1296296 1.0000000
addmargins(prop.table(table(ds_916$gender)))
## 
##         M         F       Sum 
## 0.6481481 0.3518519 1.0000000

Condition probability for interest according to our graphical model is:

prop.table((ftable(ds_916$interest, ds_916$gender,ds_916$age)))
##            30-34      35-39      40-44      45-49
##                                                  
## 7  M  0.01851852 0.01851852 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 10 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.03703704
## 15 M  0.03703704 0.05555556 0.01851852 0.00000000
##    F  0.03703704 0.00000000 0.01851852 0.00000000
## 16 M  0.07407407 0.07407407 0.07407407 0.01851852
##    F  0.05555556 0.00000000 0.00000000 0.00000000
## 18 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 19 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.01851852
## 20 M  0.03703704 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.01851852 0.00000000 0.00000000
## 21 M  0.00000000 0.00000000 0.00000000 0.01851852
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 24 M  0.00000000 0.00000000 0.00000000 0.01851852
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 25 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.01851852 0.00000000 0.00000000
## 27 M  0.01851852 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.01851852 0.00000000 0.00000000
## 28 M  0.07407407 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 29 M  0.01851852 0.01851852 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 30 M  0.01851852 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 31 M  0.03703704 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 32 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 63 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.01851852
## 65 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
2. Step 2. To learn the distribution parameters we can use maximum likelihood estimates or variable elimination algorithm.
Another option is to use Scala and “Figaro” for probabilistic programming