Facebook ad campaign - data analysis

Experiment Design

The goals of the analysis include (but not limited to):

Identify what the factors have the direct impact on the ad campaign and management team should focus on
Compare how the various ad campaigns are performing with respect to different companies
Quantify how conversion rate varies with age, gender or interest.
Identify segments with high and low cost per acquisition and how segment the audience based on click through rates conversion rates
Predict conversion rates for each possible age/gender/interest segment.

The dataset used for analysis is the csv file called “conversion_data.csv” stored on the local folder.

The analysis includes the following steps:

1. Data Exploration

Data upload into R dataframe
Data cleaning
- Removing ID columns, constants
- Handling the missing values
- Re-arranging columns
Identifying column types and data type conversion
Developing new attributes or data features that can be usefull for visualizations and building the machine learning model
Visualization of final dataframe

2. Data Analysis

Identification of segments which have the best performance in the terms of clicks and spend money
Analysis of ad campaigns success ny companies
Identification os segents based on cost , click through and conversion rates
Test prediction models for conversion rate across segments

3. Advanced analysis - next steps

1. Data exploration

As the first step we need to upload required R libraries and business data from the file:

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Warning: package 'tidyr' was built under R version 3.3.2

## Warning: package 'ggplot2' was built under R version 3.3.2

## Warning: package 'readr' was built under R version 3.3.3

## Warning: package 'pander' was built under R version 3.3.3

## Warning: package 'gRain' was built under R version 3.3.3

## Loading required package: gRbase

## Warning: package 'gRbase' was built under R version 3.3.3

## Loading required package: graph

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Loading required package: grid

ds <- read_csv("C:/ThoughtWorks/conversion_data.csv", 
               col_types = cols(gender = col_factor(levels = c("M","F"))))
glimpse(tbl_df(ds))

## Observations: 1,143
## Variables: 11
## $ ad_id               <int> 708746, 708749, 708771, 708815, 708818, 70...
## $ xyz_campaign_id     <int> 916, 916, 916, 916, 916, 916, 916, 916, 91...
## $ fb_campaign_id      <int> 103916, 103917, 103920, 103928, 103928, 10...
## $ age                 <chr> "30-34", "30-34", "30-34", "30-34", "30-34...
## $ gender              <fctr> M, M, M, M, M, M, M, M, M, M, M, M, M, M,...
## $ interest            <int> 15, 16, 20, 28, 28, 29, 15, 16, 27, 28, 31...
## $ Impressions         <int> 7350, 17861, 693, 4259, 4133, 1915, 15615,...
## $ Clicks              <int> 1, 2, 0, 1, 1, 0, 3, 1, 1, 3, 0, 0, 0, 0, ...
## $ Spent               <dbl> 1.43, 1.82, 0.00, 1.25, 1.29, 0.00, 4.77, ...
## $ Total_Conversion    <int> 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Approved_Conversion <int> 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, ...

Data cleansing

a) Identificators to be exculded

(ids <- which(sapply(ds, function(x) length(unique(x))) == nrow(ds)))

## ad_id 
##     1

ignore<-names(ids)

b) Missing values

vars<-names(ds)
mvc <- sapply(ds[vars], function(x) sum(is.na(x)))
(mvn <- names(which(mvc == nrow(ds))))

## character(0)

##### Ignore missing values
ignore <- union(ignore,mvn)

c) Constants

(constants <- names(which(sapply(ds[vars], function(x) all(x == x[1L])))))

## character(0)

ignore <- union(ignore, constants)

###### Removing columns 
ds<-ds[ , !(names(ds) %in% ignore)]

New columns to be added

a) Click performance - how many clicks we have per 10 0000 imperssions

ds<- ds %>% 
  mutate(click_rate = as.factor(ifelse(ds$Impressions != 0 , round(ds$Clicks/ds$Impressions*10000) , 0)))

b) Conversion rate

ds<- ds %>% 
  mutate(conv_rate = ifelse(ds$Total_Conversion != 0 , round(ds$Approved_Conversion/ds$Total_Conversion*100) , 0))

Conversion rate as a factor

ds$conv_rate <-cut(ds$conv_rate, seq(0,100,10), right=TRUE, labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"))
ds$conv_rate[is.na(ds$conv_rate)] <- "0-10%"

c) Evaluate campaign performance based on money spent and approved conversion. If we spent money and got 0 conversion, all money is our loss. Formula for calucaltion:

loss_amount= spent * (1 - approved_conversion / clicks)

ds<- ds %>% 
  mutate(loss_amount = ifelse(ds$Clicks !=0, ds$Spent*(1 - ds$Approved_Conversion / ds$Clicks),0))

d) Loss rate as ratio between loss_amount and spend columns

ds<- ds %>% 
  mutate(loss_rate =  ifelse(ds$Spent>0, round(ds$loss_amount/ds$Spent)*100,0)) 
ds$loss_rate <-cut(ds$loss_rate, seq(0,100,10), right=TRUE, labels=c("0-10%","10-20%","20-30%","30-40%","40-50%","50-60%","60-70%","70-80%","80-90%","90-100%"))
ds$loss_rate[is.na(ds$loss_rate)] <- "0-10%"

Writing final dataframe to the file ready for analysis

write.table(ds, file = "C:/ThoughtWorks/ds_final.csv", col.names = TRUE, row.names = FALSE, sep = ",")

2. Data analysis

There are 3 companies in the dataset

table(ds$xyz_campaign_id)

## 
##  916  936 1178 
##   54  464  625

Since companies are not connected to each other, analysis will be done separetly for every company

Let’s start with company #916

Company #916

ds_916<-ds %>% filter(xyz_campaign_id %in% c("916"))

As the first step “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions (all graphs below are interactive)

Few conclusions can be made from the visualization:

For “W” segment the most promising is “30-34” group and the recommendation is to continue to work with this group with the goal of reducing zero click rate (blue). As to other age groups - it’s hard to make any conclusions since dataset doe not have enough evidence to support any hypothesis
For “M” segment “30-34” group looks attractive, as to others - not enough data to make any inference

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

All geneder and age groups look similar in the terms of return / spend, approximately fifty-fifty. No any conclusion can be made.

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

ds_916 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:10) %>%
pander(split.table = 120)

fb_campaign_id	frequency
104013	3
103928	2
103965	2
104012	2
104085	2
104205	2
103916	1
103917	1
103920	1
103929	1

Only six fb_campaigns have been executed more than once, to get more details we can filter dataset for fb_campaign_id=104013 (TOP 1 in the list above)

ds_916 %>% 
  filter(fb_campaign_id %in% c("104013")) %>% 
  select(age, gender, interest,Impressions, Clicks,Total_Conversion,Approved_Conversion, click_rate) %>% 
  pander(split.table = 120)

age	gender	interest	Impressions	Clicks	Total_Conversion	Approved_Conversion	click_rate
35-39	M	16	12190	2	1	0	2
35-39	M	16	12193	2	1	1	2
35-39	M	16	3332	0	1	1	0

The audience for 104013 campaign is men within 35-39 age group having the same interest (#16). For all three cases conversion rate is low (0 or 1).

To summarize - since most part of the facebook campaign has low frequency (was executed only once), it does not seem possible to make any reliable conclusion as to its performance

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

ds_916 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:15) %>%
  pander(split.table = 120)

interest	frequency
16	16
15	9
28	4
20	3
27	3
29	3
7	2
10	2
31	2
63	2
18	1
19	1
21	1
24	1
25	1

There are two interests dominating for 916 company - #16 and #15.

Let’s evaluate #16 from conversion rate perspective (conversion rate = Approved conversion / Total_Conversion )

From chart above it seems plausible to conclude that #16 interest is relevant to “M” group acros two “age” segments - “35-39” and “40-44”

The simialr analysis for #15 interest

Resonable inference from the chart above is that #15 interest is more relevant for men between 35 and 44 years old and not relevant for women at all

Quick analysis of how how conversion rate varies with age, gender

Few conclusion cab be drawn from the chart:

Overall conversion rate is pretyy low and it does not depend on gender or age
Conversion rate is relatively high (more than 50%) for men with two age group - “30-34” and “35-39”
Conversion rate is relatively high (more than 50%) for women with two agr group - “30-34” and “45-49” ##### Three segments can be identified for women and men groups with respect to conversion rate - “0-10”, “40-50”, “90-100%”. The first and the third segments include all age group for women and men. ##### ##### What segments company #916 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“)
For men most part of “high cost” bar (“90-100%”) includes 30-34 and 34-35 age segments. For “Low cost” (“0-10%”) bar relative part of “30-34” group is higher than for “high cost” bar
For women all age segments are distributed approximately equally between high and low cost. “40-44” group exists only in low cost group , but since it’s just one record, no any relaible conclusion can be done about it

Any additional ad-hoc analysis can be done via interactive pivot table below

rpivotTable::rpivotTable(ds_916,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

Company #936

ds_936<-ds %>% filter(xyz_campaign_id %in% c("936"))

As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions

Few conclusions can be made from the visualization:

For both genders two click rates are prevalent - “0” and “2”
For “M” segment the structures for “35-39” and “45-49” looks very similar with respect to click rate
For “W” segment the structures for “35-39” and “40-49” looks very similar with respect to click rate

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

For “M” segment, three groups (“35-39”,“40-44”,“45-49”) ahs approximately similar ratio between “low” and “high” cost. “30-34” group has three time more “low” cost records than “high” ones, this group looks “the best” from cost-benefit analysis
For “W” segment “30-34” and “35-39” groups has fifty-fifty ratio between “low” and “high” costs, Two other age groups has more “high” cost records then “low” ones

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

ds_936 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:20) %>%
pander(split.table = 120)

fb_campaign_id	frequency
110982	4
111006	4
115510	4
115541	4
115547	4
115485	3
115487	3
115523	3
115545	3
115631	3
116501	3
123457	3
123556	3
123601	3
123671	3
123736	3
108654	2
108672	2
108752	2
108753	2

To get more details we can filter dataset for TOP 5 fb_campaign_id

ds_936_fb_campaign <- ds_936 %>% 
  filter(fb_campaign_id %in% c("110982","111006","115510","115541","115547"))
write.table(ds_936_fb_campaign, file = "C:/ThoughtWorks/ds_936_fb_campaign.csv", col.names = TRUE, row.names = FALSE, sep = ",")

Bar chart for the data set

Two campaigns had only women as a target group, other three only men
Women segment includes two ages groups - “40-44” and “44-49”, for men all three age segments were covered
Two campaigns for women have a higher average click rate in comaprison with three campaigns for men

To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate

Average loss rate for women group is two times higher then for men one
Three campaign for men group have approximately similar loss rate (25-50%)

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

ds_936 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:10) %>% 
  pander(split.table = 120)

interest	frequency
16	86
10	48
29	39
27	31
64	24
26	23
15	22
18	22
20	22
28	21

For analysis we can take TOP 10 interest and analyse its performance

ds_936_interest <- ds_936 %>% 
  filter(interest %in% c("16","10","29","27","64","26","15","18","20","28"))
write.table(ds_936_interest, file = "C:/ThoughtWorks/ds_936_interest.csv", col.names = TRUE, row.names = FALSE, sep = ",")

As the frist step, we can analyse ineterst frequencies across gender and age groups

The most popular interest for all gender groups is #16. Next two ones are #10 and #29
Interest #16 has simialr ration for age groups for men and women

The second step is to evaluate performance across interests, as the first metric we can use average click rate

For women group across all interests the average click rates are pretty close to each other
For men group #20 interest has higher average click rate than other interests, in particular for “40-44” age group

The second metric is average loss rate

Average loss rate for women is hight then for man across all interests
Two age groups for women make the most contribution into the average lost rate - “40-44”, “45-49”
Across all gender groups the one age segment are common for all interests - “30-34”

Quick analysis of how how conversion rate varies with age, gender

For both gender groups there is one age group tht have the highest conversion rate - “30-34”

What segments company #936 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“)

For both gender groups “30-34” age segment has the best ratio between low and high costs
Across all age segments the men group has better ratio between low and high costs then women

Any additional ad-hoc analysis can be done via interactive pivot table below

rpivotTable::rpivotTable(ds_936,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

Company 1178

ds_1178<-ds %>% filter(xyz_campaign_id %in% c("1178"))

As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions

Few conclusions can be made from the visualization:

For women group, across all age segments, the most common click rate is 2
For men group between 30 and 39 age the most commom click rate is 1, between 40 and 49 years old is 2
Men group does not have any records with click rate equal 4

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

Across all segments (gender, age) the ration between low cost and high cost is very slow
Only one age group (30-34) has some records with low cost

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

ds_1178 %>%
  group_by(fb_campaign_id) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
slice(1:20) %>%
pander(split.table = 120)

fb_campaign_id	frequency
144536	6
144562	6
144599	6
144611	6
144636	6
144674	6
144531	5
144532	5
144537	5
144549	5
144554	5
144565	5
144572	5
144580	5
144585	5
144586	5
144592	5
144597	5
144602	5
144622	5

To get more details we can filter dataset for TOP 6 fb_campaign_id whichhave the highest frequency

ds_1178_fb_campaign <- ds_1178 %>% 
  filter(fb_campaign_id %in% c("144536","144562","144599","144611","144636","144674"))
write.table(ds_1178_fb_campaign, file = "C:/ThoughtWorks/ds_1178_fb_campaign.csv", col.names = TRUE, row.names = FALSE, sep = ",")

Bar chart for the data set

All six campaign has a pretty narow target audience
Most part of campaigns were performed using men as a gender segemnt, only one campaign were executed for women
For women group average click rate is higher than for men

To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate

All campaigns have 100% loss for all gender and age groups
Three campaign for men group have approximately similar loss rate (25-50%)

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

ds_1178 %>%
  group_by(interest) %>%
  summarize(frequency = n()) %>%
  arrange(desc(frequency)) %>%
  slice(1:10) %>% 
  pander(split.table = 120)

interest	frequency
16	38
10	35
29	35
27	26
28	26
63	26
20	24
64	24
19	23
21	22

For analysis we can take TOP 10 interest and analyse its performance

ds_1178_interest <- ds_1178 %>% 
  filter(interest %in% c("16","10","29","27","28","63","20","64","19","21"))
write.table(ds_1178_interest, file = "C:/ThoughtWorks/ds_1178_interest.csv", col.names = TRUE, row.names = FALSE, sep = ",")

As the frist step, we can analyse ineterst frequencies across gender and age groups

The most popular interest for womens is #16 and for men is #29, both interests have pretty equsl ratio for age groups
16 has simialr ration for age groups for men and women

The second step is to evaluate performance across interests, as the first metric we can use average click rate

For both gender group across all interets have the same structure with regards to age group
Average click rate for women is higher than for men across all interests

The second metric is average loss rate

For almost all interest across age groups the average loss rate is 100%
Few exceptions - for “30-34” group thre interests (21,29,63) have 80% as loss rate

Quick analysis of how how conversion rate varies with age, gender

Across all gender and age segements the structure of conversion rate is simialr, no any anomalies is visible

What segments company #1178 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“) + For both gender groups and almost all age groups”90-100%" loss rate is prevailing

Only “30-34” group has few cases with los loss rate (“0-10%”“)

Any additional ad-hoc analysis can be done via interactive pivot table below

rpivotTable::rpivotTable(ds_1178,rows = "gender",cols="age", vals = "Freq",width="800", height="600px")

#### Predictive modeling

The objective of the chapter is to create a predictive model that can be used to predict conversion rate for Company #1178. The response varibale (conv_rate) is transformed to binary response feature - if “con_rate” is equal “0-10%” (low rate), response is 0, otherwise is 1 (high rate). It allows us to reduce out task to training and testing the binary classification models.

library(caret)

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

ds_1178<-ds %>% filter(xyz_campaign_id %in% c("1178"))
ds_1178<- ds_1178 %>% 
  mutate(response = ifelse(ds_1178$conv_rate == "0-10%" , 0,1))

since our model is predictive, we can remove all “id” columns from the dataser

### remove xyz_campagn_id, fb_campaign_id 
ds_1178_predict<-ds_1178 %>% select( age, gender,interest, Impressions,Clicks,Spent,response)
prop.table(table(ds_1178_predict$response))

## 
##      0      1 
## 0.3872 0.6128

Not all machine learning algorithms can handle factor features, we need to transform it to dummy variables:

#### split to training and testing
set.seed(1234)
ds_1178_predict_Dummy <- dummyVars("~.",data=ds_1178_predict, fullRank=T)
ds_1178_predict_final <- as.data.frame(predict(ds_1178_predict_Dummy,ds_1178_predict))
print(names(ds_1178_predict_final))

## [1] "age35-39"    "age40-44"    "age45-49"    "gender.F"    "interest"   
## [6] "Impressions" "Clicks"      "Spent"       "response"

ds_1178_predict_final$response <- ifelse(ds_1178_predict_final$response==1,'Success','Failure')

outcomeName <- 'response'
predictorsNames <- names(ds_1178_predict_final)[names(ds_1178_predict_final) != outcomeName]

As with most modeling projects, we need to split our data into two portions: a training and a testing portion. By doing this, we can use one portion to teach the model how to recognize success or failure and the other portion to evaluate the model.

set.seed(1234)
splitIndex <- createDataPartition(ds_1178_predict_final[,outcomeName], p = .75, list = FALSE, times = 1)
trainDF <- ds_1178_predict_final[ splitIndex,]
testDF  <- ds_1178_predict_final[-splitIndex,]

As the first model we can use decision tree (rpart library)

set.seed(42)

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.3.3

fit <- rpart(response ~ .,
             data = trainDF,
             method = "class",
             control = rpart.control(xval = 12, 
                                     minbucket = 10, 
                                     cp = 0), 
             parms = list(split = "information"))

rpart.plot(fit,tweak=1.5)

The graphical representation of the decision tree is very vivid and it allows to make few important conclusions about what features are of importance to get “Success” as an outcome. In our case it’s “Impressions” and “Interest”

Let’s create another model using C5.0 algorithm. It’s always advisible to test few algorithms and compare the results.

library(C50)

## Warning: package 'C50' was built under R version 3.3.2

outcome_name<-"response"

predictorsNames <- names(ds_1178_predict)[names(ds_1178_predict) != outcomeName]

ds_1178_predict$response<-as.factor(ds_1178_predict$response)

c50_model <- C5.0(x=ds_1178_predict[,-c(1,2,7)], y=ds_1178_predict$response)
summary(c50_model)

## 
## Call:
## C5.0.default(x = ds_1178_predict[, -c(1, 2, 7)], y
##  = ds_1178_predict$response)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue May 23 12:30:26 2017
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 625 cases (5 attributes) from undefined.data
## 
## Decision tree:
## 
## Impressions > 410310: 1 (166/25)
## Impressions <= 410310:
## :...Impressions <= 73222: 0 (133/48)
##     Impressions > 73222:
##     :...interest <= 21: 1 (104/28)
##         interest > 21:
##         :...interest <= 28: 0 (84/33)
##             interest > 28: 1 (138/53)
## 
## 
## Evaluation on training data (625 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##       5  187(29.9%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     136   106    (a): class 0
##      81   302    (b): class 1
## 
## 
##  Attribute usage:
## 
##  100.00% Impressions
##   52.16% interest
## 
## 
## Time: 0.0 secs

As you can see, C5.0 came to the same conclusion as “rpart” algorithm - “Impressions” and “Interest” exaplin most part of the decision rules.

As an example of model traning and testing we can generalize the previous model and create so called “random forest” - an ensemble technique that can be used for regression and classification.

set.seed(42)

model_rf <- caret::train(response ~ .,
                         data = trainDF,
                         method = "rf",
                         preProcess = c("scale", "center"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  number = 5, 
                                                  repeats = 5, 
                                                  savePredictions = TRUE, 
                                                  verboseIter = FALSE))

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.3.2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:BiocGenerics':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

model_rf$finalModel$confusion

##         Failure Success class.error
## Failure      68     114   0.6263736
## Success      68     220   0.2361111

Feature importance

imp <- model_rf$finalModel$importance
imp[order(imp, decreasing = TRUE), ]

## Impressions       Spent      Clicks    interest    gender.F  `age45-49` 
##   49.009356   44.910180   36.447911   36.164429    4.467851    4.245633 
##  `age40-44`  `age35-39` 
##    4.100636    3.803721

# estimate variable importance
importance <- varImp(model_rf, scale = TRUE)
plot(importance)

##### Predicting test data

confusionMatrix(predict(model_rf, testDF), testDF$response)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Failure Success
##    Failure      22      28
##    Success      38      67
##                                           
##                Accuracy : 0.5742          
##                  95% CI : (0.4923, 0.6532)
##     No Information Rate : 0.6129          
##     P-Value [Acc > NIR] : 0.8579          
##                                           
##                   Kappa : 0.0742          
##  Mcnemar's Test P-Value : 0.2679          
##                                           
##             Sensitivity : 0.3667          
##             Specificity : 0.7053          
##          Pos Pred Value : 0.4400          
##          Neg Pred Value : 0.6381          
##              Prevalence : 0.3871          
##          Detection Rate : 0.1419          
##    Detection Prevalence : 0.3226          
##       Balanced Accuracy : 0.5360          
##                                           
##        'Positive' Class : Failure         
##

Model performance can be improved by testing the different model configurations via automatic and manual gris search

Another importante question for predictive modeling is a fearture selection. There are few different ways to do it, for our dataset I selected “correlation” approach. We need to calcualte correlation matrix for all features in the data set (besides “response”) and remove the features with high correlation.

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.3.2

# calculate correlation matrix
corMatMy <- cor(trainDF[, -9])
corrplot(corMatMy, order = "hclust")

There are three highly correlated features - Impressions, Clicks and Spent. We can filter out the features with a correlation higher than 0.7, keeping the feature with the lower mean

#Apply correlation filter at 0.70,
highlyCor <- colnames(testDF[, -9])[findCorrelation(corMatMy, cutoff = 0.7, verbose = TRUE)]

## Compare row 7  and column  8 with corr  0.99 
##   Means:  0.381 vs 0.193 so flagging column 7 
## Compare row 8  and column  6 with corr  0.962 
##   Means:  0.27 vs 0.149 so flagging column 8 
## All correlations <= 0.7

highlyCor

## [1] "Clicks" "Spent"

Based on our analysis we can remove “Clicks” and “Spent” features and keep everything else,

3. Advanced analysis - next steps

One of the possible way to improve the prediction accuracy is to use probabilistic graphical models based on Bayesian rule (Bayesian networks). The main goal is to learn joint probability distribution for “Conversion rate” by factorization it into the conditional probabilities (see below)

The first step is to cretae a graphical model and caluclate CPT (conditional probabilities table) using the dataset

The second step is to run the exact inference algorithm to learn parameters of probability distribution

Use the new data against distribution to get the probabiltiies of conversion rate

Step #1.

FB <- dag(~Int:Age+Int:Gender+Clicks:Int+Clicks:Impres+Clicks:Fb_camp+Conv_rate:Clicks+Conv_rate:Total_conv)
plot( FB )

Probabilistic model :

P(Conv_rate,Total_conv,Clicks,Impress,Fb_camp,Int,Age,Gender)=

P(Age) x P(Gender) x P(Int|Age,Gender) x P(Clicks|Int, Fb_camp,Impres) x P(Total_conv)

CPT tables can be identified from frequency tables using the dataset. For example, apriori probability for age and gender (company #916) can be calculated as:

addmargins(prop.table(table(ds_916$age)))

## 
##     30-34     35-39     40-44     45-49       Sum 
## 0.5370370 0.2222222 0.1111111 0.1296296 1.0000000

addmargins(prop.table(table(ds_916$gender)))

## 
##         M         F       Sum 
## 0.6481481 0.3518519 1.0000000

Condition probability for interest according to our graphical model is:

prop.table((ftable(ds_916$interest, ds_916$gender,ds_916$age)))

##            30-34      35-39      40-44      45-49
##                                                  
## 7  M  0.01851852 0.01851852 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 10 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.03703704
## 15 M  0.03703704 0.05555556 0.01851852 0.00000000
##    F  0.03703704 0.00000000 0.01851852 0.00000000
## 16 M  0.07407407 0.07407407 0.07407407 0.01851852
##    F  0.05555556 0.00000000 0.00000000 0.00000000
## 18 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 19 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.01851852
## 20 M  0.03703704 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.01851852 0.00000000 0.00000000
## 21 M  0.00000000 0.00000000 0.00000000 0.01851852
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 24 M  0.00000000 0.00000000 0.00000000 0.01851852
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 25 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.01851852 0.00000000 0.00000000
## 27 M  0.01851852 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.01851852 0.00000000 0.00000000
## 28 M  0.07407407 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 29 M  0.01851852 0.01851852 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 30 M  0.01851852 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 31 M  0.03703704 0.00000000 0.00000000 0.00000000
##    F  0.00000000 0.00000000 0.00000000 0.00000000
## 32 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000
## 63 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.01851852
## 65 M  0.00000000 0.00000000 0.00000000 0.00000000
##    F  0.01851852 0.00000000 0.00000000 0.00000000

Facebook ad campaign - data analysis

Oleg Baydakov

17 May 2017

Experiment Design

The goals of the analysis include (but not limited to):

The dataset used for analysis is the csv file called “conversion_data.csv” stored on the local folder.

The analysis includes the following steps:

1. Data Exploration

2. Data Analysis

3. Advanced analysis - next steps

1. Data exploration

As the first step we need to upload required R libraries and business data from the file:

Data cleansing

a) Identificators to be exculded

b) Missing values

c) Constants

New columns to be added

a) Click performance - how many clicks we have per 10 0000 imperssions

b) Conversion rate

Conversion rate as a factor

c) Evaluate campaign performance based on money spent and approved conversion. If we spent money and got 0 conversion, all money is our loss. Formula for calucaltion:

loss_amount= spent * (1 - approved_conversion / clicks)

d) Loss rate as ratio between loss_amount and spend columns

Writing final dataframe to the file ready for analysis

2. Data analysis

There are 3 companies in the dataset

Since companies are not connected to each other, analysis will be done separetly for every company

Let’s start with company #916

Company #916

As the first step “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions (all graphs below are interactive)

Few conclusions can be made from the visualization:

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

All geneder and age groups look similar in the terms of return / spend, approximately fifty-fifty. No any conclusion can be made.

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

Only six fb_campaigns have been executed more than once, to get more details we can filter dataset for fb_campaign_id=104013 (TOP 1 in the list above)

The audience for 104013 campaign is men within 35-39 age group having the same interest (#16). For all three cases conversion rate is low (0 or 1).

To summarize - since most part of the facebook campaign has low frequency (was executed only once), it does not seem possible to make any reliable conclusion as to its performance

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

There are two interests dominating for 916 company - #16 and #15.

Let’s evaluate #16 from conversion rate perspective (conversion rate = Approved conversion / Total_Conversion )

From chart above it seems plausible to conclude that #16 interest is relevant to “M” group acros two “age” segments - “35-39” and “40-44”

The simialr analysis for #15 interest

Resonable inference from the chart above is that #15 interest is more relevant for men between 35 and 44 years old and not relevant for women at all

Quick analysis of how how conversion rate varies with age, gender

Few conclusion cab be drawn from the chart:

Any additional ad-hoc analysis can be done via interactive pivot table below

Company #936

As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions

Few conclusions can be made from the visualization:

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

To get more details we can filter dataset for TOP 5 fb_campaign_id

Bar chart for the data set

To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

For analysis we can take TOP 10 interest and analyse its performance

As the frist step, we can analyse ineterst frequencies across gender and age groups

The second step is to evaluate performance across interests, as the first metric we can use average click rate

The second metric is average loss rate

Quick analysis of how how conversion rate varies with age, gender

What segments company #936 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“)

Any additional ad-hoc analysis can be done via interactive pivot table below

Company 1178

As the first step - “click rate”" analysis. “Click rate”" shows how many clicks the company has per 10 000 impressions

Few conclusions can be made from the visualization:

The second step is “cost benefit” analysis using “loss rate” column. It shows how many approved conversion a company gets as a return to spent money. If it’s 0 , the return equals to spend. If it’s 100%, all spend is a loss.

To evaluate how various fb_campaigns performs for company #916, let’s start with frequency table - how many times a specific fb_campaign has been performed

To get more details we can filter dataset for TOP 6 fb_campaign_id whichhave the highest frequency

Bar chart for the data set

To get more information about campaign performance, we can run cost / benefit analysis evaluating the average loss rate

The next step is to perform the similar type of analysis for “interest” feature to identify possible factors impacting the performance with respect to different segments (age/gender)

For analysis we can take TOP 10 interest and analyse its performance

As the frist step, we can analyse ineterst frequencies across gender and age groups

16 has simialr ration for age groups for men and women

The second step is to evaluate performance across interests, as the first metric we can use average click rate

The second metric is average loss rate

Quick analysis of how how conversion rate varies with age, gender

What segments company #1178 has with high and low cost per acquisition (based on “loss_rate”" feature, “0-10%” means low cost, “90-100%” - high cost“) + For both gender groups and almost all age groups”90-100%" loss rate is prevailing

Any additional ad-hoc analysis can be done via interactive pivot table below

since our model is predictive, we can remove all “id” columns from the dataser