Read.me
This notebook shows you how to analyze the advertising field
experiment data at Star Digital. To use the notebook, please download
and load data Ad_Experiment.Rdata from Canvas.
The data include the follow variables from the field experiment:
- Purchase: A dummy variable indicating whether the consumer eventually subscribed at Star Digital or not (= 0 if there was no purchase; = 1 if there was a purchase).
- Imp_1 - Imp_6: The number of impressions for each consumer at Website 1 to Website 6
- Test: a factor with 2 levels “Control” and “Treatment”, telling you which users belong to the treatment vs. control group.
## ID Purchase Test Imp_1 Imp_2 Imp_3 Imp_4 Imp_5 Imp_6
## 1 27 0 Treatment 0 0 0 0 0 1
## 2 126 0 Treatment 0 1 0 0 0 0
## 3 167 1 Treatment 0 0 0 1 0 0
## 4 195 1 Treatment 0 1 0 0 0 0
## 5 323 1 Control 0 10 0 0 0 0
## 6 365 0 Treatment 0 0 0 0 0 1
A Bootstrapping Procedure for ATE estimation
Next, we are going to estimate the ATE and its confidence interval with a bootstrapping procedure. A bootstrapping procedure is a subsampling procure where we repeatedly take a sample (with replacement) from the original data. With the bootstrapped samples, we infer the statistics (here the ATE and its confidence intervals). We are able to infer the sampling variability from the repeated sampling.
A sketch of the bootstrapping procedure for the field experiment data is as follows:
- Set the number of bootstrapping repetitions \(S\).
- At a particular repition \(s\):
- obtain a sample data of the same size as the original data with replacement.
- calculate ATE and store it.
- With all the ATEs stored across \(S\), calculate the average ATE and its confidence intervals.
To run the bootstrapping procedure, you are provided with a function
called boot_experiment, shared in the .RData.
This function takes the following inputs:
x: a character specifying the treatment assignment variable in the data frame (the variable must be a factor with two levels “Treatment” and “Control”).y: a character specifying the outcome variable in the data frame.data: the data frame to use for the bootstrapping analysis.S: the no. of bootstrapping repetitions. The default value is \(1000\).nseed: the seed of the R random number generator; used as the input of theset.seed. The default value is \(123456789\). Used as an input of the function for the replication purpose.
You can see the details of the function below. See the specific comments for different lines.
## function (x, y, data, S = 1000, nseed = 123456789) {
##
## #-------------------------------------------------------------------------------
## # 1. Setting some values for the bootstrapping
## #-------------------------------------------------------------------------------
##
## # the size of treatment vs. control group
## # N[1]: Control; N[2]: Treatment
## N <- table(data[,x])
##
## # two index vectors of treated people and control people
## control <- which(data$Test == "Control")
## treat <- which(data$Test == "Treatment")
##
## #-------------------------------------------------------------------------------
## # 2. Start the bootstrapping procedure
## #-------------------------------------------------------------------------------
## # for the experiment, the sizes of the control vs. treatment group are unequal.
## # we use a "blocked" bootstrapping to consider this variability in sample sizes.
## # We sample the same no. of treated vs. control people as in the original data.
## # We must sample the treated and the control people separately.
##
## # set the seeds for replication; you may delete this line.
## set.seed(nseed)
##
## # a vector to store bootstrapped ATE estimates
## # we call it "lift" as per the industry tradition
## lift <- rep(0,S)
##
## for (i in 1:S) {
##
## # To produce bootstrapped samples:
## # A random sample with the same sample size, sampled with replacement.
## # Using the two index variables "Treatment" and "Control".
## # Sampling two vectors of the same length as "Treatment" and "Control" with replacement.
## # we use the sample function. for more details, check ?sample.
## boot_control <- sample(control, N[1], replace = T)
## boot_treatment <- sample(treat, N[2], replace = T)
##
## # To calculate the ATE = mean(Purchase_treat) - mean(Purchase_control)
## lift[i] <- mean(unlist(data[boot_treatment,y]))-
## mean(unlist(data[boot_control,y]))
## }
##
## return(lift)
##
## }
## <bytecode: 0x00000152e216a368>
We can use the boot_experiment function to obtain
bootstrapped values of ATEs. As a tradition in the industry, we call the
ATE a “lift.” We will use the default value for the no. of repetitions
and the seed. Or, we will set S = 1000 and
nseed = 12346789. You may change the values if you
want.
With the lift vector, we can calculate the mean and
confidence intervals of ATEs.
#Get the ATE and bootstrapped CIs
ATE <- mean(lift)
# The 95% confidence interval
CI_95 <- quantile(lift,c(.025,.975))
# The 90% confidence interval
CI_90 <- quantile(lift,c(.05,.95))
#We can also plot "Lift" and the 90% and 95% CI's
hist(lift,breaks = 100)
abline(v=c(CI_90,CI_95),
col=c("blue","blue","red","red"),
lty=rep(2,4),
lwd=rep(2,4))From the histogram plot, we can see that the ATE is significant at the confidence level of \(0.1\), as the \(90\%\) confidence interval does not contain zero.
Calcuating ROI of Online Display Ads
If we believe that the ATE of \(0.0194\) is credible, we can then use the ATE (average lift) to calculate the ROI for further reporting. The ROI calculation requires two values:
- the profits from the conversions;
- the costs of online ads. For this, we use consumers in the treatment group to calculate the costs. As these are the costs (impressions) that will occur if we use the ads in the treatment condition.
Note that the online advertising use a CPM model (costs per thousand impressions). We must divide the overall impressions by \(1,000\) when calculating the costs.
For the returns, the overall expected profits should be: \[Profits=ATE \times N_{Treatment} \times €60\] Where we have €60 as the life time value (CLV) of one subscription (see page 3 of the case).
For the costs, we use the CPM model and do this: \[Costs=\dfrac{Impressions_{1-5}}{1000} \times €25 + \dfrac{Impressions_{6}}{1000} \times €20\] Note that the CPM for Website 1-5 is €25 and Website 6 is €20 (see page 3 of the case).
# sum up the impressions for treatment group consumers
# obtaining a data frame with only treatment group and impressions
Treatment <- subset(AdExp_Data, Test == "Treatment")[,-c(1:3)]
Impressions <- colSums(Treatment)
# different CPMs as shown above: Website 1-5 - 25 and Website 6 - 20
Costs <- sum(Impressions/1000*c(rep(25,5),20))
#The return is the number of people in treatment * the ATE (Lift) * 60 euros
Profits <- ATE*dim(Treatment)[1]*60
# Calculate ROI: Return/Investment = (Profits-Costs)/Costs
ROI <- (Profits-Costs)/Costs
# Show final results
Results <- c(Costs,Profits,ROI)
names(Results) <- c("Costs","Profits","ROI")
Results## Costs Profits ROI
## 4254.390000 26426.110866 5.211492
Overall, with an ROI of \(521.14\%\), the online advertisement is a good strategy for Star Digital. The unconventionally high ROI is largely due to the low costs of online ads, considering the lift of the online ads is relatively small at \(1.94\%\).