This markdown is intended to present a case study for using the CausalImpact [Google Package (for more information)] tool (https://google.github.io/CausalImpact/CausalImpact.html) made on a particular product in a large financial institution. The idea is to validate the usability of the tool (R) with the GooglAnalytics tool for scenarios where it is not possible to use traditional and most important A / B testing, and can be used as a performance indicator.
For post-period modeling, we used as training the 35-day period of time data taken from the GoogleAnalytics tool that refers to the goal page (the file is in the /files/analitics.xlsx repository and has been changed to preserve the source), this is where the hiring flow of the product of interest begins. The forecast HoltWinters function was chosen for our article and we use object oriented treatment to remove the model (pre-period dashed line, as we will see) and also the post-period prediction that would have happened, as control of the model using the CausalImpact tool.Thus the model for inference, we have great gain in using the exponential smoothing state space model.
For most of the institution’s products, there are various means of attracting customers. But for digital products, there are various campaigns or media until the beginning of the funnel. All media (whether through the homepage, product page, adwords, instagram or any other media ends up on the analyzed page).
From time to time there is leverage through promotion through dedicated homepage campaigns. And this is where we will explore the funnel start page, examining the increased flow and solidly estimating the improvement given the start of this campaign.
The googleanalytics coming report is usually exported as .csv or .xlsx, in which case we have an xlsx file. We often automate the report download.
library(readxl)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#Reading GA Data
#Data comes from GA manual report for chosen period
campanha <- read_excel("files/analytics.xlsx")
head(campanha)
## # A tibble: 6 x 9
## Página Data `Visual~ `Visuali~ `Tempo ~ Entr~ `Taxa~ `Porce~ `Valo~
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 9fb22be3~ 20180~ 0 1456 0 0 0 0 0
## 2 9fb22be3~ 20180~ 0 1392 0 0 0 0 0
## 3 9fb22be3~ 20180~ 0 1281 0 0 0 0 0
## 4 9fb22be3~ 20180~ 0 658 0 0 0 0 0
## 5 9fb22be3~ 20180~ 0 616 0 0 0 0 0
## 6 9fb22be3~ 20180~ 0 1691 0 0 0 0 0
#Data treatment
dados <- dplyr::select(campanha, Data, "Visualizações de páginas únicas")
nlinhas <- nrow(dados)-1
dados <- dados[1:nlinhas,]
For this model, it is important to have the pre-period that is the data until the beginning of the campaign.
#We created pre campaign data for our template
pre.periodo <- data.frame(dados$`Visualizações de páginas únicas`[1:37])
So we can see how customers behave on the funnel start page. We note here a frequency problem in the second week. This particular issue was due to a configuration error in tagging during this period.
#We become a time series
nts <- ts(pre.periodo, frequency = 7)
plot(nts)
So we use the holtwinters package to model our little series, we see here fit has a big influence on the period of the second week.
# We make the model as simple as possible, but we should use alpha and gamma.
#For the smooth is good
fit <- HoltWinters(nts)
plot(fit)
So we can use the forecast package and use the parameters of our holtwinters model. We selected the size of the campaign period, which lasted 23 days.
## Create Model ##
library(forecast)
#We forecast
modelo <- forecast(HoltWinters(nts), 23)
plot (forecast(HoltWinters(nts), 23))
Then we use the forecast data to complete our time series.
#Same model
dados.forecast <- forecast(HoltWinters(nts),16)
plot(fit$fitted[,1])
We combine analytics data with forecast:
#Aggregate model data + prediction at x1
x1 <- append(fit$fitted[,1], modelo$mean)
plot(ts(x1))
And we also have the actual data for the period. We noticeably noticed the increase in the sales funnel home page. Then we transform it into a matrix, because the model requires this format.
y <- dados$`Visualizações de páginas únicas`
plot(ts(y))
#Consolidate the two ts
dados_final <- data.frame(y,x1);
dados_final <- data.matrix(dados_final)
This way we have the data ready to verify the causal effect of the campaign.
### Causal Impact ###
library(CausalImpact)
pre.period <- c(1, 37)
post.period <- c(38, 53)
impact <- CausalImpact(dados_final, pre.period, post.period)
plot(impact)
As we see above, our (dashed) model has less error on the dates near the start of the campaign, because it is farther away from the measurement error period.
More explicitly, the package allows us to see the metrics and even exports a robust report on the results. Where what matters to us is the relative effect.
summary(impact)
## Posterior inference {CausalImpact}
##
## Average Cumulative
## Actual 1452 23233
## Prediction (s.d.) 1055 (50) 16881 (808)
## 95% CI [960, 1151] [15367, 18416]
##
## Absolute effect (s.d.) 397 (50) 6352 (808)
## 95% CI [301, 492] [4817, 7866]
##
## Relative effect (s.d.) 38% (4.8%) 38% (4.8%)
## 95% CI [29%, 47%] [29%, 47%]
##
## Posterior tail-area probability p: 0.001
## Posterior prob. of a causal effect: 99.9%
##
## For more details, type: summary(impact, "report")
As mentioned:
summary(impact, "report")
## Analysis report {CausalImpact}
##
##
## During the post-intervention period, the response variable had an average value of approx. 1.45K. By contrast, in the absence of an intervention, we would have expected an average response of 1.06K. The 95% interval of this counterfactual prediction is [0.96K, 1.15K]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is 0.40K with a 95% interval of [0.30K, 0.49K]. For a discussion of the significance of this effect, see below.
##
## Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 23.23K. By contrast, had the intervention not taken place, we would have expected a sum of 16.88K. The 95% interval of this prediction is [15.37K, 18.42K].
##
## The above results are given in terms of absolute numbers. In relative terms, the response variable showed an increase of +38%. The 95% interval of this percentage is [+29%, +47%].
##
## This means that the positive effect observed during the intervention period is statistically significant and unlikely to be due to random fluctuations. It should be noted, however, that the question of whether this increase also bears substantive significance can only be answered by comparing the absolute effect (0.40K) to the original goal of the underlying intervention.
##
## The probability of obtaining this effect by chance is very small (Bayesian one-sided tail-area probability p = 0.001). This means the causal effect can be considered statistically significant.
We see that the report provided complements the analysis significantly and accurately brings the result. We also complement the results officially released for the product we study, resulting from the advertising campaign page. Where exactly 406 leads were assigned to start the hiring flow.
Since our model has no campaign page lead data, we can do benchmarking as validation and at this point this approach was very good using only the forecast average.
You notice that we have a large number of variables created in this article. Not necessary, but the feature was used to make it more didactic;