rm(list = ls())
library(readxl)
library(stats)
library(car)
library(dplyr)
library(tidyr)
library(infer)
library(ggplot2)
library(ggpubr)
library(plotly)
library(timetk)
library(skimr)
library(hrbrthemes)
library(lubridate)
For plotting purposes and to display decimals in standard form instead of scientific notation.
options(scipen = 15)
adspend_main <- read_excel("Ad Spend AB Testing.xlsx",)
adspend1 <- adspend_main %>% filter(period == "1")
The data is collected from a hotel booking site during an Ad marketing campaign to boost hotel bookings. The data was recorded for the period before the experiment(0),the period during(1) and after the experiment(NA).
period: 0: Pre Experiment, 1: During Experiment. NA: Post Experiment.
assignment: control” part of the experiment group and “treatment” part of the experiment group
Treatment is 0 for units not exposed to the ad campaign and 1 for units exposed to the ad campaign.
Bookings is my target feature and represents the hotel bookings sales that we want to measure the effect of Adspend on
Cost is the expenditure on a particular day on ads for the treatment group.
Before and after the experiment(marketing campaign), there was no marketing expenditure on ads to boost hotel bookings. I assume the company relied more on natural interest or traffic from recommendations.
skim(adspend_main)
| Name | adspend_main |
| Number of rows | 9225 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 6 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| assignment | 0 | 1 | 7 | 9 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| geo | 0 | 1.00 | 50.14 | 28.70 | 1.0 | 25.00 | 50.00 | 75.00 | 100.00 | ▇▇▇▇▇ |
| geo_group | 0 | 1.00 | 1.50 | 0.50 | 1.0 | 1.00 | 2.00 | 2.00 | 2.00 | ▇▁▁▁▇ |
| period | 2292 | 0.75 | 0.40 | 0.49 | 0.0 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
| treatment | 0 | 1.00 | 0.15 | 0.36 | 0.0 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
| bookings | 0 | 1.00 | 706.62 | 1245.18 | 104.8 | 166.73 | 272.29 | 596.17 | 14655.46 | ▇▁▁▁▁ |
| cost | 0 | 1.00 | 5.42 | 30.74 | 0.0 | 0.00 | 0.00 | 0.00 | 550.50 | ▇▁▁▁▁ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| date | 0 | 1 | 2015-01-05 | 2015-04-07 | 2015-02-20 | 93 |
I have realized that there are null values for the “period” variable representing the period after the intervention.Here I anticipate the effects of the ad marketing campaign to have been observed and hence I will not need the data for my exploratory or inferential analysis.
I have gone on to exclude the missing data.
adspend <- adspend_main %>% filter(!is.na(adspend_main$period))
I would like to do an exploratory analysis to get a picture of how bookings performed for the different periods.
For this demonstration, I am going to utilize ggplot2 package to visualize ad spend for the period when the experiment was active. I will only be able to make an inference from what I observe after performing the statistical test.
head(adspend[adspend$period==0,])
## # A tibble: 6 × 8
## date geo geo_group period assignment treatment bookings cost
## <dttm> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2015-01-05 00:00:00 1 2 0 treatment 0 7327. 0
## 2 2015-01-05 00:00:00 10 1 0 control 0 1927. 0
## 3 2015-01-05 00:00:00 100 1 0 control 0 124. 0
## 4 2015-01-05 00:00:00 11 1 0 control 0 1601. 0
## 5 2015-01-05 00:00:00 12 2 0 treatment 0 1472. 0
## 6 2015-01-05 00:00:00 13 1 0 control 0 1467. 0
tail(adspend[adspend$period==0,])
## # A tibble: 6 × 8
## date geo geo_group period assignment treatment bookings cost
## <dttm> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2015-02-15 00:00:00 93 2 0 treatment 0 112. 0
## 2 2015-02-15 00:00:00 94 1 0 control 0 106. 0
## 3 2015-02-15 00:00:00 95 1 0 control 0 112. 0
## 4 2015-02-15 00:00:00 96 2 0 treatment 0 129. 0
## 5 2015-02-15 00:00:00 98 2 0 treatment 0 106. 0
## 6 2015-02-15 00:00:00 99 2 0 treatment 0 106. 0
Starting date for period before the marketing campaign is 2015-01-05.
End date for the period before the marketing campaign is 2015-02-15.
head(adspend[adspend$period==1,])
## # A tibble: 6 × 8
## date geo geo_group period assignment treatment bookings cost
## <dttm> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2015-02-16 00:00:00 1 2 1 treatment 1 9267. 462.
## 2 2015-02-16 00:00:00 10 1 1 control 0 2392. 0
## 3 2015-02-16 00:00:00 100 1 1 control 0 112. 0
## 4 2015-02-16 00:00:00 11 1 1 control 0 1847. 0
## 5 2015-02-16 00:00:00 12 2 1 treatment 1 1890. 89.0
## 6 2015-02-16 00:00:00 13 1 1 control 0 1545. 0
tail(adspend[adspend$period==1,])
## # A tibble: 6 × 8
## date geo geo_group period assignment treatment bookings cost
## <dttm> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2015-03-15 00:00:00 93 2 1 treatment 1 116. 0.76
## 2 2015-03-15 00:00:00 94 1 1 control 0 132. 0
## 3 2015-03-15 00:00:00 95 1 1 control 0 113. 0
## 4 2015-03-15 00:00:00 97 1 1 control 0 145. 0
## 5 2015-03-15 00:00:00 98 2 1 treatment 1 131. 1.53
## 6 2015-03-15 00:00:00 99 2 1 treatment 1 108. 0.38
Starting date for period during marketing campaign is 2015-02-16.
End date for the period during the marketing campaign is 2015-03-15.
daily_bookings_assignment <- adspend %>% group_by(assignment) %>%
summarise_by_time(date,
total_bookings = sum(bookings),
total_cost = sum(cost),
.by = "day") %>%
ungroup()
The ungroup() function helps us to retain the dimensions of our data after aggregation since we used the summarize function.
Using ggplot2 to visualize the difference between the two assignment groups for the period before and during the ad marketing campaign
daily_bookings_assignment %>%
ggplot(aes(date,total_bookings,
color = assignment)) +
geom_line(mapping = aes(linetype = assignment)) +
labs(x = "Date",
y = "Total Bookings",
title = "Adspend Before and During Marketing Campaign",
guides("Assignment")) +
scale_y_continuous(labels = scales::comma_format()) + theme_light()
From the graph there is a noticeable improvement during the campaign when comparing the treatment and control groups.
I need to investigate this further using a statistical test to find out if the difference is significant.
There are 1391 entries for our control group and 1393 entries for our treatment representing 0.49964 and 0.50035 in proportion respectively. Hence the proportions for our test group are appropriate to proceed.
My variable of interest is bookings.
To determine the appropriate test I first consider if my metric is discrete of which it is continuous.
My sample is large enough but I will still test whether it meets the assumption of normality.
ggqqplot(adspend1$bookings, xlab = "Theoretical Quantiles", ylab =
"Sample Quantiles", title = "QQ-Plot of the Booking Amounts")
The qqplot shows evidence of the data deviating from a normal distribution
shapiro.test(adspend1$bookings)
##
## Shapiro-Wilk normality test
##
## data: adspend1$bookings
## W = 0.50101, p-value < 0.00000000000000022
The shapiro test confirms this with a significant p-value, meaning we reject the null hypothesis that our sample comes from a normal distribution.
I do not know the population variance. I will have to perform a levene test to determine the whether the assumption of homogeneity of variances is met.
It is important to note that there was no marketing cost before the campaign.
daily_bookings_assignment %>% group_by(assignment) %>%
summarize(bookings_var = var(total_bookings), cost_var =
var(total_cost))
## # A tibble: 2 × 3
## assignment bookings_var cost_var
## <chr> <dbl> <dbl>
## 1 control 87718122. 0
## 2 treatment 124813973. 902912.
Hence the variance to be investigated will be the bookings.
leveneTest(bookings ~ assignment, data = adspend1)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 2.952 0.08588 .
## 2782
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the levene test, I have obtained a non significant result.
Hence I do not have sufficient evidence to reject the null hypothesis that the population variances are equal.
My data has failed the test for normality but is consistent with regard to homogeneity of variances.
However, my sample is large enough and I opt to disregard the outcome of the shapiro wilk test. Hence, I will use the 2 sample students t-test to make an inference from my A/B test data.
H0: There is no difference in mean hotel bookings for periods before and after the marketing campaign.
H1: There is a significant difference in mean hotel bookings for periods before and after the marketing campaign.
ab_adspend <- adspend1 %>% t_test(bookings ~ assignment,
order = c("treatment","control"),
alternative = "two-sided")
ab_adspend
## # A tibble: 1 × 7
## statistic t_df p_value alternative estimate lower_ci upper_ci
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1.92 2703. 0.0545 two.sided 96.2 -1.87 194.
At an alpha level of 0.05 and considering the p-value obtained of 0.0545, we infer that there is not enough evidence to reject the null hypothesis.
However, at an alpha level of 0.1 we can reject the null hypothesis and conclude that there was a significant difference in mean bookings after the adspend was initiated during the marketing campaign. I can ascertain that there was a significant effect from ad spend by looking at my confidence interval of [-1.87,194.31] which has a larger range on the positive spectrum.
avg_te <- ab_adspend$estimate
avg_te
## [1] 96.22111
The test estimate shows that on average the difference in bookings between the control and treatment period was 96.2. This represents an improvement in bookings during the ad spend campaign. The value of 96.2 represents the average treatment effect(ATE) which in our case confirms that the marketing campaign was a success.
sum(adspend1$cost)
## [1] 50000
The total cost on Marketing was Ksh.50000.
adspend1 %>% filter(assignment == "treatment") %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 1393
There are 1393 observations for our treatment group.
(Total Observations * Average Treatment effect)/Total Booking Cost
(1393*avg_te)/sum(adspend1$cost)
## [1] 2.68072
Our marketing campaign to ramp up bookings through advertisements yielded an increase in bookings by Kshs. 2.68 for every shilling spent.
adspend %>% group_by(period) %>%
summarize(total_bookings = sum(bookings))
## # A tibble: 2 × 2
## period total_bookings
## <dbl> <dbl>
## 1 0 2412592.
## 2 1 2131682.
adspend %>% group_by(period) %>% count()
## # A tibble: 2 × 2
## # Groups: period [2]
## period n
## <dbl> <int>
## 1 0 4149
## 2 1 2784
Total bookings in the period before the marketing campaign was Kshs 2,412,592 with a total of 4149 bookings made.
The average per booking will be:
2412592/4149
## [1] 581.4876
The average booking before the marketing campaign was Kshs.581.49.
Total bookings in the period during the marketing campaign was Kshs 2,131,682 with 2784 bookings made.
The average booking will be:
2131682/2784
## [1] 765.6904
The average booking during the marketing campaign was 765.69.
Calculating the uplift rate for the average booking:
((765.69/581.49)*100) - 100
## [1] 31.67724
The average booking improved by 31.68% following the ad spend campaign.
The ad marketing campaign was successful and resulted in a return of Kshs. 2.67 for every shilling spent on the marketing campaign.
The average amount spent on a single booking improved by 31.68% following the marketing campaign.