Clearing object environment

rm(list = ls())

Loading required packages

library(readxl)
library(stats)
library(car)
library(dplyr)
library(tidyr)
library(infer)
library(ggplot2)
library(ggpubr)
library(plotly)
library(timetk)
library(skimr)
library(hrbrthemes)
library(lubridate)

For plotting purposes and to display decimals in standard form instead of scientific notation.

options(scipen = 15)

Loading data set and filtering to remain with observations for the period when the marketing campaign was done.

adspend_main <- read_excel("Ad Spend AB Testing.xlsx",)

adspend1 <- adspend_main %>% filter(period == "1")

DATA DESCRIPTION AND EXPLORATION

The data is collected from a hotel booking site during an Ad marketing campaign to boost hotel bookings. The data was recorded for the period before the experiment(0),the period during(1) and after the experiment(NA).

Description of variables

period: 0: Pre Experiment, 1: During Experiment. NA: Post Experiment.

assignment: control” part of the experiment group and “treatment” part of the experiment group

Treatment is 0 for units not exposed to the ad campaign and 1 for units exposed to the ad campaign.

Bookings is my target feature and represents the hotel bookings sales that we want to measure the effect of Adspend on

Cost is the expenditure on a particular day on ads for the treatment group.

Before and after the experiment(marketing campaign), there was no marketing expenditure on ads to boost hotel bookings. I assume the company relied more on natural interest or traffic from recommendations.

DATA QUALITY CHECK

Data quality check using skimr package.

skim(adspend_main)
Data summary
Name adspend_main
Number of rows 9225
Number of columns 8
_______________________
Column type frequency:
character 1
numeric 6
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
assignment 0 1 7 9 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
geo 0 1.00 50.14 28.70 1.0 25.00 50.00 75.00 100.00 ▇▇▇▇▇
geo_group 0 1.00 1.50 0.50 1.0 1.00 2.00 2.00 2.00 ▇▁▁▁▇
period 2292 0.75 0.40 0.49 0.0 0.00 0.00 1.00 1.00 ▇▁▁▁▆
treatment 0 1.00 0.15 0.36 0.0 0.00 0.00 0.00 1.00 ▇▁▁▁▂
bookings 0 1.00 706.62 1245.18 104.8 166.73 272.29 596.17 14655.46 ▇▁▁▁▁
cost 0 1.00 5.42 30.74 0.0 0.00 0.00 0.00 550.50 ▇▁▁▁▁

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2015-01-05 2015-04-07 2015-02-20 93

I have realized that there are null values for the “period” variable representing the period after the intervention.Here I anticipate the effects of the ad marketing campaign to have been observed and hence I will not need the data for my exploratory or inferential analysis.

I have gone on to exclude the missing data.

adspend <- adspend_main %>% filter(!is.na(adspend_main$period))

DATA EXPLORATION

I would like to do an exploratory analysis to get a picture of how bookings performed for the different periods.

For this demonstration, I am going to utilize ggplot2 package to visualize ad spend for the period when the experiment was active. I will only be able to make an inference from what I observe after performing the statistical test.

Identifying the start and end dates for these periods using R.

Period Before
head(adspend[adspend$period==0,])
## # A tibble: 6 × 8
##   date                  geo geo_group period assignment treatment bookings  cost
##   <dttm>              <dbl>     <dbl>  <dbl> <chr>          <dbl>    <dbl> <dbl>
## 1 2015-01-05 00:00:00     1         2      0 treatment          0    7327.     0
## 2 2015-01-05 00:00:00    10         1      0 control            0    1927.     0
## 3 2015-01-05 00:00:00   100         1      0 control            0     124.     0
## 4 2015-01-05 00:00:00    11         1      0 control            0    1601.     0
## 5 2015-01-05 00:00:00    12         2      0 treatment          0    1472.     0
## 6 2015-01-05 00:00:00    13         1      0 control            0    1467.     0
tail(adspend[adspend$period==0,])
## # A tibble: 6 × 8
##   date                  geo geo_group period assignment treatment bookings  cost
##   <dttm>              <dbl>     <dbl>  <dbl> <chr>          <dbl>    <dbl> <dbl>
## 1 2015-02-15 00:00:00    93         2      0 treatment          0     112.     0
## 2 2015-02-15 00:00:00    94         1      0 control            0     106.     0
## 3 2015-02-15 00:00:00    95         1      0 control            0     112.     0
## 4 2015-02-15 00:00:00    96         2      0 treatment          0     129.     0
## 5 2015-02-15 00:00:00    98         2      0 treatment          0     106.     0
## 6 2015-02-15 00:00:00    99         2      0 treatment          0     106.     0

Starting date for period before the marketing campaign is 2015-01-05.

End date for the period before the marketing campaign is 2015-02-15.

Period During the campaign
head(adspend[adspend$period==1,])
## # A tibble: 6 × 8
##   date                  geo geo_group period assignment treatment bookings  cost
##   <dttm>              <dbl>     <dbl>  <dbl> <chr>          <dbl>    <dbl> <dbl>
## 1 2015-02-16 00:00:00     1         2      1 treatment          1    9267. 462. 
## 2 2015-02-16 00:00:00    10         1      1 control            0    2392.   0  
## 3 2015-02-16 00:00:00   100         1      1 control            0     112.   0  
## 4 2015-02-16 00:00:00    11         1      1 control            0    1847.   0  
## 5 2015-02-16 00:00:00    12         2      1 treatment          1    1890.  89.0
## 6 2015-02-16 00:00:00    13         1      1 control            0    1545.   0
tail(adspend[adspend$period==1,])
## # A tibble: 6 × 8
##   date                  geo geo_group period assignment treatment bookings  cost
##   <dttm>              <dbl>     <dbl>  <dbl> <chr>          <dbl>    <dbl> <dbl>
## 1 2015-03-15 00:00:00    93         2      1 treatment          1     116.  0.76
## 2 2015-03-15 00:00:00    94         1      1 control            0     132.  0   
## 3 2015-03-15 00:00:00    95         1      1 control            0     113.  0   
## 4 2015-03-15 00:00:00    97         1      1 control            0     145.  0   
## 5 2015-03-15 00:00:00    98         2      1 treatment          1     131.  1.53
## 6 2015-03-15 00:00:00    99         2      1 treatment          1     108.  0.38

Starting date for period during marketing campaign is 2015-02-16.

End date for the period during the marketing campaign is 2015-03-15.

Bookings by assignment over time aggregated for each day.

daily_bookings_assignment <- adspend %>% group_by(assignment) %>%
                             summarise_by_time(date, 
                                               total_bookings = sum(bookings),
                                               total_cost = sum(cost),
                                               .by = "day") %>% 
                             ungroup()

The ungroup() function helps us to retain the dimensions of our data after aggregation since we used the summarize function.

Using ggplot2 to visualize the difference between the two assignment groups for the period before and during the ad marketing campaign

daily_bookings_assignment %>% 
                          ggplot(aes(date,total_bookings, 
                          color = assignment)) + 
                          geom_line(mapping = aes(linetype = assignment)) + 
                          labs(x = "Date",
                               y = "Total Bookings", 
            title = "Adspend Before and During Marketing Campaign", 
                          guides("Assignment")) + 
            scale_y_continuous(labels = scales::comma_format()) + theme_light()

From the graph there is a noticeable improvement during the campaign when comparing the treatment and control groups.

I need to investigate this further using a statistical test to find out if the difference is significant.

TESTING ASSUMPTIONS TO DETERMINE APPROPRIATE TEST

There are 1391 entries for our control group and 1393 entries for our treatment representing 0.49964 and 0.50035 in proportion respectively. Hence the proportions for our test group are appropriate to proceed.

My variable of interest is bookings.

To determine the appropriate test I first consider if my metric is discrete of which it is continuous.

Test for normality

My sample is large enough but I will still test whether it meets the assumption of normality.

QQ Plot
ggqqplot(adspend1$bookings, xlab = "Theoretical Quantiles", ylab =
"Sample Quantiles", title = "QQ-Plot of the Booking Amounts")

The qqplot shows evidence of the data deviating from a normal distribution

Parametric Test
shapiro.test(adspend1$bookings)
## 
##  Shapiro-Wilk normality test
## 
## data:  adspend1$bookings
## W = 0.50101, p-value < 0.00000000000000022

The shapiro test confirms this with a significant p-value, meaning we reject the null hypothesis that our sample comes from a normal distribution.

Test for homoscedasticity

I do not know the population variance. I will have to perform a levene test to determine the whether the assumption of homogeneity of variances is met.

It is important to note that there was no marketing cost before the campaign.

daily_bookings_assignment %>% group_by(assignment) %>%
summarize(bookings_var = var(total_bookings), cost_var =
var(total_cost))
## # A tibble: 2 × 3
##   assignment bookings_var cost_var
##   <chr>             <dbl>    <dbl>
## 1 control       87718122.       0 
## 2 treatment    124813973.  902912.

Hence the variance to be investigated will be the bookings.

leveneTest(bookings ~ assignment, data = adspend1)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    1   2.952 0.08588 .
##       2782                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the levene test, I have obtained a non significant result.

Hence I do not have sufficient evidence to reject the null hypothesis that the population variances are equal.

My data has failed the test for normality but is consistent with regard to homogeneity of variances.

However, my sample is large enough and I opt to disregard the outcome of the shapiro wilk test. Hence, I will use the 2 sample students t-test to make an inference from my A/B test data.

PERFORMING THE 2 SAMPLE T-TEST

Hypothesis

H0: There is no difference in mean hotel bookings for periods before and after the marketing campaign.

H1: There is a significant difference in mean hotel bookings for periods before and after the marketing campaign.

Calculating test statistics

ab_adspend <- adspend1 %>% t_test(bookings ~ assignment,
                                  order = c("treatment","control"),
                                  alternative = "two-sided")
ab_adspend
## # A tibble: 1 × 7
##   statistic  t_df p_value alternative estimate lower_ci upper_ci
##       <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
## 1      1.92 2703.  0.0545 two.sided       96.2    -1.87     194.

At an alpha level of 0.05 and considering the p-value obtained of 0.0545, we infer that there is not enough evidence to reject the null hypothesis.

However, at an alpha level of 0.1 we can reject the null hypothesis and conclude that there was a significant difference in mean bookings after the adspend was initiated during the marketing campaign. I can ascertain that there was a significant effect from ad spend by looking at my confidence interval of [-1.87,194.31] which has a larger range on the positive spectrum.

INTERPRETATION OF RESULTS

What is the mean difference in bookings between the two periods? This is our test estimate.

avg_te <- ab_adspend$estimate
avg_te
## [1] 96.22111

The test estimate shows that on average the difference in bookings between the control and treatment period was 96.2. This represents an improvement in bookings during the ad spend campaign. The value of 96.2 represents the average treatment effect(ATE) which in our case confirms that the marketing campaign was a success.

Return on Adspend (ROA) per unit of money(say Kshs) used.

sum(adspend1$cost)
## [1] 50000

The total cost on Marketing was Ksh.50000.

adspend1 %>% filter(assignment == "treatment") %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1393

There are 1393 observations for our treatment group.

Return on adspend per dollar

(Total Observations * Average Treatment effect)/Total Booking Cost

(1393*avg_te)/sum(adspend1$cost)
## [1] 2.68072

Our marketing campaign to ramp up bookings through advertisements yielded an increase in bookings by Kshs. 2.68 for every shilling spent.

adspend %>% group_by(period) %>% 
  summarize(total_bookings = sum(bookings))
## # A tibble: 2 × 2
##   period total_bookings
##    <dbl>          <dbl>
## 1      0       2412592.
## 2      1       2131682.
adspend %>% group_by(period) %>% count()
## # A tibble: 2 × 2
## # Groups:   period [2]
##   period     n
##    <dbl> <int>
## 1      0  4149
## 2      1  2784

Total bookings in the period before the marketing campaign was Kshs 2,412,592 with a total of 4149 bookings made.

The average per booking will be:

2412592/4149
## [1] 581.4876

The average booking before the marketing campaign was Kshs.581.49.

Total bookings in the period during the marketing campaign was Kshs 2,131,682 with 2784 bookings made.

The average booking will be:

2131682/2784
## [1] 765.6904

The average booking during the marketing campaign was 765.69.

What is the Uplift per booking following the Marketing Campaign?

Calculating the uplift rate for the average booking:

((765.69/581.49)*100) - 100
## [1] 31.67724

The average booking improved by 31.68% following the ad spend campaign.

CONCLUSION

The ad marketing campaign was successful and resulted in a return of Kshs. 2.67 for every shilling spent on the marketing campaign.

The average amount spent on a single booking improved by 31.68% following the marketing campaign.