Introduction

Description

Marketing companies strive to execute effectively targeted campaigns in a complex market environment, where predicting audience preferences can be challenging. To tackle this issue, A/B testing is commonly employed. A/B testing is a randomized experimentation method where two or more variations of a variable (such as a webpage, page element, or advertisement) are shown simultaneously to different individuals from various segments. This approach helps identify which version has the most significant impact and attracts more customers.

Source: https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing/data

Business Question

The primary goals of this project are:

  1. Campaign Effectiveness Evaluation: How can we determine if the campaign is successful?
  2. Success Attribution Measurement: If the campaign is successful, what portion of that success can be attributed to the ads?

To achieve these objectives, we will conduct a comprehensive analysis of the provided dataset, focusing on how ad exposure influences user conversions. This analysis will help us understand the effectiveness of the campaign and measure the contribution of the ads to its overall success.

Variable Description

1. Data Preparation

1.1 Prerequisites

1.2 Importing Libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(glue)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(viridis)
## Loading required package: viridisLite
## 
## Attaching package: 'viridis'
## 
## The following object is masked from 'package:scales':
## 
##     viridis_pal
library(patchwork)

1.3 Importing Data

data <- read.csv("marketing_AB.csv")
head(data)
##   no user.id test.group converted total.ads most.ads.day most.ads.hour
## 1  0 1069124         ad     FALSE       130       Monday            20
## 2  1 1119715         ad     FALSE        93      Tuesday            22
## 3  2 1144181         ad     FALSE        21      Tuesday            18
## 4  3 1435133         ad     FALSE       355      Tuesday            10
## 5  4 1015700         ad     FALSE       276       Friday            14
## 6  5 1137664         ad     FALSE       734     Saturday            10

1.4 Data Inspection

str(data)
## 'data.frame':    588101 obs. of  7 variables:
##  $ no           : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ user.id      : int  1069124 1119715 1144181 1435133 1015700 1137664 1116205 1496843 1448851 1446284 ...
##  $ test.group   : chr  "ad" "ad" "ad" "ad" ...
##  $ converted    : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ total.ads    : int  130 93 21 355 276 734 264 17 21 142 ...
##  $ most.ads.day : chr  "Monday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ most.ads.hour: int  20 22 18 10 14 10 13 18 19 14 ...

2. Exploratory Data Analysis

data1 <- data %>% select(- no, -user.id) %>% mutate(
    converted = as.integer(converted))
head(data1)
##   test.group converted total.ads most.ads.day most.ads.hour
## 1         ad         0       130       Monday            20
## 2         ad         0        93      Tuesday            22
## 3         ad         0        21      Tuesday            18
## 4         ad         0       355      Tuesday            10
## 5         ad         0       276       Friday            14
## 6         ad         0       734     Saturday            10
str(data1)
## 'data.frame':    588101 obs. of  5 variables:
##  $ test.group   : chr  "ad" "ad" "ad" "ad" ...
##  $ converted    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ total.ads    : int  130 93 21 355 276 734 264 17 21 142 ...
##  $ most.ads.day : chr  "Monday" "Tuesday" "Tuesday" "Tuesday" ...
##  $ most.ads.hour: int  20 22 18 10 14 10 13 18 19 14 ...
unique(data1$converted)
## [1] 0 1
unique(data1$test.group )
## [1] "ad"  "psa"
table(data1$converted)
## 
##      0      1 
## 573258  14843

2.1 Missing Value

colSums(is.na(data1))
##    test.group     converted     total.ads  most.ads.day most.ads.hour 
##             0             0             0             0             0

3. Data Wrangling

3.1 Exploring Categorical Variables

categorical_columns <- c('test.group', 'converted', 'most.ads.day', 'most.ads.hour')


levels <- lapply(categorical_columns, function(col) {
  unique(data1[[col]])
})


names(levels) <- categorical_columns


for (col in categorical_columns) {
  cat(paste(col, ":", paste(levels[[col]], collapse = ", ")), "\n")
}
## test.group : ad, psa 
## converted : 0, 1 
## most.ads.day : Monday, Tuesday, Friday, Saturday, Wednesday, Sunday, Thursday 
## most.ads.hour : 20, 22, 18, 10, 14, 13, 19, 11, 12, 16, 21, 3, 23, 4, 8, 0, 2, 15, 1, 6, 17, 7, 9, 5

In the code above, the following steps are taken to process data within the data1 object, which is a data frame with 588,101 observations and 5 variables:

Overall, this code aims to identify and display the unique values of the categorical columns in a data frame, which can be useful for understanding the distribution or variation of data within each categorical column.

3.2 Bootstrap Analysis

3.2.1 Conversion Rate by Group

# # Define sampling size
#  sampling <- 5000
# 
#  # Initialize vectors to store results
#  control_group1 <- NULL
#  test_group1 <- NULL
# 
#  # Set seed for reproducibility
#  set.seed(100)
# 
#  # Bootstrap sampling for control group
#  for(i in 1:sampling) {
#     control_group2 <- length(control_group$converted)
#     control_group3 <- control_group[sample(1:control_group2, replace=TRUE),]
#     control_group4 <- sum(control_group3$converted == 1) / control_group2
# 
#     control_group1 <- c(control_group1, control_group4)
#  }
# 
#  # Bootstrap sampling for test group
#  for(i in 1:sampling) {
#     test_group2 <- length(test_group$converted)
#     test_group3 <- test_group[sample(1:test_group2, replace=TRUE),]
#     test_group4 <- sum(test_group3$converted == 1) / test_group2
# 
#     test_group1 <- c(test_group1, test_group4)
#  }

# Save the results
  # saveRDS(test_group1, "data_test_group.rds")
  # saveRDS(control_group1, "data_control_group.rds")

# Read the saved RDS files
data_test_group1 <- readRDS("data_test_group.rds")
data_control_group1 <- readRDS("data_control_group.rds")

# Combine the data for plotting
data_converted <- data.frame(
  Distribution = c(rep("PSA", length(data_control_group1)), rep("AD", length(data_test_group1))),
  value = c(data_control_group1, data_test_group1))

This code uses the bootstrap technique to estimate the distribution of conversion rates (converted) for two groups (control_group and test_group). By generating 5,000 bootstrap samples for each group, we can obtain a smoother distribution and better understand the variation in conversion rates within this population. The results are then ready for further analysis or visualization to determine if there is a significant difference in conversion rates between the control group and the test group.

data_converted %>%
  ggplot(aes(x = value, fill = Distribution)) + 
  geom_area(stat = "Density", alpha = 0.7, position = "identity") +
  xlab("Conversion Rate") +
  ggtitle("Conversion Rate Distribution by Group") +
  labs(fill = NULL ) +
  theme_minimal() +
  scale_fill_manual(values = c("PSA" = "steelblue", "AD" = "tomato"))

The density plot shows the distribution of conversion rates for two groups, labeled AD (experimental group) and PSA (control group). The “AD” group has a higher conversion rate overall, with a narrower and taller peak around 0.024. The “PSA” group has a lower conversion rate, with a wider and shorter peak around 0.018. This suggests that the “AD” group is more effective in converting users compared to the “PSA” group.

3.2.2 Conversion Rate All Distribution

The primary goal of this analysis is to quantify the disparity in conversion rates between the test and control groups. By calculating the absolute percentage difference

data_difference1 <- as.data.frame(cbind(data_test_group1, data_control_group1)) %>% 
  mutate(diff1 = round(abs(data_test_group1 - data_control_group1)*100,2))
data_difference1 %>%
  ggplot( aes(x=diff1)) +
    geom_density( color="#e9ecef", fill = "steelblue", alpha=0.7) +
    scale_fill_manual(values="#8fce00") +
    xlab("Conversion Rate Diff")+
  ggtitle("Conversion Rate Distribution")+
    theme_minimal() +
    labs(fill="")

3.3 Univariate Analysis

3.3.1 T-Test

control_group <- data1 %>% filter(data1$test.group == "psa")
test_group <- data1 %>% filter(data1$test.group == "ad")


t_test_result <- t.test(test_group$converted, control_group$converted)


cat("Test Group Analysis - T-statistic:", t_test_result$statistic, ", P-value:", t_test_result$p.value, "\n")
## Test Group Analysis - T-statistic: 8.657162 , P-value: 5.107608e-18
t_test_result <- t.test(test_group$converted, control_group$converted)


p_val <- t_test_result$p.value


if (p_val < 0.05) {
  cat("Reject the null hypothesis: There is a significant difference in conversion rates between the 'ad' and 'psa' groups.\n")
} else {
  cat("Fail to reject the null hypothesis: There is no significant difference in conversion rates between the 'ad' and 'psa' groups.\n")
}
## Reject the null hypothesis: There is a significant difference in conversion rates between the 'ad' and 'psa' groups.

This conclusion is based on a t-statistic of 8.657162 and a p-value of 5.107608e-18. The low p-value indicates that the observed difference is unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference between the groups) and conclude that the “ad” group has a significantly higher conversion rate than the “psa” group.

3.3.2 Analysis of Variance (Anova)

anova_result_day <- aov(converted ~ `most.ads.day`, data = data1)


anova_summary <- summary(anova_result_day)


f_statistic <- anova_summary[[1]]["most.ads.day", "F value"]
p_value <- anova_summary[[1]]["most.ads.day", "Pr(>F)"]

cat("Most Ads Day Analysis - F-statistic:", f_statistic, ", P-value:", p_value, "\n")
## Most Ads Day Analysis - F-statistic: 68.38818 , P-value: 1.803201e-85
f_statistic <- anova_summary[[1]]["most.ads.day", "F value"]
p_value <- anova_summary[[1]]["most.ads.day", "Pr(>F)"]


if (p_value < 0.05) {
  cat("Reject the null hypothesis: The day with the most ads seen significantly affects the conversion rate.\n")
} else {
  cat("Fail to reject the null hypothesis: The day with the most ads seen does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: The day with the most ads seen significantly affects the conversion rate.

The statistical analysis indicates that there is a significant difference in conversion rates based on the day with the most ads seen. This conclusion is supported by an F-statistic of 68.38818 and a p-value of 1.803201e-85. The low p-value suggests that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the day with the most ads seen) and conclude that the day with the most ads significantly affects the conversion rate.

anova_result_hour <- aov(converted ~ as.factor(most.ads.hour), data = data1)


anova_summary_hour <- summary(anova_result_hour)


f_statistic_hour <- anova_summary_hour[[1]]$`F value`[1]
p_value_hour <- anova_summary_hour[[1]]$`Pr(>F)`[1]


cat("Most Ads Hour Analysis - F-statistic:", f_statistic_hour, ", P-value:", p_value_hour, "\n")
## Most Ads Hour Analysis - F-statistic: 18.74204 , P-value: 7.482025e-77
if (p_value_hour < 0.05) {
  cat("Reject the null hypothesis: The hour with the most ads seen significantly affects the conversion rate.\n")
} else {
  cat("Fail to reject the null hypothesis: The hour with the most ads seen does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: The hour with the most ads seen significantly affects the conversion rate.

The statistical analysis shows that there is a significant difference in conversion rates based on the hour with the most ads seen. This conclusion is supported by an F-statistic of 18.74204 and a p-value of 7.482025e-77. The low p-value suggests that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the hour with the most ads seen) and conclude that the hour with the most ads significantly affects the conversion rate.

df_subset <- data1 %>% filter(total.ads < 50)


df_subset <- df_subset %>%
  mutate(total_ads_bin = cut(total.ads, 
                             breaks = c(-1, 1, 5, 10, 20, 30, 40, 50), 
                             labels = c('0-1', '2-5', '6-10', '11-20', '21-30', '31-40', '41-50')))


anova_result_bin <- aov(converted ~ total_ads_bin, data = df_subset)


anova_summary_bin <- summary(anova_result_bin)


f_statistic_bin <- anova_summary_bin[[1]]$`F value`[1]
p_value_bin <- anova_summary_bin[[1]]$`Pr(>F)`[1]


cat("Total Ads (Binned) Analysis - F-statistic:", f_statistic_bin, ", P-value:", p_value_bin, "\n")
## Total Ads (Binned) Analysis - F-statistic: 1245.669 , P-value: 0
if (p_value_bin < 0.05) {
  cat("Reject the null hypothesis: There is a significant difference in conversion rates among different levels of total ads seen (binned).\n")
} else {
  cat("Fail to reject the null hypothesis: The number of ads seen (binned) does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: There is a significant difference in conversion rates among different levels of total ads seen (binned).

The statistical analysis demonstrates that there is a significant difference in conversion rates across various levels of total ads seen (binned). This conclusion is supported by an F-statistic of 1245.669 and a p-value of 0. The p-value of 0 indicates that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the total ads seen) and conclude that the total ads seen significantly affects the conversion rate.

contingency_table_day <- table(data1$most.ads.day, data1$converted)


chi2_test <- chisq.test(contingency_table_day)


p_val <- chi2_test$p.value


alpha <- 0.05


if (p_val < alpha) {
    print("The p-value is less than 0.05, indicating a significant relationship between 'most ads day' and 'converted'.")
} else {
    print("The p-value is greater than 0.05, indicating no significant relationship between 'most ads day' and 'converted'.")
}
## [1] "The p-value is less than 0.05, indicating a significant relationship between 'most ads day' and 'converted'."

The statement correctly interprets the p-value. A p-value less than 0.05 indicates that the observed relationship between “most ads day” and “converted” is statistically significant. This means that it is unlikely that the observed relationship occurred by chance.

However, it’s important to note that statistical significance does not necessarily imply practical significance. It’s essential to consider the effect size and context of the analysis to determine the real-world importance of the finding.

 categorical_columns <- c('test group', 'converted', 'most ads day', 'most ads hour')
conversion_rates <- data1 %>%
  group_by(test.group) %>%
  summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
  rename('Test Group' = test.group, 'Conversion Rate' = Conversion_Rate)


ggplot(conversion_rates, aes(x = `Test Group`, y = `Conversion Rate`, fill = `Test Group`)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("ad" = "tomato", "psa" = "steelblue")) +
  theme_minimal() +
  labs(title = "Conversion Rates for Control and Test Groups",
       x = "Test Group",
       y = "Conversion Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

It appears that one test group (likely “ad” colored tomato) has a higher conversion rate than the other group (likely “psa” colored steelblue).To determine which group has a definitively higher rate, we would need the actual values or the difference between the conversion rates. One test group (likely “ad” colored tomato) has a higher conversion rate than the other group (likely “psa” colored steelblue).

day_conversion_rate <- data1 %>%
  group_by(`most.ads.day`) %>%
  summarise(`Conversion Rate` = mean(converted, na.rm = TRUE)) %>%
  rename('Day of the Week' = `most.ads.day`) %>%
  mutate(`Day of the Week` = factor(`Day of the Week`, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
  arrange(`Day of the Week`)


ggplot(day_conversion_rate, aes(x = `Conversion Rate`, y = reorder(`Day of the Week`, `Conversion Rate`), fill = `Conversion Rate`)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightcoral", high = "darkred") +  
  theme_minimal() +
  labs(title = "Conversion Rate by Day with Most Ads Seen",
       x = "Conversion Rate",
       y = "Day of the Week") +
  theme(axis.text.y = element_text(angle = 0, hjust = 0.5)) +
  guides(fill = "none")  

The horizontal bar chart displays the average conversion rate for each day of the week, with days sorted from highest to lowest conversion rate. The color of the bars corresponds to the conversion rate, with a gradient from light coral (likely indicating a lower rate) to dark red (likely indicating a higher rate).

hour_conversion_rate <- data1 %>%
  group_by(most.ads.hour) %>%
  summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
  rename('Hour of the Day' = most.ads.hour, 'Conversion Rate' = Conversion_Rate) %>%
  arrange(`Hour of the Day`)


hour_conversion_rate <- hour_conversion_rate %>%
  mutate(`Hour of the Day` = factor(`Hour of the Day`, levels = 1:24))


ggplot(hour_conversion_rate, aes(x = `Hour of the Day`, y = `Conversion Rate`, fill = `Conversion Rate`)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightcoral", high = "darkred") +
  scale_x_discrete(breaks = 1:24) +  
  theme_minimal() +
  labs(title = "Conversion Rate by Hour with Most Ads Seen",
       x = "Hour of the Day",
       y = "Conversion Rate") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = "none")

Based on the data analysis, the highest conversion rate is achieved at 4:00 PM with an average of 3%, while the lowest conversion rate occurs at 2:00 AM with an average of 0.5%. This suggests that our advertising campaigns are most effective in the afternoon and less effective during the early morning hours. To improve campaign performance, it is recommended to increase ad frequency during the period from 4:00 PM to 7:00 PM and reduce the budget for hours outside of this time range.

ads_conversion_rate_bin <- df_subset %>%
  group_by(total_ads_bin) %>%
  summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
  rename('Total Ads Seen (Binned)' = total_ads_bin, 'Conversion Rate' = Conversion_Rate)

ggplot(ads_conversion_rate_bin, aes(x = reorder(`Total Ads Seen (Binned)`, `Conversion Rate`), y = `Conversion Rate`, fill = `Conversion Rate`)) +
  geom_bar(stat = "identity") +
  scale_fill_gradient(low = "lightcoral", high = "darkred") +  
  coord_flip() +  
  theme_minimal() +
  labs(title = "Conversion Rate by Total Ads Seen (Binned) - Total Ads < 50",
       x = "Total Ads Seen (Binned)",
       y = "Conversion Rate") +
  theme(axis.text.y = element_text(angle = 0, hjust = 0.5, size = 10),
        axis.text.x = element_text(size = 10),
        plot.title = element_text(size = 14, face = "bold"),
        legend.position = "none")

There appears to be a difference in conversion rates across different bins of total ads seen, with some bins showing a higher average conversion rate than others.The bins are arranged from highest to lowest conversion rate, allowing for easy identification of the bins with the highest (dark red bars) and lowest (light coral bars) conversion rates. It’s important to note that the analysis is limited to total ads seen less than 50. A similar analysis could be conducted for higher total ads seen bins to get a more complete picture of the relationship between total ads seen and conversion rate.

test_group_colors <- c('#4e79a7', '#f28e2b')
converted_colors <- c('#76b7b2', '#e15759')
most_ads_day_colors <- c('#edc948', '#f28e2b', '#4e79a7', '#e15759', '#76b7b2', '#59a14f', '#b07aa1')


most_ads_hour_colors <- scales::viridis_pal()(length(unique(data1$most.ads.hour)))


bar_plot_1 <- ggplot(data1, aes(x = test.group, fill = test.group)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = test_group_colors) +
  labs(title = 'Test Group', x = '', y = 'Number of Users') +
  theme_minimal() +
  ylim(0, max(table(data1$test.group)) * 1.1) +  
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)  

print(bar_plot_1)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

pie_data_1 <- data1 %>%
  count(test.group) %>%
  mutate(percentage = n / sum(n) * 100)

pie_chart_1 <- ggplot(pie_data_1, aes(x = "", y = n, fill = test.group)) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = 'y') +
  scale_fill_manual(values = test_group_colors) +
  labs(title = 'Test Group', x = '', y = '') +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

print(pie_chart_1)

Test Group Analysis:

The bar and pie charts for the ‘Test Group’ variable show a clear distribution of users across different test groups. The data suggests that there is a significant number of users in each test group, with some groups being larger than others. This indicates the need for tailored strategies depending on the size and characteristics of each group to maximize engagement and conversions.

bar_plot_2 <- ggplot(data1, aes(x = factor(converted), fill = factor(converted))) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = converted_colors) +
  labs(title = 'Conversion Rate', x = '', y = 'Number of Users') +
  theme_minimal() +
  ylim(0, max(table(data1$converted)) * 1.1) +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)

print(bar_plot_2)

pie_data_2 <- data1 %>%
  count(converted) %>%
  mutate(percentage = n / sum(n) * 100)

pie_chart_2 <- ggplot(pie_data_2, aes(x = "", y = n, fill = factor(converted))) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = 'y') +
  scale_fill_manual(values = converted_colors) +
  labs(title = 'Conversion Rate', x = '', y = '') +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())

print(pie_chart_2)

Conversion Rate

The visualizations for ‘Conversion Rate’ indicate a noticeable difference between converted and non-converted users. The bar chart reveals the counts of each group, while the pie chart shows the percentage distribution, highlighting the proportion of users who converted versus those who did not. This insight is valuable for optimizing marketing efforts to increase conversion rates.

bar_plot_3 <- ggplot(data1, aes(x = most.ads.day, fill = most.ads.day)) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = most_ads_day_colors) +
  labs(title = 'Most Ads Viewed by Day', x = '', y = 'Number of Users') +
  theme_minimal() +
  ylim(0, max(table(data1$most.ads.day)) * 1.1) +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)

print(bar_plot_3)

pie_data_3 <- data1 %>%
  count(most.ads.day) %>%
  mutate(percentage = n / sum(n) * 100)

pie_chart_3 <- ggplot(pie_data_3, aes(x = "", y = n, fill = most.ads.day)) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = 'y') +
  scale_fill_manual(values = most_ads_day_colors) +
  labs(title = 'Most Ads Viewed by Day', x = '', y = '') +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank()) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "white")

print(pie_chart_3)

Most Ads Viewed by Day

The bar and pie charts for ‘Most Ads Viewed by Day’ provide insights into user engagement by day. Certain days have higher ad views, suggesting that users are more active or engaged on these days. Understanding which days perform better in terms of ad views can help in scheduling ads more effectively to maximize exposure and conversion rates.

bar_plot_4 <- ggplot(data1, aes(x = factor(most.ads.hour), fill = factor(most.ads.hour))) +
  geom_bar(color = 'black') +
  scale_fill_manual(values = most_ads_hour_colors) +
  labs(title = 'Most Ads Viewed by Hour', x = '', y = 'Number of Users') +
  theme_minimal() +
  ylim(0, max(table(data1$most.ads.hour)) * 1.1) +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)

print(bar_plot_4)

pie_data_4 <- data1 %>%
  count(most.ads.hour) %>%
  mutate(percentage = n / sum(n) * 100)

pie_chart_4 <- ggplot(pie_data_4, aes(x = "", y = n, fill = factor(most.ads.hour))) +
  geom_bar(stat = 'identity', width = 1, color = "white") +
  coord_polar(theta = 'y') +
  scale_fill_manual(values = most_ads_hour_colors) +
  labs(title = 'Most Ads Viewed by Hour', x = '', y = '') +
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        axis.ticks = element_blank(), 
        panel.grid = element_blank()) +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), 
            position = position_stack(vjust = 0.5), 
            color = "white")

print(pie_chart_4)

Most Ads Viewed by Hour

The analysis of ‘Most Ads Viewed by Hour’ through bar and pie charts shows the distribution of ad views throughout different hours of the day. Certain hours, particularly in the late afternoon and early evening, have higher engagement. This information is crucial for timing ad placements to align with peak user activity times, thereby increasing the likelihood of conversions.

percentile_95 <- quantile(data1$total.ads, 0.95)


filtered_data <- data1 %>% filter(total.ads <= percentile_95)


ggplot(filtered_data, aes(x = total.ads)) +
  geom_histogram(bins = 50, fill = '#4e79a7', color = 'black', alpha = 0.7) +
  geom_density(aes(y = ..count..), color = '#4e79a7', size = 1) +
  labs(title = 'Total Ads Viewed', x = '', y = 'Frequency') +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The histogram displays the distribution of the number of total ads seen for entries where the number of ads seen is less than the 95th percentile. The x-axis represents the number of total ads seen, and the y-axis represents the frequency (count) of observations for each bin. The density curve helps visualize the overall shape of the distribution.

3.4 Bivariate Analysis

Bivariate analysis will be conducted to uncover relationships between pairs of variables. This analysis aims to reveal patterns and insights through visual exploration of how different variables interact. The specific areas of focus include:

Most Ads Day vs. Converted: The variation in conversion rates across different days of the week will be examined. Most Ads Hour vs. Converted: The correlation between the time of day when users viewed the most ads and their conversion rates will be explored. Total Ads vs. Converted: The relationship between the number of ads viewed and the likelihood of conversion will be analyzed.

Through these analyses, trends and correlations will be identified to inform deeper investigation and understanding.

3.4.1 Most Ads Day and Conversion Group

conversion_counts <- data1 %>%
  group_by(most.ads.day, converted) %>%
  summarise(count = n(), .groups = 'drop') %>%
  tidyr::spread(key = converted, value = count, fill = 0)
day_order <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
conversion_counts <- conversion_counts %>%
  mutate(most.ads.day = factor(most.ads.day, levels = day_order)) %>%
  arrange(most.ads.day)
conversion_percentages <- conversion_counts %>%
  mutate(Total = `0` + `1`) %>%
  mutate(`0_percent` = (`0` / Total) * 100,
         `1_percent` = (`1` / Total) * 100)
ggplot(conversion_percentages, aes(x = most.ads.day)) +
  geom_bar(aes(y = `0_percent`, fill = 'Not Converted'), stat = 'identity', color = 'black', position = 'stack') +
  geom_bar(aes(y = `1_percent`, fill = 'Converted'), stat = 'identity', color = 'black', position = 'stack') +
  geom_text(aes(y = `0_percent` / 2, label = sprintf('%.1f%%', `0_percent`)), color = 'white', size = 3, vjust = 3) +
  geom_text(aes(y = `0_percent` + `1_percent` / 2, label = sprintf('%.1f%%', `1_percent`)), color = 'white', size = 3, vjust = 35) +
  scale_fill_manual(values = c('Not Converted' = '#4e79a7', 'Converted' = '#f28e2b')) +
  labs(title = 'Most Ads Day and Conversion', x = '', y = 'Most Ads Day', fill = 'Converted') +
  theme_minimal() +
  theme(legend.position = 'bottom')

Relatively Stable Conversion Rates: The graph shows that the conversion rates for each day of the week are relatively stable and high, ranging from 96.7% to 97.9%. This indicates that the ad campaign was quite effective in converting users, regardless of the day the ad was shown the most. No Significant Difference Based on Day: While there was a slight fluctuation in conversion rates between days, the differences were very small and not visually significant. This indicates that the day of the week was not a major determining factor in the success of the ad campaign.

3.4.2 Most Ads Hour and Conversion Group

ggplot(data1, aes(x = factor(converted), y = most.ads.hour, fill = factor(converted))) +
  geom_boxplot(outlier.colour = "red", outlier.size = 1) +  
  scale_fill_manual(values = c('0' = '#4e79a7', '1' = '#f28e2b')) + 
  labs(title = 'Most Ads Hour vs. Conversion Status', 
       x = 'Converted', 
       y = 'Most Ads Hour', 
       fill = 'Converted') + 
  theme_minimal() +  
  theme(legend.position = 'none')  

The boxplot shows the distribution of the “most ads hour” for converted and non-converted users. The boxes represent the interquartile range (IQR), highlighting the middle 50% of the data, with the median indicated by a line inside the box. By comparing the boxes and medians, we can observe differences in “most ads hour” between the two groups. Red dots outside the boxes indicate potential outliers, falling outside the 1.5 IQR range.

3.4.3 Total Ads Viewed vs. Conversion Status

percentile_95_total_ads <- quantile(data1$total.ads, 0.95)


filtered_data <- subset(data1, total.ads <= percentile_95_total_ads)


ggplot(filtered_data, aes(x = factor(converted), y = total.ads, fill = factor(converted))) +
  geom_boxplot(outlier.colour = "red", outlier.size = 1) +  
  scale_fill_manual(values = c('0' = '#4e79a7', '1' = '#f28e2b')) + 
  labs(title = 'Total Ads Viewed vs. Conversion Status', 
       x = 'Converted', 
       y = 'Total Ads Viewed', 
       fill = 'Converted') +  
  theme_minimal() +  
  theme(legend.position = 'none')  

The boxplot shows the distribution of total ads viewed by converted and non-converted users. The boxes represent the interquartile range (IQR), with the median indicated by a line inside the box. Red dots highlight outliers beyond the standard deviation from the quartiles. Comparing the boxes and medians reveals differences in ad viewing between the two groups, focusing on users who viewed ads up to the 95th percentile.

3.5 Statistical Testing

3.5.1 Total Conversions by Day with Most Ads Seen

contingency_table_day <- table(data1$most.ads.day, data1$converted)


contingency_table_df <- as.data.frame(contingency_table_day)


colnames(contingency_table_df) <- c('MostAdsDay', 'Converted', 'Count')


sorted_table <- contingency_table_df %>%
  group_by(MostAdsDay) %>%
  summarise(Count = sum(Count)) %>%
  arrange(desc(Count))


print(sorted_table)
## # A tibble: 7 × 2
##   MostAdsDay Count
##   <fct>      <int>
## 1 Friday     92608
## 2 Monday     87073
## 3 Sunday     85391
## 4 Thursday   82982
## 5 Saturday   81660
## 6 Wednesday  80908
## 7 Tuesday    77479
ggplot(sorted_table, aes(x = reorder(MostAdsDay, Count), y = Count, fill = MostAdsDay)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Conversions by Day with Most Ads Seen",
       x = "Day with Most Ads Seen",
       y = "Total Conversions") +
  theme_minimal() +
  coord_flip() 

Most Ads Seen: The table and chart indicate that Friday was the day with the most ads seen, followed by Monday and Sunday. This suggests that users are more likely to see multiple ads on these days.Total Conversions: While the table shows the count of observations for each day, the chart provides a more visual representation of the total conversions. It appears that Friday also has the highest total conversions, followed by Monday and Sunday.

3.5.2 Chi-Test Conversion Status by Most Ads Day

contingency_table_day <- table(data1$most.ads.day, data1$converted)


chi2_test <- chisq.test(contingency_table_day)


expected_frequencies_day <- chi2_test$expected


expected_frequencies_df <- as.data.frame(expected_frequencies_day)
colnames(expected_frequencies_df) <- c('Not_Converted', 'Converted')


rownames(expected_frequencies_df) <- rownames(expected_frequencies_day)


expected_frequencies_df <- expected_frequencies_df %>%
  mutate(across(everything(), round, 2))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), round, 2)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))
sorted_expected_frequencies <- expected_frequencies_df %>%
  arrange(desc(Converted))


print(sorted_expected_frequencies)
##           Not_Converted Converted
## Friday         90270.68   2337.32
## Monday         84875.38   2197.62
## Sunday         83235.83   2155.17
## Thursday       80887.63   2094.37
## Saturday       79598.99   2061.01
## Wednesday      78865.97   2042.03
## Tuesday        75523.52   1955.48
contingency_df <- as.data.frame(contingency_table_day)

colnames(contingency_df) <- c('Most Ads Day', 'Converted', 'Count')


ggplot(contingency_df, aes(x = `Converted`, y = `Most Ads Day`, fill = Count)) +
  geom_tile(color = 'black') +  
  geom_text(aes(label = Count), color = 'black', size = 3) + 
  scale_fill_gradient2(low = "#ffcccc", mid = "#ffff99", high = "#0066cc", midpoint = median(contingency_df$Count)) +
  labs(title = 'Heatmap of Conversion Status by Most Ads Day', x = 'Conversion Status', y = 'Most Ads Day') +
  theme_minimal()

Earlier analysis suggested that Friday has the most ads seen and potentially the highest total conversions.The heatmap can reveal if Friday (or any other day) also has a higher proportion of converted users compared to non-converted users.

4. Conclusion

Overall Insights

1. Ad Exposure and Effectiveness:

Ads vs. PSAs The data reveals that advertisements (ADs) were more successful in driving conversions compared to public service announcements (PSAs). This suggests that ads had a stronger influence on encouraging purchases. However, given that most users were exposed to ads, this could skew the results, so the differences should be interpreted with caution.

Impact of Ad Exposure Users who saw more ads were generally more likely to convert, indicating that repeated exposure plays a key role in driving purchases.

Optimal Ad Exposure Displaying between 250 and 749 ads appears to be the ideal range for maximizing conversions without overwhelming users.

2. Optimal Campaign Timing:

Best Day for Campaigns Mondays consistently showed the highest conversion rates, suggesting users are more responsive at the start of the week, possibly due to a refreshed mindset after the weekend. Best Hour for Campaigns The hour of 16:00 emerged as particularly effective for conversions. However, other late afternoon to early evening hours (14:00-20:00) also performed well, allowing for some flexibility in campaign scheduling. Day-Hour Interaction While Monday at 16:00 is an optimal time, Saturday at 05:00 also showed unexpectedly high conversion rates. Nevertheless, no specific day-hour combination significantly outperformed others.

Addressing Key Questions

Will the campaign be successful?

The analysis indicates that the campaign is likely to succeed if carefully executed. Targeting ads at the optimal times and ensuring appropriate ad exposure should lead to significant results.

How much of the success can be attributed to the ads?

Ads are shown to be a major factor in the campaign’s success. They were more effective than PSAs in driving purchases, especially when presented at the right times and in appropriate quantities.

Actionable Recommendations

Prioritize Mondays Allocate a significant portion of the ad budget to Mondays, especially during late afternoon hours, as this day has demonstrated the highest conversion rates.

Focus on Late Afternoon and Early Evening Run ads between 14:00 and 20:00, with a particular focus on 16:00, which consistently showed high conversion rates.

Optimize Ad Exposure Target users with 250 to 749 ads to maximize conversions while avoiding ad fatigue. Monitor users who receive more than this to prevent diminishing returns.

Tailor to Day-Specific Timing Adjust ad campaigns to align with the most effective times for each day to maximize engagement and conversions.

5. Dataset