Marketing companies strive to execute effectively targeted campaigns in a complex market environment, where predicting audience preferences can be challenging. To tackle this issue, A/B testing is commonly employed. A/B testing is a randomized experimentation method where two or more variations of a variable (such as a webpage, page element, or advertisement) are shown simultaneously to different individuals from various segments. This approach helps identify which version has the most significant impact and attracts more customers.
Source: https://www.kaggle.com/datasets/faviovaz/marketing-ab-testing/data
The primary goals of this project are:
To achieve these objectives, we will conduct a comprehensive analysis of the provided dataset, focusing on how ad exposure influences user conversions. This analysis will help us understand the effectiveness of the campaign and measure the contribution of the ads to its overall success.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(glue)
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(viridis)
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
##
## The following object is masked from 'package:scales':
##
## viridis_pal
library(patchwork)
data <- read.csv("marketing_AB.csv")
head(data)
## no user.id test.group converted total.ads most.ads.day most.ads.hour
## 1 0 1069124 ad FALSE 130 Monday 20
## 2 1 1119715 ad FALSE 93 Tuesday 22
## 3 2 1144181 ad FALSE 21 Tuesday 18
## 4 3 1435133 ad FALSE 355 Tuesday 10
## 5 4 1015700 ad FALSE 276 Friday 14
## 6 5 1137664 ad FALSE 734 Saturday 10
str(data)
## 'data.frame': 588101 obs. of 7 variables:
## $ no : int 0 1 2 3 4 5 6 7 8 9 ...
## $ user.id : int 1069124 1119715 1144181 1435133 1015700 1137664 1116205 1496843 1448851 1446284 ...
## $ test.group : chr "ad" "ad" "ad" "ad" ...
## $ converted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ total.ads : int 130 93 21 355 276 734 264 17 21 142 ...
## $ most.ads.day : chr "Monday" "Tuesday" "Tuesday" "Tuesday" ...
## $ most.ads.hour: int 20 22 18 10 14 10 13 18 19 14 ...
data1 <- data %>% select(- no, -user.id) %>% mutate(
converted = as.integer(converted))
head(data1)
## test.group converted total.ads most.ads.day most.ads.hour
## 1 ad 0 130 Monday 20
## 2 ad 0 93 Tuesday 22
## 3 ad 0 21 Tuesday 18
## 4 ad 0 355 Tuesday 10
## 5 ad 0 276 Friday 14
## 6 ad 0 734 Saturday 10
str(data1)
## 'data.frame': 588101 obs. of 5 variables:
## $ test.group : chr "ad" "ad" "ad" "ad" ...
## $ converted : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total.ads : int 130 93 21 355 276 734 264 17 21 142 ...
## $ most.ads.day : chr "Monday" "Tuesday" "Tuesday" "Tuesday" ...
## $ most.ads.hour: int 20 22 18 10 14 10 13 18 19 14 ...
unique(data1$converted)
## [1] 0 1
unique(data1$test.group )
## [1] "ad" "psa"
table(data1$converted)
##
## 0 1
## 573258 14843
colSums(is.na(data1))
## test.group converted total.ads most.ads.day most.ads.hour
## 0 0 0 0 0
categorical_columns <- c('test.group', 'converted', 'most.ads.day', 'most.ads.hour')
levels <- lapply(categorical_columns, function(col) {
unique(data1[[col]])
})
names(levels) <- categorical_columns
for (col in categorical_columns) {
cat(paste(col, ":", paste(levels[[col]], collapse = ", ")), "\n")
}
## test.group : ad, psa
## converted : 0, 1
## most.ads.day : Monday, Tuesday, Friday, Saturday, Wednesday, Sunday, Thursday
## most.ads.hour : 20, 22, 18, 10, 14, 13, 19, 11, 12, 16, 21, 3, 23, 4, 8, 0, 2, 15, 1, 6, 17, 7, 9, 5
In the code above, the following steps are taken to process data within the data1 object, which is a data frame with 588,101 observations and 5 variables:
Overall, this code aims to identify and display the unique values of the categorical columns in a data frame, which can be useful for understanding the distribution or variation of data within each categorical column.
# # Define sampling size
# sampling <- 5000
#
# # Initialize vectors to store results
# control_group1 <- NULL
# test_group1 <- NULL
#
# # Set seed for reproducibility
# set.seed(100)
#
# # Bootstrap sampling for control group
# for(i in 1:sampling) {
# control_group2 <- length(control_group$converted)
# control_group3 <- control_group[sample(1:control_group2, replace=TRUE),]
# control_group4 <- sum(control_group3$converted == 1) / control_group2
#
# control_group1 <- c(control_group1, control_group4)
# }
#
# # Bootstrap sampling for test group
# for(i in 1:sampling) {
# test_group2 <- length(test_group$converted)
# test_group3 <- test_group[sample(1:test_group2, replace=TRUE),]
# test_group4 <- sum(test_group3$converted == 1) / test_group2
#
# test_group1 <- c(test_group1, test_group4)
# }
# Save the results
# saveRDS(test_group1, "data_test_group.rds")
# saveRDS(control_group1, "data_control_group.rds")
# Read the saved RDS files
data_test_group1 <- readRDS("data_test_group.rds")
data_control_group1 <- readRDS("data_control_group.rds")
# Combine the data for plotting
data_converted <- data.frame(
Distribution = c(rep("PSA", length(data_control_group1)), rep("AD", length(data_test_group1))),
value = c(data_control_group1, data_test_group1))
This code uses the bootstrap technique to estimate the distribution of conversion rates (converted) for two groups (control_group and test_group). By generating 5,000 bootstrap samples for each group, we can obtain a smoother distribution and better understand the variation in conversion rates within this population. The results are then ready for further analysis or visualization to determine if there is a significant difference in conversion rates between the control group and the test group.
data_converted %>%
ggplot(aes(x = value, fill = Distribution)) +
geom_area(stat = "Density", alpha = 0.7, position = "identity") +
xlab("Conversion Rate") +
ggtitle("Conversion Rate Distribution by Group") +
labs(fill = NULL ) +
theme_minimal() +
scale_fill_manual(values = c("PSA" = "steelblue", "AD" = "tomato"))
The density plot shows the distribution of conversion rates for two groups, labeled AD (experimental group) and PSA (control group). The “AD” group has a higher conversion rate overall, with a narrower and taller peak around 0.024. The “PSA” group has a lower conversion rate, with a wider and shorter peak around 0.018. This suggests that the “AD” group is more effective in converting users compared to the “PSA” group.
The primary goal of this analysis is to quantify the disparity in conversion rates between the test and control groups. By calculating the absolute percentage difference
data_difference1 <- as.data.frame(cbind(data_test_group1, data_control_group1)) %>%
mutate(diff1 = round(abs(data_test_group1 - data_control_group1)*100,2))
data_difference1 %>%
ggplot( aes(x=diff1)) +
geom_density( color="#e9ecef", fill = "steelblue", alpha=0.7) +
scale_fill_manual(values="#8fce00") +
xlab("Conversion Rate Diff")+
ggtitle("Conversion Rate Distribution")+
theme_minimal() +
labs(fill="")
control_group <- data1 %>% filter(data1$test.group == "psa")
test_group <- data1 %>% filter(data1$test.group == "ad")
t_test_result <- t.test(test_group$converted, control_group$converted)
cat("Test Group Analysis - T-statistic:", t_test_result$statistic, ", P-value:", t_test_result$p.value, "\n")
## Test Group Analysis - T-statistic: 8.657162 , P-value: 5.107608e-18
t_test_result <- t.test(test_group$converted, control_group$converted)
p_val <- t_test_result$p.value
if (p_val < 0.05) {
cat("Reject the null hypothesis: There is a significant difference in conversion rates between the 'ad' and 'psa' groups.\n")
} else {
cat("Fail to reject the null hypothesis: There is no significant difference in conversion rates between the 'ad' and 'psa' groups.\n")
}
## Reject the null hypothesis: There is a significant difference in conversion rates between the 'ad' and 'psa' groups.
This conclusion is based on a t-statistic of 8.657162 and a p-value of 5.107608e-18. The low p-value indicates that the observed difference is unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference between the groups) and conclude that the “ad” group has a significantly higher conversion rate than the “psa” group.
anova_result_day <- aov(converted ~ `most.ads.day`, data = data1)
anova_summary <- summary(anova_result_day)
f_statistic <- anova_summary[[1]]["most.ads.day", "F value"]
p_value <- anova_summary[[1]]["most.ads.day", "Pr(>F)"]
cat("Most Ads Day Analysis - F-statistic:", f_statistic, ", P-value:", p_value, "\n")
## Most Ads Day Analysis - F-statistic: 68.38818 , P-value: 1.803201e-85
f_statistic <- anova_summary[[1]]["most.ads.day", "F value"]
p_value <- anova_summary[[1]]["most.ads.day", "Pr(>F)"]
if (p_value < 0.05) {
cat("Reject the null hypothesis: The day with the most ads seen significantly affects the conversion rate.\n")
} else {
cat("Fail to reject the null hypothesis: The day with the most ads seen does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: The day with the most ads seen significantly affects the conversion rate.
The statistical analysis indicates that there is a significant difference in conversion rates based on the day with the most ads seen. This conclusion is supported by an F-statistic of 68.38818 and a p-value of 1.803201e-85. The low p-value suggests that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the day with the most ads seen) and conclude that the day with the most ads significantly affects the conversion rate.
anova_result_hour <- aov(converted ~ as.factor(most.ads.hour), data = data1)
anova_summary_hour <- summary(anova_result_hour)
f_statistic_hour <- anova_summary_hour[[1]]$`F value`[1]
p_value_hour <- anova_summary_hour[[1]]$`Pr(>F)`[1]
cat("Most Ads Hour Analysis - F-statistic:", f_statistic_hour, ", P-value:", p_value_hour, "\n")
## Most Ads Hour Analysis - F-statistic: 18.74204 , P-value: 7.482025e-77
if (p_value_hour < 0.05) {
cat("Reject the null hypothesis: The hour with the most ads seen significantly affects the conversion rate.\n")
} else {
cat("Fail to reject the null hypothesis: The hour with the most ads seen does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: The hour with the most ads seen significantly affects the conversion rate.
The statistical analysis shows that there is a significant difference in conversion rates based on the hour with the most ads seen. This conclusion is supported by an F-statistic of 18.74204 and a p-value of 7.482025e-77. The low p-value suggests that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the hour with the most ads seen) and conclude that the hour with the most ads significantly affects the conversion rate.
df_subset <- data1 %>% filter(total.ads < 50)
df_subset <- df_subset %>%
mutate(total_ads_bin = cut(total.ads,
breaks = c(-1, 1, 5, 10, 20, 30, 40, 50),
labels = c('0-1', '2-5', '6-10', '11-20', '21-30', '31-40', '41-50')))
anova_result_bin <- aov(converted ~ total_ads_bin, data = df_subset)
anova_summary_bin <- summary(anova_result_bin)
f_statistic_bin <- anova_summary_bin[[1]]$`F value`[1]
p_value_bin <- anova_summary_bin[[1]]$`Pr(>F)`[1]
cat("Total Ads (Binned) Analysis - F-statistic:", f_statistic_bin, ", P-value:", p_value_bin, "\n")
## Total Ads (Binned) Analysis - F-statistic: 1245.669 , P-value: 0
if (p_value_bin < 0.05) {
cat("Reject the null hypothesis: There is a significant difference in conversion rates among different levels of total ads seen (binned).\n")
} else {
cat("Fail to reject the null hypothesis: The number of ads seen (binned) does not significantly affect conversion rates.\n")
}
## Reject the null hypothesis: There is a significant difference in conversion rates among different levels of total ads seen (binned).
The statistical analysis demonstrates that there is a significant difference in conversion rates across various levels of total ads seen (binned). This conclusion is supported by an F-statistic of 1245.669 and a p-value of 0. The p-value of 0 indicates that the observed difference is highly unlikely to be due to chance. Therefore, we can reject the null hypothesis (which states that there is no difference in conversion rates based on the total ads seen) and conclude that the total ads seen significantly affects the conversion rate.
contingency_table_day <- table(data1$most.ads.day, data1$converted)
chi2_test <- chisq.test(contingency_table_day)
p_val <- chi2_test$p.value
alpha <- 0.05
if (p_val < alpha) {
print("The p-value is less than 0.05, indicating a significant relationship between 'most ads day' and 'converted'.")
} else {
print("The p-value is greater than 0.05, indicating no significant relationship between 'most ads day' and 'converted'.")
}
## [1] "The p-value is less than 0.05, indicating a significant relationship between 'most ads day' and 'converted'."
The statement correctly interprets the p-value. A p-value less than 0.05 indicates that the observed relationship between “most ads day” and “converted” is statistically significant. This means that it is unlikely that the observed relationship occurred by chance.
However, it’s important to note that statistical significance does not necessarily imply practical significance. It’s essential to consider the effect size and context of the analysis to determine the real-world importance of the finding.
categorical_columns <- c('test group', 'converted', 'most ads day', 'most ads hour')
conversion_rates <- data1 %>%
group_by(test.group) %>%
summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
rename('Test Group' = test.group, 'Conversion Rate' = Conversion_Rate)
ggplot(conversion_rates, aes(x = `Test Group`, y = `Conversion Rate`, fill = `Test Group`)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("ad" = "tomato", "psa" = "steelblue")) +
theme_minimal() +
labs(title = "Conversion Rates for Control and Test Groups",
x = "Test Group",
y = "Conversion Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
It appears that one test group (likely “ad” colored tomato) has a higher conversion rate than the other group (likely “psa” colored steelblue).To determine which group has a definitively higher rate, we would need the actual values or the difference between the conversion rates. One test group (likely “ad” colored tomato) has a higher conversion rate than the other group (likely “psa” colored steelblue).
day_conversion_rate <- data1 %>%
group_by(`most.ads.day`) %>%
summarise(`Conversion Rate` = mean(converted, na.rm = TRUE)) %>%
rename('Day of the Week' = `most.ads.day`) %>%
mutate(`Day of the Week` = factor(`Day of the Week`, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))) %>%
arrange(`Day of the Week`)
ggplot(day_conversion_rate, aes(x = `Conversion Rate`, y = reorder(`Day of the Week`, `Conversion Rate`), fill = `Conversion Rate`)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightcoral", high = "darkred") +
theme_minimal() +
labs(title = "Conversion Rate by Day with Most Ads Seen",
x = "Conversion Rate",
y = "Day of the Week") +
theme(axis.text.y = element_text(angle = 0, hjust = 0.5)) +
guides(fill = "none")
The horizontal bar chart displays the average conversion rate for each day of the week, with days sorted from highest to lowest conversion rate. The color of the bars corresponds to the conversion rate, with a gradient from light coral (likely indicating a lower rate) to dark red (likely indicating a higher rate).
hour_conversion_rate <- data1 %>%
group_by(most.ads.hour) %>%
summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
rename('Hour of the Day' = most.ads.hour, 'Conversion Rate' = Conversion_Rate) %>%
arrange(`Hour of the Day`)
hour_conversion_rate <- hour_conversion_rate %>%
mutate(`Hour of the Day` = factor(`Hour of the Day`, levels = 1:24))
ggplot(hour_conversion_rate, aes(x = `Hour of the Day`, y = `Conversion Rate`, fill = `Conversion Rate`)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightcoral", high = "darkred") +
scale_x_discrete(breaks = 1:24) +
theme_minimal() +
labs(title = "Conversion Rate by Hour with Most Ads Seen",
x = "Hour of the Day",
y = "Conversion Rate") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
guides(fill = "none")
Based on the data analysis, the highest conversion rate is achieved at 4:00 PM with an average of 3%, while the lowest conversion rate occurs at 2:00 AM with an average of 0.5%. This suggests that our advertising campaigns are most effective in the afternoon and less effective during the early morning hours. To improve campaign performance, it is recommended to increase ad frequency during the period from 4:00 PM to 7:00 PM and reduce the budget for hours outside of this time range.
ads_conversion_rate_bin <- df_subset %>%
group_by(total_ads_bin) %>%
summarise(Conversion_Rate = mean(converted, na.rm = TRUE)) %>%
rename('Total Ads Seen (Binned)' = total_ads_bin, 'Conversion Rate' = Conversion_Rate)
ggplot(ads_conversion_rate_bin, aes(x = reorder(`Total Ads Seen (Binned)`, `Conversion Rate`), y = `Conversion Rate`, fill = `Conversion Rate`)) +
geom_bar(stat = "identity") +
scale_fill_gradient(low = "lightcoral", high = "darkred") +
coord_flip() +
theme_minimal() +
labs(title = "Conversion Rate by Total Ads Seen (Binned) - Total Ads < 50",
x = "Total Ads Seen (Binned)",
y = "Conversion Rate") +
theme(axis.text.y = element_text(angle = 0, hjust = 0.5, size = 10),
axis.text.x = element_text(size = 10),
plot.title = element_text(size = 14, face = "bold"),
legend.position = "none")
There appears to be a difference in conversion rates across different
bins of total ads seen, with some bins showing a higher average
conversion rate than others.The bins are arranged from highest to lowest
conversion rate, allowing for easy identification of the bins with the
highest (dark red bars) and lowest (light coral bars) conversion rates.
It’s important to note that the analysis is limited to total ads seen
less than 50. A similar analysis could be conducted for higher total ads
seen bins to get a more complete picture of the relationship between
total ads seen and conversion rate.
test_group_colors <- c('#4e79a7', '#f28e2b')
converted_colors <- c('#76b7b2', '#e15759')
most_ads_day_colors <- c('#edc948', '#f28e2b', '#4e79a7', '#e15759', '#76b7b2', '#59a14f', '#b07aa1')
most_ads_hour_colors <- scales::viridis_pal()(length(unique(data1$most.ads.hour)))
bar_plot_1 <- ggplot(data1, aes(x = test.group, fill = test.group)) +
geom_bar(color = 'black') +
scale_fill_manual(values = test_group_colors) +
labs(title = 'Test Group', x = '', y = 'Number of Users') +
theme_minimal() +
ylim(0, max(table(data1$test.group)) * 1.1) +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)
print(bar_plot_1)
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
pie_data_1 <- data1 %>%
count(test.group) %>%
mutate(percentage = n / sum(n) * 100)
pie_chart_1 <- ggplot(pie_data_1, aes(x = "", y = n, fill = test.group)) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = 'y') +
scale_fill_manual(values = test_group_colors) +
labs(title = 'Test Group', x = '', y = '') +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())
print(pie_chart_1)
Test Group Analysis:
The bar and pie charts for the ‘Test Group’ variable show a clear distribution of users across different test groups. The data suggests that there is a significant number of users in each test group, with some groups being larger than others. This indicates the need for tailored strategies depending on the size and characteristics of each group to maximize engagement and conversions.
bar_plot_2 <- ggplot(data1, aes(x = factor(converted), fill = factor(converted))) +
geom_bar(color = 'black') +
scale_fill_manual(values = converted_colors) +
labs(title = 'Conversion Rate', x = '', y = 'Number of Users') +
theme_minimal() +
ylim(0, max(table(data1$converted)) * 1.1) +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)
print(bar_plot_2)
pie_data_2 <- data1 %>%
count(converted) %>%
mutate(percentage = n / sum(n) * 100)
pie_chart_2 <- ggplot(pie_data_2, aes(x = "", y = n, fill = factor(converted))) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = 'y') +
scale_fill_manual(values = converted_colors) +
labs(title = 'Conversion Rate', x = '', y = '') +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank())
print(pie_chart_2)
Conversion Rate
The visualizations for ‘Conversion Rate’ indicate a noticeable difference between converted and non-converted users. The bar chart reveals the counts of each group, while the pie chart shows the percentage distribution, highlighting the proportion of users who converted versus those who did not. This insight is valuable for optimizing marketing efforts to increase conversion rates.
bar_plot_3 <- ggplot(data1, aes(x = most.ads.day, fill = most.ads.day)) +
geom_bar(color = 'black') +
scale_fill_manual(values = most_ads_day_colors) +
labs(title = 'Most Ads Viewed by Day', x = '', y = 'Number of Users') +
theme_minimal() +
ylim(0, max(table(data1$most.ads.day)) * 1.1) +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)
print(bar_plot_3)
pie_data_3 <- data1 %>%
count(most.ads.day) %>%
mutate(percentage = n / sum(n) * 100)
pie_chart_3 <- ggplot(pie_data_3, aes(x = "", y = n, fill = most.ads.day)) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = 'y') +
scale_fill_manual(values = most_ads_day_colors) +
labs(title = 'Most Ads Viewed by Day', x = '', y = '') +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks = element_blank(), panel.grid = element_blank()) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5),
color = "white")
print(pie_chart_3)
Most Ads Viewed by Day
The bar and pie charts for ‘Most Ads Viewed by Day’ provide insights into user engagement by day. Certain days have higher ad views, suggesting that users are more active or engaged on these days. Understanding which days perform better in terms of ad views can help in scheduling ads more effectively to maximize exposure and conversion rates.
bar_plot_4 <- ggplot(data1, aes(x = factor(most.ads.hour), fill = factor(most.ads.hour))) +
geom_bar(color = 'black') +
scale_fill_manual(values = most_ads_hour_colors) +
labs(title = 'Most Ads Viewed by Hour', x = '', y = 'Number of Users') +
theme_minimal() +
ylim(0, max(table(data1$most.ads.hour)) * 1.1) +
geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5)
print(bar_plot_4)
pie_data_4 <- data1 %>%
count(most.ads.hour) %>%
mutate(percentage = n / sum(n) * 100)
pie_chart_4 <- ggplot(pie_data_4, aes(x = "", y = n, fill = factor(most.ads.hour))) +
geom_bar(stat = 'identity', width = 1, color = "white") +
coord_polar(theta = 'y') +
scale_fill_manual(values = most_ads_hour_colors) +
labs(title = 'Most Ads Viewed by Hour', x = '', y = '') +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
geom_text(aes(label = paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5),
color = "white")
print(pie_chart_4)
Most Ads Viewed by Hour
The analysis of ‘Most Ads Viewed by Hour’ through bar and pie charts shows the distribution of ad views throughout different hours of the day. Certain hours, particularly in the late afternoon and early evening, have higher engagement. This information is crucial for timing ad placements to align with peak user activity times, thereby increasing the likelihood of conversions.
percentile_95 <- quantile(data1$total.ads, 0.95)
filtered_data <- data1 %>% filter(total.ads <= percentile_95)
ggplot(filtered_data, aes(x = total.ads)) +
geom_histogram(bins = 50, fill = '#4e79a7', color = 'black', alpha = 0.7) +
geom_density(aes(y = ..count..), color = '#4e79a7', size = 1) +
labs(title = 'Total Ads Viewed', x = '', y = 'Frequency') +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The histogram displays the distribution of the number of total ads seen for entries where the number of ads seen is less than the 95th percentile. The x-axis represents the number of total ads seen, and the y-axis represents the frequency (count) of observations for each bin. The density curve helps visualize the overall shape of the distribution.
Bivariate analysis will be conducted to uncover relationships between pairs of variables. This analysis aims to reveal patterns and insights through visual exploration of how different variables interact. The specific areas of focus include:
Most Ads Day vs. Converted: The variation in conversion rates across different days of the week will be examined. Most Ads Hour vs. Converted: The correlation between the time of day when users viewed the most ads and their conversion rates will be explored. Total Ads vs. Converted: The relationship between the number of ads viewed and the likelihood of conversion will be analyzed.
Through these analyses, trends and correlations will be identified to inform deeper investigation and understanding.
conversion_counts <- data1 %>%
group_by(most.ads.day, converted) %>%
summarise(count = n(), .groups = 'drop') %>%
tidyr::spread(key = converted, value = count, fill = 0)
day_order <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
conversion_counts <- conversion_counts %>%
mutate(most.ads.day = factor(most.ads.day, levels = day_order)) %>%
arrange(most.ads.day)
conversion_percentages <- conversion_counts %>%
mutate(Total = `0` + `1`) %>%
mutate(`0_percent` = (`0` / Total) * 100,
`1_percent` = (`1` / Total) * 100)
ggplot(conversion_percentages, aes(x = most.ads.day)) +
geom_bar(aes(y = `0_percent`, fill = 'Not Converted'), stat = 'identity', color = 'black', position = 'stack') +
geom_bar(aes(y = `1_percent`, fill = 'Converted'), stat = 'identity', color = 'black', position = 'stack') +
geom_text(aes(y = `0_percent` / 2, label = sprintf('%.1f%%', `0_percent`)), color = 'white', size = 3, vjust = 3) +
geom_text(aes(y = `0_percent` + `1_percent` / 2, label = sprintf('%.1f%%', `1_percent`)), color = 'white', size = 3, vjust = 35) +
scale_fill_manual(values = c('Not Converted' = '#4e79a7', 'Converted' = '#f28e2b')) +
labs(title = 'Most Ads Day and Conversion', x = '', y = 'Most Ads Day', fill = 'Converted') +
theme_minimal() +
theme(legend.position = 'bottom')
Relatively Stable Conversion Rates: The graph shows that the conversion rates for each day of the week are relatively stable and high, ranging from 96.7% to 97.9%. This indicates that the ad campaign was quite effective in converting users, regardless of the day the ad was shown the most. No Significant Difference Based on Day: While there was a slight fluctuation in conversion rates between days, the differences were very small and not visually significant. This indicates that the day of the week was not a major determining factor in the success of the ad campaign.
ggplot(data1, aes(x = factor(converted), y = most.ads.hour, fill = factor(converted))) +
geom_boxplot(outlier.colour = "red", outlier.size = 1) +
scale_fill_manual(values = c('0' = '#4e79a7', '1' = '#f28e2b')) +
labs(title = 'Most Ads Hour vs. Conversion Status',
x = 'Converted',
y = 'Most Ads Hour',
fill = 'Converted') +
theme_minimal() +
theme(legend.position = 'none')
The boxplot shows the distribution of the “most ads hour” for converted and non-converted users. The boxes represent the interquartile range (IQR), highlighting the middle 50% of the data, with the median indicated by a line inside the box. By comparing the boxes and medians, we can observe differences in “most ads hour” between the two groups. Red dots outside the boxes indicate potential outliers, falling outside the 1.5 IQR range.
percentile_95_total_ads <- quantile(data1$total.ads, 0.95)
filtered_data <- subset(data1, total.ads <= percentile_95_total_ads)
ggplot(filtered_data, aes(x = factor(converted), y = total.ads, fill = factor(converted))) +
geom_boxplot(outlier.colour = "red", outlier.size = 1) +
scale_fill_manual(values = c('0' = '#4e79a7', '1' = '#f28e2b')) +
labs(title = 'Total Ads Viewed vs. Conversion Status',
x = 'Converted',
y = 'Total Ads Viewed',
fill = 'Converted') +
theme_minimal() +
theme(legend.position = 'none')
The boxplot shows the distribution of total ads viewed by converted and non-converted users. The boxes represent the interquartile range (IQR), with the median indicated by a line inside the box. Red dots highlight outliers beyond the standard deviation from the quartiles. Comparing the boxes and medians reveals differences in ad viewing between the two groups, focusing on users who viewed ads up to the 95th percentile.
contingency_table_day <- table(data1$most.ads.day, data1$converted)
contingency_table_df <- as.data.frame(contingency_table_day)
colnames(contingency_table_df) <- c('MostAdsDay', 'Converted', 'Count')
sorted_table <- contingency_table_df %>%
group_by(MostAdsDay) %>%
summarise(Count = sum(Count)) %>%
arrange(desc(Count))
print(sorted_table)
## # A tibble: 7 × 2
## MostAdsDay Count
## <fct> <int>
## 1 Friday 92608
## 2 Monday 87073
## 3 Sunday 85391
## 4 Thursday 82982
## 5 Saturday 81660
## 6 Wednesday 80908
## 7 Tuesday 77479
ggplot(sorted_table, aes(x = reorder(MostAdsDay, Count), y = Count, fill = MostAdsDay)) +
geom_bar(stat = "identity") +
labs(title = "Total Conversions by Day with Most Ads Seen",
x = "Day with Most Ads Seen",
y = "Total Conversions") +
theme_minimal() +
coord_flip()
Most Ads Seen: The table and chart indicate that Friday was the day with the most ads seen, followed by Monday and Sunday. This suggests that users are more likely to see multiple ads on these days.Total Conversions: While the table shows the count of observations for each day, the chart provides a more visual representation of the total conversions. It appears that Friday also has the highest total conversions, followed by Monday and Sunday.
contingency_table_day <- table(data1$most.ads.day, data1$converted)
chi2_test <- chisq.test(contingency_table_day)
expected_frequencies_day <- chi2_test$expected
expected_frequencies_df <- as.data.frame(expected_frequencies_day)
colnames(expected_frequencies_df) <- c('Not_Converted', 'Converted')
rownames(expected_frequencies_df) <- rownames(expected_frequencies_day)
expected_frequencies_df <- expected_frequencies_df %>%
mutate(across(everything(), round, 2))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `across(everything(), round, 2)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
sorted_expected_frequencies <- expected_frequencies_df %>%
arrange(desc(Converted))
print(sorted_expected_frequencies)
## Not_Converted Converted
## Friday 90270.68 2337.32
## Monday 84875.38 2197.62
## Sunday 83235.83 2155.17
## Thursday 80887.63 2094.37
## Saturday 79598.99 2061.01
## Wednesday 78865.97 2042.03
## Tuesday 75523.52 1955.48
contingency_df <- as.data.frame(contingency_table_day)
colnames(contingency_df) <- c('Most Ads Day', 'Converted', 'Count')
ggplot(contingency_df, aes(x = `Converted`, y = `Most Ads Day`, fill = Count)) +
geom_tile(color = 'black') +
geom_text(aes(label = Count), color = 'black', size = 3) +
scale_fill_gradient2(low = "#ffcccc", mid = "#ffff99", high = "#0066cc", midpoint = median(contingency_df$Count)) +
labs(title = 'Heatmap of Conversion Status by Most Ads Day', x = 'Conversion Status', y = 'Most Ads Day') +
theme_minimal()
Earlier analysis suggested that Friday has the most ads seen and potentially the highest total conversions.The heatmap can reveal if Friday (or any other day) also has a higher proportion of converted users compared to non-converted users.
Ads vs. PSAs The data reveals that advertisements (ADs) were more successful in driving conversions compared to public service announcements (PSAs). This suggests that ads had a stronger influence on encouraging purchases. However, given that most users were exposed to ads, this could skew the results, so the differences should be interpreted with caution.
Impact of Ad Exposure Users who saw more ads were generally more likely to convert, indicating that repeated exposure plays a key role in driving purchases.
Optimal Ad Exposure Displaying between 250 and 749 ads appears to be the ideal range for maximizing conversions without overwhelming users.
Best Day for Campaigns Mondays consistently showed the highest conversion rates, suggesting users are more responsive at the start of the week, possibly due to a refreshed mindset after the weekend. Best Hour for Campaigns The hour of 16:00 emerged as particularly effective for conversions. However, other late afternoon to early evening hours (14:00-20:00) also performed well, allowing for some flexibility in campaign scheduling. Day-Hour Interaction While Monday at 16:00 is an optimal time, Saturday at 05:00 also showed unexpectedly high conversion rates. Nevertheless, no specific day-hour combination significantly outperformed others.
Will the campaign be successful?
The analysis indicates that the campaign is likely to succeed if carefully executed. Targeting ads at the optimal times and ensuring appropriate ad exposure should lead to significant results.
How much of the success can be attributed to the ads?
Ads are shown to be a major factor in the campaign’s success. They were more effective than PSAs in driving purchases, especially when presented at the right times and in appropriate quantities.
Prioritize Mondays Allocate a significant portion of the ad budget to Mondays, especially during late afternoon hours, as this day has demonstrated the highest conversion rates.
Focus on Late Afternoon and Early Evening Run ads between 14:00 and 20:00, with a particular focus on 16:00, which consistently showed high conversion rates.
Optimize Ad Exposure Target users with 250 to 749 ads to maximize conversions while avoiding ad fatigue. Monitor users who receive more than this to prevent diminishing returns.
Tailor to Day-Specific Timing Adjust ad campaigns to align with the most effective times for each day to maximize engagement and conversions.