library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
library(effsize)
library(pwrss)
## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test
library(ggplot2)

Group members - Iqbal, Julia, Teresa, Dhruv

Model Critique - notes_07

url_ <- "https://raw.githubusercontent.com/leontoddjohnson/i590/main/data/marketing/marketing.csv"
marketing <- read_delim(url_, delim = ",")
## Rows: 40 Columns: 8
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (8): spend, clicks, impressions, display, transactions, revenue, ctr, co...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Dataset

We have a marketing dataset categorized by Display (0 for non-display ads, 1 for display ads). It encompasses metrics like clicks, impressions, spending, transactions, and revenue per ad. However, crucial aspects of the experiment and data collection remain undisclosed, necessitating several assumptions. These undisclosed elements lead to potential issues:

  1. Influence from Previous Ads: Unaccounted influence from prior ads a subject might have encountered could impact their response. For instance, exposure to a non-display Toyota ad days earlier might influence their interaction with a subsequent display ad from the same brand.

  2. Ambiguity in Advertisement Display: Lack of clarity regarding where and how ads were displayed poses uncertainties. Factors like the medium (personal devices, TV, streets) and circumstances of ad exposure (individual viewing or grouped viewing) could affect attention spans and overall interest in ads, potentially confounding the observed effects.

  3. Variability in Ads and Timing: The variability in products advertised and the timing of ad displays introduce complexities. Advertising multiple products, each targeting distinct customer segments, might skew analysis results if the distribution of customer groups remains unknown. Furthermore, external factors like seasonal events or days of the week could influence advertisement effectiveness.

  4. Impact of Advertised Product Price: Variations in transaction amounts and product prices across ads might misattribute revenue differences to the type of advertisement rather than the price of the advertised product. This becomes crucial as a few transactions hold significant weight in revenue due to the dataset’s limited transaction count.

marketing |>
  ggplot() +
  geom_boxplot(mapping = aes(x = revenue, 
                             y = factor(display, levels = c(0, 1), 
                                        labels = c("Normal","With Display")))) +
  labs(title = "Advertisement Effect on Revenue",
       x = "Revenue (in dollars)",
       y = "Advertisement Variation") +
  theme_minimal()

avg_revenues <- marketing |>
  group_by(display) |>
  summarize(avg_revenue = mean(revenue)) |>
  arrange(display)

avg_revenues
## # A tibble: 2 × 2
##   display avg_revenue
##     <dbl>       <dbl>
## 1       0        222.
## 2       1        315.
observed_diff <- (avg_revenues$avg_revenue[2] -
                  avg_revenues$avg_revenue[1])
paste("Observed Difference: ", observed_diff)
## [1] "Observed Difference:  93.59"

Revenue Vs Adtype Graphs

  • By focusing on the disparity between average revenues with and without display, we face assumptions about pricing and the potential existence of multiple products. A more insightful approach might involve analyzing clicks and impressions to gain a clearer perspective.

  • To address this data collection issue, a potential solution could involve assigning one display and one non-display advertisement for each distinct product in the dataset. However, discerning identical products or those advertising similar collections remains ambiguous within the current dataset structure. Enhancing data clarity by identifying and labeling identical or similar products for display and non-display ads could resolve this uncertainty.

Null Hypothesis

We could formulate a null hypothesis based on the available data, but the lack of comprehensive information could skew our conclusions. For instance, if a company primarily targets different genders with specific product types on display and non-display ads, without this insight, confirming the null hypothesis—suggesting that display ads generate higher revenue—might seem accurate. However, changes in customer demographics could alter this outcome. If the customer base shifts significantly, continuing the same advertising strategy might lead to higher revenue from non-display ads, causing potential budgetary challenges if a substantial marketing budget is allocated to display ads.

The issue arises from assuming that revenue dependence is solely based on the type of display, which might not align with the actual influencing factors. This mismatch between the assumptions of the null hypothesis and the intricate dynamics of revenue generation could affect the accuracy of our conclusions.

Sample Distribution

The sample distribution for the marketing data may be a normal distribution. However, if we we able to grouping the transactions by the customer and split the customers into groups, the distribution may be not normal for groups. The same goes with product groups.

# the same bootstrapping function from lab_06
bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)
  
  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (e.g., a vector or data frame)
    x_sample <- sample_n(x, size = length(x), replace = TRUE)
    
    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }
  
  return(func_values)
}
diff_in_avg <- function (x_data) {
  avg_revenues <- x_data |>
    group_by(display) |>
    summarize(avg_revenue = mean(revenue)) |>
    arrange(display)
  
  # difference = revenue_with - revenue_without
  diff <- (avg_revenues$avg_revenue[2] - 
           avg_revenues$avg_revenue[1])
  
  return(diff)
}

diffs_in_avgs <- bootstrap(marketing, diff_in_avg, n_iter = 100)
ggplot() +
  geom_function(xlim = c(-300, 300), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avgs))) +
  geom_vline(mapping = aes(xintercept = observed_diff,
                           color = paste("observed: ",
                                         round(observed_diff)))) +
  labs(title = "Bootstrapped Sampling Distribution of Revenue Differences",
       x = "Difference in Revenue Calculated",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-300, 300, 100)) +
  theme_minimal()

critical_value <- 2
delta <- 1.5

f_0 <- function(x) dnorm(x, mean = 0)
f_a <- function(x) dnorm(x, mean = delta)

ggplot() +
  stat_function(mapping = aes(fill = 'power'),
                fun = f_a, 
                xlim = c(critical_value, 4),
                geom = "area") +
    stat_function(mapping = aes(fill = 'alpha'),
                fun = f_0, 
                xlim = c(critical_value, 4),
                geom = "area") +
  geom_function(mapping = aes(color = 'Null Hypothesis'),
                xlim = c(-4, 4), fun = f_0) +
  geom_function(mapping = aes(color = 'Alternative Hypothesis'),
                xlim = c(-4, 4), fun = f_a) +
  geom_vline(mapping = aes(xintercept = critical_value,
                           color = "Critical Value")) +
  geom_vline(mapping = aes(xintercept = delta,
                           color = "Delta")) +
  geom_vline(mapping = aes(xintercept = 0),
             color = 'gray', linetype=2) +
  labs(title = "One-Tailed Test Illustration",
       subtitle = "(Mirror the right side for two-tailed tests.)",
       x = "Test Statistic",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-4, 4, 1)) +
  scale_fill_manual(values = c('lightblue', 'pink')) +
  scale_color_manual(values = c('darkred', 'darkorange', 'darkblue', 
                                'darkgreen')) +
  theme_minimal()

marketing |>
  group_by(display) |>
  summarize(sd = sd(revenue),
            mean = mean(revenue))
## # A tibble: 2 × 3
##   display    sd  mean
##     <dbl> <dbl> <dbl>
## 1       0  138.  222.
## 2       1  152.  315.
test <- pwrss.t.2means(mu1 = 100, 
                       sd1 = sd(pluck(marketing, "revenue")),
                       kappa = 1,
                       power = .85, alpha = 0.1, 
                       alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.85 
##   n1 = 34 
##   n2 = 34 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 66 
##  Non-centrality parameter = 2.733 
##  Type I error rate = 0.1 
##  Type II error rate = 0.15
plot(test)

Checking for Linearity

We are not explicitly checking for the assumptions of a linear model. It’s essential to examine the relationship between the response variable (revenue) and each predictor using scatter plots or correlation matrices. If the relationships are not linear, transformations might be necessary. The following code checks for linearity

# Check for linearity

marketing |>
  ggplot() +
  geom_jitter(mapping = aes(x = spend, y = revenue)) +
  labs(title = "Scatter Plot of Spend vs. Revenue",
       x = "Spend (in dollars)",
       y = "Revenue (in dollars)") +
  theme_minimal()

marketing |>
  ggplot() +
  geom_jitter(mapping = aes(x = impressions, y = revenue)) +
  labs(title = "Scatter Plot of Impressions vs. Revenue",
       x = "Impressions",
       y = "Revenue (in dollars)") +
  theme_minimal()

# Check for homoscedasticity

marketing |>
  ggplot() +
  geom_point(mapping = aes(x = revenue, y = ctr)) +
  labs(title = "Residual Plot of Revenue vs. CTR",
       x = "Revenue (in dollars)",
       y = "Click-Through Rate") +
  theme_minimal()

marketing |>
  ggplot() +
  geom_point(mapping = aes(x = revenue, y = con_rate)) +
  labs(title = "Residual Plot of Revenue vs. Conversion Rate",
       x = "Revenue (in dollars)",
       y = "Conversion Rate") +
  theme_minimal()

# Check for normality

hist(marketing$revenue)

shapiro.test(marketing$revenue)
## 
##  Shapiro-Wilk normality test
## 
## data:  marketing$revenue
## W = 0.96698, p-value = 0.2877

This suggests that the linear model assumption of linearity may be met for the relationship between spend and revenue, but it may be violated for the relationship between impressions and revenue. If the model is trained using the impressions variable, it is important to check for the impact of this assumption violation on the model’s performance.