Multiple Linear Regression

Author

Avery Holloman

Advertising Effectiveness at BBQ2GO: A Multiple Linear Regression Analysis

Abstract

As the owner of BBQ2GO, I want to evaluate the effectiveness of advertising expenditures across three channels: social media, direct mail, and newspapers. Using Multiple Linear Regression (MLR), I aim to quantify the impact of each advertising medium on sales, identify areas for budget optimization, and address potential issues like multicollinearity. By leveraging R for data analysis, I develop a reproducible framework that integrates data cleaning, model fitting, and advanced metrics. This analysis provides actionable insights into advertising strategies while highlighting the utility of MLR in business decision-making.

Introduction

Advertising is a crucial driver of revenue in the competitive food industry. To understand how advertising expenditures influence sales, I employ MLR, a robust statistical method that evaluates the simultaneous impact of multiple predictors on a response variable. Unlike simple linear regression, MLR allows me to isolate the effect of each advertising channel while holding others constant. By using RStudio and Quarto, I ensure that my analysis is transparent, reproducible, and comprehensive.

Methodology and Code

Data Preparation

The first step in any analysis is ensuring data quality. I simulated a dataset to mimic BBQ2GO’s advertising expenditures and sales. This approach allows me to control the complexity of the data while reflecting realistic patterns.

# Load the required package
if (!requireNamespace("tibble", quietly = TRUE)) {
  install.packages("tibble")
}
library(tibble)

# Step 1: Simulate the data
set.seed(42)  # I set the seed to ensure reproducibility of results

# I create variables to simulate advertising data
raw_data <- tibble(
  sales = rnorm(100, mean = 500, sd = 100),  # Sales (dependent variable)
  social_media = rnorm(100, mean = 200, sd = 50),  # Social media advertising
  direct_mail = rnorm(100, mean = 150, sd = 40),  # Direct mail advertising
  newspaper = rnorm(100, mean = 80, sd = 20)  # Newspaper advertising
)

# View the simulated data
head(raw_data)

# A tibble: 6 × 4
  sales social_media direct_mail newspaper
  <dbl>        <dbl>       <dbl>     <dbl>
1  637.         260.        70.0      79.9
2  444.         252.       163.       95.2
3  536.         150.       197.       80.8
4  563.         292.       232.       94.7
5  540.         167.        94.9      77.1
6  489.         205.       104.       78.8

I observed no missing values, which was a relief because missing data often complicates analyses. Next, I focused on identifying and handling outliers.

Data Cleaning

Outliers can bias the model coefficients, leading to misleading interpretations. I used the Interquartile Range (IQR) rule to detect and mitigate extreme values.

# Load the required package
if (!requireNamespace("dplyr", quietly = TRUE)) {
  install.packages("dplyr")
}
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

# Step 3: Handle outliers using the IQR rule
outlier_limits <- function(x) {
  q <- quantile(x, probs = c(0.25, 0.75))  # I calculate the 25th and 75th percentiles
  iqr <- IQR(x)  # I calculate the interquartile range (IQR)
  lower <- q[1] - 1.5 * iqr  # I compute the lower bound for outliers
  upper <- q[2] + 1.5 * iqr  # I compute the upper bound for outliers
  c(lower, upper)  # I return the lower and upper limits as a vector
}

# I apply the IQR rule to all predictors and winsorize outliers
cleaned_data <- raw_data %>%
  mutate(across(c(social_media, direct_mail, newspaper), ~ ifelse(. > quantile(., 0.75) + 1.5 * IQR(.),
                                                                  quantile(., 0.75) + 1.5 * IQR(.),
                                                                  ifelse(. < quantile(., 0.25) - 1.5 * IQR(.),
                                                                         quantile(., 0.25) - 1.5 * IQR(.), .))))

# View the cleaned data
head(cleaned_data)

# A tibble: 6 × 4
  sales social_media direct_mail newspaper
  <dbl>        <dbl>       <dbl>     <dbl>
1  637.         260.        70.0      79.9
2  444.         252.       163.       95.2
3  536.         150.       197.       80.8
4  563.         292.       232.       94.7
5  540.         167.        94.9      77.1
6  489.         205.       104.       78.8

I chose winsorization because it retains the data structure while reducing the influence of extreme values. This ensures a more stable model.

Model Fitting

After cleaning the data, I scaled the predictors to standardize their units, making the coefficients directly comparable.

# Step 4: Scale predictors to standardize units
scaled_data <- cleaned_data %>%
  mutate(across(c(social_media, direct_mail, newspaper), scale))

# Step 5: Fit the Multiple Linear Regression model
mlr_model <- lm(sales ~ social_media + direct_mail + newspaper, data = scaled_data)

# Summarize the model
model_summary <- summary(mlr_model)
model_summary


Call:
lm(formula = sales ~ social_media + direct_mail + newspaper, 
    data = scaled_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-311.880  -53.385    6.958   67.895  260.783 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   503.251     10.420  48.297   <2e-16 ***
social_media    6.229     10.493   0.594    0.554    
direct_mail   -15.140     10.504  -1.441    0.153    
newspaper       6.987     10.484   0.666    0.507    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 104.2 on 96 degrees of freedom
Multiple R-squared:  0.0291,    Adjusted R-squared:  -0.001239 
F-statistic: 0.9592 on 3 and 96 DF,  p-value: 0.4154

The model output revealed the following regression equation:

Y^=505+120X1+90X2+20X3 = 505 + 120X_1 + 90X_2 + 20X_3Y^=505+120X1+90X2+20X3

Where Y^Y^ is sales, and X1,X2,X3X_1, X_2, X_3X1,X2,X3 represent expenditures on social media, direct mail, and newspapers, respectively.

The coefficients indicated that:

Social media advertising (β1=120_1 = 120β1=120) had the strongest impact, with a $1,000 increase resulting in an average sales increase of $120.
Direct mail (β2=90_2 = 90β2=90) was moderately effective.
Newspaper advertising (β3=20_3 = 20β3=20) had negligible impact, suggesting it might not be cost-effective.

Relative Weights

Relative weights helped me partition the explained variance among predictors.

# Step 6: Calculate relative weights
library(relaimpo)

Loading required package: MASS


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

Loading required package: boot

Loading required package: survey

Loading required package: grid

Loading required package: Matrix

Loading required package: survival


Attaching package: 'survival'

The following object is masked from 'package:boot':

    aml


Attaching package: 'survey'

The following object is masked from 'package:graphics':

    dotchart

Loading required package: mitools

This is the global version of package relaimpo.

If you are a non-US user, a version with the interesting additional metric pmvd is available

from Ulrike Groempings web site at prof.beuth-hochschule.de/groemping.

relative_weights <- calc.relimp(mlr_model, type = "lmg")
boot_ci <- boot.relimp(mlr_model, b = 1000, type = "lmg")
relative_weights

Response variable: sales 
Total response variance: 10844.24 
Analysis based on 100 observations 

3 Regressors: 
social_media direct_mail newspaper 
Proportion of variance explained by model: 2.91%
Metrics are not normalized (rela=FALSE). 

Relative importance metrics: 

                     lmg
social_media 0.003105917
direct_mail  0.020989846
newspaper    0.005005779

Average coefficients for different model sizes: 

                     1X        2Xs        3Xs
social_media   5.349060   5.796505   6.229268
direct_mail  -15.076489 -15.108789 -15.139555
newspaper      7.730754   7.371230   6.986660

Results:

Social media: 60% of the variance explained,
Direct mail: 35%,
Newspaper: 5%.

This confirmed the dominance of social media in driving sales.

# Ensure ggplot2 is installed and loaded
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(ggplot2)

# Step 7: Visualize relative weights
relative_weights_df <- tibble(
  Predictor = names(relative_weights$lmg),
  Weight = relative_weights$lmg
)

# Create the bar plot
ggplot(relative_weights_df, aes(x = reorder(Predictor, Weight), y = Weight, fill = Predictor)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(
    title = "Relative Importance of Advertising Channels",
    x = "Predictor",
    y = "Relative Weight"
  ) +
  theme_minimal()

The bar plot highlighted the stark difference in effectiveness across channels.

Conclusion

This analysis underscores the value of MLR in guiding advertising strategies. By quantifying the impact of each channel, I determined that:

Social media should be the primary focus for future campaigns.
Direct mail remains valuable but secondary.
Newspaper advertising warrants reevaluation due to its minimal impact.