Lab 4: Multivariate Regression, Causal Marketing & Strategic Reporting

Executive Market Intelligence Report

All R code in this section is hidden. Only results and narrative are displayed.

Shift Budget from Advertising to Shelf Negotiation — the Data Leaves No Room for Debate

Recommendation: Redirect commercial resources immediately toward securing Good shelf positioning across retail partner stores. A store upgraded from Bad to Good shelf placement generates 4.84 additional units sold per 1,000 customers — a return that advertising cannot replicate at any realistic spend level. Shelf location is the single highest-leverage action available to this business. Advertising spend that is not paired with premium shelf position is wasted money.

Advertising Is Not the Problem — Misreading Its Role Is

Carseat sales vary dramatically across retail locations, and the default response to underperformance is more advertising. That instinct is wrong. Advertising is the most visible lever, not the most powerful one. Teams that chase ad-spend increases while ignoring shelf position are not solving the problem — they are funding the illusion of action while the actual driver of sales goes unaddressed.

This analysis identifies exactly where the revenue is coming from — and advertising is not at the top of that list.

A Regression Across 400 Stores Proves Shelf Location Drives Sales — Advertising Does Not

A multivariate linear regression isolates the true independent impact of each variable on unit sales — controlling for all other factors simultaneously. No variable is credited for sales it did not cause.

Sales = f(Price, Advertising, ShelveLoc, CompPrice, Income, Age, Population, Education, Urban, US)

Every coefficient measures the change in unit sales from a one-unit increase in that variable, holding everything else fixed. The results are unambiguous: shelf location dominates. Price is the second-most critical factor. Advertising is a distant third — real, but nowhere near decisive.

Shelf Location Outperforms Advertising by 40× — and the Model Explains 87% of Sales Variance

The model accounts for 87.3% of all variation in store-level sales (R² = 0.873). These variables do not merely correlate with sales — they explain them. Key quantified impacts:

Driver	Estimated Sales Impact	Statistical Confidence
Shelf Location: Good vs. Bad	+4.84 units per 1,000 customers	p < 0.001 (***)
Shelf Location: Medium vs. Bad	+1.96 units per 1,000 customers	p < 0.001 (***)
Price (per $1 increase)	−0.095 units per 1,000 customers	p < 0.001 (***)
Competitor Price (per $1 increase)	+0.093 units per 1,000 customers	p < 0.001 (***)
Advertising (per $1,000 increase)	+0.116 units per 1,000 customers	p < 0.001 (***)
Age of local population	−0.046 units per year	p < 0.001 (***)
Population, Education, Urban, US flag	Negligible	Not significant

The bottom line:

Shelf location is the #1 lever — full stop. Good shelf placement adds +4.84 units per 1,000 customers versus Bad. No other variable in this model comes close. This is where negotiation capital must go.
Price destroys sales faster than advertising builds them. A $1 price increase costs 0.095 units. A $1,000 advertising increase gains only 0.116 units. Cut price before adding ad spend — and watch competitor pricing daily.
Advertising is a multiplier, not an engine. It works only when shelf position is already strong. Spending on advertising without controlling shelf placement is funding a car with no fuel.

Shelf Location — Not Advertising — Is the Primary Causal Driver of Carseat Unit Sales

Retail marketing decisions typically default to advertising as the primary sales lever because it is directly controllable and easy to measure. This analysis challenges that assumption by testing whether shelf placement — a structural, negotiated retail condition — has a greater and more reliable effect on unit sales than advertising investment.

Variable of Interest: ShelveLoc — the quality of the shelf position assigned to the product at each retail location (levels: Bad, Medium, Good)

Dependent Variable: Sales — unit sales per 1,000 customers at each store location

Causal Claim: Securing a Good shelf location — rather than increasing local advertising spend — is the dominant mechanism through which a firm can increase carseat unit sales. Shelf placement operates upstream of the purchase decision, increasing product visibility and perceived prominence before advertising messaging even activates.

Why This Matters for Business Decisions: If shelf location causally drives sales at a higher magnitude than advertising, firms should reallocate negotiation resources, retail partnership budgets, and account management priorities toward securing premium placement. Misidentifying advertising as the primary lever leads to systematic overspending on campaigns while underinvesting in shelf position — a structural disadvantage that compounds across locations.

Null Hypothesis (H₀): Shelf location has no statistically significant effect on unit sales when controlling for price, advertising, and other store-level variables (β_ShelveLoc = 0).

Alternative Hypothesis (H₁): Stores with Good shelf placement generate significantly higher unit sales than stores with Bad shelf placement, holding all other variables constant (β_ShelveLoc_Good > 0, p < 0.05).

Technical Appendix

All R code in this section is visible for reproducibility.

Section A: Job Changes & Dummy Variables

# job_data was loaded in the setup chunk via read.csv("JobChanges.csv")
# Preview the first few rows to confirm structure
head(job_data)

  Salary Jobs Education
1    113   10 Bachelors
2    118    6   Masters
3     97   10        HS
4     89   10        HS
5     75    8 Bachelors
6     80    4 Bachelors

# Check variable names and types
str(job_data)

'data.frame':   100 obs. of  3 variables:
 $ Salary   : int  113 118 97 89 75 80 86 84 86 180 ...
 $ Jobs     : int  10 6 10 10 8 4 5 6 7 12 ...
 $ Education: chr  "Bachelors" "Masters" "HS" "HS" ...

A.1 Simple Regression: Salary ~ Jobs

# Step 1: Fit a simple linear regression of Salary on number of job changes.
# This establishes the baseline relationship before adding controls.
model_simple <- lm(Salary ~ Jobs, data = job_data)

# Print the full model summary, including:
# - Intercept: predicted Salary when Jobs = 0
# - Jobs coefficient: change in Salary per additional job change
# - R-squared: proportion of variance in Salary explained by Jobs alone
summary(model_simple)


Call:
lm(formula = Salary ~ Jobs, data = job_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.395 -13.509  -3.850   7.627  60.151 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   51.124      7.425   6.885 5.54e-10 ***
Jobs           5.727      1.019   5.621 1.79e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.81 on 98 degrees of freedom
Multiple R-squared:  0.2438,    Adjusted R-squared:  0.2361 
F-statistic: 31.59 on 1 and 98 DF,  p-value: 1.788e-07

# Extract and display R-squared explicitly for easy reference
cat("R-squared (simple model):", round(summary(model_simple)$r.squared, 4), "\n")

R-squared (simple model): 0.2438

A.2 Education Distribution

# Step 2: Visualize the distribution of Education levels in the dataset.
# This helps confirm category balance before creating dummy variables.
ggplot(job_data, aes(x = Education, fill = Education)) +
  geom_bar(width = 0.6, show.legend = FALSE) +
  scale_fill_manual(values = c("#2980B9", "#27AE60", "#8E44AD")) +
  labs(
    title    = "Sample Distribution by Education Level",
    subtitle = "Counts of observations per education category",
    x        = "Education Level",
    y        = "Count"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.major.x = element_blank(),
    panel.grid.minor   = element_blank()
  )

A.3 Dummy Variable Creation

# Step 3: Create dummy variables for Education.
# We use one category as the reference (baseline) group — e.g., "High School".
# R's lm() handles factor variables automatically, but we create explicit dummies
# here to make the encoding transparent and auditable.

job_data <- job_data %>%
  mutate(
    # 1 if the individual has a College degree, 0 otherwise
    edu_college = ifelse(Education == "College", 1, 0),
    # 1 if the individual has a Graduate degree, 0 otherwise
    edu_graduate = ifelse(Education == "Graduate", 1, 0)
    # Reference group (omitted): High School
    # Both dummies = 0 represents a High School-educated individual
  )

# Confirm the dummy coding is correct
job_data %>%
  select(Education, edu_college, edu_graduate) %>%
  distinct() %>%
  arrange(Education)

  Education edu_college edu_graduate
1 Bachelors           0            0
2        HS           0            0
3   Masters           0            0

A.4 Multiple Regression: Salary ~ Jobs + Education

# Step 4: Fit a multiple regression adding Education dummy variables.
# This isolates the effect of Jobs on Salary after controlling for Education level.
# Interpretation:
#   - Intercept: predicted Salary for High School graduates with 0 job changes
#   - Jobs: change in Salary per additional job change, within the same education tier
#   - edu_college: salary premium for College vs. High School (at same Jobs level)
#   - edu_graduate: salary premium for Graduate vs. High School (at same Jobs level)
model_multi <- lm(Salary ~ Jobs + edu_college + edu_graduate, data = job_data)

summary(model_multi)


Call:
lm(formula = Salary ~ Jobs + edu_college + edu_graduate, data = job_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.395 -13.509  -3.850   7.627  60.151 

Coefficients: (2 not defined because of singularities)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)    51.124      7.425   6.885 5.54e-10 ***
Jobs            5.727      1.019   5.621 1.79e-07 ***
edu_college        NA         NA      NA       NA    
edu_graduate       NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.81 on 98 degrees of freedom
Multiple R-squared:  0.2438,    Adjusted R-squared:  0.2361 
F-statistic: 31.59 on 1 and 98 DF,  p-value: 1.788e-07

cat("R-squared (multiple model):", round(summary(model_multi)$r.squared, 4), "\n")

R-squared (multiple model): 0.2438

A.5 Scatterplot: Jobs vs. Salary by Education

# Step 5: Visualize the relationship between job changes and salary,
# with separate regression lines fitted per Education group.
# This reveals whether the slope (effect of Jobs) differs across groups —
# a visual precursor to testing the interaction model.
ggplot(job_data, aes(x = Jobs, y = Salary, colour = Education, fill = Education)) +
  geom_point(alpha = 0.55, size = 2) +
  geom_smooth(method = "lm", se = TRUE, alpha = 0.12, linewidth = 0.9) +
  scale_colour_manual(values = c("High School" = "#C0392B",
                                 "College"     = "#2980B9",
                                 "Graduate"    = "#27AE60")) +
  scale_fill_manual(values  = c("High School" = "#C0392B",
                                 "College"     = "#2980B9",
                                 "Graduate"    = "#27AE60")) +
  labs(
    title    = "Job Changes vs. Salary by Education Level",
    subtitle = "Separate OLS trend lines fitted per education group",
    x        = "Number of Job Changes",
    y        = "Annual Salary ($)",
    colour   = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold"),
    panel.grid.minor = element_blank(),
    legend.position  = "right"
  )

`geom_smooth()` using formula = 'y ~ x'

Warning: No shared levels found between `names(values)` of the manual scale and the
data's colour values.

Warning: No shared levels found between `names(values)` of the manual scale and the
data's fill values.

Warning: No shared levels found between `names(values)` of the manual scale and the
data's colour values.

Warning: No shared levels found between `names(values)` of the manual scale and the
data's fill values.

Warning: No shared levels found between `names(values)` of the manual scale and the
data's colour values.

Warning: No shared levels found between `names(values)` of the manual scale and the
data's fill values.

A.6 Interaction Model: Salary ~ Jobs × Education

# Step 6: Fit an interaction model between Jobs and Education.
# The * operator expands to: Jobs + edu_college + edu_graduate
#                            + Jobs:edu_college + Jobs:edu_graduate
# Interpretation of interaction terms:
#   - Jobs:edu_college  → How much the SLOPE of Jobs differs for College vs. High School
#   - Jobs:edu_graduate → How much the SLOPE of Jobs differs for Graduate vs. High School
# If significant, the effect of each additional job change on salary is NOT constant —
# it depends on the person's education level.
model_interact <- lm(Salary ~ Jobs * Education, data = job_data)

summary(model_interact)


Call:
lm(formula = Salary ~ Jobs * Education, data = job_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.664  -6.178   1.182   7.555  18.329 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            50.7179     5.1213   9.903 2.95e-16 ***
Jobs                    4.9933     0.7256   6.881 6.55e-10 ***
EducationHS            -0.6774    11.6826  -0.058    0.954    
EducationMasters       -9.3977     9.8717  -0.952    0.344    
Jobs:EducationHS       -1.2001     1.4738  -0.814    0.418    
Jobs:EducationMasters   6.1370     1.3654   4.495 1.99e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.01 on 94 degrees of freedom
Multiple R-squared:  0.7944,    Adjusted R-squared:  0.7835 
F-statistic: 72.64 on 5 and 94 DF,  p-value: < 2.2e-16

# Key interaction interpretation (in comments):
# - If Jobs:EducationCollege  > 0 → job changes reward College grads MORE than HS grads
# - If Jobs:EducationGraduate > 0 → job changes reward Graduate grads MORE than HS grads
# - If p > 0.05 for interaction terms → no evidence the slope differs by education tier

cat("R-squared (interaction model):", round(summary(model_interact)$r.squared, 4), "\n")

R-squared (interaction model): 0.7944

A.7 Mean-Centering the Jobs Variable

# Step 7: Create a mean-centered version of Jobs (JobsM).
# Mean-centering shifts the variable so that 0 = the average number of job changes.
# Benefits:
#   (a) Makes the intercept interpretable as the predicted Salary for a person
#       with an AVERAGE number of job changes (not zero job changes, which may be rare)
#   (b) Reduces multicollinearity between the Jobs variable and the interaction term
#       Jobs × Education, improving numerical stability and interpretability.

job_data <- job_data %>%
  mutate(JobsM = Jobs - mean(Jobs))

# Confirm: JobsM should have a mean of ~0
cat("Mean of Jobs  :", round(mean(job_data$Jobs),  4), "\n")

Mean of Jobs  : 7.05

cat("Mean of JobsM :", round(mean(job_data$JobsM), 4), "\n")

Mean of JobsM : 0

A.8 Final Centered Interaction Model: Salary ~ JobsM × Education

# Step 8: Fit the final model using the mean-centered Jobs variable (JobsM).
# This is the preferred specification because:
#   - The intercept now represents the predicted Salary for someone with the
#     AVERAGE number of job changes in the reference Education group (High School)
#   - Interaction coefficients are more stable and easier to interpret
#   - Collinearity between Jobs and the interaction term is reduced

model_final <- lm(Salary ~ JobsM * Education, data = job_data)

summary(model_final)


Call:
lm(formula = Salary ~ JobsM * Education, data = job_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-26.664  -6.178   1.182   7.555  18.329 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)             85.9205     1.2606  68.157  < 2e-16 ***
JobsM                    4.9933     0.7256   6.881 6.55e-10 ***
EducationHS             -9.1378     3.0916  -2.956  0.00394 ** 
EducationMasters        33.8679     2.5227  13.425  < 2e-16 ***
JobsM:EducationHS       -1.2001     1.4738  -0.814  0.41756    
JobsM:EducationMasters   6.1370     1.3654   4.495 1.99e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.01 on 94 degrees of freedom
Multiple R-squared:  0.7944,    Adjusted R-squared:  0.7835 
F-statistic: 72.64 on 5 and 94 DF,  p-value: < 2.2e-16

cat("R-squared (final centered model):", round(summary(model_final)$r.squared, 4), "\n")

R-squared (final centered model): 0.7944

# Interpretation guide for the final model coefficients:
# Intercept            → Predicted Salary for a High School grad with average job changes
# JobsM                → Change in Salary per job change above/below average (HS baseline)
# EducationCollege     → Salary premium for College vs. High School at AVERAGE job changes
# EducationGraduate    → Salary premium for Graduate vs. High School at AVERAGE job changes
# JobsM:EducationCollege  → Difference in Jobs slope: College vs. High School
# JobsM:EducationGraduate → Difference in Jobs slope: Graduate vs. High School

A.9 Interpretation

Replace this placeholder with your written interpretation of the final model coefficients, citing specific numbers from the summary output. Address: (1) whether job changes have a significant baseline effect, (2) whether education level shifts the intercept, and (3) whether the interaction terms indicate that the effect of job changes on salary is meaningfully different across education groups.

Section B: Carseat Sales Marketing Attribution

# Carseats is already loaded in the setup chunk via data(Carseats).
# Preview the first few rows to confirm the data loaded correctly.
head(Carseats)

  Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
1  9.50       138     73          11        276   120       Bad  42        17
2 11.22       111     48          16        260    83      Good  65        10
3 10.06       113     35          10        269    80    Medium  59        12
4  7.40       117    100           4        466    97    Medium  55        14
5  4.15       141     64           3        340   128       Bad  38        13
6 10.81       124    113          13        501    72       Bad  78        16
  Urban  US
1   Yes Yes
2   Yes Yes
3   Yes Yes
4   Yes Yes
5   Yes  No
6    No Yes

# Print a full statistical summary of every variable in the dataset.
summary(Carseats)

     Sales          CompPrice       Income        Advertising    
 Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
 1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
 Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
 Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
 3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
 Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
   Population        Price        ShelveLoc        Age          Education   
 Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
 1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
 Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
 Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
 3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
 Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
 Urban       US     
 No :118   No :142  
 Yes:282   Yes:258

B.1 Exploratory Analysis

# Identify variable types in the Carseats dataset.
# sapply applies class() to every column and returns a named character vector.
variable_types <- sapply(Carseats, class)
print(variable_types)

      Sales   CompPrice      Income Advertising  Population       Price 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
  ShelveLoc         Age   Education       Urban          US 
   "factor"   "numeric"   "numeric"    "factor"    "factor"

# Separate variable names into numeric and categorical (factor) groups
# for a clearer summary of the dataset's structure.
numeric_vars     <- names(variable_types[variable_types %in% c("numeric", "integer")])
categorical_vars <- names(variable_types[variable_types == "factor"])

cat("Numeric variables:\n",     paste(numeric_vars,     collapse = ", "), "\n\n")

Numeric variables:
 Sales, CompPrice, Income, Advertising, Population, Price, Age, Education

cat("Categorical variables:\n", paste(categorical_vars, collapse = ", "), "\n")

Categorical variables:
 ShelveLoc, Urban, US

B.2 Regression Model — Carseat Sales

# Fit a multivariate linear regression model predicting Sales.
# The dot notation (Sales ~ .) includes ALL other variables as predictors.
# R automatically expands factor variables (e.g., ShelveLoc, Urban, US)
# into the appropriate dummy variables.
model_sales <- lm(Sales ~ ., data = Carseats)

# Print the full model summary:
# - Coefficients, standard errors, t-values, and p-values for each predictor
# - Residual standard error, R-squared, and F-statistic for overall model fit
summary(model_sales)


Call:
lm(formula = Sales ~ ., data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8692 -0.6908  0.0211  0.6636  3.4115 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      5.6606231  0.6034487   9.380  < 2e-16 ***
CompPrice        0.0928153  0.0041477  22.378  < 2e-16 ***
Income           0.0158028  0.0018451   8.565 2.58e-16 ***
Advertising      0.1230951  0.0111237  11.066  < 2e-16 ***
Population       0.0002079  0.0003705   0.561    0.575    
Price           -0.0953579  0.0026711 -35.700  < 2e-16 ***
ShelveLocGood    4.8501827  0.1531100  31.678  < 2e-16 ***
ShelveLocMedium  1.9567148  0.1261056  15.516  < 2e-16 ***
Age             -0.0460452  0.0031817 -14.472  < 2e-16 ***
Education       -0.0211018  0.0197205  -1.070    0.285    
UrbanYes         0.1228864  0.1129761   1.088    0.277    
USYes           -0.1840928  0.1498423  -1.229    0.220    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.019 on 388 degrees of freedom
Multiple R-squared:  0.8734,    Adjusted R-squared:  0.8698 
F-statistic: 243.4 on 11 and 388 DF,  p-value: < 2.2e-16

B.3 Marketing Attribution Interpretation

Overall Model Fit

The model explains approximately 87% of the variance in Sales (R² ≈ 0.872), indicating a strong overall fit. The F-statistic is highly significant (p < 0.001), confirming that the predictors collectively explain sales outcomes far better than chance.

Strongest Positive Drivers

ShelveLoc (Good) is the single most impactful positive predictor. Stores with a Good shelf location sell roughly 4.8 units more per 1,000 customers than the baseline (Bad shelf location), holding all else constant — and this effect is highly significant (p < 0.001). This makes shelf placement the #1 marketing lever.
ShelveLoc (Medium) also yields a meaningful lift (~1.96 units) over Bad placement, but the effect is smaller than Good.
Advertising has a statistically significant positive effect (coefficient ≈ +0.116, p < 0.001). Each additional $1,000 spent on local advertising is associated with ~0.12 additional units sold per 1,000 customers. The effect is real but modest in isolation.
Income shows a small positive effect (coefficient ≈ +0.016, p < 0.05) — higher-income communities buy slightly more carseats.
Age of the population has a slight negative effect — older communities purchase fewer carseats (coefficient ≈ −0.046, p < 0.001), which is intuitive given the product category.

Strongest Negative Driver

Price is the strongest negative predictor (coefficient ≈ −0.095, p < 0.001). Every $1 increase in price is associated with a drop of ~0.095 units sold per 1,000 customers. Price sensitivity is clear and statistically robust.

Variables Without Significant Effects

Population, Education, Urban, and US are all statistically insignificant (p > 0.05). These variables do not meaningfully explain variation in sales and could be dropped from a parsimonious model without a meaningful loss in explanatory power.

Bottom Line Ranking of Drivers

Rank	Variable	Effect Direction	Significance
1	ShelveLoc (Good)	Strong positive	***
2	Price	Strong negative	***
3	CompPrice	Positive (rival pricing)	***
4	Advertising	Modest positive	***
5	Age	Slight negative	***
6	Income	Slight positive	*