library(tidyverse) # dplyr + ggplot2
library(lmtest) # bptest()
options(scipen = 99)Mid-term Helper — Dummy Encoding: Correlations & Regression
Causal Analytics for Business 2026 · companion to exam_helper_newspaper_ads.qmd
Why this file exists. The main notebook (
exam_helper_newspaper_ads.qmd) treatsdayandsectionas factors, so a rawcor()matrix is N/A (onlyinquiriesis numeric). This companion takes the other route used in class: one-hot encode the factors into 0/1 dummy variables, which makes correlations computable and lets us run regression on the dummies explicitly. Everything downstream is interpreted in light of that encoding.This is a class-sanctioned technique: the Week 1 in-class script created a 0/1
promoteddummy withifelse()and fed it straight intocor(... %>% select_if(is.numeric))(“correlation matrix for numeric and binary dummy coded variables”).⚠️ The data spells Thursday as
Thrusday— that label is literal; don’t “fix” it.
0 · Setup & data
df <- read.csv("newspaper_ads.csv", stringsAsFactors = FALSE)
df$day <- factor(df$day) # keep "Thrusday" literal
df$section <- factor(df$section)
str(df)'data.frame': 60 obs. of 3 variables:
$ day : Factor w/ 5 levels "Friday","Monday",..: 2 2 2 2 4 4 4 4 5 5 ...
$ section : Factor w/ 3 levels "Business","News",..: 3 3 3 3 3 3 3 3 3 3 ...
$ inquiries: int 4 3 5 6 5 8 6 7 5 9 ...
1 · Create the dummy variables
Two equivalent, class-consistent ways to build 0/1 dummies.
## (a) MANUAL with ifelse() — the in-class style (one column per level).
## section has 3 levels -> 3 dummies; day has 5 -> 5 dummies.
dd <- df
dd$secBusiness <- ifelse(df$section == "Business", 1, 0)
dd$secNews <- ifelse(df$section == "News", 1, 0)
dd$secSports <- ifelse(df$section == "Sports", 1, 0)
dd$dayMonday <- ifelse(df$day == "Monday", 1, 0)
dd$dayTuesday <- ifelse(df$day == "Tuesday", 1, 0)
dd$dayWednesday <- ifelse(df$day == "Wednesday", 1, 0)
dd$dayThrusday <- ifelse(df$day == "Thrusday", 1, 0) # literal spelling
dd$dayFriday <- ifelse(df$day == "Friday", 1, 0)
head(dd)## (b) SHORTCUT with model.matrix() — base R, all levels at once.
## "- 1" drops the intercept so we get a full one-hot block per factor.
dummies <- cbind(model.matrix(~ section - 1, df),
model.matrix(~ day - 1, df))
colnames(dummies)[1] "sectionBusiness" "sectionNews" "sectionSports" "dayFriday"
[5] "dayMonday" "dayThrusday" "dayTuesday" "dayWednesday"
Interpretation. A k-level factor becomes k 0/1 columns (one-hot). Each dummy reads “is this row at that level? 1/0.” This is just an explicit version of what
factor()does internally — the difference is that we can now put these numeric columns intocor()and into a regression by name. Note we keep theThrusdaytypo as the column label so it still matches the data.
2 · Correlations (now possible)
## Full correlation matrix: outcome + all dummies.
cor_dummies <- round(cor(cbind(inquiries = df$inquiries, dummies)), 2)
cor_dummies inquiries sectionBusiness sectionNews sectionSports dayFriday
inquiries 1.00 0.21 0.15 -0.36 0.49
sectionBusiness 0.21 1.00 -0.50 -0.50 0.00
sectionNews 0.15 -0.50 1.00 -0.50 0.00
sectionSports -0.36 -0.50 -0.50 1.00 0.00
dayFriday 0.49 0.00 0.00 0.00 1.00
dayMonday -0.05 0.00 0.00 0.00 -0.25
dayThrusday -0.44 0.00 0.00 0.00 -0.25
dayTuesday 0.03 0.00 0.00 0.00 -0.25
dayWednesday -0.03 0.00 0.00 0.00 -0.25
dayMonday dayThrusday dayTuesday dayWednesday
inquiries -0.05 -0.44 0.03 -0.03
sectionBusiness 0.00 0.00 0.00 0.00
sectionNews 0.00 0.00 0.00 0.00
sectionSports 0.00 0.00 0.00 0.00
dayFriday -0.25 -0.25 -0.25 -0.25
dayMonday 1.00 -0.25 -0.25 -0.25
dayThrusday -0.25 1.00 -0.25 -0.25
dayTuesday -0.25 -0.25 1.00 -0.25
dayWednesday -0.25 -0.25 -0.25 1.00
## The line an MCQ usually wants: each level's correlation WITH inquiries, sorted.
sort(cor_dummies["inquiries", -1], decreasing = TRUE) dayFriday sectionBusiness sectionNews dayTuesday dayWednesday
0.49 0.21 0.15 0.03 -0.03
dayMonday sectionSports dayThrusday
-0.05 -0.36 -0.44
Interpretation — what these correlations mean. A correlation between a 0/1 dummy and a numeric variable is a point-biserial correlation; mathematically it equals that level’s mean-vs-the-rest, rescaled to [−1, 1]. Reading the sorted vector:
dayFriday≈ +0.49 — the strongest positive association: Friday rows have the highest inquiries.dayThrusday≈ −0.44 andsectionSports≈ −0.36 — the strongest negative associations: Thursday and the Sports section have the fewest inquiries.secBusiness≈ +0.21,secNews≈ +0.15 — mild positive.This is the same story as the group means / ANOVA in the main notebook, viewed through a Week 1 correlation lens. Sanity check:
cor(secSports, inquiries) = −0.36, and indeed Sports averages 7.0 vs 9.0 for the rest — the dummy correlation just re-expresses that gap.Two caveats (important).
- Within-factor dummies are mechanically negatively correlated. Any two section dummies correlate at −0.50 (they’re mutually exclusive: being Business means not-News). That −0.50 is an artefact of one-hot encoding, not a finding — don’t interpret it. This is the dummy-variable trap (perfect/structural collinearity), and it’s why a regression must drop one dummy per factor (§3).
- Each value is “this level vs all others pooled,” which is coarser than the pairwise contrasts (Tukey HSD) in the main notebook. Correlation here ranks levels; it does not test specific level-vs-level differences.
3 · Regression on the dummies
3.1 The dummy-variable trap
## If you include ALL k dummies of a factor PLUS the intercept, you get perfect
## collinearity -> R drops one (shows NA). This is the dummy-variable trap.
lm_all <- lm(inquiries ~ secBusiness + secNews + secSports, data = dd)
summary(lm_all) # one coefficient comes back NA (aliased)
Call:
lm(formula = inquiries ~ secBusiness + secNews + secSports, data = dd)
Residuals:
Min 1Q Median 3Q Max
-5.9 -1.1 -0.1 1.9 5.1
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0000 0.5632 12.429 <0.0000000000000002 ***
secBusiness 2.1000 0.7965 2.637 0.0108 *
secNews 1.9000 0.7965 2.385 0.0204 *
secSports NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.519 on 57 degrees of freedom
Multiple R-squared: 0.1294, Adjusted R-squared: 0.09883
F-statistic: 4.235 on 2 and 57 DF, p-value: 0.01928
Interpretation. The three section dummies sum to 1 in every row, which equals the intercept column — perfectly collinear. R protects you by dropping one (the
NA/“aliased” coefficient). The fix is to omit one dummy per factor: the omitted level becomes the reference/baseline, exactly whatfactor()does automatically.
3.2 Dummy regression = factor regression
## Drop one dummy per factor (Business and Friday = baselines) -> no trap.
lm_dummies <- lm(inquiries ~ secNews + secSports +
dayMonday + dayTuesday + dayWednesday + dayThrusday,
data = dd)
summary(lm_dummies)
Call:
lm(formula = inquiries ~ secNews + secSports + dayMonday + dayTuesday +
dayWednesday + dayThrusday, data = dd)
Residuals:
Min 1Q Median 3Q Max
-3.7500 -1.5667 0.1167 1.3542 4.1500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.6833 0.6876 16.992 < 0.0000000000000002 ***
secNews -0.2000 0.6366 -0.314 0.75461
secSports -2.1000 0.6366 -3.299 0.00174 **
dayMonday -2.8333 0.8218 -3.448 0.00112 **
dayTuesday -2.4167 0.8218 -2.941 0.00485 **
dayWednesday -2.7500 0.8218 -3.346 0.00151 **
dayThrusday -4.9167 0.8218 -5.983 0.000000193 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.013 on 53 degrees of freedom
Multiple R-squared: 0.4829, Adjusted R-squared: 0.4244
F-statistic: 8.249 on 6 and 53 DF, p-value: 0.000002536
## Identical model written with factors (baseline = first level alphabetically):
lm_factor <- lm(inquiries ~ section + day, data = df)
## Same coefficients, R^2, everything. (factor() orders day coefs by factor level, so we
## compare by NAME, not position: rename section* -> sec* and sort both sets.)
cf <- coef(lm_factor); names(cf) <- gsub("^section", "sec", names(cf))
cd <- coef(lm_dummies)
all.equal(cd[sort(names(cd))], cf[sort(names(cf))]) # TRUE[1] TRUE
Interpretation. Manually dropping
secBusiness/dayFridayreproduces the factor model exactly — same coefficients, same R² (≈ 0.42 adjusted), same p-values. Each dummy coefficient is the mean difference of that level vs the baseline (e.g.secSports≈ −2.1 = Sports minus Business, holding day fixed). The lesson:factor()and explicit dummies are the same regression; dummies just make the reference level a manual choice. (Whether a coefficient is “significant” is a t-test vs the baseline — change the baseline and those p-values change, even though the model fit does not.)
3.3 Multicollinearity view (connects to Week 2)
## Manual VIF = 1/(1-R^2) from regressing each predictor on the others (Week 2 tutorial).
## With one-hot dummies the within-factor VIFs are elevated BY CONSTRUCTION.
preds <- c("secNews","secSports","dayMonday","dayTuesday","dayWednesday","dayThrusday")
sapply(preds, function(p){
others <- setdiff(preds, p)
1 / (1 - summary(lm(reformulate(others, response = p), data = dd))$r.squared)
}) |> round(2) secNews secSports dayMonday dayTuesday dayWednesday dayThrusday
1.33 1.33 1.60 1.60 1.60 1.60
Interpretation. The dummies show mild VIF inflation purely because levels of the same factor are structurally related (Week 2’s point: collinearity costs precision, not bias). Here it’s harmless and expected — it’s not a sign of a modelling problem, just a property of one-hot encoding a balanced design. (Because the design is orthogonal across the two factors, section dummies are uncorrelated with day dummies — the inflation is only within a factor.)
4 · When to prefer which encoding
| Want to… | Use | Where |
|---|---|---|
| Rank levels by association with the outcome | dummy correlations (this file) | §2 |
| Estimate mean differences vs a baseline | factor or dummy regression (identical) | §3.2 / main |
| Test specific level-vs-level differences (corrected) | ANOVA + Tukey HSD (factors) | main §3.6 |
| Per-factor F-test / effect size | aov + eta_squared (factors) |
main §3.4–3.5 |
Bottom line. Dummy encoding makes correlations computable and is fully class-legitimate, but it answers a ranking question, not a testing question. For the actual hypothesis tests (which sections/days differ, interaction, effect sizes) stay with the factor-based ANOVA in
exam_helper_newspaper_ads.qmd. Use this file when a question is phrased in terms of correlation with inquiries.