Mid-term Helper — Dummy Encoding: Correlations & Regression

Causal Analytics for Business 2026 · companion to exam_helper_newspaper_ads.qmd

Author

Shaam

Published

June 23, 2026

Why this file exists. The main notebook (exam_helper_newspaper_ads.qmd) treats day and section as factors, so a raw cor() matrix is N/A (only inquiries is numeric). This companion takes the other route used in class: one-hot encode the factors into 0/1 dummy variables, which makes correlations computable and lets us run regression on the dummies explicitly. Everything downstream is interpreted in light of that encoding.

This is a class-sanctioned technique: the Week 1 in-class script created a 0/1 promoted dummy with ifelse() and fed it straight into cor(... %>% select_if(is.numeric)) (“correlation matrix for numeric and binary dummy coded variables”).

⚠️ The data spells Thursday as Thrusday — that label is literal; don’t “fix” it.

0 · Setup & data

library(tidyverse)   # dplyr + ggplot2
library(lmtest)      # bptest()
options(scipen = 99)

df <- read.csv("newspaper_ads.csv", stringsAsFactors = FALSE)
df$day     <- factor(df$day)       # keep "Thrusday" literal
df$section <- factor(df$section)
str(df)

'data.frame':   60 obs. of  3 variables:
 $ day      : Factor w/ 5 levels "Friday","Monday",..: 2 2 2 2 4 4 4 4 5 5 ...
 $ section  : Factor w/ 3 levels "Business","News",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ inquiries: int  4 3 5 6 5 8 6 7 5 9 ...

1 · Create the dummy variables

Two equivalent, class-consistent ways to build 0/1 dummies.

## (a) MANUAL with ifelse() — the in-class style (one column per level).
## section has 3 levels -> 3 dummies; day has 5 -> 5 dummies.
dd <- df
dd$secBusiness <- ifelse(df$section == "Business", 1, 0)
dd$secNews     <- ifelse(df$section == "News",     1, 0)
dd$secSports   <- ifelse(df$section == "Sports",   1, 0)

dd$dayMonday    <- ifelse(df$day == "Monday",    1, 0)
dd$dayTuesday   <- ifelse(df$day == "Tuesday",   1, 0)
dd$dayWednesday <- ifelse(df$day == "Wednesday", 1, 0)
dd$dayThrusday  <- ifelse(df$day == "Thrusday",  1, 0)   # literal spelling
dd$dayFriday    <- ifelse(df$day == "Friday",    1, 0)

head(dd)

## (b) SHORTCUT with model.matrix() — base R, all levels at once.
## "- 1" drops the intercept so we get a full one-hot block per factor.
dummies <- cbind(model.matrix(~ section - 1, df),
                 model.matrix(~ day - 1, df))
colnames(dummies)

[1] "sectionBusiness" "sectionNews"     "sectionSports"   "dayFriday"      
[5] "dayMonday"       "dayThrusday"     "dayTuesday"      "dayWednesday"

Interpretation. A k-level factor becomes k 0/1 columns (one-hot). Each dummy reads “is this row at that level? 1/0.” This is just an explicit version of what factor() does internally — the difference is that we can now put these numeric columns into cor() and into a regression by name. Note we keep the Thrusday typo as the column label so it still matches the data.

2 · Correlations (now possible)

## Full correlation matrix: outcome + all dummies.
cor_dummies <- round(cor(cbind(inquiries = df$inquiries, dummies)), 2)
cor_dummies

                inquiries sectionBusiness sectionNews sectionSports dayFriday
inquiries            1.00            0.21        0.15         -0.36      0.49
sectionBusiness      0.21            1.00       -0.50         -0.50      0.00
sectionNews          0.15           -0.50        1.00         -0.50      0.00
sectionSports       -0.36           -0.50       -0.50          1.00      0.00
dayFriday            0.49            0.00        0.00          0.00      1.00
dayMonday           -0.05            0.00        0.00          0.00     -0.25
dayThrusday         -0.44            0.00        0.00          0.00     -0.25
dayTuesday           0.03            0.00        0.00          0.00     -0.25
dayWednesday        -0.03            0.00        0.00          0.00     -0.25
                dayMonday dayThrusday dayTuesday dayWednesday
inquiries           -0.05       -0.44       0.03        -0.03
sectionBusiness      0.00        0.00       0.00         0.00
sectionNews          0.00        0.00       0.00         0.00
sectionSports        0.00        0.00       0.00         0.00
dayFriday           -0.25       -0.25      -0.25        -0.25
dayMonday            1.00       -0.25      -0.25        -0.25
dayThrusday         -0.25        1.00      -0.25        -0.25
dayTuesday          -0.25       -0.25       1.00        -0.25
dayWednesday        -0.25       -0.25      -0.25         1.00

## The line an MCQ usually wants: each level's correlation WITH inquiries, sorted.
sort(cor_dummies["inquiries", -1], decreasing = TRUE)

      dayFriday sectionBusiness     sectionNews      dayTuesday    dayWednesday 
           0.49            0.21            0.15            0.03           -0.03 
      dayMonday   sectionSports     dayThrusday 
          -0.05           -0.36           -0.44

Interpretation — what these correlations mean. A correlation between a 0/1 dummy and a numeric variable is a point-biserial correlation; mathematically it equals that level’s mean-vs-the-rest, rescaled to [−1, 1]. Reading the sorted vector:

dayFriday ≈ +0.49 — the strongest positive association: Friday rows have the highest inquiries.

dayThrusday ≈ −0.44 and sectionSports ≈ −0.36 — the strongest negative associations: Thursday and the Sports section have the fewest inquiries.

secBusiness ≈ +0.21, secNews ≈ +0.15 — mild positive.

This is the same story as the group means / ANOVA in the main notebook, viewed through a Week 1 correlation lens. Sanity check: cor(secSports, inquiries) = −0.36, and indeed Sports averages 7.0 vs 9.0 for the rest — the dummy correlation just re-expresses that gap.

Two caveats (important).

Within-factor dummies are mechanically negatively correlated. Any two section dummies correlate at −0.50 (they’re mutually exclusive: being Business means not-News). That −0.50 is an artefact of one-hot encoding, not a finding — don’t interpret it. This is the dummy-variable trap (perfect/structural collinearity), and it’s why a regression must drop one dummy per factor (§3).

Each value is “this level vs all others pooled,” which is coarser than the pairwise contrasts (Tukey HSD) in the main notebook. Correlation here ranks levels; it does not test specific level-vs-level differences.

3 · Regression on the dummies

3.1 The dummy-variable trap

## If you include ALL k dummies of a factor PLUS the intercept, you get perfect
## collinearity -> R drops one (shows NA). This is the dummy-variable trap.
lm_all <- lm(inquiries ~ secBusiness + secNews + secSports, data = dd)
summary(lm_all)   # one coefficient comes back NA (aliased)


Call:
lm(formula = inquiries ~ secBusiness + secNews + secSports, data = dd)

Residuals:
   Min     1Q Median     3Q    Max 
  -5.9   -1.1   -0.1    1.9    5.1 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)   7.0000     0.5632  12.429 <0.0000000000000002 ***
secBusiness   2.1000     0.7965   2.637              0.0108 *  
secNews       1.9000     0.7965   2.385              0.0204 *  
secSports         NA         NA      NA                  NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.519 on 57 degrees of freedom
Multiple R-squared:  0.1294,    Adjusted R-squared:  0.09883 
F-statistic: 4.235 on 2 and 57 DF,  p-value: 0.01928

Interpretation. The three section dummies sum to 1 in every row, which equals the intercept column — perfectly collinear. R protects you by dropping one (the NA/“aliased” coefficient). The fix is to omit one dummy per factor: the omitted level becomes the reference/baseline, exactly what factor() does automatically.

3.2 Dummy regression = factor regression

## Drop one dummy per factor (Business and Friday = baselines) -> no trap.
lm_dummies <- lm(inquiries ~ secNews + secSports +
                   dayMonday + dayTuesday + dayWednesday + dayThrusday,
                 data = dd)
summary(lm_dummies)


Call:
lm(formula = inquiries ~ secNews + secSports + dayMonday + dayTuesday + 
    dayWednesday + dayThrusday, data = dd)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7500 -1.5667  0.1167  1.3542  4.1500 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   11.6833     0.6876  16.992 < 0.0000000000000002 ***
secNews       -0.2000     0.6366  -0.314              0.75461    
secSports     -2.1000     0.6366  -3.299              0.00174 ** 
dayMonday     -2.8333     0.8218  -3.448              0.00112 ** 
dayTuesday    -2.4167     0.8218  -2.941              0.00485 ** 
dayWednesday  -2.7500     0.8218  -3.346              0.00151 ** 
dayThrusday   -4.9167     0.8218  -5.983          0.000000193 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.013 on 53 degrees of freedom
Multiple R-squared:  0.4829,    Adjusted R-squared:  0.4244 
F-statistic: 8.249 on 6 and 53 DF,  p-value: 0.000002536

## Identical model written with factors (baseline = first level alphabetically):
lm_factor <- lm(inquiries ~ section + day, data = df)

## Same coefficients, R^2, everything. (factor() orders day coefs by factor level, so we
## compare by NAME, not position: rename section* -> sec* and sort both sets.)
cf <- coef(lm_factor); names(cf) <- gsub("^section", "sec", names(cf))
cd <- coef(lm_dummies)
all.equal(cd[sort(names(cd))], cf[sort(names(cf))])   # TRUE

[1] TRUE

Interpretation. Manually dropping secBusiness/dayFriday reproduces the factor model exactly — same coefficients, same R² (≈ 0.42 adjusted), same p-values. Each dummy coefficient is the mean difference of that level vs the baseline (e.g. secSports ≈ −2.1 = Sports minus Business, holding day fixed). The lesson: factor() and explicit dummies are the same regression; dummies just make the reference level a manual choice. (Whether a coefficient is “significant” is a t-test vs the baseline — change the baseline and those p-values change, even though the model fit does not.)

3.3 Multicollinearity view (connects to Week 2)

## Manual VIF = 1/(1-R^2) from regressing each predictor on the others (Week 2 tutorial).
## With one-hot dummies the within-factor VIFs are elevated BY CONSTRUCTION.
preds <- c("secNews","secSports","dayMonday","dayTuesday","dayWednesday","dayThrusday")
sapply(preds, function(p){
  others <- setdiff(preds, p)
  1 / (1 - summary(lm(reformulate(others, response = p), data = dd))$r.squared)
}) |> round(2)

     secNews    secSports    dayMonday   dayTuesday dayWednesday  dayThrusday 
        1.33         1.33         1.60         1.60         1.60         1.60

Interpretation. The dummies show mild VIF inflation purely because levels of the same factor are structurally related (Week 2’s point: collinearity costs precision, not bias). Here it’s harmless and expected — it’s not a sign of a modelling problem, just a property of one-hot encoding a balanced design. (Because the design is orthogonal across the two factors, section dummies are uncorrelated with day dummies — the inflation is only within a factor.)

4 · When to prefer which encoding

Want to…	Use	Where
Rank levels by association with the outcome	dummy correlations (this file)	§2
Estimate mean differences vs a baseline	factor or dummy regression (identical)	§3.2 / main
Test specific level-vs-level differences (corrected)	ANOVA + Tukey HSD (factors)	main §3.6
Per-factor F-test / effect size	`aov` + `eta_squared` (factors)	main §3.4–3.5

Bottom line. Dummy encoding makes correlations computable and is fully class-legitimate, but it answers a ranking question, not a testing question. For the actual hypothesis tests (which sections/days differ, interaction, effect sizes) stay with the factor-based ANOVA in exam_helper_newspaper_ads.qmd. Use this file when a question is phrased in terms of correlation with inquiries.