Mid-term Exam Helper — All Weeks (1–4)

Causal Analytics for Business 2026 · applied to newspaper_ads.csv

Author

Shaam

Published

June 23, 2026

What this document is. A week-by-week compendium of every R technique from Weeks 1–4 (in-class and tutorials), with interpretations and connections to the weekly content after each result — the same style as the class/tutorial notebooks. Open it with newspaper_ads.csv in the same folder and Render, or run chunks with ▶.

Two kinds of section:

  • APPLIED (runs) — computed on newspaper_ads.csv, prints real numbers, and is interpreted below the chunk. All of Week 1 and Week 3 are applied (this dataset is a two-factor experiment, the natural Week 3 setting).
  • TEMPLATE (eval: false) — needs a data structure this dataset lacks (Week 2 needs a causal system of variables; Week 4 needs a treatment column + covariates). These hold the exact class code patterns + interpretation of what they would show.

Only methods/libraries used in class appear here. Conceptual/DAG/formula MCQs → MIDTERM_CHEATSHEET.md.

⚠️ The data spells Thursday as Thrusday — that label is literal; do not “fix” it.

0 · Setup

## Packages used across the course. APPLIED code (Weeks 1 & 3) needs only the first four;
## the rest are for the Week 4 TEMPLATE chunks (not run on render).
# install.packages(c('tidyverse','lmtest','effectsize','MASS',
#                    'PSweight','boot','magrittr','lubridate'))

library(MASS)        # stepAIC()  (load BEFORE tidyverse so dplyr::select wins)
library(tidyverse)   # dplyr/ggplot2/readr
library(lmtest)      # bptest()
library(effectsize)  # eta_squared()

options(scipen = 99) # readable p-values, no scientific notation
df <- read.csv("newspaper_ads.csv", stringsAsFactors = FALSE)
df$day     <- factor(df$day)     # keep "Thrusday" exactly as in the data
df$section <- factor(df$section)

str(df)
'data.frame':   60 obs. of  3 variables:
 $ day      : Factor w/ 5 levels "Friday","Monday",..: 2 2 2 2 4 4 4 4 5 5 ...
 $ section  : Factor w/ 3 levels "Business","News",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ inquiries: int  4 3 5 6 5 8 6 7 5 9 ...
summary(df)
        day         section     inquiries     
 Friday   :12   Business:20   Min.   : 3.000  
 Monday   :12   News    :20   1st Qu.: 6.000  
 Thrusday :12   Sports  :20   Median : 8.000  
 Tuesday  :12                 Mean   : 8.333  
 Wednesday:12                 3rd Qu.:10.000  
                              Max.   :14.000  
table(df$day, df$section)         # 5 days x 3 sections x 4 reps = 60
           
            Business News Sports
  Friday           4    4      4
  Monday           4    4      4
  Thrusday         4    4      4
  Tuesday          4    4      4
  Wednesday        4    4      4

Interpretation. The cross-tab is 4 in every cell → a perfectly balanced design (equal n per treatment combination) and, because each section appears equally on each day, an orthogonal one. These two design properties (Week 3) are what make the analysis clean: balance minimises the standard errors, and orthogonality means each factor’s effect can be estimated independently of the other (we verify this numerically in §3.5). Note the alphabetical baseline ordering — Business and Friday come first, so they become the reference levels in every regression below.


WEEK 1 — Regression, model fit & diagnostics · APPLIED

Week 1 toolkit: correlation, outcome transformation, simple vs multiple OLS, R²/adj-R²/AIC, stepAIC, polynomial terms, standardised coefficients, residual normality, Breusch-Pagan, and prediction (MAE on a hold-out). Class datasets: football managers (in-class), HKRE apartments (tutorial). Same machinery, applied to inquiries.

1.1 Correlation & outcome distribution

df %>% select_if(is.numeric) %>% summary()   # 'inquiries' is the only numeric column
   inquiries     
 Min.   : 3.000  
 1st Qu.: 6.000  
 Median : 8.000  
 Mean   : 8.333  
 3rd Qu.:10.000  
 Max.   :14.000  
par(mfrow = c(2, 2))
hist(df$inquiries,      main = "inquiries (raw)", xlab = "inquiries")
hist(log(df$inquiries), main = "log(inquiries)",  xlab = "log(inquiries)")
qqnorm(df$inquiries, main = "QQ raw");      qqline(df$inquiries)
qqnorm(log(df$inquiries), main = "QQ log"); qqline(log(df$inquiries))

par(mfrow = c(1, 1))

Interpretation. A correlation matrix is N/A — cor() needs ≥2 numeric variables and inquiries is the only one (the predictors are categorical). On the distribution: raw inquiries is already fairly symmetric (mean 8.33 ≈ median 8, range 3–14), and the raw QQ points sit close to the line. Logging actually makes the small counts more skewed, so there is no case for a log transform here — the opposite of the HKRE price example, where the raw outcome was strongly right-skewed and log(price) was clearly better. As in class, remember the decision rule: we transform to get normal residuals, not because of the raw outcome per se (we re-check residuals in §1.5).

1.2 Simple vs multiple regression, R² / adjusted R² / AIC

fit_section <- lm(inquiries ~ section, data = df)
fit_day     <- lm(inquiries ~ day,     data = df)
summary(fit_section)

Call:
lm(formula = inquiries ~ section, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
  -5.9   -1.1   -0.1    1.9    5.1 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept)     9.1000     0.5632  16.158 <0.0000000000000002 ***
sectionNews    -0.2000     0.7965  -0.251              0.8026    
sectionSports  -2.1000     0.7965  -2.637              0.0108 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.519 on 57 degrees of freedom
Multiple R-squared:  0.1294,    Adjusted R-squared:  0.09883 
F-statistic: 4.235 on 2 and 57 DF,  p-value: 0.01928
summary(fit_day)

Call:
lm(formula = inquiries ~ day, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-5.083 -1.500  0.000  1.500  4.917 

Coefficients:
             Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   10.9167     0.6378  17.116 < 0.0000000000000002 ***
dayMonday     -2.8333     0.9020  -3.141              0.00271 ** 
dayThrusday   -4.9167     0.9020  -5.451           0.00000122 ***
dayTuesday    -2.4167     0.9020  -2.679              0.00972 ** 
dayWednesday  -2.7500     0.9020  -3.049              0.00353 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.209 on 55 degrees of freedom
Multiple R-squared:  0.3535,    Adjusted R-squared:  0.3065 
F-statistic: 7.519 on 4 and 55 DF,  p-value: 0.00006612
fit_both <- lm(inquiries ~ section + day, data = df)
summary(fit_both)

Call:
lm(formula = inquiries ~ section + day, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7500 -1.5667  0.1167  1.3542  4.1500 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    11.6833     0.6876  16.992 < 0.0000000000000002 ***
sectionNews    -0.2000     0.6366  -0.314              0.75461    
sectionSports  -2.1000     0.6366  -3.299              0.00174 ** 
dayMonday      -2.8333     0.8218  -3.448              0.00112 ** 
dayThrusday    -4.9167     0.8218  -5.983          0.000000193 ***
dayTuesday     -2.4167     0.8218  -2.941              0.00485 ** 
dayWednesday   -2.7500     0.8218  -3.346              0.00151 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.013 on 53 degrees of freedom
Multiple R-squared:  0.4829,    Adjusted R-squared:  0.4244 
F-statistic: 8.249 on 6 and 53 DF,  p-value: 0.000002536
AIC(fit_section, fit_day, fit_both)

Interpretation. Each coefficient is a mean difference vs the baseline (the Week 1 reading of dummy-coded factors), e.g. in fit_section the sectionSports coefficient ≈ −2.1 says Sports averages ~2 fewer inquiries than Business. Going from a single predictor to both, Adjusted R² rises from ~0.13 (section alone) to ~0.42 (section + day) and AIC falls — day carries a lot of independent signal that section alone misses. This is the multiple-vs-simple-regression theme from Week 1: adding a relevant variable improves fit and can change how much “credit” each predictor gets. AIC is used here exactly as in class — for model comparison — and it is valid because all three models share the same outcome (inquiries).

1.3 Stepwise AIC selection (MASS::stepAIC)

full_mod <- lm(inquiries ~ section * day, data = df)
selected <- stepAIC(full_mod, direction = "both", trace = FALSE)
summary(selected)

Call:
lm(formula = inquiries ~ section * day, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2500 -0.8125 -0.1250  0.7500  2.7500 

Coefficients:
                           Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                  9.0000     0.6625  13.585 < 0.0000000000000002 ***
sectionNews                  3.5000     0.9369   3.736             0.000525 ***
sectionSports                2.2500     0.9369   2.402             0.020518 *  
dayMonday                    2.5000     0.9369   2.668             0.010560 *  
dayThrusday                 -1.2500     0.9369  -1.334             0.188854    
dayTuesday                  -0.2500     0.9369  -0.267             0.790813    
dayWednesday                -0.5000     0.9369  -0.534             0.596192    
sectionNews:dayMonday       -6.7500     1.3250  -5.094         0.0000067127 ***
sectionSports:dayMonday     -9.2500     1.3250  -6.981         0.0000000109 ***
sectionNews:dayThrusday     -7.0000     1.3250  -5.283         0.0000035643 ***
sectionSports:dayThrusday   -4.0000     1.3250  -3.019             0.004168 ** 
sectionNews:dayTuesday      -2.0000     1.3250  -1.509             0.138170    
sectionSports:dayTuesday    -4.5000     1.3250  -3.396             0.001437 ** 
sectionNews:dayWednesday    -2.7500     1.3250  -2.076             0.043684 *  
sectionSports:dayWednesday  -4.0000     1.3250  -3.019             0.004168 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.325 on 45 degrees of freedom
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.7506 
F-statistic: 13.68 on 14 and 45 DF,  p-value: 0.000000000007697
AIC(selected)
[1] 218.7788

Interpretation. stepAIC keeps the full section * day interaction model (AIC ≈ 218.8) — it does not drop the interaction. That is a strong, automatic signal that the two factors do not act additively here: the effect of one depends on the other. This contrasts with the football-manager and salary exercises, where stepwise selection often pruned terms. Caveat from Week 1: stepAIC optimises fit, so treat its output as a guide and sanity-check against theory — here it happens to agree with the formal interaction test in §3.5.

1.4 Polynomial / squared terms & standardised coefficients — why N/A here

## Both require a NUMERIC predictor; section & day are categorical. In class these were used
## on numeric predictors (area, age, exp):
##   lm(log(price) ~ poly(area, 2, raw = TRUE) + ..., data = hkre)        # squared terms
##   lm(salary ~ male + factor(education) + exp + I(exp^2), data = df)    # quadratic
##   lm(damage ~ scale(attended) + scale(area) + scale(grade), data = d) # standardised
cat("Polynomial & standardised-coefficient methods need a numeric predictor — N/A here.\n")
Polynomial & standardised-coefficient methods need a numeric predictor — N/A here.

Interpretation. Squared/poly() terms capture curvature in a numeric predictor (diminishing returns to area, exp² for salary), and scale() puts numeric coefficients on a common SD scale so you can rank “which variable matters most” (the firemen/HKRE standardised bar charts). Neither is meaningful for section/day: a factor has no slope to bend and no units to standardise. The right Week 1 reading for categorical predictors is simply the per-level mean differences already shown in §1.2.

1.5 Residual diagnostics: normality + Breusch-Pagan

par(mfrow = c(1, 2))
hist(fit_both$residuals, main = "Residuals", xlab = "residuals")
qqnorm(fit_both$residuals); qqline(fit_both$residuals)

par(mfrow = c(1, 1))
bptest(fit_both)   # H0: constant residual variance. p<0.05 => heteroscedasticity

    studentized Breusch-Pagan test

data:  fit_both
BP = 10.818, df = 6, p-value = 0.09418

Interpretation. The residual histogram is roughly bell-shaped and the QQ points track the line, so the normality assumption looks acceptable. The Breusch-Pagan test gives p ≈ 0.094 > 0.05, so we fail to reject homoscedasticity — the residual variance is effectively constant and the standard errors / p-values can be trusted (the same “can we trust our SEs?” check that closes the firemen and salary exercises). No remedy (robust SEs, transformation) is needed here.

1.6 Prediction & MAE on a hold-out

set.seed(1)
train_idx <- sample(seq_len(nrow(df)), size = floor(0.7 * nrow(df)))
train <- df[train_idx, ]; test <- df[-train_idx, ]

fit_tr <- lm(inquiries ~ section, data = train)   # section only: 3 levels always in split
pred   <- predict(fit_tr, newdata = test)
mae    <- mean(abs(test$inquiries - pred))
cat("Hold-out MAE (inquiries ~ section):", round(mae, 3), "\n")
Hold-out MAE (inquiries ~ section): 2.35 

Interpretation. This is the Week 1 prediction mindset (Shmueli’s explain vs predict): fit on a training split, score on unseen data with MAE = mean(|actual − predicted|). A section-only model is off by ≈ 2.35 inquiries on average — not great, because (as §1.2–§1.3 showed) day and the interaction carry most of the signal. Adding them would lower the MAE. The key Week 1 distinction: for prediction you keep any input that helps out-of-sample accuracy; for explanation you’d instead let the causal diagram decide what belongs in the model.

Week 1 function map: cor() · hist()/qqnorm()/qqline() · lm() · summary() (R²/adj-R²/coeffs) · AIC() (same outcome only) · MASS::stepAIC() · I(x^2)/poly() & scale() (numeric only) · bptest() · predict() + MAE.


WEEK 2 — Causal diagrams & system of regressions · TEMPLATE

Week 2 = observational data: draw a causal diagram, decide what to control for (confounder / mediator / collider), run a system of regressions for path coefficients, compute total effects (product along a path, sum across paths) and omitted-variable bias. Class datasets: salary (in-class), firemen (tutorial).

Why template, not applied — and the connection. newspaper_ads.csv is a designed experiment: day and section were set by the design and are orthogonal, so there is no confounding structure to model — no back-door path, no mediator, no collider among the three columns. In fact the design guarantees the Week 2 story can’t appear: because the factors are orthogonal, controlling for one does not change the other’s coefficient (we demonstrate this in §3.5) — the exact opposite of omitted-variable bias. A Week 2 MCQ on this exam is therefore most likely a pen-&-paper DAG question (MIDTERM_CHEATSHEET.md). The code below is the class pattern, ready for a real variable system.

2.1 Control rules

Role of Z (between X and Y) Structure Control it?
Confounder Z → X and Z → Y YES (closes a back-door)
Mediator X → Z → Y NO for the total effect (it blocks the path)
Collider X → Z ← Y NO (controlling creates spurious association)

Total effect of X→Y = sum over directed paths; each path’s effect = product of its coefficients. To estimate a total effect in one regression: control variables that block all back-door paths but are not mediators/colliders (the answer can be None).

2.2 System of regressions + total effect via path products

## Salary in-class pattern: each node with incoming arrows gets its own regression.
modelA <- lm(salary ~ male + education + exp, data = df)   # exp & education -> salary
model1 <- lm(education ~ age, data = df)                   # age -> education  (path a)
model2 <- lm(exp ~ age,       data = df)                   # age -> exp        (path b)

indirect_via_education <- coef(model1)["age"] * coef(modelA)["education"]
indirect_via_exp       <- coef(model2)["age"] * coef(modelA)["exp"]
indirect_via_education + indirect_via_exp                  # total effect of age on salary

lm(salary ~ age + male, data = df)   # cross-check: control confounder, NOT mediators

Interpretation (what it would show). The two routes agree: summing the path products equals the age coefficient from the single regression that controls the confounder but omits the mediators. That is the Week 2 rule made concrete — to get a total effect you leave mediators out (the path stays open); to get a direct effect you control them.

2.3 Omitted-variable bias (OVB)

## Firemen pattern: omitting confounder 'area' biases 'attended'.
correct <- lm(damage ~ attended + area, data = df)   # area -> damage (direct)
reverse <- lm(area ~ attended,          data = df)   # attended -> area (open backdoor leg)
OVB    <- coef(reverse)["attended"] * coef(correct)["area"]
biased <- lm(damage ~ attended, data = df)           # omit 'area'

coef(biased)["attended"]                  # = correct + OVB
coef(correct)["attended"] + OVB

Interpretation (what it would show). OVB = (effect of omitted var on outcome) × (effect of treatment on omitted var), and biased = correct + OVB exactly. In the firemen case this is why the mayor was wrong: the spurious +24.5 “firemen cause damage” decomposes into the true protective effect (−12.4) plus the area-confounding bias (+36.9).

2.4 Multicollinearity check (manual VIF)

## VIF = 1/(1-R^2) from regressing each predictor on the others. >5 a concern, >10 severe.
predictors <- c("attended", "area", "trucks", "grade")
sapply(predictors, function(p) {
  others <- setdiff(predictors, p)
  1 / (1 - summary(lm(reformulate(others, response = p), data = df))$r.squared)
})

Interpretation (what it would show). High VIFs (firemen: ≈28 for attended/area) flag that predictors carry little independent variation — Week 2’s lesson that collinearity costs precision (wide CIs, unstable coefficients), not bias; the remedy is the causal diagram, not deleting a variable.

Week 2 function map: lm() per node · coef() for path coefficients · products/sums for total effects · OVB = product of the two omitted-path coefficients · manual VIF via reformulate() + 1/(1-R²).


WEEK 3 — Experiments & ANOVA · APPLIED (core of this dataset)

newspaper_ads.csv is a balanced full-factorial design: section (3 levels) × day (5 levels), 4 replicates per cell — exactly the in-class paper-towels / store-design setting.

3.1 Balance & orthogonality

table(df$section, df$day)
          
           Friday Monday Thrusday Tuesday Wednesday
  Business      4      4        4       4         4
  News          4      4        4       4         4
  Sports        4      4        4       4         4
replications(inquiries ~ section * day, data = df)
    section         day section:day 
         20          12           4 

Interpretation. replications returns equal counts (section 20, day 12, each cell 4), confirming the design is balanced and orthogonal (Week 3). The payoff, just like the paper-towels exercise: balance gives the smallest possible standard errors, and orthogonality means adding the second factor will not disturb the first factor’s Sum Sq — only shrink the error term (shown next in §3.5).

3.2 Boxplots (distribution by group)

par(mfrow = c(1, 2))
boxplot(inquiries ~ section, data = df, main = "by section")
boxplot(inquiries ~ day,     data = df, las = 2, main = "by day")

inquiries by section, by day, and per cell.
par(mfrow = c(1, 1))
boxplot(inquiries ~ section + day, data = df, las = 2, cex.axis = 0.7, main = "by cell")

inquiries by section, by day, and per cell.

Interpretation. The Week 3 “compare the distributions first” step. By section, Business (~9.1) and News (~8.9) sit above Sports (~7.0). By day, Friday (~10.9) is clearly highest and Thrusday (~6.0) lowest. The per-cell boxplot already hints at an interaction — the section ranking is not the same on every day (e.g. Sports is strong on Friday but weakest on Monday), which the formal test in §3.5 confirms.

3.3 Two-group t-tests (Welch, var.equal = FALSE)

ttest_pair <- function(data, var, group, lvl1, lvl2) {
  sub <- data[data[[group]] %in% c(lvl1, lvl2), ]
  sub[[group]] <- factor(sub[[group]])
  cat("\n----", lvl1, "vs", lvl2, "----\n")
  print(t.test(sub[[var]] ~ sub[[group]], var.equal = FALSE))
}
for (p in combn(levels(df$section), 2, simplify = FALSE)) {
  ttest_pair(df, "inquiries", "section", p[1], p[2])
}

---- Business vs News ----

    Welch Two Sample t-test

data:  sub[[var]] by sub[[group]]
t = 0.254, df = 30.172, p-value = 0.8012
alternative hypothesis: true difference in means between group Business and group News is not equal to 0
95 percent confidence interval:
 -1.407702  1.807702
sample estimates:
mean in group Business     mean in group News 
                   9.1                    8.9 


---- Business vs Sports ----

    Welch Two Sample t-test

data:  sub[[var]] by sub[[group]]
t = 3.0195, df = 33.401, p-value = 0.004824
alternative hypothesis: true difference in means between group Business and group Sports is not equal to 0
95 percent confidence interval:
 0.6856928 3.5143072
sample estimates:
mean in group Business   mean in group Sports 
                   9.1                    7.0 


---- News vs Sports ----

    Welch Two Sample t-test

data:  sub[[var]] by sub[[group]]
t = 2.125, df = 36.926, p-value = 0.04035
alternative hypothesis: true difference in means between group News and group Sports is not equal to 0
95 percent confidence interval:
 0.0881928 3.7118072
sample estimates:
  mean in group News mean in group Sports 
                 8.9                  7.0 
for (p in combn(levels(df$day), 2, simplify = FALSE)) ttest_pair(df, "inquiries", "day", p[1], p[2])

Interpretation. Welch t-tests on the section pairs: Business vs Sports is significant (t ≈ 3.02, p ≈ 0.005), News vs Sports is borderline (t ≈ 2.13, p ≈ 0.040), and Business vs News is not (p ≈ 0.80). Recall the Week 3 equivalence: for a 2-group comparison the t-test = lm = one-way ANOVA give the same p-value. But note these are uncorrected pairwise tests — running several inflates the false-positive rate, which is exactly why we move to the omnibus ANOVA + Tukey correction (§3.4, §3.6).

3.4 One-way ANOVA (each factor) + effect size

aov_section <- aov(inquiries ~ section, data = df); summary(aov_section)
            Df Sum Sq Mean Sq F value Pr(>F)  
section      2   53.7  26.867   4.235 0.0193 *
Residuals   57  361.6   6.344                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eta_squared(aov_section)
aov_day <- aov(inquiries ~ day, data = df); summary(aov_day)
            Df Sum Sq Mean Sq F value    Pr(>F)    
day          4  146.8   36.71   7.519 0.0000661 ***
Residuals   55  268.5    4.88                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eta_squared(aov_day)

Interpretation. The F-test is between-group variance ÷ within-group variance (signal/noise). section: F(2,57) ≈ 4.24, p ≈ 0.019 (η² ≈ 0.13) and day: F(4,55) ≈ 7.52, p ≈ 6.6e-5 (η² ≈ 0.35) — both factors matter, and day is the stronger driver (over twice the variance explained). η² is the Week 3 effect-size analogue of R²: it tells you about magnitude, which the p-value alone does not.

3.5 Two-way ANOVA: main effects, then interaction

m_main <- aov(inquiries ~ section + day, data = df)
summary(m_main); AIC(m_main); eta_squared(m_main)
            Df Sum Sq Mean Sq F value    Pr(>F)    
section      2  53.73   26.87   6.630   0.00269 ** 
day          4 146.83   36.71   9.059 0.0000119 ***
Residuals   53 214.77    4.05                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 262.7851
m_int <- aov(inquiries ~ section * day, data = df)
summary(m_int);  AIC(m_int);  eta_squared(m_int)
            Df Sum Sq Mean Sq F value         Pr(>F)    
section      2  53.73   26.87  15.304 0.000008502752 ***
day          4 146.83   36.71  20.910 0.000000000852 ***
section:day  8 135.77   16.97   9.667 0.000000112469 ***
Residuals   45  79.00    1.76                           
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 218.7788

Interpretation — orthogonality in action. Compare the Sum Sq with §3.4: in the two-way model section still has Sum Sq ≈ 53.7 and day still ≈ 146.8identical to their one-way values (this is orthogonality: no shared variance to fight over). What changes is the error term, which shrinks from 361.6 (one-way section) to 214.8, so the F-values rise (section 4.24 → 6.63, day 7.52 → 9.06) and p-values drop. Each factor “soaks up noise” for the other → more precision. This is the precise lesson of the in-class paper-towels / store-design exercises.

Interpretation — the interaction (the big contrast with class). The section:day interaction is highly significant: F(8,45) ≈ 9.67, p ≈ 1.1e-7, with a large partial η² ≈ 0.63, and AIC drops sharply (262.8 → 218.8) when it is added. This is the opposite of the paper-towels and promotion exercises, where the interaction was not significant and the additive model won. Here the additive model is not adequate: the effect of section genuinely depends on day (Sports is strong on Friday but worst on Monday; News peaks on Friday and bottoms out on Thrusday). Practical consequence (Week 3): when the interaction is significant you should read the cell means, not the main effects in isolation — averaging over the other factor can mislead.

interaction.plot(df$day, df$section, df$inquiries,
                 xlab = "day", ylab = "mean inquiries", trace.label = "section")

Non-parallel/crossing lines = interaction.

Interpretation. Always plot the interaction (Week 3) — the interaction is the whole pattern, no single number shows it. Here the three section lines cross rather than run parallel, the visual signature of the significant interaction from §3.5. Parallel lines would have meant “no interaction” (the class examples); crossing lines mean the best section changes from day to day.

3.6 Post-hoc — Tukey HSD

TukeyHSD(aov_section)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = inquiries ~ section, data = df)

$section
                diff       lwr         upr     p adj
News-Business   -0.2 -2.116672  1.71667235 0.9658594
Sports-Business -2.1 -4.016672 -0.18332765 0.0286002
Sports-News     -1.9 -3.816672  0.01667235 0.0525040
TukeyHSD(aov_day)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = inquiries ~ day, data = df)

$day
                          diff         lwr        upr     p adj
Monday-Friday      -2.83333333 -5.37731936 -0.2893473 0.0218355
Thrusday-Friday    -4.91666667 -7.46065269 -2.3726806 0.0000118
Tuesday-Friday     -2.41666667 -4.96065269  0.1273194 0.0702094
Wednesday-Friday   -2.75000000 -5.29398603 -0.2060140 0.0279337
Thrusday-Monday    -2.08333333 -4.62731936  0.4606527 0.1573669
Tuesday-Monday      0.41666667 -2.12731936  2.9606527 0.9903820
Wednesday-Monday    0.08333333 -2.46065269  2.6273194 0.9999830
Tuesday-Thrusday    2.50000000 -0.04398603  5.0439860 0.0563243
Wednesday-Thrusday  2.16666667 -0.37731936  4.7106527 0.1301276
Wednesday-Tuesday  -0.33333333 -2.87731936  2.2106527 0.9959087
TukeyHSD(m_main)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = inquiries ~ section + day, data = df)

$section
                diff       lwr        upr     p adj
News-Business   -0.2 -1.734936  1.3349362 0.9470957
Sports-Business -2.1 -3.634936 -0.5650638 0.0048701
Sports-News     -1.9 -3.434936 -0.3650638 0.0117434

$day
                          diff        lwr         upr     p adj
Monday-Friday      -2.83333333 -5.1540343 -0.51263240 0.0094385
Thrusday-Friday    -4.91666667 -7.2373676 -2.59596574 0.0000019
Tuesday-Friday     -2.41666667 -4.7373676 -0.09596574 0.0373566
Wednesday-Friday   -2.75000000 -5.0707009 -0.42929907 0.0125928
Thrusday-Monday    -2.08333333 -4.4040343  0.23736760 0.0981402
Tuesday-Monday      0.41666667 -1.9040343  2.73736760 0.9863219
Wednesday-Monday    0.08333333 -2.2373676  2.40403426 0.9999753
Tuesday-Thrusday    2.50000000  0.1792991  4.82070093 0.0287660
Wednesday-Thrusday  2.16666667 -0.1540343  4.48736760 0.0780657
Wednesday-Tuesday  -0.33333333 -2.6540343  1.98736760 0.9941367

Interpretation. Tukey HSD gives family-wise-corrected pairwise comparisons (the fix for the multiple-testing problem flagged in §3.3). For section, only Sports−Business survives (diff −2.1, p ≈ 0.029); Sports−News is now p ≈ 0.053 — just misses, even though its uncorrected t-test in §3.3 was p ≈ 0.040. That gap is the exact Week 3 point: one planned contrast vs many unplanned ones — correction makes the borderline result non-significant. For day, the big gaps from Friday are significant (e.g. Thrusday−Friday ≈ −4.9, p ≈ 1e-5). (Caveat: with a significant interaction, these main-effect contrasts are “averaged over” the other factor — interpret with the cell means in mind.)

3.7 Regression (lm) view + diagnostics

lm_main <- lm(inquiries ~ section + day, data = df); summary(lm_main)

Call:
lm(formula = inquiries ~ section + day, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7500 -1.5667  0.1167  1.3542  4.1500 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    11.6833     0.6876  16.992 < 0.0000000000000002 ***
sectionNews    -0.2000     0.6366  -0.314              0.75461    
sectionSports  -2.1000     0.6366  -3.299              0.00174 ** 
dayMonday      -2.8333     0.8218  -3.448              0.00112 ** 
dayThrusday    -4.9167     0.8218  -5.983          0.000000193 ***
dayTuesday     -2.4167     0.8218  -2.941              0.00485 ** 
dayWednesday   -2.7500     0.8218  -3.346              0.00151 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.013 on 53 degrees of freedom
Multiple R-squared:  0.4829,    Adjusted R-squared:  0.4244 
F-statistic: 8.249 on 6 and 53 DF,  p-value: 0.000002536
lm_int  <- lm(inquiries ~ section * day, data = df); summary(lm_int)

Call:
lm(formula = inquiries ~ section * day, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2500 -0.8125 -0.1250  0.7500  2.7500 

Coefficients:
                           Estimate Std. Error t value             Pr(>|t|)    
(Intercept)                  9.0000     0.6625  13.585 < 0.0000000000000002 ***
sectionNews                  3.5000     0.9369   3.736             0.000525 ***
sectionSports                2.2500     0.9369   2.402             0.020518 *  
dayMonday                    2.5000     0.9369   2.668             0.010560 *  
dayThrusday                 -1.2500     0.9369  -1.334             0.188854    
dayTuesday                  -0.2500     0.9369  -0.267             0.790813    
dayWednesday                -0.5000     0.9369  -0.534             0.596192    
sectionNews:dayMonday       -6.7500     1.3250  -5.094         0.0000067127 ***
sectionSports:dayMonday     -9.2500     1.3250  -6.981         0.0000000109 ***
sectionNews:dayThrusday     -7.0000     1.3250  -5.283         0.0000035643 ***
sectionSports:dayThrusday   -4.0000     1.3250  -3.019             0.004168 ** 
sectionNews:dayTuesday      -2.0000     1.3250  -1.509             0.138170    
sectionSports:dayTuesday    -4.5000     1.3250  -3.396             0.001437 ** 
sectionNews:dayWednesday    -2.7500     1.3250  -2.076             0.043684 *  
sectionSports:dayWednesday  -4.0000     1.3250  -3.019             0.004168 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.325 on 45 degrees of freedom
Multiple R-squared:  0.8098,    Adjusted R-squared:  0.7506 
F-statistic: 13.68 on 14 and 45 DF,  p-value: 0.000000000007697
bptest(lm_main); bptest(lm_int)

    studentized Breusch-Pagan test

data:  lm_main
BP = 10.818, df = 6, p-value = 0.09418

    studentized Breusch-Pagan test

data:  lm_int
BP = 11.831, df = 14, p-value = 0.6199

Interpretation. aov() and lm() are the same model: ANOVA reports one F-test per factor, OLS one t-test per level vs the baseline (Business on Friday, intercept ≈ 11.7). The additive lm has Adjusted R² ≈ 0.42; adding the interaction lifts it to ≈ 0.75 — consistent with §3.5 and with stepAIC keeping the interaction. Breusch-Pagan is non-significant for both models (main p ≈ 0.094, interaction p ≈ 0.62), so homoscedasticity holds and the SEs are trustworthy — the Week 3 “can we trust the inference?” check passes.

3.8 Log-transformed outcome (optional remedy)

lm_log <- lm(log(inquiries) ~ section + day, data = df)
summary(lm_log); bptest(lm_log)

Call:
lm(formula = log(inquiries) ~ section + day, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7204 -0.1916  0.0371  0.2023  0.4383 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)    2.50574    0.09410  26.627 < 0.0000000000000002 ***
sectionNews   -0.07610    0.08712  -0.874             0.386320    
sectionSports -0.30771    0.08712  -3.532             0.000864 ***
dayMonday     -0.37905    0.11248  -3.370             0.001408 ** 
dayThrusday   -0.62859    0.11248  -5.589          0.000000811 ***
dayTuesday    -0.26660    0.11248  -2.370             0.021444 *  
dayWednesday  -0.29936    0.11248  -2.662             0.010274 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2755 on 53 degrees of freedom
Multiple R-squared:  0.4637,    Adjusted R-squared:  0.403 
F-statistic: 7.638 on 6 and 53 DF,  p-value: 0.000006185

    studentized Breusch-Pagan test

data:  lm_log
BP = 7.4916, df = 6, p-value = 0.2778

Interpretation. Logging the outcome is a general Week 3 remedy for skew/ heteroscedasticity — but here it is unnecessary (residuals were already well-behaved, §1.1/§1.5). Two class caveats apply: coefficients now read as approximate % effects, and the AIC of a log-outcome model is NOT comparable to the level model (different outcome scale) — judge transforms on residual diagnostics, not AIC.

Week 3 function map: table()/replications() → balance & orthogonality · boxplot() → distributions · t.test() → AB test · aov() → F per factor · lm() → t per level · interaction.plot() → interactions · TukeyHSD() → corrected pairwise · effectsize::eta_squared() → effect size · bptest()/qqnorm() → assumptions · AIC() → model comparison (same outcome only).

Week 3 takeaway for this dataset. Both factors significantly affect inquiries, day more than section, and there is a strong section×day interaction — so the design’s story is in the cell means: the best combinations are News on Friday (≈12.5) and Business on Monday (≈11.5); the worst are News on Thrusday (≈4.25) and Sports on Monday (≈4.5). Balance + orthogonality kept the estimates clean, and the assumption checks mean we can trust the inference. The methodological headline vs the class examples: here the interaction matters, so don’t stop at the main effects.


WEEK 4 — Matching, propensity scores & IPWE · TEMPLATE

Week 4 estimates causal effects from observational data via stratification, exact matching, and propensity-score weighting (IPWE). Class datasets: Titanic (in-class), job training RCT vs CPS (in-class), catholic schools (tutorial).

Why template, and the connection. Every Week 4 method needs a binary treatment D and covariatesnewspaper_ads.csv has neither, so there is literally nothing to build a propensity score on (ps = P(D=1|X) requires a D). Conceptually this dataset sits at the top of the method pyramid (a clean designed experiment), whereas Week 4 lives at the bottom (messy observational data). The code below is the exact class workflow; paste in a real treat/outcome/covariates to use it.

Estimands & IPWE weights (non-normalised):

  • τ_ATE = π·τ_ATT + (1−π)·τ_ATU; SMD = ATE + pre-treatment het. + treatment-effect het.
Estimand Treated (D=1) Control (D=0)
ATE 1/ps 1/(1-ps)
ATT 1 ps/(1-ps)

4.1 Naïve SMD vs regression adjustment

naive <- lm(outcome ~ D, data = df); summary(naive)               # simple mean difference
adj   <- lm(outcome ~ D + x1 + x2 + x3, data = df); summary(adj)  # regression adjustment

Interpretation (what it would show). A large gap between naive and adj is the signature of confounding (Titanic: 0.354 → 0.224 once male is added; job training: −$8497 → +$1066). If the two barely differ — as in the Week 3 WallMonkeys experiment — that itself is evidence randomisation worked (controls change precision, not the estimate).

4.2 Balance & common support (the testable assumption)

library(magrittr)
table(df$D)                                                      # treatment-level balance
df %>% group_by(D, x_categorical) %>%                           # covariate balance (proportions)
  summarise(cnt = n()) %>% mutate(prop = round(cnt / sum(cnt), 3))
df %>% mutate(grp = ifelse(D == 1, "Treatment", "Control")) %>% # common support (overlap)
  ggplot(aes(x = x_continuous, group = grp, colour = grp, fill = grp)) +
  geom_density(alpha = 0.2) + theme(legend.position = "bottom")

Interpretation (what it would show). Common support / overlap is the one assumption we can test (ignorability we cannot). Where the two densities overlap is the only region with “doppelgangers” to match; little overlap (job training re75) → positivity is threatened and we must trim.

4.3 Propensity score (logistic nuisance) + stepAIC

glm.ps <- glm(D ~ x1 + x2 + x3, data = df, family = binomial(link = "logit"))
selected.ps <- stepAIC(glm.ps, direction = "both")   # guide only — keep true confounders
df$ps <- predict(selected.ps, type = "response")
df.trim <- df %>% filter(ps >= 0.1 & ps <= 0.9)      # trim extreme PS (explosive weights)

Interpretation (what it would show). The PS collapses all covariates into one number (curse-of-dimensionality fix). Goal is overlap, not predictive accuracy — don’t drop a genuine confounder just because it predicts treatment poorly. Trimming matters because extreme PS create explosive inverse weights (the job-training −348 blow-up).

4.4 Manual IPWE — ATE and ATT (non-normalised)

library(magrittr)
df.trim %>% mutate(y1 = D*outcome/ps, y0 = (1-D)*outcome/(1-ps), ate = y1 - y0) %>%
  pull(ate) %>% mean()                                           # ATE: 1/ps , 1/(1-ps)

n_treated <- sum(df.trim$D)
df.trim %>% mutate(y1 = D*outcome, y0 = (1-D)*outcome*ps/(1-ps), att = y1 - y0) %>%
  pull(att) %>% sum() / n_treated                                # ATT: 1 , ps/(1-ps)

Interpretation (what it would show). Everything is a weighted mean: each unit is weighted by 1/(prob of the status it actually got). ATT often differs from ATE (treatment-effect heterogeneity) — Titanic ATT 0.237 > ATE 0.190; job training ATT ≈ $1943 > ATE ≈ $1637.

4.5 Normalised IPWE weights (Hirano–Imbens)

N <- nrow(df.trim)
df.trim %<>% mutate(d1 = D/ps, d0 = (1-D)/(1-ps))
adj1 <- sum(df.trim$d1)/N; adj0 <- sum(df.trim$d0)/N
df.trim %>% mutate(y1 = (D*outcome/ps)/adj1, y0 = ((1-D)*outcome/(1-ps))/adj0,
                   ate.norm = y1 - y0) %>% pull(ate.norm) %>% mean()

Interpretation (what it would show). Raw inverse weights don’t sum to N; rescaling each group by its adjustment factor = (Σ inverse weights)/N tames extreme weights and stabilises the estimate (the Week 4 normalisation step; PSweight’s default).

4.6 PSweight — IPWE + love plots + bootstrap SE

library(PSweight)
ps.formula <- D ~ x1 + x2 + x3
bal <- SumStat(ps.formula = ps.formula, data = df.trim, weight = c("IPW"))
plot(bal, type = "balance"); plot(bal, type = "hist")           # love plot: aim |d| < 0.1
summary(PSweight(ps.formula, yname = "outcome", data = df.trim, weight = "IPW"))      # ATE
summary(PSweight(ps.formula, yname = "outcome", data = df.trim, weight = "treated"))  # ATT
summary(PSweight(ps.formula, yname = "outcome", bootstrap = TRUE, R = 100,
                 data = df.trim, weight = "IPW"))               # bootstrap SE

Interpretation (what it would show). The love plot is how you choose a PS model — the spec that balances all covariates (|standardised mean diff| < 0.1) is the one to trust, not the one closest to any benchmark (you never see the RCT in real life). Use the bootstrap SE when the analytical SE warns.

4.7 Stratification + bootstrap; exact-match feasibility

library(magrittr); library(boot)
df %<>% mutate(subclass = case_when(x == 1 ~ 1, x == 0 ~ 2, TRUE ~ 0))
strata.ate <- function(d) {
  ey11 <- d %>% filter(subclass==1 & D==1) %$% mean(outcome)
  ey10 <- d %>% filter(subclass==1 & D==0) %$% mean(outcome)
  ey21 <- d %>% filter(subclass==2 & D==1) %$% mean(outcome)
  ey20 <- d %>% filter(subclass==2 & D==0) %$% mean(outcome)
  wt1  <- d %>% filter(subclass==1) %$% (nrow(.)/nrow(d))       # ATE weight = stratum share
  wt2  <- d %>% filter(subclass==2) %$% (nrow(.)/nrow(d))
  (ey11-ey10)*wt1 + (ey21-ey20)*wt2
}
strata.ate(df)
boot(df, function(data, i) strata.ate(data[i, ]), R = 100)      # bootstrap SE

chk <- df %>% group_by(x1, x2, x3) %>% summarise(n = n(), .groups = "keep") %>%
  ungroup() %>% complete(x1, x2, x3) %>% as.data.frame()
any(is.na(chk))                                                  # TRUE => exact match infeasible

Interpretation (what it would show). Stratification = post-hoc blocking: compare within confounder strata, weight by stratum share (ATE) or treated share (ATT). any(is.na()) TRUE means some covariate combination has no counterfactual → exact matching fails (curse of dimensionality) → use the propensity score instead. Hand-built estimators have no closed-form SE, so we bootstrap.

Week 4 function map: lm() (SMD/adjustment) · group_by + summarise(prop) & geom_density() (balance/overlap) · glm(..., family=binomial) + stepAIC() (PS) · manual IPWE with the weight table · normalising adjustment factor · PSweight(weight="IPW"/"treated") + SumStat() love plots · case_when()/%$% (stratification) · boot() (SE) · complete() + any(is.na()) (exact-match feasibility).


Conceptual / DAG / formula MCQs → MIDTERM_CHEATSHEET.md. The companion .R script prints the Week-1 & Week-3 numbers under searchable banners for fast console lookup.