This lab session focuses on detecting and modeling nonlinear relationships in regression analysis using the California Test Score dataset. We’ll explore two main groups of methods: 1. When the effect of X₁ on Y depends on the value of X₁ itself. 2. When the effect of X₁ on Y depends on another variable X₂.
Today we’ll focus on the first group - nonlinear relationships where the effect of a predictor depends on its own value.
Let’s begin by examining the relationship between test scores (Y) and district average income (X) with a simple linear regression:
View → Graph Specified Vars → X-Y Scatter
Observation: The linear fit doesn’t capture the apparent curvature in the data - most points are below the line at very low and very high incomes, but above the line in the middle range.
To model this curvature, we’ll estimate a quadratic regression: TestScoreᵢ = β₀ + β₁Incomeᵢ + β₂Incomeᵢ²+
avginc → Add → square of selected variables
Model → ordinary least squares → testscr avginc avginc_sq
Estimated equation: TestScore = 607.3 + 3.85 Income - 0.0423 Income²
We can test whether the quadratic term is needed:
H0: β₂ = 0 or not
The significant t-statistic (-8.81) and small p-value (<0.01%) suggest the quadratic term improves the model.
The effect of income changes depends on the initial income level:
Increase from $10K to $11K: ΔTestScore = [607.3 + 3.85×11 - 0.0423×11²] - [607.3 + 3.85×10 - 0.0423×10²] = 2.96 points
Increase from $40K to $41K: ΔTestScore = [607.3 + 3.85×41 - 0.0423×41²] - [607.3 + 3.85×40 - 0.0423×40²] = 0.42 points
Key Insight: A $1000 increase has a larger effect in poorer districts than wealthy ones.
For more flexibility, we can estimate a cubic model: TestScoreᵢ = β₀ + β₁Incomeᵢ + β₂Incomeᵢ² + β₃Incomeᵢ³ +uᵢ
Add → Define new variable → avgin_cube=avg_inc^3
Estimated equation: TestScore = 600.1 + 5.02 Income - 0.096 Income² + 0.00069 Income³
H0: β3 = 0 or not
Linear-log: Y = β1 + β2ln(X) +
Interpretation: 1% increase in X → 0.01β2 change in Y
Log-linear: ln(Y) = β1 + β2X +
Interpretation: 1-unit increase in X → 100β₁% change in Y
Log-log: ln(Y) = β1 + β2ln(X) +
Interpretation: 1% increase in X → β2% change in Y (elasticity)
Case | Specification | Interpretation of β₁ |
---|---|---|
Linear-log | Y = β1 + β2ln(X) | 1% ΔX → 0.01β2 ΔY |
Log-linear | ln(Y) = β1 + β2X | 1-unit ΔX → 100β2% ΔY |
Log-log | ln(Y) = β1 + β2ln(X) | 1% ΔX → β2% ΔY (elasticity) |
We’ve explored several approaches to modeling when the effect of X₁ on Y depends on X₁’s value: - Polynomial regressions (quadratic, cubic) - Logarithmic transformations All methods showed that the test score-income relationship is indeed nonlinear, with diminishing returns to higher income.
We now turn to the second group of methods for modeling nonlinear relationships - when the effect of one independent variable (X₁) depends on the value of another variable (X₂). These are called interaction effects and are crucial for understanding how relationships change across different subgroups or conditions.
Let’s create binary variables for our dataset: - DSTR
: 1
if student-teacher ratio ≥ 20, 0 otherwise - DEL
: 1 if %
English learners ≥ 10%, 0 otherwise
Estimated equation: TestScore = 664.1 - 1.9 DSTR - 18.2 DEL - 3.5(DSTR × DEL)
The effect of high STR depends on English learner status: - For low EL districts (DEL=0): Effect = -1.9 points - For high EL districts (DEL=1): Effect = -1.9 - 3.5 = -5.4 points
The t-statistic (-3.5/3.1 ≈ -1.13) suggests the interaction may not be statistically significant at conventional levels (p ≈ 0.26).
TestScore = β1 + β2STR + β3PctEL + β4(STR × PctEL) +
Estimated equation: TestScore = 686.3 - 1.12 STR - 0.67 PctEL + 0.0012(STR × PctEL)
The marginal effect of STR is: ∂TestScore/∂STR = -1.12 + 0.0012 PctEL
At median PctEL (8.85): Effect = -1.12 + 0.0012×8.85 ≈ -1.11
At 75th %ile PctEL (23.0): Effect = -1.12 + 0.0012×23.0 ≈ -1.09
Interaction effects allow us to model how relationships change across different contexts. Key takeaways: - Binary × binary interactions create four distinct groups - Continuous × binary interactions allow different slopes - Continuous × continuous interactions make effects depend on values - Always include main effects when adding interactions - Visualizations are crucial for interpretation
Next Steps: Students should practice with different interaction specifications and test whether the effects they find are statistically and substantively significant.