非线性回归、逻辑回归、机器学习思维、循环与函数、DID 与 RDD
今天这份 slides 想回答四个问题:
今天我们其实在看三种能力:
很多时候,变量之间的关系并不是一条直线。
例如:
所以,我们需要更灵活的工具:
model_linear <- lm(y ~ x, data = df_nl)
model_quad <- lm(y ~ x + I(x^2), data = df_nl)
pred_grid <- tibble(
x = seq(min(df_nl$x), max(df_nl$x), length.out = 200)
)
pred_grid$linear <- predict(model_linear, newdata = pred_grid)
pred_grid$quadratic <- predict(model_quad, newdata = pred_grid)
p1 <- ggplot(df_nl, aes(x, y)) +
geom_point(alpha = 0.5) +
geom_line(data = pred_grid, aes(y = linear), linewidth = 1.2) +
labs(
title = "Linear Regression",
subtitle = "A linear model can only fit a straight line",
x = "X",
y = "Y"
)
p2 <- ggplot(df_nl, aes(x, y)) +
geom_point(alpha = 0.5) +
geom_line(data = pred_grid, aes(y = quadratic), linewidth = 1.2) +
labs(
title = "Nonlinear Regression with a Quadratic Term",
subtitle = "The quadratic term allows the marginal effect to change with X",
x = "X",
y = "Y"
)
p1 + p2模型形式可以写成:
[ Y = _0 + _1 X + _2 X^2 + ]
Call:
lm(formula = y ~ x + I(x^2), data = df_nl)
Residuals:
Min 1Q Median 3Q Max
-5.0826 -0.9292 -0.0615 0.8473 4.3134
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.28805 0.25587 5.034 8.34e-07 ***
x 2.09242 0.11821 17.701 < 2e-16 ***
I(x^2) -0.20188 0.01144 -17.642 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.487 on 297 degrees of freedom
Multiple R-squared: 0.5166, Adjusted R-squared: 0.5134
F-statistic: 158.7 on 2 and 297 DF, p-value: < 2.2e-16
它的关键含义是:
当因变量只有两类时,例如:
普通线性回归会遇到两个问题:
所以,这类问题通常更适合使用逻辑回归。
logit_model <- glm(y ~ x, data = df_logit, family = binomial)
grid_logit <- tibble(
x = seq(min(df_logit$x), max(df_logit$x), length.out = 300)
)
grid_logit$prob_hat <- predict(logit_model, newdata = grid_logit, type = "response")
ggplot(df_logit, aes(x, y)) +
geom_jitter(height = 0.05, width = 0, alpha = 0.25) +
geom_line(data = grid_logit, aes(y = prob_hat), linewidth = 1.5) +
labs(
title = "Logistic Regression",
subtitle = "The fitted curve gives predicted probabilities between 0 and 1",
x = "X",
y = "Observed Y and Predicted Probability"
)[ P(Y=1|X)= ]
可以这样理解:
所以逻辑回归本质上是:
线性预测器 + 非线性概率映射
Call:
glm(formula = y ~ x, family = binomial, data = df_logit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8290 0.1471 -5.634 1.76e-08 ***
x 2.2866 0.2323 9.844 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 526.09 on 399 degrees of freedom
Residual deviance: 323.65 on 398 degrees of freedom
AIC: 327.65
Number of Fisher Scoring iterations: 5
最常见写法是:
如果某个变量的系数为正,通常意味着:
但要注意:
机器学习可以先用一句话理解:
让模型从数据中学习规律,并用于预测。
几个最基础的概念:
传统回归通常更关心:
机器学习通常更关心:
决策树的思想很直观:
它的优势是:
set.seed(1)
df_tree <- tibble(
x1 = rnorm(300),
x2 = rnorm(300)
) |>
mutate(y = ifelse(x1 + 0.8 * x2 + rnorm(300, 0, 0.6) > 0, 1, 0))
tree_model <- rpart(as.factor(y) ~ x1 + x2, data = df_tree, method = "class")
rpart.plot(tree_model, type = 2, extra = 104, under = TRUE, fallen.leaves = TRUE)
title("Decision Tree for Classification")x_over <- seq(0, 10, length.out = 60)
y_over <- sin(x_over) + rnorm(60, 0, 0.25)
df_over <- tibble(x = x_over, y = y_over)
fit_simple <- lm(y ~ x, data = df_over)
fit_flex <- lm(y ~ poly(x, 10), data = df_over)
grid_over <- tibble(x = seq(0, 10, length.out = 300))
grid_over$simple <- predict(fit_simple, newdata = grid_over)
grid_over$flexible <- predict(fit_flex, newdata = grid_over)
p_over1 <- ggplot(df_over, aes(x, y)) +
geom_point() +
geom_line(data = grid_over, aes(y = simple), linewidth = 1.2) +
labs(
title = "A Simpler Model",
subtitle = "It may underfit, but it is often more stable",
x = "X",
y = "Y"
)
p_over2 <- ggplot(df_over, aes(x, y)) +
geom_point() +
geom_line(data = grid_over, aes(y = flexible), linewidth = 1.2) +
labs(
title = "An Overly Flexible Model",
subtitle = "It may learn noise as if it were signal: overfitting",
x = "X",
y = "Y"
)
p_over1 + p_over2在 R 里,很多任务都不是做一次,而是要重复很多次。
例如:
如果每一步都手写,效率会很低。
[1] "Current i = 1"
[1] "Current i = 2"
[1] "Current i = 3"
[1] "Current i = 4"
[1] "Current i = 5"
循环最核心的思想就是:
把一段重复操作自动执行很多次
set.seed(123)
df_loop <- tibble(
y = rnorm(200),
x1 = rnorm(200),
x2 = rnorm(200),
x3 = rnorm(200)
)
vars <- c("x1", "x2", "x3")
results <- c()
for (v in vars) {
formula_now <- as.formula(paste("y ~", v))
model_now <- lm(formula_now, data = df_loop)
results[v] <- coef(model_now)[2]
}
results x1 x2 x3
-0.02623691 -0.02955428 -0.04271895
循环只能解决“重复执行”的问题,函数还能进一步做到:
一句话理解:
函数,就是你自己造出来的工具。
函数一般有三部分:
x1 x2 x3
-0.02623691 -0.02955428 -0.04271895
也可以写成更简洁的形式:
你已经可以做到:
这也是之后做实证研究与机器学习时非常重要的能力。
前面的模型可以帮助我们:
但一个更难的问题是:
X 的变化,是否真的导致了 Y 的变化?
这就进入因果推断的领域。
DID = Difference-in-Differences(双重差分)
典型场景:
核心思想是:
如果没有政策,两组原本应该沿着相似趋势变化;
那么政策后多出来的那部分差异,就可以理解为政策效应。
df_did <- expand_grid(
treat = c(0, 1),
post = c(0, 1),
id = 1:100
) |>
mutate(
y = 50 +
3 * treat +
2 * post +
6 * (treat * post) +
rnorm(n(), 0, 3)
)
summary_did <- df_did |>
group_by(treat, post) |>
summarise(mean_y = mean(y), .groups = "drop") |>
mutate(
group = ifelse(treat == 1, "Treatment", "Control"),
time = ifelse(post == 1, "After", "Before")
)ggplot(summary_did, aes(x = time, y = mean_y, group = group, color = group)) +
geom_line(linewidth = 1.5) +
geom_point(size = 3) +
labs(
title = "Difference-in-Differences",
subtitle = "The extra increase in the treatment group after the policy is the DID effect",
x = NULL,
y = "Average Outcome",
color = NULL
) +
theme(legend.position = "top")
Call:
lm(formula = y ~ treat * post, data = df_did)
Residuals:
Min 1Q Median 3Q Max
-8.0079 -2.1259 -0.0806 1.7191 9.6097
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.2808 0.3085 162.971 < 2e-16 ***
treat 3.0791 0.4363 7.057 7.67e-12 ***
post 1.6617 0.4363 3.808 0.000162 ***
treat:post 5.9210 0.6171 9.596 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.085 on 396 degrees of freedom
Multiple R-squared: 0.6386, Adjusted R-squared: 0.6359
F-statistic: 233.2 on 3 and 396 DF, p-value: < 2.2e-16
重点看三个部分:
treat:处理组与对照组原本的差异post:政策后整体发生的变化treat:post:DID 估计量,也是最关键的政策效应[ Y = _0 + _1 Treat + _2 Post + _3 (Treat Post) + ]
其中:
(_3) 就是 DID 中最核心的因果效应估计量。
RDD = Regression Discontinuity Design(断点回归设计)
典型场景:
核心想法是:
在 cutoff 附近,左右两边个体通常非常相似;
如果结果变量在 cutoff 处突然跳一下,这个跳跃就可能来自制度或政策。
ggplot(df_rdd, aes(score, outcome)) +
geom_point(alpha = 0.5) +
geom_vline(xintercept = cutoff, linetype = 2, linewidth = 1) +
geom_smooth(
data = df_rdd |> filter(score < cutoff),
method = "lm", se = FALSE, linewidth = 1.3
) +
geom_smooth(
data = df_rdd |> filter(score >= cutoff),
method = "lm", se = FALSE, linewidth = 1.3
) +
labs(
title = "Regression Discontinuity Design",
subtitle = "A visible jump at the cutoff suggests a treatment effect",
x = "Running Variable",
y = "Outcome"
)
Call:
lm(formula = outcome ~ score + treat + score:treat, data = df_rdd)
Residuals:
Min 1Q Median 3Q Max
-6.0245 -1.0347 -0.0902 1.1865 5.4794
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21.30358 1.50232 14.180 < 2e-16 ***
score 0.57327 0.02987 19.192 < 2e-16 ***
treat 6.75869 2.58146 2.618 0.00939 **
score:treat 0.02565 0.04224 0.607 0.54434
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.936 on 246 degrees of freedom
Multiple R-squared: 0.9682, Adjusted R-squared: 0.9678
F-statistic: 2495 on 3 and 246 DF, p-value: < 2.2e-16
这个模型的含义可以这样理解:
score:阈值左边的趋势treat:阈值右边是否整体跳高score:treat:阈值右边的斜率是否变化其中最重要的是:
treat 所对应的“跳跃”效应| 方法 | 核心识别逻辑 | 常见场景 |
|---|---|---|
| DID | 比较处理组和对照组在政策前后的变化差异 | 政策前后、地区比较 |
| RDD | 比较阈值附近左右两边是否存在结果跳跃 | 分数线、资格门槛 |
今天我们串起了四类问题:
当现实关系不是直线时,我们需要更灵活的模型。
当结果是分类问题时,我们需要概率模型与预测思维。
当任务需要重复执行时,我们需要程序化工具来提高效率。
当我们希望更接近因果解释时,可以利用政策变化或阈值设计。
线性回归帮助我们理解简单关系,
机器学习帮助我们提高预测能力,
DID 与 RDD 帮助我们更接近因果识别。
提问与讨论