R语言基础：二维数据可视化与方法入门

非线性回归、逻辑回归、机器学习思维、循环与函数、DID 与 RDD

尹俊贺

课程结构

今天这份 slides 想回答四个问题：

为什么现实中的关系常常不是线性的？
为什么逻辑回归和机器学习更适合处理分类问题？
为什么要学习循环与自定义函数？
DID 和 RDD 为什么能帮助我们更接近因果推断？

一条主线

今天我们其实在看三种能力：

建模能力：当线性模型不够时，如何处理更复杂的关系？
计算能力：当任务需要重复执行时，如何用程序自动化？
识别能力：当我们希望讨论因果关系时，如何减少“只看见相关”的局限？

为什么不能只讲线性回归？

很多时候，变量之间的关系并不是一条直线。

例如：

学习时间与成绩：前期提升较快，后期边际收益下降
广告投入与销量：通常先上升，后趋于平缓
年龄与收入：往往先增加，再放缓甚至下降

所以，我们需要更灵活的工具：

非线性回归：处理弯曲关系
逻辑回归：处理二元结果
机器学习方法：用更灵活的方式逼近复杂模式

先生成一份演示数据

# 连续型结果：明显非线性
n <- 300
x <- seq(0, 10, length.out = n)
y <- 2 + 1.8 * x - 0.18 * x^2 + rnorm(n, 0, 1.5)
df_nl <- tibble(x = x, y = y)

# 二元结果：适合逻辑回归
x2 <- rnorm(400)
p_true <- 1 / (1 + exp(-(-0.6 + 1.8 * x2)))
y_bin <- rbinom(400, 1, p_true)
df_logit <- tibble(x = x2, y = y_bin, p_true = p_true)

非线性回归：线性模型 vs 二次项模型

model_linear <- lm(y ~ x, data = df_nl)
model_quad   <- lm(y ~ x + I(x^2), data = df_nl)

pred_grid <- tibble(
  x = seq(min(df_nl$x), max(df_nl$x), length.out = 200)
)
pred_grid$linear <- predict(model_linear, newdata = pred_grid)
pred_grid$quadratic <- predict(model_quad, newdata = pred_grid)

p1 <- ggplot(df_nl, aes(x, y)) +
  geom_point(alpha = 0.5) +
  geom_line(data = pred_grid, aes(y = linear), linewidth = 1.2) +
  labs(
    title = "Linear Regression",
    subtitle = "A linear model can only fit a straight line",
    x = "X",
    y = "Y"
  )

p2 <- ggplot(df_nl, aes(x, y)) +
  geom_point(alpha = 0.5) +
  geom_line(data = pred_grid, aes(y = quadratic), linewidth = 1.2) +
  labs(
    title = "Nonlinear Regression with a Quadratic Term",
    subtitle = "The quadratic term allows the marginal effect to change with X",
    x = "X",
    y = "Y"
  )

p1 + p2

非线性回归的核心理解

模型形式可以写成：

[ Y = _0 + _1 X + _2 X^2 + ]

summary(model_quad)


Call:
lm(formula = y ~ x + I(x^2), data = df_nl)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0826 -0.9292 -0.0615  0.8473  4.3134 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.28805    0.25587   5.034 8.34e-07 ***
x            2.09242    0.11821  17.701  < 2e-16 ***
I(x^2)      -0.20188    0.01144 -17.642  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.487 on 297 degrees of freedom
Multiple R-squared:  0.5166,    Adjusted R-squared:  0.5134 
F-statistic: 158.7 on 2 and 297 DF,  p-value: < 2.2e-16

它的关键含义是：

当 (_2 ) 时，(X) 对 (Y) 的影响不再固定
也就是说，边际效应会随着 (X) 改变
这比“永远一条直线”的假设更贴近真实世界

逻辑回归为什么重要？

当因变量只有两类时，例如：

是否被录取（0/1）
是否违约（0/1）
是否参与投票（0/1）

普通线性回归会遇到两个问题：

预测值可能小于 0 或大于 1
概率解释不自然

所以，这类问题通常更适合使用逻辑回归。

逻辑回归的 S 形曲线

logit_model <- glm(y ~ x, data = df_logit, family = binomial)

grid_logit <- tibble(
  x = seq(min(df_logit$x), max(df_logit$x), length.out = 300)
)
grid_logit$prob_hat <- predict(logit_model, newdata = grid_logit, type = "response")

ggplot(df_logit, aes(x, y)) +
  geom_jitter(height = 0.05, width = 0, alpha = 0.25) +
  geom_line(data = grid_logit, aes(y = prob_hat), linewidth = 1.5) +
  labs(
    title = "Logistic Regression",
    subtitle = "The fitted curve gives predicted probabilities between 0 and 1",
    x = "X",
    y = "Observed Y and Predicted Probability"
  )

逻辑回归的公式直觉

[ P(Y=1|X)= ]

可以这样理解：

先构造一个线性组合 (_0 + _1 X)
再通过一个非线性函数，把结果压缩到 0 到 1 之间

所以逻辑回归本质上是：

线性预测器 + 非线性概率映射

逻辑回归代码

summary(logit_model)


Call:
glm(formula = y ~ x, family = binomial, data = df_logit)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8290     0.1471  -5.634 1.76e-08 ***
x             2.2866     0.2323   9.844  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 526.09  on 399  degrees of freedom
Residual deviance: 323.65  on 398  degrees of freedom
AIC: 327.65

Number of Fisher Scoring iterations: 5

最常见写法是：

glm(y ~ x1 + x2, data = your_data, family = binomial)

如果某个变量的系数为正，通常意味着：

该变量增加时，事件发生的概率会上升

但要注意：

这种变化不是恒定线性的
不同位置上的边际影响会不同

从逻辑回归走向机器学习

机器学习可以先用一句话理解：

让模型从数据中学习规律，并用于预测。

几个最基础的概念：

特征（features）：用来预测的信息
标签（label）：我们希望预测的结果
训练集 / 测试集：一个用来学习，一个用来检验
预测（prediction）：重点是“模型预测得准不准”

机器学习和传统回归的思维差别

传统回归通常更关心：

系数有多大？
方向是否显著？
变量之间如何解释？

机器学习通常更关心：

新样本预测得准不准？
模型在测试集上的表现如何？
是否出现了过拟合？

一个最基础的机器学习例子：决策树

决策树的思想很直观：

先问一个问题
根据答案把样本分组
每组继续问问题
最后给出预测结果

它的优势是：

非常直观
自动处理非线性与分段关系

决策树可视化

set.seed(1)
df_tree <- tibble(
  x1 = rnorm(300),
  x2 = rnorm(300)
) |>
  mutate(y = ifelse(x1 + 0.8 * x2 + rnorm(300, 0, 0.6) > 0, 1, 0))

tree_model <- rpart(as.factor(y) ~ x1 + x2, data = df_tree, method = "class")
rpart.plot(tree_model, type = 2, extra = 104, under = TRUE, fallen.leaves = TRUE)
title("Decision Tree for Classification")

机器学习里一个常见提醒：过拟合

x_over <- seq(0, 10, length.out = 60)
y_over <- sin(x_over) + rnorm(60, 0, 0.25)
df_over <- tibble(x = x_over, y = y_over)

fit_simple <- lm(y ~ x, data = df_over)
fit_flex   <- lm(y ~ poly(x, 10), data = df_over)

grid_over <- tibble(x = seq(0, 10, length.out = 300))
grid_over$simple <- predict(fit_simple, newdata = grid_over)
grid_over$flexible <- predict(fit_flex, newdata = grid_over)

p_over1 <- ggplot(df_over, aes(x, y)) +
  geom_point() +
  geom_line(data = grid_over, aes(y = simple), linewidth = 1.2) +
  labs(
    title = "A Simpler Model",
    subtitle = "It may underfit, but it is often more stable",
    x = "X",
    y = "Y"
  )

p_over2 <- ggplot(df_over, aes(x, y)) +
  geom_point() +
  geom_line(data = grid_over, aes(y = flexible), linewidth = 1.2) +
  labs(
    title = "An Overly Flexible Model",
    subtitle = "It may learn noise as if it were signal: overfitting",
    x = "X",
    y = "Y"
  )

p_over1 + p_over2

机器学习这一部分，最该记住什么？

回归 / 计量方法 常常更强调解释
机器学习方法 常常更强调预测
非线性、分类与树模型，本质上都是为了更好刻画复杂现实

为什么要学循环？

在 R 里，很多任务都不是做一次，而是要重复很多次。

例如：

对多个变量分别跑回归
对多个年份分别画图
对多个模拟样本重复计算

如果每一步都手写，效率会很低。

一个最简单的 for 循环

for (i in 1:5) {
  print(paste("Current i =", i))
}

[1] "Current i = 1"
[1] "Current i = 2"
[1] "Current i = 3"
[1] "Current i = 4"
[1] "Current i = 5"

循环最核心的思想就是：

把一段重复操作自动执行很多次

循环：重复拟合多个模型

set.seed(123)
df_loop <- tibble(
  y  = rnorm(200),
  x1 = rnorm(200),
  x2 = rnorm(200),
  x3 = rnorm(200)
)

vars <- c("x1", "x2", "x3")
results <- c()

for (v in vars) {
  formula_now <- as.formula(paste("y ~", v))
  model_now <- lm(formula_now, data = df_loop)
  results[v] <- coef(model_now)[2]
}

results

         x1          x2          x3 
-0.02623691 -0.02955428 -0.04271895

为什么还要自定义函数？

循环只能解决“重复执行”的问题，函数还能进一步做到：

把分析步骤打包
让代码更清晰
在不同场景中重复复用

一句话理解：

函数，就是你自己造出来的工具。

一个最简单的自定义函数

square_num <- function(x) {
  return(x^2)
}

square_num(5)

[1] 25

函数一般有三部分：

输入（input）
处理（process）
输出（return）

把回归步骤包装成函数

get_slope <- function(data, varname) {
  formula_now <- as.formula(paste("y ~", varname))
  model_now <- lm(formula_now, data = data)
  return(coef(model_now)[2])
}

get_slope(df_loop, "x1")

         x1 
-0.02623691

函数 + 循环：自动化分析

slopes <- c()

for (v in vars) {
  slopes[v] <- get_slope(df_loop, v)
}

slopes

         x1          x2          x3 
-0.02623691 -0.02955428 -0.04271895

也可以写成更简洁的形式：

sapply(vars, function(v) get_slope(df_loop, v))

      x1.x1       x2.x2       x3.x3 
-0.02623691 -0.02955428 -0.04271895

到这里，编程能力解决了什么？

你已经可以做到：

自动重复任务
把分析步骤封装起来
用更少的代码处理更多问题

这也是之后做实证研究与机器学习时非常重要的能力。

从相关到因果：为什么需要 DID 和 RDD？

前面的模型可以帮助我们：

看关系
做预测
找模式

但一个更难的问题是：

X 的变化，是否真的导致了 Y 的变化？

这就进入因果推断的领域。

DID 的直觉：比较“变化的差异”

DID = Difference-in-Differences（双重差分）

典型场景：

一个地区实施政策（处理组）
另一个地区没有实施（对照组）
我们同时观察政策前后变化

核心思想是：

如果没有政策，两组原本应该沿着相似趋势变化；
那么政策后多出来的那部分差异，就可以理解为政策效应。

生成一份 DID 演示数据

df_did <- expand_grid(
  treat = c(0, 1),
  post = c(0, 1),
  id = 1:100
) |>
  mutate(
    y = 50 +
      3 * treat +
      2 * post +
      6 * (treat * post) +
      rnorm(n(), 0, 3)
  )

summary_did <- df_did |>
  group_by(treat, post) |>
  summarise(mean_y = mean(y), .groups = "drop") |>
  mutate(
    group = ifelse(treat == 1, "Treatment", "Control"),
    time  = ifelse(post == 1, "After", "Before")
  )

DID 可视化：一眼看懂“差中差”

ggplot(summary_did, aes(x = time, y = mean_y, group = group, color = group)) +
  geom_line(linewidth = 1.5) +
  geom_point(size = 3) +
  labs(
    title = "Difference-in-Differences",
    subtitle = "The extra increase in the treatment group after the policy is the DID effect",
    x = NULL,
    y = "Average Outcome",
    color = NULL
  ) +
  theme(legend.position = "top")

DID 回归代码

did_model <- lm(y ~ treat * post, data = df_did)
summary(did_model)


Call:
lm(formula = y ~ treat * post, data = df_did)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.0079 -2.1259 -0.0806  1.7191  9.6097 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  50.2808     0.3085 162.971  < 2e-16 ***
treat         3.0791     0.4363   7.057 7.67e-12 ***
post          1.6617     0.4363   3.808 0.000162 ***
treat:post    5.9210     0.6171   9.596  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.085 on 396 degrees of freedom
Multiple R-squared:  0.6386,    Adjusted R-squared:  0.6359 
F-statistic: 233.2 on 3 and 396 DF,  p-value: < 2.2e-16

重点看三个部分：

treat：处理组与对照组原本的差异
post：政策后整体发生的变化
treat:post：DID 估计量，也是最关键的政策效应

DID 的一句话理解

[ Y = _0 + _1 Treat + _2 Post + _3 (Treat Post) + ]

其中：

(_3) 就是 DID 中最核心的因果效应估计量。

RDD 的直觉：看阈值附近有没有“跳跃”

RDD = Regression Discontinuity Design（断点回归设计）

典型场景：

分数高于某个 cutoff 才能获得补贴
收入低于某个 threshold 才能获得资格

核心想法是：

在 cutoff 附近，左右两边个体通常非常相似；
如果结果变量在 cutoff 处突然跳一下，这个跳跃就可能来自制度或政策。

生成一份 RDD 演示数据

set.seed(100)
score <- seq(40, 80, length.out = 250)
cutoff <- 60
treat_rdd <- ifelse(score >= cutoff, 1, 0)
outcome <- 20 + 0.6 * score + 8 * treat_rdd + rnorm(250, 0, 2)

df_rdd <- tibble(score = score, treat = treat_rdd, outcome = outcome)

RDD 可视化：关键就是看 cutoff 处的跳跃

ggplot(df_rdd, aes(score, outcome)) +
  geom_point(alpha = 0.5) +
  geom_vline(xintercept = cutoff, linetype = 2, linewidth = 1) +
  geom_smooth(
    data = df_rdd |> filter(score < cutoff),
    method = "lm", se = FALSE, linewidth = 1.3
  ) +
  geom_smooth(
    data = df_rdd |> filter(score >= cutoff),
    method = "lm", se = FALSE, linewidth = 1.3
  ) +
  labs(
    title = "Regression Discontinuity Design",
    subtitle = "A visible jump at the cutoff suggests a treatment effect",
    x = "Running Variable",
    y = "Outcome"
  )

RDD 回归代码

rdd_model <- lm(outcome ~ score + treat + score:treat, data = df_rdd)
summary(rdd_model)


Call:
lm(formula = outcome ~ score + treat + score:treat, data = df_rdd)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0245 -1.0347 -0.0902  1.1865  5.4794 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 21.30358    1.50232  14.180  < 2e-16 ***
score        0.57327    0.02987  19.192  < 2e-16 ***
treat        6.75869    2.58146   2.618  0.00939 ** 
score:treat  0.02565    0.04224   0.607  0.54434    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.936 on 246 degrees of freedom
Multiple R-squared:  0.9682,    Adjusted R-squared:  0.9678 
F-statistic:  2495 on 3 and 246 DF,  p-value: < 2.2e-16

这个模型的含义可以这样理解：

score：阈值左边的趋势
treat：阈值右边是否整体跳高
score:treat：阈值右边的斜率是否变化

其中最重要的是：

treat 所对应的“跳跃”效应

DID 和 RDD 怎么区分？

方法	核心识别逻辑	常见场景
DID	比较处理组和对照组在政策前后的变化差异	政策前后、地区比较
RDD	比较阈值附近左右两边是否存在结果跳跃	分数线、资格门槛

最后总结

今天我们串起了四类问题：

1. 非线性回归

当现实关系不是直线时，我们需要更灵活的模型。

2. 逻辑回归与机器学习

当结果是分类问题时，我们需要概率模型与预测思维。

3. 循环与函数

当任务需要重复执行时，我们需要程序化工具来提高效率。

4. DID 与 RDD

当我们希望更接近因果解释时，可以利用政策变化或阈值设计。

最值得记住的一句话

线性回归帮助我们理解简单关系，
机器学习帮助我们提高预测能力，
DID 与 RDD 帮助我们更接近因果识别。

谢谢大家

提问与讨论