回归问题: 预测身高, GDP, 游客数等, 观测变量\(t\)是实数.
分类问题: 预测性别, 是否发放贷款, 是否阳性.
目的: 对于给定的特征向量\(x\), 将其分配给\(K\)个离散集合中的其中一个集合\(C_k(k = 1, \cdots, K)\)中. 对于\(K = 2\), 我们常常设定\(y = \{0, 1\}\).
方法: OLS或Ridge.
Update on 2025-04-07
回归问题: 预测身高, GDP, 游客数等, 观测变量\(t\)是实数.
分类问题: 预测性别, 是否发放贷款, 是否阳性.
目的: 对于给定的特征向量\(x\), 将其分配给\(K\)个离散集合中的其中一个集合\(C_k(k = 1, \cdots, K)\)中. 对于\(K = 2\), 我们常常设定\(y = \{0, 1\}\).
方法: OLS或Ridge.
| 真实情况 | |||||
|---|---|---|---|---|---|
| 总体 | 真 | 假 | 准确率(ACC) = \(\frac{真阳 + 真阴}{总体}\) | ||
| 预测情况 | 真 |
真阳 (True positive) |
假阳Type I (False positive) |
正精度(PPV) precision (\(\frac{真阳}{真阳 + 假阳}\)) |
错误发现率(FDR) (\(\frac{假阳}{真阳 + 假阳}\)) |
| 假 |
假阴Type II (False negative) |
真阴 (True negative) |
错误遗漏率(FOR) (\(\frac{假阴}{假阴 + 真阴}\)) |
负精度(NPV) (\(\frac{真阴}{假阴 + 真阴}\)) |
|
|
真阳性率(TPR) recall, power (\(\frac{真阳}{真阳 + 假阴}\)) |
假阳性率(FPR) (\(\frac{假阳}{假阳 + 真阴}\)) |
F1 score = \(\frac{2}{\frac{1}{\rm TPR} + \frac{1}{\rm PPV}}\) |
|||
|
假阴性率(FNR) (\(\frac{假阴}{真阳 + 假阴}\)) |
真阴性率(TNR) (\(\frac{真阴}{假阳 + 真阴}\)) |
||||
假设测核酸了!
|
概率 |
真阳 |
真阴 |
|---|---|---|
|
核酸阳 |
\(\alpha\) | \(1 - \beta\) |
|
核酸阴 |
\(1 - \alpha\) | \(\beta\) |
\(\alpha\), \(\beta\)都是很大的值. 事件 \(A\): 真阳; 事件 \(B\): 核酸阳; 真实感染率(真阳率)为\({\rm P} (A) = r\), 我们假设所有的无症状实际上是第一类错误导致的. \({\rm P} (A|B)\)为有症状的概率, \({\rm P} (\bar A|B)\)为无症状的概率.
\[ {\rm P} (A|B) = \frac{{\rm P}(B|A){\rm P}(A)}{{\rm P}(B|A){\rm P}(A) + {\rm P}(B| \bar A){\rm P}(\bar A)} = \frac{\alpha r}{\alpha r + (1 - \beta)(1 - r)}, \] 同理, 无症状的概率为: \[ 1 - {\rm P} (A|B). \]
\[ \hat {y}_i = f(\mathbf x_i \boldsymbol \beta) = \frac{1}{1 + {\rm exp}(-\mathbf x_i \boldsymbol \beta)}, \] 其中\(f()\) is a activation function, \(\sigma(a) = \frac{1}{1 + {\rm exp}(-a)}\) 是Logistic sigmoid, 满足: \[ \frac{d\sigma}{da} = \sigma (1 - \sigma). \]
OLS: 令\(\hat{\mathbf y} = \mathbf X \boldsymbol \beta\), 那么 \[ \hat{\boldsymbol \beta} = \underbrace{\mbox{argmin}}_{\boldsymbol \beta}\sum_{i = 1}^n(y_i - \hat{y}(\boldsymbol \beta)_i)^2, \] 加入一些假设: \[ \mathbb E (y_i | \mathbf x_i) = \hat{y}_i, \]
又OLS的定义: \[ y_i = \mathbf x_i \boldsymbol \beta + e_i, \] 并且 \[ e_i \sim \mathcal N(0, \sigma^2). \]
那么 \[ \mbox{Pr}(y_i) = \mathcal N(\mathbf x_i \boldsymbol \beta, \sigma^2) = \mathcal N(\hat{y}_i, \sigma^2). \]
极大似然优化: \[ E(\boldsymbol \beta) = \sum_{i = 1}^n\mbox{log}\ \mbox{Pr}(y_i) = -\frac{n}{2} \mbox{ln}\sigma^{2} - \frac{n}{2} \mbox{ln}(2 \pi) - \frac{1}{2\sigma^2}\sum_i^n(y_i - \mathbf x_i \boldsymbol \beta)^2. \]
求梯度: \[ \begin{aligned} \frac{\partial E(\boldsymbol \beta)}{\boldsymbol \beta} =& \sum_{i = 1}^n (y_i - \hat{y}_i) \mathbf x_i = \boldsymbol 0. \end{aligned} \]
\[ y_i \sim u^{y_i} (1 - u)^{1 - y_i}, \] 我们希望: \[ \hat{y}_i = f(\mathbf x_i \boldsymbol \beta) = f(\alpha_i) = \mathbb E(y_i) = u. \] 极大似然估计: \[ E(\boldsymbol \beta) = \sum_{i = 1}^n\mbox{log}\ \mbox{Pr}(y_i) = \sum_{i = 1}^n y_i\ \mbox{log}\ \hat{y}_i + (1 - y_i)\ \mbox{log}\ (1 - \hat{y}_i), \] 求梯度: \[ \frac{\partial E(\boldsymbol \beta)}{\boldsymbol \beta} = \sum_{i = 1}^n y_i \frac{1}{\hat{y}_i} \frac{\partial f}{\partial a_i} \mathbf x_i - (1 - y_i) \frac{1}{1 - \hat{y}_i} \frac{\partial f}{\partial a_i} \mathbf x_i, \]
若\(f() = \sigma()\), 则:
\[ \begin{aligned} \frac{\partial E(\boldsymbol \beta)}{\boldsymbol \beta} =& \sum_{i = 1}^n y_i \frac{1}{\hat{y_i}} \hat{y_i}( 1 - \hat{y_i}) \mathbf x_i - (1 - y_i) \frac{1}{1 - \hat{y_i}} \hat{y_i}( 1 - \hat{y_i}) \mathbf x_i\\ =& \sum_{i = 1}^n (y_i - {y}_i \hat{y}_i - \hat{y}_i + y_i \hat{y}_i) \mathbf x_i \\ =& \sum_{i = 1}^n (y_i - \hat{y}_i) \mathbf x_i = \boldsymbol 0. \end{aligned} \] 恰好可以满足残差与自变量相互独立.
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 5.0 | 3.2 | 1.2 | 0.2 | setosa |
| 4.5 | 2.3 | 1.3 | 0.3 | setosa |
| 5.8 | 2.7 | 3.9 | 1.2 | versicolor |
| 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 7.2 | 3.2 | 6.0 | 1.8 | virginica |
| 6.3 | 2.7 | 4.9 | 1.8 | virginica |
| 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 6.0 | 2.2 | 5.0 | 1.5 | virginica |