OLS and Statistical Learning

Update on 2024-09-24

需要用到的python包

Python

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import lasso_path
from itertools import cycle
import time
import sys

糖尿病患者病情程度分析

age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	target
59	2	32.1	101	157	93.2	38	4.00	4.8598	87	151
48	1	21.6	87	183	103.2	70	3.00	3.8918	69	75
72	2	30.5	93	156	93.6	41	4.00	4.6728	85	141
24	1	25.3	84	198	131.4	40	5.00	4.8903	89	206
50	1	23.0	101	192	125.4	52	4.00	4.2905	80	135
23	1	22.6	89	139	64.8	61	2.00	4.1897	68	97
36	2	22.0	90	160	99.6	50	3.00	3.9512	82	138
66	2	26.2	114	255	185.0	56	4.55	4.2485	92	63
60	2	32.1	83	179	119.4	42	4.00	4.4773	94	110

OLS模型设置:

建立线性模型: \[ \begin{aligned} \hat{y}_1 =& \beta_0 + x_{11} \beta_1 + x_{12} \beta_2 + \cdots + x_{1j} \beta_j + \cdots x_{1p} \beta_p,\\ \cdots,\\ \hat{y}_i =& \beta_0 + x_{i1} \beta_1 + x_{i2} \beta_2 + \cdots + x_{ij} \beta_j + \cdots x_{ip} \beta_p,\\ \cdots,\\ \hat{y}_n =& \beta_0 + x_{n1} \beta_1 + x_{n2} \beta_2 + \cdots + x_{nj} \beta_j + \cdots x_{np} \beta_p. \end{aligned} \] 矩阵形式: \[ \underbrace{\left[ \begin{array}{c} \hat{y}_1 \\ \hat{y}_2 \\ \cdots \\ \hat{y}_n \end{array} \right]}_{预测向量\\(\rm{predict\ vector})} = \left[ \begin{array}{c} 1 \\ 1 \\ \cdots \\ 1 \end{array} \right] \times \beta_0 + \underbrace{\left[ \begin{array}{ccc} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \cdots \\ x_{n1} & x_{n2} & \cdots & x_{np}\end{array} \right]}_{设计矩阵\\(\rm{design\ matrix})} \times \left[ \begin{array}{c} \beta_1 \\ \beta_2\\ \cdots \\ \beta_p \end{array} \right]. \]

模型求解

\[ \begin{aligned} \hat{\mathbf y} =& \beta_0 * \mathbf 1 + \sum_{j = 1}^{p} \mathbf x_j \beta_j = \mathbf X \boldsymbol \beta. \end{aligned} \] 残差平方和最小(minimize the residual sum of squares): \[ \begin{aligned} RSS(\beta) =& \sum_{i = 1}^{N} (y_i - \hat{y}_i)^2 = (\mathbf y - \hat{\mathbf y})^T(\mathbf y - \hat{\mathbf y}) = (\mathbf y - \mathbf X \boldsymbol \beta)^T(\mathbf y - \mathbf X \boldsymbol \beta) \end{aligned} \]

对于有\(p + 1\)个参数的二次型方程(convexity),对\(\boldsymbol \beta\)求导并令其等于\(\mathbf 0\): \[ \frac{RSS}{\partial \boldsymbol \beta} = -2 \mathbf X^T(\mathbf y - \mathbf X \boldsymbol \beta) = \mathbf 0, \]

可以得到: \[ \hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y. \]

OLS回归的性质

由样本估计的模型: \[ \hat{\mathbf y} = \mathbf X \hat{\boldsymbol \beta}, \] 定义残差(error)\(\epsilon\): \[ \mathbf y -\hat{\mathbf y} = \boldsymbol \epsilon, \] 由于 \[ \mathbf 1^T(\mathbf y - \mathbf X \hat{\boldsymbol \beta}) = 0, \] 因此 \[ \sum \epsilon = 0. \] 所以 \[ \mathbf X^T \underbrace{(\mathbf y - \mathbf X \hat{\boldsymbol \beta})}_{\boldsymbol \epsilon} = \mathbf 0 \ -> \hat{\boldsymbol \beta}^T \mathbf X^T \boldsymbol \epsilon = \hat{\mathbf y}^T \boldsymbol \epsilon = 0. \]

判定系数(Coefficient of determination)

对于任意的\(i\), \[ y_i = \hat{y}_i + \epsilon_i, \] 那么 \[ \begin{aligned} \mbox{var} (y_i) =& \mbox{var} (\hat{y}_i + \epsilon_i)\\ =& \mbox{var} (\hat{y}_i) + \mbox{var} (\epsilon_i)\\ =& \mbox{var} (\hat{y}_i) + \mbox{mse}, \end{aligned} \] 由此可以得到判定系数\(R^2\): \[ \begin{aligned} 1 =& \underbrace{\frac{\mbox{var} (\hat{y}_i)}{\mbox{var} (y_i)}}_{R^2} + \frac{\mbox{mse}}{\mbox{var} (y_i)}. \end{aligned} \]

真实模型(True model)

我们观测不到的,具有确定关系的模型(真实模型): \[ \mathbf t = \mathbf X \boldsymbol \beta, \] 我们可以观测到的,带有不确定性的样本(观测值): \[ \mathbf y = \mathbf X \boldsymbol \beta + \mathbf e = \mathbf t + \mathbf e, \] 其中误差(noise)满足: \(\mathbb E(e) = 0\), 即零均值假设, 同时\(\mathbf X^T \mathbf e = 0\), 即独立性假设 (least square assumption). 观测值与真实值之间具有如下关系: \[ t_i = \mathbb E[y_i | \mathbf x_i]. \]

一个例子:

\[ \left[ \begin{array} {cccccc} \mathbf x : & 18 & 19 & 20 & 21 & 22\\ \mathbf t : & 1800 & 1900 & 2000 & 2100 & 2200\\ \mathbf y : & 1750 & 1950 & 2050 & 2200 & 2300\\ \hat{\mathbf y} : & 1780 & 1915 & 2050 & 2185 & 2320\\ \end{array} \right] \] 估计模型:\(\hat{\mathbf y} = -650 + 135 \mathbf x.\)

残差的秘密

对于任意的\(\mathbf x_i\), 需要预测的观测值\(y_{ie}\)(满足\(\mathbb E_e[y_{ie}] = t_i\) 和 \(\mathbb E_e[y_{ie}] = \hat{y}_i\))与预测值\(\hat{y}_i\)之间的差异由什么决定? 对于任意的训练集\(d\)(对于\(K\)-fold-cv, \(d = 1, \cdots,\ K\)), 测试集样本对应的残差平方\((\hat{y}_{d} - y_{de})^2\)关于\(d\)的期望满足:

\[ \begin{aligned} \mathbb E_d \big[\mathbb E_{ie} [(\hat{y}_{di} - y_{die})^2| \mathbf x_{di}]\big] =& \mathbb E_d \big[\mathbb E_{ie} [(\hat{y}_{di} - t_{di} + t_{di} - y_{die})^2]\big]\\ =& \mathbb E_d \big[\mathbb E_{i} [(\hat{y}_{di} - t_{di})^2] + \mathbb E_{ie} [(t_{di} - y_{die})^2] + 2 \underbrace{\mathbb E_{ie} [(\hat{y}_{di} - t_{di})(t_{di} - y_{die})]}_0\big] \\ =& \mathbb E_{di} [(\hat{y}_{di} - \mathbb E_{di} [\hat{y}_{di}| \mathbf x_{di}] + \mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}] - t_{di})^2] + \mathbb E_{die}[(t_{di} - y_{die})^2]\\ =& \mathbb E_{di} [(\hat{y}_{di} - \mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}])^2] + \mathbb E_{di} [(\mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}] - t_{di})^2] + \mathbb E_{die} [(t_{di} - y_{die})^2] \\ &+ 2 \underbrace{ \mathbb E_{di} [\big(\hat{y}_{di} - \mathbb E_{di} [\hat{y}_{di}| \mathbf x_{di}]\big) \big( \mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}] - t_{di} \big)]}_{0}. \end{aligned} \]

\[ \begin{aligned} \underbrace{\mathbb E_{die} [(\hat{y}_{di} - y_{die})^2| \mathbf x_{di}]}_{测试集均方误差\rm{MSE}} =& \underbrace{\mathbb E_{di} \big[(\hat{y}_{di} - \mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}])^2 \big]}_{方差(\rm variance)} + \underbrace{\mathbb E_{di} \big[(\mathbb E_{di} [\hat{y}_{di} | \mathbf x_{di}] - t_{di})^2 \big]}_{偏差(\rm bias)} \big] \\ &+ \underbrace{\mathbb E_{die}[(t_{di} - y_{die})^2 | \mathbf x_{di}]}_{误差(\rm noise)}. \end{aligned} \]

OLS的统计性质

对于OLS, 其偏差: \[ \begin{aligned} &(\mathbb E[\hat{\mathbf y} | \mathbf X] - \mathbf t)^2\\ =& (\mathbb E[\mathbf X \hat{\boldsymbol \beta}] - \mathbf t)^2\\ =& (\mathbb E[\mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y] - \mathbf t)^2\\ =& (\mathbb E[\mathbf X (\mathbf X^T \mathbf X)^{-1} \mathbf X^T (\mathbf t + \mathbf e)] - \mathbf t)^2\\ =& (\mathbf X \boldsymbol \beta - \mathbf t)^2 = \mathbf 0. \end{aligned} \]

事实上, OLS具有:1、无偏性; 2、最小方差; 3、一致性; 4、有效性!

对于模型方差的理解, 可以看作不同训练集训练出的算法, 预测值的方差:

\[ \mathbb E_{di}[(\hat{y}_{di} - \mathbb E_{di}[\hat{y}_{di}])^2 | \mathbf x_{di}], \]

模型方差

OLS的局限: 高维数据时代

对于 \[ \hat{\boldsymbol \beta} = (\mathbf X^T \mathbf X)^{-1} \mathbf X^T \mathbf y. \] 其中\(\mathbf X \in \mathbb R^{n \times p}\), \(\mathbf X\)的秩的最大值为\(\mbox{min}\{n, p\}\). 由于\(\mathbf X^T \mathbf X\) 是\(p\)维矩阵, 当\(p > n\)的时候, \(\mathbf X^T \mathbf X\)不再是满秩的(low rank).

Python

a = np.random.standard_normal([10000, 10000])
print("The ram size of a matrix with 10000 dimension is %.2f MB" %(sys.getsizeof(a)/1000/1000))

## The ram size of a matrix with 10000 dimension is 800.00 MB

高维数据困境

Python

start = time.time()
b = np.dot(a, a.T)
end = time.time()
time_all = end - start
print("The time of matrix multiplication: %.2f s" %time_all)

## The time of matrix multiplication: 1.71 s

##
start = time.time()
inv = np.linalg.inv(b)
end = time.time()
time_all = end - start
print("The time of Matrix inversion: %.2f s" %time_all)

## The time of Matrix inversion: 12.36 s