线性回归

线性模型简单意义上来说就是假定因变量\(y\)和自变量\(x\)之间的关系可以用线性关系来近似。一般模型的形式为 \[ \begin{equation*} y = \beta_0 + \beta_1 x + \epsilon . \end{equation*} \]

自变量为多个数量变量的情况

如果由\(p\)个数量的自变量\(x_1,x_2,...,x_p\)，则线性模型的一般表达式为 \[ y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p+\epsilon. \]

如果数据有\(n\)个观测值，为\((y_1,x_{11},...,x_{1p}),(y_2,x_{21},...,x_{2p}),...,(y_n,x_{n1},...,x_{np})\)，那么对应的线性模型为 \[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...+ \beta_px_{ip} + \epsilon_i,\quad i=1,2,...,n. \] 用\(\bf X\)表示矩阵 \[ \bf X = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2} & \cdots & x_{np}\\ \end{pmatrix} \]

记\(\bf\beta = (\beta_0,\beta_1,...,\beta_p)^{T}\)，\(\bf y\)的均值向量为\(\bf\mu =(\mu_1,\mu_2,...,\mu_n)^{T}\)，则相应的线性模型可以写成 \[ \bf y =\bf\mu + \bf\epsilon = \bf X\bf\beta+\bf\epsilon. \] 那么均值向量为 \[ \bf\mu = E(\bf y) = \bf X\bf\beta \] ### 最小二乘法估计线性模型假定一个自变量模型为 \[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \quad i = 1,2,...,n. \] 我们的目的就是运用已知的n个数据来估计未知系数的值，也就是试图找到\(\hat y = \beta_o + \beta_1\hat x\)，使得因变量\(y_i\)与该直线的残差最小。在这里我们不能简单的求和，因为残差有正有负，可能会抵消。所以我们在这里选用成为损失函数的凸函数\(\rho()\)，并且使得下面式子最小的参数\(\beta_0\)和\(\beta_1\)： \[ S = \sum^{n}_{i=1}\rho(y_i-(\beta_0+\beta_1x_i)) \] 最经典的是\(\rho()\)为二次函数，这时上式变为 \[ S = \sum^{n}_{i=1}(y_i-(\beta_0+\beta_1x_i))^2 \] 为了估计\(\beta_0\)和\(\beta_1\)，我们需要对\(S\)关于\(\beta_0\)和\(\beta_1\)求偏导数，得到normal equation \[ \frac{\partial S}{\partial \beta_0} = -2\sum^{n}_{i=1}(y_i-\beta_0-\beta_1x_i) = 0\\ \frac{\partial S}{\partial \beta_1} = -2\sum^{n}_{i=1}(y_i-\beta_0-\beta_1x_i)x_i = 0 \]

解得 \[ \hat\beta_1 = \frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^{n}(x_i-\bar x)^2},\quad \hat\beta_0 = \bar y-\hat\beta_1\bar x. \] 得到\(\hat y_i = \hat\beta_0+\hat\beta_0x_i\)。残差为\(\epsilon_i = y_i-\hat y_i\)。例如

library(ggplot2)

## Warning: 程辑包'ggplot2'是用R版本4.1.3 来建造的

library(cowplot)

## Warning: 程辑包'cowplot'是用R版本4.1.3 来建造的

library(MASS)
library(data.table)

## Warning: 程辑包'data.table'是用R版本4.1.3 来建造的

par(mfrow=c(1,2))
plot(dist~speed, cars)
plot(
  dist^.4~speed, cars,
  ylab = expression(dist^.4)
)

a = lm(dist^.4~speed, cars)
a$coefficients

## (Intercept)       speed 
##   1.4823236   0.1822545

从图中我们可以发现右图比左图更加线性。

n = nrow(cars)
a = lm(dist^.4~speed, cars)
plot(
  dist^.4~speed, cars, ylab = expression(dist^0.4), cex=1.5
)
abline(a)
for (i in 1:n)
  segments(cars[i,1], cars[i,2]^.4, cars[i,1], a$fitted[i])

anova(a)

summary(a)

## 
## Call:
## lm(formula = dist^0.4 ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1274 -0.3846 -0.1140  0.2936  1.7369 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.48232    0.27135   5.463 1.64e-06 ***
## speed        0.18225    0.01668  10.925 1.29e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6175 on 48 degrees of freedom
## Multiple R-squared:  0.7132, Adjusted R-squared:  0.7072 
## F-statistic: 119.4 on 1 and 48 DF,  p-value: 1.294e-14

多个自变量的情况

模型为 \[ \bf y = \bf\mu +\bf\epsilon = X \bf\beta + \bf\epsilon \] 在这里最小二乘法只需要最小化\((\bf y - \bf X\bf\beta)^T(\bf y - \bf X\bf\beta)\)，得 \[ \min_\bf\beta \{(\bf y - \bf X\bf\beta)^T(\bf y - \bf X\bf\beta)\}=\min_\bf\beta\{ \bf y^Ty - 2y^TX\beta +\beta^TX^TX\beta\}, \] normal equation 是 \[ \frac{\partial(\bf y^Ty - 2y^TX\beta +\beta^TX^TX\beta)}{\partial \beta} = -2X^T\bf y +2X^TX\beta = 0 \] 从而得到 \[ \bf{\hat \beta = (X^TX)^{-1}X^Ty}\\ \bf{\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty} \]

plot(rock)

用\(y\)表示perm，用\(x_1,x_2,x_3\)来表示area，peri，shape，则可以得到线性回归模型为 \[ y_i = \beta_0+\beta_1x_{i1}+\beta_3x_{i3} + \epsilon_i,\quad i=1,2,...,48 \]

a = lm(perm~.,rock)
summary(a)

## 
## Call:
## lm(formula = perm ~ ., data = rock)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -750.26  -59.57   10.66  100.25  620.91 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 485.61797  158.40826   3.066 0.003705 ** 
## area          0.09133    0.02499   3.654 0.000684 ***
## peri         -0.34402    0.05111  -6.731 2.84e-08 ***
## shape       899.06926  506.95098   1.773 0.083070 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 246 on 44 degrees of freedom
## Multiple R-squared:  0.7044, Adjusted R-squared:  0.6843 
## F-statistic: 34.95 on 3 and 44 DF,  p-value: 1.033e-11

多个模型相比较得到相对较好的模型

孤立看每个模型，从各种显著性检验来看，似乎都是合格的模型，这里除了p值以外，我们还需要计算AIC值来确定模型的合适程度。 \[ AIC = 2k - 2log(\mathcal{L}(\bf\hat\theta)) \] k是参数个数是一个惩罚项，\(\mathcal{L}(\bf\hat\theta)\)是最大似然值，AIC越小越好，但是使用AIC的重要条件是知道或假定数据的分布，否则无法计算。

a1 = lm(rate~conc, Puromycin)
a2 = lm(rate ~ ., Puromycin)
a3 = lm(rate ~conc * state, Puromycin)
a4 = lm(rate ~ log(conc), Puromycin)
a5 = lm(rate ~ log(conc) + state, Puromycin)
a6 = lm(rate~log(conc) * state, Puromycin)
shapiro.test(a1$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a1$res
## W = 0.96279, p-value = 0.522

shapiro.test(a2$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a2$res
## W = 0.94964, p-value = 0.2876

shapiro.test(a3$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a3$res
## W = 0.9469, p-value = 0.2521

shapiro.test(a4$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a4$res
## W = 0.97193, p-value = 0.7351

shapiro.test(a5$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a5$res
## W = 0.96604, p-value = 0.5951

shapiro.test(a6$res)

## 
##  Shapiro-Wilk normality test
## 
## data:  a6$res
## W = 0.95557, p-value = 0.38

AIC(a1,a2,a3,a4,a5,a6)

多重线性回归

我们知道，最小二乘法回归时估计参数的公式为 \[ \bf{\hat\beta = (X^TX)^{}-1X^TY} \] 但是当矩阵\(\bf X\)代表自变量的各个列向量线性相关时，\((\bf{X^TX})^{-1}\)不存在。有一些关于多重共线性的度量，其中之一是容忍度或者方差膨胀因子（VIF），而另一个是条件数，常用\(\kappa\)表示。 \[ tolerance = 1-R^{2}_j,\qquad VIF_j = \frac{1}{1-R^{2}_j} \] 容忍度小，或者VIF大时，被认为有多重共线性问题。条件数的公式为： \[ \kappa = \sqrt{\frac{\lambda_{max}}{\lambda_{min}}} \] 式子中，\(\lambda\)是\(\bf{X^TX}\)的特征值。自变量矩阵正交时，条件数\(\kappa\)为1，有些学者认为，当\(\kappa>15\)时，存在共线性问题。

广义线性模型

广义线性模型由三部分组成

（1）：随机部分，即变量所属的指数分布族。

（2）：线性部分，即\(\eta = \bf{x'\beta}\)。

（3）：连接函数。

\[ \begin{matrix} 分布 & 连接函数在 \mathrm{R} 中的名字 & 连接函数 g(\mu) & 均值函数 m(\eta) \\ 正㚐 (高斯) & identity & x^{\top} \beta=\mu & \mu=\boldsymbol{x}^{\top} \boldsymbol{\beta} \\ 指数 & inverse & x^{\top} \beta=-\mu^{-1} & \mu=-\left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1}\\ Gamma & inverse & x^{\top} \beta=-\mu^{-1} & \mu=-\left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1}\\ 逆高斯 & 1 / \mathrm{mu}^2 & x^{\top} \beta=-\mu^{-2} & \mu=\left(-\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1 / 2}\\ Poisson & log & x^{\top} \beta=\log (\mu) & \mu=\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)\\ 二项 & logit & x^{\top} \beta=\log \left(\frac{\mu}{1-\mu}\right) & \mu=\frac{\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)}{1+\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)} \\ \end{matrix} \] ## 方差分析方差分析时GLM模型的具体应用，适用于正态连续分布因变量和离散或者分类型解释变量。

library(data.table)
library(ggplot2)
library(ez)

## Warning: 程辑包'ez'是用R版本4.1.3 来建造的

set.seed(12345)
example <- data.table(
  y = rnorm(9),
  Condition = factor(rep(c("A","B","Control"), each = 3),
                     levels = c("Control", "A", "B"))
)
coef(lm(y~ Condition, data=example))

## (Intercept)  ConditionA  ConditionB 
##  0.02325157  0.37197894 -0.57844013

example[,.(M = mean(y)), by = Condition]

model.matrix(~ 0 + Condition, data = example)

##   ConditionControl ConditionA ConditionB
## 1                0          1          0
## 2                0          1          0
## 3                0          1          0
## 4                0          0          1
## 5                0          0          1
## 6                0          0          1
## 7                1          0          0
## 8                1          0          0
## 9                1          0          0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Condition
## [1] "contr.treatment"

coef(lm(y~ 0 + Condition, data = example))

## ConditionControl       ConditionA       ConditionB 
##       0.02325157       0.39523051      -0.55518856

pf(.72, df1 = 1, df2 = 6, lower.tail = FALSE)

## [1] 0.4286915

在下面的代码中，我们用一个很小的样本数据集测试治疗方案的总体表现，并输出统计检验Levene检验。

example[, ID := factor(1:.N)]

print(
  ezANOVA(
    data = example,
    dv = y,
    wid = ID, between = Condition,
    type = 3,
    detailed = TRUE
  )
)

## Coefficient covariances computed by hccm()

## $ANOVA
##        Effect DFn DFd        SSn      SSd        F         p p<.05         ges
## 1 (Intercept)   1   6 0.01868866 3.894959 0.028789 0.8708434       0.004775255
## 2   Condition   2   6 1.37625772 3.894959 1.060030 0.4034373       0.261089210
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd       SSn      SSd         F         p p<.05
## 1   2   6 0.5404698 1.843185 0.8796782 0.4623562

下面我们使用mtcars数据集及其交互作用来说明ANOVA模型，我们只需要增加个体交互作用项，其余不变。

library(JWileymisc)

## Warning: 程辑包'JWileymisc'是用R版本4.1.3 来建造的

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated

mtcars <- as.data.table(mtcars)
mtcars[, ID := factor(1:.N)]
mtcars[, vs := factor(vs)]
mtcars[, am := factor(am)]
print(ezANOVA(
  data = mtcars,
  dv = mpg,
  wid = ID,
  between = vs*am,
  type = 3,
  detailed = TRUE)
)

## Warning: Data is unbalanced (unequal N per group). Make sure you specified a
## well-considered value for the type argument to ezANOVA().

## Coefficient covariances computed by hccm()

## $ANOVA
##        Effect DFn DFd         SSn      SSd          F            p p<.05
## 1 (Intercept)   1  28 13144.33371 337.4764 1090.56904 5.735238e-24     *
## 2          vs   1  28   382.47771 337.4764   31.73370 4.931462e-06     *
## 3          am   1  28   283.72152 337.4764   23.54002 4.158925e-05     *
## 4       vs:am   1  28    16.00952 337.4764    1.32829 2.588547e-01      
##          ges
## 1 0.97496802
## 2 0.53125288
## 3 0.45673287
## 4 0.04529041
## 
## $`Levene's Test for Homogeneity of Variance`
##   DFn DFd      SSn      SSd         F         p p<.05
## 1   3  28 14.68171 155.5505 0.8809314 0.4628611

mtcars[, Cells := factor(sprintf("vs=%s, am=%s", vs, am))]
TukeyHSDgg("Cells", "hp", mtcars) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + xlab("")

Untitled