线性模型简单意义上来说就是假定因变量\(y\)和自变量\(x\)之间的关系可以用线性关系来近似。一般模型的形式为 \[ \begin{equation*} y = \beta_0 + \beta_1 x + \epsilon . \end{equation*} \]
如果由\(p\)个数量的自变量\(x_1,x_2,...,x_p\),则线性模型的一般表达式为 \[ y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p+\epsilon. \]
如果数据有\(n\)个观测值,为\((y_1,x_{11},...,x_{1p}),(y_2,x_{21},...,x_{2p}),...,(y_n,x_{n1},...,x_{np})\),那么对应的线性模型为 \[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + ...+ \beta_px_{ip} + \epsilon_i,\quad i=1,2,...,n. \] 用\(\bf X\)表示矩阵 \[ \bf X = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2} & \cdots & x_{np}\\ \end{pmatrix} \]
记\(\bf\beta = (\beta_0,\beta_1,...,\beta_p)^{T}\),\(\bf y\)的均值向量为\(\bf\mu =(\mu_1,\mu_2,...,\mu_n)^{T}\),则相应的线性模型可以写成 \[ \bf y =\bf\mu + \bf\epsilon = \bf X\bf\beta+\bf\epsilon. \] 那么均值向量为 \[ \bf\mu = E(\bf y) = \bf X\bf\beta \] ### 最小二乘法估计线性模型 假定一个自变量模型为 \[ y_i = \beta_0 + \beta_1x_i + \epsilon_i, \quad i = 1,2,...,n. \] 我们的目的就是运用已知的n个数据来估计未知系数的值,也就是试图找到\(\hat y = \beta_o + \beta_1\hat x\),使得因变量\(y_i\)与该直线的残差最小。在这里我们不能简单的求和,因为残差有正有负,可能会抵消。所以我们在这里选用成为损失函数的凸函数\(\rho()\),并且使得下面式子最小的参数\(\beta_0\)和\(\beta_1\): \[ S = \sum^{n}_{i=1}\rho(y_i-(\beta_0+\beta_1x_i)) \] 最经典的是\(\rho()\)为二次函数,这时上式变为 \[ S = \sum^{n}_{i=1}(y_i-(\beta_0+\beta_1x_i))^2 \] 为了估计\(\beta_0\)和\(\beta_1\),我们需要对\(S\)关于\(\beta_0\)和\(\beta_1\)求偏导数,得到normal equation \[ \frac{\partial S}{\partial \beta_0} = -2\sum^{n}_{i=1}(y_i-\beta_0-\beta_1x_i) = 0\\ \frac{\partial S}{\partial \beta_1} = -2\sum^{n}_{i=1}(y_i-\beta_0-\beta_1x_i)x_i = 0 \]
解得 \[ \hat\beta_1 = \frac{\sum_{i=1}^{n}(x_i-\bar x)(y_i-\bar y)}{\sum_{i=1}^{n}(x_i-\bar x)^2},\quad \hat\beta_0 = \bar y-\hat\beta_1\bar x. \] 得到\(\hat y_i = \hat\beta_0+\hat\beta_0x_i\)。残差为\(\epsilon_i = y_i-\hat y_i\)。 例如
library(ggplot2)
## Warning: 程辑包'ggplot2'是用R版本4.1.3 来建造的
library(cowplot)
## Warning: 程辑包'cowplot'是用R版本4.1.3 来建造的
library(MASS)
library(data.table)
## Warning: 程辑包'data.table'是用R版本4.1.3 来建造的
par(mfrow=c(1,2))
plot(dist~speed, cars)
plot(
dist^.4~speed, cars,
ylab = expression(dist^.4)
)
a = lm(dist^.4~speed, cars)
a$coefficients
## (Intercept) speed
## 1.4823236 0.1822545
从图中我们可以发现右图比左图更加线性。
n = nrow(cars)
a = lm(dist^.4~speed, cars)
plot(
dist^.4~speed, cars, ylab = expression(dist^0.4), cex=1.5
)
abline(a)
for (i in 1:n)
segments(cars[i,1], cars[i,2]^.4, cars[i,1], a$fitted[i])
anova(a)
summary(a)
##
## Call:
## lm(formula = dist^0.4 ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.1274 -0.3846 -0.1140 0.2936 1.7369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.48232 0.27135 5.463 1.64e-06 ***
## speed 0.18225 0.01668 10.925 1.29e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6175 on 48 degrees of freedom
## Multiple R-squared: 0.7132, Adjusted R-squared: 0.7072
## F-statistic: 119.4 on 1 and 48 DF, p-value: 1.294e-14
模型为 \[ \bf y = \bf\mu +\bf\epsilon = X \bf\beta + \bf\epsilon \] 在这里最小二乘法只需要最小化\((\bf y - \bf X\bf\beta)^T(\bf y - \bf X\bf\beta)\),得 \[ \min_\bf\beta \{(\bf y - \bf X\bf\beta)^T(\bf y - \bf X\bf\beta)\}=\min_\bf\beta\{ \bf y^Ty - 2y^TX\beta +\beta^TX^TX\beta\}, \] normal equation 是 \[ \frac{\partial(\bf y^Ty - 2y^TX\beta +\beta^TX^TX\beta)}{\partial \beta} = -2X^T\bf y +2X^TX\beta = 0 \] 从而得到 \[ \bf{\hat \beta = (X^TX)^{-1}X^Ty}\\ \bf{\hat y = X\hat\beta = X(X^TX)^{-1}X^Ty} \]
plot(rock)
用\(y\)表示perm,用\(x_1,x_2,x_3\)来表示area,peri,shape,则可以得到线性回归模型为 \[ y_i = \beta_0+\beta_1x_{i1}+\beta_3x_{i3} + \epsilon_i,\quad i=1,2,...,48 \]
a = lm(perm~.,rock)
summary(a)
##
## Call:
## lm(formula = perm ~ ., data = rock)
##
## Residuals:
## Min 1Q Median 3Q Max
## -750.26 -59.57 10.66 100.25 620.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 485.61797 158.40826 3.066 0.003705 **
## area 0.09133 0.02499 3.654 0.000684 ***
## peri -0.34402 0.05111 -6.731 2.84e-08 ***
## shape 899.06926 506.95098 1.773 0.083070 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 246 on 44 degrees of freedom
## Multiple R-squared: 0.7044, Adjusted R-squared: 0.6843
## F-statistic: 34.95 on 3 and 44 DF, p-value: 1.033e-11
孤立看每个模型,从各种显著性检验来看,似乎都是合格的模型,这里除了p值以外,我们还需要计算AIC值来确定模型的合适程度。 \[ AIC = 2k - 2log(\mathcal{L}(\bf\hat\theta)) \] k是参数个数是一个惩罚项,\(\mathcal{L}(\bf\hat\theta)\)是最大似然值,AIC越小越好,但是使用AIC的重要条件是知道或假定数据的分布,否则无法计算。
a1 = lm(rate~conc, Puromycin)
a2 = lm(rate ~ ., Puromycin)
a3 = lm(rate ~conc * state, Puromycin)
a4 = lm(rate ~ log(conc), Puromycin)
a5 = lm(rate ~ log(conc) + state, Puromycin)
a6 = lm(rate~log(conc) * state, Puromycin)
shapiro.test(a1$res)
##
## Shapiro-Wilk normality test
##
## data: a1$res
## W = 0.96279, p-value = 0.522
shapiro.test(a2$res)
##
## Shapiro-Wilk normality test
##
## data: a2$res
## W = 0.94964, p-value = 0.2876
shapiro.test(a3$res)
##
## Shapiro-Wilk normality test
##
## data: a3$res
## W = 0.9469, p-value = 0.2521
shapiro.test(a4$res)
##
## Shapiro-Wilk normality test
##
## data: a4$res
## W = 0.97193, p-value = 0.7351
shapiro.test(a5$res)
##
## Shapiro-Wilk normality test
##
## data: a5$res
## W = 0.96604, p-value = 0.5951
shapiro.test(a6$res)
##
## Shapiro-Wilk normality test
##
## data: a6$res
## W = 0.95557, p-value = 0.38
AIC(a1,a2,a3,a4,a5,a6)
我们知道,最小二乘法回归时估计参数的公式为 \[ \bf{\hat\beta = (X^TX)^{}-1X^TY} \] 但是当矩阵\(\bf X\)代表自变量的各个列向量线性相关时,\((\bf{X^TX})^{-1}\)不存在。有一些关于多重共线性的度量,其中之一是容忍度或者方差膨胀因子(VIF),而另一个是条件数,常用\(\kappa\)表示。 \[ tolerance = 1-R^{2}_j,\qquad VIF_j = \frac{1}{1-R^{2}_j} \] 容忍度小,或者VIF大时,被认为有多重共线性问题。条件数的公式为: \[ \kappa = \sqrt{\frac{\lambda_{max}}{\lambda_{min}}} \] 式子中,\(\lambda\)是\(\bf{X^TX}\)的特征值。自变量矩阵正交时,条件数\(\kappa\)为1,有些学者认为,当\(\kappa>15\)时,存在共线性问题。
广义线性模型由三部分组成
(1):随机部分,即变量所属的指数分布族。
(2):线性部分,即\(\eta = \bf{x'\beta}\)。
(3):连接函数。
\[ \begin{matrix} 分布 & 连接函数在 \mathrm{R} 中的名字 & 连接函数 g(\mu) & 均值函数 m(\eta) \\ 正㚐 (高斯) & identity & x^{\top} \beta=\mu & \mu=\boldsymbol{x}^{\top} \boldsymbol{\beta} \\ 指数 & inverse & x^{\top} \beta=-\mu^{-1} & \mu=-\left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1}\\ Gamma & inverse & x^{\top} \beta=-\mu^{-1} & \mu=-\left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1}\\ 逆高斯 & 1 / \mathrm{mu}^2 & x^{\top} \beta=-\mu^{-2} & \mu=\left(-\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)^{-1 / 2}\\ Poisson & log & x^{\top} \beta=\log (\mu) & \mu=\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)\\ 二项 & logit & x^{\top} \beta=\log \left(\frac{\mu}{1-\mu}\right) & \mu=\frac{\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)}{1+\exp \left(\boldsymbol{x}^{\top} \boldsymbol{\beta}\right)} \\ \end{matrix} \] ## 方差分析 方差分析时GLM模型的具体应用,适用于正态连续分布因变量和离散或者分类型解释变量。
library(data.table)
library(ggplot2)
library(ez)
## Warning: 程辑包'ez'是用R版本4.1.3 来建造的
set.seed(12345)
example <- data.table(
y = rnorm(9),
Condition = factor(rep(c("A","B","Control"), each = 3),
levels = c("Control", "A", "B"))
)
coef(lm(y~ Condition, data=example))
## (Intercept) ConditionA ConditionB
## 0.02325157 0.37197894 -0.57844013
example[,.(M = mean(y)), by = Condition]
model.matrix(~ 0 + Condition, data = example)
## ConditionControl ConditionA ConditionB
## 1 0 1 0
## 2 0 1 0
## 3 0 1 0
## 4 0 0 1
## 5 0 0 1
## 6 0 0 1
## 7 1 0 0
## 8 1 0 0
## 9 1 0 0
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$Condition
## [1] "contr.treatment"
coef(lm(y~ 0 + Condition, data = example))
## ConditionControl ConditionA ConditionB
## 0.02325157 0.39523051 -0.55518856
pf(.72, df1 = 1, df2 = 6, lower.tail = FALSE)
## [1] 0.4286915
在下面的代码中,我们用一个很小的样本数据集测试治疗方案的总体表现,并输出统计检验Levene检验。
example[, ID := factor(1:.N)]
print(
ezANOVA(
data = example,
dv = y,
wid = ID, between = Condition,
type = 3,
detailed = TRUE
)
)
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05 ges
## 1 (Intercept) 1 6 0.01868866 3.894959 0.028789 0.8708434 0.004775255
## 2 Condition 2 6 1.37625772 3.894959 1.060030 0.4034373 0.261089210
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 2 6 0.5404698 1.843185 0.8796782 0.4623562
下面我们使用mtcars数据集及其交互作用来说明ANOVA模型,我们只需要增加个体交互作用项,其余不变。
library(JWileymisc)
## Warning: 程辑包'JWileymisc'是用R版本4.1.3 来建造的
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "replValueSp"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "packedMatrix" of class "mMatrix"; definition not updated
mtcars <- as.data.table(mtcars)
mtcars[, ID := factor(1:.N)]
mtcars[, vs := factor(vs)]
mtcars[, am := factor(am)]
print(ezANOVA(
data = mtcars,
dv = mpg,
wid = ID,
between = vs*am,
type = 3,
detailed = TRUE)
)
## Warning: Data is unbalanced (unequal N per group). Make sure you specified a
## well-considered value for the type argument to ezANOVA().
## Coefficient covariances computed by hccm()
## $ANOVA
## Effect DFn DFd SSn SSd F p p<.05
## 1 (Intercept) 1 28 13144.33371 337.4764 1090.56904 5.735238e-24 *
## 2 vs 1 28 382.47771 337.4764 31.73370 4.931462e-06 *
## 3 am 1 28 283.72152 337.4764 23.54002 4.158925e-05 *
## 4 vs:am 1 28 16.00952 337.4764 1.32829 2.588547e-01
## ges
## 1 0.97496802
## 2 0.53125288
## 3 0.45673287
## 4 0.04529041
##
## $`Levene's Test for Homogeneity of Variance`
## DFn DFd SSn SSd F p p<.05
## 1 3 28 14.68171 155.5505 0.8809314 0.4628611
mtcars[, Cells := factor(sprintf("vs=%s, am=%s", vs, am))]
TukeyHSDgg("Cells", "hp", mtcars) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + xlab("")