本文档描述建立Linear Regression模型的方法.
# 载入所需包
library(data.table)
library(corrplot)
library(car)
library(gvlma)
# 读入数据
# 使用数据集state.x77
data <- as.data.frame(state.x77[, c("Murder", "Population", "Illiteracy", "Income", "Frost")])
数据概览:
因变量
Murder: 谋杀率
自变量
Population: 人口
Illiteracy: 文盲率
Income: 平均收入
Frost: 结霜天数
head(data)
Murder Population Illiteracy Income Frost
Alabama 15.1 3615 2.1 3624 20
Alaska 11.3 365 1.5 6315 152
Arizona 7.8 2212 1.8 4530 15
Arkansas 10.1 2110 1.9 3378 65
California 10.3 21198 1.1 5114 20
Colorado 6.8 2541 0.7 4884 166
corrplot.mixed(cor(data))
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = data)
summary(fit)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
data = data)
Residuals:
Min 1Q Median 3Q Max
-4.7960 -1.6495 -0.0811 1.4815 7.6210
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23456341 3.86611474 0.319 0.7510
Population 0.00022368 0.00009052 2.471 0.0173 *
Illiteracy 4.14283659 0.87435319 4.738 0.0000219 ***
Income 0.00006442 0.00068370 0.094 0.9253
Frost 0.00058131 0.01005366 0.058 0.9541
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
F-statistic: 14.73 on 4 and 45 DF, p-value: 0.00000009133
# 置信度为95%的置信区间
confint(fit, level = 0.95)
2.5 % 97.5 %
(Intercept) -6.55219139253 9.0213182149
Population 0.00004136397 0.0004059867
Illiteracy 2.38179886128 5.9038743192
Income -0.00131261059 0.0014414600
Frost -0.01966780591 0.0208304170
截距,Icome和Frost的置信区间都包含0,表示在置信度为95%情况下,无法拒绝这些变量为0的原假设.
# 图形诊断
par(mfrow = c(2, 2))
plot(fit)
左上:残差图,检验方差残差是否与因变量有关.
右上:Q-Q图,检验残差是否为标准正态分布.
左下:位置尺度图,检验方差是否齐性.
右下:残差-杠杆图,鉴别离群点,高杠杆值点和强影响点.
# 使用car包
# 检验正态性
qqPlot(fit, labels = row.names(data), simulate = TRUE, main = "Q-Q图")
# 检验方差独立性
durbinWatsonTest(fit)
lag Autocorrelation D-W Statistic p-value
1 -0.2006929 2.317691 0.302
Alternative hypothesis: rho != 0
# 检验方差齐性
ncvTest(fit)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 1.746514 Df = 1 p = 0.1863156
# 检验多重共线性
vif(fit)
Population Illiteracy Income Frost
1.245282 2.165848 1.345822 2.082547
vif(fit) > 4 # 方差膨胀因子的平方大于4表明该变量存在多重共线性.
Population Illiteracy Income Frost
FALSE FALSE FALSE FALSE
# 使用gvlma包
# 综合检验模型假设
g <- gvlma(fit)
print(g)
Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
data = data)
Coefficients:
(Intercept) Population Illiteracy Income Frost
1.23456341 0.00022368 4.14283659 0.00006442 0.00058131
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = fit)
Value p-value Decision
Global Stat 2.7728 0.5965 Assumptions acceptable.
Skewness 1.5374 0.2150 Assumptions acceptable.
Kurtosis 0.6376 0.4246 Assumptions acceptable.
Link Function 0.1154 0.7341 Assumptions acceptable.
Heteroscedasticity 0.4824 0.4873 Assumptions acceptable.
离群点: 指那些模型预测效果不佳的样本点,它们通常有很大的残差绝对值.
高杠杆值点: 指那些在预测变量空间上与其他所有样本点相距较远的样本点.
强影响点: 指那些对模型参数估计值影响较大的样本点.
# 离群点
outlierTest(fit)
rstudent unadjusted p-value Bonferonni p
Nevada 3.542929 0.00095088 0.047544
Nevada被判定为离群点.
注:该函数只是根据单个最大残差值的绝对值的显著性来判断是否有离群点.若不显著,则说明数据集中没有离群;若显著,则必须删除该离群点,然后再检验是否还有其他离群点存在.
# 高杠杆值点
hat.plot <- function(fit){
p <- length(coefficients(fit)) # 变量数
n <- length(fitted(fit)) # 样本数
dotchart(sort(hatvalues(fit)), labels = names(sort(hatvalues(fit))), cex = 0.7, main = "杠杆值")
abline(v = c(2,3)*p/n, col = "red", lty = 2) # 红色竖直线标注帽子均值2倍和3倍的位置
}
hat.plot(fit)
# 强影响点
cutoff <- 4/(nrow(data) - length(coefficients(fit)) - 1)
plot(fit, which = 4, cook.levels = cutoff, main = "鉴别强影响点的Cook'D图")
abline(h = cutoff, lty = 2, col = "red")