Linear Regression

2. 建立Linear Regression模型

2.1 相关性

corrplot.mixed(cor(data))

2.2 模型拟合

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data = data)
summary(fit)


Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7960 -1.6495 -0.0811  1.4815  7.6210 

Coefficients:
              Estimate Std. Error t value  Pr(>|t|)    
(Intercept) 1.23456341 3.86611474   0.319    0.7510    
Population  0.00022368 0.00009052   2.471    0.0173 *  
Illiteracy  4.14283659 0.87435319   4.738 0.0000219 ***
Income      0.00006442 0.00068370   0.094    0.9253    
Frost       0.00058131 0.01005366   0.058    0.9541    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.535 on 45 degrees of freedom
Multiple R-squared:  0.567, Adjusted R-squared:  0.5285 
F-statistic: 14.73 on 4 and 45 DF,  p-value: 0.00000009133

2.3 模型诊断

# 置信度为95%的置信区间
confint(fit, level = 0.95)

                     2.5 %       97.5 %
(Intercept) -6.55219139253 9.0213182149
Population   0.00004136397 0.0004059867
Illiteracy   2.38179886128 5.9038743192
Income      -0.00131261059 0.0014414600
Frost       -0.01966780591 0.0208304170

截距,Icome和Frost的置信区间都包含0,表示在置信度为95%情况下,无法拒绝这些变量为0的原假设.

# 图形诊断
par(mfrow = c(2, 2))
plot(fit)

左上：残差图,检验方差残差是否与因变量有关.
右上：Q-Q图,检验残差是否为标准正态分布.
左下：位置尺度图,检验方差是否齐性.
右下：残差-杠杆图,鉴别离群点,高杠杆值点和强影响点.

# 使用car包
# 检验正态性
qqPlot(fit, labels = row.names(data), simulate = TRUE, main = "Q-Q图")

# 检验方差独立性
durbinWatsonTest(fit)

 lag Autocorrelation D-W Statistic p-value
   1      -0.2006929      2.317691   0.302
 Alternative hypothesis: rho != 0

# 检验方差齐性
ncvTest(fit)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 1.746514    Df = 1     p = 0.1863156

# 检验多重共线性
vif(fit)

Population Illiteracy     Income      Frost 
  1.245282   2.165848   1.345822   2.082547

vif(fit) > 4  # 方差膨胀因子的平方大于4表明该变量存在多重共线性.

Population Illiteracy     Income      Frost 
     FALSE      FALSE      FALSE      FALSE

# 使用gvlma包
# 综合检验模型假设
g <- gvlma(fit)
print(g)


Call:
lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
    data = data)

Coefficients:
(Intercept)   Population   Illiteracy       Income        Frost  
 1.23456341   0.00022368   4.14283659   0.00006442   0.00058131  


ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance =  0.05 

Call:
 gvlma(x = fit) 

                    Value p-value                Decision
Global Stat        2.7728  0.5965 Assumptions acceptable.
Skewness           1.5374  0.2150 Assumptions acceptable.
Kurtosis           0.6376  0.4246 Assumptions acceptable.
Link Function      0.1154  0.7341 Assumptions acceptable.
Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.

2.4 异常观测值

离群点:       指那些模型预测效果不佳的样本点,它们通常有很大的残差绝对值.
高杠杆值点:   指那些在预测变量空间上与其他所有样本点相距较远的样本点.
强影响点:     指那些对模型参数估计值影响较大的样本点.

# 离群点
outlierTest(fit)

       rstudent unadjusted p-value Bonferonni p
Nevada 3.542929         0.00095088     0.047544

Nevada被判定为离群点.
注:该函数只是根据单个最大残差值的绝对值的显著性来判断是否有离群点.若不显著,则说明数据集中没有离群;若显著,则必须删除该离群点,然后再检验是否还有其他离群点存在.

# 高杠杆值点
hat.plot <- function(fit){
  p <- length(coefficients(fit))  # 变量数
  n <- length(fitted(fit))        # 样本数
  dotchart(sort(hatvalues(fit)), labels = names(sort(hatvalues(fit))), cex = 0.7, main = "杠杆值")
  abline(v = c(2,3)*p/n, col = "red", lty = 2)  # 红色竖直线标注帽子均值2倍和3倍的位置
}
hat.plot(fit)

# 强影响点
cutoff <- 4/(nrow(data) - length(coefficients(fit)) - 1)
plot(fit, which = 4, cook.levels = cutoff, main = "鉴别强影响点的Cook'D图")
abline(h = cutoff, lty = 2, col = "red")

Linear Regression

Tang Mengjie

Saturday, January 31, 2015

1. 数据导入与数据处理

2. 建立Linear Regression模型

2.1 相关性

2.2 模型拟合

2.3 模型诊断

2.4 异常观测值