导入数据(先把wd设置)。数据来源http://www.statsci.org/data/general/cofreewy.html
dat1 <- read.table("cofreewy.txt",header=T) #header参数默认是F
head(dat1,3)
## Hour CO Traffic Wind
## 1 1 2.4 50 -0.2
## 2 2 1.7 26 0.0
## 3 3 1.4 16 0.0
多元线性回归fit1,用其他三项拟合CO
fit1 <- lm(CO~.,dat1) #.表示剩下所有
summary(fit1) #展示结果
##
## Call:
## lm(formula = CO ~ ., data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75030 -0.33275 -0.09021 0.22653 1.25112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.318967 0.242522 5.439 2.53e-05 ***
## Hour -0.005689 0.017066 -0.333 0.74233
## Traffic 0.018402 0.001413 13.026 3.15e-11 ***
## Wind 0.179189 0.059517 3.011 0.00691 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5096 on 20 degrees of freedom
## Multiple R-squared: 0.9498, Adjusted R-squared: 0.9423
## F-statistic: 126.1 on 3 and 20 DF, p-value: 3.682e-13
通过AIC,对fit1逐步回归,去除多重共线性的影响(多重共线性:线性回归模型中的解释变量之间由于存在精确相关关系或高度相关关系而使模型估计失真或难以估计准确)
fit2 <- step(fit1,direction="backward")
## Start: AIC=-28.73
## CO ~ Hour + Traffic + Wind
##
## Df Sum of Sq RSS AIC
## - Hour 1 0.029 5.224 -30.597
## <none> 5.195 -28.730
## - Wind 1 2.354 7.549 -21.759
## - Traffic 1 44.070 49.265 23.260
##
## Step: AIC=-30.6
## CO ~ Traffic + Wind
##
## Df Sum of Sq RSS AIC
## <none> 5.224 -30.597
## - Wind 1 2.357 7.581 -23.659
## - Traffic 1 46.117 51.341 22.250
summary(fit2)
##
## Call:
## lm(formula = CO ~ Traffic + Wind, data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72858 -0.31710 -0.09629 0.22409 1.26554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.274461 0.198137 6.432 2.25e-06 ***
## Traffic 0.018290 0.001343 13.616 6.85e-12 ***
## Wind 0.174747 0.056765 3.078 0.0057 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4987 on 21 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9447
## F-statistic: 197.5 on 2 and 21 DF, p-value: 2.419e-14
最终舍去了Hour变量。下面做残差的正态性检验,绘制qq图
shapiro.test(fit2$res)
##
## Shapiro-Wilk normality test
##
## data: fit2$res
## W = 0.9192, p-value = 0.05601
qqnorm(fit2$res)
qqline(fit2$res) #添加拟合线
fit2的调整后R方0.9447,残差检验p值0.05601(略)大于0.05显著性水平,不能拒绝(接收)原假设:残差正态。
qq图显示其实没那么正态,于是进一步探索数据。把各变量用散点图绘出。
plot(dat1) #很简单不是吗?
从图中看到了什么?CO 和traffic线性,CO跟hour应该有类似正弦余弦的关系,这里按照书中的经验(或者叫策略),先计算一下CO和各个预测变量的值、二次值、三次值的相关系数。
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dat1.1 <- mutate(dat1,Tsq=Traffic^2,Tcub=Traffic^3,Hsq=Hour^2,Hcub=Hour^3,Wsq=Wind^2,Wcub=Wind^3)
cor(dat1.1)
## Hour CO Traffic Wind Tsq Tcub
## Hour 1.0000000 0.42835141 0.4285228 0.4226381 0.30074905 0.23510436
## CO 0.4283514 1.00000000 0.9626655 0.7097486 0.90169496 0.82803458
## Traffic 0.4285228 0.96266546 1.0000000 0.6134436 0.96541035 0.90993482
## Wind 0.4226381 0.70974865 0.6134436 1.0000000 0.56243700 0.50209675
## Tsq 0.3007490 0.90169496 0.9654104 0.5624370 1.00000000 0.98552762
## Tcub 0.2351044 0.82803458 0.9099348 0.5020968 0.98552762 1.00000000
## Hsq 0.9708219 0.23892898 0.2524923 0.2646975 0.13840485 0.09301792
## Hcub 0.9210687 0.09728411 0.1215328 0.1221678 0.01765746 -0.01425327
## Wsq 0.3269404 0.62979640 0.5581630 0.9705228 0.53314946 0.49118222
## Wcub 0.2908334 0.57477944 0.5167288 0.9302517 0.50130050 0.46689157
## Hsq Hcub Wsq Wcub
## Hour 0.97082193 0.92106873 0.32694039 0.29083338
## CO 0.23892898 0.09728411 0.62979640 0.57477944
## Traffic 0.25249226 0.12153280 0.55816302 0.51672883
## Wind 0.26469755 0.12216779 0.97052283 0.93025168
## Tsq 0.13840485 0.01765746 0.53314946 0.50130050
## Tcub 0.09301792 -0.01425327 0.49118222 0.46689157
## Hsq 1.00000000 0.98638529 0.18208867 0.15647259
## Hcub 0.98638529 1.00000000 0.05121417 0.03468917
## Wsq 0.18208867 0.05121417 1.00000000 0.98980496
## Wcub 0.15647259 0.03468917 0.98980496 1.00000000
主要关注CO那一行,发现似乎wind的二三次项很有必要作为预测变量(相比较原值cor下降的速度)。因此按照书中的策略,先设计一个较大的多元模型,预测变量有原来的三个(不要Hour),再加上wind的二三次项,再加上hour经过谐波分析(傅里叶分解)后的前两个正余弦项。再进行逐步回归,去掉多重共线性问题。
fit3 <- lm(CO~Traffic+Wind+I(Wind^2)+I(Wind^3)+sin((2*pi/24)*Hour)+cos((2*pi/24)*Hour)+sin((4*pi/24)*Hour)+cos((4*pi/24)*Hour),dat1) #wind^3前门要加I(),sin()就不用了,参见帮助In function formula. There it is used to inhibit the interpretation of operators such as "+", "-", "*" and "^" as formula operators, so they are used as arithmetical operators.
fit4 <- step(fit3)
## Start: AIC=-64.06
## CO ~ Traffic + Wind + I(Wind^2) + I(Wind^3) + sin((2 * pi/24) *
## Hour) + cos((2 * pi/24) * Hour) + sin((4 * pi/24) * Hour) +
## cos((4 * pi/24) * Hour)
##
## Df Sum of Sq RSS AIC
## - sin((2 * pi/24) * Hour) 1 0.0000 0.7857 -66.061
## - cos((2 * pi/24) * Hour) 1 0.0108 0.7965 -65.734
## - sin((4 * pi/24) * Hour) 1 0.0142 0.7999 -65.631
## - I(Wind^3) 1 0.0594 0.8451 -64.312
## <none> 0.7857 -64.062
## - Wind 1 0.1040 0.8897 -63.078
## - I(Wind^2) 1 0.1465 0.9322 -61.959
## - cos((4 * pi/24) * Hour) 1 0.3081 1.0938 -58.122
## - Traffic 1 8.7013 9.4870 -6.275
##
## Step: AIC=-66.06
## CO ~ Traffic + Wind + I(Wind^2) + I(Wind^3) + cos((2 * pi/24) *
## Hour) + sin((4 * pi/24) * Hour) + cos((4 * pi/24) * Hour)
##
## Df Sum of Sq RSS AIC
## <none> 0.7857 -66.061
## - I(Wind^3) 1 0.0774 0.8631 -65.807
## - sin((4 * pi/24) * Hour) 1 0.0833 0.8690 -65.643
## - I(Wind^2) 1 0.1768 0.9625 -63.191
## - Wind 1 0.5036 1.2893 -56.175
## - cos((2 * pi/24) * Hour) 1 0.8084 1.5941 -51.081
## - cos((4 * pi/24) * Hour) 1 1.2853 2.0711 -44.800
## - Traffic 1 9.9464 10.7322 -5.315
summary(fit4)
##
## Call:
## lm(formula = CO ~ Traffic + Wind + I(Wind^2) + I(Wind^3) + cos((2 *
## pi/24) * Hour) + sin((4 * pi/24) * Hour) + cos((4 * pi/24) *
## Hour), data = dat1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33393 -0.10924 -0.02143 0.11008 0.44462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.236009 0.165331 7.476 1.32e-06 ***
## Traffic 0.018611 0.001308 14.232 1.68e-10 ***
## Wind 0.780056 0.243592 3.202 0.005551 **
## I(Wind^2) -0.203563 0.107293 -1.897 0.075992 .
## I(Wind^3) 0.013957 0.011121 1.255 0.227478
## cos((2 * pi/24) * Hour) -0.354207 0.087300 -4.057 0.000915 ***
## sin((4 * pi/24) * Hour) 0.241455 0.185397 1.302 0.211227
## cos((4 * pi/24) * Hour) 0.363625 0.071076 5.116 0.000104 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2216 on 16 degrees of freedom
## Multiple R-squared: 0.9924, Adjusted R-squared: 0.9891
## F-statistic: 298.7 on 7 and 16 DF, p-value: 9.664e-16
对模型进行方差分析(其实参数项的显著性检验本质就是方差分析),然后残差正态性检验,绘制qq图。
anova(fit4)
## Analysis of Variance Table
##
## Response: CO
## Df Sum Sq Mean Sq F value Pr(>F)
## Traffic 1 95.875 95.875 1952.3546 < 2.2e-16 ***
## Wind 1 2.357 2.357 48.0009 3.399e-06 ***
## I(Wind^2) 1 1.409 1.409 28.6906 6.426e-05 ***
## I(Wind^3) 1 0.052 0.052 1.0500 0.3207490
## cos((2 * pi/24) * Hour) 1 1.474 1.474 30.0185 5.050e-05 ***
## sin((4 * pi/24) * Hour) 1 0.218 0.218 4.4369 0.0513091 .
## cos((4 * pi/24) * Hour) 1 1.285 1.285 26.1737 0.0001036 ***
## Residuals 16 0.786 0.049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(fit4$res)
##
## Shapiro-Wilk normality test
##
## data: fit4$res
## W = 0.9784, p-value = 0.8649
qqnorm(fit4$res)
qqline(fit4$res)
相比之前的模型fit2,qq图效果更好,正态性p值超过0.8,r方也大于0.98,应该是有所提高。