In OLS regression, a quantitative dependent variable is predicted from a weighted sum of predictor variables, where the weights are parameters estimated from the data.
To properly interpret the coefficients of the OLS model, you must satisfy a number of statistical assumptions:
Normality—For fixed values of the independent variables, the dependent variable is normally distributed.
Independence—The Yi values are independent of each other.
Linearity—The dependent variable is linearly related to the independent variables.
Homoscedasticity—The variance of the dependent variable doesn’t vary with the levels of the independent variables. (I could call this constant variance, but say- ing homoscedasticity makes me feel smarter.)
主要实用的是lm
命令,一般的格式为: >myfit <- lm(formula, data)
# fit
fit <- lm(weight~height, data=women)
# 模型的综合信息
summary(fit)
##
## Call:
## lm(formula = weight ~ height, data = women)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7333 -1.1333 -0.3833 0.7417 3.1167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
## height 3.45000 0.09114 37.85 1.09e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
## F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
# 此处模型的综合信息不太全面,下面是补充
# Durbin-Watson statistics[car]
library(car)
print("D-W test result is: ")
## [1] "D-W test result is: "
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 0.585079 0.3153804 0
## Alternative hypothesis: rho != 0
# correlation
cor_coef <- cor(women$height, women$weight)
print(paste("correlation coef is: ", as.character(cor_coef)))
## [1] "correlation coef is: 0.995494767784216"
# 模型的其他详细信息
#women$weight
#fitted(fit)
#residuals(fit)
# 拟合图
plot(women$height, women$weight,
xlab="Height (in inches)",
ylab="Weight (in pounds)")
abline(fit)
# 此外car包提供一种更好的可视化方法
scatterplot(weight~height, data=women,
spread=FALSE, smoother.args=list(lty=2), pch=19,
main="Women Age 30-39",
xlab="Height (inches)",
ylab="Weight (lbs.)")
# 模型预测
print("Predicting...")
## [1] "Predicting..."
new.height <- c(70, 75)
# 注意这里,新的待预测的数据要以data.frame的格式,而且要对应好列名
# 同时可以指定95%的置信区间
predict(fit, data.frame(height=new.height), level=0.95, interval="confidence")
## fit lwr upr
## 1 153.9833 152.6823 155.2844
## 2 171.2333 169.0885 173.3781
fit2 <- lm(weight~height+I(height^2), data=women)
# 此处可以如上查看模型的信息
states <- as.data.frame(state.x77[, c("Murder", "Population", "Illiteracy", "Income", "Frost")])
# 相关系数矩阵
#cor(states)
# scatterplot matrix[car]
scatterplotMatrix(states, spread=FALSE, smoother.args = list(lty=2), main="Scatter Plot Matrix")
# fit
fit <- lm(Murder~Population+Illiteracy+Income+Frost, data=states)
summary(fit)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
## data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -1.6495 -0.0811 1.4815 7.6210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.235e+00 3.866e+00 0.319 0.7510
## Population 2.237e-04 9.052e-05 2.471 0.0173 *
## Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
## Income 6.442e-05 6.837e-04 0.094 0.9253
## Frost 5.813e-04 1.005e-02 0.058 0.9541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
## F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
此处将从四个方面介绍模型的诊断相关的操作:R自带的经典的诊断方法,针对某一特定假设进行的专门诊断,全局诊断以及共线性诊断。
fit <- lm(weight~height, data=women)
old.par <- par()
par(mfrow=c(2,2))
plot(fit)
To understand these graphs, consider the assumptions of OLS regression:
Normality[根据Q-Q图]—If the dependent variable is normally distributed for a fixed set of predictor values, then the residual values should be normally distributed with a mean of 0. The Normal Q-Q plot (upper right) is a probability plot of the standardized residuals against the values that would be expected under normality. If you’ve met the normality assumption, the points on this graph should fall on the straight 45-degree line. Because they don’t, you’ve clearly violated the normality assumption.
Independence[图中无显示]—You can’t tell if the dependent variable values are independent from these plots. You have to use your understanding of how the data was collected. There’s no a priori reason to believe that one woman’s weight influences another woman’s weight. If you found out that the data were sampled from families, you might have to adjust your assumption of independence.
Linearity—[残差-预测值图]If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the predicted (that is, fitted) values. In other words, the model should capture all the systematic variance present in the data, leaving nothing but random noise. In the Residuals vs. Fitted graph (upper left), you see clear evidence of a curved relationship, which suggests that you may want to add a quadratic term to the regression.
Homoscedasticity[Scale-Location图]—If you’ve met the constant variance assumption, the points in the Scale-Location graph (bottom left) should be a random band around a horizontal line. You seem to meet this assumption.
Finally, the Residuals vs. Leverage graph (bottom right) provides information about individual observations that you may wish to attend to. The graph identifies outliers, high-leverage points, and influential observations
outlier-An outlier is an observation that isn’t predicted well by the fitted regression model (that is, has a large positive or negative residual).
high-leverage points-An observation with a high leverage value has an unusual combination of predictor values. That is, it’s an outlier in the predictor space. The dependent variable value isn’t used to calculate an observation’s leverage.
influential observation-An influential observation is an observation that has a disproportionate impact on the determination of the model parameters. Influential observations are identified using a statistic called Cook’s distance, or Cook’s D.
上面,我们检测到第13,15个数据有较高的Cook’s distance,也就意味着,在模型中,使用这些数据与否会对模型造成很大的影响。下面我们尝试去除这两个值,然后再建立模型。同时,根据残差图看出可能缺少二次项,所以也尝试加入到模型中。
newfit <- lm(weight~height+I(height^2), data=women[-c(13,15), ])
par(mfrow=c(2,2))
plot(newfit)
增强的Q-Q Plot
library(car)
states <- as.data.frame(state.x77[,c("Murder", "Population",
"Illiteracy", "Income", "Frost")])
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
# 注意,这里`id.method="identify"`使得我们可以做图后手动选点并显示信息
qqPlot(fit, labels=row.names(states), id.method="identify",
simulate=TRUE, main="Q-Q Plot")
增强的标准残差图Distribution of Errors:
residplot <- function(fit, nbreaks=10) {
z <- rstudent(fit)
hist(z, breaks=nbreaks, freq=FALSE,
xlab="Studentized Residual",
main="Distribution of Errors")
rug(jitter(z), col="brown")
curve(dnorm(x, mean=mean(z), sd=sd(z)),
add=TRUE, col="blue", lwd=2)
lines(density(z)$x, density(z)$y,
col="red", lwd=2, lty=2)
legend("topright",
legend = c( "Normal Curve", "Kernel Density Curve"),
lty=1:2, col=c("blue","red"), cex=.7)
}
residplot(fit)
durbinWatsonTest(fit)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.2006929 2.317691 0.216
## Alternative hypothesis: rho != 0
The nonsignificant p-value (p=0.282) suggests a lack of autocorrelation and, conversely, an independence of errors. The lag value (1 in this case) indicates that each observation is being compared with the one next to it in the dataset.
Note that the durbinWatsonTest() function uses bootstrapping (see chapter 12) to derive p-values. Unless you add the option simulate=FALSE, you’ll get a slightly different value each time you run the test.
Check by component plus residual plots (also known as partial residual plots)
Nonlinearity in any of these plots suggests that you may not have adequately modeled the functional form of that predictor in the regression. If so, you may need to add curvilinear components such as polynomial terms, transform one or more variables (for example, use log(X) instead of X), or abandon linear regression in favor of some other regression variant
crPlots(fit)
Way1. The ncvTest() function produces a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the fitted values. A significant result suggests heteroscedasticity [异方差性] (nonconstant error variance).
ncvTest(fit)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.746514 Df = 1 p = 0.1863156
The score test is nonsignificant (p = 0.19), suggesting that you’ve met the constant variance assumption.
Way2. The spreadLevelPlot() function creates a scatter plot of the absolute standardized residuals versus the fitted values and superimposes a line of best fit.
spreadLevelPlot(fit)
##
## Suggested power transformation: 1.209626
The suggested power transformation is the suggested power p (Y p) that would stabilize the nonconstant error variance.
gvlma() function performs a global validation of linear model assumptions as well as separate evaluations of skewness, kurtosis, and heteroscedasticity.
library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost,
## data = states)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7960 -1.6495 -0.0811 1.4815 7.6210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.235e+00 3.866e+00 0.319 0.7510
## Population 2.237e-04 9.052e-05 2.471 0.0173 *
## Illiteracy 4.143e+00 8.744e-01 4.738 2.19e-05 ***
## Income 6.442e-05 6.837e-04 0.094 0.9253
## Frost 5.813e-04 1.005e-02 0.058 0.9541
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared: 0.567, Adjusted R-squared: 0.5285
## F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = fit)
##
## Value p-value Decision
## Global Stat 2.7728 0.5965 Assumptions acceptable.
## Skewness 1.5374 0.2150 Assumptions acceptable.
## Kurtosis 0.6376 0.4246 Assumptions acceptable.
## Link Function 0.1154 0.7341 Assumptions acceptable.
## Heteroscedasticity 0.4824 0.4873 Assumptions acceptable.
Multicollinearity can be detected using a statistic called the variance inflation factor(VIF). As a general rule, vif> 2 indicates a multicollinearity problem.
vif(fit)
## Population Illiteracy Income Frost
## 1.245282 2.165848 1.345822 2.082547
sqrt(vif(fit))>2
## Population Illiteracy Income Frost
## FALSE FALSE FALSE FALSE
对Outlier, High-leverage points,Influential observations的检测。
outlierTest(fit)
## rstudent unadjusted p-value Bonferonni p
## Nevada 3.542929 0.00095088 0.047544
hat.plot <- function(fit) {
p <- length(coefficients(fit))
n <- length(fitted(fit))
plot(hatvalues(fit), main="Index Plot of Hat Values")
abline(h=c(2,3)*p/n, col="red", lty=2)
identify(1:n, hatvalues(fit), names(hatvalues(fit)))
}
# 注意这里在交互式的环境中是可以手动选择特定点的,这里在rmd可能无法正常显示
hat.plot(fit)
## integer(0)
cutoff <- 4/(nrow(states)-length(fit$coefficients)-2)
plot(fit, which=4, cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")
> Note that although it’s useful to cast a wide net when searching for influential observations, I tend to find a cutoff of 1 more generally useful than 4/(n – k – 1). Given a criterion of D=1, none of the observations in the dataset would appear to be influential.
For each predictor Xk, plot the residuals from regressing the response variable on the other k – 1 predictors versus the residuals from regressing Xk on the other k – 1 predictors.
avPlots(fit, ask=FALSE, id.method="identify")
You can combine the information from outlier, leverage, and influence plots into one highly informative plot using the influencePlot() function from the car package
influencePlot(fit, id.method="identify", main="Influence Plot",sub="Circle size is proportional to Cook's distance")
#### Corrective measures
这里是根据上面发现的问题进行纠正的操作。由于比较灵活,很难简单地总结,所以就留在R in Action吧:-)
通过比较不同的模型进行模型的选择。
Comparing nested models using the anova() function
requirement: nested model( A nested model is one whose terms are completely included in the other model.)
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])
fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost,data=states)
fit2 <- lm(Murder ~ Population + Illiteracy, data=states)
anova(fit2, fit1)
## Analysis of Variance Table
##
## Model 1: Murder ~ Population + Illiteracy
## Model 2: Murder ~ Population + Illiteracy + Income + Frost
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 47 289.25
## 2 45 289.17 2 0.078505 0.0061 0.9939
Because the test is nonsignificant (p = .994), you conclude that they don’t add to the linear prediction and you’re justified in dropping them from your model.
Comparing models with the AIC
Models with smaller AIC values—indicating adequate fit with fewer parameters—are preferred
Note that although the ANOVA approach requires nested models, the AIC approach doesn’t.
AIC(fit1, fit2)
## df AIC
## fit1 6 241.6429
## fit2 4 237.6565
STEPWISE REGRESSION >In stepwise selection, variables are added to or deleted from a model one at a time, until some stopping criterion is reached.
library(MASS)
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost,data=states)
stepAIC(fit, direction="backward")
## Start: AIC=97.75
## Murder ~ Population + Illiteracy + Income + Frost
##
## Df Sum of Sq RSS AIC
## - Frost 1 0.021 289.19 95.753
## - Income 1 0.057 289.22 95.759
## <none> 289.17 97.749
## - Population 1 39.238 328.41 102.111
## - Illiteracy 1 144.264 433.43 115.986
##
## Step: AIC=95.75
## Murder ~ Population + Illiteracy + Income
##
## Df Sum of Sq RSS AIC
## - Income 1 0.057 289.25 93.763
## <none> 289.19 95.753
## - Population 1 43.658 332.85 100.783
## - Illiteracy 1 236.196 525.38 123.605
##
## Step: AIC=93.76
## Murder ~ Population + Illiteracy
##
## Df Sum of Sq RSS AIC
## <none> 289.25 93.763
## - Population 1 48.517 337.76 99.516
## - Illiteracy 1 299.646 588.89 127.311
##
## Call:
## lm(formula = Murder ~ Population + Illiteracy, data = states)
##
## Coefficients:
## (Intercept) Population Illiteracy
## 1.6515497 0.0002242 4.0807366
Stepwise regression is controversial. Although it may find a good model, there’s no guarantee that it will find the “best” model. This is because not every possible model is evaluated. An approach that attempts to overcome this limitation is all subsets regression.
ALL SUBSETS REGRESSION
Note: When the number of predictors is large compared to the sample size, this can lead to significant overfitting.
library(leaps)
states <- as.data.frame(state.x77[,c("Murder", "Population",
"Illiteracy", "Income", "Frost")])
leaps <-regsubsets(Murder ~ Population + Illiteracy + Income +
Frost, data=states, nbest=4)
plot(leaps, scale="adjr2")
shrinkage <- function(fit, k=10){
require(bootstrap)
theta.fit <- function(x,y){lsfit(x,y)}
theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
x <- fit$model[,2:ncol(fit$model)]
y <- fit$model[,1]
results <- crossval(x, y, theta.fit, theta.predict, ngroup=k)
r2 <- cor(y, fit$fitted.values)^2
r2cv <- cor(y, results$cv.fit)^2
cat("Original R-square =", r2, "\n")
cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
cat("Change =", r2-r2cv, "\n")
}
library(bootstrap)
shrinkage(fit)
## Original R-square = 0.5669502
## 10 Fold Cross-Validated R-square = 0.4656007
## Change = 0.1013496
fit2 <- lm(Murder ~ Population + Illiteracy,data=states)
shrinkage(fit2)
## Original R-square = 0.5668327
## 10 Fold Cross-Validated R-square = 0.5135513
## Change = 0.05328143
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])
zstates <- as.data.frame(scale(states))
zfit <- lm(Murder~Population + Income + Illiteracy + Frost, data=zstates)
coef(zfit)
## (Intercept) Population Income Illiteracy Frost
## -8.272891e-17 2.705095e-01 1.072372e-02 6.840496e-01 8.185407e-03
relweights() for calculating relative importance of predictors
relweights <- function(fit,...){
R <- cor(fit$model)
nvar <- ncol(R)
rxx <- R[2:nvar, 2:nvar]
rxy <- R[2:nvar, 1]
svd <- eigen(rxx)
evec <- svd$vectors
ev <- svd$values
delta <- diag(sqrt(ev))
lambda <- evec %*% delta %*% t(evec)
lambdasq <- lambda ^ 2
beta <- solve(lambda) %*% rxy
rsquare <- colSums(beta ^ 2)
rawwgt <- lambdasq %*% beta ^ 2
import <- (rawwgt / rsquare) * 100
import <- as.data.frame(import)
row.names(import) <- names(fit$model[2:nvar])
names(import) <- "Weights"
import <- import[order(import),1, drop=FALSE]
dotchart(import$Weights, labels=row.names(import),
xlab="% of R-Square", pch=19,
main="Relative Importance of Predictor Variables",
sub=paste("Total R-Square=", round(rsquare, digits=3)),
...)
return(import)
}
states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
relweights(fit, col="blue")
## Weights
## Income 5.488962
## Population 14.723401
## Frost 20.787442
## Illiteracy 59.000195