Ordinary least squares (OLS) regression

Scenarios for using OLS regression

In OLS regression, a quantitative dependent variable is predicted from a weighted sum of predictor variables, where the weights are parameters estimated from the data.

OLS regression Theory

To properly interpret the coefficients of the OLS model, you must satisfy a number of statistical assumptions:

Normality—For fixed values of the independent variables, the dependent variable is normally distributed.

Independence—The Yi values are independent of each other.

Linearity—The dependent variable is linearly related to the independent variables.

Homoscedasticity—The variance of the dependent variable doesn’t vary with the levels of the independent variables. (I could call this constant variance, but say- ing homoscedasticity makes me feel smarter.)

Fit OLS Model In R

主要实用的是lm命令，一般的格式为： >myfit <- lm(formula, data)

Simple linear regression

# fit
fit <- lm(weight~height, data=women)

# 模型的综合信息
summary(fit)

## 
## Call:
## lm(formula = weight ~ height, data = women)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7333 -1.1333 -0.3833  0.7417  3.1167 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
## height        3.45000    0.09114   37.85 1.09e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.525 on 13 degrees of freedom
## Multiple R-squared:  0.991,  Adjusted R-squared:  0.9903 
## F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

# 此处模型的综合信息不太全面，下面是补充
# Durbin-Watson statistics[car]
library(car)
print("D-W test result is: ")

## [1] "D-W test result is: "

durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1        0.585079     0.3153804       0
##  Alternative hypothesis: rho != 0

# correlation
cor_coef <- cor(women$height, women$weight)
print(paste("correlation coef is: ", as.character(cor_coef)))

## [1] "correlation coef is:  0.995494767784216"

# 模型的其他详细信息
#women$weight
#fitted(fit)
#residuals(fit)


# 拟合图
plot(women$height, women$weight, 
     xlab="Height (in inches)",
     ylab="Weight (in pounds)")
abline(fit)

# 此外car包提供一种更好的可视化方法
scatterplot(weight~height, data=women,
            spread=FALSE, smoother.args=list(lty=2), pch=19,
            main="Women Age 30-39",
            xlab="Height (inches)",
            ylab="Weight (lbs.)")

# 模型预测
print("Predicting...")

## [1] "Predicting..."

new.height <- c(70, 75)
# 注意这里，新的待预测的数据要以data.frame的格式,而且要对应好列名
# 同时可以指定95%的置信区间
predict(fit, data.frame(height=new.height), level=0.95, interval="confidence")

##        fit      lwr      upr
## 1 153.9833 152.6823 155.2844
## 2 171.2333 169.0885 173.3781

Polynomial regression

fit2 <- lm(weight~height+I(height^2), data=women)
# 此处可以如上查看模型的信息

Multiple linear regression

states <- as.data.frame(state.x77[, c("Murder", "Population", "Illiteracy", "Income", "Frost")])
# 相关系数矩阵
#cor(states)

# scatterplot matrix[car]
scatterplotMatrix(states, spread=FALSE, smoother.args = list(lty=2), main="Scatter Plot Matrix")

# fit
fit <- lm(Murder~Population+Illiteracy+Income+Frost, data=states)
summary(fit)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08

Regression diagnostics

此处将从四个方面介绍模型的诊断相关的操作：R自带的经典的诊断方法，针对某一特定假设进行的专门诊断，全局诊断以及共线性诊断。

A typical approach

fit <- lm(weight~height, data=women)
old.par <- par()
par(mfrow=c(2,2))
plot(fit)

To understand these graphs, consider the assumptions of OLS regression:

Normality[根据Q-Q图]—If the dependent variable is normally distributed for a fixed set of predictor values, then the residual values should be normally distributed with a mean of 0. The Normal Q-Q plot (upper right) is a probability plot of the standardized residuals against the values that would be expected under normality. If you’ve met the normality assumption, the points on this graph should fall on the straight 45-degree line. Because they don’t, you’ve clearly violated the normality assumption.

Independence[图中无显示]—You can’t tell if the dependent variable values are independent from these plots. You have to use your understanding of how the data was collected. There’s no a priori reason to believe that one woman’s weight influences another woman’s weight. If you found out that the data were sampled from families, you might have to adjust your assumption of independence.

Linearity—[残差-预测值图]If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the predicted (that is, fitted) values. In other words, the model should capture all the systematic variance present in the data, leaving nothing but random noise. In the Residuals vs. Fitted graph (upper left), you see clear evidence of a curved relationship, which suggests that you may want to add a quadratic term to the regression.

Homoscedasticity[Scale-Location图]—If you’ve met the constant variance assumption, the points in the Scale-Location graph (bottom left) should be a random band around a horizontal line. You seem to meet this assumption.

Finally, the Residuals vs. Leverage graph (bottom right) provides information about individual observations that you may wish to attend to. The graph identifies outliers, high-leverage points, and influential observations

outlier-An outlier is an observation that isn’t predicted well by the fitted regression model (that is, has a large positive or negative residual).

high-leverage points-An observation with a high leverage value has an unusual combination of predictor values. That is, it’s an outlier in the predictor space. The dependent variable value isn’t used to calculate an observation’s leverage.

influential observation-An influential observation is an observation that has a disproportionate impact on the determination of the model parameters. Influential observations are identified using a statistic called Cook’s distance, or Cook’s D.

此处有关于plot(lm.result)图像更详细的解释

上面，我们检测到第13，15个数据有较高的Cook’s distance，也就意味着，在模型中，使用这些数据与否会对模型造成很大的影响。下面我们尝试去除这两个值，然后再建立模型。同时，根据残差图看出可能缺少二次项，所以也尝试加入到模型中。

newfit <- lm(weight~height+I(height^2), data=women[-c(13,15), ])
par(mfrow=c(2,2))
plot(newfit)

An enhanced approach

NORMALITY[Q-Q Plot]

增强的Q-Q Plot

library(car)
states <- as.data.frame(state.x77[,c("Murder", "Population",
"Illiteracy", "Income", "Frost")])
fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)
# 注意，这里`id.method="identify"`使得我们可以做图后手动选点并显示信息
qqPlot(fit, labels=row.names(states), id.method="identify",
simulate=TRUE, main="Q-Q Plot")

增强的标准残差图Distribution of Errors：

residplot <- function(fit, nbreaks=10) {
  z <- rstudent(fit)
  hist(z, breaks=nbreaks, freq=FALSE,
  xlab="Studentized Residual",
  main="Distribution of Errors")
  rug(jitter(z), col="brown")
  curve(dnorm(x, mean=mean(z), sd=sd(z)),
  add=TRUE, col="blue", lwd=2)
  lines(density(z)$x, density(z)$y,
  col="red", lwd=2, lty=2)
  legend("topright",
  legend = c( "Normal Curve", "Kernel Density Curve"),
  lty=1:2, col=c("blue","red"), cex=.7)
}
residplot(fit)

INDEPENDENCE OF ERRORS

durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1      -0.2006929      2.317691   0.216
##  Alternative hypothesis: rho != 0

The nonsignificant p-value (p=0.282) suggests a lack of autocorrelation and, conversely, an independence of errors. The lag value (1 in this case) indicates that each observation is being compared with the one next to it in the dataset.

Note that the durbinWatsonTest() function uses bootstrapping (see chapter 12) to derive p-values. Unless you add the option simulate=FALSE, you’ll get a slightly different value each time you run the test.

LINEARITY

Check by component plus residual plots (also known as partial residual plots)

Nonlinearity in any of these plots suggests that you may not have adequately modeled the functional form of that predictor in the regression. If so, you may need to add curvilinear components such as polynomial terms, transform one or more variables (for example, use log(X) instead of X), or abandon linear regression in favor of some other regression variant

crPlots(fit)

HOMOSCEDASTICITY[方差齐性]

Way1. The ncvTest() function produces a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the fitted values. A significant result suggests heteroscedasticity [异方差性] (nonconstant error variance).

ncvTest(fit)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.746514    Df = 1     p = 0.1863156

The score test is nonsignificant (p = 0.19), suggesting that you’ve met the constant variance assumption.

Way2. The spreadLevelPlot() function creates a scatter plot of the absolute standardized residuals versus the fitted values and superimposes a line of best fit.

spreadLevelPlot(fit)

## 
## Suggested power transformation:  1.209626

The suggested power transformation is the suggested power p (Y p) that would stabilize the nonconstant error variance.

Global validation of linear model assumption

gvlma() function performs a global validation of linear model assumptions as well as separate evaluations of skewness, kurtosis, and heteroscedasticity.

library(gvlma)
gvmodel <- gvlma(fit)
summary(gvmodel)

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy + Income + Frost, 
##     data = states)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7960 -1.6495 -0.0811  1.4815  7.6210 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.235e+00  3.866e+00   0.319   0.7510    
## Population  2.237e-04  9.052e-05   2.471   0.0173 *  
## Illiteracy  4.143e+00  8.744e-01   4.738 2.19e-05 ***
## Income      6.442e-05  6.837e-04   0.094   0.9253    
## Frost       5.813e-04  1.005e-02   0.058   0.9541    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.535 on 45 degrees of freedom
## Multiple R-squared:  0.567,  Adjusted R-squared:  0.5285 
## F-statistic: 14.73 on 4 and 45 DF,  p-value: 9.133e-08
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = fit) 
## 
##                     Value p-value                Decision
## Global Stat        2.7728  0.5965 Assumptions acceptable.
## Skewness           1.5374  0.2150 Assumptions acceptable.
## Kurtosis           0.6376  0.4246 Assumptions acceptable.
## Link Function      0.1154  0.7341 Assumptions acceptable.
## Heteroscedasticity 0.4824  0.4873 Assumptions acceptable.

Multicollinearity

Multicollinearity can be detected using a statistic called the variance inflation factor(VIF). As a general rule, vif> 2 indicates a multicollinearity problem.

vif(fit)

## Population Illiteracy     Income      Frost 
##   1.245282   2.165848   1.345822   2.082547

sqrt(vif(fit))>2

## Population Illiteracy     Income      Frost 
##      FALSE      FALSE      FALSE      FALSE

Unusual observations

对Outlier, High-leverage points,Influential observations的检测。

Outliers

outlierTest(fit)

##        rstudent unadjusted p-value Bonferonni p
## Nevada 3.542929         0.00095088     0.047544

High-leverage points

hat.plot <- function(fit) {
  p <- length(coefficients(fit))
  n <- length(fitted(fit))
  plot(hatvalues(fit), main="Index Plot of Hat Values")
  abline(h=c(2,3)*p/n, col="red", lty=2)
  identify(1:n, hatvalues(fit), names(hatvalues(fit)))
}
# 注意这里在交互式的环境中是可以手动选择特定点的，这里在rmd可能无法正常显示
hat.plot(fit)

## integer(0)

Influential observations

cutoff <- 4/(nrow(states)-length(fit$coefficients)-2)
plot(fit, which=4, cook.levels=cutoff)
abline(h=cutoff, lty=2, col="red")

> Note that although it’s useful to cast a wide net when searching for influential observations, I tend to find a cutoff of 1 more generally useful than 4/(n – k – 1). Given a criterion of D=1, none of the observations in the dataset would appear to be influential.

For each predictor Xk, plot the residuals from regressing the response variable on the other k – 1 predictors versus the residuals from regressing Xk on the other k – 1 predictors.

avPlots(fit, ask=FALSE, id.method="identify")

You can combine the information from outlier, leverage, and influence plots into one highly informative plot using the influencePlot() function from the car package

influencePlot(fit, id.method="identify", main="Influence Plot",sub="Circle size is proportional to Cook's distance")

#### Corrective measures

这里是根据上面发现的问题进行纠正的操作。由于比较灵活，很难简单地总结，所以就留在R in Action吧：-）

Selecting the “best” regression model

通过比较不同的模型进行模型的选择。

Comparing models

Comparing nested models using the anova() function

requirement: nested model( A nested model is one whose terms are completely included in the other model.)

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

fit1 <- lm(Murder ~ Population + Illiteracy + Income +  Frost,data=states)

fit2 <- lm(Murder ~ Population + Illiteracy, data=states)

anova(fit2, fit1)

## Analysis of Variance Table
## 
## Model 1: Murder ~ Population + Illiteracy
## Model 2: Murder ~ Population + Illiteracy + Income + Frost
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     47 289.25                           
## 2     45 289.17  2  0.078505 0.0061 0.9939

Because the test is nonsignificant (p = .994), you conclude that they don’t add to the linear prediction and you’re justified in dropping them from your model.

Comparing models with the AIC

Models with smaller AIC values—indicating adequate fit with fewer parameters—are preferred

Note that although the ANOVA approach requires nested models, the AIC approach doesn’t.

AIC(fit1, fit2)

##      df      AIC
## fit1  6 241.6429
## fit2  4 237.6565

Variable selection

STEPWISE REGRESSION >In stepwise selection, variables are added to or deleted from a model one at a time, until some stopping criterion is reached.

library(MASS)

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost,data=states)

stepAIC(fit, direction="backward")

## Start:  AIC=97.75
## Murder ~ Population + Illiteracy + Income + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Frost       1     0.021 289.19  95.753
## - Income      1     0.057 289.22  95.759
## <none>                    289.17  97.749
## - Population  1    39.238 328.41 102.111
## - Illiteracy  1   144.264 433.43 115.986
## 
## Step:  AIC=95.75
## Murder ~ Population + Illiteracy + Income
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.057 289.25  93.763
## <none>                    289.19  95.753
## - Population  1    43.658 332.85 100.783
## - Illiteracy  1   236.196 525.38 123.605
## 
## Step:  AIC=93.76
## Murder ~ Population + Illiteracy
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    289.25  93.763
## - Population  1    48.517 337.76  99.516
## - Illiteracy  1   299.646 588.89 127.311

## 
## Call:
## lm(formula = Murder ~ Population + Illiteracy, data = states)
## 
## Coefficients:
## (Intercept)   Population   Illiteracy  
##   1.6515497    0.0002242    4.0807366

Stepwise regression is controversial. Although it may find a good model, there’s no guarantee that it will find the “best” model. This is because not every possible model is evaluated. An approach that attempts to overcome this limitation is all subsets regression.

ALL SUBSETS REGRESSION

Note: When the number of predictors is large compared to the sample size, this can lead to significant overfitting.

library(leaps)
states <- as.data.frame(state.x77[,c("Murder", "Population",
"Illiteracy", "Income", "Frost")])
leaps <-regsubsets(Murder ~ Population + Illiteracy + Income +
Frost, data=states, nbest=4)
plot(leaps, scale="adjr2")

Cross-validation

shrinkage <- function(fit, k=10){
  require(bootstrap)
  theta.fit <- function(x,y){lsfit(x,y)}
  theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef}
  x <- fit$model[,2:ncol(fit$model)]
  y <- fit$model[,1]
  results <- crossval(x, y, theta.fit, theta.predict, ngroup=k)
  r2 <- cor(y, fit$fitted.values)^2
  r2cv <- cor(y, results$cv.fit)^2
  cat("Original R-square =", r2, "\n")
  cat(k, "Fold Cross-Validated R-square =", r2cv, "\n")
  cat("Change =", r2-r2cv, "\n")
}

library(bootstrap)
shrinkage(fit)

## Original R-square = 0.5669502 
## 10 Fold Cross-Validated R-square = 0.4656007 
## Change = 0.1013496

fit2 <- lm(Murder ~ Population + Illiteracy,data=states)
shrinkage(fit2)

## Original R-square = 0.5668327 
## 10 Fold Cross-Validated R-square = 0.5135513 
## Change = 0.05328143

Relative importance

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

zstates <- as.data.frame(scale(states))

zfit <- lm(Murder~Population + Income + Illiteracy + Frost, data=zstates)

coef(zfit)

##   (Intercept)    Population        Income    Illiteracy         Frost 
## -8.272891e-17  2.705095e-01  1.072372e-02  6.840496e-01  8.185407e-03

relweights() for calculating relative importance of predictors

relweights <- function(fit,...){
  R <- cor(fit$model)
  nvar <- ncol(R)
  rxx <- R[2:nvar, 2:nvar]
  rxy <- R[2:nvar, 1]
  svd <- eigen(rxx)
  evec <- svd$vectors
  ev <- svd$values
  delta <- diag(sqrt(ev))
  lambda <- evec %*% delta %*% t(evec)
  lambdasq <- lambda ^ 2
  beta <- solve(lambda) %*% rxy
  rsquare <- colSums(beta ^ 2)
  rawwgt <- lambdasq %*% beta ^ 2
  import <- (rawwgt / rsquare) * 100
  import <- as.data.frame(import)
  row.names(import) <- names(fit$model[2:nvar])
  names(import) <- "Weights"
  import <- import[order(import),1, drop=FALSE]
  dotchart(import$Weights, labels=row.names(import),
  xlab="% of R-Square", pch=19,
  main="Relative Importance of Predictor Variables",
  sub=paste("Total R-Square=", round(rsquare, digits=3)),
  ...)
  return(import)
}

states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")])

fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states)

relweights(fit, col="blue")

##              Weights
## Income      5.488962
## Population 14.723401
## Frost      20.787442
## Illiteracy 59.000195