Updated on Tue Sep 03 17:00:36 2013.
We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2003, ch. 4). When running a regression we are making two assumptions, 1) there is a linear relationship between two variables (i.e. Xand Y) and 2) this relationship is additive (i.e. Y= x1 + x2 + …+xN). Technically, linear regression estimates how much Y changes when X changes one unit.
Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). A regression makes sense only if there is a sound theory behind it.
The expected model is:
\[ Y = β_{0} + β_{1}X \]
where \( β_{0} \) is the theoretical y-intercept and \( β_{1} \) is the theoretical slope. The goal of a linear regression is to find the best estimates for \( β_{0} \) and \( β_{1} \) by minimizing the residual error between the experimental and predicted signal.
The final model is:
\[ Y = b_{0} + b_{1}X + e \]
where \( b_{0} \) and \( b_{1} \) are the estimates for \( β_{0} \) and \( β_{1} \) and e is the residual error.
Let’s assume that the dependent variable being modeled is Y and that A, B and C are independent variables that might affect Y. The general format for a linear model is
response ~ op1 term1 op2 term 2 op3 term3 …
where term is an object or a sequence of objects and op is an operator, such as a + or a −, that indicates how the term that follows is to be included in the model.Note that the mathematical symbols used to define models do not have their normal meanings!
~ read "is modeled as"
+ to combine elementary terms, as in A+B
: for interactions, as in A:B
* for both main effects and interactions, so A*B = A+B+A:B
When discussing models, the term ‘linear’ does not mean a straight-line. Instead, a linear model contains additive terms, each containing a single multiplicative parameter; thus, the equations \( y = β_{0} + β_{1}x \), \( y = β_{0} + β_{1}x_{1} + β_{2}x_{2} \), \( y = β_{0} + β_{11}x^2 \), \( y = β_{0} + β_{1}x_{1} + β_{2}log(x_{2}) \) are linear models. The equation \( y = αx^β \), however, is not a linear model.
The command lm() provides the model’s coefficients but no further statistical information. The command summary() provides a more complete statistical summary of the model. The command coef() gives the model’s coefficients The command resid() gives the residual errors in Y The command fitted() gives the predicted values for Y
library(car)
Loading required package: MASS Loading required package: nnet
# ?Prestige # access the codebook
reg1 <- lm(prestige ~ education + log2(income) + women, data = Prestige)
summary(reg1)
Call:
lm(formula = prestige ~ education + log2(income) + women, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-17.364 -4.429 -0.101 4.316 19.179
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -110.9658 14.8429 -7.48 3.3e-11 ***
education 3.7305 0.3544 10.53 < 2e-16 ***
log2(income) 9.3147 1.3265 7.02 2.9e-10 ***
women 0.0469 0.0299 1.57 0.12
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.09 on 98 degrees of freedom
Multiple R-squared: 0.835, Adjusted R-squared: 0.83
F-statistic: 165 on 3 and 98 DF, p-value: <2e-16
# prestige=-110.97 + 3.73*education + 9.31*log2(income)+0.05*women
# The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression (y - ŷ). For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0.
# The t-values test the hypothesis that the coefficient is different from 0. You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model.
# Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05
# R-squared shows the amount of variance of Y explained by X. Higher is better with 1 being the best.
# Adjusted R-squared shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adjusted R-squared is closer to R-squared. This provides a more honest association between X and Y.
# Residual standard error: the standard deviation of the residuals
# The Degrees of Freedom is the difference between the number of observations included in the training sample and the number of variables used in the model (intercept counts as a variable).
# F-statistic & p-value: the p-value of the model. It tests whether R-squared is different from 0. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y. It indicates the reliability of X to predict Y. Performs an F-test on the model. This takes the parameters of our model and compares it to a model that has fewer parmeters. In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value. If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.
Before accepting the result of a linear regression it is important to evaluate it suitability at explaining the data. One of the many ways to do this is to visually examine the residuals. If the model is appropriate, then the residual errors should be random and normally distributed. In addition, removing one case should not significantly impact the model’s suitability. R provides four graphical approaches for evaluating a model using the plot() command.
The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points.
The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed.
The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot.
Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook’s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model.
layout(matrix(1:4, 2, 2))
plot(reg1)
reg1.robust <- rlm(prestige ~ education + log2(income) + women, data = Prestige)
summary(reg1.robust)
Call: rlm(formula = prestige ~ education + log2(income) + women, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-17.167 -4.415 0.156 4.342 19.513
Coefficients:
Value Std. Error t value
(Intercept) -117.223 15.219 -7.703
education 3.676 0.363 10.117
log2(income) 9.846 1.360 7.239
women 0.051 0.031 1.659
Residual standard error: 6.54 on 98 degrees of freedom
prestige_hat <- fitted(reg1) # predicted values
# as.data.frame(prestige_hat) # not run
prestige_resid <- residuals(reg1) # residuals
# as.data.frame(prestige_resid) # not run
reg2 <- lm(prestige ~ education + log2(income) + type, data = Prestige)
summary(reg2)
Call:
lm(formula = prestige ~ education + log2(income) + type, data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-13.51 -3.75 1.01 4.36 18.44
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -81.202 13.743 -5.91 5.6e-08 ***
education 3.284 0.608 5.40 5.1e-07 ***
log2(income) 7.269 1.190 6.11 2.3e-08 ***
typeprof 6.751 3.618 1.87 0.065 .
typewc -1.439 2.378 -0.61 0.546
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.64 on 93 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.855, Adjusted R-squared: 0.849
F-statistic: 138 on 4 and 93 DF, p-value: <2e-16
Prestige$type <- with(Prestige, factor(type, levels = c("bc", "wc", "prof")))
reg3 <- lm(prestige ~ type * (education + log2(income)), data = Prestige)
summary(reg3)
Call:
lm(formula = prestige ~ type * (education + log2(income)), data = Prestige)
Residuals:
Min 1Q Median 3Q Max
-13.97 -4.12 1.21 3.83 18.06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -120.046 20.158 -5.96 5.1e-08 ***
typewc 30.241 37.979 0.80 0.4280
typeprof 85.160 31.181 2.73 0.0076 **
education 2.336 0.928 2.52 0.0136 *
log2(income) 11.078 1.806 6.13 2.3e-08 ***
typewc:education 3.640 1.759 2.07 0.0414 *
typeprof:education 0.697 1.290 0.54 0.5900
typewc:log2(income) -5.653 3.052 -1.85 0.0673 .
typeprof:log2(income) -6.536 2.617 -2.50 0.0143 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.41 on 89 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.871, Adjusted R-squared: 0.859
F-statistic: 75.1 on 8 and 89 DF, p-value: <2e-16
reg3a <- lm(prestige ~ education + log2(income) + type + log2(income):type +
education:type, data = Prestige)
reg3b <- lm(prestige ~ education * type + log2(income) * type, data = Prestige)
What to look for: No patterns,no problems. All p’s should be non-significant. Model ok if residuals have mean=0 and variance=1 (Fox,316). Tukey test null hypothesis: model is additive.
Using 'income' as is. Variable 'income' shows some patterns.
reg1 <- lm(prestige ~ education + income + type, data = Prestige)
residualPlots(reg1)
Test stat Pr(>|t|)
education -0.684 0.496
income -2.886 0.005
type NA NA
Tukey test -2.610 0.009
Using 'log2(income)'. Model looks ok.
reg1a <- lm(prestige ~ education + log2(income) + type, data = Prestige)
residualPlots(reg1a)
Test stat Pr(>|t|)
education -0.237 0.813
log2(income) -1.044 0.299
type NA NA
Tukey test -1.446 0.148
residualPlots(reg1, ~1, fitted = TRUE) #Residuals vs fitted only
Test stat Pr(>|t|)
Tukey test -2.61 0.009
residualPlots(reg1, ~education, fitted = FALSE) # Residuals vs education only
Test stat Pr(>|t|)
education -0.684 0.496
avPlots(reg1, id.n = 2, id.cex = 0.7)
id.n – id most influential observation
id.cex – font size for id.
Graphs outcome vs predictor variables holding the rest constant (also called partial-regression plots)
Help identify the effect(or influence) of an observation on the regression coefficient of the predictor variable
qqPlot(reg1, id.n = 3)
service.station.attendant electronic.workers
1 97
medical.technicians
98
id.n – id observations with high residuals
outlierTest(reg1)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
rstudent unadjusted p-value Bonferonni p
medical.technicians 2.821 0.005863 0.5746
Nullfor the Bonferonniadjusted outlier test is the observation is an outlier. Here observation related to ‘medical.technicians’ is an outlier.
influenceIndexPlot(reg1, id.n = 3)
Warning: 强制改变过程中产生了NA
Cook's distance measures how much an observation influences the overall model or predicted values
Studentizided residuals are the residuals divided by their estimated standard deviation as a way to standardized
Bonferroni test to identify outliers
Hat-points identify influential observations (have a high impact on the predictor variables)
Note: If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. To remove a case(s) type
reg1a <-update(prestige.reg4, subset=rownames(Prestige) != “general.managers”) reg1b <-update(prestige.reg4, subset= !(rownames(Prestige) %in% c(“general.managers”,“medical.technicians”)))
influencePlot(reg1, id.n = 3)
StudRes Hat CookD
general.managers -1.31346 0.33504 0.41534
lawyers -0.08277 0.10360 0.01265
physicians -0.39532 0.22420 0.09548
medical.technicians 2.82109 0.06859 0.33023
service.station.attendant -2.21884 0.05971 0.24494
electronic.workers 2.22519 0.02701 0.16240
Creates a bubble-plot combining the display of Studentized residuals, hat-values, and Cook's distance (represented in the circles).
qqPlot(reg1)
Look for the tails, points should be close to the line or within the confidence intervals.
Quantile plots compare the Studentized residuals vs a t-distribution
Other tests: shapiro.test(), mshapiro.test() in library(mvnormtest)-library(ts)
ncvTest(reg1)
Non-constant Variance Score Test
Variance formula: ~ fitted.values
Chisquare = 0.0983 Df = 1 p = 0.7539
Breush/Pagan and Cook/Weisberg score test for non-constant error variance. Null is constant variance
See also residualPlots(reg1).
vif(reg1)
GVIF Df GVIF^(1/(2*Df))
education 5.974 1 2.444
income 1.681 1 1.297
type 6.102 2 1.572
A gvif> 4 suggests collinearity
When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other
reg1 <- rlm(prestige ~ education + log2(income) + women, data = Prestige)
set.seed(9)
edu <- rnorm(10, mean = 10.74, sd = 2.73)
set.seed(1)
income <- rnorm(10, mean = 6798, sd = 4246)
set.seed(123456)
women <- rnorm(10, mean = 29, sd = 31.7)
newdata = cbind(education = edu, income = income, women = women)
newdata = data.frame(newdata)
newdata
education income women
1 8.647 4138 55.43
2 8.511 7578 20.25
3 10.354 3250 17.75
4 9.982 13572 31.77
5 11.931 8197 100.40
6 7.500 3314 55.45
7 13.994 8868 70.60
8 10.690 9933 108.33
9 10.063 9243 66.03
10 9.749 5501 15.49
pred = predict(reg1, newdata, level = 0.9, interval = "confidence")
pred
fit lwr upr
1 35.67 33.92 37.43
2 41.98 40.04 43.92
3 36.60 34.42 38.79
4 56.25 53.20 59.31
5 59.75 55.93 63.56
6 28.31 26.35 30.27
7 66.93 64.43 69.43
8 58.32 53.39 63.25
9 52.83 49.62 56.04
10 41.74 40.51 42.97
checks how the formula is interpreted to built the model matrix for the linear model
set.seed(123)
x <- rnorm(10, 5, 0.5)
y <- rnorm(10, 5, 0.5)
z <- rnorm(10, 5, 0.5)
model.matrix(y ~ x * z)
(Intercept) x z x:z
1 1 4.720 4.466 21.08
2 1 4.885 4.891 23.89
3 1 5.779 4.487 25.93
4 1 5.035 4.636 23.34
5 1 5.065 4.687 23.74
6 1 5.858 4.157 24.35
7 1 5.230 5.419 28.34
8 1 4.367 5.077 22.17
9 1 4.657 4.431 20.63
10 1 4.777 5.627 26.88
attr(,"assign")
[1] 0 1 2 3
model.matrix(y ~ x + z)
(Intercept) x z
1 1 4.720 4.466
2 1 4.885 4.891
3 1 5.779 4.487
4 1 5.035 4.636
5 1 5.065 4.687
6 1 5.858 4.157
7 1 5.230 5.419
8 1 4.367 5.077
9 1 4.657 4.431
10 1 4.777 5.627
attr(,"assign")
[1] 0 1 2
model.matrix(y ~ x:z)
(Intercept) x:z
1 1 21.08
2 1 23.89
3 1 25.93
4 1 23.34
5 1 23.74
6 1 24.35
7 1 28.34
8 1 22.17
9 1 20.63
10 1 26.88
attr(,"assign")
[1] 0 1
The simplest linear model fits a constant, the mean, to a single variable. This is useful if you want to obtain an estimate of the mean with a standard error and confidence interval, or wanted to test the null hypothesis that the mean is zero.
confint(lm(y ~ 1))
2.5 % 97.5 %
(Intercept) 4.733 5.476
The most familiar linear model is the linear regression of y on x.
lm.0 <- lm(y ~ x)
Add the regression line to a scatter plot.
# plot(x, y)
plot(y ~ x) # formula method
abline(lm.0)
Adds a regression line to a scatter plot that doesn’t extend beyond the data points.
plot(y ~ x)
xnew <- range(x)
yhat <- predict(lm.0, newdata = data.frame(x = xnew))
lines(yhat ~ xnew)
Add confidence bands
xnew <- seq(min(x), max(x), length.out = 100)
ynew <- data.frame(predict(lm.0, newdata = data.frame(x = xnew), interval = "confidence",
level = 0.95))
lines(ynew$lwr ~ xnew, lty = 2, col = "red")
Error: plot.new has not been called yet
lines(ynew$upr ~ xnew, lty = 2, col = "red")
Error: plot.new has not been called yet
Adding prediction intervals
xnew <- seq(min(x), max(x), length.out = 100)
ynew <- data.frame(predict(lm.0, newdata = data.frame(x = xnew), interval = "prediction",
level = 0.95))
lines(ynew$lwr ~ xnew, lty = 2, col = "blue")
Error: plot.new has not been called yet
lines(ynew$upr ~ xnew, lty = 2, col = "blue")
Error: plot.new has not been called yet
Fit the model to data. y is the numeric response variable and A is the categorical explanatory variable (character or factor).
A <- c("a", "b", "c", "c", "d", "b", "a", "d", "a", "c")
A <- factor(A)
lm.1 <- lm(y ~ A)
Visualize the fitted model in a graph that includes the data points
stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = 16, col = "red")
yhat <- tapply(fitted(lm.1), A, mean)
for (i in 1:length(yhat)) {
lines(c(i - 0.2, i + 0.2), rep(yhat[i], 2))
}
Fit the model to data. Generally, fitting the model without an interaction term is preferred, if you can assume that any interaction is weak or absent.
lm.2 <- lm(y ~ x + A) # no interaction term included, or
lm.3 <- lm(y ~ x + A + x:A) # interaction term present
Visualize the fitted model in a graph that includes the data points.
plot(x, y, pch = as.numeric(A))
legend("bottomright", as.character(levels(A)), pch = 1:length(levels(A)))
groups <- levels(A) # stores group names
for (i in 1:length(groups)) {
xi <- x[A == groups[i]] # grabs x-values for group i
yhati <- fitted(lm.2)[A == groups[i]] # grabs yhat's for group i
lines(xi[order(xi)], yhati[order(xi)]) # connects the dots
}
library(MASS)
lm0 <- lm(loss ~ hard + tens, data = Rubber)
summary(lm0)
##
## Call:
## lm(formula = loss ~ hard + tens, data = Rubber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.38 -14.61 3.82 19.75 65.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.161 61.752 14.33 3.8e-14 ***
## hard -6.571 0.583 -11.27 1.0e-11 ***
## tens -1.374 0.194 -7.07 1.3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.828
## F-statistic: 71 on 2 and 27 DF, p-value: 1.77e-11
par(mfrow = c(1, 2))
termplot(lm0, partial = TRUE, smooth = panel.smooth)
par(mfrow = c(1, 1))
showing the contribution of each of the two terms in the model, at the mean of the contributions for the other term. A smooth curve has, in each panel, been fitted through the partial residuals. There is a clear suggestion that, at the upper end of the range, the response is not linear with tensile strength.
References