Linear Regression using R

Updated on Tue Sep 03 17:00:36 2013.

Introduction

We use regression to estimate the unknown effect of changing one variable over another (Stock and Watson, 2003, ch. 4). When running a regression we are making two assumptions, 1) there is a linear relationship between two variables (i.e. Xand Y) and 2) this relationship is additive (i.e. Y= x1 + x2 + …+xN). Technically, linear regression estimates how much Y changes when X changes one unit.

Before running a regression it is recommended to have a clear idea of what you are trying to estimate (i.e. which are your outcome and predictor variables). A regression makes sense only if there is a sound theory behind it.

The expected model is:
\[ Y = β_{0} + β_{1}X \] where \( β_{0} \) is the theoretical y-intercept and \( β_{1} \) is the theoretical slope. The goal of a linear regression is to find the best estimates for \( β_{0} \) and \( β_{1} \) by minimizing the residual error between the experimental and predicted signal.

The final model is:
\[ Y = b_{0} + b_{1}X + e \] where \( b_{0} \) and \( b_{1} \) are the estimates for \( β_{0} \) and \( β_{1} \) and e is the residual error.

Defining Models

Let’s assume that the dependent variable being modeled is Y and that A, B and C are independent variables that might affect Y. The general format for a linear model is

                      response ~ op1 term1 op2 term 2 op3 term3 …

where term is an object or a sequence of objects and op is an operator, such as a + or a −, that indicates how the term that follows is to be included in the model.Note that the mathematical symbols used to define models do not have their normal meanings!

  ~           read "is modeled as"   
  +           to combine elementary terms, as in A+B   
  :           for interactions, as in A:B  
  *           for both main effects and interactions, so A*B = A+B+A:B

alt text reference

When discussing models, the term ‘linear’ does not mean a straight-line. Instead, a linear model contains additive terms, each containing a single multiplicative parameter; thus, the equations \( y = β_{0} + β_{1}x \), \( y = β_{0} + β_{1}x_{1} + β_{2}x_{2} \), \( y = β_{0} + β_{11}x^2 \), \( y = β_{0} + β_{1}x_{1} + β_{2}log(x_{2}) \) are linear models. The equation \( y = αx^β \), however, is not a linear model.

Useful commands

The command lm() provides the model’s coefficients but no further statistical information. The command summary() provides a more complete statistical summary of the model. The command coef() gives the model’s coefficients The command resid() gives the residual errors in Y The command fitted() gives the predicted values for Y

Using dataset “Prestige” in the following regression models

library(car)

Loading required package: MASS Loading required package: nnet

# ?Prestige # access the codebook

Linear regression

reg1 <- lm(prestige ~ education + log2(income) + women, data = Prestige)
summary(reg1)


Call:
lm(formula = prestige ~ education + log2(income) + women, data = Prestige)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.364  -4.429  -0.101   4.316  19.179 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -110.9658    14.8429   -7.48  3.3e-11 ***
education       3.7305     0.3544   10.53  < 2e-16 ***
log2(income)    9.3147     1.3265    7.02  2.9e-10 ***
women           0.0469     0.0299    1.57     0.12    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.09 on 98 degrees of freedom
Multiple R-squared:  0.835, Adjusted R-squared:  0.83 
F-statistic:  165 on 3 and 98 DF,  p-value: <2e-16

# prestige=-110.97 + 3.73*education + 9.31*log2(income)+0.05*women

# The residuals are the difference between the actual values of the variable you're predicting and predicted values from your regression (y - ŷ). For most regressions you want your residuals to look like a normal distribution when plotted. If our residuals are normally distributed, this indicates the mean of the difference between our predictions and the actual values is close to 0.

# The t-values test the hypothesis that the coefficient is different from 0. You can get the t-values by dividing the coefficient by its standard error. The t-values also show the importance of a variable in the model.

# Two-tail p-values test the hypothesis that each coefficient is different from 0. To reject this, the p-value has to be lower than 0.05

# R-squared shows the amount of variance of Y explained by X. Higher is better with 1 being the best.

# Adjusted R-squared shows the same as R2 but adjusted by the # of cases and # of variables. When the # of variables is small and the # of cases is very large then Adjusted R-squared is closer to R-squared. This provides a more honest association between X and Y.

# Residual standard error: the standard deviation of the residuals

# The Degrees of Freedom is the difference between the number of observations included in the training sample and the number of variables used in the model (intercept counts as a variable).

# F-statistic & p-value: the p-value of the model. It tests whether R-squared is different from 0. Usually we need a p-value lower than 0.05 to show a statistically significant relationship between X and Y. It indicates the reliability of X to predict Y. Performs an F-test on the model. This takes the parameters of our model and compares it to a model that has fewer parmeters. In theory the model with more parameters should fit better. If the model with more parameters (your model) doesn't perform better than the model with fewer parameters, the F-test will have a high p-value. If the model with more parameters is better than the model with fewer parameters, you will have a lower p-value.

Evaluate the model

Before accepting the result of a linear regression it is important to evaluate it suitability at explaining the data. One of the many ways to do this is to visually examine the residuals. If the model is appropriate, then the residual errors should be random and normally distributed. In addition, removing one case should not significantly impact the model’s suitability. R provides four graphical approaches for evaluating a model using the plot() command.

The plot in the upper left shows the residual errors plotted versus their fitted values. The residuals should be randomly distributed around the horizontal line representing a residual error of zero; that is, there should not be a distinct trend in the distribution of points.

The plot in the lower left is a standard Q-Q plot, which should suggest that the residual errors are normally distributed.

The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. Again, there should be no obvious trend in this plot.

Finally, the plot in the lower right shows each points leverage, which is a measure of its importance in determining the regression result. Superimposed on the plot are contour lines for the Cook’s distance, which is another measure of the importance of each observation to the regression. Smaller distances means that removing the observation has little affect on the regression results. Distances larger than 1 are suspicious and suggest the presence of a possible outlier or a poor model.

layout(matrix(1:4, 2, 2))
plot(reg1)

plot of chunk unnamed-chunk-3

robust linear regression, controlling for heteroskedasticity

reg1.robust <- rlm(prestige ~ education + log2(income) + women, data = Prestige)
summary(reg1.robust)


Call: rlm(formula = prestige ~ education + log2(income) + women, data = Prestige)
Residuals:
    Min      1Q  Median      3Q     Max 
-17.167  -4.415   0.156   4.342  19.513 

Coefficients:
             Value    Std. Error t value 
(Intercept)  -117.223   15.219     -7.703
education       3.676    0.363     10.117
log2(income)    9.846    1.360      7.239
women           0.051    0.031      1.659

Residual standard error: 6.54 on 98 degrees of freedom

Predicted values/Residuals

prestige_hat <- fitted(reg1)  # predicted values
# as.data.frame(prestige_hat) # not run

prestige_resid <- residuals(reg1)  # residuals
# as.data.frame(prestige_resid) # not run

Dummy regression with no interactions(analysis of covariance, fixed effects)

reg2 <- lm(prestige ~ education + log2(income) + type, data = Prestige)
summary(reg2)


Call:
lm(formula = prestige ~ education + log2(income) + type, data = Prestige)

Residuals:
   Min     1Q Median     3Q    Max 
-13.51  -3.75   1.01   4.36  18.44 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -81.202     13.743   -5.91  5.6e-08 ***
education       3.284      0.608    5.40  5.1e-07 ***
log2(income)    7.269      1.190    6.11  2.3e-08 ***
typeprof        6.751      3.618    1.87    0.065 .  
typewc         -1.439      2.378   -0.61    0.546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.64 on 93 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.855, Adjusted R-squared:  0.849 
F-statistic:  138 on 4 and 93 DF,  p-value: <2e-16

Reordering factor variables

Prestige$type <- with(Prestige, factor(type, levels = c("bc", "wc", "prof")))

Dummy regression with interactions

reg3 <- lm(prestige ~ type * (education + log2(income)), data = Prestige)
summary(reg3)


Call:
lm(formula = prestige ~ type * (education + log2(income)), data = Prestige)

Residuals:
   Min     1Q Median     3Q    Max 
-13.97  -4.12   1.21   3.83  18.06 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -120.046     20.158   -5.96  5.1e-08 ***
typewc                  30.241     37.979    0.80   0.4280    
typeprof                85.160     31.181    2.73   0.0076 ** 
education                2.336      0.928    2.52   0.0136 *  
log2(income)            11.078      1.806    6.13  2.3e-08 ***
typewc:education         3.640      1.759    2.07   0.0414 *  
typeprof:education       0.697      1.290    0.54   0.5900    
typewc:log2(income)     -5.653      3.052   -1.85   0.0673 .  
typeprof:log2(income)   -6.536      2.617   -2.50   0.0143 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.41 on 89 degrees of freedom
  (4 observations deleted due to missingness)
Multiple R-squared:  0.871, Adjusted R-squared:  0.859 
F-statistic: 75.1 on 8 and 89 DF,  p-value: <2e-16

Other ways to run the same model

reg3a <- lm(prestige ~ education + log2(income) + type + log2(income):type + 
    education:type, data = Prestige)
reg3b <- lm(prestige ~ education * type + log2(income) * type, data = Prestige)

Diagnostics for linear regression (residual plots)

What to look for: No patterns,no problems. All p’s should be non-significant. Model ok if residuals have mean=0 and variance=1 (Fox,316). Tukey test null hypothesis: model is additive.

Using 'income' as is. Variable 'income' shows some patterns.

reg1 <- lm(prestige ~ education + income + type, data = Prestige)
residualPlots(reg1)

plot of chunk unnamed-chunk-10

           Test stat Pr(>|t|)
education     -0.684    0.496
income        -2.886    0.005
type              NA       NA
Tukey test    -2.610    0.009

Using 'log2(income)'. Model looks ok.

reg1a <- lm(prestige ~ education + log2(income) + type, data = Prestige)
residualPlots(reg1a)

plot of chunk unnamed-chunk-11

             Test stat Pr(>|t|)
education       -0.237    0.813
log2(income)    -1.044    0.299
type                NA       NA
Tukey test      -1.446    0.148

Other options

residualPlots(reg1, ~1, fitted = TRUE)  #Residuals vs fitted only

plot of chunk unnamed-chunk-12

           Test stat Pr(>|t|)
Tukey test     -2.61    0.009

residualPlots(reg1, ~education, fitted = FALSE)  # Residuals vs education only

plot of chunk unnamed-chunk-12

          Test stat Pr(>|t|)
education    -0.684    0.496

Influential variables-Added-variable plots

avPlots(reg1, id.n = 2, id.cex = 0.7)

plot of chunk unnamed-chunk-13

id.n – id most influential observation
id.cex – font size for id.
Graphs outcome vs predictor variables holding the rest constant (also called partial-regression plots)
Help identify the effect(or influence) of an observation on the regression coefficient of the predictor variable

Outliers – QQ-Plots

qqPlot(reg1, id.n = 3)

plot of chunk unnamed-chunk-14

service.station.attendant        electronic.workers 
                        1                        97 
      medical.technicians 
                       98

id.n – id observations with high residuals

Outliers –Bonferonni test

outlierTest(reg1)


No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
                    rstudent unadjusted p-value Bonferonni p
medical.technicians    2.821           0.005863       0.5746

Nullfor the Bonferonniadjusted outlier test is the observation is an outlier. Here observation related to ‘medical.technicians’ is an outlier.

High leverage (hat) points

influenceIndexPlot(reg1, id.n = 3)

Warning: 强制改变过程中产生了NA

plot of chunk unnamed-chunk-16

Cook's distance measures how much an observation influences the overall model or predicted values

Studentizided residuals are the residuals divided by their estimated standard deviation as a way to standardized

Bonferroni test to identify outliers

Hat-points identify influential observations (have a high impact on the predictor variables)

Note: If an observation is an outlier and influential (high leverage) then that observation can change the fit of the linear model, it is advisable to remove it. To remove a case(s) type

reg1a <-update(prestige.reg4, subset=rownames(Prestige) != “general.managers”) reg1b <-update(prestige.reg4, subset= !(rownames(Prestige) %in% c(“general.managers”,“medical.technicians”)))

Influence Plots

influencePlot(reg1, id.n = 3)

plot of chunk unnamed-chunk-17

                           StudRes     Hat   CookD
general.managers          -1.31346 0.33504 0.41534
lawyers                   -0.08277 0.10360 0.01265
physicians                -0.39532 0.22420 0.09548
medical.technicians        2.82109 0.06859 0.33023
service.station.attendant -2.21884 0.05971 0.24494
electronic.workers         2.22519 0.02701 0.16240

Creates a bubble-plot combining the display of Studentized residuals, hat-values, and Cook's distance (represented in the circles).

Testing for normality

qqPlot(reg1)

plot of chunk unnamed-chunk-18

Look for the tails, points should be close to the line or within the confidence intervals.
Quantile plots compare the Studentized residuals vs a t-distribution
Other tests: shapiro.test(), mshapiro.test() in library(mvnormtest)-library(ts)

Testing forheteroskedasticity

ncvTest(reg1)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.0983    Df = 1     p = 0.7539

Breush/Pagan and Cook/Weisberg score test for non-constant error variance. Null is constant variance

Testing for multicolinearity

vif(reg1)

           GVIF Df GVIF^(1/(2*Df))
education 5.974  1           2.444
income    1.681  1           1.297
type      6.102  2           1.572

A gvif> 4 suggests collinearity

When there are strong linear relationships among the predictors in a regression analysis, the precision of the estimated regression coefficients in linear models declines compared to what it would have been were the predictors uncorrelated with each other

Using the Results of a Regression to Make Predictions

reg1 <- rlm(prestige ~ education + log2(income) + women, data = Prestige)
set.seed(9)
edu <- rnorm(10, mean = 10.74, sd = 2.73)
set.seed(1)
income <- rnorm(10, mean = 6798, sd = 4246)
set.seed(123456)
women <- rnorm(10, mean = 29, sd = 31.7)
newdata = cbind(education = edu, income = income, women = women)
newdata = data.frame(newdata)
newdata

   education income  women
1      8.647   4138  55.43
2      8.511   7578  20.25
3     10.354   3250  17.75
4      9.982  13572  31.77
5     11.931   8197 100.40
6      7.500   3314  55.45
7     13.994   8868  70.60
8     10.690   9933 108.33
9     10.063   9243  66.03
10     9.749   5501  15.49

pred = predict(reg1, newdata, level = 0.9, interval = "confidence")
pred

     fit   lwr   upr
1  35.67 33.92 37.43
2  41.98 40.04 43.92
3  36.60 34.42 38.79
4  56.25 53.20 59.31
5  59.75 55.93 63.56
6  28.31 26.35 30.27
7  66.93 64.43 69.43
8  58.32 53.39 63.25
9  52.83 49.62 56.04
10 41.74 40.51 42.97

model.matrix()

checks how the formula is interpreted to built the model matrix for the linear model

set.seed(123)
x <- rnorm(10, 5, 0.5)
y <- rnorm(10, 5, 0.5)
z <- rnorm(10, 5, 0.5)
model.matrix(y ~ x * z)

   (Intercept)     x     z   x:z
1            1 4.720 4.466 21.08
2            1 4.885 4.891 23.89
3            1 5.779 4.487 25.93
4            1 5.035 4.636 23.34
5            1 5.065 4.687 23.74
6            1 5.858 4.157 24.35
7            1 5.230 5.419 28.34
8            1 4.367 5.077 22.17
9            1 4.657 4.431 20.63
10           1 4.777 5.627 26.88
attr(,"assign")
[1] 0 1 2 3

model.matrix(y ~ x + z)

   (Intercept)     x     z
1            1 4.720 4.466
2            1 4.885 4.891
3            1 5.779 4.487
4            1 5.035 4.636
5            1 5.065 4.687
6            1 5.858 4.157
7            1 5.230 5.419
8            1 4.367 5.077
9            1 4.657 4.431
10           1 4.777 5.627
attr(,"assign")
[1] 0 1 2

model.matrix(y ~ x:z)

   (Intercept)   x:z
1            1 21.08
2            1 23.89
3            1 25.93
4            1 23.34
5            1 23.74
6            1 24.35
7            1 28.34
8            1 22.17
9            1 20.63
10           1 26.88
attr(,"assign")
[1] 0 1

Basic types of linear models

The simplest linear model fits a constant, the mean, to a single variable. This is useful if you want to obtain an estimate of the mean with a standard error and confidence interval, or wanted to test the null hypothesis that the mean is zero.

confint(lm(y ~ 1))

            2.5 % 97.5 %
(Intercept) 4.733  5.476

The most familiar linear model is the linear regression of y on x.

lm.0 <- lm(y ~ x)

Add the regression line to a scatter plot.

# plot(x, y)
plot(y ~ x)  # formula method
abline(lm.0)

plot of chunk unnamed-chunk-25

Adds a regression line to a scatter plot that doesn’t extend beyond the data points.

plot(y ~ x)
xnew <- range(x)
yhat <- predict(lm.0, newdata = data.frame(x = xnew))
lines(yhat ~ xnew)

plot of chunk unnamed-chunk-26

Add confidence bands

xnew <- seq(min(x), max(x), length.out = 100)
ynew <- data.frame(predict(lm.0, newdata = data.frame(x = xnew), interval = "confidence", 
    level = 0.95))
lines(ynew$lwr ~ xnew, lty = 2, col = "red")

Error: plot.new has not been called yet

lines(ynew$upr ~ xnew, lty = 2, col = "red")

Error: plot.new has not been called yet

Adding prediction intervals

xnew <- seq(min(x), max(x), length.out = 100)
ynew <- data.frame(predict(lm.0, newdata = data.frame(x = xnew), interval = "prediction", 
    level = 0.95))
lines(ynew$lwr ~ xnew, lty = 2, col = "blue")

Error: plot.new has not been called yet

lines(ynew$upr ~ xnew, lty = 2, col = "blue")

Error: plot.new has not been called yet

Fit the model to data. y is the numeric response variable and A is the categorical explanatory variable (character or factor).

A <- c("a", "b", "c", "c", "d", "b", "a", "d", "a", "c")
A <- factor(A)
lm.1 <- lm(y ~ A)

Visualize the fitted model in a graph that includes the data points

stripchart(y ~ A, vertical = TRUE, method = "jitter", pch = 16, col = "red")
yhat <- tapply(fitted(lm.1), A, mean)
for (i in 1:length(yhat)) {
    lines(c(i - 0.2, i + 0.2), rep(yhat[i], 2))
}

plot of chunk unnamed-chunk-30

Fit the model to data. Generally, fitting the model without an interaction term is preferred, if you can assume that any interaction is weak or absent.

lm.2 <- lm(y ~ x + A)  # no interaction term included, or
lm.3 <- lm(y ~ x + A + x:A)  # interaction term present

Visualize the fitted model in a graph that includes the data points.

plot(x, y, pch = as.numeric(A))
legend("bottomright", as.character(levels(A)), pch = 1:length(levels(A)))
groups <- levels(A)  # stores group names
for (i in 1:length(groups)) {
    xi <- x[A == groups[i]]  # grabs x-values for group i
    yhati <- fitted(lm.2)[A == groups[i]]  # grabs yhat's for group i
    lines(xi[order(xi)], yhati[order(xi)])  # connects the dots
}

plot of chunk unnamed-chunk-32

termplot()

library(MASS)
lm0 <- lm(loss ~ hard + tens, data = Rubber)
summary(lm0)

## 
## Call:
## lm(formula = loss ~ hard + tens, data = Rubber)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -79.38 -14.61   3.82  19.75  65.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  885.161     61.752   14.33  3.8e-14 ***
## hard          -6.571      0.583  -11.27  1.0e-11 ***
## tens          -1.374      0.194   -7.07  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.5 on 27 degrees of freedom
## Multiple R-squared:  0.84,   Adjusted R-squared:  0.828 
## F-statistic:   71 on 2 and 27 DF,  p-value: 1.77e-11

par(mfrow = c(1, 2))
termplot(lm0, partial = TRUE, smooth = panel.smooth)

plot of chunk unnamed-chunk-33

par(mfrow = c(1, 1))

showing the contribution of each of the two terms in the model, at the mean of the contributions for the other term. A smooth curve has, in each panel, been fitted through the partial residuals. There is a clear suggestion that, at the upper end of the range, the response is not linear with tensile strength.

References

1 2 3 4 5 6 7 8