A Simple Regression Example

We will use the data in the builtin daaframe mtcars.

First let’s examine the relatioship bwtween the engine displacement (explanatory) and mpg (response) graphically. We expect greater displacement to be associated with reduced mpg. A scatterplot should show that points farther to the right are lower. The correlation coeffieient should be negative and not very close to zero, probably close to -1.

plot(mtcars$disp,mtcars$mpg)

cor(mtcars$disp,mtcars$mpg)
## [1] -0.8475514

We can take this a step further and create a model of the relationship between engine displacement and gas mileage using linear regression.

The idea is to assume that there is a linear relationship of the form

\[mpg=m∗disp+b\]

We can use the function lm() in R to derive estimates of the parameters m and b from the existing data. You should recognize this as the slope-intercept form of a straight line. The slope, mm is the more important of these two parameters. It tells us how much gas mileage will change and in which direction when engine displacement increases. We expect it to have a negative value in this case.

lm1 <- lm(mpg~disp,data = mtcars)
summary(lm1)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

We can re-create the scatterplot and add an image of the line created by them model lm1.

plot(mtcars$disp,mtcars$mpg)
abline(lm1)

Exercise

Use lm1 to predict the mpg for an engine with a 200 cubic inch engine.

Exercise

Use lm1 to predict the mpg for an engine with a 300 cubic inch engine. Calculate the difference between this result and that for the 200 cubic inch engine. How could you have predicted this from the results in the summary?

Exercise

Repeat the steps above to examine the relationship between the weight of the vehicle and mpg.

lm2 <- lm(mpg~wt,data = mtcars)
summary(lm2)
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
plot(mtcars$wt,mtcars$mpg)
abline(lm2)

Question

Which of these models does a better job of predicting mpg?

On two criteria, the model using weight is a bit better. Compare the “Redidual Standard Error” and “Adjusted R-Squared” values.

A Two-Variable Model

We can include other variables on the right-hand side.

lm3 = lm(mpg~disp + wt,data = mtcars)
summary(lm3)
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4087 -2.3243 -0.7683  1.7721  6.3484 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.96055    2.16454  16.151 4.91e-16 ***
## disp        -0.01773    0.00919  -1.929  0.06362 .  
## wt          -3.35082    1.16413  -2.878  0.00743 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7658 
## F-statistic: 51.69 on 2 and 29 DF,  p-value: 2.744e-10

Exercise

Compare the coefficients of disp and weight where possible among these three models. Try to explain what you see.

Exercise

How are wt and disp related?

lm4 = lm(disp ~ wt,data = mtcars)
summary(lm4)
## 
## Call:
## lm(formula = disp ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -88.18 -33.62 -10.05  35.15 125.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -131.15      35.72  -3.672 0.000933 ***
## wt            112.48      10.64  10.576 1.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 57.94 on 30 degrees of freedom
## Multiple R-squared:  0.7885, Adjusted R-squared:  0.7815 
## F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11
plot(mtcars$wt,mtcars$disp)
abline(lm4)

Exercise

How do you interpet the coefficient 112.48?

Using predict.

The function predict accepts a model and a dataframe of observations containing the required independent variables. It returns a dataframe of predicted values. This eliminated the need to copy and paste from the summary output to make prediction.

Here is an example showing how to compute the difference between predicted mpg values when the weight of a vehicle is increased by 100 pounds.

new1 = data.frame(disp=200,wt=3)
new2 = data.frame(disp=200,wt=3.1)
Pred1 = predict(lm3,new1)
Pred2 = predict(lm3,new2)
Pred2 - Pred1
##          1 
## -0.3350825

Exercise

use predict to show the impact of increasing the displacement by 100 pounds using lm1.