We will use the data in the builtin daaframe mtcars.
First let’s examine the relatioship bwtween the engine displacement (explanatory) and mpg (response) graphically. We expect greater displacement to be associated with reduced mpg. A scatterplot should show that points farther to the right are lower. The correlation coeffieient should be negative and not very close to zero, probably close to -1.
plot(mtcars$disp,mtcars$mpg)
cor(mtcars$disp,mtcars$mpg)
## [1] -0.8475514
We can take this a step further and create a model of the relationship between engine displacement and gas mileage using linear regression.
The idea is to assume that there is a linear relationship of the form
\[mpg=m∗disp+b\]
We can use the function lm() in R to derive estimates of the parameters m and b from the existing data. You should recognize this as the slope-intercept form of a straight line. The slope, mm is the more important of these two parameters. It tells us how much gas mileage will change and in which direction when engine displacement increases. We expect it to have a negative value in this case.
lm1 <- lm(mpg~disp,data = mtcars)
summary(lm1)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
We can re-create the scatterplot and add an image of the line created by them model lm1.
plot(mtcars$disp,mtcars$mpg)
abline(lm1)
Use lm1 to predict the mpg for an engine with a 200 cubic inch engine.
Use lm1 to predict the mpg for an engine with a 300 cubic inch engine. Calculate the difference between this result and that for the 200 cubic inch engine. How could you have predicted this from the results in the summary?
Repeat the steps above to examine the relationship between the weight of the vehicle and mpg.
lm2 <- lm(mpg~wt,data = mtcars)
summary(lm2)
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
plot(mtcars$wt,mtcars$mpg)
abline(lm2)
Which of these models does a better job of predicting mpg?
On two criteria, the model using weight is a bit better. Compare the “Redidual Standard Error” and “Adjusted R-Squared” values.
We can include other variables on the right-hand side.
lm3 = lm(mpg~disp + wt,data = mtcars)
summary(lm3)
##
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4087 -2.3243 -0.7683 1.7721 6.3484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.96055 2.16454 16.151 4.91e-16 ***
## disp -0.01773 0.00919 -1.929 0.06362 .
## wt -3.35082 1.16413 -2.878 0.00743 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658
## F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10
Compare the coefficients of disp and weight where possible among these three models. Try to explain what you see.
How are wt and disp related?
lm4 = lm(disp ~ wt,data = mtcars)
summary(lm4)
##
## Call:
## lm(formula = disp ~ wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.18 -33.62 -10.05 35.15 125.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -131.15 35.72 -3.672 0.000933 ***
## wt 112.48 10.64 10.576 1.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 57.94 on 30 degrees of freedom
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
## F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
plot(mtcars$wt,mtcars$disp)
abline(lm4)
How do you interpet the coefficient 112.48?
The function predict accepts a model and a dataframe of observations containing the required independent variables. It returns a dataframe of predicted values. This eliminated the need to copy and paste from the summary output to make prediction.
Here is an example showing how to compute the difference between predicted mpg values when the weight of a vehicle is increased by 100 pounds.
new1 = data.frame(disp=200,wt=3)
new2 = data.frame(disp=200,wt=3.1)
Pred1 = predict(lm3,new1)
Pred2 = predict(lm3,new2)
Pred2 - Pred1
## 1
## -0.3350825
use predict to show the impact of increasing the displacement by 100 pounds using lm1.