Harold Nelson
3/31/2021
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "…
Get tables of class, trans, fl and year.
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
## 5 2 83 39 6 3 3
## auto(s6) manual(m5) manual(m6)
## 16 58 19
##
## c d e p r
## 1 5 8 52 168
##
## 1999 2008
## 117 117
Build a model to predict hwy (highway mpg)
Our first model was an equation of the form
\[hwy = intercept + slope * displ\]
The analysis of residuak=ls showed that a quadratic model might be more appropriate than a linear model.
What are some other features that might influence hwy?
What about vehicle class, drive type, or fuel type?
Look at class.
There are three values that deserve attention.
What about fuel type?
## Answer
What about trans?
There is nothing clear.
What about year?
## Answer
There is nothing clear.
What about drv?
The three values are significantly different.
Create some new variables.
mpg2 = mpg %>%
mutate(hog = class == "suv" |
class == "pickup",
minivan = class == "minivan",
drvf = factor(drv),
displ2 = displ^2)
Rerun the basic model and note the standard error of the residual.
##
## Call:
## lm(formula = hwy ~ displ, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1039 -2.1646 -0.2242 2.0589 15.0105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
## displ -3.5306 0.1945 -18.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
## F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
Now add displ2 and note the change in the standard error. Also note the significance of the new coefficient.
##
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6258 -2.1700 -0.7099 2.1768 13.1449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.2450 1.8576 26.510 < 2e-16 ***
## displ -11.7602 1.0729 -10.961 < 2e-16 ***
## displ2 1.0954 0.1409 7.773 2.51e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared: 0.6725, Adjusted R-squared: 0.6696
## F-statistic: 237.1 on 2 and 231 DF, p-value: < 2.2e-16
Instead of displ2, add the two class variables.
##
## Call:
## lm(formula = hwy ~ displ + minivan + hog, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5368 -1.5390 -0.2406 1.4632 14.5689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.0508 0.5671 58.284 < 2e-16 ***
## displ -1.9051 0.1846 -10.318 < 2e-16 ***
## minivanTRUE -4.2271 0.8950 -4.723 4.04e-06 ***
## hogTRUE -6.8914 0.4930 -13.978 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.824 on 230 degrees of freedom
## Multiple R-squared: 0.778, Adjusted R-squared: 0.7751
## F-statistic: 268.6 on 3 and 230 DF, p-value: < 2.2e-16
Now try drvf as the add-on variable.
##
## Call:
## lm(formula = hwy ~ displ + drvf, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9996 -1.9066 -0.3937 1.5778 13.9207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.8254 0.9239 33.364 < 2e-16 ***
## displ -2.9141 0.2183 -13.352 < 2e-16 ***
## drvff 4.7906 0.5296 9.045 < 2e-16 ***
## drvfr 5.2579 0.7336 7.167 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.082 on 230 degrees of freedom
## Multiple R-squared: 0.7356, Adjusted R-squared: 0.7322
## F-statistic: 213.3 on 3 and 230 DF, p-value: < 2.2e-16
What is the best model so far?
Try all of the variables at once.
##
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog + +drvf, data = mpg2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4845 -1.4389 -0.2472 1.4005 13.1170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.4246 1.6507 23.883 < 2e-16 ***
## displ -6.9468 0.8808 -7.887 1.29e-13 ***
## displ2 0.6427 0.1115 5.762 2.70e-08 ***
## minivanTRUE -3.3515 0.8548 -3.921 0.000117 ***
## hogTRUE -4.3825 0.6124 -7.156 1.14e-11 ***
## drvff 2.3372 0.5812 4.021 7.88e-05 ***
## drvfr 1.8448 0.6995 2.637 0.008935 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.526 on 227 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.82
## F-statistic: 177.9 on 6 and 227 DF, p-value: < 2.2e-16