Harold Nelson
4/3/2022
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Use glimpse() to look at mpg.
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
There are several categorical variables which might be useful in predicting highway gas mileage. Get tables of class, trans, fl and year.
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
## 5 2 83 39 6 3 3
## auto(s6) manual(m5) manual(m6)
## 16 58 19
##
## c d e p r
## 1 5 8 52 168
##
## 1999 2008
## 117 117
Build a model to predict hwy (highway mpg)
Our first model was an equation of the form
\[hwy = intercept + slope * displ\]
The graphical analysis of residuals we did in our last meeting showed that a quadratic model might be more appropriate than a linear model. Create the variable displ2, which is the square of displ. Add this variable to the dataframe. Then re-run the original model and a second model with the quadratic term. Compare the two on the basis of the standard error of the residuals (the RMSE).
##
## Call:
## lm(formula = hwy ~ displ, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1039 -2.1646 -0.2242 2.0589 15.0105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
## displ -3.5306 0.1945 -18.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
## F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6258 -2.1700 -0.7099 2.1768 13.1449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.2450 1.8576 26.510 < 2e-16 ***
## displ -11.7602 1.0729 -10.961 < 2e-16 ***
## displ2 1.0954 0.1409 7.773 2.51e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared: 0.6725, Adjusted R-squared: 0.6696
## F-statistic: 237.1 on 2 and 231 DF, p-value: < 2.2e-16
What are some other features that might influence hwy?
What about vehicle class, drive type, or fuel type?
Use geom_jitter to see how class effects hwy.
What do you see?
What about fuel type? Repeat the exercise with fl.
Repeat the exercise for trans.
What about year?
What about drv?
The three values are significantly different.
Create some new variables.
Do a glimpse of mpg now and look at the new data.
## Rows: 234
## Columns: 14
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
## $ displ2 <dbl> 3.24, 3.24, 4.00, 4.00, 7.84, 7.84, 9.61, 3.24, 3.24, 4.0…
## $ hog <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ minivan <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
##
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.0333 -1.5806 -0.2028 1.3872 13.4478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.1376 1.5400 27.362 < 2e-16 ***
## displ -7.4552 0.9007 -8.278 1.04e-14 ***
## displ2 0.7145 0.1138 6.276 1.72e-09 ***
## minivanTRUE -2.8430 0.8573 -3.316 0.00106 **
## hogTRUE -6.1109 0.4730 -12.918 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.614 on 229 degrees of freedom
## Multiple R-squared: 0.8106, Adjusted R-squared: 0.8073
## F-statistic: 245 on 4 and 229 DF, p-value: < 2.2e-16
Now try drv as the add-on variable.
##
## Call:
## lm(formula = hwy ~ displ + drv, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9996 -1.9066 -0.3937 1.5778 13.9207
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.8254 0.9239 33.364 < 2e-16 ***
## displ -2.9141 0.2183 -13.352 < 2e-16 ***
## drvf 4.7906 0.5296 9.045 < 2e-16 ***
## drvr 5.2579 0.7336 7.167 1.03e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.082 on 230 degrees of freedom
## Multiple R-squared: 0.7356, Adjusted R-squared: 0.7322
## F-statistic: 213.3 on 3 and 230 DF, p-value: < 2.2e-16
What is the best model so far?
Try all of the variables at once.
##
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog + drv, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4845 -1.4389 -0.2472 1.4005 13.1170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.4246 1.6507 23.883 < 2e-16 ***
## displ -6.9468 0.8808 -7.887 1.29e-13 ***
## displ2 0.6427 0.1115 5.762 2.70e-08 ***
## minivanTRUE -3.3515 0.8548 -3.921 0.000117 ***
## hogTRUE -4.3825 0.6124 -7.156 1.14e-11 ***
## drvf 2.3372 0.5812 4.021 7.88e-05 ***
## drvr 1.8448 0.6995 2.637 0.008935 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.526 on 227 degrees of freedom
## Multiple R-squared: 0.8247, Adjusted R-squared: 0.82
## F-statistic: 177.9 on 6 and 227 DF, p-value: < 2.2e-16