Harold Nelson
4/3/2022
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.5 ✓ dplyr 1.0.7
## ✓ tidyr 1.1.4 ✓ stringr 1.4.0
## ✓ readr 2.0.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Use glimpse() to look at mpg.
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
There are several categorical variables which might be useful in predicting highway gas mileage. Get tables of class, trans, fl and year.
##
## 2seater compact midsize minivan pickup subcompact suv
## 5 47 41 11 33 35 62
##
## auto(av) auto(l3) auto(l4) auto(l5) auto(l6) auto(s4) auto(s5)
## 5 2 83 39 6 3 3
## auto(s6) manual(m5) manual(m6)
## 16 58 19
##
## c d e p r
## 1 5 8 52 168
##
## 1999 2008
## 117 117
Build a model to predict hwy (highway mpg)
Our first model was an equation of the form
\[hwy = intercept + slope * displ\] ## Run Again
To refresh our minds, run the basic model. Then put the residuals from the model into the mpg dataframe and do a scatterplot of residuals against displ.
model1 = lm(hwy~displ,data = mpg)
mpg$residuals = model1$residuals
mpg %>%
ggplot(aes(x = displ,y=residuals)) +
geom_point()
The graphical analysis of residuals shows that a quadratic model might be more appropriate than a linear model. Create the variable displ2, which is the square of displ. Add this variable to the dataframe. Then run a second model with the quadratic term. Compare the two on the basis of the standard error of the residuals (the RMSE).
##
## Call:
## lm(formula = hwy ~ displ, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1039 -2.1646 -0.2242 2.0589 15.0105
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.6977 0.7204 49.55 <2e-16 ***
## displ -3.5306 0.1945 -18.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
## F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6258 -2.1700 -0.7099 2.1768 13.1449
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49.2450 1.8576 26.510 < 2e-16 ***
## displ -11.7602 1.0729 -10.961 < 2e-16 ***
## displ2 1.0954 0.1409 7.773 2.51e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared: 0.6725, Adjusted R-squared: 0.6696
## F-statistic: 237.1 on 2 and 231 DF, p-value: < 2.2e-16
Adding the quadratic term reduces the RMSE from about 3.8 to about 3.4. This is good, but what does this model imply?
Create predicted values of hwy for values of displ from .5 liters to 10 liters. Create predictions from these values using model2 and graph them.
displ = seq(from = .5, to = 10, by = .5)
displ2 = displ^2
new = data.frame(displ,displ2)
preds = predict(model2,new)
results = cbind(new,preds)
results %>%
ggplot(aes(x = displ,y = preds)) +
geom_point()
Hmmmm? A 10 liter engine would get 40 mpg??
We got a better fit to the data we had, but with new data outside the range we had, the model fails. We should look at other ways to improve our model.
What are some other features that might influence hwy?
What about vehicle class, drive type, or fuel type?
Use geom_jitter to see how class effects hwy.
What do you see?
What about fuel type? Repeat the exercise with fl.
Repeat the exercise for trans.
What about year?
What about drv?
The three values are significantly different.
How could we see the relationship between a categorical variable and the residuals?
Use color in a scatterplot of residuals against displ in model1.
Do this for drv.
Do the same for class.