Notes Mar 31

Harold Nelson

3/31/2021

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(skimr)
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "…

Get tables of class, trans, fl and year.

Answer

table(mpg$class)
## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
table(mpg$trans)
## 
##   auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4)   auto(s5) 
##          5          2         83         39          6          3          3 
##   auto(s6) manual(m5) manual(m6) 
##         16         58         19
table(mpg$fl)
## 
##   c   d   e   p   r 
##   1   5   8  52 168
table(mpg$year)
## 
## 1999 2008 
##  117  117

Objective

Build a model to predict hwy (highway mpg)

Our first model was an equation of the form

\[hwy = intercept + slope * displ\]

Other possibilities

The analysis of residuak=ls showed that a quadratic model might be more appropriate than a linear model.

What are some other features that might influence hwy?

What about vehicle class, drive type, or fuel type?

Look at class.

Answer

mpg %>% 
  ggplot(aes(x = class, y = hwy)) + geom_jitter()

There are three values that deserve attention.

What about fuel type?

## Answer

mpg %>% 
  ggplot(aes(x = fl, y = hwy)) + geom_jitter()

What about trans?

mpg %>% 
  ggplot(aes(x = trans, y = hwy)) + geom_jitter()

There is nothing clear.

What about year?

## Answer

mpg %>% 
  ggplot(aes(x = year, y = hwy)) + geom_jitter()

There is nothing clear.

What about drv?

Answer

mpg %>% 
  ggplot(aes(x = drv, y = hwy)) + geom_jitter()

The three values are significantly different.

Create some new variables.

mpg2 = mpg %>% 
  mutate(hog = class == "suv" |
               class == "pickup",
         minivan = class == "minivan",
        drvf = factor(drv),
        displ2 = displ^2)

Rerun the basic model and note the standard error of the residual.

Answer

model1 = lm(hwy ~ displ, data = mpg2)
summary(model1)
## 
## Call:
## lm(formula = hwy ~ displ, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1039 -2.1646 -0.2242  2.0589 15.0105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
## displ        -3.5306     0.1945  -18.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.585 
## F-statistic: 329.5 on 1 and 232 DF,  p-value: < 2.2e-16

Now add displ2 and note the change in the standard error. Also note the significance of the new coefficient.

Answer

model2 = lm(hwy ~ displ + displ2, data = mpg2)
summary(model2)
## 
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6258 -2.1700 -0.7099  2.1768 13.1449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.2450     1.8576  26.510  < 2e-16 ***
## displ       -11.7602     1.0729 -10.961  < 2e-16 ***
## displ2        1.0954     0.1409   7.773 2.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared:  0.6725, Adjusted R-squared:  0.6696 
## F-statistic: 237.1 on 2 and 231 DF,  p-value: < 2.2e-16

Instead of displ2, add the two class variables.

Answer

model3 = lm(hwy ~ displ + minivan + hog, data = mpg2)
summary(model3)
## 
## Call:
## lm(formula = hwy ~ displ + minivan + hog, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5368 -1.5390 -0.2406  1.4632 14.5689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.0508     0.5671  58.284  < 2e-16 ***
## displ        -1.9051     0.1846 -10.318  < 2e-16 ***
## minivanTRUE  -4.2271     0.8950  -4.723 4.04e-06 ***
## hogTRUE      -6.8914     0.4930 -13.978  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.824 on 230 degrees of freedom
## Multiple R-squared:  0.778,  Adjusted R-squared:  0.7751 
## F-statistic: 268.6 on 3 and 230 DF,  p-value: < 2.2e-16

Now try drvf as the add-on variable.

Answer

model4 = lm(hwy ~ displ + drvf, data = mpg2)
summary(model4)
## 
## Call:
## lm(formula = hwy ~ displ + drvf, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9996 -1.9066 -0.3937  1.5778 13.9207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.8254     0.9239  33.364  < 2e-16 ***
## displ        -2.9141     0.2183 -13.352  < 2e-16 ***
## drvff         4.7906     0.5296   9.045  < 2e-16 ***
## drvfr         5.2579     0.7336   7.167 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.082 on 230 degrees of freedom
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.7322 
## F-statistic: 213.3 on 3 and 230 DF,  p-value: < 2.2e-16

What is the best model so far?

Try all of the variables at once.

Answer

model5 = lm(hwy ~ displ + displ2 +
              minivan +
              hog +
              + drvf,
            data = mpg2)
summary(model5)
## 
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog + +drvf, data = mpg2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4845 -1.4389 -0.2472  1.4005 13.1170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.4246     1.6507  23.883  < 2e-16 ***
## displ        -6.9468     0.8808  -7.887 1.29e-13 ***
## displ2        0.6427     0.1115   5.762 2.70e-08 ***
## minivanTRUE  -3.3515     0.8548  -3.921 0.000117 ***
## hogTRUE      -4.3825     0.6124  -7.156 1.14e-11 ***
## drvff         2.3372     0.5812   4.021 7.88e-05 ***
## drvfr         1.8448     0.6995   2.637 0.008935 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.526 on 227 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:   0.82 
## F-statistic: 177.9 on 6 and 227 DF,  p-value: < 2.2e-16