Regression 2

Harold Nelson

4/3/2022

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.5     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Look

Use glimpse() to look at mpg.

Solution

glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Tables

There are several categorical variables which might be useful in predicting highway gas mileage. Get tables of class, trans, fl and year.

Answer

table(mpg$class)
## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62
table(mpg$trans)
## 
##   auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4)   auto(s5) 
##          5          2         83         39          6          3          3 
##   auto(s6) manual(m5) manual(m6) 
##         16         58         19
table(mpg$fl)
## 
##   c   d   e   p   r 
##   1   5   8  52 168
table(mpg$year)
## 
## 1999 2008 
##  117  117

Objective

Build a model to predict hwy (highway mpg)

Our first model was an equation of the form

\[hwy = intercept + slope * displ\]

Other Possibilities

Non-Linearity

The graphical analysis of residuals we did in our last meeting showed that a quadratic model might be more appropriate than a linear model. Create the variable displ2, which is the square of displ. Add this variable to the dataframe. Then re-run the original model and a second model with the quadratic term. Compare the two on the basis of the standard error of the residuals (the RMSE).

Solution

mpg = mpg %>% 
  mutate(displ2 = displ^2)

model1 = lm(hwy~displ,data = mpg)
summary(model1)
## 
## Call:
## lm(formula = hwy ~ displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1039 -2.1646 -0.2242  2.0589 15.0105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
## displ        -3.5306     0.1945  -18.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.585 
## F-statistic: 329.5 on 1 and 232 DF,  p-value: < 2.2e-16
model2 = lm(hwy~displ + displ2,data = mpg)
summary(model2)
## 
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6258 -2.1700 -0.7099  2.1768 13.1449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.2450     1.8576  26.510  < 2e-16 ***
## displ       -11.7602     1.0729 -10.961  < 2e-16 ***
## displ2        1.0954     0.1409   7.773 2.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared:  0.6725, Adjusted R-squared:  0.6696 
## F-statistic: 237.1 on 2 and 231 DF,  p-value: < 2.2e-16

Categorical Variables

What are some other features that might influence hwy?

What about vehicle class, drive type, or fuel type?

Use geom_jitter to see how class effects hwy.

Solution

mpg %>% 
  ggplot(aes(x = class, y = hwy)) + geom_jitter()

What do you see?

What about fuel type? Repeat the exercise with fl.

Answer

mpg %>% 
  ggplot(aes(x = fl, y = hwy)) + geom_jitter()

What do you see?

Transmission

Repeat the exercise for trans.

Solution

mpg %>% 
  ggplot(aes(x = trans, y = hwy)) + geom_jitter()

Year

What about year?

Solution

mpg %>% 
  ggplot(aes(x = year, y = hwy)) + geom_jitter()

What do you see

Drivetrain

What about drv?

Solution

mpg %>% 
  ggplot(aes(x = drv, y = hwy)) + geom_jitter()

The three values are significantly different.

Create some new variables.

mpg = mpg %>% 
  mutate(hog = class == "suv" |
               class == "pickup",
         minivan = class == "minivan",
        drv = factor(drv))

Examine Data

Do a glimpse of mpg now and look at the new data.

Solution

glimpse(mpg)
## Rows: 234
## Columns: 14
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <fct> f, f, f, f, f, f, f, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, r, …
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
## $ displ2       <dbl> 3.24, 3.24, 4.00, 4.00, 7.84, 7.84, 9.61, 3.24, 3.24, 4.0…
## $ hog          <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ minivan      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…

Answer

model3 = lm(hwy ~ displ + displ2 + minivan + hog, data = mpg)
summary(model3)
## 
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0333 -1.5806 -0.2028  1.3872 13.4478 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.1376     1.5400  27.362  < 2e-16 ***
## displ        -7.4552     0.9007  -8.278 1.04e-14 ***
## displ2        0.7145     0.1138   6.276 1.72e-09 ***
## minivanTRUE  -2.8430     0.8573  -3.316  0.00106 ** 
## hogTRUE      -6.1109     0.4730 -12.918  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.614 on 229 degrees of freedom
## Multiple R-squared:  0.8106, Adjusted R-squared:  0.8073 
## F-statistic:   245 on 4 and 229 DF,  p-value: < 2.2e-16

Now try drv as the add-on variable.

Answer

model4 = lm(hwy ~ displ + drv, data = mpg)
summary(model4)
## 
## Call:
## lm(formula = hwy ~ displ + drv, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9996 -1.9066 -0.3937  1.5778 13.9207 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.8254     0.9239  33.364  < 2e-16 ***
## displ        -2.9141     0.2183 -13.352  < 2e-16 ***
## drvf          4.7906     0.5296   9.045  < 2e-16 ***
## drvr          5.2579     0.7336   7.167 1.03e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.082 on 230 degrees of freedom
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.7322 
## F-statistic: 213.3 on 3 and 230 DF,  p-value: < 2.2e-16

What is the best model so far?

Try all of the variables at once.

Answer

model5 = lm(hwy ~ displ + 
              displ2 +
              minivan +
              hog +
              drv,
            data = mpg)
summary(model5)
## 
## Call:
## lm(formula = hwy ~ displ + displ2 + minivan + hog + drv, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4845 -1.4389 -0.2472  1.4005 13.1170 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.4246     1.6507  23.883  < 2e-16 ***
## displ        -6.9468     0.8808  -7.887 1.29e-13 ***
## displ2        0.6427     0.1115   5.762 2.70e-08 ***
## minivanTRUE  -3.3515     0.8548  -3.921 0.000117 ***
## hogTRUE      -4.3825     0.6124  -7.156 1.14e-11 ***
## drvf          2.3372     0.5812   4.021 7.88e-05 ***
## drvr          1.8448     0.6995   2.637 0.008935 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.526 on 227 degrees of freedom
## Multiple R-squared:  0.8247, Adjusted R-squared:   0.82 
## F-statistic: 177.9 on 6 and 227 DF,  p-value: < 2.2e-16