Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Look

Use glimpse() to look at mpg.

Solution

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Tables

There are several categorical variables which might be useful in predicting highway gas mileage. Get tables of class, trans, fl and year.

Answer

table(mpg$class)

## 
##    2seater    compact    midsize    minivan     pickup subcompact        suv 
##          5         47         41         11         33         35         62

table(mpg$trans)

## 
##   auto(av)   auto(l3)   auto(l4)   auto(l5)   auto(l6)   auto(s4)   auto(s5) 
##          5          2         83         39          6          3          3 
##   auto(s6) manual(m5) manual(m6) 
##         16         58         19

table(mpg$fl)

## 
##   c   d   e   p   r 
##   1   5   8  52 168

table(mpg$year)

## 
## 1999 2008 
##  117  117

Objective

Build a model to predict hwy (highway mpg)

Our first model was an equation of the form

\[hwy = intercept + slope * displ\] ## Run Again

To refresh our minds, run the basic model. Then put the residuals from the model into the mpg dataframe and do a scatterplot of residuals against displ.

Solution

model1 = lm(hwy~displ,data = mpg)
mpg$residuals = model1$residuals
mpg %>% 
  ggplot(aes(x = displ,y=residuals)) +
  geom_point()

Non-Linearity

The graphical analysis of residuals shows that a quadratic model might be more appropriate than a linear model. Create the variable displ2, which is the square of displ. Add this variable to the dataframe. Then run a second model with the quadratic term. Compare the two on the basis of the standard error of the residuals (the RMSE).

Solution

mpg = mpg %>% 
  mutate(displ2 = displ^2)


model2 = lm(hwy~displ + displ2,data = mpg)

summary(model1)

## 
## Call:
## lm(formula = hwy ~ displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1039 -2.1646 -0.2242  2.0589 15.0105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
## displ        -3.5306     0.1945  -18.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.585 
## F-statistic: 329.5 on 1 and 232 DF,  p-value: < 2.2e-16

summary(model2)

## 
## Call:
## lm(formula = hwy ~ displ + displ2, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6258 -2.1700 -0.7099  2.1768 13.1449 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  49.2450     1.8576  26.510  < 2e-16 ***
## displ       -11.7602     1.0729 -10.961  < 2e-16 ***
## displ2        1.0954     0.1409   7.773 2.51e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 231 degrees of freedom
## Multiple R-squared:  0.6725, Adjusted R-squared:  0.6696 
## F-statistic: 237.1 on 2 and 231 DF,  p-value: < 2.2e-16

Adding the quadratic term reduces the RMSE from about 3.8 to about 3.4. This is good, but what does this model imply?

Predictions

Create predicted values of hwy for values of displ from .5 liters to 10 liters. Create predictions from these values using model2 and graph them.

Solution

displ = seq(from = .5, to = 10, by = .5)
displ2 = displ^2
new = data.frame(displ,displ2)
preds = predict(model2,new)
results = cbind(new,preds)
results %>% 
  ggplot(aes(x = displ,y = preds)) +
  geom_point()

Hmmmm? A 10 liter engine would get 40 mpg??

We got a better fit to the data we had, but with new data outside the range we had, the model fails. We should look at other ways to improve our model.

Categorical Variables

What are some other features that might influence hwy?

What about vehicle class, drive type, or fuel type?

Use geom_jitter to see how class effects hwy.

Solution

mpg %>% 
  ggplot(aes(x = class, y = hwy)) + geom_jitter()

What do you see?

What about fuel type? Repeat the exercise with fl.

Answer

mpg %>% 
  ggplot(aes(x = fl, y = hwy)) + geom_jitter()

What do you see?

Transmission

Repeat the exercise for trans.

Solution

mpg %>% 
  ggplot(aes(x = trans, y = hwy)) + geom_jitter()

Year

What about year?

Solution

mpg %>% 
  ggplot(aes(x = year, y = hwy)) + geom_jitter()

What do you see?

Drivetrain

What about drv?

Solution

mpg %>% 
  ggplot(aes(x = drv, y = hwy)) + geom_jitter()

The three values are significantly different.

Categorical Variables and Residuals

How could we see the relationship between a categorical variable and the residuals?

Use color in a scatterplot of residuals against displ in model1.

Do this for drv.

Solution

mpg %>% 
  ggplot(aes(x = displ,y=residuals, color = drv)) +
  geom_point()

Class

Do the same for class.

Solution

mpg %>% 
  ggplot(aes(x = displ,y=residuals, color = class)) +
  geom_point()

Alternative Models

Instead of the quadratic term, add the categorical variable class to the basic regression model. What happens to the residual standard error.

model_class = lm(hwy~displ + class,data = mpg)
summary(model_class)

## 
## Call:
## lm(formula = hwy ~ displ + class, data = mpg)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.572 -1.569 -0.245  1.355 14.724 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      38.9533     1.7976  21.669  < 2e-16 ***
## displ            -2.2976     0.2132 -10.778  < 2e-16 ***
## classcompact     -5.3122     1.5283  -3.476 0.000610 ***
## classmidsize     -4.9471     1.4722  -3.360 0.000914 ***
## classminivan     -8.7986     1.5939  -5.520 9.26e-08 ***
## classpickup     -11.9232     1.3687  -8.711 6.46e-16 ***
## classsubcompact  -4.6988     1.5097  -3.112 0.002095 ** 
## classsuv        -10.5851     1.3268  -7.978 7.43e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.745 on 226 degrees of freedom
## Multiple R-squared:  0.7939, Adjusted R-squared:  0.7875 
## F-statistic: 124.3 on 7 and 226 DF,  p-value: < 2.2e-16

summary(model1)

## 
## Call:
## lm(formula = hwy ~ displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1039 -2.1646 -0.2242  2.0589 15.0105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
## displ        -3.5306     0.1945  -18.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.585 
## F-statistic: 329.5 on 1 and 232 DF,  p-value: < 2.2e-16

Regression 2

Setup

Look

Solution

Tables

Answer

Objective

Solution

Other Possibilities

Non-Linearity

Solution

Predictions

Solution

Categorical Variables

Solution

Answer

Transmission

Solution

Year

Solution

Drivetrain

Solution

Categorical Variables and Residuals

Solution

Class

Solution

Alternative Models