Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Examine the dataframe mpg (from ggplot2).

Solution

glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Objective

Build a model to predict hwy (highway mpg)

Our first model is an equation of the form

\[hwy = intercept + slope * displ\] Use lm() to create model1.

Solution

model1 = lm(hwy ~ displ, data = mpg)

Now use summary to look at model1.

Solution

summary(model1)

## 
## Call:
## lm(formula = hwy ~ displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1039 -2.1646 -0.2242  2.0589 15.0105 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  35.6977     0.7204   49.55   <2e-16 ***
## displ        -3.5306     0.1945  -18.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.836 on 232 degrees of freedom
## Multiple R-squared:  0.5868, Adjusted R-squared:  0.585 
## F-statistic: 329.5 on 1 and 232 DF,  p-value: < 2.2e-16

Interpret

What are the values of intercept and slope?

Solution

intercept = 35.6977

slope = -3.5306

Equation

Write down the equation

Solution

hwy = 35.6977 - 3.5306 * displ

Use this equation to calculate the value of hwy when displ = 2. call this hwy2.

Solution

hwy2 =  35.6977 - 3.5306 * 2
hwy2

## [1] 28.6365

Do the same for an engine with 3 liters displacement.

Solution

hwy3 =  35.6977 - 3.5306 * 3
hwy3

## [1] 25.1059

Subtract hwy2 from hwy3.

Solution

hwy3 - hwy2

## [1] -3.5306

Note the interpretation.

Increasing the displ by 1 “adds” the value of slope to hwy

Basic Framework

The data does not lie exactly on the line we have estimated.

Show this graphically using a scatterplot of hwy and displ. Add a smoother with method = “lm”.

Solution

mpg %>% 
  ggplot(aes(x = displ, y = hwy)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula = 'y ~ x'

We summarize the relationship in general.

Actual = Predicted + Residual

Look at the contents of model1 using str().

Solution

str(model1)

## List of 12
##  $ coefficients : Named num [1:2] 35.7 -3.53
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "displ"
##  $ residuals    : Named num [1:234] -0.343 -0.343 2.364 1.364 0.188 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:234] -358.566 -69.626 2.405 1.405 0.218 ...
##   ..- attr(*, "names")= chr [1:234] "(Intercept)" "displ" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:234] 29.3 29.3 28.6 28.6 25.8 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "displ"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.07 1.08
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 232
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = hwy ~ displ, data = mpg)
##  $ terms        :Classes 'terms', 'formula'  language hwy ~ displ
##   .. ..- attr(*, "variables")= language list(hwy, displ)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "hwy" "displ"
##   .. .. .. ..$ : chr "displ"
##   .. ..- attr(*, "term.labels")= chr "displ"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(hwy, displ)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "hwy" "displ"
##  $ model        :'data.frame':   234 obs. of  2 variables:
##   ..$ hwy  : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##   ..$ displ: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language hwy ~ displ
##   .. .. ..- attr(*, "variables")= language list(hwy, displ)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "hwy" "displ"
##   .. .. .. .. ..$ : chr "displ"
##   .. .. ..- attr(*, "term.labels")= chr "displ"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(hwy, displ)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "hwy" "displ"
##  - attr(*, "class")= chr "lm"

Extract the item residuals from the model and see what it is using str().

Solution

residuals = model1$residuals
str(residuals)

##  Named num [1:234] -0.343 -0.343 2.364 1.364 0.188 ...
##  - attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...

Create mpg_augment by adding the vector of residuals to mpg using cbind.

Solution

mpg_augment = cbind(mpg,residuals)

Now create a scatterplot of displ and residuals.

Solution

mpg_augment %>% 
  ggplot(aes(x = displ, y = residuals)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula = 'y ~ x'

Predict

We did something pretty crude when we copied numbers manually from the output of summary to build an equation to make predictions. There is a function predict() which will take a model and a dataframe of new values of the independent variable(s) to make predictions.

Use predict to create estimates of hwy for values of displ running from 1 to 11.Note that the values of displ must be in a dataframe.

Create a dataframe containing the values of displ and the predictions of hwy.

Solution

displ = c(1:11)
newdata = data.frame(displ)
predictions = predict(model1,newdata)

results = data.frame(displ,predictions)

See Predictions…

together with real data.

Solution

mpg %>% 
  ggplot(aes(displ,hwy)) + 
  geom_point() +
  geom_point(data = results,
             color = "red",
             aes(displ,predictions))

This shows nonsense caused by extrapolation. Make the engine big enough and you get a negative value of mileage. Going backwards??

A Single Categorical Variable

Create a model which predicts hwy based on drv. Use the factor version of drv. Produce a table of the values of drv.

Solution

model2 = lm(hwy~factor(drv),data = mpg)

summary(model2)

## 
## Call:
## lm(formula = hwy ~ factor(drv), data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.160  -2.175  -1.000   1.960  15.840 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   19.1748     0.4037  47.501   <2e-16 ***
## factor(drv)f   8.9856     0.5668  15.852   <2e-16 ***
## factor(drv)r   1.8252     0.9134   1.998   0.0469 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.097 on 231 degrees of freedom
## Multiple R-squared:  0.5307, Adjusted R-squared:  0.5266 
## F-statistic: 130.6 on 2 and 231 DF,  p-value: < 2.2e-16

table(mpg$drv)

## 
##   4   f   r 
## 103 106  25

Note that the value 4 does not have a coefficient. How do you interpret the coefficients including the constant?

Regression 1

Setup

Solution

Objective

Solution

Solution

Interpret

Solution

Equation

Solution

Solution

Solution

Solution

Basic Framework

Solution

Solution

Solution

Solution

Solution

Predict

Solution

See Predictions…

Solution

A Single Categorical Variable

Solution

Solution