Using Augment

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(broom)

Augment is a function in broom which allows you to combine a linear model object with a dataframe.

I’ll use mtcars to demonstrate how it’s used.

lm1 = lm(mpg~disp, data = mtcars)
alm1 = augment(lm1,data=mtcars)

## Warning: Deprecated: please use `purrr::possibly()` instead

## Warning: Deprecated: please use `purrr::possibly()` instead

## Warning: Deprecated: please use `purrr::possibly()` instead

## Warning: Deprecated: please use `purrr::possibly()` instead

## Warning: Deprecated: please use `purrr::possibly()` instead

str(alm1)

## 'data.frame':    32 obs. of  19 variables:
##  $ .rownames : chr  "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
##  $ mpg       : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl       : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp      : num  160 160 108 258 360 ...
##  $ hp        : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat      : num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt        : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec      : num  16.5 17 18.6 19.4 17 ...
##  $ vs        : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am        : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear      : num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb      : num  4 4 1 1 2 1 4 2 2 4 ...
##  $ .fitted   : num  23 23 25.1 19 14.8 ...
##  $ .se.fit   : num  0.664 0.664 0.815 0.589 0.838 ...
##  $ .resid    : num  -2.01 -2.01 -2.35 2.43 3.94 ...
##  $ .hat      : num  0.0418 0.0418 0.0629 0.0328 0.0663 ...
##  $ .sigma    : num  3.29 3.29 3.28 3.27 3.22 ...
##  $ .cooksd   : num  0.00865 0.00865 0.01868 0.00983 0.05581 ...
##  $ .std.resid: num  -0.63 -0.63 -0.746 0.761 1.253 ...

Note that it is a dataframe and that it contains both the raw data and the model. This makes it easy to compare fitted and actual values.

Here’s an example.

alm1 %>% select(.rownames,mpg,.fitted,.resid) %>% arrange(.resid)

##              .rownames  mpg  .fitted     .resid
## 1            Merc 280C 17.8 22.69220 -4.8922007
## 2         Ferrari Dino 19.7 23.62366 -3.9236624
## 3             Merc 280 19.2 22.69220 -3.4922007
## 4           Volvo 142E 21.4 24.61283 -3.2128252
## 5        Toyota Corona 21.5 24.64992 -3.1499188
## 6          Merc 450SLC 15.2 18.23272 -3.0327247
## 7           Datsun 710 22.8 25.14862 -2.3486218
## 8              Valiant 18.1 20.32645 -2.2264528
## 9        Maserati Bora 15.0 17.19410 -2.1941036
## 10           Mazda RX4 21.0 23.00544 -2.0054356
## 11       Mazda RX4 Wag 21.0 23.00544 -2.0054356
## 12          Camaro Z28 13.3 15.17456 -1.8745628
## 13         AMC Javelin 15.2 17.07046 -1.8704583
## 14          Merc 450SE 16.4 18.23272 -1.8327247
## 15            Merc 230 22.8 23.79677 -0.9967659
## 16    Dodge Challenger 15.5 16.49345 -0.9934466
## 17          Merc 450SL 17.3 18.23272 -0.9327247
## 18          Duster 360 14.3 14.76241 -0.4624116
## 19 Lincoln Continental 10.4 10.64090 -0.2408996
## 20  Cadillac Fleetwood 10.4 10.14632  0.2536819
## 21      Ford Pantera L 15.8 15.13335  0.6666524
## 22           Merc 240D 24.4 23.55360  0.8464033
## 23           Fiat X1-9 27.3 26.34386  0.9561397
## 24       Porsche 914-2 26.0 24.64168  1.3583242
## 25      Hornet 4 Drive 21.4 18.96635  2.4336462
## 26   Chrysler Imperial 14.7 11.46520  3.2347980
## 27         Honda Civic 30.4 26.47987  3.9201298
## 28   Hornet Sportabout 18.7 14.76241  3.9375884
## 29        Lotus Europa 30.4 25.68030  4.7197032
## 30            Fiat 128 32.4 26.35622  6.0437752
## 31    Pontiac Firebird 19.2 13.11381  6.0861932
## 32      Toyota Corolla 33.9 26.66946  7.2305403

It is also useful for considering the potential for other potential variables. Look for variables correlated with the residuals from the first model.

cor(alm1$.resid,alm1$wt)

## [1] -0.2167851

cor(alm1$.resid,alm1$hp)

## [1] -0.1993521

cor(alm1$.resid,alm1$drat)

## [1] 0.149288

It looks like wt has the most potential for improving the model.