Models and Visualization

Harold Nelson

2022-10-08

Models and Visualization

The material in Healy’s Chapter 6 requires more background in statistical models than most of you have. Ss an alternative, I want you to read Chapter 5 of the ModernDive text. https://moderndive.com/5-regression.html

After this, I’ll get into a simplified version of the use of visualization in building and improving models.

We will need some new packages.

Setup

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.3.6      âś” purrr   0.3.4 
## âś” tibble  3.1.8      âś” dplyr   1.0.10
## âś” tidyr   1.2.1      âś” stringr 1.4.1 
## âś” readr   2.1.2      âś” forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
library(moderndive)
library(skimr)
library(gapminder)
library(broom)

Please read/work through Chapter 5 before you proceed. It’s designed for students in an introductory statistics course, which I assume all of you have had. You should be familiar with the basic one-variable linear regression model. This reading will review that topic and show you haw such models are handled in R.

The mpg Dataframe

This is included in the ggplot2 package. To begin run an str() on it to understand the contents.

Solution

str(mpg)
## tibble [234 Ă— 11] (S3: tbl_df/tbl/data.frame)
##  $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
##  $ model       : chr [1:234] "a4" "a4" "a4" "a4" ...
##  $ displ       : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr [1:234] "f" "f" "f" "f" ...
##  $ cty         : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr [1:234] "p" "p" "p" "p" ...
##  $ class       : chr [1:234] "compact" "compact" "compact" "compact" ...

Regression Demo

Create model_a using mpg. Use displ to predict cty. Get the coefficients using get_regression_table(). Then use str() to examine the structure of model_a.

Solution

model_a = lm(cty~displ,data = mpg)
get_regression_table(model_a)
## # A tibble: 2 Ă— 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    26.0      0.482      53.9       0    25.0     26.9 
## 2 displ        -2.63     0.13      -20.2       0    -2.89    -2.37
str(model_a)
## List of 12
##  $ coefficients : Named num [1:2] 25.99 -2.63
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "displ"
##  $ residuals    : Named num [1:234] -3.257 -0.257 -0.731 0.269 -2.626 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:234] -257.893 -51.875 -0.527 0.473 -2.425 ...
##   ..- attr(*, "names")= chr [1:234] "(Intercept)" "displ" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:234] 21.3 21.3 20.7 20.7 18.6 ...
##   ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "displ"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.07 1.08
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 232
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = cty ~ displ, data = mpg)
##  $ terms        :Classes 'terms', 'formula'  language cty ~ displ
##   .. ..- attr(*, "variables")= language list(cty, displ)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "cty" "displ"
##   .. .. .. ..$ : chr "displ"
##   .. ..- attr(*, "term.labels")= chr "displ"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(cty, displ)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "cty" "displ"
##  $ model        :'data.frame':   234 obs. of  2 variables:
##   ..$ cty  : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
##   ..$ displ: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language cty ~ displ
##   .. .. ..- attr(*, "variables")= language list(cty, displ)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "cty" "displ"
##   .. .. .. .. ..$ : chr "displ"
##   .. .. ..- attr(*, "term.labels")= chr "displ"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(cty, displ)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "cty" "displ"
##  - attr(*, "class")= chr "lm"

Add to the Dataframe

Get the residuals and fitted values from the model and put them in the dataframe mpg as res_a and fit_a.

Solution

mpg$res_a = model_a$residuals
mpg$fit_a = model_a$fitted.values
glimpse(mpg)
## Rows: 234
## Columns: 13
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…
## $ res_a        <dbl> -3.2566002, -0.2566002, -0.7305039, 0.2694961, -2.6261185…
## $ fit_a        <dbl> 21.256600, 21.256600, 20.730504, 20.730504, 18.626118, 18…

Look at the correlation between the fitted values and the residuals.

Solution

cor(mpg$fit_a,mpg$res_a)
## [1] 1.174038e-16

Are we Done?

That’s about as close to zero as you could hope for. So, you might say yes.

But you really need to visualize the relationship.

Relationships

Create a scatterplot of the actuals (cty) on the y-axis and the explanatory variable (displ) on the x-axis.

mpg %>% ggplot(aes(x=displ,y=cty)) + geom_point()

Fitted Values

Add the fitted values to this graph with a second geom_point() using the color red.

Solution

mpg %>% ggplot(aes(x=displ,y=cty)) + 
  geom_point() + 
  geom_point(aes(y = fit_a),color = "red")

The Residuals

Create a scatterplot with displ on the x-axis and res_a on the y_axis.

Solution

mpg %>% ggplot(aes(x = displ, y = res_a)) + geom_point()

Note that where the actuals are obove the fitted values, the residuals are positive. There is always a relationship you should remember.

\[Actual = Fitted + Resudual\]

We would like to try to improve on this model. The graphs suggest that there are bends in the actual relationship that can’t be followed by a simple linear model.

Let’s create model_b using a second degree polynomial in displ instead of the simple linear model. Use poly(displ,2) instead of displ in the model. Then repeat the process of adding fit_b and res_b to the existing dataframe.

Solution

model_b = lm(cty~poly(displ,2),data = mpg)

mpg$fit_b = model_b$fitted.values
mpg$res_b = model_b$residuals

get_regression_table(model_b)
## # A tibble: 3 Ă— 7
##   term            estimate std_error statistic p_value lower_ci upper_ci
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept           16.9     0.148    114.         0     16.6     17.2
## 2 poly(displ, 2)1    -51.9     2.26     -22.9        0    -56.3    -47.4
## 3 poly(displ, 2)2     18.7     2.26       8.25       0     14.2     23.1

Repeat the graphics for model_b

Solution

mpg %>% ggplot(aes(x=displ,y=cty)) + 
  geom_point() + 
  geom_point(aes(y = fit_b),color = "red") +
  ggtitle("Model b Actuals and Fitted Values")

The non-linear model fixed the issues with the high-displacement vehicles, bu there are clearly problems with the very low-displacement ones. We need to look at some of the categorical variables for a solution.

We can look at relationships among the residuals and the categorical variables.

Look at year

Solution

mpg %>% 
  ggplot(aes(x = factor(year),y = res_a)) + 
  geom_jitter() +
  geom_hline(yintercept = 0,color = "red")

Look at drv

Solution

mpg %>% 
  ggplot(aes(x = factor(drv),y = res_a)) + 
  geom_jitter() +
  geom_hline(yintercept = 0,color = "red")

Look at class

Solution

mpg %>% 
  ggplot(aes(x = factor(class),y = res_a)) + 
  geom_jitter() +
  geom_violin() +
  geom_hline(yintercept = 0,color = "red")

Add class

Add it to model_b.

Solution

model_c = lm(cty~poly(displ,2) + class,data = mpg)

mpg$fit_c = model_c$fitted.values
mpg$res_c = model_c$residuals

get_regression_table(model_c)
## # A tibble: 9 Ă— 7
##   term              estimate std_error statistic p_value lower_ci upper_ci
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept           18.7        1.12    16.7     0        16.5    21.0  
## 2 poly(displ, 2)1    -43.0        3.22   -13.4     0       -49.4   -36.7  
## 3 poly(displ, 2)2     14.4        2.48     5.79    0         9.47   19.2  
## 4 class: compact      -1.42       1.23    -1.15    0.251    -3.84    1.01 
## 5 class: midsize      -0.867      1.23    -0.706   0.481    -3.29    1.55 
## 6 class: minivan      -2.26       1.35    -1.68    0.095    -4.92    0.399
## 7 class: pickup       -3.30       1.16    -2.84    0.005    -5.59   -1.01 
## 8 class: subcompact   -0.523      1.21    -0.431   0.667    -2.91    1.87 
## 9 class: suv          -3.02       1.11    -2.72    0.007    -5.20   -0.829

Graphics

mpg %>% 
  ggplot(aes(x = displ, y = cty)) +
  geom_point() +
  geom_point(aes(y = fit_c),color = "red")

That didn’t solve our problem. Try to find something that works.