Harold Nelson
2022-10-08
The material in Healy’s Chapter 6 requires more background in statistical models than most of you have. Ss an alternative, I want you to read Chapter 5 of the ModernDive text. https://moderndive.com/5-regression.html
After this, I’ll get into a simplified version of the use of visualization in building and improving models.
We will need some new packages.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## âś” ggplot2 3.3.6 âś” purrr 0.3.4
## âś” tibble 3.1.8 âś” dplyr 1.0.10
## âś” tidyr 1.2.1 âś” stringr 1.4.1
## âś” readr 2.1.2 âś” forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag() masks stats::lag()
Please read/work through Chapter 5 before you proceed. It’s designed for students in an introductory statistics course, which I assume all of you have had. You should be familiar with the basic one-variable linear regression model. This reading will review that topic and show you haw such models are handled in R.
This is included in the ggplot2 package. To begin run an str() on it to understand the contents.
## tibble [234 Ă— 11] (S3: tbl_df/tbl/data.frame)
## $ manufacturer: chr [1:234] "audi" "audi" "audi" "audi" ...
## $ model : chr [1:234] "a4" "a4" "a4" "a4" ...
## $ displ : num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int [1:234] 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int [1:234] 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr [1:234] "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr [1:234] "f" "f" "f" "f" ...
## $ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr [1:234] "p" "p" "p" "p" ...
## $ class : chr [1:234] "compact" "compact" "compact" "compact" ...
Create model_a using mpg. Use displ to predict cty. Get the coefficients using get_regression_table(). Then use str() to examine the structure of model_a.
## # A tibble: 2 Ă— 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 26.0 0.482 53.9 0 25.0 26.9
## 2 displ -2.63 0.13 -20.2 0 -2.89 -2.37
## List of 12
## $ coefficients : Named num [1:2] 25.99 -2.63
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "displ"
## $ residuals : Named num [1:234] -3.257 -0.257 -0.731 0.269 -2.626 ...
## ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
## $ effects : Named num [1:234] -257.893 -51.875 -0.527 0.473 -2.425 ...
## ..- attr(*, "names")= chr [1:234] "(Intercept)" "displ" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:234] 21.3 21.3 20.7 20.7 18.6 ...
## ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "displ"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.07 1.08
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 232
## $ xlevels : Named list()
## $ call : language lm(formula = cty ~ displ, data = mpg)
## $ terms :Classes 'terms', 'formula' language cty ~ displ
## .. ..- attr(*, "variables")= language list(cty, displ)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "cty" "displ"
## .. .. .. ..$ : chr "displ"
## .. ..- attr(*, "term.labels")= chr "displ"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(cty, displ)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "cty" "displ"
## $ model :'data.frame': 234 obs. of 2 variables:
## ..$ cty : int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
## ..$ displ: num [1:234] 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language cty ~ displ
## .. .. ..- attr(*, "variables")= language list(cty, displ)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "cty" "displ"
## .. .. .. .. ..$ : chr "displ"
## .. .. ..- attr(*, "term.labels")= chr "displ"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(cty, displ)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "cty" "displ"
## - attr(*, "class")= chr "lm"
Get the residuals and fitted values from the model and put them in the dataframe mpg as res_a and fit_a.
## Rows: 234
## Columns: 13
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c…
## $ res_a <dbl> -3.2566002, -0.2566002, -0.7305039, 0.2694961, -2.6261185…
## $ fit_a <dbl> 21.256600, 21.256600, 20.730504, 20.730504, 18.626118, 18…
Look at the correlation between the fitted values and the residuals.
That’s about as close to zero as you could hope for. So, you might say yes.
But you really need to visualize the relationship.
Create a scatterplot of the actuals (cty) on the y-axis and the explanatory variable (displ) on the x-axis.
Add the fitted values to this graph with a second geom_point() using the color red.
Create a scatterplot with displ on the x-axis and res_a on the y_axis.
Note that where the actuals are obove the fitted values, the residuals are positive. There is always a relationship you should remember.
\[Actual = Fitted + Resudual\]
We would like to try to improve on this model. The graphs suggest that there are bends in the actual relationship that can’t be followed by a simple linear model.
Let’s create model_b using a second degree polynomial in displ instead of the simple linear model. Use poly(displ,2) instead of displ in the model. Then repeat the process of adding fit_b and res_b to the existing dataframe.
model_b = lm(cty~poly(displ,2),data = mpg)
mpg$fit_b = model_b$fitted.values
mpg$res_b = model_b$residuals
get_regression_table(model_b)
## # A tibble: 3 Ă— 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 16.9 0.148 114. 0 16.6 17.2
## 2 poly(displ, 2)1 -51.9 2.26 -22.9 0 -56.3 -47.4
## 3 poly(displ, 2)2 18.7 2.26 8.25 0 14.2 23.1
Repeat the graphics for model_b
mpg %>% ggplot(aes(x=displ,y=cty)) +
geom_point() +
geom_point(aes(y = fit_b),color = "red") +
ggtitle("Model b Actuals and Fitted Values")
The non-linear model fixed the issues with the high-displacement vehicles, bu there are clearly problems with the very low-displacement ones. We need to look at some of the categorical variables for a solution.
We can look at relationships among the residuals and the categorical variables.
mpg %>%
ggplot(aes(x = factor(year),y = res_a)) +
geom_jitter() +
geom_hline(yintercept = 0,color = "red")
mpg %>%
ggplot(aes(x = factor(drv),y = res_a)) +
geom_jitter() +
geom_hline(yintercept = 0,color = "red")
mpg %>%
ggplot(aes(x = factor(class),y = res_a)) +
geom_jitter() +
geom_violin() +
geom_hline(yintercept = 0,color = "red")
Add it to model_b.
model_c = lm(cty~poly(displ,2) + class,data = mpg)
mpg$fit_c = model_c$fitted.values
mpg$res_c = model_c$residuals
get_regression_table(model_c)
## # A tibble: 9 Ă— 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 18.7 1.12 16.7 0 16.5 21.0
## 2 poly(displ, 2)1 -43.0 3.22 -13.4 0 -49.4 -36.7
## 3 poly(displ, 2)2 14.4 2.48 5.79 0 9.47 19.2
## 4 class: compact -1.42 1.23 -1.15 0.251 -3.84 1.01
## 5 class: midsize -0.867 1.23 -0.706 0.481 -3.29 1.55
## 6 class: minivan -2.26 1.35 -1.68 0.095 -4.92 0.399
## 7 class: pickup -3.30 1.16 -2.84 0.005 -5.59 -1.01
## 8 class: subcompact -0.523 1.21 -0.431 0.667 -2.91 1.87
## 9 class: suv -3.02 1.11 -2.72 0.007 -5.20 -0.829