Identify Outliers

Harold Nelson

2022-10-11

This is a follow-up to “Models and Visualization”.

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(moderndive)
## Warning: package 'moderndive' was built under R version 4.4.1
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Review Model_b

Recreate model_b and the graphic showing the outliers.

Solution

model_b = lm(cty~poly(displ,2),data = mpg)

mpg$fit_b = model_b$fitted.values
mpg$res_b = model_b$residuals

get_regression_table(model_b)
## # A tibble: 3 × 7
##   term            estimate std_error statistic p_value lower_ci upper_ci
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept           16.9     0.148    114.         0     16.6     17.2
## 2 poly(displ, 2)1    -51.9     2.26     -22.9        0    -56.3    -47.4
## 3 poly(displ, 2)2     18.7     2.26       8.25       0     14.2     23.1

Graphics

mpg %>% ggplot(aes(x=displ,y=cty)) + 
  geom_point() + 
  geom_point(aes(y = fit_b),color = "red") +
  ggtitle("Model b Actuals and Fitted Values")

Plotly

We can use plotly to make the plot above interactive. Here’s the basic structure of plotly usage.

p = mpg %>% ggplot(aes(x=displ,y=cty)) + 
  geom_point() + 
  geom_point(aes(y = fit_b),color = "red") +
  ggtitle("Model_b Actuals and Fitted Values")

ggplotly(p)

This is not very useful because the default tooltip only shows the plotted values. But we can supply extra information to the tooltip provided by ggplotly().

p = mpg %>% ggplot(aes(x=displ,
                       y=cty,
                       label1 = trans,
                       label2 = drv,
                       label3 = manufacturer,
                       label4 = class)) + 
  geom_point() + 
  geom_point(aes(y = fit_b),color = "red") +
  ggtitle("Model_b Actuals and Fitted Values")

ggplotly(p)

Model C

mpg = mpg %>% 
  mutate(econobox = displ < 2 & trans == "manual(m5)")
model_c = lm(cty~poly(displ,2) + econobox ,data = mpg)

mpg$fit_c = model_c$fitted.values
mpg$res_c = model_c$residuals

get_regression_table(model_c)
## # A tibble: 4 × 7
##   term            estimate std_error statistic p_value lower_ci upper_ci
##   <chr>              <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept          16.7      0.146    115.         0    16.4     17.0 
## 2 poly(displ, 2)1   -48.5      2.27     -21.3        0   -53.0    -44.0 
## 3 poly(displ, 2)2    15.1      2.29       6.59       0    10.6     19.6 
## 4 econoboxTRUE        3.53     0.739      4.78       0     2.08     4.99

Now Look

Solution

p = mpg %>% ggplot(aes(x=displ,
                       y=cty,
                       label = trans,
                       label2 = drv,
                       label3 = manufacturer,
                       label4 = class)) + 
  geom_point() + 
  geom_point(aes(y = fit_c),color = "red") +
  ggtitle("Model_c Actuals and Fitted Values")

ggplotly(p)