Week 9 | Data Dive — Regression Diagnostics

Due: Mon Mar 18, 2024 11:59pm

Task(s)

The purpose of this week’s data dive is for you to think critically about interpreting and diagnosing regression models.

Your RMarkdown notebook for this data dive should contain the following:

  • Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.

    • Try out either an interaction term or a binary term to start.

    • Consider adding other integer or continuous variables.

    • For each new variable you try, explain why you should include it, or not. E.g., are there any issues with multicollinearity?

    • Your model for this data dive should have 2-4 terms.

  • Evaluate this model.

    • At the very least, use the 5 diagnostic plots discussed in class to identify any issues with your model.

    • For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met.

    • Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met.

For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.


Data

In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.

Link to Data

standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>

Previous Data Dive Model

In the previous weeks data dive I created the following simple linear regression model.

model_1 <- lm(wins ~ points_for, standings)
model_1$coefficients
## (Intercept)  points_for 
## -3.06141069  0.03153369
library(ggplot2)
standings |>
  ggplot(mapping = aes(x = points_for, y = wins)) +
  geom_point(size = 2) +
  geom_smooth(method = 'lm', se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'

Null Hypothesis

In order to determine if any additional variables are worth having in my model, we need a null hypothesis to compare. For this data dive I will be using the following null hypothesis:

The coefficients of the variable in the model are zero.

Confidence Levels

In this data dive I will be using confidence levels of 0.05. In other words, p-values need to be less than 0.05 for them to be significant.

Addition of a Binary Variable

The previous model used the ‘points_for’ variable to explain the number of ‘wins’ a team had. With the goal of more accurately explaining the ‘wins’ column I will add an additional binary variable, ‘playoffs’. This column described whether a team made the playoffs in that season. The way a team makes the playoffs is not solely based on wins, but also the records of other teams in the same conference division. However, generally speaking teams with lots of wins make the playoffs, and teams with few wins do not.

library(broom)
model_2 <- lm(wins ~ points_for + playoffs, standings)
tidy(model_2, conf.int = TRUE)
## # A tibble: 3 × 7
##   term             estimate std.error statistic  p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)        0.234    0.352       0.664 5.07e- 1  -0.458     0.925 
## 2 points_for         0.0184   0.00107    17.2   1.52e-54   0.0163    0.0206
## 3 playoffsPlayoffs   3.43     0.158      21.6   2.90e-78   3.12      3.74
model_2 |> 
  summary()
## 
## Call:
## lm(formula = wins ~ points_for + playoffs, data = standings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1776 -1.0785  0.0549  1.1408  4.4778 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.233876   0.352179   0.664    0.507    
## points_for       0.018447   0.001075  17.167   <2e-16 ***
## playoffsPlayoffs 3.426101   0.158256  21.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.601 on 635 degrees of freedom
## Multiple R-squared:  0.7313, Adjusted R-squared:  0.7305 
## F-statistic: 864.3 on 2 and 635 DF,  p-value: < 2.2e-16

The addition of the ‘playoffs’ variable appears to be a good addition for the model. With a p-value of 2.90e-78, we can be extremely confident that the coefficients of the ‘playoff’ variable are not zero. From the summary function call we get an adjusted R-squared value of 0.73

Addition of Another Continuous Variable

The ‘point_for’ variable explains the number of points a team scores, and potentially the strength of a team’s offense, however there is an additional side to the game that is just as important, and that is the defense. The ‘point_against’ column explains the number of points scored against that team, and potentially explains the strength of a team’s defense. I hypothesize that the addition of the ‘point_against’ variable in our model will help to explain the ‘wins’ column further.

model_3 <- lm(wins ~ points_for + points_against + playoffs, standings)
tidy(model_3, conf.int = TRUE)
## # A tibble: 4 × 7
##   term             estimate std.error statistic   p.value conf.low conf.high
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)        7.92    0.379         20.9 2.84e- 74   7.18      8.67  
## 2 points_for         0.0217  0.000751      28.8 1.58e-117   0.0202    0.0232
## 3 points_against    -0.0233  0.000879     -26.5 1.72e-104  -0.0250   -0.0215
## 4 playoffsPlayoffs   1.64    0.128         12.8 2.13e- 33   1.39      1.89
model_3 |>
  summary()
## 
## Call:
## lm(formula = wins ~ points_for + points_against + playoffs, data = standings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4576 -0.7315  0.0374  0.7394  3.2123 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.9242977  0.3788673   20.92   <2e-16 ***
## points_for        0.0216763  0.0007514   28.85   <2e-16 ***
## points_against   -0.0232656  0.0008794  -26.46   <2e-16 ***
## playoffsPlayoffs  1.6395311  0.1283841   12.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.105 on 634 degrees of freedom
## Multiple R-squared:  0.8723, Adjusted R-squared:  0.8717 
## F-statistic:  1444 on 3 and 634 DF,  p-value: < 2.2e-16

The addition of the ‘points_against’ variable appears to be a good addition for the model. With a p-value of 1.71e-104, we can be extremely confident that the coefficients of the ‘points_against’ variable are not zero. From the summary function we get an adjusted R-squared value of 0.87, which is a higher value compared to model_2 indicating an improvement in the model.

Evaluation of Models

Residuals vs. Fitted Values

The first plot we will look at are the residual values vs. the fitted values:

library(lindia)
gg_resfitted(model_3) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I believe this plot validates one of our assumptions:

  • ERRORS HAVE CONSTANT VARIANCE ACROSS ALL PREDICTIONS

While the blue line in the plot is not perfectly placed on top of the red dotted line, it does closely resemble a straight line with little deviation from it.

The blue line does indicate we might be slightly underestimating low and high win totals, while overestimating middle win totals. However, the residuals are within one win.

Residuals vs X Values

plots <- gg_resX(model_3, plot.all = FALSE)

# for each variable of interest ...
plots$points_for + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

plots$points_against + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

plots$playoffs + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

What we want is for a constant variance of residuals across values of \(X\) variables.

While there is some slight deviation from the red dotted line, the blue line stays fairly consistent across \(X\) values. This is appears to be the case for ‘points_for’ and ‘points_against’.

The residuals for ‘playoffs’ are two box plots for ‘No Playoffs’ and ‘Playoffs’ values. The residuals appears to be the same for both \(X\) values.

Residual Histogram

gg_reshist(model_3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

When looking at a Histogram of Residuals plot we are looking for a roughly normally distributed plot. This histogram does appear roughly normalized.

There are a few questionable spots on the plot, for example there is a sizable dip in the count near the center of the histogram. There also appears to be a very slight tail to the histogram. These are not severe, but it indicates that some aspect of the phenomenon is missing from the model (i.e., we may need more/different explanatory variables).

QQ-Plots

# the normal QQ plot
gg_qqplot(model_3)

When looking at a QQ-Plot we are hoping to have the points fall on the orange line, indicating that our residuals are ‘normal’. Our residuals do appear to fall on this line, and are therefore ‘normal’.

Cook’s D Observation

gg_cooksd(model_3, threshold = 'convention')

Looking at Cook’s D, it is clear that there are many high influencing, and potentially troubling values in the data. Having this many data points with a high Cook’s D values will make it more difficult to create an accurate model. For future models, it is important that these data points are investigated by plotting various variables against the response variable.

The following plots show where these high Cook’s D Values are in the data.

standings$label <- paste(standings$team_name, standings$year, sep = ", ")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sliced_data <- slice(standings,c(16,47,62,95,126,131,136,154,206,226,
                      267,268,272,277,278,288,292,340,366,
                      371,383,392,436,518,524,550,551,610,
                      624,636))
library(ggrepel)
ggplot(data = sliced_data) +
  geom_point(data = standings,
             mapping = aes(x = points_for, y = wins)) +
  geom_point(mapping = aes(x = points_for, y = wins),
             color = 'blue') + 
  geom_text_repel(mapping = aes(x = points_for,
                                y = wins,
                                label = label),
             color = 'blue', 
             size = 2) +
  labs(title="Investigating High Influence 'points_for' Values ",
       subtitle="Label = Team Name and Year")

ggplot(data = sliced_data) +
  geom_point(data = standings,
             mapping = aes(x = points_against, y = wins)) +
  geom_point(mapping = aes(x = points_against, y = wins),
             color = 'blue')  + 
  geom_text_repel(mapping = aes(x = points_against,
                                y = wins,
                                label = label),
             color = 'blue', 
             size = 2) +
  labs(title="Investigating High Influence 'points_against' Values ",
       subtitle="Label = Team Name and Year")

ggplot(data = sliced_data) +
  geom_point(data = standings,
             mapping = aes(x = playoffs, y = wins)) +
  geom_point(mapping = aes(x = playoffs, y = wins),
             color = 'blue')  + 
  geom_text_repel(mapping = aes(x = playoffs,
                                y = wins,
                                label = label),
             color = 'blue', 
             size = 2) +
  labs(title="Investigating High Influence Values ",
       subtitle="Label = Team Name and Year")

Again, for future models it is important that these data points are investigated and removed if they are found to be highly influential across variables.