Due: Mon Mar 18, 2024 11:59pm
The purpose of this week’s data dive is for you to think critically about interpreting and diagnosing regression models.
Your RMarkdown notebook for this data dive should contain the following:
Refer to the simple linear regression model you built last week. Include 1-3 more variables into your regression model.
Try out either an interaction term or a binary term to start.
Consider adding other integer or continuous variables.
For each new variable you try, explain why you should include it, or not. E.g., are there any issues with multicollinearity?
Your model for this data dive should have 2-4 terms.
Evaluate this model.
At the very least, use the 5 diagnostic plots discussed in class to identify any issues with your model.
For each plot, point out any indications of issues with the model. Otherwise, explain how the plot supports the claim that an assumption is met.
Try to measure the severity of any issues as well as the level of confidence you have in an assumption being met.
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
In the previous weeks data dive I created the following simple linear regression model.
model_1 <- lm(wins ~ points_for, standings)
model_1$coefficients
## (Intercept) points_for
## -3.06141069 0.03153369
library(ggplot2)
standings |>
ggplot(mapping = aes(x = points_for, y = wins)) +
geom_point(size = 2) +
geom_smooth(method = 'lm', se = FALSE, color = 'darkblue')
## `geom_smooth()` using formula = 'y ~ x'
In order to determine if any additional variables are worth having in my model, we need a null hypothesis to compare. For this data dive I will be using the following null hypothesis:
The coefficients of the variable in the model are zero.
In this data dive I will be using confidence levels of 0.05. In other words, p-values need to be less than 0.05 for them to be significant.
The previous model used the ‘points_for’ variable to explain the number of ‘wins’ a team had. With the goal of more accurately explaining the ‘wins’ column I will add an additional binary variable, ‘playoffs’. This column described whether a team made the playoffs in that season. The way a team makes the playoffs is not solely based on wins, but also the records of other teams in the same conference division. However, generally speaking teams with lots of wins make the playoffs, and teams with few wins do not.
library(broom)
model_2 <- lm(wins ~ points_for + playoffs, standings)
tidy(model_2, conf.int = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.234 0.352 0.664 5.07e- 1 -0.458 0.925
## 2 points_for 0.0184 0.00107 17.2 1.52e-54 0.0163 0.0206
## 3 playoffsPlayoffs 3.43 0.158 21.6 2.90e-78 3.12 3.74
model_2 |>
summary()
##
## Call:
## lm(formula = wins ~ points_for + playoffs, data = standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1776 -1.0785 0.0549 1.1408 4.4778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.233876 0.352179 0.664 0.507
## points_for 0.018447 0.001075 17.167 <2e-16 ***
## playoffsPlayoffs 3.426101 0.158256 21.649 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.601 on 635 degrees of freedom
## Multiple R-squared: 0.7313, Adjusted R-squared: 0.7305
## F-statistic: 864.3 on 2 and 635 DF, p-value: < 2.2e-16
The addition of the ‘playoffs’ variable appears to be a good addition for the model. With a p-value of 2.90e-78, we can be extremely confident that the coefficients of the ‘playoff’ variable are not zero. From the summary function call we get an adjusted R-squared value of 0.73
The ‘point_for’ variable explains the number of points a team scores, and potentially the strength of a team’s offense, however there is an additional side to the game that is just as important, and that is the defense. The ‘point_against’ column explains the number of points scored against that team, and potentially explains the strength of a team’s defense. I hypothesize that the addition of the ‘point_against’ variable in our model will help to explain the ‘wins’ column further.
model_3 <- lm(wins ~ points_for + points_against + playoffs, standings)
tidy(model_3, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.92 0.379 20.9 2.84e- 74 7.18 8.67
## 2 points_for 0.0217 0.000751 28.8 1.58e-117 0.0202 0.0232
## 3 points_against -0.0233 0.000879 -26.5 1.72e-104 -0.0250 -0.0215
## 4 playoffsPlayoffs 1.64 0.128 12.8 2.13e- 33 1.39 1.89
model_3 |>
summary()
##
## Call:
## lm(formula = wins ~ points_for + points_against + playoffs, data = standings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4576 -0.7315 0.0374 0.7394 3.2123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9242977 0.3788673 20.92 <2e-16 ***
## points_for 0.0216763 0.0007514 28.85 <2e-16 ***
## points_against -0.0232656 0.0008794 -26.46 <2e-16 ***
## playoffsPlayoffs 1.6395311 0.1283841 12.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.105 on 634 degrees of freedom
## Multiple R-squared: 0.8723, Adjusted R-squared: 0.8717
## F-statistic: 1444 on 3 and 634 DF, p-value: < 2.2e-16
The addition of the ‘points_against’ variable appears to be a good addition for the model. With a p-value of 1.71e-104, we can be extremely confident that the coefficients of the ‘points_against’ variable are not zero. From the summary function we get an adjusted R-squared value of 0.87, which is a higher value compared to model_2 indicating an improvement in the model.
The first plot we will look at are the residual values vs. the fitted values:
library(lindia)
gg_resfitted(model_3) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
I believe this plot validates one of our assumptions:
While the blue line in the plot is not perfectly placed on top of the red dotted line, it does closely resemble a straight line with little deviation from it.
The blue line does indicate we might be slightly underestimating low and high win totals, while overestimating middle win totals. However, the residuals are within one win.
plots <- gg_resX(model_3, plot.all = FALSE)
# for each variable of interest ...
plots$points_for + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
plots$points_against + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
plots$playoffs + geom_smooth(se = F)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
What we want is for a constant variance of residuals across values of \(X\) variables.
While there is some slight deviation from the red dotted line, the blue line stays fairly consistent across \(X\) values. This is appears to be the case for ‘points_for’ and ‘points_against’.
The residuals for ‘playoffs’ are two box plots for ‘No Playoffs’ and ‘Playoffs’ values. The residuals appears to be the same for both \(X\) values.
gg_reshist(model_3)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When looking at a Histogram of Residuals plot we are looking for a roughly normally distributed plot. This histogram does appear roughly normalized.
There are a few questionable spots on the plot, for example there is a sizable dip in the count near the center of the histogram. There also appears to be a very slight tail to the histogram. These are not severe, but it indicates that some aspect of the phenomenon is missing from the model (i.e., we may need more/different explanatory variables).
# the normal QQ plot
gg_qqplot(model_3)
When looking at a QQ-Plot we are hoping to have the points fall on the orange line, indicating that our residuals are ‘normal’. Our residuals do appear to fall on this line, and are therefore ‘normal’.
gg_cooksd(model_3, threshold = 'convention')
Looking at Cook’s D, it is clear that there are many high influencing, and potentially troubling values in the data. Having this many data points with a high Cook’s D values will make it more difficult to create an accurate model. For future models, it is important that these data points are investigated by plotting various variables against the response variable.
The following plots show where these high Cook’s D Values are in the data.
standings$label <- paste(standings$team_name, standings$year, sep = ", ")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sliced_data <- slice(standings,c(16,47,62,95,126,131,136,154,206,226,
267,268,272,277,278,288,292,340,366,
371,383,392,436,518,524,550,551,610,
624,636))
library(ggrepel)
ggplot(data = sliced_data) +
geom_point(data = standings,
mapping = aes(x = points_for, y = wins)) +
geom_point(mapping = aes(x = points_for, y = wins),
color = 'blue') +
geom_text_repel(mapping = aes(x = points_for,
y = wins,
label = label),
color = 'blue',
size = 2) +
labs(title="Investigating High Influence 'points_for' Values ",
subtitle="Label = Team Name and Year")
ggplot(data = sliced_data) +
geom_point(data = standings,
mapping = aes(x = points_against, y = wins)) +
geom_point(mapping = aes(x = points_against, y = wins),
color = 'blue') +
geom_text_repel(mapping = aes(x = points_against,
y = wins,
label = label),
color = 'blue',
size = 2) +
labs(title="Investigating High Influence 'points_against' Values ",
subtitle="Label = Team Name and Year")
ggplot(data = sliced_data) +
geom_point(data = standings,
mapping = aes(x = playoffs, y = wins)) +
geom_point(mapping = aes(x = playoffs, y = wins),
color = 'blue') +
geom_text_repel(mapping = aes(x = playoffs,
y = wins,
label = label),
color = 'blue',
size = 2) +
labs(title="Investigating High Influence Values ",
subtitle="Label = Team Name and Year")
Again, for future models it is important that these data points are investigated and removed if they are found to be highly influential across variables.