Due: Mon Apr 1, 2024 11:59pm
The purpose of this week’s data dive is for you to expand your knowledge on linear modeling and generalized linear models.
Your RMarkdown notebook for this data dive should contain the following:
Build a linear (or generalized linear) model as you like
Use the tools from previous weeks to diagnose the model
Interpret at least one of the coefficients
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
For this data dive my response variable will be sb_winner. This variable has values ‘No Superbowl’ and ‘Won Superbowl’ which will be converted into 0 and 1 respectively.
The initial explanatory variables will be:
points_for
points_against
margin_of_victory
strength_of_schedule
offensive_ranking
defensive_ranking
Before building the model it is important to analyze the variables that make up the model.
First we will visualize the distributions of the explanatory variables.
library(ggplot2)
standings |>
ggplot(mapping = aes(x = points_for)) +
geom_histogram(color = 'white') +
labs("Points for by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
standings |>
ggplot(mapping = aes(x = points_against)) +
geom_histogram(color = 'white') +
labs("Points against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These distributions were all Poisson distribution because the values will never be zero or less. We can perform a log transformation to these variable before modelling to improve the model.
Lets continue visualizing the distributions of the other explanatory variables
standings |>
ggplot(mapping = aes(x = margin_of_victory)) +
geom_histogram(color = 'white') +
labs("Points against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
standings |>
ggplot(mapping = aes(x = strength_of_schedule)) +
geom_histogram(color = 'white') +
labs("Points against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
standings |>
ggplot(mapping = aes(x = offensive_ranking)) +
geom_histogram(color = 'white') +
labs("Points against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
standings |>
ggplot(mapping = aes(x = defensive_ranking)) +
geom_histogram(color = 'white') +
labs("Points against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
All of these distributions follow a roughly normal distribution, centered at zero. These distributions do not need to be transformed to make our model correct.
First, sb_winner needs its values converted into zeros and ones.
standings$sb_winner <- ifelse(standings$sb_winner == 'Won Superbowl', 1, 0)
head(standings$sb_winner)
## [1] 0 0 0 0 0 0
Next we build the model using a generalized linear model that uses logistic regression.
model1 <- glm(sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + strength_of_schedule + offensive_ranking + defensive_ranking,
data = standings,
family = binomial(link = 'logit'))
model1$coefficients
## (Intercept) log(points_for) log(points_against)
## -50.8928484 31.6230962 -23.7894654
## margin_of_victory strength_of_schedule offensive_ranking
## -1.7511172 -0.4579508 0.7814223
## defensive_ranking
## 0.8383004
The first thing we can do is visualize the residuals.
library(arm)
## Warning: package 'arm' was built under R version 4.3.3
## Loading required package: MASS
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.13-1, built: 2022-8-25)
## Working directory is C:/Users/ellio/gradSchool/i510/Data Dives
binnedplot(predict(model1, type = "response"),
residuals(model1, type = "response"))
Many of our variables fall outside of the grey lines, suggesting that our model doesn’t accurately model the response variable.
Lets analyze the Cook’s D value.
res <- data.frame(Index = 1:length(cooks.distance(model1)),
CooksDistance = cooks.distance(model1))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
rownames(res), "")),
hjust = 1.1)
standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]
## # A tibble: 27 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tennessee Titans 2000 13 3 346 191
## 2 Baltimore Ravens 2000 12 4 333 165
## 3 New England Patriots 2001 11 5 371 272
## 4 Chicago Bears 2001 13 3 338 203
## 5 Tampa Bay Buccaneers 2002 12 4 346 196
## 6 New England Patriots 2003 14 2 348 238
## 7 New England Patriots 2004 14 2 437 260
## 8 Pittsburgh Steelers 2005 11 5 389 258
## 9 Baltimore Ravens 2006 13 3 353 201
## 10 Indianapolis Colts 2006 12 4 427 360
## # ℹ 17 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>
Many of these rows which are above the optimal cooks d value are sb_winner, suggesting that the model is innacurately modeling sb_winner when it is equal to 1. This makes sense when we consider that the dataset has far more rows which has sb_winner equaling 0 than 1, causing the model to be biased towards 0.
Of the high cooks d rows that had sb_winner equal to 0, all of them had low points_against, high points_for, or high defensive_ranking. I believe this is because many superbowl winning teams excel at at either defence, offense, or both, and if a team ranks high in these areas they are likely to win the superbowl.
Moving on, we can use the Broom library to analyze the coefficients of the model
library(broom)
tidy(model1, conf.int = TRUE)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## # A tibble: 7 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -50.9 39.1 -1.30 0.193 -131. 22.1
## 2 log(points_for) 31.6 13.3 2.37 0.0176 6.98 59.1
## 3 log(points_against) -23.8 8.58 -2.77 0.00556 -41.2 -7.41
## 4 margin_of_victory -1.75 4.22 -0.415 0.678 -10.1 6.55
## 5 strength_of_schedule -0.458 4.20 -0.109 0.913 -8.75 7.84
## 6 offensive_ranking 0.781 4.21 0.186 0.853 -7.52 9.09
## 7 defensive_ranking 0.838 4.20 0.199 0.842 -7.45 9.14
Some of the p-values are high, suggesting that many of the variables are not highly significant to sb_winner.
We can also use the summary function to analyze the model.
summary(model1)
##
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## margin_of_victory + strength_of_schedule + offensive_ranking +
## defensive_ranking, family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.8928 39.0678 -1.303 0.19268
## log(points_for) 31.6231 13.3227 2.374 0.01761 *
## log(points_against) -23.7895 8.5795 -2.773 0.00556 **
## margin_of_victory -1.7511 4.2228 -0.415 0.67837
## strength_of_schedule -0.4580 4.2046 -0.109 0.91327
## offensive_ranking 0.7814 4.2106 0.186 0.85277
## defensive_ranking 0.8383 4.2035 0.199 0.84193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 129.04 on 631 degrees of freedom
## AIC: 143.04
##
## Number of Fisher Scoring iterations: 8
To understand which variables are significant for understanding sb_winner we can look at the correlation between our explanatory variables. An assumption of our model is that the variables are independent from one another, and therefore if the correlations are high we know that some variables are redundant.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ tidyr::pack() masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
standings |>
select(points_for, points_against, margin_of_victory, strength_of_schedule, offensive_ranking, defensive_ranking) |>
cor()
## points_for points_against margin_of_victory
## points_for 1.0000000 -0.1852656 0.8153530
## points_against -0.1852656 1.0000000 -0.7199803
## margin_of_victory 0.8153530 -0.7199803 1.0000000
## strength_of_schedule -0.1956613 0.1055722 -0.2002940
## offensive_ranking 0.9517868 -0.2459601 0.8170490
## defensive_ranking 0.1974664 -0.9286826 0.6866334
## strength_of_schedule offensive_ranking defensive_ranking
## points_for -0.19566126 0.95178679 0.1974664
## points_against 0.10557222 -0.24596006 -0.9286826
## margin_of_victory -0.20029397 0.81704898 0.6866334
## strength_of_schedule 1.00000000 -0.01589955 0.1225779
## offensive_ranking -0.01589955 1.00000000 0.2245290
## defensive_ranking 0.12257793 0.22452903 1.0000000
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:arm':
##
## logit
vif(model1)
## log(points_for) log(points_against) margin_of_victory
## 49.01700 53.89719 5253.59495
## strength_of_schedule offensive_ranking defensive_ranking
## 680.57639 3226.59845 3697.65343
Our Variance Inflation Factors are significantly higher than desired (>10). To reduce the VIF some of the variables that are highly correlated to one another need to be removed.
Variables like points_for, margin_of_victory, and offensive_ranking are all highly correlated. When we think about the relationship between these variables, points_for is the driving force while the others are dependent on it. For our next model, we will remove both margin_of_victory and offensing_ranking.
The same story is true for points_against. It is the driving force of variables margin_of_victory and defensive_ranking. In our next model, defensive_ranking will be removed.
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule,
data = standings,
family = binomial(link = 'logit'))
vif(model2)
## log(points_for) log(points_against) strength_of_schedule
## 1.134751 1.147889 1.050698
By rule of thumb, VIF values should be below 10. Model2 meets this criteria.
Another way we can analyze which variables are best for the model is to use step-wise selection process.
step(model1, direction = "backward")
## Start: AIC=143.04
## sb_winner ~ log(points_for) + log(points_against) + margin_of_victory +
## strength_of_schedule + offensive_ranking + defensive_ranking
##
## Df Deviance AIC
## - strength_of_schedule 1 129.05 141.05
## - offensive_ranking 1 129.07 141.07
## - defensive_ranking 1 129.08 141.08
## - margin_of_victory 1 129.21 141.21
## <none> 129.04 143.04
## - log(points_for) 1 135.63 147.63
## - log(points_against) 1 137.33 149.33
##
## Step: AIC=141.05
## sb_winner ~ log(points_for) + log(points_against) + margin_of_victory +
## offensive_ranking + defensive_ranking
##
## Df Deviance AIC
## <none> 129.05 141.05
## - offensive_ranking 1 131.30 141.30
## - defensive_ranking 1 133.21 143.21
## - log(points_for) 1 135.66 145.66
## - margin_of_victory 1 135.86 145.86
## - log(points_against) 1 137.34 147.34
##
## Call: glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## margin_of_victory + offensive_ranking + defensive_ranking,
## family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## (Intercept) log(points_for) log(points_against)
## -51.1011 31.6686 -23.7995
## margin_of_victory offensive_ranking defensive_ranking
## -1.2947 0.3235 0.3810
##
## Degrees of Freedom: 637 Total (i.e. Null); 632 Residual
## Null Deviance: 177.9
## Residual Deviance: 129 AIC: 141
Using step-wise selection we select based on which model gives the lowest AIC value.
Lets compare the models we have created so far, that is our original model, the step-wise selection model, and low VIF model.
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule,
data = standings,
family = binomial(link = 'logit'))
model3 <- glm(formula = sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + defensive_ranking,
data = standings,
family = binomial(link = "logit"))
binnedplot(predict(model2, type = "response"),
residuals(model2, type = "response"))
res <- data.frame(Index = 1:length(cooks.distance(model2)),
CooksDistance = cooks.distance(model2))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
rownames(res), "")),
hjust = 1.1)
standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]
## # A tibble: 26 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tennessee Titans 2000 13 3 346 191
## 2 Baltimore Ravens 2000 12 4 333 165
## 3 New England Patriots 2001 11 5 371 272
## 4 Tampa Bay Buccaneers 2002 12 4 346 196
## 5 New England Patriots 2003 14 2 348 238
## 6 New England Patriots 2004 14 2 437 260
## 7 Pittsburgh Steelers 2005 11 5 389 258
## 8 Baltimore Ravens 2006 13 3 353 201
## 9 Indianapolis Colts 2006 12 4 427 360
## 10 New England Patriots 2007 16 0 589 274
## # ℹ 16 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>
Model2 residuals appear similar to model1 residuals
binnedplot(predict(model3, type = "response"),
residuals(model3, type = "response"))
More of the data points fall outside of the grey line compared to model1 and model2, suggesting that model3 is doing a worse job at modelling.
res <- data.frame(Index = 1:length(cooks.distance(model3)),
CooksDistance = cooks.distance(model3))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
geom_point() +
geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
rownames(res), "")),
hjust = 1.1)
standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]
## # A tibble: 26 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Tennessee Titans 2000 13 3 346 191
## 2 Baltimore Ravens 2000 12 4 333 165
## 3 New England Patriots 2001 11 5 371 272
## 4 Chicago Bears 2001 13 3 338 203
## 5 Tampa Bay Buccaneers 2002 12 4 346 196
## 6 New England Patriots 2003 14 2 348 238
## 7 New England Patriots 2004 14 2 437 260
## 8 Pittsburgh Steelers 2005 11 5 389 258
## 9 Baltimore Ravens 2006 13 3 353 201
## 10 Indianapolis Colts 2006 12 4 427 360
## # ℹ 16 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>
Model3 has a higher max residual than model1 and model2
Moving on we can look at the summary function to analyze the coefficients of each model.
summary(model1)
##
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## margin_of_victory + strength_of_schedule + offensive_ranking +
## defensive_ranking, family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -50.8928 39.0678 -1.303 0.19268
## log(points_for) 31.6231 13.3227 2.374 0.01761 *
## log(points_against) -23.7895 8.5795 -2.773 0.00556 **
## margin_of_victory -1.7511 4.2228 -0.415 0.67837
## strength_of_schedule -0.4580 4.2046 -0.109 0.91327
## offensive_ranking 0.7814 4.2106 0.186 0.85277
## defensive_ranking 0.8383 4.2035 0.199 0.84193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 129.04 on 631 degrees of freedom
## AIC: 143.04
##
## Number of Fisher Scoring iterations: 8
summary(model2)
##
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## strength_of_schedule, family = binomial(link = "logit"),
## data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.3643 10.5870 0.318 0.750651
## log(points_for) 5.7585 1.6804 3.427 0.000611 ***
## log(points_against) -7.1515 1.4145 -5.056 4.28e-07 ***
## strength_of_schedule 0.3072 0.1588 1.934 0.053068 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 133.52 on 634 degrees of freedom
## AIC: 141.52
##
## Number of Fisher Scoring iterations: 8
summary(model3)
##
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## margin_of_victory + defensive_ranking, family = binomial(link = "logit"),
## data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -74.4258 36.0386 -2.065 0.0389 *
## log(points_for) 33.4283 13.5534 2.466 0.0136 *
## log(points_against) -21.5641 8.6272 -2.500 0.0124 *
## margin_of_victory -1.0639 0.4935 -2.156 0.0311 *
## defensive_ranking 0.2870 0.1781 1.611 0.1071
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 131.30 on 633 degrees of freedom
## AIC: 141.3
##
## Number of Fisher Scoring iterations: 8
The coefficients of model2 all appear to be significant. Some of the coefficients in model1 and model3 appear to be not be significant. This suggests that model2 is doing the best at modelling the response variable.
Model1 has the lowest residual deviance with a values of 129.04.
Model2 and Model3 have similarly low Akaike Information Criterion (AIC) value, meaning they are more likely to minimize information loss.
paste("Model 1 BIC", round(BIC(model1), 2))
## [1] "Model 1 BIC 174.24"
paste("Model 2 BIC", round(BIC(model2), 2))
## [1] "Model 2 BIC 159.36"
paste("Model 3 BIC", round(BIC(model3), 2))
## [1] "Model 3 BIC 163.59"
Model2 has the lowest BIC value. The BIC further penalizes parameters based on the number of rows of data.
We can perform ANOVA on the models to understand the significance of the deviance between models. Low p-values suggest that the deviance is significant and the models are different.
anova(model1, model2, model3, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: sb_winner ~ log(points_for) + log(points_against) + margin_of_victory +
## strength_of_schedule + offensive_ranking + defensive_ranking
## Model 2: sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule
## Model 3: sb_winner ~ log(points_for) + log(points_against) + margin_of_victory +
## defensive_ranking
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 631 129.04
## 2 634 133.52 -3 -4.4873 0.2134
## 3 633 131.30 1 2.2214 0.1361
Since the p-values are high (>0.05), the deviance between the models is not significant.
It is often assumed that the simpler the model, the better. Considering the deviance between models is not significant, it is best to choose the model with the least variables, which would be model2.
Based on all of the analysis, model2 appears to be the best so far for modeling the response variable sb_winner.