Due: Mon Apr 1, 2024 11:59pm
The purpose of this week’s data dive is for you to expand your knowledge on linear modeling and generalized linear models.
Your RMarkdown notebook for this data dive should contain the following:
Build a linear (or generalized linear) model as you like
Use the tools from previous weeks to diagnose the model
Interpret at least one of the coefficients
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.standings## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>For this data dive my response variable will be sb_winner. This variable has values ‘No Superbowl’ and ‘Won Superbowl’ which will be converted into 0 and 1 respectively.
The initial explanatory variables will be:
points_for
points_against
margin_of_victory
strength_of_schedule
offensive_ranking
defensive_ranking
Before building the model it is important to analyze the variables that make up the model.
First we will visualize the distributions of the explanatory variables.
library(ggplot2)
standings |>
  ggplot(mapping = aes(x = points_for)) +
  geom_histogram(color = 'white') +
  labs("Points for by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.standings |>
  ggplot(mapping = aes(x = points_against)) +
  geom_histogram(color = 'white') +
  labs("Points against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.These distributions were all Poisson distribution because the values will never be zero or less. We can perform a log transformation to these variable before modelling to improve the model.
Lets continue visualizing the distributions of the other explanatory variables
standings |>
  ggplot(mapping = aes(x = margin_of_victory)) +
  geom_histogram(color = 'white') +
  labs("Points against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.standings |>
  ggplot(mapping = aes(x = strength_of_schedule)) +
  geom_histogram(color = 'white') +
  labs("Points against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.standings |>
  ggplot(mapping = aes(x = offensive_ranking)) +
  geom_histogram(color = 'white') +
  labs("Points against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.standings |>
  ggplot(mapping = aes(x = defensive_ranking)) +
  geom_histogram(color = 'white') +
  labs("Points against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.All of these distributions follow a roughly normal distribution, centered at zero. These distributions do not need to be transformed to make our model correct.
First, sb_winner needs its values converted into zeros and ones.
standings$sb_winner <- ifelse(standings$sb_winner == 'Won Superbowl', 1, 0)
head(standings$sb_winner)## [1] 0 0 0 0 0 0Next we build the model using a generalized linear model that uses logistic regression.
model1 <- glm(sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + strength_of_schedule + offensive_ranking + defensive_ranking,
             data = standings,
             family = binomial(link = 'logit'))
model1$coefficients##          (Intercept)      log(points_for)  log(points_against) 
##          -50.8928484           31.6230962          -23.7894654 
##    margin_of_victory strength_of_schedule    offensive_ranking 
##           -1.7511172           -0.4579508            0.7814223 
##    defensive_ranking 
##            0.8383004The first thing we can do is visualize the residuals.
library(arm)## Warning: package 'arm' was built under R version 4.3.3## Loading required package: MASS## Loading required package: Matrix## Loading required package: lme4## 
## arm (Version 1.13-1, built: 2022-8-25)## Working directory is C:/Users/ellio/gradSchool/i510/Data Divesbinnedplot(predict(model1, type = "response"),
           residuals(model1, type = "response"))Many of our variables fall outside of the grey lines, suggesting that our model doesn’t accurately model the response variable.
Lets analyze the Cook’s D value.
res <- data.frame(Index = 1:length(cooks.distance(model1)),
                  CooksDistance = cooks.distance(model1))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
                                   rownames(res), "")),
                hjust = 1.1)standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]## # A tibble: 27 × 15
##    team         team_name   year  wins  loss points_for points_against
##    <chr>        <chr>      <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Tennessee    Titans      2000    13     3        346            191
##  2 Baltimore    Ravens      2000    12     4        333            165
##  3 New England  Patriots    2001    11     5        371            272
##  4 Chicago      Bears       2001    13     3        338            203
##  5 Tampa Bay    Buccaneers  2002    12     4        346            196
##  6 New England  Patriots    2003    14     2        348            238
##  7 New England  Patriots    2004    14     2        437            260
##  8 Pittsburgh   Steelers    2005    11     5        389            258
##  9 Baltimore    Ravens      2006    13     3        353            201
## 10 Indianapolis Colts       2006    12     4        427            360
## # ℹ 17 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>Many of these rows which are above the optimal cooks d value are sb_winner, suggesting that the model is innacurately modeling sb_winner when it is equal to 1. This makes sense when we consider that the dataset has far more rows which has sb_winner equaling 0 than 1, causing the model to be biased towards 0.
Of the high cooks d rows that had sb_winner equal to 0, all of them had low points_against, high points_for, or high defensive_ranking. I believe this is because many superbowl winning teams excel at at either defence, offense, or both, and if a team ranks high in these areas they are likely to win the superbowl.
Moving on, we can use the Broom library to analyze the coefficients of the model
library(broom)
tidy(model1, conf.int = TRUE)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred## # A tibble: 7 × 7
##   term                 estimate std.error statistic p.value conf.low conf.high
##   <chr>                   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)           -50.9       39.1     -1.30  0.193    -131.       22.1 
## 2 log(points_for)        31.6       13.3      2.37  0.0176      6.98     59.1 
## 3 log(points_against)   -23.8        8.58    -2.77  0.00556   -41.2      -7.41
## 4 margin_of_victory      -1.75       4.22    -0.415 0.678     -10.1       6.55
## 5 strength_of_schedule   -0.458      4.20    -0.109 0.913      -8.75      7.84
## 6 offensive_ranking       0.781      4.21     0.186 0.853      -7.52      9.09
## 7 defensive_ranking       0.838      4.20     0.199 0.842      -7.45      9.14Some of the p-values are high, suggesting that many of the variables are not highly significant to sb_winner.
We can also use the summary function to analyze the model.
summary(model1)## 
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     margin_of_victory + strength_of_schedule + offensive_ranking + 
##     defensive_ranking, family = binomial(link = "logit"), data = standings)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          -50.8928    39.0678  -1.303  0.19268   
## log(points_for)       31.6231    13.3227   2.374  0.01761 * 
## log(points_against)  -23.7895     8.5795  -2.773  0.00556 **
## margin_of_victory     -1.7511     4.2228  -0.415  0.67837   
## strength_of_schedule  -0.4580     4.2046  -0.109  0.91327   
## offensive_ranking      0.7814     4.2106   0.186  0.85277   
## defensive_ranking      0.8383     4.2035   0.199  0.84193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 129.04  on 631  degrees of freedom
## AIC: 143.04
## 
## Number of Fisher Scoring iterations: 8To understand which variables are significant for understanding sb_winner we can look at the correlation between our explanatory variables. An assumption of our model is that the variables are independent from one another, and therefore if the correlations are high we know that some variables are redundant.
library(tidyverse)## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errorsstandings |>
  select(points_for, points_against, margin_of_victory, strength_of_schedule, offensive_ranking, defensive_ranking) |>
  cor()##                      points_for points_against margin_of_victory
## points_for            1.0000000     -0.1852656         0.8153530
## points_against       -0.1852656      1.0000000        -0.7199803
## margin_of_victory     0.8153530     -0.7199803         1.0000000
## strength_of_schedule -0.1956613      0.1055722        -0.2002940
## offensive_ranking     0.9517868     -0.2459601         0.8170490
## defensive_ranking     0.1974664     -0.9286826         0.6866334
##                      strength_of_schedule offensive_ranking defensive_ranking
## points_for                    -0.19566126        0.95178679         0.1974664
## points_against                 0.10557222       -0.24596006        -0.9286826
## margin_of_victory             -0.20029397        0.81704898         0.6866334
## strength_of_schedule           1.00000000       -0.01589955         0.1225779
## offensive_ranking             -0.01589955        1.00000000         0.2245290
## defensive_ranking              0.12257793        0.22452903         1.0000000library(car)## Loading required package: carData## 
## Attaching package: 'car'## The following object is masked from 'package:dplyr':
## 
##     recode## The following object is masked from 'package:purrr':
## 
##     some## The following object is masked from 'package:arm':
## 
##     logitvif(model1)##      log(points_for)  log(points_against)    margin_of_victory 
##             49.01700             53.89719           5253.59495 
## strength_of_schedule    offensive_ranking    defensive_ranking 
##            680.57639           3226.59845           3697.65343Our Variance Inflation Factors are significantly higher than desired (>10). To reduce the VIF some of the variables that are highly correlated to one another need to be removed.
Variables like points_for, margin_of_victory, and offensive_ranking are all highly correlated. When we think about the relationship between these variables, points_for is the driving force while the others are dependent on it. For our next model, we will remove both margin_of_victory and offensing_ranking.
The same story is true for points_against. It is the driving force of variables margin_of_victory and defensive_ranking. In our next model, defensive_ranking will be removed.
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule,
             data = standings,
             family = binomial(link = 'logit'))
vif(model2)##      log(points_for)  log(points_against) strength_of_schedule 
##             1.134751             1.147889             1.050698By rule of thumb, VIF values should be below 10. Model2 meets this criteria.
Another way we can analyze which variables are best for the model is to use step-wise selection process.
step(model1, direction = "backward")## Start:  AIC=143.04
## sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + 
##     strength_of_schedule + offensive_ranking + defensive_ranking
## 
##                        Df Deviance    AIC
## - strength_of_schedule  1   129.05 141.05
## - offensive_ranking     1   129.07 141.07
## - defensive_ranking     1   129.08 141.08
## - margin_of_victory     1   129.21 141.21
## <none>                      129.04 143.04
## - log(points_for)       1   135.63 147.63
## - log(points_against)   1   137.33 149.33
## 
## Step:  AIC=141.05
## sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + 
##     offensive_ranking + defensive_ranking
## 
##                       Df Deviance    AIC
## <none>                     129.05 141.05
## - offensive_ranking    1   131.30 141.30
## - defensive_ranking    1   133.21 143.21
## - log(points_for)      1   135.66 145.66
## - margin_of_victory    1   135.86 145.86
## - log(points_against)  1   137.34 147.34## 
## Call:  glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     margin_of_victory + offensive_ranking + defensive_ranking, 
##     family = binomial(link = "logit"), data = standings)
## 
## Coefficients:
##         (Intercept)      log(points_for)  log(points_against)  
##            -51.1011              31.6686             -23.7995  
##   margin_of_victory    offensive_ranking    defensive_ranking  
##             -1.2947               0.3235               0.3810  
## 
## Degrees of Freedom: 637 Total (i.e. Null);  632 Residual
## Null Deviance:       177.9 
## Residual Deviance: 129   AIC: 141Using step-wise selection we select based on which model gives the lowest AIC value.
Lets compare the models we have created so far, that is our original model, the step-wise selection model, and low VIF model.
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule, 
                        data = standings, 
                        family = binomial(link = 'logit'))
model3 <- glm(formula = sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + defensive_ranking, 
              data = standings, 
              family = binomial(link = "logit"))binnedplot(predict(model2, type = "response"),
           residuals(model2, type = "response"))res <- data.frame(Index = 1:length(cooks.distance(model2)),
                  CooksDistance = cooks.distance(model2))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
                                   rownames(res), "")),
                hjust = 1.1)standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]## # A tibble: 26 × 15
##    team         team_name   year  wins  loss points_for points_against
##    <chr>        <chr>      <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Tennessee    Titans      2000    13     3        346            191
##  2 Baltimore    Ravens      2000    12     4        333            165
##  3 New England  Patriots    2001    11     5        371            272
##  4 Tampa Bay    Buccaneers  2002    12     4        346            196
##  5 New England  Patriots    2003    14     2        348            238
##  6 New England  Patriots    2004    14     2        437            260
##  7 Pittsburgh   Steelers    2005    11     5        389            258
##  8 Baltimore    Ravens      2006    13     3        353            201
##  9 Indianapolis Colts       2006    12     4        427            360
## 10 New England  Patriots    2007    16     0        589            274
## # ℹ 16 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>Model2 residuals appear similar to model1 residuals
binnedplot(predict(model3, type = "response"),
           residuals(model3, type = "response"))More of the data points fall outside of the grey line compared to model1 and model2, suggesting that model3 is doing a worse job at modelling.
res <- data.frame(Index = 1:length(cooks.distance(model3)),
                  CooksDistance = cooks.distance(model3))
# Plot index against the Cook's distance to find
# influential points:
ggplot(res, aes(Index, CooksDistance)) +
      geom_point() +
      geom_text(aes(label = ifelse(CooksDistance > (mean(res$CooksDistance)*3),
                                   rownames(res), "")),
                hjust = 1.1)standings[res$Index[res$CooksDistance > (mean(res$CooksDistance)*3)],]## # A tibble: 26 × 15
##    team         team_name   year  wins  loss points_for points_against
##    <chr>        <chr>      <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Tennessee    Titans      2000    13     3        346            191
##  2 Baltimore    Ravens      2000    12     4        333            165
##  3 New England  Patriots    2001    11     5        371            272
##  4 Chicago      Bears       2001    13     3        338            203
##  5 Tampa Bay    Buccaneers  2002    12     4        346            196
##  6 New England  Patriots    2003    14     2        348            238
##  7 New England  Patriots    2004    14     2        437            260
##  8 Pittsburgh   Steelers    2005    11     5        389            258
##  9 Baltimore    Ravens      2006    13     3        353            201
## 10 Indianapolis Colts       2006    12     4        427            360
## # ℹ 16 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <dbl>Model3 has a higher max residual than model1 and model2
Moving on we can look at the summary function to analyze the coefficients of each model.
summary(model1)## 
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     margin_of_victory + strength_of_schedule + offensive_ranking + 
##     defensive_ranking, family = binomial(link = "logit"), data = standings)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)   
## (Intercept)          -50.8928    39.0678  -1.303  0.19268   
## log(points_for)       31.6231    13.3227   2.374  0.01761 * 
## log(points_against)  -23.7895     8.5795  -2.773  0.00556 **
## margin_of_victory     -1.7511     4.2228  -0.415  0.67837   
## strength_of_schedule  -0.4580     4.2046  -0.109  0.91327   
## offensive_ranking      0.7814     4.2106   0.186  0.85277   
## defensive_ranking      0.8383     4.2035   0.199  0.84193   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 129.04  on 631  degrees of freedom
## AIC: 143.04
## 
## Number of Fisher Scoring iterations: 8summary(model2)## 
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     strength_of_schedule, family = binomial(link = "logit"), 
##     data = standings)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            3.3643    10.5870   0.318 0.750651    
## log(points_for)        5.7585     1.6804   3.427 0.000611 ***
## log(points_against)   -7.1515     1.4145  -5.056 4.28e-07 ***
## strength_of_schedule   0.3072     0.1588   1.934 0.053068 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 133.52  on 634  degrees of freedom
## AIC: 141.52
## 
## Number of Fisher Scoring iterations: 8summary(model3)## 
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     margin_of_victory + defensive_ranking, family = binomial(link = "logit"), 
##     data = standings)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         -74.4258    36.0386  -2.065   0.0389 *
## log(points_for)      33.4283    13.5534   2.466   0.0136 *
## log(points_against) -21.5641     8.6272  -2.500   0.0124 *
## margin_of_victory    -1.0639     0.4935  -2.156   0.0311 *
## defensive_ranking     0.2870     0.1781   1.611   0.1071  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 131.30  on 633  degrees of freedom
## AIC: 141.3
## 
## Number of Fisher Scoring iterations: 8The coefficients of model2 all appear to be significant. Some of the coefficients in model1 and model3 appear to be not be significant. This suggests that model2 is doing the best at modelling the response variable.
Model1 has the lowest residual deviance with a values of 129.04.
Model2 and Model3 have similarly low Akaike Information Criterion (AIC) value, meaning they are more likely to minimize information loss.
paste("Model 1 BIC", round(BIC(model1), 2))## [1] "Model 1 BIC 174.24"paste("Model 2 BIC", round(BIC(model2), 2))## [1] "Model 2 BIC 159.36"paste("Model 3 BIC", round(BIC(model3), 2))## [1] "Model 3 BIC 163.59"Model2 has the lowest BIC value. The BIC further penalizes parameters based on the number of rows of data.
We can perform ANOVA on the models to understand the significance of the deviance between models. Low p-values suggest that the deviance is significant and the models are different.
anova(model1, model2, model3, test = "Chisq")## Analysis of Deviance Table
## 
## Model 1: sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + 
##     strength_of_schedule + offensive_ranking + defensive_ranking
## Model 2: sb_winner ~ log(points_for) + log(points_against) + strength_of_schedule
## Model 3: sb_winner ~ log(points_for) + log(points_against) + margin_of_victory + 
##     defensive_ranking
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       631     129.04                     
## 2       634     133.52 -3  -4.4873   0.2134
## 3       633     131.30  1   2.2214   0.1361Since the p-values are high (>0.05), the deviance between the models is not significant.
It is often assumed that the simpler the model, the better. Considering the deviance between models is not significant, it is best to choose the model with the least variables, which would be model2.
Based on all of the analysis, model2 appears to be the best so far for modeling the response variable sb_winner.