Due: Mon Mar 25, 2024 11:59pm
The purpose of this week’s data dive is for you to attempt building a generalized linear model, or to make adjustments to variables represented in a model.
Your RMarkdown notebook for this data dive should contain the following:
Select an interesting binary column of data, or one which can be reasonably converted into a binary variable
Build a logistic regression model for this variable, using between 1-4 explanatory variables
Interpret the coefficients, and explain what they mean in your notebook
Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning
Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not)
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.standings## # A tibble: 638 × 15
##    team         team_name  year  wins  loss points_for points_against
##    <chr>        <chr>     <dbl> <dbl> <dbl>      <dbl>          <dbl>
##  1 Miami        Dolphins   2000    11     5        323            226
##  2 Indianapolis Colts      2000    10     6        429            326
##  3 New York     Jets       2000     9     7        321            321
##  4 Buffalo      Bills      2000     8     8        315            350
##  5 New England  Patriots   2000     5    11        276            338
##  6 Tennessee    Titans     2000    13     3        346            191
##  7 Baltimore    Ravens     2000    12     4        333            165
##  8 Pittsburgh   Steelers   2000     9     7        321            255
##  9 Jacksonville Jaguars    2000     7     9        367            327
## 10 Cincinnati   Bengals    2000     4    12        185            359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## #   strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## #   defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>The binary column of data I will be modeling is ‘sb_winner’. The values of this column are ‘No Superbowl’ and ‘Won Superbowl’ which will be converted to 0 and 1.
standings$sb_winner <- ifelse(standings$sb_winner == 'Won Superbowl', 1, 0)
standings$sb_winner##   [1] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
##  [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [445] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0My explanatory variables will be points_for, points_against, and margin_of_victory.
library(ggplot2)
library(patchwork)## Warning: package 'patchwork' was built under R version 4.3.3p1 <- standings |>
  ggplot(mapping = aes(x = points_for, y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Modeling a Binary ") +
  theme_minimal()
p2 <- standings |>
  ggplot(mapping = aes(x = points_against, y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Response with OLS") 
p3<- standings |>
  ggplot(mapping = aes(x = margin_of_victory, y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) 
p1 + p2 + p3The visualization gives us some indication that modeling sb_winner using the explanatory variables could be possible.
When creating the model we use the a general linear model and use a logistic link function to transform our explanatory variables to our response variable sb_winner.
model <- glm(sb_winner ~ points_for + points_against + margin_of_victory,
             data = standings,
             family = binomial(link = 'logit'))
model$coefficients##       (Intercept)        points_for    points_against margin_of_victory 
##        -0.6438372        -0.2775911         0.2662348         4.6267392The coefficients in our model are multiplied by the explanatory variable to map to sb_winner. The full formula is as follows:
\(P(\mathbf\ \texttt{sb_winner}) = p\)
\[ p = \frac{1}{1 + e^{-(-0.6438371 + (-0.2775911\times\texttt{points_for}) + (0.2662348\times\texttt{points_against}) + (4.6267392\times\texttt{margin_of_victory}))}} \]
We can use the broom library to analyze the coefficients of the model
library(broom)
tidy(model, conf.int = TRUE)## # A tibble: 4 × 7
##   term              estimate std.error statistic p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)         -0.644     1.81     -0.356   0.722   -4.20      2.93 
## 2 points_for          -0.278     0.502    -0.553   0.580   -1.28      0.712
## 3 points_against       0.266     0.502     0.530   0.596   -0.723     1.26 
## 4 margin_of_victory    4.63      8.03      0.576   0.564  -11.2      20.6Our model does not appear to be a good model. All of the p-values are high, suggesting there is no evidence the coefficients are not zero.
To build a confidence interval we need to take the standard error of each coefficient and multiply it be a Z value to get a margin of error. Z is the z score value that gives us our confidence level of 95%, which is 1.96. We then subtract and add the margin of error from the estimated coefficient to get the upper and lower intervals.
tidy_data <- tidy(model)
Z <- 1.96 # 95% confidence
margin_of_error <- Z * tidy_data$std.error
CI_low = tidy_data$estimate - margin_of_error
CI_high = tidy_data$estimate + margin_of_error
tidy(model, conf.int = TRUE) |>
  ggplot(mapping = aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(mapping = aes(xmin = CI_low, 
                               xmax = CI_high, 
                               height = 0.5)) +
  labs(title = "Model Coefficient C.I.")If the confidence interval for a coefficient includes zero, it suggest there is not enough evidence to conclude that there is a significant relationship between the response variable and the explanatory variable at the confidence level.
The distance of the interval gives an indication of the strength of the relationship.
To understand if a transformation is necessary, we need to analyze the explanatory variables we are using for our model.
Lets begin by first looking at points_for
standings |>
  ggplot(mapping = aes(x = points_for)) +
  geom_histogram(color = 'white') +
  labs("Points Scored by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.Notice how none of the values fall below zero. This makes sense considering it is impossible for a team to score negative points. This observation tells us that the data follows a Poisson distribution. We can perform a log transformation on Poisson distributions to help linearize the data.
p1 <- standings |>
  ggplot(mapping = aes(x = log(points_for), y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Modeling a Binary ") +
  theme_minimal()
p1The visualization shows us the X values have been transformed.
Moving on to analyzing points_against.
standings |>
  ggplot(mapping = aes(x = points_against)) +
  geom_histogram(color = 'white') +
  labs("Points Against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.The same observation as before can be made about the data following a Poisson distribution. We can expect there to never be a team that concedes negative points.
p2 <- standings |>
  ggplot(mapping = aes(x = log(points_against), y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Modeling a Binary ") +
  theme_minimal()
p2Again, looking at the visualization we can verify the X values have been transformed
Finally, lets analyze margin_of_victory
standings |>
  ggplot(mapping = aes(x = margin_of_victory)) +
  geom_histogram(color = 'white') +
  labs("Points Against by NFL Teams per Season")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.margin_of_victory isn’t count data, like our other variables, so its not a Poisson distribution. This time the X values contain negative numbers. margin_of_victory is the average margin of points a team has with their opponents across a season. If a team has more highly negative margin games they will have a negative margin_of_victory. The visualization shows a roughly normal distribution. We do not need to perform transformations on this variable.
p3 <- standings |>
  ggplot(mapping = aes(x = margin_of_victory, y = sb_winner)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Modeling a Binary ") +
  theme_minimal()
p3Combining our transformations together we can create a new model
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + margin_of_victory,
             data = standings,
             family = binomial(link = 'logit'))
model2$coefficients##         (Intercept)     log(points_for) log(points_against)   margin_of_victory 
##         -36.6326167          26.5933900         -21.1551091          -0.8012517tidy(model2, conf.int = TRUE)## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred## # A tibble: 4 × 7
##   term                estimate std.error statistic p.value conf.low conf.high
##   <chr>                  <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>
## 1 (Intercept)          -36.6      24.6       -1.49  0.137    -85.8    10.5   
## 2 log(points_for)       26.6      12.0        2.22  0.0264     4.00   50.7   
## 3 log(points_against)  -21.2       8.21      -2.58  0.0100   -37.8    -5.77  
## 4 margin_of_victory     -0.801     0.435     -1.84  0.0655    -1.66    0.0422Our p-values have decreased, suggesting we can be more confidence that the coefficients are not zero, and that the model is doing a better job of fitting the data.
tidy_data <- tidy(model2)
Z <- 1.96 # 95% confidence
margin_of_error <- Z * tidy_data$std.error
CI_low = tidy_data$estimate - margin_of_error
CI_high = tidy_data$estimate + margin_of_error
tidy(model2, conf.int = TRUE) |>
  ggplot(mapping = aes(x = estimate, y = term)) +
  geom_point() +
  geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
  geom_errorbarh(mapping = aes(xmin = CI_low, 
                               xmax = CI_high, 
                               height = 0.5)) +
  labs(title = "Model Coefficient C.I.")## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurredAfter performing our transformations we can see we have evidence to reject the hypothesis that coefficients for log(points_for), log(points_against) and margin_of_victory are zero.
We can further summarize the models to analyze their performances
summary(model)## 
## Call:
## glm(formula = sb_winner ~ points_for + points_against + margin_of_victory, 
##     family = binomial(link = "logit"), data = standings)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -0.6438     1.8091  -0.356    0.722
## points_for         -0.2776     0.5019  -0.553    0.580
## points_against      0.2662     0.5020   0.530    0.596
## margin_of_victory   4.6267     8.0265   0.576    0.564
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 139.89  on 634  degrees of freedom
## AIC: 147.89
## 
## Number of Fisher Scoring iterations: 8summary(model2)## 
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) + 
##     margin_of_victory, family = binomial(link = "logit"), data = standings)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)  
## (Intercept)         -36.6326    24.6277  -1.487   0.1369  
## log(points_for)      26.5934    11.9755   2.221   0.0264 *
## log(points_against) -21.1551     8.2149  -2.575   0.0100 *
## margin_of_victory    -0.8013     0.4351  -1.842   0.0655 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177.87  on 637  degrees of freedom
## Residual deviance: 133.98  on 634  degrees of freedom
## AIC: 141.98
## 
## Number of Fisher Scoring iterations: 8We have evidence to suggest that model 2 is a better model. Model 2’s residual deviance is 133.98, compared to model 1’s residual deviance of 139.89. The residual difference is a measure of how well the model fits the data. The lower the better.
Additionally, like I mentioned before, In Model 2, the coefficients for log(points_for) and log(points_against) are statistically significant (p-values < 0.05), indicating that these explanatory variables have a significant relationship with the response variable sb_winner . In contrast, none of the coefficients in Model 1 are statistically significant.