Due: Mon Mar 25, 2024 11:59pm
The purpose of this week’s data dive is for you to attempt building a generalized linear model, or to make adjustments to variables represented in a model.
Your RMarkdown notebook for this data dive should contain the following:
Select an interesting binary column of data, or one which can be reasonably converted into a binary variable
Build a logistic regression model for this variable, using between 1-4 explanatory variables
Interpret the coefficients, and explain what they mean in your notebook
Using the Standard Error for at least one coefficient, build a C.I. for that coefficient, and translate its meaning
Consider a transformation for any explanatory variable, and illustrate why you need the transformation (or why you do not)
For each of the above tasks, you must explain to the reader what insight was gathered, its significance, and any further questions you have which might need to be further investigated.
In this data dive I will be using NFL Standings data which comes from Pro Football Reference team standings.
standings <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-02-04/standings.csv')
## Rows: 638 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): team, team_name, playoffs, sb_winner
## dbl (11): year, wins, loss, points_for, points_against, points_differential,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
standings
## # A tibble: 638 × 15
## team team_name year wins loss points_for points_against
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Miami Dolphins 2000 11 5 323 226
## 2 Indianapolis Colts 2000 10 6 429 326
## 3 New York Jets 2000 9 7 321 321
## 4 Buffalo Bills 2000 8 8 315 350
## 5 New England Patriots 2000 5 11 276 338
## 6 Tennessee Titans 2000 13 3 346 191
## 7 Baltimore Ravens 2000 12 4 333 165
## 8 Pittsburgh Steelers 2000 9 7 321 255
## 9 Jacksonville Jaguars 2000 7 9 367 327
## 10 Cincinnati Bengals 2000 4 12 185 359
## # ℹ 628 more rows
## # ℹ 8 more variables: points_differential <dbl>, margin_of_victory <dbl>,
## # strength_of_schedule <dbl>, simple_rating <dbl>, offensive_ranking <dbl>,
## # defensive_ranking <dbl>, playoffs <chr>, sb_winner <chr>
The binary column of data I will be modeling is ‘sb_winner’. The values of this column are ‘No Superbowl’ and ‘Won Superbowl’ which will be converted to 0 and 1.
standings$sb_winner <- ifelse(standings$sb_winner == 'Won Superbowl', 1, 0)
standings$sb_winner
## [1] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## [445] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 0 0
My explanatory variables will be points_for, points_against, and margin_of_victory.
library(ggplot2)
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
p1 <- standings |>
ggplot(mapping = aes(x = points_for, y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Modeling a Binary ") +
theme_minimal()
p2 <- standings |>
ggplot(mapping = aes(x = points_against, y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Response with OLS")
p3<- standings |>
ggplot(mapping = aes(x = margin_of_victory, y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3)
p1 + p2 + p3
The visualization gives us some indication that modeling sb_winner using the explanatory variables could be possible.
When creating the model we use the a general linear model and use a logistic link function to transform our explanatory variables to our response variable sb_winner.
model <- glm(sb_winner ~ points_for + points_against + margin_of_victory,
data = standings,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) points_for points_against margin_of_victory
## -0.6438372 -0.2775911 0.2662348 4.6267392
The coefficients in our model are multiplied by the explanatory variable to map to sb_winner. The full formula is as follows:
\(P(\mathbf\ \texttt{sb_winner}) = p\)
\[ p = \frac{1}{1 + e^{-(-0.6438371 + (-0.2775911\times\texttt{points_for}) + (0.2662348\times\texttt{points_against}) + (4.6267392\times\texttt{margin_of_victory}))}} \]
We can use the broom library to analyze the coefficients of the model
library(broom)
tidy(model, conf.int = TRUE)
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.644 1.81 -0.356 0.722 -4.20 2.93
## 2 points_for -0.278 0.502 -0.553 0.580 -1.28 0.712
## 3 points_against 0.266 0.502 0.530 0.596 -0.723 1.26
## 4 margin_of_victory 4.63 8.03 0.576 0.564 -11.2 20.6
Our model does not appear to be a good model. All of the p-values are high, suggesting there is no evidence the coefficients are not zero.
To build a confidence interval we need to take the standard error of each coefficient and multiply it be a Z value to get a margin of error. Z is the z score value that gives us our confidence level of 95%, which is 1.96. We then subtract and add the margin of error from the estimated coefficient to get the upper and lower intervals.
tidy_data <- tidy(model)
Z <- 1.96 # 95% confidence
margin_of_error <- Z * tidy_data$std.error
CI_low = tidy_data$estimate - margin_of_error
CI_high = tidy_data$estimate + margin_of_error
tidy(model, conf.int = TRUE) |>
ggplot(mapping = aes(x = estimate, y = term)) +
geom_point() +
geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
geom_errorbarh(mapping = aes(xmin = CI_low,
xmax = CI_high,
height = 0.5)) +
labs(title = "Model Coefficient C.I.")
If the confidence interval for a coefficient includes zero, it suggest there is not enough evidence to conclude that there is a significant relationship between the response variable and the explanatory variable at the confidence level.
The distance of the interval gives an indication of the strength of the relationship.
To understand if a transformation is necessary, we need to analyze the explanatory variables we are using for our model.
Lets begin by first looking at points_for
standings |>
ggplot(mapping = aes(x = points_for)) +
geom_histogram(color = 'white') +
labs("Points Scored by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Notice how none of the values fall below zero. This makes sense considering it is impossible for a team to score negative points. This observation tells us that the data follows a Poisson distribution. We can perform a log transformation on Poisson distributions to help linearize the data.
p1 <- standings |>
ggplot(mapping = aes(x = log(points_for), y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Modeling a Binary ") +
theme_minimal()
p1
The visualization shows us the X values have been transformed.
Moving on to analyzing points_against.
standings |>
ggplot(mapping = aes(x = points_against)) +
geom_histogram(color = 'white') +
labs("Points Against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The same observation as before can be made about the data following a Poisson distribution. We can expect there to never be a team that concedes negative points.
p2 <- standings |>
ggplot(mapping = aes(x = log(points_against), y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Modeling a Binary ") +
theme_minimal()
p2
Again, looking at the visualization we can verify the X values have been transformed
Finally, lets analyze margin_of_victory
standings |>
ggplot(mapping = aes(x = margin_of_victory)) +
geom_histogram(color = 'white') +
labs("Points Against by NFL Teams per Season")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
margin_of_victory isn’t count data, like our other variables, so its not a Poisson distribution. This time the X values contain negative numbers. margin_of_victory is the average margin of points a team has with their opponents across a season. If a team has more highly negative margin games they will have a negative margin_of_victory. The visualization shows a roughly normal distribution. We do not need to perform transformations on this variable.
p3 <- standings |>
ggplot(mapping = aes(x = margin_of_victory, y = sb_winner)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Modeling a Binary ") +
theme_minimal()
p3
Combining our transformations together we can create a new model
model2 <- glm(sb_winner ~ log(points_for) + log(points_against) + margin_of_victory,
data = standings,
family = binomial(link = 'logit'))
model2$coefficients
## (Intercept) log(points_for) log(points_against) margin_of_victory
## -36.6326167 26.5933900 -21.1551091 -0.8012517
tidy(model2, conf.int = TRUE)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## # A tibble: 4 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -36.6 24.6 -1.49 0.137 -85.8 10.5
## 2 log(points_for) 26.6 12.0 2.22 0.0264 4.00 50.7
## 3 log(points_against) -21.2 8.21 -2.58 0.0100 -37.8 -5.77
## 4 margin_of_victory -0.801 0.435 -1.84 0.0655 -1.66 0.0422
Our p-values have decreased, suggesting we can be more confidence that the coefficients are not zero, and that the model is doing a better job of fitting the data.
tidy_data <- tidy(model2)
Z <- 1.96 # 95% confidence
margin_of_error <- Z * tidy_data$std.error
CI_low = tidy_data$estimate - margin_of_error
CI_high = tidy_data$estimate + margin_of_error
tidy(model2, conf.int = TRUE) |>
ggplot(mapping = aes(x = estimate, y = term)) +
geom_point() +
geom_vline(xintercept = 0, linetype = 'dotted', color = 'gray') +
geom_errorbarh(mapping = aes(xmin = CI_low,
xmax = CI_high,
height = 0.5)) +
labs(title = "Model Coefficient C.I.")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
After performing our transformations we can see we have evidence to reject the hypothesis that coefficients for log(points_for), log(points_against) and margin_of_victory are zero.
We can further summarize the models to analyze their performances
summary(model)
##
## Call:
## glm(formula = sb_winner ~ points_for + points_against + margin_of_victory,
## family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6438 1.8091 -0.356 0.722
## points_for -0.2776 0.5019 -0.553 0.580
## points_against 0.2662 0.5020 0.530 0.596
## margin_of_victory 4.6267 8.0265 0.576 0.564
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 139.89 on 634 degrees of freedom
## AIC: 147.89
##
## Number of Fisher Scoring iterations: 8
summary(model2)
##
## Call:
## glm(formula = sb_winner ~ log(points_for) + log(points_against) +
## margin_of_victory, family = binomial(link = "logit"), data = standings)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.6326 24.6277 -1.487 0.1369
## log(points_for) 26.5934 11.9755 2.221 0.0264 *
## log(points_against) -21.1551 8.2149 -2.575 0.0100 *
## margin_of_victory -0.8013 0.4351 -1.842 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 177.87 on 637 degrees of freedom
## Residual deviance: 133.98 on 634 degrees of freedom
## AIC: 141.98
##
## Number of Fisher Scoring iterations: 8
We have evidence to suggest that model 2 is a better model. Model 2’s residual deviance is 133.98, compared to model 1’s residual deviance of 139.89. The residual difference is a measure of how well the model fits the data. The lower the better.
Additionally, like I mentioned before, In Model 2, the coefficients for log(points_for) and log(points_against) are statistically significant (p-values < 0.05), indicating that these explanatory variables have a significant relationship with the response variable sb_winner . In contrast, none of the coefficients in Model 1 are statistically significant.