Week 10 Data Dive - GLMs

library(conflicted)   
library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr") 
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
# load ncaa file I cleaned 
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)

Selecting Binary Column of Data

None of my data is binary, but we can reasonably convert some into binary. In this case, I’ll create a binary is/is not a FBS school (given its D1). To supplement this, I’ll also add a binary football column as a explanatory variable for the model.

# creates binary for if a football team exists or not
ncaa$football <- ifelse(ncaa$sports == "Football", 1, 0)
# creates binary for if team is FBS
ncaa$fbs <- ifelse(ncaa$classification_code == 1, 1, 0)
binary <- ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(institution_name) |>
  summarise(revenue = sum(rev_men, na.rm = TRUE) +
              sum(rev_women, na.rm = TRUE),
            fbs = max(fbs),
            expense = sum(exp_men, na.rm = TRUE) + 
              sum(exp_women, na.rm = TRUE),
            division = min(classification_name),
            students = sum(max(ef_male_count, na.rm = TRUE) +
                             max(ef_female_count, na.rm = TRUE)),
            athletes = sum(sum_partic_men, na.rm = TRUE) +
                        sum(sum_partic_women, na.rm = TRUE),
            football = max(football)
            )

Building a Logistic Regression Model

You can see how the data looks when swapping out different response variables for “x” in ggplot()

binary |>
  ggplot(mapping = aes(x = revenue, y = fbs)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  labs(title = "Plotting fbs variable based on different response variables") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

Determining Response Variables for the Model

cor(binary[c('revenue', 'expense', 'students', 'athletes', 'football')])
##            revenue   expense  students  athletes  football
## revenue  1.0000000 0.9638226 0.6226520 0.5355185 0.3329468
## expense  0.9638226 1.0000000 0.6295991 0.5585497 0.3617943
## students 0.6226520 0.6295991 1.0000000 0.3946601 0.2194238
## athletes 0.5355185 0.5585497 0.3946601 1.0000000 0.4308708
## football 0.3329468 0.3617943 0.2194238 0.4308708 1.0000000

As expected, revenue and expenses are highly correlated, so I’ll probably opt for sticking with revenue instead of adding expenses, but I think students, athletes, and football could be good variables to include in the model.

I’ll just use revenue for this first model.

# using just revenue for the model
model <- glm(fbs ~ revenue, data = binary,
             family = binomial(link = 'logit'))

model$coefficients
##   (Intercept)       revenue 
## -4.971467e+00  1.929728e-07
# these coefficients come from the model
sigmoid <- \(x) 1 / (1 + exp(-(-4.9715 + 0.0000001929728 * x)))

binary |>
  ggplot(mapping = aes(x = revenue, y = fbs)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

Evaluating Coefficients

We have two coefficients for this model: the intercept and revenue coefficient.

Intercept: log-odds when explanatory variables go to 0, so there is a 50/50 chance that the binary variable is above or below the given threshold. We can find this threshold by dividing the intercept by the other coefficient.

4.9714668113316 / 0.0000001929728 
## [1] 25762526

We can see this threshold is about $25,762,526. Revenues are greater? We’d start to expect its a FBS school. Revenues less? Probably not a FBS.

For the revenue coefficient, a single dollar is extremely small compared to the multi-millions we’re working with, so I’ll try adjusting this to reflect the base unit to be a million dollars instead of just one.

0.0000001929728 * 1000000
## [1] 0.1929728

This would be the new coefficient if we were working with millions of dollars as the base unit. We can raise e to this exponent, roughly 0.193, and get 1.2129. This translates into about a 21.3% increase in the likelihood that a school is FBS as revenues increase by $1m.

Confidence Interval

library(boot)
summary(model)
## 
## Call:
## glm(formula = fbs ~ revenue, family = binomial(link = "logit"), 
##     data = binary)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.971e+00  5.477e-01  -9.077  < 2e-16 ***
## revenue      1.930e-07  2.535e-08   7.613 2.67e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 458.53  on 349  degrees of freedom
## Residual deviance: 204.16  on 348  degrees of freedom
## AIC: 208.16
## 
## Number of Fisher Scoring iterations: 7
# finds critical value of t for 95% confidence interval of revenue
qt(p = .975, df = 348)
## [1] 1.966804
# MOE = t * SE(coef.)
MOE <- 1.966804 * 0.00000002535
# CI = coef. +/- MOE
CI_low <- 0.00000019297 - MOE
CI_high <- 0.00000019297 + MOE
CI_low
## [1] 1.431115e-07
CI_high
## [1] 2.428285e-07

This is hard to read on its own, so lets multiply by 1m

CI_high * 1000000
## [1] 0.2428285
CI_low * 1000000
## [1] 0.1431115

Our confidence interval for the coefficient of revenue is between (.143 - 0.243) x 10^-6. To reference back to 21.3% increase in likelihood that a school would be classified as FBS for every increase in a million dollars, if we apply the same logic to the numbers we found above, it means about 95% of the samples we’d take would have its estimated increase in likelihood be between e raised to those two powers, or between 15.4% and 27.5%.