library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
# load ncaa file I cleaned
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
None of my data is binary, but we can reasonably convert some into binary. In this case, I’ll create a binary is/is not a FBS school (given its D1). To supplement this, I’ll also add a binary football column as a explanatory variable for the model.
# creates binary for if a football team exists or not
ncaa$football <- ifelse(ncaa$sports == "Football", 1, 0)
# creates binary for if team is FBS
ncaa$fbs <- ifelse(ncaa$classification_code == 1, 1, 0)
binary <- ncaa |>
filter(year == 2019) |>
filter(classification_code %in% c(1,2,3)) |>
group_by(institution_name) |>
summarise(revenue = sum(rev_men, na.rm = TRUE) +
sum(rev_women, na.rm = TRUE),
fbs = max(fbs),
expense = sum(exp_men, na.rm = TRUE) +
sum(exp_women, na.rm = TRUE),
division = min(classification_name),
students = sum(max(ef_male_count, na.rm = TRUE) +
max(ef_female_count, na.rm = TRUE)),
athletes = sum(sum_partic_men, na.rm = TRUE) +
sum(sum_partic_women, na.rm = TRUE),
football = max(football)
)
You can see how the data looks when swapping out different response variables for “x” in ggplot()
binary |>
ggplot(mapping = aes(x = revenue, y = fbs)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
labs(title = "Plotting fbs variable based on different response variables") +
scale_y_continuous(breaks = c(0, 1)) +
theme_minimal()
cor(binary[c('revenue', 'expense', 'students', 'athletes', 'football')])
## revenue expense students athletes football
## revenue 1.0000000 0.9638226 0.6226520 0.5355185 0.3329468
## expense 0.9638226 1.0000000 0.6295991 0.5585497 0.3617943
## students 0.6226520 0.6295991 1.0000000 0.3946601 0.2194238
## athletes 0.5355185 0.5585497 0.3946601 1.0000000 0.4308708
## football 0.3329468 0.3617943 0.2194238 0.4308708 1.0000000
As expected, revenue and expenses are highly correlated, so I’ll probably opt for sticking with revenue instead of adding expenses, but I think students, athletes, and football could be good variables to include in the model.
I’ll just use revenue for this first model.
# using just revenue for the model
model <- glm(fbs ~ revenue, data = binary,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) revenue
## -4.971467e+00 1.929728e-07
# these coefficients come from the model
sigmoid <- \(x) 1 / (1 + exp(-(-4.9715 + 0.0000001929728 * x)))
binary |>
ggplot(mapping = aes(x = revenue, y = fbs)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Modeling a Binary Response with Sigmoid") +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
theme_minimal()
We have two coefficients for this model: the intercept and revenue coefficient.
Intercept: log-odds when explanatory variables go to 0, so there is a 50/50 chance that the binary variable is above or below the given threshold. We can find this threshold by dividing the intercept by the other coefficient.
4.9714668113316 / 0.0000001929728
## [1] 25762526
We can see this threshold is about $25,762,526. Revenues are greater? We’d start to expect its a FBS school. Revenues less? Probably not a FBS.
For the revenue coefficient, a single dollar is extremely small compared to the multi-millions we’re working with, so I’ll try adjusting this to reflect the base unit to be a million dollars instead of just one.
0.0000001929728 * 1000000
## [1] 0.1929728
This would be the new coefficient if we were working with millions of dollars as the base unit. We can raise e to this exponent, roughly 0.193, and get 1.2129. This translates into about a 21.3% increase in the likelihood that a school is FBS as revenues increase by $1m.
library(boot)
summary(model)
##
## Call:
## glm(formula = fbs ~ revenue, family = binomial(link = "logit"),
## data = binary)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.971e+00 5.477e-01 -9.077 < 2e-16 ***
## revenue 1.930e-07 2.535e-08 7.613 2.67e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 458.53 on 349 degrees of freedom
## Residual deviance: 204.16 on 348 degrees of freedom
## AIC: 208.16
##
## Number of Fisher Scoring iterations: 7
# finds critical value of t for 95% confidence interval of revenue
qt(p = .975, df = 348)
## [1] 1.966804
# MOE = t * SE(coef.)
MOE <- 1.966804 * 0.00000002535
# CI = coef. +/- MOE
CI_low <- 0.00000019297 - MOE
CI_high <- 0.00000019297 + MOE
CI_low
## [1] 1.431115e-07
CI_high
## [1] 2.428285e-07
This is hard to read on its own, so lets multiply by 1m
CI_high * 1000000
## [1] 0.2428285
CI_low * 1000000
## [1] 0.1431115
Our confidence interval for the coefficient of revenue is between (.143 - 0.243) x 10^-6. To reference back to 21.3% increase in likelihood that a school would be classified as FBS for every increase in a million dollars, if we apply the same logic to the numbers we found above, it means about 95% of the samples we’d take would have its estimated increase in likelihood be between e raised to those two powers, or between 15.4% and 27.5%.