library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(patchwork)
data <- read_delim("./sports.csv", delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
We need to create binary column for our dataset because we don’t have one initially.
data$profit = data$total_rev_menwomen - data$total_exp_menwomen
data$binary_profit = ifelse(data$profit>0,1,0) # 1 if profit, 0 if no profit
This newly created binary profit variable is going to be our response variable. Our goal is to determine if our explanatory variables lead to profit or not. In other words, what’s the probability of getting into profit (success).
Explanatory variable I decide to use is total count of students.
So, let’s model using logistic regression.
logistic_model <- glm(binary_profit ~ ef_total_count,
data = data,family = binomial(link = 'logit'))
logistic_model$coefficients
## (Intercept) ef_total_count
## -2.805029e-01 -7.666357e-05
When we raise e to the power of the coefficient we get, the result is
0.99. That means every unit increase in total count, the odds of getting
a profit goes down very so slightly.
sigmoid <- \(x) 1 / (1 + exp(-(-0.2805 + (-0.0000766) * x)))
data |>
ggplot(mapping = aes(x = ef_total_count, y = binary_profit)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Modeling a Binary Response with Sigmoid") +
scale_y_continuous(breaks = c(0, 1)) +
theme_minimal()
## Warning: Removed 958 rows containing missing values or values outside the scale range
## (`geom_point()`).
As we increase the number of students, log odds of making a profit goes
slightly down.
summary(logistic_model)
##
## Call:
## glm(formula = binary_profit ~ ef_total_count, family = binomial(link = "logit"),
## data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.805e-01 5.899e-02 -4.755 1.98e-06 ***
## ef_total_count -7.666e-05 9.839e-06 -7.792 6.61e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2567.1 on 1977 degrees of freedom
## Residual deviance: 2476.0 on 1976 degrees of freedom
## (958 observations deleted due to missingness)
## AIC: 2480
##
## Number of Fisher Scoring iterations: 4
# let's analyze the standard error and build a C.I
confint(logistic_model)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.960481e-01 -1.647175e-01
## ef_total_count -9.695503e-05 -5.828965e-05
From our 95% C.I. we can see that it does not include zero for total count variable. It confirms that log odds of making a profit goes down slightly with each unit increase in total count of students.