Generalized Linear Models Data Dive Week 10

Load Data and Libraries

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(patchwork)

data <- read_delim("./sports.csv", delim = ',')

## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

We need to create binary column for our dataset because we don’t have one initially.

data$profit = data$total_rev_menwomen - data$total_exp_menwomen
data$binary_profit = ifelse(data$profit>0,1,0) # 1 if profit, 0 if no profit

This newly created binary profit variable is going to be our response variable. Our goal is to determine if our explanatory variables lead to profit or not. In other words, what’s the probability of getting into profit (success).

Explanatory variable I decide to use is total count of students.

So, let’s model using logistic regression.

Logistic Regression Model

logistic_model <- glm(binary_profit ~ ef_total_count, 
                      data = data,family = binomial(link = 'logit'))
logistic_model$coefficients

##    (Intercept) ef_total_count 
##  -2.805029e-01  -7.666357e-05

When we raise e to the power of the coefficient we get, the result is 0.99. That means every unit increase in total count, the odds of getting a profit goes down very so slightly.

sigmoid <- \(x) 1 / (1 + exp(-(-0.2805 + (-0.0000766) * x)))

data |>
  ggplot(mapping = aes(x = ef_total_count, y = binary_profit)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 1)) +
  theme_minimal()

## Warning: Removed 958 rows containing missing values or values outside the scale range
## (`geom_point()`).

As we increase the number of students, log odds of making a profit goes slightly down.

summary(logistic_model)

## 
## Call:
## glm(formula = binary_profit ~ ef_total_count, family = binomial(link = "logit"), 
##     data = data)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -2.805e-01  5.899e-02  -4.755 1.98e-06 ***
## ef_total_count -7.666e-05  9.839e-06  -7.792 6.61e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2567.1  on 1977  degrees of freedom
## Residual deviance: 2476.0  on 1976  degrees of freedom
##   (958 observations deleted due to missingness)
## AIC: 2480
## 
## Number of Fisher Scoring iterations: 4

Confidence Interval

# let's analyze the standard error and build a C.I
confint(logistic_model)

## Waiting for profiling to be done...

##                        2.5 %        97.5 %
## (Intercept)    -3.960481e-01 -1.647175e-01
## ef_total_count -9.695503e-05 -5.828965e-05

From our 95% C.I. we can see that it does not include zero for total count variable. It confirms that log odds of making a profit goes down slightly with each unit increase in total count of students.

Week10 GLM Part One

Arsh Singh

2024-11-25

Generalized Linear Models Data Dive Week 10

Load Data and Libraries

Logistic Regression Model

Confidence Interval