Week 8 Data Dive - Regression Modeling

library(conflicted)  
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

conflict_prefer("filter", "dplyr")

## [conflicted] Will prefer dplyr::filter over any other package.

conflict_prefer("lag", "dplyr")

## [conflicted] Will prefer dplyr::lag over any other package.

# load ncaa file I cleaned
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)

Select a continuous (or ordered integer) column of data that seems most “valuable” given the context of your data, and call this your response variable.

Response variable: expenses

Select a categorical column of data (explanatory variable) that you expect might influence the response variable

Explanatory variable: division

Anova Test

Null Hypothesis: Expenses are the same among different divisions

# get total exp for men and women
schools <- ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(institution_name) |>
  summarise(revenue = sum(rev_men, na.rm = TRUE) + 
              sum(rev_women, na.rm = TRUE),
            expense = sum(exp_men, na.rm = TRUE) + 
              sum(exp_women, na.rm = TRUE),
            division = min(classification_name))

m <- aov(expense ~ division, data = schools)
summary(m)

##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## division      2 1.079e+17 5.393e+16   193.1 <2e-16 ***
## Residuals   347 9.693e+16 2.793e+14                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see the F value here is huge, meaning that it is extremely unlikely that the mean values for programs in these subsets of Division 1 are equal. This is pretty obvious looking visually:

schools |>
  ggplot() +
  geom_boxplot(mapping = aes(y = expense, x = division)) +
  labs(x = "Division Category",
       y = "Expenses",
       title = "Expenses among D1 institutions")

This may be obvious that the top D1 programs in the country stand out. The uber successful football and basketball programs likley have significantly more expenses with travel, hosting more fans, and extra perks for the athletic departments and other support for athletes. However, what’s very important, is that having a football program might not carry much additional expenses than programs without a football team. We can test this using another anova test.

# same as above, but excluding FBS
schools_2 <- ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(2,3)) |>
  group_by(institution_name) |>
  summarise(revenue = sum(rev_men, na.rm = TRUE) + 
              sum(rev_women, na.rm = TRUE),
            expense = sum(exp_men, na.rm = TRUE) + 
              sum(exp_women, na.rm = TRUE),
            division = min(classification_name))

m <- aov(expense ~ division, data = schools_2)
summary(m)

##              Df    Sum Sq   Mean Sq F value Pr(>F)  
## division      1 1.261e+14 1.261e+14   3.468 0.0639 .
## Residuals   221 8.037e+15 3.637e+13                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The F value here is now extremely small, at least relative to what it was before. This value may not be large enough to say with great confidence that there is a significant difference beetween FCS and non-football D1 program expenses. This nuance can be very important for different shareholders in college athletics.

Recruiting: if you’re looking for institutions that spend a lot on athletes and other supporting features, there might not be much of a difference between a FCS and a non-football program. Becuase of this, if the spending criteria is important to you, there are better factors to look at that a simple litmus test of whether or not a school has a football team. (As a current student athlete who has gone through recruiting processes and works every year recruiting athletes, I know this is a fact virtually no recruit is aware of).

Changing conferences: this shows that there might be evidence to support that if you’re able to change conferences, specifically to get into the FBS division, you will likely have a substantial increase in expenses to meet certain requirements of being in the conference. Likewise, moving down conferences might mean cutting back on expenses, and this could come anywhere from needing to accommodate less fans at games to less perks for athletes and support staff.

Adding a football team: despite football teams having the largest rosters by far and being known for having some of the highest expenses from the massive amount of scholarships, coaching salaries, and travel expenses, their institution’s expenses are very similar to programs without football teams. This might suggest that if a team is contemplating removing their football team or adding one, especially if its at the expense of other male programs, expenditures might not change much (at least in a long run scenario).

Linear Regression Model

Find a single continuous (or ordered integer, non-binary) column of data that might influence the response variable

Revenue

This is aided by evidence from a prior data dive.

model <- lm(expense ~ revenue, schools)
model$coefficients

##  (Intercept)      revenue 
## 4.950527e+06 7.466576e-01

schools |>
  ggplot() +
  geom_point(mapping = aes(x = revenue, y = expense)) +
  labs(title = "Expense to Revenue for D1 Institutions (2019)",
       x = "Revenues", y = "Expenses") +
  geom_abline(intercept = 4950527, slope = 0.7466576, color = "red") +
  geom_abline(intercept = 0, slope = 1, color = "orange") +
  theme_classic()

We can see the fit looks good, but right away you can tell something is missing. It looks like there are a lot of schools that follow a near linear line, emphasized by the orange line I made of y=x. After that, as revenues keep going up, expenses begin tapering off. I think the red linear line we made is the best linear line we can make for our entire data set, but I think our line can be better fit by creating linear regressions for each division type or using another type of equation, perhaps a logarithmic function, that could be a step in a more accurate direction.

Our coefficient of approximately 0.75 means that for every dollar of revenue an athletic institution receives, it spends about 75 cents. However, I will break out data down further because I don’t think I can make a fair representation while aggregating all divisions together.

FBS <- schools |> 
  filter(division == 'NCAA Division I-FBS')
FCS <- schools |> 
  filter(division == 'NCAA Division I-FCS')
NFB <- schools |> 
  filter(division == 'NCAA Division I without football')

lm(expense ~ revenue, FBS)

## 
## Call:
## lm(formula = expense ~ revenue, data = FBS)
## 
## Coefficients:
## (Intercept)      revenue  
##   1.224e+07    6.622e-01

lm(expense ~ revenue, FCS)

## 
## Call:
## lm(formula = expense ~ revenue, data = FCS)
## 
## Coefficients:
## (Intercept)      revenue  
##   1.283e+06    9.150e-01

lm(expense ~ revenue, NFB)

## 
## Call:
## lm(formula = expense ~ revenue, data = NFB)
## 
## Coefficients:
## (Intercept)      revenue  
##   1.310e+06    9.027e-01

schools |>
  ggplot() +
  geom_point(mapping = aes(x = revenue, y = expense, color = as.factor(division))) +
  labs(title = "Expense to Revenue for D1 Institutions (2019)",
       x = "Revenues", y = "Expenses") +
  geom_abline(intercept = 4950527, slope = 0.7466576, color = "black") +
  geom_abline(intercept = 12240000, slope = 0.6622, color = "red") +
  geom_abline(intercept = 1283000, slope = 0.915, color = "blue") +
  geom_abline(intercept = 1310000, slope = 0.9027, color = "green") +
  scale_color_manual(values = c("red", "green", "blue")) +
  labs(color = 'division') + 
  theme_classic()

We can see after separating each institution by its distinct division, the linear regressions become significantly more accurate. It visually looks like the lines fit the respective data much better than before, and we can see how our initial regression model was heavily influenced by FBS schools.

Using the new coefficients, we can see FBS schools only spend about 66% of the revenues they make, but all other D1 schools spend over 90% of the revenues they make. This might suggest that there are tapering benefits for being at a high revenue school. Maybe there are only so many benefits an athletic program can use, or maybe there are so many benefits athletes are permitted to have per NCAA requirements. However, when looking at non FBS schools, these programs are spending almost all the money they are bringing in. This would suggest that these schools haven’t yet reached this “threshold” of maximum benefits for its athletes, so going to a higher revenue school would likely mean having expenses directly spent to benefit your program.

This also may suggest to an organization like the NCAA, if they are interested in maximizing student-athlete welfare, to “tax” high income schools with more profit sharing that can be distributed to other institutions. This might make funds going towards student athletes more efficient, a factor that is especially important given the non-profit classification of the NCAA.

Week 8 Data Dive

Thomas Reedy

2024-10-20

Week 8 Data Dive - Regression Modeling

Anova Test

Linear Regression Model