I’m going to use a linear model I’m planning on using for my final project.
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
# adds column to say whether its M or F
ncaa$men <- ifelse(ncaa$sum_partic_men > 0, 1, 0)
# adds profit
ncaa <- ncaa |>
mutate(profit = rowSums(across(c(rev_men, rev_women)), na.rm = TRUE) -
rowSums(across(c(exp_men, exp_women)), na.rm = TRUE) )
compare <- ncaa |>
filter(classification_code %in% c(1,2,3)) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
partic_men_change = (sum_partic_men - lag(sum_partic_men))/
lag(sum_partic_men),
partic_women_change = (sum_partic_women - lag(sum_partic_women))/
lag(sum_partic_women),
student_change = (ef_total_count - lag(ef_total_count)) /
lag(ef_total_count),
profit_change = (profit - lag(profit))
# this one is a number, not percentage
)
compare_change <- compare |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp <- compare_change |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(exp_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")]),
avg_rev = mean(rev_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
Above is the background code to get to the data we want to analyze. Below is the visualization and model of the explanatory variable, revenue from football teams, and the response variable, average growth in expenses of non-revenue sports.
# data we're basing model off of
# removes some NA values for analysis, doesn't change model results
compare_exp1 <- compare_exp |>
filter(!is.na(avg_revsport))
compare_exp1 |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
model <- lm(avg_other_sports ~ avg_revsport, compare_exp1)
summary(model)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42087 -0.06190 -0.00769 0.04366 1.78702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.030745 0.004072 7.550 9.77e-14 ***
## avg_revsport 0.201731 0.021059 9.579 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1218 on 1001 degrees of freedom
## Multiple R-squared: 0.08397, Adjusted R-squared: 0.08306
## F-statistic: 91.76 on 1 and 1001 DF, p-value: < 2.2e-16
library(boot)
library(broom)
library(lindia)
The value we get is pretty small, only 0.083. This means not much of the change of non-revenue sport revenues are based on football performance, but some of it certainly is. I tried adding other variables before such as change in revenues as another explanatory variable, but all it did was lower the adjusted R-Squared value.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Here we can see the variance across the fitted values is alright. There might be a little bit of heteroscedasticity, but for the most part the data here looks like its being fitted correctly.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This looks pretty good; it looks to roughly follow a standard distribution. There appears to be a slightly longer tail on the right side, but this isn’t terribly surprising considering how little our R-Squared is. There are “missing” explanatory variables here.
Overall, there doesn’t seem to be much issue with the model. Since this is a single variable model, some of this analysis becomes straightforward. No other diagnostic model I tried creating provided further insight to the analysis of the model. As mentioned before, additional variables were attempted to be inserted but no additional information was really found. The only minor issues appear to be that there are “missing” explanatory variables and most of the data is clumped together in small annual revenue changes per year, but both of these are expected. Additionally, they didn’t appear to significantly impact the model.
In the model, it predicts that when football revenue doesn’t change from the prior year, non-revenue sports should increase by about 3%. That is the intercept. For the slope of 0.2, this predicts that for every unit, in this case percent, increase in football revenue, non-revenue sports’ revenue will increase by 0.2, or .2%. So if Football revenue increase 10%, non-revenue sports will increase by 3%+ 0.2(10%) = 5%.
This is neat as it shows that non-revenue sports are much more consistent when it comes to annual spending, or how much funding athletes on those teams get. As well, it shows that these teams are growing their benefits pretty consistently. Revenue for football teams on the other hand have significant more variation, but we don’t know how their expenses vary.