library(conflicted)  
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

Interactions Between Members of the NCAA

Background Information

Link to original data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-03-29/readme.md

This data shows every secondary education institutions’ athletic program information for the years 2015-2019. Additional documentation can be found via the link as well.

When choosing this data, I was curious how different members of the NCAA impact each other. Being a Division I athlete myself, it’s common to hear ideas such as revenue sharing, or when revenue sports like basketball and football share excess revenues with non-revenue sports. This is the story how most NCAA programs get their funding. It’s also a common idea to root for the success of your basketball and football teams because the more revenue they make from doing well in games, the more money they have to share with your sport. This gave me a few questions I wanted to find an answer to…

My Questions

  1. How much additional opportunities do athletes get from schools with more successful revenue sports?
  2. How much do athletes and their teams from non-revenue sports benefit from increased success from revenue sports?

In other words, I am generally trying to figure out how much non-revenue sports are impacted by the success of their institution’s revenue sports.

Analysis

EDA

We can start by taking a look at our data. To meet requirements, I limited the csv file to only include sports from the NCAA. Here’s what this data looks like:

ncaa_original <- read.csv('ncaa_sports_2.csv', header=TRUE)
colnames(ncaa_original)
##  [1] "year"                 "unitid"               "institution_name"    
##  [4] "city_txt"             "state_cd"             "zip_text"            
##  [7] "classification_code"  "classification_name"  "classification_other"
## [10] "ef_male_count"        "ef_female_count"      "ef_total_count"      
## [13] "sector_cd"            "sector_name"          "sportscode"          
## [16] "partic_men"           "partic_women"         "partic_coed_men"     
## [19] "partic_coed_women"    "sum_partic_men"       "sum_partic_women"    
## [22] "rev_men"              "rev_women"            "total_rev_menwomen"  
## [25] "exp_men"              "exp_women"            "total_exp_menwomen"  
## [28] "sports"

When looking at the documentation, not much additional information can be gleaned. Most things are pretty self explanatory: partic_men is how many participating men are on the team, rev_women is how much revenue the women’s team made, and sports is the sport the row in question describes. Some slightly less obvious columns are ef_counts which is how many students attend the universities or unitid which is the unique ID of the university. There are a few questions that arise the deeper you look into these, and some of these concerns will be mentioned soon.

Cleaning the data

However, there is a problem with the data as it’s currently formatted: men and women are included in the same row. This is slightly misleading, because even though the “sports” have the same name, lets use an example of Basketball, doesn’t mean its the same sport. Men’s Basketball and Women’s Basketball have different rules and program structures, so even though the data puts these together in the same row, this is an inappropriate join.

There were also a lot of empty rows where rows showed sports with zero athletes, revenues, and expenses. These had to get removed as well.

I created a cleaning file to get rid of these problems you can find here: https://rpubs.com/tbreedy/1222442.

This removed the empty rows and split men and women sports and their accompanying revenues, expenses, etc. into separate rows. We now have a df we can work with.`

ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)

Important Insights

New Programs

Additions of new programs can make our data a bit interesting when doing year over year comparisons. This code shows the prevalence of institutions either creating or cutting certain programs, or institutions creating or cutting their entire athletic department.

program_count <- ncaa |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(institution_name, sports) |>
  summarise(count = n())
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
program_change <- program_count |>
  filter(count %% 5 != 0)

program_change |>
  group_by(sports) |>
  summarise(change = n()) |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y=change)) + 
  geom_text(mapping = aes(x = sports, y = change, label=sports), 
            size = 3, nudge_y = 2, check_overlap = TRUE) +
  theme(axis.text.x = element_blank()) +
  labs(x = "sports",
       y = "Frequency a sport wasn't offered every year")

We can see that virtually every sport has some sort of change during this time period, likely reflecting the fact that institutions either adopted an athletic program or upgraded from a different division. This also poses an interesting question: how do we filter for year over year comparisons? This will be explained in detail later.

Incorrect Number of Athletes

Some of the data I wanted to analyze involves getting metrics on a “per athlete” basis. The concern here is that this question is much more complicated than you might first think. How many athletes are on a program? This number changes throughout the season. Additions are uncommon, such as people transferring mid-season, beginning attending the university in the spring semester, or recruiting an existing student. Removals from the team are almost an inevitability though. Athletes quit, get cut, and transfer relatively frequently. Further inquiry into the documentation shows that athlete counts are supposed to represent the number of athletes on the team as of the first day of competition, but we’ll see this is incorrect below.

# filters for all D1 football schools and gets the "total number of men" each year
ncaa_football_d1 <- ncaa |>
  filter(sports == 'Football') |>
  filter(classification_code == 1 | classification_code == 2 | 
           classification_code == 3) |>
  group_by(institution_name, year) |>
  summarise(fb_players = sum(sum_partic_men))
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
ncaa_football_d1 |>
  ggplot() +
  geom_histogram(mapping = aes(x = fb_players)) +
  labs(title = "Total Number of Male Athletes on D1 Football Teams",
       x = "Total Number of Male Athletes per Year",
       y = "Frequency") +
  #geom_vline(mapping = aes(xintercept = '105', color='red') +
  geom_vline(xintercept = 120, color='red') +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

During this time frame, the number of athletes allowed on a Division I football team was 120; this is where the red line is at. We’d expect to see the histogram peak around this number or sightly below, which kind of happens, but to a much smaller extent than expected. We also see occurrences to the right of this line, but those should be impossible.

Its impossible for this to be teams systematically going over the required athlete threshold. Division I football teams are highly regulated. These errors don’t happen, let alone impact over 30% of teams. I contacted the Office of Postsecondary Education, the government organization responsible for this figures, and inquired as to the reasoning for this error. They confirmed to me that these numbers should be accurate, and gave me additional documentation to look at.

This leaves us in a pickle. The data says there are values over 120, so although I was told these number should be correct, this isn’t the case. My guess is that these errors might have crept their way into the data:

  1. These datapoints were collected in a “survey.” There was no elaboration on what this survey entailed, so there’s nothing I can point my finger at and blame as a culprit, but it could be that there are errors in extracting data.
  2. Answering the wrong question. Although this data is supposed to show the number of athletes as of the first day of competition, a different number could have been entered such as the number of athletes as of the start of the academic year. It’s popular for some sports like football to cut athletes before the first day of competition to stay in compliance with the NCAA, so this data might be included before cuts happen.

Either way, this might suggest a systematic problem in our data. Worse yet, this isn’t one we can exactly correct for. We can limit the maximum number of athletes to be the NCAA limit, but this wouldn’t also reduce institutions who’s reported amount is less than the threshold, but whose actual numbers are still lower. If a certain university is more likely to have “over-reported,” this could skew our data when comparing different groups. As well, there’s nothing to suggest this problem isn’t prevalant in other sports.

To make matters worse, finding these errors are a near impossible task as well. Football is pretty much the only sport that has a roster cap, whereas other sports don’t; they just have scholarship caps. So programs like basketball had a limited amount of scholarships they could give, they had an unlimited roster cap that could be filled with walk-ons. Is a basketball team larger than they should be, or are there just a lot of walk-ons at the school? That’s not something we can find with our data, nor something we can meaningfully correct for.

I don’t think “per athlete” statistics are useless. Instead, they need to be taken with a grain of salt. P values for statistics will need to be significantly smaller than they otherwise would. As well, this shouldn’t be used to prove anything by itself. Including additional context would be critical in suggesting results.

Relationships Between Revenue, Sport, Expenses, and Division Classification

ncaa |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(sports) |>
  summarise(revenue = sum(rev_men, na.rm=TRUE)
            + sum(rev_women, na.rm=TRUE)) |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y = revenue / 5)) + 
  geom_text(mapping = aes(x = sports, y = revenue / 5 + 150000000, label=sports), 
            size = 3, nudge_y = 0, check_overlap = TRUE) +
  theme(axis.text.x = element_blank()) +
  labs(x = "Sport",
       y = "Total Revenue per Year (Dollars)",
       title = "Revenue by Sport") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
  )

This is why Football and Basketball are considered “revenue sports.” They make up the vast majority of all of the NCAA’s revenue. Most sports make virtually nothing, and these are considered non-revenue sports, and most of these traditionally rely on external funding from revenue sharing, endowments, fundraising, and other financing options to cover their exepenses.

ncaa |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(sports) |>
  summarise(athletes = sum(sum_partic_men, na.rm=TRUE)
            + sum(sum_partic_women, na.rm=TRUE)) |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y = athletes / 5)) + 
  geom_text(mapping = aes(x = sports, y = athletes / 5 + 2000, label=sports), 
            size = 3, nudge_y = 0, check_overlap = TRUE) +
  theme(axis.text.x = element_blank()) +
  labs(x = "sports",
       y = "Number of Athletesr",
       title = "Number of Athletes per Sport per Year") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
  )

Number of athletes have a large variation too, and aren’t necessarily correlated with revenue. Football is still high upon the list, but track and field by far has the most athletes. Although most sports are still relatively small, there is much less of an inequality than what there was in revenue.

# get total exp for men and women
schools <- ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(1,2,3)) |>
  group_by(institution_name) |>
  summarise(revenue = sum(rev_men, na.rm = TRUE) + 
              sum(rev_women, na.rm = TRUE),
            expense = sum(exp_men, na.rm = TRUE) + 
              sum(exp_women, na.rm = TRUE),
            athletes = sum(sum_partic_men, na.rm=TRUE) + 
                        sum(sum_partic_women, na.rm=TRUE),
            division = min(classification_name))
schools |>
  ggplot() +
  geom_boxplot(mapping = aes(y = expense, x = division)) +
  labs(x = "Division Category",
       y = "Expenses",
       title = "Expenses among D1 institutions")

schools |>
  ggplot() +
  geom_boxplot(mapping = aes(y = revenue, x = division)) +
  labs(x = "Division Category",
       y = "Revenue",
       title = "Revenue among D1 institutions")

schools |>
  ggplot() +
  geom_boxplot(mapping = aes(y = athletes, x = division)) +
  labs(x = "Division Category",
       y = "Athletes",
       title = "Athletes among D1 institutions")

Division Crash Course: FBS are you’re biggest D1 football institutions, FCS are your smaller DQ football institutions, and the other is self explanatory. These are represented as 1, 2, and 3 for classification code. We see pretty clearly that having a better football team leads to significantly more revenue and expenses, but not too many more athletes. In an opposite way, having a less successful football program doesn’t give you much more revenue than if you had no football program, but they have more athletes.

Non-Revenue Sports

nonrev <-  ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(1,2,3)) |>
  filter(!(sports %in% c('Football', 'Basketball'))) |>
  mutate(revenue = rowSums(across(c(rev_men, rev_women)), na.rm = TRUE),
         expense = rowSums(across(c(exp_men, exp_women)), na.rm = TRUE),
         profit = revenue - expense)
  #group_by(sports, institution_name) |>
  #summarise(revenue = sum(rev_men + rev_women, na.rm = TRUE),
  #          expense = sum(exp_men + exp_women, na.rm = TRUE))
nonrev |>
  ggplot() +
  geom_point(mapping = aes(x = revenue, y = expense)) +
  labs(title = "Revenues and Expenses for Non-Revenue D1 Sports (2019)",
       x = "Revenues (Dollars)", y = "Expenses (Dollars)") +
  geom_abline(intercept = 0, slope = 1, color = "red") +
  theme_classic()

Hypothesis Testing

Does the success of revenue sports benefit non-revenue sports?

As I mentioned before, it’s common to hear that non-revenue sporting programs are held afloat by the success of revenue sports. I want to test if this is actually the case. If a revenue-sport increases it’s revenues, and that sport does “share” excess revenues with other sporting programs, we might expect to see an increase in revenue of non-revenue programs during the season of success, or the season following after.

How this revenue and revenue-sharing is reflected in our data isn’t the clearest. Its certain that some of the revenue coming to non-revenue sports comes, at least in part, from sharing pots. I assume expenses of revenue sports will include some of this sharing, but it might not. Because of this, I ended up testing for both revenue and profit of revenue-sports.

Null Hypothesis:

  • The change in monetary benefit for an institution’s non-revenue sports is the same whether the institution’s revenue sports exceed or fall short of prior benchmarks.

I’ll test this null hypothesis by comparing schools who’s football program has high increases in revenue compared to those with smaller increases or decreases, then see if the revenues of both segments’ non-revenue sports have the same or different changes in revenues.

Note: an error was made in a variable when reformatting this document, so not all manually inputted numbers are up to date. The overall conclusions did not change, but I will need to fix this.

Preparing the df for hypothesis testing

# adds column to say whether its M or F
ncaa$men <- ifelse(ncaa$sum_partic_men > 0, 1, 0)
# adds profit
ncaa <- ncaa |>
  mutate(profit = rowSums(across(c(rev_men, rev_women)), na.rm = TRUE) -
                  rowSums(across(c(exp_men, exp_women)), na.rm = TRUE) )
compare <- ncaa |>
  filter(classification_code %in% c(1,2,3)) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
         revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
         expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
         expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
         partic_men_change = (sum_partic_men - lag(sum_partic_men))/
                              lag(sum_partic_men),
         partic_women_change = (sum_partic_women - lag(sum_partic_women))/
                              lag(sum_partic_women),
         student_change = (ef_total_count - lag(ef_total_count)) /
                            lag(ef_total_count),
         profit_change = (profit - lag(profit))
         # this one is a number, not percentage
  )
compare |>
  filter(institution_name == 'The University of Alabama') |>
  filter(sports == 'Basketball') |>
  reframe(exp_women, expense_change_women)
##    exp_women expense_change_women
## 1         NA                   NA
## 2         NA                   NA
## 3         NA                   NA
## 4         NA                   NA
## 5         NA                   NA
## 6    3550880           0.45494298
## 7    4042959           0.13857945
## 8    3737069          -0.07565993
## 9    3857300           0.03217254
## 10   3805848          -0.01333886

We can see that this code works, and the only issue comes that 2015 data uses a prior 2019 entry. When we do our analysis, we’ll just need to filter to exclude 2015 data. As well, I’ll create new columns that will drop men vs women for easier comparison.

I also might want to make something that compares everything to the mean, but we might just get around to that later.

compare_change <- compare |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)
compare_rev <- compare_change |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(rev_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

If in 2016 I saw there was an x% change in football revenue, I want to see if the change of revenue in 2016 and 2017 for other sports are. This means I can use four years of data for the former, and three years of data for the trailing information.

compare_rev |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

compare_rev2 <- compare_rev |>
  arrange(institution_name, year) |>
  group_by(institution_name) |>
  mutate(avg_other = lead(avg_other_sports)) |>
  filter(year < 2019)
  

compare_rev2 |>
  filter(year > 2016) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).

I’m now going to split the data into two groups: higher than average growth, and lower than average growth. My null hypothesis would say these would be the same.

avg_growth <- mean(compare_rev$avg_revsport, na.rm = TRUE)
compare_rev$above_avg <- ifelse(compare_rev$avg_revsport >= avg_growth, 1, 0)
compare_rev$updown <- ifelse(compare_rev$avg_revsport > 0, 1, 0)
# average, no lag
compare_rev |>
  filter(year > 2016) |>
  group_by(above_avg) |>
  summarize(sd = sd(avg_other_sports),
            mean = mean(avg_other_sports))
## # A tibble: 3 × 3
##   above_avg    sd   mean
##       <dbl> <dbl>  <dbl>
## 1         0 0.588 0.0805
## 2         1 4.57  0.413 
## 3        NA 1.11  0.132
nrow(subset(compare_rev, above_avg == 0))
## [1] 594
nrow(subset(compare_rev, above_avg == 1))
## [1] 409
# updown, no lag
compare_rev |>
  filter(year > 2016) |>
  group_by(updown) |>
  summarize(sd = sd(avg_other_sports),
            mean = mean(avg_other_sports))
## # A tibble: 3 × 3
##   updown    sd   mean
##    <dbl> <dbl>  <dbl>
## 1      0 0.746 0.0907
## 2      1 3.50  0.270 
## 3     NA 1.11  0.132
# for year lag
avg_growth2 <- mean(compare_rev2$avg_revsport, na.rm = TRUE)
compare_rev2$above_avg <- ifelse(compare_rev2$avg_revsport >= avg_growth2, 1, 0)
compare_rev2$updown <- ifelse(compare_rev2$avg_revsport > 0, 1, 0)
# average, lag
compare_rev2 |>
  group_by(above_avg) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   above_avg    sd   mean
##       <dbl> <dbl>  <dbl>
## 1         0 3.66  0.291 
## 2         1 0.298 0.0626
## 3        NA 1.12  0.131
nrow(subset(compare_rev2, above_avg == 0)) / nrow(subset(compare_rev2, above_avg == 1))
## [1] 1.659574
# updown, no lag
compare_rev2 |>
  group_by(updown) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   updown    sd  mean
##    <dbl> <dbl> <dbl>
## 1      0 0.814 0.147
## 2      1 3.39  0.228
## 3     NA 1.12  0.131

We do see that there is a change in means, and some of these are rather large. Since we are only testing football, NA values would be schools without a football team, so they can be disregarded for now. Unfortunately, the standard deviation is also massive for the above average groups, so although the difference may seem large, it may not be very useful.

library(pwrss)
library(effsize)
cohen.d(d = filter(compare_rev, above_avg == 1) |> pluck("avg_other_sports"),
        f = filter(compare_rev, above_avg == 0) |> pluck("avg_other_sports"))
## 
## Cohen's d
## 
## d estimate: 0.1141633 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.01202254  0.24034912
cohen.d(d = filter(compare_rev, updown == 1) |> pluck("avg_other_sports"),
        f = filter(compare_rev, updown == 0) |> pluck("avg_other_sports"))
## 
## Cohen's d
## 
## d estimate: 0.07153171 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.06133009  0.20439352
cohen.d(d = na.omit(filter(compare_rev2, above_avg == 1) |> pluck("avg_other")),
        f = na.omit(filter(compare_rev2, above_avg == 0) |> pluck("avg_other")))
## 
## Cohen's d
## 
## d estimate: -0.07897672 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.22720210  0.06924866
cohen.d(d = na.omit(filter(compare_rev2, updown == 1) |> pluck("avg_other")),
        f = na.omit(filter(compare_rev2, updown == 0) |> pluck("avg_other")))
## 
## Cohen's d
## 
## d estimate: 0.0280054 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.1306508  0.1866616

This is not a great start. Both values are negligible, which in of itself isn’t terrible, but it does mean that the result is very small even if it is statistically significant. To make matters worse, zero lies within the confidence interval, meaning that there’s a reasonable chance that there is no relationship between football team’s success being shared with the rest of the athletic department.

Do we have enough data?

I think a meaningful difference between groups would be 10%. The “cost” of getting this data wrong isn’t very high, so I want to keep a generous margin of potential error by setting alpha to .2 and beta to .2 as well.

Additionally to note, we can see that this data is normally distributed.

compare_rev |>
  ggplot() +
  geom_histogram(mapping = aes(x=avg_revsport))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 390 rows containing non-finite outside the scale range
## (`stat_bin()`).

# no lag, above below average
pwrss.t.2means(mu1 = 0.805, 
               mu2 = 0.413,
               sd1 = 0.588,
               sd2 = 4.57,
               kappa = 1.45,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 900 
##   n2 = 621 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 1519 
##  Non-centrality parameter = 2.125 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
# no lag, above below 0 change
pwrss.t.2means(mu1 = 0.091, 
               mu2 = 0.27,
               sd1 = 0.746,
               sd2 = 3.5,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 1803 
##   n2 = 1803 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 3604 
##  Non-centrality parameter = -2.124 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
# lag, above and below average
pwrss.t.2means(mu1 = 0.291,
               mu2 = 0.0625,
               sd1 = 3.661,
               sd2 = 0.298,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 1166 
##   n2 = 1166 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 2330 
##  Non-centrality parameter = 2.124 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
# lag, above and below 0 change
pwrss.t.2means(mu1 = 0.147,
               mu2 = 0.228,
               sd1 = 0.814,
               sd2 = 3.392,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 8362 
##   n2 = 8362 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 16722 
##  Non-centrality parameter = -2.123 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2

To make our matters even worse, we don’t have enough data to support our hypotheses to the desired strength. And even if we do end up running the t tests…

t.test(filter(compare_rev, above_avg == 1) |> pluck("avg_other_sports"),
        filter(compare_rev, above_avg == 0) |> pluck("avg_other_sports"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_rev, above_avg == 1), "avg_other_sports") and pluck(filter(compare_rev, above_avg == 0), "avg_other_sports")
## t = 1.4779, df = 411.94, p-value = 0.1402
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1983851  1.4005319
## sample estimates:
## mean of x mean of y 
## 0.7048690 0.1037956
t.test(filter(compare_rev, updown == 1) |> pluck("avg_other_sports"),
        filter(compare_rev, updown == 0) |> pluck("avg_other_sports"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_rev, updown == 1), "avg_other_sports") and pluck(filter(compare_rev, updown == 0), "avg_other_sports")
## t = 1.5262, df = 714.23, p-value = 0.1274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1079612  0.8619611
## sample estimates:
##  mean of x  mean of y 
## 0.46955431 0.09255438
t.test(na.omit(filter(compare_rev2, above_avg == 1) |> pluck("avg_other")),
        na.omit(filter(compare_rev2, above_avg == 0) |> pluck("avg_other")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_rev2, above_avg == 1), "avg_other")) and na.omit(pluck(filter(compare_rev2, above_avg == 0), "avg_other"))
## t = -1.3394, df = 474.12, p-value = 0.1811
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5640150  0.1067764
## sample estimates:
##  mean of x  mean of y 
## 0.06259144 0.29121074
t.test(na.omit(filter(compare_rev2, updown == 1) |> pluck("avg_other")),
        na.omit(filter(compare_rev2, updown == 0) |> pluck("avg_other")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_rev2, updown == 1), "avg_other")) and na.omit(pluck(filter(compare_rev2, updown == 0), "avg_other"))
## t = 0.51606, df = 659.96, p-value = 0.606
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2275406  0.3897849
## sample estimates:
## mean of x mean of y 
## 0.2282531 0.1471309

We can see that our P values are pretty high as well, with both models scoring 0.12 to 0.61.

The culmination of evidence is not sufficient to reject the null hypothesis. This doesn’t mean that its not possible for revenue sports to positively impact non-revenue sports, but it looks unlikely that a relationship exists. And even if it did, it looks like the relationship in minimal.

Expenses more important?

I’ll rerun some of the prior tests using similar conditions and assumptions, but testing if the response variable is expenses of non-revenue sports instead.

compare_exp <- compare_change |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_exp |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports)) +
  labs(x = "% Change in Football Revenue",
       y = "% Change in Non-Revenue Expenses",
       title = "Institutions' Change in Football Revenue and Non-Revenue Sport Expenses") +
  # info to make this line found in regressions
  geom_abline(intercept = 0.0365, slope = 0.12372, color = "red") +
  theme(
    panel.background = element_rect(fill = "white", color = NA),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_blank(),
  )
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

compare_exp2 <- compare_exp |>
  arrange(institution_name, year) |>
  group_by(institution_name) |>
  mutate(avg_other = lead(avg_other_sports)) |>
  filter(year < 2019)
  

compare_exp2 |>
  filter(year > 2016) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).

avg_growth <- mean(compare_exp$avg_revsport, na.rm = TRUE)
compare_exp$above_avg <- ifelse(compare_exp$avg_revsport >= avg_growth, 1, 0)
compare_exp$updown <- ifelse(compare_exp$avg_revsport > 0, 1, 0)
compare_exp |>
  filter(year > 2016) |>
  group_by(above_avg) |>
  summarize(sd = sd(avg_other_sports, na.rm = TRUE),
            mean = mean(avg_other_sports, na.rm = TRUE))
## # A tibble: 3 × 3
##   above_avg    sd   mean
##       <dbl> <dbl>  <dbl>
## 1         0 0.107 0.0253
## 2         1 0.170 0.0555
## 3        NA 0.183 0.0547
compare_exp |>
  filter(year > 2016) |>
  group_by(updown) |>
  summarize(sd = sd(avg_other_sports),
            mean = mean(avg_other_sports))
## # A tibble: 3 × 3
##   updown    sd   mean
##    <dbl> <dbl>  <dbl>
## 1      0 0.118 0.0162
## 2      1 0.143 0.0476
## 3     NA 0.183 0.0547
# for year lag
avg_growth2 <- mean(compare_exp2$avg_revsport, na.rm = TRUE)
compare_exp2$above_avg <- ifelse(compare_exp2$avg_revsport >= avg_growth2, 1, 0)
compare_exp2$updown <- ifelse(compare_exp2$avg_revsport > 0, 1, 0)
compare_exp2 |>
  filter(year > 2016) |>
  group_by(above_avg) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   above_avg    sd   mean
##       <dbl> <dbl>  <dbl>
## 1         0 0.110 0.0150
## 2         1 0.113 0.0183
## 3        NA 0.184 0.0414
compare_exp2 |>
  filter(year > 2016) |>
  group_by(updown) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   updown    sd   mean
##    <dbl> <dbl>  <dbl>
## 1      0 0.110 0.0192
## 2      1 0.111 0.0149
## 3     NA 0.184 0.0414
cohen.d(d = filter(compare_exp, above_avg == 1) |> pluck("avg_other_sports"),
        f = filter(compare_exp, above_avg == 0) |> pluck("avg_other_sports"))
## 
## Cohen's d
## 
## d estimate: 0.2345709 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.1080661 0.3610757
cohen.d(d = filter(compare_exp, updown == 1) |> pluck("avg_other_sports"),
        f = filter(compare_exp, updown == 0) |> pluck("avg_other_sports"))
## 
## Cohen's d
## 
## d estimate: 0.2448789 (small)
## 95 percent confidence interval:
##     lower     upper 
## 0.1116215 0.3781364
cohen.d(d = na.omit(filter(compare_exp2, above_avg == 1) |> pluck("avg_other")),
        f = na.omit(filter(compare_exp2, above_avg == 0) |> pluck("avg_other")))
## 
## Cohen's d
## 
## d estimate: 0.08553731 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.06269746  0.23377209
cohen.d(d = na.omit(filter(compare_exp2, updown == 1) |> pluck("avg_other")),
        f = na.omit(filter(compare_exp2, updown == 0) |> pluck("avg_other")))
## 
## Cohen's d
## 
## d estimate: 0.07073128 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.08795916  0.22942173
# no lag, above and below average
pwrss.t.2means(mu1 = 0.0253,
               mu2 = 0.0555,
               sd1 = 0.1073,
               sd2 = 0.1702,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 201 
##   n2 = 201 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 400 
##  Non-centrality parameter = -2.128 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
# no lag, above and below 0 change
pwrss.t.2means(mu1 = 0.0162,
               mu2 = 0.0476,
               sd1 = 0.1177,
               sd2 = 0.1425,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 157 
##   n2 = 157 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 312 
##  Non-centrality parameter = -2.129 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
pwrss.t.2means(mu1 = 0.01884,
               mu2 = 0.01256,
               sd1 = 0.1097,
               sd2 = 0.1130,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 2836 
##   n2 = 2836 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 5670 
##  Non-centrality parameter = 2.124 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
pwrss.t.2means(mu1 = 0.01918,
               mu2 = 0.01490,
               sd1 = 0.1103,
               sd2 = 0.1115,
               kappa = 1,
               power = .8, alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 6054 
##   n2 = 6054 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 12106 
##  Non-centrality parameter = 2.123 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
t.test(filter(compare_exp, above_avg == 1) |> pluck("avg_other_sports"),
        filter(compare_exp, above_avg == 0) |> pluck("avg_other_sports"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_exp, above_avg == 1), "avg_other_sports") and pluck(filter(compare_exp, above_avg == 0), "avg_other_sports")
## t = 3.407, df = 659.74, p-value = 0.0006966
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01256505 0.04674958
## sample estimates:
##  mean of x  mean of y 
## 0.06111880 0.03146148
t.test(filter(compare_exp, updown == 1) |> pluck("avg_other_sports"),
        filter(compare_exp, updown == 0) |> pluck("avg_other_sports"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_exp, updown == 1), "avg_other_sports") and pluck(filter(compare_exp, updown == 0), "avg_other_sports")
## t = 3.8007, df = 711.24, p-value = 0.0001566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01496922 0.04695923
## sample estimates:
##  mean of x  mean of y 
## 0.05346483 0.02250061
t.test(filter(compare_exp2, above_avg == 1) |> pluck("avg_other"),
        filter(compare_exp2, above_avg == 0) |> pluck("avg_other"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_exp2, above_avg == 1), "avg_other") and pluck(filter(compare_exp2, above_avg == 0), "avg_other")
## t = 1.1296, df = 586.98, p-value = 0.2591
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.006594145  0.024449763
## sample estimates:
##  mean of x  mean of y 
## 0.03643210 0.02750429
t.test(filter(compare_exp2, updown == 1) |> pluck("avg_other"),
        filter(compare_exp2, updown == 0) |> pluck("avg_other"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_exp2, updown == 1), "avg_other") and pluck(filter(compare_exp2, updown == 0), "avg_other")
## t = 0.8883, df = 408.72, p-value = 0.3749
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.00895789  0.02372793
## sample estimates:
##  mean of x  mean of y 
## 0.03300017 0.02561515

Using the same logic as before, there is sufficient evidence to reject the null hypothesis that expenses of, or spending for, non-revenue sports are not impacted by revenue sports’ success. However, there is no evidence to support that this benefit exists in successive years.

Using profit instead of revenue

Repeating same tests before, but instead of using revenue changes, we’ll look at profit changes.

Null Hypothesis: change in revenue for non-revenue sports is the same for institutions who’s football teams have a net profit and a net loss.

compare_profit <- compare_change |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(profit_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(rev_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_profit |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

compare_profit2 <- compare_profit |>
  arrange(institution_name, year) |>
  group_by(institution_name) |>
  mutate(avg_other = lead(avg_other_sports)) |>
  filter(year < 2019)
  

compare_profit2 |>
  filter(year > 2016) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interestingly, there’s a much larger distribution on the x axis. Lets go and see what we find.

compare_profit$updown <- ifelse(compare_profit$avg_revsport > 0, 1, 0)
compare_profit2$updown <- ifelse(compare_profit2$avg_revsport > 0, 1, 0)
compare_profit |>
  filter(year > 2016) |>
  group_by(updown) |>
  summarize(sd = sd(avg_other_sports),
            mean = mean(avg_other_sports))
## # A tibble: 3 × 3
##   updown    sd  mean
##    <dbl> <dbl> <dbl>
## 1      0 3.29  0.221
## 2      1 0.671 0.173
## 3     NA 1.11  0.132
compare_profit2 |>
  group_by(updown) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   updown    sd  mean
##    <dbl> <dbl> <dbl>
## 1      0 3.34  0.234
## 2      1 0.601 0.120
## 3     NA 1.12  0.131

I can’t say this is looking too good.

cohen.d(d = filter(compare_profit, updown == 1) |> pluck("avg_other_sports"),
        f = filter(compare_profit, updown == 0) |> pluck("avg_other_sports"))
## 
## Cohen's d
## 
## d estimate: 0.1099549 (negligible)
## 95 percent confidence interval:
##       lower       upper 
## -0.03317298  0.25308286
cohen.d(d = na.omit(filter(compare_profit2, updown == 1) |> pluck("avg_other")),
        f = na.omit(filter(compare_profit2, updown == 0) |> pluck("avg_other")))
## 
## Cohen's d
## 
## d estimate: -0.03945665 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.2041184  0.1252051
t.test(filter(compare_profit, updown == 1) |> pluck("avg_other_sports"),
        filter(compare_profit, updown == 0) |> pluck("avg_other_sports"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_profit, updown == 1), "avg_other_sports") and pluck(filter(compare_profit, updown == 0), "avg_other_sports")
## t = 0.97421, df = 266.45, p-value = 0.3308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5913513  1.7496926
## sample estimates:
## mean of x mean of y 
## 0.7831329 0.2039623
t.test(filter(compare_profit2, updown == 1) |> pluck("avg_other"),
        filter(compare_profit2, updown == 0) |> pluck("avg_other"))
## 
##  Welch Two Sample t-test
## 
## data:  pluck(filter(compare_profit2, updown == 1), "avg_other") and pluck(filter(compare_profit2, updown == 0), "avg_other")
## t = -0.77186, df = 647.85, p-value = 0.4405
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4050278  0.1764583
## sample estimates:
## mean of x mean of y 
## 0.1198413 0.2341261

For almost the exact reasons outlined in the prior sections, there isn’t enough evidence to reject the null hypothesis.

With Expenses

compare_pexp <- compare_change |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(profit_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_pexp |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

compare_pexp2 <- compare_pexp |>
  arrange(institution_name, year) |>
  group_by(institution_name) |>
  mutate(avg_other = lead(avg_other_sports)) |>
  filter(year < 2019)
  

compare_profit2 |>
  filter(year > 2016) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).

compare_pexp$updown <- ifelse(compare_profit$avg_revsport > 0, 1, 0)
compare_pexp2$updown <- ifelse(compare_profit2$avg_revsport > 0, 1, 0)
compare_pexp |>
  filter(year > 2016) |>
  group_by(updown) |>
  summarize(sd = sd(avg_other_sports),
            mean = mean(avg_other_sports))
## # A tibble: 3 × 3
##   updown    sd   mean
##    <dbl> <dbl>  <dbl>
## 1      0 0.142 0.0387
## 2      1 0.113 0.0316
## 3     NA 0.183 0.0547
compare_pexp2 |>
  group_by(updown) |>
  summarize(sd = sd(avg_other, na.rm = TRUE),
            mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
##   updown    sd   mean
##    <dbl> <dbl>  <dbl>
## 1      0 0.105 0.0313
## 2      1 0.104 0.0298
## 3     NA 0.180 0.0522

I’m not even going to bother testing the rest, these are basically identical.

Do schools with more revenue-sport revenue/profit have more funding for non-revenue sports?

I want to now confirm that schools with higher revenue/profits in revenue-sports have higher expenses for non-revenue sports.

Null Hypothesis

  • The difference in expenses between non-revenue sports is the same for FBS and non-FBS sports.
d1 <- ncaa |>
  filter(year == 2019) |>
  filter(classification_code %in% c(1,2,3))

I’m going to look at four different sports: Women Swimming and Diving, Men Golf, Baseball, and Softball. I’m going to keep alpha and power levels the same as in the prior null hypothesis for the same reasons.

compare_sports <- d1 |>
  group_by(institution_name) |>
  summarise(
    revenue_rev = sum(rev_men[sports %in% c("Football")], na.rm = TRUE) +
                  sum(rev_men[sports %in% c("Basketball")], na.rm = TRUE),
    revenue_profit = sum(profit[sports %in% c("Football")], na.rm = TRUE)+
                  sum(rev_men[sports %in% c("Basketball")], na.rm = TRUE),
    
    swd_exp = mean(exp_women[sports %in% c("Swimming and Diving")], na.rm = TRUE),
    mgolf_exp = mean(exp_men[sports %in% c("Golf")], na.rm = TRUE),
    baseball_exp = mean(exp_men[sports %in% c("Baseball")], na.rm = TRUE),
    softball_exp = mean(exp_women[sports %in% c("Softball")], na.rm = TRUE),
    division = min(classification_code)
  )
compare_sports$class <- ifelse(compare_sports$division == 1, 1, 0)

Women’s Swim and Dive

# womens swim and dive
compare_sports |>
  filter(!is.na(swd_exp)) |>
  group_by(class) |>
  summarize(sd = sd(swd_exp),
            mean = mean(swd_exp))
## # A tibble: 2 × 3
##   class      sd     mean
##   <dbl>   <dbl>    <dbl>
## 1     0 264229.  547937.
## 2     1 472962. 1313916.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("swd_exp")),
        f = na.omit(filter(compare_sports, class == 0) |> pluck("swd_exp")))
## 
## Cohen's d
## 
## d estimate: 2.029838 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.646270 2.413406
one <- compare_sports |>
  filter(!is.na(swd_exp), class == 0) %>%
  summarise(count = n())
two <- compare_sports |>
  filter(!is.na(swd_exp), class == 1) %>%
  summarise(count = n())
c(one, two)
## $count
## [1] 85
## 
## $count
## [1] 76
pwrss.t.2means(mu1 = 547937.3, 
               mu2 = 1313915.8,
               sd1 = 264229.5,
               sd2 = 472962,
               kappa = .9,
               power = .8, 
               alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 3 
##   n2 = 4 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 5 
##  Non-centrality parameter = -2.722 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("swd_exp")),
        na.omit(filter(compare_sports, class == 0) |> pluck("swd_exp")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_sports, class == 1), "swd_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "swd_exp"))
## t = 12.484, df = 114.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  644438.6 887518.5
## sample estimates:
## mean of x mean of y 
## 1313915.8  547937.3

Men’s Golf

# mens golf
compare_sports |>
  filter(!is.na(mgolf_exp)) |>
  group_by(class) |>
  summarize(sd = sd(mgolf_exp),
            mean = mean(mgolf_exp))
## # A tibble: 2 × 3
##   class      sd    mean
##   <dbl>   <dbl>   <dbl>
## 1     0 153291. 287963.
## 2     1 333871. 700563.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("mgolf_exp")),
        f = na.omit(filter(compare_sports, class == 0) |> pluck("mgolf_exp")))
## 
## Cohen's d
## 
## d estimate: 1.706682 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.434323 1.979040
one <- compare_sports |>
  filter(!is.na(mgolf_exp), class == 0) %>%
  summarise(count = n())
two <- compare_sports |>
  filter(!is.na(mgolf_exp), class == 1) %>%
  summarise(count = n())
c(one, two)
## $count
## [1] 177
## 
## $count
## [1] 117
pwrss.t.2means(mu1 = 287963 , 
               mu2 = 700562.8   ,
               sd1 = 153291.4   ,
               sd2 = 333870.5   ,
               kappa = .66,
               power = .8, 
               alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 4 
##   n2 = 5 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 7 
##  Non-centrality parameter = -2.458 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("mgolf_exp")),
        na.omit(filter(compare_sports, class == 0) |> pluck("mgolf_exp")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_sports, class == 1), "mgolf_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "mgolf_exp"))
## t = 12.523, df = 148.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  347495.4 477704.2
## sample estimates:
## mean of x mean of y 
##  700562.8  287963.0

Baseball

# baseball
compare_sports |>
  filter(!is.na(baseball_exp)) |>
  group_by(class) |>
  summarize(sd = sd(baseball_exp),
            mean = mean(baseball_exp))
## # A tibble: 2 × 3
##   class       sd     mean
##   <dbl>    <dbl>    <dbl>
## 1     0  423571.  930220.
## 2     1 1192174. 2124906.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("baseball_exp")),
        f = na.omit(filter(compare_sports, class == 0) |> pluck("baseball_exp")))
## 
## Cohen's d
## 
## d estimate: 1.473127 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.210552 1.735701
one <- compare_sports |>
  filter(!is.na(baseball_exp), class == 0) %>%
  summarise(count = n())
two <- compare_sports |>
  filter(!is.na(baseball_exp), class == 1) %>%
  summarise(count = n())
c(one, two)
## $count
## [1] 183
## 
## $count
## [1] 115
pwrss.t.2means(mu1 = 930220.4   , 
               mu2 = 2124906.4  ,
               sd1 = 423571.5   ,
               sd2 = 1192174.2  ,
               kappa = .62,
               power = .8, 
               alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 4 
##   n2 = 7 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 9 
##  Non-centrality parameter = -2.4 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("baseball_exp")),
        na.omit(filter(compare_sports, class == 0) |> pluck("baseball_exp")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_sports, class == 1), "baseball_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "baseball_exp"))
## t = 10.344, df = 132.28, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   966227.7 1423144.1
## sample estimates:
## mean of x mean of y 
## 2124906.4  930220.4

Softball

# softball
compare_sports |>
  filter(!is.na(softball_exp)) |>
  group_by(class) |>
  summarize(sd = sd(softball_exp),
            mean = mean(softball_exp))
## # A tibble: 2 × 3
##   class      sd     mean
##   <dbl>   <dbl>    <dbl>
## 1     0 274425.  685636.
## 2     1 635182. 1413436.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("softball_exp")),
        f = na.omit(filter(compare_sports, class == 0) |> pluck("softball_exp")))
## 
## Cohen's d
## 
## d estimate: 1.646267 (large)
## 95 percent confidence interval:
##    lower    upper 
## 1.374592 1.917943
one <- compare_sports |>
  filter(!is.na(softball_exp), class == 0) %>%
  summarise(count = n())
two <- compare_sports |>
  filter(!is.na(softball_exp), class == 1) %>%
  summarise(count = n())
c(one, two)
## $count
## [1] 188
## 
## $count
## [1] 109
pwrss.t.2means(mu1 = 685636.3   , 
               mu2 = 1413435.6  ,
               sd1 = 274424.5   ,
               sd2 = 635182 ,
               kappa = .58,
               power = .8, 
               alpha = 0.2, 
               alternative = "not equal")
##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 4 
##   n2 = 6 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 8 
##  Non-centrality parameter = -2.481 
##  Type I error rate = 0.2 
##  Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("softball_exp")),
        na.omit(filter(compare_sports, class == 0) |> pluck("softball_exp")))
## 
##  Welch Two Sample t-test
## 
## data:  na.omit(pluck(filter(compare_sports, class == 1), "softball_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "softball_exp"))
## t = 11.364, df = 131.75, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  601105.9 854492.7
## sample estimates:
## mean of x mean of y 
## 1413435.6  685636.3

Results

It is abundantly clear that we can reject the null hypothesis. For at least popular non-revenue sports, there is more funding at FBS schools than at non-FBS schools.

Regression Models

library(boot)
library(broom)
library(lindia)

This model will show the relationship between increase in revenue of football teams and increase in non-revenue sport funding.

# data we're basing model off of 
compare_exp |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

model <- lm(avg_other_sports ~ avg_revsport, compare_exp)
summary(model)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38795 -0.05866 -0.00502  0.04318  1.86182 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.036588   0.004054   9.026  < 2e-16 ***
## avg_revsport 0.123720   0.017912   6.907 8.78e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1243 on 1001 degrees of freedom
##   (390 observations deleted due to missingness)
## Multiple R-squared:  0.04549,    Adjusted R-squared:  0.04454 
## F-statistic: 47.71 on 1 and 1001 DF,  p-value: 8.78e-12

Using a linear regression, we can estimate how much non-revenue sport expenses will change based upon changes in football revenue.

mean(compare_exp$avg_other_sports)
## [1] 0.04794688
mean(compare_exp$avg_revsport, na.rm = TRUE)
## [1] 0.05631253

In the model, it predicts that when football revenue doesn’t change from the prior year, non-revenue sports should increase funding by about 3%. For the slope of 0.2, this predicts that for every unit, in this case percent, increase in football revenue, non-revenue sports’ revenue will increase by 0.2, or .2%. So if Football revenue increase 10%, non-revenue sport funding will increase by 3%+ 0.2(10%) = 5%.

Referencing the mean ave_revsport and the model’s slope, we’d expect an intercept value of 3.5% for non-revenue sports, only slightly higher than our model predicts.

Separating model into FBS and FCS

FBS
compare_FBS <- ncaa |>
  filter(classification_code == 1) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
         revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
         expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
         expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
         partic_men_change = (sum_partic_men - lag(sum_partic_men))/
                              lag(sum_partic_men),
         partic_women_change = (sum_partic_women - lag(sum_partic_women))/
                              lag(sum_partic_women),
         student_change = (ef_total_count - lag(ef_total_count)) /
                            lag(ef_total_count),
         profit_change = (profit - lag(profit))
         # this one is a number, not percentage
  )

compare_change_FBS <- compare_FBS |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)

compare_exp_FBS <- compare_change_FBS |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FBS
compare_exp_FBS |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

This graph looks pretty different now…

model_FBS_pct <- lm(avg_other_sports ~ avg_revsport, compare_exp_FBS)
summary(model_FBS_pct)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FBS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33220 -0.05698  0.00032  0.04590  0.56641 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.035320   0.004488   7.870 2.19e-14 ***
## avg_revsport 0.044299   0.014167   3.127  0.00187 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0981 on 504 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01903,    Adjusted R-squared:  0.01708 
## F-statistic: 9.778 on 1 and 504 DF,  p-value: 0.001868
FCS
compare_FCS <- ncaa |>
  filter(classification_code == 2) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
         revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
         expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
         expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
         partic_men_change = (sum_partic_men - lag(sum_partic_men))/
                              lag(sum_partic_men),
         partic_women_change = (sum_partic_women - lag(sum_partic_women))/
                              lag(sum_partic_women),
         student_change = (ef_total_count - lag(ef_total_count)) /
                            lag(ef_total_count),
         profit_change = (profit - lag(profit))
         # this one is a number, not percentage
  )

compare_change_FCS <- compare_FCS |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)

compare_exp_FCS <- compare_change_FCS |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FCS
compare_exp_FCS |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

model_FCS_pct <- lm(avg_other_sports ~ avg_revsport, compare_exp_FCS)
summary(model_FCS_pct)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FCS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41099 -0.06008 -0.00941  0.03752  1.84678 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.040196   0.006773   5.935 5.53e-09 ***
## avg_revsport 0.134777   0.025555   5.274 2.00e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1471 on 495 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.0532, Adjusted R-squared:  0.05129 
## F-statistic: 27.81 on 1 and 495 DF,  p-value: 1.999e-07

It looks like most of the model was actually explaining FCS schools, where FBS had much less of an impact.

Evaluating These Models

Residual Histogram
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_reshist(model_FBS_pct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_reshist(model_FCS_pct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Put together, both classifications make a pretty good looking distribution curve. Separated, we see both of these programs have little issues with their shape, FBS likely the most, but it doesn’t scream that something big that we aren’t aware of already is missing.

Residuals vs Fitted Values
gg_resfitted(model) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

gg_resfitted(model_FBS_pct) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

gg_resfitted(model_FCS_pct) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

FCS and combined don’t look terrible. Towards the right, the values are too sparsely available to show if the distribution is consistent. But in FBS, we see problems. There looks to be a fanning effect, which would violate the assumption of homoscedasticity, and make the results unreliable. Luckily, in way, this actually further supports the conclusion of FBS schools which is that changes in football revenue has little effect in changes non-revenue sport funding.

Another Approach

When I was doing additional work on this dataset, I thought it might be a better idea to look at total dollar amount changes rather than just percentage change. So I’m going to make the same model, but with total dollar changes.

compare2 <- ncaa |>
  filter(classification_code %in% c(1,2,3)) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men)),
         revenue_change_women = (rev_women - lag(rev_women)),
         expense_change_men = (exp_men - lag(exp_men)),
         expense_change_women = (exp_women - lag(exp_women)),
         partic_men_change = (sum_partic_men - lag(sum_partic_men)),
         partic_women_change = (sum_partic_women - lag(sum_partic_women)),
         student_change = (ef_total_count - lag(ef_total_count)),
         profit_change = (profit - lag(profit))
  )
compare_change2 <- compare2 |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp2 <- compare_change2 |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# data we're basing model off of 
compare_exp2 |>
  filter(year > 2015) |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).

model_new <- lm(avg_other_sports ~ avg_revsport, compare_exp2)
summary(model_new)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -624505  -32739    3685   40289  564915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.303e+04  3.008e+03   4.333 1.62e-05 ***
## avg_revsport 1.540e-03  7.748e-04   1.987   0.0472 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92940 on 1001 degrees of freedom
##   (390 observations deleted due to missingness)
## Multiple R-squared:  0.003929,   Adjusted R-squared:  0.002934 
## F-statistic: 3.948 on 1 and 1001 DF,  p-value: 0.04719

Interestingly, what I thought was a blunder on my end for not thinking of sooner, turns out to explain even less of the variation as the initial approach to compare differences as a percentage.

Separating FBS and FCS

FBS
# data we're basing model off of 
compare_FBS <- ncaa |>
  filter(classification_code == 1) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men)),
         revenue_change_women = (rev_women - lag(rev_women)),
         expense_change_men = (exp_men - lag(exp_men)),
         expense_change_women = (exp_women - lag(exp_women)),
         partic_men_change = (sum_partic_men - lag(sum_partic_men)),
         partic_women_change = (sum_partic_women - lag(sum_partic_women)),
         student_change = (ef_total_count - lag(ef_total_count)),
         profit_change = (profit - lag(profit))
  )
  
compare_change_FBS <- compare_FBS |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)

compare_exp_FBS <- compare_change_FBS |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FBS
compare_exp_FBS |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

model_FBS <- lm(avg_other_sports ~ avg_revsport, compare_exp_FBS)
summary(model_FBS)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FBS)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1142549   -48578    12383    63919   548694 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   1.940e+04  6.158e+03   3.151  0.00172 **
## avg_revsport -1.212e-04  1.131e-03  -0.107  0.91472   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 132500 on 504 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  2.278e-05,  Adjusted R-squared:  -0.001961 
## F-statistic: 0.01148 on 1 and 504 DF,  p-value: 0.9147
FCS
# data we're basing model off of 
compare_FCS <- ncaa |>
  filter(classification_code == 2) |>
  arrange(institution_name, men, sports, year) |>
  mutate(revenue_change_men = (rev_men - lag(rev_men)),
         revenue_change_women = (rev_women - lag(rev_women)),
         expense_change_men = (exp_men - lag(exp_men)),
         expense_change_women = (exp_women - lag(exp_women)),
         partic_men_change = (sum_partic_men - lag(sum_partic_men)),
         partic_women_change = (sum_partic_women - lag(sum_partic_women)),
         student_change = (ef_total_count - lag(ef_total_count)),
         profit_change = (profit - lag(profit))
  )
  
compare_change_FCS <- compare_FCS |>
  filter(year > 2015) |>
  mutate(rev_change = rowSums(across(c(revenue_change_men, 
                                       revenue_change_women)), na.rm = TRUE),
         exp_change = rowSums(across(c(expense_change_men, 
                                       expense_change_women)), na.rm = TRUE),
         athlete_change = rowSums(across(c(partic_men_change, 
                                        partic_women_change)), na.rm = TRUE)
         ) |>
  reframe(year, institution_name, classification_code, sports, profit,
          rev_change, exp_change, athlete_change, student_change, profit_change)

compare_exp_FCS <- compare_change_FCS |>
  group_by(year, institution_name) |>
  summarise(avg_revsport = mean(rev_change[sports %in% 
                                        c("Football")]),
            avg_other_sports = mean(exp_change[!sports %in% 
                                        c("Football", "Basketball")])
            )
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FCS
compare_exp_FCS |>
  ggplot() +
  geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

model_FCS <- lm(avg_other_sports ~ avg_revsport, compare_exp_FCS)
summary(model_FCS)
## 
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FCS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -571018  -16699    9258   32079  207728 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.106e+03  3.167e+03   1.296    0.195    
## avg_revsport -1.613e-02  3.934e-03  -4.099 4.84e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69380 on 495 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.03283,    Adjusted R-squared:  0.03088 
## F-statistic:  16.8 on 1 and 495 DF,  p-value: 4.842e-05

Unsurprisingly, these aren’t effective. There’s a lot of very obvious problems with these; even visually you can tell these don’t behave the way we’d expect a linear regression to.

Analyzing the models

# initial model used
gg_resfitted(model) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This looks to be pretty normally distributed, but with a small right tail. Lets look at the other models…

gg_resfitted(model_FCS) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

gg_resfitted(model_FBS) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The initial model looks pretty good, FBS isn’t too bad, and FCS clearly has too much variation in residuals that we aren’t looking for.

gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see again here, there are some outliers and maybe a small right tail, but it is pretty well normally distributed.

gg_reshist(model_FBS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

gg_reshist(model_FCS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We start getting a lot more problems with FBS and FCS though. There’s heavy tails, and with FCS we can see there is a noticeable lean in the center of the data towards the right. It just doesn’t distribute as well as it could. Compared to the initial model, these are inferior. For all models though, its clear that we are missing other explanatory variables.

I suspect others that could be useful to have would be donations to the university, performance of athletic funds, and a way to measure “success” in NR sport performance. These, along with perhaps others, could help make these models a bit better, but the data we have is insufficient to use dollar amount changes to make a useful model.

Conclusion / Recommendations

Athletes and Recruits

If you are a non-revenue sport athlete, its pretty clear that going to a FBS school will likely result in your program getting higher funding than going to a non-FBS school. There are still unknowns here, such as how those resources are distributed to athletes, but this isn’t likely to cause a big difference. There’s also an unknown if greater expenses translate to funding athletes prioritize. These additional expenses could mean traveling to competitions on a plane instead of bus, but this doesn’t necessarily mean that the competitions will be of higher qualities. Extra expenses could mean your head coach gets a bonus for winning a conference championship, which is certainly good for the team, but the coaching quality hadn’t necessarily changed during that season.

As well, your program variation can dramatically more if you go to a smaller school compared to bigger FBS schools. If your football team performs terribly and you go to an FBS school, you might not see a big dip in funding for that year. If you go to a smaller FCS school though, you may feel the budget cuts compared to the prior year.

Fund Allocators

This would be applicable to donors and directors or board members on athletic departments. If you are looking at where to allocate funds at an FCS, and you want to give as much broad benefit as possible, you might want to default to spending money on revenue programs like football. If your in the same situation but for an FBS school, you may want to spread resources more evenly.

We can see two additional insights from this:

  1. The higher the ROI of increased football revenues from additional funding, the more beneficial it is for all members in the athletic department to allocate more funds to the football program. Similarly, if the ROI for football revenues is low, funds may need to be more equally distributed to different sports in order for all teams to get similar benefits.

  2. There are MANY more factors that are involved here. This alone only explains a small fraction of the variation, so the interaction between sports team isn’t that significant. Although it can be valuable to consider how other sporting programs can be impacted by the growth of a football team, other factors need to be considered if you want to make accurate, informed decisions. Examples can changes in endowment size, government budgets, institutional performance, program performance, and likely others.