# This works to get rid of errors
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
library(boot)
# load ncaa file I cleaned
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
Explanatory: Total Students, numeric discrete
Response: Revenue, numeric continuous
# groups information by school
schools <- ncaa |>
group_by(institution_name) |>
summarise(men = mean(ef_male_count, na.rm = TRUE),
women = mean(ef_female_count, na.rm = TRUE),
rev_men = sum(rev_men, na.rm = TRUE),
rev_wom = sum(rev_women, na.rm = TRUE))
# creates new columns that add up men and women values
schools$students = schools$men + schools$women
schools$revenue = (schools$rev_men + schools$rev_wom)/5 # /5 to average over 5 years
This should make a table that has average total students and average total revenue for each school. Revenue will be understated for schools that recently created an athletic department, but this should be a minority of data.
schools |>
ggplot() +
geom_point(mapping = aes(x = students, y = revenue)) +
labs(title = "University Students to Athletic Revenue",
x = "Number of Students", y = "Average Annual Athletic Revenue") +
theme_classic()
The y axis looks distorted, so I’m going to break this into two separate categories: one for under $5m and one for over $10m. “Big” and “Small” are based on revenue, not number of students.
# schools with revenue < $5m
small_schools <- schools |>
filter(revenue < 5000000) |>
ggplot() +
geom_point(mapping = aes(x = students, y = revenue)) +
labs(title = "University Students to Athletic Revenue",
x = "Number of Students", y = "Average Annual Athletic Revenue") +
theme_classic()
small_schools
# schools with revenue > $10m
big_schools <- schools |>
filter(revenue > 10000000) |>
ggplot() +
geom_point(mapping = aes(x = students, y = revenue)) +
labs(title = "University Students to Athletic Revenue",
x = "Number of Students", y = "Average Annual Athletic Revenue") +
theme_classic()
big_schools
I’m going to start by looking at big schools:
schools |>
filter(revenue > 10000000) |>
ggplot() +
geom_boxplot(mapping = aes(y = revenue))
print(schools$institution_name[which.max(schools$revenue)])
## [1] "The University of Texas at Austin"
print(schools$institution_name[which.max(schools$students)])
## [1] "Texas A & M University-College Station"
For big schools, there isn’t really any outliers, especially considering some of these values like revenue increase exponentially, and there isn’t one value that is substantially higher than others. For school size, Penn State is likely the only true outlier in a league of its own. However, the interesting part of this is that Penn isn’t considered the largest university in the USA. Far from it. So it’s interesting that there is such a gap, because you wouldn’t expect there to be.
For schools with less than $5m in revenue…
schools |>
filter(revenue < 5000000) |>
ggplot() +
geom_boxplot(mapping = aes(y = students))
We can see there are probably quite a bit of outliers, but then again when you look at the data holistically, there isn’t one or a few schools that really stand out.
small <- subset(schools, revenue < 5000000)
head(small[order(-small$students), ],10)
## # A tibble: 10 × 7
## institution_name men women rev_men rev_wom students revenue
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 San Francisco State University 9360. 12348. 7.70e6 8.32e6 21708. 3.20e6
## 2 California State Polytechnic … 11505. 10035. 1.01e7 8.38e6 21539. 3.70e6
## 3 California State University-L… 8496. 11845. 9.75e6 1.14e7 20341. 4.22e6
## 4 University of California-Sant… 8371. 8361. 4.35e6 4.65e6 16733. 1.80e6
## 5 California State University-S… 6396. 9841. 8.71e6 1.05e7 16237. 3.83e6
## 6 The University of Texas at Da… 8720. 6660. 3.87e6 4.22e6 15380. 1.62e6
## 7 California State University-C… 7104. 7993. 1.01e7 9.30e6 15097. 3.88e6
## 8 Montclair State University 5663. 9093. 6.92e6 6.13e6 14756. 2.61e6
## 9 SUNY at Binghamton 6777 6335 1.06e7 8.80e6 13112 3.88e6
## 10 Rowan University 7226. 5885. 8.73e6 7.69e6 13111. 3.28e6
Interestingly, most of these schools that do stand out are in California. I’m not quite sure why this is. Maybe its because California is a very large state in terms of population with a high amount of people attending colleges, but despite having high attendance many Californians opt to support other, more popular schools in California like Cal or Stanford. This would just be a guess, and this likely can’t be fully explained with the data available in this data set.
round(cor(schools$students, schools$revenue), 2)
## [1] 0.71
small = schools |> filter(revenue < 5000000)
large = schools |> filter(revenue > 10000000)
round(cor(small$students, small$revenue), 2)
## [1] 0.19
round(cor(large$students, large$revenue), 2)
## [1] 0.6
When broken down between small and large, this makes a lot of sense based on the visualizations. The majority of the data seemed to be a vertical line when looking at students and revenue for the smaller revenue schools, whereas there seemed to be a clear, positive correlation with bigger schools. The interesting part I think was the even larger correlation when the data was put together. I thought, for reasons explained above, that this wouldn’t be the case, and that was my desire for breaking the data up into large and small groups.
I’m curious if I broke this down further, such as separating schools between divisions, if the correlation would decrease. I initially thought it would increase, but I also thought the example above would too, so now I think I have substantial doubt if it will be true or not.
schools_mean <- mean(schools$revenue)
schools_sd <- sd(schools$revenue)
schools_mean
## [1] 10890463
schools_sd
## [1] 20449235
# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
# the `boot` library needs the function in this format
boot_func <- \(x, i) func(x[i], na.rm=TRUE)
b <- boot(v, boot_func, R = n_iter)
boot.ci(b, conf = conf, type = "perc")
}
boot_ci(schools$revenue, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 9709287, 12177113 )
## Calculations and Intervals on Original Scale
se = schools_sd / sqrt(length(schools$revenue))
print(schools_mean + (1.96 * se))
## [1] 12077025
print(schools_mean - (1.96 * se))
## [1] 9703902
Using a bootstrapped approach and Z-scores yielded similar results for a confidence interval of the response variable of revenue. This means that the true population mean for revenue will fall between roughly $9.7m and $12.1m 95% of the time.
Expenses: explanatory
Revenue: response
But also… maybe the opposite
It is important to note that I hypothesis that this is the relationship between these two variables. For some sports I am more confident that this relationship exists than others, and for some schools I think the opposite might actually hold true. Sometimes the more you spend, the more revenue you can get. Sometimes the more revenue you get, the more expenditures you can make, and this is primarily driven by the non-profit nature of many institutions. I think there is a fair argument that both variables might be both response and explanatory variables depending on the institution in question.
# groups information by school
schools2 <- ncaa |>
group_by(institution_name) |>
summarise(exp_men = sum(exp_men, na.rm = TRUE),
exp_women = sum(exp_women, na.rm = TRUE),
rev_men = sum(rev_men, na.rm = TRUE),
rev_wom = sum(rev_women, na.rm = TRUE))
# creates new columns that add up men and women values
# /5 to average over 5 years
schools2$expenses = (schools2$exp_men + schools2$exp_women)/5
schools2$revenue = (schools2$rev_men + schools2$rev_wom)/5
schools2 |>
ggplot() +
geom_point(mapping = aes(x = expenses, y = revenue)) +
labs(title = "Athletic Revenue and Expenses",
x = "Average Annual Athletic Expenses", y = "Average Annual Athletic Revenue") +
theme_classic()
This data looks to be very positively correlated, much more so than I anticipated. The further right we go, the more spread out some of the figures look, and some of these outliers I will touch on below. I think it would look a bit more interesting as we zoom in, especially with smaller programs, but overall nothing stands out of the ordinary.
schools2 |>
ggplot() +
geom_boxplot(mapping = aes(y = revenue))
I like looking at these box plots because it shows that there are a lot of “outlier” schools when looking at the entirety of the NCAA. Unsurprisingly, these tend to be schools you’ve heard of with gigantic, and ultra successful athletic programs. Here’s some examples:
top_schools = schools2[order(-schools2$revenue), ]
head(top_schools$institution_name, 5)
## [1] "The University of Texas at Austin" "University of Michigan-Ann Arbor"
## [3] "Ohio State University-Main Campus" "The University of Alabama"
## [5] "University of Georgia"
Texas, Michigan, Ohio, Alabama, and Georgia. Unsurprisingly, these are some of the top schools consistently in Football, and even their less popular sports churn out Olympians and national champions frequently. In terms of the entire NCAA, these top Division 1 schools are all outliers. However, the only real outlier among these would be Texas at Austin. Their revenue is crazy, even among Division 1 schools.
round(cor(schools2$expenses, schools2$revenue), 2)
## [1] 0.98
That came as little surprise given the visualization we saw earlier. With how highly correlated these two variables are, it makes me think that the response variable is actually expenses.
low_rev = filter(schools2, revenue < 1000000)
round(cor(low_rev$expenses, low_rev$revenue), 4)
## [1] 0.9973
It’s almost too perfect. I know some teams have at least some fixed expenses to carry on their sport, but revenue is not guaranteed. In fact, revenue is always not guaranteed as it can fluctuate dramatically depending on success in different championships. Since this is almost perfectly correlated though, it really makes me think that as schools get revenue, they spend that revenue on their programs. Additionally, I think if programs make excess revenue, they very efficiently share that revenue with other programs. I would love to have additional data to separate revenue and expenses before any sharing programs happened, but until I can scrutinize that data this is my going hypothesis on why this correlation is nearly perfect.
schools2_mean <- mean(schools2$expenses)
schools2_sd <- sd(schools2$expenses)
schools2_mean
## [1] 9955960
schools2_sd
## [1] 16635412
boot_ci(schools2$expenses, mean, 0.95)
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = b, conf = conf, type = "perc")
##
## Intervals :
## Level Percentile
## 95% ( 9042634, 10911197 )
## Calculations and Intervals on Original Scale
se = schools2_sd / sqrt(length(schools2$expenses))
print(schools2_mean + (1.96 * se))
## [1] 10921225
print(schools2_mean - (1.96 * se))
## [1] 8990694
Using a bootstrapped approach and Z-scores yielded similar results for a confidence interval of the response variable of revenue. This means that the true population mean for expenses for an institution will fall between roughly $9m and $11m 95% of the time.