library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
Link to original data: https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-03-29/readme.md
This data shows every secondary education institutions’ athletic program information for the years 2015-2019. Additional documentation can be found via the link as well.
When choosing this data, I was curious how different members of the NCAA impact each other. Being a Division I athlete myself, it’s common to hear ideas such as revenue sharing, or when revenue sports like basketball and football share excess revenues with non-revenue sports. This is the story how most NCAA programs get their funding. It’s also a common idea to root for the success of your basketball and football teams because the more revenue they make from doing well in games, the more money they have to share with your sport. This gave me a few questions I wanted to find an answer to…
In other words, I am generally trying to figure out how much non-revenue sports are impacted by the success of their institution’s revenue sports.
We can start by taking a look at our data. To meet requirements, I limited the csv file to only include sports from the NCAA. Here’s what this data looks like:
ncaa_original <- read.csv('ncaa_sports_2.csv', header=TRUE)
colnames(ncaa_original)
## [1] "year" "unitid" "institution_name"
## [4] "city_txt" "state_cd" "zip_text"
## [7] "classification_code" "classification_name" "classification_other"
## [10] "ef_male_count" "ef_female_count" "ef_total_count"
## [13] "sector_cd" "sector_name" "sportscode"
## [16] "partic_men" "partic_women" "partic_coed_men"
## [19] "partic_coed_women" "sum_partic_men" "sum_partic_women"
## [22] "rev_men" "rev_women" "total_rev_menwomen"
## [25] "exp_men" "exp_women" "total_exp_menwomen"
## [28] "sports"
When looking at the documentation, not much additional information can be gleaned. Most things are pretty self explanatory: partic_men is how many participating men are on the team, rev_women is how much revenue the women’s team made, and sports is the sport the row in question describes. Some slightly less obvious columns are ef_counts which is how many students attend the universities or unitid which is the unique ID of the university. There are a few questions that arise the deeper you look into these, and some of these concerns will be mentioned soon.
However, there is a problem with the data as it’s currently formatted: men and women are included in the same row. This is slightly misleading, because even though the “sports” have the same name, lets use an example of Basketball, doesn’t mean its the same sport. Men’s Basketball and Women’s Basketball have different rules and program structures, so even though the data puts these together in the same row, this is an inappropriate join.
There were also a lot of empty rows where rows showed sports with zero athletes, revenues, and expenses. These had to get removed as well.
I created a cleaning file to get rid of these problems you can find here: https://rpubs.com/tbreedy/1222442.
This removed the empty rows and split men and women sports and their accompanying revenues, expenses, etc. into separate rows. We now have a df we can work with.`
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
Additions of new programs can make our data a bit interesting when doing year over year comparisons. This code shows the prevalence of institutions either creating or cutting certain programs, or institutions creating or cutting their entire athletic department.
program_count <- ncaa |>
filter(classification_code %in% c(1,2,3)) |>
group_by(institution_name, sports) |>
summarise(count = n())
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
program_change <- program_count |>
filter(count %% 5 != 0)
program_change |>
group_by(sports) |>
summarise(change = n()) |>
ggplot() +
geom_point(mapping = aes(x = sports, y=change)) +
geom_text(mapping = aes(x = sports, y = change, label=sports),
size = 3, nudge_y = 2, check_overlap = TRUE) +
theme(axis.text.x = element_blank()) +
labs(x = "sports",
y = "Frequency a sport wasn't offered every year")
We can see that virtually every sport has some sort of change during this time period, likely reflecting the fact that institutions either adopted an athletic program or upgraded from a different division. This also poses an interesting question: how do we filter for year over year comparisons? This will be explained in detail later.
Some of the data I wanted to analyze involves getting metrics on a “per athlete” basis. The concern here is that this question is much more complicated than you might first think. How many athletes are on a program? This number changes throughout the season. Additions are uncommon, such as people transferring mid-season, beginning attending the university in the spring semester, or recruiting an existing student. Removals from the team are almost an inevitability though. Athletes quit, get cut, and transfer relatively frequently. Further inquiry into the documentation shows that athlete counts are supposed to represent the number of athletes on the team as of the first day of competition, but we’ll see this is incorrect below.
# filters for all D1 football schools and gets the "total number of men" each year
ncaa_football_d1 <- ncaa |>
filter(sports == 'Football') |>
filter(classification_code == 1 | classification_code == 2 |
classification_code == 3) |>
group_by(institution_name, year) |>
summarise(fb_players = sum(sum_partic_men))
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
ncaa_football_d1 |>
ggplot() +
geom_histogram(mapping = aes(x = fb_players)) +
labs(title = "Total Number of Male Athletes on D1 Football Teams",
x = "Total Number of Male Athletes per Year",
y = "Frequency") +
#geom_vline(mapping = aes(xintercept = '105', color='red') +
geom_vline(xintercept = 120, color='red') +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
During this time frame, the number of athletes allowed on a Division I football team was 120; this is where the red line is at. We’d expect to see the histogram peak around this number or sightly below, which kind of happens, but to a much smaller extent than expected. We also see occurrences to the right of this line, but those should be impossible.
Its impossible for this to be teams systematically going over the required athlete threshold. Division I football teams are highly regulated. These errors don’t happen, let alone impact over 30% of teams. I contacted the Office of Postsecondary Education, the government organization responsible for this figures, and inquired as to the reasoning for this error. They confirmed to me that these numbers should be accurate, and gave me additional documentation to look at.
This leaves us in a pickle. The data says there are values over 120, so although I was told these number should be correct, this isn’t the case. My guess is that these errors might have crept their way into the data:
Either way, this might suggest a systematic problem in our data. Worse yet, this isn’t one we can exactly correct for. We can limit the maximum number of athletes to be the NCAA limit, but this wouldn’t also reduce institutions who’s reported amount is less than the threshold, but whose actual numbers are still lower. If a certain university is more likely to have “over-reported,” this could skew our data when comparing different groups. As well, there’s nothing to suggest this problem isn’t prevalant in other sports.
To make matters worse, finding these errors are a near impossible task as well. Football is pretty much the only sport that has a roster cap, whereas other sports don’t; they just have scholarship caps. So programs like basketball had a limited amount of scholarships they could give, they had an unlimited roster cap that could be filled with walk-ons. Is a basketball team larger than they should be, or are there just a lot of walk-ons at the school? That’s not something we can find with our data, nor something we can meaningfully correct for.
I don’t think “per athlete” statistics are useless. Instead, they need to be taken with a grain of salt. P values for statistics will need to be significantly smaller than they otherwise would. As well, this shouldn’t be used to prove anything by itself. Including additional context would be critical in suggesting results.
ncaa |>
filter(classification_code %in% c(1,2,3)) |>
group_by(sports) |>
summarise(revenue = sum(rev_men, na.rm=TRUE)
+ sum(rev_women, na.rm=TRUE)) |>
ggplot() +
geom_point(mapping = aes(x = sports, y = revenue / 5)) +
geom_text(mapping = aes(x = sports, y = revenue / 5 + 150000000, label=sports),
size = 3, nudge_y = 0, check_overlap = TRUE) +
theme(axis.text.x = element_blank()) +
labs(x = "Sport",
y = "Total Revenue per Year (Dollars)",
title = "Revenue by Sport") +
theme(
panel.background = element_rect(fill = "white", color = NA),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
)
This is why Football and Basketball are considered “revenue sports.” They make up the vast majority of all of the NCAA’s revenue. Most sports make virtually nothing, and these are considered non-revenue sports, and most of these traditionally rely on external funding from revenue sharing, endowments, fundraising, and other financing options to cover their exepenses.
ncaa |>
filter(classification_code %in% c(1,2,3)) |>
group_by(sports) |>
summarise(athletes = sum(sum_partic_men, na.rm=TRUE)
+ sum(sum_partic_women, na.rm=TRUE)) |>
ggplot() +
geom_point(mapping = aes(x = sports, y = athletes / 5)) +
geom_text(mapping = aes(x = sports, y = athletes / 5 + 2000, label=sports),
size = 3, nudge_y = 0, check_overlap = TRUE) +
theme(axis.text.x = element_blank()) +
labs(x = "sports",
y = "Number of Athletesr",
title = "Number of Athletes per Sport per Year") +
theme(
panel.background = element_rect(fill = "white", color = NA),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
)
Number of athletes have a large variation too, and aren’t necessarily correlated with revenue. Football is still high upon the list, but track and field by far has the most athletes. Although most sports are still relatively small, there is much less of an inequality than what there was in revenue.
# get total exp for men and women
schools <- ncaa |>
filter(year == 2019) |>
filter(classification_code %in% c(1,2,3)) |>
group_by(institution_name) |>
summarise(revenue = sum(rev_men, na.rm = TRUE) +
sum(rev_women, na.rm = TRUE),
expense = sum(exp_men, na.rm = TRUE) +
sum(exp_women, na.rm = TRUE),
athletes = sum(sum_partic_men, na.rm=TRUE) +
sum(sum_partic_women, na.rm=TRUE),
division = min(classification_name))
schools |>
ggplot() +
geom_boxplot(mapping = aes(y = expense, x = division)) +
labs(x = "Division Category",
y = "Expenses",
title = "Expenses among D1 institutions")
schools |>
ggplot() +
geom_boxplot(mapping = aes(y = revenue, x = division)) +
labs(x = "Division Category",
y = "Revenue",
title = "Revenue among D1 institutions")
schools |>
ggplot() +
geom_boxplot(mapping = aes(y = athletes, x = division)) +
labs(x = "Division Category",
y = "Athletes",
title = "Athletes among D1 institutions")
Division Crash Course: FBS are you’re biggest D1 football institutions, FCS are your smaller DQ football institutions, and the other is self explanatory. These are represented as 1, 2, and 3 for classification code. We see pretty clearly that having a better football team leads to significantly more revenue and expenses, but not too many more athletes. In an opposite way, having a less successful football program doesn’t give you much more revenue than if you had no football program, but they have more athletes.
nonrev <- ncaa |>
filter(year == 2019) |>
filter(classification_code %in% c(1,2,3)) |>
filter(!(sports %in% c('Football', 'Basketball'))) |>
mutate(revenue = rowSums(across(c(rev_men, rev_women)), na.rm = TRUE),
expense = rowSums(across(c(exp_men, exp_women)), na.rm = TRUE),
profit = revenue - expense)
#group_by(sports, institution_name) |>
#summarise(revenue = sum(rev_men + rev_women, na.rm = TRUE),
# expense = sum(exp_men + exp_women, na.rm = TRUE))
nonrev |>
ggplot() +
geom_point(mapping = aes(x = revenue, y = expense)) +
labs(title = "Revenues and Expenses for Non-Revenue D1 Sports (2019)",
x = "Revenues (Dollars)", y = "Expenses (Dollars)") +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_classic()
As I mentioned before, it’s common to hear that non-revenue sporting programs are held afloat by the success of revenue sports. I want to test if this is actually the case. If a revenue-sport increases it’s revenues, and that sport does “share” excess revenues with other sporting programs, we might expect to see an increase in revenue of non-revenue programs during the season of success, or the season following after.
How this revenue and revenue-sharing is reflected in our data isn’t the clearest. Its certain that some of the revenue coming to non-revenue sports comes, at least in part, from sharing pots. I assume expenses of revenue sports will include some of this sharing, but it might not. Because of this, I ended up testing for both revenue and profit of revenue-sports.
I’ll test this null hypothesis by comparing schools who’s football program has high increases in revenue compared to those with smaller increases or decreases, then see if the revenues of both segments’ non-revenue sports have the same or different changes in revenues.
Note: an error was made in a variable when reformatting this document, so not all manually inputted numbers are up to date. The overall conclusions did not change, but I will need to fix this.
# adds column to say whether its M or F
ncaa$men <- ifelse(ncaa$sum_partic_men > 0, 1, 0)
# adds profit
ncaa <- ncaa |>
mutate(profit = rowSums(across(c(rev_men, rev_women)), na.rm = TRUE) -
rowSums(across(c(exp_men, exp_women)), na.rm = TRUE) )
compare <- ncaa |>
filter(classification_code %in% c(1,2,3)) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
partic_men_change = (sum_partic_men - lag(sum_partic_men))/
lag(sum_partic_men),
partic_women_change = (sum_partic_women - lag(sum_partic_women))/
lag(sum_partic_women),
student_change = (ef_total_count - lag(ef_total_count)) /
lag(ef_total_count),
profit_change = (profit - lag(profit))
# this one is a number, not percentage
)
compare |>
filter(institution_name == 'The University of Alabama') |>
filter(sports == 'Basketball') |>
reframe(exp_women, expense_change_women)
## exp_women expense_change_women
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 3550880 0.45494298
## 7 4042959 0.13857945
## 8 3737069 -0.07565993
## 9 3857300 0.03217254
## 10 3805848 -0.01333886
We can see that this code works, and the only issue comes that 2015 data uses a prior 2019 entry. When we do our analysis, we’ll just need to filter to exclude 2015 data. As well, I’ll create new columns that will drop men vs women for easier comparison.
I also might want to make something that compares everything to the mean, but we might just get around to that later.
compare_change <- compare |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_rev <- compare_change |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(rev_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
If in 2016 I saw there was an x% change in football revenue, I want to see if the change of revenue in 2016 and 2017 for other sports are. This means I can use four years of data for the former, and three years of data for the trailing information.
compare_rev |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
compare_rev2 <- compare_rev |>
arrange(institution_name, year) |>
group_by(institution_name) |>
mutate(avg_other = lead(avg_other_sports)) |>
filter(year < 2019)
compare_rev2 |>
filter(year > 2016) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).
I’m now going to split the data into two groups: higher than average growth, and lower than average growth. My null hypothesis would say these would be the same.
avg_growth <- mean(compare_rev$avg_revsport, na.rm = TRUE)
compare_rev$above_avg <- ifelse(compare_rev$avg_revsport >= avg_growth, 1, 0)
compare_rev$updown <- ifelse(compare_rev$avg_revsport > 0, 1, 0)
# average, no lag
compare_rev |>
filter(year > 2016) |>
group_by(above_avg) |>
summarize(sd = sd(avg_other_sports),
mean = mean(avg_other_sports))
## # A tibble: 3 × 3
## above_avg sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.588 0.0805
## 2 1 4.57 0.413
## 3 NA 1.11 0.132
nrow(subset(compare_rev, above_avg == 0))
## [1] 594
nrow(subset(compare_rev, above_avg == 1))
## [1] 409
# updown, no lag
compare_rev |>
filter(year > 2016) |>
group_by(updown) |>
summarize(sd = sd(avg_other_sports),
mean = mean(avg_other_sports))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.746 0.0907
## 2 1 3.50 0.270
## 3 NA 1.11 0.132
# for year lag
avg_growth2 <- mean(compare_rev2$avg_revsport, na.rm = TRUE)
compare_rev2$above_avg <- ifelse(compare_rev2$avg_revsport >= avg_growth2, 1, 0)
compare_rev2$updown <- ifelse(compare_rev2$avg_revsport > 0, 1, 0)
# average, lag
compare_rev2 |>
group_by(above_avg) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## above_avg sd mean
## <dbl> <dbl> <dbl>
## 1 0 3.66 0.291
## 2 1 0.298 0.0626
## 3 NA 1.12 0.131
nrow(subset(compare_rev2, above_avg == 0)) / nrow(subset(compare_rev2, above_avg == 1))
## [1] 1.659574
# updown, no lag
compare_rev2 |>
group_by(updown) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.814 0.147
## 2 1 3.39 0.228
## 3 NA 1.12 0.131
We do see that there is a change in means, and some of these are rather large. Since we are only testing football, NA values would be schools without a football team, so they can be disregarded for now. Unfortunately, the standard deviation is also massive for the above average groups, so although the difference may seem large, it may not be very useful.
library(pwrss)
library(effsize)
cohen.d(d = filter(compare_rev, above_avg == 1) |> pluck("avg_other_sports"),
f = filter(compare_rev, above_avg == 0) |> pluck("avg_other_sports"))
##
## Cohen's d
##
## d estimate: 0.1141633 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.01202254 0.24034912
cohen.d(d = filter(compare_rev, updown == 1) |> pluck("avg_other_sports"),
f = filter(compare_rev, updown == 0) |> pluck("avg_other_sports"))
##
## Cohen's d
##
## d estimate: 0.07153171 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.06133009 0.20439352
cohen.d(d = na.omit(filter(compare_rev2, above_avg == 1) |> pluck("avg_other")),
f = na.omit(filter(compare_rev2, above_avg == 0) |> pluck("avg_other")))
##
## Cohen's d
##
## d estimate: -0.07897672 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.22720210 0.06924866
cohen.d(d = na.omit(filter(compare_rev2, updown == 1) |> pluck("avg_other")),
f = na.omit(filter(compare_rev2, updown == 0) |> pluck("avg_other")))
##
## Cohen's d
##
## d estimate: 0.0280054 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.1306508 0.1866616
This is not a great start. Both values are negligible, which in of itself isn’t terrible, but it does mean that the result is very small even if it is statistically significant. To make matters worse, zero lies within the confidence interval, meaning that there’s a reasonable chance that there is no relationship between football team’s success being shared with the rest of the athletic department.
Do we have enough data?
I think a meaningful difference between groups would be 10%. The “cost” of getting this data wrong isn’t very high, so I want to keep a generous margin of potential error by setting alpha to .2 and beta to .2 as well.
Additionally to note, we can see that this data is normally distributed.
compare_rev |>
ggplot() +
geom_histogram(mapping = aes(x=avg_revsport))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 390 rows containing non-finite outside the scale range
## (`stat_bin()`).
# no lag, above below average
pwrss.t.2means(mu1 = 0.805,
mu2 = 0.413,
sd1 = 0.588,
sd2 = 4.57,
kappa = 1.45,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 900
## n2 = 621
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 1519
## Non-centrality parameter = 2.125
## Type I error rate = 0.2
## Type II error rate = 0.2
# no lag, above below 0 change
pwrss.t.2means(mu1 = 0.091,
mu2 = 0.27,
sd1 = 0.746,
sd2 = 3.5,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 1803
## n2 = 1803
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 3604
## Non-centrality parameter = -2.124
## Type I error rate = 0.2
## Type II error rate = 0.2
# lag, above and below average
pwrss.t.2means(mu1 = 0.291,
mu2 = 0.0625,
sd1 = 3.661,
sd2 = 0.298,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 1166
## n2 = 1166
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 2330
## Non-centrality parameter = 2.124
## Type I error rate = 0.2
## Type II error rate = 0.2
# lag, above and below 0 change
pwrss.t.2means(mu1 = 0.147,
mu2 = 0.228,
sd1 = 0.814,
sd2 = 3.392,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 8362
## n2 = 8362
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 16722
## Non-centrality parameter = -2.123
## Type I error rate = 0.2
## Type II error rate = 0.2
To make our matters even worse, we don’t have enough data to support our hypotheses to the desired strength. And even if we do end up running the t tests…
t.test(filter(compare_rev, above_avg == 1) |> pluck("avg_other_sports"),
filter(compare_rev, above_avg == 0) |> pluck("avg_other_sports"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_rev, above_avg == 1), "avg_other_sports") and pluck(filter(compare_rev, above_avg == 0), "avg_other_sports")
## t = 1.4779, df = 411.94, p-value = 0.1402
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1983851 1.4005319
## sample estimates:
## mean of x mean of y
## 0.7048690 0.1037956
t.test(filter(compare_rev, updown == 1) |> pluck("avg_other_sports"),
filter(compare_rev, updown == 0) |> pluck("avg_other_sports"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_rev, updown == 1), "avg_other_sports") and pluck(filter(compare_rev, updown == 0), "avg_other_sports")
## t = 1.5262, df = 714.23, p-value = 0.1274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1079612 0.8619611
## sample estimates:
## mean of x mean of y
## 0.46955431 0.09255438
t.test(na.omit(filter(compare_rev2, above_avg == 1) |> pluck("avg_other")),
na.omit(filter(compare_rev2, above_avg == 0) |> pluck("avg_other")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_rev2, above_avg == 1), "avg_other")) and na.omit(pluck(filter(compare_rev2, above_avg == 0), "avg_other"))
## t = -1.3394, df = 474.12, p-value = 0.1811
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5640150 0.1067764
## sample estimates:
## mean of x mean of y
## 0.06259144 0.29121074
t.test(na.omit(filter(compare_rev2, updown == 1) |> pluck("avg_other")),
na.omit(filter(compare_rev2, updown == 0) |> pluck("avg_other")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_rev2, updown == 1), "avg_other")) and na.omit(pluck(filter(compare_rev2, updown == 0), "avg_other"))
## t = 0.51606, df = 659.96, p-value = 0.606
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2275406 0.3897849
## sample estimates:
## mean of x mean of y
## 0.2282531 0.1471309
We can see that our P values are pretty high as well, with both models scoring 0.12 to 0.61.
The culmination of evidence is not sufficient to reject the null hypothesis. This doesn’t mean that its not possible for revenue sports to positively impact non-revenue sports, but it looks unlikely that a relationship exists. And even if it did, it looks like the relationship in minimal.
I’ll rerun some of the prior tests using similar conditions and assumptions, but testing if the response variable is expenses of non-revenue sports instead.
compare_exp <- compare_change |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_exp |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports)) +
labs(x = "% Change in Football Revenue",
y = "% Change in Non-Revenue Expenses",
title = "Institutions' Change in Football Revenue and Non-Revenue Sport Expenses") +
# info to make this line found in regressions
geom_abline(intercept = 0.0365, slope = 0.12372, color = "red") +
theme(
panel.background = element_rect(fill = "white", color = NA),
panel.grid.minor = element_blank(),
panel.grid.major = element_blank(),
)
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
compare_exp2 <- compare_exp |>
arrange(institution_name, year) |>
group_by(institution_name) |>
mutate(avg_other = lead(avg_other_sports)) |>
filter(year < 2019)
compare_exp2 |>
filter(year > 2016) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).
avg_growth <- mean(compare_exp$avg_revsport, na.rm = TRUE)
compare_exp$above_avg <- ifelse(compare_exp$avg_revsport >= avg_growth, 1, 0)
compare_exp$updown <- ifelse(compare_exp$avg_revsport > 0, 1, 0)
compare_exp |>
filter(year > 2016) |>
group_by(above_avg) |>
summarize(sd = sd(avg_other_sports, na.rm = TRUE),
mean = mean(avg_other_sports, na.rm = TRUE))
## # A tibble: 3 × 3
## above_avg sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.107 0.0253
## 2 1 0.170 0.0555
## 3 NA 0.183 0.0547
compare_exp |>
filter(year > 2016) |>
group_by(updown) |>
summarize(sd = sd(avg_other_sports),
mean = mean(avg_other_sports))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.118 0.0162
## 2 1 0.143 0.0476
## 3 NA 0.183 0.0547
# for year lag
avg_growth2 <- mean(compare_exp2$avg_revsport, na.rm = TRUE)
compare_exp2$above_avg <- ifelse(compare_exp2$avg_revsport >= avg_growth2, 1, 0)
compare_exp2$updown <- ifelse(compare_exp2$avg_revsport > 0, 1, 0)
compare_exp2 |>
filter(year > 2016) |>
group_by(above_avg) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## above_avg sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.110 0.0150
## 2 1 0.113 0.0183
## 3 NA 0.184 0.0414
compare_exp2 |>
filter(year > 2016) |>
group_by(updown) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.110 0.0192
## 2 1 0.111 0.0149
## 3 NA 0.184 0.0414
cohen.d(d = filter(compare_exp, above_avg == 1) |> pluck("avg_other_sports"),
f = filter(compare_exp, above_avg == 0) |> pluck("avg_other_sports"))
##
## Cohen's d
##
## d estimate: 0.2345709 (small)
## 95 percent confidence interval:
## lower upper
## 0.1080661 0.3610757
cohen.d(d = filter(compare_exp, updown == 1) |> pluck("avg_other_sports"),
f = filter(compare_exp, updown == 0) |> pluck("avg_other_sports"))
##
## Cohen's d
##
## d estimate: 0.2448789 (small)
## 95 percent confidence interval:
## lower upper
## 0.1116215 0.3781364
cohen.d(d = na.omit(filter(compare_exp2, above_avg == 1) |> pluck("avg_other")),
f = na.omit(filter(compare_exp2, above_avg == 0) |> pluck("avg_other")))
##
## Cohen's d
##
## d estimate: 0.08553731 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.06269746 0.23377209
cohen.d(d = na.omit(filter(compare_exp2, updown == 1) |> pluck("avg_other")),
f = na.omit(filter(compare_exp2, updown == 0) |> pluck("avg_other")))
##
## Cohen's d
##
## d estimate: 0.07073128 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.08795916 0.22942173
# no lag, above and below average
pwrss.t.2means(mu1 = 0.0253,
mu2 = 0.0555,
sd1 = 0.1073,
sd2 = 0.1702,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 201
## n2 = 201
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 400
## Non-centrality parameter = -2.128
## Type I error rate = 0.2
## Type II error rate = 0.2
# no lag, above and below 0 change
pwrss.t.2means(mu1 = 0.0162,
mu2 = 0.0476,
sd1 = 0.1177,
sd2 = 0.1425,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 157
## n2 = 157
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 312
## Non-centrality parameter = -2.129
## Type I error rate = 0.2
## Type II error rate = 0.2
pwrss.t.2means(mu1 = 0.01884,
mu2 = 0.01256,
sd1 = 0.1097,
sd2 = 0.1130,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 2836
## n2 = 2836
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 5670
## Non-centrality parameter = 2.124
## Type I error rate = 0.2
## Type II error rate = 0.2
pwrss.t.2means(mu1 = 0.01918,
mu2 = 0.01490,
sd1 = 0.1103,
sd2 = 0.1115,
kappa = 1,
power = .8, alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 6054
## n2 = 6054
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 12106
## Non-centrality parameter = 2.123
## Type I error rate = 0.2
## Type II error rate = 0.2
t.test(filter(compare_exp, above_avg == 1) |> pluck("avg_other_sports"),
filter(compare_exp, above_avg == 0) |> pluck("avg_other_sports"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_exp, above_avg == 1), "avg_other_sports") and pluck(filter(compare_exp, above_avg == 0), "avg_other_sports")
## t = 3.407, df = 659.74, p-value = 0.0006966
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01256505 0.04674958
## sample estimates:
## mean of x mean of y
## 0.06111880 0.03146148
t.test(filter(compare_exp, updown == 1) |> pluck("avg_other_sports"),
filter(compare_exp, updown == 0) |> pluck("avg_other_sports"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_exp, updown == 1), "avg_other_sports") and pluck(filter(compare_exp, updown == 0), "avg_other_sports")
## t = 3.8007, df = 711.24, p-value = 0.0001566
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01496922 0.04695923
## sample estimates:
## mean of x mean of y
## 0.05346483 0.02250061
t.test(filter(compare_exp2, above_avg == 1) |> pluck("avg_other"),
filter(compare_exp2, above_avg == 0) |> pluck("avg_other"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_exp2, above_avg == 1), "avg_other") and pluck(filter(compare_exp2, above_avg == 0), "avg_other")
## t = 1.1296, df = 586.98, p-value = 0.2591
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.006594145 0.024449763
## sample estimates:
## mean of x mean of y
## 0.03643210 0.02750429
t.test(filter(compare_exp2, updown == 1) |> pluck("avg_other"),
filter(compare_exp2, updown == 0) |> pluck("avg_other"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_exp2, updown == 1), "avg_other") and pluck(filter(compare_exp2, updown == 0), "avg_other")
## t = 0.8883, df = 408.72, p-value = 0.3749
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.00895789 0.02372793
## sample estimates:
## mean of x mean of y
## 0.03300017 0.02561515
Using the same logic as before, there is sufficient evidence to reject the null hypothesis that expenses of, or spending for, non-revenue sports are not impacted by revenue sports’ success. However, there is no evidence to support that this benefit exists in successive years.
Repeating same tests before, but instead of using revenue changes, we’ll look at profit changes.
Null Hypothesis: change in revenue for non-revenue sports is the same for institutions who’s football teams have a net profit and a net loss.
compare_profit <- compare_change |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(profit_change[sports %in%
c("Football")]),
avg_other_sports = mean(rev_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_profit |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
compare_profit2 <- compare_profit |>
arrange(institution_name, year) |>
group_by(institution_name) |>
mutate(avg_other = lead(avg_other_sports)) |>
filter(year < 2019)
compare_profit2 |>
filter(year > 2016) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).
Interestingly, there’s a much larger distribution on the x axis. Lets go and see what we find.
compare_profit$updown <- ifelse(compare_profit$avg_revsport > 0, 1, 0)
compare_profit2$updown <- ifelse(compare_profit2$avg_revsport > 0, 1, 0)
compare_profit |>
filter(year > 2016) |>
group_by(updown) |>
summarize(sd = sd(avg_other_sports),
mean = mean(avg_other_sports))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 3.29 0.221
## 2 1 0.671 0.173
## 3 NA 1.11 0.132
compare_profit2 |>
group_by(updown) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 3.34 0.234
## 2 1 0.601 0.120
## 3 NA 1.12 0.131
I can’t say this is looking too good.
cohen.d(d = filter(compare_profit, updown == 1) |> pluck("avg_other_sports"),
f = filter(compare_profit, updown == 0) |> pluck("avg_other_sports"))
##
## Cohen's d
##
## d estimate: 0.1099549 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.03317298 0.25308286
cohen.d(d = na.omit(filter(compare_profit2, updown == 1) |> pluck("avg_other")),
f = na.omit(filter(compare_profit2, updown == 0) |> pluck("avg_other")))
##
## Cohen's d
##
## d estimate: -0.03945665 (negligible)
## 95 percent confidence interval:
## lower upper
## -0.2041184 0.1252051
t.test(filter(compare_profit, updown == 1) |> pluck("avg_other_sports"),
filter(compare_profit, updown == 0) |> pluck("avg_other_sports"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_profit, updown == 1), "avg_other_sports") and pluck(filter(compare_profit, updown == 0), "avg_other_sports")
## t = 0.97421, df = 266.45, p-value = 0.3308
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.5913513 1.7496926
## sample estimates:
## mean of x mean of y
## 0.7831329 0.2039623
t.test(filter(compare_profit2, updown == 1) |> pluck("avg_other"),
filter(compare_profit2, updown == 0) |> pluck("avg_other"))
##
## Welch Two Sample t-test
##
## data: pluck(filter(compare_profit2, updown == 1), "avg_other") and pluck(filter(compare_profit2, updown == 0), "avg_other")
## t = -0.77186, df = 647.85, p-value = 0.4405
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4050278 0.1764583
## sample estimates:
## mean of x mean of y
## 0.1198413 0.2341261
For almost the exact reasons outlined in the prior sections, there isn’t enough evidence to reject the null hypothesis.
compare_pexp <- compare_change |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(profit_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
compare_pexp |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
compare_pexp2 <- compare_pexp |>
arrange(institution_name, year) |>
group_by(institution_name) |>
mutate(avg_other = lead(avg_other_sports)) |>
filter(year < 2019)
compare_profit2 |>
filter(year > 2016) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other))
## Warning: Removed 198 rows containing missing values or values outside the scale range
## (`geom_point()`).
compare_pexp$updown <- ifelse(compare_profit$avg_revsport > 0, 1, 0)
compare_pexp2$updown <- ifelse(compare_profit2$avg_revsport > 0, 1, 0)
compare_pexp |>
filter(year > 2016) |>
group_by(updown) |>
summarize(sd = sd(avg_other_sports),
mean = mean(avg_other_sports))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.142 0.0387
## 2 1 0.113 0.0316
## 3 NA 0.183 0.0547
compare_pexp2 |>
group_by(updown) |>
summarize(sd = sd(avg_other, na.rm = TRUE),
mean = mean(avg_other, na.rm = TRUE))
## # A tibble: 3 × 3
## updown sd mean
## <dbl> <dbl> <dbl>
## 1 0 0.105 0.0313
## 2 1 0.104 0.0298
## 3 NA 0.180 0.0522
I’m not even going to bother testing the rest, these are basically identical.
I want to now confirm that schools with higher revenue/profits in revenue-sports have higher expenses for non-revenue sports.
d1 <- ncaa |>
filter(year == 2019) |>
filter(classification_code %in% c(1,2,3))
I’m going to look at four different sports: Women Swimming and Diving, Men Golf, Baseball, and Softball. I’m going to keep alpha and power levels the same as in the prior null hypothesis for the same reasons.
compare_sports <- d1 |>
group_by(institution_name) |>
summarise(
revenue_rev = sum(rev_men[sports %in% c("Football")], na.rm = TRUE) +
sum(rev_men[sports %in% c("Basketball")], na.rm = TRUE),
revenue_profit = sum(profit[sports %in% c("Football")], na.rm = TRUE)+
sum(rev_men[sports %in% c("Basketball")], na.rm = TRUE),
swd_exp = mean(exp_women[sports %in% c("Swimming and Diving")], na.rm = TRUE),
mgolf_exp = mean(exp_men[sports %in% c("Golf")], na.rm = TRUE),
baseball_exp = mean(exp_men[sports %in% c("Baseball")], na.rm = TRUE),
softball_exp = mean(exp_women[sports %in% c("Softball")], na.rm = TRUE),
division = min(classification_code)
)
compare_sports$class <- ifelse(compare_sports$division == 1, 1, 0)
# womens swim and dive
compare_sports |>
filter(!is.na(swd_exp)) |>
group_by(class) |>
summarize(sd = sd(swd_exp),
mean = mean(swd_exp))
## # A tibble: 2 × 3
## class sd mean
## <dbl> <dbl> <dbl>
## 1 0 264229. 547937.
## 2 1 472962. 1313916.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("swd_exp")),
f = na.omit(filter(compare_sports, class == 0) |> pluck("swd_exp")))
##
## Cohen's d
##
## d estimate: 2.029838 (large)
## 95 percent confidence interval:
## lower upper
## 1.646270 2.413406
one <- compare_sports |>
filter(!is.na(swd_exp), class == 0) %>%
summarise(count = n())
two <- compare_sports |>
filter(!is.na(swd_exp), class == 1) %>%
summarise(count = n())
c(one, two)
## $count
## [1] 85
##
## $count
## [1] 76
pwrss.t.2means(mu1 = 547937.3,
mu2 = 1313915.8,
sd1 = 264229.5,
sd2 = 472962,
kappa = .9,
power = .8,
alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 3
## n2 = 4
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 5
## Non-centrality parameter = -2.722
## Type I error rate = 0.2
## Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("swd_exp")),
na.omit(filter(compare_sports, class == 0) |> pluck("swd_exp")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_sports, class == 1), "swd_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "swd_exp"))
## t = 12.484, df = 114.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 644438.6 887518.5
## sample estimates:
## mean of x mean of y
## 1313915.8 547937.3
# mens golf
compare_sports |>
filter(!is.na(mgolf_exp)) |>
group_by(class) |>
summarize(sd = sd(mgolf_exp),
mean = mean(mgolf_exp))
## # A tibble: 2 × 3
## class sd mean
## <dbl> <dbl> <dbl>
## 1 0 153291. 287963.
## 2 1 333871. 700563.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("mgolf_exp")),
f = na.omit(filter(compare_sports, class == 0) |> pluck("mgolf_exp")))
##
## Cohen's d
##
## d estimate: 1.706682 (large)
## 95 percent confidence interval:
## lower upper
## 1.434323 1.979040
one <- compare_sports |>
filter(!is.na(mgolf_exp), class == 0) %>%
summarise(count = n())
two <- compare_sports |>
filter(!is.na(mgolf_exp), class == 1) %>%
summarise(count = n())
c(one, two)
## $count
## [1] 177
##
## $count
## [1] 117
pwrss.t.2means(mu1 = 287963 ,
mu2 = 700562.8 ,
sd1 = 153291.4 ,
sd2 = 333870.5 ,
kappa = .66,
power = .8,
alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 4
## n2 = 5
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 7
## Non-centrality parameter = -2.458
## Type I error rate = 0.2
## Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("mgolf_exp")),
na.omit(filter(compare_sports, class == 0) |> pluck("mgolf_exp")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_sports, class == 1), "mgolf_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "mgolf_exp"))
## t = 12.523, df = 148.68, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 347495.4 477704.2
## sample estimates:
## mean of x mean of y
## 700562.8 287963.0
# baseball
compare_sports |>
filter(!is.na(baseball_exp)) |>
group_by(class) |>
summarize(sd = sd(baseball_exp),
mean = mean(baseball_exp))
## # A tibble: 2 × 3
## class sd mean
## <dbl> <dbl> <dbl>
## 1 0 423571. 930220.
## 2 1 1192174. 2124906.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("baseball_exp")),
f = na.omit(filter(compare_sports, class == 0) |> pluck("baseball_exp")))
##
## Cohen's d
##
## d estimate: 1.473127 (large)
## 95 percent confidence interval:
## lower upper
## 1.210552 1.735701
one <- compare_sports |>
filter(!is.na(baseball_exp), class == 0) %>%
summarise(count = n())
two <- compare_sports |>
filter(!is.na(baseball_exp), class == 1) %>%
summarise(count = n())
c(one, two)
## $count
## [1] 183
##
## $count
## [1] 115
pwrss.t.2means(mu1 = 930220.4 ,
mu2 = 2124906.4 ,
sd1 = 423571.5 ,
sd2 = 1192174.2 ,
kappa = .62,
power = .8,
alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 4
## n2 = 7
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 9
## Non-centrality parameter = -2.4
## Type I error rate = 0.2
## Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("baseball_exp")),
na.omit(filter(compare_sports, class == 0) |> pluck("baseball_exp")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_sports, class == 1), "baseball_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "baseball_exp"))
## t = 10.344, df = 132.28, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 966227.7 1423144.1
## sample estimates:
## mean of x mean of y
## 2124906.4 930220.4
# softball
compare_sports |>
filter(!is.na(softball_exp)) |>
group_by(class) |>
summarize(sd = sd(softball_exp),
mean = mean(softball_exp))
## # A tibble: 2 × 3
## class sd mean
## <dbl> <dbl> <dbl>
## 1 0 274425. 685636.
## 2 1 635182. 1413436.
cohen.d(d = na.omit(filter(compare_sports, class == 1) |> pluck("softball_exp")),
f = na.omit(filter(compare_sports, class == 0) |> pluck("softball_exp")))
##
## Cohen's d
##
## d estimate: 1.646267 (large)
## 95 percent confidence interval:
## lower upper
## 1.374592 1.917943
one <- compare_sports |>
filter(!is.na(softball_exp), class == 0) %>%
summarise(count = n())
two <- compare_sports |>
filter(!is.na(softball_exp), class == 1) %>%
summarise(count = n())
c(one, two)
## $count
## [1] 188
##
## $count
## [1] 109
pwrss.t.2means(mu1 = 685636.3 ,
mu2 = 1413435.6 ,
sd1 = 274424.5 ,
sd2 = 635182 ,
kappa = .58,
power = .8,
alpha = 0.2,
alternative = "not equal")
## Difference between Two means
## (Independent Samples t Test)
## H0: mu1 = mu2
## HA: mu1 != mu2
## ------------------------------
## Statistical power = 0.8
## n1 = 4
## n2 = 6
## ------------------------------
## Alternative = "not equal"
## Degrees of freedom = 8
## Non-centrality parameter = -2.481
## Type I error rate = 0.2
## Type II error rate = 0.2
t.test(na.omit(filter(compare_sports, class == 1) |> pluck("softball_exp")),
na.omit(filter(compare_sports, class == 0) |> pluck("softball_exp")))
##
## Welch Two Sample t-test
##
## data: na.omit(pluck(filter(compare_sports, class == 1), "softball_exp")) and na.omit(pluck(filter(compare_sports, class == 0), "softball_exp"))
## t = 11.364, df = 131.75, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 601105.9 854492.7
## sample estimates:
## mean of x mean of y
## 1413435.6 685636.3
It is abundantly clear that we can reject the null hypothesis. For at least popular non-revenue sports, there is more funding at FBS schools than at non-FBS schools.
library(boot)
library(broom)
library(lindia)
This model will show the relationship between increase in revenue of football teams and increase in non-revenue sport funding.
# data we're basing model off of
compare_exp |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
model <- lm(avg_other_sports ~ avg_revsport, compare_exp)
summary(model)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38795 -0.05866 -0.00502 0.04318 1.86182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.036588 0.004054 9.026 < 2e-16 ***
## avg_revsport 0.123720 0.017912 6.907 8.78e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1243 on 1001 degrees of freedom
## (390 observations deleted due to missingness)
## Multiple R-squared: 0.04549, Adjusted R-squared: 0.04454
## F-statistic: 47.71 on 1 and 1001 DF, p-value: 8.78e-12
Using a linear regression, we can estimate how much non-revenue sport expenses will change based upon changes in football revenue.
mean(compare_exp$avg_other_sports)
## [1] 0.04794688
mean(compare_exp$avg_revsport, na.rm = TRUE)
## [1] 0.05631253
In the model, it predicts that when football revenue doesn’t change from the prior year, non-revenue sports should increase funding by about 3%. For the slope of 0.2, this predicts that for every unit, in this case percent, increase in football revenue, non-revenue sports’ revenue will increase by 0.2, or .2%. So if Football revenue increase 10%, non-revenue sport funding will increase by 3%+ 0.2(10%) = 5%.
Referencing the mean ave_revsport and the model’s slope, we’d expect an intercept value of 3.5% for non-revenue sports, only slightly higher than our model predicts.
compare_FBS <- ncaa |>
filter(classification_code == 1) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
partic_men_change = (sum_partic_men - lag(sum_partic_men))/
lag(sum_partic_men),
partic_women_change = (sum_partic_women - lag(sum_partic_women))/
lag(sum_partic_women),
student_change = (ef_total_count - lag(ef_total_count)) /
lag(ef_total_count),
profit_change = (profit - lag(profit))
# this one is a number, not percentage
)
compare_change_FBS <- compare_FBS |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp_FBS <- compare_change_FBS |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FBS
compare_exp_FBS |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
This graph looks pretty different now…
model_FBS_pct <- lm(avg_other_sports ~ avg_revsport, compare_exp_FBS)
summary(model_FBS_pct)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FBS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33220 -0.05698 0.00032 0.04590 0.56641
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035320 0.004488 7.870 2.19e-14 ***
## avg_revsport 0.044299 0.014167 3.127 0.00187 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0981 on 504 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.01903, Adjusted R-squared: 0.01708
## F-statistic: 9.778 on 1 and 504 DF, p-value: 0.001868
compare_FCS <- ncaa |>
filter(classification_code == 2) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men))/lag(rev_men),
revenue_change_women = (rev_women - lag(rev_women))/lag(rev_women),
expense_change_men = (exp_men - lag(exp_men))/lag(exp_men),
expense_change_women = (exp_women - lag(exp_women))/lag(exp_women),
partic_men_change = (sum_partic_men - lag(sum_partic_men))/
lag(sum_partic_men),
partic_women_change = (sum_partic_women - lag(sum_partic_women))/
lag(sum_partic_women),
student_change = (ef_total_count - lag(ef_total_count)) /
lag(ef_total_count),
profit_change = (profit - lag(profit))
# this one is a number, not percentage
)
compare_change_FCS <- compare_FCS |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp_FCS <- compare_change_FCS |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FCS
compare_exp_FCS |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
model_FCS_pct <- lm(avg_other_sports ~ avg_revsport, compare_exp_FCS)
summary(model_FCS_pct)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FCS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41099 -0.06008 -0.00941 0.03752 1.84678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.040196 0.006773 5.935 5.53e-09 ***
## avg_revsport 0.134777 0.025555 5.274 2.00e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1471 on 495 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.0532, Adjusted R-squared: 0.05129
## F-statistic: 27.81 on 1 and 495 DF, p-value: 1.999e-07
It looks like most of the model was actually explaining FCS schools, where FBS had much less of an impact.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_reshist(model_FBS_pct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_reshist(model_FCS_pct)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Put together, both classifications make a pretty good looking distribution curve. Separated, we see both of these programs have little issues with their shape, FBS likely the most, but it doesn’t scream that something big that we aren’t aware of already is missing.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
gg_resfitted(model_FBS_pct) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
gg_resfitted(model_FCS_pct) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
FCS and combined don’t look terrible. Towards the right, the values are too sparsely available to show if the distribution is consistent. But in FBS, we see problems. There looks to be a fanning effect, which would violate the assumption of homoscedasticity, and make the results unreliable. Luckily, in way, this actually further supports the conclusion of FBS schools which is that changes in football revenue has little effect in changes non-revenue sport funding.
When I was doing additional work on this dataset, I thought it might be a better idea to look at total dollar amount changes rather than just percentage change. So I’m going to make the same model, but with total dollar changes.
compare2 <- ncaa |>
filter(classification_code %in% c(1,2,3)) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men)),
revenue_change_women = (rev_women - lag(rev_women)),
expense_change_men = (exp_men - lag(exp_men)),
expense_change_women = (exp_women - lag(exp_women)),
partic_men_change = (sum_partic_men - lag(sum_partic_men)),
partic_women_change = (sum_partic_women - lag(sum_partic_women)),
student_change = (ef_total_count - lag(ef_total_count)),
profit_change = (profit - lag(profit))
)
compare_change2 <- compare2 |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp2 <- compare_change2 |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# data we're basing model off of
compare_exp2 |>
filter(year > 2015) |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 390 rows containing missing values or values outside the scale range
## (`geom_point()`).
model_new <- lm(avg_other_sports ~ avg_revsport, compare_exp2)
summary(model_new)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -624505 -32739 3685 40289 564915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.303e+04 3.008e+03 4.333 1.62e-05 ***
## avg_revsport 1.540e-03 7.748e-04 1.987 0.0472 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92940 on 1001 degrees of freedom
## (390 observations deleted due to missingness)
## Multiple R-squared: 0.003929, Adjusted R-squared: 0.002934
## F-statistic: 3.948 on 1 and 1001 DF, p-value: 0.04719
Interestingly, what I thought was a blunder on my end for not thinking of sooner, turns out to explain even less of the variation as the initial approach to compare differences as a percentage.
# data we're basing model off of
compare_FBS <- ncaa |>
filter(classification_code == 1) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men)),
revenue_change_women = (rev_women - lag(rev_women)),
expense_change_men = (exp_men - lag(exp_men)),
expense_change_women = (exp_women - lag(exp_women)),
partic_men_change = (sum_partic_men - lag(sum_partic_men)),
partic_women_change = (sum_partic_women - lag(sum_partic_women)),
student_change = (ef_total_count - lag(ef_total_count)),
profit_change = (profit - lag(profit))
)
compare_change_FBS <- compare_FBS |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp_FBS <- compare_change_FBS |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FBS
compare_exp_FBS |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).
model_FBS <- lm(avg_other_sports ~ avg_revsport, compare_exp_FBS)
summary(model_FBS)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FBS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1142549 -48578 12383 63919 548694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.940e+04 6.158e+03 3.151 0.00172 **
## avg_revsport -1.212e-04 1.131e-03 -0.107 0.91472
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132500 on 504 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 2.278e-05, Adjusted R-squared: -0.001961
## F-statistic: 0.01148 on 1 and 504 DF, p-value: 0.9147
# data we're basing model off of
compare_FCS <- ncaa |>
filter(classification_code == 2) |>
arrange(institution_name, men, sports, year) |>
mutate(revenue_change_men = (rev_men - lag(rev_men)),
revenue_change_women = (rev_women - lag(rev_women)),
expense_change_men = (exp_men - lag(exp_men)),
expense_change_women = (exp_women - lag(exp_women)),
partic_men_change = (sum_partic_men - lag(sum_partic_men)),
partic_women_change = (sum_partic_women - lag(sum_partic_women)),
student_change = (ef_total_count - lag(ef_total_count)),
profit_change = (profit - lag(profit))
)
compare_change_FCS <- compare_FCS |>
filter(year > 2015) |>
mutate(rev_change = rowSums(across(c(revenue_change_men,
revenue_change_women)), na.rm = TRUE),
exp_change = rowSums(across(c(expense_change_men,
expense_change_women)), na.rm = TRUE),
athlete_change = rowSums(across(c(partic_men_change,
partic_women_change)), na.rm = TRUE)
) |>
reframe(year, institution_name, classification_code, sports, profit,
rev_change, exp_change, athlete_change, student_change, profit_change)
compare_exp_FCS <- compare_change_FCS |>
group_by(year, institution_name) |>
summarise(avg_revsport = mean(rev_change[sports %in%
c("Football")]),
avg_other_sports = mean(exp_change[!sports %in%
c("Football", "Basketball")])
)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
# FCS
compare_exp_FCS |>
ggplot() +
geom_point(mapping = aes(x = avg_revsport, y=avg_other_sports))
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).
model_FCS <- lm(avg_other_sports ~ avg_revsport, compare_exp_FCS)
summary(model_FCS)
##
## Call:
## lm(formula = avg_other_sports ~ avg_revsport, data = compare_exp_FCS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -571018 -16699 9258 32079 207728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.106e+03 3.167e+03 1.296 0.195
## avg_revsport -1.613e-02 3.934e-03 -4.099 4.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69380 on 495 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.03283, Adjusted R-squared: 0.03088
## F-statistic: 16.8 on 1 and 495 DF, p-value: 4.842e-05
Unsurprisingly, these aren’t effective. There’s a lot of very obvious problems with these; even visually you can tell these don’t behave the way we’d expect a linear regression to.
# initial model used
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This looks to be pretty normally distributed, but with a small right tail. Lets look at the other models…
gg_resfitted(model_FCS) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
gg_resfitted(model_FBS) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
The initial model looks pretty good, FBS isn’t too bad, and FCS clearly has too much variation in residuals that we aren’t looking for.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see again here, there are some outliers and maybe a small right tail, but it is pretty well normally distributed.
gg_reshist(model_FBS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
gg_reshist(model_FCS)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We start getting a lot more problems with FBS and FCS though. There’s heavy tails, and with FCS we can see there is a noticeable lean in the center of the data towards the right. It just doesn’t distribute as well as it could. Compared to the initial model, these are inferior. For all models though, its clear that we are missing other explanatory variables.
I suspect others that could be useful to have would be donations to the university, performance of athletic funds, and a way to measure “success” in NR sport performance. These, along with perhaps others, could help make these models a bit better, but the data we have is insufficient to use dollar amount changes to make a useful model.
If you are a non-revenue sport athlete, its pretty clear that going to a FBS school will likely result in your program getting higher funding than going to a non-FBS school. There are still unknowns here, such as how those resources are distributed to athletes, but this isn’t likely to cause a big difference. There’s also an unknown if greater expenses translate to funding athletes prioritize. These additional expenses could mean traveling to competitions on a plane instead of bus, but this doesn’t necessarily mean that the competitions will be of higher qualities. Extra expenses could mean your head coach gets a bonus for winning a conference championship, which is certainly good for the team, but the coaching quality hadn’t necessarily changed during that season.
As well, your program variation can dramatically more if you go to a smaller school compared to bigger FBS schools. If your football team performs terribly and you go to an FBS school, you might not see a big dip in funding for that year. If you go to a smaller FCS school though, you may feel the budget cuts compared to the prior year.
This would be applicable to donors and directors or board members on athletic departments. If you are looking at where to allocate funds at an FCS, and you want to give as much broad benefit as possible, you might want to default to spending money on revenue programs like football. If your in the same situation but for an FBS school, you may want to spread resources more evenly.
We can see two additional insights from this:
The higher the ROI of increased football revenues from additional funding, the more beneficial it is for all members in the athletic department to allocate more funds to the football program. Similarly, if the ROI for football revenues is low, funds may need to be more equally distributed to different sports in order for all teams to get similar benefits.
There are MANY more factors that are involved here. This alone only explains a small fraction of the variation, so the interaction between sports team isn’t that significant. Although it can be valuable to consider how other sporting programs can be impacted by the growth of a football team, other factors need to be considered if you want to make accurate, informed decisions. Examples can changes in endowment size, government budgets, institutional performance, program performance, and likely others.