Week 6 Data Dive - Confidence Intervals

Pair 1: Total Students and University Profit

Explanatory: Total Students, numeric discrete
Response: Revenue, numeric continuous

# groups information by school
schools <- ncaa |>
  group_by(institution_name) |>
  summarise(men = mean(ef_male_count, na.rm = TRUE),
            women = mean(ef_female_count, na.rm = TRUE),
            rev_men = sum(rev_men, na.rm = TRUE),
            rev_wom = sum(rev_women, na.rm = TRUE))

# creates new columns that add up men and women values
schools$students = schools$men + schools$women
schools$revenue = (schools$rev_men + schools$rev_wom)/5 # /5 to average over 5 years

This should make a table that has average total students and average total revenue for each school. Revenue will be understated for schools that recently created an athletic department, but this should be a minority of data.

Visualizing the Relationship

schools |>
  ggplot() +
  geom_point(mapping = aes(x = students, y = revenue)) +
  labs(title = "University Students to Athletic Revenue",
       x = "Number of Students", y = "Average Annual Athletic Revenue") +
  theme_classic()

The y axis looks distorted, so I’m going to break this into two separate categories: one for under $5m and one for over $10m. “Big” and “Small” are based on revenue, not number of students.

# schools with revenue < $5m
small_schools <- schools |>
  filter(revenue < 5000000) |>
  ggplot() +
  geom_point(mapping = aes(x = students, y = revenue)) +
  labs(title = "University Students to Athletic Revenue",
       x = "Number of Students", y = "Average Annual Athletic Revenue") +
  theme_classic()

small_schools

# schools with revenue > $10m
big_schools <- schools |>
  filter(revenue > 10000000) |>
  ggplot() +
  geom_point(mapping = aes(x = students, y = revenue)) +
  labs(title = "University Students to Athletic Revenue",
       x = "Number of Students", y = "Average Annual Athletic Revenue") +
  theme_classic()

big_schools

I’m going to start by looking at big schools:

schools |>
  filter(revenue > 10000000) |>
  ggplot() + 
  geom_boxplot(mapping = aes(y = revenue))

print(schools$institution_name[which.max(schools$revenue)])

## [1] "The University of Texas at Austin"

print(schools$institution_name[which.max(schools$students)])

## [1] "Texas A & M University-College Station"

For big schools, there isn’t really any outliers, especially considering some of these values like revenue increase exponentially, and there isn’t one value that is substantially higher than others. For school size, Penn State is likely the only true outlier in a league of its own. However, the interesting part of this is that Penn isn’t considered the largest university in the USA. Far from it. So it’s interesting that there is such a gap, because you wouldn’t expect there to be.

For schools with less than $5m in revenue…

schools |>
  filter(revenue < 5000000) |>
  ggplot() + 
  geom_boxplot(mapping = aes(y = students))

We can see there are probably quite a bit of outliers, but then again when you look at the data holistically, there isn’t one or a few schools that really stand out.

small <- subset(schools, revenue < 5000000)
head(small[order(-small$students), ],10)

## # A tibble: 10 × 7
##    institution_name                  men  women rev_men rev_wom students revenue
##    <chr>                           <dbl>  <dbl>   <int>   <dbl>    <dbl>   <dbl>
##  1 San Francisco State University  9360. 12348.  7.70e6  8.32e6   21708.  3.20e6
##  2 California State Polytechnic … 11505. 10035.  1.01e7  8.38e6   21539.  3.70e6
##  3 California State University-L…  8496. 11845.  9.75e6  1.14e7   20341.  4.22e6
##  4 University of California-Sant…  8371.  8361.  4.35e6  4.65e6   16733.  1.80e6
##  5 California State University-S…  6396.  9841.  8.71e6  1.05e7   16237.  3.83e6
##  6 The University of Texas at Da…  8720.  6660.  3.87e6  4.22e6   15380.  1.62e6
##  7 California State University-C…  7104.  7993.  1.01e7  9.30e6   15097.  3.88e6
##  8 Montclair State University      5663.  9093.  6.92e6  6.13e6   14756.  2.61e6
##  9 SUNY at Binghamton              6777   6335   1.06e7  8.80e6   13112   3.88e6
## 10 Rowan University                7226.  5885.  8.73e6  7.69e6   13111.  3.28e6

Interestingly, most of these schools that do stand out are in California. I’m not quite sure why this is. Maybe its because California is a very large state in terms of population with a high amount of people attending colleges, but despite having high attendance many Californians opt to support other, more popular schools in California like Cal or Stanford. This would just be a guess, and this likely can’t be fully explained with the data available in this data set.

Correlation Coeffecient

round(cor(schools$students, schools$revenue), 2)

## [1] 0.71

small = schools |> filter(revenue < 5000000)
large = schools |> filter(revenue > 10000000)

round(cor(small$students, small$revenue), 2)

## [1] 0.19

round(cor(large$students, large$revenue), 2)

## [1] 0.6

When broken down between small and large, this makes a lot of sense based on the visualizations. The majority of the data seemed to be a vertical line when looking at students and revenue for the smaller revenue schools, whereas there seemed to be a clear, positive correlation with bigger schools. The interesting part I think was the even larger correlation when the data was put together. I thought, for reasons explained above, that this wouldn’t be the case, and that was my desire for breaking the data up into large and small groups.

I’m curious if I broke this down further, such as separating schools between divisions, if the correlation would decrease. I initially thought it would increase, but I also thought the example above would too, so now I think I have substantial doubt if it will be true or not.

Confidence Interval

schools_mean <- mean(schools$revenue)
schools_sd <- sd(schools$revenue)
schools_mean

## [1] 10890463

schools_sd

## [1] 20449235

# we use a "_" so we don't overwrite the original function
boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
  # the `boot` library needs the function in this format
  boot_func <- \(x, i) func(x[i], na.rm=TRUE)
  
  b <- boot(v, boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}

boot_ci(schools$revenue, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 9709287, 12177113 )  
## Calculations and Intervals on Original Scale

se = schools_sd / sqrt(length(schools$revenue))
print(schools_mean + (1.96 * se))

## [1] 12077025

print(schools_mean - (1.96 * se))

## [1] 9703902

Using a bootstrapped approach and Z-scores yielded similar results for a confidence interval of the response variable of revenue. This means that the true population mean for revenue will fall between roughly $9.7m and $12.1m 95% of the time.

Pair 2: Revenues and Expenses

Expenses: explanatory
Revenue: response
But also… maybe the opposite

It is important to note that I hypothesis that this is the relationship between these two variables. For some sports I am more confident that this relationship exists than others, and for some schools I think the opposite might actually hold true. Sometimes the more you spend, the more revenue you can get. Sometimes the more revenue you get, the more expenditures you can make, and this is primarily driven by the non-profit nature of many institutions. I think there is a fair argument that both variables might be both response and explanatory variables depending on the institution in question.

Visualization of Relationships

# groups information by school
schools2 <- ncaa |>
  group_by(institution_name) |>
  summarise(exp_men = sum(exp_men, na.rm = TRUE),
            exp_women = sum(exp_women, na.rm = TRUE),
            rev_men = sum(rev_men, na.rm = TRUE),
            rev_wom = sum(rev_women, na.rm = TRUE))

# creates new columns that add up men and women values
# /5 to average over 5 years
schools2$expenses = (schools2$exp_men + schools2$exp_women)/5
schools2$revenue = (schools2$rev_men + schools2$rev_wom)/5

schools2 |>
  ggplot() +
  geom_point(mapping = aes(x = expenses, y = revenue)) +
  labs(title = "Athletic Revenue and Expenses",
       x = "Average Annual Athletic Expenses", y = "Average Annual Athletic Revenue") +
  theme_classic()

This data looks to be very positively correlated, much more so than I anticipated. The further right we go, the more spread out some of the figures look, and some of these outliers I will touch on below. I think it would look a bit more interesting as we zoom in, especially with smaller programs, but overall nothing stands out of the ordinary.

schools2 |>
  ggplot() + 
  geom_boxplot(mapping = aes(y = revenue))

I like looking at these box plots because it shows that there are a lot of “outlier” schools when looking at the entirety of the NCAA. Unsurprisingly, these tend to be schools you’ve heard of with gigantic, and ultra successful athletic programs. Here’s some examples:

top_schools = schools2[order(-schools2$revenue), ]
head(top_schools$institution_name, 5)

## [1] "The University of Texas at Austin" "University of Michigan-Ann Arbor" 
## [3] "Ohio State University-Main Campus" "The University of Alabama"        
## [5] "University of Georgia"

Texas, Michigan, Ohio, Alabama, and Georgia. Unsurprisingly, these are some of the top schools consistently in Football, and even their less popular sports churn out Olympians and national champions frequently. In terms of the entire NCAA, these top Division 1 schools are all outliers. However, the only real outlier among these would be Texas at Austin. Their revenue is crazy, even among Division 1 schools.

Correlation Coefficients

round(cor(schools2$expenses, schools2$revenue), 2)

## [1] 0.98

That came as little surprise given the visualization we saw earlier. With how highly correlated these two variables are, it makes me think that the response variable is actually expenses.

low_rev = filter(schools2, revenue < 1000000)
round(cor(low_rev$expenses, low_rev$revenue), 4)

## [1] 0.9973

It’s almost too perfect. I know some teams have at least some fixed expenses to carry on their sport, but revenue is not guaranteed. In fact, revenue is always not guaranteed as it can fluctuate dramatically depending on success in different championships. Since this is almost perfectly correlated though, it really makes me think that as schools get revenue, they spend that revenue on their programs. Additionally, I think if programs make excess revenue, they very efficiently share that revenue with other programs. I would love to have additional data to separate revenue and expenses before any sharing programs happened, but until I can scrutinize that data this is my going hypothesis on why this correlation is nearly perfect.

Confidence Interval

schools2_mean <- mean(schools2$expenses)
schools2_sd <- sd(schools2$expenses)
schools2_mean

## [1] 9955960

schools2_sd

## [1] 16635412

boot_ci(schools2$expenses, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   ( 9042634, 10911197 )  
## Calculations and Intervals on Original Scale

se = schools2_sd / sqrt(length(schools2$expenses))
print(schools2_mean + (1.96 * se))

## [1] 10921225

print(schools2_mean - (1.96 * se))

## [1] 8990694

Using a bootstrapped approach and Z-scores yielded similar results for a confidence interval of the response variable of revenue. This means that the true population mean for expenses for an institution will fall between roughly $9m and $11m 95% of the time.

Week 6 Data Dive

Thomas Reedy

2024-10-06

Week 6 Data Dive - Confidence Intervals

Pair 1: Total Students and University Profit

Visualizing the Relationship

Correlation Coeffecient

Confidence Interval

Pair 2: Revenues and Expenses

Visualization of Relationships

Correlation Coefficients

Confidence Interval