Fetching dataset and libraries

library(tidyverse)
library(ggthemes)
library(ggrepel)
library(patchwork)
library(broom)
library(lindia)
library(car)
library(boot)
data <- read_delim("./sports.csv", delim = ',')

Filtering for NCAA Division I

# filtering NCAA Division I
ncaa_I <- data |>
  filter(classification_name %in% c('NCAA Division I without football', 
  'NCAA Division I-FBS','NCAA Division I-FCS')) 

Preprocessing

# these are rows of data with no information on athletes and revenue. Basically NA values on both men and women athletes. I'm removing those.
# Also I filled NA with zeros where only one column was NA. This allows us to count the Total Athletes. Also Football has NA for women, so having zero there makes sense. 
filtered_ncaa_I <- ncaa_I |>
  filter(!(is.na(partic_men) & is.na(partic_women))) |>
  mutate(
    partic_men = ifelse(is.na(partic_men), 0, partic_men),
    partic_women = ifelse(is.na(partic_women), 0, partic_women)) |>
  filter(sector_name == 'Public, 4-year or above')
# Adding total athletes column
filtered_ncaa_I <- filtered_ncaa_I |>
  mutate(Total_Athletes = partic_men + partic_women)
# Athletes in each university 
summarized_data <- filtered_ncaa_I |>
  group_by(institution_name) |>
  summarise(Athletes = sum(Total_Athletes),
            Revenue = sum(total_rev_menwomen))

EDA

#plotting the code chunk above
ggplot(summarized_data, aes(x = institution_name, y = Revenue, fill = Athletes)) +
  geom_bar(stat = "identity") +  # Create a bar plot
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Color scale for Athletes
  theme_minimal() +  # Minimal theme for clarity
  theme(
  plot.title = element_text(size = 16, hjust = 0.5, face = 'bold'),
  axis.text.x = element_text(size = 10.5,hjust = 1)
  ) +
  scale_x_discrete(labels = c(
    "Ball State University" = "Ball State",
    "Indiana University-Purdue University-Fort Wayne" = "IUPUF",
    "Indiana University-Purdue University-Indianapolis" = "IUPUI",
    "Indiana State University"="ISU",
    "Purdue University-Main Campus" = "Purdue",
    "Indiana University-Bloomington" = "IU Bloom",
    "Purdue University Fort Wayne"= "Purdue Fort Wayne"
  )) +
  labs(
    title = "Revenue by Institution w/ Varying Athlete Count",
    x = "Public Institutions",
    y = "Revenue (in USD)",
    fill = "Athletes"  # Legend title for the fill color
  )

The barplot shows that higher athletes are linked with having more revenue in a university. IU Bloomington and Purdue Main Campus have most athletes and most revenue compared to others. We’ll test this relationship with ANOVA later.

#box plot
ggplot(filtered_ncaa_I, mapping = aes(x=institution_name,y=total_rev_menwomen)) +
  geom_boxplot()+
  scale_x_discrete(labels = c(
    "Ball State University" = "Ball State",
    "Indiana University-Purdue University-Fort Wayne" = "IUPUF",
    "Indiana University-Purdue University-Indianapolis" = "IUPUI",
    "Indiana State University"="ISU",
    "Purdue University-Main Campus" = "Purdue",
    "Indiana University-Bloomington" = "IU Bloom",
    "Purdue University Fort Wayne"= "Purdue Fort Wayne"
  )) +
  scale_y_log10() +
  theme_minimal() +
  labs(
    title = "Distributions of Universities in terms of Revenue Generated",
    x = "Institutions",
    y = "Revenue (log scaled)") + 
  theme(
  plot.title = element_text(hjust = 0.5, face = 'bold'),
  axis.text.x = element_text(,hjust = 1)
  ) 

Boxplot shows that most institutions distribution looks similar with close enough median revenue except IU Bloomington has lower median revenue. However, we can’t say if the means are different among the groups. Again, ANOVA will help more with that.

#scatterplot
#we'll use this for linear regression later
ggplot(summarized_data, mapping = aes(x = Athletes, y = Revenue)) +
  geom_point(color = 'chocolate', size = 2.8) +
  theme_minimal() +
  geom_smooth(method = "lm", color = "red", se = FALSE, linewidth = 0.3) +
  labs(
    title = 'Positive linear relationship between athletes & revenue',
    y= "Revenue (in USD)"
  ) +
  theme(
    plot.title = element_text(hjust = 0.5, face = 'bold')
  )
## `geom_smooth()` using formula = 'y ~ x'

This scatterplot was used to confirm the linear trend between athlete count and institutions. We see a positive relationship here, so we’ll attempt linear regression model. But it should be kept in mind that our aggregated data is still a small sample size so interpretations of the model should be done with caution.

Hypothesis Testing

ANOVA

#assumption check first
filtered_ncaa_I |>
  filter(institution_name == "Purdue University-Main Campus") |>
  ggplot(mapping = aes(x = total_rev_menwomen)) +
  geom_histogram() +
  scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# variance checks
filtered_ncaa_I |>
  filter(institution_name == "Purdue University Fort Wayne") |>
  select(total_rev_menwomen) |>
  pull() |>
  sd()
## [1] 791696.7

The assumptions mostly look okay, but the variance in IU Bloomington and Purdue is much higher than others, though this could be due to larger sample size. Keeping this in mind, we’ll proceed.

anova_test <- aov(total_rev_menwomen ~ institution_name, data = filtered_ncaa_I)
summary(anova_test)
##                   Df    Sum Sq   Mean Sq F value  Pr(>F)   
## institution_name   6 1.184e+15 1.974e+14    2.93 0.00859 **
## Residuals        301 2.028e+16 6.736e+13                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F-statistic of 2.93 is greater than one, and p-value is small, so we can say that institutions or different athlete count does impact revenue. This indirectly means that there is a relationship among them. Now let’s see which group(s) is/are actually different from others with pairwise t-test.

Pairwise t-test

pairwise.t.test(filtered_ncaa_I$total_rev_menwomen, filtered_ncaa_I$institution_name, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  filtered_ncaa_I$total_rev_menwomen and filtered_ncaa_I$institution_name 
## 
##                                                   Ball State University
## Indiana State University                          1.00                 
## Indiana University-Bloomington                    0.26                 
## Indiana University-Purdue University-Fort Wayne   1.00                 
## Indiana University-Purdue University-Indianapolis 1.00                 
## Purdue University Fort Wayne                      1.00                 
## Purdue University-Main Campus                     0.40                 
##                                                   Indiana State University
## Indiana State University                          -                       
## Indiana University-Bloomington                    0.36                    
## Indiana University-Purdue University-Fort Wayne   1.00                    
## Indiana University-Purdue University-Indianapolis 1.00                    
## Purdue University Fort Wayne                      1.00                    
## Purdue University-Main Campus                     0.50                    
##                                                   Indiana University-Bloomington
## Indiana State University                          -                             
## Indiana University-Bloomington                    -                             
## Indiana University-Purdue University-Fort Wayne   0.24                          
## Indiana University-Purdue University-Indianapolis 0.20                          
## Purdue University Fort Wayne                      1.00                          
## Purdue University-Main Campus                     1.00                          
##                                                   Indiana University-Purdue University-Fort Wayne
## Indiana State University                          -                                              
## Indiana University-Bloomington                    -                                              
## Indiana University-Purdue University-Fort Wayne   -                                              
## Indiana University-Purdue University-Indianapolis 1.00                                           
## Purdue University Fort Wayne                      1.00                                           
## Purdue University-Main Campus                     0.33                                           
##                                                   Indiana University-Purdue University-Indianapolis
## Indiana State University                          -                                                
## Indiana University-Bloomington                    -                                                
## Indiana University-Purdue University-Fort Wayne   -                                                
## Indiana University-Purdue University-Indianapolis -                                                
## Purdue University Fort Wayne                      1.00                                             
## Purdue University-Main Campus                     0.29                                             
##                                                   Purdue University Fort Wayne
## Indiana State University                          -                           
## Indiana University-Bloomington                    -                           
## Indiana University-Purdue University-Fort Wayne   -                           
## Indiana University-Purdue University-Indianapolis -                           
## Purdue University Fort Wayne                      -                           
## Purdue University-Main Campus                     1.00                        
## 
## P value adjustment method: bonferroni

From this test, it seems that IU Bloomington and Purdue Main Campus are distinct from others. This is because they have less correlation with others based on the values we see on the table.

Confidence Intervals

#bootstraping first

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 100) {
  # the function we want to run on each iteration i
  boot_func <- \(x, i) func(x[i])
  
  # the boot instance, running the function on each iteration
  b <- boot(v, boot_func, R = n_iter)
  b <- boot.ci(b, conf = conf, type = "perc")
  
  # return lower and upper values
  return(c("lower" = b$percent[4],
           "upper" = b$percent[5]))
}
df_ci <- filtered_ncaa_I |>
  group_by(institution_name) |>
  summarise(ci_lower = boot_ci(total_rev_menwomen, mean)['lower'],
            mean_price = mean(total_rev_menwomen),
            ci_upper = boot_ci(total_rev_menwomen, mean)['upper'])

df_ci
## # A tibble: 7 × 4
##   institution_name                                  ci_lower mean_price ci_upper
##   <chr>                                                <dbl>      <dbl>    <dbl>
## 1 Ball State University                             1179918.   1611513. 2194769.
## 2 Indiana State University                          1058398.   1447926. 1851907.
## 3 Indiana University-Bloomington                    2106728.   5236822. 8689048.
## 4 Indiana University-Purdue University-Fort Wayne    516769.    792118. 1013645.
## 5 Indiana University-Purdue University-Indianapolis  765340.    993406. 1254484.
## 6 Purdue University Fort Wayne                       518964.    958862. 1434946.
## 7 Purdue University-Main Campus                     2794337.   5218693. 8374213.
#plotting CI
df_ci |>
  ggplot() +
  geom_errorbarh(mapping = aes(y = institution_name, 
                               xmin=ci_lower, xmax=ci_upper,
                               color = '95% C.I.'), height = 0.5) +
  geom_point(mapping = aes(x = mean_price, y = institution_name,
                           color = 'Group Mean'),
             shape = '|',
             size = 5) +
  scale_color_manual(values=c('black', 'red')) +
  labs(title = "Revenue Based on Athlete Count in Institutions",
       x = "Revenue (in $)",
       y = "Institutions",
       color = '') +
  theme(
    plot.title = element_text(hjust = 0.5, face = 'bold')
  )

Confidence intervals confirm previous results by showing that IU B. and Purdue are in fact different than the rest. And they’ve much higher athlete count than others as well. So we can conclude that higher athlete count may lead to higher revenue.

Finally, we do linear regression to see if can model this relationship. Again, our sample size is very small even though we have aggregated values, so the model could over-fit the sample. So interpret keeping that in mind and try not to generalize to other regions that could be different from Indiana.

Linear Regression

linear_model <- lm(Revenue ~ Athletes, summarized_data)
linear_model$coefficients
##  (Intercept)     Athletes 
## -86690260.93     97468.54
summary(linear_model)
## 
## Call:
## lm(formula = Revenue ~ Athletes, data = summarized_data)
## 
## Residuals:
##          1          2          3          4          5          6          7 
##  -55514319 -108282480   59592407   13594827  -48724441   61432618   77901387 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -86690261   63254741  -1.370   0.2289  
## Athletes        97468      25542   3.816   0.0124 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78430000 on 5 degrees of freedom
## Multiple R-squared:  0.7444, Adjusted R-squared:  0.6933 
## F-statistic: 14.56 on 1 and 5 DF,  p-value: 0.01242

The coefficients tell us that if we add one athlete, our revenue could go up by $97468 with standard error of 25542.

Overall conclusion:

We come to the conclusion that higher number of athletes in a university in Indiana may lead to higher revenue. This was supported by statistical tests we performed.