Project Meeting 1: Data Discovery

# This works to get rid of errors
library(conflicted)  

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
# load ncaa file I cleaned
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)

Summary of Data

About the data set

I initially found this data set on a tidytuesday post. The original data comes from the US Department of Education database, and the tidytuesday post included a few recent years of available data along with a limited number of columns. There’s an accompanying Kaggle post as well with the same subset of data.

Original link: https://github.com/rfordatascience/tidytuesday/blob/master/data/2022/2022-03-29/readme.md

Each row in this data set represents a certain sport that a college or university offers in a given year. Column include information on the school and sports program including division of school, number of students attending the school, number of male and female athletes, total revenue, and total expenses. Its important to note that there are distinctions between the men’s and women’s version in almost every sport available, but this dataset combines the two as long as the sport is the same (ex. Men’s Basketball and Women’s Basketball is just considered Basketball).

I have slightly altered this data set though. I split the teams so the men’s and women’s categories can be distinguished as their own rows, and I filtered to only include NCAA programs. More information on this clean can be found here: https://rpubs.com/tbreedy/1222442.

Furthermore, most of this analysis will be done on Division 1 schools only. This comes with many benefits such as, but not limited to, higher revenues, more funded programs to draw from, and it has consistent rules for athletic programs. There will be many instances of comparisons across divisions, but the primary research will be focused on Division 1 applications.

Goal of this project

I want to discover different relationships between team programs, athletic programs, institutions, and the overall NCAA. These relationships will revolve around revenues, expenses, profits, and different counts of number of programs and number of roster spots available. So if a team or institution has higher revenue, does that mean they have higher rosters? If an institution has more teams, do they have lower profits? If a team in an institution substantially increases its revenue, does it expenses rise? Do other team programs in the school get more things which result in higher expenses? Does this vary from division to division?

These questions can be very important. Imagine we are part of a school that has had a lot of success in our football program recently, so our revenues have increased dramatically. We don’t know how we want to use these extra funds yet; do we reinvest in our football program or distribute the extra earnings across the athletic department? If we find that other sports teams become more successful as their football programs become more successful, then we might want to reinvest in creating the best football team and let the rest take care of itself. If we find that out other programs only get better if additional funding goes to them, then we’ll want to distribute a lot of that extra money to our other teams to approve our athletic department as a whole.

Image a different scenario while being recruited to compete in a sport in college. If we find out that having a successful football program means more money going into other sports, you’d want to go to the school with the best football program so you’d get the most resources spent on you. Maybe this means looking at who has the best football commits and who has the best underclassmen rather than where the team is currently.

By figuring out these relationships, we can help provide evidence that shows us insights in how to act in situations similar to those described above.

Visualizations and further investigation

There’s a problem with team sizes:

# filters for all D1 football schools and gets the "total number of men" each year
ncaa_football_d1 <- ncaa |>
  filter(sports == 'Football') |>
  filter(classification_code == 1 | classification_code == 2 | 
           classification_code == 3) |>
  group_by(institution_name, year) |>
  summarise(fb_players = sum(sum_partic_men))
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
ncaa_football_d1 |>
  ggplot() +
  geom_histogram(mapping = aes(x = fb_players)) +
  labs(title = "Total Number of Male Athletes on D1 Football Teams",
       x = "Total Number of Male Athletes per Year",
       y = "Frequency") +
  #geom_vline(mapping = aes(xintercept = '105', color='red') +
  geom_vline(xintercept = 120, color='red') +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The problem comes that the roster limit for football is 120 men, but clearly there are many teams with plenty more men than what is allowed. This can get potentially problematic because we don’t know given the available documentation how this is possible. There’s a few potential explanations, such as this list being taken before roster cuts, but one way another this is an error. I am trying to reach out to people who know more about the dataset to see why this happens, then I can adjust my results accodingly.

More filtering may also be needed when comparing programs over all 5 years:

program_count <- ncaa |>
  group_by(institution_name, sports) |>
  summarise(count = n())
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
program_change <- program_count |>
  filter(count %% 5 != 0)
program_change |>
  group_by(sports) |>
  summarise(change = n()) |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y=change)) + 
  geom_text(mapping = aes(x = sports, y = change, label=sports), 
            size = 3, nudge_y = 10, check_overlap = TRUE) +
  theme(axis.text.x = element_blank()) +
  labs(x = "sports",
       y = "Frequency a sport wasn't offered every year")

This shows a count of how often a sport wasn’t sponsored by an institution every year in the data set. This is caused by one of two events: a university began sponsoring a team, or a university cut a team. This could vary from a program being suspended, getting out a suspension, a university just created an athletic department, a university closed, and likely many more. I think its likely, at least for some of my analyses, I’ll need to filter out programs that weren’t offered every year or even filter our universities that just created or dropped their programs entirely so fair comparisons across years can be made.

Plan moving forward

  1. Contact people to figure out remaining questions over data documentation. How are counts of men/women or people in the school calculated?
  2. Create different formatting of data that can be more applicable for further analyses.
    1. Create a data frame that contains every school for one year. Ex ncaa_16, ncaa_17, so on.
    2. Add new columns such as change in revenue, expenses, and athletes from the prior year.
  3. Continue finding ways to implement statistical techniques used in class to further my research into figuring out relationships.

Initial Findings

Hypothesis 1: All teams benefit when a football team exists

I hypothesize that if a university has a football team, other sports offered by that institution will see benefits in their own program because of it. I will test these “benefits” by looking at figures like revenues, expenses, and roster counts on other teams. I am not trying to prove causation, only correlation. An example can look like the one below:

# creates example data we might collect for this hypothesis
teams_data <- data.frame(
  team = c("A", "B", "C", "D", "E", "F"),
  football_program = c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE),
  athletes = c(150, 180, 100, 200, 90, 120),
  revenues = c(5000000, 4500000, 2000000, 5500000, 1800000, 1750000),
  expenses = c(4000000, 3500000, 1500000, 4800000, 1200000, 1400000)
)
# groups teams together whether or not they have a football program
# finds mean values to compare
teams_group <- teams_data |>
  group_by(football_program) |>
  summarise(athletes = mean(athletes),
            revenues = mean(revenues),
            expenses = mean(expenses))
# visualization to show difference
barplot(teams_group$athletes,
        names = teams_group$football_program,
        xlab = "If Football is Offered", 
        ylab = "Number of Athletes",
        col = c("gray", "red"))
legend("topleft", legend = teams_group$football_program, fill = c("gray", "red"))

It’s not a terrific graph, but it would show that institutions that offer a football program also offer more roster sports for athletes. We can show this same bar plot for revenues and expenses.

Hypothesis 2: Size of schools has no/little correlation with revenue

I don’t think the size of the school matters much when creating smaller athletic programs. I think this does correlate with having a large football team, so schools with football teams will need to be compared with each other, and schools without will need to be compared with each other. When that is taken into account, I don’t suspect there will be much of a correlation, if any. Here’s an example:

# creates example data we might collect for this hypothesis
school_rev <- data.frame(
  school = c("A", "B", "C", "D", "E", "F"),
  students = c(10000, 12000, 14000, 8000, 6000, 16000),
  revenues = c(5000000, 4800000, 5200000, 5600000, 4600000, 4900000)
)
plot(x=school_rev$students, y=school_rev$revenues)

Once again, this isn’t a great graph. It shows that there might be a small positive correlation, but it’s pretty flat. So in this example, the number of students wouldn’t have a correlation with athletic revenues.