# This works to get rid of errors
library(conflicted)  

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
library(tidyverse)
library(ggthemes)
library(ggrepel)

** Note: the documentation for this data set is pretty poor, and most of the figures are relatively straightforward to understand. When I reference “documentation,” I’ll also include what the supplemental websites posted in Tidytuesday provided in additional context to understand the data, such as a link to a post on Kaggle.

Week 4 Data Dive

# load ncaa file I cleaned
ncaa <- read.csv("./ncaa_clean.csv", header = TRUE)
# loads original NCAA file
ncaa_original <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-29/sports.csv')
## Rows: 132327 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): institution_name, city_txt, state_cd, zip_text, classification_nam...
## dbl (20): year, unitid, classification_code, ef_male_count, ef_female_count,...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Three columns or values that are unclear before reading documentation

colnames(ncaa)
##  [1] "X"                    "year"                 "unitid"              
##  [4] "institution_name"     "city_txt"             "state_cd"            
##  [7] "zip_text"             "classification_code"  "classification_name" 
## [10] "classification_other" "ef_male_count"        "ef_female_count"     
## [13] "ef_total_count"       "sector_cd"            "sector_name"         
## [16] "sportscode"           "partic_men"           "partic_women"        
## [19] "partic_coed_men"      "partic_coed_women"    "sum_partic_men"      
## [22] "sum_partic_women"     "rev_men"              "rev_women"           
## [25] "total_rev_menwomen"   "exp_men"              "exp_women"           
## [28] "total_exp_menwomen"   "sports"               "pct_men"

Classification Code/Name

Classification Name/Code isn’t very clear from the get go.

unique(ncaa$classification_name)
## [1] "NCAA Division I-FCS"                "NCAA Division I-FBS"               
## [3] "NCAA Division II without football"  "NCAA Division III with football"   
## [5] "NCAA Division II with football"     "NCAA Division I without football"  
## [7] "NCAA Division III without football"
unique(ncaa_original$classification_name)
##  [1] "NCAA Division I-FCS"                "NCAA Division I-FBS"               
##  [3] "NCAA Division II without football"  "NJCAA Division I"                  
##  [5] "NCAA Division III with football"    "USCAA"                             
##  [7] "NAIA Division I"                    "NCCAA Division I"                  
##  [9] "NCAA Division II with football"     "NJCAA Division II"                 
## [11] "NCAA Division I without football"   "Other"                             
## [13] "NCAA Division III without football" "CCCAA"                             
## [15] "NAIA Division II"                   "NJCAA Division III"                
## [17] "NCCAA Division II"                  "Independent"                       
## [19] "NWAC"

Maybe your average person can understand NCAA Division I, but NAIA? I-FSC? NCJAA? NWAC? These can get confusing in a hurry. When looking at the documentation, you can see that these include all the different types of division classifications people can compete in. FBS vs FCS for example is short for Football Bowl Subdivision and Football Championship Subdivision. These are different “levels” of football, the former being the “highest level,” and today is more commonly referred to as “Power 5 Schools,” or schools from the top 5 football conferences, although its not all encompassing. NJCAA is for junior colleges, commonly community colleges, along with other divisions like NWAC.

By looking at the documentation to help explain what’s going on here, you can start to get an intuition of what schools and sports in different conferences look like. NCJAA is probably going to have smaller, younger teams, NCAA teams will likely have more recognition than NCCAA teams, and Division I teams are likely to be more competitive and have higher revenues than Division III teams.

If you didn’t read the documentation, there’s a chance you might see NCCAA Division 1 and assume there was a typo and it meant to say NCAA. You might also assume that as long as its Division 1, whether or not its NCAA, NCCAA, NAIA, or NCJAA they are comparable, and you’d be misled. I think this is also why they chose to be very specific about the classification as even though some divisions are very similar. Some divisions don’t even have formal distinctions between “football” or “without football,” but they have such varied revenue, expenses, and roster spots that it needs its own distinct classification.

For my data set for this assignment, I have simplified this to only include NCAA schools.

EF Male and Women Count

What is an EF count? How does this relate to athletics? What does this mean?

Turns out when looking at documentation, this has nothing at all to do with sports and everything to do with the size of the school. This simply shows how many men and women attend the school. Its unclear if this is just undergraduate, both under and post graduate, etc, but it does help give a feel to the size of the school. Without knowing what these columns meant, I don’t think they’d be of much use. Maybe you would confuse them with the number of athletes sponsored at the school, but a simple filter and count would easily show otherwise.

I’m not sure where the “ef” comes from, even after some Google searching. I’m sure there’s some purpose. Perhaps the “e” comes from enrollment? Who knows.

Unitid

At first I thought this was a typo for Unitid. Unit ID might make a bit more sense, but what’s a unit? Turns out this is a School ID, unique to each university.

Before reading the extra documentation that describes this, there were some signs that this was related to the university. In the first few rows, you can see it remains the same for each university. However, you might not be certain if this is related to an ID given by a conference or division, not just for a school itself. If a school changes divisions, would that mean the value changes? If the school adds football or moves to a difference conference, would the value stay the same? You would have to make an assumption about that unless you filtered through schools that changed during this time period (and not all that information is available on this data set), but even then you’d still be taking a risk.

I think this was done so schools could be compared more easily. It might be nicer to filter by distinct numbers than by names which can be incredibly similar. And, although its unlikely to happen, our own university went through a name change a few months ago. If this happened during the time frame listed and we were searching by the name of the school, we’d be confused what happened to us and why a new school came into existence. I think its some good redundancy to have.

Some things are still unclear

One of those is the participating men and women. This may appear straight forward: its how many men and women are on the team. But this isn’t a stagnant number…

  • If athletes join the team during the academic year (this is rare), the number can go up.

  • If athletes quit or get kicked off the team (this happens very frequently), the number can go down.

  • Many programs start the year off with more people than they are allowed to have with roster caps, so soon after practices start many athletes get cut.

  • Some programs host try outs where not all those who try out make the team.

How do these scenarios impact the overall number of athletes? Is it counted by how many start the season? How many finish the season? How many are on the roster by the first competition? It can’t be a weighted average since all the values are whole numbers, that is unless its rounded to the nearest integer and the documentation didn’t mention it. It’s not really possible to know given the limited documentation we have access to.

To make matters worse, some values are likely erroneously entered. Here’s an example:

# filters for all D1 football schools and gets the "total number of men" each year
ncaa_football_d1 <- ncaa |>
  filter(sports == 'Football') |>
  filter(classification_code == 1 | classification_code == 2 | 
           classification_code == 3) |>
  group_by(institution_name, year) |>
  summarise(fb_players = sum(sum_partic_men))
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
ncaa_football_d1 |>
  ggplot() +
  geom_histogram(mapping = aes(x = fb_players)) +
  labs(title = "Total Number of Male Athletes on D1 Football Teams",
       x = "Total Number of Male Athletes per Year",
       y = "Frequency") +
  #geom_vline(mapping = aes(xintercept = '105', color='red') +
  geom_vline(xintercept = 120, color='red') +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This distribution looks to be pretty normal. However, it proves that there are either errors in the entries, or “total number of male athletes” is different than “total number of male athletes on the roster at any given time during the season” because the Division 1 roster limit is 120 athletes. You’d expect that anything to the right of the red line shouldn’t be there. Interestingly, this appears to be about the mode of the data, which is kind of what you’d expect. Many teams, especially in a large and popular sport like football, have more athletes that want to join than the school is allowed to have despite the school perhaps wanting more than they are allowed to have as well. You might expect to see a spike around the 110-120 roster size for these reasons, which is a bit of what we do see.

But we see more: roster sizes over 120. And there’s a lot of them.

over_limit = sum(ncaa_football_d1$fb_players > 120)
print(over_limit)
## [1] 317
sprintf("%.1f%%", over_limit / nrow(ncaa_football_d1) * 100)
## [1] "25.3%"

Over 25% of teams were over the maximum roster size. Roster sizes aren’t a suggestion in Division 1 Football, so I highly doubt this error comes from teams being larger than they should. I can see this either being a collection error or, and what I believe is the most likely option, is that the data is “counting” differently than we’d expect. Although it might not fully explain all anomalies, or scarily it just might, this discrepancy could be caused if they counted the number of players on the “roster” before the season began, and didn’t adjust for teams trimming down the team to meet roster size quotas. An example would be a team allowing 140 guys to start pre-season training, then cutting 20 (or keeping them in some sort of “reserves” system if its possible) when the actual season and games begin.

Without any more documentation to explain what’s going on, this could be seriously misleading when creating data involving number of athletes. Since we don’t know what’s causing this, there’s a chance that a staggering amount of our data is simply inaccurate. If this is accurate though, but the “totals” just mean something other than you’d expect it to, this would make calculations comparing the total amount of something, maybe revenues or expenses, to the total amount of women or men misleading. We might find that expenses for athletes are greater in sport A than in sport B, but if for some reason the total number of men recorded for sport B is much more overstated compared to total number of men actively on the roster compared to sport A, we might find that sport B actually spends more than sport A on a per person basis. That’s a confusing statement, so let me write it out.

Total Expenses / “Total Athletes” = Expenses per athlete

Sport A: $400,000 / 40 = $10,000 per athlete

Sport B: $700,000 / 80 = $8,750 per athlete

Total Expenses / Total Athletes on the Roster In-Season = Expenses per athlete

Sport A: $400,000 / 40 = $10,000 per athlete

Sport B: $700,000 / 60 = $11,667 per athlete

In this scenario, this means that the expenses per athlete were actually about 33% higher for Sport B! This is significant, and it alters our idea of who spent the most per athlete.

What this means for my data

There’s some ways I can get around this.

I might try emailing some people involved in the data collection/entry/analysis processes and see if they know something I’m unaware of. This could help guide further decisions depending on what I find out.

Another option is to simply alter some of these sports to their respective roster caps. This comes with a lot of problems though. For example, if the average D1 school cuts 10 people from their roster ever year, or in other words all rosters are overstated by an average of 10 athletes, this alteration would disproportionately lower rosters near the roster cap than those below the cap, effectively overstating the number of athletes on teams with fewer athletes. It would also understate teams who don’t cut athletes, and overstate teams who love cutting a large amount of athletes. This also becomes impossible to do in places like Division 3 sports where there aren’t any official roster caps. That division can become especially difficult to deal with as the lines between collegiate sport and club sport are grayer than you might expect, especially compared to Division 1.

Another approach I can do is to simply expect that there is going to be an over or understatement, and factor that into my analysis. If I know that the average expense per football athlete, for a hypothetical example, is $200k, but if I adjust for roster limits it goes up to $220k, I might just assign a range and say the average expense per player is around $200k-$220k. If the benchmark I am comparing it to is $210k, I might simply conclude that football expenses are similar to the benchmark. Now if the benchmark is much lower, maybe $180k, I can say its likely that this is a significant difference. If its slightly higher, maybe $230k, I could still conclude there’s not enough evidence to say for sure if its actually more since our high end estimate is just that, an estimate, but the benchmark might be higher than our football players. And clearly, if the benchmark was much higher like $280k, it would be very likely that its higher than our football players.