Week 2 | Data Dive - Summaries

Establishes tidyverse and also gets rid of an error.

# This works to get rid of errors
library(conflicted)  

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.

Stores CSV file as ‘ncaa’

ncaa <- read.csv("./ncaa_sports_1.csv", header = TRUE)

Numeric Summary of Columns

Sports offered each year

I am looking to see how many programs offer each sport, and how the number of available programs changes over time.

# Uses two columns: year, sports
# Results in counts of unique sports in each year

ncaa |>
  filter(sum_partic_men + sum_partic_women > 0) |>
  group_by(year, sports) |>
  summarise(count = n()) |>
   pivot_wider(names_from = year, values_from = count, values_fill = 0)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 37 × 6
##    sports             `2015` `2016` `2017` `2018` `2019`
##    <chr>               <int>  <int>  <int>  <int>  <int>
##  1 All Track Combined    622    623    627    631    630
##  2 Archery                 3      3      4      3      3
##  3 Badminton               1      1      1      1      1
##  4 Baseball              938    938    944    941    932
##  5 Basketball           1088   1081   1084   1079   1078
##  6 Beach Volleyball       55     62     67     75     84
##  7 Bowling                73     83     85     87     92
##  8 Diving                  7      9      6      5     10
##  9 Equestrian             56     56     57     60     59
## 10 Fencing                41     42     43     41     41
## # ℹ 27 more rows

The first insight that was gathered is 2019 has a LOT of rows for sports that organizations don’t actually offer. Initially the above code did not have a filter as I assumed that the sum participating men and women would be non-zero. This looks to be true, except for sports offered in 2019. By commenting out that filter you can see how much the sports count increases that year specifically. I discovered this using the code below, filtering specifically for a sport with low numbers until 2019.

Additionally, I was surprised with how much variation there was from year to year. Some sports were very consistent like Fencing or Basketball with very little variation over time, but there were others like Softball seeing a 3% decline in 2019 and diving. Diving ranged between 5-10 and everything in between, and to make matters worse it’s traditionally coupled up with Swimming, not a stand alone sport. I’ll need to investigate this further on what causes a team to sponsor diving but not swimming.

# count201X returns the total participating men and women of each institution

 count2018 <- ncaa |>
  filter(year == 2018 & sports == "Rodeo") |>
  reframe(num_2018 = sum_partic_men + sum_partic_women)
 
 count2019 <- ncaa |>
  filter(year == 2019 & sports == "Rodeo") |>
  reframe(num_2019 = sum_partic_men + sum_partic_women)
# if the total students participating at a school is equal to zero, in other words
# there isn't really a team, the count will increase

print(sum(count2018==0))
## [1] 0
print(sum(count2019==0))
## [1] 761

This shows that in 2019, many schools were added to the data set despite not having a team. This needs to be filtered out.

Participating Men in Sports

# gives summary of men participating in caa sports

summary(select(ncaa, partic_men))
##    partic_men    
##  Min.   :  1.00  
##  1st Qu.: 13.00  
##  Median : 25.00  
##  Mean   : 34.23  
##  3rd Qu.: 38.00  
##  Max.   :290.00  
##  NA's   :38526

This gives a summary of men competing in Men’s Sports.

One interesting part of this aggregate function is that the minimum is 1. This is essentially a filter to only include sports that offer a men’s program, so all the NA or 0 entries are excluded. As we found out from the previous section, the sum of NA’s are > the sum of sports that are offered to women and not men due to the 2019 errors. Other errors I foresee coming to light come when looking at the Max. I don’t believe it is possible for a sport to have more than 105 roster sports, and that is the cap for NCAA football. To have an extra 185? I need to investigate how far this error potentially spreads, or what exception is being included.

This summary also shows that there is likely a skew in the distribution towards the right, and we can see this as the mean is significantly higher than the medium. I’d be excited to see how much that drops once I figure out why numbers are greater than 105, but I’d still expect the mean to stay higher than the medium.

3 Novel Questions

  • How strongly does size of school impact sports revenue?

  • Do schools that offer the most sports make the most revenue?

  • Do Division I sports sponsor more athletes than other divisions?

# Lists all the classifications
unique(ncaa$classification_name)
## [1] "NCAA Division I-FCS"                "NCAA Division I-FBS"               
## [3] "NCAA Division II without football"  "NCAA Division III with football"   
## [5] "NCAA Division II with football"     "NCAA Division I without football"  
## [7] "NCAA Division III without football"
# Aggregation function to answer 3rd Novel Question
# Prints total athletes per classification over 5 year period

ncaa |>
  filter(sum_partic_men + sum_partic_women > 0) |>
  group_by(classification_name) |>
  summarise(sum = sum(sum_partic_men + sum_partic_women)) # aggregation function
## # A tibble: 7 × 2
##   classification_name                   sum
##   <chr>                               <int>
## 1 NCAA Division I without football   188245
## 2 NCAA Division I-FBS                382572
## 3 NCAA Division I-FCS                336558
## 4 NCAA Division II with football     387112
## 5 NCAA Division II without football  221093
## 6 NCAA Division III with football    664176
## 7 NCAA Division III without football 275980

###Do Division 1 Sports sponsor more athletes than other sports?

Surpringly, no. Division 3 schools actually sponsor the most, and although Division 1 sponsors more than Division 3 when looking at schools with football, Division 3 sponsors plenty more when looking at schools without football. I think what’s very significant isn’t that Division 1 is slightly smaller than Division 3 in athlete count, but that Division 2 fell so far behind both, roughly only 2/3 the size of the other divisions. Although some potential scholarship restrictions come to mind to help explain this when I first saw the result, its still strange that its so much lower than the very comparable D1 and D3.

To figure out some of these differences, I would start by figuring out how much this is caused by the number of schools that offer sports (maybe the number of athletes at each school are roughly the same, but D3 has a bit more schools than D1 and plenty more than D2) and the athletes per program that may varies by division (maybe D3 schools offers the most spots for athletes, but D2 is lagging behind. I suspect it will be some combination of both, and will likely have large variations between football and non-football schools.

Visual Summaries

library(ggthemes)

Bar Chart of Number of Participating Men and Women in NCAA Programs

# produces bar chart showing the count of men in NCAA sports

ncaa |>
  filter(sum_partic_men>0) |>
  ggplot() +
  geom_bar(mapping = aes(x = sum_partic_men), color = "Blue") +
  theme_minimal()

# produces bar chart showing the count of women in NCAA sports

ncaa |>
  filter(sum_partic_women>0) |>
  ggplot() +
  geom_bar(mapping = aes(x = sum_partic_women), color = "Pink") +
  theme_minimal()

Looking at the distributions between men and women, women have a very high peak in what appears to be the teens to low twenties in roster count, where men see the same but along with a spike in what appears to be the thirties and a small but noticeable bump in the hundreds. Although there is plenty to be skeptical about, specifically the fact that programs have more than 105 members among others, this can suggest that men have a sport that tends to peak in the 100s (football). This also suggest that men have a sport that women don’t which creates the second peak we see, or women have more sports that men don’t which helps smooth the peak from the teens-size roster spots to the thirty-ish sized. I would be curious to see how this is broken down by different sports to highlight where sports like basketball, swimming, soccer, etc. play into this.

Bar Chart of how many sporting programs each year came from each Divsion classification

# produces bar chart showing the count of athletic programs by each divsion classification
# during the years of 2015-2019

ncaa |>
  group_by(year) |>
  filter(sum_partic_men + sum_partic_women > 0) |>
  ggplot() +
  geom_bar(mapping = aes(x = year, fill=classification_name)) +
  theme_minimal()

This is a bit of an extension from one of the big questions, but you can see visually how a significant chunk of available sporting programs come from Division 3 schools. This is significant as it 1) helps show that one of the reasons D3 schools might have so many athletes is because they offer more programs to join and 2) since it makes up such a significant chunk of the total programs available in a year, it might suggest, coupled with prior data, that team sizes are smaller on average. Of course, we can run another test to see if that is accurate. I would also like to see if this means that there are more D3 schools offering athletic programs, D3 schools are offering more programs per school, or some combination of the two.

Revenue and Profit by NCAA Classification

# creates a column 'profit' which is revenues less expenses
# shows how each division classification creates revenue and profit in their sports

ncaa |>
  filter(sum_partic_men + sum_partic_women > 0) |>
  mutate(profit = rev_men - exp_men) |>
  ggplot() +
  geom_point(mapping = aes(x = rev_men, y = profit), color = 'blue') +
  facet_wrap(vars(classification_name)) +
  theme_minimal()
## Warning: Removed 17393 rows containing missing values or values outside the scale range
## (`geom_point()`).

Right off the bat, this shows that virtually all revenue and profit made in the NCAA comes from Division 1-FBS schools. These are the some of the highest ranking universities, and subsequently some of the highest ranking college football teams. I expected Division II or III with football to produce higher revenues than Division 1 without football, but clearly they weren’t even close.

I am now curious to see how these distributions are broken down by revenue sports; like basketball, football, and baseball; and by Olympic/non-revenue sports; soccer, swimming, and athletics/track and field. It could be that most of the variation we see comes from football and basketball, but when removing those sports (and potentially baseball), we might see the profits and revenues much more comparable across division classifications.