Establishes tidyverse and also gets rid of an error.
# This works to get rid of errors
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
Stores CSV file as ‘ncaa’
ncaa <- read.csv("./ncaa_sports_1.csv", header = TRUE)
I am looking to see how many programs offer each sport, and how the number of available programs changes over time.
# Uses two columns: year, sports
# Results in counts of unique sports in each year
ncaa |>
filter(sum_partic_men + sum_partic_women > 0) |>
group_by(year, sports) |>
summarise(count = n()) |>
pivot_wider(names_from = year, values_from = count, values_fill = 0)
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
## # A tibble: 37 × 6
## sports `2015` `2016` `2017` `2018` `2019`
## <chr> <int> <int> <int> <int> <int>
## 1 All Track Combined 622 623 627 631 630
## 2 Archery 3 3 4 3 3
## 3 Badminton 1 1 1 1 1
## 4 Baseball 938 938 944 941 932
## 5 Basketball 1088 1081 1084 1079 1078
## 6 Beach Volleyball 55 62 67 75 84
## 7 Bowling 73 83 85 87 92
## 8 Diving 7 9 6 5 10
## 9 Equestrian 56 56 57 60 59
## 10 Fencing 41 42 43 41 41
## # ℹ 27 more rows
The first insight that was gathered is 2019 has a LOT of rows for sports that organizations don’t actually offer. Initially the above code did not have a filter as I assumed that the sum participating men and women would be non-zero. This looks to be true, except for sports offered in 2019. By commenting out that filter you can see how much the sports count increases that year specifically. I discovered this using the code below, filtering specifically for a sport with low numbers until 2019.
Additionally, I was surprised with how much variation there was from year to year. Some sports were very consistent like Fencing or Basketball with very little variation over time, but there were others like Softball seeing a 3% decline in 2019 and diving. Diving ranged between 5-10 and everything in between, and to make matters worse it’s traditionally coupled up with Swimming, not a stand alone sport. I’ll need to investigate this further on what causes a team to sponsor diving but not swimming.
# count201X returns the total participating men and women of each institution
count2018 <- ncaa |>
filter(year == 2018 & sports == "Rodeo") |>
reframe(num_2018 = sum_partic_men + sum_partic_women)
count2019 <- ncaa |>
filter(year == 2019 & sports == "Rodeo") |>
reframe(num_2019 = sum_partic_men + sum_partic_women)
# if the total students participating at a school is equal to zero, in other words
# there isn't really a team, the count will increase
print(sum(count2018==0))
## [1] 0
print(sum(count2019==0))
## [1] 761
This shows that in 2019, many schools were added to the data set despite not having a team. This needs to be filtered out.
# gives summary of men participating in caa sports
summary(select(ncaa, partic_men))
## partic_men
## Min. : 1.00
## 1st Qu.: 13.00
## Median : 25.00
## Mean : 34.23
## 3rd Qu.: 38.00
## Max. :290.00
## NA's :38526
This gives a summary of men competing in Men’s Sports.
One interesting part of this aggregate function is that the minimum is 1. This is essentially a filter to only include sports that offer a men’s program, so all the NA or 0 entries are excluded. As we found out from the previous section, the sum of NA’s are > the sum of sports that are offered to women and not men due to the 2019 errors. Other errors I foresee coming to light come when looking at the Max. I don’t believe it is possible for a sport to have more than 105 roster sports, and that is the cap for NCAA football. To have an extra 185? I need to investigate how far this error potentially spreads, or what exception is being included.
This summary also shows that there is likely a skew in the distribution towards the right, and we can see this as the mean is significantly higher than the medium. I’d be excited to see how much that drops once I figure out why numbers are greater than 105, but I’d still expect the mean to stay higher than the medium.
How strongly does size of school impact sports revenue?
Do schools that offer the most sports make the most revenue?
Do Division I sports sponsor more athletes than other divisions?
# Lists all the classifications
unique(ncaa$classification_name)
## [1] "NCAA Division I-FCS" "NCAA Division I-FBS"
## [3] "NCAA Division II without football" "NCAA Division III with football"
## [5] "NCAA Division II with football" "NCAA Division I without football"
## [7] "NCAA Division III without football"
# Aggregation function to answer 3rd Novel Question
# Prints total athletes per classification over 5 year period
ncaa |>
filter(sum_partic_men + sum_partic_women > 0) |>
group_by(classification_name) |>
summarise(sum = sum(sum_partic_men + sum_partic_women)) # aggregation function
## # A tibble: 7 × 2
## classification_name sum
## <chr> <int>
## 1 NCAA Division I without football 188245
## 2 NCAA Division I-FBS 382572
## 3 NCAA Division I-FCS 336558
## 4 NCAA Division II with football 387112
## 5 NCAA Division II without football 221093
## 6 NCAA Division III with football 664176
## 7 NCAA Division III without football 275980
###Do Division 1 Sports sponsor more athletes than other sports?
Surpringly, no. Division 3 schools actually sponsor the most, and although Division 1 sponsors more than Division 3 when looking at schools with football, Division 3 sponsors plenty more when looking at schools without football. I think what’s very significant isn’t that Division 1 is slightly smaller than Division 3 in athlete count, but that Division 2 fell so far behind both, roughly only 2/3 the size of the other divisions. Although some potential scholarship restrictions come to mind to help explain this when I first saw the result, its still strange that its so much lower than the very comparable D1 and D3.
To figure out some of these differences, I would start by figuring out how much this is caused by the number of schools that offer sports (maybe the number of athletes at each school are roughly the same, but D3 has a bit more schools than D1 and plenty more than D2) and the athletes per program that may varies by division (maybe D3 schools offers the most spots for athletes, but D2 is lagging behind. I suspect it will be some combination of both, and will likely have large variations between football and non-football schools.
library(ggthemes)
# produces bar chart showing the count of men in NCAA sports
ncaa |>
filter(sum_partic_men>0) |>
ggplot() +
geom_bar(mapping = aes(x = sum_partic_men), color = "Blue") +
theme_minimal()
# produces bar chart showing the count of women in NCAA sports
ncaa |>
filter(sum_partic_women>0) |>
ggplot() +
geom_bar(mapping = aes(x = sum_partic_women), color = "Pink") +
theme_minimal()
Looking at the distributions between men and women, women have a very high peak in what appears to be the teens to low twenties in roster count, where men see the same but along with a spike in what appears to be the thirties and a small but noticeable bump in the hundreds. Although there is plenty to be skeptical about, specifically the fact that programs have more than 105 members among others, this can suggest that men have a sport that tends to peak in the 100s (football). This also suggest that men have a sport that women don’t which creates the second peak we see, or women have more sports that men don’t which helps smooth the peak from the teens-size roster spots to the thirty-ish sized. I would be curious to see how this is broken down by different sports to highlight where sports like basketball, swimming, soccer, etc. play into this.
# produces bar chart showing the count of athletic programs by each divsion classification
# during the years of 2015-2019
ncaa |>
group_by(year) |>
filter(sum_partic_men + sum_partic_women > 0) |>
ggplot() +
geom_bar(mapping = aes(x = year, fill=classification_name)) +
theme_minimal()
This is a bit of an extension from one of the big questions, but you can see visually how a significant chunk of available sporting programs come from Division 3 schools. This is significant as it 1) helps show that one of the reasons D3 schools might have so many athletes is because they offer more programs to join and 2) since it makes up such a significant chunk of the total programs available in a year, it might suggest, coupled with prior data, that team sizes are smaller on average. Of course, we can run another test to see if that is accurate. I would also like to see if this means that there are more D3 schools offering athletic programs, D3 schools are offering more programs per school, or some combination of the two.
# creates a column 'profit' which is revenues less expenses
# shows how each division classification creates revenue and profit in their sports
ncaa |>
filter(sum_partic_men + sum_partic_women > 0) |>
mutate(profit = rev_men - exp_men) |>
ggplot() +
geom_point(mapping = aes(x = rev_men, y = profit), color = 'blue') +
facet_wrap(vars(classification_name)) +
theme_minimal()
## Warning: Removed 17393 rows containing missing values or values outside the scale range
## (`geom_point()`).
Right off the bat, this shows that virtually all revenue and profit made in the NCAA comes from Division 1-FBS schools. These are the some of the highest ranking universities, and subsequently some of the highest ranking college football teams. I expected Division II or III with football to produce higher revenues than Division 1 without football, but clearly they weren’t even close.
I am now curious to see how these distributions are broken down by revenue sports; like basketball, football, and baseball; and by Olympic/non-revenue sports; soccer, swimming, and athletics/track and field. It could be that most of the variation we see comes from football and basketball, but when removing those sports (and potentially baseball), we might see the profits and revenues much more comparable across division classifications.