Establishes tidyverse and also gets rid of an error.
# This works to get rid of errors
library(conflicted)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
conflict_prefer("filter", "dplyr")
## [conflicted] Will prefer dplyr::filter over any other package.
conflict_prefer("lag", "dplyr")
## [conflicted] Will prefer dplyr::lag over any other package.
Stores CSV file as ‘ncaa’
ncaa <- read.csv("./ncaa_sports_1.csv", header = TRUE)
## Clean the ncaa file
ncaa <- ncaa |>
filter(sum_partic_men + sum_partic_women > 0)
# isolates number of athletes, programs, and division type of
# each institution
program_sum_count <- ncaa |>
group_by(institution_name) |>
summarise(athletes = sum(sum_partic_men + sum_partic_women),
programs = n(), classification = classification_name)
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.
program_sum_count |>
ggplot() +
geom_point(mapping = aes(x = programs/5, y = athletes/5),
color = 'red') +
labs(title = "Annualized Number of Athletes and Sports by Institution",
x = "Number of Sports Offered",
y = "Number of Athletes") +
theme_classic()
# these are based off of NCAA requirements
num_programs <- cut(program_sum_count$programs, breaks=c(0,15,40,60,200),
labels=c("Low","Medium", "High", "Very High"))
table(num_programs)
## num_programs
## Low Medium High Very High
## 449 7736 32926 16351
program_sum_count$prob_program <- num_programs
program_sum_count
## # A tibble: 57,462 × 5
## # Groups: institution_name [1,141]
## institution_name athletes programs classification prob_program
## <chr> <int> <int> <chr> <fct>
## 1 Abilene Christian University 2394 48 NCAA Division I-… High
## 2 Abilene Christian University 2394 48 NCAA Division I-… High
## 3 Abilene Christian University 2394 48 NCAA Division I-… High
## 4 Abilene Christian University 2394 48 NCAA Division I-… High
## 5 Abilene Christian University 2394 48 NCAA Division I-… High
## 6 Abilene Christian University 2394 48 NCAA Division I-… High
## 7 Abilene Christian University 2394 48 NCAA Division I-… High
## 8 Abilene Christian University 2394 48 NCAA Division I-… High
## 9 Abilene Christian University 2394 48 NCAA Division I-… High
## 10 Abilene Christian University 2394 48 NCAA Division I-… High
## # ℹ 57,452 more rows
program_sum_count |>
#mutate(program_size = programs < 15) |>
#filter(program_size == TRUE) |>
filter(prob_program == "Low") |>
ggplot() +
geom_bar(mapping = aes(x = classification)) +
theme(axis.text.x = element_text(angle = -20)) +
labs(title = "Count of Schools with <15 programs from 2015-2019")
The first chart shows the relationship between the number of sports an institution offers and how many athletes comprise of those sports. There is a very big range in how many athletic programs and athletes a school sponsors. Unsurprisingly, there is a direct correlation between the two. It is surprising how many programs have sponsored such few programs over the span of five years though.
After investigating this a bit further in the second graph, it looks like the vast majority of schools with minimal number of athletes come from Division 2 and 3 teams. These divisions were disproportionately more likely to offer the fewest programs. I suspect some of this might be due to Division 3 programs having more leniency on minimum number of programs required, but Divisions 1 and 2 shouldn’t be allowed to offer 3 or fewer programs a year and remain Division 1 or 2 schools. I suspect the reason these schools might be on the list anyways is because they recently created a sporting team (2018-2019) or dropped their sporting teams near the beginning of the recordings (2015-2016).
I hypothesis that if we were to filter out programs that are less than 5 years old but still actively sponsoring teams, we will remove all Division 1 and 2 schools.
# counts the number of times an institution offers a sport
sport_freq <- ncaa |>
group_by(sports = sports) |>
summarise(count = n_distinct(institution_name))
# calculates number of unique teams
teams = n_distinct(ncaa$institution_name)
# uncommon => <25% of teams offer the sport
sport_freq$probability = cut(sport_freq$count, breaks=c(0,teams/4, teams),
labels=c("Uncommon", "Common"))
sport_freq |>
ggplot() +
geom_point(mapping = aes(x = sports, y = count/teams,
color = probability)) +
theme(axis.text.x = element_blank()) +
geom_text(mapping = aes(x = sports, y = count/teams, label=sports),
size = 3, nudge_y = .05, check_overlap = TRUE) +
labs(title = "Frequency of Sports Offered by Institution",
x = "Sports",
y = "% of Institutions that Sponsor")
From the graph above, we can see the probability that a given NCAA school will sponsor a given sport. Some are remarkably high, with nearly every school having a basketball, soccer, and volleyball team, where others are hardly ever offered like weight lifting, table tennis, squash, archery, rifle, sailing, and many more. Most of this is pretty expected for someone familiar with sports, but I am surprised how popular lacrosse is nation-wide.
Looking at sports infrequently offered, my hypothesis is that these sports will on average have much higher expenses than revenues compared to more popular sports, especially when looked at on a per athlete basis. Other factors almost certainly have a substantial impact, but our data is too limited to explore those ideas.
Some of these factors include general popularity (sqash isn’t popular like golf is), geographic location (beach volleyball is likely very popular in southern states but less so in northern states), and legacy (tennis has been played in the NCAA for a very long time, so although Table Tennis may be very similar and may have a growing demand, a school might be more likely to opt for a Tennis team instead to have more existing programs to compete against).
# creates columns used to sum how many times a school offers
# a given sport for women and men
mw_ncaa <- ncaa |>
mutate(has_men = sum_partic_men > 0) |>
mutate(has_women = sum_partic_women > 0) |>
group_by(sports = sports) |>
summarise(men = sum(has_men), women = sum(has_women))
mw_ncaa |>
ggplot() +
geom_point(mapping = aes(x = sports, y = men), color = 'blue') +
geom_text(mapping = aes(x = sports, y = men,label=sports),
size = 3, nudge_y = 200, check_overlap = TRUE) +
geom_point(mapping = aes(x = sports, y = women), color = 'red') +
#geom_text(mapping = aes(x = sports,y = women,label=sports),
# size = 3, nudge_y = 200, check_overlap = TRUE) +
theme(axis.text.x = element_blank()) +
labs(title = "Frequency of Sports Offered by Gender",
x = "Sports",
y = "Number of Institutions that Sponsor")
mw_ncaa$pc_dif = abs(round((mw_ncaa$men-mw_ncaa$women)/mw_ncaa$men, digits=2))
mw_ncaa$probability = cut(mw_ncaa$pc_dif, breaks=c(-1,0.2, max(mw_ncaa$pc_dif)),
labels=c("Common", "Uncommon"))
mw_ncaa$sports[mw_ncaa$probability == 'Uncommon']
## [1] "Archery" "Badminton" "Baseball"
## [4] "Beach Volleyball" "Bowling" "Equestrian"
## [7] "Fencing" "Field Hockey" "Football"
## [10] "Gymnastics" "Ice Hockey" "Lacrosse"
## [13] "Other Sports" "Rifle" "Rowing"
## [16] "Soccer" "Softball" "Swimming and Diving"
## [19] "Synchronized Swimming" "Tennis" "Volleyball"
## [22] "Water Polo" "Wrestling"
Although very similar to the graph used in part II, there are some key differences in the graph above. Now we have been able to make some distinctions between schools that offer more or less of a given sport to one gender or another. In sports like Basketball or Track and Field, there are almost the exact same number of programs available to both men and women. On the complete opposite side of this, some sports like Football or Baseball are only given to men, where Softball is only given to women. There’s also everything in the middle like Volleyball having one of the largest differences, and other sports like Golf, Soccer, and Tennis having modest differences. Its important to create these distinctions when comparing sports to avoid saying that volleyball and soccer have about the same probability of being offered, when in reality only women’s volleyball and soccer have similar probabilities; men’s offerings in these sports are wildly different and much less likely.
I would hypothesise that there are actually more women’s sports than men’s sports if we sum the men and women columns. I think this is the case because, as supported by prior graphs made in this course, men tend to have larger teams like football, and Title IX plays a role in ensuring women have the same access to sporting opportunities as men. Because of this, schools often have to compensate for their men’s football teams by adding multiple, smaller programs for women.
# creates a df of all unique combinations
compare <- unique(data.frame(state = ncaa$state_cd, sport = ncaa$sports))
# counts the number of occurences of each combination
# also provides unique combinations
combos <- ncaa |>
group_by(state = state_cd, sport = sports) |>
summarise(count = n())
## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.
# visualizes a table to better show count of combinations
cool_combo <- combos |>
pivot_wider(names_from = state, values_from = count,
values_fill = 0)
cool_combo
## # A tibble: 37 × 54
## sport AK AL AR AZ CA CO CT DC DE FL GA HI
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 Bask… 10 96 70 20 282 77 95 40 25 130 141 21
## 2 Gymn… 5 10 6 10 35 5 15 5 0 6 5 1
## 3 Ice … 10 5 0 5 0 10 41 0 0 0 0 0
## 4 Rifle 5 10 0 0 0 0 0 0 0 0 10 0
## 5 Skii… 10 0 0 0 0 10 0 0 0 0 0 0
## 6 Swim… 5 1 15 0 42 15 16 4 1 46 14 0
## 7 Trac… 5 13 0 0 37 10 5 9 3 9 10 0
## 8 Trac… 4 22 0 0 133 11 5 13 4 15 41 0
## 9 Trac… 10 35 25 0 144 17 32 14 11 58 71 16
## 10 Voll… 10 95 65 20 282 77 95 35 25 130 118 21
## # ℹ 27 more rows
## # ℹ 41 more variables: IA <int>, ID <int>, IL <int>, IN <int>, KS <int>,
## # KY <int>, LA <int>, MA <int>, MD <int>, ME <int>, MI <int>, MN <int>,
## # MO <int>, MS <int>, MT <int>, NC <int>, ND <int>, NE <int>, NH <int>,
## # NJ <int>, NM <int>, NV <int>, NY <int>, OH <int>, OK <int>, OR <int>,
## # PA <int>, PR <int>, RI <int>, SC <int>, SD <int>, TN <int>, TX <int>,
## # UT <int>, VA <int>, VT <int>, WA <int>, WI <int>, WV <int>, WY <int>, …
# gets the total possible combinations, less actual combinations
# remaining are 'missing' combinations
n_distinct(combos$state) * n_distinct(combos$sport) -
nrow(combos)
## [1] 822
# should return the same as found above
sum(cool_combo == 0)
## [1] 822
# shows the top few combinations of most popular sports in a state
combos <- combos[order(combos$count, decreasing=TRUE),]
head(combos, n=10)
## # A tibble: 10 × 3
## # Groups: state [2]
## state sport count
## <chr> <chr> <int>
## 1 NY Basketball 493
## 2 NY Soccer 491
## 3 PA Basketball 478
## 4 PA Soccer 470
## 5 NY Volleyball 465
## 6 PA Volleyball 460
## 7 PA Softball 456
## 8 NY Softball 428
## 9 PA Tennis 423
## 10 PA Baseball 410
There are plenty of sports with 0 participation. 822 combinations of sports and states fit this criteria, and this can likely be attributed to some states having very few schools to offer a large variety of sports (Alaska is notable here among a few), and some sports likely have geographic limitations (Arizona doesn’t have an ice hockey team… I don’t think that’s a coincidence).
For the most popular combinations, limiting ourselves to the top 10 one theme is very clear: there are a lot of NCAA sports in Pennsylvania and New York. Basketball and soccer weren’t surprising to see with high frequency, but volleyball and softball really surprised me. Volleyball is limited almost entirely to women, and softball is exclusively for women. Basketball meanwhile is almost always offered to men and women, so I expected sports with men and women categories to outrank those without. That’s when I realized that the data combines sports that offer a men’s and women’s version in one row, or one count. So that is a limitation of the data unless I were to break it up into Women’s Sport and Men’s Sport - something I will probably do going forward.
ncaa |>
mutate(Arizona = grepl("AZ", state_cd)) |>
mutate(NorthDakota = grepl("ND", state_cd)) |>
group_by(sports = sports) |>
summarise(arizona = sum(Arizona), nd = sum(NorthDakota)) |>
ggplot() +
geom_point(mapping = aes(x = sports, y = arizona),
color = 'red') +
geom_text(mapping = aes(x = sports, y = arizona, label=sports),
size = 3, nudge_y = 1, check_overlap = TRUE) +
geom_point(mapping = aes(x = sports, y = nd),
color = 'blue') +
geom_text(mapping = aes(x = sports, y = nd, label=sports),
size = 3, nudge_y = 1, check_overlap = TRUE) +
theme(axis.text.x = element_blank()) +
labs(title = "Frequency of Sports Offered in Arizona and North Dakota",
x = "Sports:: Red: AZ; Blue: ND",
y = "Frequency")
You can see how some sports occur more or less frequently in different states. Surprisingly Ice Hockey both occur at a school in each state, but even though Volleyball also happen the same amount at both schools, Arizona also offers Sand Volleyball. Arizona also offers more Water Polo and Swimming and Diving. Since the states are relatively small when it comes to NCAA representation, many smaller sports which are infrequently included are simply not offered such as Table Tennis or Archery.