Week 3 | Data Dive - Group by Probabilities

Establishes tidyverse and also gets rid of an error.

# This works to get rid of errors
library(conflicted)  

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

conflict_prefer("filter", "dplyr")

## [conflicted] Will prefer dplyr::filter over any other package.

conflict_prefer("lag", "dplyr")

## [conflicted] Will prefer dplyr::lag over any other package.

Stores CSV file as ‘ncaa’

ncaa <- read.csv("./ncaa_sports_1.csv", header = TRUE)

## Clean the ncaa file
ncaa <- ncaa |>
  filter(sum_partic_men + sum_partic_women > 0)

Groupings

Number of Athletes and Programs by Institution

# isolates number of athletes, programs, and division type of
# each institution

program_sum_count <- ncaa |>
        group_by(institution_name) |>
        summarise(athletes = sum(sum_partic_men + sum_partic_women),
                  programs = n(), classification = classification_name)

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'institution_name'. You can override using
## the `.groups` argument.

program_sum_count |>
  ggplot() +
  geom_point(mapping = aes(x = programs/5, y = athletes/5),
             color = 'red') +
  labs(title = "Annualized Number of Athletes and Sports by Institution",
       x = "Number of Sports Offered",
       y = "Number of Athletes") +
  theme_classic()

# these are based off of NCAA requirements
num_programs <- cut(program_sum_count$programs, breaks=c(0,15,40,60,200),
                    labels=c("Low","Medium", "High", "Very High"))

table(num_programs)

## num_programs
##       Low    Medium      High Very High 
##       449      7736     32926     16351

program_sum_count$prob_program <- num_programs

program_sum_count

## # A tibble: 57,462 × 5
## # Groups:   institution_name [1,141]
##    institution_name             athletes programs classification    prob_program
##    <chr>                           <int>    <int> <chr>             <fct>       
##  1 Abilene Christian University     2394       48 NCAA Division I-… High        
##  2 Abilene Christian University     2394       48 NCAA Division I-… High        
##  3 Abilene Christian University     2394       48 NCAA Division I-… High        
##  4 Abilene Christian University     2394       48 NCAA Division I-… High        
##  5 Abilene Christian University     2394       48 NCAA Division I-… High        
##  6 Abilene Christian University     2394       48 NCAA Division I-… High        
##  7 Abilene Christian University     2394       48 NCAA Division I-… High        
##  8 Abilene Christian University     2394       48 NCAA Division I-… High        
##  9 Abilene Christian University     2394       48 NCAA Division I-… High        
## 10 Abilene Christian University     2394       48 NCAA Division I-… High        
## # ℹ 57,452 more rows

program_sum_count |>
  #mutate(program_size = programs < 15) |>
  #filter(program_size == TRUE) |>
  filter(prob_program == "Low") |>
  ggplot() +
  geom_bar(mapping = aes(x = classification)) + 
  theme(axis.text.x = element_text(angle = -20)) + 
  labs(title = "Count of Schools with <15 programs from 2015-2019")

Conclusions and Testable Hypothesis I

The first chart shows the relationship between the number of sports an institution offers and how many athletes comprise of those sports. There is a very big range in how many athletic programs and athletes a school sponsors. Unsurprisingly, there is a direct correlation between the two. It is surprising how many programs have sponsored such few programs over the span of five years though.

After investigating this a bit further in the second graph, it looks like the vast majority of schools with minimal number of athletes come from Division 2 and 3 teams. These divisions were disproportionately more likely to offer the fewest programs. I suspect some of this might be due to Division 3 programs having more leniency on minimum number of programs required, but Divisions 1 and 2 shouldn’t be allowed to offer 3 or fewer programs a year and remain Division 1 or 2 schools. I suspect the reason these schools might be on the list anyways is because they recently created a sporting team (2018-2019) or dropped their sporting teams near the beginning of the recordings (2015-2016).

I hypothesis that if we were to filter out programs that are less than 5 years old but still actively sponsoring teams, we will remove all Division 1 and 2 schools.

Frequency which Sports are Offered by an Institution

# counts the number of times an institution offers a sport

sport_freq <- ncaa |>
  group_by(sports = sports) |>
  summarise(count = n_distinct(institution_name))

# calculates number of unique teams
teams = n_distinct(ncaa$institution_name)

# uncommon => <25% of teams offer the sport
sport_freq$probability = cut(sport_freq$count, breaks=c(0,teams/4, teams),
                    labels=c("Uncommon", "Common"))

sport_freq |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y = count/teams,
             color = probability)) +
  theme(axis.text.x = element_blank()) +
  geom_text(mapping = aes(x = sports, y = count/teams, label=sports), 
                      size = 3, nudge_y = .05, check_overlap = TRUE) +
  labs(title = "Frequency of Sports Offered by Institution",
       x = "Sports",
       y = "% of Institutions that Sponsor")

Conclusions and Testable Hypothesis II

From the graph above, we can see the probability that a given NCAA school will sponsor a given sport. Some are remarkably high, with nearly every school having a basketball, soccer, and volleyball team, where others are hardly ever offered like weight lifting, table tennis, squash, archery, rifle, sailing, and many more. Most of this is pretty expected for someone familiar with sports, but I am surprised how popular lacrosse is nation-wide.

Looking at sports infrequently offered, my hypothesis is that these sports will on average have much higher expenses than revenues compared to more popular sports, especially when looked at on a per athlete basis. Other factors almost certainly have a substantial impact, but our data is too limited to explore those ideas.

Some of these factors include general popularity (sqash isn’t popular like golf is), geographic location (beach volleyball is likely very popular in southern states but less so in northern states), and legacy (tennis has been played in the NCAA for a very long time, so although Table Tennis may be very similar and may have a growing demand, a school might be more likely to opt for a Tennis team instead to have more existing programs to compete against).

Frequency Sports are offered to Men vs Women by Institution

# creates columns used to sum how many times a school offers
# a given sport for women and men

mw_ncaa <- ncaa |>
  mutate(has_men = sum_partic_men > 0) |>
  mutate(has_women = sum_partic_women > 0) |>
  group_by(sports = sports) |>
  summarise(men = sum(has_men), women = sum(has_women))

mw_ncaa |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y = men), color = 'blue') +
  geom_text(mapping = aes(x = sports, y = men,label=sports), 
                      size = 3, nudge_y = 200, check_overlap = TRUE) +
  
  geom_point(mapping = aes(x = sports, y = women), color = 'red') +
  #geom_text(mapping = aes(x = sports,y = women,label=sports), 
  #                    size = 3, nudge_y = 200, check_overlap = TRUE) +
  
  theme(axis.text.x = element_blank()) +

  
  labs(title = "Frequency of Sports Offered by Gender",
       x = "Sports",
       y = "Number of Institutions that Sponsor")

mw_ncaa$pc_dif = abs(round((mw_ncaa$men-mw_ncaa$women)/mw_ncaa$men, digits=2))

mw_ncaa$probability = cut(mw_ncaa$pc_dif, breaks=c(-1,0.2, max(mw_ncaa$pc_dif)),
                    labels=c("Common", "Uncommon"))

mw_ncaa$sports[mw_ncaa$probability == 'Uncommon']

##  [1] "Archery"               "Badminton"             "Baseball"             
##  [4] "Beach Volleyball"      "Bowling"               "Equestrian"           
##  [7] "Fencing"               "Field Hockey"          "Football"             
## [10] "Gymnastics"            "Ice Hockey"            "Lacrosse"             
## [13] "Other Sports"          "Rifle"                 "Rowing"               
## [16] "Soccer"                "Softball"              "Swimming and Diving"  
## [19] "Synchronized Swimming" "Tennis"                "Volleyball"           
## [22] "Water Polo"            "Wrestling"

Conclusions and Testable Hypothesis III

Although very similar to the graph used in part II, there are some key differences in the graph above. Now we have been able to make some distinctions between schools that offer more or less of a given sport to one gender or another. In sports like Basketball or Track and Field, there are almost the exact same number of programs available to both men and women. On the complete opposite side of this, some sports like Football or Baseball are only given to men, where Softball is only given to women. There’s also everything in the middle like Volleyball having one of the largest differences, and other sports like Golf, Soccer, and Tennis having modest differences. Its important to create these distinctions when comparing sports to avoid saying that volleyball and soccer have about the same probability of being offered, when in reality only women’s volleyball and soccer have similar probabilities; men’s offerings in these sports are wildly different and much less likely.

I would hypothesise that there are actually more women’s sports than men’s sports if we sum the men and women columns. I think this is the case because, as supported by prior graphs made in this course, men tend to have larger teams like football, and Title IX plays a role in ensuring women have the same access to sporting opportunities as men. Because of this, schools often have to compensate for their men’s football teams by adding multiple, smaller programs for women.

Combinations of Two Categorical Columns

# creates a df of all unique combinations

compare <- unique(data.frame(state = ncaa$state_cd, sport = ncaa$sports))

# counts the number of occurences of each combination
# also provides unique combinations

combos <- ncaa |>
  group_by(state = state_cd, sport = sports) |>
  summarise(count = n())

## `summarise()` has grouped output by 'state'. You can override using the
## `.groups` argument.

# visualizes a table to better show count of combinations

cool_combo <- combos |>
  pivot_wider(names_from = state, values_from = count, 
               values_fill = 0)
cool_combo

## # A tibble: 37 × 54
##    sport    AK    AL    AR    AZ    CA    CO    CT    DC    DE    FL    GA    HI
##    <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
##  1 Bask…    10    96    70    20   282    77    95    40    25   130   141    21
##  2 Gymn…     5    10     6    10    35     5    15     5     0     6     5     1
##  3 Ice …    10     5     0     5     0    10    41     0     0     0     0     0
##  4 Rifle     5    10     0     0     0     0     0     0     0     0    10     0
##  5 Skii…    10     0     0     0     0    10     0     0     0     0     0     0
##  6 Swim…     5     1    15     0    42    15    16     4     1    46    14     0
##  7 Trac…     5    13     0     0    37    10     5     9     3     9    10     0
##  8 Trac…     4    22     0     0   133    11     5    13     4    15    41     0
##  9 Trac…    10    35    25     0   144    17    32    14    11    58    71    16
## 10 Voll…    10    95    65    20   282    77    95    35    25   130   118    21
## # ℹ 27 more rows
## # ℹ 41 more variables: IA <int>, ID <int>, IL <int>, IN <int>, KS <int>,
## #   KY <int>, LA <int>, MA <int>, MD <int>, ME <int>, MI <int>, MN <int>,
## #   MO <int>, MS <int>, MT <int>, NC <int>, ND <int>, NE <int>, NH <int>,
## #   NJ <int>, NM <int>, NV <int>, NY <int>, OH <int>, OK <int>, OR <int>,
## #   PA <int>, PR <int>, RI <int>, SC <int>, SD <int>, TN <int>, TX <int>,
## #   UT <int>, VA <int>, VT <int>, WA <int>, WI <int>, WV <int>, WY <int>, …

# gets the total possible combinations, less actual combinations
# remaining are 'missing' combinations

n_distinct(combos$state) * n_distinct(combos$sport) -
  nrow(combos)

## [1] 822

# should return the same as found above

sum(cool_combo == 0)

## [1] 822

# shows the top few combinations of most popular sports in a state

combos <- combos[order(combos$count, decreasing=TRUE),]
head(combos, n=10)

## # A tibble: 10 × 3
## # Groups:   state [2]
##    state sport      count
##    <chr> <chr>      <int>
##  1 NY    Basketball   493
##  2 NY    Soccer       491
##  3 PA    Basketball   478
##  4 PA    Soccer       470
##  5 NY    Volleyball   465
##  6 PA    Volleyball   460
##  7 PA    Softball     456
##  8 NY    Softball     428
##  9 PA    Tennis       423
## 10 PA    Baseball     410

There are plenty of sports with 0 participation. 822 combinations of sports and states fit this criteria, and this can likely be attributed to some states having very few schools to offer a large variety of sports (Alaska is notable here among a few), and some sports likely have geographic limitations (Arizona doesn’t have an ice hockey team… I don’t think that’s a coincidence).

For the most popular combinations, limiting ourselves to the top 10 one theme is very clear: there are a lot of NCAA sports in Pennsylvania and New York. Basketball and soccer weren’t surprising to see with high frequency, but volleyball and softball really surprised me. Volleyball is limited almost entirely to women, and softball is exclusively for women. Basketball meanwhile is almost always offered to men and women, so I expected sports with men and women categories to outrank those without. That’s when I realized that the data combines sports that offer a men’s and women’s version in one row, or one count. So that is a limitation of the data unless I were to break it up into Women’s Sport and Men’s Sport - something I will probably do going forward.

ncaa |>
  mutate(Arizona = grepl("AZ", state_cd)) |>
  mutate(NorthDakota = grepl("ND", state_cd)) |>
  group_by(sports = sports) |>
  summarise(arizona = sum(Arizona), nd = sum(NorthDakota)) |>
  ggplot() +
  geom_point(mapping = aes(x = sports, y = arizona),
             color = 'red') +
  geom_text(mapping = aes(x = sports, y = arizona, label=sports), 
            size = 3, nudge_y = 1, check_overlap = TRUE) +
  
  geom_point(mapping = aes(x = sports, y = nd),
             color = 'blue') +
  geom_text(mapping = aes(x = sports, y = nd, label=sports), 
            size = 3, nudge_y = 1, check_overlap = TRUE) +
  
  theme(axis.text.x = element_blank()) +
  labs(title = "Frequency of Sports Offered in Arizona and North Dakota",
       x = "Sports:: Red: AZ; Blue: ND",
       y = "Frequency")

You can see how some sports occur more or less frequently in different states. Surprisingly Ice Hockey both occur at a school in each state, but even though Volleyball also happen the same amount at both schools, Arizona also offers Sand Volleyball. Arizona also offers more Water Polo and Swimming and Diving. Since the states are relatively small when it comes to NCAA representation, many smaller sports which are infrequently included are simply not offered such as Table Tennis or Archery.

Week 3 Data Dive

Thomas Reedy

2024-09-11