START

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/arsh10/Desktop/DATA SCIENCE/Stats")
data <- read_delim("./sports.csv", delim = ",")
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Group 1

data |>
  group_by(institution_name) |>
  summarise(total_count = sum(ef_total_count)) 
## # A tibble: 46 × 2
##    institution_name                total_count
##    <chr>                                 <dbl>
##  1 Ancilla College                       18573
##  2 Anderson University                  102545
##  3 Ball State University               1272708
##  4 Bethel College-Indiana                43840
##  5 Bethel University                     12216
##  6 Butler University                    350946
##  7 Calumet College of Saint Joseph       31343
##  8 DePauw University                    179784
##  9 Earlham College                       50850
## 10 Franklin College                      78504
## # ℹ 36 more rows
l <- data |>
  group_by(institution_name) |>
  summarise(total_count = sum(ef_total_count)) |>
  arrange(total_count)
l$strength <- cut(l$total_count,breaks = c(0,20000,300000,Inf),labels = c("least","medium","top"))

l
## # A tibble: 46 × 3
##    institution_name                       total_count strength
##    <chr>                                        <dbl> <fct>   
##  1 Saint Josephs College                         8496 least   
##  2 Bethel University                            12216 least   
##  3 Purdue University-North Central Campus       13842 least   
##  4 Ancilla College                              18573 least   
##  5 Holy Cross College                           23548 medium  
##  6 Calumet College of Saint Joseph              31343 medium  
##  7 Oakland City University                      34100 medium  
##  8 Saint Mary-of-the-Woods College              36713 medium  
##  9 Bethel College-Indiana                       43840 medium  
## 10 Goshen College                               50021 medium  
## # ℹ 36 more rows
l |>
  group_by(strength) |>
  summarise(count = n()) |>
  mutate(prob = count/sum(count))
## # A tibble: 3 × 3
##   strength count   prob
##   <fct>    <int>  <dbl>
## 1 least        4 0.0870
## 2 medium      32 0.696 
## 3 top         10 0.217

If we were to randomly select a row from our dataset, there’s about 8% probability that it’s going to be one from the least strength group.

Hypothesis: If a college does not have data for all years, then it is likely it’s going to be in least strength group.

Visualization Group 1

least <- data |>
  group_by(institution_name) |>
  summarise(visualize = sum(ef_total_count, na.rm = TRUE)) |>
  arrange(visualize) |>
  filter(visualize < 14000) |>
  pluck("institution_name")

  
least
## [1] "Saint Josephs College"                 
## [2] "Bethel University"                     
## [3] "Purdue University-North Central Campus"
p <- data |>
  filter(institution_name %in% least) |>
  ggplot(aes(x = institution_name, y = ef_total_count, fill = institution_name)) +
  geom_bar(stat = "identity") +
  theme_light() +
  scale_fill_brewer(palette = "Dark2") +
  scale_x_discrete(labels = c("Bethel", "Purdue North Central", "Saint Joseph")) +
  labs(
    x = "Institution",
    y = "Total no. of athletes",
    title = "Athletes in least strength institutions"
    )
  
  

p

Group 2

data |>
  group_by(city_txt) |>
  summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE))  # might delete
## # A tibble: 34 × 2
##    city_txt      revenue
##    <chr>           <dbl>
##  1 Anderson      9382091
##  2 Angola       14973495
##  3 Bloomington 366577565
##  4 Donaldson     3659049
##  5 Evansville   70294827
##  6 Fort Wayne  125441632
##  7 Franklin      7794814
##  8 Gary          1394873
##  9 Goshen       11843896
## 10 Greencastle  17228474
## # ℹ 24 more rows
m <- data |>
  group_by(city_txt) |>
  summarise(total_rev = sum(total_rev_menwomen, na.rm = TRUE)) |>
  arrange(total_rev)
m$strength <- cut(m$total_rev,breaks = c(0,2000000,5000000,Inf),labels = c("least","medium","top"))

m
## # A tibble: 34 × 3
##    city_txt                total_rev strength
##    <chr>                       <dbl> <fct>   
##  1 Westville                  380393 least   
##  2 Gary                      1394873 least   
##  3 Donaldson                 3659049 medium  
##  4 Rensselaer                3963227 medium  
##  5 South Bend                4281575 medium  
##  6 New Albany                4310946 medium  
##  7 Kokomo                    5075583 top     
##  8 Saint Mary of the Woods   6147401 top     
##  9 North Manchester          7653981 top     
## 10 Franklin                  7794814 top     
## # ℹ 24 more rows
m |>
  group_by(strength) |>
  summarise(count = n()) |>
  mutate(prob = count/sum(count))
## # A tibble: 3 × 3
##   strength count   prob
##   <fct>    <int>  <dbl>
## 1 least        2 0.0588
## 2 medium       4 0.118 
## 3 top         28 0.824

If we randomly picked a row from the dataset, there’s about 6% probability that we pick a row from least strength group.

Hypothesis: If a city’s college participates in less number of sports comparatively, then it’s likely to be in least strength group.

Visualization Group 2

four_lowest_cities <- data |>
  group_by(city_txt) |>
  summarise(visualize = sum(total_rev_menwomen, na.rm = TRUE)) |>
  arrange(visualize) |>
  filter(visualize < 4000000) |>
  pluck("city_txt")


four_lowest_cities
## [1] "Westville"  "Gary"       "Donaldson"  "Rensselaer"
p <- data |>
  filter(city_txt %in% four_lowest_cities) |>
  ggplot(aes(x = city_txt, y = total_rev_menwomen, fill = city_txt)) +
  geom_bar(stat = "identity") +
  theme_light() +
  scale_fill_brewer(palette = "Dark2") 
  
p
## Warning: Removed 59 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Group 3

n <- data |>
  group_by(classification_name) |>
  summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE)) |>
  arrange(revenue)
n$strength <- cut(n$revenue, breaks = c(0,5000000,15000000,Inf), labels = c('least','medium','top'))
n
## # A tibble: 13 × 3
##    classification_name                   revenue strength
##    <chr>                                   <dbl> <fct>   
##  1 NCAA Division III without football    1981112 least   
##  2 NJCAA Division II                     3659049 least   
##  3 USCAA                                 6147401 medium  
##  4 NJCAA Division I                     10628989 medium  
##  5 Other                                31314887 top     
##  6 NCAA Division II without football    37882665 top     
##  7 NCAA Division II with football       64543183 top     
##  8 NAIA Division I                      70988867 top     
##  9 NCAA Division III with football      86537305 top     
## 10 NCAA Division I without football    122716089 top     
## 11 NCAA Division I-FCS                 181139323 top     
## 12 NAIA Division II                    224343796 top     
## 13 NCAA Division I-FBS                1325878584 top
n |>
  group_by(strength) |>
  summarise(count = n()) |>
  mutate(prob = count/sum(count))
## # A tibble: 3 × 3
##   strength count  prob
##   <fct>    <int> <dbl>
## 1 least        2 0.154
## 2 medium       2 0.154
## 3 top          9 0.692

There’s about 15% chance of a row in least strength group being picked.

Hypothesis: Higher tier divisions may correlate with higher revenue generation.

Visualization for Group 3

four_lowest_classification <- data |>
    group_by(classification_name)|>
    summarise(visualize = sum(total_rev_menwomen, na.rm = TRUE)) |>
    arrange(visualize) |>
    filter(visualize < 20000000) |>
    pluck("classification_name")

four_lowest_classification
## [1] "NCAA Division III without football" "NJCAA Division II"                 
## [3] "USCAA"                              "NJCAA Division I"
p <- data |>
  filter(classification_name %in% four_lowest_classification) |>
  ggplot(aes(x = classification_name, y = total_rev_menwomen, fill = classification_name)) +
  geom_bar(stat = "identity") +
  theme_light() +
  scale_fill_brewer(palette = "Dark2") 


p
## Warning: Removed 91 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Combinations

data |>
  group_by(classification_name, sports) |>
  summarise(s_count = n()) |>
  arrange(desc(s_count)) 
## `summarise()` has grouped output by 'classification_name'. You can override
## using the `.groups` argument.
## # A tibble: 430 × 3
## # Groups:   classification_name [13]
##    classification_name             sports                     s_count
##    <chr>                           <chr>                        <int>
##  1 NAIA Division II                Basketball                      71
##  2 NAIA Division II                Volleyball                      67
##  3 NAIA Division II                Golf                            63
##  4 NAIA Division II                Tennis                          58
##  5 NAIA Division II                Baseball                        57
##  6 NAIA Division II                Softball                        54
##  7 NAIA Division II                Soccer                          52
##  8 NAIA Division II                Track and Field, X-Country      43
##  9 NCAA Division III with football Baseball                        39
## 10 NCAA Division III with football Basketball                      39
## # ℹ 420 more rows

We don’t really have rows that are missing any combination. Because each division has their appropriate sport listed depending on the institution. Therefore, no visualization can be made here.

For the frequency of combinations, we find that NAIA Division II and basketball combination is the most common with 71 occurrences. This could be due to the popularity of the sport and also how easy it is to incorporate it into this division (think of player demand, funding, cost, etc). On the other hand, the least common combinations, that appear only once, are mostly NCAA Division II with football and other sports like Rodeo, Table Tennis, etc. it could be because this division is highly based on football, so it’s combination with other sports is uncommon.