START
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/arsh10/Desktop/DATA SCIENCE/Stats")
data <- read_delim("./sports.csv", delim = ",")
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Group 1
data |>
group_by(institution_name) |>
summarise(total_count = sum(ef_total_count))
## # A tibble: 46 × 2
## institution_name total_count
## <chr> <dbl>
## 1 Ancilla College 18573
## 2 Anderson University 102545
## 3 Ball State University 1272708
## 4 Bethel College-Indiana 43840
## 5 Bethel University 12216
## 6 Butler University 350946
## 7 Calumet College of Saint Joseph 31343
## 8 DePauw University 179784
## 9 Earlham College 50850
## 10 Franklin College 78504
## # ℹ 36 more rows
l <- data |>
group_by(institution_name) |>
summarise(total_count = sum(ef_total_count)) |>
arrange(total_count)
l$strength <- cut(l$total_count,breaks = c(0,20000,300000,Inf),labels = c("least","medium","top"))
l
## # A tibble: 46 × 3
## institution_name total_count strength
## <chr> <dbl> <fct>
## 1 Saint Josephs College 8496 least
## 2 Bethel University 12216 least
## 3 Purdue University-North Central Campus 13842 least
## 4 Ancilla College 18573 least
## 5 Holy Cross College 23548 medium
## 6 Calumet College of Saint Joseph 31343 medium
## 7 Oakland City University 34100 medium
## 8 Saint Mary-of-the-Woods College 36713 medium
## 9 Bethel College-Indiana 43840 medium
## 10 Goshen College 50021 medium
## # ℹ 36 more rows
l |>
group_by(strength) |>
summarise(count = n()) |>
mutate(prob = count/sum(count))
## # A tibble: 3 × 3
## strength count prob
## <fct> <int> <dbl>
## 1 least 4 0.0870
## 2 medium 32 0.696
## 3 top 10 0.217
If we were to randomly select a row from our dataset, there’s about
8% probability that it’s going to be one from the least strength
group.
Hypothesis: If a college does not have data for all years, then it
is likely it’s going to be in least strength group.
Visualization Group 1
least <- data |>
group_by(institution_name) |>
summarise(visualize = sum(ef_total_count, na.rm = TRUE)) |>
arrange(visualize) |>
filter(visualize < 14000) |>
pluck("institution_name")
least
## [1] "Saint Josephs College"
## [2] "Bethel University"
## [3] "Purdue University-North Central Campus"
p <- data |>
filter(institution_name %in% least) |>
ggplot(aes(x = institution_name, y = ef_total_count, fill = institution_name)) +
geom_bar(stat = "identity") +
theme_light() +
scale_fill_brewer(palette = "Dark2") +
scale_x_discrete(labels = c("Bethel", "Purdue North Central", "Saint Joseph")) +
labs(
x = "Institution",
y = "Total no. of athletes",
title = "Athletes in least strength institutions"
)
p

Group 2
data |>
group_by(city_txt) |>
summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE)) # might delete
## # A tibble: 34 × 2
## city_txt revenue
## <chr> <dbl>
## 1 Anderson 9382091
## 2 Angola 14973495
## 3 Bloomington 366577565
## 4 Donaldson 3659049
## 5 Evansville 70294827
## 6 Fort Wayne 125441632
## 7 Franklin 7794814
## 8 Gary 1394873
## 9 Goshen 11843896
## 10 Greencastle 17228474
## # ℹ 24 more rows
m <- data |>
group_by(city_txt) |>
summarise(total_rev = sum(total_rev_menwomen, na.rm = TRUE)) |>
arrange(total_rev)
m$strength <- cut(m$total_rev,breaks = c(0,2000000,5000000,Inf),labels = c("least","medium","top"))
m
## # A tibble: 34 × 3
## city_txt total_rev strength
## <chr> <dbl> <fct>
## 1 Westville 380393 least
## 2 Gary 1394873 least
## 3 Donaldson 3659049 medium
## 4 Rensselaer 3963227 medium
## 5 South Bend 4281575 medium
## 6 New Albany 4310946 medium
## 7 Kokomo 5075583 top
## 8 Saint Mary of the Woods 6147401 top
## 9 North Manchester 7653981 top
## 10 Franklin 7794814 top
## # ℹ 24 more rows
m |>
group_by(strength) |>
summarise(count = n()) |>
mutate(prob = count/sum(count))
## # A tibble: 3 × 3
## strength count prob
## <fct> <int> <dbl>
## 1 least 2 0.0588
## 2 medium 4 0.118
## 3 top 28 0.824
If we randomly picked a row from the dataset, there’s about 6%
probability that we pick a row from least strength group.
Hypothesis: If a city’s college participates in less number of
sports comparatively, then it’s likely to be in least strength
group.
Visualization Group 2
four_lowest_cities <- data |>
group_by(city_txt) |>
summarise(visualize = sum(total_rev_menwomen, na.rm = TRUE)) |>
arrange(visualize) |>
filter(visualize < 4000000) |>
pluck("city_txt")
four_lowest_cities
## [1] "Westville" "Gary" "Donaldson" "Rensselaer"
p <- data |>
filter(city_txt %in% four_lowest_cities) |>
ggplot(aes(x = city_txt, y = total_rev_menwomen, fill = city_txt)) +
geom_bar(stat = "identity") +
theme_light() +
scale_fill_brewer(palette = "Dark2")
p
## Warning: Removed 59 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Group 3
n <- data |>
group_by(classification_name) |>
summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE)) |>
arrange(revenue)
n$strength <- cut(n$revenue, breaks = c(0,5000000,15000000,Inf), labels = c('least','medium','top'))
n
## # A tibble: 13 × 3
## classification_name revenue strength
## <chr> <dbl> <fct>
## 1 NCAA Division III without football 1981112 least
## 2 NJCAA Division II 3659049 least
## 3 USCAA 6147401 medium
## 4 NJCAA Division I 10628989 medium
## 5 Other 31314887 top
## 6 NCAA Division II without football 37882665 top
## 7 NCAA Division II with football 64543183 top
## 8 NAIA Division I 70988867 top
## 9 NCAA Division III with football 86537305 top
## 10 NCAA Division I without football 122716089 top
## 11 NCAA Division I-FCS 181139323 top
## 12 NAIA Division II 224343796 top
## 13 NCAA Division I-FBS 1325878584 top
n |>
group_by(strength) |>
summarise(count = n()) |>
mutate(prob = count/sum(count))
## # A tibble: 3 × 3
## strength count prob
## <fct> <int> <dbl>
## 1 least 2 0.154
## 2 medium 2 0.154
## 3 top 9 0.692
There’s about 15% chance of a row in least strength group being
picked.
Hypothesis: Higher tier divisions may correlate with higher revenue
generation.
Visualization for Group 3
four_lowest_classification <- data |>
group_by(classification_name)|>
summarise(visualize = sum(total_rev_menwomen, na.rm = TRUE)) |>
arrange(visualize) |>
filter(visualize < 20000000) |>
pluck("classification_name")
four_lowest_classification
## [1] "NCAA Division III without football" "NJCAA Division II"
## [3] "USCAA" "NJCAA Division I"
p <- data |>
filter(classification_name %in% four_lowest_classification) |>
ggplot(aes(x = classification_name, y = total_rev_menwomen, fill = classification_name)) +
geom_bar(stat = "identity") +
theme_light() +
scale_fill_brewer(palette = "Dark2")
p
## Warning: Removed 91 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Combinations
data |>
group_by(classification_name, sports) |>
summarise(s_count = n()) |>
arrange(desc(s_count))
## `summarise()` has grouped output by 'classification_name'. You can override
## using the `.groups` argument.
## # A tibble: 430 × 3
## # Groups: classification_name [13]
## classification_name sports s_count
## <chr> <chr> <int>
## 1 NAIA Division II Basketball 71
## 2 NAIA Division II Volleyball 67
## 3 NAIA Division II Golf 63
## 4 NAIA Division II Tennis 58
## 5 NAIA Division II Baseball 57
## 6 NAIA Division II Softball 54
## 7 NAIA Division II Soccer 52
## 8 NAIA Division II Track and Field, X-Country 43
## 9 NCAA Division III with football Baseball 39
## 10 NCAA Division III with football Basketball 39
## # ℹ 420 more rows
We don’t really have rows that are missing any combination. Because
each division has their appropriate sport listed depending on the
institution. Therefore, no visualization can be made here.
For the frequency of combinations, we find that NAIA Division II and
basketball combination is the most common with 71 occurrences. This
could be due to the popularity of the sport and also how easy it is to
incorporate it into this division (think of player demand, funding,
cost, etc). On the other hand, the least common combinations, that
appear only once, are mostly NCAA Division II with football and other
sports like Rodeo, Table Tennis, etc. it could be because this division
is highly based on football, so it’s combination with other sports is
uncommon.