library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim("./sports.csv",delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(data) 
## cols(
##   year = col_double(),
##   unitid = col_double(),
##   institution_name = col_character(),
##   city_txt = col_character(),
##   state_cd = col_character(),
##   zip_text = col_double(),
##   classification_code = col_double(),
##   classification_name = col_character(),
##   classification_other = col_character(),
##   ef_male_count = col_double(),
##   ef_female_count = col_double(),
##   ef_total_count = col_double(),
##   sector_cd = col_double(),
##   sector_name = col_character(),
##   sportscode = col_double(),
##   partic_men = col_double(),
##   partic_women = col_double(),
##   partic_coed_men = col_double(),
##   partic_coed_women = col_double(),
##   sum_partic_men = col_double(),
##   sum_partic_women = col_double(),
##   rev_men = col_double(),
##   rev_women = col_double(),
##   total_rev_menwomen = col_double(),
##   exp_men = col_double(),
##   exp_women = col_double(),
##   total_exp_menwomen = col_double(),
##   sports = col_character()
## )

Some confusing elements in data:

first of all, sum_partic_men/women columns were confusing because I don’t know how they differentiate from partic_men/women.

Other columns that I found a bit unclear were ef_male_count and partic_men. Are they talking about total men in university and total participating men in sports? It’s a bit unclear for me.

Next, in the sports column, I was confused if All Track Combined data is same for every institution. In other words, does it include the same track sports for each or different.

Some zip codes are in standard five digit format and some are in full nine digit format. Were they not able to find full nine digit code for all?

After referring to documentation:

After reading the documentation, I figured sum_partic is combining participants with co-ed participants.

ef_male_count represent total men in the institution.

*I could not find which exact sports were part of All Track Combined from the documentation.

For zip codes, I think they just included the ones they could find.

*still unclear

Let’s Visualize…

revenue <- data |>
  filter(sports == "All Track Combined",institution_name %in% c("Anderson University","Purdue University-Main Campus"), year == "2019") |>
  select(institution_name, total_rev_menwomen)
revenue
## # A tibble: 2 × 2
##   institution_name              total_rev_menwomen
##   <chr>                                      <dbl>
## 1 Anderson University                       143031
## 2 Purdue University-Main Campus             482273
  ggplot(data = revenue, aes(x = institution_name, y =total_rev_menwomen, fill = institution_name)) +
  geom_bar(stat = "identity") +
  theme_classic() +
  labs(
     title = "revenue comparison in All Track Combined"
  )

I used this visualization to clarify that how would we know from what track sports exactly is this revenue coming from. In other words, Purdue may participate in different track sports than the Anderson University. So, this was a bit of confusion I wanted to point out.