library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data <- read_delim("./sports.csv",delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
spec(data)
## cols(
## year = col_double(),
## unitid = col_double(),
## institution_name = col_character(),
## city_txt = col_character(),
## state_cd = col_character(),
## zip_text = col_double(),
## classification_code = col_double(),
## classification_name = col_character(),
## classification_other = col_character(),
## ef_male_count = col_double(),
## ef_female_count = col_double(),
## ef_total_count = col_double(),
## sector_cd = col_double(),
## sector_name = col_character(),
## sportscode = col_double(),
## partic_men = col_double(),
## partic_women = col_double(),
## partic_coed_men = col_double(),
## partic_coed_women = col_double(),
## sum_partic_men = col_double(),
## sum_partic_women = col_double(),
## rev_men = col_double(),
## rev_women = col_double(),
## total_rev_menwomen = col_double(),
## exp_men = col_double(),
## exp_women = col_double(),
## total_exp_menwomen = col_double(),
## sports = col_character()
## )
Some confusing elements in data:
first of all, sum_partic_men/women columns were confusing because I
don’t know how they differentiate from partic_men/women.
Other columns that I found a bit unclear were ef_male_count and
partic_men. Are they talking about total men in university and total
participating men in sports? It’s a bit unclear for me.
Next, in the sports column, I was confused if All Track Combined
data is same for every institution. In other words, does it include the
same track sports for each or different.
Some zip codes are in standard five digit format and some are in
full nine digit format. Were they not able to find full nine digit code
for all?
After referring to documentation:
After reading the documentation, I figured sum_partic is combining
participants with co-ed participants.
ef_male_count represent total men in the institution.
*I could not find which exact sports were part of All Track Combined
from the documentation.
For zip codes, I think they just included the ones they could
find.
*still unclear
Let’s Visualize…
revenue <- data |>
filter(sports == "All Track Combined",institution_name %in% c("Anderson University","Purdue University-Main Campus"), year == "2019") |>
select(institution_name, total_rev_menwomen)
revenue
## # A tibble: 2 × 2
## institution_name total_rev_menwomen
## <chr> <dbl>
## 1 Anderson University 143031
## 2 Purdue University-Main Campus 482273
ggplot(data = revenue, aes(x = institution_name, y =total_rev_menwomen, fill = institution_name)) +
geom_bar(stat = "identity") +
theme_classic() +
labs(
title = "revenue comparison in All Track Combined"
)

I used this visualization to clarify that how would we know from
what track sports exactly is this revenue coming from. In other words,
Purdue may participate in different track sports than the Anderson
University. So, this was a bit of confusion I wanted to point out.