library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
data <- read_delim("./sports.csv", delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Summary:

The dataset that I’m using is about collegiate sports in the United States. I’ve focused my data to be only about Indiana. I did this so I can reduce the number of rows to allowed range, but also to focus on the state we live in.

The overall project goal is to run an ANOVA statistical test to find out which sector is performing better in generating profit.

Visualizations: First one = top 5 institutions with most revenue. Because this gives you general idea of big revenue generators, and there’s a good chance that these will be profitable.

top_5 <- data |>
  group_by(institution_name) |>
  summarise(visualize = sum(total_rev_menwomen,na.rm = TRUE)) |>
  arrange(desc(visualize)) |>
  filter(visualize >= 90000000) |>
  pluck("institution_name")

top_5
## [1] "University of Notre Dame"       "Indiana University-Bloomington"
## [3] "Purdue University-Main Campus"  "Ball State University"         
## [5] "Butler University"
p <- data |>
  filter(institution_name %in% top_5) |>
  ggplot(aes(x = institution_name, y = total_rev_menwomen, fill = institution_name)) + 
  geom_bar(stat = "identity") +
  theme_light() +
  scale_fill_brewer(palette = "Dark2") +
  scale_x_discrete(labels = c("Ball State","Butler", "IU B", "Purdue", "Notre Dame")) +
  labs(
    x = "Institutions",
    y = "Total Revenue (men and women) in dollars",
    title = "Top Indiana Universities' Sports Revenue"
    ) 
  
p
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Second Visualization= Which sector brings in more revenue. This could be interesting because we can have an idea which type of colleges are high revenue, possibly profit, generators.

p <- data |>
  group_by(sector_name)|>
  summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE))
  
p
## # A tibble: 3 × 2
##   sector_name                           revenue
##   <chr>                                   <dbl>
## 1 Private nonprofit, 2-year             3659049
## 2 Private nonprofit, 4-year or above 1212154047
## 3 Public, 4-year or above             951948154
ggplot(p, mapping = aes(x = sector_name, y = revenue)) +
  geom_col(fill = "lightblue") +
  theme_stata()

My plan moving forward is to work towards the goal of finding profitable institutions in sports. Also, I want to find which sports contribute more than the other.

Initial Findings:

Hypothesis One: If an institution is in top 10 revenue list, then it’s going to be profitable.

Conclusion: I agree with the hypothesis because all top 10 institutions are profitable. I came to the conclusion from my analysis and visualizations. A full hypothesis test wasn’t needed for this assignment.

xy <- data |>
  group_by(institution_name) |>
  summarise(total_rev = sum(total_rev_menwomen, na.rm = TRUE)) |>
  # filter(total_rev > 50000000) |>
  arrange(desc(total_rev)) |>
  slice_head(n = 10)

xy
## # A tibble: 10 × 2
##    institution_name               total_rev
##    <chr>                              <dbl>
##  1 University of Notre Dame       575582082
##  2 Indiana University-Bloomington 366577565
##  3 Purdue University-Main Campus  287028134
##  4 Ball State University           96690803
##  5 Butler University               90932589
##  6 Indiana State University        63708755
##  7 University of Indianapolis      60579956
##  8 Valparaiso University           57812866
##  9 University of Evansville        50920006
## 10 Marian University               48390992
ggplot(xy, mapping = aes(x = fct_reorder(institution_name, desc(total_rev)), y = total_rev)) +
geom_col() +
theme_stata() +
labs(
  x = "institution_name",
  y = "Revenue"
) +
scale_x_discrete(labels = c("Rank1","2","3","4","5","6","7","8","9","10"))

Hypothesis Two: Institutions with higher total count will be more profitable compared to the ones with less total count of students.

Conclusion: Even though we haven’t ran a full hypothesis test according to instructions, my conclusion is that it’s not true that the higher count will lead to more revenue based on general analysis and visualizations.

xyz <- data|>
  group_by(institution_name) |>
  summarise(count = sum(ef_total_count)) |>
  arrange(desc(count)) |>
  slice_head(n=5)


xyz
## # A tibble: 5 × 2
##   institution_name                                    count
##   <chr>                                               <dbl>
## 1 Indiana University-Bloomington                    2997148
## 2 Purdue University-Main Campus                     1711940
## 3 Ball State University                             1272708
## 4 Indiana University-Purdue University-Indianapolis 1219824
## 5 University of Notre Dame                           778715
ggplot(xyz, mapping = aes(x= institution_name, y = count)) +
geom_col() + 
theme_stata() +
labs(
  title = "Student Count in Institutions",
  x = "Institutions",
  y = "Count"
) +
scale_x_discrete(labels=c("Ball State", "IU B", "IUPUI", "Purdue", "Notre Dame"))