library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
data <- read_delim("./sports.csv", delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Summary:
The dataset that I’m using is about collegiate sports in the United
States. I’ve focused my data to be only about Indiana. I did this so I
can reduce the number of rows to allowed range, but also to focus on the
state we live in.
The overall project goal is to run an ANOVA statistical test to
find out which sector is performing better in generating
profit.
Visualizations: First one = top 5 institutions with most revenue.
Because this gives you general idea of big revenue generators, and
there’s a good chance that these will be profitable.
top_5 <- data |>
group_by(institution_name) |>
summarise(visualize = sum(total_rev_menwomen,na.rm = TRUE)) |>
arrange(desc(visualize)) |>
filter(visualize >= 90000000) |>
pluck("institution_name")
top_5
## [1] "University of Notre Dame" "Indiana University-Bloomington"
## [3] "Purdue University-Main Campus" "Ball State University"
## [5] "Butler University"
p <- data |>
filter(institution_name %in% top_5) |>
ggplot(aes(x = institution_name, y = total_rev_menwomen, fill = institution_name)) +
geom_bar(stat = "identity") +
theme_light() +
scale_fill_brewer(palette = "Dark2") +
scale_x_discrete(labels = c("Ball State","Butler", "IU B", "Purdue", "Notre Dame")) +
labs(
x = "Institutions",
y = "Total Revenue (men and women) in dollars",
title = "Top Indiana Universities' Sports Revenue"
)
p
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_bar()`).

Second Visualization= Which sector brings in more revenue. This
could be interesting because we can have an idea which type of colleges
are high revenue, possibly profit, generators.
p <- data |>
group_by(sector_name)|>
summarise(revenue = sum(total_rev_menwomen, na.rm = TRUE))
p
## # A tibble: 3 × 2
## sector_name revenue
## <chr> <dbl>
## 1 Private nonprofit, 2-year 3659049
## 2 Private nonprofit, 4-year or above 1212154047
## 3 Public, 4-year or above 951948154
ggplot(p, mapping = aes(x = sector_name, y = revenue)) +
geom_col(fill = "lightblue") +
theme_stata()

My plan moving forward is to work towards the goal of finding
profitable institutions in sports. Also, I want to find which sports
contribute more than the other.
Initial Findings:
Hypothesis One: If an institution is in top 10 revenue list, then
it’s going to be profitable.
Conclusion: I agree with the hypothesis because all top 10
institutions are profitable. I came to the conclusion from my analysis
and visualizations. A full hypothesis test wasn’t needed for this
assignment.
xy <- data |>
group_by(institution_name) |>
summarise(total_rev = sum(total_rev_menwomen, na.rm = TRUE)) |>
# filter(total_rev > 50000000) |>
arrange(desc(total_rev)) |>
slice_head(n = 10)
xy
## # A tibble: 10 × 2
## institution_name total_rev
## <chr> <dbl>
## 1 University of Notre Dame 575582082
## 2 Indiana University-Bloomington 366577565
## 3 Purdue University-Main Campus 287028134
## 4 Ball State University 96690803
## 5 Butler University 90932589
## 6 Indiana State University 63708755
## 7 University of Indianapolis 60579956
## 8 Valparaiso University 57812866
## 9 University of Evansville 50920006
## 10 Marian University 48390992
ggplot(xy, mapping = aes(x = fct_reorder(institution_name, desc(total_rev)), y = total_rev)) +
geom_col() +
theme_stata() +
labs(
x = "institution_name",
y = "Revenue"
) +
scale_x_discrete(labels = c("Rank1","2","3","4","5","6","7","8","9","10"))

Hypothesis Two: Institutions with higher total count will be more
profitable compared to the ones with less total count of students.
Conclusion: Even though we haven’t ran a full hypothesis test
according to instructions, my conclusion is that it’s not true that the
higher count will lead to more revenue based on general analysis and
visualizations.
xyz <- data|>
group_by(institution_name) |>
summarise(count = sum(ef_total_count)) |>
arrange(desc(count)) |>
slice_head(n=5)
xyz
## # A tibble: 5 × 2
## institution_name count
## <chr> <dbl>
## 1 Indiana University-Bloomington 2997148
## 2 Purdue University-Main Campus 1711940
## 3 Ball State University 1272708
## 4 Indiana University-Purdue University-Indianapolis 1219824
## 5 University of Notre Dame 778715
ggplot(xyz, mapping = aes(x= institution_name, y = count)) +
geom_col() +
theme_stata() +
labs(
title = "Student Count in Institutions",
x = "Institutions",
y = "Count"
) +
scale_x_discrete(labels=c("Ball State", "IU B", "IUPUI", "Purdue", "Notre Dame"))
