ANOVA
Reading the data
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
data <- read_delim('./sports.csv', delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Our categorical (predictor) variable = sector_name
Our response variable = total_rev_menwomen
General Visualization depicting distribution of different
sectors.
data |>
ggplot(mapping = aes(x = sector_name, y = total_rev_menwomen)) +
geom_boxplot() +
scale_y_log10(labels = \(x) paste('$', x/1000,'K')) +
theme_minimal() +
labs(
x = 'Sectors',
y = 'Revenue'
)
## Warning: Removed 958 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Null Hypothesis: There is no difference in sector groups in terms of
revenue.
# checking for assumptions
first_group <- data |>
filter(sector_name == "Public, 4-year or above")
ggplot(first_group, mapping = aes(x = total_rev_menwomen)) +
geom_histogram() +
scale_x_log10() +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 389 rows containing non-finite outside the scale range
## (`stat_bin()`).

F-Test
m <- aov(total_rev_menwomen ~ sector_name, data = data)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## sector_name 2 2.849e+14 1.424e+14 3.692 0.0251 *
## Residuals 1975 7.620e+16 3.858e+13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 958 observations deleted due to missingness
we’ve got an F-value of 3.692 which tells us that there is likely a
difference between groups in terms of revenue. Also, p-value of 0.02 is
small which tells us that getting this difference or more extreme is
unlikely if the null was true. So we can safely reject the null
hypothesis.
Now, let’s find out which group is different from the other two
using a pairwise t-test.
pairwise.t.test(data$total_rev_menwomen, data$sector_name, p.adjust.method = 'bonferroni')
##
## Pairwise comparisons using t tests with pooled SD
##
## data: data$total_rev_menwomen and data$sector_name
##
## Private nonprofit, 2-year
## Private nonprofit, 4-year or above 1.00
## Public, 4-year or above 0.55
## Private nonprofit, 4-year or above
## Private nonprofit, 4-year or above -
## Public, 4-year or above 0.03
##
## P value adjustment method: bonferroni
It looks like public 4 year is a bit different than the rest because
the p-values are small. Both private sectors are matching with
themselves. But public 4yr and private 4 yr are distinct with 0.03
p-value, and public 4 yr and private 2 yr is also seems different with
p-value of 0.55.
Linear Regression
Exploring dataset to find appropriate explanatory variables
data_lm <- data #copied data for linear regression
data_lm <- data_lm |>
mutate(margin = round((total_rev_menwomen - total_exp_menwomen)/ total_rev_menwomen,digits = 3))
data_lm |>
ggplot(mapping = aes(x=ef_total_count, y =total_rev_menwomen)) +
scale_y_log10() +
scale_x_log10() +
geom_point(size =2, color = 'darkblue')
## Warning: Removed 958 rows containing missing values or values outside the scale range
## (`geom_point()`).

After some adjustments, we find that there is a rough linear
relationship between total students and total revenue of institutions.
So we can try running the linear regression.
So let’s try the model
model <- lm(total_rev_menwomen ~ ef_total_count, data)
model$coefficients
## (Intercept) ef_total_count
## 256671.2982 164.0522
So from the model, for every one unit of ef_total_count(students),
the revenue goes up by $164. My recommendation would be straightforward
that if you want to improve the revenue, have more students.