Regression Modelling

ANOVA

Reading the data

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
data <- read_delim('./sports.csv', delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Our categorical (predictor) variable = sector_name

Our response variable = total_rev_menwomen

General Visualization depicting distribution of different sectors.

data |>
  ggplot(mapping = aes(x = sector_name, y = total_rev_menwomen)) +
  geom_boxplot() +
  scale_y_log10(labels = \(x) paste('$', x/1000,'K')) +
  theme_minimal() +
  labs(
    x = 'Sectors',
    y = 'Revenue'
  )
## Warning: Removed 958 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Null Hypothesis: There is no difference in sector groups in terms of revenue.

# checking for assumptions
first_group <- data |>
  filter(sector_name == "Public, 4-year or above") 

ggplot(first_group, mapping = aes(x = total_rev_menwomen)) +
  geom_histogram() +
  scale_x_log10() +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 389 rows containing non-finite outside the scale range
## (`stat_bin()`).

F-Test

m <- aov(total_rev_menwomen ~ sector_name, data = data)
summary(m)
##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## sector_name    2 2.849e+14 1.424e+14   3.692 0.0251 *
## Residuals   1975 7.620e+16 3.858e+13                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 958 observations deleted due to missingness

we’ve got an F-value of 3.692 which tells us that there is likely a difference between groups in terms of revenue. Also, p-value of 0.02 is small which tells us that getting this difference or more extreme is unlikely if the null was true. So we can safely reject the null hypothesis.

Now, let’s find out which group is different from the other two using a pairwise t-test.

pairwise.t.test(data$total_rev_menwomen, data$sector_name, p.adjust.method = 'bonferroni')
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  data$total_rev_menwomen and data$sector_name 
## 
##                                    Private nonprofit, 2-year
## Private nonprofit, 4-year or above 1.00                     
## Public, 4-year or above            0.55                     
##                                    Private nonprofit, 4-year or above
## Private nonprofit, 4-year or above -                                 
## Public, 4-year or above            0.03                              
## 
## P value adjustment method: bonferroni

It looks like public 4 year is a bit different than the rest because the p-values are small. Both private sectors are matching with themselves. But public 4yr and private 4 yr are distinct with 0.03 p-value, and public 4 yr and private 2 yr is also seems different with p-value of 0.55.

Linear Regression

Exploring dataset to find appropriate explanatory variables

data_lm <- data #copied data for linear regression
data_lm <- data_lm |>
  mutate(margin = round((total_rev_menwomen - total_exp_menwomen)/ total_rev_menwomen,digits = 3))
data_lm |>
  ggplot(mapping = aes(x=ef_total_count, y =total_rev_menwomen)) +
  scale_y_log10() +
  scale_x_log10() +
  geom_point(size =2, color = 'darkblue')
## Warning: Removed 958 rows containing missing values or values outside the scale range
## (`geom_point()`).

After some adjustments, we find that there is a rough linear relationship between total students and total revenue of institutions. So we can try running the linear regression.

So let’s try the model

model <- lm(total_rev_menwomen ~ ef_total_count, data)
model$coefficients
##    (Intercept) ef_total_count 
##    256671.2982       164.0522

So from the model, for every one unit of ef_total_count(students), the revenue goes up by $164. My recommendation would be straightforward that if you want to improve the revenue, have more students.