library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(broom)
library(lindia)
data <- read_delim('./sports.csv', delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model <- lm(total_rev_menwomen ~ ef_total_count, data)
model$coefficients
## (Intercept) ef_total_count
## 256671.2982 164.0522
#testing for multicollinearity
data |>
ggplot(mapping = aes(x= ef_total_count, y = partic_men)) +
geom_point(size = 1) +
scale_y_log10() +
scale_x_log10() +
theme_minimal()
## Warning: Removed 1453 rows containing missing values or values outside the scale range
## (`geom_point()`).
model2 <- lm(total_rev_menwomen ~ ef_total_count + partic_men +
ef_total_count:partic_men, data)
tidy(model2) |>
select(term, estimate) |>
mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
## term estimate
## <chr> <dbl>
## 1 (Intercept) 43872.
## 2 ef_total_count -57.8
## 3 partic_men 10800.
## 4 ef_total_count:partic_men 6.7
# I removed the interaction term for diagnostic plots bc it was interfering with some plots
model2 <- lm(total_rev_menwomen ~ ef_total_count + partic_men , data)
gg_resfitted(model2) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plot <- gg_resX(model2, plot.all = FALSE)
# for each variable of interest ...
plot$ef_total_count +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
plot$partic_men +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Overall, plots for both total students and participating men looks good enough for the assumptions to be met. The thing I noticed is that in total student plot, the variance is higher in higher counts compared to lower counts. Also in participating men plot around 100 count, we see unusual high residuals. We’ll look at the normality of residuals even more closely with the following plots.
gg_reshist(model2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the histogram, most of the values of residuals are close to zero. We can’t fully confirm the normality here. But it does not seem to be properly normal curve. It’s leaning more towards a poisson curve. Using the QQ plot, we’ll see more clearly the normality of the curve or how much we’re deviating from it.
gg_qqplot(model2)
Our QQ plot confirmed that it’s not completely normal even though this plot is sensitive to small changes. The right leg of the bell curve for residuals is not quite right for our model.
# let's check influential points using Cook's D
gg_cooksd(model2, threshold = 'matlab')
As we can see, there are multiple influential data points for our model.