Linear Regression Part II

Loading the data and libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(broom)
library(lindia)
data <- read_delim('./sports.csv', delim = ',')
## Rows: 2936 Columns: 28
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): institution_name, city_txt, state_cd, classification_name, classif...
## dbl (21): year, unitid, zip_text, classification_code, ef_male_count, ef_fem...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s review our model from part one.

model <- lm(total_rev_menwomen ~ ef_total_count, data)
model$coefficients
##    (Intercept) ef_total_count 
##    256671.2982       164.0522

We learned that for each student added, our revenue would go up by about $164. But I think there’s at least one other variable whose levels could affect how total count is influencing the revenue. In other words, we’ve an interaction term. I’m going to use participating men variable. Before that, we need to ensure if there’s an issue with multicollinearity.

#testing for multicollinearity
data |>
  ggplot(mapping = aes(x= ef_total_count, y = partic_men)) +
  geom_point(size = 1) +
  scale_y_log10() +
  scale_x_log10() +
  theme_minimal()
## Warning: Removed 1453 rows containing missing values or values outside the scale range
## (`geom_point()`).

So even though we see slight linearity, I’m going to move forward with this variable for purposes of our assignment. Linearity is small and we don’t have any other variable to work with, so we’ll ignore it.

Now, let’s run the model with this new variable and see what do we get.

model2 <- lm(total_rev_menwomen ~ ef_total_count + partic_men +
               ef_total_count:partic_men, data)
tidy(model2) |>
  select(term, estimate) |>
  mutate(estimate = round(estimate, 1))
## # A tibble: 4 × 2
##   term                      estimate
##   <chr>                        <dbl>
## 1 (Intercept)                43872. 
## 2 ef_total_count               -57.8
## 3 partic_men                 10800. 
## 4 ef_total_count:partic_men      6.7

From the model, coefficients for total count and participating men here won’t be realistic because you can’t have zero total count and increase participating men at the same time. But there are some informative points to look for. We can see that revenue is -57.8 if there are (constant) zero partic_men. Revenue goes down. Finally, we got a positive interaction term of 6.7 which means increase in partic_men helps push ef_total_count’s affect on revenue towards positivity by 6.7 units.

Now let’s test our model using the diagnostic plots:

Residuals vs. Fitted Values

# I removed the interaction term for diagnostic plots bc it was interfering with some plots
model2 <- lm(total_rev_menwomen ~ ef_total_count + partic_men , data)
gg_resfitted(model2) +
  geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Our model without the interaction term looks good according to the plot above. The variance in residuals are pretty even throughout and the number of values above and below the zero line looks similar as well.

Residuals vs X Values

plot <- gg_resX(model2, plot.all = FALSE)

# for each variable of interest ...
plot$ef_total_count +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

plot$partic_men +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Overall, plots for both total students and participating men looks good enough for the assumptions to be met. The thing I noticed is that in total student plot, the variance is higher in higher counts compared to lower counts. Also in participating men plot around 100 count, we see unusual high residuals. We’ll look at the normality of residuals even more closely with the following plots.

Residual Histogram

gg_reshist(model2) 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the histogram, most of the values of residuals are close to zero. We can’t fully confirm the normality here. But it does not seem to be properly normal curve. It’s leaning more towards a poisson curve. Using the QQ plot, we’ll see more clearly the normality of the curve or how much we’re deviating from it.

QQ Plot

gg_qqplot(model2)

Our QQ plot confirmed that it’s not completely normal even though this plot is sensitive to small changes. The right leg of the bell curve for residuals is not quite right for our model.

Cook’s D by Observation

# let's check influential points using Cook's D
gg_cooksd(model2, threshold = 'matlab')

As we can see, there are multiple influential data points for our model.

Overall:

I think our model is violating some assumptions of linear regression. Plots are done on model without interaction term by the way. Residuals do not follow a normal distribution. Also, the variance is not quite constant across the prediction line.