Loading in the tidyverse, data and setting seed
# Loading tidyverse
library(tidyverse)
#Loading in Data
nhl_draft <- read_csv("nhldraft.csv")
# Setting seed
set.seed(1)
# Creating is_top25 variable
nhl_draft <- nhl_draft |>
mutate(is_top25 = case_when(
overall_pick > 25 ~ 0,
overall_pick <= 25 ~ 1
))
For this week’s Data Dive we will be expanding on the model that was created last week and some of the challenges of modeling with General Linear Models.
First, here is our model from last week:
model1 <- glm(is_top25 ~ goals, data = nhl_draft,
family = binomial(link = 'logit'))
With this model we were able to use Logistic Regression to try and predict whether or not a player is a top 25 player or not just by their goals.
This week we’re going to improve the model.
First, let’s look at the model summary to gain a sense of how well the model did.
summary(model1)
##
## Call:
## glm(formula = is_top25 ~ goals, family = binomial(link = "logit"),
## data = nhl_draft)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5766118 0.0407770 -38.66 <2e-16 ***
## goals 0.0071167 0.0003683 19.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5781.8 on 5245 degrees of freedom
## Residual deviance: 5326.3 on 5244 degrees of freedom
## (7004 observations deleted due to missingness)
## AIC: 5330.3
##
## Number of Fisher Scoring iterations: 4
The first thing I want to point out is that we can’t fully gain a understanding of the model performance without comparing it to other models.
Thus we can at least see that the goals explanatory variable is statistically significant for the model. This is important as we improve our model to account for other variables. Thus the variable will be kept as we improve the model.
Next, let’s create a new model with 2 other variables: assists and points.
# Adding assists and points independent variables to the model
model2 <- glm(is_top25 ~ goals + points + assists, data = nhl_draft,
family = binomial(link = 'logit'))
Now let’s look at it’s summary:
summary(model2)
##
## Call:
## glm(formula = is_top25 ~ goals + points + assists, family = binomial(link = "logit"),
## data = nhl_draft)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.64163 0.04236 -38.75 <2e-16 ***
## goals -9.92599 196.96766 -0.05 0.96
## points 9.92808 196.96766 0.05 0.96
## assists -9.92443 196.96766 -0.05 0.96
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5781.8 on 5245 degrees of freedom
## Residual deviance: 5275.5 on 5242 degrees of freedom
## (7004 observations deleted due to missingness)
## AIC: 5283.5
##
## Number of Fisher Scoring iterations: 10
What you’ll notice is that the goals variable is no longer significant to the model! But why?
One of the things that has to be checked for model building is the relationship of the independent variables with each other.
To do this, we use a thing called the variance inflation factor also known as the “VIF”. The VIF looks at how much the variance between a variable is affected by changes in the other predictors. Thus if the VIF is 25, the standard error for the coefficient of that predictor variable is \(\sqrt{25} = 5\) times larger than if that predictor variable had 0 correlation with the other predictor variables.
A rule of thumb is to keep each VIF below 10.
The “car” package has a function that let’s us calculate the VIF. So let’s check the VIF for the model:
library(car)
vif(model2)
## goals points assists
## 2.944610e+11 1.742599e+12 6.637330e+11
Looking at the VIF of the 3 variables, the assists variable has a high VIF with the other variables and thus since the VIF is 6.6 it is more of a concern than the goals and points variables.
Another test to compare the models is by using the deviance of the models, which is a measure of the goodness of fit for the model and we want it to be as close to zero as we can.
paste("Model 1 Deviance", round(model1$deviance, 1))
## [1] "Model 1 Deviance 5326.3"
paste("Model 2 Deviance", round(model2$deviance, 1))
## [1] "Model 2 Deviance 5275.5"
We can see that the Model 2 deviance is lowe than the model 1 deviance. Thus we would favor model 2 over model 1.
To compare the models further, we can look at what is known as the Akaike Information Criterion or “AIC”. This criterion estimates how much information is lost in the model compared to other models. The Bayesian Information Criterion or “BIC” is same the as the AIC, but it’s more sensitive to the size of the data sets and the amount of parameters in the model. Thus the model we want to go with is the one with the lowest AIC or BIC.
R automatically comes with both AIC and BIC functions automatically. Let’s compare our models.
AIC(model1, model2)
## df AIC
## model1 2 5330.262
## model2 4 5283.525
BIC(model1, model2)
## df BIC
## model1 2 5343.393
## model2 4 5309.786
We can see that the AIC and BIC of model 2 with the 2 new parameters actually gives us a lower AIC and BIC than the model without points and assists. Without knowing about the multicollinearity we would choose model 2, but with knowing about the multicollinearity we have a decision to make about the model.
Either we keep the model with the multicollinearity knowing the model was improved by adding those 2 variables or we remove the problem variable causing the multicollinearity and keep comparing models until we get a model that is best fit.
Hopefully this analysis gives you a couple of tools to compare your models. These comparison measures can be applied to any model. Accompanying these tests with plotting your residuals to check for normality is extremely important when using generalized linear model.