Loading in the tidyverse, data and setting seed
# Loading tidyverse
library(tidyverse)
#Loading in Data
nhl_draft <- read_csv("nhldraft.csv")
# Setting seed
set.seed(1)
For this week’s Data Dive we will be going over a basic overview of General Linear models through Logistic Regression to predict a binary respose variable.
To start, lets create a binary columns using one of the columns already provided in the data set. For this analysis I will create a column based on top 25 players. I’ll call this “is_top25”.
# Adding is_top_25 binary column
nhl_draft <- nhl_draft |>
mutate(is_top25 = case_when(
overall_pick > 25 ~ 0,
overall_pick <= 25 ~ 1
))
We’re now going to see which variables contribute to a player as top 25.
However, since the response variable “is_top25” is binary then it follows that the distribution of the binary variables follows a Bernoulli distribution since we just have ones and zeros. Since the Bernoulli is a special case of the Binomial distribution then we say that the family link is Binomial.
Now using logistic regression we can look at the probability that a player is a top 25 player.
Let’s create our model.
For this I will use goals to see how it affects the probability.
model <- glm(is_top25 ~ goals, data = nhl_draft,
family = binomial(link = 'logit'))
model$coefficients
## (Intercept) goals
## -1.57661176 0.00711671
Looking at the coefficients we can see that
\[ \log\left(\frac{p}{1 - p}\right) = -1.577 + 0.007\times\texttt{goals} \] So, for every increase in goals, the odds that the player is in the top 25 is multiplied by \(e^{-0.0071} = 0.992\), or for every increase in goals, that the player is in the top 25 goes down by about 12% (\(1 - 0.992 = 0.008\)).
The (Intercept) represents the log-odds when all the feature values are equal to zero. This can be used to determine a 50%-probability “decision threshold” for any variable.
\[ \begin{align} 0 &= \log(\text{odds}) \newline \to \quad 0 &= \beta_0 + \beta_1 x_1 \newline &= -1.577 + 0.007 \cdot \texttt{goals} \newline 1.577 & = 0.007 \cdot \texttt{goals} \newline \to \quad \texttt{goals} &= \frac{1.577}{0.007} = 225 \end{align} \] So when goals are roughly 225, there is a 50/50 odds that the player is in the top 25 of a draft.
Now let’s look at the how the sigmoid function can help us give insight into the likelihood that the player is in the top 25.
sigmoid <- \(x) 1 / (1 + exp(-(-1.577 + 0.007 * x)))
nhl_draft |>
ggplot(mapping = aes(x = goals, y = is_top25)) +
geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
labs(title = "Modeling a Binary Response with Sigmoid") +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
theme_minimal()
## Warning: Removed 7004 rows containing missing values (`geom_point()`).
Since we are using the log odds of the response variable to help predict, I would not suggest doing any other transformations to the goals variable since we are already making things more complex by transforming the response variable. Transforming the explanatory variable too would only make things more complicated to interpret. Also the binary variable doesn’t follow a normal distribution and logistic regression especially great at handling binary response variables.
I hope you enjoyed this very very brief data dive.