With linear regression we model a quantitative response, \(Y\), with one or more predictors, \(X_i\), with the regression equation \[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_\rho X_\rho +\epsilon \]
where \(\beta_0\) is a constant that balances the equation, the \(\beta_i\) are coefficients for the predictors, and \(\epsilon\) represents the prediction error. The predictors can be quantitative, can be categorical factors when coded as “dummy variables” that take values 0 or 1 for each level of the factor. More sophisticated models may include interactions between predictors, but we will leave that discussion for another time. The coefficients for the model can be calculated from data both numerically and analytically.
The estimated regression model is given by \[\hat{y} = b_0 + b_1 \cdot X_1 + b_2 \cdot X_2 + ... + b_p \cdot X_p\] Consider the example of trying to predict SAT Score. We have the population model:
\[SAT = \beta_0 + \beta_1 \cdot GPA + \beta_2 \cdot family income + \beta_3 \cdot tutor + \epsilon\] and the estimated model:
\[\widehat{SAT} = b_0 + b_1 \cdot GPA + b_2 \cdot family income + b_3 \cdot tutor\]
If the response to be predicted is not quantitative but rather is a categorical variable with only two outcomes, linear regression is no longer an appropriate method. Consider trying to predict whether an individual will graduate from a four-year institution. “Success” and “Failure” could be coded as 1 and 0 respectively, and a linear regression could be estimated for a given set of predictors.
\[graduation = \beta_0 + \beta_1 \cdot GPA + \beta_2 \cdot family income + \beta_3 \cdot SAT + \epsilon\]
where \(graduation = 1\) if a student graduates and \(graduation = 0\) otherwise.
The estimated model is given by:
\[\widehat{graduation} = b_0 + b_1 \cdot GPA + b_2 \cdot family income + b_3 \cdot SAT\]
It is possible to fit this model using linear regression and classify that values greater than 0.5 as predicted graduation success and values less than 0.5 as predicted graduation failure. This model would be flawed in a few ways, however. It would violate a key assumption of linear regression models that the error terms should be normally distributed, and it could produce predictions greater than 1 or less than 0.
In order to ensure that all predicted responses are between 0 and 1, we can apply a type of “sigmoid” (s-shaped) function called the logistic function: \[f(x)=\frac{1}{1+e^{-x}} = \frac{e^{x}}{1+e^{x}}\]
If we substitute the right-hand-side of the regression equation into this function we get the logistic function:
\[p(x)=\frac{e^{\beta_0 + \beta_1 X_1}}{1+e^{\beta_0 + \beta_1 X_1}}\] This function always produces values between 0 and 1, which we now view as probabilities.
Some algebraic manipulation reveals that this equation can be expressed as \[\frac{p(x)}{1-p(x)}=e^{\beta_0 + \beta_1 X_1}\]
where the quantity \(\frac{p(x)}{1-p(x)}\) is known as the “odds,” where the numerator is the probability of success and the denominator is the probability of failure. Taking the log of both sides produces \[\log \Big( \frac{p(x)}{1-p(x)} \Big) = \beta_0 + \beta_1 X_1\] which is known as the “log odds” or the “logit” function. If the probability of success is greater than 0.5, the log odds will be a positive number; if the log odds are less than 0.5, a negative number, and if exactly 0.5, zero.
Models are typically fit using maximum likelihood methods. Unfortunately, unlike linear regression, there is no analytical solution to this optimization problem, so numerical computation methods must be used to fit the model.
Once the model is fit, the model can be used to make predictions about the probability of success using \[\widehat{p}(x)=\frac{e^{b_0 + b_1 X_1}}{1+e^{b_0 + b_1 X_1}}\]
Of course, just like linear regression the model can be expanded to include multiple predictors which may be quantitative or dummy variables.
\[p(x)=\frac{e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_\rho X_\rho}}{1+e^{\beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_\rho X_\rho}}\]
Typically, predictions with probability \(\hat{p}(x)>0.5\) are predicted to be successes, while those with \(\hat{p}(x) \leq 0.5\) are failures. This cutoff is arbitrary, however, and values other than 0.5 may be used.
Intepreting the coefficients of a logistic regression is not a straightforward as doing so for linear regression. The simplest way to interpret the coefficients is to return to the log odds, where the linear combination of predictors is isolated on one side of the equation.
\[\log \Big( \frac{p(x)}{1-p(x)} \Big) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_\rho X_\rho\]
It is now easier to see that a one unit increase in \(X_i\) predicts a \(\beta_i\) increase in the log odds.
The model can be generalized further to a multinomial response, but we will leave this for another time.