Deriving Logistic Regression

Ben Horvath

November 2019

Previously, I derived the least squares solution for a single independent variable, and developed rudimentary functionality to estimate it in R. In this blog, I will briefly introduce the mathematics behind logistic regression for binary response variables, mostly focusing on explicating R’s functionality for this problem.

The Problem

Logistic regression modeling is appropriate when given a response variable \(Y\) that takes either the value ‘success’ or ‘failure’ (coded as 1 and 0, respectively), and a set of \(q\) variables \((x_1, x_2, \ldots)\) to predict it.

The Solution, Briefly

The first step should remind the reader of linear regression: Constructing a linear predictor associating \(Y\) and \(X\):

\[\eta = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_q x_q\]

We need to transform the linear predictor to accomodate our \(Y\), since it is a probability bounded by 0 and 1. We’ll use the logit link function, giving us:

\[\frac{e^\eta}{1 + e^\eta}\]

Now we just need to take the derivatives to get our parameters \(\hat{\beta}\), of the log-likelihood of the logit link:

\[l(\beta) = \sum_{i = 1}^n ( y_i \eta_i - log(1 + e_i^\eta))\]

There are numerous algorithms to solve this, e.g., gradient descent.

Fitting a Logistic Regression Model

To begin, simulate some data, allowing us to precisely evaluate how effective our model is:

This process is very similiar to simulating data for linear regression, with two interesting differences:

  1. There is no ‘noise’ term added
  2. The linear predictor \(\eta\) is a linear function of the variables and parameters, but it gets transformed

Examine our data. First, the response variable:

## y
##    0    1 
## 0.41 0.59

The sample is somewhat unbalanced, with almost 60 percent of the samples having positive \(y\). The rest of the variables:

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Both independent variables \(x_1\) and \(x_2\) have a clear relationship with \(y\), the former more strongly than the latter. And both are distributed approximately normally (as we already knew).

Now, fit and examine the model:

## 
## Call:
## glm(formula = y ~ x1 + x2, family = binomial, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4626  -0.3017   0.0450   0.3292   2.1857  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.0831     0.2851   3.799 0.000145 ***
## x1            4.1693     0.6546   6.369 1.90e-10 ***
## x2            1.7395     0.3555   4.893 9.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.74  on 199  degrees of freedom
## Residual deviance: 109.52  on 197  degrees of freedom
## AIC: 115.52
## 
## Number of Fisher Scoring iterations: 7

Before diving into interpreting the model, let’s make sure it is a reasonable fit to the data.

Goodness of Fit

First, we can use the residual and null deviances to test the hypotheses:

  • \(H_0\): Neither variables \(x_1\) or \(x_2\) are related to \(y\),
  • \(H_A\): At least one of the variables \(x_1\) or \(x_2\) are related to \(y\)

using the difference between the two deviances, and one degree of freedom for each estimated parameter:

## [1] 0

The resulting \(p\)-value is so small it’s very close to 0, so we can reject the null hypothesis that neither variables are related to \(y\).

Conduct a similar test on each of the independent variables:

## Single term deletions
## 
## Model:
## y ~ x1 + x2
##        Df Deviance    AIC     LRT  Pr(>Chi)    
## <none>      109.52 115.52                      
## x1      1   254.48 258.48 144.958 < 2.2e-16 ***
## x2      1   147.82 151.82  38.303 6.057e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both rows for the independent variables are highly significant, suggesting that each value, by itself, has a relationship with the response variable (which we know is correct).

Hosmer-Lemeshow Test

A further, more graphical measure of fit is the Hosmer-Lemeshow test. It bins the sample into \(g\) groups, and compares the expected and observed proportion of successes in each bin. For a well-fit model, the expected and observed proportions of success will be about the same, for each bin.

Examine the results of the statistical test itself:

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  df$y, fitted(m)
## X-squared = 1.144, df = 8, p-value = 0.9972

This very high \(p\)-value suggests we cannot reject the null hypothesis that the model fits the data well. (If it had been quite small, we would be forced conclude the model fits poorly.)

Graphically,

obs_p exp_p
[1.87e-06,0.00875] 0 0.06
(0.00875,0.0972] 1 0.82
(0.0972,0.221] 3 3.02
(0.221,0.507] 7 7.50
(0.507,0.725] 13 12.25
(0.725,0.878] 16 16.09
(0.878,0.963] 18 18.71
(0.963,0.991] 20 19.63
(0.991,0.999] 20 19.94
(0.999,1] 20 19.99
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The graph shows the data in hl_df, with a black line demonstrating what a perfect fit would look like, and a blue loess line based on the results of the test. From both the table and the graph, it is clear the model fits almost exactly.

Two Measures of \(R^2\) for Logistic Regression

Faraway’s Extending the Linear Model with R cautions against using psuedo-\(R^2\) measures for logistic regression, but writes that if it’s necessary, Nagelkerke’s is the best to use:

## Nagelkerke 
##  0.7461195

However, Sheather (A Modern Approach to Regression with R) recommends one based on deviance:

## [1] 0.5954869

In either case, both indicate the fit model is capturing more than the majority of the variation in the data.

Error Rate

This method compares a ‘dummy’ model—the null model that predicts the majority class—with the fit model. If the fit model has an error rate about the same as the dummy model, we know we are missing something important about the data.

Using a cut-off point of 0.50:

## [1] 0.115
## [1] 0.41

Thus our model has an error rate only a quarter of the null model, and we can be assured it is performing well.

Interpreting Model Coefficients

Now that we are sure our model fits the data reasonably well (in this case, very well!), we can proceed to interpreting it. The coefficients:

Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0831 0.2851 3.7987 1e-04
x1 4.1693 0.6546 6.3690 0e+00
x2 1.7395 0.3555 4.8931 0e+00
## Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.5579 1.6854
x1 3.0322 5.6176
x2 1.1031 2.5103

Recall that the true coefficients (log odds) are:

\[\beta_0 = 1.48\] \[\beta_1 = 4.48\] \[\beta_2 = 1.40\]

So the true value is within the 95% confidence interval for each prediction. The estimated coefficients are in the ballpark—although it seems like the intercept has ‘absorbed for itself’ too much of the variation actually due to \(x_1\) and \(x_2\).

Interpreting the Intercept \(\hat{\beta_0}\)

\(\hat{\beta_0} = 1.0831\). This number represents the log-odds of \(y = 1\) give that \(x_1, x_2 = 0\). In direct probability terms this is,

## (Intercept) 
##   0.7470709

that is, given the other variables are zero, the probability of \(y=1\) is near 75%.

Interpreting \(\hat{\beta_1}\)

\(exp(\hat{\beta_1}) = 64.67259\). This result is unusual in real logistic regression, but it indicates that a unit increase in \(x_1\) is associated with a 63,673 percent increase in the odds of \(y=1\)!

We can attempt to put this into terms of change in probability \(P(y = 1)\) through two methods (Gelman and Hill, Data Analysis Using Regression and Multileval/Hierchical Models). However, it is very important to understand that, unlike linear regression, these coefficients are not constant at all values of \(x_1\)! Hence the need for the following methods:

First, we can use the equation directly to calculate it, holding \(x_2\) at its mean:

## [1] 0.2456584

Thus: A difference of one unit in \(x_1\) corresponds to a positive difference of 24.4% in the probability \(y=1\), when the \(x_1\) values are near the mean, and \(x_2\) is held at its mean.

Second, we can determine the maximum different in \(P(y = 1)\) corresponding to unit difference in \(x_1\) by calculating \(\beta / 4\). Gelman and Hill write, with reference to the derivative of the fit logistic function:

This upper bound [\(\beta / 4\)] is a reasonable approximation near the midpoint of the logistic curve, where probabilities are close to 0.5.

In the present case, this is:

##       x1 
## 1.042334

suggesting a difference of one unit in \(x_1\) corresponds to a maximum difference in \(P(y=1)\) of \(104\) percent, just over double.

Interpreting \(\hat{\beta_2}\)

\(exp(\hat{\beta_2}) = 5.6945\). This means that a unit increase in \(x_2\) corresponds to 469% increase in the odds of \(y=1\).

Using both methods to convert this to a percentage:

## [1] 0.195284

indicating that, given \(x_1\) held constant at its mean, and the values of \(x_2\) are near its mean, a difference in one unit in \(x_2\) corresponds to a positive difference of about 20% in the probability \(y=1\).

##        x2 
## 0.4348764

estimates that the maximum percent change in \(P(y=1)\) given a one unit increase in \(x_2\) is about 44 percent.

Plotting Estimated Relationships

We want to plot how the \(P(y = 1)\) changes at different values of \(x_1\), at given values of \(x_2\).

First, let’s select the values of \(x_1\) we want to examine. For the purposes of a smooth graph, I select every value between min(x1) and \(max(x2)\) at a granularity of \(0.1\). Second, we have to determine which values of \(x_2\) we want to hold constant. I have selected the quartiles of the distribution:

## [1] -3.36883 -3.26883 -3.16883 -3.06883 -2.96883 -2.86883
## [1] -0.719346216 -0.006649968  0.672890220

Plot:

Recall that we had previously estimated that the max change in the probability of \(y=1\)—where the logistic curve is at its steepest, in the middle—is \(104\) percent. In this chart, the middle corresponds to \(y=0, x=0\).

The two gray, dashed lines make it easier to see if our estimate makes sense. Examining just the red line for now, we see the dashed line goes through it at \(y=0.5\). The other dashed line at \(x=1\) intersects the red line at \(y=1\). Since 1 is double 0.5, our estimate of an increase of 104% at the steepest part of the curve is correct.

Also note that as \(x_2\) decreases, the \(P(y=1)\) increases for all values of \(x_1\).

The same procedure for \(x_2\):

Above, it was estimated that a unit increase in \(x_2\) was associated with an increase of 44% in the probability that \(y=1\). Examine the orange line, corresponding the mean of \(x_1\). By eyesight, the steepest part of the curve is around \(-0.5\), so draw a line there, and \(-0.5 + 1 = 0.5\). The first reference line intersects the orange curve at about 0.6, and the right reference line intersects it at about 0.9.

As 0.9 is about 50% larger than 0.6, our intitial estimate makes sense by this graphic.

Marginal Model Plots

The car package provides a conveniant way to produce marginal model plots on glm objects:

The reason this package is plotting straight lines is because it’s plotting the log-odds.

Residuals and Diagnostics

A number of different types of residuals are applicable to logistic regression. According to the literature, it is recommended that deviance and standardized deviance residuals are the most helpful, but I will prefer elucidate the other kinds.

Raw residuals are simply \(y_i - \hat{y_i}\) where the latter term is the predicted \(P(y_i = 1)\) from the model:

##            1            2            3            4            5 
## -0.003644618 -0.524107996  0.481356135  0.169500278 -0.282864138 
##            6 
##  0.039743151

Deviance residuals can be calculated from R easily:

##           1           2           3           4           5           6 
## -0.08545496 -1.21865855  1.14589513  0.60947138 -0.81546302  0.28479634

Standardized deviance residuals are most recommended by MAR, they improve on deviance residauls by accounting for variance:

##           1           2           3           4           5           6 
## -0.08558862 -1.24975950  1.15427588  0.61429687 -0.82400658  0.28642111

Finally, there are Pearson residuals and their standardized variety, which are not recommended.

##           1           2           3           4           5           6 
## -0.06048098 -1.04943656  0.96338223  0.45176800 -0.62804133  0.20344051
##           1           2           3           4           5           6 
## -0.06057558 -1.07621885  0.97042813  0.45534487 -0.63462129  0.20460114

Examine the distributions of each graph:

In both plots, we can see the deviance residuals are more compact—most of the residuals are within \(\pm 2\) on the x-axis, and that the peak is a lot smaller than the Pearson residuals.

Outliers

As in linear regression, we can use a half-normal plot to look for unusual observations that could distort the model:

The problematic values:

##           x1        x2 y
## 87  3.016872 1.4058682 1
## 110 2.648849 0.7024264 1

We see this first value is near the max of both variables’ distributions:

## [1] 1
## [1] 0.915

and this second value is also very high on \(x_1\), in the 99th percentile:

## [1] 0.99
## [1] 0.765

Binned Residuals

Since the standard residual plots for linear regression, aren’t terribly helpful, we have to come up with an alternative. In this case, binned residuals (via the performance package). For the entire model:

## Warning: Probably bad model fit. Only about 57% of the residuals are inside the error bounds.

Like in linear regression, we want to make sure there is no systematic pattern in this graph. This software helpful informs us how many residual groups are outside the 95 percent confidence boundaries. IN this case, 57% are outside. Examining the graph more closely, however, the red dots occur at the ends the ends of the distribution, and are not far from zero, so I am not especially worried about this.

We can also perform the same procedure with each individual variable:

## Warning: About 93% of the residuals are inside the error bounds (~95% or higher would be good).

For \(x_1\), 93 percent are within the error bounds, and like before, the outlier occurs at the end of the distribution, and is close enough to 0. I do not concern myself with this.

## Ok: About 100% of the residuals are inside the error bounds.

The plot for \(x_2\) has all residuals inside the error bounds. However, there does seem to be some kind of pattern, though it’s not clear if it’s real or not. There is a pattern of four points increasing from around -1, following by a pattern of four points decreasing through 0. Were this a real dataset, it would be wise to investigate this phenomenon, verifying it truly is random.

References

  • Faraway, Julian J., Extending the Linear Model with R (CRC Press, 2016).

  • Gelman, Andrew and Jennifer Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models (Cambridge University Press, 2007).

  • Sheather, Simon J., A Modern Approach to Regression with R (Springer, 2009).