Often in studies, we encounter outcomes that are not continuous, but instead fall into 1 of 2 categories. In these cases we have binary outcomes, variables that can have only two possible values:
Here \(Y\) is the binary response variable, that is coded as 0 or 1 and rather than predicting these two values for binary response we try to model the probabilities that \(Y\) takes one of these two values.
Let us examine one such example using the bankruptcies data available from 👉 https://tanjakec.github.io/mydata/Bankruptcies.csv.
Bankruptcies.csv
file contains data of the operating financial ratios of 66 firms which
either went bankrupt or remained solvent after a 2 year period. The data
is taken from the book “Regression
Analysis by Example” by Chatterjee
and Hadi. The
variables are:
> bankrup <- read.csv("https://tanjakec.github.io/mydata/Bankruptcies.csv")
> summary(bankrup)
## Y X1 X2 X3
## Min. :0.0 Min. :-308.90 Min. :-280.000 Min. :0.100
## 1st Qu.:0.0 1st Qu.: -39.05 1st Qu.: -17.675 1st Qu.:1.025
## Median :0.5 Median : 7.85 Median : 4.100 Median :1.550
## Mean :0.5 Mean : -13.63 Mean : -8.226 Mean :1.721
## 3rd Qu.:1.0 3rd Qu.: 35.75 3rd Qu.: 14.400 3rd Qu.:1.975
## Max. :1.0 Max. : 68.60 Max. : 34.100 Max. :6.700
Since
it would be very attractive to be able to use the same modelling techniques as for linear models. We are going to do just that: fit a simple linear regression model to examine this relationship to see if it will work.
\[Y = b_0 + b_1X_3\]
Examining this plot we can make two obvious observations:
the higher the ration of RE/TA the better the chances is for the firm to be solvent.
by analysing the risk we analyse a chance, a probability of a company being solvent based on the value of the RE/TA. Since \(Y\) is limited to the values of 0 and 1, rather than predicting these two values we will try to model the probabilities \(p\) that \(Y\) takes one of these two values.
Let \(p\) denote the probability that \(Y\) = 1 when \(X = x\). If we use the standard linear model to describe \(p\), then our model for the probability would be
\[p = P(Y = 1 | X = x) = b_0 + b_1x + e\]
Note that since \(p\) is a probability it must lie between 0 and 1. It seems rational that \(X\hat{b}\) is a reasonable estimate of \(P(Y=1∣X=x)\). Nonetheless, the plot has flagged a big issue with this model and that is that it has predicted probabilities less than 0.
As we can see the linear regression model does not work for this type of problem, for which we do not expect predictions that are off-scale values: below zero or above 1.
Apart from the fact that the linear function given is unbounded, and hence cannot be used to model probability, the other assumptions of linear regression when dealing with this type of a problem are also not valid:
A workaround these issues is to fit a different model, one that is bounded by the minimum and maximum probabilities. It makes better sense to model the probabilities on a transformed scale and this is what is done in logistic regression analysis. The relationship between the probability \(p\) and \(X\) can be presented by a logistic regression function.
The shape of the S-curve given in the figure above can be reproduced if we model the probabilities as follows
\[p = P(Y = 1 | X = x) = \frac{e^{\beta_0 + \beta_1x}}{1 + e^{\beta_0 + \beta_1x}},\]
where \(e\) is the base of the natural logarithm. The logistic model can be generalized directly to the situation where we have several predictor variables. The probability \(p\) is modelled as
\[p = P(Y = 1 | X_1 = x_1, X_2=x_2, ..., X_q=x_q) = \frac{e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_qx_q}}{1 + e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_qx_q}},\] where \(q\) is the number of predictors. The two equations above are called the logistic regression functions. It is nonlinear in the parameters \(\beta_0, \beta_1,... \beta_q\). However, it can be linearised by the logit transformation. Instead of working directly with \(p\) we work with a transformed value of \(p\). If \(p\) is the probability of an event happening, the ratio \(\frac{p}{(1-p)}\) is called the odds ratio for the event. By moving some terms around
\[1 - p = P(Y = 1 | X_1 = x_1, X_2=x_2, ..., X_q=x_q) = \frac{1}{1 + e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_qx_q}},\] we get
\[\frac{p}{1-p} = e^{\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_qx_q}\] Taking the natural logarithm of both sides of the above expression, we obtain
\[g(x_1, x_2, ... x_q) = log(\frac{p}{1-p}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_qx_q\] where the logarithm of the odds ratio is called the logit. We realise that the logit transformation produces a linear function of the parameters \(\beta_0, \beta_1,... \beta_q\). It is important to realise is that while the range of values of \(p\) of binomial response \(Y\) is between 0 and 1, the range of values of \(log\frac{p}{(1-p)}\) is between \(-\infty\) and \(+\infty\). This makes the logarithm of the odds ratio, known as logits, more appropriate for linear regression fitting.
In logistic regression the response probabilities are modelled by the logistic distribution function. That is, the model does not use the binned version of the predictor, but rather the log odds are modelled as a function of the predictor. The model parameters are estimated by working with logits which produces a model that is linear in the parameters.
The method of model estimation is the maximum likelihood method. Maximum likelihood parameter estimation is a technique that can be used when we can make assumptions about the probability distribution of the data. Based on the theoretical probability distribution and the observed data, the likelihood function is a probability statement that can be made about a particular set of parameter values. In logistic regression modelling the maximum likelihood parameters are obtained numerically using an interactive procedure. This is explained in the book by McCullagh and Nelder in Section 4.4.2.
Although the method of maximum likelihood is used for the model estimation we ask the same set of questions that are usually considered in linear regression. That is, we can not use the very familiar least square regression tools such as \(R^2\), \(t\) and \(F\), but that is not to say that we are not able to answer the same questions as we do when assessing a linear regression model for which we use the listed tools.
John Nelder
and Robert
Wedderburn formulated a modelling technique for accommodating
response variables with non-normal conditional distribution. Logistic
regression and ordinary linear regression fall into this larger class of
techniques called Generalised
Linear Models (GLMs) which accommodate
many different probability distributions. They substantially extend the
range of application of linear statistical models by accommodating
response variables with non-normal conditional distribution. Except for
the error, the right hand side of a GLM model equation is basically the
same as for a linear model. This is reflected in the syntax of the
glm()
function in R, which expects the formula that
specifies the right-hand side of GLM to be the same as those used for
the least square linear regression model, with the addition of
description for the error distribution.
In the following examples we will apply logistic regression in R by
directly applying the glm()
function and by using the
caret
package. We will use two data sets: Bankrupticies.csv
which we introduced earlier and the CreditCard
data set
from the AER
package. This is cross-section data on the
credit history for a sample of applicants for a type of credit card that
consists of 1319 observations with 12 variables.
glm()
functionThrough this example we will fit a logistic regression model using
the glom
function in order to learn how such models are
fitted and evaluated.
We have already uploaded the Bankrupticies.csv
data file, which contains information of the operating financial ratios
of 66 firms which either went bankrupt or remained solvent after a 2
year period earlier. Hence, the response variable is defined as
\(Y\):
Here, we will instead of predicting \(Y\), fit the model to predict the logits \(log\frac{p}{(1-p)}\), which after transformation means we can get the predicted probabilities for \(Y\).
Let us remind ourselves what the data looks like.
> suppressPackageStartupMessages(library(dplyr))
>
> bankrup <- read.csv("https://tanjakec.github.io/mydata/Bankruptcies.csv")
> summary(bankrup)
## Y X1 X2 X3
## Min. :0.0 Min. :-308.90 Min. :-280.000 Min. :0.100
## 1st Qu.:0.0 1st Qu.: -39.05 1st Qu.: -17.675 1st Qu.:1.025
## Median :0.5 Median : 7.85 Median : 4.100 Median :1.550
## Mean :0.5 Mean : -13.63 Mean : -8.226 Mean :1.721
## 3rd Qu.:1.0 3rd Qu.: 35.75 3rd Qu.: 14.400 3rd Qu.:1.975
## Max. :1.0 Max. : 68.60 Max. : 34.100 Max. :6.700
> glimpse(bankrup)
## Rows: 66
## Columns: 4
## $ Y <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ X1 <dbl> -62.8, 3.3, -120.8, -18.1, -3.8, -61.2, -20.3, -194.5, 20.8, -106.1…
## $ X2 <dbl> -89.5, -3.5, -103.2, -28.8, -50.6, -56.2, -17.4, -25.8, -4.3, -22.9…
## $ X3 <dbl> 1.7, 1.1, 2.5, 1.1, 0.9, 1.7, 1.0, 0.5, 1.0, 1.5, 1.2, 1.3, 0.8, 2.…
Before we start the model fitting procedure we will make test-train split data in the proportion of 80:20.
> set.seed(123)
> split_idx = sample(nrow(bankrup), 53)
> bankrup_train = bankrup[split_idx, ]
> bankrup_test = bankrup[-split_idx, ]
We will have a quick glance at data to see what the training data is like
> glimpse(bankrup_train)
## Rows: 53
## Columns: 4
## $ Y <int> 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1…
## $ X1 <dbl> -8.8, 31.4, 7.2, -120.8, 68.6, 18.1, 40.6, 37.3, 35.0, 21.5, 59.5, …
## $ X2 <dbl> -9.1, 15.7, -22.6, -103.2, 13.8, 13.5, 5.8, 33.4, 20.8, -14.4, 7.0,…
## $ X3 <dbl> 0.9, 1.9, 2.0, 2.5, 1.6, 4.0, 1.8, 3.5, 1.9, 1.0, 2.0, 2.0, 2.1, 2.…
> summary(as.factor(bankrup_train$Y))
## 0 1
## 28 25
> summary(bankrup_train)
## Y X1 X2 X3
## Min. :0.0000 Min. :-308.90 Min. :-280.00 Min. :0.10
## 1st Qu.:0.0000 1st Qu.: -39.40 1st Qu.: -20.80 1st Qu.:1.10
## Median :0.0000 Median : 3.30 Median : 1.60 Median :1.60
## Mean :0.4717 Mean : -17.83 Mean : -11.93 Mean :1.76
## 3rd Qu.:1.0000 3rd Qu.: 35.00 3rd Qu.: 13.50 3rd Qu.:2.00
## Max. :1.0000 Max. : 68.60 Max. : 33.40 Max. :6.70
We have a nice split in terms of the number of observations for each
class of Y. Next, we estimate a logistic regression model using the
glm()
(generalised linear model) function which we save as
an object.
> model <- glm(Y ~ X1 + X2 + X3, data = bankrup_train, family = binomial(logit))
> model
##
## Call: glm(formula = Y ~ X1 + X2 + X3, family = binomial(logit), data = bankrup_train)
##
## Coefficients:
## (Intercept) X1 X2 X3
## -9.8846 0.3238 0.1779 4.9443
##
## Degrees of Freedom: 52 Total (i.e. Null); 49 Residual
## Null Deviance: 73.3
## Residual Deviance: 5.787 AIC: 13.79
In order to get all the results saved in the glm_model
object we use the summary command.
> summary(model)
##
## Call:
## glm(formula = Y ~ X1 + X2 + X3, family = binomial(logit), data = bankrup_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.8846 10.6564 -0.928 0.354
## X1 0.3238 0.2959 1.094 0.274
## X2 0.1779 0.1079 1.650 0.099 .
## X3 4.9443 5.0231 0.984 0.325
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 73.3037 on 52 degrees of freedom
## Residual deviance: 5.7875 on 49 degrees of freedom
## AIC: 13.787
##
## Number of Fisher Scoring iterations: 12
The key components of R’s summary( ) function for generalised linear models for binomial family with the logit link are:
- Call: just like in
the case of fitting the lm()
model this is R reminding us
what model we ran, what options we specified, etc
- the Deviance Residuals are a measure of model fit. This part of the output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit
- the Coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable
At the end there are fit indices, including the null and deviance residuals and the AIC, which are used to assess overall model fit.
We realise that the output above has a resemblance to the standard regression output. Although they differ in the type of information, they serve similar functions. Let us make an interpretation of it.
The fitted logarithm of the odds ratio, i.e. logit of the probability \(p\) of the firm remaining solvent after two years is modelled as:
\[\hat{g}(X_1, X_2, ...X_q) = -9.8846 + 0.3238X_1 + 0.1779X_2 + 4.9443X_3\] Remember that here instead of predicting \(Y\) we obtain the model to predict \(\log\frac{p}{(1-p)}\). Using the transformation of the logit we get the predicted probabilities of the firm being solvent.
The estimated parameters are expected changes in the logit for unit change in their corresponding variables when the other, remaining variables are held fixed. That is, the logistic regression coefficients give the change in the log odds of the response variable for a unit increase in the explanatory variable:
This is very hard to make sense of. We have predicted log odds and in order to interpret them into a more sensible fashion we need to “anti logge” them as the changes in odds ratio for a unit change in variable \(X_i\), while the other variables are held fixed \(e^(\beta_i)\).
> round(exp(coef(model)), 4)
## (Intercept) X1 X2 X3
## 0.0001 1.3823 1.1947 140.3731
Now, we can interpret the coefficients of the ratios as follows:
is multiplied by \(e^{\beta_1} = e^{0.3238} = 1.38\), implying that there is an increase of \(38\%\)
The odds of a firm being solvent (vs being bankrupt) increases by 1.20 for a unit change in ratio \(X_2\), all else in the model being fixed… implying that there is an increase of \(20\%\)
The odds of a firm being solvent (vs being bankrupt) increases by 140.37 for a unit change in ratio \(X_3\), all else in the model being being fixed… implying that there is an increase of \(40.37\%\)
The column headed as z value
is the ratio of the
coefficients (Estimate
) to their standard errors
(Std. Error
) known as Wald statistics for
testing the hypothesis that the corresponding parameters are zeros. In
standard regression this would be the t-test. Next to
Wald test statistics we have their corresponding \(p\)-values (Pr(>|z|)
) which
are used to judge the significance of the coefficients. Values smaller
than \(0.5\) would lead to the
conclusion that the coefficient is significantly different from \(0\) at \(5\%\) significant level. From the output
obtained we see that none of the \(p\)-values is smaller than \(0.5\) implying that none of the variables
individually is significant for predicting the logit of the
observations.
We need to make a proper assessment to check if the variables collectively contribute to explaining the logit. That is, we need to examine whether the coefficients \(\beta_1\), \(\beta_2\) and \(\beta_3\) are all zeros. We do this using the \(G\) statistic, which stands for goodness of fit
\[G = \text{likelihood without the predictors}- \text{likelihood with the predictors}\] \(G\) is distributed as a chi-square statistic with 1 degree of freedom, so a chi-square test is the test of the fit of the model (note, that \(G\) is similar to an \(R^{2}\) type test). The question we have now is where do we get this statistic from?
In addition to the above, the summary of the model includes the deviance and degrees of freedom for a model with only an intercept (the null deviance) and the residual deviance and degrees of freedom for a fitted model. We calculate the \(G\) statistic as difference in deviances between the null model and the fitted model
Before we calculate the \(G\) statistic we need to specify the hypotheses:
where \(i = 1, 2, 3\). The decision rule is:
We can also consider: If its associated \(p\)-value is greater than \(0.05\) we conclude that the variables do not collectively influence the logits, if however the \(p\)-value is less that \(0.05\) we conclude that they do collectively influence the logits.
Next we calculate the \(G\)
statistic and the degrees of freedom of the corresponding critical
value. Knowing what it is, we can calculated manually or obtain it using
the pscl::pR2()
function
> G_calc <- model$null.deviance - model$deviance
> Gdf <- model$df.null - model$df.residual
> pscl::pR2(model)
## fitting null model for pseudo-r2
## llh llhNull G2 McFadden r2ML r2CU
## -2.8937311 -36.6518495 67.5162368 0.9210482 0.7202590 0.9613743
> G_calc
## [1] 67.51624
> qchisq(.95, df = Gdf)
## [1] 7.814728
> 1 - pchisq(G_calc, Gdf)
## [1] 1.454392e-14
As \(G_{calc}\) = 67.52 > \(G_{crit}\) = 7.81 \(\Rightarrow H_1\), ie. \(p\text{-value}\) = 0.00 < 0.05 we can conclude that this is a statistically valid model and that the variables collectively have explanatory power. The next question we need to ask is do we need all three variables?
In linear regression we assessed the significance of individual regression coefficients, i.e. the contribution of the individual variables using the t-test. Here, we use the z scores to conduct the equivalent Wald statistic (test). The ratio of the logistic regression has a normal distribution as opposed to the t-distribution we have seen in linear regression. Nonetheless, the set of hypotheses we wish to test is the same:
We are going to use a simplistic approach in testing these hypotheses in terms of the adopted decision rule. Decision Rule: If the Wald’s z statistic lies between -2 and +2, then the financial ratio, \(X_i\), is not needed and can be dropped from the analysis. Otherwise, we will keep the financial ratio. However, there is some scope here for subjective judgement depending upon how near to +/-2 theWald’s \(z\) value is. Therefore we may have to do some “manual” searching upon an acceptable set of explanatory variables, as the z value of all three variables in the model lies between +/-2.
In our example none of the variables appear to be important judging by their Wald’s z statstic, yet based on the chi-square statistic \(G\), we know that the fitted model is valid and that the selected variables collectively contribute to explaining the logits.
To evaluate the individual contribution of the variables used in a logistic regression model we examine what happens to the change in the chi-square statistic \(G\) when the \(i^{th}\) variable is removed from the model. A large \(p\)-value means that the reduced model explains about the same amount of variation as the full model and, thus, we can probably leave out the variable.
Decision Rule: if \(p\)-value > 0.05 \(\Rightarrow\) the variable can be taken out, otherwise if \(p\)-value < 0.05 keep the variable in the model.
Rather then fitting individual models and doing a manual comparison we can make use of the anova function for comparing the nested model using the chi-square test.
> anova(model, test="Chisq")
We can remove \(X_3\) variable from the model
> model_new <- update(model, ~. -X3, data=bankrup_train)
> summary(model_new)
##
## Call:
## glm(formula = Y ~ X1 + X2, family = binomial(logit), data = bankrup_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.57363 0.95662 -0.600 0.5487
## X1 0.15058 0.07408 2.033 0.0421 *
## X2 0.18163 0.12069 1.505 0.1324
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 73.3037 on 52 degrees of freedom
## Residual deviance: 9.3172 on 50 degrees of freedom
## AIC: 15.317
##
## Number of Fisher Scoring iterations: 10
> anova(model_new, test="Chisq")
To compare the fit of the alternative model we will use the Akaike Information Criterion (AIC), which is an index of fit that takes account of parsimony of the model by penalising for the number of parameters. It is defined as
\[AIC = - 2 \times \text{maximum log-likelihood} + 2 \times \text{number of parameters}\] and thus smaller values are indicative of a better fit to the data. In the context of logit fit, the AIC is simply the residual deviance plus twice the number of regression coefficients.
The AIC of the initial model is 13.787 and of the new model 15.472. Checking the new model, we can see that it consists of the variables that all significantly contribute to explaining the logits. So, in the spirit of parsimony we can choose the second model to be a better fit.
To obtain the overall accuracy rate we need to find the predicted
probabilities of the observations kept aside in the
bankrup_test
subset. By default, predict.glm()
uses type = "link"
and if not specified otherwise R is
returning:
\[\hat{\beta_0} + \hat{\beta_1}X_1 + \hat{\beta_2}X_2\] for each observation, which are not predicted probabilities. To obtain the predicted probabilities:
\[\frac{1}{1 + e^{\hat{\beta_0} +
\hat{\beta}_1x_1 + \hat{\beta}_2x_2}}\] when using the
predict.glm()
function we need to use
type = "response"
.
> link_pr <- round(predict(model_new, bankrup_test, type = "link"), 2)
> link_pr
## 4 20 22 24 30 38 40 44 46 47 55
## -8.53 -11.11 -4.48 -22.58 -9.16 8.75 8.68 12.51 2.58 11.24 9.43
## 58 63
## 10.31 7.93
> response_pr <- round(predict(model_new, bankrup_test, type = "response"), 2)
>
> t(bankrup_test$Y)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0 0 0 0 0 1 1 1 1 1 1 1 1
> coefficients(model_new)
## (Intercept) X1 X2
## -0.5736311 0.1505756 0.1816265
Here we illustrate how these predictions are made for the third and
twelfth observations in the bankrup_test
data subset.
> bankrup_test[c(3, 12),]
> round(1/(1+exp(-(-0.5503398 + 0.1573639*(-18.1) + 0.1947428*(-6.5)))), 4)
## [1] 0.0093
> round(1/(1+exp(-(-0.5503398 + 0.1573639*54.7 + 0.1947428*14.6))), 4)
## [1] 1
Knowing how the probabilities are estimated for the test data we now need to describe the performance of a classification model.
A common method for describing the performance of a classification model is the confusion matrix. So, to check how well the model has performed we will construct the confusion matrix, which is a table that is used to show the number of correct and incorrect predictions on a classification problem when the real values of the test data are known. For a two class problem it is of the format
The TRUE
values are the number of correct predictions
made. The values on the diagonal shows the number of correctly
classified instances for each category.
> how_well <- data.frame(response_pr, bankrup_test$Y) %>%
+ mutate(result = round(response_pr) == bankrup_test$Y)
>
> how_well
> confusion_matrix <- table(bankrup_test$Y, round(response_pr))
> confusion_matrix
##
## 0 1
## 0 5 0
## 1 0 8
The simplest metric for model classification is the overall accuracy rate. To evaluate predicted classes, we are going to create a function that can calculate the accuracy of the fitted model.
> # calculate percentage accuracy
> accuracy <- function(x){
+ sum(diag(x) / (sum(rowSums(x)))) * 100
+ }
Knowing how the probabilities are estimated for the test data we can now describe the performance of a classification model using the confusion matrix.
> accuracy(confusion_matrix)
## [1] 100
Although the overall accuracy rate might be easy to compute and to interpret, it makes no distinction about the type of errors being made.
Practise by doing the following exercises:
Access the stroke
data set from https://tanjakec.github.io/mydata/stroke.csv. The data
dictionary of this data set is given in https://tanjakec.github.io/mydata/Stroke_DataDictionary.docx
word document.
Split the data into a training and a test set. Fit a logistic regression model using the training part of the data set.
Predict the response for the test set.
What is the value of the performance metric?
Creative Commons Attribution-ShareAlike 4.0 International License.