1 Introduction

This is the last part of the Census Data project. In this project we analyze a U.S. census data taken from the UCI (University of California at Irvine) Machine Learning Repository. The project is divided into four parts. The current part of the project provides some theoretical insights into the statistical methods used in the project. The first part of the project is Cleaning and Preprocessing the Data and it includes cleaning and transforming the data. The second part – Exploratory Data Analysis, deals with investigating the variables, and the third part – Predictive Analysis, is devoted to building prediction models.

2 Logistic regression

2.1 Idea of the method

We are interested in predicting the values of the variable “income”. Therefore “income” is our response variable. It assumes only two values - less than 50K a year and more than 50K a year. Let \(Y_i\) be the random variable “income of the i-th subject”. Let also \(Y_i=1\) if income>50K and \(Y_i=0\) if income<=50K. In the end of this subsection we explain why we choose to model the class " >50K" as the “positive” one and " <=50K" – as the “negative” one. Then, \(Y_i\) (\(i=1,2,\dots, n\), where \(n=30162\) is the number of observations in the data frame) follows the binomial distribution. More precisely, \(Y_i\) follows the Bernoulli distribution, which is a special case of the binomial distribution when we have only 1 trial.

Since the response variable is binary, we will use logistic regression for the predictive analysis, which is a type of a generalized linear model. We denote \(\pi_i=P(Y_i = 1\vert \mathbf{X}=\mathbf{x}_i)\), where \(\mathbf{X}=\left(X_1, X_2,\dots, X_k \right)^T\) is the vector of predictors and \(\mathbf{x}_i=(x_{i1}, x_{i2}, \dots, x_{ik})^T\) is the set of predictor values corresponding to the observed value of the response \(Y_i\), which may be 0 or 1. This can be written also as: \[ \small{ P(Y_i = y_i\vert \mathbf{X}=\mathbf{x}_i) = \pi_i^{y_i}(1-\pi_i)^{1-y_i} }, \] where \(y_i=0\) or \(1\). In other words, \(\pi_i\) is the probability that the i-th subject has an income greater than 50K. Then, we have that the probability for a person to have an income less than or equal to 50K is \(1-\pi_i=P(Y_i = 0\vert \mathbf{X}=\mathbf{x}_i)\) and the log-odds are modeled by the following linear model: \[ \ln\left( \dfrac{\pi_i}{1-\pi_i} \right) = \beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i, \quad \beta_0,\beta_j\in\mathbb{R}, j=1,\dots,k \] which is equivalent to \[ \ln\left( \dfrac{\pi_i}{1-\pi_i} \right) = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{k} x_{ik} \] where \(\boldsymbol{\bar{\beta}}^T=\left( \beta_1,\beta_2, \dots, \beta_k \right)\). The function \(\ln\left( \dfrac{\pi_i}{1-\pi_i} \right)\) is called the “logit” function and is denoted by \[ \text{logit}(\pi_i)=\ln\left( \dfrac{\pi_i}{1-\pi_i} \right). \] Note that the logit maps probabilities from the range \((0,1)\) to the entire real line. Then the probabilities \(\pi_i\) are given by: \[ \pi_i = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \in (0,1) \] Since \(Y_i\) follows the Bernoulli distribution, we have that \[ E(Y_i) = \mu_i = \pi_i \] and \[ Var(Y_i) = \sigma^2_i = \pi_i(1 - \pi_i) \] Hence, in logistic regression we model the expected value of the random variable \(Y_i\), just as it is done in simple linear regression.

The estimates \(\hat{\beta}_0, \hat{\beta}_1, \dots, \hat{\beta}_k\) of the parameters \(\beta_0, \beta_1, \dots, \beta_k\) are obtained with the maximum likelihood estimation method.

2.2 Choice of the “positive” class

Next, we explain our choice of a “positive” class, i.e., the values of the binary response variable whose probability we model via the logistic regression. Since R codes the factor variables as numbers, the binary response variable, which is a categorical (factor) variable, is also being coded. Therefore when estimating a logistic regression model we need to know how the binary response variable is being modeled, i.e. which factor level is coded as the “positive” class and which as the “negative” class. By default R orders the factor levels alphabetically and the response level modeled in the logistic regression is the highest level. In our case the highest level in the “income” variable is " >50K“:

setwd("D:/Data_Science_Projects/Census_DataSet")

db.adult <- read.csv("adult_df.csv")

attributes(db.adult$income)[1]

## $levels
## [1] " <=50K" " >50K"

Hence, the response level modeled is " >50K“, i.e. when fitting the logistic regression model, the probability of the income being greater than 50K is calculated. This means that we have to choose” >50K" as the “positive” class.

2.3 Interpretation of the coefficients

The odds are defined as the probability of success divided by the probability of failure:

\[ \textbf{Odds}_i = \dfrac{P(Y_i = 1\vert \mathbf{X}=\mathbf{x}_i)}{1-P(Y_i = 1\vert \mathbf{X}=\mathbf{x}_i)}= \dfrac{\pi_i}{1-\pi_i} \]

Then we have that the log-odds are modeled by a linear model:

\[ \ln \left( \textbf{Odds}_i \right) = \ln \left( \dfrac{\pi_i}{1-\pi_i} \right) = \beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \]

and from here we obtain that

\[ \textbf{Odds}_i = \dfrac{\pi_i}{1-\pi_i} = \exp\left( \beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \]

2.3.1 Odds ratio

We have that \[ \pi_i = \dfrac{\exp\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{k} x_{ik} \right)}{1 + \exp\left( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{k} x_{ik} \right)} \] Let the index \(p\) be such that \(1\leq p \leq k\), and let us see what happens to the probability \(\pi_i\) if we increase \(x_{ip}\) by one and keep all other \(x_{ij}\)’s unchanged (\(j=1,2,\dots, p-1, p+1, \dots, k\)). We denote \[ \pi_i^{(+1)} = \dfrac{\exp\left(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p(x_{ip} + 1) + \dots \beta_{k} x_{ik} \right)}{1 + \exp\left( \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p(x_{ip} + 1) \dots + \beta_{k} x_{ik} \right)} = \] \[ = \dfrac{\exp\left(\underbrace{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{p}x_{ip} + \dots \beta_{k} x_{ik}}_{} + \beta_p \right)}{1 + \exp\left( \underbrace{\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_{p}x_{ip} + \dots + \beta_{k}x_{ik}}_{} + \beta_p\right)} = \] \[ = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i + \beta_p \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i + \beta_p \right)} = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)\exp\left( \beta_p \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)\exp\left( \beta_p \right)} \] Hence, the odds for a unit increase in \(x_{ip}\) become: \[ \textbf{Odds} = \dfrac{\pi_i^{(+1)}}{1 - \pi_i^{(+1)}} = \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)\exp\left( \beta_p \right) \] Then the odds ratio of \(\pi_i^{(+1)}\) and \(\pi_i\) is: \[ \textbf{Odds ratio} = \dfrac{\dfrac{\pi_i^{(+1)}}{1 - \pi_i^{(+1)}}}{\dfrac{\pi_i}{1 - \pi_i}} = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)\exp\left( \beta_p \right)}{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} = \exp\left( \beta_p \right) \] or, equivalently: \[ \beta_p = \ln\left( \textbf{Odds ratio} \right) = \ln\left( \dfrac{\dfrac{\pi_i^{(+1)}}{1 - \pi_i^{(+1)}}}{\dfrac{\pi_i}{1 - \pi_i}} \right) = \ln\left( \dfrac{\pi_i^{(+1)}}{1 - \pi_i^{(+1)}} \right) - \ln\left(\dfrac{\pi_i}{1 - \pi_i} \right) \] Consequently, the logistic regression coefficients are interpreted in the following way. A logistic regression coefficient \(\beta_p\), \(p=1,2,\dots,k\), is equal to the difference in log odds for a unit increase in the value of the predictor \(X_p\). This definition makes sense for numerical variables. Next we will show how to interpret logistic coefficients in the case of categorical (nominal) variables.

2.3.2 Odds ratio for categorical variables

Without loss of generality, let us assume that the variable \(X_1\) is a categorical one with three levels. We denote the levels of \(X_1\) with \(X_{11}\), \(X_{12}\) and \(X_{13}\). In R the categories of \(X_1\) are coded as follows:

When an observation belongs to category \(1\): \(X_{11}=1\) and \(X_{12}=X_{13}=0\)
When an observation belongs to category \(2\): \(X_{12}=1\) and \(X_{11}=X_{13}=0\)
When an observation belongs to category \(3\): \(X_{13}=1\) and \(X_{11}=X_{12}=0\)

R leaves out the level which is first alphabetically as a reference level amd does not include it in the model. This is done because if we know that an observation does not belong to category 2 or 3, we know that it belongs to category 1. Without loss of generality, let us assume that the reference (base) level is the first level – \(X_{11}\). Let \(\pi := P(Y=1\vert \mathbf{X})\). The variables \(X_{12}\) and \(X_{13}\) are called dummy variables and are included in the logistic regression model in the following way:

\[ \textbf{Odds} = \dfrac{\pi}{1-\pi} = \exp\left(\beta_0 + \underbrace{\beta_{12}X_{12} + \beta_{13}X_{13}}_{} + \beta_2 X_2 + \dots + \beta_{k} X_k \right) \] If \(Y\) belongs to category \(2\), then \[ O_2 := \textbf{Odds} = \exp\left(\beta_0 + \beta_{12} + \beta_{2}X_2 + \dots + \beta_{k}X_k \right) \] If \(Y\) belongs to category \(3\), then \[ O_3 := \textbf{Odds} = \exp\left(\beta_0 + \beta_{13} + \beta_{2}X_2 + \dots + \beta_{k}X_k \right) \] And if \(Y\) belongs to category \(1\), then \[ O_1 := \textbf{Odds} = \exp\left(\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k \right) \]

Therefore, we have that \[ \dfrac{O_2}{O_1} = \dfrac{\exp\left(\beta_0 + \beta_{12} + \beta_{2}X_2 + \dots + \beta_{k}X_k \right)}{\exp\left(\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k \right)} = \dfrac{\exp\left( (\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k) + \beta_{12} \right)}{\exp\left(\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k \right)} = \] \[ = \dfrac{\exp\left(\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k\right) \exp\left(\beta_{12} \right)}{\exp\left(\beta_0 + \beta_{2}X_2 + \dots + \beta_{k}X_k \right)} = \exp\left(\beta_{12} \right). \] By analogy: \[ \dfrac{O_3}{O_1} = \exp\left(\beta_{13} \right). \]

Hence, for categorical variables, the exponent of the logistic coefficient for a given category (factor level) is equal to the odds ratio of the odds for success of that category and the reference category.

2.4 Log-likelihood function

The probability mass function (p.m.f.) of \(Y_i\) is given by: \[ P(Y_i = y_i) = p_i(y_i,\pi_i)=\pi_i^{y_i}(1-\pi_i)^{1-y_i}, \] where \(y_i\in\{0,1\}\). Let us denote \(\mathbf{y}^T=\left( y_1, y_2, \dots, y_n \right)\). We have that \(y_i\in\{0,1\}\) is the outcome of each “trial”, where the i-th “trial” is the event of the i-th subject having an income of more than 50K or not. Then, the likelihood function (joint probability mass function) of \[ \boldsymbol{\beta}^T=(\beta_0, \boldsymbol{\bar{\beta}}^T) \] given the outcomes \(y_1, y_2, \dots, y_n\) and the predictor values \(x_{i1}, x_{i2}, \dots, x_{in}\) is \[ L(\boldsymbol{\beta} \vert \mathbf{x}_i, \mathbf{y}) = \prod\limits_{i=1}^n p_i(y_i,\pi_i) = \prod\limits_{i=1}^n \pi_i^{y_i}\left( 1-\pi_i \right)^{1 - y_i}, \] and the log-likelihood function is \[ l = \ln\left[ L( \boldsymbol{\beta} \vert \mathbf{x}_i, \mathbf{y}) \right]=\ln\left[ \prod\limits_{i=1}^n \pi_i^{y_i}\left( 1-\pi_i \right)^{1 - y_i} \right] = \] \[ = \ln\left[ \prod\limits_{i=1}^n \left( \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}\right)^{y_i} \cdot \left( 1 - \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \right)^{1-y_i} \right] = \] \[ = \sum\limits_{i=1}^n \ln\left[\left( \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}\right)^{y_i} \cdot \left( \dfrac{1}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \right)^{1-y_i} \right] = \] \[ = \sum\limits_{i=1}^n \left\{ \ln\left( \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}\right)^{y_i} + \ln\left( \dfrac{1}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \right)^{1-y_i} \right\} = \] \[ = \sum\limits_{i=1}^n \left\{ y_i\ln\left( \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}\right) + (1-y_i)\ln\left( \dfrac{1}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \right) \right\} = \] \[ = \sum\limits_{i=1}^n \left\{ y_i\left[\ln\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) - \ln\left(1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \right] - (1-y_i)\ln\left( 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \right\}= \] \[ = \sum\limits_{i=1}^n \left\{ y_i\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) - y_i\ln\left(1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) - \right. \] \[ \left. - \ln\left( 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) + y_i\ln\left( 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \right\}= \] \[ = \sum\limits_{i=1}^n \left\{ y_i\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) - \ln\left(1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \right\}= \] \[ =\sum\limits_{i=1}^n y_i\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) - \sum\limits_{i=1}^n \ln\left(1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \] Hence, we obtained that the log-likelihood function is: \[ l = \ln\left[ L(\boldsymbol{\beta} \vert \mathbf{x}_i, \mathbf{y}) \right] = \sum\limits_{i=1}^n y_i\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) - \sum\limits_{i=1}^n \ln\left(1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right) \]

2.5 Maximum likelihood estimators

Then the maximum likelihood estimator for \(\boldsymbol{\beta}^T=\left( \beta_0, \beta_1,\dots,\beta_k \right)\) is obtained by finding \(\hat{\boldsymbol{\beta}}^T=\left( \hat{\beta}_0, \hat{\beta}_1,\dots,\hat{\beta}_k \right)\) that maximizes the log-likelihood function \(l\) for the given data \(\left(\mathbf{x}_i, \mathbf{y} \right)\). That is, \(\hat{\boldsymbol{\beta}}\) are the values of the model parameters for which the probability of observing the given data is maximized. In order to find \(\hat{\boldsymbol{\beta}}\) we have to find the first derivatives of the log-likelihood function and set them to 0. If the second derivatives ( calculated at the point \(\hat{\boldsymbol{\beta}}\) for which the first derivative is zero) are less than 0, then the critical point is a maximum. We will not go into further details, and we will assume that the critical points indeed maximize the log-likelihood function. Then, the first derivatives of \(l\) with repsect to \(\beta_j\) are \[ f_j(\boldsymbol{\beta})=\dfrac{\partial l}{\partial \beta_j}=\sum\limits_{i=1}^{n} y_i x_{ij} - \sum\limits_{i=1}^n \dfrac{x_{ij}\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \] where we introduce \(x_{i0}=1\), and \(j=0,1,2,\dots,k\). Consequently, in order to find the critical points, we have to solve the following system of nonlinear equations: \[ f_j(\boldsymbol{\beta})=0, \quad j=0,1,2,3,\dots, k \] Therefore, the maximum likelihood estimators \(\hat{\boldsymbol{\beta}}^T=\left( \hat{\beta}_0, \hat{\beta}_1,\dots,\hat{\beta}_k \right)\) are obtained by using iterative methods such as Newton-Raphson.

2.6 Iterative methods for solving systems of non-linear equations and the issue of collinearity

2.6.1 Newton-Raphson

Since we will stumble upon collinearity among the predictors, we will briefly introduce the Newton-Raphson method and the Jacobian matrix. We will need these concepts to explain the consequences of the linear dependencies between the covariates.

Until convergence, at each iteration of the Newton-Raphson method the following system of linear equations has to be solved \[ \mathbf{J}\left( \boldsymbol{\beta}^{(r-1)} \right)\left( \boldsymbol{\beta}^{(r)} - \boldsymbol{\beta}^{(r-1)} \right)=-\mathbf{f}(\boldsymbol{\beta}^{(r-1)}), \] where \(\mathbf{J}\) is the \(\textbf{Jacobian matrix}\) and the right hand side is given by \[ \mathbf{f} =\left( \begin{array}{c} f_0(\boldsymbol{\beta}) \\ f_1(\boldsymbol{\beta}) \\ f_2(\boldsymbol{\beta}) \\ \dots \\ f_k(\boldsymbol{\beta}) \end{array} \right) \] Furthermore, \(\boldsymbol{\beta}^{(r)}\) is the solution at the current iteration and \(\boldsymbol{\beta}^{(r-1)}\) is the solution from the previous iteration. We start the iterations with an appropriately chosen initial guess. The Jacobian and the right hand side are calculated at \(\boldsymbol{\beta}=\boldsymbol{\beta}^{(r-1)}\). The elements of the Jacobian matrix (which is \((k+1)\times(k+1)\)) are given by: \[ J_{jl}=\dfrac{\partial f_j}{\partial\beta_l} = \sum\limits_{i=1}^n x_{ij}x_{il}\dfrac{(-1)\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{\left[ 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right]^2} \] for \(j,l=0,1,2,\dots,k\).

The \(\textbf{design matrix}\) is \(n\times(k+1)\) and has the form: \[ \mathbf{X}=\left( \begin{array}{ccccc} 1 & x_{11} & x_{12} & \dots & x_{1k} \\ 1 & x_{21} & x_{22} & \dots & x_{2k} \\ \dots & \dots & \dots & \dots & \dots \\ 1 & x_{n1} & x_{n2} & \dots & x_{nk} \end{array} \right) \] We will show that if \(X\) has linearly dependent column vectors (which is equivalent to some predictors being collinear), then the Jacobian matrix \(\mathbf{J}=\{ J_{jl} \}_{j,l=0}^{k}\) has linearly dependent rows.

Let us denote \[ w_i = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{\left[ 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right]^2} = \pi_i(1-\pi_i)=Var(Y_i) \] which is a scalar. Then we have that \[ J_{jl} = -\sum\limits_{i=1}^n x_{ij}x_{il}w_i = -\sum\limits_{i=1}^n w_i x_{ij}x_{il} \] From here it follows that the Jacobian can be written as the following sum of matrices: \[ \mathbf{J} = -\sum\limits_{i=1}^n \mathbf{M}_i \] where \[ \mathbf{M}_i = w_i\mathbf{x}_i\mathbf{x}_i^T \] Let us denote \(\mathbf{D}_i=\mathbf{x}_i\mathbf{x}_i^T\), i.e. we have that \[ \mathbf{M}_i = w_iD_i. \] Without loss of generality, let us assume that \[ x_{ik}=\sum\limits_{m=0}^{k-1} a_m x_{im} = a_0x_{i0} + a_1x_{i1} + a_2x_{i2} + \dots + a_{k-1}x_{i,k-1} \] where \(a_m\in\mathbb{R}\). Then the matrix \(\mathbf{D}_i=\mathbf{x}_i\mathbf{x}_i^T\) has the following form \[ \mathbf{D}_i=\left( \begin{array}{ccccc} x_{i0}^2 & x_{i0}x_{i1} & x_{i0}x_{i2} & \dots & x_{i0}\sum\limits_{m=0}^{k-1} a_m x_{im} \\ x_{i1}x_{i0} & x_{i1}^2 & x_{i1}x_{i2} & \dots & x_{i1}\sum\limits_{m=0}^{k-1} a_m x_{im} \\ \dots & \dots & \dots & \dots & \dots \\ \left(\sum\limits_{m=0}^{k-1} a_m x_{im}\right)x_{i0} & \left( \sum\limits_{m=0}^{k-1} a_m x_{im} \right)x_{i1} & \left(\sum\limits_{m=0}^{k-1} a_m x_{im}\right)x_{i2} & \dots & \left(\sum\limits_{m=0}^{k-1} a_m x_{im} \right)\left(\sum\limits_{m=0}^{k-1} a_m x_{im} \right) \end{array} \right) \] For the last row of the matrix \(\mathbf{D}_i\) we have that \[ \left(\sum\limits_{m=0}^{k-1} a_m x_{im}\right)x_{i0} = a_0x_{i0}^2 + a_1x_{i1}x_{i0} + a_2x_{i2}x_{i0} + \dots + a_{k-1}x_{i,k-1}x_{i0} \] \[ \left( \sum\limits_{m=0}^{k-1} a_m x_{im} \right)x_{i1} = a_0x_{i0}x_{i1} + a_1x_{i1}^2 + a_2x_{i2}x_{i1} + \dots + a_{k-1}x_{i,k-1}x_{i1} \] and so on. Therefore, it is obvious that the last row of \(\mathbf{D}_i\) is a linear combination of all other rows. Hence, if we consecutively multiply the first row by \(-a_0\) and add it to the last one, then the second row by \(-a_1\) and add it to the last row, and so on, we will obtain that the last row of \(\mathbf{D}_i\) is composed only of zeros. This means that the rows of the matrix \(\mathbf{D}_i\), and, hence, the rows of the matrices \(\mathbf{M}_i\), are linearly dependent, which will lead to singularities when solving the linear system \[ \mathbf{J}\left( \boldsymbol{\beta}^{(r-1)} \right)\left( \boldsymbol{\beta}^{(r)} - \boldsymbol{\beta}^{(r-1)} \right)=-\mathbf{f}(\boldsymbol{\beta}^{(r-1)}) \] at each iteration of the Newton-Raphson method.

We build a logistic regression model with the help of the R function “glm()”, which handles generalized linear models. The iterative algorithm for calculating the maximum likelihood estimators \(\boldsymbol{\hat{\beta}}\) implemented in “glm()” is Iteratively Reweighted Least Squares (IRLS). However, the Newton-Raphson iterative scheme can be expressed as IRLS with the Jacobian still in play. As we will see below, in the case of the census data, collinearities among the predictors, indeed lead to linear dependencies in the Jacobian. This results in a model coefficient tending to infinity. If special measures are not taken, the iterative method will not converge. In our analysis R overcomes this issue, and as a result we obtain finite solutions for all model parameters but one which is reported to be NA. This may be achieved by setting some treshold for the model coefficients above which they are marked as infinite. Then some constant value is being assigned to them. This constant is not being updated at successive iterations, thus allowing for the algorithm to converge.

2.6.2 Iteratively Reweighted Least Squares (IRLS)

Now we will show how Newton-Raphson can be expressed as IRLS. Let us introduce the \(n\times n\) diagonal matrix \(\mathbf{W}\): \[ \mathbf{W}=\left( \begin{array}{ccccc} w_1 & 0 & 0 & \dots & 0 \\ 0 & w_2 & 0 & \dots & 0 \\ 0 & 0 & w_3 & \dots & 0 \\ \dots & \dots & \dots & \dots & \dots \\ 0 & 0 & 0 & \dots & w_n \end{array} \right), \] which is referred to as the weight matrix in IRLS, where we recall that \[ w_i = \dfrac{(-1)\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{\left[ 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right]^2} \] Then, the Jacobian can be written as the following matrix product \[ \mathbf{J}= -\mathbf{X}^T\mathbf{W}\mathbf{X}. \] Since we have that \[ \pi_i = \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}, \] let \[ \boldsymbol{\pi}=\left( \pi_1, \pi_2,\dots, \pi_n \right)^T \] be the vector of modelled probabilities for the sampled individuals to earn more than 50K a year. Then, we also have that \[ f_j = \sum\limits_{i=1}^n x_{ij}\left[ y_i - \dfrac{\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)} \right] = \sum\limits_{i=1}^n x_{ij}\left[ y_i - \pi_i \right] \] which is equivalent to \[ \mathbf{f} = \mathbf{X}^T\left( \mathbf{y} - \boldsymbol{\pi} \right). \] Now, we can rewrite the Newton-Raphson iteration \[ \mathbf{J}\left( \boldsymbol{\beta}^{(r-1)} \right)\left( \boldsymbol{\beta}^{(r)} - \boldsymbol{\beta}^{(r-1)} \right)=-\mathbf{f}(\boldsymbol{\beta}^{(r-1)}) \] in the following matrix notation \[ -\mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{X}\left( \boldsymbol{\beta}^{(r)} - \boldsymbol{\beta}^{(r-1)} \right) = -\mathbf{X}^T\left( \mathbf{y} - \boldsymbol{\pi}^{(r-1)} \right), \] where the superscript \((r-1)\) indicates that \(w_i= \dfrac{(-1)\exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right)}{\left[ 1 + \exp\left(\beta_0 + \boldsymbol{\bar{\beta}}^T\mathbf{x}_i \right) \right]^2}\) and \(\boldsymbol{\pi}\) are calculated at \(\boldsymbol{\beta}^{(r-1)}\). The latter matrix equality is equivalent to \[ \boldsymbol{\beta}^{(r)} = \boldsymbol{\beta}^{(r-1)} + \left( \mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{X} \right)^{-1}\mathbf{X}^T\left( \mathbf{y} - \boldsymbol{\pi}^{(r-1)} \right) \Longleftrightarrow \] \[ \boldsymbol{\beta}^{(r)} = \left( \mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{X} \right)^{-1}\mathbf{X}^T\mathbf{W}^{(r-1)}\left[ \mathbf{X}\boldsymbol{\beta}^{(r-1)} + \left(\mathbf{W}^{(r-1)}\right)^{-1}\left( \mathbf{y} - \boldsymbol{\pi}^{(r-1)} \right) \right] \Longleftrightarrow \] \[ \boldsymbol{\beta}^{(r)} = \left( \mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{X} \right)^{-1}\mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{z}^{(r-1)}, \] where we denote \[ \mathbf{z}^{(r-1)} = \mathbf{X}\boldsymbol{\beta}^{(r-1)} + \left(\mathbf{W}^{(r-1)} \right)^{-1}\left( \mathbf{y} - \boldsymbol{\pi}^{(r-1)} \right) \] It can be shown that \(\boldsymbol{\beta}^{(r)} = \left( \mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{X} \right)^{-1}\mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{z}^{(r-1)}\) minimizes the \(\textbf{weighted least}\) \(\textbf{square}\) problem: \[ \min\limits_{\boldsymbol{\beta}}\left( \mathbf{z}^{(r-1)} - \mathbf{X}\boldsymbol{\beta} \right)^T\mathbf{W}^{(r-1)}\left( \mathbf{z}^{(r-1)} - \mathbf{X}\boldsymbol{\beta} \right). \] But since \(\mathbf{J}=-\mathbf{X}^T\mathbf{W}\mathbf{X}\), it follows that \[ \boldsymbol{\beta}^{(r)} = \left( -\mathbf{J}^{(r-1)} \right)^{-1}\mathbf{X}^T\mathbf{W}^{(r-1)}\mathbf{z}^{(r-1)}. \] This means that if there are linearly dependent rows in the \((k+1)\times(k+1)\) Jacobian matrix \(\mathbf{J}\), its rank will be less than \(k+1\) and the matrix will not be invertible. Therefore, the issue with collinear predictors remains also when applying the iteratively reweighted least squares algorithm for finding the maximum likelihood estimators of the model parameters.

Since \(\hat{\boldsymbol{\beta}}\) is the solution to the considered minimization problem (iterative process given above), we denote with \(\hat{\boldsymbol{\pi}}=\left( \hat{\pi}_1, \hat{\pi}_2, \dots, \hat{\pi}_n \right)\) the fitted values (probabilities), where \[ \hat{\pi}_i = \dfrac{\exp\left( \hat{\boldsymbol{\beta}}^T\mathbf{x}_i \right)}{1 + \exp\left(\boldsymbol{\hat{\beta}}^T\mathbf{x}_i \right)}, \] and we adopt the new notation: \[ \mathbf{x}_i = \left( x_{i0}, x_{i1}, x_{i2}, \dots, x_{in} \right) = \left( 1, x_{i1}, x_{i2}, \dots, x_{in} \right)^T \in \mathbb{R}^{n+1}. \] Let us denote the fitted values with \(\hat{\boldsymbol{\pi}}=\left( \hat{\pi}_1, \hat{\pi}_2, \dots, \hat{\pi}_n \right)\).

3 Assessing the goodness of fit of a model

3.1 Deviance statistic

In logistic regression a commonly used measure of the difference between observed and fitted values is the so called deviance (also known as residual deviance). The deviance is defined as \[ D= -2\ln\left( \dfrac{\text{likelihood of the fitted model}}{\text{likelihood of the saturated model}} \right). \] A saturated model is one that contains as many parameters as there are data points. This means that the saturated logistic model fits perfectly the observed data: \[ \hat{\mu}^S_i = y_i , \quad \forall i=1,2,\dots, n, \] where \(\hat{\mu}^S_i\) are the fitted values for the saturated model (\(\hat{\mu}^S_i\) is the estimated expected value of \(Y_i\)). The deviance is used as a test statistic for hypotheses regarding the goodness of fit of the considered model. This is due to the fact that under some assumptions it approximately follows the chi-squared distribution. Hence, with the help of the deviance we test the null hypothesis that the fitted model is a good fit to the observed data and explains about the same amount of variation in the response variable as the saturated model. For this purpose, we calculate the p-value associated with the hypothesis test at some significance level \(\alpha\) (\(\alpha=0.01\) or \(0.05\), etc.) We recall that the p-value is the probability, under the assumption that the null hypothesis is true, of obtaining a result equal to or more extreme, due to chance alone, than the observed. The smaller the p-value, the larger the significance because it means that the built model may not adequately explain the observations.

3.2 Limitations of the deviance as a test statistic for ungrouped data

The deviance is also the standard goodness of fit measure that R calculates in its summary statistic when fitting generalized linear models with th function “glm()”. However, we will show that the deviance is not always a meaningful test statistic and cannot be applied to problems involving ungrouped binary data (called also “sparse” data), for example. Therefore, one cannot blindly report R default test statistics. In such situations careful investigation of the problem at hand is needed and additional analysis is required. Nevertheless, the deviance can be used for comparing two nested models both for grouped and ungrouped binary data.

Our model problem is also one with sparse data, meaning that for every pattern of covariate values we have only one observation or a small number of observations. In this case of ungrouped data, which is common in real-world data sets, the deviance and the Pearson test no longer have a chi-square distribution under the null hypothesis and cannot be used as valid measures of model fit.

First we will define the deviance for a binomial distribution (grouped data) and then we will consider the special case when we have Bernoulli distribution (ungrouped, or sparse data). Let \(Z_i\), \(i=1,2,\dots, n\), be a binary response variable, whose value depends on the covariates \(X_j\), \(j=1,2,\dots, k\). Consequently, \(Z_i\) assumes only two values - 1 and 0, to which we refer as “success” and “failure”, respectively. Then, it is clear that the random variables \(Z_i\) follow the Bernoulli distribution. Let us classify the observed response values \(z_i\) into \(m\) groups in such a way that all variables belonging to a specific group have identical values of all covariates. With \(n_l\) we denote the number of observations in group \(l\), \(l=1,2,\dots, m\). Hence, we have that \(\sum\limits_{i=1}^{m} n_i = n\), where \(n\) is the total number of observations. We denote with \(y_i\), \(i=1,2,\dots,m\), the number of “successes” in group \(i\) (\(y_i=0,1,2,\dots, n_i\)). Hence, the number of “failures” in group \(i\) is \(n_i - y_i\). We assume that the observations in each group are independent. Then, if we view \(y_i\) as a realization of the random variable \(Y_i\) (\(i=1,2,\dots, m\)), then \(Y_i\sim B(n_i, \pi_i)\), i.e. \(Y_i\) has a binomial distribution with \(\pi_i\) being the probability of “success” at each Bernoulli trial (we have a total of \(n_i\) trials). Therefore, the probability mass function of \(Y_i\) is \[ f(y_i) = P(Y_i=y_i)=\left( \begin{array}{c} n_i \\ y_i \end{array} \right)\pi_i^{y_i}(1-\pi_i)^{n_i-y_i}, \] for \(y_i=0,1,2,\dots, n_i\). The mean and variance of \(Y_i\) are: \[ E(Y_i) = \mu_i = n_i\pi_i \] \[ Var(Y_i) = \sigma_i^2 = n_i\pi_i(1-\pi_i) \] As before, we use logistic regression to model the expected values of \(Y_i\): \[ \mu_i = n_i\pi_i = \dfrac{\exp\left(\boldsymbol{\beta}^T\mathbf{x}_i \right)}{1 + \exp\left(\boldsymbol{\beta}^T\mathbf{x}_i \right)} \] This grouped data formulation is the most general one. It includes ungrouped, or individual data when we have \(n\) groups of size \(n_i=1\) for all \(i\), and hence, \(m=n\), which is exactly the case that we study. Then, the log-likelihood function for \(Y_i\sim B(n_i, \pi_i)\), \(i=1,2,\dots, m\), is \[ l = \sum\limits_{i=1}^{m} \ln\left[\left( \begin{array}{c} n_i \\ y_i \end{array} \right)\right] + \sum\limits_{i=1}^{m} \left[ y_i\ln(\pi_i) + (n_i-y_i)\ln(1-\pi_i) \right] \]

We denote with \(l_M\) the maximized log-likelihood of the current fitted model, i.e. calculated at \(\hat{\boldsymbol{\beta}}^M\), where \(\hat{\boldsymbol{\beta}}^M\) is the maximum likelihood estimate of the model parameters \(\boldsymbol{\beta}^M\). By analogy, with \(l_S\) we denote the maximized log-likelihood of the saturated model, calculated at \(\hat{\boldsymbol{\beta}}^S\). Then, the deviance is \[ \text{Deviance} = -2(l_M - l_S) = -2\left\{ \sum\limits_{i=1}^{m} \ln\left[\left( \begin{array}{c} n_i \\ y_i \end{array} \right) \right] + \sum\limits_{i=1}^{m} \left[ y_i\ln(\hat{\pi}_i^M) + (n_i-y_i)\ln(1 - \hat{\pi}_i^M) \right] - \right. \] \[ \left. -\sum\limits_{i=1}^{m} \ln\left[\left( \begin{array}{c} n_i \\ y_i \end{array} \right) \right] - \sum\limits_{i=1}^{m} \left[ y_i\ln(\hat{\pi}_i^S) + (n_i-y_i)\ln(1 - \hat{\pi}_i^S) \right] \right\} = \] \[ = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{\hat{\pi}_i^M}{\hat{\pi}_i^S} \right)} + (n_i - y_i)\ln{\left( \dfrac{1 - \hat{\pi}_i^M}{1 - \hat{\pi}_i^S} \right)} \right\} = \] \[ = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{ \dfrac{\hat{\mu}_i^M}{n_i}}{\dfrac{\hat{\mu}_i^S}{n_i}} \right)} + (n_i - y_i)\ln{\left( \dfrac{1 - \dfrac{\hat{\mu}_i^M}{n_i}}{1 - \dfrac{\hat{\mu}_i^S}{n_i}} \right)} \right\} = \] \[ = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{\hat{\mu}_i^M}{\hat{\mu}_i^S} \right)} + (n_i - y_i)\ln{\left( \dfrac{n_i - \hat{\mu}_i^M}{n_i - \hat{\mu}_i^S} \right)} \right\} = \] \[ = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{\hat{\mu}_i^M}{y_i} \right)} + (n_i - y_i)\ln{\left( \dfrac{n_i - \hat{\mu}_i^M}{n_i - y_i} \right)} \right\} \] because \(\hat{\mu}_i^S=y_i\) for the saturated model. Finally, we obtain that the deviance is given by \[ D = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{\hat{\mu}_i^M}{y_i} \right)} + (n_i - y_i)\ln{\left( \dfrac{n_i - \hat{\mu}_i^M}{n_i - y_i} \right)} \right\}. \]

Since in our problem we have ungrouped binary response data, it follows that \(n_i=1\) for all \(i\), \(m=n\), \(\underbrace{\mu_i := E(Y_i)}_{}= n_i\pi_i=\pi_i\) – population mean and hence, \(\hat{\mu}_i=\hat{\pi}_i\) – estimated mean, and \(y_i=0\) or \(1\). Therefore, we obtain that the deviance for ungrouped data with a binary response variable is equal to: \[ D = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln{\left( \dfrac{\hat{\pi}_i^M}{y_i} \right)} + (1 - y_i)\ln{\left( \dfrac{1 - \hat{\pi}_i^M}{1 - y_i} \right)} \right\} = \] \[ = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln{\left( \hat{\pi}_i^M \right)} - y_i\ln{y_i} + (1-y_i)\ln{\left( 1 - \hat{\pi}_i^M \right)} - (1-y_i)\ln\left(1 - y_i \right) \right\} = \] \[ = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln{\left( \hat{\pi}_i^M \right)} + (1-y_i)\ln{\left( 1 - \hat{\pi}_i^M \right)} \right\}. \] In the latter equality we take \(y_i\ln{y_i}\) to be zero for \(y_i=0\), and \((1-y_i)\ln{\left( 1 - y_i \right)}=0\) for \(y_i=1\). This also means that the log-likelihood of the saturated model for an ungrouped data and a binary response variable is equal to 0: \[ l_S(\hat{\boldsymbol{\beta}}^S) = \sum\limits_{i=1}^{n} \left\{ y_i\ln{\hat{\pi}_i^S} + (1-y_i)\ln\left(1- \hat{\pi}_i^S \right) \right\} = \sum\limits_{i=1}^{n} \left\{ y_i\ln{y_i} + (1-y_i)\ln\left(1- y_i \right) \right\} = 0 \] Hence, for the ungrouped data case, we obtained that the deviance is given by: \[ D = -2\ln( \text{likelihood of the fitted model} ) = -2l_M = \] \[ = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln{\left( \hat{\pi}_i^M \right)} + (1-y_i)\ln{\left( 1 - \hat{\pi}_i^M \right)} \right\}. \] From now on we will drop the superscript “M”, and we will denote the fitted values for the current model simply with \(\pi_i\), \(i=1,2,\dots, n\). Therefore, the deviance for ungrouped data and a binary response variable becomes: \[ D = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln{\left( \hat{\pi}_i \right)} + (1-y_i)\ln{\left( 1 - \hat{\pi}_i \right)} \right\}. \]

The deviance can also be rewritten as: \[ D = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln\left( \hat{\pi}_i \right) + \ln\left( 1 - \hat{\pi}_i \right) - y_i\ln\left( 1 - \hat{\pi}_i \right) \right\} = \] \[ = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) + \ln\left( 1 - \hat{\pi}_i \right) \right\} \] Now we will show that th term \(\sum\limits_{i=1}^{n} y_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right)\) can be substituted by a term which does not depend on \(y_i\). We already showed that the first derivative of the log-likelihood function is: \[ \dfrac{\partial l_M}{\partial \beta_j} = \sum\limits_{i=1}^{n} y_i x_{ij} - \sum\limits_{i=1}^{n} x_{ij}\pi_i = \sum\limits_{i=1}^{n} x_{ij}(y_i - \pi_i) \] for \(j=0,1,2,\dots, k\). Since the first derivatives of \(l_M\) are equal to zero at \(\boldsymbol{\beta}=\hat{\boldsymbol{\beta}}\), i.e. \[ \dfrac{\partial l_M}{\partial \beta_j}\left( \hat{\boldsymbol{\beta}} \right) = 0, \quad \forall j=0,1,2,\dots, k \] it follows that: \[ \sum\limits_{j=0}^{k} \hat{\beta}_j\dfrac{\partial l_M}{\partial \beta_j}\left( \hat{\boldsymbol{\beta}} \right) = 0. \] The latter equality is equivalent to \[ \sum\limits_{j=0}^{k} \hat{\beta}_j \left( \sum\limits_{i=1}^{n} x_{ij}(y_i - \hat{\pi}_i) \right) = 0 \quad \Longleftrightarrow \quad \sum\limits_{j=0}^{k} \sum\limits_{i=1}^{n} \hat{\beta}_j x_{ij}(y_i - \hat{\pi}_i) = 0 \Longleftrightarrow \] \[ \sum\limits_{i=1}^{n} \left( \sum\limits_{j=0}^{k} \hat{\beta}_j x_{ij}(y_i - \hat{\pi}_i) \right) = 0 \quad \Longleftrightarrow \quad \sum\limits_{i=1}^{n} (y_i - \hat{\pi}_i) \left( \sum\limits_{j=0}^{k} \hat{\beta}_j x_{ij} \right) = 0 \Longleftrightarrow \] \[ \sum\limits_{i=1}^{n} (y_i - \hat{\pi}_i) \ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right)= 0 \Longleftrightarrow \] \[ \sum\limits_{i=1}^{n} y_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) = \sum\limits_{i=1}^{n} \hat{\pi}_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) \] Consequently, the deviance is equal to: \[ D = -2\sum\limits_{i=1}^{n} \left\{ y_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) + \ln\left( 1 - \hat{\pi}_i \right) \right\} = -2\sum\limits_{i=1}^{n} y_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) -2\sum\limits_{i=1}^{n} \ln\left( 1 - \hat{\pi}_i \right) = \] \[ = -2\sum\limits_{i=1}^{n} \hat{\pi}_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) - 2\sum\limits_{i=1}^{n} \ln\left( 1 - \hat{\pi}_i \right) = \] \[ = -2\sum\limits_{i=1}^{n}\left\{ \hat{\pi}_i\ln\left( \dfrac{\hat{\pi}_i}{1 - \hat{\pi}_i} \right) + \ln\left( 1 - \hat{\pi}_i \right) \right\} \] Therefore, for ungrouped binary data, it turns out that the deviance is not useful for assessing the goodness of fit of the logistic model because it depends only on the fitted values \(\hat{\pi}_i\) and a comparison between fitted and observed values cannot be made.

3.3 Likelihood ratio test statistic

We showed that the deviance for ungrouped data and a binary response variable is \[ D = -2\sum\limits_{i=1}^{m} \left\{ y_i\ln{\left( \dfrac{\hat{\mu}_i}{y_i} \right)} + (n_i - y_i)\ln{\left( \dfrac{n_i - \hat{\mu}_i}{n_i - y_i} \right)} \right\}, \] where \(n_i=1\) and hence, the estimated mean is \(\hat{\mu}_i = n_i\hat{\pi}_i=\hat{\pi}_i\).

The deviance is also used not only as a measure of goodness-of-fit but for comparing two nested models. Let us consider two nested models \(M_0\) and \(M_1\). We assume that that \(M_0\) is included in \(M_1\) in the sense that \(M_0\) is obtained from \(M_1\) by setting some of \(M_1\)’s model parameters to zero. Then, the log-likelihood ratio statistic (LR) is defined as: \[ \textbf{LR} = -2\ln\left( \dfrac{L_0(\hat{\boldsymbol{\beta}}_0)}{L_1(\hat{\boldsymbol{\beta}}_1)} \right) = -2\left[ \ln(L_0) - \ln(L_1) \right] = -2[l_0 - l_1] = \] \[ = -2(l_0 - l_S) - (-2(l_1 - l_S)) = D_0 - D_1 \] Consequently, we obtained that \[ \textbf{LR} = D_0 - D_1, \] where \(D_0\) is the deviance of model \(M_0\) and \(M_1\) is the deviance of model \(M_1\). The log-likelihood ratio statistic is used for testing the following null hypothesis

\(\mathbf{H_0}\): the reduced model \(\, M_0 \,\) is a better fit to the observed data than the model \(\, M_1\)

against the alternative hypothesis

\(\mathbf{H_A}\): the model \(\, M_1 \,\) with more parameters provides a better fit to the observed data.

According to Wilks’ theorem, if the null hypothesis is true, then the log-likelihood ratio test statistic – \(\textbf{LR} = -2\ln\left( \dfrac{L_0}{L_1} \right)\), is approximately chi-squared distributed as the number of groups \(m\rightarrow \infty\) or the size of each group \(n_i\rightarrow \infty\) for all \(i\), provided the number of model parameters stays fixed. The degrees of freedom of the chi-squared distribution equal the number of model parameters of \(M_1\) minus the number of model parameters of \(M_0\). This means that if \(H_0\) is true and if \(\hat{\boldsymbol{\beta}}_0\in\mathbb{R}^{p_0}\), and \(\hat{\boldsymbol{\beta}}_1\in\mathbb{R}^{p_1}\), where \(p_0<p_1\), then \(\textbf{LR}\sim \chi^2_{p_1-p_0}\) as \(n_i\rightarrow\infty\) or \(m\rightarrow \infty\). We have that Wilks’ theorem is valid also for \(m\rightarrow \infty\) which is equivalent to having ungrouped or individual data, i.e. data with very large number of covariate patterns. Hence, the likelihood ratio test can be used as a test statistic both for grouped and ungrouped data with binary response.

In our problem we have that \(n_i=1\) and hence \(m=n\), i.e. the number of groups \(m\) is equal to the number of observations. Since we have a large number of observations, we can assume that Wilks’ theorem holds and consequently the log-likelihood ratio test statistic is approximately chi-squared distributed under the null hypothesis. This means that we can use the likelihood ratio test (LRT) to assess the goodness of fit of nested models.

3.4 More facts about the deviance

From the definition of the log-likelihood ratio statisitc, it follows that the deviance is equal to the log-likelihood ratio of the fitted model \(M_0\) and the saturated model \(M_S\). Therefore for grouped data, if \(n_i \rightarrow\infty\) for all \(i\) (\(n_i\) is the number of observations in each group), the deviance is approximately \(\chi^2_{n-p}\), where \(p\) is the number of parameters of \(M_0\). The deviance for a logistic model plays the same role as the residual sum of squares in ordinary least squares regression for simple linear models. Small deviance (hence large p-value) indicates a good fit of the model \(M_0\). A large deviance (hence small p-value) means that the reduced model is not a better fit to the observed data than the saturated model.

The deviance can also be written as: \[ D = \sum\limits_{i=1}^{m} d_i^2, \] where \(d_i\), \(i=1,2,\dots, m\), is the deviance residual and is defined as follows: \[ d_i = d\left( y_i, \hat{\mu}_i \right) = sign(y_i - \hat{\mu}_i)\left\{ 2\sum\limits_{i=1}^{m} \left[ y_i\ln{\left( \dfrac{\hat{\mu}_i}{y_i} \right)} + (n_i - y_i)\ln{\left( \dfrac{n_i - \hat{\mu}_i}{n_i - y_i} \right)} \right] \right\}^{\dfrac{1}{2}} \]

An alternative residual for assessing the discrepancy between fitted values and observed values is the Pearson residual, also known as chi-square residual. It is defined as follows: \[ r_i = r \left( y_i, \hat{\mu}_i \right) = \dfrac{y_i - \hat{\mu}_i}{\sqrt{\widehat{Var}(Y_i)}} = \dfrac{y_i - n_i\hat{\pi}_i}{\sqrt{n_i\hat{\pi}_i(1 - \hat{\pi}_i)}} \] where \(\widehat{Var}(Y_i)\) is the estimated variance of the random variable \(Y_i\). Hence, the Pearson residual is obtained by dividing the raw residual \(\left( y_i - \hat{\mu}_i \right)\) by the estimated binomial standard deviation of \(Y_i\). The Pearson chi-square statistic is a summary test statistic and is defined as the sum of the squared Pearson residuals: \[ X^2 = \sum\limits_{i=1}^{m} r_i^2 \] For large sample sizes where \(n_i\rightarrow \infty\), the Pearson test statistic has the same distribution as the deviance, i.e. \(X^2\sim\chi^2_{n-p}\) with \(p\) being the number of model parameters. Since in our problem the data is ungrouped, the Pearson test statistic cannot be used as a measure of the goodness of fit of the built logistic model.

3.5 The Hosmer-Lemeshow test

The Hosmer-Lemeshow test is a goodness of fit test for logistic regression models with ungrouped (individual) binary data. Just as in our problem, grouping the data according to covariate patterns is not always easy or even possible. The idea of the Hosmer-Lemeshow test is to divide the data into subgroups based on the predicted probabilities \(\hat{\pi}_i\) instead on the values of the explanatory variables. The test consists of the following steps:

First we sort the data set of size \(n\) (we have \(n\) observations) in descending order according to the fitted probabilities \(\hat{\pi}_i\)
Then we divide the data into \(g\) equal-sized groups: the first group contains the first \(\dfrac{n}{g}\) observations with the highest estimated probabilities; the second group contains the \(\dfrac{n}{g}\) observations having the next highest estimated probabilities; etc.
Finally, we construct a Pearson-like test statistic for the observed and expected cell frequencies from a \(2\times g\) table

The Hosmer-Lemeshow statistic is given by:

\[ X^2_{HL} = \sum\limits_{i=1}^{g} \dfrac{\left( O_i - n_i\bar{\pi}_i \right)^2}{n_i\bar{\pi}_i(1 - \bar{\pi}_i)}, \] where \(O_i\) is the total frequency of “success” outcomes in the \(i\)-th group, \(i=1,2,\dots, g\), \(n_i\) is the number of observations in each group (for equal-sized groups we have that \(n_i=\dfrac{n}{g}\) for all \(i=1,2,\dots, g\)), and \(\bar{\pi}_i\) is the average estimated (predicted by the logistic model) probability of “success” outcome for the \(i\)-th group. This means that \[ O_i = \sum\limits_{i=1}^{g} y_{ij}, \] where \(y_{ij}\) is the \(j\)-th observation in the \(i\)-th group (hence \(j=1,2,\dots, n_i\)), and \[ \bar{\pi}_i = \dfrac{1}{n_i}\sum\limits_{i=1}^{g} \hat{\pi}_{ij}, \] where \(\hat{\pi}_{ij}\) is the fitted (predicted) probability for “success” of the \(j\)-th (\(j=1,2,\dots, n_i\)) observation in the \(i\)-th group (\(i=1,2,\dots, g\)). Therefore, the Hosmer-Lemeshow statistic is equal to \[ X^2_{HL} = \sum\limits_{i=1}^{g} \dfrac{\left( \sum\limits_{j=1}^{n_i} y_{ij} - \sum\limits_{j=1}^{n_i}\hat{\pi}_{ij} \right)^2 }{\left( \sum\limits_{j=1}^{n_i} \hat{\pi}_{ij} \right)\left( 1 - \sum\limits_{j=1}^{n_i} \dfrac{\hat{\pi}_{ij}}{n_i} \right) } \] Hosmer and Lemeshow showed with simulations (reference) that for large samples the test statistic \(X^2_{HL}\) follows approximately the chi-squared distribution, i.e. \(X^2_{HL}\sim\chi^2_{g-2}\). The Hosmer and Lemeshow goodness of fit test tests the following hypotheses:

\(H_0\): the model is a good fit,

vs.

\(H_1\): the model is not a good fit

Hence, with this hypothesis test we want to fail to reject the null hypothesis which means that the built model is a good fit. Therefore, large values of \(X^2_{HL}\) (and small p-values) indicate a lack of fit of the model, whereas small values of \(X^2_{HL}\) (and large p-values) indicate that the model is a good fit. However, this test has some disadvantages which have to be taken into account when we use it to assess the goodness of fit of the constructed logistic model. For example, Hosmer et al. (1997) showed that the value of \(X^2_{HL}\) depends on the choice of number of groups and on the calculating algorithm. They fitted the same data set in different statistical software packages and obtained identical values for the estimated model parameters but different p-values of the Hosmer-Lemeshow test ranging from 0.02 to 0.16. There are no rules on how to choose the number of groups, but in the literature it is often recommended to take \(g=10\). Another deficiency of the Hosmer-Lemeshow test is that if a significant p-value is obtained, meaning a poor fit, the test does not provide any information why the model is lacking a fit. Therefore the test is often criticized as being unable to detect true lack of fit and alternative tests are proposed, such as those of Osius and Rojek (1992), and McCullagh (1985) (references).

4 Significance of the estimated model parameters

The Wald Test Statistic

If we run a summary on the fitted logistic regression model, we can see which are the significant covariates and levels of categorical covariates for the log odds model based on the corresponding p-values. These p-values are obtained using the Wald test statistic (reference) to test the following hypotheses for the continuous variables:

\(H_0\): \(\hat{\beta}_j=0\)

vs.

\(H_1\): \(\hat{\beta}_j\neq 0\)

for all \(j=0,1,2,\dots, k\).

In other words, the Wald test is used to evaluate the statistical significance of each coefficient in the fitted logistic model, i.e. the test checks the hypothesis that each individual coefficient is zero. If the coefficient of some category is statistically significant, this does not imply that the whole categorical predictor is unimportant and should be removed from the model. The overall effect of the factor variable is tested by performing a likelihood ratio test as we showed earlier.

The Wald test has the following form:

\[ W = \dfrac{\hat{\beta}_j - b_0}{\widehat{se}\left( \hat{\beta}_j \right)} = \dfrac{\hat{\beta}_j - b_0}{\sqrt{\widehat{Var}\left( \hat{\beta}_j \right)}} \] or, equivalently \[ W^2 = \dfrac{\hat{\beta}_j^2 - b_0}{\widehat{Var}\left( \hat{\beta}_j \right)}, \] where \(b_0=0\) for the above hypotheses. In large samples this test statistic has approximately a standard normal distribution, or equivalently, the square of this statistic has approximately a chi-squared distribution with one degree of freedom: \[ W = \dfrac{\hat{\beta}_j}{\widehat{se}\left( \hat{\beta}_j \right)} \sim N(0,1) \] and \[ W^2 = \dfrac{\hat{\beta}_j^2}{\widehat{Var}\left( \hat{\beta}_j \right)} \sim\chi^2_1. \] The standard error of \(\hat{\beta}_j\) is calculated as follows: \[ \widehat{se}\left( \hat{\beta}_j \right) = \sqrt{diag(\mathbf{H})_j}, \] where \[ \mathbf{H}= Var\left( \hat{\boldsymbol{\beta}} \right) =\left( \mathbf{X}^{T}\mathbf{WX} \right)^{-1} \] with \(\mathbf{X}\) being the design matrix and \(\mathbf{W}\) - the weight matrix evaluated at \(\hat{\boldsymbol{\pi}}\) (we defined both matrices in the Logistic Regression section).

The Wald test is also used for the calculation of a confidence interval for each population parameter \(\beta_j\), as can be seen from the output of the “summary()” function. A \((1-\alpha)100\)% confidence interval for the true population parameter \(\boldsymbol{\beta}\) is given by: \[ \hat{\beta}_j \pm z_{1-\alpha/2}\sqrt{\widehat{Var}\left( \hat{\beta}_j \right)}, \] where \(z_{1-\alpha/2}\) is the critical value of the standard normal distribution for a two-sided test with significance level \(\alpha\).

Census Data Project – Part 4: Theoretical Background

Vasilena Taralova