Logistic Regression: STAT 963

March 29, 2017

Section 4.4: Logistic regression

Logistic regression (LR) models a function of the probability of \(K\) classes with a linear combination of features \(x=(x_1,\dots,x_p)^T\). Unlike linear regression, LR ensures that probabilities are bounded between \([0,1]\).
A multinomial logistic regression model has the form, \(\beta_i^T=(\beta_{i1},\dots,\beta_{ip})\):

\[ \begin{align*} \log\frac{P(G=1|X=x)}{P(G=K|X=x)} &= \beta_{10} + \beta_1^T x \\ \vdots \\ \log\frac{P(G=K-1|X=x)}{P(G=K|X=x)} &= \beta_{(K-1)0} + \beta_{K-1}^T x \\ \end{align*} \]

Section 4.4: Logistic regression

We use class \(K\) as the denominator, but this is arbitrary, and the estimates are equivariant under this choice.
Some more (EoSL) notation: \(\theta=\{\beta_{10},\beta_1^T,\dots,\beta_{(K-1)0},\beta_{K-1}^T \}\), and \(P(G=k|X=x)=p_k(x;\theta)\).
Solving for each class we get that:

\[ \begin{align*} p_k(x;\theta) &= \frac{\exp(\beta_{k0}+\beta_k^Tx)}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_{l}^Tx)} \hspace{1cm} k=1,\dots,K-1 \\ p_K(x;\theta) &= \frac{1}{1+\sum_{l=1}^{K-1}\exp(\beta_{l0}+\beta_{l}^Tx)} \end{align*} \]

Notice that we only need \(K-1\) equations to model \(\frac{K!}{2!(K-2)!}\) relationships.

Section 4.4: Logistic regression

We say that we have a logit transformation of the discriminant function (which is a monotone transformation). The discriminant function is \(\delta_k(x) = p_k(x;\theta)\), and has a decision boundary of \(\beta^T x > 0\) for \(K=2\).

Section 4.4.1: Fitting logistic regression models

The parameters of \(\theta\) are estimated by maximum likelihood (ML) for LR.
We begin with the two-class (binary) case, so \(\theta=\beta=\{\beta_{10},\beta_1\}\). Given a data set with a sample of \(N\) independent observations and a vector of inputs \(x_i\) (which now includes an intercept) for each observation:

\[\log\Bigg(\frac{p(x_i;\beta)}{1-p(x_i;\beta)}\Bigg)=\beta^Tx_i \hspace{1cm} i=1,\dots,N\]

To simplify computations, data are assumed to aggregated so each row is a distinct "population", with \(m_i\) observations associated with a population and \(y_i\) "successes" (the binomial count for each population). If \(y_i \in \{0,1\} \hspace{2mm} \forall i\) then no observations are aggregated.

Section 4.4.1: Fitting logistic regression models

This leads to a joint (binomial) pmf:

\[f(\boldsymbol y|\beta) = \prod_{i=1}^N \frac{m_i!}{y_i!(m_i-y_i)!} p(x_i;\beta)^{y_i}(1-p(x_i;\beta))^{m_i - y_i} \]

And if we assume that \(y_i \in \{0,1\}\) then:

\[f(\boldsymbol y|\beta) = \prod_{i=1}^N p(x_i;\beta)^{y_i}(1-p(x_i;\beta))^{1 - y_i} \]

Section 4.4.1: Fitting logistic regression models

Conditioning on the data and taking the log gives the log-likelihood.

\[ \begin{align} l(\beta|\textbf{y}) &= \sum_{i=1}^N \Big\{ y_i \log [p(x_i;\beta)] + (1-y_i) \log [1-p(x_i;\beta)] \Big\} \\ l(\beta|\textbf{y}) &= \sum_{i=1}^N \Big\{ y_i \cdot \text{expit}(\beta^T x_i) + (1-y_i) \log [1-\text{expit}(\beta^T x_i)] \Big\} \\ l(\beta|\textbf{y}) &= \sum_{i=1}^N \Big\{ y_i [\beta^T x_i] - \log [1+\exp\{\beta^Tx_i \}] \Big\} \tag{1} \end{align} \]

Section 4.4.1: Fitting logistic regression models

The \(p+1\) score equations are:

\[ \begin{align} \frac{\partial l(\beta|\textbf{y})}{\partial \beta_j} &= \sum_{i=1}^N \Bigg[ \Bigg(y_i - \overbrace{\frac{\exp\{\beta^Tx_i \}}{1+\exp\{\beta^Tx_i \}}}^{p(x_i;\beta)}\Bigg)x_{ij} \Bigg] \tag{2} \end{align} \]

The ML estimate is found where \(\partial l(\hat{\beta}|\textbf{y}) / \partial \beta_j =0\) for \(j=0,\dots,p\).

Section 4.4.1: Fitting logistic regression models

The \(jk^{th}\) element of the Hessian matrix is:

\[ \begin{align} [H(\beta)]_{jk} &= \frac{\partial^2l(\beta|\textbf{y})}{\partial \beta_j\partial \beta_k} \\ &= - \sum_{i=1}^N \overbrace{\frac{\exp\{\beta^Tx_i \}}{1+\exp\{\beta^Tx_i \}}}^{p(x_i;\beta)} \overbrace{\frac{1}{1+\exp\{\beta^Tx_i \}}}^{1-p(x_i;\beta)} x_{ij}x_{ik} \tag{3} \end{align} \]

Section 4.4.1: Fitting logistic regression models

We can use the the Netwon-Raphson algorithm to find the ML estimate.
Consider a Taylor expansion of the score vector (i.e. gradient) of log-likelihood about the point \(\beta_0\).

\[ \begin{align*} \nabla l(\beta) &\approx \nabla l(\beta_0) + H(\beta_0) (\beta-\beta_0) = 0 \\ \beta &= \beta_0 - H^{-1}(\beta_0)\nabla l(\beta_0) \\ \beta^{\text{new}} &= \beta^{\text{old}} - \Bigg(\frac{\partial ^2 l(\beta^{\text{old}})}{\partial\beta\partial\beta^T} \Bigg)^{-1} \frac{\partial l(\beta^{\text{old}})}{\partial \beta} \end{align*} \]

We then re-write equations (2) and (3), the score and Hessian, in matrix notation.

Section 4.4.1: Fitting logistic regression models

Get some notation

\[ \begin{align} \mathbf{X} &= [\mathbf{1} \hspace{2mm} X_1 \hspace{2mm} \dots \hspace{2mm} X_p] \\ \mathbf{y} &= [y_1,\dots,y_N]^T \\ \mathbf{p} &= [p(x_1;\beta),\dots,p(x_N;\beta)]^T \\ \mathbf{W} &= \text{diag}[p(x_1;\beta)(1-p(x_1;\beta)),\dots,p(x_N;\beta)(1-p(x_N;\beta))] \\ \end{align} \]

Section 4.4.1: Fitting logistic regression models

For the Score vector we have:

\[ \begin{align*} \frac{\partial l(\beta|\textbf{y})}{\partial \beta_j} &= \sum_{i=1}^N \{ [y_i - p(x_i;\beta)]x_{ij} \} \\ \frac{\partial l(\beta|\textbf{y})}{\partial \beta_j} &= \mathbf{X}_j^T[\mathbf{y}-\mathbf{p}] \\ \frac{\partial l(\beta|\textbf{y})}{\partial \beta} &= \mathbf{X}^T[\mathbf{y}-\mathbf{p}] \\ \end{align*} \]

Section 4.4.1: Fitting logistic regression models

For the Hessian:

\[ \begin{align*} [H(\beta)]_{jk} &= - \sum_{i=1}^N \Big[ p(x_i;\beta) (1-p(x_i;\beta)) x_{ij}x_{ik} \Big] \\ &= - \begin{bmatrix} x_{1j} & \dots & x_{Nj} \end{bmatrix} \begin{pmatrix} p_1(1-p_1) & & \mathbf{0} \\ & \ddots & \\ \mathbf{0} & & p_N(1-p_N) \end{pmatrix} \begin{bmatrix} x_{1k} \\ \vdots \\ x_{Nk} \end{bmatrix} \\ &= - \mathbf{X}_j^T \mathbf{W} \mathbf{X}_k \\ H(\beta) &= - \mathbf{X}^T \mathbf{W} \mathbf{X} \end{align*} \]

Section 4.4.1: Fitting logistic regression models

Substitute into the Newton-Raphson algorithm (and where \(\mathbf{p}\) is a function of \(\beta^{\text{old}}\)) to get the "iteratively reweighted least squares" (IRLS) algorithm.

\[ \begin{align*} \beta^{\text{new}} &= \beta^{\text{old}} + (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{y}-\mathbf{p}) \\ &= (\mathbf{X}^T\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^T \mathbf{W}\mathbf{z} \tag{4} \\ \underset{\text{Adj. resp.}}{\mathbf{z}} &= \mathbf{X}\beta^{\text{old}} + \mathbf{W}^{-1}(\mathbf{y}-\mathbf{p}) \end{align*} \]

Show that the first line gets to (4).
Show that (4) is equivalent to \(\underset{\beta}{\text{arg}\min} (\mathbf{z}-\mathbf{X}\beta)^T\mathbf{W} (\mathbf{z}-\mathbf{X}\beta)\).

Section 4.4.2: South African Heart Disease

Let's now use a logistic regression on the South African Heart Disease data set.
We have 462 observations of white males (ages 15-64) with a response variable of whether they have a type of coronary heart disease: myocardial infarction (chd) along with nine control variables including systolic blood pressure (sbp), low density lipoprotein cholesterol (ldl), and body fat (adiposity).

##   sbp tobacco  ldl adiposity famhist typea obesity alcohol age chd
## 1 160   12.00 5.73     23.11 Present    49   25.30   97.20  52   1
## 2 144    0.01 4.41     28.61  Absent    55   28.87    2.06  63   1
## 3 118    0.08 3.48     32.28 Present    52   29.14    3.81  46   0
## 4 170    7.50 6.41     38.03 Present    51   31.99   24.26  58   1

Section 4.4.2: South African Heart Disease

Figure 4.12: Red is cases, teal is controls

Section 4.4.2: South African Heart Disease

A logistic regression function is easily fit using base R:

glm(chd~sbp+tobacco+ldl+famhist+obesity+alcohol+age,data=dat,family=binomial(link=logit))

Table 4.2
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.130	0.964	-4.283	0.000
sbp	0.006	0.006	1.023	0.306
tobacco	0.080	0.026	3.034	0.002
ldl	0.185	0.057	3.219	0.001
famhistPresent	0.939	0.225	4.177	0.000
obesity	-0.035	0.029	-1.187	0.235
alcohol	0.001	0.004	0.136	0.892
age	0.043	0.010	4.181	0.000

Section 4.4.2: South African Heart Disease

The authors point out it is odd that systolic blood pressure is insignificant and obesity lowers your risk. They suggest using a backward selection procedure.

mdl.full <- glm(chd~sbp+tobacco+ldl+famhist+obesity+alcohol+age,data=dat,
                family=binomial(link=logit))
mdl.bw <- step(mdl.full,direction = 'backward',trace=0)

Table 4.3
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-4.204	0.498	-8.437	0.000
tobacco	0.081	0.026	3.163	0.002
ldl	0.168	0.054	3.093	0.002
famhistPresent	0.924	0.223	4.141	0.000
age	0.044	0.010	4.521	0.000

Section 4.4.2: South African Heart Disease

How can we intepret the coefficients from a logistic regression? Consider \(\beta_{\text{age}}=0.044\), and imagine two input vectors for the model from Table 4.3:

\[ \begin{align} x_a &= [1,\text{tobacco}_0,\text{ldl}_0,\text{famhist}_0,\text{age}_0+1]^T \\ x_b &= [1,\text{tobacco}_0,\text{ldl}_0,\text{famhist}_0,\text{age}_0]^T\\ \underbrace{\hat{\beta}^T(x_a - x_b)}_{\hat{\beta}^T [0,0,0,0,1]^T} &= \log\Bigg(\frac{p(x_a;\hat{\beta})}{1-p(x_a;\hat{\beta})}\Bigg) - \log\Bigg(\frac{p(x_a;\hat{\beta})}{1-p(x_a;\hat{\beta})}\Bigg) \\ \hat\beta_{\text{age}} &= \log\Bigg(\frac{p(x_a;\hat{\beta})/(1-p(x_a;\hat{\beta}))}{p(x_b;\hat{\beta})/(1-p(x_b;\hat{\beta}))} \Bigg) \end{align} \]

Section 4.4.2: South African Heart Disease

A one unit change in a feature leads to a \(\beta_j\) increase in the log-odds. By simple transformation:

\[ \begin{align} \exp\{\hat\beta_{\text{age}}\} &= \frac{p(x_a;\hat{\beta})/(1-p(x_a;\hat{\beta}))}{p(x_b;\hat{\beta})/(1-p(x_b;\hat{\beta}))} \hspace{1cm} \text{Odds} \end{align} \]

In other words, becoming one-year older leads to an increase of the odds of coronary heart disease by 1.045.
Confidence intervals can be constructed easily too:

\[ \begin{align} \exp \Big\{ \hat\beta_j \pm Z_{1-\alpha/2}\sqrt{\hat{\text{Var}}(\hat\beta_j)} \Big\} = \{1.025,1.065 \} \end{align} \]

Estimation issues

Computation bottlenecks
- When \(p\) grows, finding \(H^{-1}\) with Newton-Raphson or \((X^T W X)^{-1}\) with IRLS algorithm is \(O(p^3)\)
- When \(N\) gets very large, \(\sum_{i=1}^N p(x_i;\beta)\) become computationally time consuming
Solutions to the \(p\) problem: gradient descent or coordinate descent
Solutions to the \(N\) problem: stochastic gradient descent

Gradient descent

With gradient descent, we ignore the Hessian and use some step size \(\eta\) with an update algorithm of: \(\beta^{\text{new}} = \beta^{\text{old}} - \eta \frac{-\partial l(\beta)}{\partial\beta}\). The figure below shows gradient descent with an intercept and the age variable (normalized) at various starting points with \(\eta=0.5\).

Stochastic gradient descent (SGD)

Our log-likelihood is made up of \(N\) "summand functions": \(\sum_{i=1}^N Q_i\), where \(Q_i=y_i [\beta^T x_i] - \log [1+\exp\{\beta^Tx_i \}]\). In SGD, only one (or a batch) of these summand functions are (randomly) selected to be used in the update of the gradient descent algorithm with a decaying learning rate (\(\eta_t=\delta\eta_{t-1}\)).

Multinomial logistic regression

For the multinomial case, we will show how to get the gradient of:

\[ \begin{align} \frac{\partial l(\theta|\mathbf{Y})}{\partial \theta} &= \tilde{\mathbf{X}}^T [\tilde{\textbf{t}}-\tilde{\textbf{p}}] \end{align} \]

Where \(\tilde{\mathbf{X}}^T\) is of dimesion \((p+1)(K-1)\times N(K-1)\) and \(\tilde{\textbf{t}}-\tilde{\textbf{p}}\) is a vector of length \(N(K-1)\times 1\), so our score vector is of length \((p+1)(K-1)\times 1\).
As well as a Hessian in a similar mold to the binomial form:

\[ \begin{align} \frac{\partial ^2 l(\theta|\mathbf{Y})}{\partial \theta \partial \theta^T} &= - \tilde{\mathbf{X}}^T \tilde{\mathbf{W}} \tilde{\mathbf{X}} \end{align} \]

Multinomial logistic regression

Using the well-known iris data set, we will estimate a multinomial logistic regression for the three species of flowers: setosa, versicolor, and virginca.

head(iris,1)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa

# Fit the model with nnet package
library(nnet)
iris1 <- nnet::multinom(Species~Sepal.Length,
                                data=iris,trace=F)
# Or the VGAM package
library(VGAM)
iris2 <- VGAM::vglm(Species~Sepal.Length,
                   data = iris, multinomial(refLevel = 1))

Multinomial logistic regression

Printing our results:

coef(summary(iris1))

##            (Intercept) Sepal.Length
## versicolor   -26.08339     4.816072
## virginica    -38.76786     6.847957

coef(summary(iris2))[,1]

##  (Intercept):1  (Intercept):2 Sepal.Length:1 Sepal.Length:2 
##     -26.081936     -38.759001       4.815691       6.846399