20220321_LogisticRegression

Generalized Models for Binary Outcomes

Rather than predicting \(p(y_i=1)\) directly, we must transform it into an unbounded variable with a link function:

Transform probability into an odds ratio (OR):

\[ OR = \frac{p}{1-p} \]

#scenario one 
p = 0.9 #probability of YES or 1
q = 1-p #probability of NO or 0

OR_1 = p/q #YES
OR_1

## [1] 9

OR_0 = q/p #NO
OR_0

## [1] 0.1111111

When \(p(y_i = 1) = 0.9\), the OR(1) = 9; and \(p(y_i = 0) = 0.1\), the OR(0) = 0.11.

#scenario two
p = 0.6 #probability of YES or 1
q = 1-p #probability of NO or 0

OR_1 = p/q #YES
OR_1

## [1] 1.5

OR_0 = q/p #NO
OR_0

## [1] 0.6666667

When \(p(y_i = 1) = 0.6\), the OR(1) = 1.5; and \(p(y_i = 0) = 0.4\), the OR(0) = 0.67.

#scenario three
p = 0.3 #probability of YES or 1
q = 1-p #probability of NO or 0

OR_1 = p/q #YES
OR_1

## [1] 0.4285714

OR_0 = q/p #NO
OR_0

## [1] 2.333333

When \(p(y_i = 1) = 0.3\), the OR(1) = 0.43; and \(p(y_i = 0) = 0.7\), the OR(0) = 2.33.

Comments: odds scale is skewed, asymmetric, and ranges from 0 to \(+\infty\) Not helpful

Take natural log of odds ratio called “logit” link: \(ln\frac{p}{1-p}\)

\[ e^{ln_x} = x \] \(\ln\) = natural logarithm \(e\) = natural exponent \(x\) = real number

The natural log is the logarithm to the base of the number \(e\) and is the inverse function of an exponential function. Natural logarithms are special types of logarithms and are used in solving time and growth problems. Logarithmic functions and exponential functions are the foundations of logarithms and natural logs.

#scenario One
p = 0.6 #probability of YES or 1
q = 1-p #probability of NO or 0

OR_1 = p/q #YES
OR_1

## [1] 1.5

OR_0 = q/p #NO
OR_0

## [1] 0.6666667

logit_1 = log(p/(1-p)) #natural log
logit_1

## [1] 0.4054651

logit_0 = log(q/(1-q)) #natural log
logit_0

## [1] -0.4054651

If \(p(y_1=1)= 0.6\), then Logit(1)=0.405; Logit(0)= -0.405.

Comments: Logit scale is now symmetric about 0, range is \(±\infty\).

A Logit link is a nonlinear transformation of probability:

Equal intervals in logits are NOT equal intervals of probability
The logit goes from \(±\infty\) and is symmetric about prob = .5 (logit = 0)

Now we can use a linear model. The model will be linear with respect to the predicted logit, which translates into a nonlinear prediction with respect to probability, the conditional mean outcome shuts off at 0 or 1 as needed.

Predicted Binary Outcomes

General Linear Model

\[ p(y_i = 1) = \beta_0 + \beta_1X_i + \beta_2Z_i + e_i \] If \(y_i\) is binary, \(e_i\) can only be 2 things: \(e_i = y_i - \hat{y_i}\)

if \(y_i = 0\), \(e_i = (0 - \hat{p})\)
if \(y_i = 1\), \(e_i = (1 - \hat{p})\)
variance of binary variable: \[ Var(y_i) = p \times (1-p) \] ## Logistic Regression

1. Logit: Data to model

The \(g(.)\space link\) function:

\[ log\frac{p(y_i =1)}{1-p(y_i = 1)} = \beta_0 + \beta_1X_i + \beta_2Z_i + e_i \] \[ e^{ln_x} = x \] \[ e^{log\frac{p(y_i =1)}{1-p(y_i = 1)} } = \frac{p(y_i =1)}{1-p(y_i = 1)}= exp^{\beta_0 + \beta_1X_i + \beta_2Z_i + e_i} \]

#scenario One
p = 0.6 #probability of YES or 1
q = 1-p #probability of NO or 0

logit_1 = log(p/(1-p)) #natural log
logit_1

## [1] 0.4054651

logit_0 = log(q/(1-q)) #natural log
logit_0

## [1] -0.4054651

Comments: symmetric, unbounded, \(-\infty \sim \infty\)

2. Oddes ratio:

\[ \frac{p(y_i =1)}{1-p(y_i = 1)} = exp^{\beta_0 + \beta_1X_i + \beta_2Z_i + e_i} \]

#scenario One
p = 0.6 #probability of YES or 1
q = 1-p #probability of NO or 0

OR_1 = p/q #YES
OR_1

## [1] 1.5

OR_0 = q/p #NO
OR_0

## [1] 0.6666667

Comments: non-symmetric, one-side bounded, 0 to \(\infty\).

3. Probability: model to data

The \(g^{-1}(.)\space inverse \space link\) function:

\[ p(y_i =1) = \frac{exp^{\beta_0 + \beta_1X_i + \beta_2Z_i + e_i}}{1+exp^{\beta_0 + \beta_1X_i + \beta_2Z_i + e_i}} \]

#scenario One
p = 0.6 #probability of YES or 1
q = 1-p #probability of NO or 0

logit_1 = log(p/(1-p)) #natural log
logit_1

## [1] 0.4054651

logit_0 = log(q/(1-q)) #natural log
logit_0

## [1] -0.4054651

OR_1 = p/q #YES
OR_1

## [1] 1.5

OR_0 = q/p #NO
OR_0

## [1] 0.6666667

prob_1 = exp(logit_1)/(1+exp(logit_1))
prob_1

## [1] 0.6

prob_0 = exp(logit_0)/(1+exp(logit_0))
prob_0

## [1] 0.4

Comments: bounded, \(0 \sim 1\)

Logit-based Models for C Nominal Categories

Sub-model 1:

\[ log\frac{p(y_i = 0)}{1-p(y_i = 0)} = \beta_{01}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]

Sub-model 2:

\[ log\frac{p(y_i = 1)}{1-p(y_i = 1)} = \beta_{02}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]

Sub-model 3:

\[ log\frac{p(y_i = 2)}{1-p(y_i = 2)} = \beta_{03}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]

Logit-based Models for C Ordinal Categories

Sub-model 1:

\[ log\frac{p(y_i > 0)}{1-p(y_i > 0)} = \beta_{01}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]

Sub-model 2:

\[ log\frac{p(y_i > 1)}{1-p(y_i > 1)} = \beta_{02}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]

Sub-model 3:

\[ log\frac{p(y_i > 2)}{1-p(y_i > 2)} = \beta_{03}+ \beta_1X_i + \beta_2Z_i + \beta_3X_iZ_i + e_i \]