Logit Model

Generalized Linear Models

The standard notation of linear regression models has two component:

$Y = X_i\beta + \epsilon$ – Systematic component

$\epsilon \thicksim N(0, \sigma^2)$ – stochastic component

Alternative, the following notation can be used:

$Y_i \thicksim N(\mu_i, \sigma^2)$— stochastic

$\mu_i = X_i\beta$ — systematic

These notations can be generalized as follows:

$Y_i \thicksim f(\theta_i, \alpha)$

$\theta_i = g(X_i, \beta)$

where $f()$: probability density function,
$\theta_i$: a systematic feature of the function that varies(e.g. mean, variance)
$\alpha$: ancilliary parameter, constant
$\beta$: effect parameter

GLM Components

It appears that GLM builds on the generalized notation. GLM has three components:

Random distribution (normal, poisson, binomial, etc. ) [stochastic component]
Linear predictor: $\beta_0 + \beta_1 X_{i1} + ... + \beta_kX_{ik}$, from which I build a model [systematic component]

Note: Here the linearity is in the coefficients (not variables). This defines linear models. $\alpha X_{1}^2 + \beta X_2^2$ is still a linear model. $\alpha^2 X_1 + \beta X_2$ is no longer defined as a linear model.

Link function that connects those two. In general, $g(\mu)=predictor$, where $\mu$ is the expected outcome. Note: the link function $g$ transforms the mean of a response, not response itself

In the case of an ordinary linear regression model where the outcome/response variable has a normal distribution, $g(\mu)=\beta_0+ \beta_1 X_{i1}...+ \beta_k X_{ik}= \sum_{k=0}^{n} \beta_k X_{ik}=\mu$

In the case of a binary outcome variable, however, the link function does not hold primarily because the outcome variable is restricted to 0 or 1. In this case, the expected value ($\mu$) = $\pi$ (probability). It turns out that the link function is log (odds):

\[g(\pi) = log(\frac {\pi}{1-\pi})\] This is called logit.

\[g(\pi)= log(\frac {\pi}{1-\pi}) = \sum_{k=0}^{n} \beta_k X_{ik} \]

Let’s simply write $\sum_{k=0}^{n} \beta_kX_{ik}$ as $\beta X_i$ and $X_{i0}=1$(constant), then,

\[\pi = \frac {e^{ \beta X_i}}{1+ e^{ \beta X_i}}= \frac {1}{1 + e^{- \beta X_i}}\] Differently stated, $\pi$ is defined as inverse link function $g^{-1}(\beta X_i)$,

\[\pi = g^{-1}(\beta X_i)\]

log (odds), log (odds ratio), and odds ratio

In the univariate logistic model, the link function $g(x)$ is defined as

\[g(X_i) = log(\frac {p}{1-p}) = \beta_0 + \beta_1X_i\]

Estimated odds: $\hat O = \frac{p}{1-p} = e^{\beta_0 + \beta_1X_i}$

Estimated prob: $\hat p=\frac{e^{\beta_0 + X_i\beta_1}}{1+e^{\beta_0 + \beta_1 X_i}}$

$\beta_0$: log(Odds) when $X_i=0$

$\beta_1$: log (odds ratio) -> change in log odds for a unit change in the predictor.

It tells the change that occurs in log odds for every point increment in $X_i$ (i.e. increment by 1)
Proof:
Let $g(X_i)=log(\frac {p_i}{1-p_i}) = \beta_0 + \beta_1 X_i$
$g(X_i +1) = log (\frac {p_j}{1-p_j}) = \beta_0 + \beta_1 X_i$
Then,
$g(X_i +1)-g(X_i) = log(\frac {\frac {p_j}{1-p_j}}{\frac {p_i}{1-p_i}}) = \beta_1$

Thus,
$e^{\beta_1}$: Odds ratio for an increment in $X_i$ by 1 (point increment)

$e^{\beta_1}= \frac{\frac {p_j}{1-p_j}}{\frac {p_i}{1-p_i}}$

Why is an odds ratio important? What does it tell me?

When a discrete predictor variable is used…
Exam: male ($X_i=0$) and female ($X_i=1$)

$\beta_0$ indicates log(Odds) for male.
$\beta_1$ indicates log of Odds ratio between female and male groups.
$e^{\beta_1}$ is the odds ratio between the group (Odds(female)/Odds(male))

So what?

Odds tell us the likelihood of an event occurring against not occurring. If the probability of raining is 0.3, the likelilhood of raining is 3/7. If two teams compete in a game and they have an equal chance of winning, their odds are 1 for each team. Their odds ratio between the two team is 1.

If team A’s probability of winning is 0.6 and the other, 0,4, then, Odds(A) = 6/4 and Odds(B) = 4/6. Odds Ratio (A/B) = $6^2/4^2 = (6/4)^2$ = odds(A)^2

In the case of the logistic model as above, the odds ratio between the two groups is equal to $e^{\beta_1}$. If $e^{\beta_1}$=1, this means that odds(male) and odds(female) are equal to 1. There is no difference between the two groups. In general, odds ratio ($e^{\beta_1}$) <0.5, or >2 indicates a moderate effect in an experiment study.

When a continuous predictor variable is used…

EXAM: predicting probabilities of graduating within a year based on a test score:
$\beta_1$ indicates the degree of change in the log odds when X_i increments by 1. It is the log of odds ratio corresponding to an increment in X_i by 1.

If $e^{\beta_1}=0.15$, the number tells me that a one-unit increase in the test score is likely to increase the odds of graduating within a year by 15%.

Example

Prep data

library(mvtnorm)

## Warning: package 'mvtnorm' was built under R version 4.0.3

library(tidyverse)
library(ggplot2)
set.seed(2020-10-23)

data <- read_csv("data.csv")
data %>% head(2)

## # A tibble: 2 x 4
##     Age Education Gender Income
##   <dbl>     <dbl> <chr>  <chr> 
## 1    25         7 Male   <=50K 
## 2    38         9 Male   <=50K

## mutate explanatory veriables:
## income (char) into a numeric variable (1 over 50k; 0 otherwise)
## female:1, male: 0
##if_else(dplyr): more strict than ifelse, faster, more predicable.
data %>%
  mutate(Gender = if_else(Gender=="Female", 1, 0), 
         Income = if_else(Income ==">50K", 1, 0)) -> data
head(data) %>% knitr::kable()

Age	Education	Gender	Income
25	7	0	0
38	9	0	0
28	12	0	1
44	10	0	1
18	10	1	0
34	6	0	0

Fitting MLE

## multivariate 
## 1 for intercept, ancilliary parameter. 
X <- as.matrix(cbind(1, data[, c("Age", "Education", "Gender")]))

# inverse logit function g^-1 (x*beta) = p.
# x*beta is expressed as x here.
invlogit <- function(x) {exp(x) / (1 + exp(x))}


logit_llk <- function(par, y, X) {

  # p = x1*par[1] + x2*par[2] + x3*par[3] + x4* par[4] 
  p <- invlogit(X%*%par)

  # return log-likelihood of the logit
  # llk(p|y) = log (BigPI) p^y * (1-p)^(1-y)
  return(sum(y * log(p) + (1 - y) * log(1 - p)))
}

opt <- optim(par = rep(0.1, 4), 
             fn = logit_llk, 
             y = data$Income, X = X, 
             control = list(fnscale = -1),
             method = "BFGS")

coefs <- opt$par
names(coefs) <- c("Intercept", "Age", "Education", "Gender")
coefs

##   Intercept         Age   Education      Gender 
## -6.14746316  0.04000189  0.35307870 -1.32727551

glm() function simplifies the process

modFit <- glm(Income~Age + Education + Gender, data=data, family=binomial(link=logit))
# modFit2 <- glm(Income~Age + Education + Gender, data=data, family=binomial) # same.

summary(modFit)$coef

##               Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept) -6.1469636 0.354809252 -17.324699 3.062916e-67
## Age          0.0400101 0.004524413   8.843158 9.305036e-19
## Education    0.3530143 0.025341076  13.930517 4.133499e-44
## Gender      -1.3281006 0.147996099  -8.973889 2.862246e-19

summary(modFit)$coef[, 1]

## (Intercept)         Age   Education      Gender 
##  -6.1469636   0.0400101   0.3530143  -1.3281006

summary(modFit)

## 
## Call:
## glm(formula = Income ~ Age + Education + Gender, family = binomial(link = logit), 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0749  -0.6852  -0.4575  -0.1497   2.7508  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.146964   0.354809 -17.325   <2e-16 ***
## Age          0.040010   0.004524   8.843   <2e-16 ***
## Education    0.353014   0.025341  13.931   <2e-16 ***
## Gender      -1.328101   0.147996  -8.974   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2206.6  on 1999  degrees of freedom
## Residual deviance: 1792.7  on 1996  degrees of freedom
## AIC: 1800.7
## 
## Number of Fisher Scoring iterations: 5

# modFit$fitted.values[1:6]
# modFit2$fitted.values[1:6]
# fitted values of logistic model is probability between 0 and 1

Simulating QOI

QOI: Expected Value of having high income ($E[Y]$)

Steps for Simulating QOI
1. Simulate coefficients (estimation uncertainty)
2. Simulate Y
3. Calculate quantity of interests given simulated Y ’s
4. Repeat 1-3 many times

Step 1

vcov(modFit)

##               (Intercept)           Age     Education        Gender
## (Intercept)  0.1258896054 -9.901074e-04 -7.494513e-03  1.544499e-03
## Age         -0.0009901074  2.047031e-05  1.375578e-05 -3.195114e-05
## Education   -0.0074945132  1.375578e-05  6.421701e-04 -4.212469e-04
## Gender       0.0015444989 -3.195114e-05 -4.212469e-04  2.190285e-02

#####################################
# get coefficients from a simulation
# assume normality
#  rmvnorm(1, coefficients, vcov)
#####################################


sim_coef <- function(coef, vcov){
  
  rmvnorm(1, mean=coef, sigma=vcov)
}

sim_coef(modFit$coefficients, vcov(modFit))

##      (Intercept)        Age Education    Gender
## [1,]   -6.662965 0.03654244 0.4210437 -1.382391

Step 2 & 3; simulate Y and calculate the expectation

Set up for covariates:
-Gender: as it is.
-Age: use median (by gender)
-Education: 1 (min) to 16 (max)

# get median age by gender
data %>%
  group_by(Gender) %>%
  summarise(Age=median(Age)) -> res
res

## # A tibble: 2 x 2
##   Gender   Age
##    <dbl> <dbl>
## 1      0    37
## 2      1    36

# set the education values and set up explanatory variables.
educ <- 1:16
tibble(Age = rep(res$Age, length(educ)), 
       Education = rep(educ, each=2), 
       Gender = rep(res$Gender, length(educ))) -> tbl

tbl %>% arrange(Gender, Age, Education) -> sim_data

sim_data

## # A tibble: 32 x 3
##      Age Education Gender
##    <dbl>     <int>  <dbl>
##  1    37         1      0
##  2    37         2      0
##  3    37         3      0
##  4    37         4      0
##  5    37         5      0
##  6    37         6      0
##  7    37         7      0
##  8    37         8      0
##  9    37         9      0
## 10    37        10      0
## # ... with 22 more rows

Get a simulated coefficient set and obtain expected Y values.

sim_coef <- sim_coef(modFit$coefficients, vcov(modFit))

# Given simulated explanatory data and simulated coefficients,
# obtain a vector of p (probabilities for each observation in the data)
# draw 1000 outcomes from this set up.
# get expected value of y by averaging the outcomes. 
sim_EY <- function(sim_data, sim_coef){
  
  # p = invlogit(alpha + x_1*beta_1 +... x_3 *beta3)
  # Transpose(sim_coef) to multiplication of two matrices 
  # outcome: 32-long vector
  p <- invlogit(cbind(1, as.matrix(sim_data)) %*% t(sim_coef)) 

  # Simulate Y 1000 times
  # Analogy: tossing n coins 1000 times with the probability vector of p.
  # outcome: the number of heads for each coin.
  Y <- rbinom(n=length(p), size=1000, prob=p)
  
  # Get expected values for each of the 32 observations
  QOI <- Y/1000
  return(QOI)
}


sim_EY(sim_data, sim_coef)

##  [1] 0.009 0.021 0.032 0.044 0.047 0.082 0.098 0.140 0.198 0.256 0.321 0.374
## [13] 0.465 0.576 0.668 0.744 0.003 0.005 0.005 0.009 0.011 0.012 0.024 0.039
## [25] 0.051 0.078 0.092 0.132 0.189 0.239 0.320 0.412

Step 4: Repeat

sim_coef <- function(coef, vcov){

  rmvnorm(1, mean=coef, sigma=vcov)
}

sim_run <- function (fit, sim_data){
  # step 1: get simulated coefficients
  sim_coef <- sim_coef(fit$coefficients, vcov(fit))
  
  # step 2 & 3: simulate Y and QOI
  QOI <- sim_EY(sim_data, sim_coef)
  return(QOI)
}

sim_results <- replicate(2000, sim_run(modFit, sim_data))
dim(sim_results)

## [1]   32 2000

Visualize

sim_data %>%
  ## add a new column point to summarize simulations.
  mutate(point = rowMeans(sim_results)) %>%
  
  ## 1. get quantiles for each row: 2 x 32 matrix / apply quantile in row-wise
  ## 2. transpose the matrix
  ## 3. convert to a tibble. 
  ## 4. efficiently bind dataframe by columns
  bind_cols(apply(sim_results, 1, quantile, c(0.025, 0.975)) %>% t() %>% as_tibble()) %>%
  
  ## Factorize gender for visualization
  mutate(Gender=if_else(Gender==0, "Male", "Female")) -> results
results

## # A tibble: 32 x 6
##      Age Education Gender  point `2.5%` `97.5%`
##    <dbl>     <int> <chr>   <dbl>  <dbl>   <dbl>
##  1    37         1 Male   0.0139  0.005   0.025
##  2    37         2 Male   0.0193  0.009   0.033
##  3    37         3 Male   0.0270  0.013   0.043
##  4    37         4 Male   0.0379  0.022   0.058
##  5    37         5 Male   0.0529  0.034   0.076
##  6    37         6 Male   0.0733  0.05    0.1  
##  7    37         7 Male   0.101   0.073   0.131
##  8    37         8 Male   0.138   0.108   0.171
##  9    37         9 Male   0.184   0.151   0.220
## 10    37        10 Male   0.243   0.207   0.282
## # ... with 22 more rows

Plotting

g <- ggplot(results, aes(x = Education, y = point, 
                         colour = Gender, shape = Gender)) +
  geom_point(size = 2)+
  geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`)) + # backticks used to refer to inherent symbols.
  ylab("E[Y]")
g

Alternative plotting code

my_theme <- function(legend.position = "right") {
  t <- theme_bw() +
    theme(plot.title = element_text(hjust = 0.5),
          panel.grid = element_blank(),
          text = element_text(size = 14),
          legend.position = legend.position)
  return(t)
}

gg <- g +
  guides(colour = guide_legend(reverse = TRUE),
         shape = guide_legend(reverse = TRUE)) +
  my_theme(legend.position = c(0.1, 0.85))

gg

Alternative Approach

In the above example, Age was constant, median of all ages. The variable can be divided into several subgroups. Then, the outcomes can be plotted in each subgroup. This can tell how the expected income can vary in different age groups.

dim(data)

## [1] 2000    4

range(data$Age)

## [1] 17 90

Group the Age variable.

################
# grouping Age
# Cut2 creates four levels [17,28) [28,38) [38,49) [49,90] based on the quantile.
# Each age value belongs to one of the four groups. 
################

data %>%
  mutate(ageCut=Hmisc::cut2(data$Age, g=4)) %>%
  group_by(ageCut) %>%
  summarise(median=median(Age)) -> output
output

## # A tibble: 4 x 2
##   ageCut  median
##   <fct>    <dbl>
## 1 [17,28)     23
## 2 [28,38)     33
## 3 [38,49)     43
## 4 [49,90]     57

Create new simulated data.

Set up for covariates:
-Gender: 1(woman) and 0 (man)
-Age: median age for each group
-Education: 1 (min) to 16 (max)

educ <- 1:16
age <- output$median

## create a dataframe consisting of
##  4 levels of age cut x 16 levels of education x 2 levels of gender
##  3 columns(variables) times 4 x 16 x 2 rows(observations) 
tibble(Age = rep(age, each=16*2), 
       Education = rep(educ, 2*4), 
       Gender = rep(c(rep(0, 16), rep(1, 16)), 4)) -> tbl

tbl %>% arrange(Gender, Age, Education) -> sim_data2

head(sim_data2)

## # A tibble: 6 x 3
##     Age Education Gender
##   <dbl>     <int>  <dbl>
## 1    23         1      0
## 2    23         2      0
## 3    23         3      0
## 4    23         4      0
## 5    23         5      0
## 6    23         6      0

Given a model fit from the data, get a set of simulated coefficients.

First, obtain predicted coefficients from the model fit, using glm().

Second, rmvnorm(1, mean=fit$coef, sigma=vcov(fit)).

modFit <- glm(Income~Age + Education + Gender, data=data, family=binomial(link=logit))

sim_coef <- rmvnorm(1, mean=modFit$coef, sigma=vcov(modFit))

sim_coef

##      (Intercept)        Age Education     Gender
## [1,]   -5.831713 0.03618881 0.3300587 -0.9767984

Get the expected income(QOI)

# Given dataset and coefficients, get the expected income. 
QOI <- sim_EY(sim_data2, sim_coef)

QOI[1:6]

## [1] 0.013 0.016 0.022 0.026 0.038 0.046

length(QOI)

## [1] 128

Repeat the procedures 2000 times.

sim_run2 <- function (fit, sim_data){
  
  # 1. get simulated coefficients
  sim_coef <- rmvnorm(1, mean=modFit$coef, sigma=vcov(modFit))
  
  # 2, get expected outcome.
  QOI <- sim_EY(sim_data, sim_coef)
  
  return(QOI)
}

sim_results <- replicate(2000, sim_run2(modFit, sim_data2))
dim(sim_results)

## [1]  128 2000

Plot

# Prep simulation results for plotting.


getLevel <- function(num){
  # remember the dataframe: output$ageCut, output$median
  i=1
  while(i<= nrow(output)){
    if(num==output$median[i]) { val=output$ageCut[i]; break }
    else {i=i+1; next }
  }
  return(paste(val, ", median: ", num))
}


sim_data2 %>%
  ## add a new column point to summarize simulations.
  mutate(Income = rowMeans(sim_results)) %>%
  
  ## get 95% CI for each observation and add them as new columns.
  bind_cols(apply(sim_results, 1, quantile, c(0.025, 0.975)) %>% t() %>% as_tibble()) %>%
  
  ## Factorize gender for visualization
  mutate(Gender=if_else(Gender==0, "Male", "Female"), 
         AgeGroup = sapply(Age, getLevel))  -> results

head(results)

## # A tibble: 6 x 7
##     Age Education Gender  Income  `2.5%` `97.5%` AgeGroup             
##   <dbl>     <int> <chr>    <dbl>   <dbl>   <dbl> <chr>                
## 1    23         1 Male   0.00793 0.002     0.016 [17,28) , median:  23
## 2    23         2 Male   0.0111  0.004     0.021 [17,28) , median:  23
## 3    23         3 Male   0.0156  0.00698   0.028 [17,28) , median:  23
## 4    23         4 Male   0.0221  0.011     0.036 [17,28) , median:  23
## 5    23         5 Male   0.0310  0.017     0.049 [17,28) , median:  23
## 6    23         6 Male   0.0432  0.026     0.063 [17,28) , median:  23

p <- ggplot(data=results, aes(x=Education, y=Income, color=Gender, shape=Gender)) +
  geom_point(size = 2)+
  geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`)) +
  geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`, fill=Gender), alpha=0.3)+
  ylab("Probability of earning 50K or more") +
  ggtitle("Probability of Expected High Income by Education in Different Age Groups from simulation")+
  facet_wrap(~AgeGroup)
  
p

My observation from the plot

On average, the expected income of college educated men in the youngest group is above 50K. In contrast, the expected income of college educated women in the third age group it not yet 50K. It suggest a big income gap between college educated men and women.
The difference between the expected incomes of high school graduate men and college educated men tends to decrease when they get older. Visually, in the fourth age group, the curve for men appears to reach a plateau.
While the income variance tends to decrease among men when they get old, the variance tends to increase among women.

Inference

Model Coefficients (log of odds ratio)

logOR <- summary(modFit)$coef[, 1] # coef(modFit)
logOR

## (Intercept)         Age   Education      Gender 
##  -6.1469636   0.0400101   0.3530143  -1.3281006

Odds Ratio and probability

exp(logOR)

## (Intercept)         Age   Education      Gender 
##  0.00213997  1.04082128  1.42335149  0.26498008

My interpretation of coefficients

AGE
Log of odds ratio for age= 0.04:
For every one unit increase in age (while holding other predictors constants), the log odds of having high income increases by 0.04 (4%).

Odds ratio = 1.04:
For a unit increase in age, the odds of earning high income (vs. low income) increase by a factor of 1.04.
2. Education
Log of odds ratio for education =0.35:
For every one unit increase in education, the log odds of having high income increases by 0.35.

Odds ratio = 1.42:
For a unit increase in education, the odds of earning high income (vs. low income) increases by a factor of 1.42.

Gender
Log of odds ratio for gender = -1.33: For every one unit increase in gender (in other words, in the case of female), the log odds of having high income decreases by 1.33.

Odds ratio = 0.26:
For female, the odds of earning high income (vs. low income) increases by a factor of 0.26.

Plotting the observed data along with the fitted model

data %>% 
  mutate(Gender=if_else(Gender==0, "Male", "Female")) -> data1

ggplot(data1, aes(x=Education, y=modFit$fitted, color=Age))+
  geom_point() +
  xlab ("Education") + 
  ylab ("Prob of high income")+
  geom_smooth(aes(x=Education, y=modFit$fitted), color="magenta")+
  facet_wrap(~Gender)

Confidence Intervals of coefficients

round(confint(modFit), 3) %>% knitr::kable()

	2.5 %	97.5 %
(Intercept)	-6.856	-5.465
Age	0.031	0.049
Education	0.304	0.404
Gender	-1.624	-1.044