The standard notation of linear regression models has two component:
\(Y = X_i\beta + \epsilon\) – Systematic component
\(\epsilon \thicksim N(0, \sigma^2)\) – stochastic component
Alternative, the following notation can be used:
\(Y_i \thicksim N(\mu_i, \sigma^2)\)— stochastic
\(\mu_i = X_i\beta\) — systematic
These notations can be generalized as follows:
\(Y_i \thicksim f(\theta_i, \alpha)\)
\(\theta_i = g(X_i, \beta)\)
where \(f()\): probability density function,
\(\theta_i\): a systematic feature of the function that varies(e.g. mean, variance)
\(\alpha\): ancilliary parameter, constant
\(\beta\): effect parameter
It appears that GLM builds on the generalized notation. GLM has three components:
Random distribution (normal, poisson, binomial, etc. ) [stochastic component]
Linear predictor: \(\beta_0 + \beta_1 X_{i1} + ... + \beta_kX_{ik}\), from which I build a model [systematic component]
Note: Here the linearity is in the coefficients (not variables). This defines linear models. \(\alpha X_{1}^2 + \beta X_2^2\) is still a linear model. \(\alpha^2 X_1 + \beta X_2\) is no longer defined as a linear model.
In the case of an ordinary linear regression model where the outcome/response variable has a normal distribution, \(g(\mu)=\beta_0+ \beta_1 X_{i1}...+ \beta_k X_{ik}= \sum_{k=0}^{n} \beta_k X_{ik}=\mu\)
In the case of a binary outcome variable, however, the link function does not hold primarily because the outcome variable is restricted to 0 or 1. In this case, the expected value (\(\mu\)) = \(\pi\) (probability). It turns out that the link function is log (odds):
\[g(\pi) = log(\frac {\pi}{1-\pi})\] This is called logit.
\[g(\pi)= log(\frac {\pi}{1-\pi}) = \sum_{k=0}^{n} \beta_k X_{ik} \]
Let’s simply write \(\sum_{k=0}^{n} \beta_kX_{ik}\) as \(\beta X_i\) and \(X_{i0}=1\)(constant), then,
\[\pi = \frac {e^{ \beta X_i}}{1+ e^{ \beta X_i}}= \frac {1}{1 + e^{- \beta X_i}}\] Differently stated, \(\pi\) is defined as inverse link function \(g^{-1}(\beta X_i)\),
\[\pi = g^{-1}(\beta X_i)\]
In the univariate logistic model, the link function \(g(x)\) is defined as
\[g(X_i) = log(\frac {p}{1-p}) = \beta_0 + \beta_1X_i\]
Estimated odds: \(\hat O = \frac{p}{1-p} = e^{\beta_0 + \beta_1X_i}\)
Estimated prob: \(\hat p=\frac{e^{\beta_0 + X_i\beta_1}}{1+e^{\beta_0 + \beta_1 X_i}}\)
\(\beta_0\): log(Odds) when \(X_i=0\)
\(\beta_1\): log (odds ratio) -> change in log odds for a unit change in the predictor.
It tells the change that occurs in log odds for every point increment in \(X_i\) (i.e. increment by 1)
Proof:
Let \(g(X_i)=log(\frac {p_i}{1-p_i}) = \beta_0 + \beta_1 X_i\)
\(g(X_i +1) = log (\frac {p_j}{1-p_j}) = \beta_0 + \beta_1 X_i\)
Then,
\(g(X_i +1)-g(X_i) = log(\frac {\frac {p_j}{1-p_j}}{\frac {p_i}{1-p_i}}) = \beta_1\)
Thus,
\(e^{\beta_1}\): Odds ratio for an increment in \(X_i\) by 1 (point increment)
\(e^{\beta_1}= \frac{\frac {p_j}{1-p_j}}{\frac {p_i}{1-p_i}}\)
Why is an odds ratio important? What does it tell me?
\(\beta_0\) indicates log(Odds) for male.
\(\beta_1\) indicates log of Odds ratio between female and male groups.
\(e^{\beta_1}\) is the odds ratio between the group (Odds(female)/Odds(male))
So what?
Odds tell us the likelihood of an event occurring against not occurring. If the probability of raining is 0.3, the likelilhood of raining is 3/7. If two teams compete in a game and they have an equal chance of winning, their odds are 1 for each team. Their odds ratio between the two team is 1.
If team A’s probability of winning is 0.6 and the other, 0,4, then, Odds(A) = 6/4 and Odds(B) = 4/6. Odds Ratio (A/B) = \(6^2/4^2 = (6/4)^2\) = odds(A)^2
In the case of the logistic model as above, the odds ratio between the two groups is equal to \(e^{\beta_1}\). If \(e^{\beta_1}\)=1, this means that odds(male) and odds(female) are equal to 1. There is no difference between the two groups. In general, odds ratio (\(e^{\beta_1}\)) <0.5, or >2 indicates a moderate effect in an experiment study.
EXAM: predicting probabilities of graduating within a year based on a test score:
\(\beta_1\) indicates the degree of change in the log odds when X_i increments by 1. It is the log of odds ratio corresponding to an increment in X_i by 1.
If \(e^{\beta_1}=0.15\), the number tells me that a one-unit increase in the test score is likely to increase the odds of graduating within a year by 15%.
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 4.0.3
library(tidyverse)
library(ggplot2)
set.seed(2020-10-23)
data <- read_csv("data.csv")
data %>% head(2)
## # A tibble: 2 x 4
## Age Education Gender Income
## <dbl> <dbl> <chr> <chr>
## 1 25 7 Male <=50K
## 2 38 9 Male <=50K
## mutate explanatory veriables:
## income (char) into a numeric variable (1 over 50k; 0 otherwise)
## female:1, male: 0
##if_else(dplyr): more strict than ifelse, faster, more predicable.
data %>%
mutate(Gender = if_else(Gender=="Female", 1, 0),
Income = if_else(Income ==">50K", 1, 0)) -> data
head(data) %>% knitr::kable()
| Age | Education | Gender | Income |
|---|---|---|---|
| 25 | 7 | 0 | 0 |
| 38 | 9 | 0 | 0 |
| 28 | 12 | 0 | 1 |
| 44 | 10 | 0 | 1 |
| 18 | 10 | 1 | 0 |
| 34 | 6 | 0 | 0 |
## multivariate
## 1 for intercept, ancilliary parameter.
X <- as.matrix(cbind(1, data[, c("Age", "Education", "Gender")]))
# inverse logit function g^-1 (x*beta) = p.
# x*beta is expressed as x here.
invlogit <- function(x) {exp(x) / (1 + exp(x))}
logit_llk <- function(par, y, X) {
# p = x1*par[1] + x2*par[2] + x3*par[3] + x4* par[4]
p <- invlogit(X%*%par)
# return log-likelihood of the logit
# llk(p|y) = log (BigPI) p^y * (1-p)^(1-y)
return(sum(y * log(p) + (1 - y) * log(1 - p)))
}
opt <- optim(par = rep(0.1, 4),
fn = logit_llk,
y = data$Income, X = X,
control = list(fnscale = -1),
method = "BFGS")
coefs <- opt$par
names(coefs) <- c("Intercept", "Age", "Education", "Gender")
coefs
## Intercept Age Education Gender
## -6.14746316 0.04000189 0.35307870 -1.32727551
glm() function simplifies the process
modFit <- glm(Income~Age + Education + Gender, data=data, family=binomial(link=logit))
# modFit2 <- glm(Income~Age + Education + Gender, data=data, family=binomial) # same.
summary(modFit)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.1469636 0.354809252 -17.324699 3.062916e-67
## Age 0.0400101 0.004524413 8.843158 9.305036e-19
## Education 0.3530143 0.025341076 13.930517 4.133499e-44
## Gender -1.3281006 0.147996099 -8.973889 2.862246e-19
summary(modFit)$coef[, 1]
## (Intercept) Age Education Gender
## -6.1469636 0.0400101 0.3530143 -1.3281006
summary(modFit)
##
## Call:
## glm(formula = Income ~ Age + Education + Gender, family = binomial(link = logit),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0749 -0.6852 -0.4575 -0.1497 2.7508
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.146964 0.354809 -17.325 <2e-16 ***
## Age 0.040010 0.004524 8.843 <2e-16 ***
## Education 0.353014 0.025341 13.931 <2e-16 ***
## Gender -1.328101 0.147996 -8.974 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2206.6 on 1999 degrees of freedom
## Residual deviance: 1792.7 on 1996 degrees of freedom
## AIC: 1800.7
##
## Number of Fisher Scoring iterations: 5
# modFit$fitted.values[1:6]
# modFit2$fitted.values[1:6]
# fitted values of logistic model is probability between 0 and 1
QOI: Expected Value of having high income (\(E[Y]\))
Steps for Simulating QOI
1. Simulate coefficients (estimation uncertainty)
2. Simulate Y
3. Calculate quantity of interests given simulated Y ’s
4. Repeat 1-3 many times
Step 1
vcov(modFit)
## (Intercept) Age Education Gender
## (Intercept) 0.1258896054 -9.901074e-04 -7.494513e-03 1.544499e-03
## Age -0.0009901074 2.047031e-05 1.375578e-05 -3.195114e-05
## Education -0.0074945132 1.375578e-05 6.421701e-04 -4.212469e-04
## Gender 0.0015444989 -3.195114e-05 -4.212469e-04 2.190285e-02
#####################################
# get coefficients from a simulation
# assume normality
# rmvnorm(1, coefficients, vcov)
#####################################
sim_coef <- function(coef, vcov){
rmvnorm(1, mean=coef, sigma=vcov)
}
sim_coef(modFit$coefficients, vcov(modFit))
## (Intercept) Age Education Gender
## [1,] -6.662965 0.03654244 0.4210437 -1.382391
Step 2 & 3; simulate Y and calculate the expectation
Set up for covariates:
-Gender: as it is.
-Age: use median (by gender)
-Education: 1 (min) to 16 (max)
# get median age by gender
data %>%
group_by(Gender) %>%
summarise(Age=median(Age)) -> res
res
## # A tibble: 2 x 2
## Gender Age
## <dbl> <dbl>
## 1 0 37
## 2 1 36
# set the education values and set up explanatory variables.
educ <- 1:16
tibble(Age = rep(res$Age, length(educ)),
Education = rep(educ, each=2),
Gender = rep(res$Gender, length(educ))) -> tbl
tbl %>% arrange(Gender, Age, Education) -> sim_data
sim_data
## # A tibble: 32 x 3
## Age Education Gender
## <dbl> <int> <dbl>
## 1 37 1 0
## 2 37 2 0
## 3 37 3 0
## 4 37 4 0
## 5 37 5 0
## 6 37 6 0
## 7 37 7 0
## 8 37 8 0
## 9 37 9 0
## 10 37 10 0
## # ... with 22 more rows
Get a simulated coefficient set and obtain expected Y values.
sim_coef <- sim_coef(modFit$coefficients, vcov(modFit))
# Given simulated explanatory data and simulated coefficients,
# obtain a vector of p (probabilities for each observation in the data)
# draw 1000 outcomes from this set up.
# get expected value of y by averaging the outcomes.
sim_EY <- function(sim_data, sim_coef){
# p = invlogit(alpha + x_1*beta_1 +... x_3 *beta3)
# Transpose(sim_coef) to multiplication of two matrices
# outcome: 32-long vector
p <- invlogit(cbind(1, as.matrix(sim_data)) %*% t(sim_coef))
# Simulate Y 1000 times
# Analogy: tossing n coins 1000 times with the probability vector of p.
# outcome: the number of heads for each coin.
Y <- rbinom(n=length(p), size=1000, prob=p)
# Get expected values for each of the 32 observations
QOI <- Y/1000
return(QOI)
}
sim_EY(sim_data, sim_coef)
## [1] 0.009 0.021 0.032 0.044 0.047 0.082 0.098 0.140 0.198 0.256 0.321 0.374
## [13] 0.465 0.576 0.668 0.744 0.003 0.005 0.005 0.009 0.011 0.012 0.024 0.039
## [25] 0.051 0.078 0.092 0.132 0.189 0.239 0.320 0.412
Step 4: Repeat
sim_coef <- function(coef, vcov){
rmvnorm(1, mean=coef, sigma=vcov)
}
sim_run <- function (fit, sim_data){
# step 1: get simulated coefficients
sim_coef <- sim_coef(fit$coefficients, vcov(fit))
# step 2 & 3: simulate Y and QOI
QOI <- sim_EY(sim_data, sim_coef)
return(QOI)
}
sim_results <- replicate(2000, sim_run(modFit, sim_data))
dim(sim_results)
## [1] 32 2000
Visualize
sim_data %>%
## add a new column point to summarize simulations.
mutate(point = rowMeans(sim_results)) %>%
## 1. get quantiles for each row: 2 x 32 matrix / apply quantile in row-wise
## 2. transpose the matrix
## 3. convert to a tibble.
## 4. efficiently bind dataframe by columns
bind_cols(apply(sim_results, 1, quantile, c(0.025, 0.975)) %>% t() %>% as_tibble()) %>%
## Factorize gender for visualization
mutate(Gender=if_else(Gender==0, "Male", "Female")) -> results
results
## # A tibble: 32 x 6
## Age Education Gender point `2.5%` `97.5%`
## <dbl> <int> <chr> <dbl> <dbl> <dbl>
## 1 37 1 Male 0.0139 0.005 0.025
## 2 37 2 Male 0.0193 0.009 0.033
## 3 37 3 Male 0.0270 0.013 0.043
## 4 37 4 Male 0.0379 0.022 0.058
## 5 37 5 Male 0.0529 0.034 0.076
## 6 37 6 Male 0.0733 0.05 0.1
## 7 37 7 Male 0.101 0.073 0.131
## 8 37 8 Male 0.138 0.108 0.171
## 9 37 9 Male 0.184 0.151 0.220
## 10 37 10 Male 0.243 0.207 0.282
## # ... with 22 more rows
Plotting
g <- ggplot(results, aes(x = Education, y = point,
colour = Gender, shape = Gender)) +
geom_point(size = 2)+
geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`)) + # backticks used to refer to inherent symbols.
ylab("E[Y]")
g
Alternative plotting code
my_theme <- function(legend.position = "right") {
t <- theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
panel.grid = element_blank(),
text = element_text(size = 14),
legend.position = legend.position)
return(t)
}
gg <- g +
guides(colour = guide_legend(reverse = TRUE),
shape = guide_legend(reverse = TRUE)) +
my_theme(legend.position = c(0.1, 0.85))
gg
In the above example, Age was constant, median of all ages. The variable can be divided into several subgroups. Then, the outcomes can be plotted in each subgroup. This can tell how the expected income can vary in different age groups.
dim(data)
## [1] 2000 4
range(data$Age)
## [1] 17 90
################
# grouping Age
# Cut2 creates four levels [17,28) [28,38) [38,49) [49,90] based on the quantile.
# Each age value belongs to one of the four groups.
################
data %>%
mutate(ageCut=Hmisc::cut2(data$Age, g=4)) %>%
group_by(ageCut) %>%
summarise(median=median(Age)) -> output
output
## # A tibble: 4 x 2
## ageCut median
## <fct> <dbl>
## 1 [17,28) 23
## 2 [28,38) 33
## 3 [38,49) 43
## 4 [49,90] 57
Set up for covariates:
-Gender: 1(woman) and 0 (man)
-Age: median age for each group
-Education: 1 (min) to 16 (max)
educ <- 1:16
age <- output$median
## create a dataframe consisting of
## 4 levels of age cut x 16 levels of education x 2 levels of gender
## 3 columns(variables) times 4 x 16 x 2 rows(observations)
tibble(Age = rep(age, each=16*2),
Education = rep(educ, 2*4),
Gender = rep(c(rep(0, 16), rep(1, 16)), 4)) -> tbl
tbl %>% arrange(Gender, Age, Education) -> sim_data2
head(sim_data2)
## # A tibble: 6 x 3
## Age Education Gender
## <dbl> <int> <dbl>
## 1 23 1 0
## 2 23 2 0
## 3 23 3 0
## 4 23 4 0
## 5 23 5 0
## 6 23 6 0
First, obtain predicted coefficients from the model fit, using glm().
Second, rmvnorm(1, mean=fit$coef, sigma=vcov(fit)).
modFit <- glm(Income~Age + Education + Gender, data=data, family=binomial(link=logit))
sim_coef <- rmvnorm(1, mean=modFit$coef, sigma=vcov(modFit))
sim_coef
## (Intercept) Age Education Gender
## [1,] -5.831713 0.03618881 0.3300587 -0.9767984
# Given dataset and coefficients, get the expected income.
QOI <- sim_EY(sim_data2, sim_coef)
QOI[1:6]
## [1] 0.013 0.016 0.022 0.026 0.038 0.046
length(QOI)
## [1] 128
sim_run2 <- function (fit, sim_data){
# 1. get simulated coefficients
sim_coef <- rmvnorm(1, mean=modFit$coef, sigma=vcov(modFit))
# 2, get expected outcome.
QOI <- sim_EY(sim_data, sim_coef)
return(QOI)
}
sim_results <- replicate(2000, sim_run2(modFit, sim_data2))
dim(sim_results)
## [1] 128 2000
# Prep simulation results for plotting.
getLevel <- function(num){
# remember the dataframe: output$ageCut, output$median
i=1
while(i<= nrow(output)){
if(num==output$median[i]) { val=output$ageCut[i]; break }
else {i=i+1; next }
}
return(paste(val, ", median: ", num))
}
sim_data2 %>%
## add a new column point to summarize simulations.
mutate(Income = rowMeans(sim_results)) %>%
## get 95% CI for each observation and add them as new columns.
bind_cols(apply(sim_results, 1, quantile, c(0.025, 0.975)) %>% t() %>% as_tibble()) %>%
## Factorize gender for visualization
mutate(Gender=if_else(Gender==0, "Male", "Female"),
AgeGroup = sapply(Age, getLevel)) -> results
head(results)
## # A tibble: 6 x 7
## Age Education Gender Income `2.5%` `97.5%` AgeGroup
## <dbl> <int> <chr> <dbl> <dbl> <dbl> <chr>
## 1 23 1 Male 0.00793 0.002 0.016 [17,28) , median: 23
## 2 23 2 Male 0.0111 0.004 0.021 [17,28) , median: 23
## 3 23 3 Male 0.0156 0.00698 0.028 [17,28) , median: 23
## 4 23 4 Male 0.0221 0.011 0.036 [17,28) , median: 23
## 5 23 5 Male 0.0310 0.017 0.049 [17,28) , median: 23
## 6 23 6 Male 0.0432 0.026 0.063 [17,28) , median: 23
p <- ggplot(data=results, aes(x=Education, y=Income, color=Gender, shape=Gender)) +
geom_point(size = 2)+
geom_linerange(aes(ymin = `2.5%`, ymax = `97.5%`)) +
geom_ribbon(aes(ymin = `2.5%`, ymax = `97.5%`, fill=Gender), alpha=0.3)+
ylab("Probability of earning 50K or more") +
ggtitle("Probability of Expected High Income by Education in Different Age Groups from simulation")+
facet_wrap(~AgeGroup)
p
My observation from the plot
On average, the expected income of college educated men in the youngest group is above 50K. In contrast, the expected income of college educated women in the third age group it not yet 50K. It suggest a big income gap between college educated men and women.
The difference between the expected incomes of high school graduate men and college educated men tends to decrease when they get older. Visually, in the fourth age group, the curve for men appears to reach a plateau.
While the income variance tends to decrease among men when they get old, the variance tends to increase among women.
Model Coefficients (log of odds ratio)
logOR <- summary(modFit)$coef[, 1] # coef(modFit)
logOR
## (Intercept) Age Education Gender
## -6.1469636 0.0400101 0.3530143 -1.3281006
Odds Ratio and probability
exp(logOR)
## (Intercept) Age Education Gender
## 0.00213997 1.04082128 1.42335149 0.26498008
My interpretation of coefficients
Odds ratio = 1.04:
For a unit increase in age, the odds of earning high income (vs. low income) increase by a factor of 1.04.
2. Education
Log of odds ratio for education =0.35:
For every one unit increase in education, the log odds of having high income increases by 0.35.
Odds ratio = 1.42:
For a unit increase in education, the odds of earning high income (vs. low income) increases by a factor of 1.42.
Odds ratio = 0.26:
For female, the odds of earning high income (vs. low income) increases by a factor of 0.26.
Plotting the observed data along with the fitted model
data %>%
mutate(Gender=if_else(Gender==0, "Male", "Female")) -> data1
ggplot(data1, aes(x=Education, y=modFit$fitted, color=Age))+
geom_point() +
xlab ("Education") +
ylab ("Prob of high income")+
geom_smooth(aes(x=Education, y=modFit$fitted), color="magenta")+
facet_wrap(~Gender)
Confidence Intervals of coefficients
round(confint(modFit), 3) %>% knitr::kable()
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -6.856 | -5.465 |
| Age | 0.031 | 0.049 |
| Education | 0.304 | 0.404 |
| Gender | -1.624 | -1.044 |