rm(list = ls())
library(tidyverse)
setwd("~/Desktop/Econometrics Method Papers")
dat <- read.csv("sample.csv", stringsAsFactors = FALSE)
The probability of an event or occurence happening is denoted as:
p
The probability of that event not happening is therefore:
1 - p
The odds thereof is defined as:
p / (1 - p)
Suppose there are two such events (e.g. probability of going to the movie for persons A and B), then the odds ratio is:
[p / (1 - p)] / [q / (1 - q)]
And the log of odds ratio or logit is:
log[p / (1 - p)] / [q / (1 - q)]
In the context of logistic regression, the outcome variable by itself does not have a linear relationship with the predicator variables.
However, its logit transformation does have a linear relationship with the predicator variables.
Example. Probability of success is 0.8 Then, the probability of failure is 1 - 0.8 = 0.2
Then, the odds of success is p(success) / p (failure) = 0.8 / 0.2 = 4 In words: “the odds of success are 4 to 1”
Transformation from probability to odds is a monotonic transformation.
p ranges from 0 to 1 odds range from 0 to +ve logit ranges from -ve to +ve ***
The logistic regression models the logit-transformed p as a linear relationship with the predicator variables:
logit(p) := log(p / (1 - p)) = b0 + b1 * x1 + ... bk * xk
To convert back in terms of p:
log(p / (1 - p)) = b0 + b1 * x1 + ... bk * xk
p / (1 - p) = exp(b0 + b1 * x1 + ... bk * xk)
(1 - p) / p = 1 / exp(b0 + b1 * x1 + ... bk * xk)
1 / p = [1 + exp(b0 + b1 * x1 + ... bk * xk)] / exp(b0 + b1 * x1 + ... bk * xk)
Therefore, p = exp(b0 + b1 * x1 + ... bk * xk) / [1 + exp(b0 + b1 * x1 + ... bk * xk)]
p <- seq(from = 0.001, to = 0.999, by = 0.005)
odds = p / (1 - p)
ggplot(data = dat,
aes(x = p, y = odds)) +
geom_line()
This is also a monotonic transformation
logit = log(odds)
ggplot(data = dat,
aes(x = odds, y = logit)) +
geom_line()
rm(list = ls.str(mode = 'numeric'))
p := prob(hon == 1)
Equation 1 is:
logit(p) = b0
eq1 <- glm(hon ~ 1,
data = dat,
family = "binomial")
summary(eq1)
Call:
glm(formula = hon ~ 1, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7497 -0.7497 -0.7497 -0.7497 1.6772
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1255 0.1644 -6.845 7.62e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 222.71 on 199 degrees of freedom
Residual deviance: 222.71 on 199 degrees of freedom
AIC: 224.71
Number of Fisher Scoring iterations: 4
paste("This means logit(p) equals", eq1$coefficients[1])
[1] "This means logit(p) equals -1.12545953870146"
paste("hence, p equals", exp(eq1$coefficients[1]) / (1 + exp(eq1$coefficients[1])), "which is the overall probability of being in honours class")
[1] "hence, p equals 0.245000000000525 which is the overall probability of being in honours class"
Frequency table for hon
dat %>%
group_by(hon) %>%
summarise(Freq. = n()) %>%
mutate(Percent = Freq. / nrow(.),
Cum. = cumsum(100 * (Percent / sum(Percent)))) %>%
janitor::adorn_totals(where = "row") %>%
ungroup()
hon Freq. Percent Cum.
0 151 75.5 75.5
1 49 24.5 100.0
Total 200 100.0 175.5
From frequency table
p = 49 / 200
odds = p / (1 - p)
logit = log(odds)
paste("p is", p)
[1] "p is 0.245"
paste("odds is", odds)
[1] "odds is 0.324503311258278"
paste("the logit is", logit)
[1] "the logit is -1.1254595387043"
paste("note the logit equals to the intercept from eq1 which is", eq1$coefficients[1])
[1] "note the logit equals to the intercept from eq1 which is -1.12545953870146"
rm(list = ls.str(mode = 'numeric'))
Equation 2 is:
logit(p) = b0 + b1 * female
eq2 <- glm(hon ~ female,
data = dat,
family = "binomial")
summary(eq2)
Call:
glm(formula = hon ~ female, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.8337 -0.8337 -0.6431 -0.6431 1.8317
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.4709 0.2690 -5.469 4.53e-08 ***
female 0.5928 0.3414 1.736 0.0825 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 222.71 on 199 degrees of freedom
Residual deviance: 219.61 on 198 degrees of freedom
AIC: 223.61
Number of Fisher Scoring iterations: 4
Cross tabulation between hon and female
gmodels::CrossTable(dat$hon, dat$female,
prop.r = FALSE,
prop.c = FALSE,
prop.t = FALSE,
prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
|-------------------------|
Total Observations in Table: 200
| dat$female
dat$hon | 0 | 1 | Row Total |
-------------|-----------|-----------|-----------|
0 | 74 | 77 | 151 |
-------------|-----------|-----------|-----------|
1 | 17 | 32 | 49 |
-------------|-----------|-----------|-----------|
Column Total | 91 | 109 | 200 |
-------------|-----------|-----------|-----------|
Using above table
odds_m = (17 / 91) / (74 / 91)
odds_f = (32 / 109) / (77 / 109)
odds_ratio = odds_f / odds_m
paste("the odds of males in honours is", odds_m)
[1] "the odds of males in honours is 0.22972972972973"
paste("the odds of females in honours is", odds_f)
[1] "the odds of females in honours is 0.415584415584416"
paste("the odds ratio of female-to-male is therefore", odds_ratio)
[1] "the odds ratio of female-to-male is therefore 1.80901451489687"
paste("the interpretation of the female-to-male odds ratio is that the ODDS of getting into honours for females are", 100 * (odds_ratio - 1), "per cent higher than the ODDS for males")
[1] "the interpretation of the female-to-male odds ratio is that the ODDS of getting into honours for females are 80.9014514896868 per cent higher than the ODDS for males"
From regression output
paste("the intercept of", eq2$coefficients[1], "is the logit for male since male is the reference group (female == 0)", "This can be confirmed from the table at log(odds_m) equals", log(odds_m))
[1] "the intercept of -1.47085174914793 is the logit for male since male is the reference group (female == 0) This can be confirmed from the table at log(odds_m) equals -1.47085174914795"
paste("the coefficient for female", eq2$coefficients[2], "is the logit BETWEEN female == 1 & female == 0", "This can be confirmed from the table at log(odds_rato) equals", log(odds_ratio))
[1] "the coefficient for female 0.592782230095463 is the logit BETWEEN female == 1 & female == 0 This can be confirmed from the table at log(odds_rato) equals 0.592782230093996"
paste("note that we get back the female-to-male odds ratio from exp(coefficient) at", exp(eq2$coefficients[2]), "which is the same as", odds_ratio)
[1] "note that we get back the female-to-male odds ratio from exp(coefficient) at 1.80901451489952 which is the same as 1.80901451489687"
rm(list = ls.str(mode = 'numeric'))
Equation 3 is:
logit(p) = b0 + b1 * math
eq3 <- glm(hon ~ math,
data = dat,
family = "binomial")
summary(eq3)
Call:
glm(formula = hon ~ math, family = "binomial", data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0332 -0.6785 -0.3506 -0.1565 2.6143
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.79394 1.48174 -6.610 3.85e-11 ***
math 0.15634 0.02561 6.105 1.03e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 222.71 on 199 degrees of freedom
Residual deviance: 167.07 on 198 degrees of freedom
AIC: 171.07
Number of Fisher Scoring iterations: 5
paste("the interpretation of the intercept", eq3$coefficients[1], "is the logit of someone with a 0 in math score being in honours.", "Therefore, the odds of being in honours with math score 0 is", exp(eq3$coefficients[1]))
[1] "the interpretation of the intercept -9.79394211182874 is the logit of someone with a 0 in math score being in honours. Therefore, the odds of being in honours with math score 0 is 5.57885385598169e-05"
paste("To continue: the probability of being in honours with math score 0 is", exp(eq3$coefficients[1]) / (1 + exp(eq3$coefficients[1])))
[1] "To continue: the probability of being in honours with math score 0 is 5.57854263724066e-05"
At a math score of 55, the logistic regression fitted:
paste("logit(p | math == 55) = ", eq3$coefficients[1], "+", eq3$coefficients[2], "* 55")
[1] "logit(p | math == 55) = -9.79394211182874 + 0.156340355582943 * 55"
Similarly at a math score of 54
paste("logit(p | math == 54) = ", eq3$coefficients[1], "+", eq3$coefficients[2], "* 54")
[1] "logit(p | math == 54) = -9.79394211182874 + 0.156340355582943 * 54"
The difference from math == 54 to math == 55 is:
paste("logit(p | math == 55) - logit(p | math == 54) = ", (eq3$coefficients[1] + eq3$coefficients[2] * 55) - (eq3$coefficients[1] + eq3$coefficients[2] * 54), "which is the estimated COEFFICIENT for math")
[1] "logit(p | math == 55) - logit(p | math == 54) = 0.156340355582943 which is the estimated COEFFICIENT for math"
paste("In other words, b1 is the difference in logits.")
[1] "In other words, b1 is the difference in logits."
paste("For interpretation, we say a 1-unit increase in math leads to an expected change in the logit of", eq3$coefficients[2], "This is equivalent to an expected changes in the odds ratio of", exp(eq3$coefficients[2]), "That is, a 1-unit change in math increases the ODDS of being in honours by", 100 * (exp(eq3$coefficients[2]) - 1), "per cent. Note that this percentage is independent of math score")
[1] "For interpretation, we say a 1-unit increase in math leads to an expected change in the logit of 0.156340355582943 This is equivalent to an expected changes in the odds ratio of 1.1692240873208 That is, a 1-unit change in math increases the ODDS of being in honours by 16.9224087320799 per cent. Note that this percentage is independent of math score"
Equation 4 is:
logit(p) = b0 + b1 * math + b2 * female + b3 * read
eq4 <- glm(hon ~ math + female + read,
data = dat,
family = "binomial")
summary(eq4)
Call:
glm(formula = hon ~ math + female + read, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8305 -0.6327 -0.3300 -0.1258 2.3896
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -11.77025 1.71068 -6.880 5.97e-12 ***
math 0.12296 0.03128 3.931 8.44e-05 ***
female 0.97995 0.42163 2.324 0.0201 *
read 0.05906 0.02655 2.224 0.0261 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 222.71 on 199 degrees of freedom
Residual deviance: 156.17 on 196 degrees of freedom
AIC: 164.17
Number of Fisher Scoring iterations: 5
Each estimated coefficient = expected change in the LOGIT of being in honours for a 1-unit increase in the corresponding predicator, holding all other predictors at constant values.
And each exp(estimated coefficient) = expected change in the ODDS RATIO of being in honours for a 1-unit increase in the corresponding predicator, holding all other predictors at constant values.
paste("Fixing values for math and read, the ODDS of being in honours for females over the ODDS of being in honours for males is", exp(eq4$coefficients[3]), "which means the ODDS for females are", 100 * (exp(eq4$coefficients[3]) - 1), "per cent higher than the ODDS for males")
[1] "Fixing values for math and read, the ODDS of being in honours for females over the ODDS of being in honours for males is 2.66431769455757 which means the ODDS for females are 166.431769455757 per cent higher than the ODDS for males"
paste("Fixing values for female and read, the ODDS of being in honours for a 1-unit increase in math is", exp(eq4$coefficients[2]), "which means the ODDS of being in honours increases by", 100 * (exp(eq4$coefficients[2]) - 1), "per cent due to a 1-unit increase in math")
[1] "Fixing values for female and read, the ODDS of being in honours for a 1-unit increase in math is 1.13083791729043 which means the ODDS of being in honours increases by 13.0837917290425 per cent due to a 1-unit increase in math"
Equation 5 is:
logit(p) = b0 + b1 * female + b2 * math + b3 * female * math
eq5 <- glm(hon ~ female + math + female:math,
data = dat,
family = "binomial")
summary(eq5)
Call:
glm(formula = hon ~ female + math + female:math, family = "binomial",
data = dat)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7623 -0.6725 -0.3421 -0.1450 2.6913
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.74584 2.12913 -4.108 4e-05 ***
female -2.89986 3.09418 -0.937 0.348657
math 0.12938 0.03588 3.606 0.000312 ***
female:math 0.06700 0.05346 1.253 0.210139
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 222.71 on 199 degrees of freedom
Residual deviance: 159.77 on 196 degrees of freedom
AIC: 167.77
Number of Fisher Scoring iterations: 5
In the presence of interaction term of female by math, we can no longer talk about the effect of female, holding all other variables at certain values. This is because it no longer makes sense to fix math and female x math at certain value and still allowing female to change from 0 to 1
The ‘equation’ for male (female == 0) is:
logit(p) = b0 + b1 * female + b2 * math + b3 * female * math
= b0 + b2 * math
The ‘equation’ for female (female == 1) is:
logit(p) = b0 + b1 * female + b2 * math + b3 * female * math
= (b0 + b1) + (b2 + b3) * math
paste("for male, a 1-unit increase in math leads to a change in the logit of", eq5$coefficients[3])
[1] "for male, a 1-unit increase in math leads to a change in the logit of 0.129378054581833"
paste("this is a change in the odds ratio of", exp(eq5$coefficients[3]))
[1] "this is a change in the odds ratio of 1.13812031444246"
paste("for female, a 1-unit increase in math leads to a change in the logit of", eq5$coefficients[3] + eq5$coefficients[4])
[1] "for female, a 1-unit increase in math leads to a change in the logit of 0.196373186527463"
paste("this is a change in the odds ratio of", exp(eq5$coefficients[3] + eq5$coefficients[4]))
[1] "this is a change in the odds ratio of 1.21698098150797"
paste("the RATIO of the female-to-male odds ratio for math", exp(eq5$coefficients[3] + eq5$coefficients[4]) / exp(eq5$coefficients[3]), " = exp(coefficient(interaction term)) = ", exp(eq5$coefficients[4]))
[1] "the RATIO of the female-to-male odds ratio for math 1.06929027279875 = exp(coefficient(interaction term)) = 1.06929027279875"