Joel Correa da Rosa
June, 7th 2017
Logistic regression is part of the methods called generalized linear models. They extend the concept of regression for distributions that are different from the normal distribution.
These distributions belong to the exponential family.
Examples of these distributions are: Exponential, Gamma, Poisson, Binomial, among others
For the linear regression, \( Y \) is normal distributed and \( \mu \) is the parameter of interest.
\( \mu = \beta_0X_0 +\beta_1X_{1i} +\beta_2X_{2i}+... \)
For the logistic regression, \( Y \) is binary (binomial distributed) and \( p_i \) is the parameter of interest, linked to a set of predictors through the so-called link function.
\( F(p_i) = \beta_0 +\beta_1X_{1i} +\beta_2X_{2i}+... \)
In the logistic regression the function that links the parameter \( p_i \) to the set of predictors is the logit.
\( F(p_i) = logit(p_i) = ln(\frac{p_i}{1-p_i}) = \beta_0X_+\beta_1X_{1i}+ \beta_2X_{2i}... \)
The logit can be seen as the log of an odds ratio.
Interpretation : The exponential of the parameters (\( e^{\beta_i} \)) are simply the odds ratio.
When re-writing the odds ratio as a function of \( p_i \) instead of the logit, we have the logistic function:
\( p_i = \frac{1}{1+e^{-(\beta_0+\beta_1X_{1i}+\beta_2X_{2i}+...)}} \)
The function in the right side of the equation is very important in machine learning theory and it is usually called activation function.
One investigator wishes to identify risk factors for alcohol dependence among a set of predictors: Impulsivity, Ethnicity, Gender and Age.
load('logreg.Rda')
head(dx.reg)
BIS_Total scidOPIOID scidCOCAINE CannabisTOT Ethnicity Gender Age
327 78 0 0 10 C F 23
329 51 0 0 0 AA F 24
333 63 1 0 14 C M 37
334 71 0 1 14 AA F 32
335 55 0 0 14 AA F 46
336 51 0 0 0 AA F 34
alcoholTOT.binary
327 1
329 0
333 1
334 1
335 1
336 0
summary(dx.reg)
BIS_Total scidOPIOID scidCOCAINE CannabisTOT Ethnicity
Min. : 35.00 0:536 0:378 Min. : 0.0 C :162
1st Qu.: 54.00 1:230 1:388 1st Qu.: 0.0 AA :375
Median : 62.00 Median :10.0 H :151
Mean : 62.78 Mean : 7.7 Other: 78
3rd Qu.: 71.00 3rd Qu.:13.0
Max. :107.00 Max. :14.0
Gender Age alcoholTOT.binary
F:323 Min. :18.00 0:415
M:443 1st Qu.:33.00 1:351
Median :42.00
Mean :40.63
3rd Qu.:48.00
Max. :69.00
dx.reg$alcoholTOT.binary<-mapvalues(dx.reg$alcoholTOT.binary,from=c(0,1),to=c('Non-dependent','Dependent'))
dx.reg$alcoholTOT.binary<-relevel(dx.reg$alcoholTOT.binary,ref='Non-dependent')
dx.reg$Gender<-relevel(dx.reg$Gender,ref='M')
dx.reg$Ethnicity<-relevel(dx.reg$Ethnicity,ref='C')
To get a sense of the level of association between the binary outcome and a categorical outcome, we can use contingency tables followed by chi-square statistic and its significance.
table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity)
C AA H Other
Non-dependent 91 194 77 53
Dependent 71 181 74 25
chisq.test(table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity))
Pearson's Chi-squared test
data: table(dx.reg$alcoholTOT.binary, dx.reg$Ethnicity)
X-squared = 7.7374, df = 3, p-value = 0.05176
barplot(table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity),beside=TRUE)
table(dx.reg$alcoholTOT.binary , dx.reg$Gender)
M F
Non-dependent 215 200
Dependent 228 123
chisq.test(table(dx.reg$alcoholTOT.binary , dx.reg$Gender))
Pearson's Chi-squared test with Yates' continuity correction
data: table(dx.reg$alcoholTOT.binary, dx.reg$Gender)
X-squared = 12.951, df = 1, p-value = 0.0003198
barplot(table(dx.reg$alcoholTOT.binary , dx.reg$Gender),beside=TRUE)
[1] 6.877921e-15
[1] 5.966515e-15
tb<-table(dx.reg$Gender,dx.reg$alcoholTOT.binary)
tb[c('F','M'),c('Dependent','Non-dependent')]
Dependent Non-dependent
F 123 200
M 228 215
fisher.test(tb)
Fisher's Exact Test for Count Data
data: tb
p-value = 0.0002444
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.4283613 0.7845499
sample estimates:
odds ratio
0.5803393
fit<-glm(alcoholTOT.binary~Gender,family = 'binomial',data=dx.reg)
summary(fit)
Call:
glm(formula = alcoholTOT.binary ~ Gender, family = "binomial",
data = dx.reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2024 -1.2024 -0.9791 1.1526 1.3896
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.05871 0.09506 0.618 0.536865
GenderF -0.54484 0.14889 -3.659 0.000253 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1056.5 on 765 degrees of freedom
Residual deviance: 1043.0 on 764 degrees of freedom
AIC: 1047
Number of Fisher Scoring iterations: 4
fit<-glm(alcoholTOT.binary~Age,family = 'binomial',data=dx.reg)
summary(fit)
Call:
glm(formula = alcoholTOT.binary ~ Age, family = "binomial", data = dx.reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7820 -1.0810 -0.7268 1.1545 1.7821
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.392474 0.319268 -7.494 6.70e-14 ***
Age 0.054371 0.007522 7.229 4.88e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1056.55 on 765 degrees of freedom
Residual deviance: 999.03 on 764 degrees of freedom
AIC: 1003
Number of Fisher Scoring iterations: 4
fit<-glm(alcoholTOT.binary~BIS_Total,family = 'binomial',data=dx.reg)
summary(fit)
Call:
glm(formula = alcoholTOT.binary ~ BIS_Total, family = "binomial",
data = dx.reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9364 -1.0283 -0.7743 1.1635 1.8641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.268669 0.428450 -7.629 2.36e-14 ***
BIS_Total 0.049276 0.006697 7.358 1.87e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1056.55 on 765 degrees of freedom
Residual deviance: 995.82 on 764 degrees of freedom
AIC: 999.82
Number of Fisher Scoring iterations: 4
logitF<-function(age){
f<-1/(1+exp(-(-2.4+0.055*age)))
}
plot(dx.reg$Age,logitF(dx.reg$Age))
logitF<-function(x){
f<-1/(1+exp(-(-3.27+0.049*x)))
}
plot(dx.reg$BIS_Total,logitF(dx.reg$BIS_Total),)
In the multiplpe logistic regression the odds ratios are adjusted by including other variables.
fit.m<-glm(alcoholTOT.binary~Gender+Ethnicity+Age+BIS_Total,family='binomial',data=dx.reg)
summary(fit.m)
Call:
glm(formula = alcoholTOT.binary ~ Gender + Ethnicity + Age +
BIS_Total, family = "binomial", data = dx.reg)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0088 -1.0175 -0.5084 1.0607 2.3170
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.564819 0.617939 -9.005 < 2e-16 ***
GenderF -0.441387 0.161614 -2.731 0.00631 **
EthnicityAA 0.319503 0.205820 1.552 0.12058
EthnicityH 0.472557 0.248272 1.903 0.05699 .
EthnicityOther -0.259539 0.316244 -0.821 0.41182
Age 0.054287 0.008101 6.701 2.07e-11 ***
BIS_Total 0.049449 0.006975 7.089 1.35e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1056.55 on 765 degrees of freedom
Residual deviance: 926.96 on 759 degrees of freedom
AIC: 940.96
Number of Fisher Scoring iterations: 3
-Odds ratios in logistic regression can be interpreted as the effect of a one unit of change in X in the predicted odds ratio with the other variables in the model held constant.