Lecture 14: Logistic Regression

Joel Correa da Rosa
June, 7th 2017

Logistic Regression

Logistic regression is part of the methods called generalized linear models. They extend the concept of regression for distributions that are different from the normal distribution.

These distributions belong to the exponential family.

Examples of these distributions are: Exponential, Gamma, Poisson, Binomial, among others

Linear Logistic Regression

For the linear regression, \( Y \) is normal distributed and \( \mu \) is the parameter of interest.

\( \mu = \beta_0X_0 +\beta_1X_{1i} +\beta_2X_{2i}+... \)

For the logistic regression, \( Y \) is binary (binomial distributed) and \( p_i \) is the parameter of interest, linked to a set of predictors through the so-called link function.

\( F(p_i) = \beta_0 +\beta_1X_{1i} +\beta_2X_{2i}+... \)

Logit Function

In the logistic regression the function that links the parameter \( p_i \) to the set of predictors is the logit.

\( F(p_i) = logit(p_i) = ln(\frac{p_i}{1-p_i}) = \beta_0X_+\beta_1X_{1i}+ \beta_2X_{2i}... \)

The logit can be seen as the log of an odds ratio.

Interpretation : The exponential of the parameters (\( e^{\beta_i} \)) are simply the odds ratio.

Alternative Formulation

When re-writing the odds ratio as a function of \( p_i \) instead of the logit, we have the logistic function:

\( p_i = \frac{1}{1+e^{-(\beta_0+\beta_1X_{1i}+\beta_2X_{2i}+...)}} \)

The function in the right side of the equation is very important in machine learning theory and it is usually called activation function.

Application of the logistic regression (Real Data)

One investigator wishes to identify risk factors for alcohol dependence among a set of predictors: Impulsivity, Ethnicity, Gender and Age.

load('logreg.Rda')
head(dx.reg)
    BIS_Total scidOPIOID scidCOCAINE CannabisTOT Ethnicity Gender Age
327        78          0           0          10         C      F  23
329        51          0           0           0        AA      F  24
333        63          1           0          14         C      M  37
334        71          0           1          14        AA      F  32
335        55          0           0          14        AA      F  46
336        51          0           0           0        AA      F  34
    alcoholTOT.binary
327                 1
329                 0
333                 1
334                 1
335                 1
336                 0

Example (Predictors of Alcohol Dependence)

summary(dx.reg)
   BIS_Total      scidOPIOID scidCOCAINE  CannabisTOT   Ethnicity  
 Min.   : 35.00   0:536      0:378       Min.   : 0.0   C    :162  
 1st Qu.: 54.00   1:230      1:388       1st Qu.: 0.0   AA   :375  
 Median : 62.00                          Median :10.0   H    :151  
 Mean   : 62.78                          Mean   : 7.7   Other: 78  
 3rd Qu.: 71.00                          3rd Qu.:13.0              
 Max.   :107.00                          Max.   :14.0              
 Gender       Age        alcoholTOT.binary
 F:323   Min.   :18.00   0:415            
 M:443   1st Qu.:33.00   1:351            
         Median :42.00                    
         Mean   :40.63                    
         3rd Qu.:48.00                    
         Max.   :69.00                    

Defining Reference Levels

dx.reg$alcoholTOT.binary<-mapvalues(dx.reg$alcoholTOT.binary,from=c(0,1),to=c('Non-dependent','Dependent'))
dx.reg$alcoholTOT.binary<-relevel(dx.reg$alcoholTOT.binary,ref='Non-dependent')
dx.reg$Gender<-relevel(dx.reg$Gender,ref='M')
dx.reg$Ethnicity<-relevel(dx.reg$Ethnicity,ref='C')

Univariate Approach

To get a sense of the level of association between the binary outcome and a categorical outcome, we can use contingency tables followed by chi-square statistic and its significance.

Alcohol Dependence x Ethnicity

table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity)

                  C  AA   H Other
  Non-dependent  91 194  77    53
  Dependent      71 181  74    25
chisq.test(table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity))

    Pearson's Chi-squared test

data:  table(dx.reg$alcoholTOT.binary, dx.reg$Ethnicity)
X-squared = 7.7374, df = 3, p-value = 0.05176

Alcohol Dependence x Ethnicity

barplot(table(dx.reg$alcoholTOT.binary , dx.reg$Ethnicity),beside=TRUE)

plot of chunk unnamed-chunk-7

Univariate Approach

Alcohol Dependence x Gender

table(dx.reg$alcoholTOT.binary , dx.reg$Gender)

                  M   F
  Non-dependent 215 200
  Dependent     228 123
chisq.test(table(dx.reg$alcoholTOT.binary , dx.reg$Gender))

    Pearson's Chi-squared test with Yates' continuity correction

data:  table(dx.reg$alcoholTOT.binary, dx.reg$Gender)
X-squared = 12.951, df = 1, p-value = 0.0003198

Alcohol Dependence x Gender

barplot(table(dx.reg$alcoholTOT.binary , dx.reg$Gender),beside=TRUE)

plot of chunk unnamed-chunk-9

Univariate Approach

Alcohol Dependence vs Age

plot of chunk unnamed-chunk-10

[1] 6.877921e-15

Univariate Approach

Alcohol Dependence vs Impulsivity Score

plot of chunk unnamed-chunk-11

[1] 5.966515e-15

ODDS ratio through Contingency Table

Alcohol Dependence x Gender

tb<-table(dx.reg$Gender,dx.reg$alcoholTOT.binary)
tb[c('F','M'),c('Dependent','Non-dependent')]

    Dependent Non-dependent
  F       123           200
  M       228           215
fisher.test(tb)

    Fisher's Exact Test for Count Data

data:  tb
p-value = 0.0002444
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 0.4283613 0.7845499
sample estimates:
odds ratio 
 0.5803393 

ODDS ratio through Logistic Regression

fit<-glm(alcoholTOT.binary~Gender,family = 'binomial',data=dx.reg)
summary(fit)

Call:
glm(formula = alcoholTOT.binary ~ Gender, family = "binomial", 
    data = dx.reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2024  -1.2024  -0.9791   1.1526   1.3896  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.05871    0.09506   0.618 0.536865    
GenderF     -0.54484    0.14889  -3.659 0.000253 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1056.5  on 765  degrees of freedom
Residual deviance: 1043.0  on 764  degrees of freedom
AIC: 1047

Number of Fisher Scoring iterations: 4

ODDS ratio for a Continuous Outcome (Age)

fit<-glm(alcoholTOT.binary~Age,family = 'binomial',data=dx.reg)
summary(fit)

Call:
glm(formula = alcoholTOT.binary ~ Age, family = "binomial", data = dx.reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7820  -1.0810  -0.7268   1.1545   1.7821  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.392474   0.319268  -7.494 6.70e-14 ***
Age          0.054371   0.007522   7.229 4.88e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1056.55  on 765  degrees of freedom
Residual deviance:  999.03  on 764  degrees of freedom
AIC: 1003

Number of Fisher Scoring iterations: 4

ODDS ratio for a Continuous Outcome (Impulsivity)

fit<-glm(alcoholTOT.binary~BIS_Total,family = 'binomial',data=dx.reg)
summary(fit)

Call:
glm(formula = alcoholTOT.binary ~ BIS_Total, family = "binomial", 
    data = dx.reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9364  -1.0283  -0.7743   1.1635   1.8641  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.268669   0.428450  -7.629 2.36e-14 ***
BIS_Total    0.049276   0.006697   7.358 1.87e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1056.55  on 765  degrees of freedom
Residual deviance:  995.82  on 764  degrees of freedom
AIC: 999.82

Number of Fisher Scoring iterations: 4

Logit Function (Age)

logitF<-function(age){
  f<-1/(1+exp(-(-2.4+0.055*age)))
}
plot(dx.reg$Age,logitF(dx.reg$Age))

plot of chunk unnamed-chunk-16

Logit Function (Impulsivity)

logitF<-function(x){
  f<-1/(1+exp(-(-3.27+0.049*x)))
}
plot(dx.reg$BIS_Total,logitF(dx.reg$BIS_Total),)

plot of chunk unnamed-chunk-17

Multiple Logistic Regression

In the multiplpe logistic regression the odds ratios are adjusted by including other variables.

fit.m<-glm(alcoholTOT.binary~Gender+Ethnicity+Age+BIS_Total,family='binomial',data=dx.reg)
summary(fit.m)

Call:
glm(formula = alcoholTOT.binary ~ Gender + Ethnicity + Age + 
    BIS_Total, family = "binomial", data = dx.reg)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0088  -1.0175  -0.5084   1.0607   2.3170  

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -5.564819   0.617939  -9.005  < 2e-16 ***
GenderF        -0.441387   0.161614  -2.731  0.00631 ** 
EthnicityAA     0.319503   0.205820   1.552  0.12058    
EthnicityH      0.472557   0.248272   1.903  0.05699 .  
EthnicityOther -0.259539   0.316244  -0.821  0.41182    
Age             0.054287   0.008101   6.701 2.07e-11 ***
BIS_Total       0.049449   0.006975   7.089 1.35e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1056.55  on 765  degrees of freedom
Residual deviance:  926.96  on 759  degrees of freedom
AIC: 940.96

Number of Fisher Scoring iterations: 3

Interpreting Odds Ratios

-Odds ratios in logistic regression can be interpreted as the effect of a one unit of change in X in the predicted odds ratio with the other variables in the model held constant.

  • We should emphasize that the odds ratios are constant, we say they are adjusted for the presence of other predictors (it does not matter what values they take on).