Logistic Regression

Key Concepts covered in the lecture include:
1. Rethinking regression for Bernoulli random variables: using the logistic model
2. The link function: logit
3. Parameter Estimation
4. Parameter Interpretation: The Odds Ratio
5. Diagnostics on logistic regression: Wald's Test
6. Diagnostics on logistic regression: Likelihood Ratio Test to compare a full and reduced model
7. AIC: a goodness of fit measure to replace R2 (for logistic models).


The inclass exercise is about using a generalized linear model, specifically the logistic regression.

Task: What is the probability of a birth defect for women <20, from 20-24, and >24 years old?

library(MASS)  #We need the MASS library in order to run some diagnostics on the glm function
bd <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/babybirthdata.csv")

# Create a generalized linear model using logit.
bdlog <- glm(bd$casegrp ~ bd$MAGE, family = binomial("logit"))
summary(bdlog)  #coefficients don't mean anything.... must convert to a odds ratio
## 
## Call:
## glm(formula = bd$casegrp ~ bd$MAGE, family = binomial("logit"))
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.567  -0.240  -0.147  -0.089   3.610  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.837      0.328    2.56    0.011 *  
## bd$MAGE       -0.199      0.015  -13.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2327.3  on 11751  degrees of freedom
## Residual deviance: 2096.3  on 11750  degrees of freedom
## AIC: 2100
## 
## Number of Fisher Scoring iterations: 8

# The result of the model outputs either a 1 or a 0 (yes or no...).  We
# can see this by 'jittering' the data
plot(jitter(casegrp) ~ MAGE, bd)  #meaningless.... adding noise into the Y variable.

plot of chunk unnamed-chunk-1


# We can also plot the likelihood of a 1(occurance of a birth defect) for
# a continuous X variable (age)
plot(bd$MAGE, fitted(glm(casegrp ~ MAGE, binomial, data = bd)), xlab = "Maternal Age", 
    ylab = "P(Birth Defect)", pch = 15)

plot of chunk unnamed-chunk-1



# In order to make sense of the model parameters, we must convert the data
# to an odds ratio. convert to odds ratio... essential inversing the
# natural log link function.
exp(-0.19872)
## [1] 0.8198

########### IMPORTANT!!!############### this is how to interpret the
########### coefficeint.  FOr a 1 year increase in age, there is a ___
########### change in odds of a birth defect
exp(confint(bdlog))  #this is the confidence interval for the odds ratio
## Waiting for profiling to be done...
##              2.5 % 97.5 %
## (Intercept) 1.2214 4.4155
## bd$MAGE     0.7955 0.8438

# In order to answer the question, we must categorize the data into
# groups. Categorize the data into groups:
bd$magecat3 <- ifelse(bd$MAGE > 25, c(1), c(0))
bd$magecat2 <- ifelse(bd$MAGE >= 20 & bd$MAGE <= 25, c(1), c(0))
bd$magecat1 <- ifelse(bd$MAGE < 20, c(1), c(0))

# Fit a model to the categorical dummy variables (using 1 less category
# than is present)
bdlog2 <- glm(casegrp ~ magecat1 + magecat2, binomial, data = bd)
summary(bdlog2)
## 
## Call:
## glm(formula = casegrp ~ magecat1 + magecat2, family = binomial, 
##     data = bd)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.378  -0.232  -0.106  -0.106   3.222  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -5.186      0.167   -31.0  < 2e-16 ***
## magecat1       2.582      0.196    13.2  < 2e-16 ***
## magecat2       1.582      0.195     8.1  5.4e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2327.3  on 11751  degrees of freedom
## Residual deviance: 2113.7  on 11749  degrees of freedom
## AIC: 2120
## 
## Number of Fisher Scoring iterations: 8
# Have an intercept plus 2 variables in output table Reference group is
# the omitted groups (e.g. magegrpCat3 aka persons over 24)

# Fit the model...
yr20 <- exp(-5.1861 + 2.5824 * (1) + 1.5817 * (0))
summary(yr20/(1 + yr20))  #probability of having a birth defect in this group
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0689  0.0689  0.0689  0.0689  0.0689  0.0689

# calculate odds ratios
exp(cbind(OR = coef(bdlog2), confint(bdlog2)))
## Waiting for profiling to be done...
##                    OR    2.5 %    97.5 %
## (Intercept)  0.005594 0.003956  0.007633
## magecat1    13.228246 9.092820 19.682178
## magecat2     4.863064 3.350905  7.219970
# Group 1 is 13x more likely of having a BD than group 3 (the omitted
# group[s]) Group 2 is 4x more likely of having....

bd.log <- glm(casegrp ~ magecat1 + magecat2 + bthparity2 + smoke, binomial, 
    data = bd)
summary(bd.log)
## 
## Call:
## glm(formula = casegrp ~ magecat1 + magecat2 + bthparity2 + smoke, 
##     family = binomial, data = bd)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -0.510  -0.247  -0.121  -0.091   3.315  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.907      0.188  -26.13  < 2e-16 ***
## magecat1       2.257      0.208   10.87  < 2e-16 ***
## magecat2       1.432      0.198    7.25  4.3e-13 ***
## bthparity2    -0.582      0.151   -3.85  0.00012 ***
## smoke          0.676      0.155    4.36  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2319.1  on 11739  degrees of freedom
## Residual deviance: 2078.1  on 11735  degrees of freedom
##   (12 observations deleted due to missingness)
## AIC: 2088
## 
## Number of Fisher Scoring iterations: 8
exp(cbind(OR = coef(bd.log), CI = confint(bd.log)))
## Waiting for profiling to be done...
##                   OR    2.5 %   97.5 %
## (Intercept) 0.007393 0.005033  0.01052
## magecat1    9.555339 6.419046 14.52035
## magecat2    4.185062 2.869565  6.24008
## bthparity2  0.558605 0.413840  0.74906
## smoke       1.965743 1.440422  2.64938

# likelyhood ratio test
anova(bd.log, test = "LRT")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: casegrp
## 
## Terms added sequentially (first to last)
## 
## 
##            Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                       11739       2319             
## magecat1    1    134.2     11738       2185  < 2e-16 ***
## magecat2    1     77.1     11737       2108  < 2e-16 ***
## bthparity2  1     12.4     11736       2095  0.00042 ***
## smoke       1     17.2     11735       2078  3.4e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# gives saturated (full) model sequentially adding terms to the model does
# an overall test of the model ORDER REALLY MATTERS HERE!!!!