Key Concepts covered in the lecture include:
1. Rethinking regression for Bernoulli random variables: using the logistic model
2. The link function: logit
3. Parameter Estimation
4. Parameter Interpretation: The Odds Ratio
5. Diagnostics on logistic regression: Wald's Test
6. Diagnostics on logistic regression: Likelihood Ratio Test to compare a full and reduced model
7. AIC: a goodness of fit measure to replace R2 (for logistic models).
The inclass exercise is about using a generalized linear model, specifically the logistic regression.
Task: What is the probability of a birth defect for women <20, from 20-24, and >24 years old?
library(MASS) #We need the MASS library in order to run some diagnostics on the glm function
bd <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/babybirthdata.csv")
# Create a generalized linear model using logit.
bdlog <- glm(bd$casegrp ~ bd$MAGE, family = binomial("logit"))
summary(bdlog) #coefficients don't mean anything.... must convert to a odds ratio
##
## Call:
## glm(formula = bd$casegrp ~ bd$MAGE, family = binomial("logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.567 -0.240 -0.147 -0.089 3.610
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.837 0.328 2.56 0.011 *
## bd$MAGE -0.199 0.015 -13.22 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2327.3 on 11751 degrees of freedom
## Residual deviance: 2096.3 on 11750 degrees of freedom
## AIC: 2100
##
## Number of Fisher Scoring iterations: 8
# The result of the model outputs either a 1 or a 0 (yes or no...). We
# can see this by 'jittering' the data
plot(jitter(casegrp) ~ MAGE, bd) #meaningless.... adding noise into the Y variable.
# We can also plot the likelihood of a 1(occurance of a birth defect) for
# a continuous X variable (age)
plot(bd$MAGE, fitted(glm(casegrp ~ MAGE, binomial, data = bd)), xlab = "Maternal Age",
ylab = "P(Birth Defect)", pch = 15)
# In order to make sense of the model parameters, we must convert the data
# to an odds ratio. convert to odds ratio... essential inversing the
# natural log link function.
exp(-0.19872)
## [1] 0.8198
########### IMPORTANT!!!############### this is how to interpret the
########### coefficeint. FOr a 1 year increase in age, there is a ___
########### change in odds of a birth defect
exp(confint(bdlog)) #this is the confidence interval for the odds ratio
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.2214 4.4155
## bd$MAGE 0.7955 0.8438
# In order to answer the question, we must categorize the data into
# groups. Categorize the data into groups:
bd$magecat3 <- ifelse(bd$MAGE > 25, c(1), c(0))
bd$magecat2 <- ifelse(bd$MAGE >= 20 & bd$MAGE <= 25, c(1), c(0))
bd$magecat1 <- ifelse(bd$MAGE < 20, c(1), c(0))
# Fit a model to the categorical dummy variables (using 1 less category
# than is present)
bdlog2 <- glm(casegrp ~ magecat1 + magecat2, binomial, data = bd)
summary(bdlog2)
##
## Call:
## glm(formula = casegrp ~ magecat1 + magecat2, family = binomial,
## data = bd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.378 -0.232 -0.106 -0.106 3.222
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.186 0.167 -31.0 < 2e-16 ***
## magecat1 2.582 0.196 13.2 < 2e-16 ***
## magecat2 1.582 0.195 8.1 5.4e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2327.3 on 11751 degrees of freedom
## Residual deviance: 2113.7 on 11749 degrees of freedom
## AIC: 2120
##
## Number of Fisher Scoring iterations: 8
# Have an intercept plus 2 variables in output table Reference group is
# the omitted groups (e.g. magegrpCat3 aka persons over 24)
# Fit the model...
yr20 <- exp(-5.1861 + 2.5824 * (1) + 1.5817 * (0))
summary(yr20/(1 + yr20)) #probability of having a birth defect in this group
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0689 0.0689 0.0689 0.0689 0.0689 0.0689
# calculate odds ratios
exp(cbind(OR = coef(bdlog2), confint(bdlog2)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.005594 0.003956 0.007633
## magecat1 13.228246 9.092820 19.682178
## magecat2 4.863064 3.350905 7.219970
# Group 1 is 13x more likely of having a BD than group 3 (the omitted
# group[s]) Group 2 is 4x more likely of having....
bd.log <- glm(casegrp ~ magecat1 + magecat2 + bthparity2 + smoke, binomial,
data = bd)
summary(bd.log)
##
## Call:
## glm(formula = casegrp ~ magecat1 + magecat2 + bthparity2 + smoke,
## family = binomial, data = bd)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.510 -0.247 -0.121 -0.091 3.315
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.907 0.188 -26.13 < 2e-16 ***
## magecat1 2.257 0.208 10.87 < 2e-16 ***
## magecat2 1.432 0.198 7.25 4.3e-13 ***
## bthparity2 -0.582 0.151 -3.85 0.00012 ***
## smoke 0.676 0.155 4.36 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2319.1 on 11739 degrees of freedom
## Residual deviance: 2078.1 on 11735 degrees of freedom
## (12 observations deleted due to missingness)
## AIC: 2088
##
## Number of Fisher Scoring iterations: 8
exp(cbind(OR = coef(bd.log), CI = confint(bd.log)))
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 0.007393 0.005033 0.01052
## magecat1 9.555339 6.419046 14.52035
## magecat2 4.185062 2.869565 6.24008
## bthparity2 0.558605 0.413840 0.74906
## smoke 1.965743 1.440422 2.64938
# likelyhood ratio test
anova(bd.log, test = "LRT")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: casegrp
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 11739 2319
## magecat1 1 134.2 11738 2185 < 2e-16 ***
## magecat2 1 77.1 11737 2108 < 2e-16 ***
## bthparity2 1 12.4 11736 2095 0.00042 ***
## smoke 1 17.2 11735 2078 3.4e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# gives saturated (full) model sequentially adding terms to the model does
# an overall test of the model ORDER REALLY MATTERS HERE!!!!