Logistic Regression Models

02-11 Lab exercise

See also slides for Logistic Regression

Logistic regression is for non-continuous data (e.g. it doesn't make sense to have an value of 1.2 for a gender or race category) The simple linear and multiple regressions we did last week are for continuous variables, such as measurements of CO2 or time.

import birthdata

bd <- read.csv(file.choose(), header = TRUE)

## Error: file choice cancelled

head(bd)

## Error: object 'bd' not found

end(bd)

## Error: object 'bd' not found

summary(bd)

## Error: object 'bd' not found

Variables: MAGE = mother's age casegrp: born with a defect (1) or not (0) dummy variables for other variables

bdlog <- glm(bd$casegrp ~ bd$MAGE, family = binomial("logit"))

## Error: object 'bd' not found

summary(bdlog)

## Error: object 'bdlog' not found

VERY IMPORTANT: Interpret the coefficients in the data output: these dont' make much sense until we turn them into an odds ratio!!!

plot(casegrp ~ MAGE, bd)

## Error: object 'bd' not found

Plot is not that useful: there is only 2 possible outcomes for every age, so there could be lots of women of 20 years old. Could fix by “jittering” the data:

plot(jitter(casegrp) ~ MAGE, bd)

## Error: object 'bd' not found

This is slightly more useful, but not entirely.

The main point here is that a lot of the tools we used in regular regression are just not that useful in logistic regression.

plot(bd$MAGE, fitted(glm(casegrp ~ MAGE, binomial, data = bd)), xlab = "Maternal Age", 
    ylab = "P(Birth Defect)", pch = 15)

## Error: object 'bd' not found

Extracts the fitted/predicted values of the GLM….?

Plots: Probability of a baby being born with a birth defect as a function of mother's age. This is not really what we'd expect though! Most of the data is probably in the 20-25 age range…

library(MASS)  # load the MASS library to calculate the odds ratio (this gives us the confint command in the next line)
exp(confint(bdlog))

## Error: object 'bdlog' not found

Slide 30 from lecture on Logistic Regression: break mother's age into 3 groups:

bd$magecat3 <- ifelse(bd$MAGE > 25, c(1), c(0))  # oldest group, over 25

## Error: object 'bd' not found

bd$magecat2 <- ifelse(bd$MAGE >= 20 & bd$MAGE <= 25, c(1), c(0))  # middle group, 20-25

## Error: object 'bd' not found

bd$magecat1 <- ifelse(bd$MAGE < 20, c(1), c(0))  # youngest group, under 20

## Error: object 'bd' not found

# making a dummy variable for age category greater than 25. The 'c' in
# this code inserts it as a numeric value rather than a list, but it's not
# that important... code can work without the c. C creates a vector (list)
# with 1 item in it. could also do 'true' and 'false'
head(bd)  # we see we have 3 new categories

## Error: object 'bd' not found

bdlog2 <- glm(bd$casegrp ~ bd$magecat1 + bd$magecat2, binomial)  # fit a new model with these new categories

## Error: object 'bd' not found

summary(bdlog2)

## Error: object 'bdlog2' not found

Interpret the output for the bdlog2 model: when we take these odds ratios, they are in comparison to the group that is NOT in our model. We did not include the oldest group, over age 25. So computations of odds ratios are relative to that group.

CALCULATE PIi: what is the probability of having a birth defect for women under 25 years old? refer to equation for calculating the probability on slide 28 use variables for each age group set up in the model, which are given in the summary(bdlog2)

# set up an equation with all variables. insert 0s

# Group magecat1: under 20
(exp(-5.18 + (1 * 2.58) + (0 * 1.58)))/(1 + exp(-5.18 + (1 * 2.58) + (0 * 1.58)))  #exp is e to the power of....

## [1] 0.06914

# probability of a woman in magecat 1 (under 20) of having a baby with a
# birth defect is .0691 or 6.9 %

# Group magecat2: 20-25
(exp(-5.18 + (0 * 2.58) + (1 * 1.58)))/(1 + exp(-5.18 + (0 * 2.58) + (1 * 1.58)))

## [1] 0.0266

# probably of a woman in magecat 2 (20-25) of having a baby with a birth
# defect is .0266 or 2.7 %

# Group magecat3: over 25
(exp(-5.18 + (0 * 2.58) + (0 * 1.58)))/(1 + exp(-5.18 + (0 * 2.58) + (0 * 1.58)))

## [1] 0.005597

# probability of a woman in magecat 3 (over 25) of having a baby with a
# birth defect is .0056 or 0.56 % We can calculate the probability of
# having a birth defect for the group of women not included in the model
# (magecat3) because if a woman isn't in one of the first 2 groups, she is
# automatically in the 3rd, by default. Just just put 0s next to the
# coefficients for those two categories

What about the odd's ratio? see slide # 34

exp(cbind(OR = coef(bdlog2), confint(bdlog2)))

## Error: object 'bdlog2' not found

Interpretation of odds ratio: Remember that the odds ratio is RELATIVE to the category that is left out, which in our case is the over-25 women Reading the output: women <20 years have a 13 times greater odds of a birth defect than women >24 years women 20-24 years have a 5 times greater odds of a birth defect than women >24 years

Slide 36: contribution of individual variables. Build a more complex model with more variables:

bd.log <- glm(casegrp ~ magecat1 + magecat2 + bthparity2 + smoke, binomial, 
    data = bd)

## Error: object 'bd' not found

summary(bd.log)

## Error: object 'bd.log' not found

exp(cbind(OR = coef(bd.log), CI = confint(bd.log)))

## Error: object 'bd.log' not found

# If we run the odds ratios, we see that the some of the variables in the
# output that we had before have lower numbers. That's because introducing
# additional variables takes away explanatory power from pre-existing
# ones. e.g. older women may be smokers.

# Likelihood ratio test:
anova(bd.log, test = "LRT")

## Error: object 'bd.log' not found

# This is an overall test: whether the model is explaining a sig portion
# of the variance.  Results are that all these variables are significant
# to the model

So what variables can we consider dropping? This is a change in deviance test. Similar to the LRT test above.

anova(bd.log, test = "Chisq")

## Error: object 'bd.log' not found

# results: Small p-values indicate that all variables are needed to
# explain the variation in y

So, when we are doing a logistic model, we are not trying to predict Yi, but PI hat: the probability that for any value of Y, we get a 1/postive outcome.

summary(cars)

##      speed           dist    
##  Min.   : 4.0   Min.   :  2  
##  1st Qu.:12.0   1st Qu.: 26  
##  Median :15.0   Median : 36  
##  Mean   :15.4   Mean   : 43  
##  3rd Qu.:19.0   3rd Qu.: 56  
##  Max.   :25.0   Max.   :120

You can also embed plots, for example:

plot(cars)

plot of chunk unnamed-chunk-13