MSSP 897 Lab 11: Logistic Regression

The slides for Lab 11 session are now available to review here.

options(scipen=999, digits = 2)
good <- read.csv("good.csv")
goodMini <- na.omit(good[,c("CHID","bthwht","HOME97","read02","WICpreg")])

11.1 Check Class bias

Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, lets first check the proportion of classes in the dependent variable bthwht.

table(goodMini$bthwht)

## 
##    0    1 
## 1567  809

prop.table(table(goodMini$bthwht))

## 
##    0    1 
## 0.66 0.34

First let’s run OLS regression (for demonstration only), remember AVOID DOING THIS WITH A BINARY DV!!

Let’s look at the summary of the predictor variable (WICpreg)

summary(ols <- lm(bthwht ~ WICpreg, data=goodMini))

## 
## Call:
## lm(formula = bthwht ~ WICpreg, data = goodMini)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.408 -0.408 -0.288  0.592  0.712 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   0.2885     0.0128   22.49 < 0.0000000000000002 ***
## WICpreg       0.1199     0.0195    6.16        0.00000000087 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.47 on 2374 degrees of freedom
## Multiple R-squared:  0.0157, Adjusted R-squared:  0.0153 
## F-statistic: 37.9 on 1 and 2374 DF,  p-value: 0.000000000869

#If we were to run a linear model

11.2 Logistic Regression

Simple Logistic Regression

lm1<-glm(bthwht~WICpreg,
         family=binomial(link='logit'),data=goodMini)
summary(lm1) # AIC: 3014.5

## 
## Call:
## glm(formula = bthwht ~ WICpreg, family = binomial(link = "logit"), 
##     data = goodMini)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.024  -1.024  -0.825   1.338   1.577  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)  -0.9028     0.0602  -15.00 < 0.0000000000000002 ***
## WICpreg       0.5320     0.0874    6.09         0.0000000011 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3047.7  on 2375  degrees of freedom
## Residual deviance: 3010.5  on 2374  degrees of freedom
## AIC: 3015
## 
## Number of Fisher Scoring iterations: 4

Odds Ratio (exponentiating logit coefficient)

OR1<-exp(coef(lm1))
OR1

## (Intercept)     WICpreg 
##        0.41        1.70

Logistic Regression With Two or More Predicotrs

lm2<-glm(bthwht~HOME97 + WICpreg,
         family=binomial(link='logit'),data=goodMini)
summary(lm2)   # AIC = 3006

## 
## Call:
## glm(formula = bthwht ~ HOME97 + WICpreg, family = binomial(link = "logit"), 
##     data = goodMini)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.182  -0.913  -0.816   1.327   1.788  
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept)  -1.7439     0.2688   -6.49 0.000000000088 ***
## HOME97        0.0417     0.0129    3.23         0.0012 ** 
## WICpreg       0.6311     0.0931    6.78 0.000000000012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3047.7  on 2375  degrees of freedom
## Residual deviance: 3000.0  on 2373  degrees of freedom
## AIC: 3006
## 
## Number of Fisher Scoring iterations: 4

**Odds Ratio (exponentiating logit coefficient)

OR2<-exp(coef(lm2))
OR2

## (Intercept)      HOME97     WICpreg 
##        0.17        1.04        1.88

Odds Ratios with 95% confidence intervals

or_ci <- exp(cbind(OR = coef(lm2), confint(lm2)))

## Waiting for profiling to be done...

11.3 Goodness of Fit

Psuedo R2

Cox.Snell & Nagelkerke are the \(R^2s\) of interest.

# install.packages("BaylorEdPsych")
library(BaylorEdPsych)

## Warning: package 'BaylorEdPsych' was built under R version 3.5.2

PseudoR2(lm1)

##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
##            0.012            0.010            0.016            0.022 
## McKelvey.Zavoina           Effron            Count        Adj.Count 
##            0.021            0.016               NA               NA 
##              AIC    Corrected.AIC 
##         3014.528         3014.533

PseudoR2(lm2)

##         McFadden     Adj.McFadden        Cox.Snell       Nagelkerke 
##           0.0157           0.0130           0.0199           0.0275 
## McKelvey.Zavoina           Effron            Count        Adj.Count 
##           0.0268           0.0204           0.6591          -0.0012 
##              AIC    Corrected.AIC 
##        3005.9969        3006.0071

Likelihood Ratio Test

A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well (i.e. a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant.

Given that H0 holds that the reduced model is true, a p-value for the overall model fit statistic that is less than 0.05 would compel us to reject the null hypothesis. It would provide evidence against the reduced model in favor of the current model. https://www.r-bloggers.com/evaluating-logistic-regression-models/

# install.packages("lmtest")
library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

lrtest(lm1, lm2)

## Likelihood ratio test
## 
## Model 1: bthwht ~ WICpreg
## Model 2: bthwht ~ HOME97 + WICpreg
##   #Df LogLik Df Chisq Pr(>Chisq)   
## 1   2  -1505                       
## 2   3  -1500  1  10.5     0.0012 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

11.4 Predicted Probability

To find out the predicted probability, we create a data frame called newdata (WIC or noWIC), in which we include the desired values for our prediction. It can help interpretation to plot the predicted probability that bthwht=1 against each predictor separately.

range(goodMini$HOME97) # 7.0 26.9

## [1]  7 27

# holding WICpreg constant at 1
goodNew_WIC = data.frame(WICpreg = 1, 
                     HOME97 = seq(min(goodMini$HOME97), max(goodMini$HOME97)))

# holding WICpreg constant at 0
goodNew_noWIC = data.frame(WICpreg = 0, 
                         HOME97 = seq(min(goodMini$HOME97), max(goodMini$HOME97)))

Now we use the predict() function to calculate the predicted probability of the model for all of the values of HOME97.

goodNew_WIC$bthwht = predict(lm2, newdata=goodNew_WIC, type="response")
goodNew_noWIC$bthwht = predict(lm2, newdata=goodNew_noWIC, type="response")
View(goodNew_WIC)
View(goodNew_noWIC)
# We include the argument type=”response” in order to get our prediction.

Next, we can plot out the predicted probability of y across values of HOME97 with WIC controlled at 1 and 0 respectively

plot(bthwht~HOME97, data=goodMini, col="grey3")
lines(bthwht~HOME97, goodNew_WIC, col="pink1", lwd=2)
lines(bthwht~HOME97, goodNew_noWIC, col="lightblue3", lwd=2)