options(scipen=999, digits = 2)
good <- read.csv("good.csv")
goodMini <- na.omit(good[,c("CHID","bthwht","HOME97","read02","WICpreg")])Ideally, the proportion of events and non-events in the Y variable should approximately be the same. So, lets first check the proportion of classes in the dependent variable bthwht.
table(goodMini$bthwht)##
## 0 1
## 1567 809
prop.table(table(goodMini$bthwht))##
## 0 1
## 0.66 0.34
First let’s run OLS regression (for demonstration only), remember AVOID DOING THIS WITH A BINARY DV!!
Let’s look at the summary of the predictor variable (WICpreg)
summary(ols <- lm(bthwht ~ WICpreg, data=goodMini)) ##
## Call:
## lm(formula = bthwht ~ WICpreg, data = goodMini)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.408 -0.408 -0.288 0.592 0.712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2885 0.0128 22.49 < 0.0000000000000002 ***
## WICpreg 0.1199 0.0195 6.16 0.00000000087 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.47 on 2374 degrees of freedom
## Multiple R-squared: 0.0157, Adjusted R-squared: 0.0153
## F-statistic: 37.9 on 1 and 2374 DF, p-value: 0.000000000869
#If we were to run a linear modelSimple Logistic Regression
lm1<-glm(bthwht~WICpreg,
family=binomial(link='logit'),data=goodMini)
summary(lm1) # AIC: 3014.5##
## Call:
## glm(formula = bthwht ~ WICpreg, family = binomial(link = "logit"),
## data = goodMini)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.024 -1.024 -0.825 1.338 1.577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9028 0.0602 -15.00 < 0.0000000000000002 ***
## WICpreg 0.5320 0.0874 6.09 0.0000000011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3047.7 on 2375 degrees of freedom
## Residual deviance: 3010.5 on 2374 degrees of freedom
## AIC: 3015
##
## Number of Fisher Scoring iterations: 4
Odds Ratio (exponentiating logit coefficient)
OR1<-exp(coef(lm1))
OR1## (Intercept) WICpreg
## 0.41 1.70
Logistic Regression With Two or More Predicotrs
lm2<-glm(bthwht~HOME97 + WICpreg,
family=binomial(link='logit'),data=goodMini)
summary(lm2) # AIC = 3006##
## Call:
## glm(formula = bthwht ~ HOME97 + WICpreg, family = binomial(link = "logit"),
## data = goodMini)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.182 -0.913 -0.816 1.327 1.788
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.7439 0.2688 -6.49 0.000000000088 ***
## HOME97 0.0417 0.0129 3.23 0.0012 **
## WICpreg 0.6311 0.0931 6.78 0.000000000012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3047.7 on 2375 degrees of freedom
## Residual deviance: 3000.0 on 2373 degrees of freedom
## AIC: 3006
##
## Number of Fisher Scoring iterations: 4
**Odds Ratio (exponentiating logit coefficient)
OR2<-exp(coef(lm2))
OR2 ## (Intercept) HOME97 WICpreg
## 0.17 1.04 1.88
Odds Ratios with 95% confidence intervals
or_ci <- exp(cbind(OR = coef(lm2), confint(lm2)))## Waiting for profiling to be done...
Cox.Snell & Nagelkerke are the \(R^2s\) of interest.
# install.packages("BaylorEdPsych")
library(BaylorEdPsych)## Warning: package 'BaylorEdPsych' was built under R version 3.5.2
PseudoR2(lm1)## McFadden Adj.McFadden Cox.Snell Nagelkerke
## 0.012 0.010 0.016 0.022
## McKelvey.Zavoina Effron Count Adj.Count
## 0.021 0.016 NA NA
## AIC Corrected.AIC
## 3014.528 3014.533
PseudoR2(lm2)## McFadden Adj.McFadden Cox.Snell Nagelkerke
## 0.0157 0.0130 0.0199 0.0275
## McKelvey.Zavoina Effron Count Adj.Count
## 0.0268 0.0204 0.6591 -0.0012
## AIC Corrected.AIC
## 3005.9969 3006.0071
A logistic regression is said to provide a better fit to the data if it demonstrates an improvement over a model with fewer predictors. This is performed using the likelihood ratio test, which compares the likelihood of the data under the full model against the likelihood of the data under a model with fewer predictors. Removing predictor variables from a model will almost always make the model fit less well (i.e. a model will have a lower log likelihood), but it is necessary to test whether the observed difference in model fit is statistically significant.
Given that H0 holds that the reduced model is true, a p-value for the overall model fit statistic that is less than 0.05 would compel us to reject the null hypothesis. It would provide evidence against the reduced model in favor of the current model. https://www.r-bloggers.com/evaluating-logistic-regression-models/
# install.packages("lmtest")
library(lmtest)## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lrtest(lm1, lm2)## Likelihood ratio test
##
## Model 1: bthwht ~ WICpreg
## Model 2: bthwht ~ HOME97 + WICpreg
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1505
## 2 3 -1500 1 10.5 0.0012 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To find out the predicted probability, we create a data frame called newdata (WIC or noWIC), in which we include the desired values for our prediction. It can help interpretation to plot the predicted probability that bthwht=1 against each predictor separately.
range(goodMini$HOME97) # 7.0 26.9## [1] 7 27
# holding WICpreg constant at 1
goodNew_WIC = data.frame(WICpreg = 1,
HOME97 = seq(min(goodMini$HOME97), max(goodMini$HOME97)))
# holding WICpreg constant at 0
goodNew_noWIC = data.frame(WICpreg = 0,
HOME97 = seq(min(goodMini$HOME97), max(goodMini$HOME97)))Now we use the predict() function to calculate the predicted probability of the model for all of the values of HOME97.
goodNew_WIC$bthwht = predict(lm2, newdata=goodNew_WIC, type="response")
goodNew_noWIC$bthwht = predict(lm2, newdata=goodNew_noWIC, type="response")
View(goodNew_WIC)
View(goodNew_noWIC)
# We include the argument type=”response” in order to get our prediction.Next, we can plot out the predicted probability of y across values of HOME97 with WIC controlled at 1 and 0 respectively
plot(bthwht~HOME97, data=goodMini, col="grey3")
lines(bthwht~HOME97, goodNew_WIC, col="pink1", lwd=2)
lines(bthwht~HOME97, goodNew_noWIC, col="lightblue3", lwd=2)