This example will further explore the logisitic regression model, including discussing model stratification, the chow test and comparison of regression effects across models.
For this example I am using 2014 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. Link
#load brfss
library(car)
library(stargazer)
library(survey)
library(sjPlot)
library(ggplot2)
library(pander)
library(knitr)
load("brfss_14.Rdata")
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss14)
head(nams, n=10)
## [1] "dispcode" "genhlth" "physhlth" "menthlth" "poorhlth" "hlthpln1"
## [7] "persdoc2" "medcost" "checkup1" "exerany2"
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
set.seed(1234)
newnames<-gsub(pattern = "x_",replacement = "",x = nams)
names(brfss14)<-tolower(newnames)
#samps<-sample(1:nrow(brfss14), size = 10000, replace = F)
#brfss14<-brfss14[samps,]
#Poor or fair self rated health
#brfss14$badhealth<-ifelse(brfss14$genhlth %in% c(4,5),1,0)
brfss14$badhealth<-recode(brfss14$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss14$black<-recode(brfss14$racegr3, recodes="2=1; 9=NA; else=0")
brfss14$white<-recode(brfss14$racegr3, recodes="1=1; 9=NA; else=0")
brfss14$other<-recode(brfss14$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss14$hispanic<-recode(brfss14$racegr3, recodes="5=1; 9=NA; else=0")
brfss14$race_eth<-recode(brfss14$racegr3, recodes="1='nhwhite'; 2='nh black'; 3='nh other';
4='nh multirace'; 5='hispanic'; else=NA", as.factor.result = T)
brfss14$race_eth<-relevel(brfss14$race_eth, ref = "nhwhite")
#insurance
brfss14$ins<-ifelse(brfss14$hlthpln1==1,1,0)
#income grouping
brfss14$inc<-ifelse(brfss14$incomg==9, NA, brfss14$incomg)
#education level
brfss14$educ<-recode(brfss14$educa, recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad';
5='3somecol'; 6='4colgrad';9=NA", as.factor.result=T)
#brfss14$educ<-relevel(brfss14$educ, ref='0Prim')
#employment
brfss14$employ<-recode(brfss14$employ, recodes="1:2='Employed'; 2:6='nilf';
7='retired'; 8='unable'; else=NA", as.factor.result=T)
brfss14$employ<-relevel(brfss14$employ, ref='Employed')
#marital status
brfss14$marst<-recode(brfss14$marital, recodes="1='married'; 2='divorced'; 3='widowed';
4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss14$marst<-relevel(brfss14$marst, ref='married')
#Age cut into intervals
brfss14$agec<-cut(brfss14$age80, breaks=c(0,24,39,59,79,99), include.lowest = T)
#Health behaviors
#insurance
brfss14$ins<-recode(brfss14$hlthpln1, recodes = "1=1; 2=0; else=NA" )
#smoking currently
brfss14$smoke<-recode(brfss14$smoker3, recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA", as.factor.result=T)
brfss14$smoke<-relevel(brfss14$smoke, ref = "NeverSmoked")
If our outcome is dichtomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or
\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)
Which gives us the following conditional mean model:
\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:
\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]
Often in a research setting we are interested in comparing several models, and almost never are we satisfied with a single solitary model. This is because the literature on our subjects often has multiple facets to it. So, for instance, we may be interested in how SES mediates the effects of race/ethnicity on health (see Shuey and Wilson 2008 and Sudano and Baker 2005 or Hummer 1993 for a few examples of these types of analysis).
Typically in these types of analysis, predictor variables are entered into the model in “blocks”. For example, let’s look at the self-rated health outcome (0=good or excellent health, 1= fair/poor health) from last week. But instead of entering all variables in the model simultaneously, we begin with the effect of race/ethnicity, then add the effect of SES then the effects of health behaviour variables.
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata = ~ststr, weights = ~mmsawt,data= brfss14, nest=T)
#Logit models
fit.logit1<-svyglm(badhealth~race_eth,design= des, family=binomial) #race only
## Warning: non-integer #successes in a binomial glm!
fit.logit2<-svyglm(badhealth~race_eth+educ,design= des, family=binomial) #race+education
## Warning: non-integer #successes in a binomial glm!
fit.logit3<-svyglm(badhealth~race_eth+educ+ins+smoke,design= des, family=binomial)#race+education+health behaviors
## Warning: non-integer #successes in a binomial glm!
summary(fit.logit1)
##
## Call:
## svyglm(formula = badhealth ~ race_eth, design = des, family = binomial)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss14,
## nest = T)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.81225 0.01258 -144.070 < 2e-16 ***
## race_ethhispanic 0.81773 0.03004 27.223 < 2e-16 ***
## race_ethnh black 0.50635 0.03211 15.770 < 2e-16 ***
## race_ethnh multirace 0.34237 0.08086 4.234 2.30e-05 ***
## race_ethnh other -0.30556 0.06551 -4.665 3.09e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9980624)
##
## Number of Fisher Scoring iterations: 4
In model 1 we see that Hispanics and blacks have a higher odds of reporting poor self rated health, compared to non-Hispanic whites, while the “other” group shows lower odds of reporting poor health.
Now, let’s see if, by controlling for education, some of these differences go away, or are reduced. The fancy word for when an effect is reduced is “attenuated”. We will also do a test to see if the model with the education term significantly improves the model fit. Traditionally this would be done using a likelihood ratio test, but in survey models, that’s not kosherload
summary(fit.logit2)
##
## Call:
## svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss14,
## nest = T)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.20534 0.05332 -3.851 0.000118 ***
## race_ethhispanic 0.22203 0.03332 6.664 2.67e-11 ***
## race_ethnh black 0.32486 0.03350 9.696 < 2e-16 ***
## race_ethnh multirace 0.22609 0.08693 2.601 0.009298 **
## race_ethnh other -0.21407 0.07000 -3.058 0.002227 **
## educ1somehs -0.59801 0.06257 -9.557 < 2e-16 ***
## educ2hsgrad -1.27173 0.05503 -23.110 < 2e-16 ***
## educ3somecol -1.62042 0.05582 -29.027 < 2e-16 ***
## educ4colgrad -2.42243 0.05662 -42.784 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9923749)
##
## Number of Fisher Scoring iterations: 5
regTermTest(fit.logit2, test.terms = ~educ, method="Wald", df = NULL)
## Wald test for educ
## in svyglm(formula = badhealth ~ race_eth + educ, design = des, family = binomial)
## F = 805.2959 on 4 and 235130 df: p= < 2.22e-16
so, we see the race effects in all groups attenuate (reduce in size) somewhat after considering education, so the differences in health are smaller once you control for education. The F test also suggest that the second model fits the data better than the first.
Next we consider the third model, which contains health behaviors of current smoking and insurance coverage:
summary(fit.logit3)
##
## Call:
## svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des,
## family = binomial)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss14,
## nest = T)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.64784 0.06604 -9.809 < 2e-16 ***
## race_ethhispanic 0.40986 0.03557 11.523 < 2e-16 ***
## race_ethnh black 0.42337 0.03512 12.056 < 2e-16 ***
## race_ethnh multirace 0.26179 0.08968 2.919 0.00351 **
## race_ethnh other -0.06810 0.07259 -0.938 0.34815
## educ1somehs -0.61995 0.06606 -9.385 < 2e-16 ***
## educ2hsgrad -1.26118 0.05825 -21.651 < 2e-16 ***
## educ3somecol -1.59819 0.05899 -27.093 < 2e-16 ***
## educ4colgrad -2.32393 0.05972 -38.916 < 2e-16 ***
## ins 0.12232 0.04011 3.049 0.00229 **
## smokeCurrent 0.61914 0.03214 19.261 < 2e-16 ***
## smokeFormer 0.53426 0.02762 19.341 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9788161)
##
## Number of Fisher Scoring iterations: 5
regTermTest(fit.logit3, test.terms=~ins+smoke, method="Wald", df = NULL)
## Wald test for ins smoke
## in svyglm(formula = badhealth ~ race_eth + educ + ins + smoke, design = des,
## family = binomial)
## F = 181.8163 on 3 and 224775 df: p= < 2.22e-16
In this model, we see the effects for Hispanics and blacks go back up. This is somewhat confusing, but is most likely related to the differential rates of smoking among those groups, compared to whites. Both current and former smokers are more likely to report poor/fair health, while insurance coverage does not affect the odds at all. Finally, we see that the model is significantly better fitting than the previous model.
res<-data.frame(model1=c(round(coef(fit.logit1),3),rep("-",7)) ,
model2=c(round(coef(fit.logit2),3), rep("-",3)),
model3=round(coef(fit.logit3),3))
row.names(res)<-c("Intercept", "Hispanic", "Black","MultRace" ,"Other", "Some HS","HS Graduate", "Some College", "College Grad", "Insurance", "Current Smoker", "Former Smoker")
knitr::kable(res, caption="Model Coefficients in nested Logit Models", align="c")
| model1 | model2 | model3 | |
|---|---|---|---|
| Intercept | -1.812 | -0.205 | -0.648 |
| Hispanic | 0.818 | 0.222 | 0.410 |
| Black | 0.506 | 0.325 | 0.423 |
| MultRace | 0.342 | 0.226 | 0.262 |
| Other | -0.306 | -0.214 | -0.068 |
| Some HS | - | -0.598 | -0.620 |
| HS Graduate | - | -1.272 | -1.261 |
| Some College | - | -1.62 | -1.598 |
| College Grad | - | -2.422 | -2.324 |
| Insurance | - | - | 0.122 |
| Current Smoker | - | - | 0.619 |
| Former Smoker | - | - | 0.534 |
Often in the literature, we will see models stratified by some predictor. This is usually because a specific hypothesis is stated regarding how the effect of a particular predictor varies by some categoricial variable. In this case, we may be interested in considering if education or smoking universally affects the poor health outcome. We get at this by stratifying our analysis by race (in this example).
The easiest way to do this is to subset the data by race and run the models separately.
The first thing we do is test for the interaction of education and race. If this interaction is not significant, we have no justification for proceeding, becuase the effect of education does not vary by race group. This is the test of parallel slopes, a’ la the ANCOVA model
fit.logitint<-svyglm(badhealth~race_eth*educ+ins+smoke,design= des, family=binomial)#race*education interaction+health behaviors
## Warning: non-integer #successes in a binomial glm!
regTermTest(fit.logitint, test.terms = ~race_eth:educ, method = "Wald", df=NULL)
## Wald test for race_eth:educ
## in svyglm(formula = badhealth ~ race_eth * educ + ins + smoke, design = des,
## family = binomial)
## F = 3.314641 on 16 and 224759 df: p= 7.462e-06
Here, the F-test does indicate that the interaction term in the model is significant, so the effects of education are not constant by race.
Now we stratify our models:
fit.unrestricted<-svyglm(badhealth~educ+ins+smoke,design= des, family=binomial)
## Warning: non-integer #successes in a binomial glm!
fit.logit.white<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, white==1), family=binomial)
## Warning: non-integer #successes in a binomial glm!
fit.logit.black<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, black==1), family=binomial)
## Warning: non-integer #successes in a binomial glm!
fit.logit.other1<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh other"), family=binomial)
## Warning: non-integer #successes in a binomial glm!
fit.logit.other2<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, race_eth=="nh multirace"), family=binomial)
## Warning: non-integer #successes in a binomial glm!
fit.logit.hisp<-svyglm(badhealth~(educ+ins+smoke),design= subset(des, hispanic==1), family=binomial)
## Warning: non-integer #successes in a binomial glm!
White only model
summary(fit.logit.white)
##
## Call:
## svyglm(formula = badhealth ~ (educ + ins + smoke), design = subset(des,
## white == 1), family = binomial)
##
## Survey design:
## subset(des, white == 1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.52011 0.10422 -4.991 6.02e-07 ***
## educ1somehs -0.57210 0.10201 -5.608 2.05e-08 ***
## educ2hsgrad -1.29270 0.09251 -13.973 < 2e-16 ***
## educ3somecol -1.64253 0.09286 -17.688 < 2e-16 ***
## educ4colgrad -2.36821 0.09323 -25.402 < 2e-16 ***
## ins -0.04241 0.05360 -0.791 0.429
## smokeCurrent 0.73902 0.03628 20.371 < 2e-16 ***
## smokeFormer 0.59157 0.03023 19.569 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9945642)
##
## Number of Fisher Scoring iterations: 5
Black only model
summary(fit.logit.black)
##
## Call:
## svyglm(formula = badhealth ~ (educ + ins + smoke), design = subset(des,
## black == 1), family = binomial)
##
## Survey design:
## subset(des, black == 1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.87473 0.22324 -3.918 8.94e-05 ***
## educ1somehs -0.02020 0.21683 -0.093 0.92579
## educ2hsgrad -0.59577 0.20392 -2.922 0.00349 **
## educ3somecol -0.92213 0.20467 -4.505 6.66e-06 ***
## educ4colgrad -1.70163 0.20795 -8.183 2.92e-16 ***
## ins 0.12787 0.09763 1.310 0.19028
## smokeCurrent 0.59094 0.08059 7.333 2.33e-13 ***
## smokeFormer 0.66819 0.07932 8.424 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9961611)
##
## Number of Fisher Scoring iterations: 4
Hispanic only model
summary(fit.logit.hisp)
##
## Call:
## svyglm(formula = badhealth ~ (educ + ins + smoke), design = subset(des,
## hispanic == 1), family = binomial)
##
## Survey design:
## subset(des, hispanic == 1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.18094 0.07567 -2.391 0.01680 *
## educ1somehs -0.74659 0.09689 -7.706 1.36e-14 ***
## educ2hsgrad -1.36480 0.08422 -16.206 < 2e-16 ***
## educ3somecol -1.75429 0.09341 -18.781 < 2e-16 ***
## educ4colgrad -2.24096 0.09410 -23.815 < 2e-16 ***
## ins 0.28046 0.06958 4.031 5.57e-05 ***
## smokeCurrent 0.28216 0.08931 3.159 0.00158 **
## smokeFormer 0.36002 0.07731 4.657 3.23e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9720207)
##
## Number of Fisher Scoring iterations: 4
Multi race only model
summary(fit.logit.other2)
##
## Call:
## svyglm(formula = badhealth ~ (educ + ins + smoke), design = subset(des,
## race_eth == "nh multirace"), family = binomial)
##
## Survey design:
## subset(des, race_eth == "nh multirace")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6574 0.4507 -1.459 0.14480
## educ1somehs -1.2981 0.4589 -2.829 0.00471 **
## educ2hsgrad -1.1172 0.4270 -2.616 0.00893 **
## educ3somecol -1.1844 0.4227 -2.802 0.00512 **
## educ4colgrad -1.9433 0.4385 -4.432 9.71e-06 ***
## ins 0.1247 0.2311 0.540 0.58946
## smokeCurrent 0.9220 0.2095 4.401 1.12e-05 ***
## smokeFormer 0.5002 0.1989 2.515 0.01195 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.00517)
##
## Number of Fisher Scoring iterations: 4
Other only model
summary(fit.logit.other1)
##
## Call:
## svyglm(formula = badhealth ~ (educ + ins + smoke), design = subset(des,
## race_eth == "nh other"), family = binomial)
##
## Survey design:
## subset(des, race_eth == "nh other")
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1416 0.3661 -3.118 0.001829 **
## educ1somehs -0.2260 0.4058 -0.557 0.577574
## educ2hsgrad -0.9352 0.3433 -2.724 0.006458 **
## educ3somecol -1.0311 0.3356 -3.073 0.002130 **
## educ4colgrad -1.8710 0.3338 -5.605 2.16e-08 ***
## ins 0.1079 0.2137 0.505 0.613506
## smokeCurrent 0.6339 0.1746 3.632 0.000284 ***
## smokeFormer 0.6014 0.1630 3.689 0.000227 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 0.9845029)
##
## Number of Fisher Scoring iterations: 5
beta.test<-function(model1, model2, betaname){
s1<-summary(model1)$coef
s2<-summary(model2)$coef
db <- ((s2[rownames(s2)==betaname,1]-s1[rownames(s1)==betaname,1]))^2
sd <-s2[rownames(s2)==betaname,2]^2+s1[rownames(s1)==betaname,2]^2
td <- db/sd
beta1=s1[rownames(s1)==betaname,1]
beta2=s2[rownames(s2)==betaname,1]
pv<-1- pchisq(td, df = 1)
print(list(beta=betaname,beta1=beta1, beta2=beta2, x2=td, pvalue=pv))
}
Here is an example of testing if the “Current Smoking” effect is the same among whites and blacks. This follows the logic set forth in Allison 2010, p 219
Test for \(\beta_{1j} = \beta_{1k}\) in two models \(j \text{ and } k\) \[z= \frac{\beta_{1j} - \beta_{1k}}{\left[ s.e.(\beta_{1j}) \right]^2+\left[ s.e.(\beta_{1k}) \right]^2}\] compare the \(|z|\) test to a normal distribution for the p value.
beta.test(fit.logit.white, fit.logit.black, "smokeCurrent")
## $beta
## [1] "smokeCurrent"
##
## $beta1
## [1] 0.7390187
##
## $beta2
## [1] 0.5909446
##
## $x2
## [1] 2.807394
##
## $pvalue
## [1] 0.09383066
Which suggests that the effect of current smoking is the same in the two groups.
Here we also examine the equality of the college education effect, this time between Hispanics and non hispanic multirace
beta.test(fit.logit.hisp, fit.logit.other2, "educ4colgrad")
## $beta
## [1] "educ4colgrad"
##
## $beta1
## [1] -2.240955
##
## $beta2
## [1] -1.943324
##
## $x2
## [1] 0.440408
##
## $pvalue
## [1] 0.5069256
Which suggests the effect of college education is the same in these two groups.
If we do this for all our beta’s in our model, this is referred to as a Chow test, following Chow, 1960. This is a test for equality of regression effects in different regression models. It is a ‘global test’, meaning that it tests for equality of all regression effects, as opposed to the test above, which considers one at a time.
#construct a test of whether the betas are the same for each race group
#See Allison p 217 for this
#deviance from total model
d1<-fit.unrestricted$deviance
#sum of deviances from cause-specific models
otherds<- (fit.logit.white$deviance+ fit.logit.black$deviance+ fit.logit.other1$deviance+fit.logit.other2$deviance+fit.logit.hisp$deviance)
#Chow test
test<- d1-otherds
df<-(length(coef(fit.logit.white))*5)-length(coef(fit.unrestricted))
#print the test results
print(list(chowtest=test, df=df,pval= pchisq(test, df=df, lower=F)))
## $chowtest
## [1] 10460.98
##
## $df
## [1] 32
##
## $pval
## [1] 0
Which is our Chow test, and the result means that not all races have the same effects of education and health behaviors.