Last time, we saw the logistic regression model:

If our outcome is dichtomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or

\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)

Which gives us the following conditional mean model:

\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:

\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]

Stratified models

Often in the literature, we will see models stratified by some predictor. This is usually because a specific hypothesis is stated regarding how the effect of a particular predictor varies by some categoricial variable. In this case, we may be interested in considering if education or race universally affects the migration outcome. We get at this by stratifying our analysis by race (in this example).

The easiest way to do this is to subset the data by race and run the models separately.

The first thing we do is test for the interaction of education and race. If this interaction is not significant, we have no justification for proceeding, becuase the effect of education does not vary by race group. This is the test of parallel slopes, a’ la the ANCOVA model

library(haven)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2. http://CRAN.R-project.org/package=stargazer

library(ggplot2)
library(knitr)

ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

newpums<-ipums%>%
  filter(labforce==2, age>=18, incwage>0, relate==1)%>%
  mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
         unemp=ifelse(empstat==2,1,0), #unemployment
         mig=as.factor(migrate1), #moved in past year
         sexrecode=ifelse(sex==1, "male", "female"))%>%
  mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic", 
                          .$hispan ==0 & .$race==1 ~"0nh_white",
                         .$hispan ==0 & .$race==2 ~"nh_black",
                         .$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
                         .$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
                          .$hispan==9 ~ "missing"))%>%
  mutate(edurec = as.factor(case_when(.$educd %in% c(0:61)~"nohs", 
                            .$educd %in% c(62:64)~"hs",
                            .$educd %in% c(65:100)~"somecoll",
                            .$educd %in% c(101:116)~"collgrad", 
                            .$educd ==999 ~ "missing")) )

fitting the models

In genearl, we need to compare the model with race and education as independent variables to the model where they interact with one another. We can do this with a F-test for a linear model, but for a glm we need to use a \(\chi^2\) test.

#Logit model
fit.logit1<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)

fit.logit2<-glm(mig~race_eth*edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)

anova(fit.logit2, test="Chisq" )

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: mig
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev Pr(>Chi)    
## NULL                            68041      54102             
## race_eth         4    119.4     68037      53982  < 2e-16 ***
## edurec           3    119.1     68034      53863  < 2e-16 ***
## sexrecode        1      0.3     68033      53863  0.60132    
## scale(age)       1   5642.7     68032      48220  < 2e-16 ***
## scale(I(age^2))  1    480.7     68031      47740  < 2e-16 ***
## race_eth:edurec 12     21.6     68019      47718  0.04215 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, we see a significant interaction between race and education. So we can conclude that education does not affect the liklihood of migration equally for all race/ethnicities in our data. This gives us some statistical justificaiton to stratify the analysis by race.

Stratifying the models by a categorical variable

This is done by using the subset option in most modeling fuctions in R.

#nhwhites
fit_w<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="0nh_white")

#nhblacks
fit_b<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_black")

#hispanics
fit_h<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="hispanic")

#asians
fit_a<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_asian")


#other
fit_o<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_other")

Present all model coefficients next to one another

stargazer(fit_w,fit_b,fit_h,fit_a,fit_o, type = "html", style = "demography", ci = T,  model.names = T, column.labels = c("white" , "black", "hispanic", "asian", "other"),keep.stat = c("n"))


	mig
	logistic
	white	black	hispanic	asian	other
	Model 1	Model 2	Model 3	Model 4	Model 5

edurechs	-0.207^***	0.011	-0.391^***	-0.140	-0.133
	(-0.287, -0.126)	(-0.182, 0.204)	(-0.578, -0.204)	(-0.490, 0.209)	(-0.497, 0.231)
edurecnohs	-0.037	0.098	-0.462^***	-0.169	-0.215
	(-0.199, 0.125)	(-0.208, 0.404)	(-0.660, -0.265)	(-0.624, 0.286)	(-0.870, 0.440)
edurecsomecoll	-0.077^*	0.096	-0.235^**	-0.322^*	-0.196
	(-0.142, -0.012)	(-0.074, 0.267)	(-0.409, -0.062)	(-0.587, -0.056)	(-0.506, 0.114)
sexrecodemale	0.032	0.096	0.054	0.224^*	0.310^*
	(-0.026, 0.089)	(-0.046, 0.239)	(-0.078, 0.185)	(0.026, 0.421)	(0.041, 0.578)
scale(age)	-2.846^***	-1.539^***	-2.423^***	-2.136^***	-2.188^***
	(-3.013, -2.679)	(-2.023, -1.055)	(-2.852, -1.995)	(-2.897, -1.375)	(-3.105, -1.271)
scale(I(age2))	2.080^***	0.808^**	1.793^***	1.296^**	1.404^**
	(1.904, 2.256)	(0.272, 1.343)	(1.316, 2.270)	(0.451, 2.141)	(0.372, 2.436)
Constant	-2.109^***	-2.004^***	-1.866^***	-1.983^***	-1.980^***
	(-2.163, -2.055)	(-2.152, -1.857)	(-2.019, -1.713)	(-2.181, -1.785)	(-2.268, -1.693)
N	48,334	6,610	7,976	3,481	1,641

p < .05; p < .01; p < .001

These models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).

Homogeneity of effects across models

beta.test<-function(model1, model2, betaname){
s1<-summary(model1)$coef
s2<-summary(model2)$coef
db <- ((s2[rownames(s2)==betaname,1]-s1[rownames(s1)==betaname,1]))^2
sd <-s2[rownames(s2)==betaname,2]^2+s1[rownames(s1)==betaname,2]^2
td <- db/sd
beta1=s1[rownames(s1)==betaname,1]
beta2=s2[rownames(s2)==betaname,1]
pv<-1- pchisq(td, df = 1)
print(list(beta=betaname,beta1=beta1, beta2=beta2, x2=td, pvalue=pv))
}

Here is an example of testing if the “high school” effect is the same among whites and blacks. This follows the logic set forth in Allison 2010, p 219

Test for \(\beta_{1j} = \beta_{1k}\) in two models \(j \text{ and } k\) \[\chi^2= \frac{(\beta_{1j} - \beta_{1k})^2}{\left[ s.e.(\beta_{1j}) \right]^2+\left[ s.e.(\beta_{1k}) \right]^2}\]

Where you have beta’s (the same regression effect), in two different models, and you want to see if they are equal.

beta.test(fit_w, fit_b, betaname = "edurechs")

## $beta
## [1] "edurechs"
## 
## $beta1
## [1] -0.2069903
## 
## $beta2
## [1] 0.0106681
## 
## $x2
## [1] 4.165079
## 
## $pvalue
## [1] 0.04126549

Which looks like the High school education effect is not equal between blacks and whites, for whites, the effect is negative and statistically significant, and blacks the effect is positive and not significant.

Chow tests

If we do this for all our beta’s in our model, this is referred to as a Chow test, following Chow, 1960. This is a test for equality of regression effects in different regression models. It is a ‘global test’, meaning that it tests for equality of all regression effects, as opposed to the test above, which considers one at a time.

#construct a test of whether the betas are the same for each race group
#See Allison p 217 for this

#deviance from total model, the one with all subjects in it
d1<-fit.logit1$deviance

#sum of deviances from stratified models
otherds<- (fit_w$deviance+ fit_b$deviance+ fit_h$deviance+fit_a$deviance+fit_o$deviance)

#Chow test
test<- d1-otherds
df<-(length(coef(fit_w))*5)-length(coef(fit.logit1))
#print the test results
print(list(chowtest=test, df=df,pval= pchisq(test, df=df, lower=F)))

## $chowtest
## [1] 79.52261
## 
## $df
## [1] 24
## 
## $pval
## [1] 7.247417e-08

Which is our Chow test, and the result means that not all races have the same effects of of education, sex or age. This is a global test, so if you want to see which coefficients are different, use the coefficient tests above.

DEM 7273 - Logit Models Part 2 - Model stratification

Corey Sparks, PhD

November 29, 2017