Last time, we saw the logistic regression model:
If our outcome is dichtomous (1/0), the natural distribution to consider for a GLM is the binomial \[y \sim \ \text{Binomial}\binom{n}{p}\] with \(p\) being the mean of the binomial, and n being the number of trials, generally when you have individual data, n is always 1, and p is the probability of observing the 1, conditional on the observed predictors. There are two common techniques to estimate the mean, logistic and probit regression. In a Logistic model, the link function is the inverse logit function, or
\(\text{Logit}^{-1}(p) =log \frac{p}{(1-p)}\)
Which gives us the following conditional mean model:
\[E(y|x) = \frac{1}{1+ exp({-\sum_k \beta_k x_{ik}})}\] Which situates the model within the logistic distribution function. Expressing p as a linear model is done via this log odds transformation of the probability:
\[log \frac{p}{(1-p)} = \sum_k \beta_k x_{ik}\]
Often in the literature, we will see models stratified by some predictor. This is usually because a specific hypothesis is stated regarding how the effect of a particular predictor varies by some categoricial variable. In this case, we may be interested in considering if education or race universally affects the migration outcome. We get at this by stratifying our analysis by race (in this example).
The easiest way to do this is to subset the data by race and run the models separately.
The first thing we do is test for the interaction of education and race. If this interaction is not significant, we have no justification for proceeding, becuase the effect of education does not vary by race group. This is the test of parallel slopes, a’ la the ANCOVA model
library(haven)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(ggplot2)
library(knitr)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
newpums<-ipums%>%
filter(labforce==2, age>=18, incwage>0, relate==1)%>%
mutate(mywage= ifelse(incwage%in%c(999998,999999), NA,incwage),
unemp=ifelse(empstat==2,1,0), #unemployment
mig=as.factor(migrate1), #moved in past year
sexrecode=ifelse(sex==1, "male", "female"))%>%
mutate(race_eth = case_when(.$hispan %in% c(1:4) & .$race %in%c(1:9) ~ "hispanic",
.$hispan ==0 & .$race==1 ~"0nh_white",
.$hispan ==0 & .$race==2 ~"nh_black",
.$hispan ==0 & .$race%in%c(3,7,8,9) ~"nh_other",
.$hispan ==0 & .$race%in%c(4:6) ~"nh_asian",
.$hispan==9 ~ "missing"))%>%
mutate(edurec = as.factor(case_when(.$educd %in% c(0:61)~"nohs",
.$educd %in% c(62:64)~"hs",
.$educd %in% c(65:100)~"somecoll",
.$educd %in% c(101:116)~"collgrad",
.$educd ==999 ~ "missing")) )
In genearl, we need to compare the model with race and education as independent variables to the model where they interact with one another. We can do this with a F-test for a linear model, but for a glm we need to use a \(\chi^2\) test.
#Logit model
fit.logit1<-glm(mig~race_eth+edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)
fit.logit2<-glm(mig~race_eth*edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial)
anova(fit.logit2, test="Chisq" )
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: mig
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 68041 54102
## race_eth 4 119.4 68037 53982 < 2e-16 ***
## edurec 3 119.1 68034 53863 < 2e-16 ***
## sexrecode 1 0.3 68033 53863 0.60132
## scale(age) 1 5642.7 68032 48220 < 2e-16 ***
## scale(I(age^2)) 1 480.7 68031 47740 < 2e-16 ***
## race_eth:edurec 12 21.6 68019 47718 0.04215 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, we see a significant interaction between race and education. So we can conclude that education does not affect the liklihood of migration equally for all race/ethnicities in our data. This gives us some statistical justificaiton to stratify the analysis by race.
This is done by using the subset
option in most modeling fuctions in R.
#nhwhites
fit_w<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="0nh_white")
#nhblacks
fit_b<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_black")
#hispanics
fit_h<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="hispanic")
#asians
fit_a<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_asian")
#other
fit_o<-glm(mig~edurec+sexrecode+scale(age)+scale(I(age^2)),data=newpums, family=binomial, subset= race_eth=="nh_other")
stargazer(fit_w,fit_b,fit_h,fit_a,fit_o, type = "html", style = "demography", ci = T, model.names = T, column.labels = c("white" , "black", "hispanic", "asian", "other"),keep.stat = c("n"))
mig | |||||
logistic | |||||
white | black | hispanic | asian | other | |
Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | |
edurechs | -0.207*** | 0.011 | -0.391*** | -0.140 | -0.133 |
(-0.287, -0.126) | (-0.182, 0.204) | (-0.578, -0.204) | (-0.490, 0.209) | (-0.497, 0.231) | |
edurecnohs | -0.037 | 0.098 | -0.462*** | -0.169 | -0.215 |
(-0.199, 0.125) | (-0.208, 0.404) | (-0.660, -0.265) | (-0.624, 0.286) | (-0.870, 0.440) | |
edurecsomecoll | -0.077* | 0.096 | -0.235** | -0.322* | -0.196 |
(-0.142, -0.012) | (-0.074, 0.267) | (-0.409, -0.062) | (-0.587, -0.056) | (-0.506, 0.114) | |
sexrecodemale | 0.032 | 0.096 | 0.054 | 0.224* | 0.310* |
(-0.026, 0.089) | (-0.046, 0.239) | (-0.078, 0.185) | (0.026, 0.421) | (0.041, 0.578) | |
scale(age) | -2.846*** | -1.539*** | -2.423*** | -2.136*** | -2.188*** |
(-3.013, -2.679) | (-2.023, -1.055) | (-2.852, -1.995) | (-2.897, -1.375) | (-3.105, -1.271) | |
scale(I(age2)) | 2.080*** | 0.808** | 1.793*** | 1.296** | 1.404** |
(1.904, 2.256) | (0.272, 1.343) | (1.316, 2.270) | (0.451, 2.141) | (0.372, 2.436) | |
Constant | -2.109*** | -2.004*** | -1.866*** | -1.983*** | -1.980*** |
(-2.163, -2.055) | (-2.152, -1.857) | (-2.019, -1.713) | (-2.181, -1.785) | (-2.268, -1.693) | |
N | 48,334 | 6,610 | 7,976 | 3,481 | 1,641 |
p < .05; p < .01; p < .001 |
These models show the exact same patterns of effects, with Hispanics, blacks and multi-race individuals showing increased chances of reporting poor/fair health, when compared to whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with more education having lower chances of reporting poor/fair health compared to those with a primary school education (Reference group), and likewise, as people get older, they are more likely to report poor/fair health, compared to those under age 24 (Reference group).
beta.test<-function(model1, model2, betaname){
s1<-summary(model1)$coef
s2<-summary(model2)$coef
db <- ((s2[rownames(s2)==betaname,1]-s1[rownames(s1)==betaname,1]))^2
sd <-s2[rownames(s2)==betaname,2]^2+s1[rownames(s1)==betaname,2]^2
td <- db/sd
beta1=s1[rownames(s1)==betaname,1]
beta2=s2[rownames(s2)==betaname,1]
pv<-1- pchisq(td, df = 1)
print(list(beta=betaname,beta1=beta1, beta2=beta2, x2=td, pvalue=pv))
}
Here is an example of testing if the “high school” effect is the same among whites and blacks. This follows the logic set forth in Allison 2010, p 219
Test for \(\beta_{1j} = \beta_{1k}\) in two models \(j \text{ and } k\) \[\chi^2= \frac{(\beta_{1j} - \beta_{1k})^2}{\left[ s.e.(\beta_{1j}) \right]^2+\left[ s.e.(\beta_{1k}) \right]^2}\]
Where you have beta’s (the same regression effect), in two different models, and you want to see if they are equal.
beta.test(fit_w, fit_b, betaname = "edurechs")
## $beta
## [1] "edurechs"
##
## $beta1
## [1] -0.2069903
##
## $beta2
## [1] 0.0106681
##
## $x2
## [1] 4.165079
##
## $pvalue
## [1] 0.04126549
Which looks like the High school education effect is not equal between blacks and whites, for whites, the effect is negative and statistically significant, and blacks the effect is positive and not significant.
If we do this for all our beta’s in our model, this is referred to as a Chow test, following Chow, 1960. This is a test for equality of regression effects in different regression models. It is a ‘global test’, meaning that it tests for equality of all regression effects, as opposed to the test above, which considers one at a time.
#construct a test of whether the betas are the same for each race group
#See Allison p 217 for this
#deviance from total model, the one with all subjects in it
d1<-fit.logit1$deviance
#sum of deviances from stratified models
otherds<- (fit_w$deviance+ fit_b$deviance+ fit_h$deviance+fit_a$deviance+fit_o$deviance)
#Chow test
test<- d1-otherds
df<-(length(coef(fit_w))*5)-length(coef(fit.logit1))
#print the test results
print(list(chowtest=test, df=df,pval= pchisq(test, df=df, lower=F)))
## $chowtest
## [1] 79.52261
##
## $df
## [1] 24
##
## $pval
## [1] 7.247417e-08
Which is our Chow test, and the result means that not all races have the same effects of of education, sex or age. This is a global test, so if you want to see which coefficients are different, use the coefficient tests above.