For the following task, I am interested in looking at whether mortality as a binary outcome is affected by educational attainment and race/ethnicity which will be treated as categorical variables.

Research Questions: I am interested in answering the following couple of questoins: 1. Does mortality vary with increased educational attainment? 2. Does mortality vary across racial and ethinic population? 3. Does mortality vary with health behavior status, such as smoking?

To address these two questions, I am going to use NHIS-LMF 1999-2009 dataset. However, it is a large dataset which is difficult to manage in regular computing capacity. Hence, I will be using a subset of around 20,000 samples from the total population.

Recoding Variables:

Mortality, as the main outcome variable, will be recoded as a binary outcome. That means, whether someone is dead or not. Education, as one of the moderators, will be categorized into three groups: High School or Less, Some College, and College. Race/Ethnicity, as another moderator, will be categorized into four groups: NHWhite, NHBlack, Hispanic, and NHother. Age is recoded to break into the following age groups: 18-24, 24-35, 35-45, 45-65, 65+. Smoking status is recoded to three groups: Current Smoker, Former Smoker and Never Smoked.

Loading Libraries

library(readr)
library(car)
library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.1. https://CRAN.R-project.org/package=stargazer

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

library(survival)
library(cmprsk)
library(questionr)
library(haven)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:Matrix':
## 
##     expand

Reading data

mydata <- read_dta("C://Users/munta/Google Drive/NHIS/nhis_00005.dta")

Taking a Subset of the Population and Recoding Variables

#Taking a subset of the total observations
sub<-subset(mydata, mydata$mortelig==1&is.na(mydata$racenew)==F)

samps<-sample(1:length(sub$psu), size = 20000, replace = F)
sub<-sub[samps,] #Using a subset of 20000 samples

#Age
sub$d.age<-ifelse(sub$mortstat==1, sub$mortdody-(sub$year-sub$age) ,
                  ifelse(sub$mortstat==2, 2006-(sub$year-sub$age), NA))
#Mortality Status
sub$d.event<-ifelse(sub$mortstat==1,1,0)
sub$timetodeath<-ifelse(sub$mortstat ==1, sub$mortdody-sub$year  , 2006 - sub$year )
sub$d5yr<-ifelse(sub$timetodeath<=5&sub$mortstat==1, 1,0)

#Marital Status
sub$married<-recode(sub$marstat, recodes="00=NA; 10:13='married'; 20:40='sep'; 50='nm'; 99=NA" ,as.factor.result=T )
sub$sep<-ifelse(sub$married=='sep',1,0)
sub$nm<-ifelse(sub$married=='nm',1,0)

#Smoking Status
sub$smoke<-recode(sub$smokestatus2, recodes="00=NA; 10:13='CurrentSmoker'; 20='FormerSmoker'; 30='NeverSmoked'; else=NA", as.factor.result=T)
sub$smoke<-relevel(sub$smoke, ref = "NeverSmoked")

sub$male<-ifelse(sub$sex==1,1,0)

sub$mwt<-sub$mortwt/mean(sub$mortwt, na.rm=T)

#Age cut into intervals
sub$agere<-cut(sub$age, breaks=c(15,18,24,35,45,65,85))

#Education 
sub$moredu<-recode(sub$educrec2, recodes="00=NA; 10:42='High School or Less'; 50:53='Some College'; 54:60='College'; else=NA", as.factor.result=T)

sub$hs<-ifelse(sub$moredu=='hs or less',1,0)
sub$scol1<-ifelse(sub$moredu=='some coll',1,0)

#Race/Ethnicity
sub$race<-recode(sub$racenew, recodes ="10='wht'; 20 ='blk'; 30:61='other'; 97:99=NA", as.factor.result=T)
sub$black<-ifelse(sub$race=='blk',1,0)
sub$oth<-ifelse(sub$race=='other',1,0)

sub$hisp<-recode(sub$hispyn, recodes="1=0; 2=1; else=NA")

sub$race_eth[sub$hisp == 0 & sub$race=="wht"]<-"NHWhite"

## Warning: Unknown or uninitialised column: 'race_eth'.

sub$race_eth[sub$hisp == 0 & sub$race=="blk"]<-"NHBlack"
sub$race_eth[sub$hisp == 0 & sub$race=="other"]<-"NHOther"
sub$race_eth[sub$hisp == 1 ]<-"Hispanic"
sub$race_eth[is.na(sub$hisp) ==T | is.na(sub$race)==T]<-NA

Analysis

Descriptive analysis: means and cross tabulations

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:car':
## 
##     recode

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Keeping only clean cases
sub2<-sub%>%
  select(mortstat, agere,race_eth, married, moredu, race, black, hisp, oth, smoke, mortwt, strata, psu) %>%
  filter( complete.cases(.))

#Survey Design
options(survey.lonely.psu="adjust")
des<-svydesign(ids=~psu, strata=~strata, weights = ~mortwt, data=sub[sub$mortwt>0,], nest=T)

We can examine the % of US adults alive by education level, and do a survey-corrected chi-square test for independence.

#Chi-square test
library(ggplot2)
ed<-svyby(formula = ~mortstat, by = ~moredu, design = des, FUN = svymean, na.rm=T)
svychisq(~mortstat+moredu, design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~mortstat + moredu, design = des)
## F = 106.68, ndf = 1.9977, ddf = 1276.6000, p-value < 2.2e-16

The output is really big. We can also look at the following plot for a better pciture.

#education*mortality cross tabulation
qplot(x=ed$moredu,y=ed$mortstat, data=ed ,xlab="Education", ylab="% Alive" )+
geom_errorbar(aes(x=moredu, ymin=mortstat-se,ymax= mortstat+se), width=.25)+
ggtitle(label = "% of US Adults Alive by Education")

The qplot shows that with increaseing education the perentage of being alive increases.

#race*mortality cross tabulation
rac<-svyby(formula = ~mortstat, by = ~race_eth, design = des, FUN = svymean, na.rm=T)
svychisq(~mortstat+race_eth, design = des)

## 
##  Pearson's X^2: Rao & Scott adjustment
## 
## data:  svychisq(~mortstat + race_eth, design = des)
## F = 21.958, ndf = 2.7682, ddf = 1768.9000, p-value = 4.397e-13

The F static is smaller than what we got from the education and mortality crosstabulation.

qplot(x=rac$race_eth,y=rac$mortstat, data=rac ,xlab="Race", ylab="%  Alive" )+
geom_errorbar(aes(x=race_eth, ymin=mortstat-se,ymax= mortstat+se), width=.25)+
ggtitle(label = "% of US Adults Alive by Race/Ethnicity")

With respect to racial and ethinic population and their chances of being alive, the Hispanics enjoy the best outcome, with the NHOthers not faring behind. The NHWhites and NHBlacks are at about the same level, with NHBlacks have wider error bar, which means they have wider lifespan variability.

The following plot gives us a comprehensive picture of the outcome so far.

Bivariate Means

#race*education*health cross tabulation
edrac<-svyby(formula = ~mortstat, by = ~race_eth+moredu, design = des, FUN = svymean, na.rm=T)
edrac

##                              race_eth              moredu mortstat
## Hispanic.College             Hispanic             College 1.981437
## NHBlack.College               NHBlack             College 1.968185
## NHOther.College               NHOther             College 1.965674
## NHWhite.College               NHWhite             College 1.961475
## Hispanic.High School or Less Hispanic High School or Less 1.954621
## NHBlack.High School or Less   NHBlack High School or Less 1.894406
## NHOther.High School or Less   NHOther High School or Less 1.947244
## NHWhite.High School or Less   NHWhite High School or Less 1.882310
## Hispanic.Some College        Hispanic        Some College 1.973558
## NHBlack.Some College          NHBlack        Some College 1.969867
## NHOther.Some College          NHOther        Some College 1.969355
## NHWhite.Some College          NHWhite        Some College 1.943417
##                                       se
## Hispanic.College             0.006959276
## NHBlack.College              0.010239678
## NHOther.College              0.009600790
## NHWhite.College              0.003303336
## Hispanic.High School or Less 0.004491775
## NHBlack.High School or Less  0.007811621
## NHOther.High School or Less  0.011622430
## NHWhite.High School or Less  0.004478658
## Hispanic.Some College        0.006042576
## NHBlack.Some College         0.006358586
## NHOther.Some College         0.008548979
## NHWhite.Some College         0.004320557

The bivariate means also show the same picture what we have seen above; college educated population has a higher likelihood of being alive.

Plot

p<-ggplot(edrac, aes(moredu, mortstat,),xlab="Race", ylab="% Alive")
p<-p+geom_point(aes(colour=race_eth))
p<-p+geom_line(aes(colour=race_eth,group=race_eth))
p<-p+ylab("% Alive")
p<-p+xlab("Education Level")
p+ggtitle("% of US Adults in 2011 in Alive Race and Education")

The plot above shows the importance of education on the percentage of being alive among the US population across different race and ethnicities. Hispanics enjoy the mortality advantage even though their lower socioeconomic satus compared the non-Hispanic Whites.

Logit/Probit Regression

#Logit model
fit.logit<-svyglm(I(mortstat==1)~race_eth+moredu+agere, design=des, family=binomial)

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

#probit model
fit.probit<-svyglm(I(mortstat==1)~race_eth+moredu+agere, design=des, family=binomial(link= "probit"))

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

Both model coefficients next to one another

expcoeff<- function(x) (exp(x))

stargazer(fit.logit, fit.probit, apply.coef = expcoeff, style="demography", type = "html",
          title = "Logit/Probit models for Mortality Status by Race/Ethnicty and Education using survey data - NHIS2009",
          covariate.labels=c("Hispanic", "NHBlack","NHOther" , "High School or Less", "Some College", "Age 24-35", "Age 35-45" ,"Age 45-65", "Age 65+"), keep.stat="n", model.names=F, align=T, ci=T)

**Logit/Probit models for Mortality Status by Race/Ethnicty and Education using survey data - NHIS2009**

	I(mortstat == 1)
	Model 1	Model 2

Hispanic	1.549^***	1.236^***
	(1.274, 1.824)	(1.101, 1.371)
NHBlack	1.142^***	1.120^***
	(0.696, 1.587)	(0.909, 1.331)
NHOther	1.372^***	1.150^***
	(1.163, 1.581)	(1.048, 1.252)
High School or Less	2.216^***	1.497^***
	(2.037, 2.394)	(1.410, 1.583)
Some College	1.459^***	1.211^***
	(1.244, 1.674)	(1.106, 1.316)
Age 24-35	1.041	1.019^***
	(-0.176, 2.259)	(0.585, 1.453)
Age 35-45	1.469^*	1.181^***
	(0.308, 2.630)	(0.765, 1.596)
Age 45-65	3.287^***	1.606^***
	(2.204, 4.369)	(1.218, 1.994)
Age 65+	9.062^***	2.501^***
	(7.961, 10.163)	(2.106, 2.895)
agere(65,85]	63.099^***	7.143^***
	(61.997, 64.201)	(6.747, 7.540)
Constant	0.003	0.054
	(-1.121, 1.127)	(-0.358, 0.466)
N	19,547	19,547

p < .05; p < .01; p < .001

Both of these models show the exact same patterns of effects, with Hispanics, and non-Hispanic Others individuals showing increased chances of being alive, when compared to non-Hispanic whites (Reference group). Similarly, the education variables shows a negative linear trend, with those with less education having less chances of being alive compared to those with a college education (Reference group); aslo, as people get older, their chances of being alive decreases, compared to those under age 24 (Reference group).

Marginal effects

#Logit marginal effects
log.marg<-coef(fit.logit)*mean(dlogis(predict(fit.logit)), na.rm=T) #Logisitic Distribution

#for probit now
prob.marg<-coef(fit.probit)*mean(dnorm(predict(fit.probit)), na.rm=T) #Normal Distribution

plot(log.marg[-1], ylab="Marginal Effects", axes=T, xaxt="n", main="Marginal Effects from Logit and Probit models", ylim=c(-.25, .2))
axis(side=1, at=1:13, labels=F)
text(x=1:13, y=-.3,  srt = 45, pos = 1, xpd = TRUE,
     labels = c( "Hispanic", "NHBlack","NHOther" ,"NHWhite", "College","High School or Less", "Some College", "Age 18-24", "Age 24-35", "Age 35-45" ,"Age 45-65", "Age 65+"))
points(prob.marg[-1], col=2)
abline(h=0, col=3)
legend("bottomright", legend=c("Logit Model", "Probit Model"), col=c("black", "red"),pch=1)

The above plot shows that the marginal effects are very similar between the two models, except in the age groups 18-24 and 24-25. We can coax these into a table like:

data.frame(m.logit=log.marg, m.probit=prob.marg)

##                                m.logit    m.probit
## (Intercept)               -0.315573038 -0.30006099
## race_ethNHBlack            0.023520786  0.02179510
## race_ethNHOther            0.007120606  0.01164322
## race_ethNHWhite            0.016996375  0.01438507
## moreduHigh School or Less  0.042767546  0.04152832
## moreduSome College         0.020318024  0.01971461
## agere(18,24]               0.002179014  0.00194973
## agere(24,35]               0.020678019  0.01709377
## agere(35,45]               0.063966434  0.04880570
## agere(45,65]               0.118494681  0.09439185
## agere(65,85]               0.222824074  0.20247233

Fitted Values

We can generate a bunch of “fake people” that have variability in the model covariates, and fit the model for each type of person.

dat<-expand.grid(race_eth=levels(factor(sub$race_eth)), moredu=levels(factor(sub$moredu)), agere=levels(factor(sub$agere))) ##"Factor" for changing the class

#fitted values
fit<-predict(fit.logit, newdata=dat, type="response")
fitp<-predict(fit.probit, newdata=dat, type="response")

#add the values to the fake data
dat$fitted.prob.lrm<-round(fit, 3)
dat$fitted.prob.pro<-round(fitp, 3)

#Print the fitted probabilities for the first 20 cases
head(dat, n=20)

##    race_eth              moredu   agere fitted.prob.lrm fitted.prob.pro
## 1  Hispanic             College (15,18]           0.003           0.002
## 2   NHBlack             College (15,18]           0.004           0.003
## 3   NHOther             College (15,18]           0.003           0.003
## 4   NHWhite             College (15,18]           0.004           0.003
## 5  Hispanic High School or Less (15,18]           0.006           0.006
## 6   NHBlack High School or Less (15,18]           0.010           0.011
## 7   NHOther High School or Less (15,18]           0.007           0.008
## 8   NHWhite High School or Less (15,18]           0.009           0.009
## 9  Hispanic        Some College (15,18]           0.004           0.003
## 10  NHBlack        Some College (15,18]           0.006           0.006
## 11  NHOther        Some College (15,18]           0.005           0.005
## 12  NHWhite        Some College (15,18]           0.006           0.005
## 13 Hispanic             College (18,24]           0.003           0.002
## 14  NHBlack             College (18,24]           0.005           0.004
## 15  NHOther             College (18,24]           0.003           0.003
## 16  NHWhite             College (18,24]           0.004           0.003
## 17 Hispanic High School or Less (18,24]           0.006           0.006
## 18  NHBlack High School or Less (18,24]           0.010           0.011
## 19  NHOther High School or Less (18,24]           0.007           0.009
## 20  NHWhite High School or Less (18,24]           0.009           0.009

It shows us the estimated probabilty of reporting alive or dead for each specified type of “fake person” that we generate. For example, we can look at the probability for a Non-Hispanic white, age 24-35 with a college education, compared to a Hispanic person, age 24-35 with a high school education or less:

dat[which(dat$race_eth=="NHWhite" & dat$agere=="(24,35]" & dat$moredu=="College"),]

##    race_eth  moredu   agere fitted.prob.lrm fitted.prob.pro
## 28  NHWhite College (24,35]           0.006           0.005

dat[which(dat$race_eth=="Hispanic" & dat$agere=="(24,35]" & dat$moredu=="High School or Less"),]

##    race_eth              moredu   agere fitted.prob.lrm fitted.prob.pro
## 29 Hispanic High School or Less (24,35]           0.009            0.01

The first case has an estimated probability of being alive of about 4%, while the second case, with lower education, still show a slightly better,6% probability.

Nested model comparison

fit.logit1<-svyglm(I(mortstat==1)~race_eth,design= des, family=binomial) #race only

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit2<-svyglm(I(mortstat==1)~race_eth+moredu,design= des, family=binomial) #race+education

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

fit.logit3<-svyglm(I(mortstat==1)~race_eth+moredu+smoke,design= des, family=binomial)#race+education+health behaviors

## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!

summary(fit.logit1)

## 
## Call:
## svyglm(formula = I(mortstat == 1) ~ race_eth, design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~mortwt, data = sub[sub$mortwt > 
##     0, ], nest = T)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -3.22415    0.08958 -35.993  < 2e-16 ***
## race_ethNHBlack  0.64926    0.12506   5.192 2.81e-07 ***
## race_ethNHOther  0.04138    0.20544   0.201     0.84    
## race_ethNHWhite  0.76325    0.09612   7.940 9.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000201)
## 
## Number of Fisher Scoring iterations: 6

In model 1 we see that Blacks and Whites have a higher odds of being alive, compared to Hispanics; the other category does not seem to be significant.

#controlling for education
summary(fit.logit2)

## 
## Call:
## svyglm(formula = I(mortstat == 1) ~ race_eth + moredu, design = des, 
##     family = binomial)
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~mortwt, data = sub[sub$mortwt > 
##     0, ], nest = T)
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -4.16236    0.12315 -33.800  < 2e-16 ***
## race_ethNHBlack            0.76722    0.12648   6.066 2.26e-09 ***
## race_ethNHOther            0.36436    0.20723   1.758  0.07919 .  
## race_ethNHWhite            0.98693    0.09787  10.084  < 2e-16 ***
## moreduHigh School or Less  1.16285    0.09065  12.827  < 2e-16 ***
## moreduSome College         0.32827    0.10697   3.069  0.00224 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.007648)
## 
## Number of Fisher Scoring iterations: 6

regTermTest(fit.logit2, test.terms = ~moredu, method="Wald", df = NULL)

## Wald test for moredu
##  in svyglm(formula = I(mortstat == 1) ~ race_eth + moredu, design = des, 
##     family = binomial)
## F =  114.8922  on  2  and  634  df: p= < 2.22e-16

After controlling for education, we see the significant effects of race and ethnicity; so the differences in being alive are considerable once we control for education.

The F test also suggest that the second model fits the data better than the first.

The third model includes health behaviors of current smoking:

summary(fit.logit3)

## 
## Call:
## svyglm(formula = I(mortstat == 1) ~ race_eth + moredu + smoke, 
##     design = des, family = binomial)
## 
## Survey design:
## svydesign(ids = ~psu, strata = ~strata, weights = ~mortwt, data = sub[sub$mortwt > 
##     0, ], nest = T)
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -4.31905    0.16719 -25.833  < 2e-16 ***
## race_ethNHBlack            0.93445    0.17214   5.429 8.13e-08 ***
## race_ethNHOther            0.66446    0.25401   2.616  0.00911 ** 
## race_ethNHWhite            1.11211    0.13585   8.186 1.52e-15 ***
## moreduHigh School or Less  1.22253    0.13023   9.387  < 2e-16 ***
## moreduSome College         0.42933    0.14360   2.990  0.00290 ** 
## smokeCurrentSmoker        -0.01312    0.11272  -0.116  0.90736    
## smokeFormerSmoker          0.71437    0.09630   7.418 3.89e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.025507)
## 
## Number of Fisher Scoring iterations: 6

regTermTest(fit.logit3, test.terms=~smoke, method="Wald", df = NULL)

## Wald test for smoke
##  in svyglm(formula = I(mortstat == 1) ~ race_eth + moredu + smoke, 
##     design = des, family = binomial)
## F =  31.76775  on  2  and  626  df: p= 7.2367e-14

In this model, we see the effects of smoking do not change the outcome much, however, has significant effects.

Finally, we see that the model is significantly better fitting than the previous model.

expcoeff<- function(x) (exp(x))

stargazer(fit.logit1, fit.logit2, fit.logit3, apply.coef = expcoeff, type = "html", style="demography", covariate.labels =c("Intercept", "Hispanic", "NHBlack","NHOther" , "High School or Less", "Some College", "CurrentSmoker", "FormerSmoker"), keep.stat="n", model.names=F, align=T, ci=T )


	I(mortstat == 1)
	Model 1	Model 2	Model 3

Intercept	1.914^***	2.154^***	2.546^***
	(1.669, 2.159)	(1.906, 2.402)	(2.208, 2.883)
Hispanic	1.042^***	1.440^***	1.943^***
	(0.640, 1.445)	(1.033, 1.846)	(1.446, 2.441)
NHBlack	2.145^***	2.683^***	3.041^***
	(1.957, 2.334)	(2.491, 2.875)	(2.775, 3.307)
NHOther		3.199^***	3.396^***
		(3.021, 3.377)	(3.141, 3.651)
High School or Less		1.389^***	1.536^***
		(1.179, 1.598)	(1.255, 1.818)
Some College			0.987^***
			(0.766, 1.208)
CurrentSmoker			2.043^***
			(1.854, 2.232)
FormerSmoker	0.040	0.016	0.013
	(-0.136, 0.215)	(-0.226, 0.257)	(-0.314, 0.341)
N	19,903	19,547	9,234

p < .05; p < .01; p < .001

The nested model comparison shows that, with addtion of covariates such as education and smoking status, the effect of race and ethinicity or mortality outcome changes slightly yet remains significant. That is, Hispanics enjoy an advantage over other groups when it comes to remain alive. Education enhances that advantage.

Homework2

Muntasir

February 4, 2018