This example will cover the use of R functions for fitting Ordinal and Multinomial logit models to complex survey data.
For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART MSA data. Link
#load brfss
library(car)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.1. https://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
load("~/Google Drive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
head(nams, n=10)
## [1] "DISPCODE" "HHADULT" "GENHLTH" "PHYSHLTH" "MENTHLTH" "POORHLTH"
## [7] "HLTHPLN1" "PERSDOC2" "MEDCOST" "CHECKUP1"
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement = "",x = nams))
names(brfss16m)<-newnames
#Poor or fair self rated health
#brfss16m$badhealth<-ifelse(brfss16m$genhlth %in% c(4,5),1,0)
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-recode(brfss16m$racegr3, recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA", as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-ifelse(brfss16m$incomg==9, NA, brfss16m$incomg)
#education level
brfss16m$educ<-recode(brfss16m$educa, recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA", as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employment
brfss16m$employ<-recode(brfss16m$employ1, recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA", as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='Employed')
#marital status
brfss16m$marst<-recode(brfss16m$marital, recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places, so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3, recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA", as.factor.result=T)
brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
In the basic logistic regression model, we assumed a dichotomous outcome * y = 0 or 1 This is a very flexible type of model since we can write any number of outcomes as a dichtomous variable.
Although the logistic regression model is very flexible, we may often run into other kinds of outcomes that it is not applicable to
These types of outcomes require other kinds of regression models to analyze them.
Suppose we have an outcome with J ordered outcomes, and each person, i, has a response: \(Y_{i}\), and we can write the probablity of a person having response level J as: \(Pr(Y_ij = j)\) for J= 1 … J
For an ordered response it is easier to work with the cumluative probablity function or
\(Pr(Y_ij \leqslant j)\)
So the probabilities increase in adjacent categories (ordered categories). I.e. \[Pr(Y_ij \leqslant 3)>Pr(Y_ij \leqslant 2)>Pr(Y_ij \leqslant 1)\] Since we are using cumulative probabilities, we do not acutally have to estimate all J probabilities. We need only estimate J-1 probabilites, due to the complementary rule of probability: \[Pr(y=0) = 1-Pr(y>0)\]
In order to construct a regression model, we have to link a linear function of covariates to these probabilities. A general depiction of this model can be thought of as a latent trait model, where a latent variable, z is not observable but continuous, but we can observe the discrete representation of it as the J categories, then we can write this as:
\[\pi_{j-1} < Z{i} \leqslant \pi{j}\] or graphically as:
The most common model for an ordinal outcome such as this is the Proportional Odds Model. In this model, for each level of J, there is an intercept term, \(\beta_{0j}\) which depends on the category, J. The other explantory variables x do not depend on the category J, however. This generates a model of the form:
\[\text {log} \frac{\pi_1+\cdots +\pi_j}{\pi_{j+1}+\cdots +\pi_j} = \beta_{0j}+x'\beta\] The base assumption for this model is that each x variable affects the probability of J increasing by 1 level in the same way, for all transitions. We can visualize this as:
This model is attractive because we have a direct and consistent interpretation of our \(\beta\)’s in the model. While this is nice, the assumption can be very weak in practice, with the \(\beta\)’s not being equal between categories. In other words, come covariates may affect the probability of certain transitions more than others.
We can evaluate the assumption generally by fitting a series of binomial logistic regressions and examining the \(\beta\)’s between them. This can be done by re-coding the outcome from a single ordered outcome to a series of binomial outcomes. For example:
#General ordinal coding, with 5 being the worst and 1 being the best health
brfss16m$generalhealth<-recode(brfss16m$genhlth, recodes="1:2=1;3=2;4:5=3; else=NA", as.factor.result = T)
brfss16m$generalhealth<-relevel(brfss16m$generalhealth, ref = "1")
brfss16m$healthnum<-car::recode(brfss16m$genhlth, recodes="1:2=1;3=2;4:5=3; else=NA", as.factor.result = F)
#First we tell R our survey design
options(survey.lonely.psu = "adjust")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sub<-brfss16m%>%
select(badhealth,healthnum,generalhealth, mmsaname, bmi, agec,race_eth, marst, educ,white, black, hispanic, other, smoke, ins, mmsawt, ststr) %>%
filter( complete.cases(.))
#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data =sub )
To fit an ordinal logit to survey data in R, we use the svyolr fucntion in the survey library.
#Here I fit three nested models for the health outcome
fit.solr1<-svyolr(generalhealth~race_eth+educ+agec,des)
summary(fit.solr1)
## Call:
## svyolr(generalhealth ~ race_eth + educ + agec, des)
##
## Coefficients:
## Value Std. Error t value
## race_ethhispanic 0.5155520 0.02846057 18.114602
## race_ethnh black 0.4170751 0.02639000 15.804286
## race_ethnh multirace 0.4521600 0.06608237 6.842369
## race_ethnh other 0.3030651 0.04198436 7.218523
## educ0Prim 1.0574528 0.05780852 18.292336
## educ1somehs 0.6620081 0.04072828 16.254261
## educ3somecol -0.3089586 0.02299982 -13.433088
## educ4colgrad -0.9845144 0.02150067 -45.789934
## agec(24,39] 0.3632629 0.03623319 10.025693
## agec(39,59] 0.7456899 0.03447668 21.628817
## agec(59,79] 1.1022439 0.03504340 31.453678
## agec(79,99] 1.3501218 0.04544723 29.707465
##
## Intercepts:
## Value Std. Error t value
## 1|2 0.6313 0.0341 18.5165
## 2|3 2.3040 0.0353 65.1897
#Calculate the AIC ourself
fit.solr1$deviance+2*length(fit.solr1$coefficients)
## [1] 402859
Which gives us the results from the proportional odds model. In this case, we see that non-Hispanic blacks have higher odds of reporting poorer health, compared to non-Hispanic whites. We also see that as education increases, the odds of reporting poorer health decreases, while for lower leves of education, the odds of having worse health are higher. As age increases, the odds of reporting poorer health increases, compared to those <25 years of age.
Now we can “examine” the proportional odds assumption by fitting logits for each change
ex1<-svyglm(I(healthnum>1)~race_eth+educ+agec,des, family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
ex2<-svyglm(I(healthnum>2)~race_eth+educ+agec,des, family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
This is NOT a statistical test, just a a rough guide. Here, I plot the coefficients of each model. If the proportional odds is correct, and the assumption is ok, then all these should be “approximately” the same values.
plot(coef(ex1)[-1], ylim=c(-3, 3), type="l",xaxt="n", ylab="Beta", main=c("Comparison of betas for", " proportional odds assumption"))
lines(coef(ex2)[-1], col=1, lty=2)
axis(side=1, at=1:12, labels=F)
text(x=1:12, y=-4, srt = 90, pos = 1, xpd = TRUE,cex=.8,
labels = c( "Hispanic", "Black","MultRace" ,"Other",
"PrimarySch","SomeHS", "SomeColl", "CollGrad", "Age 24-39","Age 39-59" ,"Age 59-79", "Age 80+" ))
legend("bottomright",col=c(1,1),lty=c(1,2), legend=c(">1", ">2"))
lines(coef(fit.solr1)[c(-13:-16)], col=4, lwd=3)
Based on this, the effects appear to be pretty consistent across transitions.
#Just print the odds ratios,
round(exp(rbind(coef(ex1)[-1], coef(ex2)[-1])),3)
## race_ethhispanic race_ethnh black race_ethnh multirace
## [1,] 1.698 1.577 1.472
## [2,] 1.630 1.369 1.825
## race_ethnh other educ0Prim educ1somehs educ3somecol educ4colgrad
## [1,] 1.407 2.519 1.828 0.737 0.385
## [2,] 1.089 2.957 2.060 0.733 0.315
## agec(24,39] agec(39,59] agec(59,79] agec(79,99]
## [1,] 1.427 1.985 2.736 3.562
## [2,] 1.629 2.883 4.344 5.125
Which again shows us the effects are pretty consistent across transitions, just like the plot does. If anything the plot is easier to digest.
We can also fit a cumulative logit model where the coefficients are not assumed to be the same across transitions. This is an alternative model where we don’t assume proportionality of the odds of the various transitions. We can’t use survey design to do this however.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:survey':
##
## calibrate
## The following object is masked from 'package:car':
##
## logit
#Proportional odds
fit.vgam<-vglm(as.ordered(generalhealth)~race_eth+educ+agec,brfss16m, weights =mmsawt/mean(mmsawt, na.rm=T), family=cumulative(parallel = T, reverse = T)) #<-parallel = T == proportional odds
summary(fit.vgam)
##
## Call:
## vglm(formula = as.ordered(generalhealth) ~ race_eth + educ +
## agec, family = cumulative(parallel = T, reverse = T), data = brfss16m,
## weights = mmsawt/mean(mmsawt, na.rm = T))
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logit(P[Y>=2]) -10.937 -0.5576 -0.1609 0.51902 8.531
## logit(P[Y>=3]) -7.738 -0.3470 -0.1677 -0.07377 17.376
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.62517 0.01405 -44.49 <2e-16 ***
## (Intercept):2 -2.32192 0.01489 -155.93 <2e-16 ***
## race_ethhispanic 0.52140 0.01145 45.52 <2e-16 ***
## race_ethnh black 0.40714 0.01209 33.69 <2e-16 ***
## race_ethnh multirace 0.47611 0.03457 13.77 <2e-16 ***
## race_ethnh other 0.33090 0.01561 21.20 <2e-16 ***
## educ0Prim 0.97749 0.02025 48.26 <2e-16 ***
## educ1somehs 0.66847 0.01554 43.01 <2e-16 ***
## educ3somecol -0.30490 0.01042 -29.26 <2e-16 ***
## educ4colgrad -0.97252 0.01125 -86.48 <2e-16 ***
## agec(24,39] 0.35346 0.01455 24.29 <2e-16 ***
## agec(39,59] 0.75487 0.01399 53.97 <2e-16 ***
## agec(59,79] 1.10589 0.01479 74.79 <2e-16 ***
## agec(79,99] 1.33419 0.02280 58.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3])
##
## Residual deviance: 452289.9 on 485742 degrees of freedom
##
## Log-likelihood: -226144.9 on 485742 degrees of freedom
##
## Number of iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
## Exponentiated coefficients:
## race_ethhispanic race_ethnh black race_ethnh multirace
## 1.6843782 1.5025127 1.6098072
## race_ethnh other educ0Prim educ1somehs
## 1.3922222 2.6577733 1.9512507
## educ3somecol educ4colgrad agec(24,39]
## 0.7371934 0.3781280 1.4239798
## agec(39,59] agec(59,79] agec(79,99]
## 2.1273290 3.0219149 3.7969261
#Non-proportional odds
fit.vgam2<-vglm(as.ordered(generalhealth)~race_eth+educ+agec,brfss16m, weights =mmsawt/mean(mmsawt, na.rm=T), family=cumulative(parallel = F, reverse = T)) #<-parallel = F == Nonproportional odds
summary(fit.vgam2)
##
## Call:
## vglm(formula = as.ordered(generalhealth) ~ race_eth + educ +
## agec, family = cumulative(parallel = F, reverse = T), data = brfss16m,
## weights = mmsawt/mean(mmsawt, na.rm = T))
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logit(P[Y>=2]) -10.358 -0.5532 -0.1598 0.53470 8.177
## logit(P[Y>=3]) -8.507 -0.3466 -0.1665 -0.07256 19.839
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.59831 0.01453 -41.189 < 2e-16 ***
## (Intercept):2 -2.52039 0.02363 -106.640 < 2e-16 ***
## race_ethhispanic:1 0.54140 0.01246 43.453 < 2e-16 ***
## race_ethhispanic:2 0.46721 0.01559 29.969 < 2e-16 ***
## race_ethnh black:1 0.43848 0.01301 33.712 < 2e-16 ***
## race_ethnh black:2 0.30590 0.01689 18.112 < 2e-16 ***
## race_ethnh multirace:1 0.40835 0.03707 11.017 < 2e-16 ***
## race_ethnh multirace:2 0.61056 0.04643 13.151 < 2e-16 ***
## race_ethnh other:1 0.38058 0.01627 23.387 < 2e-16 ***
## race_ethnh other:2 0.10654 0.02510 4.245 2.19e-05 ***
## educ0Prim:1 0.89652 0.02591 34.602 < 2e-16 ***
## educ0Prim:2 0.99948 0.02307 43.318 < 2e-16 ***
## educ1somehs:1 0.61562 0.01816 33.898 < 2e-16 ***
## educ1somehs:2 0.70911 0.01875 37.820 < 2e-16 ***
## educ3somecol:1 -0.29973 0.01109 -27.036 < 2e-16 ***
## educ3somecol:2 -0.30658 0.01464 -20.941 < 2e-16 ***
## educ4colgrad:1 -0.93844 0.01171 -80.111 < 2e-16 ***
## educ4colgrad:2 -1.13193 0.01788 -63.323 < 2e-16 ***
## agec(24,39]:1 0.34321 0.01497 22.930 < 2e-16 ***
## agec(24,39]:2 0.45835 0.02493 18.384 < 2e-16 ***
## agec(39,59]:1 0.68875 0.01447 47.588 < 2e-16 ***
## agec(39,59]:2 1.05354 0.02340 45.016 < 2e-16 ***
## agec(59,79]:1 1.00645 0.01546 65.088 < 2e-16 ***
## agec(59,79]:2 1.45215 0.02396 60.611 < 2e-16 ***
## agec(79,99]:1 1.23550 0.02513 49.155 < 2e-16 ***
## agec(79,99]:2 1.64567 0.03185 51.676 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: logit(P[Y>=2]), logit(P[Y>=3])
##
## Residual deviance: 451015.4 on 485730 degrees of freedom
##
## Log-likelihood: -225507.7 on 485730 degrees of freedom
##
## Number of iterations: 5
##
## No Hauck-Donner effect found in any of the estimates
##
## Exponentiated coefficients:
## race_ethhispanic:1 race_ethhispanic:2 race_ethnh black:1
## 1.7184147 1.5955431 1.5503519
## race_ethnh black:2 race_ethnh multirace:1 race_ethnh multirace:2
## 1.3578505 1.5043379 1.8414533
## race_ethnh other:1 race_ethnh other:2 educ0Prim:1
## 1.4631332 1.1124241 2.4510690
## educ0Prim:2 educ1somehs:1 educ1somehs:2
## 2.7168723 1.8508036 2.0321802
## educ3somecol:1 educ3somecol:2 educ4colgrad:1
## 0.7410215 0.7359574 0.3912369
## educ4colgrad:2 agec(24,39]:1 agec(24,39]:2
## 0.3224093 1.4094587 1.5814612
## agec(39,59]:1 agec(39,59]:2 agec(59,79]:1
## 1.9912267 2.8677712 2.7358603
## agec(59,79]:2 agec(79,99]:1 agec(79,99]:2
## 4.2722948 3.4400973 5.1844637
AIC(fit.vgam)
## [1] 452317.9
AIC(fit.vgam2)
## [1] 451067.4
Based on the AIC, the nonproportioal odds model fits better than the proportional odds model.
There are other types of models for ordinal data. Another is the Adjacent categories model, which models:
\[\text {log} \frac{\pi_{ij}}{\pi_{i{(j-1)}}} = \beta_{ij} , i = 1 , \cdots, I; \text{ } j= 2, \cdots, J\]
#Adjacent categories
fit.acat<-vglm(as.ordered(generalhealth)~race_eth+educ+agec,brfss16m, weights = mmsawt/mean(mmsawt, na.rm=T), family=acat(parallel = F, reverse = T))
summary(fit.acat)
##
## Call:
## vglm(formula = as.ordered(generalhealth) ~ race_eth + educ +
## agec, family = acat(parallel = F, reverse = T), data = brfss16m,
## weights = mmsawt/mean(mmsawt, na.rm = T))
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## loge(P[Y=1]/P[Y=2]) -7.684 -0.5706 0.1626 0.5636 10.543
## loge(P[Y=2]/P[Y=3]) -19.450 0.0422 0.1169 0.3534 7.733
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 0.82801 0.01572 52.656 < 2e-16 ***
## (Intercept):2 1.37123 0.02581 53.137 < 2e-16 ***
## race_ethhispanic:1 -0.47882 0.01372 -34.897 < 2e-16 ***
## race_ethhispanic:2 -0.21507 0.01727 -12.454 < 2e-16 ***
## race_ethnh black:1 -0.41737 0.01426 -29.271 < 2e-16 ***
## race_ethnh black:2 -0.07686 0.01864 -4.123 3.74e-05 ***
## race_ethnh multirace:1 -0.25863 0.04202 -6.155 7.51e-10 ***
## race_ethnh multirace:2 -0.47722 0.05280 -9.038 < 2e-16 ***
## race_ethnh other:1 -0.40286 0.01747 -23.062 < 2e-16 ***
## race_ethnh other:2 0.14551 0.02737 5.316 1.06e-07 ***
## educ0Prim:1 -0.52185 0.02910 -17.933 < 2e-16 ***
## educ0Prim:2 -0.78445 0.02578 -30.433 < 2e-16 ***
## educ1somehs:1 -0.41110 0.02034 -20.215 < 2e-16 ***
## educ1somehs:2 -0.52389 0.02091 -25.051 < 2e-16 ***
## educ3somecol:1 0.24390 0.01228 19.859 < 2e-16 ***
## educ3somecol:2 0.17060 0.01631 10.463 < 2e-16 ***
## educ4colgrad:1 0.75251 0.01288 58.424 < 2e-16 ***
## educ4colgrad:2 0.66517 0.01971 33.749 < 2e-16 ***
## agec(24,39]:1 -0.26653 0.01608 -16.578 < 2e-16 ***
## agec(24,39]:2 -0.31121 0.02710 -11.485 < 2e-16 ***
## agec(39,59]:1 -0.46549 0.01567 -29.702 < 2e-16 ***
## agec(39,59]:2 -0.80641 0.02553 -31.586 < 2e-16 ***
## agec(59,79]:1 -0.68517 0.01688 -40.597 < 2e-16 ***
## agec(59,79]:2 -1.08443 0.02622 -41.358 < 2e-16 ***
## agec(79,99]:1 -0.89401 0.02794 -31.992 < 2e-16 ***
## agec(79,99]:2 -1.14086 0.03525 -32.363 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: loge(P[Y=1]/P[Y=2]), loge(P[Y=2]/P[Y=3])
##
## Residual deviance: 450954 on 485730 degrees of freedom
##
## Log-likelihood: -225477 on 485730 degrees of freedom
##
## Number of iterations: 5
We also commonly see a resonse variable with unordered categories for a response. Such as:
These are example of what generally is called an âalternative decision makingâ process.
A distribution commonly used for this type of outcome is the multivariate extension of the binomial distribution, known as the multinomial distribution.
If you have an outcome with J unordered classes, then \(\pi_1\), \(\pi_2\), …, \(\pi_J\) are the probabilities of observing the J classes. Remember, \(\sum_J \pi_j = 1\).
If we observe \(y_1\) outcomes in the first category and \(y_2\) outcomes in the second and so on, the let:
\[\mathbf{y}=\begin{bmatrix} y_1\\ y_2\\ \vdots \\ y_4 \end{bmatrix}, with \sum_{j=1}^Jy_i = n\]
In this case y follows the multinomial distribution:
\[f(y|n)=\frac{n!}{y_1!y_2!\cdots y_j!} \pi_1^{y1}\pi_2^{y2}\cdots\pi_J^{yJ}\]
When constructing a regression model for such a distrubution, we choose one level of the outcome as the reference level and compare the probabilty of chooseing other options to it. For example, if we modeled the probability of choosing option j to option 1 (as the reference level), the we would have the logistic regression model:
\[logit(\pi_{j}) =log \left ( \frac{\pi_j}{\pi_1} \right)=x'_j\beta_j\]
This makes J-1 equations, again with the reference category estimated by the complementary rule. We then have a total of J-1 regression equations to interpret, where we compare the odds of being in each of the other j-1 categories to the reference category. If we want to estimate the probability of being in each particular class of the outcome, for a response with a given set of x values, we can solve the equation for \(\pi_j\) as:
\[\pi_j = \frac{exp(x'_j\beta_j)}{1+\sum_{j=2}^J exp(x'_j\beta_j)}\]
This is a much more complicated model than the proportional odds model, and we will have many more effects to interpret, owing to the J-1 separate equations we are estimating.
Unfortunately, in R there is no multinomial survey design model, so we have to use VGAM again and feed the function our normalized weights.
Now I fit the model, this is screwy looking because I have to use the multinom() function from the nnet library to fit the model and since I want to extract the equations from the model I have to use the coef() function.
mfit<-vglm(generalhealth~race_eth+educ+agec,multinomial(refLevel = 1),brfss16m, weights=mmsawt/mean(mmsawt, na.rm=T))
summary(mfit)
##
## Call:
## vglm(formula = generalhealth ~ race_eth + educ + agec, family = multinomial(refLevel = 1),
## data = brfss16m, weights = mmsawt/mean(mmsawt, na.rm = T))
##
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## log(mu[,2]/mu[,1]) -7.150 -0.5176 -0.2604 0.52666 8.274
## log(mu[,3]/mu[,1]) -8.438 -0.3476 -0.1723 -0.06237 19.844
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 -0.82801 0.01572 -52.656 < 2e-16 ***
## (Intercept):2 -2.19924 0.02469 -89.086 < 2e-16 ***
## race_ethhispanic:1 0.47882 0.01372 34.897 < 2e-16 ***
## race_ethhispanic:2 0.69390 0.01711 40.552 < 2e-16 ***
## race_ethnh black:1 0.41737 0.01426 29.271 < 2e-16 ***
## race_ethnh black:2 0.49423 0.01836 26.921 < 2e-16 ***
## race_ethnh multirace:1 0.25863 0.04202 6.155 7.51e-10 ***
## race_ethnh multirace:2 0.73585 0.05021 14.657 < 2e-16 ***
## race_ethnh other:1 0.40286 0.01747 23.062 < 2e-16 ***
## race_ethnh other:2 0.25735 0.02663 9.665 < 2e-16 ***
## educ0Prim:1 0.52185 0.02910 17.933 < 2e-16 ***
## educ0Prim:2 1.30630 0.02924 44.668 < 2e-16 ***
## educ1somehs:1 0.41110 0.02034 20.215 < 2e-16 ***
## educ1somehs:2 0.93499 0.02194 42.610 < 2e-16 ***
## educ3somecol:1 -0.24390 0.01228 -19.859 < 2e-16 ***
## educ3somecol:2 -0.41450 0.01568 -26.431 < 2e-16 ***
## educ4colgrad:1 -0.75251 0.01288 -58.424 < 2e-16 ***
## educ4colgrad:2 -1.41768 0.01868 -75.877 < 2e-16 ***
## agec(24,39]:1 0.26653 0.01608 16.578 < 2e-16 ***
## agec(24,39]:2 0.57774 0.02596 22.254 < 2e-16 ***
## agec(39,59]:1 0.46549 0.01567 29.702 < 2e-16 ***
## agec(39,59]:2 1.27189 0.02447 51.982 < 2e-16 ***
## agec(59,79]:1 0.68517 0.01688 40.597 < 2e-16 ***
## agec(59,79]:2 1.76960 0.02525 70.080 < 2e-16 ***
## agec(79,99]:1 0.89401 0.02794 31.992 < 2e-16 ***
## agec(79,99]:2 2.03487 0.03504 58.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of linear predictors: 2
##
## Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
##
## Residual deviance: 450954 on 485730 degrees of freedom
##
## Log-likelihood: -225477 on 485730 degrees of freedom
##
## Number of iterations: 5
##
## Reference group is level 1 of the response
Calculate the odds ratios and confidence intervals
round(exp(coef(mfit)), 3)
## (Intercept):1 (Intercept):2 race_ethhispanic:1
## 0.437 0.111 1.614
## race_ethhispanic:2 race_ethnh black:1 race_ethnh black:2
## 2.001 1.518 1.639
## race_ethnh multirace:1 race_ethnh multirace:2 race_ethnh other:1
## 1.295 2.087 1.496
## race_ethnh other:2 educ0Prim:1 educ0Prim:2
## 1.293 1.685 3.692
## educ1somehs:1 educ1somehs:2 educ3somecol:1
## 1.508 2.547 0.784
## educ3somecol:2 educ4colgrad:1 educ4colgrad:2
## 0.661 0.471 0.242
## agec(24,39]:1 agec(24,39]:2 agec(39,59]:1
## 1.305 1.782 1.593
## agec(39,59]:2 agec(59,79]:1 agec(59,79]:2
## 3.568 1.984 5.869
## agec(79,99]:1 agec(79,99]:2
## 2.445 7.651
round(exp(confint(mfit)), 3)
## 2.5 % 97.5 %
## (Intercept):1 0.424 0.451
## (Intercept):2 0.106 0.116
## race_ethhispanic:1 1.571 1.658
## race_ethhispanic:2 1.935 2.070
## race_ethnh black:1 1.476 1.561
## race_ethnh black:2 1.581 1.699
## race_ethnh multirace:1 1.193 1.406
## race_ethnh multirace:2 1.892 2.303
## race_ethnh other:1 1.446 1.548
## race_ethnh other:2 1.228 1.363
## educ0Prim:1 1.592 1.784
## educ0Prim:2 3.487 3.910
## educ1somehs:1 1.450 1.570
## educ1somehs:2 2.440 2.659
## educ3somecol:1 0.765 0.803
## educ3somecol:2 0.641 0.681
## educ4colgrad:1 0.459 0.483
## educ4colgrad:2 0.234 0.251
## agec(24,39]:1 1.265 1.347
## agec(24,39]:2 1.694 1.875
## agec(39,59]:1 1.545 1.642
## agec(39,59]:2 3.401 3.743
## agec(59,79]:1 1.920 2.051
## agec(59,79]:2 5.585 6.166
## agec(79,99]:1 2.315 2.583
## agec(79,99]:2 7.143 8.195
So in these results, each row consists of an equation for a particular set of outcomes. For example, let’s interpret the first row, which corresponds to the model \(Pr(\text{y='good health'})/Pr(\text{y='excellent/vg health'})\) which corresponts to the coefficent for race_ethhispanic:1
In this case, hispanics are more likely to report good instead of excellent/vg health, compared to NH Whites. This is also the case for NH blacks,others and multirace respondents. In terms of education, those with primary school and some highschool are more likely to report good, compared to excellent/vg health, compared to those with high school, but those with college educatoin are less likely to report good vs excellent/vg health, compared to those with high school.
Let’s consider the second transition now. This corresponds to the model \(Pr(\text{y='poor/fair health'})/Pr(\text{y='excellent/vg health'})\) . In this case hispanics are more likely to report poor instead of excellent health, compared to NH Whites. This is also the case for NH blacks, Others and multirace respondents. In terms of education, those with less education(primary or some HS) are more likely to report poor, compared to excellent health, compared to those with high school education, while those with more education (some college or college grad) are less likely to report poor, compared to excellent health, compared to those with high school education.
#get a series of predicted probabilites for different "types" of people for each model
dat<-expand.grid(race_eth=levels(brfss16m$race_eth), educ=levels(brfss16m$educ), agec=levels(brfss16m$agec))
#generate the fitted values
#Unfortunately, the survey proportional odds model won't generate fitted values
#but here I use a weighted multinomial model, which fits the data better anyway
fitm<-predict(mfit, newdat=dat,type="response")
#add the values to the fake data
dat<-cbind(dat, round(fitm,3))
#Print the fitted probabilities
head(dat, n=20)
## race_eth educ agec 1 2 3
## 1 nhwhite 2hsgrad (0,24] 0.646 0.282 0.072
## 2 hispanic 2hsgrad (0,24] 0.519 0.366 0.115
## 3 nh black 2hsgrad (0,24] 0.542 0.359 0.099
## 4 nh multirace 2hsgrad (0,24] 0.556 0.315 0.129
## 5 nh other 2hsgrad (0,24] 0.556 0.364 0.080
## 6 nhwhite 0Prim (0,24] 0.466 0.343 0.191
## 7 hispanic 0Prim (0,24] 0.332 0.395 0.272
## 8 nh black 0Prim (0,24] 0.359 0.401 0.241
## 9 nh multirace 0Prim (0,24] 0.356 0.340 0.304
## 10 nh other 0Prim (0,24] 0.380 0.419 0.201
## 11 nhwhite 1somehs (0,24] 0.515 0.339 0.145
## 12 hispanic 1somehs (0,24] 0.380 0.405 0.215
## 13 nh black 1somehs (0,24] 0.406 0.406 0.188
## 14 nh multirace 1somehs (0,24] 0.409 0.349 0.241
## 15 nh other 1somehs (0,24] 0.425 0.419 0.155
## 16 nhwhite 3somecol (0,24] 0.706 0.242 0.052
## 17 hispanic 3somecol (0,24] 0.588 0.325 0.086
## 18 nh black 3somecol (0,24] 0.610 0.317 0.073
## 19 nh multirace 3somecol (0,24] 0.626 0.278 0.096
## 20 nh other 3somecol (0,24] 0.622 0.319 0.059
This will use polr with survey weights to get fitted probabilities from the proportional odds model. The fitted values are right, but the standard errors won’t be.
fitted.ord<-round(predict(fit.vgam, newdat=dat[,1:3], type="response"), 3)
dat<-cbind(dat, fitted.ord)
names(dat)<-c(names(dat)[1:3], "mp1", "mp2", "mp3", "op1", "op2", "op3")
head(dat, n=20)
## race_eth educ agec mp1 mp2 mp3 op1 op2 op3
## 1 nhwhite 2hsgrad (0,24] 0.646 0.282 0.072 0.651 0.259 0.089
## 2 hispanic 2hsgrad (0,24] 0.519 0.366 0.115 0.526 0.332 0.142
## 3 nh black 2hsgrad (0,24] 0.542 0.359 0.099 0.554 0.317 0.128
## 4 nh multirace 2hsgrad (0,24] 0.556 0.315 0.129 0.537 0.326 0.136
## 5 nh other 2hsgrad (0,24] 0.556 0.364 0.080 0.573 0.307 0.120
## 6 nhwhite 0Prim (0,24] 0.466 0.343 0.191 0.413 0.380 0.207
## 7 hispanic 0Prim (0,24] 0.332 0.395 0.272 0.294 0.400 0.305
## 8 nh black 0Prim (0,24] 0.359 0.401 0.241 0.319 0.400 0.281
## 9 nh multirace 0Prim (0,24] 0.356 0.340 0.304 0.304 0.400 0.296
## 10 nh other 0Prim (0,24] 0.380 0.419 0.201 0.336 0.398 0.266
## 11 nhwhite 1somehs (0,24] 0.515 0.339 0.145 0.489 0.350 0.161
## 12 hispanic 1somehs (0,24] 0.380 0.405 0.215 0.362 0.394 0.244
## 13 nh black 1somehs (0,24] 0.406 0.406 0.188 0.389 0.387 0.223
## 14 nh multirace 1somehs (0,24] 0.409 0.349 0.241 0.373 0.391 0.236
## 15 nh other 1somehs (0,24] 0.425 0.419 0.155 0.408 0.382 0.210
## 16 nhwhite 3somecol (0,24] 0.706 0.242 0.052 0.717 0.215 0.067
## 17 hispanic 3somecol (0,24] 0.588 0.325 0.086 0.601 0.291 0.109
## 18 nh black 3somecol (0,24] 0.610 0.317 0.073 0.628 0.274 0.098
## 19 nh multirace 3somecol (0,24] 0.626 0.278 0.096 0.612 0.284 0.104
## 20 nh other 3somecol (0,24] 0.622 0.319 0.059 0.645 0.263 0.091
Let’s look at the relative model fits:
AIC(fit.vgam) #proportional odds
## [1] 452317.9
AIC(fit.vgam2) #cumulative logit, non proportional
## [1] 451067.4
AIC(mfit) #multinomial
## [1] 451006
Looks like the multinomial is the best fitting model, but it is very close to the nonproportional odds model using the cumlative logit.