Defining an Ordinal Variable for Analysis

For this homework project, I would like to use “Socio-economic status quartile” from the ELS:2002 dataset.

In the actual dataset, SES quartiles are coded as follows:

Category Label Unweighted
1 Lowest quartile 3,608
2 Second quartile 3,604
3 Third quartile 3,731
4 Highest quartile 4,301

This should be fairly easy to recode as an ordinal variable, with “1” being the lowest quartile and “4” being the highest:

#recoding SES quartiles as numbers. This is pretty easy because they are already divided into 1-4 quartiles. 

els$ses_ordinal<-recode(els$byses1qu,recodes="1=1;2=2;3=3;4=4;else=NA",as.factor.result = T)
els$ses_ordinal<-relevel(els$ses_ordinal,ref="1")
els$ses_number<-recode(els$byses1qu,recodes="1=1;2=2;3=3;4=4;else=NA",as.factor.result = F)

Research Question

I am interested in the effect of non-traditional family structure, English-language fluency (or lack thereof) and suburban living upon SES quartiles. My hypothesis is that non-traditional family structure, lack of English language fluency will be associated negatively with higher SES quartiles, while suburban living will be associated positively with higher SES quartiles. In each case, the predictor variable in question (non-traditional family, non-fluent in English and suburban living) has been coded as “1” in a binary variable.

#creating non-traditional family variable
els<-els %>% mutate(non_trad_family=case_when(.$byfcomp %in% c(2:9) ~ 1,
                                              .$byfcomp==1 ~ 0,
                                              .$byfcomp %in% c(-4,-8,-9) ~ NA_real_))

#creating parent's English-language fluency variable
els<-els %>% mutate(non_fluent_english=case_when(.$byplang %in% c(0,1,3) ~ 1,
                                                 .$byplang %in% c(2,4) ~ 0,
                                                 .$byplang %in% c(-4,-8,-9) ~ NA_real_))

#creating urban-suburban variable
els<-els %>% mutate(suburban=case_when(.$byurban==2 ~ 1,
                                       .$byurban %in% c(1,3) ~ 0))

Creating a survey design

Here, I am creating a survey design with my dataset:

dataset<-els %>% select(stu_id,bystuwt,strat_id,ses_ordinal,ses_number,non_trad_family,non_fluent_english,suburban) %>% filter(complete.cases(.))
options(survey.lonely.psu = "adjust")
design<-svydesign(ids=~stu_id,strata=~strat_id,weights=~bystuwt,data=dataset)

Fitting an Ordinal Logistic Regression

By using a survey design and ordinal logistic regression, we observe the following relationships between the variables in question and SES quartile:

fit.ordinal1<-svyolr(ses_ordinal~non_trad_family+non_fluent_english+suburban,design)
summary(fit.ordinal1)
## Call:
## svyolr(ses_ordinal ~ non_trad_family + non_fluent_english + suburban, 
##     design)
## 
## Coefficients:
##                         Value Std. Error    t value
## non_trad_family    -0.9221582 0.03926129 -23.487723
## non_fluent_english -2.4026825 0.08517680 -28.208179
## suburban            0.1365971 0.03641969   3.750639
## 
## Intercepts:
##     Value    Std. Error t value 
## 1|2  -1.7757   0.0384   -46.2496
## 2|3  -0.5463   0.0339   -16.1026
## 3|4   0.6432   0.0339    18.9746

We see that non-traditonal families and being non-fluent in English have a negative relationship with higher SES quartiles, while living in the suburbs has a slightly positive relationship with higher SES quartiles.

Testing the proportional odds assumption

Here, I will test the proportional odds assumption (i.e., the assumption that the relationship between the coefficients of each ordinal level are the same):

#testing the proportional odds assumption
delta1<-svyglm(I(ses_number>1)~non_trad_family+non_fluent_english+suburban,design,family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
delta2<-svyglm(I(ses_number>2)~non_trad_family+non_fluent_english+suburban,design,family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
delta3<-svyglm(I(ses_number>3)~non_trad_family+non_fluent_english+suburban,design,family="binomial")
## Warning in eval(family$initialize): non-integer #successes in a binomial
## glm!
rounded<-data.frame(round(exp(rbind(coef(delta1)[-1],coef(delta2)[-1],coef(delta3)[-1])),3))
print(rounded)
##   non_trad_family non_fluent_english suburban
## 1           0.426              0.087    1.217
## 2           0.400              0.107    1.139
## 3           0.345              0.148    1.114

As we can see, there are some differences between the levels of outcome variable across predictor variables, which means that the assumption is unlikely to hold.

Non-Proportional assumption

Since the proportional odds assumption appears to be false, we can fit a non-proportional odds model to the data:

fit.vgam<-vglm(as.ordered(ses_ordinal)~non_trad_family+non_fluent_english+suburban,dataset,weights=bystuwt/mean(bystuwt,na.rm=T),family=cumulative(parallel=T,reverse=T))
summary(fit.vgam)
## 
## Call:
## vglm(formula = as.ordered(ses_ordinal) ~ non_trad_family + non_fluent_english + 
##     suburban, family = cumulative(parallel = T, reverse = T), 
##     data = dataset, weights = bystuwt/mean(bystuwt, na.rm = T))
## 
## 
## Pearson residuals:
##                   Min       1Q  Median     3Q    Max
## logit(P[Y>=2]) -4.430  0.08298  0.2191 0.4532  4.886
## logit(P[Y>=3]) -3.389 -0.58172  0.1929 0.5820  8.431
## logit(P[Y>=4]) -2.294 -0.51690 -0.2375 0.6958 10.911
## 
## Coefficients: 
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1       1.77575    0.03299  53.827  < 2e-16 ***
## (Intercept):2       0.54628    0.02911  18.763  < 2e-16 ***
## (Intercept):3      -0.64319    0.02934 -21.921  < 2e-16 ***
## non_trad_family    -0.92216    0.03293 -28.006  < 2e-16 ***
## non_fluent_english -2.40268    0.06806 -35.304  < 2e-16 ***
## suburban            0.13660    0.03185   4.289 1.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  3 
## 
## Names of linear predictors: 
## logit(P[Y>=2]), logit(P[Y>=3]), logit(P[Y>=4])
## 
## Residual deviance: 34597.43 on 39822 degrees of freedom
## 
## Log-likelihood: -17298.71 on 39822 degrees of freedom
## 
## Number of iterations: 5 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## Exponentiated coefficients:
##    non_trad_family non_fluent_english           suburban 
##         0.39765983         0.09047537         1.14636631
fit.vgam2<-vglm(as.ordered(ses_ordinal)~non_trad_family+non_fluent_english+suburban,dataset,weights=bystuwt/mean(bystuwt,na.rm=T),family=cumulative(parallel=F,reverse=T))
## Warning in Deviance.categorical.data.vgam(mu = mu, y = y, w = w, residuals
## = residuals, : fitted values close to 0 or 1
summary(fit.vgam2)
## 
## Call:
## vglm(formula = as.ordered(ses_ordinal) ~ non_trad_family + non_fluent_english + 
##     suburban, family = cumulative(parallel = F, reverse = T), 
##     data = dataset, weights = bystuwt/mean(bystuwt, na.rm = T))
## 
## 
## Pearson residuals:
##                   Min       1Q  Median     3Q   Max
## logit(P[Y>=2]) -4.298  0.08421  0.2198 0.4534 5.237
## logit(P[Y>=3]) -3.340 -0.58333  0.1943 0.5814 7.940
## logit(P[Y>=4]) -2.633 -0.51675 -0.2367 0.6849 8.467
## 
## Coefficients: 
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1         1.72001    0.04025  42.729  < 2e-16 ***
## (Intercept):2         0.53193    0.03155  16.858  < 2e-16 ***
## (Intercept):3        -0.59795    0.03338 -17.916  < 2e-16 ***
## non_trad_family:1    -0.85035    0.04410 -19.282  < 2e-16 ***
## non_trad_family:2    -0.89637    0.03698 -24.242  < 2e-16 ***
## non_trad_family:3    -1.05164    0.04426 -23.760  < 2e-16 ***
## non_fluent_english:1 -2.43754    0.07085 -34.404  < 2e-16 ***
## non_fluent_english:2 -2.19087    0.08940 -24.505  < 2e-16 ***
## non_fluent_english:3 -1.84161    0.12246 -15.039  < 2e-16 ***
## suburban:1            0.19215    0.04339   4.428  9.5e-06 ***
## suburban:2            0.12648    0.03637   3.477 0.000506 ***
## suburban:3            0.10093    0.04064   2.483 0.013012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Number of linear predictors:  3 
## 
## Names of linear predictors: 
## logit(P[Y>=2]), logit(P[Y>=3]), logit(P[Y>=4])
## 
## Residual deviance: 34553.18 on 39816 degrees of freedom
## 
## Log-likelihood: -17276.59 on 39816 degrees of freedom
## 
## Number of iterations: 8 
## 
## No Hauck-Donner effect found in any of the estimates
## 
## Exponentiated coefficients:
##    non_trad_family:1    non_trad_family:2    non_trad_family:3 
##           0.42726687           0.40804852           0.34936455 
## non_fluent_english:1 non_fluent_english:2 non_fluent_english:3 
##           0.08737573           0.11181976           0.15856128 
##           suburban:1           suburban:2           suburban:3 
##           1.21185445           1.13482995           1.10620207
AIC(fit.vgam)
## [1] 34609.43
AIC(fit.vgam2)
## [1] 34577.18

From this comparison we see that the AIC for the proportional odds model (34,609) is a bit higher than for the non-proportional odds model (34,577), suggesting that we use the non-proportional odds model.

Best Model and Interpretation

Here we see the odds ratio of the non-proportional odds model (the best model) along with the confidence intervals. What we see is that having a non traditional-family and being not fluent in English greatly reduces the chances of rising in socio-economic status: non-traditional families are about sixty five percent less likely to be in the highest SES category than traditional families. Likewise, households who have heads not fluent in English are almost 84% less likely to be in the highest SES category than households where English is spoken with fluency. Meanwhile, suburban households are about 10% more likely than non-suburban households to be in the highest SES category.

binded<-data.frame(cbind(round(exp(coef(fit.vgam2)),3),round(exp(confint(fit.vgam2)),3)))
names(binded)<-c("Odds_Ratio","2.5%","97.5%")
print(binded)
##                      Odds_Ratio  2.5% 97.5%
## (Intercept):1             5.585 5.161 6.043
## (Intercept):2             1.702 1.600 1.811
## (Intercept):3             0.550 0.515 0.587
## non_trad_family:1         0.427 0.392 0.466
## non_trad_family:2         0.408 0.380 0.439
## non_trad_family:3         0.349 0.320 0.381
## non_fluent_english:1      0.087 0.076 0.100
## non_fluent_english:2      0.112 0.094 0.133
## non_fluent_english:3      0.159 0.125 0.202
## suburban:1                1.212 1.113 1.319
## suburban:2                1.135 1.057 1.219
## suburban:3                1.106 1.022 1.198

Conclusion

There does appear to be a negative relationship between non-traditional families and lack of fluency in English and lower SES quartile status. There also appears to be a positive relationship between suburban living and higher SES quartile status.