DEM7283: Statistics for Demographic Data II

Objective of the exercise
Set working directory and load packages
Load and process data set
Operationalize working variables
- Health outcome of interest
- Hypothesized covariates
Define complex survey design object
Results
Acknowledgements

Author: Elhakim Ibrahim

Instructor: Corey Sparks, PhD¹

February 24, 2020

Objective of the exercise

This exercise seeks to evaluate if risk of obesity in the United States differ substantially by socioeconomic characteristics (age, race/ethinicity and income). Data for the State of Texas extracted from nationally representative 2016 Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data set were employed². Comparison by the hypothesized characteristics will be based on odds ratios from logistic regression models. In addition, predicted values will be fitted to facilitate more intuitive understanding of the logistic regression model estimates.

Set working directory and load packages

setwd("C:/A/01/02S20/D283/H/03")

suppressPackageStartupMessages({
    library(car)
    library(dplyr)
    library(gtools)
    library(survey)
    library(sjPlot)
    library(sjmisc)
    library(forcats)
    library(stargazer)
    library(questionr)
    library(htmlTable)
    library(kableExtra)
    library(sjlabelled)
    options(scipen = 5)
})

Load and process data set

load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))

mydata <- brfss_17
rawnames <- names(mydata)
newnames <- tolower(gsub(pattern = "_", replacement = "", x = rawnames))
# above identifies and removes '_' from the variable names and changes all
# to lowercase
names(mydata) <- newnames
head(newnames, n = 48)

 [1] "dispcode" "statere1" "safetime" "hhadult"  "genhlth"  "physhlth"
 [7] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost"  "checkup1"
[13] "bphigh4"  "bpmeds"   "cholchk1" "toldhi2"  "cholmed1" "cvdinfr4"
[19] "cvdcrhd4" "cvdstrk3" "asthma3"  "asthnow"  "chcscncr" "chcocncr"
[25] "chccopd1" "havarth3" "addepev2" "chckidny" "diabete3" "diabage2"
[31] "lmtjoin3" "arthdis2" "arthsocl" "joinpai1" "sex"      "marital" 
[37] "educa"    "renthom1" "numhhol2" "numphon2" "cpdemo1a" "veteran3"
[43] "employ1"  "children" "income2"  "internet" "weight2"  "height3"

Operationalize working variables

Health outcome of interest

Obesity status

In the data, the bmi variable was computed as with implied 2 decimal places as shown above. Hence, the variable is divided by 100 to get the values that represent actual bmi’s. The new bmi is then dummied as 0 = Not obese for bmi below 30 and 1 = Obese for bmi 30 and above

mydata$bmi <- mydata$bmi5/100
# print(mydata$bmi)
mydata$bmigrp <- ifelse(mydata$bmi >= 30, 1, 0)
mydata$bmigrp <- as.factor(ifelse(mydata$bmigrp == 1, "Obese", "Not obese"))
# continuous bmi factorized

Hypothesized covariates

Age group

mydata$agegrp <- cut(mydata$age80, breaks = c(0, 24, 39, 59, 79, 99))
# continuous age variable categorized
levels(mydata$agegrp)

[1] "(0,24]"  "(24,39]" "(39,59]" "(59,79]" "(79,99]"

Race/ethnic identity

mydata$racegr3 <- as.factor(mydata$racegr3)
# race/ethicity factorized
levels(mydata$racegr3)

[1] "1" "2" "3" "4" "5" "9"

mydata <- mydata %>% mutate(racegrp = factor(racegr3, levels = c(1, 2, 3, 4, 
    5), labels = c("Non-Hispanic White", "Non-Hispanic Black", "Other", "Other", 
    "Hispanic")))
mydata$racegrp <- fct_relevel(mydata$racegrp, "Non-Hispanic White", "Non-Hispanic Black", 
    "Hispanic")
levels(mydata$racegrp)

[1] "Non-Hispanic White" "Non-Hispanic Black" "Hispanic"          
[4] "Other"

Income group

mydata$income <- quantcut(mydata$incomg, q = seq(0, 1, by = 0.25), na.rm = TRUE)
# income factorized into quantile income group
mydata <- mydata %>% mutate(incomegrp = factor(income, levels = c("[1,3]", "(3,5)", 
    "5", "(5,9]"), labels = c("Quantile 1", "Quantile 2", "Quantile 3", "Quantile 4")))
levels(mydata$incomegrp)

[1] "Quantile 1" "Quantile 2" "Quantile 3" "Quantile 4"

Define complex survey design object

I create a survey design object from complex survey parameters ids = PSU identifers, strata=strata identifiers, weights=case weights, data=data frame. In doing this, respondents with missing case weights are excluded from the analysis, while options(survey.lonely.psu = "adjust") was applied to facilitate calculation of within stratum variance for stratum with single PSU.

mydata$tx <- NA
mydata$tx[grep(pattern = "TX", mydata$mmsaname)] <- 1
# above extracts only cities in the State of Texas by looking for 'TX' in
# the MSA's name field in the mydata

mydata <- mydata %>% filter(tx == 1, is.na(mmsawt) == F)
options(survey.lonely.psu = "adjust")
design.object <- svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)

Results

Odds ratios

logit.model <- svyglm(bmigrp ~ agegrp + racegrp + incomegrp, design = design.object, 
    family = binomial)
summary(logit.model)


Call:
svyglm(formula = bmigrp ~ agegrp + racegrp + incomegrp, design = design.object, 
    family = binomial)

Survey design:
svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                -1.9675     0.2561  -7.682 1.77e-14 ***
agegrp(24,39]               1.1495     0.2363   4.866 1.16e-06 ***
agegrp(39,59]               1.6848     0.2342   7.194 6.90e-13 ***
agegrp(59,79]               1.3892     0.2438   5.699 1.25e-08 ***
agegrp(79,99]               0.4582     0.3891   1.178   0.2389    
racegrpNon-Hispanic Black   0.2261     0.1792   1.262   0.2070    
racegrpHispanic             0.2699     0.1310   2.060   0.0394 *  
racegrpOther               -0.4718     0.2594  -1.819   0.0690 .  
incomegrpQuantile 2        -0.3750     0.1812  -2.070   0.0385 *  
incomegrpQuantile 3        -0.2890     0.1326  -2.179   0.0293 *  
incomegrpQuantile 4        -0.1285     0.1885  -0.682   0.4955    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 0.9723225)

Number of Fisher Scoring iterations: 4

In the above table, the summary function is applied to generate parameter coefficients with accompanying s.e., t-statistic, and p-value for each. But, we are equally interested in relative size of difference in gradients across categories compared with the reference category, not only in the direction of difference in the gradients. Hence, same results are represented below in odds ratios and accompanying confidence intervals using stargazer³ package.

# first set the rule for coefficients exponentiation
myexp <- function(x) {
    exp(x)
}

stargazer(logit.model, ci = T, ci.level = 0.95, ci.separator = ",", digits = 2, 
    font.size = "normalsize", dep.var.caption = c("Outcome: Obesity likelihood"), 
    dep.var.labels.include = F, intercept.bottom = F, align = F, type = "text", 
    single.row = T, no.space = T, column.sep.width = "4pt", p.auto = F, t.auto = F, 
    covariate.labels = c("Intercept", "25-39", "40-59", "60-79", "80+", "NH Black", 
        "Hispanic", "Other", "Quantile 2", "Quantile 3", "Quantile 4"), column.labels = c("Odds ratios (25% CI,95% CI)"), 
    star.cutoffs = c(0.05, 0.01, 0.001), object.names = F, model.numbers = F, 
    apply.coef = myexp, report = "vcs*", title = "Odds ratios of Obesity by selected characteristics, BRFSS 2016", 
    notes = "Variables selected for exercise purpose only", notes.align = "l", 
    notes.label = "Notes:", keep.stat = c("n", "rsq", "ll", "aic", "bic", "adj.rsq"), 
    multicolumn = T, out = "tab002.htm")


Odds ratios of Obesity by selected characteristics, BRFSS 2016
==============================================================
                          Outcome: Obesity likelihood         
                  --------------------------------------------
                          Odds ratios (25% CI,95% CI)         
--------------------------------------------------------------
Intercept                     0.14 (-0.36,0.64)***            
25-39                         3.16 (2.69,3.62)***             
40-59                         5.39 (4.93,5.85)***             
60-79                         4.01 (3.53,4.49)***             
80+                             1.58 (0.82,2.34)              
NH Black                        1.25 (0.90,1.60)              
Hispanic                       1.31 (1.05,1.57)*              
Other                           0.62 (0.12,1.13)              
Quantile 2                     0.69 (0.33,1.04)*              
Quantile 3                     0.75 (0.49,1.01)*              
Quantile 4                      0.88 (0.51,1.25)              
--------------------------------------------------------------
Observations                         7,750                    
Log Likelihood                     -4,265.40                  
Akaike Inf. Crit.                   8,552.80                  
==============================================================
Notes:            *p<0.05; **p<0.01; ***p<0.001               
                  Variables selected for exercise purpose only

Odds ratios interpretation

The odds ratios indicating variations in likelihood of obesity among Texas residents are presented in the above table. Considering differentials by age groups, the model suggests a curvilinear relation between age and risk of obesity with the highest odds found among population aged 39-59 years (OR: 5.39; CI: 3.41-8.53), followed by those aged 59-79 years (OR: 4.01; CI: 2.49-6.47) compared with those aged less than 24 years. Also, income appears to be a siginficant factor: compared with those in lower 25% of income ladder, obesity propensity is signifcantly lower by approximately 31% and 25% for those in upper 50% and 75% of income ladder. Meanwhile, the likelihood of being obese is 1.31 times higher (p<0.05) for an average Hispanic relative to a Non-Hispanic White. Similar risk gradient is exhibited by an average Non-Hispanic Black, but the probability is falls outside significant thresholds.

Predicted values

mydata2 <- expand.grid(agegrp = levels(mydata$agegrp), racegrp = levels(mydata$racegrp), 
    incomegrp = levels(mydata$incomegrp))
fitted <- predict(logit.model, newdata = mydata2, type = "response")
mydata2$fitted <- round(fitted, 3) * 100

mydata2[which(mydata2$racegrp == "Non-Hispanic White" & mydata2$agegrp == "(24,39]" & 
    mydata2$incomegrp == "Quantile 1"), ]

mydata2[which(mydata2$racegrp == "Hispanic" & mydata2$agegrp == "(24,39]" & 
    mydata2$incomegrp == "Quantile 1"), ]

mydata2[which(mydata2$racegrp == "Non-Hispanic White" & mydata2$agegrp == "(24,39]" & 
    mydata2$incomegrp == "Quantile 4"), ]

mydata2[which(mydata2$racegrp == "Hispanic" & mydata2$agegrp == "(24,39]" & 
    mydata2$incomegrp == "Quantile 4"), ]

Predicted values interpretation

Attemp is made to understand inherent variations in likelihood of obesity within the diverse groups of Texas population. Obesity risk disparities are examined between a hypothetical Non-Hispanic White and Hispanic person by specific age category and income status. As shown in the tables, a low-income (income quantile 1, lower 25%), middle-aged (24-39 years) Non-Hispanic White is averagely 6% point less likely to be obese than a Hispanic of similar income and age characteristics (i.e. estimated probability of being obese: Non-Hispanic White, 30.6% vs. Hispanic, 36.6%). Further, almost equal difference is observed between the two hypothetical individuals (in favor of the Non-Hispanic White) in the same age group but of highest income status (estimated probability of being obese: Non-Hispanic White, 20.0% vs. Hispanic, 33.7%).

Acknowledgements

This content adapts steps and examples from instructional materials authored by Dr. Corey Sparks and made available at https://rpubs.com/corey_sparks.
The 2016 BRFSS SMART metro area survey data are open source resources made available by the US Centers for Disease Control and Prevention at https://www.cdc.gov/brfss/smart/smart_2016.html.
Stargazer is authored by Marek Hlavac and full document can be downloaded from https://cran.r-project.org/web/packages/stargazer/stargazer.pdf

############################################################################################################