Author: Elhakim Ibrahim

Instructor: Corey Sparks, PhD

February 10, 2020



Objective of the exercise

The focus of this exercise is to assess if point estimates deriving from analysis based on assumption of simple random sampling survey design (SRSSD, unweighted estimates) differ substantially from those based on complex sampling survey design (CSSD, design-weighted estimates). Data for the State of Texas extracted from nationally representative 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data set (https://www.cdc.gov/brfss/smart/smart_2016.html) were employed. To facilitate comparison of pattern and size of differences in estimates between SRSSD and CSSD approaches, respondents are profiled by obesity status (health outcome of interest) and race/ethnicity (main background variable of interest). Other profiling characteristics are respondents’ age, sex, level of education and income.



Set working directory and load packages

setwd("C:/PhD/YR1/02SP20/DEM7283/HWK/01")

suppressPackageStartupMessages({
    library(car)
    library(dplyr)
    library(gtools)
    library(survey)
    library(forcats)
    library(tableone)
    library(questionr)
    library(htmlTable)
})


Load and process data set

load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))

mydata <- brfss_17
rawnames <- names(mydata)
# head(rawnames, n=55)

newnames <- tolower(gsub(pattern = "_", replacement = "", x = rawnames))
# above identifies and removes '_' from the variable names and changes all
# to lowercase

names(mydata) <- newnames
head(newnames, n = 18)
 [1] "dispcode" "statere1" "safetime" "hhadult"  "genhlth"  "physhlth"
 [7] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost"  "checkup1"
[13] "bphigh4"  "bpmeds"   "cholchk1" "toldhi2"  "cholmed1" "cvdinfr4"


Operationalize variables

Health outcome of interest:

Obesity status

options(max.print = 42)
print(mydata$bmi5)
 [1] 3132 2330 3557 3004 2746 3487 3109 2661 2746 1921 2645   NA 3470 2732
[15] 3119 3068   NA 2468 3658 2639 2303 3544   NA 2441 2037 3497 2338 2378
[29] 2103 2573 2880 2813 3325 2550 2179   NA 3543   NA 2613 2152 3695 3389
 [ reached getOption("max.print") -- omitted 230833 entries ]
attr(,"label")
[1] "COMPUTED BODY MASS INDEX"

In the data, the bmi variable was computed as with implied 2 decimal places as shown above. Hence, the variable is divided by 100 to get the values that represent actual bmi’s. The new bmi is then dummied as 0 = Not obese for bmi below 30 and 1 = Obese for bmi 30 and above

mydata$bmi <- mydata$bmi5/100
options(max.print = 33)
print(mydata$bmi)
 [1] 31.32 23.30 35.57 30.04 27.46 34.87 31.09 26.61 27.46 19.21 26.45
[12]    NA 34.70 27.32 31.19 30.68    NA 24.68 36.58 26.39 23.03 35.44
[23]    NA 24.41 20.37 34.97 23.38 23.78 21.03 25.73 28.80 28.13 33.25
 [ reached getOption("max.print") -- omitted 230842 entries ]
attr(,"label")
[1] "COMPUTED BODY MASS INDEX"
mydata$bmigrp <- ifelse(mydata$bmi >= 30, 1, 0)
mydata$bmigrp <- as.factor(ifelse(mydata$bmigrp == 1, "Obese", "Not obese"))  #parsed as factor variable with label

Background characteristics:

Age group

# continuous variable cut into categorical intervals:
mydata$agegrp <- cut(mydata$age80, breaks = c(0, 24, 39, 59, 79, 99))
levels(mydata$agegrp)
[1] "(0,24]"  "(24,39]" "(39,59]" "(59,79]" "(79,99]"

Sex

mydata$sex <- as.factor(ifelse(mydata$sex == 1, "Male", "Female"))
levels(mydata$sex)
[1] "Female" "Male"  

Race/ethnic identity

mydata$racegr3 <- as.factor(mydata$racegr3)  #converts variable race/ethicity to factor variable
levels(mydata$racegr3)  #displays levels of variable race/ethicity
[1] "1" "2" "3" "4" "5" "9"
mydata <- mydata %>% mutate(racegrp = factor(racegr3, levels = c(1, 2, 3, 4, 
    5), labels = c("White", "Black", "Other", "Other", "Hispanic")))
mydata$racegrp <- fct_relevel(mydata$racegrp, "White", "Black", "Hispanic")
levels(mydata$racegrp)
[1] "White"    "Black"    "Hispanic" "Other"   

Educational attainment

# mydata$edugrp<-Recode(mydata$educa, recodes='1:2='0Primary'; 3='1SomeHS';
# 4='2HSGrad'; 5='3SomeCol'; 6='4ColGrad';9=NA', as.factor=T)

mydata <- mydata %>% mutate(edugrp = factor(mydata$educa, levels = c(1, 2, 3, 
    4, 5, 6), labels = c("Primary", "Primary", "Some HS", "HS Grad", "Some Col", 
    "Col Grad")))
mydata$edugrp <- fct_relevel(mydata$edugrp, "Primary", "Some HS", "HS Grad", 
    "Some Col")
levels(mydata$edugrp)
[1] "Primary"  "Some HS"  "HS Grad"  "Some Col" "Col Grad"
# below arranges educational attainment in logical order

# mydata$edugrp<-relevel(mydata$edugrp, 'Primary','Some HS','HS Grad','Some
# Col','Col Grad')

# brfss_17$educ<-Recode(brfss_17$educa, recodes='1:2='0Prim'; 3='1somehs';
# 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA', as.factor=T)
# brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad') levels(mydata$edugrp)

Income group

mydata$income <- quantcut(mydata$incomg, q = seq(0, 1, by = 0.25), na.rm = TRUE)

# above factorizes income to quantile income group

mydata <- mydata %>% mutate(incomegrp = factor(income, levels = c("[1,3]", "(3,5)", 
    "5", "(5,9]"), labels = c("Quant 1", "Quant 2", "Quant 3", "Quant 4")))
levels(mydata$incomegrp)
[1] "Quant 1" "Quant 2" "Quant 3" "Quant 4"


Define complex survey design object

I create a survey design object from complex survey parameters ids = PSU identifers, strata=strata identifiers, weights=case weights, data=data frame. In doing this, respondents with missing case weights are excluded from the analysis, while options(survey.lonely.psu = "adjust") was applied to facilitate calculation of within stratum variance for stratum with single PSU.

mydata$tx <- NA
mydata$tx[grep(pattern = "TX", mydata$mmsaname)] <- 1

# above extracts only cities in the State of Texas by looking for 'TX' in
# the MSA's name field in the mydata

mydata <- mydata %>% filter(tx == 1, is.na(mmsawt) == F)

options(survey.lonely.psu = "adjust")

des <- svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)


Results

1. Sample distribution by background characteristics stratified by obesity status

SRSSD, unweighted distribution

# without survey design

tab1_unweighted <- CreateTableOne(data = mydata, vars = c("racegrp", "agegrp", 
    "sex", "edugrp", "incomegrp"), strata = "bmigrp")
options(max.print = 1000)
print(tab1_unweighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1, 
    pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T, 
    noSpaces = T, cramVars = NULL, dropEqual = T)
           Stratified by bmigrp
            level    Not obese Obese p      test
  n                  5425      2493             
  racegrp   White    65.8      58.2  <0.001     
            Black    7.9       11.5             
            Hispanic 20.5      26.6             
            Other    5.8       3.8              
  agegrp    (0,24]   7.1       2.7   <0.001     
            (24,39]  17.3      16.5             
            (39,59]  25.2      35.3             
            (59,79]  38.3      39.9             
            (79,99]  12.2      5.6              
  sex       Female   57.3      55.0  0.049      
            Male     42.7      45.0             
  edugrp    Primary  3.3       4.7   <0.001     
            Some HS  4.9       6.1              
            HS Grad  20.7      25.7             
            Some Col 25.2      28.3             
            Col Grad 45.9      35.2             
  incomegrp Quant 1  30.1      36.6  <0.001     
            Quant 2  11.2      11.8             
            Quant 3  45.3      41.7             
            Quant 4  13.3      9.8              

CSSD, design-weighted distribution

# with survey design

tab1_weighted <- svyCreateTableOne(data = des, vars = c("racegrp", "agegrp", 
    "sex", "edugrp", "incomegrp"), strata = "bmigrp")
options(max.print = 1000)
print(tab1_weighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1, 
    pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T, 
    noSpaces = T, cramVars = NULL, dropEqual = T)
           Stratified by bmigrp
            level    Not obese Obese     p      test
  n                  9488033.4 4305759.1            
  racegrp   White    47.1      42.9      0.002      
            Black    12.6      15.1                 
            Hispanic 31.4      37.6                 
            Other    8.8       4.3                  
  agegrp    (0,24]   17.3      5.1       <0.001     
            (24,39]  29.2      26.2                 
            (39,59]  29.6      44.8                 
            (59,79]  20.3      22.3                 
            (79,99]  3.5       1.7                  
  sex       Female   50.2      46.6      0.160      
            Male     49.8      53.4                 
  edugrp    Primary  7.0       9.0       0.013      
            Some HS  7.3       8.4                  
            HS Grad  22.6      28.1                 
            Some Col 32.1      31.1                 
            Col Grad 31.0      23.4                 
  incomegrp Quant 1  32.4      39.2      0.047      
            Quant 2  12.5      11.0                 
            Quant 3  43.5      39.7                 
            Quant 4  11.6      10.1                 

Results from the tables establish substantially significant difference in the point estimates from the two estimation approaches. First, there is huge variation in number of observations (n) resulting from the two approaches; this simply indicates that the CSSD attempts to extrapolates survey characteristics to the entire population taking into account all survey design parameters. In addition, the percent distribution of the observations by background characteristics vary significantly by method. This is evident, for instance, in the proportion of obese population that identified as White (SRSSD, 58.2 vs CSSD, 42.9; approx. 15 percent point difference), Black (SRSSD, 11.5 vs CSSD, 15.1; approx. 4 percent point difference), Hispanic (SRSSD, 26.6 vs CSSD, 37.6; approx. 11 percent point difference) and Other (SRSSD, 3.8 vs CSSD, 4.3; approx. 1 percent point difference).

By indication, the difference attributable to weighting is largest for White respondents with decreased share of obese population ,followed by the Black and Hispanic resondents with relatively equal proportional increase. However, the effect is very minimal and almost negligible for respondents of Other race/ethnic background.



2. Sample profile highlighting obesity prevalence by racial/ethnic identity

SRSSD, unweighted distribution

# without survey design

tab2_unweighted <- CreateTableOne(data = mydata, vars = c("bmigrp", "agegrp", 
    "sex", "edugrp", "incomegrp"), strata = "racegrp")
options(max.print = 1000)
print(tab2_unweighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1, 
    pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T, 
    noSpaces = T, cramVars = NULL, dropEqual = T)
           Stratified by racegrp
            level     White Black Hispanic Other p      test
  n                   5234  761   1989     441              
  bmigrp    Not obese 71.1  59.9  62.6     76.9  <0.001     
            Obese     28.9  40.1  37.4     23.1             
  agegrp    (0,24]    4.0   4.9   9.8      10.0  <0.001     
            (24,39]   12.7  20.5  27.0     30.6             
            (39,59]   25.6  31.9  34.6     29.3             
            (59,79]   45.0  33.9  24.6     25.9             
            (79,99]   12.7  8.8   3.9      4.3              
  sex       Female    57.9  64.8  58.8     49.7  <0.001     
            Male      42.1  35.2  41.2     50.3             
  edugrp    Primary   0.6   1.3   16.1     1.6   <0.001     
            Some HS   2.7   7.6   12.7     2.7              
            HS Grad   20.3  28.6  27.6     14.3             
            Some Col  27.1  28.4  22.4     22.5             
            Col Grad  49.3  34.1  21.2     58.9             
  incomegrp Quant 1   23.3  44.8  50.2     28.6  <0.001     
            Quant 2   11.5  11.7  10.6     7.3              
            Quant 3   51.1  30.2  25.1     49.0             
            Quant 4   14.1  13.3  14.1     15.2             

CSSD, design-weighted distribution

# with survey design

tab2_weighted <- svyCreateTableOne(data = des, vars = c("bmigrp", "agegrp", 
    "sex", "edugrp", "incomegrp"), strata = "racegrp")
options(max.print = 1000)
print(tab2_weighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1, 
    pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T, 
    noSpaces = T, cramVars = NULL, dropEqual = T)
           Stratified by racegrp
            level     White     Black     Hispanic  Other     p      test
  n                   6584630.9 2010422.2 5250640.7 1220788.3            
  bmigrp    Not obese 70.6      64.6      64.6      81.6      0.002      
            Obese     29.4      35.4      35.4      18.4                 
  agegrp    (0,24]    9.3       9.2       17.5      23.8      <0.001     
            (24,39]   23.0      28.7      35.1      37.6                 
            (39,59]   34.9      37.8      34.6      27.9                 
            (59,79]   28.6      22.2      11.3      8.9                  
            (79,99]   4.2       2.1       1.5       1.8                  
  sex       Female    51.0      54.7      50.0      53.2      0.676      
            Male      49.0      45.3      50.0      46.8                 
  edugrp    Primary   1.1       0.2       21.2      4.6       <0.001     
            Some HS   3.4       6.8       15.6      3.3                  
            HS Grad   21.9      28.7      29.1      17.1                 
            Some Col  34.4      40.3      23.5      27.5                 
            Col Grad  39.3      24.1      10.6      47.4                 
  incomegrp Quant 1   19.5      38.0      53.0      25.5      <0.001     
            Quant 2   10.1      13.6      12.7      7.3                  
            Quant 3   57.9      37.9      19.4      48.2                 
            Quant 4   12.5      10.5      14.9      18.9                 

Results in the preceding tables establish that the prevalence of obesity is significantly higher in the sample than the population among Blacks (SRSSD, 40.1 vs CSSD, 35.4; approx. 5 percent point difference), Hispanics (SRSSD, 37.4 vs CSSD, 35.4; approx. 2 percent point difference) and Others (SRSSD, 23.1 vs CSSD, 18.4; approx. 5 percent point difference), but relatively equal for Whites (SRSSD, 28.9 vs CSSD, 29.4).

Also worthy of mentioning is the variation in the proportion obese by sex of the respondents across race/ethnic strata. While point estimates from the SRSSD approach indicate very substantial difference in obese population by sex (p<0.001), the estimates from CSSD approach indicate otherwise (p=0.676).



Conclusion

The above results affirm that there exist substantial differences in the point estimates generated based on the assumption of simple random sampling survey design (SRSSD, unweighted estimates) and those based on complex sampling survey design (CSSD, design-weighted estimates) parameters. The next exercise will further examine the variance at higher level of analysis using linear and logistic regression techniques.

############################################################################################################################