Author: Elhakim Ibrahim
Instructor: Corey Sparks, PhD
February 10, 2020
Objective of the exercise
The focus of this exercise is to assess if point estimates deriving from analysis based on assumption of simple random sampling survey design (SRSSD, unweighted estimates) differ substantially from those based on complex sampling survey design (CSSD, design-weighted estimates). Data for the State of Texas extracted from nationally representative 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data set (https://www.cdc.gov/brfss/smart/smart_2016.html) were employed. To facilitate comparison of pattern and size of differences in estimates between SRSSD and CSSD approaches, respondents are profiled by obesity status (health outcome of interest) and race/ethnicity (main background variable of interest). Other profiling characteristics are respondents’ age, sex, level of education and income.
Set working directory and load packages
setwd("C:/PhD/YR1/02SP20/DEM7283/HWK/01")
suppressPackageStartupMessages({
library(car)
library(dplyr)
library(gtools)
library(survey)
library(forcats)
library(tableone)
library(questionr)
library(htmlTable)
})
Load and process data set
load(url("https://github.com/coreysparks/data/blob/master/brfss_2017.Rdata?raw=true"))
mydata <- brfss_17
rawnames <- names(mydata)
# head(rawnames, n=55)
newnames <- tolower(gsub(pattern = "_", replacement = "", x = rawnames))
# above identifies and removes '_' from the variable names and changes all
# to lowercase
names(mydata) <- newnames
head(newnames, n = 18)
[1] "dispcode" "statere1" "safetime" "hhadult" "genhlth" "physhlth"
[7] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost" "checkup1"
[13] "bphigh4" "bpmeds" "cholchk1" "toldhi2" "cholmed1" "cvdinfr4"
Operationalize variables
Health outcome of interest:
Obesity status
options(max.print = 42)
print(mydata$bmi5)
[1] 3132 2330 3557 3004 2746 3487 3109 2661 2746 1921 2645 NA 3470 2732
[15] 3119 3068 NA 2468 3658 2639 2303 3544 NA 2441 2037 3497 2338 2378
[29] 2103 2573 2880 2813 3325 2550 2179 NA 3543 NA 2613 2152 3695 3389
[ reached getOption("max.print") -- omitted 230833 entries ]
attr(,"label")
[1] "COMPUTED BODY MASS INDEX"
In the data, the bmi variable was computed as with implied 2 decimal places as shown above. Hence, the variable is divided by 100 to get the values that represent actual bmi’s. The new bmi is then dummied as 0 = Not obese for bmi below 30 and 1 = Obese for bmi 30 and above
mydata$bmi <- mydata$bmi5/100
options(max.print = 33)
print(mydata$bmi)
[1] 31.32 23.30 35.57 30.04 27.46 34.87 31.09 26.61 27.46 19.21 26.45
[12] NA 34.70 27.32 31.19 30.68 NA 24.68 36.58 26.39 23.03 35.44
[23] NA 24.41 20.37 34.97 23.38 23.78 21.03 25.73 28.80 28.13 33.25
[ reached getOption("max.print") -- omitted 230842 entries ]
attr(,"label")
[1] "COMPUTED BODY MASS INDEX"
mydata$bmigrp <- ifelse(mydata$bmi >= 30, 1, 0)
mydata$bmigrp <- as.factor(ifelse(mydata$bmigrp == 1, "Obese", "Not obese")) #parsed as factor variable with label
Background characteristics:
Age group
# continuous variable cut into categorical intervals:
mydata$agegrp <- cut(mydata$age80, breaks = c(0, 24, 39, 59, 79, 99))
levels(mydata$agegrp)
[1] "(0,24]" "(24,39]" "(39,59]" "(59,79]" "(79,99]"
Sex
mydata$sex <- as.factor(ifelse(mydata$sex == 1, "Male", "Female"))
levels(mydata$sex)
[1] "Female" "Male"
Race/ethnic identity
mydata$racegr3 <- as.factor(mydata$racegr3) #converts variable race/ethicity to factor variable
levels(mydata$racegr3) #displays levels of variable race/ethicity
[1] "1" "2" "3" "4" "5" "9"
mydata <- mydata %>% mutate(racegrp = factor(racegr3, levels = c(1, 2, 3, 4,
5), labels = c("White", "Black", "Other", "Other", "Hispanic")))
mydata$racegrp <- fct_relevel(mydata$racegrp, "White", "Black", "Hispanic")
levels(mydata$racegrp)
[1] "White" "Black" "Hispanic" "Other"
Educational attainment
# mydata$edugrp<-Recode(mydata$educa, recodes='1:2='0Primary'; 3='1SomeHS';
# 4='2HSGrad'; 5='3SomeCol'; 6='4ColGrad';9=NA', as.factor=T)
mydata <- mydata %>% mutate(edugrp = factor(mydata$educa, levels = c(1, 2, 3,
4, 5, 6), labels = c("Primary", "Primary", "Some HS", "HS Grad", "Some Col",
"Col Grad")))
mydata$edugrp <- fct_relevel(mydata$edugrp, "Primary", "Some HS", "HS Grad",
"Some Col")
levels(mydata$edugrp)
[1] "Primary" "Some HS" "HS Grad" "Some Col" "Col Grad"
# below arranges educational attainment in logical order
# mydata$edugrp<-relevel(mydata$edugrp, 'Primary','Some HS','HS Grad','Some
# Col','Col Grad')
# brfss_17$educ<-Recode(brfss_17$educa, recodes='1:2='0Prim'; 3='1somehs';
# 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA', as.factor=T)
# brfss_17$educ<-relevel(brfss_17$educ, ref='2hsgrad') levels(mydata$edugrp)
Income group
mydata$income <- quantcut(mydata$incomg, q = seq(0, 1, by = 0.25), na.rm = TRUE)
# above factorizes income to quantile income group
mydata <- mydata %>% mutate(incomegrp = factor(income, levels = c("[1,3]", "(3,5)",
"5", "(5,9]"), labels = c("Quant 1", "Quant 2", "Quant 3", "Quant 4")))
levels(mydata$incomegrp)
[1] "Quant 1" "Quant 2" "Quant 3" "Quant 4"
Define complex survey design object
I create a survey design object from complex survey parameters ids = PSU identifers, strata=strata identifiers, weights=case weights, data=data frame. In doing this, respondents with missing case weights are excluded from the analysis, while options(survey.lonely.psu = "adjust") was applied to facilitate calculation of within stratum variance for stratum with single PSU.
mydata$tx <- NA
mydata$tx[grep(pattern = "TX", mydata$mmsaname)] <- 1
# above extracts only cities in the State of Texas by looking for 'TX' in
# the MSA's name field in the mydata
mydata <- mydata %>% filter(tx == 1, is.na(mmsawt) == F)
options(survey.lonely.psu = "adjust")
des <- svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = mydata)
Results
1. Sample distribution by background characteristics stratified by obesity status
SRSSD, unweighted distribution
# without survey design
tab1_unweighted <- CreateTableOne(data = mydata, vars = c("racegrp", "agegrp",
"sex", "edugrp", "incomegrp"), strata = "bmigrp")
options(max.print = 1000)
print(tab1_unweighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1,
pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T,
noSpaces = T, cramVars = NULL, dropEqual = T)
Stratified by bmigrp
level Not obese Obese p test
n 5425 2493
racegrp White 65.8 58.2 <0.001
Black 7.9 11.5
Hispanic 20.5 26.6
Other 5.8 3.8
agegrp (0,24] 7.1 2.7 <0.001
(24,39] 17.3 16.5
(39,59] 25.2 35.3
(59,79] 38.3 39.9
(79,99] 12.2 5.6
sex Female 57.3 55.0 0.049
Male 42.7 45.0
edugrp Primary 3.3 4.7 <0.001
Some HS 4.9 6.1
HS Grad 20.7 25.7
Some Col 25.2 28.3
Col Grad 45.9 35.2
incomegrp Quant 1 30.1 36.6 <0.001
Quant 2 11.2 11.8
Quant 3 45.3 41.7
Quant 4 13.3 9.8
CSSD, design-weighted distribution
# with survey design
tab1_weighted <- svyCreateTableOne(data = des, vars = c("racegrp", "agegrp",
"sex", "edugrp", "incomegrp"), strata = "bmigrp")
options(max.print = 1000)
print(tab1_weighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1,
pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T,
noSpaces = T, cramVars = NULL, dropEqual = T)
Stratified by bmigrp
level Not obese Obese p test
n 9488033.4 4305759.1
racegrp White 47.1 42.9 0.002
Black 12.6 15.1
Hispanic 31.4 37.6
Other 8.8 4.3
agegrp (0,24] 17.3 5.1 <0.001
(24,39] 29.2 26.2
(39,59] 29.6 44.8
(59,79] 20.3 22.3
(79,99] 3.5 1.7
sex Female 50.2 46.6 0.160
Male 49.8 53.4
edugrp Primary 7.0 9.0 0.013
Some HS 7.3 8.4
HS Grad 22.6 28.1
Some Col 32.1 31.1
Col Grad 31.0 23.4
incomegrp Quant 1 32.4 39.2 0.047
Quant 2 12.5 11.0
Quant 3 43.5 39.7
Quant 4 11.6 10.1
Results from the tables establish substantially significant difference in the point estimates from the two estimation approaches. First, there is huge variation in number of observations (n) resulting from the two approaches; this simply indicates that the CSSD attempts to extrapolates survey characteristics to the entire population taking into account all survey design parameters. In addition, the percent distribution of the observations by background characteristics vary significantly by method. This is evident, for instance, in the proportion of obese population that identified as White (SRSSD, 58.2 vs CSSD, 42.9; approx. 15 percent point difference), Black (SRSSD, 11.5 vs CSSD, 15.1; approx. 4 percent point difference), Hispanic (SRSSD, 26.6 vs CSSD, 37.6; approx. 11 percent point difference) and Other (SRSSD, 3.8 vs CSSD, 4.3; approx. 1 percent point difference).
By indication, the difference attributable to weighting is largest for White respondents with decreased share of obese population ,followed by the Black and Hispanic resondents with relatively equal proportional increase. However, the effect is very minimal and almost negligible for respondents of Other race/ethnic background.
2. Sample profile highlighting obesity prevalence by racial/ethnic identity
SRSSD, unweighted distribution
# without survey design
tab2_unweighted <- CreateTableOne(data = mydata, vars = c("bmigrp", "agegrp",
"sex", "edugrp", "incomegrp"), strata = "racegrp")
options(max.print = 1000)
print(tab2_unweighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1,
pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T,
noSpaces = T, cramVars = NULL, dropEqual = T)
Stratified by racegrp
level White Black Hispanic Other p test
n 5234 761 1989 441
bmigrp Not obese 71.1 59.9 62.6 76.9 <0.001
Obese 28.9 40.1 37.4 23.1
agegrp (0,24] 4.0 4.9 9.8 10.0 <0.001
(24,39] 12.7 20.5 27.0 30.6
(39,59] 25.6 31.9 34.6 29.3
(59,79] 45.0 33.9 24.6 25.9
(79,99] 12.7 8.8 3.9 4.3
sex Female 57.9 64.8 58.8 49.7 <0.001
Male 42.1 35.2 41.2 50.3
edugrp Primary 0.6 1.3 16.1 1.6 <0.001
Some HS 2.7 7.6 12.7 2.7
HS Grad 20.3 28.6 27.6 14.3
Some Col 27.1 28.4 22.4 22.5
Col Grad 49.3 34.1 21.2 58.9
incomegrp Quant 1 23.3 44.8 50.2 28.6 <0.001
Quant 2 11.5 11.7 10.6 7.3
Quant 3 51.1 30.2 25.1 49.0
Quant 4 14.1 13.3 14.1 15.2
CSSD, design-weighted distribution
# with survey design
tab2_weighted <- svyCreateTableOne(data = des, vars = c("bmigrp", "agegrp",
"sex", "edugrp", "incomegrp"), strata = "racegrp")
options(max.print = 1000)
print(tab2_weighted, varLabels = T, format = "p", showAllLevels = T, catDigits = 1,
pDigits = 3, contDigits = 1, quote = F, missing = F, explain = F, printToggle = T,
noSpaces = T, cramVars = NULL, dropEqual = T)
Stratified by racegrp
level White Black Hispanic Other p test
n 6584630.9 2010422.2 5250640.7 1220788.3
bmigrp Not obese 70.6 64.6 64.6 81.6 0.002
Obese 29.4 35.4 35.4 18.4
agegrp (0,24] 9.3 9.2 17.5 23.8 <0.001
(24,39] 23.0 28.7 35.1 37.6
(39,59] 34.9 37.8 34.6 27.9
(59,79] 28.6 22.2 11.3 8.9
(79,99] 4.2 2.1 1.5 1.8
sex Female 51.0 54.7 50.0 53.2 0.676
Male 49.0 45.3 50.0 46.8
edugrp Primary 1.1 0.2 21.2 4.6 <0.001
Some HS 3.4 6.8 15.6 3.3
HS Grad 21.9 28.7 29.1 17.1
Some Col 34.4 40.3 23.5 27.5
Col Grad 39.3 24.1 10.6 47.4
incomegrp Quant 1 19.5 38.0 53.0 25.5 <0.001
Quant 2 10.1 13.6 12.7 7.3
Quant 3 57.9 37.9 19.4 48.2
Quant 4 12.5 10.5 14.9 18.9
Results in the preceding tables establish that the prevalence of obesity is significantly higher in the sample than the population among Blacks (SRSSD, 40.1 vs CSSD, 35.4; approx. 5 percent point difference), Hispanics (SRSSD, 37.4 vs CSSD, 35.4; approx. 2 percent point difference) and Others (SRSSD, 23.1 vs CSSD, 18.4; approx. 5 percent point difference), but relatively equal for Whites (SRSSD, 28.9 vs CSSD, 29.4).
Also worthy of mentioning is the variation in the proportion obese by sex of the respondents across race/ethnic strata. While point estimates from the SRSSD approach indicate very substantial difference in obese population by sex (p<0.001), the estimates from CSSD approach indicate otherwise (p=0.676).
Conclusion
The above results affirm that there exist substantial differences in the point estimates generated based on the assumption of simple random sampling survey design (SRSSD, unweighted estimates) and those based on complex sampling survey design (CSSD, design-weighted estimates) parameters. The next exercise will further examine the variance at higher level of analysis using linear and logistic regression techniques.
############################################################################################################################