Processing math: 100%

This example will cover the use of R functions for analyzing complex survey data. Most social and health surveys are not simple random samples of the population, but instead consist of respondents from a complex survey design. These designs often stratify the population based on one or more characteristics, including geography, race, age, etc. In addition the designs can be multi-stage, meaning that initial strata are created, then respondents are sampled from smaller units within those strata. An example would be if a school district was chosen as a sample strata, and then schools were then chosen as the primary sampling units (PSUs) within the district. From this 2 stage design, we could further sample classrooms within the school (3 stage design) or simply sample students (or whatever our unit of interest is).

A second feature of survey data we often want to account for is differential respondent weighting. This means that each respondent is given a weight to represent how common that particular respondent is within the population. This reflects the differenital probability of sampling based on respondent characteristics. As demographers, we are also often interested in making inference for the population, not just the sample, so our results must be generalizable to the population at large. Sample weights are used in the process as well.

When such data are analyzed, we must take into account this nesting structure (sample design) as well as the respondent sample weight in order to make valid estimates of ANY statistical parameter. If we do not account for design, the parameter standard errors will be incorrect, and if we do not account for weighting, the parameters themselves will be incorrect and biased.

In general there are typically three things we need to find in our survey data codebooks: The sample strata identifier, the sample primary sampling unit identifier (often called a cluster identifier) and the respondent survey weight. These will typically have one of these names and should be easily identifiable in the codebook.

Statistical software will have special routines for analyzing these types of data and you must be aware that the diversity of statistical routines that generally exists will be lower for analyzing complex survey data, and some forms of analysis may not be available!

Below I illustrate the use of survey characteristics when conducting descriptive analysis of a survey data set and a linear regression model estimated from that data. For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART metro area survey data. Link

#load brfss
library(car)
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
## 
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
## 
##     dotchart
library(questionr)
load("~/Google Drive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")

#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
head(nams, n=10)
##  [1] "DISPCODE" "HHADULT"  "GENHLTH"  "PHYSHLTH" "MENTHLTH" "POORHLTH"
##  [7] "HLTHPLN1" "PERSDOC2" "MEDCOST"  "CHECKUP1"
#we see some names are lower case, some are upper and some have a little _ in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss16m)<-newnames
#Poor or fair self rated health
#brfss16m$badhealth<-ifelse(brfss16m$genhlth %in% c(4,5),1,0)
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")

#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss16m$inc<-ifelse(brfss16m$incomg==9, NA, brfss16m$incomg)

#education level
brfss16m$educ<-recode(brfss16m$educa, recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA", as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')

#employment
brfss16m$employ<-recode(brfss16m$employ1, recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA", as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='Employed')

#marital status
brfss16m$marst<-recode(brfss16m$marital, recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA", as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')

#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss16ma the bmi variable has 2 implied decimal places, so we must divide by 100 to get real bmi's

brfss16m$bmi<-brfss16m$bmi5/100

Analysis

First, we will do some descriptive analysis, such as means and cross tabulations.

#First we will do some tables
#Raw frequencies
table(brfss16m$badhealth, brfss16m$educ)
##    
##     2hsgrad 0Prim 1somehs 3somecol 4colgrad
##   0   47484  3008    6646    54755    92394
##   1   14612  2880    4434    11928     9222
#column percentages
prop.table(table(brfss16m$badhealth, brfss16m$educ), margin=2)
##    
##        2hsgrad      0Prim    1somehs   3somecol   4colgrad
##   0 0.76468694 0.51086957 0.59981949 0.82112382 0.90924658
##   1 0.23531306 0.48913043 0.40018051 0.17887618 0.09075342
#basic chi square test of independence
chisq.test(table(brfss16m$badhealth, brfss16m$educ))
## 
##  Pearson's Chi-squared test
## 
## data:  table(brfss16m$badhealth, brfss16m$educ)
## X-squared = 14538, df = 4, p-value < 2.2e-16

So basically all of these numbers are incorrect, since they all assume random sampling. Now, we must tell R what the survey design is and what the weight variable is, then we can re-do these so they are correct.

Create a survey design object

Now we identify the survey design. ids = PSU identifers, strata=strata identifiers, weights=case weights, data= the data frame where these variables are located. Lastly, I only include respondents with NON-MISSING case weights.

options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt)==F,] )

simple weighted analysis

Now , we re-do the analysis from above using only weights:

cat<-wtd.table(brfss16m$badhealth, brfss16m$educ, weights = brfss16m$mmsawt)
prop.table(wtd.table(brfss16m$badhealth, brfss16m$educ, weights = brfss16m$mmsawt), margin=2)
##      2hsgrad      0Prim    1somehs   3somecol   4colgrad
## 0 0.80516192 0.52749701 0.65210907 0.85249850 0.92874589
## 1 0.19483808 0.47250299 0.34789093 0.14750150 0.07125411
#compare that with the original
prop.table(table(brfss16m$badhealth, brfss16m$educ), margin=2)
##    
##        2hsgrad      0Prim    1somehs   3somecol   4colgrad
##   0 0.76468694 0.51086957 0.59981949 0.82112382 0.90924658
##   1 0.23531306 0.48913043 0.40018051 0.17887618 0.09075342

There are differences, notably that the prevalence of poor SRH is higher in the sample than the population. This is important!

Let’s say we also want the standard errors of these percentages. This can be found for a proportion by: s.e.(p)=p(1p)n

So we need to get n and p, that’s easy:

n<-table(is.na(brfss16m$badhealth)==F)
n
## 
##  FALSE   TRUE 
##    655 248356
p<-prop.table(wtd.table(brfss16m$badhealth, brfss16m$educ, weights = brfss16m$mmsawt), margin=2)
se<-sqrt((p*(1-p))/n[2])

data.frame(proportion=p, se=se)
##    proportion.Var1 proportion.Var2 proportion.Freq se.Var1  se.Var2
## 1                0         2hsgrad      0.80516192       0  2hsgrad
## 2                1         2hsgrad      0.19483808       1  2hsgrad
## 3                0           0Prim      0.52749701       0    0Prim
## 4                1           0Prim      0.47250299       1    0Prim
## 5                0         1somehs      0.65210907       0  1somehs
## 6                1         1somehs      0.34789093       1  1somehs
## 7                0        3somecol      0.85249850       0 3somecol
## 8                1        3somecol      0.14750150       1 3somecol
## 9                0        4colgrad      0.92874589       0 4colgrad
## 10               1        4colgrad      0.07125411       1 4colgrad
##         se.Freq
## 1  0.0007947695
## 2  0.0007947695
## 3  0.0010017860
## 4  0.0010017860
## 5  0.0009557501
## 6  0.0009557501
## 7  0.0007115537
## 8  0.0007115537
## 9  0.0005161977
## 10 0.0005161977

Which shows us the errors in the estimates based on the weighted proportions. That’s nice, but since we basically inflated the n to be the population of the US, these se’s are too small. This is another example of using survey statistical methods, to get the right se for a statistic.

Proper survey design analysis

#Now consider the full sample design + weights
cat<-svytable(~badhealth+educ, design = des)
prop.table(svytable(~badhealth+educ, design = des), margin = 2)
##          educ
## badhealth    2hsgrad      0Prim    1somehs   3somecol   4colgrad
##         0 0.80516192 0.52749701 0.65210907 0.85249850 0.92874589
##         1 0.19483808 0.47250299 0.34789093 0.14750150 0.07125411

Which gives the same %’s as the weighted table above, but we also want the correct standard errors for our bad health prevalences.

The svyby() function will calculate statistics by groups, in this case we want the % in bad health by each level of education. The %’s can be gotten using the svymean() function, which finds means of variables using survey design:

sv.table<-svyby(formula = ~badhealth, by = ~educ, design = des, FUN = svymean, na.rm=T)
sv.table
##              educ  badhealth          se
## 2hsgrad   2hsgrad 0.19483808 0.003069167
## 0Prim       0Prim 0.47250299 0.011651082
## 1somehs   1somehs 0.34789093 0.008429821
## 3somecol 3somecol 0.14750150 0.002630547
## 4colgrad 4colgrad 0.07125411 0.001462964

And we see the same point estimates of our prevalences as in the simple weighted table, but the standard errors have now been adjusted for survey design as well, so they are also correct. You also see they are much larger than the ones we computed above, which assumed random sampling.

Regression example

Next we apply this logic to a regression case. First we fit the OLS model for our BMI outcome using education and age as predictors:

fit1<-lm(bmi~educ+agec, data=brfss16m)
summary(fit1)
## 
## Call:
## lm(formula = bmi ~ educ + agec, data = brfss16m)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.972  -4.011  -0.938   2.869  71.275 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.48127    0.05543 459.697  < 2e-16 ***
## educ0Prim     0.64448    0.08925   7.221 5.17e-13 ***
## educ1somehs   0.38467    0.06453   5.961 2.51e-09 ***
## educ3somecol -0.20410    0.03487  -5.852 4.85e-09 ***
## educ4colgrad -1.56122    0.03202 -48.765  < 2e-16 ***
## agec(24,39]   2.84383    0.06074  46.822  < 2e-16 ***
## agec(39,59]   3.85094    0.05674  67.871  < 2e-16 ***
## agec(59,79]   3.37236    0.05604  60.176  < 2e-16 ***
## agec(79,99]   0.99002    0.06838  14.478  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.984 on 227307 degrees of freedom
##   (21695 observations deleted due to missingness)
## Multiple R-squared:  0.04132,    Adjusted R-squared:  0.04128 
## F-statistic:  1225 on 8 and 227307 DF,  p-value: < 2.2e-16

Next we incorporate case weights

fit2<-lm(bmi~educ+agec, data=brfss16m, weights = mmsawt)
summary(fit2)
## 
## Call:
## lm(formula = bmi ~ educ + agec, data = brfss16m, weights = mmsawt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -2153.7   -60.9    -9.1    46.4  6500.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.30692    0.04041 626.213  < 2e-16 ***
## educ0Prim     0.75253    0.06634  11.343  < 2e-16 ***
## educ1somehs   0.37305    0.05125   7.279 3.37e-13 ***
## educ3somecol -0.12319    0.03354  -3.673  0.00024 ***
## educ4colgrad -1.78033    0.03399 -52.381  < 2e-16 ***
## agec(24,39]   2.82911    0.04398  64.332  < 2e-16 ***
## agec(39,59]   3.82963    0.04228  90.584  < 2e-16 ***
## agec(59,79]   3.35996    0.04456  75.404  < 2e-16 ***
## agec(79,99]   1.12917    0.07154  15.783  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152.4 on 227307 degrees of freedom
##   (21695 observations deleted due to missingness)
## Multiple R-squared:  0.05289,    Adjusted R-squared:  0.05286 
## F-statistic:  1587 on 8 and 227307 DF,  p-value: < 2.2e-16

We see the low education effect reduce and the age effects increase. Now we will incorporate design effects as well:

fit3<-svyglm(bmi~educ+agec,des, family=gaussian)
summary(fit3)
## 
## Call:
## svyglm(formula = bmi ~ educ + agec, des, family = gaussian)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt) == 
##     F, ])
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  25.30692    0.09108 277.861  < 2e-16 ***
## educ0Prim     0.75253    0.17921   4.199 2.68e-05 ***
## educ1somehs   0.37305    0.13122   2.843  0.00447 ** 
## educ3somecol -0.12319    0.07326  -1.682  0.09264 .  
## educ4colgrad -1.78033    0.06390 -27.862  < 2e-16 ***
## agec(24,39]   2.82911    0.10108  27.987  < 2e-16 ***
## agec(39,59]   3.82963    0.09403  40.729  < 2e-16 ***
## agec(59,79]   3.35996    0.09342  35.966  < 2e-16 ***
## agec(79,99]   1.12917    0.11511   9.809  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 35.73852)
## 
## Number of Fisher Scoring iterations: 2

Notice, the resuls for the education levels are much less significant than the were with either of the other two analysis. This is because those models had standard errors for the parameters that were too small. You see all the standard errors are larger and the T statistics are smaller.

Now I make a table to show the results of the three models:

stargazer(fit1, fit2, fit3, style="demography", type="html",
          column.labels = c("OLS", "Weights", "Survey"),
          title = "Regression models for BMI using survey data - BRFSS 2016", 
          covariate.labels=c("PrimarySchool", "SomeHS", "SomeColl", "CollGrad", "Age 24-39","Age 39-59" ,"Age 59-79", "Age 80+"), 
          keep.stat="n", model.names=F, align=T, ci=T)
Regression models for BMI using survey data - BRFSS 2016
bmi
OLS Weights Survey
Model 1 Model 2 Model 3
PrimarySchool 0.644*** 0.753*** 0.753***
(0.470, 0.819) (0.623, 0.883) (0.401, 1.104)
SomeHS 0.385*** 0.373*** 0.373**
(0.258, 0.511) (0.273, 0.474) (0.116, 0.630)
SomeColl -0.204*** -0.123*** -0.123
(-0.272, -0.136) (-0.189, -0.057) (-0.267, 0.020)
CollGrad -1.561*** -1.780*** -1.780***
(-1.624, -1.498) (-1.847, -1.714) (-1.906, -1.655)
Age 24-39 2.844*** 2.829*** 2.829***
(2.725, 2.963) (2.743, 2.915) (2.631, 3.027)
Age 39-59 3.851*** 3.830*** 3.830***
(3.740, 3.962) (3.747, 3.912) (3.645, 4.014)
Age 59-79 3.372*** 3.360*** 3.360***
(3.263, 3.482) (3.273, 3.447) (3.177, 3.543)
Age 80+ 0.990*** 1.129*** 1.129***
(0.856, 1.124) (0.989, 1.269) (0.904, 1.355)
Constant 25.481*** 25.307*** 25.307***
(25.373, 25.590) (25.228, 25.386) (25.128, 25.485)
N 227,316 227,316 227,316
p < .05; p < .01; p < .001

Which shows the same β’s between the survey design model and the weighted model but the standard errors are larger in the survey model, so the test statistics are more conservative (smaller t statistics).

While in this simple model, our overall interpretation of the effects do not change (positive effects of education, negative effects of age), it is entirely possible that they could once we include our survey design effects.

It may be informative to plot the results of the models to see how different the coefficients are from one another:

plot(coef(fit1)[-1], ylab="Beta parameters",ylim=c(-2, 4), xlab=NULL,axes=T,xaxt="n",main=expression(paste(beta , " Parameters from Survey Regression and non survey regression models")))
axis(side=1, at=1:8, labels=F)
text(x=1:8, y=-2.5,  srt = 45, pos = 1, xpd = TRUE,labels = c( "PrimarySch", "SomeHS", "somecol", "colgrad", "25_40", "40_60", "60_80", "80+" ))
#add the coefficients for the unweighted model
points(coef(fit3)[-1], col=2, pch=4, cex=1.5)
abline(h=0, lty=2)
legend("topleft", legend=c("Non-survey model", "Survey Model"), col=c(1,2), pch=c(1,4))

Which shows us that the betas are similar but have some differences between the two models.

Replicate Weights If your dataset comes with replicate weights, you have to specify the survey design slightly differently. Here is an example using the IPUMS CPS data. For this data, you can get information here, but you must consult your specific data source for the appropriate information for your data.

load("~/Google Drive/classes/dem7283/class18/data/cpsmar10tx.Rdata")
names(cpsmar10tx)
##   [1] "year"     "serial"   "hwtsupp"  "repwt"    "statefip" "metarea" 
##   [7] "foodstmp" "REPWT1"   "REPWT2"   "REPWT3"   "REPWT4"   "REPWT5"  
##  [13] "REPWT6"   "REPWT7"   "REPWT8"   "REPWT9"   "REPWT10"  "REPWT11" 
##  [19] "REPWT12"  "REPWT13"  "REPWT14"  "REPWT15"  "REPWT16"  "REPWT17" 
##  [25] "REPWT18"  "REPWT19"  "REPWT20"  "REPWT21"  "REPWT22"  "REPWT23" 
##  [31] "REPWT24"  "REPWT25"  "REPWT26"  "REPWT27"  "REPWT28"  "REPWT29" 
##  [37] "REPWT30"  "REPWT31"  "REPWT32"  "REPWT33"  "REPWT34"  "REPWT35" 
##  [43] "REPWT36"  "REPWT37"  "REPWT38"  "REPWT39"  "REPWT40"  "REPWT41" 
##  [49] "REPWT42"  "REPWT43"  "REPWT44"  "REPWT45"  "REPWT46"  "REPWT47" 
##  [55] "REPWT48"  "REPWT49"  "REPWT50"  "REPWT51"  "REPWT52"  "REPWT53" 
##  [61] "REPWT54"  "REPWT55"  "REPWT56"  "REPWT57"  "REPWT58"  "REPWT59" 
##  [67] "REPWT60"  "REPWT61"  "REPWT62"  "REPWT63"  "REPWT64"  "REPWT65" 
##  [73] "REPWT66"  "REPWT67"  "REPWT68"  "REPWT69"  "REPWT70"  "REPWT71" 
##  [79] "REPWT72"  "REPWT73"  "REPWT74"  "REPWT75"  "REPWT76"  "REPWT77" 
##  [85] "REPWT78"  "REPWT79"  "REPWT80"  "REPWT81"  "REPWT82"  "REPWT83" 
##  [91] "REPWT84"  "REPWT85"  "REPWT86"  "REPWT87"  "REPWT88"  "REPWT89" 
##  [97] "REPWT90"  "REPWT91"  "REPWT92"  "REPWT93"  "REPWT94"  "REPWT95" 
## [103] "REPWT96"  "REPWT97"  "REPWT98"  "REPWT99"  "REPWT100" "REPWT101"
## [109] "REPWT102" "REPWT103" "REPWT104" "REPWT105" "REPWT106" "REPWT107"
## [115] "REPWT108" "REPWT109" "REPWT110" "REPWT111" "REPWT112" "REPWT113"
## [121] "REPWT114" "REPWT115" "REPWT116" "REPWT117" "REPWT118" "REPWT119"
## [127] "REPWT120" "REPWT121" "REPWT122" "REPWT123" "REPWT124" "REPWT125"
## [133] "REPWT126" "REPWT127" "REPWT128" "REPWT129" "REPWT130" "REPWT131"
## [139] "REPWT132" "REPWT133" "REPWT134" "REPWT135" "REPWT136" "REPWT137"
## [145] "REPWT138" "REPWT139" "REPWT140" "REPWT141" "REPWT142" "REPWT143"
## [151] "REPWT144" "REPWT145" "REPWT146" "REPWT147" "REPWT148" "REPWT149"
## [157] "REPWT150" "REPWT151" "REPWT152" "REPWT153" "REPWT154" "REPWT155"
## [163] "REPWT156" "REPWT157" "REPWT158" "REPWT159" "REPWT160" "month"   
## [169] "pernum"   "wtsupp"   "relate"   "age"      "sex"      "race"    
## [175] "marst"    "offpov"   "MIGRATE1"
cpsmar10tx$poverty<-ifelse(cpsmar10tx$offpov==1,1,0)
des2<-svrepdesign( data = cpsmar10tx,repweights = cpsmar10tx[, c(8:167)]  , weights = ~wtsupp , type="JK1", scale=.025)
des2
## Call: svrepdesign.default(data = cpsmar10tx, repweights = cpsmar10tx[, 
##     c(8:167)], weights = ~wtsupp, type = "JK1", scale = 0.025)
## Unstratified cluster jacknife (JK1) with 160 replicates.
#Without design
prop.table(table(cpsmar10tx$poverty))
## 
##         0         1 
## 0.8374617 0.1625383
#with design
prop.table(svytable(~poverty, design = des2))
## poverty
##         0         1 
## 0.8481106 0.1518894
#Again, using the mean
mean(cpsmar10tx$poverty)
## [1] 0.1625383
#Using the design. This would be an official estimate of poverty in TX in 2010:
svymean(~poverty, design=des2)
##            mean    SE
## poverty 0.15189 0.007
fit<-svyglm(poverty~cut(age, breaks = 5), des2, family=binomial)
summary(fit)
## 
## Call:
## svyglm(formula = poverty ~ cut(age, breaks = 5), des2, family = binomial)
## 
## Survey design:
## svrepdesign.default(data = cpsmar10tx, repweights = cpsmar10tx[, 
##     c(8:167)], weights = ~wtsupp, type = "JK1", scale = 0.025)
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -1.0512     0.1147  -9.163 2.88e-16 ***
## cut(age, breaks = 5)(29,43]    -0.7178     0.1405  -5.109 9.39e-07 ***
## cut(age, breaks = 5)(43,57]    -0.7677     0.1412  -5.436 2.07e-07 ***
## cut(age, breaks = 5)(57,71]    -1.1828     0.1690  -7.000 7.28e-11 ***
## cut(age, breaks = 5)(71,85.1]  -0.8162     0.2249  -3.629 0.000385 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## Number of Fisher Scoring iterations: 5