Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a United States health survey that looks at behavioral risk factors. It is run by Centers for Disease Control and Prevention (CDC) and conducted by the individual state health departments. This survey is administered by telephone and is the world’s largest such survey. In 2009, the BRFSS began conducting surveys by cellular phone in addition to traditional landline telephones.

The BRFSS is a cross-sectional telephone survey conducted by state health departments with technical and methodological assistance provided by CDC. In addition to all 50 states, the BRFSS is also conducted by health departments in The District of Columbia, Guam, Puerto Rico, and the U.S. Virgin Islands.

Sampling Design

In order to conduct the BRFSS, states obtain samples of telephone numbers from CDC. The BRFSS uses two samples : one for landline telephone respondents and one for cellular telephone respondents. Since landline telephones are often shared among persons living within a residence, household sampling is used in the landline sample. Household sampling requires interviewers to collect information on the number of adults living within a residence and then select randomly from all eligible adults.

The Landline Sample

Disproportionate stratified sampling (DSS¹) has been used for the landline sample since 2003. DSS draws telephone numbers from two strata (lists) that are based on the presumed density of known telephone household numbers. In this design, telephone numbers are classified into strata that are either high density (listed 1+ block telephone numbers) or medium density (not listed 1+ block telephone numbers) to yield residential telephone numbers.

The Cellular Telephone Sample

The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations. Cellular telephone respondents are randomly selected with each having equal probability of selection.

Scope of Inference

From what is described in the points above, it seems pretty clear that the BRFSS survey is actually an observational retrosepctive study that uses a stratified sampling design based on random digit dialing methods to select a representative sample from each state’s non-institutionalized residents. In this sample design, each state begins with a single stratum. To provide adequate sample sizes for smaller geographically defined populations of interest, however, many states sample disproportionately from strata that correspond to sub-state regions.

Generalizability

As it is, the BRFSS survey should be generalizable to all non-institutionalized adults (18 years of age and older) residing in the U.S.

Causality

As is well-known, making causal conclusions based on observational data is not recommended. Observational studies are only sufficient to show associations.

Part 2: Research questions

Research question 1: Is there an association between general health (genhlth variable) and the frequency of feeling restless in the past 30 days (misrstls variable) ?

It would be interesting to know if the sentiment of feeling well-rested or restless is linked in any way with the general health of a person.

Research question 2: Is there an association between prediabetes condition (prediab1 variable) and BMI (X_bmi5 variable) for males and females (sex variable) ?

An actual association between prediabetes and BMI might provide us with ways to better understand this more and more common disease in our western societies.

Research question 3: Is there an association between BMI (X_bmi5 variable) and the minutes of total physical activity per week (pa1min_ variable) for males and females (sex variable) ?

As in the previous research question, an actual association between BMI and physical activity might provide us with means to better deal with this disease.

Part 3: Exploratory data analysis

Research question 1:

brfss2013_rest_genhlth <- brfss2013 %>% filter(misrstls != "NA") %>% mutate(gen_hlth = ifelse(genhlth == "Poor", "Poor", "Good")) %>% filter(gen_hlth != "NA")
ggplot(brfss2013_rest_genhlth, aes(x = misrstls, fill = gen_hlth)) + geom_bar(position = "fill") + scale_fill_discrete(name = "General Health") + xlab("Frequency of restless days") + ylab("Proportion")

brfss2013_rest_genhlth %>% group_by(misrstls) %>% summarise(prop_poor_hlth = sum(gen_hlth == "Poor") / n())

## # A tibble: 5 x 2
##   misrstls prop_poor_hlth
##     <fctr>          <dbl>
## 1      All     0.25836820
## 2     Most     0.19343987
## 3     Some     0.08153702
## 4 A little     0.04125977
## 5     None     0.03030004

From the statistics above (prop_poor_hlth : proportion of people in poor health), we may clearly see that the proportion of people feeling in poor general health increases with the number of restless days in the previous month.

Research question 2:

brfss2013_prediab_bmi <- brfss2013 %>% filter(prediab1 != "NA") %>% filter(X_bmi5 != "NA") %>% mutate(bmi = X_bmi5 / 100)
ggplot(brfss2013_prediab_bmi, aes(x = prediab1, y = bmi)) + geom_boxplot() + facet_wrap(~sex) + xlab("Prediabetes condition") + ylab("BMI")

brfss2013_prediab_bmi %>% group_by(sex, prediab1) %>% summarise(median_bmi = median(bmi), iqr_bmi = IQR(bmi))

## # A tibble: 5 x 4
## # Groups:   sex [?]
##      sex              prediab1 median_bmi iqr_bmi
##   <fctr>                <fctr>      <dbl>   <dbl>
## 1   Male                   Yes      29.05    7.04
## 2   Male                    No      26.78    5.66
## 3 Female                   Yes      29.29    8.82
## 4 Female Yes, during pregnancy      27.34    7.82
## 5 Female                    No      25.69    7.13

The statistics above (median_bmi : median of BMI, iqr_bmi : interquartile range of BMI) show that the median and interquartile range of the BMI are slightly higher when prediabetes condition is present for both males and females. However, we cannot rule out the possibility that these results happen by luck only.

Research question 3:

brfss2013_pa_bmi <- brfss2013 %>% filter(pa1min_ != "NA") %>% filter(X_bmi5 != "NA") %>% mutate(bmi = X_bmi5 / 100)
ggplot(brfss2013_pa_bmi, aes(x = bmi, y = log(pa1min_ + 1), colour = sex)) + geom_point(shape = 19, alpha = 1/4) + geom_smooth(method = lm, se = FALSE) + scale_colour_discrete(name = "Sex") + xlab("BMI") + ylab("Minutes of physical activity per week (Log)")

brfss2013_pa_bmi %>% group_by(sex) %>% summarise(corr_bmi_phys_activity = cor(bmi, pa1min_))

## # A tibble: 2 x 2
##      sex corr_bmi_phys_activity
##   <fctr>                  <dbl>
## 1   Male            -0.04421816
## 2 Female            -0.07044956

The statistics above (corr_bmi_phys_activity : correlation of BMI and minutes of total physical activity per week) show a trend of doing less physical activity when BMI is higher for both males and females. Though, here again we cannot tell if this results from a real trend in the data or is merely happening by luck only.

Disproportionate stratification is a type of stratified sampling. With disproportionate stratification, the sample size of each stratum does not have to be proportionate to the population size of the stratum. This means that two or more strata will have different sampling fractions .↩