library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
load("brfss2013.RData")
The Behavioral Risk Factor Surveillance System (BRFSS) is a United States health survey that looks at behavioral risk factors. It is run by Centers for Disease Control and Prevention (CDC) and conducted by the individual state health departments. This survey is administered by telephone and is the world’s largest such survey. In 2009, the BRFSS began conducting surveys by cellular phone in addition to traditional landline telephones.
The BRFSS is a cross-sectional telephone survey conducted by state health departments with technical and methodological assistance provided by CDC. In addition to all 50 states, the BRFSS is also conducted by health departments in The District of Columbia, Guam, Puerto Rico, and the U.S. Virgin Islands.
In order to conduct the BRFSS, states obtain samples of telephone numbers from CDC. The BRFSS uses two samples : one for landline telephone respondents and one for cellular telephone respondents. Since landline telephones are often shared among persons living within a residence, household sampling is used in the landline sample. Household sampling requires interviewers to collect information on the number of adults living within a residence and then select randomly from all eligible adults.
Disproportionate stratified sampling (DSS1) has been used for the landline sample since 2003. DSS draws telephone numbers from two strata (lists) that are based on the presumed density of known telephone household numbers. In this design, telephone numbers are classified into strata that are either high density (listed 1+ block telephone numbers) or medium density (not listed 1+ block telephone numbers) to yield residential telephone numbers.
The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations. Cellular telephone respondents are randomly selected with each having equal probability of selection.
From what is described in the points above, it seems pretty clear that the BRFSS survey is actually an observational retrosepctive study that uses a stratified sampling design based on random digit dialing methods to select a representative sample from each state’s non-institutionalized residents. In this sample design, each state begins with a single stratum. To provide adequate sample sizes for smaller geographically defined populations of interest, however, many states sample disproportionately from strata that correspond to sub-state regions.
As it is, the BRFSS survey should be generalizable to all non-institutionalized adults (18 years of age and older) residing in the U.S.
As is well-known, making causal conclusions based on observational data is not recommended. Observational studies are only sufficient to show associations.
Research question 1: Is there an association between general health (genhlth variable) and the frequency of feeling restless in the past 30 days (misrstls variable) ?
It would be interesting to know if the sentiment of feeling well-rested or restless is linked in any way with the general health of a person.
Research question 2: Is there an association between prediabetes condition (prediab1 variable) and BMI (X_bmi5 variable) for males and females (sex variable) ?
An actual association between prediabetes and BMI might provide us with ways to better understand this more and more common disease in our western societies.
Research question 3: Is there an association between BMI (X_bmi5 variable) and the minutes of total physical activity per week (pa1min_ variable) for males and females (sex variable) ?
As in the previous research question, an actual association between BMI and physical activity might provide us with means to better deal with this disease.
Research question 1:
brfss2013_rest_genhlth <- brfss2013 %>% filter(misrstls != "NA") %>% mutate(gen_hlth = ifelse(genhlth == "Poor", "Poor", "Good")) %>% filter(gen_hlth != "NA")
ggplot(brfss2013_rest_genhlth, aes(x = misrstls, fill = gen_hlth)) + geom_bar(position = "fill") + scale_fill_discrete(name = "General Health") + xlab("Frequency of restless days") + ylab("Proportion")
brfss2013_rest_genhlth %>% group_by(misrstls) %>% summarise(prop_poor_hlth = sum(gen_hlth == "Poor") / n())
## # A tibble: 5 x 2
## misrstls prop_poor_hlth
## <fctr> <dbl>
## 1 All 0.25836820
## 2 Most 0.19343987
## 3 Some 0.08153702
## 4 A little 0.04125977
## 5 None 0.03030004
From the statistics above (prop_poor_hlth : proportion of people in poor health), we may clearly see that the proportion of people feeling in poor general health increases with the number of restless days in the previous month.
Research question 2:
brfss2013_prediab_bmi <- brfss2013 %>% filter(prediab1 != "NA") %>% filter(X_bmi5 != "NA") %>% mutate(bmi = X_bmi5 / 100)
ggplot(brfss2013_prediab_bmi, aes(x = prediab1, y = bmi)) + geom_boxplot() + facet_wrap(~sex) + xlab("Prediabetes condition") + ylab("BMI")
brfss2013_prediab_bmi %>% group_by(sex, prediab1) %>% summarise(median_bmi = median(bmi), iqr_bmi = IQR(bmi))
## # A tibble: 5 x 4
## # Groups: sex [?]
## sex prediab1 median_bmi iqr_bmi
## <fctr> <fctr> <dbl> <dbl>
## 1 Male Yes 29.05 7.04
## 2 Male No 26.78 5.66
## 3 Female Yes 29.29 8.82
## 4 Female Yes, during pregnancy 27.34 7.82
## 5 Female No 25.69 7.13
The statistics above (median_bmi : median of BMI, iqr_bmi : interquartile range of BMI) show that the median and interquartile range of the BMI are slightly higher when prediabetes condition is present for both males and females. However, we cannot rule out the possibility that these results happen by luck only.
Research question 3:
brfss2013_pa_bmi <- brfss2013 %>% filter(pa1min_ != "NA") %>% filter(X_bmi5 != "NA") %>% mutate(bmi = X_bmi5 / 100)
ggplot(brfss2013_pa_bmi, aes(x = bmi, y = log(pa1min_ + 1), colour = sex)) + geom_point(shape = 19, alpha = 1/4) + geom_smooth(method = lm, se = FALSE) + scale_colour_discrete(name = "Sex") + xlab("BMI") + ylab("Minutes of physical activity per week (Log)")
brfss2013_pa_bmi %>% group_by(sex) %>% summarise(corr_bmi_phys_activity = cor(bmi, pa1min_))
## # A tibble: 2 x 2
## sex corr_bmi_phys_activity
## <fctr> <dbl>
## 1 Male -0.04421816
## 2 Female -0.07044956
The statistics above (corr_bmi_phys_activity : correlation of BMI and minutes of total physical activity per week) show a trend of doing less physical activity when BMI is higher for both males and females. Though, here again we cannot tell if this results from a real trend in the data or is merely happening by luck only.
Disproportionate stratification is a type of stratified sampling. With disproportionate stratification, the sample size of each stratum does not have to be proportionate to the population size of the stratum. This means that two or more strata will have different sampling fractions .↩