library(ggplot2)
library(dplyr)
library(corrplot)load("brfss2013.RData")The Behavioral Risk Factor Surveillance System (BRFSS) is a survey administered by the Centers for Disease Control and Prevention (CDC). Each of the fifty (50) state health departments conducts the phone survey (ie., both landline and cellular phones) of adult U.S. residents regarding their risk behaviors and preventive health practices; for example, sleep, exercise, chronic health conditions.
Importantly, my exploratory analysis CANNOT infer causality between variables. The BRFSS employs disproportionate stratified sampling (DSS), but we are not here ex ante assigning individuals into groups (nor does the survey simulate this dynamic); in other words, we are not conducting an experment. This is an analysis of obervational data such that we can only conclude association(s). As it says in OpenIntro Statistics (3rd Edition), “In general association does not imply causation, and causation can only be inferred from a randomized experiment.”
The survey is a vertible iceberg of interesting data! To me some of the interesting variables include:
Research quesion 1: Is self-reported sleep time associated with self-reported general health? My hypothesis is that respondents who get seven or eight hours of sleep will tend to report BETTER general health; and, conversely, that respondents who get LESS sleep may report worse health. This is interesting because personally I believe sleep, exercise, and stress control are the most important contributors to better health. (In order to add a third variable per the assignment, I am going to partition the results by gender)
Research quesion 2: Is HIGHER income associated with better self-reported general health? Further, might this vary according to marital status? This question is intersting to me because personally I believe that higher income, up to a point, contributes to a better quality of life; further, I wonder if marital status is a factor.
Research quesion 3: Is excercise (ie, physical activity) associated with lower body mass index (BMI)? I noticed that the survey collects weight and height and realized that I could therefore calculate BMI. While BMI might be imperfect, it seems like an elegant approximation of fitness (BMI is mass divided by height squared, such that a lower BMI is generally better and the division by height squared somewhat standardizes the metric across body types if not gender). I selected this question because I hope for a strong negative relationship between physical activity and BMI.
Research quesion 1: Is self-reported sleep time associated with self-reported general health (by gender)?
Although the median, for all health categories and both sexes, is seven (7) hours of sleep and the third quartile is eight (8) hours, there are two slightly interesting observations:
sleep.health <- select(brfss2013, genhlth, sex, sleptim1) %>%
filter(!is.na(genhlth), !is.na(sex), sleptim1 <= 12 )
ggplot(data = sleep.health, aes(x = genhlth, y = sleptim1)) +
geom_violin() +
scale_y_continuous(limits = c(4,10), breaks = 5:9) +
facet_grid(. ~ sex) +
xlab("genhlth = Self reported general health") +
ylab ("sleptim1 = How much time do you sleep?")Research quesion 2: Is HIGHER income associated with better self-reported general health? I decided to collapse the marital categories into three: married, not married (either divorced, widowed or separated), and never married. To me, the most interesting observation here is the disproportionate collection of individuals who are married AND higher income AND reporting general health of good or better!
income.h <- select(brfss2013, genhlth, marital, income2) %>%
filter(!is.na(genhlth), !is.na(income2),
!is.na(marital), marital %in% c("Married","Divorced",
"Widowed", "Separated",
"Never married"))
income.h.m <- income.h %>%
filter(marital %in% c("Married", "Never married")) %>%
mutate(married = marital)
income.h.nm <- income.h %>%
filter(marital %in% c("Divorced", "Widowed", "Separated")) %>%
mutate(married = "Not married")
income.health <- rbind(income.h.m, income.h.nm)
levels(income.health$genhlth) <- c("E","VG","G","F","P")
ggplot(data = income.health, aes(x = genhlth, y = income2)) +
geom_count() +
facet_grid(. ~ married) +
xlab("genhlth = Self reported general health")Research quesion 3: Is excercise (ie, physical activity) associated with lower body mass index (BMI)?
At this superficial level of analysis, I did NOT find a strong inverse linear relationship. So I resorted to the heatmap of 2-dimensional bin counts (below); this has an advantage over simple geom_point where there is a very large amount of overlapping data. I think this makes a somewhat visual point that the upper bound on BMI is lower as physical activity increases (maybe not too profound!)
bmi.activity <- select(brfss2013, htm4, wtkg3, pa1min_) %>%
filter(!is.na(htm4), !is.na(wtkg3), !is.na(pa1min_)) %>%
mutate(bmi = (wtkg3/100) / (htm4/100)^2 )
ggplot(data = bmi.activity, aes(x = pa1min_, y = bmi)) +
geom_bin2d(binwidth = c(100,5)) +
scale_x_continuous(limits = c(0,10000)) +
scale_y_continuous(limits = c(0,100)) +
xlab("pa1min_ = Calculated Minutes Of Total Physical Activity Per Week") +
ylab ("My Calculated Body Mass Index")As a final step, I just wanted to explore (superficially) correlations between selected key variables. Two interesting observations:
some.ideas <- select(brfss2013, genhlth, sleptim1, income2, smokday2, avedrnk2, drocdy3_, grenday_, pa1min_) %>%
mutate(genhlth = as.numeric(genhlth),
income2 = as.numeric(income2),
smokday2 = as.numeric(smokday2))
matrix.cor <- cor(some.ideas, use = "complete.obs")
corrplot(matrix.cor, type="upper", method = "number")