BRFSS 2013 - Data Exploration

Setup

Load packages

library(ggplot2)
library(dplyr)
library(maps)

Load data

load("brfss2013.RData")

Part 1: Data

What is the data?

the Behavioral Risk Factor Surveillance System is a survey collecting data on health characteristics of the non-institutionalized adult population in the US & some of its Territories
each row contains the survey responses of a participant in the survey in 2013

How was the data collected?

households were selected for participation using Random Digit Dialing in order to create a randomized sample
an adult household member at this number is then randomly selected to participate in the survey over the phone

Is the data generalizable?

the data is generalizable to the participating US States & Territories
the data is generalizable to adults over 18 years old who are not institutionalized
the sample does not include households outside of the US States and selected territories, and also does not contain any person under 18 years old or who is institutionalized, so cannot be generalizated to these populations

Can the data imply causality?

the data collected in the survey is not experimental, meaning there was no control
for example, if you were comparing respondents in Alabama to respondents in California, there would be many other potential variables aside from state of residence that could explain a difference in the metric of interest
because the data is observational, it cannot imply causality
however, it can show correlation

Part 2: Research questions

Research quesion 1: What is the relationship between perceived physical health and mental health? To answer this question, I will look at the following variables:
genhlth : General Health rated Poor –> Excellent
menthlth : Number of Days (out of 31) Mental Health Not Good

Research quesion 2: Is number of not-good mental health days different by gender? Is it different by state? To answer this question, I will look at the following variables:
sex : Respondent’s sex
X_state : Respondent’s state where they were surveyed
menthlth : Number of Days (out of 31) Mental Health Not Good

Research quesion 3: Finally, I will explore the relationship between bad mental health days, gender and reported general health? Are females more likely to report good general health despite having more bad mental health days?

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Research quesion 1:

# subset dataframe to only the columns of interest
q1 <- select(brfss2013, genhlth, menthlth) %>% na.omit()

# bucket number of mental health days into groups - low, medium, high
q1$menthlth_group<-ifelse(q1$menthlth < 5, "low",
                          ifelse(q1$menthlth>=5 & q1$menthlth<10,"medium","high"))
prop.table(table(q1$genhlth, q1$menthlth_group),2)

##            
##                   high        low     medium
##   Excellent 0.06467717 0.19821348 0.11355926
##   Very good 0.18121018 0.35110083 0.31544238
##   Good      0.29034201 0.30806971 0.33469294
##   Fair      0.27195791 0.10936918 0.17676209
##   Poor      0.19181274 0.03324680 0.05954332

Looking at the frequency of responses to ratings of a participants’ general health stratified by low, medium and high number of “mental health not good” days, it seems as though those with a low number of bad mental health days rate themselves as having excellent general health most frequently of the 3 groups, and rate themselves as having poor general health the least frequently.
The opposite is true for participants in the high bad mental health days group.

g1 <- ggplot(q1) + aes(x=menthlth_group, fill=genhlth) + geom_bar(position="fill")
g1 <- g1 + xlab("# of Bad Mental Health Days") + ylab("Proportion") + scale_fill_discrete(name="Reported General Health")
g1

You can see in the stacked bar chart above that particpants who report the most frequent bad mental health days also report most frequently report poor general health.
Conversly, participants who reported the least bad mental health days reported the best overall general health.
It seems as though reported mental health is correlated with reported general health.

Research quesion 2:

# subset data to male & female to get two dataframes
male_q2 <- brfss2013 %>%
  filter(sex == "Male")
female_q2 <- brfss2013 %>%
  filter(sex == "Female")
male_q2 <- select(male_q2, menthlth) %>% na.omit()
female_q2 <- select(female_q2, menthlth) %>% na.omit()

dim(male_q2)

## [1] 198066      1

dim(female_q2)

## [1] 285079      1

# plot male # of not good mental health days
male_q2_distribution = ggplot(data = male_q2, aes(x = menthlth)) +
  geom_histogram(binwidth = 5) + ggtitle("Male Frequency of Bad Mental Health Days in a Month")
male_q2_distribution

# plot female # of not good mental health days
female_q2_distribution = ggplot(data = female_q2, aes(x = menthlth)) +
  geom_histogram(binwidth = 5) + ggtitle("Female Frequency of Bad Mental Health Days in a Month")
female_q2_distribution

The distributions look pretty similar, but I also want to compare the average # of bad mental health days in a month for each group:

q2 <- select(brfss2013, sex, menthlth)  %>% na.omit()
q2 %>%
  group_by(sex) %>%
  summarise(mean_days = mean(menthlth),
            median_days = median(menthlth),
            sd_days = sd(menthlth),n = n())

## # A tibble: 2 x 5
##   sex    mean_days median_days sd_days      n
##   <fct>      <dbl>       <dbl>   <dbl>  <int>
## 1 Male        2.78           0    7.13 198066
## 2 Female      3.78           0    8.05 285079

Though the distributions between male & female look fairly similar, the female subset is more heavily skewed right than male, so the female mean number of days is higher. However, median is 0 for both groups, because the median is a more robust statistic, meaning it is less affected by skewedness.

# create map of US states
all_states <- map_data("state")

# select the relevant columns from brfss dataframe
q2_state <- select(brfss2013, X_state, menthlth)  %>% na.omit()

# calculate the mean number of mental health days by state
# using the state mean, calculate the mean % of bad mental health days
q2_state_menthlth <- q2_state %>%
  filter(menthlth<=30) %>%
  group_by(X_state) %>%
  summarise(mean_days = mean(menthlth),n = n()) %>%
  mutate(pct_bad_menthlth_days = (mean_days/30)*100)

# create the mental health map and merge with the US states map
menthlth_states_map <- q2_state_menthlth %>%
  mutate(region= tolower(X_state))

states_map <- merge(all_states, menthlth_states_map, by="region")

# plot the heatmap
ggplot() +
  geom_polygon(data = states_map, aes(x= long, y= lat, group = group, fill = pct_bad_menthlth_days), color = "white") +
  ggtitle("Heat Map of % Bad Mental Health Days in the US") +
  scale_fill_gradient2(low = "white", high = "darkred") +
  theme(legend.position = c(1,0), legend.justification = c(1,0))

States with the highest average percent of bad mental health days appear to be:
- Alabama
- Kentucky
- West Virginia

# print a table of the top 10 states with highest percent bad mental health days
head(arrange(q2_state_menthlth, desc(pct_bad_menthlth_days)), n=10)

## # A tibble: 10 x 4
##    X_state       mean_days     n pct_bad_menthlth_days
##    <fct>             <dbl> <int>                 <dbl>
##  1 Alabama            4.44  6334                  14.8
##  2 West Virginia      4.40  5798                  14.7
##  3 Kentucky           4.31 10717                  14.4
##  4 Oklahoma           4.01  8114                  13.4
##  5 Arkansas           3.92  5139                  13.1
##  6 Puerto Rico        3.91  5945                  13.0
##  7 Mississippi        3.88  7298                  12.9
##  8 Oregon             3.85  5861                  12.8
##  9 Tennessee          3.85  5703                  12.8
## 10 California         3.76 11410                  12.5

This table shows the states with the highest average percent of bad mental health days in a month.

Research quesion 3: Are females more likely to report good general health despite having more bad mental health days?

# subset the dataframe to metrics of interest, remove NA's, remove mental health outliers
q3 <- select(brfss2013, sex, menthlth, genhlth)
q3_cleaned <- select(q3, sex, menthlth, genhlth) %>% na.omit() %>% filter(menthlth <= 30)
dim(q3_cleaned)

## [1] 481487      3

# assign low, medium & high mental health days categories to
# each individual as we did in question 1
q3_cleaned$menthlth_group<-ifelse(q3_cleaned$menthlth < 5, "low # days",
                          ifelse(q3_cleaned$menthlth>=5 & q3_cleaned$menthlth<10,"medium # days","high # days"))

# plot the proportion of participants in each mental health group (low, med, high days)
# who reported each general health rating and segment by gender
g3 <- ggplot(q3_cleaned) + aes(x=sex, fill=genhlth) + geom_bar(position = "fill") + facet_grid(.~menthlth_group)
g3 <- g3 + xlab("Mental Health Category by Gender") + ylab("Proportion") + scale_fill_discrete(name="Reported General Health")
g3

Though reported general health does vary by group (low, medium and high # of bad mental health days), it does not seem to vary much by gender. It seems that of participants who reported a high number of bad mental health days, participants were slightly less likely to report poor general health if they were female, but this difference may be due to sampling error & would require statistical testing to determine if this effect of gender is real.