Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)

Load data

load("brfss2013.RData")

Part 1: Data

BRFSS is an ongoing surveillance system beginning in 1984 with data collection in 15 states. It now extends to all 50 states and additiional territories. This information is collected via landline and cellular telephone-based surverys.

Due to the observational nature of the data collection, any correlation between variables cannot be viewed as casual. Also, due to the phone surveys, generalization of any inferences has to done with extreme caution, as survey respondents may not be representative of the population.

Part 2: Research questions

Research questions: (11 points) Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

Research quesion 1:

Does the month matter on how people report the best (and worst) Health-Related Quality of Life? And does this differ by Martial Status? (as measured by poor health days poorhlth)

Research quesion 2:

Do people who consume more alcohol beverages also tend to smoke more? Does this change by sex?

Research quesion 3:

Do people who exercise eat more fruit? Does this change by level of education?

Part 3: Exploratory data analysis

Research quesion 1:

In order to determine the best (and worst) Quality of Life, we will need to filter the data to the column needed and deal with na’s

data1 <- brfss2013 %>%
        select(Month = imonth, Poor_Health = poorhlth, Mar_Status = marital) %>%
        as.data.frame()
        
round(sapply(data1, function(x) mean(is.na(x))),4)

##       Month Poor_Health  Mar_Status 
##      0.0000      0.4944      0.0070

Nearly 50% of the data has NAs reported for poor health, so let’s drop those entries.

data1 <- data1 %>%
        drop_na()

round(sapply(data1, function(x) mean(is.na(x))),4)

##       Month Poor_Health  Mar_Status 
##           0           0           0

Let’s create some tables:

data1_by_married <- data1 %>%
                group_by(Mar_Status) %>%
                summarize(Average_Days = mean(Poor_Health))

## `summarise()` ungrouping output (override with `.groups` argument)

data1_by_married

## # A tibble: 6 x 2
##   Mar_Status                      Average_Days
##   <fct>                                  <dbl>
## 1 Married                                 4.48
## 2 Divorced                                7.32
## 3 Widowed                                 6.21
## 4 Separated                               7.95
## 5 Never married                           4.50
## 6 A member of an unmarried couple         4.51

data_by_month <- data1 %>%
                group_by(Month) %>%
                summarize(Average_Days = mean(Poor_Health))

## `summarise()` ungrouping output (override with `.groups` argument)

data_by_month

## # A tibble: 12 x 2
##    Month     Average_Days
##    <fct>            <dbl>
##  1 January           5.27
##  2 February          5.14
##  3 March             5.18
##  4 April             5.27
##  5 May               5.26
##  6 June              5.53
##  7 July              5.49
##  8 August            5.37
##  9 September         5.39
## 10 October           5.21
## 11 November          5.07
## 12 December          5.11

data1_both <- data1 %>%
                group_by(Mar_Status, Month) %>%
                summarize(Average_Days = mean(Poor_Health))

## `summarise()` regrouping output by 'Mar_Status' (override with `.groups` argument)

data1_both %>%
        spread(Month, Average_Days)

## # A tibble: 6 x 13
## # Groups:   Mar_Status [6]
##   Mar_Status January February March April   May  June  July August September
##   <fct>        <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>
## 1 Married       4.46     4.47  4.39  4.49  4.39  4.78  4.71   4.67      4.45
## 2 Divorced      7.50     6.89  7.38  7.34  7.35  7.47  7.47   7.22      7.67
## 3 Widowed       6.07     6.21  6.34  6.52  6.35  6.39  6.23   6.35      6.25
## 4 Separated     7.90     8.22  7.82  7.33  7.67  7.88  8.90   7.84      8.26
## 5 Never mar~    4.58     4.30  4.26  4.30  4.64  4.81  4.74   4.55      4.62
## 6 A member ~    4.74     4.08  4.21  4.78  4.15  4.70  4.67   4.42      5.15
## # ... with 3 more variables: October <dbl>, November <dbl>, December <dbl>

## Spread the data to make a nice visual table

There does appear to a relationship between Marital Status and Poor Health Days, but not a strong relationship when looking at the Month interviewed.

Look’s look at it visually

ggplot(data1_both, aes(x = Month, y = Average_Days, fill = Mar_Status)) +
        geom_bar(stat = "identity", position = position_dodge())

This plot is a little busy, but you can get a sense that Seperated and Divorced have higher Poor_Health days and month doesn’t impact it that much.

Research quesion 2:

data2 <- brfss2013 %>%
        select(Sex = sex, Alcohol_Days = alcday5, Smoke_Days = smokday2) %>%
        as.data.frame()
        
round(sapply(data2, function(x) mean(is.na(x))),4)

##          Sex Alcohol_Days   Smoke_Days 
##       0.0000       0.0399       0.5632

Let’s get rid of the NAs again

data2 <- data2 %>%
        drop_na()

round(sapply(data2, function(x) mean(is.na(x))),4)

##          Sex Alcohol_Days   Smoke_Days 
##            0            0            0

Let’s look at some average numbers by sex.

data2 %>%
        group_by(Sex) %>%
        summarize(Average_Drink_Days = mean(Alcohol_Days))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   Sex    Average_Drink_Days
##   <fct>               <dbl>
## 1 Male                106. 
## 2 Female               88.4

data2_both <- data2 %>% group_by(Sex, Smoke_Days) %>%
        summarize(Average_Drink_Days = mean(Alcohol_Days))

## `summarise()` regrouping output by 'Sex' (override with `.groups` argument)

data2_both %>%
        spread(Sex, Average_Drink_Days)

## # A tibble: 3 x 3
##   Smoke_Days  Male Female
##   <fct>      <dbl>  <dbl>
## 1 Every day   107.   84.4
## 2 Some days   117.   91.1
## 3 Not at all  105.   89.7

Interesting that men seem to consume alcholic beverages more often, but smokers don’t seem to drink more than non-smokers - while those the answered some days drink the most.

Again, let’s try to visual this data

ggplot(data2_both, aes(x = Smoke_Days, y = Average_Drink_Days, fill = Smoke_Days)) +
        geom_bar(stat = "identity") +
        facet_wrap(. ~ Sex)

Research quesion 3:

data3 <- brfss2013 %>%
        select(Education = educa, Fruit_Times = fruit1, Exercise_Times = exeroft1) %>%
        as.data.frame()
        
round(sapply(data3, function(x) mean(is.na(x))),4)

##      Education    Fruit_Times Exercise_Times 
##         0.0046         0.0687         0.3338

Let’s get rid of the NAs again

data3 <- data3 %>%
        drop_na()

round(sapply(data3, function(x) mean(is.na(x))),4)

##      Education    Fruit_Times Exercise_Times 
##              0              0              0

Let’s look at some average numbers by education

data3 %>%
        group_by(Education) %>%
        summarize(Average_Fruit_Times = mean(Fruit_Times),
                  Average_Exercise_Times = mean(Exercise_Times))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 6 x 3
##   Education                               Average_Fruit_Tim~ Average_Exercise_T~
##   <fct>                                                <dbl>               <dbl>
## 1 Never attended school or only kinderga~               160.                130.
## 2 Grades 1 through 8 (Elementary)                       166.                134.
## 3 Grades 9 though 11 (Some high school)                 180.                138.
## 4 Grade 12 or GED (High school graduate)                180.                138.
## 5 College 1 year to 3 years (Some colleg~               179.                137.
## 6 College 4 years or more (College gradu~               172.                133.

It does appear that there is some relationship between fruit eating and educaion, but not much between exercise and education.