Exploring the BRFSS data

Part 3: Exploratory data analysis

Research quesion 1: To see how the median of reported sleep time varies across reported general health we use the following:

sleep_vs_health <- brfss2013 %>% filter(!is.na(genhlth)) %>% filter(!is.na(sleptim1)) %>% group_by(genhlth) 
sleep_vs_health_chart <- sleep_vs_health %>% summarise(mean_sleptim = mean(sleptim1), med_sleptim = median(sleptim1))

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = sleep_vs_health_chart, aes(x = genhlth, y = mean_sleptim)) + geom_col()

ggplot(data = sleep_vs_health_chart, aes(x = genhlth, y = med_sleptim)) + geom_col()

ggplot(data=sleep_vs_health, aes(x = genhlth, y = sleptim1)) + geom_boxplot()

By the above graphs, it seems that by both the mean and median, there isn’t much difference in reported sleep time across reported general health.

To see if the reported general health changed throughout the year:

health_by_month <- brfss2013 %>% filter(!is.na(imonth), !is.na(genhlth)) %>% select(imonth, genhlth)
health_by_month %>% group_by(imonth, genhlth) %>% summarise(count = n())

## `summarise()` regrouping output by 'imonth' (override with `.groups` argument)

## # A tibble: 60 x 3
## # Groups:   imonth [12]
##    imonth   genhlth   count
##    <fct>    <fct>     <int>
##  1 January  Excellent  6094
##  2 January  Very good 11217
##  3 January  Good      10354
##  4 January  Fair       4473
##  5 January  Poor       1923
##  6 February Excellent  7699
##  7 February Very good 13946
##  8 February Good      13058
##  9 February Fair       5656
## 10 February Poor       2345
## # … with 50 more rows

instead of viewing all the types of entries, I just want to see how the number of respondents answering “poor” changed over time.

poor_by_month <- health_by_month %>% filter(genhlth == "Poor") %>% group_by(imonth) %>% summarise(poorcount = n())

## `summarise()` ungrouping output (override with `.groups` argument)

poor_by_month

## # A tibble: 12 x 2
##    imonth    poorcount
##    <fct>         <int>
##  1 January        1923
##  2 February       2345
##  3 March          2488
##  4 April          2418
##  5 May            2334
##  6 June           2196
##  7 July           2513
##  8 August         2422
##  9 September      2267
## 10 October        2420
## 11 November       2318
## 12 December       2307

ggplot(data = poor_by_month, aes(x=imonth, y=poorcount)) + geom_col()

There doesn’t seem to be any trend in poor responses by month.

To simplify the general health responses, I want to combine “good,” “very good,” and “excellent” into one category as “at least good”.

health_by_month <- health_by_month %>% mutate(healthasnumber = as.numeric(genhlth)) %>% mutate(atleastgood = ifelse(healthasnumber < 4, "yes", "no")) 
head(health_by_month, n=10)

##      imonth   genhlth healthasnumber atleastgood
## 1   January      Fair              4          no
## 2   January      Good              3         yes
## 3   January      Good              3         yes
## 4   January Very good              2         yes
## 5  February      Good              3         yes
## 6     March Very good              2         yes
## 7     March      Fair              4          no
## 8     March      Good              3         yes
## 9     April Excellent              1         yes
## 10    April      Good              3         yes

now we can use atleastgood to combine the entries that are at least good by month and graph the overall changes and create new tables of just those who felt at least good and those who felt fair or worse.

health_by_month %>% group_by(imonth, atleastgood) %>% summarise(count = n())

## `summarise()` regrouping output by 'imonth' (override with `.groups` argument)

## # A tibble: 24 x 3
## # Groups:   imonth [12]
##    imonth   atleastgood count
##    <fct>    <chr>       <int>
##  1 January  no           6396
##  2 January  yes         27665
##  3 February no           8001
##  4 February yes         34703
##  5 March    no           8422
##  6 March    yes         35886
##  7 April    no           8339
##  8 April    yes         34427
##  9 May      no           7981
## 10 May      yes         32301
## # … with 14 more rows

at_least_good <- health_by_month %>% filter(atleastgood == "yes") %>% group_by(imonth) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

fair_or_worse <- health_by_month %>% filter(atleastgood =="no") %>% group_by(imonth) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

ggplot(data = at_least_good, aes(x = imonth, y = count)) + geom_col()

ggplot(data = fair_or_worse, aes(x = imonth, y = count)) + geom_col()

Though the absolute count between at least good and fair or worse is quite different (see the scaling on the graphs), the trend is similar in both showing rises in the same months.

I want to also see if the ratio of good to bad health responses changes by month:

health_by_month %>% group_by(imonth) %>% summarise(count = n()) %>% mutate(good_to_bad_ratio = at_least_good$count / fair_or_worse$count)

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 12 x 3
##    imonth    count good_to_bad_ratio
##    <fct>     <int>             <dbl>
##  1 January   34061              4.33
##  2 February  42704              4.34
##  3 March     44308              4.26
##  4 April     42766              4.13
##  5 May       40282              4.05
##  6 June      37772              4.02
##  7 July      43515              4.17
##  8 August    42132              4.17
##  9 September 38391              4.03
## 10 October   42054              4.20
## 11 November  41756              4.26
## 12 December  40049              4.14

Overall there doesnt seem to be a relationship between the season and the general health of respondents.

Research quesion 2: To see what categories are available in the income variable and filter out NAs and then summarize median days of poor physical and mental health:

income_health <- brfss2013 %>% filter(!is.na(income2), !is.na(physhlth), !is.na(menthlth)) 
income_health %>% group_by(income2) %>% summarise(count = n(), med_phys = median(physhlth), med_ment = median(menthlth))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 8 x 4
##   income2            count med_phys med_ment
##   <fct>              <int>    <dbl>    <dbl>
## 1 Less than $10,000  23900        2        0
## 2 Less than $15,000  25247        2        0
## 3 Less than $20,000  33264        0        0
## 4 Less than $25,000  40078        0        0
## 5 Less than $35,000  47263        0        0
## 6 Less than $50,000  60049        0        0
## 7 Less than $75,000  64164        0        0
## 8 $75,000 or more   114421        0        0

It appears that the median days of poor physical health is higher for severely lower income survey responders.

ggplot(data = income_health, aes(x = income2, y = physhlth)) + geom_boxplot()

Since the interquartile range for higher income respondents is more narrow, we know that there is less variability in those categories. The category of people earning less than $10,000 shows a large interquartile range, indicating that a large portion of the category is in that wide range. Although the median is only slightly higher than that of the higher income respondents, it seems that there is more variability in the number of days a person was in poor health if they are also low income.

ggplot(data = income_health, aes(x = income2, y = menthlth)) + geom_boxplot()

there seems to be one obvious outlier in the data that’s making the y-axis range much larger than it should be so to remove that one outlier and graph again:

income_health <- income_health %>% filter(menthlth < 200)
ggplot(data = income_health, aes(x = income2, y = menthlth)) + geom_boxplot()

Although the median days of having poor mental health doesnt change across income brackets, the IQR varies a lot with much less variation (more respondents answering in low numbers) in the higher income brackets. This implies that there are more respondents in the lower income brackets answering with a high number of days where they experiences poor mental health.

Research quesion 3: Are those who have had a heart attack or stroke likely to have also been diagnosed with high blood pressure?

heart_disease <- brfss2013 %>% filter(!is.na(toldhi2), !is.na(cvdinfr4), !is.na(cvdstrk3)) %>% select(toldhi2, cvdinfr4, cvdstrk3)

Discounting respondents who didn’t answer whether or not they have been diagnosed with high blood pressure (filtering out NAs), we can calculate the probability of randomly selecting a respondent who was diagnosed with high bp.

head(heart_disease, n=10)

##    toldhi2 cvdinfr4 cvdstrk3
## 1      Yes       No       No
## 2       No       No       No
## 3       No       No       No
## 4      Yes       No       No
## 5       No       No       No
## 6      Yes       No       No
## 7       No       No       No
## 8      Yes       No       No
## 9      Yes       No       No
## 10      No       No       No

heart_disease %>% group_by(toldhi2) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   toldhi2  count
##   <fct>    <int>
## 1 Yes     181744
## 2 No      235430

prob_hbp <- nrow(heart_disease %>% filter(toldhi2 == "Yes"))/nrow(heart_disease)
prob_hbp

## [1] 0.4356551

following the same method, we’ll find the probability of randomly selecting a respondent who has had a heart attack and stroke respectively.

heart_disease %>% group_by(cvdinfr4) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   cvdinfr4  count
##   <fct>     <int>
## 1 Yes       27665
## 2 No       389509

prob_ha <- nrow(heart_disease %>% filter(cvdinfr4 == "Yes"))/nrow(heart_disease)
prob_ha

## [1] 0.06631525

heart_disease %>% group_by(cvdstrk3) %>% summarise(count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   cvdstrk3  count
##   <fct>     <int>
## 1 Yes       18730
## 2 No       398444

prob_strk <- nrow(heart_disease %>% filter(cvdstrk3 == "Yes"))/nrow(heart_disease)
prob_strk

## [1] 0.04489733

to get conditional probabilities we need to group by two factors:

heart_disease %>% group_by(toldhi2, cvdinfr4) %>% summarise(count = n())

## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)

## # A tibble: 4 x 3
## # Groups:   toldhi2 [2]
##   toldhi2 cvdinfr4  count
##   <fct>   <fct>     <int>
## 1 Yes     Yes       19419
## 2 Yes     No       162325
## 3 No      Yes        8246
## 4 No      No       227184

prob_ha_giv_hbp <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdinfr4 == "Yes"))/nrow(heart_disease %>% filter(toldhi2 == "Yes"))
prob_ha_giv_hbp

## [1] 0.1068481

prob_ha

## [1] 0.06631525

Since the probability of having had a heart attack given that the respondent is also diagnosed with high blood pressure is different from the probability of randomly selecting a respondent who had a heart attack from all respondents, these two events are not independent. It also shows that the probability of having had a heart attack increases given that the respondent also was diagnosed with high blood pressure.

We can do the same test with respondents who have had a stroke.

heart_disease %>% group_by(toldhi2, cvdstrk3) %>% summarise(count = n())

## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)

## # A tibble: 4 x 3
## # Groups:   toldhi2 [2]
##   toldhi2 cvdstrk3  count
##   <fct>   <fct>     <int>
## 1 Yes     Yes       12196
## 2 Yes     No       169548
## 3 No      Yes        6534
## 4 No      No       228896

prob_strk_giv_hbp <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdstrk3 == "Yes"))/nrow(heart_disease %>% filter(toldhi2 == "Yes"))
prob_strk_giv_hbp

## [1] 0.06710538

prob_strk

## [1] 0.04489733

Similarly, the events of having had a stroke and having been diagnosed with high blood pressure are not independent. It also shows that the probability of having had a stroke increases given that the respondent also was diagnosed with high blood pressure.

We can also check the reverse conditional probabilities (prob hbp given ha or strk)

prob_hbp_giv_ha <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdinfr4 == "Yes"))/nrow(heart_disease %>% filter(cvdinfr4 == "Yes"))
prob_hbp_giv_ha

## [1] 0.7019339

prob_hbp

## [1] 0.4356551

Since those who have had a heart attack seem to have significantly more incidence of also having been diagnosed with high blood pressure, high blood pressure may be a risk factor for heart attacks.

prob_hbp_giv_strk <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdstrk3 == "Yes"))/nrow(heart_disease %>% filter(cvdstrk3 == "Yes"))
prob_hbp_giv_strk

## [1] 0.6511479

prob_hbp

## [1] 0.4356551

This also shows that high blood pressure could be a risk factor for those who have had a stroke. It seems that there may be associations between high blood pressure and heart attack and strokes. Since this is all observational data and since there aren’t any controls, we cannot determine causalit but there seems to be an association.

Exploring the BRFSS data

Setup

Load packages

Load data

Part 1: Data

Part 2: Research questions

Part 3: Exploratory data analysis