Observations in the sample are collected by random surveys over landline and personal cell phone calls. Previously only including adults, the survey was recently adjusted to include young adults in college. Each observation is self-reported by the survey taker. There are both categorical and numerical variables. In the scope of inference, since all variables are not experimental but just observational, causality cannot be determined for any variables. Thus with this data we can only assess for general associations. Since sampling was stratified geographically, any inferences or relations between variables may be generalized by state.The survey also includes some state-specific questions. These particular questions target the general state population. Otherwise, the questions asked to all states target the US adult population including territories outside of the 50 states.
Research quesion 1: Is there a relationship between sleep time and the self-reported general health of survey respondents? Perhaps sufficient sleep would be associated with general health/mood. Does the general health of respondents vary throughout the year, perhaps better in sunnier seasons and more poor in colder seasons (Seasonal Affective Disorder)?
Research quesion 2: Is there a relationship between health-related qualtiy of life and income?
Research quesion 3: Are those who have had a heart attack or stroke likely to have also been diagnosed with high blood pressure?
Research quesion 1: To see how the median of reported sleep time varies across reported general health we use the following:
sleep_vs_health <- brfss2013 %>% filter(!is.na(genhlth)) %>% filter(!is.na(sleptim1)) %>% group_by(genhlth)
sleep_vs_health_chart <- sleep_vs_health %>% summarise(mean_sleptim = mean(sleptim1), med_sleptim = median(sleptim1))## `summarise()` ungrouping output (override with `.groups` argument)
By the above graphs, it seems that by both the mean and median, there isn’t much difference in reported sleep time across reported general health.
To see if the reported general health changed throughout the year:
health_by_month <- brfss2013 %>% filter(!is.na(imonth), !is.na(genhlth)) %>% select(imonth, genhlth)
health_by_month %>% group_by(imonth, genhlth) %>% summarise(count = n())## `summarise()` regrouping output by 'imonth' (override with `.groups` argument)
## # A tibble: 60 x 3
## # Groups: imonth [12]
## imonth genhlth count
## <fct> <fct> <int>
## 1 January Excellent 6094
## 2 January Very good 11217
## 3 January Good 10354
## 4 January Fair 4473
## 5 January Poor 1923
## 6 February Excellent 7699
## 7 February Very good 13946
## 8 February Good 13058
## 9 February Fair 5656
## 10 February Poor 2345
## # … with 50 more rows
instead of viewing all the types of entries, I just want to see how the number of respondents answering “poor” changed over time.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## imonth poorcount
## <fct> <int>
## 1 January 1923
## 2 February 2345
## 3 March 2488
## 4 April 2418
## 5 May 2334
## 6 June 2196
## 7 July 2513
## 8 August 2422
## 9 September 2267
## 10 October 2420
## 11 November 2318
## 12 December 2307
There doesn’t seem to be any trend in poor responses by month.
To simplify the general health responses, I want to combine “good,” “very good,” and “excellent” into one category as “at least good”.
health_by_month <- health_by_month %>% mutate(healthasnumber = as.numeric(genhlth)) %>% mutate(atleastgood = ifelse(healthasnumber < 4, "yes", "no"))
head(health_by_month, n=10)## imonth genhlth healthasnumber atleastgood
## 1 January Fair 4 no
## 2 January Good 3 yes
## 3 January Good 3 yes
## 4 January Very good 2 yes
## 5 February Good 3 yes
## 6 March Very good 2 yes
## 7 March Fair 4 no
## 8 March Good 3 yes
## 9 April Excellent 1 yes
## 10 April Good 3 yes
now we can use atleastgood to combine the entries that are at least good by month and graph the overall changes and create new tables of just those who felt at least good and those who felt fair or worse.
## `summarise()` regrouping output by 'imonth' (override with `.groups` argument)
## # A tibble: 24 x 3
## # Groups: imonth [12]
## imonth atleastgood count
## <fct> <chr> <int>
## 1 January no 6396
## 2 January yes 27665
## 3 February no 8001
## 4 February yes 34703
## 5 March no 8422
## 6 March yes 35886
## 7 April no 8339
## 8 April yes 34427
## 9 May no 7981
## 10 May yes 32301
## # … with 14 more rows
## `summarise()` ungrouping output (override with `.groups` argument)
## `summarise()` ungrouping output (override with `.groups` argument)
Though the absolute count between at least good and fair or worse is quite different (see the scaling on the graphs), the trend is similar in both showing rises in the same months.
I want to also see if the ratio of good to bad health responses changes by month:
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 3
## imonth count good_to_bad_ratio
## <fct> <int> <dbl>
## 1 January 34061 4.33
## 2 February 42704 4.34
## 3 March 44308 4.26
## 4 April 42766 4.13
## 5 May 40282 4.05
## 6 June 37772 4.02
## 7 July 43515 4.17
## 8 August 42132 4.17
## 9 September 38391 4.03
## 10 October 42054 4.20
## 11 November 41756 4.26
## 12 December 40049 4.14
Overall there doesnt seem to be a relationship between the season and the general health of respondents.
Research quesion 2: To see what categories are available in the income variable and filter out NAs and then summarize median days of poor physical and mental health:
income_health <- brfss2013 %>% filter(!is.na(income2), !is.na(physhlth), !is.na(menthlth))
income_health %>% group_by(income2) %>% summarise(count = n(), med_phys = median(physhlth), med_ment = median(menthlth))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 4
## income2 count med_phys med_ment
## <fct> <int> <dbl> <dbl>
## 1 Less than $10,000 23900 2 0
## 2 Less than $15,000 25247 2 0
## 3 Less than $20,000 33264 0 0
## 4 Less than $25,000 40078 0 0
## 5 Less than $35,000 47263 0 0
## 6 Less than $50,000 60049 0 0
## 7 Less than $75,000 64164 0 0
## 8 $75,000 or more 114421 0 0
It appears that the median days of poor physical health is higher for severely lower income survey responders.
Since the interquartile range for higher income respondents is more narrow, we know that there is less variability in those categories. The category of people earning less than $10,000 shows a large interquartile range, indicating that a large portion of the category is in that wide range. Although the median is only slightly higher than that of the higher income respondents, it seems that there is more variability in the number of days a person was in poor health if they are also low income.
there seems to be one obvious outlier in the data that’s making the y-axis range much larger than it should be so to remove that one outlier and graph again:
income_health <- income_health %>% filter(menthlth < 200)
ggplot(data = income_health, aes(x = income2, y = menthlth)) + geom_boxplot() Although the median days of having poor mental health doesnt change across income brackets, the IQR varies a lot with much less variation (more respondents answering in low numbers) in the higher income brackets. This implies that there are more respondents in the lower income brackets answering with a high number of days where they experiences poor mental health.
Research quesion 3: Are those who have had a heart attack or stroke likely to have also been diagnosed with high blood pressure?
Discounting respondents who didn’t answer whether or not they have been diagnosed with high blood pressure (filtering out NAs), we can calculate the probability of randomly selecting a respondent who was diagnosed with high bp.
## toldhi2 cvdinfr4 cvdstrk3
## 1 Yes No No
## 2 No No No
## 3 No No No
## 4 Yes No No
## 5 No No No
## 6 Yes No No
## 7 No No No
## 8 Yes No No
## 9 Yes No No
## 10 No No No
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## toldhi2 count
## <fct> <int>
## 1 Yes 181744
## 2 No 235430
## [1] 0.4356551
following the same method, we’ll find the probability of randomly selecting a respondent who has had a heart attack and stroke respectively.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## cvdinfr4 count
## <fct> <int>
## 1 Yes 27665
## 2 No 389509
## [1] 0.06631525
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## cvdstrk3 count
## <fct> <int>
## 1 Yes 18730
## 2 No 398444
## [1] 0.04489733
to get conditional probabilities we need to group by two factors:
## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: toldhi2 [2]
## toldhi2 cvdinfr4 count
## <fct> <fct> <int>
## 1 Yes Yes 19419
## 2 Yes No 162325
## 3 No Yes 8246
## 4 No No 227184
prob_ha_giv_hbp <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdinfr4 == "Yes"))/nrow(heart_disease %>% filter(toldhi2 == "Yes"))
prob_ha_giv_hbp## [1] 0.1068481
## [1] 0.06631525
Since the probability of having had a heart attack given that the respondent is also diagnosed with high blood pressure is different from the probability of randomly selecting a respondent who had a heart attack from all respondents, these two events are not independent. It also shows that the probability of having had a heart attack increases given that the respondent also was diagnosed with high blood pressure.
We can do the same test with respondents who have had a stroke.
## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: toldhi2 [2]
## toldhi2 cvdstrk3 count
## <fct> <fct> <int>
## 1 Yes Yes 12196
## 2 Yes No 169548
## 3 No Yes 6534
## 4 No No 228896
prob_strk_giv_hbp <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdstrk3 == "Yes"))/nrow(heart_disease %>% filter(toldhi2 == "Yes"))
prob_strk_giv_hbp## [1] 0.06710538
## [1] 0.04489733
Similarly, the events of having had a stroke and having been diagnosed with high blood pressure are not independent. It also shows that the probability of having had a stroke increases given that the respondent also was diagnosed with high blood pressure.
We can also check the reverse conditional probabilities (prob hbp given ha or strk)
prob_hbp_giv_ha <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdinfr4 == "Yes"))/nrow(heart_disease %>% filter(cvdinfr4 == "Yes"))
prob_hbp_giv_ha## [1] 0.7019339
## [1] 0.4356551
Since those who have had a heart attack seem to have significantly more incidence of also having been diagnosed with high blood pressure, high blood pressure may be a risk factor for heart attacks.
prob_hbp_giv_strk <- nrow(heart_disease %>% filter(toldhi2 == "Yes", cvdstrk3 == "Yes"))/nrow(heart_disease %>% filter(cvdstrk3 == "Yes"))
prob_hbp_giv_strk## [1] 0.6511479
## [1] 0.4356551
This also shows that high blood pressure could be a risk factor for those who have had a stroke. It seems that there may be associations between high blood pressure and heart attack and strokes. Since this is all observational data and since there aren’t any controls, we cannot determine causalit but there seems to be an association.