By Bruce Hoppe. August 3, 2017.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)

Load data

load("brfss2013.RData")

Introduction

We explore some basic public health questions. How are sleep and health related? How are mental and physical health related? How do chronic diseases differ in their impact on general physical health? Our analysis relies on the BRFSS dataset. We observe some intriguing trends that deserve further investigation.


Part 1: Data

Our investigation uses the Behavioral Risk Factor Surveillance System dataset (BRFSS) of 2013. This dataset is collected regularly across the U.S. via telephone interviews with randomly selected landlines and cell phones. All 50 states and several U.S. territories participate fully in conducting interviews and compiling data. The BRFSS 2013 dataset includes results of 491775 interviews. Results from this dataset can be generalized to apply to non-institutionalized adults 18+ who reside in the U.S. We cannot, however, make any conclusions about causality from this dataset, because it is purely observational.


Part 2: Research questions

Each of our research questions is motivated by our desire to understand the most basic factors of public health. How do people perceive their general health? How are general health, physical health, mental health, and sleep related? How do prevalent chronic diseases affect general and physical health? With questions like these, we seek to inform a high-level strategy to prioritize future investments in health-related programs, products, and research.

Research quesion 1: Our first inquiry explores the relationship between sleep and general health. We hypothesize that people with abnormal sleep time suffer worse general health than people with normal sleep time. We investigate this for men and women, and in both cases we find evidence to support the hypothesis.

Research question 2: Our second inquiry explores the relationship between mental and physical health. We hypothesize that people who suffer more problems with mental health also suffer more problems with physical health. We further hypothesize a similar correlation with a person’s general health. We find evidence to support these hypotheses.

Research question 3 Our final inquiry explores the general and physical health of people who have ever been diagnosed with any of the following chronic diseases: cancer, asthma, arthritis, depression, diabetes, and COPD. We hypothesize that each of these chronic diseases corresponds to a different distribution of general health among those diagnosed. Our motivation is to understand where there is greatest ongoing suffering due to a prevalent chronic disease. We do in fact observe that a diagnosis of one of the selected chronic diseases corresponds to much worse health than the others.


Part 3: Exploratory data analysis

Research quesion 1:

Respondents reported their general health by answering a simple multiple choice question. The results are in the variable genhlth:

brfss2013 %>% group_by(genhlth) %>% summarise(n=n())
## # A tibble: 6 x 2
##     genhlth      n
##      <fctr>  <int>
## 1 Excellent  85482
## 2 Very good 159076
## 3      Good 150555
## 4      Fair  66726
## 5      Poor  27951
## 6        NA   1985

This data looks reasonable. We will remove NAs from our analysis.

filtered_brfss2013 <- brfss2013 %>% filter(!is.na(genhlth))

It will be convenient to represent general health as a numeric score so that we can discuss the average general health of a population. Numeric values are already built into genhlth, which we can access with the function as.numeric. We store this numeric value in a new varable num_genhlth and observe the numeric values for each possible value of genhlth:

filtered_brfss2013$num_genhlth <- as.numeric(filtered_brfss2013$genhlth)
filtered_brfss2013 %>% group_by(num_genhlth, genhlth) %>% summarise(n=n())
## # A tibble: 5 x 3
## # Groups:   num_genhlth [?]
##   num_genhlth   genhlth      n
##         <dbl>    <fctr>  <int>
## 1           1 Excellent  85482
## 2           2 Very good 159076
## 3           3      Good 150555
## 4           4      Fair  66726
## 5           5      Poor  27951

Note that better general health corresponds to lower numeric values. That is perhaps counterintuitive, but we’ll proceed with the numeric values as given.

Next we look at sleep time. Considering the 30 days prior to being interviewed, respondents reported how many hours they typically slept in a 24-hour period. The results are in variable sleptim1:

filtered_brfss2013 %>% group_by(sleptim1) %>% summarise(n=n())
## # A tibble: 26 x 2
##    sleptim1      n
##       <int>  <int>
##  1        0      1
##  2        1    228
##  3        2   1063
##  4        3   3466
##  5        4  14194
##  6        5  33290
##  7        6 105880
##  8        7 142090
##  9        8 140498
## 10        9  23688
## # ... with 16 more rows

This data looks OK except for the NAs and some outliers both large and small. Let’s visualize the distribution to help us determine outlier cutoffs.

filtered_brfss2013 %>% filter(!is.na(sleptim1)) %>% ggplot(aes(x=sleptim1)) + geom_histogram(binwidth = 1)

The distribution looks approximately normal. Without getting too sophisticated, we will simply define outliers to be sleep values less than 4 or greater than 10. We remove outliers with this:

filtered_brfss2013 <- filtered_brfss2013 %>% filter(!is.na(sleptim1), sleptim1>=4, sleptim1<=10)

Our hypothesis is that people with normal sleep time have better general health than people with abnormal sleep time. Let’s rephrase this more precisely. We see in the above distribution that the mode of sleep times is 7. Below we see that the median is 7 and the mean is very nearly 7:

filtered_brfss2013 %>% summarise(median_sl=median(sleptim1), mean_sl=mean(sleptim1))
##   median_sl  mean_sl
## 1         7 7.018916

Therefore, we will say 7 hours is our standard of normal sleep. Our new more precise hypothesis is that we expect people who sleep 7 hours per 24 hours to have the best average general health, and people who sleep more or less than 7 hours per 24 hours to have worse average general health. We furthermore expect average general health to degrade progressively with the gap (positive or negative) between sleep time and the standard 7 hours.

To evaluate our hypothesis we group our data by sleep time and determine the mean general health for each group.

filtered_brfss2013 %>% group_by(sleptim1) %>% summarise(n=n(), mean_gh=mean(num_genhlth))
## # A tibble: 7 x 3
##   sleptim1      n  mean_gh
##      <int>  <int>    <dbl>
## 1        4  14194 3.337678
## 2        5  33290 2.960829
## 3        6 105880 2.639299
## 4        7 142090 2.344275
## 5        8 140498 2.456092
## 6        9  23688 2.580252
## 7       10  12030 3.006234

Our hypothesis is supported because the group sleeping 7 hours has the lowest mean_gh. (Remember that better general health translates to lower numeric general health scores.) This table will be easier to interpret visually. Before we chart it, we separate the analysis by gender:

filtered_brfss2013 %>% group_by(sex) %>% summarise(n=n())
## # A tibble: 3 x 2
##      sex      n
##   <fctr>  <int>
## 1   Male 193803
## 2 Female 277866
## 3     NA      1

It appears that women are overrepresented in our dataset. It would be interesting to determine why this is so, but for now we simply note that this will not affect our analysis (because we are averaging values for each gender). We will filter out the NA.

filtered_brfss2013 <- filtered_brfss2013 %>% filter(!is.na(sex))

gender_sleptim_gh <- filtered_brfss2013 %>% group_by(sleptim1,sex) %>% summarise( mean_gh=mean(num_genhlth))

gender_sleptim_gh
## # A tibble: 14 x 3
## # Groups:   sleptim1 [?]
##    sleptim1    sex  mean_gh
##       <int> <fctr>    <dbl>
##  1        4   Male 3.232639
##  2        4 Female 3.408517
##  3        5   Male 2.861427
##  4        5 Female 3.025895
##  5        6   Male 2.595615
##  6        6 Female 2.670991
##  7        7   Male 2.343612
##  8        7 Female 2.344764
##  9        8   Male 2.471812
## 10        8 Female 2.445658
## 11        9   Male 2.599569
## 12        9 Female 2.567832
## 13       10   Male 2.966879
## 14       10 Female 3.031557

Then we visualize this data with the following bar chart:

gender_sleptim_gh %>% ggplot(aes(sleptim1,mean_gh)) + geom_bar(aes(fill=sex), position="dodge", stat="identity") + ggtitle("Health vs. Sleep\nMean general health score vs. hours of sleep per 24 hours") + ylab("Mean general health score\n1=Excellent ... 5=Poor") + xlab("Hours of sleep per 24 hours")

We observe that for both men and women, those who reported 7 hours of sleep per 24 hours had the best general health on average. Fewer than 7 hours sleep corresponds to progressively worse general health. More than 7 hours also corresponds to worse general health (especially 10 hours). Again, these trends hold for both women and men. We also observe that sleep-deprived women seem to have generally worse health than their male counterparts.

We conclude that we have strong evidence supporting a correlation between sleep and health, with the healthiest people typically sleeping 7 hours (i.e., an average amount) per 24 hours. We recommend further study to determine causality.

Research question #2 What is the relationship between mental and physical health? We expect that these are closely related. BRFSS respondents considered the 30 days prior to being interviewed and reported how many of those days were “not good” both mentally and physically. Our hypothesis is that people with more not good days of mental health will on average also suffer more not good days of physical health. Furthermore we hypothesize that general health will also degrade along with the number of not good days of mental (and physical) health.

Our data reside in the variables menthlth and physhlth:

brfss2013 %>% group_by(menthlth) %>% summarise(n=n())
## # A tibble: 34 x 2
##    menthlth      n
##       <int>  <int>
##  1        0 334461
##  2        1  15206
##  3        2  23520
##  4        3  13593
##  5        4   6660
##  6        5  16654
##  7        6   1861
##  8        7   6353
##  9        8   1244
## 10        9    198
## # ... with 24 more rows
brfss2013 %>% group_by(physhlth) %>% summarise(n=n())
## # A tibble: 33 x 2
##    physhlth      n
##       <int>  <int>
##  1        0 304265
##  2        1  20098
##  3        2  26793
##  4        3  15846
##  5        4   8356
##  6        5  14206
##  7        6   2425
##  8        7   9052
##  9        8   1544
## 10        9    353
## # ... with 23 more rows

We filter out the NAs for these two variables as well as genhlth, and we filter out responses reporting more than 30 not good days. We will also refresh our numeric version of general health:

filtered_brfss2013 <- brfss2013 %>% filter(!is.na(genhlth), !is.na(menthlth),!is.na(physhlth), menthlth<=30, physhlth<=30)
filtered_brfss2013$num_genhlth <- as.numeric(filtered_brfss2013$genhlth)

Let’s chart the distributions of menthlth and physhlth:

filtered_brfss2013 %>% group_by(menthlth) %>% ggplot(aes(x=menthlth)) + geom_histogram(binwidth = 1) + ggtitle("Mental health\nDistribution of # not good days in past 30 days") + ylab("Number of respondents") + xlab("# of not good mental health days in past 30 days")

filtered_brfss2013 %>% group_by(physhlth) %>% ggplot(aes(x=physhlth)) + geom_histogram(binwidth = 1) + ggtitle("Physical health\nDistribution of # not good days in past 30 days") + ylab("Number of respondents") + xlab("# of not good physical health days in past 30 days")

We see that menthlth and physhlth have strikingly similar distributions, strongly right skewed with an overwhelming majority of responses of 0 not good days, along with a bimodal bump at the worst possible response of 30 not good days.

Before we analyze menthlth and physhlth in more depth, we check how they correlate with general health. We expect that general health will closely correlate to how many not good days a person has suffered with mental or physical health problems. We calculate the mean number of not good days (mentally and physically) for each level of general health in the following table:

filtered_brfss2013 %>% group_by(genhlth) %>% summarise(mean_mh=mean(menthlth),mean_ph=mean(physhlth),n=n())
## # A tibble: 5 x 4
##     genhlth   mean_mh    mean_ph      n
##      <fctr>     <dbl>      <dbl>  <int>
## 1 Excellent  1.365762  0.8779332  84380
## 2 Very good  1.980085  1.5162531 156216
## 3      Good  3.146355  3.2212872 145092
## 4      Fair  6.557414 10.6503034  61805
## 5      Poor 11.668142 23.0211585  25758

We see strong confirmation that general health correlates with the number of not good days of mental and physical health, just as we expected.

We proceed to our primary research interest of this section: We hypothesize a positive correlation between the number of not good mental health days and the number of not good physical health days.

To check our hypothesis, we group respondents according to how many not good days of mental health they reported. For each subgroup we then calculate the average number of not good days of physical health.

physical_vs_mental <- filtered_brfss2013 %>% group_by(menthlth) %>%  summarise(mean_ph=mean(physhlth),n=n())
physical_vs_mental
## # A tibble: 31 x 3
##    menthlth  mean_ph      n
##       <int>    <dbl>  <int>
##  1        0 2.970824 327602
##  2        1 2.634743  14973
##  3        2 3.522698  23130
##  4        3 3.884144  13370
##  5        4 4.114857   6556
##  6        5 5.284329  16393
##  7        6 5.500000   1814
##  8        7 5.284135   6240
##  9        8 6.605070   1223
## 10        9 6.401042    192
## # ... with 21 more rows

Let’s view this with an x-y scatter plot:

physical_vs_mental %>% ggplot(aes(x=menthlth,y=mean_ph)) + geom_point() + ggtitle("Physical health vs Mental health\nComparing # not good days in past 30 days") + ylab("Mean # not good physical days\nFor ppl w spec # mentl hlth days") + xlab("# not good mental days")

This scatter plot strongly suggests a linear correlation. However, it would be misleading to read the y axis of this chart as representing the physical health of some average individual with a given mental health profile. Recall that variable physhlth is highly skewed, with almost all individuals having value 0. Anyone with more than zero not good physical health days is therefore, in some intuitive sense, “not average.” We could thus interpret the rising y-values of mean physical health to indicate an increasing fraction of “not average” people with more than zero not good physical health days.

We can see this better with a boxplot, which below shows the distribution of physhlth across each level of menthlth:

filtered_brfss2013 %>% select(physhlth,menthlth) %>% ggplot(aes(group=menthlth, x=menthlth,y=physhlth)) + geom_boxplot() + ggtitle("Physical health vs Mental health\nComparing # not good days in past 30 days") + ylab("Boxplot of # not good physical days\nFor ppl w spec # mntl hlth days") + xlab("# not good mental days")

The boxplot shows clearly how the variability of physical health scores widens progressively with increasing number of not good mental health days. We see that the median number of not good physical health days generally rises as well. This again supports our hypothesized positive correlation between mental and physical health. Further study is needed to better understand the relationship of mental and physical health and any associated causality.

Research quesion 3: How do chronic diseases differ in their impact on general and physical health? Do people diagnosed with some chronic diseases feel on average worse than people diagnosed with other chronic diseases? We expect to find differences. Our aim is to highlight chronic diseases with the worst effects on general health and thereby help prioritize clinical research.

We explore seven chronic diseases, each of which is tracked by its own variable: skin cancer (chcscncr), other cancers (chcocncr), arthritis (havarth3), depression (addepev2), diabetes (diabete3), asthma (asthma3), and COPD (chccopd1). Based on a preliminary analysis, these chronic diseases are the seven with the most reported diagnoses in the BRFSS data.

We begin our analysis with the same filtered set of respondents from our previous question. (We are only looking at records with valid values for genhlth, menthlth, and physhlth.)

filtered_brfss2013 %>% group_by(chcscncr) %>% summarise(n=n())
## # A tibble: 3 x 2
##   chcscncr      n
##     <fctr>  <int>
## 1      Yes  43268
## 2       No 428813
## 3       NA   1170
filtered_brfss2013 %>% group_by(chcocncr) %>% summarise(n=n())
## # A tibble: 3 x 2
##   chcocncr      n
##     <fctr>  <int>
## 1      Yes  44560
## 2       No 427716
## 3       NA    975
filtered_brfss2013 %>% group_by(havarth3) %>% summarise(n=n())
## # A tibble: 3 x 2
##   havarth3      n
##     <fctr>  <int>
## 1      Yes 156173
## 2       No 314464
## 3       NA   2614
filtered_brfss2013 %>% group_by(addepev2) %>% summarise(n=n())
## # A tibble: 3 x 2
##   addepev2      n
##     <fctr>  <int>
## 1      Yes  90969
## 2       No 380486
## 3       NA   1796
filtered_brfss2013 %>% group_by(diabete3) %>% summarise(n=n())
## # A tibble: 5 x 2
##                                     diabete3      n
##                                       <fctr>  <int>
## 1                                        Yes  58752
## 2 Yes, but female told only during pregnancy   4438
## 3                                         No 401266
## 4    No, pre-diabetes or borderline diabetes   8103
## 5                                         NA    692
filtered_brfss2013 %>% group_by(asthma3) %>% summarise(n=n())
## # A tibble: 3 x 2
##   asthma3      n
##    <fctr>  <int>
## 1     Yes  64188
## 2      No 407704
## 3      NA   1359
filtered_brfss2013 %>% group_by(chccopd1) %>% summarise(n=n())
## # A tibble: 3 x 2
##   chccopd1      n
##     <fctr>  <int>
## 1      Yes  38079
## 2       No 432831
## 3       NA   2341

We will focus on the “Yes” responses for these variables (and ignore diabetic complications of pregnancy as more temporary than the other chronic conditions under consideration). We see that the number of respondents diagnosed with each disease varies from 156173 (arthritis) to 38079 (COPD). We next focus on the impact each disease has on general health and physical health.

We filter out NAs for our seven diseases.

filtered_brfss2013 <- filtered_brfss2013 %>% filter(!is.na(chcscncr), !is.na(chcocncr), !is.na(havarth3), !is.na(addepev2), !is.na(diabete3), !is.na(asthma3), !is.na(chccopd1))

For each of these seven chronic diseases, how do those who have ever been diagnosed feel about their general health?

disease <- c("general_pop","skin_cancer","other_cancer","asthma","arthritis","depression","diabetes","copd")

general_health <-
  c(mean(filtered_brfss2013$num_genhlth),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$chcscncr=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$chcocncr=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$asthma3=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$havarth3=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$addepev2=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$diabete3=="Yes"]),
    mean(filtered_brfss2013$num_genhlth[filtered_brfss2013$chccopd1=="Yes"]))

For each disease, how do those ever diagnosed feel about their physical health? (On average, how many not good physical health days have they had in the past 30 days?)

physical_hlth <-
  c(mean(filtered_brfss2013$physhlth),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$chcscncr=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$chcocncr=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$asthma3=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$havarth3=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$addepev2=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$diabete3=="Yes"]),
    mean(filtered_brfss2013$physhlth[filtered_brfss2013$chccopd1=="Yes"]))

We store our results in a data frame, use mutate to ensure we preserve our ordering of the diseases, and then show our results (first with general health, then physical health):

avg_impact = data.frame(disease,general_health,physical_hlth)
avg_impact <- avg_impact %>% mutate(disease=factor(disease,levels=c("general_pop","skin_cancer","other_cancer","asthma","arthritis","depression","diabetes","copd"),ordered=TRUE))
avg_impact
##        disease general_health physical_hlth
## 1  general_pop       2.542266      4.220651
## 2  skin_cancer       2.692021      5.393529
## 3 other_cancer       3.018849      7.132244
## 4       asthma       2.934620      7.201472
## 5    arthritis       3.015981      7.632811
## 6   depression       3.077846      8.484227
## 7     diabetes       3.364954      8.275621
## 8         copd       3.547810     11.532723
avg_impact %>% ggplot(aes(x=disease,y=general_health)) + geom_bar(stat="identity") + ggtitle("Average general health for those with a chronic disease") + ylab("Mean general health score\nFor people ever diagnosed with the disease\n1=Excellent ... 5=Poor") + xlab("Disease")

avg_impact %>% ggplot(aes(x=disease,y=physical_hlth)) + geom_bar(stat="identity") + ggtitle("Physical health for select chronic diseases:\nMean # not good physical health days for those with a diagnosis") + ylab("Mean # not good physical days in past 30 days\nFor those ever diagnosed with the disease") + xlab("Disease")

In terms of impact on both general health and physical health, the extremes from our selection of chronic diseases are skin cancer and COPD. COPD shows by far the worst impact. In contrast, people diagnosed with skin cancer have general and physical health scores closer to those of the general population than to the scores of those diagnosed with COPD.

This supports our hypothesis that chronic diseases vary significantly in their impacts on general health and physical health. In particular we see evidence that people diagnosed with COPD suffer significantly worse health than people diagnosed with other chronic diseases.

Explaining the cause of these differences is beyond the scope of this analysis. We merely note that each disease has its own symptoms and treatments. We would expect severity of symptoms to correlate to degradation of health scores and effectiveness of treatments to correlate to improvement of health scores. Additional research would be needed to confirm this.

We end with one more analysis to explore in greater depth how those diagnosed with COPD differ from others in their physical health. Below we show the distribution of physical health responses for each disease, as well as for all repondents:

general_ph <- filtered_brfss2013 %>% select(physhlth)
skincancer_ph <- filtered_brfss2013 %>% filter(chcscncr=="Yes") %>% select(physhlth)
othercancer_ph <- filtered_brfss2013 %>% filter(chcocncr=="Yes") %>% select(physhlth)
asthma_ph <- filtered_brfss2013 %>% filter(asthma3=="Yes") %>% select(physhlth)
arthritis_ph <- filtered_brfss2013 %>% filter(havarth3=="Yes") %>% select(physhlth)
depression_ph <- filtered_brfss2013 %>% filter(addepev2=="Yes") %>% select(physhlth)
diabetes_ph <- filtered_brfss2013 %>% filter(diabete3=="Yes") %>% select(physhlth)
copd_ph <- filtered_brfss2013 %>% filter(chccopd1=="Yes") %>% select(physhlth)

general_ph$disease <- "general_pop"
skincancer_ph$disease <- "skin_cancer"
othercancer_ph$disease <- "other_cancer"
asthma_ph$disease <- "asthma"
arthritis_ph$disease <- "arthritis"
depression_ph$disease <- "depression"
diabetes_ph$disease <- "diabetes"
copd_ph$disease <- "copd"

all_ph <- rbind(general_ph,skincancer_ph,othercancer_ph,asthma_ph,arthritis_ph,depression_ph,diabetes_ph,copd_ph)

all_ph <- all_ph %>% mutate(disease=factor(disease,levels=c("general_pop","skin_cancer","other_cancer","asthma","arthritis","depression","diabetes","copd"),ordered=TRUE))

all_ph %>% ggplot(aes(x=physhlth,fill=disease))+
  geom_histogram(aes(color=disease, y=..density..),position="dodge",binwidth = 1) + facet_wrap(~disease,nrow=2,ncol=4) + ggtitle("Distribution of # not good physical days\nCompared across several chronic diseases") + xlab("# not good physical health days in past 30 days")

The physical health distributions have similar shapes in each case but with notable differences at the extreme left and ride sides of the distributions. We are most interested in the far right: of people ever diagnosed, what fraction have suffered 100% bad days for the last month. We see that approximately 25% of those ever diagnosed with COPD suffer in this way, which is significantly higher than any other disease on our list. (No other disease reaches 20%.) According to the NIH, 12 million U.S. adults are diagnosed with COPD and perhaps 12 million more have undiagnosed COPD. We recommend further research to verify our results and to assess opportunities to reduce the extreme burden of COPD.


Summary and Conclusion

We explored a few basic public health questions. We observed evidence that general health is positively correlated with close to 7 hours of sleep. We saw evidence supporting a positive correlation between mental and physical health. And we saw evidence that those diagnosed with COPD suffer worse health on average than those diagnosed with other common chronic diseases. We recommend further research to verify these findings and to assess the causality and mechanisms at play.