Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Data

In this case, BRFSS-2013 was sample from surveys through both landline telephones and cellular telephone every year on non-institutionalized adults population residing in the US, to measure factors linked to diseases. In the landline telephone servey, data is collected from randomly selected adulta in household, and in the cellular telephone servey, interviewees are adult who participates by using a cellular telephone and resides in a private residence or college housing.

Generalizability, causalty and bias

Generalizability: The full data set contains 491775 observations of 330 variable. As both surveys use radom sampling to collect information - participated by all of the states of the United States, administrated and supported by CDC, the results from exploring this dataset can be considered generalizable to the population.

Causalty: Since there’s no random assignment used, therefore we cannot extablish any causal relationship based on this data set.

The following are some concerns about bias of the surveys: 1) Residents who don’t use landline telephones or cellphones were not included in the sample. 2) Residents who use landline phones but were not able when the survey was conducted were not included in the sample. 3) …

Before these conditons are considered, the study base on this dataset could be misleading.


Part 2: Research questions

Research quesion 1: As for the first question, we might be interested in the distribution of health status in all of the states, and whether there be any difference in genders?

To answer the question, we will use the following variables: 1)X_state: State of residency for respondent 2)genhlth: general health 3)sex: respondents’ sex

Research quesion 2: Among all the behaviral risk factors, here we can study if people sleep longer tend to be mentally healthier.

Therefore our study of interest should include the following variabes: 1)menthlth: number of days that the people don’t feel good in a month 2)sleptim1: how much time sleep in a day(hours)

Research quesion 3: Here we can look at whether having many children at home will affect the frequency for people to do exercises. Along with a general distribution, it is also interesting to explore whether there is any gender differences.

This analysis is done using the following variables in the dataset: X_chldcnt: numbers of children in households exerany2: exercise in past 30 days sex: respondents’ sex


Part 3: Exploratory data analysis

Research quesion 1: Since our study of interest in how does the distribution of general health status look like in different states and whether this is repoirted differently according to gender, we first want to take a look at the values of these three variables.

First we should to find the right name of the variable related to the state where the respondent is resided in. This can be done using the grep() function in R.

grep("state", names(brfss2013), value = TRUE)
## [1] "X_state"  "stateres" "cstate"
brfss2013  %>% 
  select(X_state, stateres, cstate) %>% 
  str()
## 'data.frame':    491775 obs. of  3 variables:
##  $ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ stateres: Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ cstate  : Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...

Therefore we can now confirm that the X_state variable is what we are looking for.

brfss2013%>% 
  select(X_state, sex, genhlth) %>% 
  str()
## 'data.frame':    491775 obs. of  3 variables:
##  $ X_state: Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ sex    : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ genhlth: Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...

From above we can see that varibles of our interst are all catogorical variable. Aside from the generally known 55 states and 2 genders, the health satus is catogorized in five levels: Excellent, Very good, Good, Fair,and poor.

And now we want to filter out NAs and take a closer look at efficient responses.

q1 <- select(brfss2013,X_state, sex, genhlth) %>% 
  na.omit()                                     #filter out NAs

ggplot(q1, aes(x= genhlth)) +
  geom_bar(width = 0.5) +
  ggtitle('General Health of Respondents') +    #add title for graph
  xlab('General Health') 

totalnum <- count(q1)  %>%
as.numeric()                                    # transform character to number

healthy <- c("Excellent","Very good","Good")    # define healthy

  q1%>%
  mutate(hlth = ifelse(genhlth %in% healthy, "healthy","unhealthy")) %>%
  group_by(hlth)   %>%
  summarise( hlth_rate = n()/totalnum)
## # A tibble: 2 x 2
##   hlth      hlth_rate
##   <chr>         <dbl>
## 1 healthy       0.807
## 2 unhealthy     0.193

We can see that among the respondents, the distribution of health status is a little right skewed and approximately 80% of respondents in this dataset are in good health or above.

As for geographical difference, here we sample four states out for further analysis:

 q1%>%
  mutate(hlth = ifelse(genhlth %in% healthy, "healthy","unhealthy")) %>%
  group_by(X_state) %>%
  summarise(state_hlth_rate = sum(hlth=="healthy")/n()) %>%
  arrange(desc(state_hlth_rate))
## # A tibble: 53 x 2
##    X_state              state_hlth_rate
##    <fct>                          <dbl>
##  1 Minnesota                      0.867
##  2 Vermont                        0.865
##  3 Utah                           0.861
##  4 Colorado                       0.859
##  5 New Hampshire                  0.853
##  6 District of Columbia           0.853
##  7 Connecticut                    0.852
##  8 Hawaii                         0.849
##  9 Alaska                         0.846
## 10 North Dakota                   0.845
## # ... with 43 more rows
 q1dis<- q1 %>%
      mutate(hlth = ifelse(genhlth %in% healthy, "healthy","unhealthy")) %>%
      select(X_state, hlth) 

  ggplot(data = q1dis, aes(x = X_state, fill = hlth)) +
  geom_bar() +
  ggtitle('General Health Distribution')  +    #add title for graph
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab('States')

From above analysis we can see that the distribution of health status in all the states is very uneven, ranging from 60% to 86%.And now we want to investigate whether there is a gender difference in health status. Let’s sample six states from the top ranking, medium ranking and lower ranking group:

sample_states <- c("Minnesota","Vermont","Pennsylvania","Arizona","Mississippi","Puerto Rico")  
q1_states <- filter(q1,X_state %in% sample_states)

gen <- q1_states%>%
  mutate(hlth = ifelse(genhlth %in% healthy, "healthy","unhealthy")) %>%
  group_by(sex,X_state,hlth) 
  
ggplot(data = gen, aes(x = sex, fill = hlth)) +
  geom_bar() +
  facet_grid(.~X_state) +
  xlab("Sex") + 
  ylab("Count") + 
  scale_fill_discrete(name="Health status")

gen %>%
  group_by(X_state,sex) %>%
  summarize(proportion = sum(hlth == "healthy")/n()) #%>%
## # A tibble: 12 x 3
## # Groups:   X_state [?]
##    X_state      sex    proportion
##    <fct>        <fct>       <dbl>
##  1 Arizona      Male        0.799
##  2 Arizona      Female      0.790
##  3 Minnesota    Male        0.864
##  4 Minnesota    Female      0.869
##  5 Mississippi  Male        0.744
##  6 Mississippi  Female      0.686
##  7 Pennsylvania Male        0.828
##  8 Pennsylvania Female      0.796
##  9 Vermont      Male        0.856
## 10 Vermont      Female      0.872
## 11 Puerto Rico  Male        0.680
## 12 Puerto Rico  Female      0.567
  #arrange(desc(proportion))

Thus,to answer the first question,we can tell that approximately 80% people live with a health status above “good”, and the distribution of health status in the united states is very uneven, while generally women are healthier than men.

Research quesion 2:

First, we’d like to see a brief summary of people’s sleep hours and mental health.

q2 <- brfss2013 %>%
   select(sleptim1,menthlth,sex) %>% na.omit()

q2brief <- q2 %>%
  summarize(ave_slep = mean(sleptim1),ave_unhlth =mean(menthlth),sd(sleptim1),sd(menthlth))
 q2brief
##   ave_slep ave_unhlth sd(sleptim1) sd(menthlth)
## 1 7.052296   3.340112     1.458264     7.648201
ggplot(q2, aes(x = sleptim1)) +
  geom_bar() +
  ggtitle('Sleeping hours distribution')  +    #add title for graph
  xlab('Sleep Hours in a Day') 

ggplot(q2, aes(x = menthlth)) +
  geom_bar() +
  ggtitle('Days of mentally unhealthy distribution')  +    #add title for graph
  xlab('Mentally unhealthy in 30 days') 

And now let’s see if there is any relationship between these two variables. Here we define those who suffer from mental problems more than 4 days as mentally unhealthy.

q2_relate <-q2 %>%
   mutate(men_status = ifelse(menthlth <= 4,"healthy","unhealthy")) %>%
   select(sleptim1,men_status)

q2_ave <- q2_relate%>%
   group_by(sleptim1) %>%
   summarise(rate = sum(men_status=="healthy")/n()) %>%
   arrange(desc(rate))
q2_ave
## # A tibble: 24 x 2
##    sleptim1  rate
##       <int> <dbl>
##  1        8 0.872
##  2        7 0.865
##  3        9 0.844
##  4        6 0.778
##  5       10 0.753
##  6       24 0.719
##  7       11 0.703
##  8        5 0.676
##  9       16 0.671
## 10       19 0.667
## # ... with 14 more rows
   ggplot(q2_ave,aes(x=sleptim1,y=rate)) +
   geom_bar(stat = "identity", width = 0.8) +
   ggtitle('Mentally unhealthy rate')  +    #add title for graph
   xlab('Sleep hours in a day') 

From our analysis, we observe that: 1)People normally sleep around 7 hours a day. 2)The sleeping hours data follows a normal distribution. 3)People who sleep too little or too much are often likely to suffer from mental problems, and comparatively,people who sleep between 7 and 9 hours are more common to be mentally healthy.

So we have an idea that these two variables are related to each other. But if we want to establish any causal relationship between these two variables we need to borrow the method of random assignment.

Research quesion 3:

To answer the question whether having many children at home will have an impact on people’s choice to do exercises, we first want to take a look at the distribution of this variable:

q3 <- brfss2013 %>%
  select(X_chldcnt,exerany2,sex) %>% na.omit()

total3 <- count(q3)%>%
as.numeric() 

q3 %>%
  group_by(X_chldcnt) %>%
  summarise(count = n(), proportion = n()/total3) 
## # A tibble: 6 x 3
##   X_chldcnt                           count proportion
##   <fct>                               <int>      <dbl>
## 1 No children in household           336378    0.737  
## 2 One child in household              49174    0.108  
## 3 Two children in household           43074    0.0944 
## 4 Three children in household         18279    0.0401 
## 5 Four children in household           6380    0.0140 
## 6 Five or more children in household   3086    0.00676
  ggplot(q3, aes(x=X_chldcnt)) +
  geom_bar(width = 0.8) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) 

  #facet_grid(.~sex) +

From the graph we can see that the distribution how many children live in the respindents’ household is right skewed. And now we want to take a look at whether having many kids will imapct people’s choice of doing exercise:

q3 %>% group_by(X_chldcnt) %>%
  summarise(count = sum(exerany2 == "Yes"), proportion = count/n())
## # A tibble: 6 x 3
##   X_chldcnt                           count proportion
##   <fct>                               <int>      <dbl>
## 1 No children in household           240500      0.715
## 2 One child in household              36702      0.746
## 3 Two children in household           33234      0.772
## 4 Three children in household         13909      0.761
## 5 Four children in household           4769      0.747
## 6 Five or more children in household   2268      0.735
q3 %>%
  ggplot(aes(x = X_chldcnt,fill = exerany2)) +
  geom_bar(position = "fill") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle('How many did exercise within past 30 days') +
  xlab("children in household") +
  ylab('Proportion')

Therefore, we see that there’s not much difference in people’s choice of excersizing, or academically speaking the distribution is uniform distributed. And now we try to further analyze whether there’s any gender difference:

q3 %>%
  ggplot(aes(x = sex,fill = exerany2)) +
  geom_bar(position = "fill") +
  facet_grid(.~X_chldcnt) +
  xlab("children in household") +
  ylab('Proportion')

And now we find the answer to the last question:having children or having many children don’t have a strong impact on people’s choice of doing exercise, and there’s no much diffrence in gender neither.

Summary

To go back to the three questions, our answers are as following: Q1:What is the distribution of health status like in all of the states? and whether there be any difference in genders? 1)Approximately 80% people live with a health status above “good” 2)The distribution of health status in diffrenct states is very uneven. 3)Generally women are healthier than men.

Q2:Are people sleeping longer tend to be mentally healthier? 1)People normally sleep around 7 hours a day. 2)The sleeping hours data follows a normal distribution. 3)People who sleep too little or too much are often likely to suffer from mental problems, and comparatively,people who sleep between 7 and 9 hours are more common to be mentally healthy.

Q3:Will having children, or having many children don’t have a strong impact on people’s choice of doing exercise? Is there any gender diffrences? 1)Having children or having many children don’t have a strong impact on people’s choice of doing exercise. 2)there’s no much diffrence in gender neither.