In this project, we carry out exploratory analysis of the BRFSS-2013 data set by setting out research questions, and then exploring relationship between identified variables to answer those questions. To know more about BRFSS and the dataset, visit this link.

The project was completed as a part of Duke University’s ‘Introduction to Probability and Data’ online course on Coursera, the first of the Statistics with R Specialization.


Load packages


Load data


The Data

The BRFSS-2013 dataset was sampled from the non-institutionalised adult population (i.e. 18 years and older) residing in the US. The data was collected through landline and cellular-telephone based surveys.

Disproportionate stratified sampling, which is more efficient than simple random sampling, was used for the landline sample (source). The cellular sample was generated from randomly selected respondents, with an equal probability of selection.

As random sampling was used for both data collection methods, the data for the sample is generalizable to the population. On the other hand, as this is an observational study, it won’t be possible to make causal inferences from the data.

Research questions

Research question 1:

Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?

While researching this, we’re trying to explore the impact of consuming alcohol vs smoking tobacco on a person’s health and see which is worse.

Research question 2:

Do people who sleep fewer hours than average person, also have more than days with poor mental health?

Research has suggested that inadequate sleep has a negative effect on a person’s overall health. Here we try to determine if it also has a negative effect on their mental health.

Research question 3:

Are people who have completed higher levels of education, more likely to consume fruits and vegetables once or more in a day?

We might assume that educated people live a healthier lifestyle i.e. exercising or eating nutritious food. We’ll try and figure out if that’s the case here by comparing education levels with fruit and vegetable consumption.

Exploratory data analysis

Research question 1:

Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?

We’ll be using the following variables for this question:

  • genhlth: Respondent’s health, in general
  • _rfsmok3: Is the respondent a current smoker?
  • _rfdrhv4: Is the respondent a heavy drinker?

Type of the variables we’re dealing with:

## 'data.frame':    491775 obs. of  3 variables:
##  $ genhlth  : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ X_rfsmok3: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 2 1 1 ...
##  $ X_rfdrhv4: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...

All of the above are categorical variable. General health of a person is defined in 5 levels, while a person is or isn’t a heavy drinker or a smoker.

To begin, let’s check out our selected variables individually.

genhlth: General Health

total_obs <- nrow(brfss2013)

brfss2013 %>%
  group_by(genhlth) %>%
## # A tibble: 6 × 3
##     genhlth  count percentage
##      <fctr>  <int>      <dbl>
## 1 Excellent  85482 17.3823395
## 2 Very good 159076 32.3473133
## 3      Good 150555 30.6146103
## 4      Fair  66726 13.5684002
## 5      Poor  27951  5.6836968
## 6        NA   1985  0.4036399
ggplot(brfss2013, aes(x=genhlth)) + geom_bar() + ggtitle('General Health of Respondents') + xlab('General Health') + theme_bw()

Around 80% of the respondents in our dataset are in good health or better, and most of the people have ‘Very good’ health. There are some missing (NA) values too which we’ll deal with later as they don’t make much sense with our analysis.

_rfsmok3: Currently a smoker?

According to the codebook, respondents who have replied ‘Yes’, now smoke every day or some days; while those who replied ‘No’ have either never smoked in their lifetimes or don’t smoke now.

brfss2013 %>%
  group_by(X_rfsmok3) %>%
## # A tibble: 3 × 3
##   X_rfsmok3  count percentage
##      <fctr>  <int>      <dbl>
## 1        No 399786  81.294494
## 2       Yes  76654  15.587210
## 3        NA  15335   3.118296
ggplot(brfss2013, aes(x=X_rfsmok3)) + geom_bar() + ggtitle('Smoking Status of Respondents') + xlab('Currently a smoker?')+ theme_bw()

More than 81% of the respondents are not current smokers, though they might have smoked earlier in their lifetimes.

_rfdrhv4: Heavy drinker?

The heavy drinker variable is defined as adult men having more than two drinks per day and adult women having more than one drink per day).

brfss2013 %>%
  group_by(X_rfdrhv4) %>%
## # A tibble: 3 × 3
##   X_rfdrhv4  count percentage
##      <fctr>  <int>      <dbl>
## 1        No 442359  89.951502
## 2       Yes  25533   5.192009
## 3        NA  23883   4.856489
ggplot(brfss2013, aes(x=X_rfdrhv4)) + geom_bar() + ggtitle('Drinking Habits of Respondents') + xlab('Heavy Drinker?') +theme_bw()