Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in
the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC) The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion.

BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.Bias to data can exist. This data source is
valid and generalizable. The reliability is around 75% and compares well with other surveys of similar nature.

The BRFSS objective is to collect uniform, state-specific include tobacco use, HIV/AIDS knowledge and prevention,exercise, immunization, health status, healthy days - health-related quality of life, health care access,inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Dataset Notes

The dataset is provided in both Stata (.dta) and R Workspace (.Rdata) formats. Categorical values are factors in the R workspace, and value labels are attached in the Stata version (except when a categorical variable contains more than 50 categories.

All missing values are coded NA in the R Workspace. For Stata, missing values are coded missing using the following codes: . (numbers) or empty text field: BLANK; .a: Don’t know/Not sure; .b: Refused; .c: Zero (none); .d: Don’t know/Not Sure Or Refused/Missing

Many variables, such as age, race, education, as well as variables that measure counts of events (drinks, times eating fruit, etc.) have alternate versions in the Calculated Variables section of the dataset. Review this section prior to choosing variables for analysis.


Part 2: Research questions

Research quesion 1: What is the health Status of the population by Age,Sex and activity ?

Research quesion 2: Is it possible to derive some relationship between the number of hours one sleeps and his health condition like heart attack, Depressive Disorders etc ?

Research quesion 3: Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?


Part 3: Exploratory data analysis

Research quesion 1:

Health Status by Age, Sex and Activity ?

Variables used

  1. genhlth: General Health
  2. physhlth: Physical Health
  3. menthlth: Mental Health
  4. poorhlth: Poor Physical Or Mental Health
  5. Sex: Respondents Sex
  6. X_age_g: Imputed Age In Six Groups
  7. X_pacat1: Physical Activity Categories

We will create plots of physical, mental health.

# Create subset of data
hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>% filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth)) 
# Create Histogram of Physical Health
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))

# Create Histogram of Physical Health
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))

# Create Histogram of Health Status
hist(hlthpmp$poorhlth, main="Health status", xlab="Poor Health",col=rainbow(5))

# Create a data set of general health, age, sex and activness.
health <- brfss2013 %>% dplyr::select(genhlth,X_age_g,sex,X_pacat1) %>% dplyr::filter(sex %in%c("Male","Female"))

# Create factor for sex
levels(health$sex) <- c("Male", "Female")

# Plot the dataset
ggplot(health,aes(x = sex, fill = X_pacat1)) + geom_bar() +facet_grid(~X_age_g) + coord_flip() +ggtitle("Health, Age,sex and activity")

From the above plots and results we can infer

  1. There is a natural trend in health deterioration with age.
  2. Females have better health in old age.
  3. Also level of activity decreases with age.

Research quesion 2:

Is it possible to derive some relationship between the number of hours one sleeps and his health condition like heart attack, Depressive Disorders etc ?

Variables used

  1. sleptim1: How Much Time Do You Sleep
  2. cvdinfr4: Ever Diagnosed With Heart Attack
  3. addepev2: Ever Told You Had A Depressive Disorder

We will calculate correlation among the variables and create plot.

# Create a subset of data
vars <- names(brfss2013) %in% c("sleptim1", "cvdinfr4", "addepev2")
subdata <- brfss2013[vars]
# make a backup
subdata1 <- subdata
# conver factor levels into numeric levels
subdata1$addepev2 <- ifelse(subdata$addepev2=="Yes", 1, 0)
subdata1$cvdinfr4 <- ifelse(subdata$cvdinfr4=="Yes", 1, 0)
# remove rows containing NAs
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units
subdata1 <- na.delete(subdata1)
# find correlation
cor(subdata1)
##               sleptim1     cvdinfr4    addepev2
## sleptim1  1.0000000000 0.0002048415 -0.05743056
## cvdinfr4  0.0002048415 1.0000000000  0.05267562
## addepev2 -0.0574305638 0.0526756160  1.00000000
library(corrplot)
M <- cor(subdata1)
corrplot(M, method="ellipse")

From the above results we can infer that:

  1. Sleep time and Depressive Disorder has negative correlation, which mean if one sleeps less, chances for Depressive Disorder go high.
  2. Sleep time and Ever Diagnosed With Heart Attack shows almost no relation among them.

Research quesion 3:

Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?

Variables used

  1. genhlth: General Health
  2. X_rfsmok3: Current Smoking Calculated Variable
  3. X_rfdrhv4: Heavy Alcohol Consumption Calculated Variable

General Health

total_obs <- nrow(brfss2013)

# Group together General Health categories
brfss2013 %>%
  group_by(genhlth) %>%
  summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 6 x 3
##     genhlth  count percentage
##      <fctr>  <int>      <dbl>
## 1 Excellent  85482 17.3823395
## 2 Very good 159076 32.3473133
## 3      Good 150555 30.6146103
## 4      Fair  66726 13.5684002
## 5      Poor  27951  5.6836968
## 6      <NA>   1985  0.4036399
ggplot(brfss2013, aes(x=genhlth)) + geom_bar() + ggtitle('General Health') + xlab('General Health') + theme_bw()

80% of the respondents in our dataset are in good health or better, and most of the people have ‘Very good’ health.

Smoking Status

# Group together smoking categories
brfss2013 %>%
group_by(X_rfsmok3) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 3 x 3
##   X_rfsmok3  count percentage
##      <fctr>  <int>      <dbl>
## 1        No 399786  81.294494
## 2       Yes  76654  15.587210
## 3      <NA>  15335   3.118296
ggplot(brfss2013, aes(x=X_rfsmok3)) + geom_bar() + ggtitle('Smoking Status') + xlab('Currently a smoker?')+ theme_bw()

More than 81% of the respondents are not current smokers.

Drinking Status

# Group together drinking categories
brfss2013 %>%
group_by(X_rfdrhv4) %>%
summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 3 x 3
##   X_rfdrhv4  count percentage
##      <fctr>  <int>      <dbl>
## 1        No 442359  89.951502
## 2       Yes  25533   5.192009
## 3      <NA>  23883   4.856489
ggplot(brfss2013, aes(x=X_rfdrhv4)) + geom_bar() + ggtitle('Drinking Status') + xlab('Heavy Drinker?') +theme_bw()

Only about 5% of the respondends in our dataset are heavy drinkers.

Now to answer our main question, we create a new categorical variable to categorise a person as: ‘Smoker’, ‘Heavy Drinker’, ‘Both’ or ‘None.’

# Group together Smoker, Heavy Drinker, Both and None
brfss2013 <- brfss2013 %>%
mutate(smoke_alc = ifelse(X_rfdrhv4 == 'Yes',
                            ifelse(X_rfsmok3 == 'Yes','Both','Heavy Drinker'),
                            ifelse(X_rfsmok3 == 'Yes','Current Smoker','None')))

Let us check the distribution of our new variable

brfss2013 %>%
  group_by(smoke_alc) %>%
  summarise(count=n(),percentage=n()*100/total_obs)
## # A tibble: 5 x 3
##        smoke_alc  count percentage
##            <chr>  <int>      <dbl>
## 1           Both   8144   1.656042
## 2 Current Smoker  66000  13.420772
## 3  Heavy Drinker  17269   3.511565
## 4           None 374377  76.127701
## 5           <NA>  25985   5.283920
ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Drinking and Smoking Status') + xlab('Drinker or Smoker?') +theme_bw()

About 76% of the respondents don’t smoke or drink heavily. Around 13.4% are current smokers, and about 3.5% drink heavily. We’ll be focusing on the last two.

# Create contingency table for better understanding
cont_table <- table(brfss2013$smoke_alc,brfss2013$genhlth)

mosaicplot(prop.table(cont_table,1),main='Drinking and/or Smoking vs General Health', xlab='Drinking and/or Smoking status', ylab='General Health')

From the above results we can infer that:

  1. It looks like smokers have poorer health than heavy drinkers

References:

Hussain, T. (2016). BRFSS 2013 Exploratory Analysis. [Blog] Available at: http://www.rpubs.com/tahirhussa/198234 [Accessed 27 Sep. 2017].

Sharma, S. (2016). Behavioral Risk Factor Surveillance System (BRFSS) 2013: Exploratory Data Analysis. [Blog] Available at: https://rpubs.com/sajal_sharma/brfss2013 [Accessed 27 Sep. 2017].