library(ggplot2)
library(dplyr)load("brfss2013.RData")The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in
the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC) The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion.
BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.Bias to data can exist. This data source is
valid and generalizable. The reliability is around 75% and compares well with other surveys of similar nature.
The BRFSS objective is to collect uniform, state-specific include tobacco use, HIV/AIDS knowledge and prevention,exercise, immunization, health status, healthy days - health-related quality of life, health care access,inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Dataset Notes
The dataset is provided in both Stata (.dta) and R Workspace (.Rdata) formats. Categorical values are factors in the R workspace, and value labels are attached in the Stata version (except when a categorical variable contains more than 50 categories.
All missing values are coded NA in the R Workspace. For Stata, missing values are coded missing using the following codes: . (numbers) or empty text field: BLANK; .a: Don’t know/Not sure; .b: Refused; .c: Zero (none); .d: Don’t know/Not Sure Or Refused/Missing
Many variables, such as age, race, education, as well as variables that measure counts of events (drinks, times eating fruit, etc.) have alternate versions in the Calculated Variables section of the dataset. Review this section prior to choosing variables for analysis.
Research quesion 1: What is the health Status of the population by Age,Sex and activity ?
Research quesion 2: Is it possible to derive some relationship between the number of hours one sleeps and his health condition like heart attack, Depressive Disorders etc ?
Research quesion 3: Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?
Research quesion 1:
Health Status by Age, Sex and Activity ?
Variables used
We will create plots of physical, mental health.
# Create subset of data
hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>% filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth)) # Create Histogram of Physical Health
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))# Create Histogram of Physical Health
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))# Create Histogram of Health Status
hist(hlthpmp$poorhlth, main="Health status", xlab="Poor Health",col=rainbow(5))# Create a data set of general health, age, sex and activness.
health <- brfss2013 %>% dplyr::select(genhlth,X_age_g,sex,X_pacat1) %>% dplyr::filter(sex %in%c("Male","Female"))
# Create factor for sex
levels(health$sex) <- c("Male", "Female")
# Plot the dataset
ggplot(health,aes(x = sex, fill = X_pacat1)) + geom_bar() +facet_grid(~X_age_g) + coord_flip() +ggtitle("Health, Age,sex and activity")From the above plots and results we can infer
Research quesion 2:
Is it possible to derive some relationship between the number of hours one sleeps and his health condition like heart attack, Depressive Disorders etc ?
Variables used
We will calculate correlation among the variables and create plot.
# Create a subset of data
vars <- names(brfss2013) %in% c("sleptim1", "cvdinfr4", "addepev2")
subdata <- brfss2013[vars]# make a backup
subdata1 <- subdata# conver factor levels into numeric levels
subdata1$addepev2 <- ifelse(subdata$addepev2=="Yes", 1, 0)
subdata1$cvdinfr4 <- ifelse(subdata$cvdinfr4=="Yes", 1, 0)# remove rows containing NAs
library(Hmisc)## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
subdata1 <- na.delete(subdata1)# find correlation
cor(subdata1)## sleptim1 cvdinfr4 addepev2
## sleptim1 1.0000000000 0.0002048415 -0.05743056
## cvdinfr4 0.0002048415 1.0000000000 0.05267562
## addepev2 -0.0574305638 0.0526756160 1.00000000
library(corrplot)
M <- cor(subdata1)
corrplot(M, method="ellipse")From the above results we can infer that:
Research quesion 3:
Are non-smoking heavy drinkers, generally healthier than regular smokers, who are not heavy drinkers?
Variables used
General Health
total_obs <- nrow(brfss2013)
# Group together General Health categories
brfss2013 %>%
group_by(genhlth) %>%
summarise(count=n(),percentage=n()*100/total_obs)## # A tibble: 6 x 3
## genhlth count percentage
## <fctr> <int> <dbl>
## 1 Excellent 85482 17.3823395
## 2 Very good 159076 32.3473133
## 3 Good 150555 30.6146103
## 4 Fair 66726 13.5684002
## 5 Poor 27951 5.6836968
## 6 <NA> 1985 0.4036399
ggplot(brfss2013, aes(x=genhlth)) + geom_bar() + ggtitle('General Health') + xlab('General Health') + theme_bw()80% of the respondents in our dataset are in good health or better, and most of the people have ‘Very good’ health.
Smoking Status
# Group together smoking categories
brfss2013 %>%
group_by(X_rfsmok3) %>%
summarise(count=n(),percentage=n()*100/total_obs)## # A tibble: 3 x 3
## X_rfsmok3 count percentage
## <fctr> <int> <dbl>
## 1 No 399786 81.294494
## 2 Yes 76654 15.587210
## 3 <NA> 15335 3.118296
ggplot(brfss2013, aes(x=X_rfsmok3)) + geom_bar() + ggtitle('Smoking Status') + xlab('Currently a smoker?')+ theme_bw()More than 81% of the respondents are not current smokers.
Drinking Status
# Group together drinking categories
brfss2013 %>%
group_by(X_rfdrhv4) %>%
summarise(count=n(),percentage=n()*100/total_obs)## # A tibble: 3 x 3
## X_rfdrhv4 count percentage
## <fctr> <int> <dbl>
## 1 No 442359 89.951502
## 2 Yes 25533 5.192009
## 3 <NA> 23883 4.856489
ggplot(brfss2013, aes(x=X_rfdrhv4)) + geom_bar() + ggtitle('Drinking Status') + xlab('Heavy Drinker?') +theme_bw()Only about 5% of the respondends in our dataset are heavy drinkers.
Now to answer our main question, we create a new categorical variable to categorise a person as: ‘Smoker’, ‘Heavy Drinker’, ‘Both’ or ‘None.’
# Group together Smoker, Heavy Drinker, Both and None
brfss2013 <- brfss2013 %>%
mutate(smoke_alc = ifelse(X_rfdrhv4 == 'Yes',
ifelse(X_rfsmok3 == 'Yes','Both','Heavy Drinker'),
ifelse(X_rfsmok3 == 'Yes','Current Smoker','None')))Let us check the distribution of our new variable
brfss2013 %>%
group_by(smoke_alc) %>%
summarise(count=n(),percentage=n()*100/total_obs)## # A tibble: 5 x 3
## smoke_alc count percentage
## <chr> <int> <dbl>
## 1 Both 8144 1.656042
## 2 Current Smoker 66000 13.420772
## 3 Heavy Drinker 17269 3.511565
## 4 None 374377 76.127701
## 5 <NA> 25985 5.283920
ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Drinking and Smoking Status') + xlab('Drinker or Smoker?') +theme_bw()About 76% of the respondents don’t smoke or drink heavily. Around 13.4% are current smokers, and about 3.5% drink heavily. We’ll be focusing on the last two.
# Create contingency table for better understanding
cont_table <- table(brfss2013$smoke_alc,brfss2013$genhlth)
mosaicplot(prop.table(cont_table,1),main='Drinking and/or Smoking vs General Health', xlab='Drinking and/or Smoking status', ylab='General Health')From the above results we can infer that:
References:
Hussain, T. (2016). BRFSS 2013 Exploratory Analysis. [Blog] Available at: http://www.rpubs.com/tahirhussa/198234 [Accessed 27 Sep. 2017].
Sharma, S. (2016). Behavioral Risk Factor Surveillance System (BRFSS) 2013: Exploratory Data Analysis. [Blog] Available at: https://rpubs.com/sajal_sharma/brfss2013 [Accessed 27 Sep. 2017].