Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: About the Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Data Collection:

Data collection procedure is explained in brfss_codebook. The data were collected from United States’ all 50 states, the District of Columbia, Puerto Rico, Guam and American Samoa, Federated States of Micronesia, and Palau, by conducting both landline telephone- and cellular telephone-based surveys. Disproportionate stratified sampling (DSS) has been used for the landline sample and the cellular telephone respondents are randomly selected with each having equal probability of selection. The dataset we are working on contains 330 variables for a total of 491, 775 observations in 2013. The missing values denoted by “NA”.

Generalizability:

The sample data should allow us to generalize to the population of interest. It is a survey of 491,775 U.S. adults aged 18 years or older. It is based on a large stratified random sample. Potential biases are associated with non-response, incomplete interviews, missing values and convenience bias (some potential respondents may not have been included because they do not have a landline and cell phone).

Causality:

There is no causation can be established as BRFSS is an observation study that can only establish correlation/association between variables.

Part 2: Research Questions

Research question 1:

Does the distribution of the number of days in which physical and mental health was not good during the past 30 days differ by gender?

Research quesion 2:

Is there an association between the month in which a respondent was interviewed and the respondent’s self-reported health perception?

Research quesion 3:

Is there any association between a respondent’s income and health care coverage?

Research quesion 4:

Is there any relation between smoking, drinking alcohol, cholesterol level, blood pressure, weight and having a stroke? Eventually, I would like to see whether stroke can be predicted from the above mentioned variables.

Part 3: Exploratory data analysis

Research quesion 1:

ggplot(aes(x=physhlth, fill=sex), data = brfss2013[!is.na(brfss2013$sex), ]) +
  geom_histogram(bins=30, position = position_dodge()) + ggtitle('Number of Days Physical Health not Good in the Past 30 Days')

by(brfss2013$physhlth, brfss2013$sex, summary)
## brfss2013$sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.926   2.000  30.000    3818 
## -------------------------------------------------------- 
## brfss2013$sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    4.65    4.00   30.00    7135
ggplot(aes(x=menthlth, fill=sex), data=brfss2013[!is.na(brfss2013$sex), ]) +
  geom_histogram(bins=30, position = position_dodge()) + ggtitle('Number of Days Mental Health not Good in the Past 30 Days')

by(brfss2013$menthlth, brfss2013$sex, summary)
## brfss2013$sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.779   1.000  30.000    3247 
## -------------------------------------------------------- 
## brfss2013$sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.784   3.000  30.000    5376
ggplot(aes(x=poorhlth, fill=sex), data=brfss2013[!is.na(brfss2013$sex), ]) +
  geom_histogram(bins=30, position = position_dodge()) + ggtitle('Number of Days with Poor Physical Or Mental Health in the Past 30 Days')

by(brfss2013$poorhlth, brfss2013$sex, summary)
## brfss2013$sex: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    5.32    5.00   30.00  109880 
## -------------------------------------------------------- 
## brfss2013$sex: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.00    5.24    5.00   30.00  133269
summary(brfss2013$sex)
##   Male Female   NA's 
## 201313 290455      7

The above three figures show the data distribution of how male and female responded to the number of days physical, mental and both health not good during the past 30 days. We can see that there were far more female respondents than male respondents.

Research quesion 2:

by_month1 <- brfss2013 %>% filter(iyear=='2013') %>% group_by(imonth) %>% summarise(n=n())
by_month1.num <- by_month1[, 2]
colSums(by_month1.num)
##      n 
## 486088
by_month1$total <- colSums(by_month1.num)
ggplot(aes(x=imonth, y=n/total*100), data=by_month1) + geom_bar(stat = 'identity') + ggtitle('Percentage of Respondents by Month')+ ylab('Percent')

by_month <- brfss2013 %>% filter(iyear=='2013') %>% group_by(imonth, genhlth) %>% summarise(no=n())
by_month <- left_join(by_month, by_month1, by='imonth')
ggplot(aes(x=imonth, y=no/n*100, fill = genhlth), data = by_month[!is.na(by_month$genhlth), ]) + geom_bar(stat = 'identity', position = position_dodge()) + ggtitle('Health Perception By Month') + ylab('Percent')

I was trying to find out whether people respond their health condition differently in the different month. For example, are people more likely to say they are in good health in the spring or summer? It appears that there was no obvious pattern.

Research quesion 3:

plot(brfss2013$income2, brfss2013$hlthpln1, xlab = 'Income Level', ylab = 'Health Care Coverage', main =
'Income Level versus Health Care Coverage')

by(brfss2013$hlthpln1, brfss2013$income2, summary)
## brfss2013$income2: Less than $10,000
##   Yes    No  NA's 
## 18732  6551   158 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $15,000
##   Yes    No  NA's 
## 21143  5558    93 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $20,000
##   Yes    No  NA's 
## 26695  8061   117 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $25,000
##   Yes    No  NA's 
## 33312  8295   125 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $35,000
##   Yes    No  NA's 
## 41738  7024   105 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $50,000
##   Yes    No  NA's 
## 55575  5824   110 
## -------------------------------------------------------- 
## brfss2013$income2: Less than $75,000
##   Yes    No  NA's 
## 61732  3414    85 
## -------------------------------------------------------- 
## brfss2013$income2: $75,000 or more
##    Yes     No   NA's 
## 113023   2771    108

In general, higher income respondents are more likely to have health care coverage then those of lower income respondents.

Research quesion 4:

To answer this question, I willl be using the following varibles:

smoke100: Smoked At Least 100 Cigarettes

avedrnk2: Avg Alcoholic Drinks Per Day In Past 30

bphigh4: Ever Told Blood Pressure High

toldhi2: Ever Told Blood Cholesterol High

weight2: Reported Weight In Pounds

cvdstrk3: Ever Diagnosed With A Stroke

First, convert the above variables to numeric, and see any correlation between these numerica variables.

vars <- names(brfss2013) %in% c('smoke100', 'avedrnk2', 'bphigh4', 'toldhi2', 'weight2')
selected_brfss <- brfss2013[vars]
selected_brfss$toldhi2 <- ifelse(selected_brfss$toldhi2=="Yes", 1, 0)
selected_brfss$smoke100 <- ifelse(selected_brfss$smoke100=="Yes", 1, 0)
selected_brfss$weight2 <- as.numeric(selected_brfss$weight2)
selected_brfss$bphigh4 <- as.factor(ifelse(selected_brfss$bphigh4=="Yes", "Yes", (ifelse(selected_brfss$bphigh4=="Yes, but female told only during pregnancy", "Yes", (ifelse(selected_brfss$bphigh4=="Told borderline or pre-hypertensive", "Yes", "No"))))))
selected_brfss$bphigh4 <- ifelse(selected_brfss$bphigh4=="Yes", 1, 0)
library(Hmisc)
library(corrplot)
selected_brfss <- na.delete(selected_brfss)
corr.matrix <- cor(selected_brfss)
corrplot(corr.matrix, main="\n\nCorrelation Plot of Smoke, Alcohol, Blood pressure, Cholesterol, and Weight", method="number")