Exploring the BRFSS data

Aadarsh Srivastava

Setup

Load packages

library(ggplot2)
library(dplyr)
library(Hmisc)
library(corrplot)
library(caret)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.

load("~/EDA/brfss2013.RData")

Part 1: Data

About the Data:

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey in the United States. The BRFSS is designed to identify risk factors in the adult population and report emerging trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Data Collection:

Data collection procedure is explained in brfss_codebook. The data were collected from United States’ all 50 states, the District of Columbia, Puerto Rico, Guam and American Samoa, Federated States of Micronesia, and Palau, by conducting both landline telephone- and cellular telephone-based surveys. Disproportionate stratified sampling (DSS) has been used for the landline sample and the cellular telephone respondents are randomly selected with each having equal probability of selection. The dataset we are working on contains 330 variables for a total of 491, 775 observations in 2013. The missing values denoted by “NA”.

Generalizability:

The sample data should allow us to generalize to the population of interest. It is a survey of 491,775 U.S. adults aged 18 years or older. It is based on a large stratified random sample. Potential biases are associated with non-response, incomplete interviews, missing values and convenience bias (some potential respondents may not have been included because they do not have a landline and cell phone).

Causality:

There is no causation can be established as BRFSS is an observation study that can only establish correlation/association between variables.

Part 2: Research questions

Research quesion 1: For the general population in the US,is there a relation between person’s frequency of tobacoo consumption and number of days with poor mental health condition? Usually people say that they find escape to their mental issues thorugh tobacco usage. Ifs this just an excuse or does the Data reflect this?

Research quesion 2: For the general population in the US, is there a relation between physical activity and income level. We might suspect that higher income or employment statua might lead to more activity as there might be more freedom to give proper attention to their health.

Research quesion 3: For the general population of the US who are over 18 years of age,what is the distribution of hours of sleep based on the gender difference. Does there exist difference in the distribution based on the whether the individual is male or female?

Part 3: Exploratory data analysis

Research quesion 1:Relationship between Frequency of days now Smoking and Number of days of Poor Mental Health reported.

#Remove the NAs from the data is in focus.
cleaned_data1 <- brfss2013 %>% filter(!is.na(smokday2),!is.na(menthlth))
#Plotting the selected feature
ggplot(data = cleaned_data1,aes(x=smokday2,fill=smokday2))+geom_bar(alpha = 0.5)+ labs(title="Distribution of Smoking habits",x="Frequency of Smoking",y="Population")+scale_y_continuous(breaks = scales::pretty_breaks(n=10))

ggplot(data = cleaned_data1,aes(x=menthlth,color='red'))+geom_freqpoly(bins = 10)+ labs(title="Frequency of Bad Mental Health",x="Number of days",y="Frequency")+scale_y_continuous(breaks = scales::pretty_breaks(n=10))+scale_x_continuous(breaks = scales::pretty_breaks(n=10))

#Plotting the Bar plot to display data
pl = ggplot(data = cleaned_data1,aes(x=smokday2,y= menthlth,fill = smokday2)) + geom_bar(stat = "Identity",alpha=0.5)
pl + labs(title="Smoking and Mental Health related?",x="Frequency of Smoking",y="Days with Poor Mental Health",fill = "Frequency of Smoking")+scale_y_continuous(breaks = scales::pretty_breaks(n=10))

The bar plot between Mental Health and Frequency of smoking suggests that people who smokes not at all has the highest number of days with poor mental health.But this is not true,as the Feature corresponds to the numeric value of number of days of poor mental health and should be considered a categorical variable.

pl = ggplot(data = cleaned_data1,aes(y=smokday2,x= factor(menthlth),fill = smokday2)) + geom_count()
pl + labs(title="Smoking and Mental Health related?",x="Frequency of Smoking",y="Days with Poor Mental Health")

by(cleaned_data1$menthlth,cleaned_data1$smokday2,summary)

## cleaned_data1$smokday2: Every day
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   6.386  10.000  30.000 
## -------------------------------------------------------- 
## cleaned_data1$smokday2: Some days
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    5.77    7.00   30.00 
## -------------------------------------------------------- 
## cleaned_data1$smokday2: Not at all
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   3.159   2.000  30.000

The statistics show that the mean number of days with poor mental health is the least for people who smokes not at all. This means that people with more days with poor mental health tend to smoke more.

Research quesion 2: More Income = More Exercise?

# Remove the NAs and find porportion that does any exercise grouped by income.
cleaned_data2 <- brfss2013 %>% filter(!is.na(exerany2), !is.na(income2), !is.na(employ1)) %>% group_by(income2) %>% summarise(prop_exer = sum(exerany2 == "Yes") / n())
# Plot - first replace spaces with line breaks in income2 names.
levels(cleaned_data2$income2) <- gsub(" ", "\n", levels(cleaned_data2$income2))
ggplot(cleaned_data2, aes(income2, prop_exer)) +
    geom_point(aes(income2, prop_exer)) +
    labs(title="Proportion who exercised in last 30 days vs. Income", x="Income", y="Proportion Exercise")

The plot clearly indicates that there is an increasing proportion of the people who exercised in the last 30 days as we jump to a higher income scale. There does come a dip in the proportion who has a income between 10,000 and 15,000 dollars which does not help us to infer anything.

Research quesion 3: Nap time also has a Battle of the Sexes?**

#Cleaning the data and then selection of required columns.
cleaned_data3 <- brfss2013 %>% filter(!is.na(ladult),!is.na(sleptim1))
cleaned_data3 <- cleaned_data3 %>% select(ladult,sleptim1)
#Getting rid of the outliers
cleaned_data3 <- cleaned_data3 %>% filter(sleptim1 < 20)
#Mean sleep of male and female respondents 
levels(cleaned_data3$ladult) <- c("Male","Female")
names(cleaned_data3) <- c("Gender","Sleep_Time")
#Examining the ditribution of Male and Female respondents' hours of sleep
 ggplot(data = cleaned_data3,aes(Gender,factor(Sleep_Time)))+geom_count(fill="green")

 by(cleaned_data3$Sleep_Time,cleaned_data3$Gender,summary)

## cleaned_data3$Gender: Male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   6.000   8.000   7.412   8.000  10.000 
## -------------------------------------------------------- 
## cleaned_data3$Gender: Female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   7.500   6.611   8.000   9.000

The sleep time of male adults were observed to distributed between 5 to 10 hours where as that of the female adults were between 1 to 9 hours. For both the genders the median is greater than the mean, hence a left skewness in observed. Also the mean sleep time for the male adults were observed to be greater than that of the females.

Logistic Regression to predict Chronic Diseases

Firstly,the process of Feature selection. The features selected are: smoke100: Smoked At Least 100 Cigarettes

avedrnk2: Avg Alcoholic Drinks Per Day In Past 30

bphigh4: Ever Told Blood Pressure High

toldhi2: Ever Told Blood Cholesterol High

weight2: Reported Weight In Pounds

cvdstrk3: Ever Diagnosed With A Stroke

Converting the above variables to numeric, and see any correlation between these numerical variables.

vars <- names(brfss2013) %in% c('smoke100', 'avedrnk2', 'bphigh4', 'toldhi2', 'weight2')
selected_brfss <- brfss2013[vars]
selected_brfss$toldhi2 <- ifelse(selected_brfss$toldhi2=="Yes", 1, 0)
selected_brfss$smoke100 <- ifelse(selected_brfss$smoke100=="Yes", 1, 0)
selected_brfss$weight2 <- as.numeric(selected_brfss$weight2)
selected_brfss$bphigh4 <- as.factor(ifelse(selected_brfss$bphigh4=="Yes", "Yes", (ifelse(selected_brfss$bphigh4=="Yes, but female told only during pregnancy", "Yes", (ifelse(selected_brfss$bphigh4=="Told borderline or pre-hypertensive", "Yes", "No"))))))
selected_brfss$bphigh4 <- ifelse(selected_brfss$bphigh4=="Yes", 1, 0)
names(selected_brfss) <- c("High_BP","High_Cholestrol","Weight","Smoke","Alcohol")
selected_corr <- na.delete(selected_brfss)
corr.matrix <- cor(selected_corr)
corrplot(corr.matrix, method="number")

No any two numeric variables seem to have strong correlations.

Next,I will combine three features to make our target variable,which are: cvdinfr4: Ever Diagnosed With Heart Attack

cvdcrhd4: Ever Diagnosed With Angina Or Coronary Heart Disease

cvdstrk3: Ever Diagnosed With A Stroke

chronic_diseases <- data.frame("Stroke"=brfss2013$cvdstrk3,"Coronary_disease"=brfss2013$cvdcrhd4,"Heart_attack"=brfss2013$cvdinfr4)
chronic_diseases$Stroke <- ifelse(chronic_diseases$Stroke=="Yes",1,0)
chronic_diseases$Coronary_disease <- ifelse(chronic_diseases$Coronary_disease=="Yes",1,0)
chronic_diseases$Heart_attack <- ifelse(chronic_diseases$Heart_attack=="Yes",1,0)
chronic_diseases$Stroke <- replace(chronic_diseases$Stroke,which(is.na(chronic_diseases$Stroke)),0)
chronic_diseases$Coronary_disease <- replace(chronic_diseases$Coronary_disease,which(is.na(chronic_diseases$Coronary_disease)),0)
chronic_diseases$Heart_attack <- replace(chronic_diseases$Heart_attack,which(is.na(chronic_diseases$Heart_attack)),0)
chronic_diseases$Chronic_disease<-ifelse((chronic_diseases$Stroke+chronic_diseases$Coronary_disease+chronic_diseases$Heart_attack)>0,1,0)

model_data <- as.data.frame(cbind(selected_brfss,chronic_diseases$Chronic_disease))
names(model_data)[6] <- "chronic_disease"

model_data$High_BP <- replace(model_data$High_BP, which(is.na(model_data$High_BP)),0)
model_data$High_Cholestrol <- replace(model_data$High_Cholestrol, which(is.na(model_data$High_Cholestrol)),0)
model_data$chronic_disease <- replace(model_data$chronic_disease, which(is.na(model_data$chronic_disease)),0)
model_data$Smoke <- replace(model_data$Smoke, which(is.na(model_data$Smoke)),0)

#Replace NA with average
model_data$Alcohol <- replace(model_data$Alcohol, which(is.na(model_data$Alcohol)),round(mean(model_data$Alcohol,na.rm = T)))

summary(model_data)

##     High_BP       High_Cholestrol      Weight           Smoke       
##  Min.   :0.0000   Min.   :0.0000   Min.   :  1.00   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 43.00   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median : 73.00   Median :0.0000  
##  Mean   :0.4223   Mean   :0.3731   Mean   : 80.22   Mean   :0.4376  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:103.00   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :570.00   Max.   :1.0000  
##     Alcohol       chronic_disease 
##  Min.   : 1.000   Min.   :0.0000  
##  1st Qu.: 2.000   1st Qu.:0.0000  
##  Median : 2.000   Median :0.0000  
##  Mean   : 2.099   Mean   :0.1164  
##  3rd Qu.: 2.000   3rd Qu.:0.0000  
##  Max.   :76.000   Max.   :1.0000

head(model_data)

##   High_BP High_Cholestrol Weight Smoke Alcohol chronic_disease
## 1       1               1    154     1       2               0
## 2       0               0     30     0       2               0
## 3       0               0     63     1       4               0
## 4       0               1     31     0       2               0
## 5       1               0    169     1       2               0
## 6       1               1    128     0       2               0

After wrangling and cleaning the data, now we can fit the model.

Model Fitting

train_index <- createDataPartition(model_data$chronic_disease,p=0.8,list = FALSE,times = 1)
train <- model_data[train_index,]
test <- model_data[-train_index,]

model <- glm(chronic_disease~.,family=binomial(link = 'logit'),data=train)
summary(model)

## 
## Call:
## glm(formula = chronic_disease ~ ., family = binomial(link = "logit"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0409  -0.5104  -0.3305  -0.2475   3.4258  
## 
## Coefficients:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)     -3.403e+00  1.607e-02 -211.744   <2e-16 ***
## High_BP          1.276e+00  1.185e-02  107.704   <2e-16 ***
## High_Cholestrol  8.780e-01  1.101e-02   79.783   <2e-16 ***
## Weight           7.391e-04  8.549e-05    8.646   <2e-16 ***
## Smoke            5.596e-01  1.053e-02   53.136   <2e-16 ***
## Alcohol         -6.077e-02  4.570e-03  -13.298   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 283561  on 393419  degrees of freedom
## Residual deviance: 250669  on 393414  degrees of freedom
## AIC: 250681
## 
## Number of Fisher Scoring iterations: 5

Interpreting the results of my logistic regression model:

All the variables are statistically significant.

All other variables being equal, being told high cholestrol is more likely to have a stroke.

The negative coefficient for the predictor Alcohol suggests that all other variables being equal, higher the alcohol drinksper day in the past 30 days is less likely to have a stroke.

For every one unit change in weight, the log odds of having a stroke (versus no-stroke) increases by 0.0007391.

Not Smoked At Least 100 Cigarettes, less likely to have a stroke.

For a one unit increase in Avg alcoholic drinks per day in past 30 days, the log odds of having a stroke decreases by 0.06077.

anova(model, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: chronic_disease
## 
## Terms added sequentially (first to last)
## 
## 
##                 Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                           393419     283561              
## High_BP          1  22644.2    393418     260917 < 2.2e-16 ***
## High_Cholestrol  1   7192.1    393417     253725 < 2.2e-16 ***
## Weight           1     99.5    393416     253625 < 2.2e-16 ***
## Smoke            1   2745.6    393415     250880 < 2.2e-16 ***
## Alcohol          1    211.1    393414     250669 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Analyzing the deviance table we can see the drop in deviance when adding each variable one at a time.

Assessing the predictive ability of the model

fitted.results <- predict(model,newdata=test,type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)

misClasificError <- mean(fitted.results != test$chronic_disease)
print(paste('Accuracy',(1-misClasificError)*100,"%"))

## [1] "Accuracy 88.4998220731025 %"

One last note, when we analyze health survery data, we must be aware that self-reported prevalence may be biased because respondents may not be aware of their risk status. Therefore, to achieve more precise estimates, researchers are using laboratory tests as well as self-reported data.