Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

This data is from the Behavior Risk Factor Survelliance System (BRFSS) project which involved all the states in the US and participating US territories with support from Centers for Disease Control and Prevention (CDC). The health characteristics in the data include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use, characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US.

This is an observational study since researchers have collected the data in a way that does not interfere with the way the data arises.

As stated in the BRFSS overview - “In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”. However, in an overview of the BRFSS 2016[1], CDC has stated that since 2014, all adults conducted through their cellular telephone were eligible for the survey regardless of their landline usage.

Since the adults were selected randomly, the study’s results are generalizable to the population. However since no random assignment was done under experimental settings, no causal conclusions can be made based on the study.

Part 2: Research questions

Research quesion 1:

Is there a correlation between getting diagnosed with high blood pressure (bphigh4), getting diagnosed with angina or coronary heart disease (cvdcrhd4) and getting diagnosed with a heart attack (cvdinfr4)?

This is of immense importance to the general audience as well as health agencies such as the CDC to understand and promote the importance of taking cardiac abnormalities very seriously. Statistics like this are of utmost importance in developing strategies to reduce death due to cardiac arrests. It will also help the common people realise the need for regular checkups as well as to keep a strict diet and medication plan, as needed, to minimise the chances of heart disease and cardiac arrests.

Research quesion 2:

Is regular internet usage (internet) associated with the usage of seatbelts in cars (seatbelt)? In other words, is there a correlation between the two?

This is intended to unravel the extent to which online campaigns, promoting the use of seatbelts in cars to reduce fatality rates and injuries in accidents, have been successful. If a correlation can be established between the two variables mentioned above, road safety and other government agencies could decide whether or not to continue the online campaigns and decide on strategies to reach the maximum audience possible.

Research quesion 3:

Over the years, research has established the harmful effects of smoking on health which has also led to an increase in medical costs for smokers and their families. Health Insurance and similar plan coverages help mitigate a part of these costs and this forms the basis of my last question - Is there a correlation between smoking, income levels and having a health plan coverage.

This question has profound significance for the smokers, their families as well as the medical insurance companies. Since we are also considering the income levels of the participating adults, this analysis holds a significance for government health regulatory agencies as well.

Part 3: Exploratory data analysis

Overview of the Data

With the research questions formulated above, I know the variables that I will be working with and so I want to take a look at the data within those variables for inconsistent/missing values that might hamper with the analysis.

Let us calculate the number of distinct values present in the each of the variables:

summary(brfss2013$bphigh4)

##                                        Yes 
##                                     198921 
## Yes, but female told only during pregnancy 
##                                       3680 
##                                         No 
##                                     282687 
##        Told borderline or pre-hypertensive 
##                                       5067 
##                                       NA's 
##                                       1420

summary(brfss2013$cvdcrhd4)

##    Yes     No   NA's 
##  29064 458288   4423

summary(brfss2013$cvdinfr4)

##    Yes     No   NA's 
##  29284 459904   2587

summary(brfss2013$internet)

##    Yes     No   NA's 
## 367510 119339   4926

summary(brfss2013$seatbelt)

##                       Always                Nearly always 
##                       390327                        34494 
##                    Sometimes                       Seldom 
##                        13718                         5708 
##                        Never Never drive or ride in a car 
##                         7445                         1148 
##                         NA's 
##                        38935

summary(brfss2013$X_smoker3)

## Current smoker - now smokes every day 
##                                 55162 
## Current smoker - now smokes some days 
##                                 21495 
##                         Former smoker 
##                                138134 
##                          Never smoked 
##                                261651 
##                                  NA's 
##                                 15333

summary(brfss2013$income2)

## Less than $10,000 Less than $15,000 Less than $20,000 Less than $25,000 
##             25441             26794             34873             41732 
## Less than $35,000 Less than $50,000 Less than $75,000   $75,000 or more 
##             48867             61509             65231            115902 
##              NA's 
##             71426

summary(brfss2013$hlthpln1)

##    Yes     No   NA's 
## 434571  55300   1904

As we can see there are quite a few values like ‘NA’ and ‘Yes, but female told only during pregnancy’ which we can ignore, since the proportions of such values in each of the 3 variables is ~ 1-2%. Before proceeding with further calculations, it’s best we load all the necessary data into a new dataframe without the inconsistent/missing values:

brfss2013_clean <- brfss2013[which(brfss2013$bphigh4 == 'Yes' | brfss2013$bphigh4 == 'No' & brfss2013$cvdcrhd4 != 'NA' & brfss2013$cvdinfr4 != 'NA' & brfss2013$seatbelt != 'NA' & brfss2013$seatbelt != 'Never drive or ride in a car' & brfss2013$internet != 'NA' & brfss2013$X_smoker3 != 'NA' & brfss2013$income2 != 'NA' & brfss2013$hlthpln1 != 'NA'),]
nrow(brfss2013_clean)

## [1] 421804

Now that we have eliminated the unknown/missing/inconsistent values, we still have a large enough sample to conduct our statistical analysis on the research questions.

Research quesion 1:

Before delving into our research question, let’s plot the proportions of people suffering from these heart diseases in a barplot:

ggplot(data = brfss2013_clean %>% filter(cvdcrhd4 != 'NA') , aes(x = bphigh4, fill = cvdcrhd4)) + geom_bar() + xlab("High BP") + ggtitle("Weighted Bar Plot of High BP and Angina or CHD")

## Warning: package 'bindrcpp' was built under R version 3.4.4

From the bar plot above, we can see that High BP is a much more common health issue than Angina or Coronary Heart Disease. The proportion of adults diagnosed with High BP is close to 50% of the total sample, which is a solid reason to explore our analysis further to find the relationship between the 3 variables i.e. bphigh4, cvdcrhd4 and cvdinfr4.

Since all 3 variables are categorical variables, we turn to Logistic Regression to establish whether or not any correlation exists between the variables (cvdinfr4, bphigh4 and cvdcrhd4).

model <- glm(cvdinfr4 ~ bphigh4 + cvdcrhd4, family = binomial(link = 'logit'), data = brfss2013_clean)
summary(model)

## 
## Call:
## glm(formula = cvdinfr4 ~ bphigh4 + cvdcrhd4, family = binomial(link = "logit"), 
##     data = brfss2013_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8311   0.1915   0.1915   0.3239   1.2402  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.14645    0.01266  -11.57   <2e-16 ***
## bphigh4No    1.06776    0.01691   63.15   <2e-16 ***
## cvdcrhd4No   3.06788    0.01559  196.79   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 195043  on 417489  degrees of freedom
## Residual deviance: 145648  on 417487  degrees of freedom
##   (4314 observations deleted due to missingness)
## AIC: 145654
## 
## Number of Fisher Scoring iterations: 6

Inference from the Logistic Regression results:

Both the variables bphigh4No (High BP) and cvdcrhd4No (Angina or CHD) have p-value (<2e-16) which is extremely less than the significance level of 0.05. This means the variables bphigh4No and cvdcrhd4No are statistically significant and hence have a correlation with cvdinfr4.

In lay man’s terms, this means that people getting diagnosed with High BP and Angina or CHD is correlated with getting diagnosed with Heart Attack.

Research quesion 2:

In this question, we are trying to establish if there’s any correlation between regular internet usage (internet), as depicted by the usage of last 30 days, and the use of seatbelts in cars (seatbelt).

ggplot(data = brfss2013_clean %>% filter(internet != 'NA' & seatbelt != 'NA'), aes(x = internet, fill = seatbelt)) + geom_bar() + xlab("Internet Used in the last 30 days") + ggtitle("Weighted Bar Plot of Internet Usage in Last 30 days and Wearing Seatbelt in cars")

From the bar-plot, while we can see that with internet usage in 30 days, there is a relatively higher proportion of wearing seatbelts, this is not enough to suggest if any correlation actually exists between the two or not.

We can establish this correlation much better by resorting to hypothesis testing. Since both these variables are categorical, we can use the Chi-squared test of independence for our hypothesis testing:

NULL HYPOTHESIS - There is no correlation between ‘internet’ and ‘seatbelt’.
ALTERNATIVE HYPOTHESIS - There is a correlation between ‘internet’ and ‘seatbelt’.

chisq.test(brfss2013_clean$internet, brfss2013_clean$seatbelt, correct = FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  brfss2013_clean$internet and brfss2013_clean$seatbelt
## X-squared = 2535.5, df = 5, p-value < 2.2e-16

Since we get a p-value (<2.2e-16) that is way less than the significance level of 0.05, we reject the NULL hypothesis in favour of the ALTERNATIVE hypothesis and conclude that there is indeed a correlation between “regular internet usage” and “use of seatbelts in cars”.

Research quesion 3:

To address this question, let’s construct a sequence of bar plots to check the relationship between hlthpln1 (Have a health plan coverage), Xsmoker3 (Smoker or not) and income2 (Levels of income in the household):

ggplot(brfss2013_clean %>% filter(income2 != 'NA' & hlthpln1 != 'NA' & X_smoker3 != 'NA'), aes(x = hlthpln1, fill = income2)) + geom_bar(position = "dodge") + facet_wrap(~ X_smoker3, ncol = 2) + xlab("Smoker or not") + ggtitle("Weighted Bar Plot of Smoker, Income Levels and Having a Health Plan Coverage")

An interesting fact that we notice from the plots is that the number of former smokers and non-smokers who have health plan coverage is much higher than current smokers. There appears to be some sort of correlation with the incomes as well, as depicted by increase in heights of bars with the increase in income. Let’s examine the same with a logistic regression model to confirm if there is statiscal significance of the variables X_smoker3 and income2 in their correlation with hlthpln1.

model_1 <- glm(hlthpln1 ~ X_smoker3 + income2, family = binomial(link = 'logit'), data = brfss2013_clean)
summary(model_1)

## 
## Call:
## glm(formula = hlthpln1 ~ X_smoker3 + income2, family = binomial(link = "logit"), 
##     data = brfss2013_clean)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9531  -0.5535  -0.3171  -0.2179   2.8421  
## 
## Coefficients:
##                                                Estimate Std. Error z value
## (Intercept)                                    -0.55362    0.01821 -30.409
## X_smoker3Current smoker - now smokes some days -0.15064    0.02289  -6.582
## X_smoker3Former smoker                         -0.97854    0.01627 -60.132
## X_smoker3Never smoked                          -0.68596    0.01404 -48.850
## income2Less than $15,000                       -0.25163    0.02227 -11.301
## income2Less than $20,000                       -0.09643    0.02057  -4.689
## income2Less than $25,000                       -0.26626    0.02024 -13.157
## income2Less than $35,000                       -0.64830    0.02070 -31.320
## income2Less than $50,000                       -1.09527    0.02124 -51.565
## income2Less than $75,000                       -1.72564    0.02422 -71.254
## income2$75,000 or more                         -2.48880    0.02551 -97.549
##                                                Pr(>|z|)    
## (Intercept)                                     < 2e-16 ***
## X_smoker3Current smoker - now smokes some days 4.66e-11 ***
## X_smoker3Former smoker                          < 2e-16 ***
## X_smoker3Never smoked                           < 2e-16 ***
## income2Less than $15,000                        < 2e-16 ***
## income2Less than $20,000                       2.75e-06 ***
## income2Less than $25,000                        < 2e-16 ***
## income2Less than $35,000                        < 2e-16 ***
## income2Less than $50,000                        < 2e-16 ***
## income2Less than $75,000                        < 2e-16 ***
## income2$75,000 or more                          < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 267639  on 387092  degrees of freedom
## Residual deviance: 237291  on 387082  degrees of freedom
##   (34711 observations deleted due to missingness)
## AIC: 237313
## 
## Number of Fisher Scoring iterations: 6

Thus, here again, we see that the p-values for all levels are less than 0.05, suggesting that both smoking and income levels are statistically significant in the correlation to having a health plan coverage.

Part 4: Conclusion

In my capstone project, I have conducted research about three important parameters that form a component part of a nation’s health - Safety while driving, Cardiac problems and Having a health plan coverage. While each of the research topics have a lot more to be studied about, we have been able to establish correlations and statistical significance between the variables of interest.

Part 5: References

[1] https://www.cdc.gov/brfss/annual_data/2016/pdf/overview_2016.pdf
[2] https://stats.idre.ucla.edu/r/dae/logit-regression/