Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.4.3

#library(statsr)

Load data

load("brfss2013.Data")

Part 1: Data

Behavioral Risk Factor Surveillance System (BRFSS): This is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodologic assistance from Center for Disease Control and prevention(CDC).

The data is collected accross all states in the US as well as the District of Columbia and three U.S. territories. They complete somewhere close to 0.4M adult interviews every year. The data they collect comprise of a gamut of things that includes >General Information like their resedential conditions, age, number of members in their family, if they are working or otherwise etc. >Their current health status. >Past month history relative to their health, their sleeping routine to their BP levels, blood sugar levels to their cardiac health issues. >Their eating and drinking habits etc.

All this information when collected from as many as 0.4M adults accross the United states and its several territories, provides them with a lot of data which can be used to generate patterns on how doing one thing be it anything from the entire survey, relates to their health status. This data, is at the very least generalizable to the entire population of the United States as there sure is Random Sampling when it comes to interviewing people from the entire population base of the United States. But, since they do not survey to cultivate the option for control groups per-say, it would be safe to say that, the surveys do not relate to Causation.

Part 2: Research questions

Research quesion 1: Q1. How health could relate to different working categories?

People as they go about with their daily lives may have a huge impact on how they scale up to different health conditions. Working might relate to an active lifestyle, thus promoting good health. It could also mean, better economic conditions which in turn could relate to ease of having regular check-ups, buying costly medicines. But at the same time, long work hours could relate to stress, or may be exccessive smoking that might lead to an unhealthy lifestyle altogether. Having said that, what about those, who don’t work, i.e. adults who either fall under the sudent’s category or if they are retired. How health fluctuates for all these people and if at all, is there a plausible connection between these factors?

Research quesion 2: Q2. How smoking could relate to a person’s health and on how it differs per different working categories?

As mentioned in the description for the first question, here the focus will be on how smoking could relate to health conditions and how in turn could it be relative to different working categories. People who smoke may or may not have a negative impact of it on their health conditions, it may vary per their age or their daily activities. In order to find out how things might turn out to be, this one is quite an important question to be addressed per the data provided by this survey.

Research quesion 3: Q3. How having or not having a health coverage plan relates to health conditions and how it differs with different working classes?

Having a health cover may prove beneficial, well, is it a given for all scenarios or could it be that it might vary from people to people. How could having a health plan but not having enough time to go for a check up or treatment relate to a person’s current health status? Or, how are the chances for a person to not have a health plan and still be just as healthy? All these questions might not be answered by this one Research question but it just might bring us one step closer in fiding the connectives.

Part 3: Exploratory data analysis

Research quesion 1: Q1. How health could relate to different working categories?

To begin with, let’s first have a look at how people per the survey fair when it comes to their health. The following code chunk provides us with data on their general health:

table(brfss2013$genhlth)

## 
## Excellent Very good      Good      Fair      Poor 
##     85482    159076    150555     66726     27951

As can be seen, there are approximately 27K people out of the 0.4M who are not doing so well when it comes to having a healthy life.

Next we need to learn about their (people who were surveyed) work scenarios. For that, we could have a look at a column named employ1 and how health differs with each of these scenarios:

brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(employ1,genhlth ) %>%
  summarise(n())

## # A tibble: 45 x 3
## # Groups:   employ1 [?]
##               employ1   genhlth `n()`
##                <fctr>    <fctr> <int>
##  1 Employed for wages Excellent 43990
##  2 Employed for wages Very good 78534
##  3 Employed for wages      Good 60875
##  4 Employed for wages      Fair 15825
##  5 Employed for wages      Poor  2476
##  6      Self-employed Excellent 10232
##  7      Self-employed Very good 14271
##  8      Self-employed      Good 11283
##  9      Self-employed      Fair  3198
## 10      Self-employed      Poor   709
## # ... with 35 more rows

The results above showcase on how people who are employed/self-employed fair with respect to health and how the rest of the participants (those who are students or not working or retired) fair with respect to differnt health scenarios.

The data is great as is, but it is not that readable as in, it is still a bit hard to the get the complete picture here right? To make things easier for interpretation, lets create a new column that will store different work categories. This is how it goes: 1) if a person is working, be it the one who works for someone or is self employed, will come under the category of Employed. 2) if a person is not working be it a student or someone who has stopped working for a few years now would come under NotWorking, and finally 3) those who are retired will be categorized simply as Retired.

brfss2013 <- brfss2013 %>%
  filter(!is.na(employ1))%>%
  mutate(work_cat = ifelse(employ1=="Employed for wages" |employ1=="Self-employed", "employed",
                                   ifelse(employ1=="Retired", "Retired", "NotWorking"  )
                                   )
                           )

Explantion on how the code works: This here is a nested ifelse conditional mutation. To put it simply, it does exactly what we meant to accomplish in the 3 sets of categorization that are mentioned above this code chunk. If a person is employed or self-employed, they’ll be categorized as Employed, else if they are retired they will be categorized as Retired, while the rest will be categorized as Not Working. Now, one might wonder, what about the NAs that were a part of the response matrix for Employ1? Well, the concern is genuine and to overcome this one falacy of a survey, we have a filter in place which will simply ignore all the rows wherein we are having NA as a employment level response.

So, lets try to create a new summary again:

brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(work_cat, genhlth) %>%
  summarise(totals = n())

## # A tibble: 15 x 3
## # Groups:   work_cat [?]
##      work_cat   genhlth totals
##         <chr>    <fctr>  <int>
##  1   employed Excellent  54222
##  2   employed Very good  92805
##  3   employed      Good  72158
##  4   employed      Fair  19023
##  5   employed      Poor   3185
##  6 NotWorking Excellent  13516
##  7 NotWorking Very good  24214
##  8 NotWorking      Good  30552
##  9 NotWorking      Fair  23587
## 10 NotWorking      Poor  15625
## 11    Retired Excellent  17110
## 12    Retired Very good  41216
## 13    Retired      Good  46701
## 14    Retired      Fair  23602
## 15    Retired      Poor   8937

As can be seen from the results that although the numbers are right in front of our eyes, we are still not able to clearly visualize on how each of the work scenarios scale up with different health scenarios. The following piece of code will help us do just that:

ggplot(data = brfss2013, aes(x= work_cat, fill = factor(genhlth)))+
  geom_histogram(width = 0.5, stat = "count") +
  xlab("work categories") +
  ylab("total count") +
  labs(fill = "genhlth")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

As can be observed by the visual above, those who are employed have really low chance at having poor health conditions, while those who are not working are those who are really prone to having a poor health condition and those who are retired are fairing in between these two.

Now that we have established something to begin with, we still do not have a lot to produce a definitive connection between work categories and health conditions. In order to do that, one must dig into the dataset and perform more exploratory analysis based on the information from this visual.

Research quesion 2: Q2. How smoking could relate to a person’s health and on how it differs per different working categories?

Smoking is injurious to health, well that is one way to put it…as a fact that is, but being a statistician, one must always rely on numbers. To begin with, one must observe on what sort of data is being collected on smoking habits of an individual and what is the count for each in our survey dataframe.

brfss2013 %>%
  filter(!is.na(X_smoker3))%>%
  group_by(X_smoker3) %>%
  summarise(total = n())

## # A tibble: 4 x 2
##                               X_smoker3  total
##                                  <fctr>  <int>
## 1 Current smoker - now smokes every day  54935
## 2 Current smoker - now smokes some days  21384
## 3                         Former smoker 137626
## 4                          Never smoked 260308

There are a large number of those who have never smoked, but does that prevent them from being ill. Well, what about passive smoking? Well, that is a question that they should have included in their survey on the lines of “Though you do not smoke, but do you have friends or family who do and do you accompany them while they smoke?” For now, given the data at hand, we have a lot of people who do smoke, or who were former smokers. To analyze the this further let us summarize this to scale with the health conditions on each factor of smokers or non smokers.

brfss2013 %>%
  filter(!is.na(genhlth), !is.na(X_smoker3)) %>%
  group_by(X_smoker3, genhlth) %>%
  summarise(total_count = n())

## # A tibble: 20 x 3
## # Groups:   X_smoker3 [?]
##                                X_smoker3   genhlth total_count
##                                   <fctr>    <fctr>       <int>
##  1 Current smoker - now smokes every day Excellent        5223
##  2 Current smoker - now smokes every day Very good       14034
##  3 Current smoker - now smokes every day      Good       19373
##  4 Current smoker - now smokes every day      Fair       10798
##  5 Current smoker - now smokes every day      Poor        5278
##  6 Current smoker - now smokes some days Excellent        2660
##  7 Current smoker - now smokes some days Very good        6061
##  8 Current smoker - now smokes some days      Good        6693
##  9 Current smoker - now smokes some days      Fair        3772
## 10 Current smoker - now smokes some days      Poor        2102
## 11                         Former smoker Excellent       20305
## 12                         Former smoker Very good       43238
## 13                         Former smoker      Good       43258
## 14                         Former smoker      Fair       20626
## 15                         Former smoker      Poor        9626
## 16                          Never smoked Excellent       54099
## 17                          Never smoked Very good       90811
## 18                          Never smoked      Good       75556
## 19                          Never smoked      Fair       29007
## 20                          Never smoked      Poor        9903

So, there we go! As can be seen at the first glance of this mapping, even those who smoke, a large number of them are fairing well when it comes to health. What is surprising here is that, those who have never smoked have more number of Poor health conditions than any of the other X_smoker3 factors. Well, this was unexpected ain’t it!

Now in order to see on how these two factors fair with different working categories, we would first need to create a new column that will have different categories for smokers. Woah! what do you mean by a new column for different smoking categories? We already have it as X_smoker3 right? Well, we do have that, but if we are to compare smoking to health to work factors, one might say that it will be easier if there could be a way to create a combined column based on the any of the two factors first and then map them out with the third.

In order to do so, let’s create the new column, the categories for the column will be 1) Those who smoke and have poor health, in ‘DangerZone’. 2) Those who smoke and have a good health condition will be marked as ‘Safe’ and finally, 3) Those who have never smoked but are still having poor health conditions as ‘illNonSmoker’

And just for precautionary measures, let’s filter out all the NAs from both columns before mutating a new one:

brfss2013 <- brfss2013 %>%
  filter(!is.na(X_smoker3),
         !is.na(genhlth))%>%
mutate(smoke_cat = ifelse( genhlth == "Poor" & X_smoker3=="Current smoker - now smokes every day" |
                                                            genhlth == "Poor" & X_smoker3 == "Current smoker - now smokes some days"|
                                                            genhlth == "Poor" & X_smoker3 == "Former smoker","DangerZone",
                                                          ifelse(  genhlth == "Poor" & X_smoker3 == "Never smoked" ,"illNonSmoker","Safe")
  ))

Let’s summarize this newly created column with work categories to see how things turn out to be:

brfss2013 %>%
  group_by(work_cat,smoke_cat) %>%
  summarize(totalCount = n())

## # A tibble: 9 x 3
## # Groups:   work_cat [?]
##     work_cat    smoke_cat totalCount
##        <chr>        <chr>      <int>
## 1   employed   DangerZone       1770
## 2   employed illNonSmoker       1311
## 3   employed         Safe     231374
## 4 NotWorking   DangerZone      10025
## 5 NotWorking illNonSmoker       5142
## 6 NotWorking         Safe      89157
## 7    Retired   DangerZone       5211
## 8    Retired illNonSmoker       3450
## 9    Retired         Safe     124983

Now, though this gives out a pretty clear measure of how things are as per the data, let’s try to create a visual on top of it to make things more..clear!

ggplot(data = brfss2013, aes(x = work_cat, fill = factor(genhlth))) +
  geom_histogram(width = 0.5, stat = "count") +
    facet_wrap(~smoke_cat+genhlth) +
  xlab("Work_Category")+
  ylab("TotalCount")+
  labs(fill = "genhlth")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Here the visuals have borrowed the simplicity of Facets. It allows one to create seperate visuals on different scenarios. This is particularly helpful when one wishes to create visuals on 3 columns. As can be seen here, Those who are Employed, are a lot less prone to having a poor health situation be it being one who smokes/smoked or the one who never smoked. While those who are not working, are still one of those who have it all upside down. This brings us a step closer on how not having work might add up to having poor health conditions. And, while it’s true that there are a multitude of other factors that might relate to this situation, and without pulling all the strings, one can never be sure on what might relate to what.

For now, what this research provided us with is, people who smoke do have poor health scenarios but those are working have it easy when compared to those who are either retired or don’t work.

Research quesion 3: Q3. How having or not having a health coverage plan relates to health conditions and how it differs with different working classes?

Well as mentioned earlier, having a health plan, proves that one is concerned about their health. But, does that mean, the one who is concerned about their well being is on the greener side of health?

This one factor alone is not enough to give us a solid proof on how things relate here. Now, while this research may not be able to give out something solid for the cause, but it just might take us a step closer in the right direction.

First thing first, let’s have a look at what the data tells us about people:

table(brfss2013$hlthpln1)

## 
##    Yes     No 
## 418201  52606

Well, suprisingly enough, people ARE concerned about their well being. But, wait a minute, this does not tell us if they are really that concerned about their well being. For having a health plan and actually caring for one’s health are two completely different things.

To dig deeper, let’s quickly summarize our data with respect to health conditions this time:

brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(hlthpln1,genhlth) %>%
  summarize(totalcount = n())

## # A tibble: 15 x 3
## # Groups:   hlthpln1 [?]
##    hlthpln1   genhlth totalcount
##      <fctr>    <fctr>      <int>
##  1      Yes Excellent      74262
##  2      Yes Very good     140063
##  3      Yes      Good     125607
##  4      Yes      Fair      54784
##  5      Yes      Poor      23485
##  6       No Excellent       7745
##  7       No Very good      13628
##  8       No      Good      18703
##  9       No      Fair       9206
## 10       No      Poor       3324
## 11     <NA> Excellent        280
## 12     <NA> Very good        453
## 13     <NA>      Good        570
## 14     <NA>      Fair        213
## 15     <NA>      Poor        100

Well, this data does tells us that most of the people who have a health plan are concerned with their well being. But, wait a minute! When comapred to people who do not have a health plan, the Poor health conditions are more than double with those having a health plan. Could it be that, those who take a health plan become more careless about their well being thinking that if things go astray, they still have a plan in place to take care of it.

Or could it mean something else.

Well, for now, let’s try to see how this spreads out when connected with different work categories.

brfss2013 %>%
  filter(!is.na(genhlth)) %>%
  group_by(hlthpln1,genhlth,work_cat) %>%
  summarize(totalCount = n())

## # A tibble: 45 x 4
## # Groups:   hlthpln1, genhlth [?]
##    hlthpln1   genhlth   work_cat totalCount
##      <fctr>    <fctr>      <chr>      <int>
##  1      Yes Excellent   employed      47476
##  2      Yes Excellent NotWorking      10544
##  3      Yes Excellent    Retired      16242
##  4      Yes Very good   employed      81594
##  5      Yes Very good NotWorking      19148
##  6      Yes Very good    Retired      39321
##  7      Yes      Good   employed      58955
##  8      Yes      Good NotWorking      22585
##  9      Yes      Good    Retired      44067
## 10      Yes      Fair   employed      14111
## # ... with 35 more rows

If we take a look at those who have poor health, one at a time, that is, one, when they have a health plan and the other where they don’t. As per the numbers here, those who are either NotWorking or are Retired, take the biiger chunk of people who having a poor health condition and are having a health plan, and surprisingly enough the trend falls a bit short of the scale, when it comes to those who are not having a health plan and have a poor health condition. Here, the rankings have changed per the work categories. People who are NotWorking still take the lead, but the ones who are retired take take last place this time while those who are employed take the second position.

It’s now time to create a new column that sums up the values in hlthpln1 and genhlth columns. The categories are as follows: 1) illnessCovered: When, people have a health plan and still having a poor health. 2) illnessNotCovered: When, people do not have a health plan and are still having a poor health. 3) HlthyOthrwse: When people are healthy but do not have a healthplan.

brfss2013<- brfss2013 %>%
  filter(!is.na(hlthpln1), !is.na(genhlth))%>%
  mutate(plans_cat = ifelse(hlthpln1 == "Yes" & genhlth == "Poor","illnessCovered",
                            ifelse(hlthpln1 == "No" & genhlth == "Poor", "illnessNotCovered","HlthyOthrwse")))

Let’s create a visual on top of it to have a better understanding of the situation at hand:

ggplot(data = brfss2013, aes(x= work_cat, fill = factor(genhlth)))+
  geom_histogram(width = 0.5, stat = "count") +
  facet_wrap(~plans_cat+genhlth)+
  xlab("work categories") +
  ylab("total count") +
  labs(fill = "genhlth")

## Warning: Ignoring unknown parameters: binwidth, bins, pad

This visual helps in capturing the information presented as part of this research question. While there was kind of an observable trend of sorts till now, that people who were not working tend to have a poor health more than those who are retired who in turn was more than those who were Working.

But, as mentioned earlier that was just part of the entire set, the data collected via BRFSS is a massive one, and in order to create some concrete evidence on to something, a lot more detailed analysis which include multiple derived columns, variable mappings etc to be created and plotted in order to be able to give a substantial result.