library(ggplot2)
library(dplyr)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.3
#library(statsr)
load("brfss2013.Data")
Behavioral Risk Factor Surveillance System (BRFSS): This is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodologic assistance from Center for Disease Control and prevention(CDC).
The data is collected accross all states in the US as well as the District of Columbia and three U.S. territories. They complete somewhere close to 0.4M adult interviews every year. The data they collect comprise of a gamut of things that includes >General Information like their resedential conditions, age, number of members in their family, if they are working or otherwise etc. >Their current health status. >Past month history relative to their health, their sleeping routine to their BP levels, blood sugar levels to their cardiac health issues. >Their eating and drinking habits etc.
All this information when collected from as many as 0.4M adults accross the United states and its several territories, provides them with a lot of data which can be used to generate patterns on how doing one thing be it anything from the entire survey, relates to their health status. This data, is at the very least generalizable to the entire population of the United States as there sure is Random Sampling when it comes to interviewing people from the entire population base of the United States. But, since they do not survey to cultivate the option for control groups per-say, it would be safe to say that, the surveys do not relate to Causation.
Research quesion 1: Q1. How health could relate to different working categories?
People as they go about with their daily lives may have a huge impact on how they scale up to different health conditions. Working might relate to an active lifestyle, thus promoting good health. It could also mean, better economic conditions which in turn could relate to ease of having regular check-ups, buying costly medicines. But at the same time, long work hours could relate to stress, or may be exccessive smoking that might lead to an unhealthy lifestyle altogether. Having said that, what about those, who don’t work, i.e. adults who either fall under the sudent’s category or if they are retired. How health fluctuates for all these people and if at all, is there a plausible connection between these factors?
Research quesion 2: Q2. How smoking could relate to a person’s health and on how it differs per different working categories?
As mentioned in the description for the first question, here the focus will be on how smoking could relate to health conditions and how in turn could it be relative to different working categories. People who smoke may or may not have a negative impact of it on their health conditions, it may vary per their age or their daily activities. In order to find out how things might turn out to be, this one is quite an important question to be addressed per the data provided by this survey.
Research quesion 3: Q3. How having or not having a health coverage plan relates to health conditions and how it differs with different working classes?
Having a health cover may prove beneficial, well, is it a given for all scenarios or could it be that it might vary from people to people. How could having a health plan but not having enough time to go for a check up or treatment relate to a person’s current health status? Or, how are the chances for a person to not have a health plan and still be just as healthy? All these questions might not be answered by this one Research question but it just might bring us one step closer in fiding the connectives.
Research quesion 1: Q1. How health could relate to different working categories?
To begin with, let’s first have a look at how people per the survey fair when it comes to their health. The following code chunk provides us with data on their general health:
table(brfss2013$genhlth)
##
## Excellent Very good Good Fair Poor
## 85482 159076 150555 66726 27951
As can be seen, there are approximately 27K people out of the 0.4M who are not doing so well when it comes to having a healthy life.
Next we need to learn about their (people who were surveyed) work scenarios. For that, we could have a look at a column named employ1 and how health differs with each of these scenarios:
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(employ1,genhlth ) %>%
summarise(n())
## # A tibble: 45 x 3
## # Groups: employ1 [?]
## employ1 genhlth `n()`
## <fctr> <fctr> <int>
## 1 Employed for wages Excellent 43990
## 2 Employed for wages Very good 78534
## 3 Employed for wages Good 60875
## 4 Employed for wages Fair 15825
## 5 Employed for wages Poor 2476
## 6 Self-employed Excellent 10232
## 7 Self-employed Very good 14271
## 8 Self-employed Good 11283
## 9 Self-employed Fair 3198
## 10 Self-employed Poor 709
## # ... with 35 more rows
The results above showcase on how people who are employed/self-employed fair with respect to health and how the rest of the participants (those who are students or not working or retired) fair with respect to differnt health scenarios.
The data is great as is, but it is not that readable as in, it is still a bit hard to the get the complete picture here right? To make things easier for interpretation, lets create a new column that will store different work categories. This is how it goes: 1) if a person is working, be it the one who works for someone or is self employed, will come under the category of Employed. 2) if a person is not working be it a student or someone who has stopped working for a few years now would come under NotWorking, and finally 3) those who are retired will be categorized simply as Retired.
brfss2013 <- brfss2013 %>%
filter(!is.na(employ1))%>%
mutate(work_cat = ifelse(employ1=="Employed for wages" |employ1=="Self-employed", "employed",
ifelse(employ1=="Retired", "Retired", "NotWorking" )
)
)
Explantion on how the code works: This here is a nested ifelse conditional mutation. To put it simply, it does exactly what we meant to accomplish in the 3 sets of categorization that are mentioned above this code chunk. If a person is employed or self-employed, they’ll be categorized as Employed, else if they are retired they will be categorized as Retired, while the rest will be categorized as Not Working. Now, one might wonder, what about the NAs that were a part of the response matrix for Employ1? Well, the concern is genuine and to overcome this one falacy of a survey, we have a filter in place which will simply ignore all the rows wherein we are having NA as a employment level response.
So, lets try to create a new summary again:
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(work_cat, genhlth) %>%
summarise(totals = n())
## # A tibble: 15 x 3
## # Groups: work_cat [?]
## work_cat genhlth totals
## <chr> <fctr> <int>
## 1 employed Excellent 54222
## 2 employed Very good 92805
## 3 employed Good 72158
## 4 employed Fair 19023
## 5 employed Poor 3185
## 6 NotWorking Excellent 13516
## 7 NotWorking Very good 24214
## 8 NotWorking Good 30552
## 9 NotWorking Fair 23587
## 10 NotWorking Poor 15625
## 11 Retired Excellent 17110
## 12 Retired Very good 41216
## 13 Retired Good 46701
## 14 Retired Fair 23602
## 15 Retired Poor 8937
As can be seen from the results that although the numbers are right in front of our eyes, we are still not able to clearly visualize on how each of the work scenarios scale up with different health scenarios. The following piece of code will help us do just that:
ggplot(data = brfss2013, aes(x= work_cat, fill = factor(genhlth)))+
geom_histogram(width = 0.5, stat = "count") +
xlab("work categories") +
ylab("total count") +
labs(fill = "genhlth")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
As can be observed by the visual above, those who are employed have really low chance at having poor health conditions, while those who are not working are those who are really prone to having a poor health condition and those who are retired are fairing in between these two.
Now that we have established something to begin with, we still do not have a lot to produce a definitive connection between work categories and health conditions. In order to do that, one must dig into the dataset and perform more exploratory analysis based on the information from this visual.
Research quesion 2: Q2. How smoking could relate to a person’s health and on how it differs per different working categories?
Smoking is injurious to health, well that is one way to put it…as a fact that is, but being a statistician, one must always rely on numbers. To begin with, one must observe on what sort of data is being collected on smoking habits of an individual and what is the count for each in our survey dataframe.
brfss2013 %>%
filter(!is.na(X_smoker3))%>%
group_by(X_smoker3) %>%
summarise(total = n())
## # A tibble: 4 x 2
## X_smoker3 total
## <fctr> <int>
## 1 Current smoker - now smokes every day 54935
## 2 Current smoker - now smokes some days 21384
## 3 Former smoker 137626
## 4 Never smoked 260308
There are a large number of those who have never smoked, but does that prevent them from being ill. Well, what about passive smoking? Well, that is a question that they should have included in their survey on the lines of “Though you do not smoke, but do you have friends or family who do and do you accompany them while they smoke?” For now, given the data at hand, we have a lot of people who do smoke, or who were former smokers. To analyze the this further let us summarize this to scale with the health conditions on each factor of smokers or non smokers.
brfss2013 %>%
filter(!is.na(genhlth), !is.na(X_smoker3)) %>%
group_by(X_smoker3, genhlth) %>%
summarise(total_count = n())
## # A tibble: 20 x 3
## # Groups: X_smoker3 [?]
## X_smoker3 genhlth total_count
## <fctr> <fctr> <int>
## 1 Current smoker - now smokes every day Excellent 5223
## 2 Current smoker - now smokes every day Very good 14034
## 3 Current smoker - now smokes every day Good 19373
## 4 Current smoker - now smokes every day Fair 10798
## 5 Current smoker - now smokes every day Poor 5278
## 6 Current smoker - now smokes some days Excellent 2660
## 7 Current smoker - now smokes some days Very good 6061
## 8 Current smoker - now smokes some days Good 6693
## 9 Current smoker - now smokes some days Fair 3772
## 10 Current smoker - now smokes some days Poor 2102
## 11 Former smoker Excellent 20305
## 12 Former smoker Very good 43238
## 13 Former smoker Good 43258
## 14 Former smoker Fair 20626
## 15 Former smoker Poor 9626
## 16 Never smoked Excellent 54099
## 17 Never smoked Very good 90811
## 18 Never smoked Good 75556
## 19 Never smoked Fair 29007
## 20 Never smoked Poor 9903
So, there we go! As can be seen at the first glance of this mapping, even those who smoke, a large number of them are fairing well when it comes to health. What is surprising here is that, those who have never smoked have more number of Poor health conditions than any of the other X_smoker3 factors. Well, this was unexpected ain’t it!
Now in order to see on how these two factors fair with different working categories, we would first need to create a new column that will have different categories for smokers. Woah! what do you mean by a new column for different smoking categories? We already have it as X_smoker3 right? Well, we do have that, but if we are to compare smoking to health to work factors, one might say that it will be easier if there could be a way to create a combined column based on the any of the two factors first and then map them out with the third.
In order to do so, let’s create the new column, the categories for the column will be 1) Those who smoke and have poor health, in ‘DangerZone’. 2) Those who smoke and have a good health condition will be marked as ‘Safe’ and finally, 3) Those who have never smoked but are still having poor health conditions as ‘illNonSmoker’
And just for precautionary measures, let’s filter out all the NAs from both columns before mutating a new one:
brfss2013 <- brfss2013 %>%
filter(!is.na(X_smoker3),
!is.na(genhlth))%>%
mutate(smoke_cat = ifelse( genhlth == "Poor" & X_smoker3=="Current smoker - now smokes every day" |
genhlth == "Poor" & X_smoker3 == "Current smoker - now smokes some days"|
genhlth == "Poor" & X_smoker3 == "Former smoker","DangerZone",
ifelse( genhlth == "Poor" & X_smoker3 == "Never smoked" ,"illNonSmoker","Safe")
))
Let’s summarize this newly created column with work categories to see how things turn out to be:
brfss2013 %>%
group_by(work_cat,smoke_cat) %>%
summarize(totalCount = n())
## # A tibble: 9 x 3
## # Groups: work_cat [?]
## work_cat smoke_cat totalCount
## <chr> <chr> <int>
## 1 employed DangerZone 1770
## 2 employed illNonSmoker 1311
## 3 employed Safe 231374
## 4 NotWorking DangerZone 10025
## 5 NotWorking illNonSmoker 5142
## 6 NotWorking Safe 89157
## 7 Retired DangerZone 5211
## 8 Retired illNonSmoker 3450
## 9 Retired Safe 124983
Now, though this gives out a pretty clear measure of how things are as per the data, let’s try to create a visual on top of it to make things more..clear!
ggplot(data = brfss2013, aes(x = work_cat, fill = factor(genhlth))) +
geom_histogram(width = 0.5, stat = "count") +
facet_wrap(~smoke_cat+genhlth) +
xlab("Work_Category")+
ylab("TotalCount")+
labs(fill = "genhlth")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
Here the visuals have borrowed the simplicity of Facets. It allows one to create seperate visuals on different scenarios. This is particularly helpful when one wishes to create visuals on 3 columns. As can be seen here, Those who are Employed, are a lot less prone to having a poor health situation be it being one who smokes/smoked or the one who never smoked. While those who are not working, are still one of those who have it all upside down. This brings us a step closer on how not having work might add up to having poor health conditions. And, while it’s true that there are a multitude of other factors that might relate to this situation, and without pulling all the strings, one can never be sure on what might relate to what.
For now, what this research provided us with is, people who smoke do have poor health scenarios but those are working have it easy when compared to those who are either retired or don’t work.
Research quesion 3: Q3. How having or not having a health coverage plan relates to health conditions and how it differs with different working classes?
Well as mentioned earlier, having a health plan, proves that one is concerned about their health. But, does that mean, the one who is concerned about their well being is on the greener side of health?
This one factor alone is not enough to give us a solid proof on how things relate here. Now, while this research may not be able to give out something solid for the cause, but it just might take us a step closer in the right direction.
First thing first, let’s have a look at what the data tells us about people:
table(brfss2013$hlthpln1)
##
## Yes No
## 418201 52606
Well, suprisingly enough, people ARE concerned about their well being. But, wait a minute, this does not tell us if they are really that concerned about their well being. For having a health plan and actually caring for one’s health are two completely different things.
To dig deeper, let’s quickly summarize our data with respect to health conditions this time:
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(hlthpln1,genhlth) %>%
summarize(totalcount = n())
## # A tibble: 15 x 3
## # Groups: hlthpln1 [?]
## hlthpln1 genhlth totalcount
## <fctr> <fctr> <int>
## 1 Yes Excellent 74262
## 2 Yes Very good 140063
## 3 Yes Good 125607
## 4 Yes Fair 54784
## 5 Yes Poor 23485
## 6 No Excellent 7745
## 7 No Very good 13628
## 8 No Good 18703
## 9 No Fair 9206
## 10 No Poor 3324
## 11 <NA> Excellent 280
## 12 <NA> Very good 453
## 13 <NA> Good 570
## 14 <NA> Fair 213
## 15 <NA> Poor 100
Well, this data does tells us that most of the people who have a health plan are concerned with their well being. But, wait a minute! When comapred to people who do not have a health plan, the Poor health conditions are more than double with those having a health plan. Could it be that, those who take a health plan become more careless about their well being thinking that if things go astray, they still have a plan in place to take care of it.
Or could it mean something else.
Well, for now, let’s try to see how this spreads out when connected with different work categories.
brfss2013 %>%
filter(!is.na(genhlth)) %>%
group_by(hlthpln1,genhlth,work_cat) %>%
summarize(totalCount = n())
## # A tibble: 45 x 4
## # Groups: hlthpln1, genhlth [?]
## hlthpln1 genhlth work_cat totalCount
## <fctr> <fctr> <chr> <int>
## 1 Yes Excellent employed 47476
## 2 Yes Excellent NotWorking 10544
## 3 Yes Excellent Retired 16242
## 4 Yes Very good employed 81594
## 5 Yes Very good NotWorking 19148
## 6 Yes Very good Retired 39321
## 7 Yes Good employed 58955
## 8 Yes Good NotWorking 22585
## 9 Yes Good Retired 44067
## 10 Yes Fair employed 14111
## # ... with 35 more rows
If we take a look at those who have poor health, one at a time, that is, one, when they have a health plan and the other where they don’t. As per the numbers here, those who are either NotWorking or are Retired, take the biiger chunk of people who having a poor health condition and are having a health plan, and surprisingly enough the trend falls a bit short of the scale, when it comes to those who are not having a health plan and have a poor health condition. Here, the rankings have changed per the work categories. People who are NotWorking still take the lead, but the ones who are retired take take last place this time while those who are employed take the second position.
It’s now time to create a new column that sums up the values in hlthpln1 and genhlth columns. The categories are as follows: 1) illnessCovered: When, people have a health plan and still having a poor health. 2) illnessNotCovered: When, people do not have a health plan and are still having a poor health. 3) HlthyOthrwse: When people are healthy but do not have a healthplan.
brfss2013<- brfss2013 %>%
filter(!is.na(hlthpln1), !is.na(genhlth))%>%
mutate(plans_cat = ifelse(hlthpln1 == "Yes" & genhlth == "Poor","illnessCovered",
ifelse(hlthpln1 == "No" & genhlth == "Poor", "illnessNotCovered","HlthyOthrwse")))
Let’s create a visual on top of it to have a better understanding of the situation at hand:
ggplot(data = brfss2013, aes(x= work_cat, fill = factor(genhlth)))+
geom_histogram(width = 0.5, stat = "count") +
facet_wrap(~plans_cat+genhlth)+
xlab("work categories") +
ylab("total count") +
labs(fill = "genhlth")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
This visual helps in capturing the information presented as part of this research question. While there was kind of an observable trend of sorts till now, that people who were not working tend to have a poor health more than those who are retired who in turn was more than those who were Working.
But, as mentioned earlier that was just part of the entire set, the data collected via BRFSS is a massive one, and in order to create some concrete evidence on to something, a lot more detailed analysis which include multiple derived columns, variable mappings etc to be created and plotted in order to be able to give a substantial result.