Setup

Load packages

install.packages(
   'printr',
   type = 'source',
   repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)
library(ggplot2)
library(dplyr)
library(printr)
library(reshape2)

Load data

load("brfss2013.RData")

Part 1: Data

The sampling methodology used in the BRFSS study is a survey through telephone interviews. This is done across the 50 states of the United States of America including, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands.

“The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population”.

BRFSS conducts both landline telephone and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing" - From Coursera Project Information on BRFSS, 2016.

The method of data collection used in observational in nature (i.e. random sampling). No random assignment is done to control for extraneous variables. As such we can only look for associations and correlations within variables of interest vis-a-vis our research questions. No decision on causality can be made from this study.


Part 2: Research questions

Research question 1:
What is the percentage of good health for people who exercised in the past 30 days exercise and have also drank alcohol and/or smoked compared to those that have not.

==> My belief on this issue, which forms my status quo opinion is that people who exercise and abstain from alcohol and smoking should report a better quality of health. This question will investigate that claim.

Research question 2:
What is the health status of smokers across various demographics who eat fruits.
==> How much does eating fruits impact the negative effects of smoking. This question will provide some insignts into the general perception of wellbeing of smokers who eat fruits.

Research question 3:
Is there any association between good health and income level

==> Do high level income earners generally enjoy better health quality? Are low level income earners more prone to general poor wellbeing. This question will investigate this issue.


Part 3: Exploratory data analysis

Research question 1:
For the first question, our interest is to examine the percentage of good health for people who exercised in the past 30 days, but also drank alcohol and/or smoked within that same time frame compared to those that exercised, but have neither drank nor smoked. To prepare for this research question, we will consider the variables:

# Variable type
brfss2013 %>%
        select(genhlth, exerany2, alcday5, smoke100) %>%
        str()
## 'data.frame':    491775 obs. of  4 variables:
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##  $ alcday5 : int  201 0 220 208 210 0 201 202 101 0 ...
##  $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
# Get People who have exercised in the past 30 days and those who have not
exercised30 <- brfss2013 %>%
        filter(exerany2 == "Yes")
noexercise <- brfss2013 %>%
        filter(exerany2 == "No")

# From people exercise30 and noexercise,
# get those who have had an alcoholic beverage or smoked
#exercise30
exercised30withAlcohol_and_Smoke <- exercised30 %>%
        filter(alcday5 > 0) %>%
        filter(smokday2 != "Not at all") %>%
        select(genhlth) %>%
        group_by(genhlth) %>% 
        summarise(count = n())
proptable30withnarcotics <- exercised30withAlcohol_and_Smoke %>%
        mutate(prop = count/sum(count))
proptable30withnarcotics
genhlth count prop
Excellent 3584 0.1366791
Very good 9058 0.3454351
Good 9195 0.3506598
Fair 3391 0.1293189
Poor 918 0.0350088
NA 76 0.0028983
proptable30withnarcotics[-6,] %>%
        ggplot(aes(x = genhlth, y = prop, fill = genhlth,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        geom_label() +
        labs(title = "Exercised in the last 30 days \nwhile Smoking or Drinking Alcohol", x = "Quality of Health",
             y = "Percentage (%)", fill = "genhlth") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        guides(fill=guide_legend(title = "Quality of Health"))

The proportion table and barplot above shows the health quality proportion of people who exercised within the last 30 days and drank alcohol or smoked. From the table, 13% reported to enjoy Excellent health, while 34% and 35% reported Very Good and Good health respectively. Approximately 12% and 3% said they have Fair and Poor health respectively.

exercised30_NO_Alcohol_and_Smoke <- exercised30 %>%
        filter(alcday5 == 0) %>%
        filter(smokday2 == "Not at all") %>%
        select(genhlth) %>%
        group_by(genhlth) %>% 
        summarise(count = n())
proptable30NONarcotics <- exercised30_NO_Alcohol_and_Smoke %>%
        mutate(prop = count/sum(count))
proptable30NONarcotics
genhlth count prop
Excellent 4708 0.1249900
Very good 10747 0.2853161
Good 12656 0.3359970
Fair 6633 0.1760958
Poor 2753 0.0730878
NA 170 0.0045132
proptable30NONarcotics[-6,] %>%
        ggplot(aes(x = genhlth, y = prop, fill = genhlth,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        geom_label() +
        labs(title = "Exercised in the last 30 days \nwithout Smoking or Drinking Alcohol", x = "Quality of Health",
             y = "Percentage (%)", fill = "genhlth") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        guides(fill=guide_legend(title = "Quality of Health"))

The proportion table and barplot above shows that of the people who exercised within the last 30 days without drinking alcohol or smoking, 19% reported Excellent health, while 33% and 31% reported Very Good and Good health respectively. Approximately 12% and 3% said they have Fair and Poor health respectively.

# noexercise
noexercisewithAlcohol_and_Smoke <- noexercise %>%
        filter(alcday5 > 0) %>%
        filter(smokday2 != "Not al all") %>%
        select(genhlth) %>%
        group_by(genhlth) %>% 
        summarise(count = n())
proptableNoExerciseWithNarcotics <- noexercisewithAlcohol_and_Smoke %>%
        mutate(prop = count/sum(count))
proptableNoExerciseWithNarcotics
genhlth count prop
Excellent 2555 0.0948650
Very good 7216 0.2679241
Good 9635 0.3577396
Fair 5181 0.1923662
Poor 2208 0.0819812
NA 138 0.0051238
proptableNoExerciseWithNarcotics[-6,] %>%
        ggplot(aes(x = genhlth, y = prop, fill = genhlth,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        geom_label() +
        labs(title = "No Exercise in the last 30 days \nwith Smoking or Drinking Alcohol", x = "Quality of Health",
             y = "Percentage (%)", fill = "genhlth") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        guides(fill=guide_legend(title = "Quality of Health"))

The proportion table and barplot above shows that of the people who did not exercise in the last 30 days but drank alcohol or smoked, 9% reported Excellent health, while 26% and 35% reported Very Good and Good health respectively. Approximately 19% and 8% said they have Fair and Poor health respectively.

noexercise_NO_Alcohol_and_Smoke <- noexercise %>%
        filter(alcday5 == 0) %>%
        filter(smokday2 == "Not at all") %>%
        select(genhlth) %>%
        group_by(genhlth) %>% 
        summarise(count = n())
proptableNoExerciseNONarcotics <- noexercise_NO_Alcohol_and_Smoke %>%
        mutate(prop = count/sum(count))
proptableNoExerciseNONarcotics
genhlth count prop
Excellent 1231 0.0563206
Very good 3665 0.1676808
Good 7005 0.3204923
Fair 5687 0.2601912
Poor 4124 0.1886810
NA 145 0.0066340
proptableNoExerciseNONarcotics[-6,] %>%
        ggplot(aes(x = genhlth, y = prop, fill = genhlth,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        geom_label() +
        labs(title = "No Exercise in the last 30 days \nwithout Smoking or Drinking Alcohol", x = "Quality of Health",
             y = "Percentage (%)", fill = "genhlth") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        guides(fill=guide_legend(title = "Quality of Health"))

The proportion table and barplot above shows the peoportion of people who did not exercise in the last 30 days and did not drink alcohol or smoked. The report shows that 5% has Excellent health, while 16% and 32% had Very Good and Good health respectively. Approximately 26% and 18% had Fair and Poor health respectively.

# join objects to form a dataframe
quest1 <- cbind(exercised30withAlcohol_and_Smoke,
                exercised30_NO_Alcohol_and_Smoke,
                noexercisewithAlcohol_and_Smoke,
                noexercise_NO_Alcohol_and_Smoke)

# remove redundant columns
quest1 <- quest1[,c(-3,-5,-7)]

# add proper column names
colnames(quest1) <- c("Health quality",
                      "Exercise with Alcohol or Smoking",
                      "Exercise without Alcohol or Smoking",
                      "No Exercise with Alcohol or Smoking",
                      "No Exercise without Alcohol or Smoking")

# remove NA row
quest1 <- quest1[-6,]
quest1
Health quality Exercise with Alcohol or Smoking Exercise without Alcohol or Smoking No Exercise with Alcohol or Smoking No Exercise without Alcohol or Smoking
Excellent 3584 4708 2555 1231
Very good 9058 10747 7216 3665
Good 9195 12656 9635 7005
Fair 3391 6633 5181 5687
Poor 918 2753 2208 4124
# prop.table
quest1prop <- data.frame(health = as.factor(proptable30withnarcotics$genhlth),
                exercisewith = proptable30withnarcotics$prop,
                exercisewithout = proptable30NONarcotics$prop,
                noexercisewith = proptableNoExerciseWithNarcotics$prop,
                noexercisewithout = proptableNoExerciseNONarcotics$prop)
# remove NA row
quest1prop <- quest1prop[-6,]

# line plot
lineplotdf <- melt(quest1prop, id.vars = "health")
colnames(lineplotdf) <- c("health", "variable", "value")
# add proper column names
colnames(quest1prop) <- c("Health quality",
                      "Exercise with Alcohol or Smoking",
                      "Exercise without Alcohol or Smoking",
                      "No Exercise with Alcohol or Smoking",
                      "No Exercise without Alcohol or Smoking")
quest1prop
Health quality Exercise with Alcohol or Smoking Exercise without Alcohol or Smoking No Exercise with Alcohol or Smoking No Exercise without Alcohol or Smoking
Excellent 0.1366791 0.1249900 0.0948650 0.0563206
Very good 0.3454351 0.2853161 0.2679241 0.1676808
Good 0.3506598 0.3359970 0.3577396 0.3204923
Fair 0.1293189 0.1760958 0.1923662 0.2601912
Poor 0.0350088 0.0730878 0.0819812 0.1886810
lineplotdf %>%
        ggplot(aes(x = health, y = value, group = variable)) +
        geom_line(linetype="solid", aes(color = factor(variable))) +
        geom_point() +
        labs(title = "Scatterplot to Compare Health Quality of those who Exercised\n and did not Exercise while taking alcholol or smoking\n and not taking alcohol or smoking", x = "Quality of Health",
             y = "Percentage (%)", fill = "health") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        scale_colour_discrete(name="Status",
                         labels=c("Exercise with \nAlcohol or Smoking",
                      "Exercise without \nAlcohol or Smoking",
                      "No Exercise with \nAlcohol or Smoking",
                      "No Exercise without \nAlcohol or Smoking"))

The proportion table and line graph above showws the comparisons between those who:

From the above results, it seems that cutting across demographics and other issues, those who exercised in the last 30 days and drank alcohol or smoked reported a better quality of health. Strange ad dubious result!!!. While those who did not exercise and did not drink or smoke in the last 30 days had an overall lower health quality.

Research question 2:
The second research question investigates the health status of smokers across various demographics who eat fruits.
The variables that will be involved in this investigation are:

eatfruits <- brfss2013 %>%
        filter(smokday2 != "Not at all") %>%
        filter(fruit1 > 0) %>%
        select(genhlth) %>%
        group_by(genhlth) %>% 
        summarise(count = n()) %>%
        mutate(prop = count/sum(count))
colnames(eatfruits) <- c(c("Health status", "count", "proportion"))
eatfruits[-6,]
Health status count proportion
Excellent 6943 0.1036919
Very good 18192 0.2716927
Good 22911 0.3421697
Fair 12512 0.1868634
Poor 6138 0.0916694
# bar plot
eatfruits[-6,] %>%
        ggplot(aes(x = `Health status`, y = proportion, fill = `Health status`,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        geom_label() +
        labs(title = "Health Status of Smokers \n who eat fruits",
             x = "Quality of Health",
             y = "Percentage (%)", fill = "Health status") +
        theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
              axis.text.y = element_text(face = "bold", size = 11),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold")) +
        guides(fill=guide_legend(title = "Quality of Health"))

The proportion table and barplot above shows the peoportion of smokers across demographics who eat fruits. The report shows that 10% reported Excellent health, while 27% and 22% reported Very Good and Good health respectively. Approximately 12% and 9% reported Fair and Poor health status.

Research question 3:
The third research question looks into the association between good health and income level.
The variables of interest are:

health_income <- brfss2013 %>%
        select(genhlth, income2) %>%
        group_by(genhlth, income2) %>% 
        summarise(count = n()) %>%
        mutate(prop = count/sum(count))
colnames(health_income) <- c(c("Health status", "Income level",
                               "count", "proportion"))
health_income <- health_income[complete.cases(health_income),]
health_income
Health status Income level count proportion
Excellent Less than $10,000 2319 0.0271285
Excellent Less than $15,000 2128 0.0248941
Excellent Less than $20,000 3551 0.0415409
Excellent Less than $25,000 4719 0.0552046
Excellent Less than $35,000 6521 0.0762851
Excellent Less than $50,000 9972 0.1166561
Excellent Less than $75,000 12701 0.1485810
Excellent $75,000 or more 32343 0.3783604
Very good Less than $10,000 3976 0.0249943
Very good Less than $15,000 4567 0.0287095
Very good Less than $20,000 7479 0.0470153
Very good Less than $25,000 10800 0.0678921
Very good Less than $35,000 14740 0.0926601
Very good Less than $50,000 21668 0.1362116
Very good Less than $75,000 26044 0.1637205
Very good $75,000 or more 48973 0.3078591
Good Less than $10,000 7515 0.0499153
Good Less than $15,000 8440 0.0560592
Good Less than $20,000 11948 0.0793597
Good Less than $25,000 14931 0.0991731
Good Less than $35,000 17287 0.1148218
Good Less than $50,000 20468 0.1359503
Good Less than $75,000 19710 0.1309156
Good $75,000 or more 27402 0.1820066
Fair Less than $10,000 6929 0.1038426
Fair Less than $15,000 7187 0.1077091
Fair Less than $20,000 8053 0.1206876
Fair Less than $25,000 7845 0.1175704
Fair Less than $35,000 7564 0.1133591
Fair Less than $50,000 7153 0.1071996
Fair Less than $75,000 5226 0.0783203
Fair $75,000 or more 5645 0.0845997
Poor Less than $10,000 4513 0.1614611
Poor Less than $15,000 4311 0.1542342
Poor Less than $20,000 3674 0.1314443
Poor Less than $25,000 3268 0.1169189
Poor Less than $35,000 2575 0.0921255
Poor Less than $50,000 2058 0.0736289
Poor Less than $75,000 1421 0.0508390
Poor $75,000 or more 1296 0.0463669
# bar plot
health_income %>%
        ggplot(aes(x = `Health status`, y = proportion, fill = `Health status`,
                   label = factor(count))) +
        geom_bar(stat="identity") +
        facet_grid(~ `Income level`) +
        facet_wrap(~ `Income level`, ncol = 3) +
        #geom_label() +
        labs(title = "Association bettwen Health Quality \n and Income Level",
             x = "Quality of Health",
             y = "Percentage (%)") +
        theme(axis.text.x = element_text(face = "bold", size = 9, angle = 25),
              axis.text.y = element_text(face = "bold", size = 9),
              plot.title = element_text(face = "bold"),
              legend.title = element_text(face ="bold"))

The proportion table and barplot above shows that good health is mostly enjoyed among high income earners compared to low income earners.