install.packages(
'printr',
type = 'source',
repos = c('http://yihui.name/xran', 'http://cran.rstudio.com')
)
library(ggplot2)
library(dplyr)
library(printr)
library(reshape2)
load("brfss2013.RData")
The sampling methodology used in the BRFSS study is a survey through telephone interviews. This is done across the 50 states of the United States of America including, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands.
“The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population”.
BRFSS conducts both landline telephone and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing" - From Coursera Project Information on BRFSS, 2016.
The method of data collection used in observational in nature (i.e. random sampling). No random assignment is done to control for extraneous variables. As such we can only look for associations and correlations within variables of interest vis-a-vis our research questions. No decision on causality can be made from this study.
Research question 1:
What is the percentage of good health for people who exercised in the past 30 days exercise and have also drank alcohol and/or smoked compared to those that have not.
==> My belief on this issue, which forms my status quo opinion is that people who exercise and abstain from alcohol and smoking should report a better quality of health. This question will investigate that claim.
Research question 2:
What is the health status of smokers across various demographics who eat fruits.
==> How much does eating fruits impact the negative effects of smoking. This question will provide some insignts into the general perception of wellbeing of smokers who eat fruits.
Research question 3:
Is there any association between good health and income level
==> Do high level income earners generally enjoy better health quality? Are low level income earners more prone to general poor wellbeing. This question will investigate this issue.
Research question 1:
For the first question, our interest is to examine the percentage of good health for people who exercised in the past 30 days, but also drank alcohol and/or smoked within that same time frame compared to those that exercised, but have neither drank nor smoked. To prepare for this research question, we will consider the variables:
genhlth: General Healthexerany2: Exercise In Past 30 Daysalcday5: Days In Past 30 Had Alcoholic Beveragesmokday2: Frequency Of Days Now Smoking# Variable type
brfss2013 %>%
select(genhlth, exerany2, alcday5, smoke100) %>%
str()
## 'data.frame': 491775 obs. of 4 variables:
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
## $ alcday5 : int 201 0 220 208 210 0 201 202 101 0 ...
## $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
genhlth is a factor variable with 5 levels namely: Excellent, Very good, Good, Fair and Poor.exerany2 is a factor variable with 2 levels, Yes and Noalcday5 is an integer variable. 0 indicates no alcohol consumed in the past 30 dayssmokday2 is a factor variable with 3 levels “Not at all”, “Some days” and “Every day”# Get People who have exercised in the past 30 days and those who have not
exercised30 <- brfss2013 %>%
filter(exerany2 == "Yes")
noexercise <- brfss2013 %>%
filter(exerany2 == "No")
# From people exercise30 and noexercise,
# get those who have had an alcoholic beverage or smoked
#exercise30
exercised30withAlcohol_and_Smoke <- exercised30 %>%
filter(alcday5 > 0) %>%
filter(smokday2 != "Not at all") %>%
select(genhlth) %>%
group_by(genhlth) %>%
summarise(count = n())
proptable30withnarcotics <- exercised30withAlcohol_and_Smoke %>%
mutate(prop = count/sum(count))
proptable30withnarcotics
| genhlth | count | prop |
|---|---|---|
| Excellent | 3584 | 0.1366791 |
| Very good | 9058 | 0.3454351 |
| Good | 9195 | 0.3506598 |
| Fair | 3391 | 0.1293189 |
| Poor | 918 | 0.0350088 |
| NA | 76 | 0.0028983 |
proptable30withnarcotics[-6,] %>%
ggplot(aes(x = genhlth, y = prop, fill = genhlth,
label = factor(count))) +
geom_bar(stat="identity") +
geom_label() +
labs(title = "Exercised in the last 30 days \nwhile Smoking or Drinking Alcohol", x = "Quality of Health",
y = "Percentage (%)", fill = "genhlth") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
guides(fill=guide_legend(title = "Quality of Health"))
The proportion table and barplot above shows the health quality proportion of people who exercised within the last 30 days and drank alcohol or smoked. From the table, 13% reported to enjoy Excellent health, while 34% and 35% reported Very Good and Good health respectively. Approximately 12% and 3% said they have Fair and Poor health respectively.
exercised30_NO_Alcohol_and_Smoke <- exercised30 %>%
filter(alcday5 == 0) %>%
filter(smokday2 == "Not at all") %>%
select(genhlth) %>%
group_by(genhlth) %>%
summarise(count = n())
proptable30NONarcotics <- exercised30_NO_Alcohol_and_Smoke %>%
mutate(prop = count/sum(count))
proptable30NONarcotics
| genhlth | count | prop |
|---|---|---|
| Excellent | 4708 | 0.1249900 |
| Very good | 10747 | 0.2853161 |
| Good | 12656 | 0.3359970 |
| Fair | 6633 | 0.1760958 |
| Poor | 2753 | 0.0730878 |
| NA | 170 | 0.0045132 |
proptable30NONarcotics[-6,] %>%
ggplot(aes(x = genhlth, y = prop, fill = genhlth,
label = factor(count))) +
geom_bar(stat="identity") +
geom_label() +
labs(title = "Exercised in the last 30 days \nwithout Smoking or Drinking Alcohol", x = "Quality of Health",
y = "Percentage (%)", fill = "genhlth") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
guides(fill=guide_legend(title = "Quality of Health"))
The proportion table and barplot above shows that of the people who exercised within the last 30 days without drinking alcohol or smoking, 19% reported Excellent health, while 33% and 31% reported Very Good and Good health respectively. Approximately 12% and 3% said they have Fair and Poor health respectively.
# noexercise
noexercisewithAlcohol_and_Smoke <- noexercise %>%
filter(alcday5 > 0) %>%
filter(smokday2 != "Not al all") %>%
select(genhlth) %>%
group_by(genhlth) %>%
summarise(count = n())
proptableNoExerciseWithNarcotics <- noexercisewithAlcohol_and_Smoke %>%
mutate(prop = count/sum(count))
proptableNoExerciseWithNarcotics
| genhlth | count | prop |
|---|---|---|
| Excellent | 2555 | 0.0948650 |
| Very good | 7216 | 0.2679241 |
| Good | 9635 | 0.3577396 |
| Fair | 5181 | 0.1923662 |
| Poor | 2208 | 0.0819812 |
| NA | 138 | 0.0051238 |
proptableNoExerciseWithNarcotics[-6,] %>%
ggplot(aes(x = genhlth, y = prop, fill = genhlth,
label = factor(count))) +
geom_bar(stat="identity") +
geom_label() +
labs(title = "No Exercise in the last 30 days \nwith Smoking or Drinking Alcohol", x = "Quality of Health",
y = "Percentage (%)", fill = "genhlth") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
guides(fill=guide_legend(title = "Quality of Health"))
The proportion table and barplot above shows that of the people who did not exercise in the last 30 days but drank alcohol or smoked, 9% reported Excellent health, while 26% and 35% reported Very Good and Good health respectively. Approximately 19% and 8% said they have Fair and Poor health respectively.
noexercise_NO_Alcohol_and_Smoke <- noexercise %>%
filter(alcday5 == 0) %>%
filter(smokday2 == "Not at all") %>%
select(genhlth) %>%
group_by(genhlth) %>%
summarise(count = n())
proptableNoExerciseNONarcotics <- noexercise_NO_Alcohol_and_Smoke %>%
mutate(prop = count/sum(count))
proptableNoExerciseNONarcotics
| genhlth | count | prop |
|---|---|---|
| Excellent | 1231 | 0.0563206 |
| Very good | 3665 | 0.1676808 |
| Good | 7005 | 0.3204923 |
| Fair | 5687 | 0.2601912 |
| Poor | 4124 | 0.1886810 |
| NA | 145 | 0.0066340 |
proptableNoExerciseNONarcotics[-6,] %>%
ggplot(aes(x = genhlth, y = prop, fill = genhlth,
label = factor(count))) +
geom_bar(stat="identity") +
geom_label() +
labs(title = "No Exercise in the last 30 days \nwithout Smoking or Drinking Alcohol", x = "Quality of Health",
y = "Percentage (%)", fill = "genhlth") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
guides(fill=guide_legend(title = "Quality of Health"))
The proportion table and barplot above shows the peoportion of people who did not exercise in the last 30 days and did not drink alcohol or smoked. The report shows that 5% has Excellent health, while 16% and 32% had Very Good and Good health respectively. Approximately 26% and 18% had Fair and Poor health respectively.
# join objects to form a dataframe
quest1 <- cbind(exercised30withAlcohol_and_Smoke,
exercised30_NO_Alcohol_and_Smoke,
noexercisewithAlcohol_and_Smoke,
noexercise_NO_Alcohol_and_Smoke)
# remove redundant columns
quest1 <- quest1[,c(-3,-5,-7)]
# add proper column names
colnames(quest1) <- c("Health quality",
"Exercise with Alcohol or Smoking",
"Exercise without Alcohol or Smoking",
"No Exercise with Alcohol or Smoking",
"No Exercise without Alcohol or Smoking")
# remove NA row
quest1 <- quest1[-6,]
quest1
| Health quality | Exercise with Alcohol or Smoking | Exercise without Alcohol or Smoking | No Exercise with Alcohol or Smoking | No Exercise without Alcohol or Smoking |
|---|---|---|---|---|
| Excellent | 3584 | 4708 | 2555 | 1231 |
| Very good | 9058 | 10747 | 7216 | 3665 |
| Good | 9195 | 12656 | 9635 | 7005 |
| Fair | 3391 | 6633 | 5181 | 5687 |
| Poor | 918 | 2753 | 2208 | 4124 |
# prop.table
quest1prop <- data.frame(health = as.factor(proptable30withnarcotics$genhlth),
exercisewith = proptable30withnarcotics$prop,
exercisewithout = proptable30NONarcotics$prop,
noexercisewith = proptableNoExerciseWithNarcotics$prop,
noexercisewithout = proptableNoExerciseNONarcotics$prop)
# remove NA row
quest1prop <- quest1prop[-6,]
# line plot
lineplotdf <- melt(quest1prop, id.vars = "health")
colnames(lineplotdf) <- c("health", "variable", "value")
# add proper column names
colnames(quest1prop) <- c("Health quality",
"Exercise with Alcohol or Smoking",
"Exercise without Alcohol or Smoking",
"No Exercise with Alcohol or Smoking",
"No Exercise without Alcohol or Smoking")
quest1prop
| Health quality | Exercise with Alcohol or Smoking | Exercise without Alcohol or Smoking | No Exercise with Alcohol or Smoking | No Exercise without Alcohol or Smoking |
|---|---|---|---|---|
| Excellent | 0.1366791 | 0.1249900 | 0.0948650 | 0.0563206 |
| Very good | 0.3454351 | 0.2853161 | 0.2679241 | 0.1676808 |
| Good | 0.3506598 | 0.3359970 | 0.3577396 | 0.3204923 |
| Fair | 0.1293189 | 0.1760958 | 0.1923662 | 0.2601912 |
| Poor | 0.0350088 | 0.0730878 | 0.0819812 | 0.1886810 |
lineplotdf %>%
ggplot(aes(x = health, y = value, group = variable)) +
geom_line(linetype="solid", aes(color = factor(variable))) +
geom_point() +
labs(title = "Scatterplot to Compare Health Quality of those who Exercised\n and did not Exercise while taking alcholol or smoking\n and not taking alcohol or smoking", x = "Quality of Health",
y = "Percentage (%)", fill = "health") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
scale_colour_discrete(name="Status",
labels=c("Exercise with \nAlcohol or Smoking",
"Exercise without \nAlcohol or Smoking",
"No Exercise with \nAlcohol or Smoking",
"No Exercise without \nAlcohol or Smoking"))
The proportion table and line graph above showws the comparisons between those who:
From the above results, it seems that cutting across demographics and other issues, those who exercised in the last 30 days and drank alcohol or smoked reported a better quality of health. Strange ad dubious result!!!. While those who did not exercise and did not drink or smoke in the last 30 days had an overall lower health quality.
Research question 2:
The second research question investigates the health status of smokers across various demographics who eat fruits.
The variables that will be involved in this investigation are:
genhlth: General Health,smokday2: Frequency Of Days Now Smoking,fruit1: How Many Times Did You Eat Fruit?eatfruits <- brfss2013 %>%
filter(smokday2 != "Not at all") %>%
filter(fruit1 > 0) %>%
select(genhlth) %>%
group_by(genhlth) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))
colnames(eatfruits) <- c(c("Health status", "count", "proportion"))
eatfruits[-6,]
| Health status | count | proportion |
|---|---|---|
| Excellent | 6943 | 0.1036919 |
| Very good | 18192 | 0.2716927 |
| Good | 22911 | 0.3421697 |
| Fair | 12512 | 0.1868634 |
| Poor | 6138 | 0.0916694 |
# bar plot
eatfruits[-6,] %>%
ggplot(aes(x = `Health status`, y = proportion, fill = `Health status`,
label = factor(count))) +
geom_bar(stat="identity") +
geom_label() +
labs(title = "Health Status of Smokers \n who eat fruits",
x = "Quality of Health",
y = "Percentage (%)", fill = "Health status") +
theme(axis.text.x = element_text(face = "bold", size = 11, angle = 25),
axis.text.y = element_text(face = "bold", size = 11),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold")) +
guides(fill=guide_legend(title = "Quality of Health"))
The proportion table and barplot above shows the peoportion of smokers across demographics who eat fruits. The report shows that 10% reported Excellent health, while 27% and 22% reported Very Good and Good health respectively. Approximately 12% and 9% reported Fair and Poor health status.
Research question 3:
The third research question looks into the association between good health and income level.
The variables of interest are:
genhlth: General Health,income2: Income Levelhealth_income <- brfss2013 %>%
select(genhlth, income2) %>%
group_by(genhlth, income2) %>%
summarise(count = n()) %>%
mutate(prop = count/sum(count))
colnames(health_income) <- c(c("Health status", "Income level",
"count", "proportion"))
health_income <- health_income[complete.cases(health_income),]
health_income
| Health status | Income level | count | proportion |
|---|---|---|---|
| Excellent | Less than $10,000 | 2319 | 0.0271285 |
| Excellent | Less than $15,000 | 2128 | 0.0248941 |
| Excellent | Less than $20,000 | 3551 | 0.0415409 |
| Excellent | Less than $25,000 | 4719 | 0.0552046 |
| Excellent | Less than $35,000 | 6521 | 0.0762851 |
| Excellent | Less than $50,000 | 9972 | 0.1166561 |
| Excellent | Less than $75,000 | 12701 | 0.1485810 |
| Excellent | $75,000 or more | 32343 | 0.3783604 |
| Very good | Less than $10,000 | 3976 | 0.0249943 |
| Very good | Less than $15,000 | 4567 | 0.0287095 |
| Very good | Less than $20,000 | 7479 | 0.0470153 |
| Very good | Less than $25,000 | 10800 | 0.0678921 |
| Very good | Less than $35,000 | 14740 | 0.0926601 |
| Very good | Less than $50,000 | 21668 | 0.1362116 |
| Very good | Less than $75,000 | 26044 | 0.1637205 |
| Very good | $75,000 or more | 48973 | 0.3078591 |
| Good | Less than $10,000 | 7515 | 0.0499153 |
| Good | Less than $15,000 | 8440 | 0.0560592 |
| Good | Less than $20,000 | 11948 | 0.0793597 |
| Good | Less than $25,000 | 14931 | 0.0991731 |
| Good | Less than $35,000 | 17287 | 0.1148218 |
| Good | Less than $50,000 | 20468 | 0.1359503 |
| Good | Less than $75,000 | 19710 | 0.1309156 |
| Good | $75,000 or more | 27402 | 0.1820066 |
| Fair | Less than $10,000 | 6929 | 0.1038426 |
| Fair | Less than $15,000 | 7187 | 0.1077091 |
| Fair | Less than $20,000 | 8053 | 0.1206876 |
| Fair | Less than $25,000 | 7845 | 0.1175704 |
| Fair | Less than $35,000 | 7564 | 0.1133591 |
| Fair | Less than $50,000 | 7153 | 0.1071996 |
| Fair | Less than $75,000 | 5226 | 0.0783203 |
| Fair | $75,000 or more | 5645 | 0.0845997 |
| Poor | Less than $10,000 | 4513 | 0.1614611 |
| Poor | Less than $15,000 | 4311 | 0.1542342 |
| Poor | Less than $20,000 | 3674 | 0.1314443 |
| Poor | Less than $25,000 | 3268 | 0.1169189 |
| Poor | Less than $35,000 | 2575 | 0.0921255 |
| Poor | Less than $50,000 | 2058 | 0.0736289 |
| Poor | Less than $75,000 | 1421 | 0.0508390 |
| Poor | $75,000 or more | 1296 | 0.0463669 |
# bar plot
health_income %>%
ggplot(aes(x = `Health status`, y = proportion, fill = `Health status`,
label = factor(count))) +
geom_bar(stat="identity") +
facet_grid(~ `Income level`) +
facet_wrap(~ `Income level`, ncol = 3) +
#geom_label() +
labs(title = "Association bettwen Health Quality \n and Income Level",
x = "Quality of Health",
y = "Percentage (%)") +
theme(axis.text.x = element_text(face = "bold", size = 9, angle = 25),
axis.text.y = element_text(face = "bold", size = 9),
plot.title = element_text(face = "bold"),
legend.title = element_text(face ="bold"))
The proportion table and barplot above shows that good health is mostly enjoyed among high income earners compared to low income earners.