Setup
Load packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
library(colorspace)
library(tidyr)
library(knitr)
opts_chunk$set(echo = TRUE, fig.align = "center")
Load data
load("brfss2013.RData")
————-
Part 1 Answer:
The data is described a follows (quoted directly from the CDC website): “The Behavioral RIsk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventative services. Established in 1984 with 15 states, BRFSS now collects data in all 50 states as well as teh District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.”
Methodology used, how the samples were collected, is quoted directly from the CDC website below: “BRFSS is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodological assistance from CDC. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”
Scope of Inference (Generalizability/Causality):
Generalizability: Survey data were collected from all 50 states and U.S. territories, which makes the data seem good enough of a random sample to make it generalizable to the U.S. population as a whole.
Causality: Since the data obtained and methodology used makes this an observational exercise - that is, all participants were not subject to assigned treatment and control groups - causation can’t be assume - only correlation can be measured.
Issues with Methodology, Bias, and Areas for Improvement: By using a telephone Survey, there is the possibility of underreporting several types of individuals:
1. Individuals without a landline or cellphone
2. Individuals who refuse to respond to or participate in phone surveys.
3. Individuals who are unavailable/unreachable by phone for the survey at the time it was being conducted.
The answers to the interview questions were unvalidated, meaning respondents could potentially modify their responses, by:
1. Overreporting desirable behaviors and/or traits.
2. Underreporting undesirable behaviors and/or traits.
3. Exaggerating or misrepresenting certain traits, such as heigh, education, or income.
4. Inaccurately giving key information (since respondents were asked to remember details back to 30 days or more, their memory may be inaccurate).
5. Potentially being inconsistent in interview practices and question sets between the participating state agencies. See the CDC website for more details.
For future reference, it would be useful if the dataset included details about each interview, such as the time of day the data was collected and the duration of the interviews. THese additional pieces of information would provide further insight about those who may or may not have taken part in the survey.
————-
————-
————-
Part 2: Research questions
Research Question 1: Is Body Mass Index (BMI) somehow related/correlated to a respondent’s opinion of their own health?
This question explores whether or not individuals with “normal” BMI may have a slightly better perception of their own health. While BMI is not a perfect indicator of health, it is still widely recognized as an initial measure of health and wellness.
Total Variables Used: 2
genhlth - General Health
X_bmi5cat - computed variable that categorizes BMI into 4 categories (underweight, normal, overweight, obese)
————-
Research Question 2: Is there a correlation between the hours of sleep an individual gets pre night and their energy levels? Is there a difference between the sexes?
This is an interesting question because sleep is often touted as an important part of maintaining good general health. Research suggests that those individuals getting less than 5 hours of sleep may even be more prone to chronic or serious diseases or illnesses.
Total Variables Used: 3
sleptim1 - hours of sleep reported
qlhlth2 - days reported as havnig full energy in the past 30 days
sex - reported biological sex
————-
Research Question 3: Is there a correlation between overall life satisfaction and level of education? Are there any differences between the sexes?
This questions will attempt to see if any correlation exists between overall life satisfaction and the level of education of an individual. Some research shows that individuals with a higher level of education are less likely to have marital problems and may have better health than those with less education. It will further explore if there is any difference between males and females.
Total Variables Used: 3
satisfy - overall life satisfaction
educa - level of education
sex - individual’s biological sex
————-
————-
————-
Part 3: Exploratory Data Analysis
Research Question 1: Is Body Mass Index (BMI) somehow related/correlated to a respondent’s opinion of their ownhealth?
load("brfss2013.RData")
dim(brfss2013)
## [1] 491775 330
q1 <- select(brfss2013,genhlth,X_bmi5cat) %>% na.omit()
dim(q1)
## [1] 463275 2
With over 460,000 observations, it is easier to observe through a table, as shown below:
prop.table(table(q1$genhlth,q1$X_bmi5cat),2)
##
## Underweight Normal weight Overweight Obese
## Excellent 0.19987805 0.26019496 0.17373887 0.07933813
## Very good 0.26402439 0.35069868 0.35401238 0.26824837
## Good 0.26146341 0.24667514 0.30698451 0.37088006
## Fair 0.15829268 0.09751640 0.11943759 0.19913468
## Poor 0.11634146 0.04491484 0.04582665 0.08239876
Even the table has an overwhelming amount of data. An easier way to interpret the table above:
g1 <- ggplot(q1) + aes(x=X_bmi5cat,fill=genhlth) + geom_bar(position = "fill")
g1

Each column represents the 4 BMI categories (underweight, normal, overweight, obese) and the proportion of the respondents that described their own health.
g2 <- g1 + xlab("BMI category") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g2

One can conclude that to an extent, the answer to this questions is “yes”. There does seem to be a correlation between an individual’s BMI and his or her own perception of health.
+++++++++++++++++
Research Question 2: Is there a correlation between the hours of sleep an individual gets pre night and their energy levels? Is there a difference between the sexes?
q2 <- select(brfss2013, qlhlth2 , sex , sleptim1) %>%
filter(!is.na(qlhlth2), !is.na(sex), sleptim1 <= 12 )
q2 %>% group_by(qlhlth2) %>% summary(count=n())
## qlhlth2 sex sleptim1
## Min. : 0.00 Male :162 Min. : 2.000
## 1st Qu.: 2.00 Female:287 1st Qu.: 6.000
## Median :15.00 Median : 7.000
## Mean :15.56 Mean : 7.013
## 3rd Qu.:28.00 3rd Qu.: 8.000
## Max. :30.00 Max. :12.000
ggplot(data = q2, aes(x = sleptim1, y = qlhlth2 ))+
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) +
scale_x_continuous(limits = c(4,10), breaks = 4:10) +
facet_grid(. ~ sex) +
xlab("sleptim1 = Hours of Sleep") +
ylab ("qlhlth2: Days of full energy in the last 30 days")
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).

There seems to be a generally positive correlation between the hours of sleep and days of full energy. The correlation seems to be slightly stronger for females than males as the males’ data is more widely dispersed.
++++++++++++++++++++++
Research Question 3: Is there a correlation between overall life satisfaction and level of education? Are there any differences between the sexes?
q3 <- select(brfss2013, lsatisfy , sex, educa) %>%
filter(!is.na(lsatisfy), !is.na(sex), !is.na(educa))
q3 %>% group_by(lsatisfy) %>% summarise(count=n())
## # A tibble: 4 x 2
## lsatisfy count
## <fct> <int>
## 1 Very satisfied 5378
## 2 Satisfied 5506
## 3 Dissatisfied 598
## 4 Very dissatisfied 161
q3 %>% group_by(educa) %>% summarise(count=n())
## # A tibble: 6 x 2
## educa count
## <fct> <int>
## 1 Never attended school or only kindergarten 10
## 2 Grades 1 through 8 (Elementary) 496
## 3 Grades 9 though 11 (Some high school) 1078
## 4 Grade 12 or GED (High school graduate) 3708
## 5 College 1 year to 3 years (Some college or technical school) 3055
## 6 College 4 years or more (College graduate) 3296
q3 %>% group_by(sex) %>% summarise(count=n())
## # A tibble: 2 x 2
## sex count
## <fct> <int>
## 1 Male 4078
## 2 Female 7565
ggplot(data = q3, aes(x = lsatisfy, y = educa )) +
geom_count () +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
facet_grid(. ~ sex) +
xlab("lsatisfy: Overall Life Satisfaction") +
ylab ("educa: Level of Education")

There seems to be a generally positive correlation between level of education and overall life satisfaction for both males and females. Satisfaction levels seem to be higher for those who have at least completed high school (or equivalent). There are certain outliers in the data as well, with some responding “satisfied” or “very satisfied” while having no education. Since most of the American population has at least graduated from high school, the data seems to accurately represent the distribution of respondents.