Setup

Load packages

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.1
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
library(colorspace)
library(tidyr)
library(knitr)
opts_chunk$set(echo = TRUE, fig.align = "center")

Load data

load("brfss2013.RData")

Part 1: Data (3 points) Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability/causality). Note that you will need to look into documentation on the BRFSS to answer this question. See http://www.cdc.gov/brfss/ as well as “More information on the data” section below.

————-

Part 1 Answer:

Methodology used, how the samples were collected, is quoted directly from the CDC website below: “BRFSS is a cross-sectional telephone survey that state health departments conduct monthly over landline telephones and cellular telephones with a standardized questionnaire and technical and methodological assistance from CDC. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”

Scope of Inference (Generalizability/Causality):

Generalizability: Survey data were collected from all 50 states and U.S. territories, which makes the data seem good enough of a random sample to make it generalizable to the U.S. population as a whole.

Causality: Since the data obtained and methodology used makes this an observational exercise - that is, all participants were not subject to assigned treatment and control groups - causation can’t be assume - only correlation can be measured.

Issues with Methodology, Bias, and Areas for Improvement: By using a telephone Survey, there is the possibility of underreporting several types of individuals:

1. Individuals without a landline or cellphone

2. Individuals who refuse to respond to or participate in phone surveys.

3. Individuals who are unavailable/unreachable by phone for the survey at the time it was being conducted.

The answers to the interview questions were unvalidated, meaning respondents could potentially modify their responses, by:

1. Overreporting desirable behaviors and/or traits.

2. Underreporting undesirable behaviors and/or traits.

3. Exaggerating or misrepresenting certain traits, such as heigh, education, or income.

4. Inaccurately giving key information (since respondents were asked to remember details back to 30 days or more, their memory may be inaccurate).

5. Potentially being inconsistent in interview practices and question sets between the participating state agencies. See the CDC website for more details.

For future reference, it would be useful if the dataset included details about each interview, such as the time of day the data was collected and the duration of the interviews. THese additional pieces of information would provide further insight about those who may or may not have taken part in the survey.

————-

————-

————-

Part 2: Research questions

Research Question 1: Is Body Mass Index (BMI) somehow related/correlated to a respondent’s opinion of their own health?

This question explores whether or not individuals with “normal” BMI may have a slightly better perception of their own health. While BMI is not a perfect indicator of health, it is still widely recognized as an initial measure of health and wellness.

Total Variables Used: 2

genhlth - General Health

X_bmi5cat - computed variable that categorizes BMI into 4 categories (underweight, normal, overweight, obese)

————-

Research Question 2: Is there a correlation between the hours of sleep an individual gets pre night and their energy levels? Is there a difference between the sexes?

This is an interesting question because sleep is often touted as an important part of maintaining good general health. Research suggests that those individuals getting less than 5 hours of sleep may even be more prone to chronic or serious diseases or illnesses.

Total Variables Used: 3

sleptim1 - hours of sleep reported

qlhlth2 - days reported as havnig full energy in the past 30 days

sex - reported biological sex

————-

Research Question 3: Is there a correlation between overall life satisfaction and level of education? Are there any differences between the sexes?

This questions will attempt to see if any correlation exists between overall life satisfaction and the level of education of an individual. Some research shows that individuals with a higher level of education are less likely to have marital problems and may have better health than those with less education. It will further explore if there is any difference between males and females.

Total Variables Used: 3

satisfy - overall life satisfaction

educa - level of education

sex - individual’s biological sex

————-

————-

————-

Part 3: Exploratory Data Analysis

Research Question 1: Is Body Mass Index (BMI) somehow related/correlated to a respondent’s opinion of their ownhealth?

load("brfss2013.RData")
dim(brfss2013)
## [1] 491775    330
q1 <- select(brfss2013,genhlth,X_bmi5cat) %>% na.omit()
dim(q1)
## [1] 463275      2

With over 460,000 observations, it is easier to observe through a table, as shown below:

prop.table(table(q1$genhlth,q1$X_bmi5cat),2)
##            
##             Underweight Normal weight Overweight      Obese
##   Excellent  0.19987805    0.26019496 0.17373887 0.07933813
##   Very good  0.26402439    0.35069868 0.35401238 0.26824837
##   Good       0.26146341    0.24667514 0.30698451 0.37088006
##   Fair       0.15829268    0.09751640 0.11943759 0.19913468
##   Poor       0.11634146    0.04491484 0.04582665 0.08239876

Even the table has an overwhelming amount of data. An easier way to interpret the table above:

g1 <- ggplot(q1) + aes(x=X_bmi5cat,fill=genhlth) + geom_bar(position = "fill")
g1

Each column represents the 4 BMI categories (underweight, normal, overweight, obese) and the proportion of the respondents that described their own health.

g2 <- g1 + xlab("BMI category") + ylab("Proportion") + scale_fill_discrete(name="Reported Health")
g2

One can conclude that to an extent, the answer to this questions is “yes”. There does seem to be a correlation between an individual’s BMI and his or her own perception of health.

+++++++++++++++++

Research Question 2: Is there a correlation between the hours of sleep an individual gets pre night and their energy levels? Is there a difference between the sexes?

q2 <- select(brfss2013, qlhlth2 , sex , sleptim1) %>% 
  filter(!is.na(qlhlth2), !is.na(sex), sleptim1 <= 12 )

q2 %>% group_by(qlhlth2) %>%    summary(count=n())
##     qlhlth2          sex         sleptim1     
##  Min.   : 0.00   Male  :162   Min.   : 2.000  
##  1st Qu.: 2.00   Female:287   1st Qu.: 6.000  
##  Median :15.00                Median : 7.000  
##  Mean   :15.56                Mean   : 7.013  
##  3rd Qu.:28.00                3rd Qu.: 8.000  
##  Max.   :30.00                Max.   :12.000
ggplot(data = q2, aes(x = sleptim1, y = qlhlth2 ))+
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm)   +
    scale_x_continuous(limits = c(4,10), breaks = 4:10) +
    facet_grid(. ~  sex) +
    xlab("sleptim1 = Hours of Sleep") +
    ylab ("qlhlth2: Days of full energy in the last 30 days")
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).

There seems to be a generally positive correlation between the hours of sleep and days of full energy. The correlation seems to be slightly stronger for females than males as the males’ data is more widely dispersed.

++++++++++++++++++++++

Research Question 3: Is there a correlation between overall life satisfaction and level of education? Are there any differences between the sexes?

q3 <- select(brfss2013, lsatisfy , sex, educa) %>% 
  filter(!is.na(lsatisfy), !is.na(sex), !is.na(educa))

q3 %>% group_by(lsatisfy) %>%    summarise(count=n())
## # A tibble: 4 x 2
##   lsatisfy          count
##   <fct>             <int>
## 1 Very satisfied     5378
## 2 Satisfied          5506
## 3 Dissatisfied        598
## 4 Very dissatisfied   161
q3 %>% group_by(educa) %>%   summarise(count=n())
## # A tibble: 6 x 2
##   educa                                                        count
##   <fct>                                                        <int>
## 1 Never attended school or only kindergarten                      10
## 2 Grades 1 through 8 (Elementary)                                496
## 3 Grades 9 though 11 (Some high school)                         1078
## 4 Grade 12 or GED (High school graduate)                        3708
## 5 College 1 year to 3 years (Some college or technical school)  3055
## 6 College 4 years or more (College graduate)                    3296
q3 %>% group_by(sex) %>%   summarise(count=n())
## # A tibble: 2 x 2
##   sex    count
##   <fct>  <int>
## 1 Male    4078
## 2 Female  7565
ggplot(data = q3, aes(x = lsatisfy, y = educa )) +
  geom_count () +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  facet_grid(. ~  sex) +
     xlab("lsatisfy: Overall Life Satisfaction") +
     ylab ("educa: Level of Education")

There seems to be a generally positive correlation between level of education and overall life satisfaction for both males and females. Satisfaction levels seem to be higher for those who have at least completed high school (or equivalent). There are certain outliers in the data as well, with some responding “satisfied” or “very satisfied” while having no education. Since most of the American population has at least graduated from high school, the data seems to accurately represent the distribution of respondents.