Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

John Eugene Driscoll | Week 5 - Data Analysis Project | June 2017 Submission

Introduction

The purpose of this document is to complete the Data Analysis Project required during week 5 of the Introduction to Probability and Data course by Duke University (Coursera.)

The Data:

The background context regarding the assignment can be found at: https://www.coursera.org/learn/probability-intro/supplement/1E7zQ/project-information.

The underlying analysis will use data from survery performed in 2013.

Generalizability:

Per additional details in the context material: Since 2011, BRFSS has conducted both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

In summary, as a large representative random sampling was drawn in both data collection methods, the data for the sample is generalizable to the adult population of the participating states.

Highlights of the data collection methodology

(Source: https://www.cdc.gov/brfss/data_documentation/pdf/userguidejune2013.pdf)

  • Disproportionate stratified sampling (DSS) has been used for the landline sample since 2003.

  • The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations.

  • The BRFSS goal is to support at least 4,000 interviews per state each year.

  • Data weighting is an important statistical process that attempts to remove bias in the sample. The BRFSS uses a weighting process includes two steps: design weighting and iterative proportional fitting (also known as “raking” weighting.)

Causality:

Per additional details in the context material, it can be found that BRFSS is an ongoing surveillance system. Per the CDC website (https://www.cdc.gov/obesity/data/surveillance.html,) A surveillance system is a series of surveys conducted again and again to monitor long-term trends in public health. It is used to examine public health issues across several years, to track the trends, compare health among groups of people, and determine whether something is improving or worsening for a specific group of people.

In summary, a surveillance system is an observational study. Therefore, it won’t be possible to make causal inferences from the data.


Part 2: Research questions

Research quesion 1:

Is there any correlation between the number of hours a respondent sleeps on average and the reported number of days where a respondent felt full of energy (in the past 30 Days.) Further, are there any noticeable differences in this correlation between genders.

Interest factor: As someone who often sacrifices sleep in order to meet the time requirements to be a top professional performer, I am interested to see how survey participants reported number of hours of sleep correlates to overall energy levels. Sometimes mine feel low! (Even though the exicting writing in this project may lead you to assume otherwise!)

Id variables

sleptim1: How Much Time Do You Sleep

sex: Respondents Sex

qlhlth2: How Many Days Full Of Energy In Past 30 Days

Research quesion 2:

Is there any correlation between a level of education obtained and overall life satisfaction? Further, are there any noticeable differences in this correlation between genders.

Interest factor: Are there any noticeable trends in overall life satisfaction for those that push hard to achieve high levels of formal education? As the first in my family to graduate university and a strong propenet of self advancement through eductation, I am interested to see if there is any noticeable trends between those who complete more eductation and reported satisfaction.

Id variables

satisfy: Satisfaction With Life

educa: Education Level

sex: Respondents Sex

Research quesion 3:

Is there any correlation between reported level of income earned and general health? Further, are there any noticeable differences in this correlation between genders.

Interest factor: Are there any noticeable trends in general health for those who maintain high paying jobs? One would think jobs of this nature would come with benefits that would promote general health like medical insurance and cash flow for healthy habits. Inversely, positions that come with higher income could come with higher levels of stress and more hours on the job. This analysis should help identify some higher level trends that could drive future analysis on the topic.

Id variables

genhlth: General Health

sex: Respondents Sex

income2: Income Level


Part 3: Exploratory data analysis

Research quesion 1:

# Query the relevant variables
# sleptim1: How Much Time Do You Sleep
# sex: Respondents Sex
# qlhlth2: How Many Days Full Of Energy In Past 30 Days

q1 <- select(brfss2013, qlhlth2 , sex , sleptim1) %>% 
  filter(!is.na(qlhlth2), !is.na(sex), sleptim1 <= 12 )

# Present summary statics of continuous variables and count of categorical data

q1 %>% group_by(qlhlth2) %>%    summary(count=n())
##     qlhlth2          sex         sleptim1     
##  Min.   : 0.00   Male  :162   Min.   : 2.000  
##  1st Qu.: 2.00   Female:287   1st Qu.: 6.000  
##  Median :15.00                Median : 7.000  
##  Mean   :15.56                Mean   : 7.013  
##  3rd Qu.:28.00                3rd Qu.: 8.000  
##  Max.   :30.00                Max.   :12.000
# Plot relevant variables
# Use scatter plot to obeserve relationship between two continuous variables. Add linear trend line with confidence interval to see relationship via line along with the spread of the results. Facet by gender to observe differences between the sexes.

ggplot(data = q1, aes(x = sleptim1, y = qlhlth2 ))+
    geom_point(shape=1) +    # Use hollow circles
    geom_smooth(method=lm)   +
    scale_x_continuous(limits = c(4,10), breaks = 4:10) +
    facet_grid(. ~  sex) +
    xlab("sleptim1 = How much time do you sleep?") +
    ylab ("qlhlth2: How Many Days Full Of Energy In Past 30 Days")
## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).

# Survey results indicate that those who sleep between 6 and 8 hours make up the majority of the respondents. There appears to be a positive linear relationship between the x and y variables with a wide dispersion between the number of energetic days reported at each recorded level of sleep (in hours.) The dispersion is greater in male respondents. 

Research quesion 2:

# # Query the relevant variables
# # lsatisfy: Satisfaction With Life ( categorical)
# # educa: Education Level
# # sex: Respondents Sex

q2 <- select(brfss2013, lsatisfy , sex, educa) %>% 
  filter(!is.na(lsatisfy), !is.na(sex), !is.na(educa))

# Present totals of variables being analyzed

q2 %>% group_by(lsatisfy) %>%    summarise(count=n())
## # A tibble: 4 x 2
##            lsatisfy count
##              <fctr> <int>
## 1    Very satisfied  5378
## 2         Satisfied  5506
## 3      Dissatisfied   598
## 4 Very dissatisfied   161
q2 %>% group_by(educa) %>%   summarise(count=n())
## # A tibble: 6 x 2
##                                                          educa count
##                                                         <fctr> <int>
## 1                   Never attended school or only kindergarten    10
## 2                              Grades 1 through 8 (Elementary)   496
## 3                        Grades 9 though 11 (Some high school)  1078
## 4                       Grade 12 or GED (High school graduate)  3708
## 5 College 1 year to 3 years (Some college or technical school)  3055
## 6                   College 4 years or more (College graduate)  3296
q2 %>% group_by(sex) %>%   summarise(count=n())
## # A tibble: 2 x 2
##      sex count
##   <fctr> <int>
## 1   Male  4078
## 2 Female  7565
# # Plot relevant variables
# # Use count plot to obeserve relationship between two catergorical variables. Facet by gender to observe differences between the sexes.


 ggplot(data = q2, aes(x = lsatisfy , y = educa ))+
     geom_count () +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
   facet_grid(. ~  sex) +
     xlab("lsatisfy: Satisfaction With Life") +
     ylab ("educa: Education Leve")

# Both genders are beahaving simlilary in the observation that satisfation levles are greatest (Satisfied, Very Satisfied) for those that have at least completed highschool or the equivalent.  Further, most respondents in this survey have completed at least high school, which may indicate there is some sort of systematic bias in the survey ( those with phones may be more likely to have completed high school and therefore may be overall more satisfied with life)

# Finally, it is interesting to note the make up of the outlier cases. The handful of cases that reported never attending any education report either being satisfied or very satisfied with life. This result along with the reported cases of disastifaction with life (or very dissatisfie) across those who have complete high school or better indicate that eductation levels are not perfectly correlated with life satisfcation. 

Research quesion 3:

# Query the relevant variables
# # genhlth: General Health
# # income2: Income Level

q3 <- select(brfss2013, genhlth ,income2) %>% 
  filter(!is.na(genhlth), !is.na(income2))

# Present totals of variables being analayzed

q3 %>% group_by(genhlth) %>%    summarise(count=n())
## # A tibble: 5 x 2
##     genhlth  count
##      <fctr>  <int>
## 1 Excellent  74254
## 2 Very good 138247
## 3      Good 127701
## 4      Fair  55602
## 5      Poor  23116
q3 %>% group_by(income2) %>%   summarise(count=n())
## # A tibble: 8 x 2
##             income2  count
##              <fctr>  <int>
## 1 Less than $10,000  25252
## 2 Less than $15,000  26633
## 3 Less than $20,000  34705
## 4 Less than $25,000  41563
## 5 Less than $35,000  48687
## 6 Less than $50,000  61319
## 7 Less than $75,000  65102
## 8   $75,000 or more 115659
ggplot(data = q3, aes(x = genhlth , y = income2 ))+
    geom_count () +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
    xlab("genhlth: General Health ") +
    ylab ("income2: Income Level")

# More than 50 percent of the surveyed population reported an income level greater than 35,000 USD. It appears that there is a postive relationship between earning more income and those who reported health levels of at least good.

# Further, when we look at reported numbers of very good and exceleent respondents, the number is trending up as we move up the income scale. This provides some evidence that more research could help identify possible causaul relationships.  

# Finally, short of more robust analysis to identify causation, I believe this survey would benefit from further segementation of those who earn more than 75,000 to see how even higher earners fare in terms of general health relative to income levels. The positions that pay even high salaries may in fact come with more work time and stress levels that could correlate to lower reported health levels.