library(ggplot2)
library(dplyr)load("brfss2013.RData")John Eugene Driscoll | Week 5 - Data Analysis Project | June 2017 Submission
The purpose of this document is to complete the Data Analysis Project required during week 5 of the Introduction to Probability and Data course by Duke University (Coursera.)
The background context regarding the assignment can be found at: https://www.coursera.org/learn/probability-intro/supplement/1E7zQ/project-information.
The underlying analysis will use data from survery performed in 2013.
Per additional details in the context material: Since 2011, BRFSS has conducted both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
In summary, as a large representative random sampling was drawn in both data collection methods, the data for the sample is generalizable to the adult population of the participating states.
Highlights of the data collection methodology
(Source: https://www.cdc.gov/brfss/data_documentation/pdf/userguidejune2013.pdf)
Disproportionate stratified sampling (DSS) has been used for the landline sample since 2003.
The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations.
The BRFSS goal is to support at least 4,000 interviews per state each year.
Data weighting is an important statistical process that attempts to remove bias in the sample. The BRFSS uses a weighting process includes two steps: design weighting and iterative proportional fitting (also known as “raking” weighting.)
Per additional details in the context material, it can be found that BRFSS is an ongoing surveillance system. Per the CDC website (https://www.cdc.gov/obesity/data/surveillance.html,) A surveillance system is a series of surveys conducted again and again to monitor long-term trends in public health. It is used to examine public health issues across several years, to track the trends, compare health among groups of people, and determine whether something is improving or worsening for a specific group of people.
In summary, a surveillance system is an observational study. Therefore, it won’t be possible to make causal inferences from the data.
Research quesion 1:
Is there any correlation between the number of hours a respondent sleeps on average and the reported number of days where a respondent felt full of energy (in the past 30 Days.) Further, are there any noticeable differences in this correlation between genders.
Interest factor: As someone who often sacrifices sleep in order to meet the time requirements to be a top professional performer, I am interested to see how survey participants reported number of hours of sleep correlates to overall energy levels. Sometimes mine feel low! (Even though the exicting writing in this project may lead you to assume otherwise!)
Id variables
sleptim1: How Much Time Do You Sleep
sex: Respondents Sex
qlhlth2: How Many Days Full Of Energy In Past 30 Days
Research quesion 2:
Is there any correlation between a level of education obtained and overall life satisfaction? Further, are there any noticeable differences in this correlation between genders.
Interest factor: Are there any noticeable trends in overall life satisfaction for those that push hard to achieve high levels of formal education? As the first in my family to graduate university and a strong propenet of self advancement through eductation, I am interested to see if there is any noticeable trends between those who complete more eductation and reported satisfaction.
Id variables
satisfy: Satisfaction With Life
educa: Education Level
sex: Respondents Sex
Research quesion 3:
Is there any correlation between reported level of income earned and general health? Further, are there any noticeable differences in this correlation between genders.
Interest factor: Are there any noticeable trends in general health for those who maintain high paying jobs? One would think jobs of this nature would come with benefits that would promote general health like medical insurance and cash flow for healthy habits. Inversely, positions that come with higher income could come with higher levels of stress and more hours on the job. This analysis should help identify some higher level trends that could drive future analysis on the topic.
Id variables
genhlth: General Health
sex: Respondents Sex
income2: Income Level
Research quesion 1:
# Query the relevant variables
# sleptim1: How Much Time Do You Sleep
# sex: Respondents Sex
# qlhlth2: How Many Days Full Of Energy In Past 30 Days
q1 <- select(brfss2013, qlhlth2 , sex , sleptim1) %>%
filter(!is.na(qlhlth2), !is.na(sex), sleptim1 <= 12 )
# Present summary statics of continuous variables and count of categorical data
q1 %>% group_by(qlhlth2) %>% summary(count=n())## qlhlth2 sex sleptim1
## Min. : 0.00 Male :162 Min. : 2.000
## 1st Qu.: 2.00 Female:287 1st Qu.: 6.000
## Median :15.00 Median : 7.000
## Mean :15.56 Mean : 7.013
## 3rd Qu.:28.00 3rd Qu.: 8.000
## Max. :30.00 Max. :12.000
# Plot relevant variables
# Use scatter plot to obeserve relationship between two continuous variables. Add linear trend line with confidence interval to see relationship via line along with the spread of the results. Facet by gender to observe differences between the sexes.
ggplot(data = q1, aes(x = sleptim1, y = qlhlth2 ))+
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) +
scale_x_continuous(limits = c(4,10), breaks = 4:10) +
facet_grid(. ~ sex) +
xlab("sleptim1 = How much time do you sleep?") +
ylab ("qlhlth2: How Many Days Full Of Energy In Past 30 Days")## Warning: Removed 12 rows containing non-finite values (stat_smooth).
## Warning: Removed 12 rows containing missing values (geom_point).
# Survey results indicate that those who sleep between 6 and 8 hours make up the majority of the respondents. There appears to be a positive linear relationship between the x and y variables with a wide dispersion between the number of energetic days reported at each recorded level of sleep (in hours.) The dispersion is greater in male respondents. Research quesion 2:
# # Query the relevant variables
# # lsatisfy: Satisfaction With Life ( categorical)
# # educa: Education Level
# # sex: Respondents Sex
q2 <- select(brfss2013, lsatisfy , sex, educa) %>%
filter(!is.na(lsatisfy), !is.na(sex), !is.na(educa))
# Present totals of variables being analyzed
q2 %>% group_by(lsatisfy) %>% summarise(count=n())## # A tibble: 4 x 2
## lsatisfy count
## <fctr> <int>
## 1 Very satisfied 5378
## 2 Satisfied 5506
## 3 Dissatisfied 598
## 4 Very dissatisfied 161
q2 %>% group_by(educa) %>% summarise(count=n())## # A tibble: 6 x 2
## educa count
## <fctr> <int>
## 1 Never attended school or only kindergarten 10
## 2 Grades 1 through 8 (Elementary) 496
## 3 Grades 9 though 11 (Some high school) 1078
## 4 Grade 12 or GED (High school graduate) 3708
## 5 College 1 year to 3 years (Some college or technical school) 3055
## 6 College 4 years or more (College graduate) 3296
q2 %>% group_by(sex) %>% summarise(count=n())## # A tibble: 2 x 2
## sex count
## <fctr> <int>
## 1 Male 4078
## 2 Female 7565
# # Plot relevant variables
# # Use count plot to obeserve relationship between two catergorical variables. Facet by gender to observe differences between the sexes.
ggplot(data = q2, aes(x = lsatisfy , y = educa ))+
geom_count () +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
facet_grid(. ~ sex) +
xlab("lsatisfy: Satisfaction With Life") +
ylab ("educa: Education Leve")# Both genders are beahaving simlilary in the observation that satisfation levles are greatest (Satisfied, Very Satisfied) for those that have at least completed highschool or the equivalent. Further, most respondents in this survey have completed at least high school, which may indicate there is some sort of systematic bias in the survey ( those with phones may be more likely to have completed high school and therefore may be overall more satisfied with life)
# Finally, it is interesting to note the make up of the outlier cases. The handful of cases that reported never attending any education report either being satisfied or very satisfied with life. This result along with the reported cases of disastifaction with life (or very dissatisfie) across those who have complete high school or better indicate that eductation levels are not perfectly correlated with life satisfcation. Research quesion 3:
# Query the relevant variables
# # genhlth: General Health
# # income2: Income Level
q3 <- select(brfss2013, genhlth ,income2) %>%
filter(!is.na(genhlth), !is.na(income2))
# Present totals of variables being analayzed
q3 %>% group_by(genhlth) %>% summarise(count=n())## # A tibble: 5 x 2
## genhlth count
## <fctr> <int>
## 1 Excellent 74254
## 2 Very good 138247
## 3 Good 127701
## 4 Fair 55602
## 5 Poor 23116
q3 %>% group_by(income2) %>% summarise(count=n())## # A tibble: 8 x 2
## income2 count
## <fctr> <int>
## 1 Less than $10,000 25252
## 2 Less than $15,000 26633
## 3 Less than $20,000 34705
## 4 Less than $25,000 41563
## 5 Less than $35,000 48687
## 6 Less than $50,000 61319
## 7 Less than $75,000 65102
## 8 $75,000 or more 115659
ggplot(data = q3, aes(x = genhlth , y = income2 ))+
geom_count () +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
xlab("genhlth: General Health ") +
ylab ("income2: Income Level")# More than 50 percent of the surveyed population reported an income level greater than 35,000 USD. It appears that there is a postive relationship between earning more income and those who reported health levels of at least good.
# Further, when we look at reported numbers of very good and exceleent respondents, the number is trending up as we move up the income scale. This provides some evidence that more research could help identify possible causaul relationships.
# Finally, short of more robust analysis to identify causation, I believe this survey would benefit from further segementation of those who earn more than 75,000 to see how even higher earners fare in terms of general health relative to income levels. The positions that pay even high salaries may in fact come with more work time and stress levels that could correlate to lower reported health levels.–