BRFSS is an ongoing surveillance system beginning in 1984 with data collection in 15 states. It now extends to all 50 states and additiional territories. This information is collected via landline and cellular telephone-based surverys.
Due to the observational nature of the data collection, any correlation between variables cannot be viewed as casual. Also, due to the phone surveys, generalization of any inferences has to done with extreme caution, as survey respondents may not be representative of the population.
Research questions: (11 points) Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
Research quesion 1:
Does the month matter on how people report the best (and worst) Health-Related Quality of Life? And does this differ by Martial Status? (as measured by poor health days poorhlth)
Research quesion 2:
Do people who consume more alcohol beverages also tend to smoke more? Does this change by sex?
Research quesion 3:
Do people who exercise eat more fruit? Does this change by level of education?
Research quesion 1:
In order to determine the best (and worst) Quality of Life, we will need to filter the data to the column needed and deal with na’s
data1 <- brfss2013 %>%
select(Month = imonth, Poor_Health = poorhlth, Mar_Status = marital) %>%
as.data.frame()
round(sapply(data1, function(x) mean(is.na(x))),4)## Month Poor_Health Mar_Status
## 0.0000 0.4944 0.0070
Nearly 50% of the data has NAs reported for poor health, so let’s drop those entries.
## Month Poor_Health Mar_Status
## 0 0 0
Let’s create some tables:
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## Mar_Status Average_Days
## <fct> <dbl>
## 1 Married 4.48
## 2 Divorced 7.32
## 3 Widowed 6.21
## 4 Separated 7.95
## 5 Never married 4.50
## 6 A member of an unmarried couple 4.51
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## Month Average_Days
## <fct> <dbl>
## 1 January 5.27
## 2 February 5.14
## 3 March 5.18
## 4 April 5.27
## 5 May 5.26
## 6 June 5.53
## 7 July 5.49
## 8 August 5.37
## 9 September 5.39
## 10 October 5.21
## 11 November 5.07
## 12 December 5.11
## `summarise()` regrouping output by 'Mar_Status' (override with `.groups` argument)
## # A tibble: 6 x 13
## # Groups: Mar_Status [6]
## Mar_Status January February March April May June July August September
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Married 4.46 4.47 4.39 4.49 4.39 4.78 4.71 4.67 4.45
## 2 Divorced 7.50 6.89 7.38 7.34 7.35 7.47 7.47 7.22 7.67
## 3 Widowed 6.07 6.21 6.34 6.52 6.35 6.39 6.23 6.35 6.25
## 4 Separated 7.90 8.22 7.82 7.33 7.67 7.88 8.90 7.84 8.26
## 5 Never mar~ 4.58 4.30 4.26 4.30 4.64 4.81 4.74 4.55 4.62
## 6 A member ~ 4.74 4.08 4.21 4.78 4.15 4.70 4.67 4.42 5.15
## # ... with 3 more variables: October <dbl>, November <dbl>, December <dbl>
There does appear to a relationship between Marital Status and Poor Health Days, but not a strong relationship when looking at the Month interviewed.
Look’s look at it visually
ggplot(data1_both, aes(x = Month, y = Average_Days, fill = Mar_Status)) +
geom_bar(stat = "identity", position = position_dodge())This plot is a little busy, but you can get a sense that Seperated and Divorced have higher Poor_Health days and month doesn’t impact it that much.
Research quesion 2:
data2 <- brfss2013 %>%
select(Sex = sex, Alcohol_Days = alcday5, Smoke_Days = smokday2) %>%
as.data.frame()
round(sapply(data2, function(x) mean(is.na(x))),4)## Sex Alcohol_Days Smoke_Days
## 0.0000 0.0399 0.5632
Let’s get rid of the NAs again
## Sex Alcohol_Days Smoke_Days
## 0 0 0
Let’s look at some average numbers by sex.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Sex Average_Drink_Days
## <fct> <dbl>
## 1 Male 106.
## 2 Female 88.4
data2_both <- data2 %>% group_by(Sex, Smoke_Days) %>%
summarize(Average_Drink_Days = mean(Alcohol_Days))## `summarise()` regrouping output by 'Sex' (override with `.groups` argument)
## # A tibble: 3 x 3
## Smoke_Days Male Female
## <fct> <dbl> <dbl>
## 1 Every day 107. 84.4
## 2 Some days 117. 91.1
## 3 Not at all 105. 89.7
Interesting that men seem to consume alcholic beverages more often, but smokers don’t seem to drink more than non-smokers - while those the answered some days drink the most.
Again, let’s try to visual this data
ggplot(data2_both, aes(x = Smoke_Days, y = Average_Drink_Days, fill = Smoke_Days)) +
geom_bar(stat = "identity") +
facet_wrap(. ~ Sex)Research quesion 3:
data3 <- brfss2013 %>%
select(Education = educa, Fruit_Times = fruit1, Exercise_Times = exeroft1) %>%
as.data.frame()
round(sapply(data3, function(x) mean(is.na(x))),4)## Education Fruit_Times Exercise_Times
## 0.0046 0.0687 0.3338
Let’s get rid of the NAs again
## Education Fruit_Times Exercise_Times
## 0 0 0
Let’s look at some average numbers by education
data3 %>%
group_by(Education) %>%
summarize(Average_Fruit_Times = mean(Fruit_Times),
Average_Exercise_Times = mean(Exercise_Times))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 3
## Education Average_Fruit_Tim~ Average_Exercise_T~
## <fct> <dbl> <dbl>
## 1 Never attended school or only kinderga~ 160. 130.
## 2 Grades 1 through 8 (Elementary) 166. 134.
## 3 Grades 9 though 11 (Some high school) 180. 138.
## 4 Grade 12 or GED (High school graduate) 180. 138.
## 5 College 1 year to 3 years (Some colleg~ 179. 137.
## 6 College 4 years or more (College gradu~ 172. 133.
It does appear that there is some relationship between fruit eating and educaion, but not much between exercise and education.