Load Packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.0.0     v readr   1.1.1
## v tibble  1.4.2     v purrr   0.2.5
## v tidyr   0.8.1     v stringr 1.3.1
## v ggplot2 3.0.0     v forcats 0.3.0
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Load Dataset

getwd()
## [1] "C:/Users/Toby/Downloads"
setwd("C:/Users/Toby/Downloads")
load("brfss.RData")

Dataset

The Centers for Disease Control and Prevention (CDC) has an ongoing yearly project called The Behavioral Risk Factor Surveillance System (BRFSS) that collects the data of health factors of adults in the United States. The BRFSS was collected by telephone survey to adults above 18 years old in 2013. The dataset includes 491,775 observations within 330 different variables.

Research Question 1

-Is there a correlation between income and mental health?

Research Question 2

-Is there a correlation between education and BMI?

Research Question 3

-Is there a correlation between exercise and general health levels?

Question 1

glimpse(brfss2013$income2)
##  Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
glimpse(brfss2013$menthlth)
##  int [1:491775] 29 0 2 0 2 0 15 0 0 0 ...
brfss2013$menthlth <- as.factor(brfss2013$menthlth)
ivsm <- filter(brfss2013,!is.na(income2), !is.na(menthlth))
ivsm <- select(ivsm,income2, menthlth)

ivsm %>% group_by(income2) %>% summarize(count = n()) 
## # A tibble: 8 x 2
##   income2            count
##   <fct>              <int>
## 1 Less than $10,000  24608
## 2 Less than $15,000  26065
## 3 Less than $20,000  34119
## 4 Less than $25,000  40945
## 5 Less than $35,000  48116
## 6 Less than $50,000  60812
## 7 Less than $75,000  64701
## 8 $75,000 or more   115097
ivsm %>% group_by(menthlth) %>% summarize(count = n())
## # A tibble: 32 x 2
##    menthlth  count
##    <fct>     <int>
##  1 0        283709
##  2 1         13434
##  3 2         20890
##  4 3         12094
##  5 4          5854
##  6 5         14858
##  7 6          1613
##  8 7          5541
##  9 8          1074
## 10 9           176
## # ... with 22 more rows
colnames(ivsm)[2] <- "mentalhealth"
colnames(ivsm)[1] <- "income"

glimpse(ivsm)
## Observations: 414,463
## Variables: 2
## $ income       <fct> Less than $75,000, $75,000 or more, $75,000 or mo...
## $ mentalhealth <fct> 29, 0, 2, 0, 2, 0, 0, 0, 0, 1, 0, 0, 1, 30, 0, 3,...
ggplot(aes(x = mentalhealth, y = income), data = ivsm) +
  geom_count() + xlab("Number of Days with Bad Mental Health") +
  ylab("Income") + ggtitle("Income vs Bad Mental Health Days")

The plot shows a increasing trend that people with higher income have less number of bad mental health days. Although the plot shows this statement’s correlation, the data shows that more people with higher incomes were surveyed more than people with lower income. This might have skewed the results of this plot.

Question 2

glimpse(brfss2013$educa)
##  Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
glimpse(brfss2013$X_bmi5) 
##  int [1:491775] 3916 1822 2746 2197 3594 3986 2070 NA 3017 2829 ...
brfss2013$X_bmi5 <- (brfss2013$X_bmi5)/100
edbmi <- select(brfss2013, educa, X_bmi5) %>% filter(!is.na(X_bmi5), !is.na(educa))
educat <- edbmi %>% group_by(educa) %>% summarize(count = n())

as_tibble(educat) 
## # A tibble: 6 x 2
##   educa                                                         count
##   <fct>                                                         <int>
## 1 Never attended school or only kindergarten                      487
## 2 Grades 1 through 8 (Elementary)                               11870
## 3 Grades 9 though 11 (Some high school)                         26544
## 4 Grade 12 or GED (High school graduate)                       135505
## 5 College 1 year to 3 years (Some college or technical school) 127455
## 6 College 4 years or more (College graduate)                   162295
edvsbmi<- edbmi %>% group_by(educa) %>% summarize(avg_bmi = mean(X_bmi5))

colnames(edvsbmi)[2] <- "averagebmi"
colnames(edvsbmi)[1] <- "education"

glimpse(edvsbmi)
## Observations: 6
## Variables: 2
## $ education  <fct> Never attended school or only kindergarten, Grades ...
## $ averagebmi <dbl> 29.27897, 28.95172, 28.63817, 28.25168, 28.16299, 2...
ggplot(aes(x = averagebmi, y = education), data = edvsbmi) + geom_point() + xlab("Average BMI") + 
  ylab("Education Level") + ggtitle("BMI vs Education Level")

The plot shows that there is a negative trend that people with lower education levels have a higher BMI. Something interesting when looking at this plot is that the average BMI are quite high. The plot’s graph goes from 27 to 29. These levels of BMI are considered “overweight”. But according to several health studies of American BMIs, the average American BMI is around 28. Therefore, the plotting of the graph is not incorrect. But the data does show that more people with high school and college education were surveyed the most. This might have affected the data’s correlation betwen BMI and education level.

Question 3

glimpse(brfss2013$exerany2)
##  Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
glimpse(brfss2013$genhlth)
##  Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
health <- select(brfss2013, exerany2, genhlth) %>% filter(
  !is.na(exerany2), !is.na(genhlth))
health %>% group_by(genhlth) %>% summarize(count = n())
## # A tibble: 5 x 2
##   genhlth    count
##   <fct>      <int>
## 1 Excellent  79367
## 2 Very good 149455
## 3 Good      139538
## 4 Fair       61793
## 5 Poor       25797
health %>% group_by(exerany2) %>% summarize(count = n())
## # A tibble: 2 x 2
##   exerany2  count
##   <fct>     <int>
## 1 Yes      331389
## 2 No       124561
colnames(health)[1] <- "exercise"
colnames(health)[2] <- "generalhealth"

glimpse(health)
## Observations: 455,950
## Variables: 2
## $ exercise      <fct> No, Yes, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Y...
## $ generalhealth <fct> Fair, Good, Good, Very good, Good, Very good, Fa...
ggplot(aes(x = generalhealth, y = exercise), data = health) +
  geom_count() + xlab("General Health Level") +
  ylab("Exercise in the Past 30 Days") + ggtitle("General Health vs Exercise")

This plot shows that people who exercise more have a higher health level. Although it seems there might be a correlation, people might just be more biased towards their own feelings of general health of being good.

Overall Summary

The dataset focuses on 491,775 observation within 330 variables pertaining to behavioral health risks of 2013. I focused on examining three research questions and the correlation between two variables.

The first question focuses on income and mental health. I used a dot plot to show the correlation between income and bad mental health days. The plot showed an postiive correlation that people with higher income had less number of bad mental health days.

The second question focuses on BMI and education level. I used a point plot to show the correlation between the two variables. THe graph shows a negative correlation that the higher the education, the lower the BMI level.

The third question focuses on general health and exercise. I used a dot plot to the show the correlation between the two variables. This plot showed a increasing trend that people who exercised more in the past 30 days had better general health.

~People who higher incomes have less number of bad mental health days

~People with higher education have lower BMI

~People who exercise more have better health levels

Some future improvements:

~Have more lower income people surveyed

~Have more lower education people surveyed

~Have more people who do not regularly exercise surveyed