“The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world…” The survey is conducted using Random Digit Dialing (RDD) techniques on both landlines and cell phones."
Since the data is collecting in a way that does not directly interfere with how the data arise, we can conlude that it is an observational study. Therefore, we can only establish association between the variables.
The population of interest, adults 18 years or older, is defined using a stratified sample. Hence, the results of this survey can be generalized to the US population When conducting studies, many biases are likely to occur. In surveys, potential source of bias can come from non responses. Non responses biases occur if a part of the sampled people who respond are no longer representative of the population.
We’re interested in seeing if the health coverage might differ from state to state. Different policies concerning healtch access might exist across the states, which can result in differences in health access.
In the question above, we wanted to see if some differences in health access existed across the states. We make the hypothesis that income level might explain the health care access. Thus we want to see if there’s a correlation between income level and health care access.
Finally, we want to look at some activities impacting the health. We choose exercise and sleep time. We wanted to include tobacco use but the variable we’re interested in (smokday2) contains 56% of missing values. Therefore we decided not to use it.
In this part, we’ll perform an exploratory data analysis (EDA) that addresses each of the three research questions above.
DATA PROCESSING
We start by checking the info of the session and the workspace
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.6.1 magrittr_1.5 tools_3.6.1 htmltools_0.3.6
## [5] yaml_2.2.0 Rcpp_1.0.2 stringi_1.4.3 rmarkdown_1.15
## [9] knitr_1.25 stringr_1.4.0 xfun_0.9 digest_0.6.21
## [13] evaluate_0.14
getwd(); ls()
## [1] "/Users/mamba/Downloads/Data_Scientist_Path/Courses/Coursera/Statistics_With_R_Duke/1_Introduction_to_Data_&_Probability/scripts"
## character(0)
We load the necessary packages
library(tidyverse) # data wrangling and analysis
## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr) # convert markdown to HTML
library(rmarkdown) # markdown
library(fcuk) # correct typos
We downlaod the data if(!file.exists(“data”)) { dir.create(“data”) }
if(!file.exists("data")) {
dir.create("data")
}
fileurl <- "https://d3c33hcgiwev3.cloudfront.net/_384b2d9eda4b29131fb681b243a7767d_brfss2013.RData?Expires=1569801600&Signature=I-Q07DUZYfnFGP38xLnib7Td~ry~puzFZfUIoh5PMxYMzeMg7v3OABvBltUeaz7sNz1-vQKDfGcHKdSK3j5z6cuo1qta87b6JS0SmKzV6tRqp~qXbTrClNz40lClELyGsszVdgwc7glFrvRgOVwIk3CWlqXQY4dYsQshGMLxD0Q_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
destfile <- "data/brfss2013.RData"
download.file(fileurl, destfile)
dateDownloaded <- date()
dir("data") # check that the file has been downloaded
## [1] "brfss2013.RData"
We may now read in the data
load("data/brfss2013.Rdata")
We look at some information about the data
class(brfss2013); dim(brfss2013)
## [1] "data.frame"
## [1] 491775 330
We will select the variables we’ll need to answer the question. Through the rest of the analysis, we will refer to the codebook for variables selection. We store the link in an object.
codebook <- "https://d3c33hcgiwev3.cloudfront.net/_e34476fda339107329fc316d1f98e042_brfss_codebook.html?Expires=1569369600&Signature=BEynyfHDi3iw2VjikExnq-rXYEN2X5bTcTdFHg0d~qYePPXHMynC8Ls85dtAMa2kPzWmByy9dUADMEbJ4gPFFDxxZPHu2lXLtgY7kbukqj9ht3LrYuzzp-ruFHK3gm89nfxVGpQqbJYhrVMMSKJTXdXAZiTfqOHM1EMzK6ArRdc_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
datecode <- date()
So we’ll need the variables X_state and hlthpln1 from section 0 and 3.
We first want to see if there’s any missing values in the two variables.
We can obtain this information either in the codebook, or by the following code
sum(is.na(c(brfss2013$X_state, brfss2013$hlthpln1)))
## [1] 1904
sum(is.na(brfss2013$X_state))
## [1] 0
There are 1904 missing values and they are all in the hlthpln1 variable. They only represent 0.3% of all observations in the variable.
sum(is.na(brfss2013$hlthpln1)) / length(brfss2013$hlthpln1)
## [1] 0.003871689
This information is also confirmed by the notebook in the section 3. Since the missing values only represent 0.3% of the observations in the variable, we can delete them.
The following code selects the two variables from the data, then drop the missing values, group the observations by state, compute the total of people with an health care coverage and those without. Finally it calculates the ratio of perople with an healtch care coverage.
So basically, it gives us the ratio of people with an health care coverage by state.
states <- brfss2013 %>%
select(X_state, hlthpln1) %>%
na.omit() %>%
group_by(X_state) %>%
summarise(hlth = sum(hlthpln1 == "Yes", na.rm = T), nohlth = sum(hlthpln1 == "No", na.rm = T)) %>%
mutate(ratio = hlth / (hlth + nohlth)) %>%
arrange(desc(ratio))
str(states)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53 obs. of 4 variables:
## $ X_state: Factor w/ 55 levels "0","Alabama",..: 23 10 54 17 47 22 25 8 36 9 ...
## $ hlth : int 14377 4698 5708 7606 5949 12064 13277 7090 7140 4737 ...
## $ nohlth : int 642 220 286 527 427 904 1014 604 630 445 ...
## $ ratio : num 0.957 0.955 0.952 0.935 0.933 ...
head(states)
## # A tibble: 6 x 4
## X_state hlth nohlth ratio
## <fct> <int> <int> <dbl>
## 1 Massachusetts 14377 642 0.957
## 2 District of Columbia 4698 220 0.955
## 3 Puerto Rico 5708 286 0.952
## 4 Iowa 7606 527 0.935
## 5 Vermont 5949 427 0.933
## 6 Maryland 12064 904 0.930
Once our new dataframe is created, we can now plot the ratios.
ggplot(states, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
xlab("States") +
ylab("Health Care Coverage Ratio") +
ggtitle("Health Care Access")
20 of 53 states have an health access ratio equal or higher than 90 percent.
above <- states %>%
filter(ratio >= 0.9)
nrow(above)
## [1] 20
ggplot(above, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
xlab("States") +
ylab("Health Care Coverage Ratio") +
ggtitle("Health Care Access")
31 of 53 states have an health access ratio between 80 and 90 percent.
middle <- states %>%
filter(ratio >= 0.8 & ratio < 0.9)
nrow(middle)
## [1] 31
ggplot(middle, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
xlab("States") +
ylab("Health Care Coverage Ratio") +
ggtitle("Health Care Access")
2 of 53 states have an health access ratio lower than 80 percent
below <- states %>%
filter(ratio < 0.8)
nrow(below)
## [1] 2
ggplot(below, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
xlab("States") +
ylab("Health Care Coverage Ratio") +
ggtitle("Health Care Access")
We don’t notice a great difference across the states about health care access. 31 of the 53 states have a health care coverage ratio of 80 percent of higher, with Massachussets, DC and Puerto Rico being the only one with a ratio higher than 95 percent.
On the other hand, there are two states whose ratio is lower than 80 percents. Those states are Guam and Texas.
We start by selecting the variables we need for the analysis and we drop the missing values. The hlthpln1 variable tells us if the person has any health care coverage
income_hlth <- brfss2013 %>%
select(income2, hlthpln1) %>%
na.omit()
We plot the income level values and compare it with health care access.
ggplot(income_hlth, aes(income2, fill = hlthpln1)) + geom_bar() +
xlab("Income Level") +
ylab("") +
ggtitle("Health Care Access by Income Level")
We saw earlier that the vast majority of people have an health care coverage. This graph confirms it and shows us as well than as the income level increases, people are more likely to have an health care.
We’ll focus on exercise, and sleeping time.
We first start by selecting the three variables we need. The exerany2 variable describes people who exercised in the past 30 days
activity <- brfss2013 %>%
select(exerany2, sleptim1, genhlth)
str(activity)
## 'data.frame': 491775 obs. of 3 variables:
## $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
## $ sleptim1: int NA 6 9 8 6 8 7 6 8 8 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
head(activity)
## exerany2 sleptim1 genhlth
## 1 No NA Fair
## 2 Yes 6 Good
## 3 No 9 Good
## 4 Yes 8 Very good
## 5 No 6 Good
## 6 Yes 8 Very good
We want to check if there’re any missing values
missing <- sum(is.na(activity))
missing
## [1] 43401
There are 61060 missing values in our new dataframe and we want to see which variables contain the Nas
summary(activity)
## exerany2 sleptim1 genhlth
## Yes :332464 Min. : 0.000 Excellent: 85482
## No :125282 1st Qu.: 6.000 Very good:159076
## NA's: 34029 Median : 7.000 Good :150555
## Mean : 7.052 Fair : 66726
## 3rd Qu.: 8.000 Poor : 27951
## Max. :450.000 NA's : 1985
## NA's :7387
We looked into the codebook to see the percentage of missing values in each variables and these are the different percentages. I gave the approximations. exerany2 7% sleptim1 1%
Since the missing values present in our dataset represent a small portion of our observations, we can delete them.
activity_clean <- na.omit(activity)
str(activity_clean)
## 'data.frame': 449742 obs. of 3 variables:
## $ exerany2: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 1 1 1 ...
## $ sleptim1: int 6 9 8 6 8 7 6 8 8 6 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 3 3 2 3 2 4 3 1 3 3 ...
## - attr(*, "na.action")= 'omit' Named int 1 36 42 53 54 70 71 128 161 176 ...
## ..- attr(*, "names")= chr "1" "36" "42" "53" ...
sum(is.na(activity_clean))
## [1] 0
Now that the data is clean, we can analyse it.
We visualize the people who exercised and those who didn’t.
table(activity_clean$exerany2)
##
## Yes No
## 328131 121611
ggplot(activity_clean, aes(exerany2, fill = exerany2)) + geom_bar() +
xlab("People Exercising") +
ylab("") +
ggtitle("Exercise in Past 30 Days")
We then make a plot to see the relationship between general health and exercising.
exercising <- brfss2013 %>%
select(genhlth, exerany2) %>%
na.omit()
ggplot(exercising, aes(exerany2, fill = genhlth)) + geom_bar() +
xlab("People Exercising") +
ylab("") +
ggtitle("Exercising and General Health")
We do the same thing for the sleep time. First we want to see the distribution of sleep time
summary(activity_clean$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 6.00 7.00 7.05 8.00 24.00
boxplot(activity_clean$sleptim1)
In the sleptim1 variable, there are unusual values. The variable refers to the average hours of sleep in a 24-period. So anything beyond 24 is unusual and will be deleted. We look at the distribution of the variable
activity_clean$sleptim1 <- activity_clean %>%
filter(sleptim1 <= 24)
summary(activity_clean$sleptim1)
## exerany2 sleptim1 genhlth
## Yes:328131 Min. : 1.00 Excellent: 78803
## No :121611 1st Qu.: 6.00 Very good:148378
## Median : 7.00 Good :137599
## Mean : 7.05 Fair : 60318
## 3rd Qu.: 8.00 Poor : 24644
## Max. :24.00
boxplot(activity_clean$sleptim1)
We want to look for any association between general helath and sleeping time.
slep_hlth <- brfss2013 %>%
select(genhlth, sleptim1) %>%
na.omit() %>%
filter(sleptim1 <= 24)
ggplot(slep_hlth, aes(sleptim1, fill = genhlth)) + geom_bar() +
xlab("Sleeping Time") +
ylab("") +
ggtitle("General Health and Sleeping Time")
There’s no apparent association between sleeping time and general health. We expected the people who sleep the most to have a better health. This can be explained by the fact that the general health of the population is good. The following plot shows it
ggplot(slep_hlth, aes(genhlth, fill = genhlth)) + geom_bar() +
xlab("General Health") +
ylab("") +
ggtitle("General Health")
In this analysis, we explored the BRFSS data from 2013 and tried to answer some questions related to the health of US adults (18+). We first looked at the potential differences across the different states. The results show that people, irrespective of the state, have in the vast majority access to the health care coverage.
Then we asked if there were any association between health care coverage and income level. The analysis yieled a negative result. There’s no apparent association between health care coverage and income level. However, we notice that as income level increases, there are more people with health care coverage.
Finally, we analyzed different behaviors related to health and looked at their impact to the general health.