PART III : EXPLORATORY DATA ANALYSIS

In this part, we’ll perform an exploratory data analysis (EDA) that addresses each of the three research questions above.

DATA PROCESSING

We start by checking the info of the session and the workspace

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] fr_FR.UTF-8/fr_FR.UTF-8/fr_FR.UTF-8/C/fr_FR.UTF-8/fr_FR.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.3.6
##  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.15 
##  [9] knitr_1.25      stringr_1.4.0   xfun_0.9        digest_0.6.21  
## [13] evaluate_0.14

getwd(); ls()

## [1] "/Users/mamba/Downloads/Data_Scientist_Path/Courses/Coursera/Statistics_With_R_Duke/1_Introduction_to_Data_&_Probability/scripts"

## character(0)

We load the necessary packages

library(tidyverse) # data wrangling and analysis

## ── Attaching packages ────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(knitr) # convert markdown to HTML
library(rmarkdown) # markdown
library(fcuk) # correct typos

We downlaod the data if(!file.exists(“data”)) { dir.create(“data”) }

if(!file.exists("data")) {
    dir.create("data")
}

fileurl <- "https://d3c33hcgiwev3.cloudfront.net/_384b2d9eda4b29131fb681b243a7767d_brfss2013.RData?Expires=1569801600&Signature=I-Q07DUZYfnFGP38xLnib7Td~ry~puzFZfUIoh5PMxYMzeMg7v3OABvBltUeaz7sNz1-vQKDfGcHKdSK3j5z6cuo1qta87b6JS0SmKzV6tRqp~qXbTrClNz40lClELyGsszVdgwc7glFrvRgOVwIk3CWlqXQY4dYsQshGMLxD0Q_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
destfile <- "data/brfss2013.RData"
download.file(fileurl, destfile)
dateDownloaded <- date()
dir("data") # check that the file has been downloaded

## [1] "brfss2013.RData"

We may now read in the data

load("data/brfss2013.Rdata")

We look at some information about the data

class(brfss2013); dim(brfss2013)

## [1] "data.frame"

## [1] 491775    330

Question 1 : Across the states, is there any difference in health care access?

We will select the variables we’ll need to answer the question. Through the rest of the analysis, we will refer to the codebook for variables selection. We store the link in an object.

codebook <- "https://d3c33hcgiwev3.cloudfront.net/_e34476fda339107329fc316d1f98e042_brfss_codebook.html?Expires=1569369600&Signature=BEynyfHDi3iw2VjikExnq-rXYEN2X5bTcTdFHg0d~qYePPXHMynC8Ls85dtAMa2kPzWmByy9dUADMEbJ4gPFFDxxZPHu2lXLtgY7kbukqj9ht3LrYuzzp-ruFHK3gm89nfxVGpQqbJYhrVMMSKJTXdXAZiTfqOHM1EMzK6ArRdc_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A"
datecode <- date()

So we’ll need the variables X_state and hlthpln1 from section 0 and 3.

We first want to see if there’s any missing values in the two variables.
We can obtain this information either in the codebook, or by the following code

sum(is.na(c(brfss2013$X_state, brfss2013$hlthpln1)))

## [1] 1904

sum(is.na(brfss2013$X_state))

## [1] 0

There are 1904 missing values and they are all in the hlthpln1 variable. They only represent 0.3% of all observations in the variable.

sum(is.na(brfss2013$hlthpln1)) / length(brfss2013$hlthpln1)

## [1] 0.003871689

This information is also confirmed by the notebook in the section 3. Since the missing values only represent 0.3% of the observations in the variable, we can delete them.

The following code selects the two variables from the data, then drop the missing values, group the observations by state, compute the total of people with an health care coverage and those without. Finally it calculates the ratio of perople with an healtch care coverage.

So basically, it gives us the ratio of people with an health care coverage by state.

states <- brfss2013 %>%
    select(X_state, hlthpln1) %>%
    na.omit() %>%
    group_by(X_state) %>%
    summarise(hlth = sum(hlthpln1 == "Yes", na.rm = T), nohlth = sum(hlthpln1 == "No", na.rm = T)) %>%
    mutate(ratio = hlth / (hlth + nohlth)) %>%
    arrange(desc(ratio))
    
str(states)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53 obs. of  4 variables:
##  $ X_state: Factor w/ 55 levels "0","Alabama",..: 23 10 54 17 47 22 25 8 36 9 ...
##  $ hlth   : int  14377 4698 5708 7606 5949 12064 13277 7090 7140 4737 ...
##  $ nohlth : int  642 220 286 527 427 904 1014 604 630 445 ...
##  $ ratio  : num  0.957 0.955 0.952 0.935 0.933 ...

head(states)

## # A tibble: 6 x 4
##   X_state               hlth nohlth ratio
##   <fct>                <int>  <int> <dbl>
## 1 Massachusetts        14377    642 0.957
## 2 District of Columbia  4698    220 0.955
## 3 Puerto Rico           5708    286 0.952
## 4 Iowa                  7606    527 0.935
## 5 Vermont               5949    427 0.933
## 6 Maryland             12064    904 0.930

Once our new dataframe is created, we can now plot the ratios.

ggplot(states, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
    xlab("States") +
    ylab("Health Care Coverage Ratio") +
    ggtitle("Health Care Access")

20 of 53 states have an health access ratio equal or higher than 90 percent.

above <- states %>%
    filter(ratio >= 0.9)
nrow(above)

## [1] 20

ggplot(above, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
     xlab("States") +
    ylab("Health Care Coverage Ratio") +
    ggtitle("Health Care Access")

31 of 53 states have an health access ratio between 80 and 90 percent.

middle <- states %>%
    filter(ratio >= 0.8 & ratio < 0.9)
nrow(middle)

## [1] 31

ggplot(middle, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
     xlab("States") +
    ylab("Health Care Coverage Ratio") +
    ggtitle("Health Care Access")

2 of 53 states have an health access ratio lower than 80 percent

below <- states %>%
    filter(ratio < 0.8)
nrow(below)

## [1] 2

ggplot(below, aes(X_state, ratio, fill = X_state)) + geom_bar(stat = "identity") +
    xlab("States") +
    ylab("Health Care Coverage Ratio") +
    ggtitle("Health Care Access")

We don’t notice a great difference across the states about health care access. 31 of the 53 states have a health care coverage ratio of 80 percent of higher, with Massachussets, DC and Puerto Rico being the only one with a ratio higher than 95 percent.

On the other hand, there are two states whose ratio is lower than 80 percents. Those states are Guam and Texas.

Question 2 : Is there an association between income level and health care access?

We start by selecting the variables we need for the analysis and we drop the missing values. The hlthpln1 variable tells us if the person has any health care coverage

income_hlth <- brfss2013 %>%
  select(income2, hlthpln1) %>%
  na.omit()

We plot the income level values and compare it with health care access.

ggplot(income_hlth, aes(income2, fill = hlthpln1)) + geom_bar() +
     xlab("Income Level") +
    ylab("") +
    ggtitle("Health Care Access by Income Level")

We saw earlier that the vast majority of people have an health care coverage. This graph confirms it and shows us as well than as the income level increases, people are more likely to have an health care.

Question 3 : Which behaviors impact health?

We’ll focus on exercise, and sleeping time.

We first start by selecting the three variables we need. The exerany2 variable describes people who exercised in the past 30 days

activity <- brfss2013 %>%
  select(exerany2, sleptim1, genhlth)

str(activity)

## 'data.frame':    491775 obs. of  3 variables:
##  $ exerany2: Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
##  $ sleptim1: int  NA 6 9 8 6 8 7 6 8 8 ...
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...

head(activity)

##   exerany2 sleptim1   genhlth
## 1       No       NA      Fair
## 2      Yes        6      Good
## 3       No        9      Good
## 4      Yes        8 Very good
## 5       No        6      Good
## 6      Yes        8 Very good

We want to check if there’re any missing values

missing <- sum(is.na(activity))
missing

## [1] 43401

There are 61060 missing values in our new dataframe and we want to see which variables contain the Nas

summary(activity)

##  exerany2         sleptim1            genhlth      
##  Yes :332464   Min.   :  0.000   Excellent: 85482  
##  No  :125282   1st Qu.:  6.000   Very good:159076  
##  NA's: 34029   Median :  7.000   Good     :150555  
##                Mean   :  7.052   Fair     : 66726  
##                3rd Qu.:  8.000   Poor     : 27951  
##                Max.   :450.000   NA's     :  1985  
##                NA's   :7387

We looked into the codebook to see the percentage of missing values in each variables and these are the different percentages. I gave the approximations. exerany2 7% sleptim1 1%

Since the missing values present in our dataset represent a small portion of our observations, we can delete them.

activity_clean <- na.omit(activity)
str(activity_clean)

## 'data.frame':    449742 obs. of  3 variables:
##  $ exerany2: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 1 1 1 1 1 ...
##  $ sleptim1: int  6 9 8 6 8 7 6 8 8 6 ...
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 3 3 2 3 2 4 3 1 3 3 ...
##  - attr(*, "na.action")= 'omit' Named int  1 36 42 53 54 70 71 128 161 176 ...
##   ..- attr(*, "names")= chr  "1" "36" "42" "53" ...

sum(is.na(activity_clean))

## [1] 0

Now that the data is clean, we can analyse it.

We visualize the people who exercised and those who didn’t.

table(activity_clean$exerany2)

## 
##    Yes     No 
## 328131 121611

ggplot(activity_clean, aes(exerany2, fill = exerany2)) + geom_bar() +
    xlab("People Exercising") +
    ylab("") +
    ggtitle("Exercise in Past 30 Days")

We then make a plot to see the relationship between general health and exercising.

exercising <- brfss2013 %>%
  select(genhlth, exerany2) %>%
  na.omit()

ggplot(exercising, aes(exerany2, fill = genhlth)) + geom_bar() +
    xlab("People Exercising") +
    ylab("") +
    ggtitle("Exercising and General Health")

We do the same thing for the sleep time. First we want to see the distribution of sleep time

summary(activity_clean$sleptim1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    6.00    7.00    7.05    8.00   24.00

boxplot(activity_clean$sleptim1)

In the sleptim1 variable, there are unusual values. The variable refers to the average hours of sleep in a 24-period. So anything beyond 24 is unusual and will be deleted. We look at the distribution of the variable

activity_clean$sleptim1 <- activity_clean %>%
    filter(sleptim1 <= 24)

summary(activity_clean$sleptim1)

##  exerany2        sleptim1          genhlth      
##  Yes:328131   Min.   : 1.00   Excellent: 78803  
##  No :121611   1st Qu.: 6.00   Very good:148378  
##               Median : 7.00   Good     :137599  
##               Mean   : 7.05   Fair     : 60318  
##               3rd Qu.: 8.00   Poor     : 24644  
##               Max.   :24.00

boxplot(activity_clean$sleptim1)

We want to look for any association between general helath and sleeping time.

slep_hlth <- brfss2013 %>%
    select(genhlth, sleptim1) %>%
    na.omit() %>%
    filter(sleptim1 <= 24)

ggplot(slep_hlth, aes(sleptim1, fill = genhlth)) + geom_bar() +
    xlab("Sleeping Time") +
    ylab("") +
    ggtitle("General Health and Sleeping Time")

There’s no apparent association between sleeping time and general health. We expected the people who sleep the most to have a better health. This can be explained by the fact that the general health of the population is good. The following plot shows it

ggplot(slep_hlth, aes(genhlth, fill = genhlth)) + geom_bar() +
    xlab("General Health") +
    ylab("") +
    ggtitle("General Health")

CONCLUSION

In this analysis, we explored the BRFSS data from 2013 and tried to answer some questions related to the health of US adults (18+). We first looked at the potential differences across the different states. The results show that people, irrespective of the state, have in the vast majority access to the health care coverage.

Then we asked if there were any association between health care coverage and income level. The analysis yieled a negative result. There’s no apparent association between health care coverage and income level. However, we notice that as income level increases, there are more people with health care coverage.

Finally, we analyzed different behaviors related to health and looked at their impact to the general health.

Data Analysis Project

Lionel Ngoubou

9/28/2019

PART I : DATA

PART II : RESEARCH QUESTIONS

Question 1 : Across the states, is there any difference in health care access?

Question 2 : Is there an association between income level and health care access?

Question 3 : Which behaviors impact health?

PART III : EXPLORATORY DATA ANALYSIS

Question 1 : Across the states, is there any difference in health care access?

Question 2 : Is there an association between income level and health care access?

Question 3 : Which behaviors impact health?

CONCLUSION