library(ggplot2)
library(dplyr)load("F:/COURSERA COURSES/Intro to Probability/brfss2013.RData")The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion.
BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews.
Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term "state" is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.
The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.
Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.
The dataset is provided in both Stata (.dta) and R Workspace (.Rdata) formats. Categorical values are factors in the R workspace, and value labels are attached in the Stata version (except when a categorical variable contains more than 50 categories.
All missing values are coded NA in the R Workspace. For Stata, missing values are coded missing using the following codes: . (numbers) or empty text field: BLANK; .a: Don't know/Not sure; .b: Refused; .c: Zero (none); .d: Don't know/Not Sure Or Refused/Missing
Many variables, such as age, race, education, as well as variables that measure counts of events (drinks, times eating fruit, etc.) have alternate versions in the Calculated Variables section of the dataset. Review this section prior to choosing variables for analysis.
The skip logic used in the survey is contained in the notes section for each variable where present.
Research quesion 1:
First I want to explore from Main Survey - Section 2 - Healthy Days
Health-Related Quality of Life
Variables will be used are:
physhlth: Number Of Days Physical Health Not Good
menthlth: Number Of Days Mental Health Not Good
poorhlth: Poor Physical Or Mental Health
My research question will be:
What is the distributon of health conditions.
Research quesion 2:
Second, I would like to see if there exists any relation between how many hours one sleeps and certain health conditions such as Heart Attack, Depressive Disorder etc.
Variables will be used are:
sleptim1: How Much Time Do You Sleep
cvdinfr4: Ever Diagnosed With Heart Attack
addepev2: Ever Told You Had A Depressive Disorder
What I will do is:
Run an initial correlation analysis among these variables.
Research quesion 3:
Lastly, I would like to see if health condition such as Heart Attack can be predicted using sleep time, Cholesterol, blood pressure etc.
Variables I will be using for that are:
cvdinfr4: Ever Diagnosed With Heart Attack
bloodcho: Ever Had Blood Cholesterol Checked
cholchk: How Long Since Cholesterol Checked
toldhi2: Ever Told Blood Cholesterol High
bphigh4: Ever Told Blood Pressure High
sleptim1: How Much Time Do You Sleep
What I will do is:
Run a basic generalized linear regression analysis among these variables.
Research quesion 1:
# op <- par(mfrow = c(2, 2))
hist(brfss2013$physhlth, main="Histogram-Physical Health", xlab="Physical Health")hist(brfss2013$menthlth, main="Histogram-Mental Health", xlab="Mental Health")hist(brfss2013$poorhlth, main="Histogram-Health", xlab="Poor Physical Or Mental Health")# par(op)Research quesion 2:
# Create a subset of data
vars <- names(brfss2013) %in% c("sleptim1", "cvdinfr4", "addepev2")
subdata <- brfss2013[vars]
# make a backup
subdata1 <- subdata
# conver factor levels into numeric levels
subdata1$addepev2 <- ifelse(subdata$addepev2=="Yes", 1, 0)
subdata1$cvdinfr4 <- ifelse(subdata$cvdinfr4=="Yes", 1, 0)
# remove rows containing NAs
library(Hmisc)## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
subdata1 <- na.delete(subdata1)
# find correlation
cor(subdata1)## sleptim1 cvdinfr4 addepev2
## sleptim1 1.0000000000 0.0002048415 -0.05743056
## cvdinfr4 0.0002048415 1.0000000000 0.05267562
## addepev2 -0.0574305638 0.0526756160 1.00000000
Plotting correlations
library(corrplot)
M <- cor(subdata1)
corrplot(M, method="ellipse")From the above results we can infer that:
1. Sleep time and Depressive Disorder has negative correlation, which mean if one sleeps less, chances for Depressive Disorder go high (this is not a causation, but just an initial inference).
2. Sleep time and Ever Diagnosed With Heart Attack shows almost no relation among them (corerlation is almost zero)
This can further be explored in subsequent in depth analysis.
Research quesion 3:
Data Prepration
# Create a subset of data
vars <- names(brfss2013) %in% c("cvdinfr4", "bloodcho", "toldhi2", "bphigh4", "sleptim1")
subdata <- brfss2013[vars]How Many NAs in each column of the above subdata
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(subdata, 2, MissingData)## sleptim1 bphigh4 bloodcho toldhi2 cvdinfr4
## 1.5021097 0.2887499 1.8321387 14.5721112 0.5260536
# NAs encountered
summary(subdata$bloodcho)## Yes No NA's
## 423868 58897 9010
# Replace NA with "No"s
subdata$bloodcho <- replace(subdata$bloodcho, which(is.na(subdata$bloodcho)), "No")
#check replacement
summary(subdata$bloodcho)## Yes No
## 423868 67907
summary(subdata$bphigh4)## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 282687
## Told borderline or pre-hypertensive
## 5067
## NA's
## 1420
# Replace NA with "No"s
subdata$bphigh4 <- replace(subdata$bphigh4, which(is.na(subdata$bphigh4)), "No")
#check replacement
summary(subdata$bphigh4)## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 284107
## Told borderline or pre-hypertensive
## 5067
# Sleep time NA to be replaced with mean of the data
summary(subdata$sleptim1)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.052 8.000 450.000 7387
mean(subdata$sleptim1,na.rm = T)## [1] 7.052099
subdata$sleptim1 <- replace(subdata$sleptim1, which(is.na(subdata$sleptim1)), 7)
#check replacement
summary(subdata$sleptim1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 7.000 7.051 8.000 450.000
summary(subdata$toldhi2)## Yes No NA's
## 183501 236612 71662
subdata$toldhi2 <- replace(subdata$toldhi2, which(is.na(subdata$toldhi2)), "No")
#check replacement
summary(subdata$toldhi2)## Yes No
## 183501 308274
summary(subdata$cvdinfr4)## Yes No NA's
## 29284 459904 2587
subdata$cvdinfr4 <- replace(subdata$cvdinfr4, which(is.na(subdata$cvdinfr4)), "No")
#check replacement
summary(subdata$cvdinfr4)## Yes No
## 29284 462491
Now, let's look into the High BP data
# Make a copy of the data
subdata1 <- subdata
# table the data
table(subdata1$bphigh4)##
## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 284107
## Told borderline or pre-hypertensive
## 5067
So we see that there are some other factor levels also i.e.
Yes, but female told only during pregnancy - 2647 cases
Told borderline or pre-hypertensive - 4467 cases
I will treat both the cases as "YES" (excuse me). why so, because I guess they are High BP cases, almost.
Therefore, let's prepare data again.
# COnverting factor levels into proper factor levels
subdata1$bphigh4 <- as.factor(ifelse(subdata1$bphigh4=="Yes", "Yes",
(ifelse(subdata1$bphigh4=="Yes, but female told only during pregnancy", "Yes",
(ifelse(subdata1$bphigh4=="Told borderline or pre-hypertensive", "Yes",
"No"))))))
# table the data
summary(subdata1$bphigh4)## No Yes
## 284107 207668
# check class of sub dataset
class(subdata1)## [1] "data.frame"
Now we will fit the data for logistic regression using binomial method
# summarize data
summary(subdata1)## sleptim1 bphigh4 bloodcho toldhi2 cvdinfr4
## Min. : 0.000 No :284107 Yes:423868 Yes:183501 Yes: 29284
## 1st Qu.: 6.000 Yes:207668 No : 67907 No :308274 No :462491
## Median : 7.000
## Mean : 7.051
## 3rd Qu.: 8.000
## Max. :450.000
# Fit the logistic regression
fit <- glm(cvdinfr4 ~ ., data=subdata1, family = "binomial")Summarize the model
summary(fit)##
## Call:
## glm(formula = cvdinfr4 ~ ., family = "binomial", data = subdata1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9987 0.2045 0.2749 0.3702 0.5719
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.031227 0.029578 102.481 <2e-16 ***
## sleptim1 -0.004202 0.003677 -1.143 0.253
## bphigh4Yes -1.202519 0.014496 -82.957 <2e-16 ***
## bloodchoNo 0.606773 0.032612 18.606 <2e-16 ***
## toldhi2No 0.850998 0.013786 61.729 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 222008 on 491774 degrees of freedom
## Residual deviance: 202329 on 491770 degrees of freedom
## AIC: 202339
##
## Number of Fisher Scoring iterations: 6
```
So from the summary we can infer that the model says the sleeping time is insignificant in predicting heart attack conditions while other variables are highly significat.
``