Exploring the BRFSS_2013 Data
Objective: Data analysis of the dataset BRFSS2013 data for the evaluation of the final assignment as part of the statistics with R by Duke University
Part 1: Data
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC).
The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing. Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.
Required packages
r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rio)
download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/brfss2013.RData", destfile="brfss2013.RData")
load("brfss2013.RData")
PART 2: Research questions
Q1: What is the health Status of the population by Age,Sex and activity ? Q2: What are different behaviours and habits which might cause poor health ? Q3: What is correlation between behaviours and life styles on the chronic health conditions ?
Part 3: Exploratory Data Analysis
hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>% filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth))
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))
hist(hlthpmp$menthlth, main="Mental Health", xlab="Mental Health",col=rainbow(5))
Q2: Behaviours and their implications on the human health .How exercise,smoking and sleeping time causes obesity,diabetes,high blood pressure,high cholesterol level,cardiovascular disorder and depression in respondents. Plots show their association.Behaviours and Habits
Sol: Exercise can improves Health by reducing obesity,high blood pressure and cholesterol.
exer <- brfss2013 %>% dplyr::select(exerany2,bphigh4,sex) %>% dplyr::filter(bphigh4 %in% c("Yes","No"))
levels(exer$bphigh4) <- c("Yes, been told blood pressure high","Yes, but female told only during pregnancy","No, never been told blood pressure high","Yes, been told borderline or pre-hypertensive")
ggplot(exer,aes(x = exerany2, fill = sex)) + geom_bar() +facet_grid(~bphigh4) + coord_flip() + ggtitle(" Exercise vs High Blood Pressure")
Sleeping time and its implications on Health. Here a comparison is shown
sleep <- brfss2013 %>% select(sleptim1,poorhlth, physhlth)
sleep <- subset(sleep, sleptim1 >= 0 & sleptim1 < 25, select=c(sleptim1, poorhlth, physhlth))
sleep = within(sleep, {sleepstatus = ifelse(sleptim1 >= 7 & sleptim1 < 10, "opt", "non-op")})
sleep_poorhlth <- subset(sleep, poorhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, poorhlth, sleepstatus))
sleep_physhlth <- subset(sleep, physhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, physhlth, sleepstatus))
ggplot(data = sleep_poorhlth, aes(x =sleptim1, y= poorhlth)) + geom_point()
ggplot(data = sleep_physhlth, aes(x =sleptim1, y= physhlth)) + geom_point()
Poor Health is also associated with chronic disorders. These are Diabetes,Heart disease,High B.P,High cholesterol and depressive disorders.
disease <- select(brfss2013,genhlth,sex,X_age_g, X_rfbmi5,X_bmi5cat,diabete3,cvdinfr4,bphigh4,addepev2,exerany2,toldhi2,smoke100,sleptim1 ) %>% filter(!is.na(genhlth),!is.na(sex),!is.na(exerany2),!is.na(X_age_g),!is.na(toldhi2),!is.na(smoke100),!is.na(diabete3),!is.na(addepev2),!is.na(cvdinfr4),!is.na(bphigh4),!is.na(X_rfbmi5),!is.na(X_bmi5cat),!is.na(sleptim1))
Diabetes Prevalence and General Health
table(disease$diabete3,disease$X_age_g)
##
## Age 18 to 24 Age 25 to 34
## Yes 152 713
## Yes, but female told only during pregnancy 60 449
## No 8765 25044
## No, pre-diabetes or borderline diabetes 61 185
##
## Age 35 to 44 Age 45 to 54
## Yes 2325 6699
## Yes, but female told only during pregnancy 854 697
## No 38745 57530
## No, pre-diabetes or borderline diabetes 458 1031
##
## Age 55 to 64 Age 65 or older
## Yes 14056 27257
## Yes, but female told only during pregnancy 488 557
## No 72603 100907
## No, pre-diabetes or borderline diabetes 1809 3347
diabete <- brfss2013 %>%
filter(!is.na(diabete3),!is.na(genhlth)) %>%
group_by(genhlth,diabete3) %>%
summarise(n = n()) %>%
mutate(pct_total_stacked = n/sum(n),
position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
position_n = cumsum(n)-0.5*n)
## `summarise()` has grouped output by 'genhlth'. You can override using the `.groups` argument.
ggplot(diabete, aes(genhlth), y=n) +
geom_bar(aes(fill = diabete3, weight = n), width = .7, color="black") +
geom_text(aes(label=n, y=position_n), color="white")
ggplot(diabete, aes(x=genhlth, y=pct_total_stacked, fill=diabete3)) +
geom_bar(stat='identity', width = .7, color="black")+
geom_text(aes(label=ifelse(diabete3 == 'Yes', paste0(sprintf("%.0f", pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
coord_flip() +
scale_y_continuous() +
labs(y="", x="")
rm(diabete)
prop.table(ftable(table(disease$addepev2,disease$X_age_g)))*100
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.436139 1.444933 2.423025 4.241869 5.863067 5.772331
## No 2.041437 5.789601 9.195103 13.838845 18.522336 30.431314
qplot(disease$addepev2)
qplot(X_age_g, data = disease, fill=addepev2,xlab="Depression",ylab ="Prevalence",main ="Heart Disease vs age")
ggplot(disease,aes(x = addepev2, fill = diabete3)) + geom_bar() +facet_grid(~addepev2) + coord_flip() + ggtitle(" Depression Vs Diabetes disease")
Q3: Analyzing Association of behavioural outcomes with the causes by Linear Regression.
vars <- names(brfss2013) %in% c("sleptim1","cvdinfr4","addepev2","diabete3","bphigh4","toldhi2",
"smoke100","X_rfbmi5")
hlthsub2 <- brfss2013[vars]
names(hlthsub2)
## [1] "sleptim1" "bphigh4" "toldhi2" "cvdinfr4" "addepev2" "diabete3" "smoke100"
## [8] "X_rfbmi5"
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3 smoke100
## 1.5021097 0.2887499 14.5721112 0.5260536 0.4654568 0.1691831 3.0339078
## X_rfbmi5
## 5.4352092
summary(hlthsub2)
## sleptim1 bphigh4
## Min. : 0.000 Yes :198921
## 1st Qu.: 6.000 Yes, but female told only during pregnancy: 3680
## Median : 7.000 No :282687
## Mean : 7.052 Told borderline or pre-hypertensive : 5067
## 3rd Qu.: 8.000 NA's : 1420
## Max. :450.000
## NA's :7387
## toldhi2 cvdinfr4 addepev2
## Yes :183501 Yes : 29284 Yes : 95779
## No :236612 No :459904 No :393707
## NA's: 71662 NA's: 2587 NA's: 2289
##
##
##
##
## diabete3 smoke100
## Yes : 62363 Yes :215201
## Yes, but female told only during pregnancy: 4602 No :261654
## No :415374 NA's: 14920
## No, pre-diabetes or borderline diabetes : 8604
## NA's : 832
##
##
## X_rfbmi5
## No :163161
## Yes :301885
## NA's: 26729
##
##
##
##
summary(hlthsub2$X_rfbmi5)
## No Yes NA's
## 163161 301885 26729
hlthsub2$X_rfbmi5 <- replace(hlthsub2$X_rfbmi5, which(is.na(hlthsub2$X_rfbmi5)), "Yes")
summary(hlthsub2$X_rfbmi5)
## No Yes
## 163161 328614
summary(hlthsub2$smoke100)
## Yes No NA's
## 215201 261654 14920
hlthsub2$addepev2 <- replace(hlthsub2$addepev2, which(is.na(hlthsub2$addepev2)), "Yes")
summary(hlthsub2$addepev2)
## Yes No
## 98068 393707
hlthsub2$diabete3 <- replace(hlthsub2$diabete3, which(is.na(hlthsub2$diabete3)), "Yes")
summary(hlthsub2$diabete3)
## Yes
## 63195
## Yes, but female told only during pregnancy
## 4602
## No
## 415374
## No, pre-diabetes or borderline diabetes
## 8604
hlthsub2$bphigh4 <- replace(hlthsub2$bphigh4, which(is.na(hlthsub2$bphigh4)), "No")
summary(hlthsub2$bphigh4)
## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 284107
## Told borderline or pre-hypertensive
## 5067
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.052 8.000 450.000 7387
mean(hlthsub2$sleptim1,na.rm = T)
## [1] 7.052099
hlthsub2$sleptim1 <- replace(hlthsub2$sleptim1, which(is.na(hlthsub2$sleptim1)), 7)
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 7.000 7.051 8.000 450.000
summary(hlthsub2$toldhi2)
## Yes No NA's
## 183501 236612 71662
hlthsub2$toldhi2 <- replace(hlthsub2$toldhi2, which(is.na(hlthsub2$toldhi2)), "No")
summary(hlthsub2$toldhi2)
## Yes No
## 183501 308274
summary(hlthsub2$cvdinfr4)
## Yes No NA's
## 29284 459904 2587
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
## Yes No
## 31871 459904
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
## Yes No
## 31871 459904
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
## Yes No
## 31871 459904
#install.packages("Hmisc")
#library(Hmisc)
Converting factors to numerics for analysis.
hlthsub2$addepev2 <- ifelse(hlthsub2$addepev2=="Yes", 1, 0)
hlthsub2$cvdinfr4 <- ifelse(hlthsub2$cvdinfr4=="Yes", 1, 0)
hlthsub2$smoke100 <- ifelse(hlthsub2$smoke100=="Yes", 1, 0)
hlthsub2$diabete3 <- ifelse(hlthsub2$diabete3=="Yes", 1, 0)
hlthsub2$toldhi2 <- ifelse(hlthsub2$toldhi2 =="Yes", 1, 0)
hlthsub2$bphigh4 <- ifelse(hlthsub2$bphigh4=="Yes", 1, 0)
hlthsub2$X_rfbmi5 <- ifelse(hlthsub2$X_rfbmi5=="Yes", 1, 0)
Checking Missing Values
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3 smoke100 X_rfbmi5
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.033908 0.000000
Q3: Finding correlation and plotting a corrplot.Fitting model by logistic regression using binomial method. Plotting correlations.
#install.packages("corrplot")
library(corrplot)
## corrplot 0.88 loaded
M <- cor(hlthsub2)
corrplot(M, method="number",pch=23,col=rainbow(7))
summary(M)
## sleptim1 bphigh4 toldhi2 cvdinfr4
## Min. :-0.051986 Min. :0.002993 Min. :0.003714 Min. :0.001512
## 1st Qu.:-0.013410 1st Qu.:0.124331 1st Qu.:0.112266 1st Qu.:0.049895
## Median : 0.002993 Median :0.186748 Median :0.150813 Median :0.150813
## Mean : 0.133140 Mean :0.290694 Mean :0.273895 Mean :0.226881
## 3rd Qu.: 0.003897 3rd Qu.:0.298228 3rd Qu.:0.269103 3rd Qu.:0.168027
## Max. : 1.000000 Max. :1.000000 Max. :1.000000 Max. :1.000000
## NA's :1 NA's :1 NA's :1 NA's :1
## addepev2 diabete3 smoke100 X_rfbmi5
## Min. :-0.05199 Min. :0.00408 Min. :1 Min. :-0.02833
## 1st Qu.: 0.05372 1st Qu.:0.11515 1st Qu.:1 1st Qu.: 0.04430
## Median : 0.07234 Median :0.15973 Median :1 Median : 0.12663
## Mean : 0.18640 Mean :0.26852 Mean :1 Mean : 0.21784
## 3rd Qu.: 0.08849 3rd Qu.:0.24278 3rd Qu.:1 3rd Qu.: 0.16899
## Max. : 1.00000 Max. :1.00000 Max. :1 Max. : 1.00000
## NA's :1 NA's :1 NA's :7 NA's :1
Correlation and Fitting of the Models,cardio,depression and diabetes with other variables. A plot of a fit model for cardiovascular disorder against all variables is drawn.
plot(glm(formula = cvdinfr4 ~ ., family = "binomial", data = hlthsub2))