Exploring the BRFSS_2013 Data

Objective: Data analysis of the dataset BRFSS2013 data for the evaluation of the final assignment as part of the statistics with R by Duke University

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC).

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days - health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing. Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.

Required packages

r = getOption("repos")
r["CRAN"] = "http://cran.us.r-project.org"
options(repos = r)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rio)

Loading dataset

download.file("http://stat.duke.edu/~cr173/Sta102_Sp16/Proj/brfss2013.RData",  destfile="brfss2013.RData")
load("brfss2013.RData")

PART 2: Research questions

Q1: What is the health Status of the population by Age,Sex and activity ? Q2: What are different behaviours and habits which might cause poor health ? Q3: What is correlation between behaviours and life styles on the chronic health conditions ?

Part 3: Exploratory Data Analysis

hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>% filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth))
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))

hist(hlthpmp$menthlth, main="Mental Health", xlab="Mental Health",col=rainbow(5))

Q2: Behaviours and their implications on the human health .How exercise,smoking and sleeping time causes obesity,diabetes,high blood pressure,high cholesterol level,cardiovascular disorder and depression in respondents. Plots show their association.Behaviours and Habits

Sol: Exercise can improves Health by reducing obesity,high blood pressure and cholesterol.

exer <- brfss2013 %>% dplyr::select(exerany2,bphigh4,sex) %>% dplyr::filter(bphigh4 %in% c("Yes","No"))
levels(exer$bphigh4) <- c("Yes, been told blood pressure high","Yes, but female told only during pregnancy","No, never been told blood pressure high","Yes, been told borderline or pre-hypertensive")
ggplot(exer,aes(x = exerany2, fill = sex)) + geom_bar() +facet_grid(~bphigh4) + coord_flip() + ggtitle(" Exercise vs High Blood Pressure")

Sleeping time and its implications on Health. Here a comparison is shown

sleep <- brfss2013 %>% select(sleptim1,poorhlth, physhlth)
sleep <- subset(sleep, sleptim1 >= 0 & sleptim1 < 25, select=c(sleptim1, poorhlth, physhlth))
sleep = within(sleep, {sleepstatus = ifelse(sleptim1 >= 7 & sleptim1 < 10, "opt", "non-op")})
sleep_poorhlth <- subset(sleep, poorhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, poorhlth, sleepstatus))
sleep_physhlth <- subset(sleep, physhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, physhlth, sleepstatus))
ggplot(data = sleep_poorhlth, aes(x =sleptim1, y= poorhlth)) + geom_point()

ggplot(data = sleep_physhlth, aes(x =sleptim1, y= physhlth)) + geom_point()

Poor Health is also associated with chronic disorders. These are Diabetes,Heart disease,High B.P,High cholesterol and depressive disorders.

disease <- select(brfss2013,genhlth,sex,X_age_g, X_rfbmi5,X_bmi5cat,diabete3,cvdinfr4,bphigh4,addepev2,exerany2,toldhi2,smoke100,sleptim1 ) %>%          filter(!is.na(genhlth),!is.na(sex),!is.na(exerany2),!is.na(X_age_g),!is.na(toldhi2),!is.na(smoke100),!is.na(diabete3),!is.na(addepev2),!is.na(cvdinfr4),!is.na(bphigh4),!is.na(X_rfbmi5),!is.na(X_bmi5cat),!is.na(sleptim1))

Diabetes Prevalence and General Health

table(disease$diabete3,disease$X_age_g)
##                                             
##                                              Age 18 to 24 Age 25 to 34
##   Yes                                                 152          713
##   Yes, but female told only during pregnancy           60          449
##   No                                                 8765        25044
##   No, pre-diabetes or borderline diabetes              61          185
##                                             
##                                              Age 35 to 44 Age 45 to 54
##   Yes                                                2325         6699
##   Yes, but female told only during pregnancy          854          697
##   No                                                38745        57530
##   No, pre-diabetes or borderline diabetes             458         1031
##                                             
##                                              Age 55 to 64 Age 65 or older
##   Yes                                               14056           27257
##   Yes, but female told only during pregnancy          488             557
##   No                                                72603          100907
##   No, pre-diabetes or borderline diabetes            1809            3347
diabete <- brfss2013 %>%
     filter(!is.na(diabete3),!is.na(genhlth)) %>%
     group_by(genhlth,diabete3) %>%
     summarise(n = n()) %>%
     mutate(pct_total_stacked = n/sum(n), 
     position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
     position_n = cumsum(n)-0.5*n)
## `summarise()` has grouped output by 'genhlth'. You can override using the `.groups` argument.
ggplot(diabete, aes(genhlth), y=n) +
    geom_bar(aes(fill = diabete3, weight = n),  width = .7, color="black") +
    geom_text(aes(label=n, y=position_n), color="white")

  ggplot(diabete, aes(x=genhlth, y=pct_total_stacked, fill=diabete3)) +
    geom_bar(stat='identity',  width = .7, color="black")+
    geom_text(aes(label=ifelse(diabete3 == 'Yes', paste0(sprintf("%.0f",                                                 pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
    coord_flip() +
    scale_y_continuous() +
    labs(y="", x="")

rm(diabete)

Q2: Stress and Depression are associated with work,environments and behaviours.The habits seem to trigger chronic disorders e.g; high blood pressure, diabetes and cardiovascular diseases.Correlation to be calculated under Question 3.

prop.table(ftable(table(disease$addepev2,disease$X_age_g)))*100
##      Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##                                                                                      
## Yes      0.436139     1.444933     2.423025     4.241869     5.863067        5.772331
## No       2.041437     5.789601     9.195103    13.838845    18.522336       30.431314
qplot(disease$addepev2)

qplot(X_age_g, data = disease, fill=addepev2,xlab="Depression",ylab ="Prevalence",main ="Heart Disease vs age")  

ggplot(disease,aes(x = addepev2, fill = diabete3)) + geom_bar() +facet_grid(~addepev2) + coord_flip() +              ggtitle(" Depression Vs Diabetes disease") 

Q3: Analyzing Association of behavioural outcomes with the causes by Linear Regression.

vars <- names(brfss2013) %in% c("sleptim1","cvdinfr4","addepev2","diabete3","bphigh4","toldhi2",
                                                                  "smoke100","X_rfbmi5")
hlthsub2 <- brfss2013[vars] 
names(hlthsub2)
## [1] "sleptim1" "bphigh4"  "toldhi2"  "cvdinfr4" "addepev2" "diabete3" "smoke100"
## [8] "X_rfbmi5"
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
##   sleptim1    bphigh4    toldhi2   cvdinfr4   addepev2   diabete3   smoke100 
##  1.5021097  0.2887499 14.5721112  0.5260536  0.4654568  0.1691831  3.0339078 
##   X_rfbmi5 
##  5.4352092
summary(hlthsub2)
##     sleptim1                                             bphigh4      
##  Min.   :  0.000   Yes                                       :198921  
##  1st Qu.:  6.000   Yes, but female told only during pregnancy:  3680  
##  Median :  7.000   No                                        :282687  
##  Mean   :  7.052   Told borderline or pre-hypertensive       :  5067  
##  3rd Qu.:  8.000   NA's                                      :  1420  
##  Max.   :450.000                                                      
##  NA's   :7387                                                         
##  toldhi2       cvdinfr4      addepev2     
##  Yes :183501   Yes : 29284   Yes : 95779  
##  No  :236612   No  :459904   No  :393707  
##  NA's: 71662   NA's:  2587   NA's:  2289  
##                                           
##                                           
##                                           
##                                           
##                                        diabete3      smoke100     
##  Yes                                       : 62363   Yes :215201  
##  Yes, but female told only during pregnancy:  4602   No  :261654  
##  No                                        :415374   NA's: 14920  
##  No, pre-diabetes or borderline diabetes   :  8604                
##  NA's                                      :   832                
##                                                                   
##                                                                   
##  X_rfbmi5     
##  No  :163161  
##  Yes :301885  
##  NA's: 26729  
##               
##               
##               
## 
summary(hlthsub2$X_rfbmi5)
##     No    Yes   NA's 
## 163161 301885  26729
hlthsub2$X_rfbmi5 <- replace(hlthsub2$X_rfbmi5, which(is.na(hlthsub2$X_rfbmi5)), "Yes")
summary(hlthsub2$X_rfbmi5)
##     No    Yes 
## 163161 328614
summary(hlthsub2$smoke100)
##    Yes     No   NA's 
## 215201 261654  14920
hlthsub2$addepev2 <- replace(hlthsub2$addepev2, which(is.na(hlthsub2$addepev2)), "Yes")
summary(hlthsub2$addepev2)
##    Yes     No 
##  98068 393707
hlthsub2$diabete3 <- replace(hlthsub2$diabete3, which(is.na(hlthsub2$diabete3)), "Yes")
summary(hlthsub2$diabete3)
##                                        Yes 
##                                      63195 
## Yes, but female told only during pregnancy 
##                                       4602 
##                                         No 
##                                     415374 
##    No, pre-diabetes or borderline diabetes 
##                                       8604
hlthsub2$bphigh4 <- replace(hlthsub2$bphigh4, which(is.na(hlthsub2$bphigh4)), "No")
summary(hlthsub2$bphigh4)
##                                        Yes 
##                                     198921 
## Yes, but female told only during pregnancy 
##                                       3680 
##                                         No 
##                                     284107 
##        Told borderline or pre-hypertensive 
##                                       5067
summary(hlthsub2$sleptim1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   7.000   7.052   8.000 450.000    7387
mean(hlthsub2$sleptim1,na.rm = T)
## [1] 7.052099
hlthsub2$sleptim1 <- replace(hlthsub2$sleptim1, which(is.na(hlthsub2$sleptim1)), 7)
summary(hlthsub2$sleptim1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.000   7.000   7.051   8.000 450.000
summary(hlthsub2$toldhi2)
##    Yes     No   NA's 
## 183501 236612  71662
hlthsub2$toldhi2 <- replace(hlthsub2$toldhi2, which(is.na(hlthsub2$toldhi2)), "No")
summary(hlthsub2$toldhi2)
##    Yes     No 
## 183501 308274
summary(hlthsub2$cvdinfr4)
##    Yes     No   NA's 
##  29284 459904   2587
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
##    Yes     No 
##  31871 459904
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
##    Yes     No 
##  31871 459904
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
##    Yes     No 
##  31871 459904
#install.packages("Hmisc")
#library(Hmisc)

Converting factors to numerics for analysis.

hlthsub2$addepev2 <- ifelse(hlthsub2$addepev2=="Yes", 1, 0)
hlthsub2$cvdinfr4 <- ifelse(hlthsub2$cvdinfr4=="Yes", 1, 0)
hlthsub2$smoke100 <- ifelse(hlthsub2$smoke100=="Yes", 1, 0)
hlthsub2$diabete3 <- ifelse(hlthsub2$diabete3=="Yes", 1, 0)
hlthsub2$toldhi2 <- ifelse(hlthsub2$toldhi2 =="Yes", 1, 0)
hlthsub2$bphigh4 <- ifelse(hlthsub2$bphigh4=="Yes", 1, 0)
hlthsub2$X_rfbmi5 <- ifelse(hlthsub2$X_rfbmi5=="Yes", 1, 0)

Checking Missing Values

MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1  bphigh4  toldhi2 cvdinfr4 addepev2 diabete3 smoke100 X_rfbmi5 
## 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3.033908 0.000000

Q3: Finding correlation and plotting a corrplot.Fitting model by logistic regression using binomial method. Plotting correlations.

#install.packages("corrplot")
library(corrplot)
## corrplot 0.88 loaded
M <- cor(hlthsub2)
corrplot(M, method="number",pch=23,col=rainbow(7))

summary(M)
##     sleptim1            bphigh4            toldhi2            cvdinfr4       
##  Min.   :-0.051986   Min.   :0.002993   Min.   :0.003714   Min.   :0.001512  
##  1st Qu.:-0.013410   1st Qu.:0.124331   1st Qu.:0.112266   1st Qu.:0.049895  
##  Median : 0.002993   Median :0.186748   Median :0.150813   Median :0.150813  
##  Mean   : 0.133140   Mean   :0.290694   Mean   :0.273895   Mean   :0.226881  
##  3rd Qu.: 0.003897   3rd Qu.:0.298228   3rd Qu.:0.269103   3rd Qu.:0.168027  
##  Max.   : 1.000000   Max.   :1.000000   Max.   :1.000000   Max.   :1.000000  
##  NA's   :1           NA's   :1          NA's   :1          NA's   :1         
##     addepev2           diabete3          smoke100    X_rfbmi5       
##  Min.   :-0.05199   Min.   :0.00408   Min.   :1   Min.   :-0.02833  
##  1st Qu.: 0.05372   1st Qu.:0.11515   1st Qu.:1   1st Qu.: 0.04430  
##  Median : 0.07234   Median :0.15973   Median :1   Median : 0.12663  
##  Mean   : 0.18640   Mean   :0.26852   Mean   :1   Mean   : 0.21784  
##  3rd Qu.: 0.08849   3rd Qu.:0.24278   3rd Qu.:1   3rd Qu.: 0.16899  
##  Max.   : 1.00000   Max.   :1.00000   Max.   :1   Max.   : 1.00000  
##  NA's   :1          NA's   :1         NA's   :7   NA's   :1

Correlation and Fitting of the Models,cardio,depression and diabetes with other variables. A plot of a fit model for cardiovascular disorder against all variables is drawn.

plot(glm(formula = cvdinfr4 ~ ., family = "binomial", data = hlthsub2))

END