BRFSS 2013 Exploratory Analysis

knitr::opts_chunk$set(error = TRUE,tidy=FALSE,size="small")  ts_chunk$set(error = TRUE)
Load packages

    set.seed(123)
    library(ggplot2)
    library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

BRFSS 2013- SELF REPORTED SURVEILLANCE SURVEY ANALYSIS
Part 1: About Data  

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in   
the United States (US) and participating US territories and the Centers for Disease Control and Prevention 
(CDC) The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division
of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. 

BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.Bias to data can exist. This data source is    
valid and generalizable. The reliability is around 75% and compares well with other surveys of similar nature.

The BRFSS objective is to collect uniform, state-specific include tobacco use, HIV/AIDS knowledge and 
prevention,exercise, immunization, health status, healthy days - health-related quality of life, health 
care access,inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, 
alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.  

Dataset notes
The dataset is provided R Workspace (.Rdata) format for Anaysis with R Studio. Categorical values are factors   
in the R workspace.All missing values are coded NA in the R Workspace.

Part 2  Research Questions.  
Research quesion 1:
What is the health Status of the population by Age,Sex and activity ?
Research quesion.2
What are different behaviours and habits which might cause poor health ? 
Research quesion 3:
What is correlation between behaviours and life styles on the chronic health conditions ?

Following Variables will be used:
    genhlth:  Status of general health
    Sex    :  Sex of Respondent
    X_age_g:  Groups of repondents by Age
    qlhlth2:  How many days full of enegry in last 30 days.
    sleptim1: How Much Time Do You Sleep
    cvdinfr4: Ever Diagnosed With Heart Attack
    addepev2: Ever Told You Had A Depressive Disorder
    smoke100: Smoking 100 cigarettes a day.
    diabete3: Told that you are diabetic.
    exerany2: Exercise and physical activity in past 30 days
    toldhi2 : Told that you have blood high cholestrol
    X_bmi5cat:Weight by Categories 
    X_rfbmi5: Overweight or obese.
    bphigh4: Ever Told Blood Pressure High
    X_pacat1 : Is Life Active or Passive

Research quesion 1:
Health Status by:1.Sex and 2. By Age Group 3. Active or Passive Life.

Exploring Data Frame brfss2013

  dim(brfss2013)

## [1] 491775    330

  names(brfss2013)

##   [1] "X_state"   "fmonth"    "idate"     "imonth"    "iday"     
##   [6] "iyear"     "dispcode"  "seqno"     "X_psu"     "ctelenum" 
##  [11] "pvtresd1"  "colghous"  "stateres"  "cellfon3"  "ladult"   
##  [16] "numadult"  "nummen"    "numwomen"  "genhlth"   "physhlth" 
##  [21] "menthlth"  "poorhlth"  "hlthpln1"  "persdoc2"  "medcost"  
##  [26] "checkup1"  "sleptim1"  "bphigh4"   "bpmeds"    "bloodcho" 
##  [31] "cholchk"   "toldhi2"   "cvdinfr4"  "cvdcrhd4"  "cvdstrk3" 
##  [36] "asthma3"   "asthnow"   "chcscncr"  "chcocncr"  "chccopd1" 
##  [41] "havarth3"  "addepev2"  "chckidny"  "diabete3"  "veteran3" 
##  [46] "marital"   "children"  "educa"     "employ1"   "income2"  
##  [51] "weight2"   "height3"   "numhhol2"  "numphon2"  "cpdemo1"  
##  [56] "cpdemo4"   "internet"  "renthom1"  "sex"       "pregnant" 
##  [61] "qlactlm2"  "useequip"  "blind"     "decide"    "diffwalk" 
##  [66] "diffdres"  "diffalon"  "smoke100"  "smokday2"  "stopsmk2" 
##  [71] "lastsmk2"  "usenow3"   "alcday5"   "avedrnk2"  "drnk3ge5" 
##  [76] "maxdrnks"  "fruitju1"  "fruit1"    "fvbeans"   "fvgreen"  
##  [81] "fvorang"   "vegetab1"  "exerany2"  "exract11"  "exeroft1" 
##  [86] "exerhmm1"  "exract21"  "exeroft2"  "exerhmm2"  "strength" 
##  [91] "lmtjoin3"  "arthdis2"  "arthsocl"  "joinpain"  "seatbelt" 
##  [96] "flushot6"  "flshtmy2"  "tetanus"   "pneuvac3"  "hivtst6"  
## [101] "hivtstd3"  "whrtst10"  "pdiabtst"  "prediab1"  "diabage2" 
## [106] "insulin"   "bldsugar"  "feetchk2"  "doctdiab"  "chkhemo3" 
## [111] "feetchk"   "eyeexam"   "diabeye"   "diabedu"   "painact2" 
## [116] "qlmentl2"  "qlstres2"  "qlhlth2"   "medicare"  "hlthcvrg" 
## [121] "delaymed"  "dlyother"  "nocov121"  "lstcovrg"  "drvisits" 
## [126] "medscost"  "carercvd"  "medbills"  "ssbsugar"  "ssbfrut2" 
## [131] "wtchsalt"  "longwtch"  "dradvise"  "asthmage"  "asattack" 
## [136] "aservist"  "asdrvist"  "asrchkup"  "asactlim"  "asymptom" 
## [141] "asnoslep"  "asthmed3"  "asinhalr"  "harehab1"  "strehab1" 
## [146] "cvdasprn"  "aspunsaf"  "rlivpain"  "rduchart"  "rducstrk" 
## [151] "arttoday"  "arthwgt"   "arthexer"  "arthedu"   "imfvplac" 
## [156] "hpvadvc2"  "hpvadsht"  "hadmam"    "howlong"   "profexam" 
## [161] "lengexam"  "hadpap2"   "lastpap2"  "hadhyst2"  "bldstool" 
## [166] "lstblds3"  "hadsigm3"  "hadsgco1"  "lastsig3"  "pcpsaad2" 
## [171] "pcpsadi1"  "pcpsare1"  "psatest1"  "psatime"   "pcpsars1" 
## [176] "pcpsade1"  "pcdmdecn"  "rrclass2"  "rrcognt2"  "rratwrk2" 
## [181] "rrhcare3"  "rrphysm2"  "rremtsm2"  "misnervs"  "mishopls" 
## [186] "misrstls"  "misdeprd"  "miseffrt"  "miswtles"  "misnowrk" 
## [191] "mistmnt"   "mistrhlp"  "misphlpf"  "scntmony"  "scntmeal" 
## [196] "scntpaid"  "scntwrk1"  "scntlpad"  "scntlwk1"  "scntvot1" 
## [201] "rcsgendr"  "rcsrltn2"  "casthdx2"  "casthno2"  "emtsuprt" 
## [206] "lsatisfy"  "ctelnum1"  "cellfon2"  "cadult"    "pvtresd2" 
## [211] "cclghous"  "cstate"    "landline"  "pctcell"   "qstver"   
## [216] "qstlang"   "mscode"    "X_ststr"   "X_strwt"   "X_rawrake"
## [221] "X_wt2rake" "X_imprace" "X_impnph"  "X_impeduc" "X_impmrtl"
## [226] "X_imphome" "X_chispnc" "X_crace1"  "X_impcage" "X_impcrac"
## [231] "X_impcsex" "X_cllcpwt" "X_dualuse" "X_dualcor" "X_llcpwt2"
## [236] "X_llcpwt"  "X_rfhlth"  "X_hcvu651" "X_rfhype5" "X_cholchk"
## [241] "X_rfchol"  "X_ltasth1" "X_casthm1" "X_asthms1" "X_drdxar1"
## [246] "X_prace1"  "X_mrace1"  "X_hispanc" "X_race"    "X_raceg21"
## [251] "X_racegr3" "X_race_g1" "X_ageg5yr" "X_age65yr" "X_age_g"  
## [256] "htin4"     "htm4"      "wtkg3"     "X_bmi5"    "X_bmi5cat"
## [261] "X_rfbmi5"  "X_chldcnt" "X_educag"  "X_incomg"  "X_smoker3"
## [266] "X_rfsmok3" "drnkany5"  "drocdy3_"  "X_rfbing5" "X_drnkdy4"
## [271] "X_drnkmo4" "X_rfdrhv4" "X_rfdrmn4" "X_rfdrwm4" "ftjuda1_" 
## [276] "frutda1_"  "beanday_"  "grenday_"  "orngday_"  "vegeda1_" 
## [281] "X_misfrtn" "X_misvegn" "X_frtresp" "X_vegresp" "X_frutsum"
## [286] "X_vegesum" "X_frtlt1"  "X_veglt1"  "X_frt16"   "X_veg23"  
## [291] "X_fruitex" "X_vegetex" "X_totinda" "metvl11_"  "metvl21_" 
## [296] "maxvo2_"   "fc60_"     "actin11_"  "actin21_"  "padur1_"  
## [301] "padur2_"   "pafreq1_"  "pafreq2_"  "X_minac11" "X_minac21"
## [306] "strfreq_"  "pamiss1_"  "pamin11_"  "pamin21_"  "pa1min_"  
## [311] "pavig11_"  "pavig21_"  "pa1vigm_"  "X_pacat1"  "X_paindx1"
## [316] "X_pa150r2" "X_pa300r2" "X_pa30021" "X_pastrng" "X_parec1" 
## [321] "X_pastae1" "X_lmtact1" "X_lmtwrk1" "X_lmtscl1" "X_rfseat2"
## [326] "X_rfseat3" "X_flshot6" "X_pneumo2" "X_aidtst3" "X_age80"

Preparation of subsets for Question No 1
There is a natural trend in health deterioraton with age. Females have better health in old age.
There is also level of activity decreases with age. 
General Health by sex,age and active life.

    hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>% 
    filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth))  
   
     hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))

     hist(hlthpmp$menthlth, main="Mental Health", xlab="Mental Health",col=rainbow(5))

     hist(hlthpmp$poorhlth, main="Health status", xlab="Poor Health",col=rainbow(5))

    health <- brfss2013 %>% dplyr::select(genhlth,X_age_g,sex,X_pacat1) %>% dplyr::filter(sex %in%                       c("Male","Female"))
    levels(health$sex) <- c("Male", "Female")
    
    ggplot(health,aes(x = sex, fill = X_pacat1)) + geom_bar() +facet_grid(~X_age_g) + coord_flip() +                     ggtitle("Health, Age,sex and activity")

    rm(hlthpmp)  
    rm(health)

Research quesion 2
Behaviours and their implications on the human health .How exercise,smoking and sleeping time causes obesity,
diabetes,high blood pressure,high cholesterol level,cardiovascular disorder and depression in respondents. 
Plots show their association.
Behaviours and Habits

A.Exercise can improves Health by reducing obesity,high blood pressure and cholesterol.

 exer <- brfss2013 %>% dplyr::select(exerany2,bphigh4,sex) %>% dplyr::filter(bphigh4 %in% c("Yes","No"))

levels(exer$bphigh4) <- c("Yes, been told blood pressure high","Yes, but female told only during pregnancy","No, never been told blood pressure high","Yes, been told borderline or pre-hypertensive")

ggplot(exer,aes(x = exerany2, fill = sex)) + geom_bar() +facet_grid(~bphigh4) + coord_flip() + ggtitle(" Exercise vs High Blood Pressure")

qplot(disease$toldhi2)

qplot(X_age_g, data = disease, fill=toldhi2,xlab="High cholesterol ",ylab ="Prevalence",main ="High blood cholesterol vs Age")

ggplot(disease,aes(x = toldhi2, fill = exerany2)) + geom_bar() +facet_grid(~cvdinfr4) + coord_flip() + ggtitle("Exercise Vs High chloresterol ")

rm(exer)

B.Smoking is harmful for health . We can see this habit deteriorates health.

    smoke <- brfss2013 %>%  dplyr::select(genhlth,smoke100,smokday2,sex,X_age_g) %>% dplyr::filter(smoke100                      %in% c("Yes","No"))
 
    levels(smoke$genhlth) <- c("Excellent","Very good","Good","Fair","Poor")

    ggplot(smoke,aes(x = smoke100,fill = smoke100)) + geom_bar() + facet_grid(~genhlth) + coord_flip() +                 ggtitle("Smoking Harms General health") 
     
     smoke <- brfss2013 %>%
     filter(!is.na(smoke100),!is.na(genhlth)) %>%
     group_by(genhlth,smoke100) %>%
     summarise(n = n()) %>%
     mutate(pct_total_stacked = n/sum(n), 
     position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
     position_n = cumsum(n)-0.5*n)
     
     ggplot(smoke, aes(x=genhlth, y=pct_total_stacked, fill=smoke100)) +
     geom_bar(stat='identity',  width = .7, color="black")+
     geom_text(aes(label=ifelse(smoke100 == 'Yes', paste0(sprintf("%.0f",                                                 pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
     coord_flip() +
     scale_y_continuous() +
     labs(y="", x="")
     rm(smoke)

C.Sleeping time and its implications on Health. Here a comparison is shown.

    sleep <- brfss2013 %>% select(sleptim1,poorhlth, physhlth)
    sleep <- subset(sleep, sleptim1 >= 0 & sleptim1 < 25, select=c(sleptim1, poorhlth, physhlth))

    sleep = within(sleep, {sleepstatus = ifelse(sleptim1 >= 7 & sleptim1 < 10, "opt", "non-op")})

    sleep_poorhlth <- subset(sleep, poorhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, poorhlth, sleepstatus))
    sleep_physhlth <- subset(sleep, physhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, physhlth, sleepstatus))
   
    ggplot(data = sleep_poorhlth, aes(x =sleptim1, y= poorhlth)) + geom_point()

    ggplot(data = sleep_physhlth, aes(x =sleptim1, y= physhlth)) + geom_point()

    rm(sleep)
    rm(sleep_poorhlth)

Poor Health is also associated with chronic disorders. These are Diabetes,Heart disease,High B.P,High cholesterol and depressive disorders. Chronic Diseases and Health

```

disease <- select(brfss2013,genhlth,sex,X_age_g, X_rfbmi5,X_bmi5cat,diabete3,cvdinfr4,bphigh4,addepev2,exerany2,toldhi2,smoke100,sleptim1 ) %>%          filter(!is.na(genhlth),!is.na(sex),!is.na(exerany2),!is.na(X_age_g),!is.na(toldhi2),!is.na(smoke100),!is.na(diabete3),!is.na(addepev2),!is.na(cvdinfr4),!is.na(bphigh4),!is.na(X_rfbmi5),!is.na(X_bmi5cat),!is.na(sleptim1))

Diabetes Prevalence and General Health

     table(disease$diabete3,disease$X_age_g)

##                                             
##                                              Age 18 to 24 Age 25 to 34
##   Yes                                                 152          713
##   Yes, but female told only during pregnancy           60          449
##   No                                                 8765        25044
##   No, pre-diabetes or borderline diabetes              61          185
##                                             
##                                              Age 35 to 44 Age 45 to 54
##   Yes                                                2325         6699
##   Yes, but female told only during pregnancy          854          697
##   No                                                38745        57530
##   No, pre-diabetes or borderline diabetes             458         1031
##                                             
##                                              Age 55 to 64 Age 65 or older
##   Yes                                               14056           27257
##   Yes, but female told only during pregnancy          488             557
##   No                                                72603          100907
##   No, pre-diabetes or borderline diabetes            1809            3347

     diabete <- brfss2013 %>%
     filter(!is.na(diabete3),!is.na(genhlth)) %>%
     group_by(genhlth,diabete3) %>%
     summarise(n = n()) %>%
     mutate(pct_total_stacked = n/sum(n), 
     position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
     position_n = cumsum(n)-0.5*n)
    
    ggplot(diabete, aes(genhlth), y=n) +
    geom_bar(aes(fill = diabete3, weight = n),  width = .7, color="black") +
    geom_text(aes(label=n, y=position_n), color="white")

    ggplot(diabete, aes(x=genhlth, y=pct_total_stacked, fill=diabete3)) +
    geom_bar(stat='identity',  width = .7, color="black")+
    geom_text(aes(label=ifelse(diabete3 == 'Yes', paste0(sprintf("%.0f",                                                 pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
    coord_flip() +
    scale_y_continuous() +
    labs(y="", x="")

    rm(diabete)

Prevalence of Cardio Vascular Disorders


   Yes     No 
 24347 340445


      Yes        No 
 6.674214 93.325786

     Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
                                                                                     
Yes    0.01288405   0.04468300   0.15077085   0.60966249   1.52278559      4.33342836
No     2.46469221   7.18985065  11.46735674  17.47105200  22.86261760     31.87021645

Stress and Depression are associated with work,environments and behaviours.The habits seem to trigger chronic
disorders e.g; high blood pressure, diabetes and cardiovascular diseases.Correlation to be calculated under
Question 3.

    prop.table(ftable(table(disease$addepev2,disease$X_age_g)))*100

##      Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##                                                                                      
## Yes      0.436139     1.444933     2.423025     4.241869     5.863067        5.772331
## No       2.041437     5.789601     9.195103    13.838845    18.522336       30.431314

    qplot(disease$addepev2)

    qplot(X_age_g, data = disease, fill=addepev2,xlab="Depression",ylab ="Prevalence",main ="Heart Disease vs age")

     ggplot(disease,aes(x = addepev2, fill = diabete3)) + geom_bar() +facet_grid(~addepev2) + coord_flip() +              ggtitle(" Depression Vs Diabetes disease")

      rm(exer)

## Warning in rm(exer): object 'exer' not found

     brfss_dep <- brfss2013 %>%
     filter(!is.na(addepev2),!is.na(diabete3)) %>%
     group_by(addepev2,diabete3) %>%
     summarise(n = n()) %>%
     mutate(pct_total_stacked = n/sum(n), 
     position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
     position_n = cumsum(n)-0.5*n)

    ggplot(brfss_dep, aes(x=diabete3, y=pct_total_stacked, fill=addepev2)) +
    geom_bar(stat='identity',  width = .7, color="black")+
    geom_text(aes(label=ifelse(addepev2 == 'Yes', paste0(sprintf("%.0f",                                                 pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
    coord_flip() +
    scale_y_continuous() +
    labs(y="", x="")

High Blood Pressure Prevalence


                                       Yes 
                                    161096 
Yes, but female told only during pregnancy 
                                      2258 
                                        No 
                                    197358 
       Told borderline or pre-hypertensive 
                                      4080


                                       Yes 
                                44.1610562 
Yes, but female told only during pregnancy 
                                 0.6189829 
                                        No 
                                54.1015154 
       Told borderline or pre-hypertensive 
                                 1.1184456

                                            Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
                                                                                                                            
Yes                                          0.223140858  1.020581592  2.488541415  6.123215421 11.593181868    22.712395009
Yes, but female told only during pregnancy   0.025219851  0.146384789  0.161461874  0.106087853  0.084705805     0.095122700
No                                           2.220443431  6.018224084  8.859021031 11.644443957 12.393638018    12.965744863
Told borderline or pre-hypertensive          0.008772122  0.049343187  0.109103270  0.206967258  0.313877497     0.430382245

High Blood cholesterol Prevalence


   Yes     No 
160553 204239


    Yes      No 
44.0122 55.9878

     Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
                                                                                     
Yes      0.234106     1.142295     3.027753     7.028663    12.330863       20.248525
No       2.243470     6.092239     8.590375    11.052052    12.054541       15.955120

Research quesion 3:
Analyzing Association of behavioural outcomes with the causes by Linear and Multiple Regression Methods

vars <- names(brfss2013) %in% c("sleptim1","cvdinfr4","addepev2","diabete3","bphigh4","toldhi2",
                                                                  "smoke100","X_rfbmi5")

hlthsub2 <- brfss2013[vars] 
names(hlthsub2)

## [1] "sleptim1" "bphigh4"  "toldhi2"  "cvdinfr4" "addepev2" "diabete3"
## [7] "smoke100" "X_rfbmi5"

MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)

##   sleptim1    bphigh4    toldhi2   cvdinfr4   addepev2   diabete3 
##  1.5021097  0.2887499 14.5721112  0.5260536  0.4654568  0.1691831 
##   smoke100   X_rfbmi5 
##  3.0339078  5.4352092

summary(hlthsub2)

##     sleptim1                                             bphigh4      
##  Min.   :  0.000   Yes                                       :198921  
##  1st Qu.:  6.000   Yes, but female told only during pregnancy:  3680  
##  Median :  7.000   No                                        :282687  
##  Mean   :  7.052   Told borderline or pre-hypertensive       :  5067  
##  3rd Qu.:  8.000   NA's                                      :  1420  
##  Max.   :450.000                                                      
##  NA's   :7387                                                         
##  toldhi2       cvdinfr4      addepev2     
##  Yes :183501   Yes : 29284   Yes : 95779  
##  No  :236612   No  :459904   No  :393707  
##  NA's: 71662   NA's:  2587   NA's:  2289  
##                                           
##                                           
##                                           
##                                           
##                                        diabete3      smoke100     
##  Yes                                       : 62363   Yes :215201  
##  Yes, but female told only during pregnancy:  4602   No  :261654  
##  No                                        :415374   NA's: 14920  
##  No, pre-diabetes or borderline diabetes   :  8604                
##  NA's                                      :   832                
##                                                                   
##                                                                   
##  X_rfbmi5     
##  No  :163161  
##  Yes :301885  
##  NA's: 26729  
##               
##               
##               
##

summary(hlthsub2$X_rfbmi5)

##     No    Yes   NA's 
## 163161 301885  26729

hlthsub2$X_rfbmi5 <- replace(hlthsub2$X_rfbmi5, which(is.na(hlthsub2$X_rfbmi5)), "Yes")
summary(hlthsub2$X_rfbmi5)

##     No    Yes 
## 163161 328614

summary(hlthsub2$smoke100)

##    Yes     No   NA's 
## 215201 261654  14920

hlthsub2$addepev2 <- replace(hlthsub2$addepev2, which(is.na(hlthsub2$addepev2)), "Yes")
summary(hlthsub2$addepev2)

##    Yes     No 
##  98068 393707

hlthsub2$smoke100 <- replace(hlthsub2$smoke100, which(is.na(hlthsub2$smoke100)), "Yes")
summary(hlthsub2$smoke100)

##    Yes     No 
## 230121 261654

hlthsub2$diabete3 <- replace(hlthsub2$diabete3, which(is.na(hlthsub2$diabete3)), "Yes")
summary(hlthsub2$diabete3)

##                                        Yes 
##                                      63195 
## Yes, but female told only during pregnancy 
##                                       4602 
##                                         No 
##                                     415374 
##    No, pre-diabetes or borderline diabetes 
##                                       8604

hlthsub2$bphigh4 <- replace(hlthsub2$bphigh4, which(is.na(hlthsub2$bphigh4)), "No")
summary(hlthsub2$bphigh4)

##                                        Yes 
##                                     198921 
## Yes, but female told only during pregnancy 
##                                       3680 
##                                         No 
##                                     284107 
##        Told borderline or pre-hypertensive 
##                                       5067

summary(hlthsub2$sleptim1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   6.000   7.000   7.052   8.000 450.000    7387

mean(hlthsub2$sleptim1,na.rm = T)

## [1] 7.052099

hlthsub2$sleptim1 <- replace(hlthsub2$sleptim1, which(is.na(hlthsub2$sleptim1)), 7)
summary(hlthsub2$sleptim1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   6.000   7.000   7.051   8.000 450.000

summary(hlthsub2$toldhi2)

##    Yes     No   NA's 
## 183501 236612  71662

hlthsub2$toldhi2 <- replace(hlthsub2$toldhi2, which(is.na(hlthsub2$toldhi2)), "No")
summary(hlthsub2$toldhi2)

##    Yes     No 
## 183501 308274

summary(hlthsub2$cvdinfr4)

##    Yes     No   NA's 
##  29284 459904   2587

hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)

##    Yes     No 
##  31871 459904

hlthsub2$bphigh4 <- as.factor(ifelse(hlthsub2$bphigh4=="Yes", "Yes", 
                           (ifelse(hlthsub2$bphigh4=="Yes, but female told only during pregnancy", "Yes",
                           (ifelse(hlthsub2$bphigh4=="Told borderline or pre-hypertensive", "Yes",
                                   "No"))))))
 

hlthsub2$diabete3 <- as.factor(ifelse(hlthsub2$diabete3 == "Yes","Yes", 
                    (ifelse(hlthsub2$diabete3 == "Yes, but female told only during pregnancy","Yes",
                    (ifelse(hlthsub2$diabete3 == "Told borderline diabetes or pre-diabetes","Yes",                                  "No")))))) 
library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     combine, src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

 Converting factors to numerics for analysis.

hlthsub2$addepev2 <- ifelse(hlthsub2$addepev2=="Yes", 1, 0)
hlthsub2$cvdinfr4 <- ifelse(hlthsub2$cvdinfr4=="Yes", 1, 0)
hlthsub2$smoke100 <- ifelse(hlthsub2$smoke100=="Yes", 1, 0)
hlthsub2$diabete3 <- ifelse(hlthsub2$diabete3=="Yes", 1, 0)
hlthsub2$toldhi2 <- ifelse(hlthsub2$toldhi2 =="Yes", 1, 0)
hlthsub2$bphigh4 <- ifelse(hlthsub2$bphigh4=="Yes", 1, 0)
hlthsub2$X_rfbmi5 <- ifelse(hlthsub2$X_rfbmi5=="Yes", 1, 0)

Checking Missing Values

MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)

## sleptim1  bphigh4  toldhi2 cvdinfr4 addepev2 diabete3 smoke100 X_rfbmi5 
##        0        0        0        0        0        0        0        0

 Research quesion 3:
 Finding correlation and plotting a corrplot.
 Fitting model by logistic regression using binomial method. Plotting correlations.

library(corrplot)
M <- cor(hlthsub2)
corrplot(M, method="number",pch=23,col=rainbow(7))

library(corrplot)
corrplot(M, type="upper", order="hclust", tl.col=rainbow(7), tl.srt=45,method="number",pch=23)

summary(M)

    sleptim1            bphigh4             toldhi2        
 Min.   :-0.051986   Min.   :0.0009575   Min.   :0.003714  
 1st Qu.:-0.025489   1st Qu.:0.0722225   1st Qu.:0.091977  
 Median : 0.001235   Median :0.1804427   Median :0.138721  
 Mean   : 0.112901   Mean   :0.2597608   Mean   :0.246796  
 3rd Qu.: 0.002340   3rd Qu.:0.2698770   3rd Qu.:0.231654  
 Max.   : 1.000000   Max.   :1.0000000   Max.   :1.000000  
    cvdinfr4           addepev2           diabete3       
 Min.   :0.001512   Min.   :-0.05199   Min.   :0.001882  
 1st Qu.:0.054606   1st Qu.: 0.05652   1st Qu.:0.070512  
 Median :0.122297   Median : 0.07588   Median :0.149478  
 Mean   :0.208511   Mean   : 0.17623   Mean   :0.234932  
 3rd Qu.:0.156271   3rd Qu.: 0.09960   3rd Qu.:0.215495  
 Max.   :1.000000   Max.   : 1.00000   Max.   :1.000000  
    smoke100           X_rfbmi5       
 Min.   :-0.02454   Min.   :-0.02833  
 1st Qu.: 0.03625   1st Qu.: 0.03496  
 Median : 0.07328   Median : 0.08737  
 Mean   : 0.17264   Mean   : 0.19287  
 3rd Qu.: 0.09651   3rd Qu.: 0.15817  
 Max.   : 1.00000   Max.   : 1.00000

Correlation and Fitting of the Models,cardio,depression and diabetes  with other variables.
A plot of a fit model for cardiovascular disorder against all variables is drawn.


Please cite as:

 Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2. http://CRAN.R-project.org/package=stargazer

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred


Call:
glm(formula = cvdinfr4 ~ ., family = "binomial", data = hlthsub2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9064  -0.4012  -0.2552  -0.1755   2.9322  

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -4.218351   0.026644 -158.325  < 2e-16 ***
sleptim1     0.006618   0.002905    2.278   0.0227 *  
bphigh4      1.043494   0.014198   73.496  < 2e-16 ***
toldhi2      0.734294   0.013012   56.433  < 2e-16 ***
addepev2     0.257595   0.013494   19.089  < 2e-16 ***
diabete3     0.756986   0.013609   55.624  < 2e-16 ***
smoke100     0.642813   0.012433   51.701  < 2e-16 ***
X_rfbmi5    -0.073540   0.013952   -5.271 1.36e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 236049  on 491774  degrees of freedom
Residual deviance: 210390  on 491767  degrees of freedom
AIC: 210406

Number of Fisher Scoring iterations: 6

Multiple Linear Regression Example

library(stargazer, quietly = TRUE)
par(op)
stargazer(fit,fit1,fit2,type="text")


========================================================
                           Dependent variable:          
                  --------------------------------------
                    cvdinfr4     addepev2     diabete3  
                      (1)          (2)          (3)     
--------------------------------------------------------
sleptim1            0.007**     -0.094***     0.006***  
                    (0.003)      (0.003)      (0.002)   
                                                        
bphigh4             1.043***     0.117***     1.244***  
                    (0.014)      (0.008)      (0.010)   
                                                        
toldhi2             0.734***     0.344***     0.729***  
                    (0.013)      (0.008)      (0.009)   
                                                        
addepev2            0.258***                  0.340***  
                    (0.013)                   (0.010)   
                                                        
cvdinfr4                         0.235***     0.741***  
                                 (0.013)      (0.013)   
                                                        
diabete3            0.757***     0.312***               
                    (0.014)      (0.010)                
                                                        
smoke100            0.643***     0.468***     0.042***  
                    (0.012)      (0.007)      (0.009)   
                                                        
X_rfbmi5           -0.074***     0.137***               
                    (0.014)      (0.008)                
                                                        
Constant           -4.218***    -1.327***    -3.090***  
                    (0.027)      (0.019)      (0.020)   
                                                        
--------------------------------------------------------
Observations        491,775      491,775      491,775   
Log Likelihood    -105,194.800 -239,030.700 -175,230.900
Akaike Inf. Crit. 210,405.600  478,077.300  350,475.800 
========================================================
Note:                        *p<0.1; **p<0.05; ***p<0.01

Corelation of variables with eg.,diabetes
(we can do this for sleptime1,bphigh4,toldhi2,cvdinfr4,addepev2 or smokine if required)

[1] 0.001881772

[1] 0.2537181

[1] 0.2027539

[1] 0.1514461

[1] 0.07995236

[1] 0.04218968

  X_rfbmi5   diabete3
1        0 0.06568972
2        1 0.17369619

Correlation analysis on subset hlthsub2.

 coran <- sort(cor(hlthsub2)[,1])
 print(coran)

     addepev2      X_rfbmi5      smoke100       bphigh4      cvdinfr4 
-0.0519859273 -0.0283317665 -0.0245406984  0.0009575273  0.0015123802 
     diabete3       toldhi2      sleptim1 
 0.0018817718  0.0037143384  1.0000000000

 plot(coran)

Fisher Test on Selected variables.


    Fisher's Exact Test for Count Data

data:  hlthsub2$cvdinfr4 and hlthsub2$X_rfbmi5
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.412973 1.488011
sample estimates:
odds ratio 
  1.449966


    Fisher's Exact Test for Count Data

data:  hlthsub2$diabete3 and hlthsub2$addepev2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 1.657903 1.720391
sample estimates:
odds ratio 
  1.688858


    Fisher's Exact Test for Count Data

data:  hlthsub2$smoke100 and hlthsub2$cvdinfr4
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.126275 2.229919
sample estimates:
odds ratio 
  2.177476


    Fisher's Exact Test for Count Data

data:  hlthsub2$diabete3 and hlthsub2$X_rfbmi5
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 2.926220 3.055166
sample estimates:
odds ratio 
  2.989719


    Fisher's Exact Test for Count Data

data:  hlthsub2$cvdinfr4 and hlthsub2$diabete3
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
 3.486181 3.664181
sample estimates:
odds ratio 
  3.573851

Run Density Plots of a few selected health outcomes

Inference of Analysis

From the above results we can infer that:
   1.  Sleeping Time has a negative association with depression,smoking and obesity.
       If there is more smoking,obesity and depression, the person tends to sleep more hours.
   2.  Exercise has negative association with depression,diabetes,high blood pressure and cholesterol. 
       It means your health will improve if you exercise and chronic disorders like High Blood Pressure
       and cholestrol will be reduced.
   3.  Instance of cardio vascular disorders are directly related to diabetes,high blood pressure and
       cholesterol,depression, smoking and obesity,but no relation with exercise and sleeping time.

Therefore, the self reported survey data has ample valid information which leads to confirming the hypothesis.
There are association between the causes and outcomes. To Lead good life it is essential that one should avoid 
excesses and be physically active irrespective of age or sex.

BRFSS 2013 Exploratory Analysis

Tahir Hussain

20th July 2016