knitr::opts_chunk$set(error = TRUE,tidy=FALSE,size="small") ts_chunk$set(error = TRUE)
Load packages
set.seed(123)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
BRFSS 2013- SELF REPORTED SURVEILLANCE SURVEY ANALYSIS
Part 1: About Data
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in
the United States (US) and participating US territories and the Centers for Disease Control and Prevention
(CDC) The BRFSS is administered and supported by CDC's Population Health Surveillance Branch, under the Division
of Population Health at the National Center for Chronic Disease Prevention and Health Promotion.
BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.Bias to data can exist. This data source is
valid and generalizable. The reliability is around 75% and compares well with other surveys of similar nature.
The BRFSS objective is to collect uniform, state-specific include tobacco use, HIV/AIDS knowledge and
prevention,exercise, immunization, health status, healthy days - health-related quality of life, health
care access,inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions,
alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
Dataset notes
The dataset is provided R Workspace (.Rdata) format for Anaysis with R Studio. Categorical values are factors
in the R workspace.All missing values are coded NA in the R Workspace.
Part 2 Research Questions.
Research quesion 1:
What is the health Status of the population by Age,Sex and activity ?
Research quesion.2
What are different behaviours and habits which might cause poor health ?
Research quesion 3:
What is correlation between behaviours and life styles on the chronic health conditions ?
Following Variables will be used:
genhlth: Status of general health
Sex : Sex of Respondent
X_age_g: Groups of repondents by Age
qlhlth2: How many days full of enegry in last 30 days.
sleptim1: How Much Time Do You Sleep
cvdinfr4: Ever Diagnosed With Heart Attack
addepev2: Ever Told You Had A Depressive Disorder
smoke100: Smoking 100 cigarettes a day.
diabete3: Told that you are diabetic.
exerany2: Exercise and physical activity in past 30 days
toldhi2 : Told that you have blood high cholestrol
X_bmi5cat:Weight by Categories
X_rfbmi5: Overweight or obese.
bphigh4: Ever Told Blood Pressure High
X_pacat1 : Is Life Active or Passive
Research quesion 1:
Health Status by:1.Sex and 2. By Age Group 3. Active or Passive Life.
Exploring Data Frame brfss2013
dim(brfss2013)
## [1] 491775 330
names(brfss2013)
## [1] "X_state" "fmonth" "idate" "imonth" "iday"
## [6] "iyear" "dispcode" "seqno" "X_psu" "ctelenum"
## [11] "pvtresd1" "colghous" "stateres" "cellfon3" "ladult"
## [16] "numadult" "nummen" "numwomen" "genhlth" "physhlth"
## [21] "menthlth" "poorhlth" "hlthpln1" "persdoc2" "medcost"
## [26] "checkup1" "sleptim1" "bphigh4" "bpmeds" "bloodcho"
## [31] "cholchk" "toldhi2" "cvdinfr4" "cvdcrhd4" "cvdstrk3"
## [36] "asthma3" "asthnow" "chcscncr" "chcocncr" "chccopd1"
## [41] "havarth3" "addepev2" "chckidny" "diabete3" "veteran3"
## [46] "marital" "children" "educa" "employ1" "income2"
## [51] "weight2" "height3" "numhhol2" "numphon2" "cpdemo1"
## [56] "cpdemo4" "internet" "renthom1" "sex" "pregnant"
## [61] "qlactlm2" "useequip" "blind" "decide" "diffwalk"
## [66] "diffdres" "diffalon" "smoke100" "smokday2" "stopsmk2"
## [71] "lastsmk2" "usenow3" "alcday5" "avedrnk2" "drnk3ge5"
## [76] "maxdrnks" "fruitju1" "fruit1" "fvbeans" "fvgreen"
## [81] "fvorang" "vegetab1" "exerany2" "exract11" "exeroft1"
## [86] "exerhmm1" "exract21" "exeroft2" "exerhmm2" "strength"
## [91] "lmtjoin3" "arthdis2" "arthsocl" "joinpain" "seatbelt"
## [96] "flushot6" "flshtmy2" "tetanus" "pneuvac3" "hivtst6"
## [101] "hivtstd3" "whrtst10" "pdiabtst" "prediab1" "diabage2"
## [106] "insulin" "bldsugar" "feetchk2" "doctdiab" "chkhemo3"
## [111] "feetchk" "eyeexam" "diabeye" "diabedu" "painact2"
## [116] "qlmentl2" "qlstres2" "qlhlth2" "medicare" "hlthcvrg"
## [121] "delaymed" "dlyother" "nocov121" "lstcovrg" "drvisits"
## [126] "medscost" "carercvd" "medbills" "ssbsugar" "ssbfrut2"
## [131] "wtchsalt" "longwtch" "dradvise" "asthmage" "asattack"
## [136] "aservist" "asdrvist" "asrchkup" "asactlim" "asymptom"
## [141] "asnoslep" "asthmed3" "asinhalr" "harehab1" "strehab1"
## [146] "cvdasprn" "aspunsaf" "rlivpain" "rduchart" "rducstrk"
## [151] "arttoday" "arthwgt" "arthexer" "arthedu" "imfvplac"
## [156] "hpvadvc2" "hpvadsht" "hadmam" "howlong" "profexam"
## [161] "lengexam" "hadpap2" "lastpap2" "hadhyst2" "bldstool"
## [166] "lstblds3" "hadsigm3" "hadsgco1" "lastsig3" "pcpsaad2"
## [171] "pcpsadi1" "pcpsare1" "psatest1" "psatime" "pcpsars1"
## [176] "pcpsade1" "pcdmdecn" "rrclass2" "rrcognt2" "rratwrk2"
## [181] "rrhcare3" "rrphysm2" "rremtsm2" "misnervs" "mishopls"
## [186] "misrstls" "misdeprd" "miseffrt" "miswtles" "misnowrk"
## [191] "mistmnt" "mistrhlp" "misphlpf" "scntmony" "scntmeal"
## [196] "scntpaid" "scntwrk1" "scntlpad" "scntlwk1" "scntvot1"
## [201] "rcsgendr" "rcsrltn2" "casthdx2" "casthno2" "emtsuprt"
## [206] "lsatisfy" "ctelnum1" "cellfon2" "cadult" "pvtresd2"
## [211] "cclghous" "cstate" "landline" "pctcell" "qstver"
## [216] "qstlang" "mscode" "X_ststr" "X_strwt" "X_rawrake"
## [221] "X_wt2rake" "X_imprace" "X_impnph" "X_impeduc" "X_impmrtl"
## [226] "X_imphome" "X_chispnc" "X_crace1" "X_impcage" "X_impcrac"
## [231] "X_impcsex" "X_cllcpwt" "X_dualuse" "X_dualcor" "X_llcpwt2"
## [236] "X_llcpwt" "X_rfhlth" "X_hcvu651" "X_rfhype5" "X_cholchk"
## [241] "X_rfchol" "X_ltasth1" "X_casthm1" "X_asthms1" "X_drdxar1"
## [246] "X_prace1" "X_mrace1" "X_hispanc" "X_race" "X_raceg21"
## [251] "X_racegr3" "X_race_g1" "X_ageg5yr" "X_age65yr" "X_age_g"
## [256] "htin4" "htm4" "wtkg3" "X_bmi5" "X_bmi5cat"
## [261] "X_rfbmi5" "X_chldcnt" "X_educag" "X_incomg" "X_smoker3"
## [266] "X_rfsmok3" "drnkany5" "drocdy3_" "X_rfbing5" "X_drnkdy4"
## [271] "X_drnkmo4" "X_rfdrhv4" "X_rfdrmn4" "X_rfdrwm4" "ftjuda1_"
## [276] "frutda1_" "beanday_" "grenday_" "orngday_" "vegeda1_"
## [281] "X_misfrtn" "X_misvegn" "X_frtresp" "X_vegresp" "X_frutsum"
## [286] "X_vegesum" "X_frtlt1" "X_veglt1" "X_frt16" "X_veg23"
## [291] "X_fruitex" "X_vegetex" "X_totinda" "metvl11_" "metvl21_"
## [296] "maxvo2_" "fc60_" "actin11_" "actin21_" "padur1_"
## [301] "padur2_" "pafreq1_" "pafreq2_" "X_minac11" "X_minac21"
## [306] "strfreq_" "pamiss1_" "pamin11_" "pamin21_" "pa1min_"
## [311] "pavig11_" "pavig21_" "pa1vigm_" "X_pacat1" "X_paindx1"
## [316] "X_pa150r2" "X_pa300r2" "X_pa30021" "X_pastrng" "X_parec1"
## [321] "X_pastae1" "X_lmtact1" "X_lmtwrk1" "X_lmtscl1" "X_rfseat2"
## [326] "X_rfseat3" "X_flshot6" "X_pneumo2" "X_aidtst3" "X_age80"
Preparation of subsets for Question No 1
There is a natural trend in health deterioraton with age. Females have better health in old age.
There is also level of activity decreases with age.
General Health by sex,age and active life.
hlthpmp <- select(brfss2013, physhlth, menthlth,poorhlth) %>%
filter(!is.na(physhlth),!is.na(menthlth),!is.na(poorhlth))
hist(hlthpmp$physhlth, main="Physical Health", xlab="Physical Health",col=rainbow(5))
hist(hlthpmp$menthlth, main="Mental Health", xlab="Mental Health",col=rainbow(5))
hist(hlthpmp$poorhlth, main="Health status", xlab="Poor Health",col=rainbow(5))
health <- brfss2013 %>% dplyr::select(genhlth,X_age_g,sex,X_pacat1) %>% dplyr::filter(sex %in% c("Male","Female"))
levels(health$sex) <- c("Male", "Female")
ggplot(health,aes(x = sex, fill = X_pacat1)) + geom_bar() +facet_grid(~X_age_g) + coord_flip() + ggtitle("Health, Age,sex and activity")
rm(hlthpmp)
rm(health)
Research quesion 2
Behaviours and their implications on the human health .How exercise,smoking and sleeping time causes obesity,
diabetes,high blood pressure,high cholesterol level,cardiovascular disorder and depression in respondents.
Plots show their association.
Behaviours and Habits
A.Exercise can improves Health by reducing obesity,high blood pressure and cholesterol.
exer <- brfss2013 %>% dplyr::select(exerany2,bphigh4,sex) %>% dplyr::filter(bphigh4 %in% c("Yes","No"))
levels(exer$bphigh4) <- c("Yes, been told blood pressure high","Yes, but female told only during pregnancy","No, never been told blood pressure high","Yes, been told borderline or pre-hypertensive")
ggplot(exer,aes(x = exerany2, fill = sex)) + geom_bar() +facet_grid(~bphigh4) + coord_flip() + ggtitle(" Exercise vs High Blood Pressure")
qplot(disease$toldhi2)
qplot(X_age_g, data = disease, fill=toldhi2,xlab="High cholesterol ",ylab ="Prevalence",main ="High blood cholesterol vs Age")
ggplot(disease,aes(x = toldhi2, fill = exerany2)) + geom_bar() +facet_grid(~cvdinfr4) + coord_flip() + ggtitle("Exercise Vs High chloresterol ")
rm(exer)
B.Smoking is harmful for health . We can see this habit deteriorates health.
smoke <- brfss2013 %>% dplyr::select(genhlth,smoke100,smokday2,sex,X_age_g) %>% dplyr::filter(smoke100 %in% c("Yes","No"))
levels(smoke$genhlth) <- c("Excellent","Very good","Good","Fair","Poor")
ggplot(smoke,aes(x = smoke100,fill = smoke100)) + geom_bar() + facet_grid(~genhlth) + coord_flip() + ggtitle("Smoking Harms General health")
smoke <- brfss2013 %>%
filter(!is.na(smoke100),!is.na(genhlth)) %>%
group_by(genhlth,smoke100) %>%
summarise(n = n()) %>%
mutate(pct_total_stacked = n/sum(n),
position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
position_n = cumsum(n)-0.5*n)
ggplot(smoke, aes(x=genhlth, y=pct_total_stacked, fill=smoke100)) +
geom_bar(stat='identity', width = .7, color="black")+
geom_text(aes(label=ifelse(smoke100 == 'Yes', paste0(sprintf("%.0f", pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
coord_flip() +
scale_y_continuous() +
labs(y="", x="")
rm(smoke)
C.Sleeping time and its implications on Health. Here a comparison is shown.
sleep <- brfss2013 %>% select(sleptim1,poorhlth, physhlth)
sleep <- subset(sleep, sleptim1 >= 0 & sleptim1 < 25, select=c(sleptim1, poorhlth, physhlth))
sleep = within(sleep, {sleepstatus = ifelse(sleptim1 >= 7 & sleptim1 < 10, "opt", "non-op")})
sleep_poorhlth <- subset(sleep, poorhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, poorhlth, sleepstatus))
sleep_physhlth <- subset(sleep, physhlth >= 0 & sleptim1 < 1000, select=c(sleptim1, physhlth, sleepstatus))
ggplot(data = sleep_poorhlth, aes(x =sleptim1, y= poorhlth)) + geom_point()
ggplot(data = sleep_physhlth, aes(x =sleptim1, y= physhlth)) + geom_point()
rm(sleep)
rm(sleep_poorhlth)
Poor Health is also associated with chronic disorders. These are Diabetes,Heart disease,High B.P,High cholesterol and depressive disorders. Chronic Diseases and Health
```
disease <- select(brfss2013,genhlth,sex,X_age_g, X_rfbmi5,X_bmi5cat,diabete3,cvdinfr4,bphigh4,addepev2,exerany2,toldhi2,smoke100,sleptim1 ) %>% filter(!is.na(genhlth),!is.na(sex),!is.na(exerany2),!is.na(X_age_g),!is.na(toldhi2),!is.na(smoke100),!is.na(diabete3),!is.na(addepev2),!is.na(cvdinfr4),!is.na(bphigh4),!is.na(X_rfbmi5),!is.na(X_bmi5cat),!is.na(sleptim1))
Diabetes Prevalence and General Health
table(disease$diabete3,disease$X_age_g)
##
## Age 18 to 24 Age 25 to 34
## Yes 152 713
## Yes, but female told only during pregnancy 60 449
## No 8765 25044
## No, pre-diabetes or borderline diabetes 61 185
##
## Age 35 to 44 Age 45 to 54
## Yes 2325 6699
## Yes, but female told only during pregnancy 854 697
## No 38745 57530
## No, pre-diabetes or borderline diabetes 458 1031
##
## Age 55 to 64 Age 65 or older
## Yes 14056 27257
## Yes, but female told only during pregnancy 488 557
## No 72603 100907
## No, pre-diabetes or borderline diabetes 1809 3347
diabete <- brfss2013 %>%
filter(!is.na(diabete3),!is.na(genhlth)) %>%
group_by(genhlth,diabete3) %>%
summarise(n = n()) %>%
mutate(pct_total_stacked = n/sum(n),
position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
position_n = cumsum(n)-0.5*n)
ggplot(diabete, aes(genhlth), y=n) +
geom_bar(aes(fill = diabete3, weight = n), width = .7, color="black") +
geom_text(aes(label=n, y=position_n), color="white")
ggplot(diabete, aes(x=genhlth, y=pct_total_stacked, fill=diabete3)) +
geom_bar(stat='identity', width = .7, color="black")+
geom_text(aes(label=ifelse(diabete3 == 'Yes', paste0(sprintf("%.0f", pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
coord_flip() +
scale_y_continuous() +
labs(y="", x="")
rm(diabete)
Prevalence of Cardio Vascular Disorders
Yes No
24347 340445
Yes No
6.674214 93.325786
Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
Yes 0.01288405 0.04468300 0.15077085 0.60966249 1.52278559 4.33342836
No 2.46469221 7.18985065 11.46735674 17.47105200 22.86261760 31.87021645
Stress and Depression are associated with work,environments and behaviours.The habits seem to trigger chronic
disorders e.g; high blood pressure, diabetes and cardiovascular diseases.Correlation to be calculated under
Question 3.
prop.table(ftable(table(disease$addepev2,disease$X_age_g)))*100
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.436139 1.444933 2.423025 4.241869 5.863067 5.772331
## No 2.041437 5.789601 9.195103 13.838845 18.522336 30.431314
qplot(disease$addepev2)
qplot(X_age_g, data = disease, fill=addepev2,xlab="Depression",ylab ="Prevalence",main ="Heart Disease vs age")
ggplot(disease,aes(x = addepev2, fill = diabete3)) + geom_bar() +facet_grid(~addepev2) + coord_flip() + ggtitle(" Depression Vs Diabetes disease")
rm(exer)
## Warning in rm(exer): object 'exer' not found
brfss_dep <- brfss2013 %>%
filter(!is.na(addepev2),!is.na(diabete3)) %>%
group_by(addepev2,diabete3) %>%
summarise(n = n()) %>%
mutate(pct_total_stacked = n/sum(n),
position_stacked = cumsum(pct_total_stacked)-0.5*pct_total_stacked,
position_n = cumsum(n)-0.5*n)
ggplot(brfss_dep, aes(x=diabete3, y=pct_total_stacked, fill=addepev2)) +
geom_bar(stat='identity', width = .7, color="black")+
geom_text(aes(label=ifelse(addepev2 == 'Yes', paste0(sprintf("%.0f", pct_total_stacked*100),"%"),""), y=position_stacked), color="white") +
coord_flip() +
scale_y_continuous() +
labs(y="", x="")
High Blood Pressure Prevalence
Yes
161096
Yes, but female told only during pregnancy
2258
No
197358
Told borderline or pre-hypertensive
4080
Yes
44.1610562
Yes, but female told only during pregnancy
0.6189829
No
54.1015154
Told borderline or pre-hypertensive
1.1184456
Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
Yes 0.223140858 1.020581592 2.488541415 6.123215421 11.593181868 22.712395009
Yes, but female told only during pregnancy 0.025219851 0.146384789 0.161461874 0.106087853 0.084705805 0.095122700
No 2.220443431 6.018224084 8.859021031 11.644443957 12.393638018 12.965744863
Told borderline or pre-hypertensive 0.008772122 0.049343187 0.109103270 0.206967258 0.313877497 0.430382245
High Blood cholesterol Prevalence
Yes No
160553 204239
Yes No
44.0122 55.9878
Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
Yes 0.234106 1.142295 3.027753 7.028663 12.330863 20.248525
No 2.243470 6.092239 8.590375 11.052052 12.054541 15.955120
Research quesion 3:
Analyzing Association of behavioural outcomes with the causes by Linear and Multiple Regression Methods
vars <- names(brfss2013) %in% c("sleptim1","cvdinfr4","addepev2","diabete3","bphigh4","toldhi2",
"smoke100","X_rfbmi5")
hlthsub2 <- brfss2013[vars]
names(hlthsub2)
## [1] "sleptim1" "bphigh4" "toldhi2" "cvdinfr4" "addepev2" "diabete3"
## [7] "smoke100" "X_rfbmi5"
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3
## 1.5021097 0.2887499 14.5721112 0.5260536 0.4654568 0.1691831
## smoke100 X_rfbmi5
## 3.0339078 5.4352092
summary(hlthsub2)
## sleptim1 bphigh4
## Min. : 0.000 Yes :198921
## 1st Qu.: 6.000 Yes, but female told only during pregnancy: 3680
## Median : 7.000 No :282687
## Mean : 7.052 Told borderline or pre-hypertensive : 5067
## 3rd Qu.: 8.000 NA's : 1420
## Max. :450.000
## NA's :7387
## toldhi2 cvdinfr4 addepev2
## Yes :183501 Yes : 29284 Yes : 95779
## No :236612 No :459904 No :393707
## NA's: 71662 NA's: 2587 NA's: 2289
##
##
##
##
## diabete3 smoke100
## Yes : 62363 Yes :215201
## Yes, but female told only during pregnancy: 4602 No :261654
## No :415374 NA's: 14920
## No, pre-diabetes or borderline diabetes : 8604
## NA's : 832
##
##
## X_rfbmi5
## No :163161
## Yes :301885
## NA's: 26729
##
##
##
##
summary(hlthsub2$X_rfbmi5)
## No Yes NA's
## 163161 301885 26729
hlthsub2$X_rfbmi5 <- replace(hlthsub2$X_rfbmi5, which(is.na(hlthsub2$X_rfbmi5)), "Yes")
summary(hlthsub2$X_rfbmi5)
## No Yes
## 163161 328614
summary(hlthsub2$smoke100)
## Yes No NA's
## 215201 261654 14920
hlthsub2$addepev2 <- replace(hlthsub2$addepev2, which(is.na(hlthsub2$addepev2)), "Yes")
summary(hlthsub2$addepev2)
## Yes No
## 98068 393707
hlthsub2$smoke100 <- replace(hlthsub2$smoke100, which(is.na(hlthsub2$smoke100)), "Yes")
summary(hlthsub2$smoke100)
## Yes No
## 230121 261654
hlthsub2$diabete3 <- replace(hlthsub2$diabete3, which(is.na(hlthsub2$diabete3)), "Yes")
summary(hlthsub2$diabete3)
## Yes
## 63195
## Yes, but female told only during pregnancy
## 4602
## No
## 415374
## No, pre-diabetes or borderline diabetes
## 8604
hlthsub2$bphigh4 <- replace(hlthsub2$bphigh4, which(is.na(hlthsub2$bphigh4)), "No")
summary(hlthsub2$bphigh4)
## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 284107
## Told borderline or pre-hypertensive
## 5067
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.052 8.000 450.000 7387
mean(hlthsub2$sleptim1,na.rm = T)
## [1] 7.052099
hlthsub2$sleptim1 <- replace(hlthsub2$sleptim1, which(is.na(hlthsub2$sleptim1)), 7)
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 7.000 7.051 8.000 450.000
summary(hlthsub2$toldhi2)
## Yes No NA's
## 183501 236612 71662
hlthsub2$toldhi2 <- replace(hlthsub2$toldhi2, which(is.na(hlthsub2$toldhi2)), "No")
summary(hlthsub2$toldhi2)
## Yes No
## 183501 308274
summary(hlthsub2$cvdinfr4)
## Yes No NA's
## 29284 459904 2587
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4, which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
## Yes No
## 31871 459904
hlthsub2$bphigh4 <- as.factor(ifelse(hlthsub2$bphigh4=="Yes", "Yes",
(ifelse(hlthsub2$bphigh4=="Yes, but female told only during pregnancy", "Yes",
(ifelse(hlthsub2$bphigh4=="Told borderline or pre-hypertensive", "Yes",
"No"))))))
hlthsub2$diabete3 <- as.factor(ifelse(hlthsub2$diabete3 == "Yes","Yes",
(ifelse(hlthsub2$diabete3 == "Yes, but female told only during pregnancy","Yes",
(ifelse(hlthsub2$diabete3 == "Told borderline diabetes or pre-diabetes","Yes", "No"))))))
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
Converting factors to numerics for analysis.
hlthsub2$addepev2 <- ifelse(hlthsub2$addepev2=="Yes", 1, 0)
hlthsub2$cvdinfr4 <- ifelse(hlthsub2$cvdinfr4=="Yes", 1, 0)
hlthsub2$smoke100 <- ifelse(hlthsub2$smoke100=="Yes", 1, 0)
hlthsub2$diabete3 <- ifelse(hlthsub2$diabete3=="Yes", 1, 0)
hlthsub2$toldhi2 <- ifelse(hlthsub2$toldhi2 =="Yes", 1, 0)
hlthsub2$bphigh4 <- ifelse(hlthsub2$bphigh4=="Yes", 1, 0)
hlthsub2$X_rfbmi5 <- ifelse(hlthsub2$X_rfbmi5=="Yes", 1, 0)
Checking Missing Values
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3 smoke100 X_rfbmi5
## 0 0 0 0 0 0 0 0
Research quesion 3:
Finding correlation and plotting a corrplot.
Fitting model by logistic regression using binomial method. Plotting correlations.
library(corrplot)
M <- cor(hlthsub2)
corrplot(M, method="number",pch=23,col=rainbow(7))
library(corrplot)
corrplot(M, type="upper", order="hclust", tl.col=rainbow(7), tl.srt=45,method="number",pch=23)
summary(M)
sleptim1 bphigh4 toldhi2
Min. :-0.051986 Min. :0.0009575 Min. :0.003714
1st Qu.:-0.025489 1st Qu.:0.0722225 1st Qu.:0.091977
Median : 0.001235 Median :0.1804427 Median :0.138721
Mean : 0.112901 Mean :0.2597608 Mean :0.246796
3rd Qu.: 0.002340 3rd Qu.:0.2698770 3rd Qu.:0.231654
Max. : 1.000000 Max. :1.0000000 Max. :1.000000
cvdinfr4 addepev2 diabete3
Min. :0.001512 Min. :-0.05199 Min. :0.001882
1st Qu.:0.054606 1st Qu.: 0.05652 1st Qu.:0.070512
Median :0.122297 Median : 0.07588 Median :0.149478
Mean :0.208511 Mean : 0.17623 Mean :0.234932
3rd Qu.:0.156271 3rd Qu.: 0.09960 3rd Qu.:0.215495
Max. :1.000000 Max. : 1.00000 Max. :1.000000
smoke100 X_rfbmi5
Min. :-0.02454 Min. :-0.02833
1st Qu.: 0.03625 1st Qu.: 0.03496
Median : 0.07328 Median : 0.08737
Mean : 0.17264 Mean : 0.19287
3rd Qu.: 0.09651 3rd Qu.: 0.15817
Max. : 1.00000 Max. : 1.00000
Correlation and Fitting of the Models,cardio,depression and diabetes with other variables.
A plot of a fit model for cardiovascular disorder against all variables is drawn.
Please cite as:
Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2. http://CRAN.R-project.org/package=stargazer
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call:
glm(formula = cvdinfr4 ~ ., family = "binomial", data = hlthsub2)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9064 -0.4012 -0.2552 -0.1755 2.9322
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.218351 0.026644 -158.325 < 2e-16 ***
sleptim1 0.006618 0.002905 2.278 0.0227 *
bphigh4 1.043494 0.014198 73.496 < 2e-16 ***
toldhi2 0.734294 0.013012 56.433 < 2e-16 ***
addepev2 0.257595 0.013494 19.089 < 2e-16 ***
diabete3 0.756986 0.013609 55.624 < 2e-16 ***
smoke100 0.642813 0.012433 51.701 < 2e-16 ***
X_rfbmi5 -0.073540 0.013952 -5.271 1.36e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 236049 on 491774 degrees of freedom
Residual deviance: 210390 on 491767 degrees of freedom
AIC: 210406
Number of Fisher Scoring iterations: 6
Multiple Linear Regression Example
library(stargazer, quietly = TRUE)
par(op)
stargazer(fit,fit1,fit2,type="text")
========================================================
Dependent variable:
--------------------------------------
cvdinfr4 addepev2 diabete3
(1) (2) (3)
--------------------------------------------------------
sleptim1 0.007** -0.094*** 0.006***
(0.003) (0.003) (0.002)
bphigh4 1.043*** 0.117*** 1.244***
(0.014) (0.008) (0.010)
toldhi2 0.734*** 0.344*** 0.729***
(0.013) (0.008) (0.009)
addepev2 0.258*** 0.340***
(0.013) (0.010)
cvdinfr4 0.235*** 0.741***
(0.013) (0.013)
diabete3 0.757*** 0.312***
(0.014) (0.010)
smoke100 0.643*** 0.468*** 0.042***
(0.012) (0.007) (0.009)
X_rfbmi5 -0.074*** 0.137***
(0.014) (0.008)
Constant -4.218*** -1.327*** -3.090***
(0.027) (0.019) (0.020)
--------------------------------------------------------
Observations 491,775 491,775 491,775
Log Likelihood -105,194.800 -239,030.700 -175,230.900
Akaike Inf. Crit. 210,405.600 478,077.300 350,475.800
========================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Corelation of variables with eg.,diabetes
(we can do this for sleptime1,bphigh4,toldhi2,cvdinfr4,addepev2 or smokine if required)
[1] 0.001881772
[1] 0.2537181
[1] 0.2027539
[1] 0.1514461
[1] 0.07995236
[1] 0.04218968
X_rfbmi5 diabete3
1 0 0.06568972
2 1 0.17369619
Correlation analysis on subset hlthsub2.
coran <- sort(cor(hlthsub2)[,1])
print(coran)
addepev2 X_rfbmi5 smoke100 bphigh4 cvdinfr4
-0.0519859273 -0.0283317665 -0.0245406984 0.0009575273 0.0015123802
diabete3 toldhi2 sleptim1
0.0018817718 0.0037143384 1.0000000000
plot(coran)
Fisher Test on Selected variables.
Fisher's Exact Test for Count Data
data: hlthsub2$cvdinfr4 and hlthsub2$X_rfbmi5
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.412973 1.488011
sample estimates:
odds ratio
1.449966
Fisher's Exact Test for Count Data
data: hlthsub2$diabete3 and hlthsub2$addepev2
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.657903 1.720391
sample estimates:
odds ratio
1.688858
Fisher's Exact Test for Count Data
data: hlthsub2$smoke100 and hlthsub2$cvdinfr4
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.126275 2.229919
sample estimates:
odds ratio
2.177476
Fisher's Exact Test for Count Data
data: hlthsub2$diabete3 and hlthsub2$X_rfbmi5
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
2.926220 3.055166
sample estimates:
odds ratio
2.989719
Fisher's Exact Test for Count Data
data: hlthsub2$cvdinfr4 and hlthsub2$diabete3
p-value < 2.2e-16
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
3.486181 3.664181
sample estimates:
odds ratio
3.573851
Run Density Plots of a few selected health outcomes
Inference of Analysis
From the above results we can infer that:
1. Sleeping Time has a negative association with depression,smoking and obesity.
If there is more smoking,obesity and depression, the person tends to sleep more hours.
2. Exercise has negative association with depression,diabetes,high blood pressure and cholesterol.
It means your health will improve if you exercise and chronic disorders like High Blood Pressure
and cholestrol will be reduced.
3. Instance of cardio vascular disorders are directly related to diabetes,high blood pressure and
cholesterol,depression, smoking and obesity,but no relation with exercise and sleeping time.
Therefore, the self reported survey data has ample valid information which leads to confirming the hypothesis.
There are association between the causes and outcomes. To Lead good life it is essential that one should avoid
excesses and be physically active irrespective of age or sex.