knitr::opts_chunk$set(error = TRUE,tidy=FALSE,size="small") ts_chunk$set(error = TRUE)
set.seed(123)
library(ggplot2)
library(dplyr)
Data Sources
The data frame(BRFSS2013) was downloaded from Courser’s Course Web site’s assignment page.I loaded it in the in hard drive (in R Data Files directory) then imported in the R markdown file in R studio for Analysis.The data set is provided in Work space (.Rdata) format for Analysis with R Studio. Categorical values are factors in the R work space. All missing values are coded NA in the R Work space.During analysis I had to manipulate ram by deleting several not required subsets from time to time to make more space in the working memory.
DATA - Generalizability
This data was collected through landline and cell phones from households in all the American states. Any information which had 50 or less samples or if it was thought to be biased was dropped. Futher weightage stratified the data. The non English speaking resondents were not questioned in their own language. This information is non-institutionlized. The data can be considered as random viewing the size of data. This data appears to be valid and reliable when we compare it with other surveys of similar nature. According to independent analysts it stands at 75%.reliability and it is generalizable. It is used for planning health care. By year on year comparison of results from this data the health care planners can find if their plans are working well. The prevalence trends can further help them to revise their plans.
What is BRFSS ?
The Behavioral Risk Factor Surveillance System (BRFSS)is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC) The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion.
BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US.Bias to data can exist.
The BRFSS objective is to collect uniform, state-specific include tobacco use, HIV/AIDS knowledge and prevention,exercise, immunization, health status, healthy days - health-related quality of life, health care access,inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seat belt use.
Research quesion 1:
What is the health Status of the population by Age,Sex and activity ?
Research quesion.2
What are different behaviours that might cause poor health ?
Research quesion 3:
What is correlation between behaviours and chronic health conditions ?
Following Variables will be used:
** This list shows which variables are used for which research question(s)**
genhlth: Status of general health Research Question 1
Sex : Sex of Respondent Research Question 1
X_age_g: Groups of repondents by Age Research Question 1
sleptim1: How Much Time Do You Sleep Research Question 2,3
cvdinfr4: Ever Diagnosed With Heart Attack Research Question 2,3
addepev2: Ever Told You Had A Depressive Disorder Research Question 2,3
smoke100: Smoking 100 cigarettes a day. Research Question 2,3
diabete3: Told that you are diabetic. Research Question 2,3
exerany2: Exercise and physical activity in past 30 days Research Question 1,2,3
toldhi2 : Told that you have blood high cholestrol Research Question 2,3
X_bmi5cat:Weight by Categories Research Question 2,3
X_rfbmi5: Overweight or obese. Research Question 2,3
bphigh4: Ever Told Blood Pressure High Research Question 2,3
X_pacat1 : Is Life Active or Passive Research Question 1
Exploring Data Frame brfss2013
This data frame has 330 columns and 491775 rows.
dim(brfss2013)
## [1] 491775 330
There are 330 variable as shown below.We have to choose required variables from this by subsetting.
str(brfss2013)
## 'data.frame': 491775 obs. of 330 variables:
## $ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ fmonth : Factor w/ 12 levels "January","February",..: 1 1 1 1 2 3 3 3 4 4 ...
## $ idate : int 1092013 1192013 1192013 1112013 2062013 3272013 3222013 3042013 4242013 4242013 ...
## $ imonth : Factor w/ 12 levels "January","February",..: 1 1 1 1 2 3 3 3 4 4 ...
## $ iday : Factor w/ 31 levels "1","2","3","4",..: 9 19 19 11 6 27 22 4 24 24 ...
## $ iyear : Factor w/ 2 levels "2013","2014": 1 1 1 1 1 1 1 1 1 1 ...
## $ dispcode : Factor w/ 2 levels "Completed interview",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ seqno : int 2013000580 2013000593 2013000600 2013000606 2013000608 2013000630 2013000634 2013000644 2013001305 2013001338 ...
## $ X_psu : int 2013000580 2013000593 2013000600 2013000606 2013000608 2013000630 2013000634 2013000644 2013001305 2013001338 ...
## $ ctelenum : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ pvtresd1 : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ colghous : Factor w/ 1 level "Yes": NA NA NA NA NA NA NA NA NA NA ...
## $ stateres : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ cellfon3 : Factor w/ 1 level "Not a cellular phone": 1 1 1 1 1 1 1 1 1 1 ...
## $ ladult : Factor w/ 2 levels "Yes, male respondent",..: NA NA NA NA NA NA NA NA NA NA ...
## $ numadult : Factor w/ 19 levels "1","2","3","4",..: 2 2 3 2 2 1 2 1 5 2 ...
## $ nummen : Factor w/ 14 levels "0","1","2","3",..: 2 2 3 2 2 1 2 1 5 2 ...
## $ numwomen : Factor w/ 12 levels "0","1","2","3",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ physhlth : int 30 0 3 2 10 0 1 5 0 0 ...
## $ menthlth : int 29 0 2 0 2 0 15 0 0 0 ...
## $ poorhlth : int 30 NA 0 0 0 NA 0 10 NA NA ...
## $ hlthpln1 : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ persdoc2 : Factor w/ 3 levels "Yes, only one",..: 1 1 1 1 1 1 2 1 1 1 ...
## $ medcost : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ checkup1 : Factor w/ 5 levels "Within past year",..: 1 1 1 2 4 1 1 1 1 1 ...
## $ sleptim1 : int NA 6 9 8 6 8 7 6 8 8 ...
## $ bphigh4 : Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 1 3 3 3 1 1 1 1 3 3 ...
## $ bpmeds : Factor w/ 2 levels "Yes","No": 1 NA NA NA 2 1 1 1 NA NA ...
## $ bloodcho : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ cholchk : Factor w/ 4 levels "Within past year",..: 1 1 4 1 2 1 1 1 1 1 ...
## $ toldhi2 : Factor w/ 2 levels "Yes","No": 1 2 2 1 2 1 2 1 1 2 ...
## $ cvdinfr4 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ cvdcrhd4 : Factor w/ 2 levels "Yes","No": NA 2 2 2 2 2 2 1 2 2 ...
## $ cvdstrk3 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
## $ asthnow : Factor w/ 2 levels "Yes","No": 1 NA NA NA 2 NA NA NA NA NA ...
## $ chcscncr : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ chcocncr : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 2 2 ...
## $ chccopd1 : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ havarth3 : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 1 2 ...
## $ addepev2 : Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
## $ chckidny : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ diabete3 : Factor w/ 4 levels "Yes","Yes, but female told only during pregnancy",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ veteran3 : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ marital : Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
## $ children : int 0 2 0 0 0 0 1 0 1 0 ...
## $ educa : Factor w/ 6 levels "Never attended school or only kindergarten",..: 6 5 6 4 6 6 4 5 6 4 ...
## $ employ1 : Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
## $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
## $ weight2 : Factor w/ 570 levels "",".b","100",..: 154 30 63 31 169 128 9 1 139 73 ...
## $ height3 : int 507 510 504 504 600 503 500 505 602 505 ...
## $ numhhol2 : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 1 2 2 2 2 ...
## $ numphon2 : Factor w/ 6 levels "1 residential telephone number",..: 2 NA NA NA NA 1 NA NA NA NA ...
## $ cpdemo1 : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ cpdemo4 : int 10 70 70 75 0 70 40 1 60 50 ...
## $ internet : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ renthom1 : Factor w/ 3 levels "Own","Rent","Other arrangement": 1 1 1 1 1 1 1 2 1 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ pregnant : Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA 2 NA NA NA ...
## $ qlactlm2 : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 1 1 2 2 ...
## $ useequip : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ blind : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ decide : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ diffwalk : Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
## $ diffdres : Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
## $ diffalon : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 2 2 2 2 2 ...
## $ smoke100 : Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
## $ smokday2 : Factor w/ 3 levels "Every day","Some days",..: 3 NA 2 NA 3 NA 3 1 NA NA ...
## $ stopsmk2 : Factor w/ 2 levels "Yes","No": NA NA 1 NA NA NA NA 2 NA NA ...
## $ lastsmk2 : Factor w/ 8 levels "Within the past month",..: 7 NA NA NA 1 NA 5 NA NA NA ...
## $ usenow3 : Factor w/ 3 levels "Every day","Some days",..: 3 3 3 3 3 3 3 3 1 3 ...
## $ alcday5 : int 201 0 220 208 210 0 201 202 101 0 ...
## $ avedrnk2 : int 2 NA 4 2 2 NA 1 1 1 NA ...
## $ drnk3ge5 : int 0 NA 20 0 0 NA 0 0 0 NA ...
## $ maxdrnks : int 2 NA 10 2 3 NA 1 1 2 NA ...
## $ fruitju1 : int 304 305 301 202 0 205 320 0 0 202 ...
## $ fruit1 : int 104 301 203 306 302 206 325 320 101 202 ...
## $ fvbeans : int 303 310 202 202 101 0 330 360 202 203 ...
## $ fvgreen : int 310 203 202 310 310 203 315 315 203 201 ...
## $ fvorang : int 303 202 310 305 303 0 310 325 0 201 ...
## $ vegetab1 : int NA 203 330 204 101 207 310 308 101 203 ...
## $ exerany2 : Factor w/ 2 levels "Yes","No": 2 1 2 1 2 1 1 1 1 1 ...
## $ exract11 : Factor w/ 75 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 64 NA 64 NA 6 64 64 7 64 ...
## $ exeroft1 : int NA 105 NA 205 NA 102 220 102 102 220 ...
## $ exerhmm1 : int NA 20 NA 30 NA 15 100 15 100 30 ...
## $ exract21 : Factor w/ 76 levels "Active Gaming Devices (Wii Fit, Dance, Dance revolution)",..: NA 71 NA 75 NA 18 75 75 75 18 ...
## $ exeroft2 : int NA 101 NA NA NA 102 NA NA NA 101 ...
## $ exerhmm2 : int NA 10 NA NA NA 30 NA NA NA 100 ...
## $ strength : int 0 0 0 0 0 0 205 0 102 0 ...
## $ lmtjoin3 : Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 2 1 2 NA ...
## $ arthdis2 : Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 1 2 2 NA ...
## $ arthsocl : Factor w/ 3 levels "A lot","A little",..: 1 NA 2 NA NA NA 3 1 3 NA ...
## $ joinpain : int 7 NA 5 NA NA NA 3 8 4 NA ...
## $ seatbelt : Factor w/ 6 levels "Always","Nearly always",..: 1 1 1 1 1 1 1 1 2 1 ...
## $ flushot6 : Factor w/ 2 levels "Yes","No": 2 1 1 2 2 1 2 1 1 2 ...
## $ flshtmy2 : Factor w/ 26 levels "January 2012",..: NA 10 13 NA NA NA NA 10 10 NA ...
## $ tetanus : Factor w/ 4 levels "Yes, received Tdap",..: 4 1 1 4 4 4 4 4 1 4 ...
## $ pneuvac3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 2 1 2 2 2 2 ...
## [list output truncated]
What is the Current Status of Health by Sex,Age Groups and Activity Levels.
The tablulated information gives health by sex,age and quality of life
## sex
## genhlth Male Female
## Excellent 35741 49740
## Very good 65135 93940
## Good 62998 87557
## Fair 25882 40844
## Poor 10713 17238
## X_age_g
## genhlth Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54
## Excellent 6896 12204 13243 15683
## Very good 10266 18453 21542 27630
## Good 7795 14518 17677 24513
## Fair 1874 4047 6055 10738
## Poor 303 821 1729 4950
## X_age_g
## genhlth Age 55 to 64 Age 65 or older
## Excellent 17434 20020
## Very good 34626 46557
## Good 32745 53305
## Fair 16092 27919
## Poor 8098 12050
## exerany2
## genhlth Yes No
## Excellent 67914 11453
## Very good 120595 28860
## Good 97095 42443
## Fair 34844 26949
## Poor 10941 14856
General Health Status:physical,mental and poor health prevalence in
population under study.Plots show the health in population
Activity level and General Health as shown here
Both sexes have almost similar Health levels.
Behaviour Patterns and Prevalence of Major disorders. Preparation of Data Frame and Analysis,Checking Missing Values and tabulating the Summaries of Variables which we will use for analysis.There are three habits which seem to effect life:Physical Activity,smoking and sleeping habit.
Calculating figures and Plotting
## genhlth sex X_age_g X_rfbmi5
## Excellent: 63241 Male :149886 Age 18 to 24 : 9038 No :121282
## Very good:121403 Female:214906 Age 25 to 34 : 26391 Yes:243510
## Good :109965 Age 35 to 44 : 42382
## Fair : 49386 Age 45 to 54 : 65957
## Poor : 20797 Age 55 to 64 : 88956
## Age 65 or older:132068
## X_bmi5cat
## Underweight : 5537
## Normal weight:115745
## Overweight :133390
## Obese :110120
##
##
## diabete3 cvdinfr4
## Yes : 51202 Yes: 24347
## Yes, but female told only during pregnancy: 3105 No :340445
## No :303594
## No, pre-diabetes or borderline diabetes : 6891
##
##
## bphigh4 addepev2
## Yes :161096 Yes: 73620
## Yes, but female told only during pregnancy: 2258 No :291172
## No :197358
## Told borderline or pre-hypertensive : 4080
##
##
## exerany2 toldhi2 smoke100 sleptim1
## Yes:269002 Yes:160553 Yes:166624 Min. : 1.000
## No : 95790 No :204239 No :198168 1st Qu.: 6.000
## Median : 7.000
## Mean : 7.057
## 3rd Qu.: 8.000
## Max. :24.000
Smoking by Age and Sex.
Tabulating and Plottig.
##
## Yes No
## 166624 198168
##
## Yes No
## 45.67644 54.32356
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.6485888 2.7755543 4.3364438 7.7849843 11.9136385 18.2172306
## No 1.8289875 4.4589794 7.2816838 10.2957302 12.4717647 17.9864142
## Male Female
##
## Yes 21.44099 24.23545
## No 19.64709 34.67647
Exercise and Physical Activity.This activity is useful for health.
Summary of subset.Table and Plots.
##
## Yes No
## 269002 95790
##
## Yes No
## 73.7412 26.2588
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 2.109421 5.970526 9.181945 13.653534 17.868265 24.957510
## No 0.368155 1.264008 2.436183 4.427180 6.517139 11.246135
Seeping Time.How much time people spend while sleeping?.
Insufficient sleeping causes problems for health.
Let us compare sleeping time with Health.Tables and Plots.
##
## 1 2 3 4 5 6 7 8 9 10
## 130 718 2430 10283 24362 79431 110243 105978 17984 8820
## 11 12 13 14 15 16 17 18 19 20
## 580 2632 137 323 265 254 27 115 12 44
## 21 22 23 24
## 3 6 1 14
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Obesity and Overweight Prevalence:How does obesity
effects our health. Table and Plots.
table(disease$X_rfbmi5)
##
## No Yes
## 121282 243510
table(disease$X_rfbmi5,disease$genhlth)
##
## Excellent Very good Good Fair Poor
## No 31835 42698 28934 11938 5877
## Yes 31406 78705 81031 37448 14920
prop.table(table(disease$X_bmi5cat,disease$X_age_g))*100
##
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54
## Underweight 0.09018838 0.11760126 0.12966293 0.21957718
## Normal weight 1.24317419 2.69578280 3.65331477 5.27012654
## Overweight 0.67654992 2.38574311 4.03216079 6.47903463
## Obese 0.46766376 2.03540648 3.80298910 6.11197614
##
## Age 55 to 64 Age 65 or older
## Underweight 0.28454571 0.67627580
## Normal weight 6.92449396 11.94214785
## Overweight 8.98649093 14.00606373
## Obese 8.18987258 9.57915744
prop.table(table(disease$X_rfbmi5,disease$exerany2))*100
##
## Yes No
## No 26.339119 6.907772
## Yes 47.402081 19.351027
obs <- table(disease$genhlth,disease$X_rfbmi5)
plot(obs,main="Obesity vs General Health",col=rainbow(5))
qplot(disease$X_bmi5cat,xlab="Obesity",main="Obesity prevalence")
qplot(X_age_g, data = disease, fill=X_bmi5cat,
xlab="Age",ylab ="Population",
main ="Obese-Overwight by age")
prop.table(table(disease$diabete3,disease$genhlth))*100
##
## Excellent Very good
## Yes 0.43641308 2.26238514
## Yes, but female told only during pregnancy 0.15570517 0.30208996
## No 16.60864273 30.25559771
## No, pre-diabetes or borderline diabetes 0.13541964 0.45998816
##
## Good Fair
## Yes 5.09961841 4.12152679
## Yes, but female told only during pregnancy 0.25631045 0.10800675
## No 24.06494660 8.90425229
## No, pre-diabetes or borderline diabetes 0.72370008 0.40434001
##
## Poor
## Yes 2.11600035
## Yes, but female told only during pregnancy 0.02905765
## No 3.39042523
## No, pre-diabetes or borderline diabetes 0.16557381
prop.table(table(disease$genhlth,disease$exerany2))*100
##
## Yes No
## Excellent 15.107239 2.228941
## Very good 27.166988 6.113073
## Good 21.289118 8.855457
## Fair 7.716726 5.821400
## Poor 2.461129 3.239929
prop.table(table(disease$genhlth,disease$sex))*100
##
## Male Female
## Excellent 7.140782 10.195399
## Very good 13.661210 19.618851
## Good 12.778789 17.365787
## Fair 5.302748 8.235378
## Poor 2.204544 3.496513
Diabetes: What is the prevalence of diabetes by Age Group ?
Tabulating and Plotting the relevant information.
##
## Yes
## 51202
## Yes, but female told only during pregnancy
## 3105
## No
## 303594
## No, pre-diabetes or borderline diabetes
## 6891
##
## Yes
## 14.035944
## Yes, but female told only during pregnancy
## 0.851170
## No
## 83.223865
## No, pre-diabetes or borderline diabetes
## 1.889022
##
## Age 18 to 24 Age 25 to 34
## Yes 0.04166758 0.19545385
## Yes, but female told only during pregnancy 0.01644773 0.12308384
## No 2.40273910 6.86528213
## No, pre-diabetes or borderline diabetes 0.01672186 0.05071383
##
## Age 35 to 44 Age 45 to 54
## Yes 0.63734950 1.83638896
## Yes, but female told only during pregnancy 0.23410601 0.19106779
## No 10.62112108 15.77063093
## No, pre-diabetes or borderline diabetes 0.12555100 0.28262681
##
## Age 55 to 64 Age 65 or older
## Yes 3.85315467 7.47192921
## Yes, but female told only during pregnancy 0.13377486 0.15268975
## No 19.90257462 27.66151670
## No, pre-diabetes or borderline diabetes 0.49589903 0.91750916
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.04166758 0.19545385 0.63734950 1.83638896 3.85315467 7.47192921
## Yes, but female told only during pregnancy 0.01644773 0.12308384 0.23410601 0.19106779 0.13377486 0.15268975
## No 2.40273910 6.86528213 10.62112108 15.77063093 19.90257462 27.66151670
## No, pre-diabetes or borderline diabetes 0.01672186 0.05071383 0.12555100 0.28262681 0.49589903 0.91750916
Cardio Vascular Disorders
How the prevalence of Heart diseases increase with aging.?
Tabulating and plotting.
table(disease$cvdinfr4)
##
## Yes No
## 24347 340445
prop.table(table(disease$cvdinfr4))*100
##
## Yes No
## 6.674214 93.325786
prop.table(ftable(table(disease$cvdinfr4,disease$X_age_g)))*100
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.01288405 0.04468300 0.15077085 0.60966249 1.52278559 4.33342836
## No 2.46469221 7.18985065 11.46735674 17.47105200 22.86261760 31.87021645
qplot(disease$cvdinfr4,xlab="affirmation",ylab="count",
main="Heart Disease Prevalence")
qplot(X_age_g, data = disease, fill=cvdinfr4,xlab="CVD",
ylab ="Prevalence",main ="Heart Disease by Age")
High Blood Pressure
How does the high blood pressure increases with aging.?
table(disease$bphigh4)
##
## Yes
## 161096
## Yes, but female told only during pregnancy
## 2258
## No
## 197358
## Told borderline or pre-hypertensive
## 4080
prop.table(table(disease$bphigh4))*100
##
## Yes
## 44.1610562
## Yes, but female told only during pregnancy
## 0.6189829
## No
## 54.1015154
## Told borderline or pre-hypertensive
## 1.1184456
prop.table(ftable(table(disease$bphigh4,disease$X_age_g)))*100
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.223140858 1.020581592 2.488541415 6.123215421 11.593181868 22.712395009
## Yes, but female told only during pregnancy 0.025219851 0.146384789 0.161461874 0.106087853 0.084705805 0.095122700
## No 2.220443431 6.018224084 8.859021031 11.644443957 12.393638018 12.965744863
## Told borderline or pre-hypertensive 0.008772122 0.049343187 0.109103270 0.206967258 0.313877497 0.430382245
qplot(disease$bphigh4)
qplot(X_age_g, data = disease, fill=bphigh4,xlab="High BP",
ylab ="Prevalence",main ="High blood pressure by age")
High Blood cholesterol
How prevalence of High Cholesterol level relates to aging.?
Tabulating and plotting this information.
##
## Yes No
## 160553 204239
##
## Yes No
## 44.0122 55.9878
## Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
##
## Yes 0.234106 1.142295 3.027753 7.028663 12.330863 20.248525
## No 2.243470 6.092239 8.590375 11.052052 12.054541 15.955120
Depression and Stress Disorders
How the depression relates to different age groups.?
Tabulating and Plotting this information.
Yes No
73620 291172
Yes No
20.18136 79.81864
Age 18 to 24 Age 25 to 34 Age 35 to 44 Age 45 to 54 Age 55 to 64 Age 65 or older
Yes 0.436139 1.444933 2.423025 4.241869 5.863067 5.772331
No 2.041437 5.789601 9.195103 13.838845 18.522336 30.431314
Compare Diabetes and Obesity
Tabulation and Plots.
## hlthdiab1
## Yes
## 62363
## Yes, but female told only during pregnancy
## 4602
## No
## 415374
## No, pre-diabetes or borderline diabetes
## 8604
##
## Underweight Normal weight Overweight Obese
## 5537 115745 133390 110120
Analyzing Association of behavioural outcomes with the causes by Linear and Multiple Regression Methods
vars <- names(brfss2013) %in% c("sleptim1",
"cvdinfr4","addepev2","diabete3",
"bphigh4","toldhi2","smoke100","X_rfbmi5")
hlthsub2 <- brfss2013[vars]
names(hlthsub2)
## [1] "sleptim1" "bphigh4" "toldhi2" "cvdinfr4" "addepev2" "diabete3"
## [7] "smoke100" "X_rfbmi5"
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3
## 1.5021097 0.2887499 14.5721112 0.5260536 0.4654568 0.1691831
## smoke100 X_rfbmi5
## 3.0339078 5.4352092
summary(hlthsub2)
## sleptim1 bphigh4
## Min. : 0.000 Yes :198921
## 1st Qu.: 6.000 Yes, but female told only during pregnancy: 3680
## Median : 7.000 No :282687
## Mean : 7.052 Told borderline or pre-hypertensive : 5067
## 3rd Qu.: 8.000 NA's : 1420
## Max. :450.000
## NA's :7387
## toldhi2 cvdinfr4 addepev2
## Yes :183501 Yes : 29284 Yes : 95779
## No :236612 No :459904 No :393707
## NA's: 71662 NA's: 2587 NA's: 2289
##
##
##
##
## diabete3 smoke100
## Yes : 62363 Yes :215201
## Yes, but female told only during pregnancy: 4602 No :261654
## No :415374 NA's: 14920
## No, pre-diabetes or borderline diabetes : 8604
## NA's : 832
##
##
## X_rfbmi5
## No :163161
## Yes :301885
## NA's: 26729
##
##
##
##
summary(hlthsub2$X_rfbmi5)
## No Yes NA's
## 163161 301885 26729
hlthsub2$X_rfbmi5 <- replace(hlthsub2$X_rfbmi5,
which(is.na(hlthsub2$X_rfbmi5)), "Yes")
summary(hlthsub2$X_rfbmi5)
## No Yes
## 163161 328614
summary(hlthsub2$smoke100)
## Yes No NA's
## 215201 261654 14920
hlthsub2$addepev2 <- replace(hlthsub2$addepev2,
which(is.na(hlthsub2$addepev2)), "Yes")
summary(hlthsub2$addepev2)
## Yes No
## 98068 393707
hlthsub2$smoke100 <- replace(hlthsub2$smoke100,
which(is.na(hlthsub2$smoke100)), "Yes")
summary(hlthsub2$smoke100)
## Yes No
## 230121 261654
hlthsub2$diabete3 <- replace(hlthsub2$diabete3,
which(is.na(hlthsub2$diabete3)), "Yes")
summary(hlthsub2$diabete3)
## Yes
## 63195
## Yes, but female told only during pregnancy
## 4602
## No
## 415374
## No, pre-diabetes or borderline diabetes
## 8604
hlthsub2$bphigh4 <- replace(hlthsub2$bphigh4,
which(is.na(hlthsub2$bphigh4)), "No")
summary(hlthsub2$bphigh4)
## Yes
## 198921
## Yes, but female told only during pregnancy
## 3680
## No
## 284107
## Told borderline or pre-hypertensive
## 5067
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 6.000 7.000 7.052 8.000 450.000 7387
mean(hlthsub2$sleptim1,na.rm = T)
## [1] 7.052099
hlthsub2$sleptim1 <- replace(hlthsub2$sleptim1,
which(is.na(hlthsub2$sleptim1)), 7)
summary(hlthsub2$sleptim1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.000 7.000 7.051 8.000 450.000
summary(hlthsub2$toldhi2)
## Yes No NA's
## 183501 236612 71662
hlthsub2$toldhi2 <- replace(hlthsub2$toldhi2,
which(is.na(hlthsub2$toldhi2)), "Yes")
summary(hlthsub2$toldhi2)
## Yes No
## 255163 236612
summary(hlthsub2$cvdinfr4)
## Yes No NA's
## 29284 459904 2587
hlthsub2$cvdinfr4 <- replace(hlthsub2$cvdinfr4,
which(is.na(hlthsub2$cvdinfr4)), "Yes")
summary(hlthsub2$cvdinfr4)
## Yes No
## 31871 459904
hlthsub2$bphigh4 <- as.factor(ifelse(hlthsub2$bphigh4=="Yes",
"Yes",(ifelse(hlthsub2$bphigh4=="Yes, but female
told only during pregnancy", "Yes",
(ifelse(hlthsub2$bphigh4=="Told borderline or
pre-hypertensive", "Yes","No"))))))
hlthsub2$diabete3 <- as.factor(ifelse(hlthsub2$diabete3 == "Yes",
"Yes",(ifelse(hlthsub2$diabete3 == "Yes, but
female told only during pregnancy","Yes",
(ifelse(hlthsub2$diabete3 == "Told borderline
diabetes or pre-diabetes","Yes","No"))))))
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## combine, src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
Converting factors to numerics for analysis.
hlthsub2$addepev2 <- ifelse(hlthsub2$addepev2=="Yes", 1, 0)
hlthsub2$cvdinfr4 <- ifelse(hlthsub2$cvdinfr4=="Yes", 1, 0)
hlthsub2$smoke100 <- ifelse(hlthsub2$smoke100=="Yes", 1, 0)
hlthsub2$diabete3 <- ifelse(hlthsub2$diabete3=="Yes", 1, 0)
hlthsub2$toldhi2 <- ifelse(hlthsub2$toldhi2 =="Yes", 1, 0)
hlthsub2$bphigh4 <- ifelse(hlthsub2$bphigh4=="Yes", 1, 0)
hlthsub2$X_rfbmi5 <- ifelse(hlthsub2$X_rfbmi5=="Yes", 1, 0)
Checking Missing Values
MissingData <- function(x){sum(is.na(x))/length(x)*100}
apply(hlthsub2, 2, MissingData)
## sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2 diabete3 smoke100 X_rfbmi5
## 0 0 0 0 0 0 0 0
Finding correlation and plotting a corrplot. Fitting model by logistic regression using binomial method. Tabulation and Plotting correlations between various variables.
library(corrplot)
M <- cor(hlthsub2)
corrplot(M, method="number",pch=23,col=rainbow(7))
library(corrplot)
corrplot(M, type="upper", order="hclust", tl.col=rainbow(7), tl.srt=45,method="number",pch=23)
print(M)
sleptim1 bphigh4 toldhi2 cvdinfr4 addepev2
sleptim1 1.000000000 0.00299319 -0.002282787 0.00151238 -0.05198593
bphigh4 0.002993190 1.00000000 0.180710056 0.17632702 0.07233525
toldhi2 -0.002282787 0.18071006 1.000000000 0.10240121 0.07886054
cvdinfr4 0.001512380 0.17632702 0.102401212 1.00000000 0.05931655
addepev2 -0.051985927 0.07233525 0.078860535 0.05931655 1.00000000
diabete3 0.004080381 0.27190219 0.130802191 0.15972742 0.07907738
smoke100 -0.024540698 0.07609272 0.064803460 0.09378122 0.10470799
X_rfbmi5 -0.028331766 0.18674776 0.067617184 0.04047252 0.04811917
diabete3 smoke100 X_rfbmi5
sleptim1 0.004080381 -0.02454070 -0.02833177
bphigh4 0.271902195 0.07609272 0.18674776
toldhi2 0.130802191 0.06480346 0.06761718
cvdinfr4 0.159727418 0.09378122 0.04047252
addepev2 0.079077382 0.10470799 0.04811917
diabete3 1.000000000 0.04739043 0.15123138
smoke100 0.047390432 1.00000000 0.01842239
X_rfbmi5 0.151231383 0.01842239 1.00000000
summary(M)
sleptim1 bphigh4 toldhi2
Min. :-0.0519859 Min. :0.002993 Min. :-0.002283
1st Qu.:-0.0254885 1st Qu.:0.075153 1st Qu.: 0.066914
Median :-0.0003852 Median :0.178518 Median : 0.090631
Mean : 0.1126806 Mean :0.245889 Mean : 0.202864
3rd Qu.: 0.0032650 3rd Qu.:0.208036 3rd Qu.: 0.143279
Max. : 1.0000000 Max. :1.000000 Max. : 1.000000
cvdinfr4 addepev2 diabete3
Min. :0.001512 Min. :-0.05199 Min. :0.00408
1st Qu.:0.054606 1st Qu.: 0.05652 1st Qu.:0.07116
Median :0.098091 Median : 0.07560 Median :0.14102
Mean :0.204192 Mean : 0.17380 Mean :0.23053
3rd Qu.:0.163877 3rd Qu.: 0.08549 3rd Qu.:0.18777
Max. :1.000000 Max. : 1.00000 Max. :1.00000
smoke100 X_rfbmi5
Min. :-0.02454 Min. :-0.02833
1st Qu.: 0.04015 1st Qu.: 0.03496
Median : 0.07045 Median : 0.05787
Mean : 0.17258 Mean : 0.18553
3rd Qu.: 0.09651 3rd Qu.: 0.16011
Max. : 1.00000 Max. : 1.00000
Correlation and Fitting of the Models Cardio,depression and diabetes with other variables. A plot of a fit model for cardiovascular disorder against all variables is drawn.
library(stargazer, quietly = TRUE)
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
fit <- glm(cvdinfr4 ~ ., data=hlthsub2, family = "binomial")
fit1 <- glm(addepev2 ~ ., data=hlthsub2, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
fit2 <- glm(diabete3 ~ cvdinfr4+addepev2+bphigh4+smoke100+
toldhi2+sleptim1, data=hlthsub2, family = "binomial")
summary(fit)
##
## Call:
## glm(formula = cvdinfr4 ~ ., family = "binomial", data = hlthsub2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.8987 -0.3972 -0.2618 -0.2160 2.9409
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.262688 0.027655 -154.140 < 2e-16 ***
## sleptim1 0.006726 0.003001 2.241 0.025 *
## bphigh4 1.171271 0.013841 84.623 < 2e-16 ***
## toldhi2 0.544248 0.013241 41.105 < 2e-16 ***
## addepev2 0.267102 0.013492 19.797 < 2e-16 ***
## diabete3 0.834225 0.013728 60.766 < 2e-16 ***
## smoke100 0.640269 0.012428 51.517 < 2e-16 ***
## X_rfbmi5 -0.055023 0.013937 -3.948 7.88e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 236049 on 491774 degrees of freedom
## Residual deviance: 211103 on 491767 degrees of freedom
## AIC: 211119
##
## Number of Fisher Scoring iterations: 6
plot(fit)
Multiple Linear Regression Example
require(stargazer)
library(stargazer, quietly = TRUE)
stargazer(fit,fit1,fit2,type="text")
##
## ========================================================
## Dependent variable:
## --------------------------------------
## cvdinfr4 addepev2 diabete3
## (1) (2) (3)
## --------------------------------------------------------
## sleptim1 0.007** -0.092*** 0.012***
## (0.003) (0.003) (0.003)
##
## bphigh4 1.171*** 0.165*** 1.532***
## (0.014) (0.008) (0.010)
##
## toldhi2 0.544*** 0.288*** 0.505***
## (0.013) (0.007) (0.010)
##
## addepev2 0.267*** 0.351***
## (0.013) (0.010)
##
## cvdinfr4 0.253*** 0.818***
## (0.013) (0.014)
##
## diabete3 0.834*** 0.327***
## (0.014) (0.010)
##
## smoke100 0.640*** 0.468*** 0.073***
## (0.012) (0.007) (0.009)
##
## X_rfbmi5 -0.055*** 0.152***
## (0.014) (0.008)
##
## Constant -4.263*** -1.383*** -3.349***
## (0.028) (0.020) (0.023)
##
## --------------------------------------------------------
## Observations 491,775 491,775 491,775
## Log Likelihood -105,551.600 -239,307.700 -166,126.000
## Akaike Inf. Crit. 211,119.100 478,631.300 332,266.000
## ========================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Corelation of variables with eg.,diabetes (we can do this for sleptime1,bphigh4,toldhi2,cvdinfr4, addepev2 or smokine if required) Values of correlation for diabetes,sleeptime,cholesterol, high b.p and depression are given below
cor(hlthsub2$diabete3,hlthsub2$sleptim1)
## [1] 0.004080381
cor(hlthsub2$diabete3,hlthsub2$bphigh4)
## [1] 0.2719022
cor(hlthsub2$diabete3,hlthsub2$toldhi2)
## [1] 0.1308022
cor(hlthsub2$diabete3,hlthsub2$cvdinfr4)
## [1] 0.1597274
cor(hlthsub2$diabete3,hlthsub2$addepev2)
## [1] 0.07907738
cor(hlthsub2$diabete3,hlthsub2$smoke100)
## [1] 0.04739043
aggregate(diabete3~X_rfbmi5,data=hlthsub2,mean)
## X_rfbmi5 diabete3
## 1 0 0.05668021
## 2 1 0.16416525
Correlation analysis on subset hlthsub2.
coran <- sort(cor(hlthsub2)[,1])
print(coran)
addepev2 X_rfbmi5 smoke100 toldhi2 cvdinfr4
-0.051985927 -0.028331766 -0.024540698 -0.002282787 0.001512380
bphigh4 diabete3 sleptim1
0.002993190 0.004080381 1.000000000
plot(coran)
Fisher Test on Selected variables.
Cardiovascular disorder and obesity
##
## Fisher's Exact Test for Count Data
##
## data: hlthsub2$cvdinfr4 and hlthsub2$X_rfbmi5
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.412973 1.488011
## sample estimates:
## odds ratio
## 1.449966
Diabetes and Depression
##
## Fisher's Exact Test for Count Data
##
## data: hlthsub2$diabete3 and hlthsub2$addepev2
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 1.669775 1.734516
## sample estimates:
## odds ratio
## 1.701818
Smoke and cardiovascular disease
##
## Fisher's Exact Test for Count Data
##
## data: hlthsub2$smoke100 and hlthsub2$cvdinfr4
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 2.126275 2.229919
## sample estimates:
## odds ratio
## 2.177476
Diabetes and Obesity
##
## Fisher's Exact Test for Count Data
##
## data: hlthsub2$diabete3 and hlthsub2$X_rfbmi5
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 3.194582 3.344753
## sample estimates:
## odds ratio
## 3.26872
Diabetes and cardiovascular disease
##
## Fisher's Exact Test for Count Data
##
## data: hlthsub2$cvdinfr4 and hlthsub2$diabete3
## p-value < 2.2e-16
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 3.743158 3.936279
## sample estimates:
## odds ratio
## 3.838856
Run Density Plots of a few selected health outcomes
From the above results we can infer that:
1.Sleeping Time has a negative association with depression,smoking and obesity.
If there is more smoking,obesity and depression, the person tends to sleep more
hours.
2.Exercise has negative association with depression,diabetes,high blood pressure
and cholesterol.It means your health will improve if you exercise and chronic
disorders like High Blood Pressure and cholestrol will be reduced.
3.Instance of cardio vascular disorders are directly related to diabetes,
high blood pressure and cholesterol,depression,smoking and obesity,but no
relationhip with exercise and sleeping time.
We can find associations between outcomes and causes,which confirms our hypothesis.
"To Lead good life it is essential that one must avoid excesses and be physically
active irrespective of age or sex."