library(viridis)
library(gt)
library(hrbrthemes)
library(measurements)
library(dplyr)
library(ggplot2)setwd("D:/Google Drive/COPPEAD/R/COURSERA/Project")
load("brfss2013.RData")
# Trimming by the ones who completed the interview
brfss2013_TRIM <- brfss2013 %>%
filter (brfss2013$dispcode == "Completed interview")
brfss2013_TRIM <- brfss2013_TRIM[,!(names(brfss2013_TRIM) %in% c("dispcode"))]The Behavioral Risk Factor Surveillance System (BRFSS) is described on the by the CDC as
“The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. (…) For most states, BRFSS is the only source of state-based health risk behavior data related to chronic disease prevalence. (…) Currently, BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted multi-mode (mail, landline phone, and cell phone) health survey system in the world”.
For that matter, from a data scientist, statistician, or any researcher point of view, the 2013 sample has an enormous amount of information, not to mention the gigantic datafile with almost 500.000 observations of 330 variables. The wealth of data is only matched by the care that must be taken manipulating it. One issue, for example, is the overwhelmingly large number of NAs, caused mostly by the conditional multi-sections of the survey. Some variables show great discrepancy, and values often transmute to categories (e.g. the height related variables, some of them appearing in this paper).
Anyway, it is understood that any hassle is the direct consequence of the nature of the survey, it being large, complex and comprehensive. The health and socioeconomic data here present many, almost infinite possibilities of correlation, as seen in the following sections.
NAreport <- sapply(brfss2013_TRIM, function(x) sum(is.na(x)))
# How many NAs per column?
NAreport ## X_state fmonth idate imonth iday iyear seqno X_psu
## 0 0 0 0 0 0 0 0
## ctelenum pvtresd1 colghous stateres cellfon3 ladult numadult nummen
## 8 105539 433187 105544 109006 433192 105547 105548
## numwomen genhlth physhlth menthlth poorhlth hlthpln1 persdoc2 medcost
## 105548 1701 9237 7250 213303 1502 1429 1017
## checkup1 sleptim1 bphigh4 bpmeds bloodcho cholchk toldhi2 cvdinfr4
## 5327 6076 1159 254692 7563 60350 59254 2155
## cvdcrhd4 cvdstrk3 asthma3 asthnow chcscncr chcocncr chccopd1 havarth3
## 3877 1273 1346 375411 1146 968 2366 2561
## addepev2 chckidny diabete3 veteran3 marital children educa employ1
## 1959 1475 674 368 2215 1210 1040 1803
## income2 weight2 height3 numhhol2 numphon2 cpdemo1 cpdemo4 internet
## 58890 0 4333 106321 418472 106774 187283 1054
## renthom1 sex pregnant qlactlm2 useequip blind decide diffwalk
## 3479 2 368060 2415 706 1307 2378 1683
## diffdres diffalon smoke100 smokday2 stopsmk2 lastsmk2 usenow3 alcday5
## 801 1422 2159 238197 365217 307035 830 4331
## avedrnk2 drnk3ge5 maxdrnks fruitju1 fruit1 fvbeans fvgreen fvorang
## 223515 223026 226489 8791 5903 8246 5363 6003
## vegetab1 exerany2 exract11 exeroft1 exerhmm1 exract21 exeroft2 exerhmm2
## 7305 1385 121760 124293 127991 125414 231177 236051
## strength lmtjoin3 arthdis2 arthsocl joinpain seatbelt flushot6 flshtmy2
## 3960 285259 287798 285097 290054 1221 1800 245389
## tetanus pneuvac3 hivtst6 hivtstd3 whrtst10 pdiabtst prediab1 diabage2
## 42466 42282 13338 334671 311104 228443 219840 0
## insulin bldsugar feetchk2 doctdiab chkhemo3 feetchk eyeexam diabeye
## 399996 400541 400947 0 402892 401510 400494 400419
## diabedu painact2 qlmentl2 qlstres2 qlhlth2 medicare hlthcvrg delaymed
## 400067 432763 432759 432764 432785 148250 0 118678
## dlyother nocov121 lstcovrg drvisits medscost carercvd medbills ssbsugar
## 0 151260 401015 126730 116727 126087 118156 331778
## ssbfrut2 wtchsalt longwtch dradvise asthmage asattack aservist asdrvist
## 331921 306916 363432 306886 431757 432243 432715 432731
## asrchkup asactlim asymptom asnoslep asthmed3 asinhalr harehab1 strehab1
## 432269 432315 432274 432525 432250 432259 424823 427157
## cvdasprn aspunsaf rlivpain rduchart rducstrk arttoday arthwgt arthexer
## 299203 346224 387161 388116 390642 411946 411939 412056
## arthedu imfvplac hpvadvc2 hpvadsht hadmam howlong profexam lengexam
## 411877 410851 420550 432062 403697 409231 403829 407123
## hadpap2 lastpap2 hadhyst2 bldstool lstblds3 hadsigm3 hadsgco1 lastsig3
## 403828 405987 404017 354569 403051 354366 377424 376626
## pcpsaad2 pcpsadi1 pcpsare1 psatest1 psatime pcpsars1 pcpsade1 pcdmdecn
## 426413 426459 426438 426519 429500 429525 429449 431996
## rrclass2 rrcognt2 rratwrk2 rrhcare3 rrphysm2 rremtsm2 misnervs mishopls
## 429413 429465 431725 429916 429364 429364 397684 397660
## misrstls misdeprd miseffrt miswtles misnowrk mistmnt mistrhlp misphlpf
## 397766 397695 397914 397719 397831 397732 399068 399129
## scntmony scntmeal scntpaid scntwrk1 scntlpad scntlwk1 scntvot1 rcsgendr
## 371603 367716 401315 401149 411352 412053 368408 364566
## rcsrltn2 casthdx2 casthno2 emtsuprt lsatisfy ctelnum1 cellfon2 cadult
## 364772 368556 424329 421614 421563 327684 327683 327683
## pvtresd2 cclghous cstate landline pctcell qstver qstlang mscode
## 327688 432335 327687 327690 412329 6 288 110328
## X_ststr X_strwt X_rawrake X_wt2rake X_imprace X_impnph X_impeduc X_impmrtl
## 0 2 2 2 5 105543 433220 433220
## X_imphome X_chispnc X_crace1 X_impcage X_impcrac X_impcsex X_cllcpwt X_dualuse
## 433220 362652 363028 358830 358829 358828 363533 5
## X_dualcor X_llcpwt2 X_llcpwt X_rfhlth X_hcvu651 X_rfhype5 X_cholchk X_rfchol
## 379029 1 2 1703 150406 1164 11902 59255
## X_ltasth1 X_casthm1 X_asthms1 X_drdxar1 X_prace1 X_mrace1 X_hispanc X_race
## 1352 2981 2980 2566 6523 6523 4419 6870
## X_raceg21 X_racegr3 X_race_g1 X_ageg5yr X_age65yr X_age_g htin4 htm4
## 6870 6872 6870 3525 3527 7 5860 4351
## wtkg3 X_bmi5 X_bmi5cat X_rfbmi5 X_chldcnt X_educag X_incomg X_smoker3
## 15260 20159 20159 20160 1215 1044 58891 2439
## X_rfsmok3 drnkany5 drocdy3_ X_rfbing5 X_drnkdy4 X_drnkmo4 X_rfdrhv4 X_rfdrmn4
## 2441 4333 4332 6913 7405 7406 7407 261670
## X_rfdrwm4 ftjuda1_ frutda1_ beanday_ grenday_ orngday_ vegeda1_ X_misfrtn
## 178962 8793 5904 8248 5364 6003 7305 5
## X_misvegn X_frtresp X_vegresp X_frutsum X_vegesum X_frtlt1 X_veglt1 X_frt16
## 3 7 5 2194 1662 2194 1662 5
## X_veg23 X_fruitex X_vegetex X_totinda metvl11_ metvl21_ maxvo2_ fc60_
## 6 4 6 1390 135016 142322 3523 3525
## actin11_ actin21_ padur1_ padur2_ pafreq1_ pafreq2_ X_minac11 X_minac21
## 137441 144669 127992 236052 124297 231178 129534 133037
## strfreq_ pamiss1_ pamin11_ pamin21_ pa1min_ pavig11_ pavig21_ pa1vigm_
## 3964 1388 144761 151037 127494 140690 147741 0
## X_pacat1 X_paindx1 X_pa150r2 X_pa300r2 X_pa30021 X_pastrng X_parec1 X_pastae1
## 40046 31447 31446 40046 40046 3965 33782 33784
## X_lmtact1 X_lmtwrk1 X_lmtscl1 X_rfseat2 X_rfseat3 X_flshot6 X_pneumo2 X_aidtst3
## 3967 6505 3804 1228 1227 288065 293707 13341
## X_age80
## 6
str(NAreport)## Named int [1:329] 0 0 0 0 0 0 0 0 8 105539 ...
## - attr(*, "names")= chr [1:329] "X_state" "fmonth" "idate" "imonth" ...
summary(NAreport)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2441 116727 174883 379029 433220
COMMENTARY
The variables with many NAs seem to be the result of:
Any considered variable should be investigated carefully to avoid problems in areas of R where NAs are not well treated.
The strategy adopted in this work is:
Research question 1:
How the height is distributed on this population? What happens when you consider Gender?
It is a classic statistical inference. Does this sample follow the normal curve? Its statistics are consistent with the ones known about the population.
Research question 2:
What can be said about the population who smokes?
The decrease of the smoker’s proportion of the general population is good news. But what can we learn from this sample about them? How it correlates with other health aspects?
Research question 3:
How Education impacts another aspects of life, including health?
Education affects all aspects of life. How it does impact with health habits? At what amount it correlates with income, for example?
Research question 4:
A study on the correlations of depression.
Here we have a simple question: if you have depression or not. If you have the disorder, what kind of person you are more likely to be?
Research question 1:
For this question I will use the variable htin4, described “Reported height in inches”. Tried before using other which used foot and inches. It did not computed statistics well, since 1 foot = 12 inches. After some tries, I settled for number of inches and transformed to metric.
Lets create a new variable with the metric values:
brfss2013_TRIM <- brfss2013_TRIM %>%
mutate(Metric_Height = round(conv_unit(htin4,"inch","m"), digits = 2))
summary(brfss2013_TRIM$Metric_Height)## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.050 1.630 1.680 1.693 1.780 172.820 5860
Since the median and the mean are very close, this distribution looks not much skewed and close to the normal. But there are some unreasonably extreme values. Considering this distribution resembles the normal curve enough, I decided to restrain into the sd range for the overall distribution, and plot it using the 2 SD range.
brfss_mean <- mean(brfss2013_TRIM$Metric_Height, na.rm = TRUE)
brfss_sd <- sd(brfss2013_TRIM$Metric_Height, na.rm = TRUE)
brfss2013_TRIM %>%
ggplot() +
geom_bar(aes(x = Metric_Height)) +
xlim(brfss_mean-(2*brfss_sd),brfss_mean+(2*brfss_sd))## Warning: Removed 5871 rows containing non-finite values (stat_count).
Now, lets separate the population by gender. Each one will have its own SD and Mean. Lets calculate it.
M_mean <- brfss2013_TRIM %>%
filter(sex == "Male") %>%
summarise("Mean for Males" = mean(Metric_Height, na.rm = TRUE))
F_mean <- brfss2013_TRIM %>%
filter(sex == "Female") %>%
summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))
M_sd <- brfss2013_TRIM %>%
filter(sex == "Male") %>%
summarise("SD for Males" = sd(Metric_Height, na.rm = TRUE))
F_sd <- brfss2013_TRIM %>%
filter(sex == "Female") %>%
summarise("SD for Females" = sd(Metric_Height, na.rm = TRUE))
paste("Stats for Male distribution of height")## [1] "Stats for Male distribution of height"
M_sd## SD for Males
## 1 0.07698236
brfss2013_TRIM %>%
filter(sex == "Male") %>%
select(Metric_Height) %>%
summary() ## Metric_Height
## Min. :0.050
## 1st Qu.:1.730
## Median :1.780
## Mean :1.781
## 3rd Qu.:1.830
## Max. :2.360
## NA's :1898
paste("Stats for Female distribution of height")## [1] "Stats for Female distribution of height"
F_sd## SD for Females
## 1 0.4624587
brfss2013_TRIM %>%
filter(sex == "Female") %>%
select(Metric_Height) %>%
summary() ## Metric_Height
## Min. : 0.910
## 1st Qu.: 1.570
## Median : 1.630
## Mean : 1.633
## 3rd Qu.: 1.680
## Max. :172.820
## NA's :3962
Something is wrong, the SD of the two populations are too different. Looking at summary of the female distribution, we see an unreasonable max value, suggesting that there are too much extreme values interfering with the SD, a statistic particularly sensible to these values. The strategy:
brfss2013_TRIM$Metric_Height <- replace(brfss2013_TRIM$Metric_Height,brfss2013_TRIM$Metric_Height > 2.360,NA)
F_mean <- brfss2013_TRIM %>%
filter(sex == "Female") %>%
summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))
F_mean <- brfss2013_TRIM %>%
filter(sex == "Female") %>%
summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))
paste("Stats for Female distribution of height")## [1] "Stats for Female distribution of height"
paste("Standard Deviation")## [1] "Standard Deviation"
F_sd## SD for Females
## 1 0.4624587
brfss2013_TRIM %>%
filter(sex == "Female") %>%
select(Metric_Height) %>%
summary() ## Metric_Height
## Min. :0.910
## 1st Qu.:1.570
## Median :1.630
## Mean :1.632
## 3rd Qu.:1.680
## Max. :2.260
## NA's :3964
F_sd <- as.numeric(F_sd); M_sd <- as.numeric(M_sd); F_mean <- as.numeric(F_mean); M_mean <- as.numeric(M_mean)Still a large disparity between the SDs. Globally, the height ratio between genders is 1.07, meaning that on average, men are about 7% taller than women. Across the world, this relative difference between the sexes can vary from only 2-3% to over 12%. Why this scenario is not reflected on the BRFSS survey can be subject for further research.
Proceeding with the plot, the following piece of code is an example of the range used to plot.
xlim(M_mean-(2*M_sd),M_mean+(2*M_sd))
brfss2013_TRIM %>%
filter(sex == "Male") %>%
ggplot() +
geom_bar(aes(x = Metric_Height)) +
xlim(M_mean-(2*M_sd),M_mean+(2*M_sd)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Male Height Frequency Distrib")## Warning: Removed 7442 rows containing non-finite values (stat_count).
## Warning: Removed 2 rows containing missing values (geom_bar).
brfss2013_TRIM %>%
filter(sex == "Female") %>%
ggplot() +
geom_bar(aes(x = Metric_Height)) +
xlim(F_mean-(2*F_sd),F_mean+(2*F_sd)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Female Height Frequency Distrib")## Warning: Removed 3964 rows containing non-finite values (stat_count).
ggplot(na.omit(brfss2013_TRIM[,c("sex","Metric_Height")]), aes(x = sex, y = Metric_Height)) +
geom_boxplot(notch = TRUE, fill = "lightgray")+
stat_summary(fun.y = mean, geom = "point", shape = 18, size = 2.5, color = "#FC4E07")## Warning: `fun.y` is deprecated. Use `fun` instead.
The distributions are very similar, as we can see in the last boxplots.
Research quesion 2:
First of all, lets adjust the factors. It will make the graphics more manageable. Then, we trim further, now with variables of interest and we are ready to plot the absolute numbers.
brfss2013_TRIM$X_smoker3 <- as.character(brfss2013_TRIM$X_smoker3)
brfss2013_TRIM$X_smoker3[brfss2013_TRIM$X_smoker3 == "Current smoker - now smokes every day"] <- c("Smokes Daily")
brfss2013_TRIM$X_smoker3[brfss2013_TRIM$X_smoker3 == "Current smoker - now smokes some days"] <- c("Smokes occasionally")
brfss2013_TRIM$X_smoker3 <- as.factor(brfss2013_TRIM$X_smoker3)
Interest_Smoking <- na.omit(brfss2013_TRIM[,c("X_smoker3","sex","X_age_g","genhlth","X_bmi5cat","X_educag","X_incomg","X_rfdrhv4")])
Interest_Smoking %>%
ggplot() +
geom_bar(aes(x = X_smoker3)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Smoking categories: Absolute numbers")Fortunately we have a significant number of non smokers or people who have never smoked. Anyway, The CDC says almost 38 million americans still smoke.
The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.
ggplot(Interest_Smoking, aes(X_smoker3, group = sex)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~sex) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Gender Vs. Smoking Habits")ggplot(count(Interest_Smoking,sex,X_smoker3), aes(fill=X_smoker3, y=n, x=sex)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Gender Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
Similar amounts of smokers among the genders, but more male former smokers
ggplot(Interest_Smoking, aes(X_smoker3, group = X_age_g)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_age_g) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Age Vs. Smoking Habits")ggplot(count(Interest_Smoking,X_age_g,X_smoker3), aes(fill=X_smoker3, y=n, x=X_age_g)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Age Vs. Smoking Habits: Stacked",y = "Proportion")COMMENTARY
Irregular frequency of smokers in general, and the expected trend of increasing number of former smokers as age progresses. A large amount of youth who have never smoked.
ggplot(Interest_Smoking, aes(X_smoker3, group = genhlth)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~genhlth) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="General Health Vs. Smoking Habits")ggplot(count(Interest_Smoking,genhlth,X_smoker3), aes(fill=X_smoker3, y=n, x=genhlth)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="General Health Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
A clear and expected trend correlating smoking with poorness of health.
ggplot(Interest_Smoking, aes(X_smoker3, group = X_educag)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_educag) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Education Level Vs. Smoking Habits")ggplot(count(Interest_Smoking,X_educag,X_smoker3), aes(fill=X_smoker3, y=n, x=X_educag)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Education Level Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
Another trend that may be somewhat recent: A more educated a person, the lesser likelihood he or she have never smoked, or smokes.
ggplot(Interest_Smoking, aes(X_smoker3, group = X_bmi5cat)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_bmi5cat) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Smoking Habits")ggplot(count(Interest_Smoking,X_bmi5cat,X_smoker3), aes(fill=X_smoker3, y=n, x=X_bmi5cat)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
A slight negative correlation linking weight and smoking.
ggplot(Interest_Smoking, aes(X_smoker3, group = X_incomg)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_incomg) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Income Vs. Smoking Habits")ggplot(count(Interest_Smoking,X_incomg,X_smoker3), aes(fill=X_smoker3, y=n, x=X_incomg)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Income Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
A mild but consistent correlation: the more educated are more prone to forfeit smoking.
ggplot(Interest_Smoking, aes(X_smoker3, group = X_rfdrhv4)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_rfdrhv4) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Smoking Habits")ggplot(count(Interest_Smoking,X_rfdrhv4,X_smoker3), aes(fill=X_smoker3, y=n, x=X_rfdrhv4)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Smoking Habits: Stacked", y = "Proportion")COMMENTARY
A clear indicator of self-destructive behavior.
Research question 3:
Again, more adjustments to increase the quality of the graphs. Zooming in on the variables of interest and plotting the absolute values.
brfss2013_TRIM$X_educag <- as.character(brfss2013_TRIM$X_educag)
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Attended college or technical school"] <- c("Atten. College/Tech")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Did not graduate high school"] <- c("HS Dropout")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Graduated from college or technical school"] <- c("Grad. College/Tech")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Graduated high school"] <- c("Grad. HS")
brfss2013_TRIM$X_educag <- as.factor(brfss2013_TRIM$X_educag)
Interest_Education <- na.omit(brfss2013_TRIM[,c("X_educag","sex","X_age_g","genhlth","X_bmi5cat","X_incomg","X_rfdrhv4")])
# The absolute numbers
Interest_Education %>%
ggplot() +
geom_bar(aes(x = X_educag)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Education categories: Absolute numbers")The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.
The education profile shown here resembles the US population.
# Proportions by Gender
ggplot(Interest_Education, aes(X_educag, group = sex)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~sex) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Gender Vs. Education")ggplot(count(Interest_Education,sex,X_educag), aes(fill=X_educag, y=n, x=sex)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Gender Vs. Education: Stacked", y = "Proportion")COMMENTARY
Very similar gender profiles, indicating almost ideal gender equality on education.
ggplot(Interest_Education, aes(X_educag, group = X_age_g)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_age_g) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Age Vs. Education")ggplot(count(Interest_Education,X_age_g,X_educag), aes(fill=X_educag, y=n, x=X_age_g)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Age Vs. Education: Stacked",y = "Proportion")COMMENTARY
The tendency here shows that the likelihood of a grad/tech diploma somewhat decrease with age, indicating an increased proportion of the population graduated on grad/tech over time.
ggplot(Interest_Education, aes(X_educag, group = genhlth)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~genhlth) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="General Health Vs. Education")ggplot(count(Interest_Education,genhlth,X_educag), aes(fill=X_educag, y=n, x=genhlth)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="General Health Vs. Education: Stacked", y = "Proportion")COMMENTARY
The more educated tend to have larger incomes, consequently better health care. The correlation shown is particularly strong.
# Proportions by BMI
ggplot(Interest_Education, aes(X_educag, group = X_bmi5cat)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_bmi5cat) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Education")ggplot(count(Interest_Education,X_bmi5cat,X_educag), aes(fill=X_educag, y=n, x=X_bmi5cat)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Education: Stacked", y = "Proportion")COMMENTARY
A somewhat surprising positive correlation between weight and college education, albeit very mild.
ggplot(Interest_Education, aes(X_educag, group = X_incomg)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_incomg) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Income Vs. Education")ggplot(count(Interest_Education,X_incomg,X_educag), aes(fill=X_educag, y=n, x=X_incomg)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Income Vs. Education: Stacked", y = "Proportion")COMMENTARY
An expected strong linking about education and income, despite the student loan crisis in the US.
ggplot(Interest_Education, aes(X_educag, group = X_rfdrhv4)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_rfdrhv4) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Education")ggplot(count(Interest_Education,X_rfdrhv4,X_educag), aes(fill=X_educag, y=n, x=X_rfdrhv4)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Education: Stacked", y = "Proportion")COMMENTARY
Probably alcohol has a diversified audience among almost all walks of life. No good correlation here.
Research quesion 4:
First, we narrow the scope of this research. Then, we will plot the absolute numers.
Interest_Depression <- na.omit(brfss2013_TRIM[,c("addepev2","addepev2","sex","X_age_g","genhlth","X_bmi5cat","X_educag","X_incomg","X_rfdrhv4","X_smoker3")])
Interest_Depression %>%
ggplot() +
geom_bar(aes(x = addepev2)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Depression: Absolute numbers")COMMENTARY
A significant amount of people suffering of depression in the general population, showing an actual health crisis.
The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.
ggplot(Interest_Depression, aes(addepev2, group = sex)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~sex) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Gender Vs. Depression")ggplot(count(Interest_Depression,sex,addepev2), aes(fill=addepev2, y=n, x=sex)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Gender Vs. Depression: Stacked", y = "Proportion")COMMENTARY
A significant amount of the female population suffering from depression.
ggplot(Interest_Depression, aes(addepev2, group = X_age_g)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_age_g) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Age Vs. Depression")ggplot(count(Interest_Depression,X_age_g,addepev2), aes(fill=addepev2, y=n, x=X_age_g)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Age Vs. Depression: Stacked",y = "Proportion")COMMENTARY
The probability of acquiring depression increases with age until it falls suddenly on the +65 demographic. The reasons deserve to be thoroughly investigated.
ggplot(Interest_Depression, aes(addepev2, group = genhlth)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~genhlth) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="General Health Vs. Depression")ggplot(count(Interest_Depression,genhlth,addepev2), aes(fill=addepev2, y=n, x=genhlth)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="General Health Vs. Depression: Stacked", y = "Proportion")COMMENTARY
The expected link with general health and mental health. Although not absolute, it is strong.
ggplot(Interest_Depression, aes(addepev2, group = X_educag)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_educag) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Education Level Vs. Depression")ggplot(count(Interest_Depression,X_educag,addepev2), aes(fill=addepev2, y=n, x=X_educag)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Education Level Vs. Depression: Stacked", y = "Proportion")COMMENTARY
A not clear correlation here. The motives and triggers may vary, though.
ggplot(Interest_Depression, aes(addepev2, group = X_bmi5cat)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_bmi5cat) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Depression")ggplot(count(Interest_Depression,X_bmi5cat,addepev2), aes(fill=addepev2, y=n, x=X_bmi5cat)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Body Mass Index Vs. Depression: Stacked", y = "Proportion")COMMENTARY
Another not very clear correlation until the obese demographic is considered, when the proportion of depressed people increases significantly.
ggplot(Interest_Depression, aes(addepev2, group = X_incomg)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_incomg) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Income Vs. Depression")ggplot(count(Interest_Depression,X_incomg,addepev2), aes(fill=addepev2, y=n, x=X_incomg)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Income Vs. Depression: Stacked", y = "Proportion")COMMENTARY
Yes, the old saying “money brings happiness” can be more statistically and accurately rephrased as “happiness seems to be positively correlated with income”.
ggplot(Interest_Depression, aes(addepev2, group = X_rfdrhv4)) +
geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) +
scale_y_continuous(labels=scales::percent) +
ylab("relative frequencies") +
facet_grid(~X_rfdrhv4) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Depression")ggplot(count(Interest_Depression,X_rfdrhv4,addepev2), aes(fill=addepev2, y=n, x=X_rfdrhv4)) +
geom_bar(position="fill", stat="identity") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
labs(title="Heavy Drinking Vs. Depression: Stacked", y = "Proportion")COMMENTARY
No clear link here. Probably you will find an excuse to drink regardless you are happy or sad.