Exploring the BRFSS data

Setup

Load packages

library(viridis)
library(gt)
library(hrbrthemes)
library(measurements)
library(dplyr)
library(ggplot2)

Load data

setwd("D:/Google Drive/COPPEAD/R/COURSERA/Project")
load("brfss2013.RData")


# Trimming by the ones who completed the interview

brfss2013_TRIM <- brfss2013 %>%
  filter (brfss2013$dispcode == "Completed interview")

brfss2013_TRIM <- brfss2013_TRIM[,!(names(brfss2013_TRIM) %in% c("dispcode"))]

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is described on the by the CDC as

“The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. (…) For most states, BRFSS is the only source of state-based health risk behavior data related to chronic disease prevalence. (…) Currently, BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. BRFSS completes more than 400,000 adult interviews each year, making it the largest continuously conducted multi-mode (mail, landline phone, and cell phone) health survey system in the world”.

For that matter, from a data scientist, statistician, or any researcher point of view, the 2013 sample has an enormous amount of information, not to mention the gigantic datafile with almost 500.000 observations of 330 variables. The wealth of data is only matched by the care that must be taken manipulating it. One issue, for example, is the overwhelmingly large number of NAs, caused mostly by the conditional multi-sections of the survey. Some variables show great discrepancy, and values often transmute to categories (e.g. the height related variables, some of them appearing in this paper).

Anyway, it is understood that any hassle is the direct consequence of the nature of the survey, it being large, complex and comprehensive. The health and socioeconomic data here present many, almost infinite possibilities of correlation, as seen in the following sections.

NAreport <- sapply(brfss2013_TRIM, function(x) sum(is.na(x)))

# How many NAs per column?
NAreport

##   X_state    fmonth     idate    imonth      iday     iyear     seqno     X_psu 
##         0         0         0         0         0         0         0         0 
##  ctelenum  pvtresd1  colghous  stateres  cellfon3    ladult  numadult    nummen 
##         8    105539    433187    105544    109006    433192    105547    105548 
##  numwomen   genhlth  physhlth  menthlth  poorhlth  hlthpln1  persdoc2   medcost 
##    105548      1701      9237      7250    213303      1502      1429      1017 
##  checkup1  sleptim1   bphigh4    bpmeds  bloodcho   cholchk   toldhi2  cvdinfr4 
##      5327      6076      1159    254692      7563     60350     59254      2155 
##  cvdcrhd4  cvdstrk3   asthma3   asthnow  chcscncr  chcocncr  chccopd1  havarth3 
##      3877      1273      1346    375411      1146       968      2366      2561 
##  addepev2  chckidny  diabete3  veteran3   marital  children     educa   employ1 
##      1959      1475       674       368      2215      1210      1040      1803 
##   income2   weight2   height3  numhhol2  numphon2   cpdemo1   cpdemo4  internet 
##     58890         0      4333    106321    418472    106774    187283      1054 
##  renthom1       sex  pregnant  qlactlm2  useequip     blind    decide  diffwalk 
##      3479         2    368060      2415       706      1307      2378      1683 
##  diffdres  diffalon  smoke100  smokday2  stopsmk2  lastsmk2   usenow3   alcday5 
##       801      1422      2159    238197    365217    307035       830      4331 
##  avedrnk2  drnk3ge5  maxdrnks  fruitju1    fruit1   fvbeans   fvgreen   fvorang 
##    223515    223026    226489      8791      5903      8246      5363      6003 
##  vegetab1  exerany2  exract11  exeroft1  exerhmm1  exract21  exeroft2  exerhmm2 
##      7305      1385    121760    124293    127991    125414    231177    236051 
##  strength  lmtjoin3  arthdis2  arthsocl  joinpain  seatbelt  flushot6  flshtmy2 
##      3960    285259    287798    285097    290054      1221      1800    245389 
##   tetanus  pneuvac3   hivtst6  hivtstd3  whrtst10  pdiabtst  prediab1  diabage2 
##     42466     42282     13338    334671    311104    228443    219840         0 
##   insulin  bldsugar  feetchk2  doctdiab  chkhemo3   feetchk   eyeexam   diabeye 
##    399996    400541    400947         0    402892    401510    400494    400419 
##   diabedu  painact2  qlmentl2  qlstres2   qlhlth2  medicare  hlthcvrg  delaymed 
##    400067    432763    432759    432764    432785    148250         0    118678 
##  dlyother  nocov121  lstcovrg  drvisits  medscost  carercvd  medbills  ssbsugar 
##         0    151260    401015    126730    116727    126087    118156    331778 
##  ssbfrut2  wtchsalt  longwtch  dradvise  asthmage  asattack  aservist  asdrvist 
##    331921    306916    363432    306886    431757    432243    432715    432731 
##  asrchkup  asactlim  asymptom  asnoslep  asthmed3  asinhalr  harehab1  strehab1 
##    432269    432315    432274    432525    432250    432259    424823    427157 
##  cvdasprn  aspunsaf  rlivpain  rduchart  rducstrk  arttoday   arthwgt  arthexer 
##    299203    346224    387161    388116    390642    411946    411939    412056 
##   arthedu  imfvplac  hpvadvc2  hpvadsht    hadmam   howlong  profexam  lengexam 
##    411877    410851    420550    432062    403697    409231    403829    407123 
##   hadpap2  lastpap2  hadhyst2  bldstool  lstblds3  hadsigm3  hadsgco1  lastsig3 
##    403828    405987    404017    354569    403051    354366    377424    376626 
##  pcpsaad2  pcpsadi1  pcpsare1  psatest1   psatime  pcpsars1  pcpsade1  pcdmdecn 
##    426413    426459    426438    426519    429500    429525    429449    431996 
##  rrclass2  rrcognt2  rratwrk2  rrhcare3  rrphysm2  rremtsm2  misnervs  mishopls 
##    429413    429465    431725    429916    429364    429364    397684    397660 
##  misrstls  misdeprd  miseffrt  miswtles  misnowrk   mistmnt  mistrhlp  misphlpf 
##    397766    397695    397914    397719    397831    397732    399068    399129 
##  scntmony  scntmeal  scntpaid  scntwrk1  scntlpad  scntlwk1  scntvot1  rcsgendr 
##    371603    367716    401315    401149    411352    412053    368408    364566 
##  rcsrltn2  casthdx2  casthno2  emtsuprt  lsatisfy  ctelnum1  cellfon2    cadult 
##    364772    368556    424329    421614    421563    327684    327683    327683 
##  pvtresd2  cclghous    cstate  landline   pctcell    qstver   qstlang    mscode 
##    327688    432335    327687    327690    412329         6       288    110328 
##   X_ststr   X_strwt X_rawrake X_wt2rake X_imprace  X_impnph X_impeduc X_impmrtl 
##         0         2         2         2         5    105543    433220    433220 
## X_imphome X_chispnc  X_crace1 X_impcage X_impcrac X_impcsex X_cllcpwt X_dualuse 
##    433220    362652    363028    358830    358829    358828    363533         5 
## X_dualcor X_llcpwt2  X_llcpwt  X_rfhlth X_hcvu651 X_rfhype5 X_cholchk  X_rfchol 
##    379029         1         2      1703    150406      1164     11902     59255 
## X_ltasth1 X_casthm1 X_asthms1 X_drdxar1  X_prace1  X_mrace1 X_hispanc    X_race 
##      1352      2981      2980      2566      6523      6523      4419      6870 
## X_raceg21 X_racegr3 X_race_g1 X_ageg5yr X_age65yr   X_age_g     htin4      htm4 
##      6870      6872      6870      3525      3527         7      5860      4351 
##     wtkg3    X_bmi5 X_bmi5cat  X_rfbmi5 X_chldcnt  X_educag  X_incomg X_smoker3 
##     15260     20159     20159     20160      1215      1044     58891      2439 
## X_rfsmok3  drnkany5  drocdy3_ X_rfbing5 X_drnkdy4 X_drnkmo4 X_rfdrhv4 X_rfdrmn4 
##      2441      4333      4332      6913      7405      7406      7407    261670 
## X_rfdrwm4  ftjuda1_  frutda1_  beanday_  grenday_  orngday_  vegeda1_ X_misfrtn 
##    178962      8793      5904      8248      5364      6003      7305         5 
## X_misvegn X_frtresp X_vegresp X_frutsum X_vegesum  X_frtlt1  X_veglt1   X_frt16 
##         3         7         5      2194      1662      2194      1662         5 
##   X_veg23 X_fruitex X_vegetex X_totinda  metvl11_  metvl21_   maxvo2_     fc60_ 
##         6         4         6      1390    135016    142322      3523      3525 
##  actin11_  actin21_   padur1_   padur2_  pafreq1_  pafreq2_ X_minac11 X_minac21 
##    137441    144669    127992    236052    124297    231178    129534    133037 
##  strfreq_  pamiss1_  pamin11_  pamin21_   pa1min_  pavig11_  pavig21_  pa1vigm_ 
##      3964      1388    144761    151037    127494    140690    147741         0 
##  X_pacat1 X_paindx1 X_pa150r2 X_pa300r2 X_pa30021 X_pastrng  X_parec1 X_pastae1 
##     40046     31447     31446     40046     40046      3965     33782     33784 
## X_lmtact1 X_lmtwrk1 X_lmtscl1 X_rfseat2 X_rfseat3 X_flshot6 X_pneumo2 X_aidtst3 
##      3967      6505      3804      1228      1227    288065    293707     13341 
##   X_age80 
##         6

str(NAreport)

##  Named int [1:329] 0 0 0 0 0 0 0 0 8 105539 ...
##  - attr(*, "names")= chr [1:329] "X_state" "fmonth" "idate" "imonth" ...

summary(NAreport)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    2441  116727  174883  379029  433220

COMMENTARY

The variables with many NAs seem to be the result of:

Conditional surveying
Sensible/problematic subject questions
Surveying errors
Unfinished interviews

Any considered variable should be investigated carefully to avoid problems in areas of R where NAs are not well treated.

The strategy adopted in this work is:

Preferring composite/calculated variables
Creating a separated dataframe with the variables of interest
Verify if removing the NAs would significantly decerase the number of observations.

Part 2: Research questions

Research question 1:

How the height is distributed on this population? What happens when you consider Gender?

It is a classic statistical inference. Does this sample follow the normal curve? Its statistics are consistent with the ones known about the population.

Research question 2:

What can be said about the population who smokes?

The decrease of the smoker’s proportion of the general population is good news. But what can we learn from this sample about them? How it correlates with other health aspects?

Research question 3:

How Education impacts another aspects of life, including health?

Education affects all aspects of life. How it does impact with health habits? At what amount it correlates with income, for example?

Research question 4:

A study on the correlations of depression.

Here we have a simple question: if you have depression or not. If you have the disorder, what kind of person you are more likely to be?

Part 3: Exploratory data analysis

Research question 1:

How the height is distributed on this population? What happens when you consider Gender?

For this question I will use the variable htin4, described “Reported height in inches”. Tried before using other which used foot and inches. It did not computed statistics well, since 1 foot = 12 inches. After some tries, I settled for number of inches and transformed to metric.

Lets create a new variable with the metric values:

brfss2013_TRIM <- brfss2013_TRIM %>% 
  mutate(Metric_Height  = round(conv_unit(htin4,"inch","m"), digits = 2))
summary(brfss2013_TRIM$Metric_Height)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.050   1.630   1.680   1.693   1.780 172.820    5860

Since the median and the mean are very close, this distribution looks not much skewed and close to the normal. But there are some unreasonably extreme values. Considering this distribution resembles the normal curve enough, I decided to restrain into the sd range for the overall distribution, and plot it using the 2 SD range.

brfss_mean <- mean(brfss2013_TRIM$Metric_Height, na.rm = TRUE)
brfss_sd   <- sd(brfss2013_TRIM$Metric_Height, na.rm = TRUE)
brfss2013_TRIM %>%
ggplot() + 
  geom_bar(aes(x = Metric_Height))  +
  xlim(brfss_mean-(2*brfss_sd),brfss_mean+(2*brfss_sd))

## Warning: Removed 5871 rows containing non-finite values (stat_count).

Now, lets separate the population by gender. Each one will have its own SD and Mean. Lets calculate it.

M_mean <- brfss2013_TRIM %>%
  filter(sex == "Male") %>%
  summarise("Mean for Males" = mean(Metric_Height, na.rm = TRUE))

F_mean <- brfss2013_TRIM %>%
  filter(sex == "Female") %>%
  summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))  
  
M_sd <- brfss2013_TRIM %>%
  filter(sex == "Male") %>%
  summarise("SD for Males" = sd(Metric_Height, na.rm = TRUE))

F_sd <- brfss2013_TRIM %>%
  filter(sex == "Female") %>%
  summarise("SD for Females" = sd(Metric_Height, na.rm = TRUE))

paste("Stats for Male distribution of height")

## [1] "Stats for Male distribution of height"

M_sd

##   SD for Males
## 1   0.07698236

brfss2013_TRIM %>%
  filter(sex == "Male") %>% 
  select(Metric_Height) %>% 
  summary()

##  Metric_Height  
##  Min.   :0.050  
##  1st Qu.:1.730  
##  Median :1.780  
##  Mean   :1.781  
##  3rd Qu.:1.830  
##  Max.   :2.360  
##  NA's   :1898

paste("Stats for Female distribution of height")

## [1] "Stats for Female distribution of height"

F_sd

##   SD for Females
## 1      0.4624587

brfss2013_TRIM %>%
  filter(sex == "Female") %>% 
  select(Metric_Height) %>% 
  summary()

##  Metric_Height    
##  Min.   :  0.910  
##  1st Qu.:  1.570  
##  Median :  1.630  
##  Mean   :  1.633  
##  3rd Qu.:  1.680  
##  Max.   :172.820  
##  NA's   :3962

Something is wrong, the SD of the two populations are too different. Looking at summary of the female distribution, we see an unreasonable max value, suggesting that there are too much extreme values interfering with the SD, a statistic particularly sensible to these values. The strategy:

Identify the values above the male max values
Turn them to NAs.

brfss2013_TRIM$Metric_Height <- replace(brfss2013_TRIM$Metric_Height,brfss2013_TRIM$Metric_Height > 2.360,NA)

F_mean <- brfss2013_TRIM %>%
  filter(sex == "Female") %>%
  summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))  
F_mean <- brfss2013_TRIM %>%
  filter(sex == "Female") %>%
  summarise("Mean for Females" = mean(Metric_Height, na.rm = TRUE))  

paste("Stats for Female distribution of height")

## [1] "Stats for Female distribution of height"

paste("Standard Deviation")

## [1] "Standard Deviation"

F_sd

##   SD for Females
## 1      0.4624587

brfss2013_TRIM %>%
  filter(sex == "Female") %>% 
  select(Metric_Height) %>% 
  summary()

##  Metric_Height  
##  Min.   :0.910  
##  1st Qu.:1.570  
##  Median :1.630  
##  Mean   :1.632  
##  3rd Qu.:1.680  
##  Max.   :2.260  
##  NA's   :3964

F_sd <- as.numeric(F_sd); M_sd <- as.numeric(M_sd); F_mean <- as.numeric(F_mean); M_mean <- as.numeric(M_mean)

Still a large disparity between the SDs. Globally, the height ratio between genders is 1.07, meaning that on average, men are about 7% taller than women. Across the world, this relative difference between the sexes can vary from only 2-3% to over 12%. Why this scenario is not reflected on the BRFSS survey can be subject for further research.

Proceeding with the plot, the following piece of code is an example of the range used to plot.

xlim(M_mean-(2*M_sd),M_mean+(2*M_sd))

brfss2013_TRIM %>%
  filter(sex == "Male") %>%
  ggplot() + 
  geom_bar(aes(x = Metric_Height))  +
  xlim(M_mean-(2*M_sd),M_mean+(2*M_sd)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Male Height Frequency Distrib")

## Warning: Removed 7442 rows containing non-finite values (stat_count).

## Warning: Removed 2 rows containing missing values (geom_bar).

brfss2013_TRIM %>%
  filter(sex == "Female") %>%
  ggplot() + 
  geom_bar(aes(x = Metric_Height))  +
  xlim(F_mean-(2*F_sd),F_mean+(2*F_sd)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Female Height Frequency Distrib")

## Warning: Removed 3964 rows containing non-finite values (stat_count).

ggplot(na.omit(brfss2013_TRIM[,c("sex","Metric_Height")]), aes(x = sex, y = Metric_Height)) +
 geom_boxplot(notch = TRUE, fill = "lightgray")+
  stat_summary(fun.y = mean, geom = "point", shape = 18, size = 2.5, color = "#FC4E07")

## Warning: `fun.y` is deprecated. Use `fun` instead.

The distributions are very similar, as we can see in the last boxplots.

Research quesion 2:

What can be said about the population who smokes?

First of all, lets adjust the factors. It will make the graphics more manageable. Then, we trim further, now with variables of interest and we are ready to plot the absolute numbers.

brfss2013_TRIM$X_smoker3 <- as.character(brfss2013_TRIM$X_smoker3)

brfss2013_TRIM$X_smoker3[brfss2013_TRIM$X_smoker3 == "Current smoker - now smokes every day"] <- c("Smokes Daily")
brfss2013_TRIM$X_smoker3[brfss2013_TRIM$X_smoker3 == "Current smoker - now smokes some days"] <- c("Smokes occasionally")

brfss2013_TRIM$X_smoker3 <- as.factor(brfss2013_TRIM$X_smoker3)

Interest_Smoking <- na.omit(brfss2013_TRIM[,c("X_smoker3","sex","X_age_g","genhlth","X_bmi5cat","X_educag","X_incomg","X_rfdrhv4")])



Interest_Smoking %>%
  ggplot() + 
  geom_bar(aes(x = X_smoker3)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Smoking categories: Absolute numbers")

Fortunately we have a significant number of non smokers or people who have never smoked. Anyway, The CDC says almost 38 million americans still smoke.

The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.

Smoking: Proportions by Gender

ggplot(Interest_Smoking, aes(X_smoker3, group = sex)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~sex) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Gender Vs. Smoking Habits")

ggplot(count(Interest_Smoking,sex,X_smoker3), aes(fill=X_smoker3, y=n, x=sex)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Gender Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

Similar amounts of smokers among the genders, but more male former smokers

Smoking: Proportions by age

ggplot(Interest_Smoking, aes(X_smoker3, group = X_age_g)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_age_g) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Age Vs. Smoking Habits")

ggplot(count(Interest_Smoking,X_age_g,X_smoker3), aes(fill=X_smoker3, y=n, x=X_age_g)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Age Vs. Smoking Habits: Stacked",y = "Proportion")

COMMENTARY

Irregular frequency of smokers in general, and the expected trend of increasing number of former smokers as age progresses. A large amount of youth who have never smoked.

Smoking: Proportions by General Health

ggplot(Interest_Smoking, aes(X_smoker3, group = genhlth)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~genhlth) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="General Health Vs. Smoking Habits")

ggplot(count(Interest_Smoking,genhlth,X_smoker3), aes(fill=X_smoker3, y=n, x=genhlth)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="General Health Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

A clear and expected trend correlating smoking with poorness of health.

Smoking: Proportions by Education Level

ggplot(Interest_Smoking, aes(X_smoker3, group = X_educag)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_educag) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Education Level Vs. Smoking Habits")

ggplot(count(Interest_Smoking,X_educag,X_smoker3), aes(fill=X_smoker3, y=n, x=X_educag)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Education Level Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

Another trend that may be somewhat recent: A more educated a person, the lesser likelihood he or she have never smoked, or smokes.

Smoking: Proportions by BMI

ggplot(Interest_Smoking, aes(X_smoker3, group = X_bmi5cat)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_bmi5cat) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Smoking Habits")

ggplot(count(Interest_Smoking,X_bmi5cat,X_smoker3), aes(fill=X_smoker3, y=n, x=X_bmi5cat)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

A slight negative correlation linking weight and smoking.

Smoking: Proportions by Income

ggplot(Interest_Smoking, aes(X_smoker3, group = X_incomg)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_incomg) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Smoking Habits")

ggplot(count(Interest_Smoking,X_incomg,X_smoker3), aes(fill=X_smoker3, y=n, x=X_incomg)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

A mild but consistent correlation: the more educated are more prone to forfeit smoking.

Smoking: Proportions by Heavy Drinking

ggplot(Interest_Smoking, aes(X_smoker3, group = X_rfdrhv4)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_rfdrhv4) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Smoking Habits")

ggplot(count(Interest_Smoking,X_rfdrhv4,X_smoker3), aes(fill=X_smoker3, y=n, x=X_rfdrhv4)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Smoking Habits: Stacked", y = "Proportion")

COMMENTARY

A clear indicator of self-destructive behavior.

Research question 3:

How Education impacts another aspects of life, including health?

Again, more adjustments to increase the quality of the graphs. Zooming in on the variables of interest and plotting the absolute values.

brfss2013_TRIM$X_educag <- as.character(brfss2013_TRIM$X_educag)

brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Attended college or technical school"] <- c("Atten. College/Tech")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Did not graduate high school"] <- c("HS Dropout")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Graduated from college or technical school"] <- c("Grad. College/Tech")
brfss2013_TRIM$X_educag[brfss2013_TRIM$X_educag == "Graduated high school"] <- c("Grad. HS")

brfss2013_TRIM$X_educag <- as.factor(brfss2013_TRIM$X_educag)


Interest_Education <- na.omit(brfss2013_TRIM[,c("X_educag","sex","X_age_g","genhlth","X_bmi5cat","X_incomg","X_rfdrhv4")])

# The absolute numbers

Interest_Education %>%
  ggplot() + 
  geom_bar(aes(x = X_educag)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Education categories: Absolute numbers")

The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.

The education profile shown here resembles the US population.

Education: Proportions by Gender

# Proportions by Gender

ggplot(Interest_Education, aes(X_educag, group = sex)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~sex) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Gender Vs. Education")

ggplot(count(Interest_Education,sex,X_educag), aes(fill=X_educag, y=n, x=sex)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Gender Vs. Education: Stacked", y = "Proportion")

COMMENTARY

Very similar gender profiles, indicating almost ideal gender equality on education.

Education: Proportions by Age

ggplot(Interest_Education, aes(X_educag, group = X_age_g)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_age_g) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Age Vs. Education")

ggplot(count(Interest_Education,X_age_g,X_educag), aes(fill=X_educag, y=n, x=X_age_g)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Age Vs. Education: Stacked",y = "Proportion")

COMMENTARY

The tendency here shows that the likelihood of a grad/tech diploma somewhat decrease with age, indicating an increased proportion of the population graduated on grad/tech over time.

Education: Proportions by General Health

ggplot(Interest_Education, aes(X_educag, group = genhlth)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~genhlth) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="General Health Vs. Education")

ggplot(count(Interest_Education,genhlth,X_educag), aes(fill=X_educag, y=n, x=genhlth)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="General Health Vs. Education: Stacked", y = "Proportion")

COMMENTARY

The more educated tend to have larger incomes, consequently better health care. The correlation shown is particularly strong.

Education: Proportions by BMI

# Proportions by BMI

ggplot(Interest_Education, aes(X_educag, group = X_bmi5cat)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_bmi5cat) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Education")

ggplot(count(Interest_Education,X_bmi5cat,X_educag), aes(fill=X_educag, y=n, x=X_bmi5cat)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Education: Stacked", y = "Proportion")

COMMENTARY

A somewhat surprising positive correlation between weight and college education, albeit very mild.

Education: Proportions by Income

ggplot(Interest_Education, aes(X_educag, group = X_incomg)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_incomg) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Education")

ggplot(count(Interest_Education,X_incomg,X_educag), aes(fill=X_educag, y=n, x=X_incomg)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Education: Stacked", y = "Proportion")

COMMENTARY

An expected strong linking about education and income, despite the student loan crisis in the US.

Education: Proportions by Heavy Drinking

ggplot(Interest_Education, aes(X_educag, group = X_rfdrhv4)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_rfdrhv4) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Education")

ggplot(count(Interest_Education,X_rfdrhv4,X_educag), aes(fill=X_educag, y=n, x=X_rfdrhv4)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Education: Stacked", y = "Proportion")

COMMENTARY

Probably alcohol has a diversified audience among almost all walks of life. No good correlation here.

Research quesion 4:

A study on the correlations of depression.

First, we narrow the scope of this research. Then, we will plot the absolute numers.

Interest_Depression <- na.omit(brfss2013_TRIM[,c("addepev2","addepev2","sex","X_age_g","genhlth","X_bmi5cat","X_educag","X_incomg","X_rfdrhv4","X_smoker3")])


Interest_Depression %>%
  ggplot() + 
  geom_bar(aes(x = addepev2)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Depression: Absolute numbers")

COMMENTARY

A significant amount of people suffering of depression in the general population, showing an actual health crisis.

The removal of the NAs left us with more than 353k observations, who have completed the survey. Good enough.

Depression: Proportions by Heavy Gender

ggplot(Interest_Depression, aes(addepev2, group = sex)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~sex) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Gender Vs. Depression")

ggplot(count(Interest_Depression,sex,addepev2), aes(fill=addepev2, y=n, x=sex)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Gender Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

A significant amount of the female population suffering from depression.

Depression: Proportions by Age

ggplot(Interest_Depression, aes(addepev2, group = X_age_g)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_age_g) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Age Vs. Depression")

ggplot(count(Interest_Depression,X_age_g,addepev2), aes(fill=addepev2, y=n, x=X_age_g)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="Age Vs. Depression: Stacked",y = "Proportion")

COMMENTARY

The probability of acquiring depression increases with age until it falls suddenly on the +65 demographic. The reasons deserve to be thoroughly investigated.

Depression: Proportions by General Health

ggplot(Interest_Depression, aes(addepev2, group = genhlth)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~genhlth) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="General Health Vs. Depression")

ggplot(count(Interest_Depression,genhlth,addepev2), aes(fill=addepev2, y=n, x=genhlth)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) +
  labs(title="General Health Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

The expected link with general health and mental health. Although not absolute, it is strong.

Depression: Proportions by Education Level

ggplot(Interest_Depression, aes(addepev2, group = X_educag)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_educag) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Education Level Vs. Depression")

ggplot(count(Interest_Depression,X_educag,addepev2), aes(fill=addepev2, y=n, x=X_educag)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Education Level Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

A not clear correlation here. The motives and triggers may vary, though.

Depression: Proportions by BMI

ggplot(Interest_Depression, aes(addepev2, group = X_bmi5cat)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_bmi5cat) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Depression")

ggplot(count(Interest_Depression,X_bmi5cat,addepev2), aes(fill=addepev2, y=n, x=X_bmi5cat)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Body Mass Index Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

Another not very clear correlation until the obese demographic is considered, when the proportion of depressed people increases significantly.

Depression: Proportions by Income

ggplot(Interest_Depression, aes(addepev2, group = X_incomg)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_incomg) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Depression")

ggplot(count(Interest_Depression,X_incomg,addepev2), aes(fill=addepev2, y=n, x=X_incomg)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Income Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

Yes, the old saying “money brings happiness” can be more statistically and accurately rephrased as “happiness seems to be positively correlated with income”.

Depression: Proportions by Heavy Drinking

ggplot(Interest_Depression, aes(addepev2, group = X_rfdrhv4)) + 
  geom_bar(aes(y = ..prop.., fill = factor(..x..)), stat="count",show.legend = FALSE) + 
  scale_y_continuous(labels=scales::percent) +
  ylab("relative frequencies") +
  facet_grid(~X_rfdrhv4) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Depression")

ggplot(count(Interest_Depression,X_rfdrhv4,addepev2), aes(fill=addepev2, y=n, x=X_rfdrhv4)) + 
  geom_bar(position="fill", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1), legend.title=element_blank(), axis.title.x=element_blank()) + 
  labs(title="Heavy Drinking Vs. Depression: Stacked", y = "Proportion")

COMMENTARY

No clear link here. Probably you will find an excuse to drink regardless you are happy or sad.