Objective

This document serves the purpose of a final evaluation of the 5 week Introduction to Probability and Data course by Duke University. The data of interest is Behavioral Risk Factor Surveillance System (BRFSS) dataset. It can be downloaded from here:


Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC).

More detailed information of the data can be found here:

More detailed information of the dataset can be found here:


Required packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

First we’ll select and extract the variables we want to work with and clean them from NA’s

# make a small data frame with the variables of interest
t1 <- select(brfss2013, sex, income2, seatbelt, educa, scntwrk1, employ1, lsatisfy, X_bmi5) %>%
      filter(!is.na(sex), !is.na(income2), !is.na(seatbelt), !is.na(educa), !is.na(scntwrk1), !is.na(lsatisfy),!is.na(X_bmi5),!is.na(employ1))

Part 2: Research questions

Research quesion 1:
What is the body mass index between different incomes of men and women? Does it change when people have higher income, consequently they can afford a better lifestyle and nutrition? Does employment status reflect the BMI between genders?

Research quesion 2:
What type of education brings you what amount of salary? What kind of level of education have men vs women?And are Men or Women better educated? Does education reflects putting a seatbelt on?

Research quesion 3:
What is the overall satisfaction level in different education levels among genders?


Part 3: Exploratory data analysis

Research quesion 1:

# Let's first see the median BMI across gender
t1 %>%
      group_by(sex) %>%
      summarise(medBMI = median(X_bmi5))
## # A tibble: 2 x 2
##      sex medBMI
##   <fctr>  <int>
## 1   Male   2889
## 2 Female   2694

We see that men have higher BMI than women when looking at their median.

### BODY MASS, INCOME, GENDER
ggplot(data = t1, aes(x = income2, y = X_bmi5, fill = sex)) +
      geom_boxplot() +
      scale_x_discrete(name = "Income",
                       labels = c("Less than $10,000" = "< $10,000", 
                                "Less than $15,000" = "< $15,000",
                                "Less than $20,000" = "< $20,000",
                                "Less than $25,000" = "< $25,000", 
                                "Less than $35,000" = "< $35,000",
                                "Less than $50,000" = "< $50,000", 
                                "Less than $75,000" = "< $75,000",
                                "$75,000 or more" = ">= $75,000")) +
      scale_y_continuous(name = "Body Mass Index") +
      ggtitle("Body Mass across income and gender") +
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

### BMI, emplyment status, gender
ggplot(data = t1, aes(x = employ1, y = X_bmi5, fill = sex)) +
      geom_boxplot() +
      labs(title = "Body Mass across Employment Status and gender",
           x = "Employment Status", y = "Body Mass Index") +
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

What we can observe here is that women who earn less than $25.000 but more then $10.000 tend to be with lower BMI than men. In any other income level women have BMI higher than men. When women are selfemplyed have lower BMI than the employed for wages women. With men is vice versa.

Research quesion 2:

# education, income
ggplot(t1, aes(income2)) + 
      geom_bar(aes(fill = educa), position = "dodge") +
      scale_x_discrete(name = "Income",
                       labels = c("Less than $10,000" = "< $10,000", 
                                  "Less than $15,000" = "< $15,000",
                                  "Less than $20,000" = "< $20,000",
                                  "Less than $25,000" = "< $25,000", 
                                  "Less than $35,000" = "< $35,000",
                                  "Less than $50,000" = "< $50,000", 
                                  "Less than $75,000" = "< $75,000",
                                  "$75,000 or more" = ">= $75,000")) +
      scale_y_continuous(name = "Count") +
      ggtitle("Education vs Income") +
      labs(fill = "Education") + 
      scale_fill_discrete(labels = c("Never attended school or only kindergarten" = "Never attended school",
                          "Grades 1 through 8 (Elementary)" = "Elementary",
                          "Grades 9 though 11 (Some high school)" = "Some high school",
                          "Grade 12 or GED (High school graduate)" = "High school graduate",
                          "College 1 year to 3 years (Some college or technical school)" = "Some college or technical school",
                          "College 4 years or more (College graduate)" = "College graduate")) +
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

# education and sex
ggplot(t1, aes(educa)) + 
      geom_bar(aes(fill = sex), position = "dodge") +
      labs(title = "Education among genders",
           x = "Education Level", y = "Count") +
      scale_x_discrete(labels = c("Never attended school or only kindergarten" = "Never attended school",
                                     "Grades 1 through 8 (Elementary)" = "Elementary",
                                     "Grades 9 though 11 (Some high school)" = "Some high school",
                                     "Grade 12 or GED (High school graduate)" = "High school graduate",
                                     "College 1 year to 3 years (Some college or technical school)" = "Some college or technical school",
                                     "College 4 years or more (College graduate)" = "College graduate")) +
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

# education and seatbelt
ggplot(t1, aes(educa)) + 
      geom_bar(aes(fill = seatbelt), position = "fill") + 
      facet_grid(.~sex) +
      labs(title = "Education / seatbelt",
           x = "Education Level", fill = "Seatbelt") +
      scale_x_discrete(labels = c("Never attended school or only kindergarten" = "Never attended school",
                                  "Grades 1 through 8 (Elementary)" = "Elementary",
                                  "Grades 9 though 11 (Some high school)" = "Some high school",
                                  "Grade 12 or GED (High school graduate)" = "High school graduate",
                                  "College 1 year to 3 years (Some college or technical school)" = "Some college or technical school",
                                  "College 4 years or more (College graduate)" = "College graduate")) +
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

Of course, the old concept “the more you learn, richer you’ll become” is also supported by this data, but what’s interesting here is that with a high school degree you can also receive up to $75.000, which level is dominant by the college graduates, who are women (shown by the second graph) and college graduated women are the only ones among all other educated levels women or men who never put a seatbealt on (shown by the third graph)? The question whether a person puts a seatbelt on was not stated as wheter the person puts a seatbelt on when driving, so we don’t know whether women drivers put their seatbelt on or not? But we do know that a small procent of college educated women don’t put a seatbelt on regardles of whether they are driving or not!

Research quesion 3:

# income and satisfaction
ggplot(t1, aes(income2)) + 
      geom_bar(aes(fill = lsatisfy), position = "dodge") + 
      facet_grid(.~sex) +
      scale_x_discrete(name = "Income",
                       labels = c("Less than $10,000" = "< $10,000", 
                                  "Less than $15,000" = "< $15,000",
                                  "Less than $20,000" = "< $20,000",
                                  "Less than $25,000" = "< $25,000", 
                                  "Less than $35,000" = "< $35,000",
                                  "Less than $50,000" = "< $50,000", 
                                  "Less than $75,000" = "< $75,000",
                                  "$75,000 or more" = ">= $75,000")) +
      labs(title = "Satisfaction and Income among genders",
           x = "Income", y = "Count", fill = "Satisfaction")+
      theme(legend.position = "bottom",
      legend.background = element_rect(fill = "gray90", size = .5, linetype = "dotted"))

What’s interesting in this graph is that women experience more often an overall dissatisfaction, regardless of their payment than men do.