Setup

Load packages

library(ggplot2) # for nice and easy plotting
library(dplyr) # for data manipulation
library(tidyr) # for dataframes realignments
library(knitr) # for printing nicer tables

Load data

load("brfss2013.RData")

Part 1: Data

The data for this study comes from The Behavioral Risk Factor Surveillance System (BRFSS), which collects information from all of the states in USA and participating US territories via monthly telephone surveys from randomly selected adults residing in the USA employing standardized core questionnaire, optional modules, and state-added questions. From the details provided on the official pages of the project (http://www.cdc.gov/brfss/) I assume that the data are collected randomly accros the adult population of the USA, e.g. the data follow normal distribution for each of the questioned preventive health practices and risk behaviors.

The data are quite a big dataset, with many rows and many collumns:

dim(brfss2013)
## [1] 491775    330

Therefore I trim it to somewhat tidier working dataset. It contains only the data from year 2013 and only the variables necessary for addressing my research questions (see them below):

#create working dataset - only 2013 and only target variables
a <- filter(brfss2013, iyear == 2013) %>%
      select(menthlth, X_chldcnt, sex, X_drnkdy4, X_age_g, X_bmi5cat, frutda1_, beanday_, grenday_, orngday_, vegeda1_)

#rename variables to something more understandable
names(a) <- c("mental_health", "num_children", "sex", "drinks", "age", "BMI", "fruit", "beans", "green_vegs", "orange_vegs", "other_vegs")
names(a)
##  [1] "mental_health" "num_children"  "sex"           "drinks"       
##  [5] "age"           "BMI"           "fruit"         "beans"        
##  [9] "green_vegs"    "orange_vegs"   "other_vegs"

Part 2: Research questions

Research quesion 1:

Is there difference in bad mental health days based on number of children in household between men and women?

Does having more children prevent or encourage feelings of bad mental health? Does women feel it in different way than men? Being mother of two I find this question rather interesting. I utilize the data of adult men and women and of children present in household together with reported number of not good mental health days per last 30 days.

Research quesion 2:

How is the BMI distributed in age categories? Which age category has the highest prevalence of the highest BMI?

Are older people more obese? Or is there more young adults with such a problem? Every day most of us can see some overweighted and obese people, and depending on social circles one can get impression that there are more overweighted olders or youngsters or some other age category. What does these census data can say to it?

Research quesion 3:

Does people who drink more also eat more fruits and veggies, e.g. do they have more hedonistic tendencies?

Does drinkers in this dataset spend less time and money on eating fruits and vegetables? Does no-drinkers eat more fruits and vegetables, suggesting overally healthier life style? Or are the drinkers in provided dataset just more hedonistic than no-drinkers, as they eat similar or more amounts of fruits and vegetables?


Part 3: Exploratory data analysis

Research quesion 1:

Is there difference in bad mental health days based on number of children in household between men and women?

I chose to omit people who don’t report any bad mental health day. I want to look at people who do experience some mental struggle and the overwhelming majority of people without mental health bad days would screw the numbers.

#create dataset for first question
a1 <- select(a, mental_health, num_children, sex) %>%
      filter(!is.na(mental_health) & !is.na(num_children) & !is.na(sex)) %>% 
      filter(!mental_health==0)

#summary table
a1_sum <- group_by(a1, num_children, sex) %>%
          summarise(Mean = mean(mental_health), Median = median(mental_health), SD = sd(mental_health))
kable(a1_sum, digits = 2, align = "c", 
      caption="Tab. 1 Number of bad health days per 30 days for men and women based on number of children in household",
      col.names=c("Number of children", "Sex", "Mean", "Median", "SD"))
Tab. 1 Number of bad health days per 30 days for men and women based on number of children in household
Number of children Sex Mean Median SD
No children in household Male 11.30 6 10.70
No children in household Female 11.24 6 10.54
One child in household Male 10.05 5 10.14
One child in household Female 10.96 5 10.37
Two children in household Male 9.46 5 9.88
Two children in household Female 10.13 5 10.01
Three children in household Male 9.54 5 9.90
Three children in household Female 10.40 5 10.12
Four children in household Male 9.58 5 10.02
Four children in household Female 10.54 5 10.25
Five or more children in household Male 11.49 6 10.69
Five or more children in household Female 11.49 6 10.64

There is no difference between men and women in how good or bad their mental health was. Also the number of bad days doesn’t differ regardless of number of children present. Thus the bad mental health has different sources of dispair than having or not having children.

Research quesion 2:

How is the BMI distributed in age categories? Which age category has the highest prevalence of the highest BMI?

#create dataset for second question
a2 <- select(a, age, BMI) %>%
      filter(!is.na(age) & !is.na(BMI))

#table of data
table(a2)
##                  BMI
## age               Underweight Normal weight Overweight Obese
##   Age 18 to 24           1119         13387       6617  4013
##   Age 25 to 34            899         17738      14967 12054
##   Age 35 to 44            710         17916      19483 17738
##   Age 45 to 54           1021         23362      28298 26062
##   Age 55 to 64           1287         29497      37352 33715
##   Age 65 or older        3117         51156      58493 39709

Let’s see some summary statistics for this partial dataset. In absolute numbers age category with the highest BMI are the seniors older than 64 years. However the numbers are not equal for each age category, therefore proportional distribution in each age class should be more informative.

#plot
ggplot(a2, aes(x=age, fill=BMI)) + 
  geom_bar(position = "fill") +
  labs(title = "Fig. 2 BMI-class distribution in six age categories of Americans in 2013",
       x = "Age category",
       y = "BMI-class distribution",
       fill="BMI class")

And in form of a table:

#percentage table
a2_sum <- as.data.frame(table(a2))
a2_sum <- spread(a2_sum, BMI, Freq)
names(a2_sum)[3] <- "Normal_weight"

a2_sum2 <- group_by(a2_sum, age) %>%
          transmute(No_of_people = sum(c(Underweight, Normal_weight, Overweight, Obese)),
                    Perc_under = Underweight*100/No_of_people,
                    Perc_norm = Normal_weight*100/No_of_people,
                    Perc_over = Overweight*100/No_of_people,
                    Perc_obes = Obese*100/No_of_people) %>%
          select(-No_of_people)

kable(a2_sum2, digits=1, align="c",
      caption="Tab. 2 BMI-class distribution (%) in six age categories of Americans in 2013",
      col.names=c("Age", "Underweight (%)", "Normal weight (%)", "Overweight (%)", "Obese (%)"))
Tab. 2 BMI-class distribution (%) in six age categories of Americans in 2013
Age Underweight (%) Normal weight (%) Overweight (%) Obese (%)
Age 18 to 24 4.5 53.3 26.3 16.0
Age 25 to 34 2.0 38.8 32.8 26.4
Age 35 to 44 1.3 32.1 34.9 31.8
Age 45 to 54 1.3 29.7 35.9 33.1
Age 55 to 64 1.3 29.0 36.7 33.1
Age 65 or older 2.0 33.6 38.4 26.0
a2_sum3 <- gather(a2_sum2, key=BMI, value=BMI2, 2:5) %>%
          arrange(age)

One can see that with increasing age the proportion of normal weighted people is getting smaller in favor of overweighted and obese ones. The most obese age category of Americans in 2013 are people in age 45 to 64, with more than 25 % of them being obese. One could speculate that decreased obese proportions in the oldest age category may be connected with lower life length expectancies of obese people.

Research quesion 3:

Does people who drink more also eat more fruits and veggies, e.g. do they have more hedonistic tendencies?

#create dataset for third question
a3 <- select(a, drinks, fruit, beans, green_vegs, orange_vegs, other_vegs) %>%
      filter(!is.na(drinks) & !is.na(fruit) & !is.na(beans) & !is.na(green_vegs) & !is.na(orange_vegs) & !is.na(other_vegs)) %>%
      filter(drinks < 500, fruit < 500, beans < 500, green_vegs < 500, orange_vegs < 500, other_vegs < 500)

I decide to limit the scales of all variables to maximum of 500. Higher values are too sparse (can be classified as outliers) and probably not correct anyway (who drinks 7000 drinks per 30 days?!).

#plot
# I create multilayer ggplot with custom legend - this requires stating each line as an unique geom and setting labels (labs()) and legend (scale_colour_manual()) manually

ggplot() +
  geom_smooth(data=a3, aes(x=drinks, y=fruit, color = "purple")) +
  geom_smooth(data=a3, aes(x=drinks, y=beans, color = "brown")) +
  geom_smooth(data=a3, aes(x=drinks, y=green_vegs, color = "darkgreen")) +
  geom_smooth(data=a3, aes(x=drinks, y=orange_vegs, color = "orange")) +
  geom_smooth(data=a3, aes(x=drinks, y=other_vegs, color = "chartreuse2")) +
  scale_colour_manual(name = "The edibles", 
         values=c("purple" = "purple", "brown" = "brown", "darkgreen" = "darkgreen", "orange" = "orange", "chartreuse2" = "chartreuse2"), 
         labels = c("beans", "dark green vegs", "other vegs", "orange vegs", "fruits")) +
  labs(title="Fig. 3 Consumptions of fruits and veggies in relation to drinking",
       x = "Number of drinks per month",
       y = "Number of fruits or veggies per month")

There seems to be no relation between drinking and consumption od fruits and veggies for adult Americans in 2013. One would expect that drinkers (more than 300 drinks per 30 days) would eat far less fruits and veggies than the rest of population, but this plot indicates no trend whatsoever (maybe slight decrease in fruits intake). That suggests that indeed the drinkers have more hedonistic tendencies as they are able to consume quite a lot of drinks together with their non-decreased day ratio of fruits and vegetables.