library(ggplot2) # for nice and easy plotting
library(dplyr) # for data manipulation
library(tidyr) # for dataframes realignments
library(knitr) # for printing nicer tablesload("brfss2013.RData")The data for this study comes from The Behavioral Risk Factor Surveillance System (BRFSS), which collects information from all of the states in USA and participating US territories via monthly telephone surveys from randomly selected adults residing in the USA employing standardized core questionnaire, optional modules, and state-added questions. From the details provided on the official pages of the project (http://www.cdc.gov/brfss/) I assume that the data are collected randomly accros the adult population of the USA, e.g. the data follow normal distribution for each of the questioned preventive health practices and risk behaviors.
The data are quite a big dataset, with many rows and many collumns:
dim(brfss2013)## [1] 491775 330
Therefore I trim it to somewhat tidier working dataset. It contains only the data from year 2013 and only the variables necessary for addressing my research questions (see them below):
#create working dataset - only 2013 and only target variables
a <- filter(brfss2013, iyear == 2013) %>%
select(menthlth, X_chldcnt, sex, X_drnkdy4, X_age_g, X_bmi5cat, frutda1_, beanday_, grenday_, orngday_, vegeda1_)
#rename variables to something more understandable
names(a) <- c("mental_health", "num_children", "sex", "drinks", "age", "BMI", "fruit", "beans", "green_vegs", "orange_vegs", "other_vegs")
names(a)## [1] "mental_health" "num_children" "sex" "drinks"
## [5] "age" "BMI" "fruit" "beans"
## [9] "green_vegs" "orange_vegs" "other_vegs"
Research quesion 1:
Is there difference in bad mental health days based on number of children in household between men and women?
Does having more children prevent or encourage feelings of bad mental health? Does women feel it in different way than men? Being mother of two I find this question rather interesting. I utilize the data of adult men and women and of children present in household together with reported number of not good mental health days per last 30 days.
Research quesion 2:
How is the BMI distributed in age categories? Which age category has the highest prevalence of the highest BMI?
Are older people more obese? Or is there more young adults with such a problem? Every day most of us can see some overweighted and obese people, and depending on social circles one can get impression that there are more overweighted olders or youngsters or some other age category. What does these census data can say to it?
Research quesion 3:
Does people who drink more also eat more fruits and veggies, e.g. do they have more hedonistic tendencies?
Does drinkers in this dataset spend less time and money on eating fruits and vegetables? Does no-drinkers eat more fruits and vegetables, suggesting overally healthier life style? Or are the drinkers in provided dataset just more hedonistic than no-drinkers, as they eat similar or more amounts of fruits and vegetables?
Is there difference in bad mental health days based on number of children in household between men and women?
I chose to omit people who don’t report any bad mental health day. I want to look at people who do experience some mental struggle and the overwhelming majority of people without mental health bad days would screw the numbers.
#create dataset for first question
a1 <- select(a, mental_health, num_children, sex) %>%
filter(!is.na(mental_health) & !is.na(num_children) & !is.na(sex)) %>%
filter(!mental_health==0)
#summary table
a1_sum <- group_by(a1, num_children, sex) %>%
summarise(Mean = mean(mental_health), Median = median(mental_health), SD = sd(mental_health))
kable(a1_sum, digits = 2, align = "c",
caption="Tab. 1 Number of bad health days per 30 days for men and women based on number of children in household",
col.names=c("Number of children", "Sex", "Mean", "Median", "SD"))| Number of children | Sex | Mean | Median | SD |
|---|---|---|---|---|
| No children in household | Male | 11.30 | 6 | 10.70 |
| No children in household | Female | 11.24 | 6 | 10.54 |
| One child in household | Male | 10.05 | 5 | 10.14 |
| One child in household | Female | 10.96 | 5 | 10.37 |
| Two children in household | Male | 9.46 | 5 | 9.88 |
| Two children in household | Female | 10.13 | 5 | 10.01 |
| Three children in household | Male | 9.54 | 5 | 9.90 |
| Three children in household | Female | 10.40 | 5 | 10.12 |
| Four children in household | Male | 9.58 | 5 | 10.02 |
| Four children in household | Female | 10.54 | 5 | 10.25 |
| Five or more children in household | Male | 11.49 | 6 | 10.69 |
| Five or more children in household | Female | 11.49 | 6 | 10.64 |
There is no difference between men and women in how good or bad their mental health was. Also the number of bad days doesn’t differ regardless of number of children present. Thus the bad mental health has different sources of dispair than having or not having children.
How is the BMI distributed in age categories? Which age category has the highest prevalence of the highest BMI?
#create dataset for second question
a2 <- select(a, age, BMI) %>%
filter(!is.na(age) & !is.na(BMI))
#table of data
table(a2)## BMI
## age Underweight Normal weight Overweight Obese
## Age 18 to 24 1119 13387 6617 4013
## Age 25 to 34 899 17738 14967 12054
## Age 35 to 44 710 17916 19483 17738
## Age 45 to 54 1021 23362 28298 26062
## Age 55 to 64 1287 29497 37352 33715
## Age 65 or older 3117 51156 58493 39709
Let’s see some summary statistics for this partial dataset. In absolute numbers age category with the highest BMI are the seniors older than 64 years. However the numbers are not equal for each age category, therefore proportional distribution in each age class should be more informative.
#plot
ggplot(a2, aes(x=age, fill=BMI)) +
geom_bar(position = "fill") +
labs(title = "Fig. 2 BMI-class distribution in six age categories of Americans in 2013",
x = "Age category",
y = "BMI-class distribution",
fill="BMI class")And in form of a table:
#percentage table
a2_sum <- as.data.frame(table(a2))
a2_sum <- spread(a2_sum, BMI, Freq)
names(a2_sum)[3] <- "Normal_weight"
a2_sum2 <- group_by(a2_sum, age) %>%
transmute(No_of_people = sum(c(Underweight, Normal_weight, Overweight, Obese)),
Perc_under = Underweight*100/No_of_people,
Perc_norm = Normal_weight*100/No_of_people,
Perc_over = Overweight*100/No_of_people,
Perc_obes = Obese*100/No_of_people) %>%
select(-No_of_people)
kable(a2_sum2, digits=1, align="c",
caption="Tab. 2 BMI-class distribution (%) in six age categories of Americans in 2013",
col.names=c("Age", "Underweight (%)", "Normal weight (%)", "Overweight (%)", "Obese (%)"))| Age | Underweight (%) | Normal weight (%) | Overweight (%) | Obese (%) |
|---|---|---|---|---|
| Age 18 to 24 | 4.5 | 53.3 | 26.3 | 16.0 |
| Age 25 to 34 | 2.0 | 38.8 | 32.8 | 26.4 |
| Age 35 to 44 | 1.3 | 32.1 | 34.9 | 31.8 |
| Age 45 to 54 | 1.3 | 29.7 | 35.9 | 33.1 |
| Age 55 to 64 | 1.3 | 29.0 | 36.7 | 33.1 |
| Age 65 or older | 2.0 | 33.6 | 38.4 | 26.0 |
a2_sum3 <- gather(a2_sum2, key=BMI, value=BMI2, 2:5) %>%
arrange(age)One can see that with increasing age the proportion of normal weighted people is getting smaller in favor of overweighted and obese ones. The most obese age category of Americans in 2013 are people in age 45 to 64, with more than 25 % of them being obese. One could speculate that decreased obese proportions in the oldest age category may be connected with lower life length expectancies of obese people.
Does people who drink more also eat more fruits and veggies, e.g. do they have more hedonistic tendencies?
#create dataset for third question
a3 <- select(a, drinks, fruit, beans, green_vegs, orange_vegs, other_vegs) %>%
filter(!is.na(drinks) & !is.na(fruit) & !is.na(beans) & !is.na(green_vegs) & !is.na(orange_vegs) & !is.na(other_vegs)) %>%
filter(drinks < 500, fruit < 500, beans < 500, green_vegs < 500, orange_vegs < 500, other_vegs < 500)I decide to limit the scales of all variables to maximum of 500. Higher values are too sparse (can be classified as outliers) and probably not correct anyway (who drinks 7000 drinks per 30 days?!).
#plot
# I create multilayer ggplot with custom legend - this requires stating each line as an unique geom and setting labels (labs()) and legend (scale_colour_manual()) manually
ggplot() +
geom_smooth(data=a3, aes(x=drinks, y=fruit, color = "purple")) +
geom_smooth(data=a3, aes(x=drinks, y=beans, color = "brown")) +
geom_smooth(data=a3, aes(x=drinks, y=green_vegs, color = "darkgreen")) +
geom_smooth(data=a3, aes(x=drinks, y=orange_vegs, color = "orange")) +
geom_smooth(data=a3, aes(x=drinks, y=other_vegs, color = "chartreuse2")) +
scale_colour_manual(name = "The edibles",
values=c("purple" = "purple", "brown" = "brown", "darkgreen" = "darkgreen", "orange" = "orange", "chartreuse2" = "chartreuse2"),
labels = c("beans", "dark green vegs", "other vegs", "orange vegs", "fruits")) +
labs(title="Fig. 3 Consumptions of fruits and veggies in relation to drinking",
x = "Number of drinks per month",
y = "Number of fruits or veggies per month")There seems to be no relation between drinking and consumption od fruits and veggies for adult Americans in 2013. One would expect that drinkers (more than 300 drinks per 30 days) would eat far less fruits and veggies than the rest of population, but this plot indicates no trend whatsoever (maybe slight decrease in fruits intake). That suggests that indeed the drinkers have more hedonistic tendencies as they are able to consume quite a lot of drinks together with their non-decreased day ratio of fruits and vegetables.