The data from this project come from the “Behavioral Risk Factor Surveillance System” (BRFSS). You can learn more about this survey on the following website: https://www.cdc.gov/brfss/.
The BRFSS is a national survey that collects health-related data by telephone about U.S. residents (adults +18 years old) regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. It was established in 1984 with 15 states, but now BRFSS collects data in all 50 states as well as the District of Columbia and three U.S. territories. The survey completes more than 400,000 adult interviews each year, making it the largest continuously conducted health survey system in the world.
Note on causality: Since this is an observational cross-sectional survey, we cannot establish causal inference through the data. We may draw conclusions about prevalence, correlation, and even association. However, we are not able to distinguish the direction of this association, in other words, the causality. We cannot assume that one outcome causes the other, instead of the other outcome causing the first one.library(ggplot2)
library(dplyr)
library(magrittr)
library(scales)
library(RColorBrewer)
library(tidyverse)
library(pander)
library(statsr)
library(devtools)
load("brfss2013.RData")
Background: Some studies in the field of chrononutrition have raised a possible relationship between individuals’ sleep time and their weight status. In this question, I want to explore this possible relation under the assumption (hypothesis) that people who sleep less have a higher BMI.
The dataset contains 458,915 observations after NAs and refusal removal among the three categories.
## sleptim1 X_bmi5cat sex
## Min. : 1.000 Underweight : 8054 Male :195085
## 1st Qu.: 6.000 Normal weight:152911 Female:263830
## Median : 7.000 Overweight :165107
## Mean : 7.049 Obese :132843
## 3rd Qu.: 8.000
## Max. :24.000
From the summary output we can see the absolute frequency for each level on the categorical variables (sex and BMI categories). The summary statistics for sleep time - a discrete variable - show a range of 1-24 hours of sleep with a mean (7.049 hours) close to the median (7 hours).
#Descriptive Statistics
table1 <- table(sleep$sex, sleep$X_bmi5cat)
prop_table <- prop.table(table1, 1)
prop_table_percent <- prop_table * 100
pander(prop_table_percent, caption = "Contingency Table of BMI status by sex (%)")
| Underweight | Normal weight | Overweight | Obese | |
|---|---|---|---|---|
| Male | 0.9565 | 26.89 | 43.03 | 29.13 |
| Female | 2.345 | 38.08 | 30.76 | 28.81 |
result <- aggregate(x = sleep$sleptim1,
by = list(sleep$X_bmi5cat, sleep$sex),
FUN = mean)
pander(result, caption = "Mean Sleep Time by BMI Category and Sex")
| Group.1 | Group.2 | x |
|---|---|---|
| Underweight | Male | 7.034 |
| Normal weight | Male | 7.105 |
| Overweight | Male | 7.041 |
| Obese | Male | 6.945 |
| Underweight | Female | 7.098 |
| Normal weight | Female | 7.126 |
| Overweight | Female | 7.078 |
| Obese | Female | 6.959 |
The descriptive statistics show that men are mostly overweight and women mostly normal-weight. The obesity rate is close to 30% for both sexes. Very few individuals were underweight.
The mean sleep time between sexes was very similar: 7.03h for men and 7.06h for women.
When considering weight status, the highest mean sleep (7.12h) time was among normal-weight women, while the lowest mean sleep time was among obese men (6.94h).
# Plot the bar graph with percentage labels
prop_table <- sleep %>%
group_by(X_bmi5cat, sex) %>%
summarise(count = n()) %>%
mutate(prop = count / sum(count) * 100)
## `summarise()` has grouped output by 'X_bmi5cat'. You can override using the
## `.groups` argument.
ggplot(prop_table, aes(x = X_bmi5cat, y = prop, fill = sex)) +
geom_bar(stat = "identity", position = "stack") +
geom_text(aes(label = paste0(round(prop), "%")),
position = position_stack(vjust = 0.5),
color = "white", fontface = "bold", hjust = 0.5) +
labs(x = "BMI Categories", y = "Percentage", fill = "Sex") +
scale_fill_brewer(palette="Set2") +
theme_classic()
#Plotting graphs
ggplot(sleep, aes(x=X_bmi5cat, y=sleptim1, fill=sex)) +
geom_boxplot() + labs(y = "Hours of Sleep", x = "BMI categories") +
theme_bw() + scale_fill_brewer(palette="Set2")
When we investigated sleep time by sex among nutritional status categories the plots showed no relevant differences in the distributions. However, there are several outliers in the categories.
ttest <- t.test(sleep$sleptim1 ~ sleep$sex)
pander(ttest, caption = "Independent t-test - Sleeping time by sex")
| Test statistic | df | P value | Alternative hypothesis |
|---|---|---|---|
| -7.348 | 425925 | 2.009e-13 * * * | two.sided |
| mean in group Male | mean in group Female |
|---|---|
| 7.03 | 7.062 |
chi_square <- chisq.test(sleep$X_bmi5cat, sleep$sex, correct = FALSE)
pander(chi_square, caption = "Chi-Square Test Results")
| Test statistic | df | P value |
|---|---|---|
| 10141 | 3 | 0 * * * |
oneway <- aov(sleptim1 ~ X_bmi5cat, data = sleep)
summary(oneway)
## Df Sum Sq Mean Sq F value Pr(>F)
## X_bmi5cat 3 1988 662.8 310.4 <2e-16 ***
## Residuals 458911 979913 2.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pander(oneway, caption = "Analysis of variance - Sleeping time by BMI categories")
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| X_bmi5cat | 3 | 1988 | 662.8 | 310.4 | 2.394e-201 |
| Residuals | 458911 | 979913 | 2.135 | NA | NA |
TukeyHSD(oneway)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = sleptim1 ~ X_bmi5cat, data = sleep)
##
## $X_bmi5cat
## diff lwr upr p adj
## Normal weight-Underweight 0.03520084 -0.007717174 0.07811885 0.1507698
## Overweight-Underweight -0.02454772 -0.067386343 0.01829091 0.4544401
## Obese-Underweight -0.13031921 -0.173399119 -0.08723930 0.0000000
## Overweight-Normal weight -0.05974855 -0.073072192 -0.04642492 0.0000000
## Obese-Normal weight -0.16552005 -0.179600173 -0.15143992 0.0000000
## Obese-Overweight -0.10577149 -0.119607752 -0.09193524 0.0000000
tukey <- TukeyHSD(oneway)
pander(tukey, caption = "Analysis of variance with Tukey multiple pairwise-comparisons - Sleeping time by BMI categories")
## Warning in pander.default(tukey, caption = "Analysis of variance with
## Tukey multiple pairwise-comparisons - Sleeping time by BMI categories"):
## No pander.method for "TukeyHSD", reverting to default.No pander.method for
## "multicomp", reverting to default.
X_bmi5cat:
| diff | lwr | upr | p adj | |
|---|---|---|---|---|
| Normal weight-Underweight | 0.0352 | -0.007717 | 0.07812 | 0.1508 |
| Overweight-Underweight | -0.02455 | -0.06739 | 0.01829 | 0.4544 |
| Obese-Underweight | -0.1303 | -0.1734 | -0.08724 | 7.006e-14 |
| Overweight-Normal weight | -0.05975 | -0.07307 | -0.04642 | 0 |
| Obese-Normal weight | -0.1655 | -0.1796 | -0.1514 | 0 |
| Obese-Overweight | -0.1058 | -0.1196 | -0.09194 | 0 |
Because the p-value of the independent t-test (p<0.001) is less than alpha = 0.05, we reject the null hypothesis of the test. This means we have sufficient evidence to say that the mean sleeping time of US individuals is different between the sexes.
Since we get a p-Value (p<0.001) in the chi-square test, which is less than the significance level of 0.05, we reject the null hypothesis and conclude that the two variables are dependent (or associated).
In a one-way ANOVA test, a significant p-value indicates that some of the group means are different, but we don’t know which pairs of groups are different. We perform multiple pairwise comparisons to determine if the mean difference between specific pairs of the group is statistically significant.
As the p-value in ANOVA (p<0.001) is less than the significance level of 0.05, we can conclude that there are significant differences in sleeping time among BMI categories.
After the multiple pairwise comparisons, it can be seen from the output that only the differences between underweight-normal weight and underweight-overweight groups are not significant with an adjusted p-value > 0.05.
**