DATA 606 Final Project: BMI & Self-Reported General Health

Kyle Gilde

May 13, 2017

Part 1 - Introduction

Research Question

Does the Body Mass Index (BMI) fluctuate with the self-reported level of general health?

Rationale

Do U.S. adults have a subjective or misconceived perception of their own health, or does the self-reported level of general health correspond to an empirical measure of their health such as BMI?

Part 2 - Data

Data collection

The data were collected as a collaborative project between the U.S. states and territories and the Centers for Disease Control and Prevention (CDC) in their Behavioral Rick Factor Surveillance System (BRFSS) telephone survey. Given the rise of cellular telephone-only households, the survey began sampling both land-line and cellular phone numbers in 2011.

The American Community Survey found that nearly 98% of U.S. households had telephone survice available. The BRFSS did not attempt to survey non-telephone residents. They used a weighting method instead.

Note: Despite a thorough search, I could not find specific instructions on how to use the weighted variable _LLCPWT

State health departments conduct the surveys, and most states use a stratified sampling method that divides land-line phone numbers into 2 groups: high-density or medium-density populations

2015 BRFSS Overview

BRFSS Zip File

The Cases

The cases are adult U.S. residents, specifically “non-institutionalized adults who reside in each of the states and selected U.S. territories.” There are 441,456 cases in the 2015 survey, and 404,030 of the cases gave measurable responses for the 2 selected variables.

Module Data for Analysis for 2015 BRFSS

The Variables

Explanatory Variable

  • The explanatory variable is GENHLTH, which is self-reported level of general health response to the following question.

Question: Would you say that in general your health is…

  • The General Health variable is categorical variable that has been stored as an integer. In the study, nearly all categorical variables were stored in a numeric format.

  • This categorical variable has an ordinal scale because its values of Excellent through Poor have an order to them. However, it is not an interval scale, so we cannot assume that the intervals between adjacent values are of equal length.

Value Value Label
1 Excellent
2 Very Good
3 Good
4 Fair
5 Poor
7 Don’t know/Note Sure
9 Refused
BLANK Not asked or Missing

Response Variable

  • The response variable is the Body Mass Index originally labeled as X_BMI5. The World Health Organization (WHO) describes the BMI as a “simple index of weight-for-height that is commonly used to classify underweight, overweight and obesity in adults.”

  • It is a numeric and continuous variable (although the study only records it to 2 decimal places) that was multiplied by 100 and stored as an integer.

  • The BRFSS calculated the respondents’ BMI by converting the reported heights and weights to meters and kilograms and then dividing the weight by the square of the height.

  • The WHO lists the primary classification of the BMI as follows:

Class BMI Range
Underweight <18.50
Normal range 18.50 - 24.99
Overweight 25.00 - 29.99
Obese +30.00

WHO: BMI

Type of study

The BRFSS is an observational study because there was no experiment with subjects being randomly assigned to treatment and control groups.

Scope of inference

Generalizability:

The population of interest is adult U.S. residents. Since random sampling was used, the findings can be generalized to the population.

Weighting was used to account for residents without telephone access.

Causality:

Because this is an observation study and not an experimental one, no random assignment was used, and therefore, these data cannot be used to establish causal links between the variables of interest.

Part 3 - Exploratory data analysis

Required Packages

##               
## psych     TRUE
## prettydoc TRUE
## dplyr     TRUE
## tidyr     TRUE
## foreign   TRUE
## knitr     TRUE
## ggplot2   TRUE

Download & Read Data

url_data <- "https://www.cdc.gov/brfss/annual_data/2015/files/LLCP2015XPT.zip"
brfss_file <- "brfss.zip"

# download file if it isn't already unzip and read if is isn't already
if (!exists("brfss_raw")) {
    download.file(url_data, brfss_file)
    brfss_raw <- read.xport(unzip(brfss_file))
}

Tidy & format the data

  • The non-ordinal values of “Don’t know/Note Sure” and BLANK were removed from GENHLTH. They represented only a tiny fraction of 1 percent.

  • Concatenated the friendly names to the GENHLTH integers

  • Divided the BMI integer by 100

  • Incomplete observations were removed.

bmi_gen_health <- brfss_raw %>% 
  filter(GENHLTH < 6) %>% 
  transmute(BMI = X_BMI5/100, 
            reported_general_health = 
              as.factor(paste(GENHLTH, 
                    sapply(GENHLTH,
                           function(x) switch(x,
                            "1"     = "Excellent",
                            "2"     = "Very Good",
                            "3"     = "Good",
                            "4"     = "Fair",
                            "5"     = "Poor")))
              )) %>% 
  na.omit()

Summary Statistics

Entire Sample

options(scipen = 999)
kable(t(round(describe(bmi_gen_health$BMI), 4)))
X1
vars 1.0000
n 404030.0000
mean 28.0427
sd 6.6535
median 26.9500
trimmed 27.3352
mad 5.1298
min 12.0200
max 99.9500
range 87.9300
skew 2.2434
kurtosis 12.1673
se 0.0105

By General Health Response

stats_by_grp <- describeBy(bmi_gen_health$BMI, bmi_gen_health$reported_general_health)

kable(sapply(stats_by_grp, function(x) rbind(x)))
1 Excellent 2 Very Good 3 Good 4 Fair 5 Poor
vars 1 1 1 1 1
n 70305 134048 124152 54085 21440
mean 25.64967 27.29787 28.96909 30.07246 30.0614
sd 5.247704 5.797408 6.761064 7.790429 8.620629
median 24.96 26.57 27.98 28.8 28.48
trimmed 25.14349 26.76525 28.33719 29.30584 29.15305
mad 3.825108 4.610886 5.48562 6.52344 6.997872
min 12.16 12.02 12.11 12.53 12.15
max 94.66 97.65 97.65 97.65 99.95
range 82.5 85.63 85.54 85.12 87.8
skew 3.671951 2.66578 1.980394 1.606167 1.589135
kurtosis 32.43733 19.36357 10.48139 6.003336 5.28648
se 0.01979139 0.01583447 0.01918837 0.03349831 0.05887441

Distribution of the General Health Responses

ggplot(bmi_gen_health, aes(x = reported_general_health)) + geom_bar()

Boxplots by Reported General Health

boxplot(bmi_gen_health$BMI ~ bmi_gen_health$reported_general_health)

BMI Histogram & Q-Q plot

The BMI distribution is right-skewed.

ggplot(bmi_gen_health, aes(x = BMI)) + geom_histogram()

qqnorm(bmi_gen_health$BMI)
qqline(bmi_gen_health$BMI)

Part 4 - Inference

Confidence Intervals

Check Conditions

  • Independence: Yes, the data are from a stratified random sample.

  • Sample Size: Yes, the sample size is very large.

  • Normal Distribution: The distribution is right-skewed, but this condition can be relaxed since we have a massive sample size.

95% Confidence interval for the entire sample

I used the T-distribution since the BMI distribution is right-skewed. However, the sample size and degrees of freedom are so large that it will approach the normal distribution.

n <- nrow(bmi_gen_health)
df <- n - 1
xbar <- mean(bmi_gen_health$BMI)
sign_level <- 0.05
s <- sd(bmi_gen_health$BMI)
tstar <- qt(sign_level/2, df, lower.tail = F)
se <- s/sqrt(n)
me <- se * tstar
c(xbar - me, xbar + me)
## [1] 28.02216 28.06319

We are 95% confident that the BMI population mean is between 28.02216 and 28.06319.

Confidence intervals for each general health group

  • We are 95% confident that the BMI population mean for each of these groups are within the following intervals.

  • Only the intervals for the Fair and Poor group overlap.

for (i in levels(bmi_gen_health$reported_general_health)) {
    genhlth_subset <- subset(bmi_gen_health, reported_general_health == i)
    n <- nrow(genhlth_subset)
    df <- n - 1
    xbar <- mean(genhlth_subset$BMI)
    sign_level <- 0.05
    s <- sd(genhlth_subset$BMI)
    tstar <- qt(sign_level/2, df, lower.tail = F)
    se <- s/sqrt(n)
    me <- se * tstar
    print(paste("Confidence interval for", i, "group:", xbar - me, xbar + me))
}
## [1] "Confidence interval for 1 Excellent group: 25.6108785067945 25.688460658272"
## [1] "Confidence interval for 2 Very Good group: 27.2668356311951 27.3289061926292"
## [1] "Confidence interval for 3 Good group: 28.931481662721 29.0066994378654"
## [1] "Confidence interval for 4 Fair group: 30.006798434683 30.1381123538905"
## [1] "Confidence interval for 5 Poor group: 29.9459972844314 30.1767937603448"

ANOVA

  • Analysis of Variance is used to to compare means across several groups and uses the F-distribution.

  • The F-statistic is calculated by dividing the mean square between groups by the mean square error, which measures the variability within the groups.

Check Conditions for ANOVA

  • Independence: Yes, the data are from a random sample from less than 10% of the population. Also, the groups are independent of each other.

  • Approximately normal: The Q-Q plots do not look very normal at all. However, since the sample size is so large, perhaps the normality requirement can be set aside.

op <- par(mfrow = c(2, 3))

for (i in levels(bmi_gen_health$reported_general_health)) {
    tmp <- subset(bmi_gen_health$BMI, bmi_gen_health$reported_general_health == 
        i)
    hist(tmp, xlab = "Reported General Health", main = i)
}
par(op)

op2 <- par(mfrow = c(2, 3))

for (i in levels(bmi_gen_health$reported_general_health)) {
    tmp <- subset(bmi_gen_health$BMI, bmi_gen_health$reported_general_health == 
        i)
    qqnorm(tmp, xlab = "Reported General Health", main = i)
    qqline(tmp)
}
par(op2)

# code citation:
# http://stackoverflow.com/questions/20781663/qqnorm-plotting-for-multiple-subsets
  • Constant variance: From the 5 box plots above and the standard deviation statistics, we can see that the variability increases from the Excellent to Poor health groups. This may undermine the results of ANOVA.
t(t(sapply(stats_by_grp, function(x) x$sd)))
##                 [,1]
## 1 Excellent 5.247704
## 2 Very Good 5.797408
## 3 Good      6.761064
## 4 Fair      7.790429
## 5 Poor      8.620629

ANOVA Hypotheses

\(H_0:\) The average BMI is equal across all levels of self-reported general health.

\(H_A:\) The average BMI varies by the level of self-reported general health. At least one pair of group averages is more different than we would expect to see by chance.

ANOVA Analysis

The F-statistics is quite large at 5312, which yields a p-value near zero. Consequently, we would reject the null hypothesis.

aov_output <- aov(BMI ~ reported_general_health, bmi_gen_health)

summary(aov_output)
##                             Df   Sum Sq Mean Sq F value
## reported_general_health      4   893717  223429    5312
## Residuals               404025 16992204      42        
##                                      Pr(>F)    
## reported_general_health <0.0000000000000002 ***
## Residuals                                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple differences in means

  • To test the 10 differences in mean BMIs for the 5 levels of reported general health, we use the Bonferroni correction to apply a more strigent significance level for multiple comparisons.

  • The modified significance level is used to avoid committing a Type 1 error, which is rejecting the null hypothese when it is true. The Type 1 error rate increases with the number of tests conducted.

Check conditions

  • Independence: Yes, the data are from a random sample of less than 10% of the population. There is independence between groups.

  • Sample Size: Yes, the sample size is very large.

  • Normal Distribution: The distribution is right-skewed, but this condition can be relaxed since we have a massive sample size.

Hypotheses

For each comparison, the null hypotheis is that there is no difference between the means, and the alternative hypotheis is that there is a difference.

\(H_0: \mu_diff = 0\)

\(H_A: \mu_diff \neq 0\)

Analysis

For 9 out of the 10 comparisons, the p-value is less than the Bonferroni-corrected significance level of 0.005. Thus, we would reject the null hypothesis that there is no difference in means for these instances.

H0 <- 0

health_levels <- levels(bmi_gen_health$reported_general_health)
# number of groups
k <- length(health_levels)

K <- k * (k - 1)/2
alpha <- 0.05

# Bonferroni correction
alpha_star <- alpha/K

combos <- data.frame(t(combn(health_levels, 2)), stringsAsFactors = F)

results <- data.frame()

for (i in 1:nrow(combos)) {
    v1 <- subset(bmi_gen_health$BMI, bmi_gen_health$reported_general_health == 
        combos[i, 1])
    v2 <- subset(bmi_gen_health$BMI, bmi_gen_health$reported_general_health == 
        combos[i, 2])
    n1 <- length(v1)
    n2 <- length(v2)
    xbar1 <- mean(v1)
    xbar2 <- mean(v2)
    s1 <- sd(v1)
    s2 <- sd(v2)
    xbar_diff <- xbar1 - xbar2
    
    se_diff <- sqrt(s1^2/n1 + s2^2/n2)
    df_diff <- min(n1, n2) - 1
    
    t_score <- (xbar_diff - H0)/se_diff
    pvalue <- pt(-abs(t_score), df_diff) * 2
    
    results <- rbind(results, cbind(combos[i, ], xbar_diff, t_score, pvalue, 
        pvalue < alpha_star))
}

kable(results)
X1 X2 xbar_diff t_score pvalue pvalue < alpha_star
1 Excellent 2 Very Good -1.6482013 -65.0275626 0.0000000 TRUE
1 Excellent 3 Good -3.3194210 -120.4165530 0.0000000 TRUE
1 Excellent 4 Fair -4.4227858 -113.6727784 0.0000000 TRUE
1 Excellent 5 Poor -4.4117259 -71.0286097 0.0000000 TRUE
2 Very Good 3 Good -1.6712196 -67.1760979 0.0000000 TRUE
2 Very Good 4 Fair -2.7745845 -74.8830651 0.0000000 TRUE
2 Very Good 5 Poor -2.7635246 -45.3284988 0.0000000 TRUE
3 Good 4 Fair -1.1033648 -28.5810256 0.0000000 TRUE
3 Good 5 Poor -1.0923050 -17.6398862 0.0000000 TRUE
4 Fair 5 Poor 0.0110599 0.1632761 0.8703026 FALSE

Part 5 - Conclusions

  • We have some evidence that the mean BMIs do fluctuate between the levels of self-reported general health, which means that overall U.S. adults’ self-perception of their own health is affected by an empirical measure of health such as BMI.

  • The ANOVA may be of limited validity because the groups lack the necessary constant variability. However, the large F-statistic and tiny p-value do indicate that we would reject the null hypothesis that there is no difference between the mean BMIs of the self-reported groups.

  • In the differences in means tests, all of the conditions were met, and we saw that 9 of the 10 pairs had statistically significant differences. The only pair that didn’t display a significant difference is the 2 lowest self-reported levels of general health, Fair and Poor.

# unlink(brfss_file) unlink('LLCP2015.XPT')