Setup

Load packages

library(ggplot2)
library(plyr)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
library(e1071)
## Warning: package 'e1071' was built under R version 3.3.1

Load data

load("brfss2013.RData")
#Remove individuals who reported they were pregnant.
idx <- which(brfss2013$pregnant == "Yes")
brfss2013 <- brfss2013[-idx,]

Part 1: Data

The data is obtained through an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. Specifically, telephone interviews are performed every one to three months. This particular method of data collections is a nonprobability sampling, i.e., participants are selected on the basis of availability and, as such, a critical evaluation whether an unknown portion of the population is (systematically) excluded should always be performed. As a corollary, generalization of any statistical inference should be done with caution if at all possible.


Part 2a: Research questions

Obesity is increasingly recognized as a public health concern because of its adverse consequences, such as diabetes mellitus, heart disease, various cancers, and arthritis. It is not inconceivable that obese persons may experience a diminished health-related quality of life. Here, we will ascertain whether there is indeed a correlation between the body mass index (BMI) of an individual and the number of self-reported unhealthy days, by age, race, and employment status.

**Research quesion 1: Is there a difference in the mean number of unhealthy (physical or mental) days, by body mass index (BMI) and by age?

**Research quesion 2: Is there a difference in the mean number of unhealthy (physical or mental) days, by BMI and by race or ethnicity?

**Research quesion 3: Is there a difference in the mean number of unhealthy days (physical or mental) days, by BMI and by employment status?


Part 2b: Select variables into a new dataframe and clean data

#Select variables for analysis into new data frame, df, and subset for complete cases.

vars <- c("seqno", "htm4", "wtkg3", "iyear", "physhlth", "menthlth",
          "employ1", "X_race", "X_bmi5", "X_ageg5yr")
df <- brfss2013[vars]
df <- df[complete.cases(df),]

A preliminary analysis of the data showed some obvious entry errors, e.g., a height of 2469 cm or a weight of 2 grams. As such, only rows with a height smaller than 2.5 m or a weight greater than 40 kg were selected for further analysis.

df <- subset(df, htm4 < 250 | wtkg3 > 40)
#Rename the column names and standardize values
colnames(df) = c("id", "height", "weight", "year", "pHealth", "mHealth",
                 "employed", "race", "bmi", "ageGroup")
df$height <- df$height / 100
df$weight <- df$weight / 100
df$bmi <- df$bmi / 100
levels(df$ageGroup) <- gsub("Age ", "", levels(df$ageGroup))
#Combine poor mental and poor physical health days
df$aHealth = (df$pHealth + df$mHealth) / 2
#Remove original data frame
#rm(brfss2013)

The structure of the resulting data frame is:

str(df)
## 'data.frame':    439310 obs. of  11 variables:
##  $ id      : int  2013000580 2013000593 2013000600 2013000606 2013000608 2013000630 2013000634 2013001305 2013001338 2013001640 ...
##  $ height  : num  1.7 1.78 1.63 1.63 1.83 1.6 1.52 1.88 1.65 1.85 ...
##  $ weight  : num  113.4 57.6 72.6 58.1 120.2 ...
##  $ year    : Factor w/ 2 levels "2013","2014": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pHealth : int  30 0 3 2 10 0 1 0 0 0 ...
##  $ mHealth : int  29 0 2 0 2 0 15 0 0 1 ...
##  $ employed: Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 5 1 ...
##  $ race    : Factor w/ 8 levels "White only, non-Hispanic",..: 2 1 1 1 1 2 1 1 1 1 ...
##  $ bmi     : num  39.2 18.2 27.5 22 35.9 ...
##  $ ageGroup: Factor w/ 13 levels "18 to 24","25 to 29",..: 9 7 8 9 10 6 4 7 10 5 ...
##  $ aHealth : num  29.5 0 2.5 1 6 0 8 0 0 0.5 ...

Part 3: Exploratory data analysis

The distribution of BMI values is shown in Figure 1. It is unimodal and skewed to the right. The blue line represents a normal distribution with a mean and standard deviation equal to that of the BMI values.

hist(df$bmi, prob=TRUE, main="Histogram of BMI values", breaks=100)
curve(dnorm(x, mean(df$bmi), sd(df$bmi)), add=TRUE, col="darkblue", lwd=2)

skewness(df$bmi)
## [1] 1.380694

The mean and standard deviation of the BMI values for individuals in all age groups are:

mean = mean(df$bmi); mean
## [1] 27.83377
sd = sd(df$bmi); sd
## [1] 6.141421

Assuming the population standard deviation sigma being 6.14, the margin of error for the BMI values at a 95% confidence level is 3.81 kg/m^2. The 95% confidence interval is calculated as follows:

n=length(df)
x_bar = mean(df$bmi)
stdev = sd(df$bmi)
sem = stdev / sqrt(n)
E = qnorm(0.975) * sem
x_bar + c(-E, E)
## [1] 24.20448 31.46305

Figure 2 shows a comparison of the BMI values grouped by the different age groups.

ggplot(df, aes(x=ageGroup, y=bmi)) +
  geom_boxplot(fill="grey80", colour="blue") +
  xlab("Age Group") + ylab("BMI") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Due to the similar IQR and the significant variation of BMI within each group, it is not unreasonable to assume that the point estimates for the population means and population variances are the same for the different age groups in further analyses.

In order to simplify subsequent analyses, we will group individuals in three age groups, namely 18 to 44 years of age, 45 to 64 years of age, and 65 or older. Moreover, with respect to race only individuals belonging to White, Black or Hispanic will be considered. Finally, BMI values will be aggregated into six groups (<18.5, 18.5 to <25, 25 to <30, 30 to <35, 35 to <40, and >=40). According to the guidelines by the CDC, the first three categories are associated with the status underweight, normal (or healthy) weight, and overweight, respectively while any BMI value of 30 or higher is considered to be obese.

lbls <- c("18 to 24", "18 to 24", "18 to 24", "18 to 24", "18 to 24",
          "45 to 64", "45 to 64", "45 to 64", "45 to 64", "65 or older",
          "65 or older", "65 or older", "65 or older")
levels(df$ageGroup) <- lbls
df <- subset(df, race == "White only, non-Hispanic" |
               race == "Black only, non-Hispanic" | race == "Hispanic")
df$bmi <- cut(df$bmi, breaks = c(0,18.5,25,30,35,40,100),
              labels = c("UW", "HW", "Overweight", "Ob", "VOb", "COb"))

**Research quesion 1: Is there a difference in the mean number of unhealthy (physical or mental) days, by body mass index (BMI) and by age?

df_rq1 <- ddply(df, ~ bmi + ageGroup, summarise, mean=mean(aHealth), sd=sd(aHealth))
ggplot(df_rq1, aes(bmi, mean, colour=ageGroup)) + geom_point() +
  xlab("Body Mass Index") + ylab("Mean number of unhealthy days") +
  ggtitle("Figure 3")

The data shown in Figure 3 suggest that individuals classified as underweight or obese typically report a higher number of unhealthy days (physical or mental). It should be noted that in this and subsequent graphs, BMI cannot considered a true predictor for the mean number of unhealthy days. At best we can show an association between the two variables.

**Research quesion 2: Is there a difference in the mean number of unhealthy (physical or mental) days, by BMI and by race or ethnicity?

df_rq2 <- ddply(df, ~bmi + race, summarise, mean=mean(aHealth), sd=sd(aHealth))
ggplot(df_rq2, aes(bmi, mean, colour=race)) + geom_point() +
  xlab("Body Mass Index") + ylab("Mean number of unhealthy days") +
  ggtitle("Figure 4")

Similar to results presented in Figure 3, the mean number of unhealthy days (physical or mental) is higher for individuals classified as underweight or obese when the data are grouped by race (Figure 4).

**Research quesion 3: Is there a difference in the mean number of unhealthy days (physical or mental) days, by BMI and by employment status?

df_rq3 <- ddply(df, ~bmi + employed, summarise, mean=mean(aHealth), sd=sd(aHealth))
ggplot(df_rq3, aes(bmi, mean, colour=employed)) + geom_point() +
  xlab("Body Mass Index") + ylab("Mean number of unhealthy days") +
  ggtitle("Figure 5")

Finally, from the results presented in Figure 5, it would appear that individuals which were unable to work reported the highest mean number of unhealthy days regardless of their BMI value. Likewise, individuals who reported to be out of work tended to report a higher mean number of unhealthy days as compared to those classified as either homemaker, student, retired or self-employed although this number was lower than the mean number of unhealthy days reported by those classified as unable to work.

Note that the mean number of unhealthy days seems to be related to BMI in a J- or U-shaped fashion which precludes fitting a simple linear or even exponential model to the data. Additonally, at the extremes of the self-reported BMI an increased variability was evident (not shown).

In summary, individuals classified as belonging to the “Healthy Weight” (HW) or “Overweight” categories, based on their BMI values, reported a lower mean number of unhealthy days (physical or mental) regardless of age or race. A similar trend was evident when comparing the mean number of unhealthy days by bmi and by employment but it as less marked. Nevertheless, those categorized as “Unable to work” did report a significantly higher mean number of unhealthy days as compared to the other employment categories.