Exploring the BRFSS data - Cantone Final Project

Setup

Load packages

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.2

Load data

load("brfss2013data.RData")
dim(brfss2013)

## [1] 491775    330

Part 1: Data

The below data comes directly from the CDC and the Behavioral Risk Factor Surveilliance System information.

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC).

All 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months).

Each year, the states—represented by their BRFSS coordinators and CDC—agree on the content of the questionnaire.

In a telephone survey such as the BRFSS, a sample record is one telephone number in the list of all telephone numbers the system randomly selects for dialing. To meet the BRFSS standard for the participating states’ sample designs, one must be able to justify sample records as a probability sample of all households with telephones in the state. All participating areas met this criterion in 2018.

Fifty-one projects used a disproportionate stratified sample (DSS) design for their landline samples. Guam and Puerto Rico used a simple random-sample design. This design choice affects the reliability of the data. Random samples are more random, as it says, and enhances generalizability across a population. A stratified sample might target populations that are stratified on a particular attribute (say, race) but it is not a pure random sample then.

These decisions affects inference and generalizability. While the goal was to generalize to all non-institutionalized adults of 18 years and above, stratification slighly adjusts these metrics.

Also, if it is an observational study, causality cannot be assessed. Causality requires a true experiment.

Part 2: Research questions

Research question 1: (I had to change the word question to make it correct, sorry)

Do obese people eat fewer vegetable and fruits?

This will use bmi5cat: Computed Body Mass Index category and the variables tracking self-reported consumption of vegetables and fruits.

One might hypothesize that those who eat fewer vegetables and fruits are more likely to have a higher BMI.

This question will use a series of variables including X_bmi5cat, as well as vegetable and fruit measures.

Research question 2:

The second research question will expand from the obesity variable (using BMI) to understand if high BMI individuals and have higher risks for heart attacks.

Again, this research question will use _bmi5cat: Computed Body Mass Index Categories; it will also use cvdinfr4: Ever Diagnosed With Heart Attack

Research question 3:

I predict that individuals who are more obese (on the BMI scale) are also more likely to report joint pain. I would predict this due to the extra strain, and from anecodotal experiences of those around me. This analysis will use THREE variables:

lmtjoin3
X_bmi5cat
joinpain

Part 3: Exploratory data analysis

Research question 1 data analysis

For the research question, I analyzed the means for vegetable and fruit intake, separating by BMI category. My children oppose eating fruit and veggies and I want them to eat more of these foods.

As shown in the data below, for vegetables, my hypothesis is confirmed. Those who are obese averaged 177.86 vegetables consumed. This was the least among the categories (overweight = 187.76; normal weight = 202.89; underweight = 193.16). With underweight dropping in the number, it’s looking more like a normal distribution here.

For fruit, my hypothesis is also confirmed. Those who are obese averaged 127.94 fruits consumed. This was the least among the categories (overweight = 138.38; normal weight = 149.90; underweight = 142.94). The pattern also remains – more fruit consumption matches those of normal weight.

df1 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Underweight') %>% filter(X_vegesum !='NA') %>% summarise(mean_Underweight_veg = mean(X_vegesum))
df1

##   mean_Underweight_veg
## 1             193.1601

df2 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Normal weight') %>% filter(X_vegesum !='NA') %>% summarise(mean_NormalWeight_veg = mean(X_vegesum))
df2

##   mean_NormalWeight_veg
## 1              202.8943

df3 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Overweight') %>% filter(X_vegesum !='NA') %>% summarise(mean_Overweight_veg = mean(X_vegesum))
df3

##   mean_Overweight_veg
## 1            187.7619

df4 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Obese') %>% filter(X_vegesum !='NA') %>% summarise(mean_Obese_veg = mean(X_vegesum))
df4

##   mean_Obese_veg
## 1       177.8585

df5 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Underweight') %>% filter(X_frutsum !='NA') %>% summarise(mean_Underweight_fruit = mean(X_frutsum))
df5

##   mean_Underweight_fruit
## 1               142.9423

df6 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Normal weight') %>% filter(X_frutsum !='NA') %>% summarise(mean_NormalWeight_fruit = mean(X_frutsum))
df6

##   mean_NormalWeight_fruit
## 1                149.9006

df7 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Overweight') %>% filter(X_frutsum !='NA') %>% summarise(mean_Overweight_fruit = mean(X_frutsum))
df7

##   mean_Overweight_fruit
## 1              138.3816

df8 <-brfss2013 %>% filter(X_bmi5cat != "NA") %>% filter(X_bmi5cat == 'Obese') %>% filter(X_frutsum !='NA') %>% summarise(mean_Obese_fruit = mean(X_frutsum))
df8

##   mean_Obese_fruit
## 1         127.9404

mean_fruit_vector_bmi<-c(df5$mean_Underweight_fruit,df6$mean_NormalWeight_fruit,df7$mean_Overweight_fruit,df8$mean_Obese_fruit)

mean_veg_vector_bmi <-c(df1$mean_Underweight_veg,df2$mean_NormalWeight_veg,df3$mean_Overweight_veg,df4$mean_Obese_veg)

mean_veg_vector_bmi

## [1] 193.1601 202.8943 187.7619 177.8585

names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_veg_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of veg consumption', main='Mean of veg consumption for various BMI categories', xlab="BMI category") 
lines(mean_veg_vector_bmi, col='Red')

mean_fruit_vector_bmi

## [1] 142.9423 149.9006 138.3816 127.9404

names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_fruit_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of fruits consumption', main='Mean of fruits consumption for various BMI categories', xlab="BMI category") 
lines(mean_fruit_vector_bmi, col='Red')

mean_fruit_vector_bmi

## [1] 142.9423 149.9006 138.3816 127.9404

names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_fruit_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of fruits consumption', main='Mean of fruits consumption for various BMI categories', xlab="BMI category") 
lines(mean_fruit_vector_bmi, col='Red')

mean_veg_vector_bmi <-c(df1$mean_Underweight_veg,df2$mean_NormalWeight_veg,df3$mean_Overweight_veg,df4$mean_Obese_veg)

mean_veg_vector_bmi

## [1] 193.1601 202.8943 187.7619 177.8585

names <- c("Underweight", "NormalWeight", "Overweight", "Obese")
barplot(mean_veg_vector_bmi, names.arg=names, cex.names=.9, ylab = 'mean of veg consumption', main='Mean of veg consumption for various BMI categories', xlab="BMI category") 
lines(mean_veg_vector_bmi, col='Red')

Individuals with normal BMIs are those who ate the most fruits and vegetables. Obese and overweight individuals ate the least. This is the start of potentially good research on a correlation between nutrition and BMI.

Research question 2

DATA ANALYSIS: Creating a new dataset with just the two variables in question, I then did a count and saw it was very disproportionate. The NO for heart attack history (434,459) far outweighed the YES (28,265).Looking at the data itself, it appears that the NO is evenly split between obese, overweight, and normal weight. Likewise, the YES appears to evenly include obese, overweight, and normal. So it is not a predictor.

RQ2Data<-brfss2013%>%
  filter(!is.na(X_bmi5cat),!is.na(cvdinfr4))%>%
  select(X_bmi5cat,cvdinfr4)

count(RQ2Data,cvdinfr4)

##   cvdinfr4      n
## 1      Yes  28265
## 2       No 434459

ggplot(data=RQ2Data,aes(x=cvdinfr4,fill=X_bmi5cat))+
  geom_bar()

Research question 3

RQ3Data<-brfss2013%>%
  filter(!is.na(X_bmi5cat),!is.na(lmtjoin3),!is.na(joinpain))%>%
  select(X_bmi5cat,lmtjoin3,joinpain)

count(RQ3Data,joinpain)

##    joinpain     n
## 1         0 10595
## 2         1  8660
## 3         2 14251
## 4         3 17676
## 5         4 15316
## 6         5 22485
## 7         6 12072
## 8         7 12962
## 9         8 14650
## 10        9  3671
## 11       10  8194

count(RQ3Data,lmtjoin3)

##   lmtjoin3     n
## 1      Yes 69844
## 2       No 70688

count(RQ3Data,X_bmi5cat)

##       X_bmi5cat     n
## 1   Underweight  2043
## 2 Normal weight 36739
## 3    Overweight 48988
## 4         Obese 52762

ggplot(data=RQ3Data,aes(x=joinpain,fill=X_bmi5cat))+
  geom_bar()

ggplot(data=RQ3Data,aes(x=lmtjoin3,fill=X_bmi5cat))+
  geom_bar()

##DATA ANALYSIS##

There is not a significant difference between the assessed groups for the third research question. It might be that obesity (BMI) mediates this relationship instead.