Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyverse)

Part 1: Data

In this study, we will use data extracted from The Behavioral Risk Factor Surveillance System - BRFSS2013.

As stated on their website (https://www.cdc.gov/brfss/), the BRFSS is the premier “system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services”.

The data are collected by landline telephone and cellular telephone within each of the states and territories. According to the 2013 Summary Data Quality Report, released on August 15, 2014, landline and cellular phone response rates achieved both a satisfactory level, when compared to other similar surveys. The same report states that BRFSS2013 sample was random and weighted by demographic characteristics of respondents to ensure accurate estimates for most measures.

Base on that information, we can say that this survey’s sample tends to be a good representation of the population and the results have a good degree of generabizability. On the other hand, we cannot talk about causality here. This is an observational study. There’s no intervention and, therefore, no random allocation between intervention and control groups. So, despite all the controls and the quality this study shows, there’s no way we can talk about causal relationships between the variables observerd.

As a preliminary approach, we are only going to focus on five variables in this study: 1) Reported sleep time, 2) Reported High levels of cholesterol, 3) Reported weights in pounds and, 4) Reported Height in pounds and inches and 5) sex

It’s also important to note that the studies that follow are only a preliminary exploratory data analysis. All statements should be treated with caution, and should only be read as an introductory analysis. Further conclusion, specially the inferential ones, should be postponed to when we are able to run more advanced analysis.

Load data

load("brfss2013.RData")

Part 2: Research questions

Research quesion 1: Is there any relationship between how much time someone sleeps and cholesterol levels?

Research quesion 2: Is there any relationship between sex and cholesterol levels?

Research quesion 3: What’s the relationship between cholesterol levels and obesity?

Part 3: Exploratory data analysis

Before addressing the research questions, we will create a new data frame containing only the variables we are interested at the moment. We can do this by using the select function.

final_project <- brfss2013 %>%
  select(sleptim1, toldhi2, sex, weight2, height3)
final_project %>%
  head()

##   sleptim1 toldhi2    sex weight2 height3
## 1       NA     Yes Female     250     507
## 2        6      No Female     127     510
## 3        9      No Female     160     504
## 4        8     Yes Female     128     504
## 5        6      No   Male     265     600
## 6        8     Yes Female     225     503

Research question 1:

To begin addressing the first question, we will run some summaries, to discover basic information about our data. We will start by summarizing data about sleeping time. First, from all the sample, and then grouping by sex. We use the filter function to exclude NA values from sleptim1. If we don’t do this, the numerical functions won’t be able to display the desirable outcomes.

final_project %>% 
  filter(!(is.na(sleptim1))) %>%
  summarise(slep_mean = mean(sleptim1), slep_median = median(sleptim1), slep_sd = sd(sleptim1), 
            slep_min = min(sleptim1), slep_max = max(sleptim1))

##   slep_mean slep_median slep_sd slep_min slep_max
## 1  7.052099           7 1.60411        0      450

final_project %>% 
  filter(!(is.na(sleptim1))) %>%
  group_by(sex) %>%
  summarise(slep_mean = mean(sleptim1), slep_median = median(sleptim1), slep_sd = sd(sleptim1), 
            slep_min = min(sleptim1), slep_max = max(sleptim1), count = n())

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 3 x 7
##   sex    slep_mean slep_median slep_sd slep_min slep_max  count
##   <fct>      <dbl>       <dbl>   <dbl>    <int>    <int>  <int>
## 1 Male        7.03         7      1.45        1       24 198910
## 2 Female      7.06         7      1.48        1       24 285474
## 3 <NA>      140.          55.5  212.          0      450      4

The first data summarization shows that mean and median values are pretty close to each other. It also shows that we have some possible outilers recorded at max and min values. When we group the sample by sex, we can see those extreme values are associated to NA entries, maybe due to some filling mistake, and that data from men and women seem to be quite uniform on every parameter described by summarization.

Now we can use those values as baseline to investigate if the amount of sleep has any relationship with reported hi cholesterol. To do this, we can start by creating (mutate) a new variable indicating people who reported sleep routines bigger or smaller than our baseline. We will use the same pipe to exclude NA values from sex and baseline variables.

final_project <- final_project %>% 
  mutate(baseline = ifelse(sleptim1 < 7, "below 7", "at least 7"))
final_project %>%
  filter(!(is.na(sex))) %>%
  filter(!(is.na(baseline))) %>%
  group_by(baseline)%>% 
  summarise( toldhi2 = length(toldhi2))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   baseline   toldhi2
##   <chr>        <int>
## 1 at least 7  325690
## 2 below 7     158694

According to data, people who slept 7 or more hours a day reported more hi cholesterol than the people who slept less than 7 hours a day.

We can do a similar analysis, but this time adding the variable sex to the dataframe.

final_project %>%
  filter(!(is.na(sex))) %>%
  filter(!(is.na(baseline))) %>%
  group_by(sex, baseline)%>% 
  summarise( toldhi2 = length(toldhi2))

## `summarise()` regrouping output by 'sex' (override with `.groups` argument)

## # A tibble: 4 x 3
## # Groups:   sex [2]
##   sex    baseline   toldhi2
##   <fct>  <chr>        <int>
## 1 Male   at least 7  133280
## 2 Male   below 7      65630
## 3 Female at least 7  192410
## 4 Female below 7      93064

As we can see, sex dos not alter the trend of reporting hi cholesterol if the respondent has a sleep routine bigger than 7h a day.

Research question 2: To address research question 2, we are going to create a contingency table that can show us how sex and reported hi cholesterol are distributed on this sample. To do it, we group the data by sex and reported cholesterol and then we summarize the results asking the function to count the times men and women reported or not a high cholesterol. We end the pipe using the spread() function, from tidyverse package, in order to organize Men and Women data in separated columns.

final_project %>% 
  filter(!(is.na(sex))) %>%
  group_by(toldhi2, sex  ) %>%
  summarise(n = n()) %>% 
  spread(sex, n)

## `summarise()` regrouping output by 'toldhi2' (override with `.groups` argument)

## # A tibble: 3 x 3
## # Groups:   toldhi2 [3]
##   toldhi2  Male Female
##   <fct>   <int>  <int>
## 1 Yes     74917 108584
## 2 No      92402 144210
## 3 <NA>    33994  37661

The first outcome we get is the great amount of NA entries. On question 1, we saw that NA entries could be excluded from the sample without many concerns. This time, however, NA occurrence can’t be ignored on further analysis. Another thing we can see on this tibble is that both men and women tended more to report a non-hi cholesterol level when asked about it.

We can see proportions for that numbers. That’s the outcome of the following pipe. According to it, the proportion of men and women that did not report hi cholesterol was 46% and 50% respectively. We can also see that the missing proportion is almost 17% for men and 13% for woman respondents.

Although we need further analysis, data seems to show no relationship between sex and hi cholesterol reporting, with both men and women showing similar outcomes. On the other hand, could this result be different if this data had less NA entries? We don’t have enough information to answer this.

final_project %>% 
  filter(!(is.na(sex))) %>%
  count(toldhi2, sex) %>% 
  group_by(sex) %>% 
  mutate(prop = prop.table(n)) %>%
  gather(measure, value, n, prop) %>%
  unite(measure, measure, toldhi2) %>% 
  spread(measure, value)

## # A tibble: 2 x 7
## # Groups:   sex [2]
##   sex     n_NA   n_No  n_Yes prop_NA prop_No prop_Yes
##   <fct>  <dbl>  <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 Male   33994  92402  74917   0.169   0.459    0.372
## 2 Female 37661 144210 108584   0.130   0.496    0.374

Research quesion 3:

To address question 3, we will create a new variable, Body Mass Index (BMI). This is the most commonly used index to measure obesity levels. According to WHO (shorturl.at/fpCHQ), the BMI is defined as a person’s weight in kilograms divided by the square of his height in meters (kg/m2). Note that our data is in pounds and feet. As we are going to use a BMI based on metric system inputs, we will first convert them before plugging in the values on the BMI formula.

final_project <- final_project %>%
  mutate(numweight2 = as.numeric(as.character((weight2))))

## Warning: Problem with `mutate()` input `numweight2`.
## ℹ NAs introduced by coercion
## ℹ Input `numweight2` is `as.numeric(as.character((weight2)))`.

## Warning in mask$eval_all_mutate(dots[[i]]): NAs introduced by coercion

final_project <- final_project %>%
  mutate(numheight = as.numeric(as.character((height3))))

final_project <- final_project %>%
  mutate(weight_KG = numweight2 /2.20462262185)
final_project <- final_project %>%
  mutate(height_Mt = (height3/100)/3.2808)

final_project %>%
  select(sex, weight_KG, height_Mt)%>%
  head()

##      sex weight_KG height_Mt
## 1 Female 113.39809  1.545355
## 2 Female  57.60623  1.554499
## 3 Female  72.57478  1.536211
## 4 Female  58.05982  1.536211
## 5   Male 120.20198  1.828822
## 6 Female 102.05828  1.533163

Note that originally, weight and height are recorded as factors. We first asked R to read them as numeric, and only then we used this values to create new columns for kg and meter metrics. Now, we can go further and create some summaries on weight and height.

final_project %>%
  filter(!(is.na(weight_KG))) %>%
  summarise(wgt_mean = mean(weight_KG), wgt_median = median(weight_KG), wgt_sd = sd(weight_KG), 
            wgt_min = min(weight_KG), wgt_max = max(weight_KG))

##   wgt_mean wgt_median   wgt_sd  wgt_min  wgt_max
## 1 87.93668    77.1107 179.1129 1.360777 4229.749

final_project %>%
  filter(!(is.na(height_Mt))) %>%
  summarise(hgt_mean = mean(height_Mt), hgt_median = median(height_Mt), hgt_sd = sd(height_Mt), 
            hgt_min = min(height_Mt), hgt_max = max(height_Mt))

##   hgt_mean hgt_median   hgt_sd     hgt_min  hgt_max
## 1 1.679829   1.542307 1.642893 0.003048037 28.98378

At the first glance, we can see we have some strange data on this data frame. Min and max values are way out of the curve we would expect from a distribution of weights and heights for human beings. On the other hand, median and mean for both weight and height are not to apart from each other, and both fall under what we would expect to be a normal standard for human adults.

Making another step on our analysis, we can now go ahead and create our BMI index , using the formula BMI = kg/m^2.

final_project <- final_project %>% 
  mutate(bmi = weight_KG/(height_Mt^2))

final_project %>%
  select(sex, weight_KG, height_Mt, bmi)%>%
  head()

##      sex weight_KG height_Mt      bmi
## 1 Female 113.39809  1.545355 47.48422
## 2 Female  57.60623  1.554499 23.83903
## 3 Female  72.57478  1.536211 30.75276
## 4 Female  58.05982  1.536211 24.60221
## 5   Male 120.20198  1.828822 35.93922
## 6 Female 102.05828  1.533163 43.41820

According to World Health Organization site (shorturl.at/bfpzG), we can use BMI to classify people in 3 categories: healthy (BMI<25), overweight (BMI >= 25) and obesity (BMI >=30). We will create another column on our working dataframe in order to classify our bmi values into those 3 categories.

final_project <- final_project %>% 
  mutate(condition = ifelse(bmi < 25, "healthy", ifelse(bmi >= 30, "obsesity", "overweight")))

final_project %>%
  select(sex, toldhi2, weight_KG, height_Mt, bmi,condition) %>%
  head()

##      sex toldhi2 weight_KG height_Mt      bmi condition
## 1 Female     Yes 113.39809  1.545355 47.48422  obsesity
## 2 Female      No  57.60623  1.554499 23.83903   healthy
## 3 Female      No  72.57478  1.536211 30.75276  obsesity
## 4 Female     Yes  58.05982  1.536211 24.60221   healthy
## 5   Male      No 120.20198  1.828822 35.93922  obsesity
## 6 Female     Yes 102.05828  1.533163 43.41820  obsesity

Now we can cross obesity entries and reported hi cholesterol entries to finally adress our third question. We do it again using mutate and spread function.

final_project %>% 
  group_by(condition, toldhi2) %>%
  summarise(n = n()) %>% 
  spread(condition, n)

## `summarise()` regrouping output by 'condition' (override with `.groups` argument)

## # A tibble: 3 x 5
##   toldhi2 healthy obsesity overweight `<NA>`
##   <fct>     <int>    <int>      <int>  <int>
## 1 Yes       19129   111863      44571   7938
## 2 No        39600   119580      66020  11412
## 3 <NA>      15608    31900      19752   4402

Again, the great number of NA entries on both toldhi2 and condition variables does not let us comfortable to look for patterns on this table. Nevertheless, it looks like that people with good BMI scores tend to report less high cholesterol. Unless, of course, all the healthy that haven’t reported their cholesterol (NA entries) would report a high cholesterol, than we would have to consider the opposite situation (good BMI associating with high cholesterol reports)

When we consider obesity and overweight, the situation is even less conclusive. Actually, just looking at the table does not allow us to think of any kind of association between hi reported levels of cholesterol and obesity on the sample at hand. Again, the great numbers of NA entries weakens the analysis here, but if can trust these numbers and generalize them to the population, it could even mean that people with obese and overweight BMIs scores would tend to report low cholesterol levels more often than they would tend to report hi level ones. Or, in a more conservative conclusion, that obesity and overweight conditions does not affect cholesterol levels.

Now that we cleaned up the data in a way that addresses the needs of the research questions we want to explore, we are ready to continue with our analysis.