Exploring the BRFSS data

Introduction

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS is administered and supported by CDC’s Population Health Surveillance Branch, under the Division of Population Health at the National Center for Chronic Disease Prevention and Health Promotion. BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS was initiated in 1984, with 15 states collecting surveillance data on risk behaviors through monthly telephone interviews. Over time, the number of states participating in the survey increased; by 2001, 50 states, the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands were participating in the BRFSS. Today, all 50 states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months). In this document, the term “state” is used to refer to all areas participating in BRFSS, including the District of Columbia, Guam, and the Commonwealth of Puerto Rico.

The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

Health characteristics estimated from the BRFSS pertain to the non-institutionalized adult population, aged 18 years or older, who reside in the US. In 2013, additional question sets were included as optional modules to provide a measure for several childhood health and wellness indicators, including asthma prevalence for people aged 17 years or younger.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(plotly)
library(tidyr)

Load data

load("brfss2013.RData")

Part 1: Data

Data collection procedures for the brfss2013 dataset are clearly documented in the The BFRSS Data User’s Guide, which can be found at (https://www.cdc.gov/brfss/data_documentation/index.htm). Three components of the process (survey protocol, sample design, and weighting) are explained in detail in this guide to ensure that the BRFSS data are representative of a random sample and most, if not all, calculated statistics are generalizable to a much larger population.

Part 2: Research questions

Research question 1:

How does exercise affect a person’s health? The goal here is to determine whether exercise positively or negatively impacts a person’s health.

Research question 2:

Do respondents who classify themselves as ‘healthy’ either smoke cigarettes or drink alchohol? We want to understand the positive or negative effects of smoking and alchohol consumption on a person’s health.

Research question 3:

What are some of the best habits or activities in which a person can engage to improve their overall health? It’s a given that most people want to improve their health but they, more often than not, are either confused or uncertain as to what to do to achieve this goal.

Part 3: Exploratory data analysis

Research question 1: “How does exercise affect a person’s health? The goal here is to determine whether exercise positively or negatively impacts a person’s health.”

Let’s plot a histogram of the general health variable, genhlth. Upon visual inspection, we can see that the general health conditions of the individuals in the data set follow a somewhat ‘normal’ distribution.

fig <- plot_ly(x=brfss2013$genhlth,type="histogram",nbinsx=50) %>%
  
  layout(autosize = F, width = 800, height = 400,
         title="Histogram of General Health 'genhlth' Feature",
         xaxis=list(title="Health Condition"),yaxis=list(title="Number of Instances"),
         legend = list(x = 5, y = 1))
fig

The key dataset variables related to exercise we will investigate are the following:

exerany2: Exercise In Past 30 Days
exract11: Type Of Physical Activity
exract21: Other Type Of Physical Activity Giving Most Exercise During Past Month

Let’s subset the dataset using genhlth and these exercise features….

exercise <- brfss2013 %>% select(genhlth,exerany2,exract11,exract21) 
head(exercise)

##     genhlth exerany2                   exract11
## 1      Fair       No                       <NA>
## 2      Good      Yes                    Walking
## 3      Good       No                       <NA>
## 4 Very good      Yes                    Walking
## 5      Good       No                       <NA>
## 6 Very good      Yes Bicycling machine exercise
##                                                       exract21
## 1                                                         <NA>
## 2 Household Activities (vacuuming, dusting, home repair, etc.)
## 3                                                         <NA>
## 4                                            No other activity
## 5                                                         <NA>
## 6               Gardening (spading, weeding, digging, filling)

Let’s group the exercise dataframe by genhlth and determine whether the person has engaged in any exercise in the last 30 days. Also we will remove rows in exercise where genhlth is NA.

exercise_last_30_days <- exercise %>% group_by(genhlth,exerany2) %>% 
  tally()  %>% filter(!is.na(genhlth))
head(exercise_last_30_days)

## # A tibble: 6 x 3
## # Groups:   genhlth [2]
##   genhlth   exerany2      n
##   <fct>     <fct>     <int>
## 1 Excellent Yes       67914
## 2 Excellent No        11453
## 3 Excellent <NA>       6115
## 4 Very good Yes      120595
## 5 Very good No        28860
## 6 Very good <NA>       9621

We create a grouped column chart of health conditions and counts of whether individuals exercised in the last 30 days or not.

fig <- plot_ly(data=exercise_last_30_days, x = ~genhlth, y = ~n, color = ~exerany2, type = 'bar') %>%
   layout(autosize = F, width = 800, height = 400,
         title="Exercised in Last Thirty Days? (Y/N) vs. General Health",
         xaxis=list(title="Health Condition"),yaxis=list(title="Count"))
fig

As expected, the number of individuals who responded that they are in ‘Fair’ or better health condition and have exercised in the last 30 days outnumber those that have not.

Finally, let’s get an idea of the most popular types of activity that individuals in this dataset do.

exercise_activities <- exercise %>% unite("Activities", exract11,exract21) %>% group_by(genhlth,Activities) %>% tally()  %>% filter(!is.na(genhlth)) %>% arrange(genhlth,desc(n))

exercise_top_3_activities <- exercise_activities %>% group_by(genhlth) %>% slice(1:4) %>% filter(Activities!="NA_NA")
exercise_top_3_activities

## # A tibble: 15 x 3
## # Groups:   genhlth [5]
##    genhlth   Activities                                                        n
##    <fct>     <chr>                                                         <int>
##  1 Excellent Walking_No other activity                                     10468
##  2 Excellent Walking_Gardening (spading, weeding, digging, filling)         2265
##  3 Excellent Running_Weight lifting                                         1972
##  4 Very good Walking_No other activity                                     24183
##  5 Very good Walking_Gardening (spading, weeding, digging, filling)         4955
##  6 Very good Walking_Household Activities (vacuuming, dusting, home repai…  3632
##  7 Good      Walking_No other activity                                     26068
##  8 Good      Walking_Gardening (spading, weeding, digging, filling)         3592
##  9 Good      Walking_Household Activities (vacuuming, dusting, home repai…  3525
## 10 Fair      Walking_No other activity                                     12012
## 11 Fair      Walking_Household Activities (vacuuming, dusting, home repai…  1629
## 12 Fair      Walking_Other                                                  1334
## 13 Poor      Walking_No other activity                                      4448
## 14 Poor      Walking_Household Activities (vacuuming, dusting, home repai…   571
## 15 Poor      Walking_Other                                                   490

We create a grouped column chart of health conditions and counts of whether individuals exercised in the last 30 days or not.

fig <- plot_ly(data=exercise_top_3_activities, x = ~genhlth, y = ~n, color = ~Activities, type = 'bar') %>%
   layout(autosize = F, width = 800, height = 400,
         title="Top Three Exercise Activities vs. General Health",
         xaxis=list(title="Health Condition"),yaxis=list(title="Count"),
         legend=list(orientation = "v", x = 1.75, y = 0.8))
fig

We can see from the plot above that ‘walking’ is clearly the most popular form of exercise among all health condition groups followed by ‘walking + gardening’.

Research question 2: “Do respondents who classify themselves as ‘healthy’ either smoke cigarettes or drink alchohol? We want to understand the positive or negative effects of smoking and alchohol consumption on a person’s health.”

The features in the dataset related to smoking and alchohol consumption are found in the following columns:

Main Survey - Section 9 - Tobacco Use

smoke100: Smoked At Least 100 Cigarettes

Main Survey - Section 10 - Alcohol Consumption

alcday5: Days In Past 30 Had Alcoholic Beverage
avedrnk2: Avg Alcoholic Drinks Per Day In Past 30

Let’s subset the brfss2013 dataset using genhlth and these features….

smoke_drink <- brfss2013 %>% select(genhlth,smoke100,alcday5,avedrnk2)

We will use the variable smoke100 to determine if an individual has ever smoked.

smoking_no_smoking <- smoke_drink %>% group_by(genhlth,smoke100) %>% tally() %>% 
  filter(!is.na(genhlth),!is.na(smoke100))
head(smoking_no_smoking)

## # A tibble: 6 x 3
## # Groups:   genhlth [3]
##   genhlth   smoke100     n
##   <fct>     <fct>    <int>
## 1 Excellent Yes      28342
## 2 Excellent No       54392
## 3 Very good Yes      63642
## 4 Very good No       91157
## 5 Good      Yes      69765
## 6 Good      No       75995

We create a grouped column chart of health conditions and counts of whether individuals have ever smoked or not.

fig <- plot_ly(data=smoking_no_smoking, x = ~genhlth, y = ~n, color = ~smoke100, type = 'bar') %>%
   layout(autosize = F, width = 800, height = 400,
         title="Has Individual Ever Smoked? (Y/N) vs. General Health",
         xaxis=list(title="Health Condition"),yaxis=list(title="Count"))
fig

We observe from the bar plot above that the numbers of smokers rises steadily from the ‘poor’ to ‘good’ health conditions and then starts to drop for those in the ‘very good’ and ‘excellent’ categories. A conclusion we can draw is that it appears that the number of smokers is greater that the number of non-smokers in the ‘fair’ and ‘poor’ health categories. As has been proven extensively in more scientific analyses, smoking can be associated with a poorer health condition.

To understand how alchohol consumption affects an individual’s general health, we will create a new variable current_drinker based on the feature avedrnk2. We will assume that, if avedrnk2 is not ’NA" for a particular individual, the individual consumes alchohol.

drinking_no_drinking <- smoke_drink %>% mutate(current_drinker=ifelse(is.na(avedrnk2),"No","Yes")) %>% 
  group_by(genhlth,current_drinker) %>% tally() %>% filter(!is.na(genhlth))
head(drinking_no_drinking)

## # A tibble: 6 x 3
## # Groups:   genhlth [3]
##   genhlth   current_drinker     n
##   <fct>     <chr>           <int>
## 1 Excellent No              36413
## 2 Excellent Yes             49069
## 3 Very good No              70147
## 4 Very good Yes             88929
## 5 Good      No              84626
## 6 Good      Yes             65929

We create a grouped column chart of health conditions and counts of whether individuals are alchohol consumers or not.

fig <- plot_ly(data=drinking_no_drinking, x = ~genhlth, y = ~n, color = ~current_drinker, type = 'bar') %>%
   layout(autosize = F, width = 800, height = 400,
         title="Currently Consume Alchohol? (Y/N) vs. General Health",
         xaxis=list(title="Health Condition"),yaxis=list(title="Count"))
fig

We observe from the bar plot above that the numbers of alchohol consumers rises steadily from the ‘poor’ to ‘very good’ health conditions and then drop significantly for those in the ‘excellent’ category. A conclusion we can draw is that the number individuals in the ‘excellent’, ‘fair’ and ‘poor’ health categories tend to drink less alchohol.

Research question 3: “What are some of the best habits or activities in which a person can engage to improve their overall health? It’s a given that most people want to improve their health but they are, more often than not, either confused or uncertain as to what to do to achieve this goal.”

To answer this question, we will investigate how one of the provided calculated variables, X_pacat1 (Physical Activity Categories), correlates with an individual’s general health condition.

First, let’s subset the brfss2013 dataset using genhlth and X_pacat1….

physical_activity <- brfss2013 %>% select(genhlth,X_pacat1) 
head(physical_activity)

##     genhlth              X_pacat1
## 1      Fair              Inactive
## 2      Good Insufficiently active
## 3      Good              Inactive
## 4 Very good Insufficiently active
## 5      Good              Inactive
## 6 Very good Insufficiently active

activity_summary <- physical_activity %>% group_by(genhlth,X_pacat1) %>% tally() %>% 
  filter(!is.na(genhlth),!is.na(X_pacat1))
head(activity_summary)

## # A tibble: 6 x 3
## # Groups:   genhlth [2]
##   genhlth   X_pacat1                  n
##   <fct>     <fct>                 <int>
## 1 Excellent Highly active         31183
## 2 Excellent Active                14881
## 3 Excellent Insufficiently active 12757
## 4 Excellent Inactive              12605
## 5 Very good Highly active         49750
## 6 Very good Active                27230

Let’s generate a bar chart of Physical Activity category counts vs. General Health

fig <- plot_ly(data=activity_summary, x = ~genhlth, y = ~n, color = ~X_pacat1, type = 'bar') %>%
   layout(autosize = F, width = 800, height = 400,
         title="Physical Activity vs. General Health",
         xaxis=list(title="Health Condition"),yaxis=list(title="Count"),
         legend=list(orientation = "v", x = 0.65, y = 0.9))
fig

It is clear from the plot above that physical activity positively correlates with a person’s health condition. People in better health definitely tend to be more active.