Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

BRFSS collects state data about U.S. residents regarding their health-related risk behaviors and events, chronic health conditions, and use of preventive services. BRFSS also collects data on important emerging health issues such as vaccine shortage and influenza-like illness.

Dataset notes:

Categorical values are factors
Many variables, such as age, race, education, as well as variables that measure counts of events (drinks, times eating fruit, etc.) have alternate versions in the Calculated Variables section of the dataset. We choose variables calculated variables for analysis.

load("brfss2013.RData")
data = brfss2013
names = names(data)
# names[!grepl("(^X_)|(_$)", names)]

Part 1: Data

Describe how the observations in the sample are collected, and the implications of this data collection method on generalizability.
With technical and methodological assistance from CDC, state health departments use in-house interviewers or contract with telephone call centers or universities to administer the BRFSS surveys continuously through the year. The states use a standardized core questionnaire, optional modules, and state-added questions. The survey is conducted using Random Digit Dialing (RDD) techniques on both landlines and cell phones. BRFSS collects state data about U.S. residents regarding their health-related risk behaviors and events, chronic health conditions, and use of preventive services. BRFSS also collects data on important emerging health issues such as vaccine shortage and influenza-like illness.
Describe how the observations in the sample are collected, and the implications of this data collection method on causality.
Random sampling was used in conducting the landline tellephone survey, and it is the way get general samples without less prejudice. Using Random Digit Dialing to randomly select the sample would help blocking for variables known or suspected to affect the outcome. Causality is an difficult task for data analysis, generally it involves both randomization and control. So without control, we can make causality inference only for a few special ways by this data collection method.

Part 2: Research questions

Research quesion 1: Whether the group of people who exercise is healthier than those who don’t?

We select the variable genhlth as indicator of health, and the variable exerany2 which represents whether someone exercise in the past 30 days or not.

# We select the variable genhlth as indicator of health
dat1 = select(data, genhlth, exerany2) %>%
  filter(!is.na(exerany2))
ggplot(dat1, aes(x = exerany2, fill = genhlth)) + geom_bar(position = "fill") + ylab("Percentages of general health") + xlab("Exercise or not")

From the picture, it is clear in the picture that the group of of people who exercise in the past 30 days are healthier than those who don’t!

There are many variables highly related to the health condition, we can continue to explore this topic. Yet we have done general description of the population.

Research quesion 2: What is the health condition of the whole population?

There are several variables to describe the health condition of the whole population

# select a data data set with colghous and variables related to health.
hlthdata = select(data, grep("hlth", names)) %>%
  select(genhlth, physhlth, menthlth, poorhlth, qlhlth2)
# we think the variables genhlth, physhlth, menthlth, poorhlth, qlhlth2 are measuring the health condition.
summary(hlthdata)

##       genhlth          physhlth         menthlth           poorhlth     
##  Excellent: 85482   Min.   : 0.000   Min.   :   0.000   Min.   :   0.0  
##  Very good:159076   1st Qu.: 0.000   1st Qu.:   0.000   1st Qu.:   0.0  
##  Good     :150555   Median : 0.000   Median :   0.000   Median :   0.0  
##  Fair     : 66726   Mean   : 4.353   Mean   :   3.383   Mean   :   5.3  
##  Poor     : 27951   3rd Qu.: 3.000   3rd Qu.:   2.000   3rd Qu.:   5.0  
##  NA's     :  1985   Max.   :60.000   Max.   :5000.000   Max.   :7000.0  
##                     NA's   :10957    NA's   :8627       NA's   :243153  
##     qlhlth2      
##  Min.   :  0.0   
##  1st Qu.:  2.0   
##  Median : 15.0   
##  Mean   : 15.9   
##  3rd Qu.: 28.0   
##  Max.   :243.0   
##  NA's   :491310

As we could see that General Health

ggplot(hlthdata, aes(x = genhlth)) + geom_bar(position = "stack", fill = "green") + ylab("number of person") + xlab("general health")

For the reason of outliers and missing values, we use 3rd quantile to describle those variables. The 3rd quantiles of Number Of Days Physical Health Not Good is 3. The 3rd quantiles of Number Of Days Mental Health Not Good is 2. The 3rd quantiles days of Poor Physical Or Mental Health is 5 days. The mean number of Days Full Of Energy In Past 30 Days is 15.9.

Research quesion 3: what is the situation of health care?

Health care is a very big deal for all of us, we are going to explore health care coverage and related items.

# select some variables related to health care
healthcaredata = select(data, medicare,hlthcvrg, drvisits,carercvd)
summary(healthcaredata)

##  medicare         hlthcvrg         drvisits     
##  Yes :134598          :175883   Min.   : 0.00   
##  No  :178386   01     : 69541   1st Qu.: 1.00   
##  NA's:178791   03     : 31973   Median : 3.00   
##                1      : 31230   Mean   : 5.24   
##                0      : 28736   3rd Qu.: 6.00   
##                02     : 27340   Max.   :76.00   
##                (Other):127072   NA's   :154152  
##                  carercvd     
##  Very satisfied      :227880  
##  Somewhat satisfied  : 98644  
##  Not at all satisfied: 12035  
##  NA's                :153216  
##                               
##                               
##

From the table we could see that:

About \(\frac{2}{3}\) of the population are Satisfied With Care Received
134598, 178386 are the numbers of people with and without medicare
Health Insurance Coverage are offered by different kinds of institutions
the median times of Doctor Visits Past 12 Months is 3 times.

Part 3: Exploratory data analysis

Using plots to explain the relationships between certain variables is presented below.

Research quesion 1: The relationship between exercise and health

We select some related variables, General Health, Number Of Days Physical Health Not Good, Number Of Days Mental Health Not Good, Exercise In Past 30 Days.

# Clean the outliers and NAs in the data. there are outlier in variable physhlth and menthlth according to their definition.
dat1 = select(data, genhlth, exerany2, physhlth, menthlth) %>%
  filter(physhlth <= 30 & menthlth <= 30 & !is.na(exerany2) & !is.na(genhlth))  

group_by(dat1, exerany2) %>%
  summarise(phys_mean = mean(physhlth), ment_mean = mean(menthlth))

## Source: local data frame [2 x 3]
## 
##   exerany2 phys_mean ment_mean
##     (fctr)     (dbl)     (dbl)
## 1      Yes  3.183990  2.825518
## 2       No  7.295681  4.768849

From the table we can see that the group who excercise is over who don’t excercise on the mean of not good physical or mental days.

ggplot(dat1, aes(x= exerany2, y = physhlth, fill = exerany2)) + geom_boxplot() + ylab(" Number Of Days Physical Health Not Good") + xlab("Exercise In Past 30 Days or not")

The mean of physhlth is r for excercise or not We can see from the picture that people who Exercised In Past 30 Days is significantly smaller Number Of Days Physical Health Not Good!

ggplot(dat1, aes(x= exerany2, y = menthlth, fill = exerany2)) + geom_boxplot() + ylab(" Number Of Days Mental Health Not Good") + xlab("Exercise In Past 30 Days or not")

We can see from the picture that people who Exercised In Past 30 Days is significantly smaller Number Of Days Mental Health Not Good!

Research quesion 2: Tobacco use in the population

There are several variables related to tobacco use.

smoke100: Smoked At Least 100 Cigarettes
smokday2: Frequency Of Days Now Smoking
stopsmk2: Stopped Smoking In Past 12 Months
lastsmk2: Interval Since Last Smoked
usenow3: Use Of Smokeless Tobacco Products
_smoker3: Computed Smoking Status
_rfsmok3: Current Smoking Calculated Variable

We only use those two computed variables.

tobacco_data = select(data, X_smoker3, X_rfsmok3) %>% 
  filter(!is.na(X_smoker3) & !is.na(X_rfsmok3))
summary(tobacco_data)

##                                  X_smoker3      X_rfsmok3   
##  Current smoker - now smokes every day: 55161   No :399785  
##  Current smoker - now smokes some days: 21493   Yes: 76654  
##  Former smoker                        :138134               
##  Never smoked                         :261651

As we can see that the number of 76654 of 491775 people are current smokes while 399786 are not. Non-smokes are 5.2154617 times of smokes. Lets see some plots of tobacco use.

ggplot(tobacco_data, aes(x = factor(1), fill = X_smoker3))+ geom_bar() + coord_polar(theta = "y") + xlab("") + ylab("")

From the plot, we can see that most people are not smokers, more than a half never smoked, and a lot of people are giving up smoking!

Research quesion 3: tobacco use and alcohol consumption

Tobacco use and alcohol consumption are related actions, we study the relationship between them. We select variables X_smoker3, X_rfsmok3, X_rfdrhv4 to study it.

tobacco_alcohol_data = select(data, X_smoker3, X_rfsmok3, X_rfdrhv4) %>% 
  filter(!is.na(X_smoker3) & !is.na(X_rfdrhv4) & !is.na(X_rfsmok3))
table(select(tobacco_alcohol_data, X_rfsmok3, X_rfdrhv4))

##          X_rfdrhv4
## X_rfsmok3     No    Yes
##       No  374377  17269
##       Yes  66000   8144

It seems that 8144 of 17269+8144 heavy drinkers are smokers, which is far more that the proportion \(\frac{66000+8144}{374377+17269+ 66000+8144}\) of smokers an in the population. Let’s see some plots.

ggplot(tobacco_alcohol_data, aes(x = X_rfdrhv4, fill = X_rfsmok3)) + geom_bar(position = "fill") + xlab("Heavy Alcohol Consumption") + ylab("Current smokers or not")

This picture show the above conclusion that the porportion of smoker among the heavy alcohol consumption is much higher that who aren’t heavy alcohol consumers.

ggplot(tobacco_alcohol_data, aes(x = X_rfdrhv4, fill = X_smoker3)) + geom_bar(position = "fill") + xlab("Heavy Alcohol Consumption") + ylab("proportion")

This plot also shows relationship between tobacco use and alcohol consumption. Let’s do some test to see if there are indepent variables.

chisq.test(x = data$X_rfdrhv4, y = data$X_rfsmok3)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  data$X_rfdrhv4 and data$X_rfsmok3
## X-squared = 5223, df = 1, p-value < 2.2e-16

This test with very small p-value shows that tobacco use and alcohol consumption are not independent variable.