Author: Arta Gharib Parsa

Setup

Load packages

library(ggplot2)
library(dplyr)

Part 1: Data

The Behavioral Risk Factor Surveillance System (BRFSS) is a national system of telephone surveys with the objective of collecting data on preventative health practices and risk behaviors linked to chronic diseases in the adult population. Specifically, the survey is targeted at non-institutionalized adults aged 18 our older residing in the U.S. This data analysis will focus on the 2013 BRFSS, which introduced several new categories of survey questions.

The BRFSS compiles a huge amount of data on various health practices and chronic conditions. R is a powerful tool for separating this data into manageable chunks and creating meaningful visualizations that may help inform clinical practice.

Data Collection Methodology and Scope of Inference

(Source: https://www.cdc.gov/brfss/data_documentation/pdf/UserguideJune2013.pdf)

Since 2011, the BRFSS collects state-specific data via landline and cellphone surveys, generally monthly. For landline surveys, data is collected from one randomly selected adult per household via disproportionate stratified sampling (DSS). For cellular surveys, data is collected from a list of randomly generated cellphone numbers, from a sampling frame of confirmed cellular area code and prefix combinations. Each number has an equal probability of selection.

The BRFFSS has a two-step weighting protocol (design weighting followed by iterative proportional fitting), which ensures data are representative of the population. More information about this weighting process can be found in the source link above.

Generalizability
Random sampling at a state level is performed in both collection methods, and thus results of the BRFSS are generalizable to adults of the state who use landlines and cellphones.

Causality
THe BRFSS is a survey that examines trends in public health. It is not interventional and does not feature random assignment of participants to treatments or controls. It is an observational study, and thus no conclusions of causality can be drawn from the data - only correlations between our variables of interest.


Part 2: Research questions

Variables are defined in the BRFSS codebase.

Codebase source: https://bit.ly/2WghPMV.

Research question 1: Is there a correlation between income class and average sleep per night in the US population?

Variables:
income2: Income Level
sleptim1: How Much Time Do You Sleep A Night

Many studies, most famously the Whitehall Study, have shown a correlation between various indicators of stress and income level - generally, more stress at lower incomes. It would be informative to see if average sleep per night may be one such indicator in BRFSS data.

Research question 2: Is there a correlation between drinking and veteran status and how does this apply across genders?

Variables:
avedrnk2: Avg Alcoholic Drinks Per Day In Past 30
veteran3: Are You A Veteran
sex: Respondent’s Sex

Veterans are classically more likely to suffer from mental health disorders including substance abuse. Although this data is restricted by its collection methodology, it would be informative to see if it suggests a similar correlation.

Research question 3: Is the severity of joint pain correlated with the impact of joint pain on work performance, and does this depend on gender?

Variables
joinpain: How Bad Is Joint Pain
arthdis2: Does Joint Pain Affect Whether You Work
sex: Respondent’s Sex

For self-reported joint pain on a scale of 1-10, it may be informative to know if there is a threshold of pain at which work performance seems to be affected. This could be useful for clinicians to understand self-reported raw scores in the context of functional impacts.


Part 3: Exploratory data analysis

Research question 1: Is there a correlation between income class and average sleep per night in the US population?

load("~/Stats with R/Week 5 data analysis project/brfss.rdata")

# Variables: 
# income2: Income Level
# sleptim1: How Much Time Do You Sleep A Night
# Filter out N/A values for our variables of interest
incomesleep <- brfss2013 %>% filter(!is.na(sleptim1), !is.na(income2))

# Create a table with our new filterd variables of interest and find the average amount of sleep for each income group
mean_sleep_by_income <- incomesleep %>% group_by(income2) %>% summarise(mean_sleep = mean(sleptim1))

# Numerical summary of our data of interest
mean_sleep_by_income
## # A tibble: 8 x 2
##   income2           mean_sleep
##   <fct>                  <dbl>
## 1 Less than $10,000       6.86
## 2 Less than $15,000       6.95
## 3 Less than $20,000       7.03
## 4 Less than $25,000       7.04
## 5 Less than $35,000       7.07
## 6 Less than $50,000       7.07
## 7 Less than $75,000       7.05
## 8 $75,000 or more         7.05
mean_sleep_by_income %>% summarise(sd_sleep = sd(mean_sleep))
## # A tibble: 1 x 1
##   sd_sleep
##      <dbl>
## 1   0.0726

Our results show that there does not appear to be significant variation among income classes in regard to average sleep duration per night. We can represent this graphically.

# Cleaning up the x-axis: replacing spaces with breaks in the names of the income levels 
levels(mean_sleep_by_income$income2) <- gsub(" ", "\n", levels(mean_sleep_by_income$income2))

# Visualization
ggplot(data = mean_sleep_by_income, aes(x = income2, y = mean_sleep)) + 
  geom_point() + 
  labs(title = "Hours of Sleep by Income Class for US Adults", x="Income Class", y="Mean Hours of Sleep Per Night") +
  theme(plot.title = element_text(hjust = 0.5))

Our graph represents the data, but the range of the y-axis is too small, visually suggesting large differences for subjectively small differences in amount of sleep. We can change these aesthetics as follows:

# Adjusting title and y-axis range
ggplot(data = mean_sleep_by_income, aes(x = income2, y = mean_sleep)) + 
  geom_point() + 
  labs(title = "Hours of Sleep by Income Class for US Adults", x="Income Class", y="Mean Hours of Sleep Per Night") + 
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(6.75,7.25)

We can conclude that there is no strong correlation between income class and average sleep per night in the US adult population. This is too tenuous of a connection to make any commentary on stress level vs income class. But this may suggest that stress in lower income levels manifests in ways other than sleep.


Research question 2: Is there a correlation between drinking and veteran status and how does this apply across genders?

# Variables:
# avedrnk2: Avg Alcoholic Drinks Per Day In Past 30
# veteran3: Are You A Veteran
# sex: Respondents Sex
# Filter out N/A values for our data of interest and create a data table
vetsdrnks <- brfss2013 %>% filter(!is.na(veteran3), !is.na(sex), !is.na(avedrnk2)) %>% select(veteran3, sex, avedrnk2)

# Numerical summary of our data of interest
vetsdrnks %>% summary()
##  veteran3         sex            avedrnk2    
##  Yes: 31709   Male  :110062   Min.   : 1.00  
##  No :199305   Female:120952   1st Qu.: 1.00  
##                               Median : 2.00  
##                               Mean   : 2.21  
##                               3rd Qu.: 2.00  
##                               Max.   :76.00
vetsdrnks %>% summarise(sd_drinks = sd(avedrnk2))
##   sd_drinks
## 1  2.308651

This summary suggests that there are significant outliers (such as 76 drinks per day). This may be due to misunderstanding the question as total number of drinks in a 30 day period. The means and standard deviation suggest that most of our data points lie between 1 and 5 drinks, so we will narrow the scope of our histogram for better data visualization.

Of note, the minimum number of drinks is 1, meaning the data only applies to the drinking population of the U.S.

# Plot veteran status and drinks per day as a porportion of the population, for both males and females
ggplot (data = vetsdrnks, aes(avedrnk2, ..density.., fill=veteran3)) + 
  geom_histogram(binwidth = 1, position= position_dodge(width=0.75)) + 
  xlim(0.5,7.5) + 
  facet_grid(. ~ sex) + 
  labs(title = "Number of Drinks Per Day for Veterans and Non-Veterans", x= "Average Alcoholic Drinks Per Day in Past 30 days", y= "Proportion of Drinking Population", fill= "Veteran Status") + 
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 5726 rows containing non-finite values (stat_bin).

These histograms suggest that, in the U.S. drinking population, there is no strong correlation between Veteran Status and the number of drinks on average per day. Veteran males, proportionally, may have fewer drinks than their non-veteran counterparts, but more nuanced statistical inference is required to draw conclusions from this. Of females who drink, more than 50% have only 1 drink on average. In comparison, males are more likely to have more than one drink.

It is important to note that the BRFSS data does not include incarcerated individuals or those in homeless shelters without cell phones. As such, we cannot draw conclusions about the veteran population as a whole.


Research question 3: Is the severity of joint pain correlated with the impact of joint pain on work performance, and does this depend on gender?

# Variables
# joinpain: How Bad Is Joint Pain
# arthdis2: Does Joint Pain Affect Whether You Work
# sex: Respondent's Sex 
# Filter out N/A values for our data of interest and create a data table
arthritis2 <- brfss2013 %>% filter(!is.na(sex), !is.na(joinpain), !is.na(arthdis2)) %>% select(sex, joinpain, arthdis2)

# Separate the data by gender to interpret joint pain and effect on work in context
arthritisfemale <- select(filter(arthritis2, sex == "Female"), c(joinpain, arthdis2))
arthritismale <- select(filter(arthritis2, sex == "Male"), c(sex, joinpain, arthdis2))

# Numerical summary of our data of interest for both genders
arthritisfemale %>% summary()
##     joinpain      arthdis2   
##  Min.   : 0.000   Yes:31602  
##  1st Qu.: 3.000   No :63683  
##  Median : 5.000              
##  Mean   : 4.909              
##  3rd Qu.: 7.000              
##  Max.   :10.000
arthritisfemale %>% summarise(sd_female = sd(joinpain))
##   sd_female
## 1  2.759097
arthritismale %>% summary()
##      sex           joinpain     arthdis2   
##  Male  :48896   Min.   : 0.00   Yes:15408  
##  Female:    0   1st Qu.: 2.00   No :33488  
##                 Median : 4.00              
##                 Mean   : 4.25              
##                 3rd Qu.: 6.00              
##                 Max.   :10.00
arthritismale %>% summarise(sd_male = sd(joinpain))                              
##    sd_male
## 1 2.682954

Female reported joint pain has a mean of 4.9 with a standard deviation of 2.76, and male reported joint pain has a mean of 4.25 with a standard deviation of 2.68. Generally, the two have a similar spread. Joint pain affects work performance in similar proportions in both populations, with 49% of females and 46% of males reporting that joint pain affects work. We can visualize the data below.

ggplot(data = arthritisfemale, aes(x = joinpain, fill = arthdis2)) + 
  geom_histogram(binwidth = 1, position = position_dodge(width = 0.75)) + 
  labs(title = "Severity of Joint Pain and Its Effect on Work in U.S. Adult Females", x = "Joint Pain on Scale of 1-10", y = "Individuals", fill = "Does Joint Pain Affect Work?") + 
  theme(plot.title = element_text(hjust = 0.5))

Our ggplot2 function automatically genereates the x-axis, but increments of 2.5 are not ideal for representing our scale of 1-10.

# Adding scale of 1-10 on x-axis for visual clarity
ggplot(data = arthritisfemale, aes(x = joinpain, fill = arthdis2)) + 
  geom_histogram(binwidth = 1, position = position_dodge(width = 0.75)) + 
  scale_x_continuous(breaks=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) + 
  labs(title = "Severity of Joint Pain and Its Effect on Work in U.S. Adult Females", x = "Joint Pain on Scale of 1-10", y = "Individuals", fill = "Does Joint Pain Affect Work?") + 
  theme(plot.title = element_text(hjust = 0.5))

ggplot(data = arthritismale, aes(x = joinpain, fill = arthdis2)) + 
  geom_histogram(binwidth = 1, position = position_dodge(width = 0.75)) + 
  scale_x_continuous(breaks=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)) + 
  labs(title = "Severity of Joint Pain and Its Effect on Work in U.S. Adult Males", x = "Joint Pain on Scale of 1-10", y = "Individuals", fill = "Does Joint Pain Affect Work?") + 
  theme(plot.title = element_text(hjust = 0.5))

There seems to be a correlation between joint pain scores of 7-10 and an effect on work performance. For both genders, the proportion of people affected by joint pain in their work is higher than those who are unaffected at these pain levels.