Exploring The BRFSS-2013 Dataset

Siddharth Samant

06/07/2020


Setup

The objective of this assignment is to conduct exploratory data analysis on the Behavioral Risk Factor Surveillance System (BRFSS) - 2013 dataset.


Load packages

We load the tidyverse set of packages, which includes the visualisation package ggplot2, the relational data package dplyr, and the forcats package for working with factors, among others.

We also load the kableExtra package for table styling and formatting options.

library(tidyverse)
library(kableExtra)

Load data

We will load in the dataset, copy and save it as an object named br, and then remove the original dataset.

load("brfss2013.Rdata")
br <- brfss2013
rm(brfss2013)

Part 1: Data1

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS’s objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.

Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

The total number of rows and columns in the dataset are as follows:

knitr::kable(
        tibble(Rows = dim(br)[1],
               Columns = dim(br)[2]),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
Rows Columns
491775 330

Let’s take a quick look at 10 sampled rows and the first few columns of the dataset:

knitr::kable(
        br[sample(nrow(br), 10),1:7],
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
X_state fmonth idate imonth iday iyear dispcode
408091 South Dakota May 5302013 May 30 2013 Completed interview
272120 Nebraska July 7172013 July 17 2013 Completed interview
359971 Oklahoma February 3192013 March 19 2013 Completed interview
13469 Arizona February 2072013 February 7 2013 Completed interview
356810 Ohio June 6142013 June 14 2013 Completed interview
166642 Kansas January 2012013 February 1 2013 Partially completed interview
105578 Georgia May 5012013 May 1 2013 Completed interview
461882 Washington June 6032013 June 3 2013 Completed interview
159568 Kansas July 7212013 July 21 2013 Completed interview
222603 Michigan January 3162013 March 16 2013 Completed interview

Data Collection:

  • The observations in the sample are collected from all 50 states in the USA, the District of Columbia, Puerto Rico, Guam, American Samoa, Federated States of Micronesia, and Palau.

  • 80% of the data is collected through landline phone surveys, while 20% is collected through cellular phone surveys.

  • For persons interviewed on landline telephones, individual respondents are randomly selected from all adults, aged 18 years and older, living in a household and are interviewed in accordance with BRFSS protocol. Cellular telephone interviews are conducted with respondents who answer the number called and are treated as one-person households.

Survey Design & Scope of Inference:

  • The survey is an observational study. The population of interest is the non-institutionalised adult population of the United States of America.

  • It employs a stratified random sampling method that places weights on age, race, sex, and state of residence, among other factors.

  • It avoids the pitfalls of non-response bias by calling 80% people on weeknights and weekends, and on non-business hours during weekdays.

  • Since the study is observational and not an experiment, there is no random assignment. Thus, we can infer correlation between the explanatory and response variables, but not causation.

  • However, since random sampling techniques have been extensively used for the survey, our inferences will be generalizable to the population of interest.


Part 2: Research questions

Research question 1: Is there a correlation between mental health and hours of sleep? Does this correlation stand for both sexes?

Poor mental health is widely believed to be associated with sleep and sleeping disorders2, and we want to see if this association is reflected in the BRFSS data. We also want to find out if there are differences between men and women when it comes to the link between mental health and sleep.

Variables: Occurence of mental health, quality of sleep, sex

Research question 2: For people with high blood pressure, is the proportion of those who take BP medicines positively correlated with income level?

Once you start taking high Blood Pressure (BP) medication, you have to continue taking them for life3. This may lead people with low income to eschew taking high BP medication. While the observational nature of the BRFSS prevents us from making causal inferences, it would still be illuminating to see if there is a positive association between the proportion of people with high BP who take medication, and income levels.

Variables: Income level, proportion of individuals with high BP who take BP meds

Research question 3: Does the proportion of women who stay married differ significantly by race?

There is a huge racial disparity in incarceration rates in the US - black men are six times as likely to be incarcerated as white men. It is estimated that as many as one in three black men born on the turn of this century might be imprisoned during their lifetime.4

Such a high rate of incarceration for black men, coupled with wide racial disparities, could result in poorer marriage outcomes for black women. We check if this is borne out in the BRFSS data. Since the observational nature of the study prevents us from making causal inferences, we limit the scope of the question to differences in marriage outcomes by race.

Variables: Race, marital status, sex


Part 3: Exploratory data analysis

Research question 1

Is there a correlation between mental health and hours of sleep? Does this correlation stand for both sexes?

We will narrow the above research question to the following, and attempt to answer it: Is there a correlation between quality of sleep (measured by hours of sleep), and occurrence of mental health issues? Does this correlation stand for both men and women?

Before we begin to answer this question, let’s look at the menthlthvariable. This variable records the number of days of “not good” mental health over the previous 30 days. Let’s display the first few results as a table.

br0 <- br %>% count(menthlth)
knitr::kable(
        br0[1:5,],
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
menthlth n
0 334461
1 15206
2 23520
3 13593
4 6660

We create a new variable mentalIllness, which takes only two values - “Yes” and “No”. Respondents who reported 0 “not good” mental health days are classified as a “Yes”, and the rest as a “No”. Missing values are ignored. The results are stored as a new object, br1.

The values of the mentalIllness variable are as follows. 148,687 individuals reported at least one day of poor mental health.

br1 <- br %>%
        filter(!is.na(menthlth)) %>%
        mutate(mentalIllness = ifelse(menthlth == 0, "No", "Yes"))

knitr::kable(
        br1 %>% count(mentalIllness),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
mentalIllness n
No 334461
Yes 148687

Most researchers agree that 7-9 hours is the optimum hours of sleep for an individual5. We will create a new variable, sleepQuality, which takes one of 3 values. 7-9 hours of sleep is classified as “Optimum”; less than 7 hours as “Low”, and more than 9 hours as “High”. This variable is a derivation of the sleptim1 variable, which records individuals’ average hours of sleep.

The values of the sleepQuality variable are as follows.

br1 <- br1 %>%
        filter(!is.na(sleptim1)) %>%
        mutate(sleepQuality = ifelse(sleptim1 < 7, "Low", 
                                     ifelse(sleptim1 > 9, "High", 
                                            "Optimum")),
               sleepQuality = fct_relevel(sleepQuality, "Low",
                                          "Optimum", "High")) 
knitr::kable(
        br1 %>% count(sleepQuality),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
sleepQuality n
Low 155640
Optimum 303142
High 17834

Finally, we show the correlation between sleep quality and mental health by sex both as a summary table and as a line graph.

br1 <- br1 %>%
        filter(!is.na(sex)) %>%
        group_by(sex, sleepQuality) %>%
        summarize(mentIllPercent = mean(mentalIllness == "Yes"))

knitr::kable(
        br1,
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
sex sleepQuality mentIllPercent
Male Low 0.3465563
Male Optimum 0.2051385
Male High 0.3202945
Female Low 0.4405082
Female Optimum 0.2899323
Female High 0.4196240
br1 %>%
        ggplot(aes(sleepQuality, mentIllPercent)) +
        geom_line(aes(colour = sex, group = sex), size = 1.2) +
        geom_point(size = 3) +
        labs(x = "Hours of sleep",
             y = "Occurence of poor mental health outcomes", 
             title = "Correlation between sleep, mental health & sex",
             colour = "Sex") +
        scale_y_continuous(limits = c(0,1), breaks = seq(0,1,by = 0.2))

The results show a clear correlation between mental health and sleep. Let’s break them down:

  1. Individuals who sleep an optimum number of hours report much better mental health outcomes. This is true for both men and women.

  2. Both individuals who sleep too little, and individuals who sleep too much, report poorer mental health outcomes.

  3. Women report around 10% poorer mental health outcomes than men do. This is true regardless of the hours of sleep.


Research quesion 2

For people with high blood pressure, is the proportion of those who take BP medicines positively correlated with income level?

The variable bphigh4 asks survey respondents to answer the following question: “Have you EVER been told by a doctor, nurse or other health professional that you have high blood pressure?”

knitr::kable(
        br %>% count(bphigh4),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
bphigh4 n
Yes 198921
Yes, but female told only during pregnancy 3680
No 282687
Told borderline or pre-hypertensive 5067
NA 1420

Those whose answer to the above question is yes (198,921 respondents), are then asked if they are currently taking BP medicines. The answers are captured by the variablebpmeds.

We can confirm that only high BP individuals are asked whether they are taking BP medicine, by means of the below table.

knitr::kable(
        table(br$bphigh4, br$bpmeds),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
Yes No
Yes 166178 32396
Yes, but female told only during pregnancy 0 0
No 0 0
Told borderline or pre-hypertensive 0 0

The variables of bphigh4 form the row headers, and the variables of bpmeds form the column headers. It is clear that only those individuals who have high BP are further asked whether they are taking BP medicines. Furthermore, we note that 166,178 respondents are taking BP medicines, and 32,396 are not.

The income2 variable records the income level of respondents.

knitr::kable(
        br %>% count(income2),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
income2 n
Less than $10,000 25441
Less than $15,000 26794
Less than $20,000 34873
Less than $25,000 41732
Less than $35,000 48867
Less than $50,000 61509
Less than $75,000 65231
$75,000 or more 115902
NA 71426

Next, we plot the relationship between the proportion of high BP individuals who take BP medicines, and income level. We first remove missing values from both the income2 and bpmeds, then create a summary table that groups survey respondents by their income level. For each income level, we calculate the proportion of individuals who are taking BP medicines. The results are displayed in the below summary table and plot.

br2 <- br %>%
        filter(!is.na(income2) & !is.na(bpmeds)) %>%
        group_by(income2) %>%
        summarize(bpMedProp = mean(bpmeds == "Yes"))
knitr::kable(
        br2,
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
income2 bpMedProp
Less than $10,000 0.8035208
Less than $15,000 0.8473082
Less than $20,000 0.8396687
Less than $25,000 0.8460262
Less than $35,000 0.8470706
Less than $50,000 0.8404967
Less than $75,000 0.8297079
$75,000 or more 0.8040028
br2 %>%
        ggplot(aes(income2, bpMedProp)) +
        geom_line(size = 1.2, group = 1) +
        geom_point(size = 3, colour = "red") +
        scale_y_continuous(limits = c(0,1), breaks = seq(0,1,by = 0.2)) +
        labs(x = "Income Level",
             y = "% High BP individuals who take BP meds",
             title = "Do a higher % of wealthy people take BP meds?") +
        coord_flip()

The results indicate a very weak correlation. We can break it down as follows:

  1. The proportion of individuals with high Blood Pressure who take BP medicines, and income level, are not strongly correlated with each other.

  2. A smaller proportion of individuals with income levels lesser than $10,000 and higher than $75,000 take BP medicines after being diagnosed with high BP, than those with incomes between $10,000 and $75,000. However, the difference is less than 5%.


Research question 3

Does the proportion of women who stay married differ significantly by race?

Variables: Race, marital status, sex

The respondents’ race is captured by the X_imprace variable.

knitr::kable(
        br %>% count(X_imprace),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
X_imprace n
White, Non-Hispanic 383624
Black, Non-Hispanic 39817
Asian, Non-Hispanic 9629
American Indian/Alaskan Native, Non-Hispanic 7781
Hispanic 37138
Other race, Non-Hispanic 13777
NA 9

The variable marital captures the marital status of respondents.

knitr::kable(
        br %>% count(marital),
        align = "cc"
) %>%
        kable_styling(full_width = TRUE)
marital n
Married 253329
Divorced 70376
Widowed 65745
Separated 10662
Never married 75070
A member of an unmarried couple 13173
NA 3420

We want to see if there’s a significant difference in the current marital status of women who have been married at least once, when we cut by race. In order to conduct this analysis, we follow the below steps:

  1. We filter out men from the sex variable, and women who reported their marital status as either “Never married” or “A member of an unmarried couple” from the marital variable.
  2. Missing values are removed from the X_imprace, sex, and marital variables.
  3. We change the labels of the values of the X_imprace variable. For example, “White, Non-Hispanic” is re-labeled as “White”.
  4. We change the order of the values of the marital variable, so that “Separation” follows “Divorce” (since these are similar categories).
  5. We group the dataset by the X_imprace and marital variables, and compute the count of each group in a new summary table.
  6. We group the new table by the X_imprace category, and summarize the proportion of each marital category by race.
  7. We plot the results in a barplot of proportions, so that we can compare marital status of women by race.
br %>%
        filter(marital %in% c("Married", "Separated", "Widowed", "Divorced"),
               !is.na(X_imprace), !is.na(marital), !is.na(sex), 
               sex == "Female") %>%
        mutate(X_imprace = 
                 fct_recode(
                       X_imprace,
                       "White" = "White, Non-Hispanic",
                       "Black" = "Black, Non-Hispanic",
                       "Asian" = "Asian, Non-Hispanic",
                       "Native American" = 
                       "American Indian/Alaskan Native, Non-Hispanic",
                       "Other Race" = "Other race, Non-Hispanic"
                   ),
               marital = 
                 fct_relevel(marital, "Married",
                             "Divorced", "Separated", "Widowed")) %>%
        group_by(X_imprace, marital) %>%
        summarise(count  = n()) %>% 
        group_by(X_imprace) %>%
        mutate(prop = count/sum(count)) %>%
        ggplot(aes(X_imprace, prop)) +
        geom_bar(aes(fill = marital), stat = "identity", 
                 position = "fill") + 
        labs(x = "Race",
             y = "% Marital Status",
             fill = "Marital Status",
             title = "Proportion of women who stay married",
             subtitle = "Comparison by race") +
        coord_flip()

There are significant differences in the marital status of women respondents, when compared by race. Let’s break them down:

  1. There are a significantly lower proportion of married black women, when compared with their counterparts in other races.

  2. There are a higher proportion of divorced and widowed black women, when compared with their counterparts in other races.

  3. There are a significantly lower proportion of married Asian women, when compared with their counterparts in other races.


Conclusion

We can answer the 3 research questions thus:

  1. There is a clear correlation between quality of sleep (as measured by hours of sleep) and mental health. This correlation stands for both sexes.

  2. For people with high blood pressure, the proportion of those who take BP medicines is very weakly correlated with income level.

  3. There are significant differences in the proportion of women who stay married, when compared by race.

These results are generalizable to the non-institutionalized adult population of the United States.


References