Setup
The objective of this assignment is to conduct exploratory data analysis on the Behavioral Risk Factor Surveillance System (BRFSS) - 2013 dataset.
Load packages
We load the tidyverse set of packages, which includes the visualisation package ggplot2, the relational data package dplyr, and the forcats package for working with factors, among others.
We also load the kableExtra package for table styling and formatting options.
Part 1: Data1
The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). The BRFSS’s objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.
Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.
The total number of rows and columns in the dataset are as follows:
knitr::kable(
tibble(Rows = dim(br)[1],
Columns = dim(br)[2]),
align = "cc"
) %>%
kable_styling(full_width = TRUE)| Rows | Columns |
|---|---|
| 491775 | 330 |
Let’s take a quick look at 10 sampled rows and the first few columns of the dataset:
| X_state | fmonth | idate | imonth | iday | iyear | dispcode | |
|---|---|---|---|---|---|---|---|
| 408091 | South Dakota | May | 5302013 | May | 30 | 2013 | Completed interview |
| 272120 | Nebraska | July | 7172013 | July | 17 | 2013 | Completed interview |
| 359971 | Oklahoma | February | 3192013 | March | 19 | 2013 | Completed interview |
| 13469 | Arizona | February | 2072013 | February | 7 | 2013 | Completed interview |
| 356810 | Ohio | June | 6142013 | June | 14 | 2013 | Completed interview |
| 166642 | Kansas | January | 2012013 | February | 1 | 2013 | Partially completed interview |
| 105578 | Georgia | May | 5012013 | May | 1 | 2013 | Completed interview |
| 461882 | Washington | June | 6032013 | June | 3 | 2013 | Completed interview |
| 159568 | Kansas | July | 7212013 | July | 21 | 2013 | Completed interview |
| 222603 | Michigan | January | 3162013 | March | 16 | 2013 | Completed interview |
Data Collection:
The observations in the sample are collected from all 50 states in the USA, the District of Columbia, Puerto Rico, Guam, American Samoa, Federated States of Micronesia, and Palau.
80% of the data is collected through landline phone surveys, while 20% is collected through cellular phone surveys.
For persons interviewed on landline telephones, individual respondents are randomly selected from all adults, aged 18 years and older, living in a household and are interviewed in accordance with BRFSS protocol. Cellular telephone interviews are conducted with respondents who answer the number called and are treated as one-person households.
Survey Design & Scope of Inference:
The survey is an observational study. The population of interest is the non-institutionalised adult population of the United States of America.
It employs a stratified random sampling method that places weights on age, race, sex, and state of residence, among other factors.
It avoids the pitfalls of non-response bias by calling 80% people on weeknights and weekends, and on non-business hours during weekdays.
Since the study is observational and not an experiment, there is no random assignment. Thus, we can infer correlation between the explanatory and response variables, but not causation.
However, since random sampling techniques have been extensively used for the survey, our inferences will be generalizable to the population of interest.
Part 2: Research questions
Research question 1: Is there a correlation between mental health and hours of sleep? Does this correlation stand for both sexes?
Poor mental health is widely believed to be associated with sleep and sleeping disorders2, and we want to see if this association is reflected in the BRFSS data. We also want to find out if there are differences between men and women when it comes to the link between mental health and sleep.
Variables: Occurence of mental health, quality of sleep, sex
Research question 2: For people with high blood pressure, is the proportion of those who take BP medicines positively correlated with income level?
Once you start taking high Blood Pressure (BP) medication, you have to continue taking them for life3. This may lead people with low income to eschew taking high BP medication. While the observational nature of the BRFSS prevents us from making causal inferences, it would still be illuminating to see if there is a positive association between the proportion of people with high BP who take medication, and income levels.
Variables: Income level, proportion of individuals with high BP who take BP meds
Research question 3: Does the proportion of women who stay married differ significantly by race?
There is a huge racial disparity in incarceration rates in the US - black men are six times as likely to be incarcerated as white men. It is estimated that as many as one in three black men born on the turn of this century might be imprisoned during their lifetime.4
Such a high rate of incarceration for black men, coupled with wide racial disparities, could result in poorer marriage outcomes for black women. We check if this is borne out in the BRFSS data. Since the observational nature of the study prevents us from making causal inferences, we limit the scope of the question to differences in marriage outcomes by race.
Variables: Race, marital status, sex
Part 3: Exploratory data analysis
Research question 1
Is there a correlation between mental health and hours of sleep? Does this correlation stand for both sexes?
We will narrow the above research question to the following, and attempt to answer it: Is there a correlation between quality of sleep (measured by hours of sleep), and occurrence of mental health issues? Does this correlation stand for both men and women?
Before we begin to answer this question, let’s look at the menthlthvariable. This variable records the number of days of “not good” mental health over the previous 30 days. Let’s display the first few results as a table.
br0 <- br %>% count(menthlth)
knitr::kable(
br0[1:5,],
align = "cc"
) %>%
kable_styling(full_width = TRUE)| menthlth | n |
|---|---|
| 0 | 334461 |
| 1 | 15206 |
| 2 | 23520 |
| 3 | 13593 |
| 4 | 6660 |
We create a new variable mentalIllness, which takes only two values - “Yes” and “No”. Respondents who reported 0 “not good” mental health days are classified as a “Yes”, and the rest as a “No”. Missing values are ignored. The results are stored as a new object, br1.
The values of the mentalIllness variable are as follows. 148,687 individuals reported at least one day of poor mental health.
br1 <- br %>%
filter(!is.na(menthlth)) %>%
mutate(mentalIllness = ifelse(menthlth == 0, "No", "Yes"))
knitr::kable(
br1 %>% count(mentalIllness),
align = "cc"
) %>%
kable_styling(full_width = TRUE)| mentalIllness | n |
|---|---|
| No | 334461 |
| Yes | 148687 |
Most researchers agree that 7-9 hours is the optimum hours of sleep for an individual5. We will create a new variable, sleepQuality, which takes one of 3 values. 7-9 hours of sleep is classified as “Optimum”; less than 7 hours as “Low”, and more than 9 hours as “High”. This variable is a derivation of the sleptim1 variable, which records individuals’ average hours of sleep.
The values of the sleepQuality variable are as follows.
br1 <- br1 %>%
filter(!is.na(sleptim1)) %>%
mutate(sleepQuality = ifelse(sleptim1 < 7, "Low",
ifelse(sleptim1 > 9, "High",
"Optimum")),
sleepQuality = fct_relevel(sleepQuality, "Low",
"Optimum", "High"))
knitr::kable(
br1 %>% count(sleepQuality),
align = "cc"
) %>%
kable_styling(full_width = TRUE)| sleepQuality | n |
|---|---|
| Low | 155640 |
| Optimum | 303142 |
| High | 17834 |
Finally, we show the correlation between sleep quality and mental health by sex both as a summary table and as a line graph.
br1 <- br1 %>%
filter(!is.na(sex)) %>%
group_by(sex, sleepQuality) %>%
summarize(mentIllPercent = mean(mentalIllness == "Yes"))
knitr::kable(
br1,
align = "cc"
) %>%
kable_styling(full_width = TRUE)| sex | sleepQuality | mentIllPercent |
|---|---|---|
| Male | Low | 0.3465563 |
| Male | Optimum | 0.2051385 |
| Male | High | 0.3202945 |
| Female | Low | 0.4405082 |
| Female | Optimum | 0.2899323 |
| Female | High | 0.4196240 |
br1 %>%
ggplot(aes(sleepQuality, mentIllPercent)) +
geom_line(aes(colour = sex, group = sex), size = 1.2) +
geom_point(size = 3) +
labs(x = "Hours of sleep",
y = "Occurence of poor mental health outcomes",
title = "Correlation between sleep, mental health & sex",
colour = "Sex") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,by = 0.2))The results show a clear correlation between mental health and sleep. Let’s break them down:
Individuals who sleep an optimum number of hours report much better mental health outcomes. This is true for both men and women.
Both individuals who sleep too little, and individuals who sleep too much, report poorer mental health outcomes.
Women report around 10% poorer mental health outcomes than men do. This is true regardless of the hours of sleep.
Research quesion 2
For people with high blood pressure, is the proportion of those who take BP medicines positively correlated with income level?
The variable bphigh4 asks survey respondents to answer the following question: “Have you EVER been told by a doctor, nurse or other health professional that you have high blood pressure?”
| bphigh4 | n |
|---|---|
| Yes | 198921 |
| Yes, but female told only during pregnancy | 3680 |
| No | 282687 |
| Told borderline or pre-hypertensive | 5067 |
| NA | 1420 |
Those whose answer to the above question is yes (198,921 respondents), are then asked if they are currently taking BP medicines. The answers are captured by the variablebpmeds.
We can confirm that only high BP individuals are asked whether they are taking BP medicine, by means of the below table.
| Yes | No | |
|---|---|---|
| Yes | 166178 | 32396 |
| Yes, but female told only during pregnancy | 0 | 0 |
| No | 0 | 0 |
| Told borderline or pre-hypertensive | 0 | 0 |
The variables of bphigh4 form the row headers, and the variables of bpmeds form the column headers. It is clear that only those individuals who have high BP are further asked whether they are taking BP medicines. Furthermore, we note that 166,178 respondents are taking BP medicines, and 32,396 are not.
The income2 variable records the income level of respondents.
| income2 | n |
|---|---|
| Less than $10,000 | 25441 |
| Less than $15,000 | 26794 |
| Less than $20,000 | 34873 |
| Less than $25,000 | 41732 |
| Less than $35,000 | 48867 |
| Less than $50,000 | 61509 |
| Less than $75,000 | 65231 |
| $75,000 or more | 115902 |
| NA | 71426 |
Next, we plot the relationship between the proportion of high BP individuals who take BP medicines, and income level. We first remove missing values from both the income2 and bpmeds, then create a summary table that groups survey respondents by their income level. For each income level, we calculate the proportion of individuals who are taking BP medicines. The results are displayed in the below summary table and plot.
br2 <- br %>%
filter(!is.na(income2) & !is.na(bpmeds)) %>%
group_by(income2) %>%
summarize(bpMedProp = mean(bpmeds == "Yes"))
knitr::kable(
br2,
align = "cc"
) %>%
kable_styling(full_width = TRUE)| income2 | bpMedProp |
|---|---|
| Less than $10,000 | 0.8035208 |
| Less than $15,000 | 0.8473082 |
| Less than $20,000 | 0.8396687 |
| Less than $25,000 | 0.8460262 |
| Less than $35,000 | 0.8470706 |
| Less than $50,000 | 0.8404967 |
| Less than $75,000 | 0.8297079 |
| $75,000 or more | 0.8040028 |
br2 %>%
ggplot(aes(income2, bpMedProp)) +
geom_line(size = 1.2, group = 1) +
geom_point(size = 3, colour = "red") +
scale_y_continuous(limits = c(0,1), breaks = seq(0,1,by = 0.2)) +
labs(x = "Income Level",
y = "% High BP individuals who take BP meds",
title = "Do a higher % of wealthy people take BP meds?") +
coord_flip()The results indicate a very weak correlation. We can break it down as follows:
The proportion of individuals with high Blood Pressure who take BP medicines, and income level, are not strongly correlated with each other.
A smaller proportion of individuals with income levels lesser than $10,000 and higher than $75,000 take BP medicines after being diagnosed with high BP, than those with incomes between $10,000 and $75,000. However, the difference is less than 5%.
Research question 3
Does the proportion of women who stay married differ significantly by race?
Variables: Race, marital status, sex
The respondents’ race is captured by the X_imprace variable.
| X_imprace | n |
|---|---|
| White, Non-Hispanic | 383624 |
| Black, Non-Hispanic | 39817 |
| Asian, Non-Hispanic | 9629 |
| American Indian/Alaskan Native, Non-Hispanic | 7781 |
| Hispanic | 37138 |
| Other race, Non-Hispanic | 13777 |
| NA | 9 |
The variable marital captures the marital status of respondents.
| marital | n |
|---|---|
| Married | 253329 |
| Divorced | 70376 |
| Widowed | 65745 |
| Separated | 10662 |
| Never married | 75070 |
| A member of an unmarried couple | 13173 |
| NA | 3420 |
We want to see if there’s a significant difference in the current marital status of women who have been married at least once, when we cut by race. In order to conduct this analysis, we follow the below steps:
- We filter out men from the
sexvariable, and women who reported their marital status as either “Never married” or “A member of an unmarried couple” from themaritalvariable. - Missing values are removed from the
X_imprace,sex, andmaritalvariables. - We change the labels of the values of the
X_impracevariable. For example, “White, Non-Hispanic” is re-labeled as “White”. - We change the order of the values of the
maritalvariable, so that “Separation” follows “Divorce” (since these are similar categories). - We group the dataset by the
X_impraceandmaritalvariables, and compute the count of each group in a new summary table. - We group the new table by the
X_impracecategory, and summarize the proportion of eachmaritalcategory by race. - We plot the results in a barplot of proportions, so that we can compare marital status of women by race.
br %>%
filter(marital %in% c("Married", "Separated", "Widowed", "Divorced"),
!is.na(X_imprace), !is.na(marital), !is.na(sex),
sex == "Female") %>%
mutate(X_imprace =
fct_recode(
X_imprace,
"White" = "White, Non-Hispanic",
"Black" = "Black, Non-Hispanic",
"Asian" = "Asian, Non-Hispanic",
"Native American" =
"American Indian/Alaskan Native, Non-Hispanic",
"Other Race" = "Other race, Non-Hispanic"
),
marital =
fct_relevel(marital, "Married",
"Divorced", "Separated", "Widowed")) %>%
group_by(X_imprace, marital) %>%
summarise(count = n()) %>%
group_by(X_imprace) %>%
mutate(prop = count/sum(count)) %>%
ggplot(aes(X_imprace, prop)) +
geom_bar(aes(fill = marital), stat = "identity",
position = "fill") +
labs(x = "Race",
y = "% Marital Status",
fill = "Marital Status",
title = "Proportion of women who stay married",
subtitle = "Comparison by race") +
coord_flip()There are significant differences in the marital status of women respondents, when compared by race. Let’s break them down:
There are a significantly lower proportion of married black women, when compared with their counterparts in other races.
There are a higher proportion of divorced and widowed black women, when compared with their counterparts in other races.
There are a significantly lower proportion of married Asian women, when compared with their counterparts in other races.
Conclusion
We can answer the 3 research questions thus:
There is a clear correlation between quality of sleep (as measured by hours of sleep) and mental health. This correlation stands for both sexes.
For people with high blood pressure, the proportion of those who take BP medicines is very weakly correlated with income level.
There are significant differences in the proportion of women who stay married, when compared by race.
These results are generalizable to the non-institutionalized adult population of the United States.
References
The information in this section draws heavily from the following document - https://www.cdc.gov/brfss/data_documentation/pdf/UserguideJune2013.pdf↩︎
https://www.health.harvard.edu/newsletter_article/sleep-and-mental-health↩︎
https://www.columbianeurology.org/neurology/staywell/document.php?id=1428↩︎
https://sentencingproject.org/wp-content/uploads/2016/01/Trends-in-US-Corrections.pdf (Page 5)↩︎
https://www.sleepfoundation.org/articles/how-much-sleep-do-we-really-need↩︎