Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

setwd("C:/Users/devea/Downloads")
load("brfss2013.RData")

Part 1: Data

About: The Behavioral Risk Factor Surveillance System (BRFSS) is the US’s premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services

Objective: to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.

Scope: all 50 US states, the District of Columbia, Puerto Rico, and Guam collect data annually and American Samoa, Federated States of Micronesia, and Palau collect survey data over a limited point- in-time (usually one to three months)

Type: BRFSS completes adult interviews each year through landline telephone and cellular telephone-based surveys

Sample size: Sample size refers to the number of telephone numbers that must be called within a given period of time. The BRFSS goal is to support at least 4,000 interviews per state each year

Stratified sampling Process for the landline Sample: Disproportionate stratified sampling (DSS) has been used for the landline sample since 2003. DDS draws telephone numbers from two strata (lists) that are based on the presumed density of known telephone household numbers. In this design, telephone numbers are classified into strata that are either high density (listed 1+ block telephone numbers) or medium density (not listed 1+ block telephone numbers) to yield residential telephone numbers. Telephone numbers in the high density stratum are sampled at the highest rate.

Simple Random Sampling Process for the Cellular Telephone Sample: The cellular telephone sample is randomly generated from a sampling frame of confirmed cellular area code and prefix combinations. Cellular telephone respondents are randomly selected with each having equal probability of selection.

Weighting: Data weighting is an important statistical process that attempts to remove bias in the sample. The BRFSS weighting process includes two steps: design weighting and iterative proportional fitting (also known as “raking” weighting). The objective is to ensure that data are representative of the population on a number of demographic characteristics including sex, age, race, education, marital status, home ownership, phone ownership (landline telephone, cellular telephone or both) and sub-state region.

Implication: The BRFSS is an observational study because the data is collected in a way that does not directly interfere with how the data arise (on the contrary of an experiment where we want to investigate the possibility of a causal connection).

It uses random sampling where individuals of each state are randomly selected. Random sampling ensures that the sample is a good representation of the population. As it is not an experimental study, there are no random assignment. As a result, the BRFSS observations are not causable but generalizable.

Part 2: Research questions

I recently started to do regularly sports (3-4 times a week) so the research questions will be linked with individuals who are doing regularly a physical activities and how do they compare with the rest of the population. The questions are coming from some assumptions I have and I would like to check if, according to the BRFSS data, those assumptions seem to be indeed correct.

Research question 1:

One of the assumption I have is that although doing sport is good for your heatlh, doing a lot of physical activities can have a negative effect on your body. One example I have in mind would be for people who practice regularly weight lifting and the potential impact on their body.

Q: Do individuals who are doing a lot physical activities to strengthen their muscles are more likely to have join pain issues?

Variables to use:

strength: How Many Times Did You Do Physical Activities Or Exercises To Strengthen Your Muscles
lmtjoin3: Limited Because Of Joint Symptoms
joinpain: How Bad Was Joint Pain
arthdis2: Does Arthritis Affect Whether You Work

Research question 2:

This question is related to an article I read some time ago who was saying that the lowest social classes tend to be the first victims of unhealthy/junk foods. We will try here to have an understanding of the demographic profile of people who do not practice sport regularly.

Q: What is the demographic profile of the individuals who do not practice sport regularly compared with the rest of the population?

Variables to use:

exerany2: Exercise In Past 30 Days
educa: Education Level
employ1: Employment Status
income2: Income Level
sex: Respondents Sex

Research question 3:

The last question is related to the relation between physical activities and mental health. I often heard that the two goes together, so let’s check if it is the case:

Q: Do individuals who are doing regularly physical activities tend to be in a better mental health than people who do not?

Variables to use:

exeroft1: How Many Times Walking, Running, Jogging, Or Swimming
misdeprd: How Often Feel Depressed Past 30 Days
menthlth: Number Of Days Mental Health Not Good
lsatisfy: Satisfaction With Life

Part 3: Exploratory data analysis

Research question 1:

Q: Do individuals who are doing a lot physical activities to strenghthen their muscles are more likely to have join pain issues?

First, we will create a new variable based on the “strength” variable to determine if YES or NO, the individuals are considered as doing regularly strength related activities. Doing “regular” is subjective, but for this exercise, I will consider that doing 3 times or more a strength physical activities per week is “regular”.

Strengh variable coding:

1_ _ Times per week
2_ _ Times per month
8 8 8 Never
7 7 7 Don’t know / Not sure
9 9 9 Refused

# first we will check how many NA we have in the strength variable and we will create a new dataframe without them

print(sum(is.na(brfss2013$strength))) #number of NA in strength

## [1] 39819

ds_str <- filter(brfss2013, is.na(strength) == FALSE)

# new variable to determine individuals doing a lot of strength related activites

ds_str$High_strength <- ds_str$strength >= 103 & ds_str$strength <= 199 | ds_str$strength >= 212 & ds_str$strength <= 299

print(table(ds_str$High_strength))

## 
##  FALSE   TRUE 
## 363333  88623

# 88623 individuals are considered as doing a lot of strength physical activities

# We can now compare those 2 categories (lot of physical activities, not a lot) with the lmtjoin3 variable: "Are you now limited in any way in any of your usual activities because of arthritis or joint symptoms?". This variable is a Yes / No factor

print(sum(is.na(ds_str$lmtjoin3))) # lot of NA among the answers of this question

## [1] 299952

limited <- as.data.frame(table(ds_str$High_strength, ds_str$lmtjoin3))
colnames(limited) <- c("Frequent strength training", "Limitation due to joint pain", "Freq")

my.palette <- c("#edc951", "#eb6841", "#cc2a36", "#4f372d", "#00a0b0")
    
ggplot(limited, aes(x= `Frequent strength training`, y= Freq, fill= `Limitation due to joint pain`))+
    geom_bar(stat = "identity", position = "fill") +
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Practice a lot of strength related physical activities")+
    ylab("Freq")+
    theme(legend.position="top")+
    ggtitle("Usual activities limitations because of arthritis or joint symptoms?")

It doesn’t look like practicing regularly strength training correlate with joint pain. However there were many NA in the data so let’s look at the other variable related to joint pain.

joinpain variable: How Bad Was Joint Pain. Please think about the past 30 days, keeping in mind all of your joint pain or aching and # whether or not you have taken medication. DURING THE PAST 30 DAYS, how bad was # your joint pain ON AVERAGE? Please answer on a scale of 0 to 10 where 0 is no pain or aching and 10 is pain or aching as bad as it can be.

print(sum(is.na(ds_str$joinpain))) # 68% of NA among the answers of this question

## [1] 305499

pain <- as.data.frame(table(ds_str$joinpain, ds_str$High_strength))
colnames(pain) <- c("Joint Pain level", "Frequent strength training", "Freq")

perc_T <- prop.table(pain$Freq[pain$`Frequent strength training` == TRUE])
perc_F <- prop.table(pain$Freq[pain$`Frequent strength training` == FALSE])

perc <- c(perc_F, perc_T)

pain$Perc <- perc

ggplot(pain, aes(x= `Joint Pain level`, y= Perc, fill= `Frequent strength training`))+
    geom_bar(stat = "identity", position = "dodge") +
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Joint pain scale of 0 to 10 where 0 is no pain")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Joint pain level over the last 30 days in percentage")

The joint pain for those 2 categories are mostly similar although people who are doing strength training regularly have a slightly smaller pain level than people who don’t.

To finish, we will check the variable arthsocl: During the past 30 days, to what extent has your arthritis or joint symptoms interfered with your normal social activities, such as going shopping, to the movies, or to religious or social gatherings?

soc <- as.data.frame(table(ds_str$High_strength, ds_str$arthsocl))
colnames(soc) <- c("Frequent strength training", "Interference with social activities", "Freq")


perc_T <- prop.table(soc$Freq[soc$`Frequent strength training` == TRUE])
perc_F <- prop.table(soc$Freq[soc$`Frequent strength training` == FALSE])

perc <- c(perc_F, perc_T)

soc$Perc <- perc

ggplot(soc, aes(x= `Interference with social activities`, y= Perc, fill= `Frequent strength training`))+
    geom_bar(stat = "identity", position = "dodge") +
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("joint symptoms interference")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Joint symptoms interference with normal social activities in percentage")

62% of the individuals doing regularly strength exercise say joint symptoms do not interfer with their social activities.

So on the contrary to my assumption, it does not look like people who are doing strength exercises regularly suffer more from join pain that the rest of the population. It sometimes even look like they suffer less (or are less disturbed) from potential joint pain. However, it is worth noting that the data was often NA.

Research question 2:

Q: What is the demographic profile of the individuals who do not practice sport regularly compared with the rest of the population?

Variables to use:

exerany2: Exercise In Past 30 Days
educa: Education Level
employ1: Employment Status
income2: Income Level
sex: Respondents Sex

We will start by checking if there is any difference among gender. In order to determine the individuals who do not practice sport regularly, we will use the exerany2 variable: Have you exercise In Past 30 Day.

gender <- as.data.frame(table(brfss2013$sex, brfss2013$exerany2))
colnames(gender) <- c("Sex", "Any physical activities during last month", "Freq")

ggplot(gender, aes(x= `Any physical activities during last month`, y= Freq, fill= Sex))+
    geom_bar(stat= "identity", position = "stack")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Physical activities during the last month")+
    ylab("Freq")+
    theme(legend.position="top")+
    ggtitle("Physical activities during the last month by sex")

73% of the interviewed individuals had some physical activities over the last 30 days.

28% of female interviewed didn’t do any physical activities over the last 30 days compared to 25% for male.

We will now check if there are any differences in the individual income.

income <- as.data.frame(table(brfss2013$income2, brfss2013$exerany2))
colnames(income) <- c("Category", "Any physical activities during last month", "Freq")

perc_T <- prop.table(income$Freq[income$`Any physical activities during last month` == "Yes"])
perc_F <- prop.table(income$Freq[income$`Any physical activities during last month` == "No"])

perc <- c(perc_T, perc_F)

income$Perc <- perc

ggplot(income, aes(x= Category, y= Perc, fill= `Any physical activities during last month`))+
    geom_bar(stat= "identity", position = "dodge")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Income categories")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Income categories in percentage split by individuals \n who had physical activities or not")+
    coord_flip()

We can see that the individuals who did not do any physical activities over the last 30 days tend to be from lower income categories: 42% are earning less than USD 25,000. Among people who did some physical activities, 26% are earning less than USD 25,000. Moreover, 32% are earning USD 75,000 or more compared to 13% for people who did not do some physical activities.

Related to the income level, we can have a look at the employment status to pick any visible differences

status <- as.data.frame(table(brfss2013$employ1, brfss2013$exerany2))
colnames(status) <- c("Category", "Any physical activities during last month", "Freq")

perc_T <- prop.table(status$Freq[status$`Any physical activities during last month` == "Yes"])
perc_F <- prop.table(status$Freq[status$`Any physical activities during last month` == "No"])

perc <- c(perc_T, perc_F)

status$Perc <- perc

ggplot(status, aes(x= Category, y= Perc, fill= `Any physical activities during last month`))+
    geom_bar(stat= "identity", position = "dodge")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Employment status")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Employment status categories in percentage split by  individuals \n who had physical activities or not")+
    coord_flip()

5.3% of the individuals who did not do any physical activities over the last 30 days are unable to work which may suggest some disabilities. 31% of the individuals who did not do any physical activities over the last 30 days are retired.

We will now check the education level.

edu <- as.data.frame(table(brfss2013$educa, brfss2013$exerany2))
colnames(edu) <- c("Category", "Physical activities during last month", "Freq")

perc_T <- prop.table(edu$Freq[edu$`Physical activities during last month` == "Yes"])
perc_F <- prop.table(edu$Freq[edu$`Physical activities during last month` == "No"])

perc <- c(perc_T, perc_F)

edu$Perc <- perc

ggplot(edu, aes(x= Category, y= Perc, fill= `Physical activities during last month`))+
    geom_bar(stat= "identity", position = "dodge")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Education categories in percentage split by individuals who had \n physical activities or not")+
    coord_flip()

51% of the individuals who did not have any physical activities recently have a Grade 12 or GED or less. 40% of people who did had some physical activities have a college 4 years or more.

As a conclusion, we can see that people who didn’t exercise over the last month tend to be from lower income categories than people who did exercises. Similarly, they tend to have a lower education background. The female male ratio was slightly higher for female.

Research question 3:

Q: Do individuals who are doing regularly physical activities tend to be in a better mental health than people who do not?

Variables to use:

exeroft1: How Many Times Walking, Running, Jogging, Or Swimming
misdeprd: How Often Feel Depressed Past 30 Days
menthlth: Number Of Days Mental Health Not Good
lsatisfy: Satisfaction With Life

We will start by selecting people that we consider as doing physical activities regularly with the exeroft1 variable. We will take people who are doing physical activities at least 5 times a week.

exeroft1 variable: How many times per week or per month did you take part in this activity (pre chosen before) during the past month?

1_ _ Times per week
2_ _ Times per month
7 7 7 Don’t know / Not sure
9 9 9 Refused

print(sum(is.na(brfss2013$exeroft1))) #number of NA in exeroft1

## [1] 164165

ds_phy <- filter(brfss2013, is.na(exeroft1) == FALSE)

# selecting people doing regularly physical activities
ds_phy$High_phy <- ds_phy$exeroft1 >= 105 & ds_phy$exeroft1 <= 199 | ds_phy$exeroft1 >= 220 & ds_phy$exeroft1 <= 299

print(table(ds_phy$High_phy)) # 37% are in the regular physical activities category

## 
##  FALSE   TRUE 
## 205035 122575

# lsatisfy: Satisfaction With Life
satisfy <- as.data.frame(table(ds_phy$High_phy, ds_phy$lsatisfy))

colnames(satisfy) <- c("Regular physical activities", "Life satisfaction", "Freq")

ggplot(satisfy, aes(x= `Regular physical activities`, y= Freq, fill= `Life satisfaction`))+
    geom_bar(stat= "identity", position = "fill")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Regular physical activities")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Life satisfaction split by individuals who are doing regularly physical activities or not")+
    coord_flip()

We can see that there are not so many differences between the 2 categories. People who are doing regularly physical activities have a slightly higher “very satisfied” level. “very dissatisfied” and “dissatisfied” tend to have more or less the same weight.

misdeprd variable: How Often Feel Depressed Past 30 Days

depr <- as.data.frame(table(ds_phy$High_phy, ds_phy$misdeprd))

colnames(depr) <- c("Regular physical activities", "Depression level", "Freq")

ggplot(depr, aes(x= `Regular physical activities`, y= Freq, fill= `Depression level`))+
    geom_bar(stat= "identity", position = "fill")+
    scale_fill_manual(values = my.palette)+
    theme_bw()+
    xlab("Regular physical activities")+
    ylab("Percentage")+
    theme(legend.position="top")+
    ggtitle("Depression level split by individuals who are doing regularly physical activities or not")+
    coord_flip()

Once again the 2 categories are highly similar and no particular differences are visible.

To finish, we will check the variable menthlth: Number Of Days Mental Health Not Good over the last month.

As the menthlth variable is discrete, we are going to be able to use the fivenum function

print(fivenum(ds_phy$menthlth[ds_phy$High_phy == TRUE]))

## [1]  0  0  0  1 30

print(fivenum(ds_phy$menthlth[ds_phy$High_phy == FALSE]))

## [1]  0  0  0  2 30

#fiv number shows* the min, Q1, median, Q3, the max

The min, Q1 and median are in both case at 0. The max are both at 30 and only the Q3 slightly differs

what if we select only the people who said they were at least one day depressed?

# if we filter out the 0 from the data
ment_modified <- filter(ds_phy, menthlth != 0) 

print(fivenum(ment_modified$menthlth[ment_modified$High_phy == TRUE]))

## [1]  1  2  5 15 30

print(fivenum(ment_modified$menthlth[ment_modified$High_phy == FALSE]))

## [1]  1  2  5 14 30

Once again, it looks mostly similar. We can show it with a box plot

ggplot(ment_modified, aes(x= High_phy, y= menthlth)) + geom_boxplot() +
    theme_bw()+
    xlab("Regular physical activity")+
    ylab("Number of days depressed")+
    ggtitle("Box plot: Number of depression days over the last month \n for the individuals who felt depressed")

Contrary to the basic assumption I had, using the BRFSS database, it does not look like there is a clear visible correlation between practising physical exercises and mental health