Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The United States (U.S.) Centers for Disease Control (CDC) collects data from all 50 states and also Washington DC and three territories of the U.S. (Reference 1). This is known as the Behavioral Risk Factor Surveillance System (BRFSS). The data is collected by the state health departments by performing telephone interviews (both cellular and landline) using a standardized questionnaire from the CDC. The questionnaire is designed to collect “prevalence data” regarding risk behaviors and also preventive practices for good health. The data is forwarded to CDC, where it is tabulated by a standardized method and then returned to the states for annual publishing (Reference 1).

The questionnaire includes standard core questions that are included each year and must be asked by all states, rotating core questions that are asked every other year by all states, optional modules that states can decide whether or not to include, and state-added questions. States are allowed to contract out the interview work, but it must be done in accordance with the BRFF protocols (Reference 2).

The protocol requires that the timing of the interviews is distributed (i.e., 20% on weekdays, 80% on weeknights and weekends, change schedules to accommodate holidays and special events) (Reference 2). This distribution of interview time helps to prevent addition of bias into the sample and therefore helps to keep the sample more random than if all of the interviews were performed on the same schedule.

The study excludes (as households) vacation homes not occupied more than 30 days per year, and institutions. However, “since 2011, adult students living in college housing were included as eligible respondents.” (Reference 2).

Rather than use simple random sampling, to achieve greater efficiency, the BRFSS utilizes Disproportionate Stratified Sampling (DSS). “In disproportionate stratified random sampling, the different strata do not have the same sampling fractions as each other…Using a stratified sample will always achieve greater precision than a simple random sample, provided that the strata have been chosen so that members of the same stratum are as similar as possible in terms of the characteristic of interest…A final advantage is that a stratified sample guarantees better coverage of the population.” (Reference 3). The BRFSS specifies the strata based upon the telephone number characteristics (i.e., listed, not listed, landline, cellular, etc.).

Some of the variables in the BRFSS are calculated variables such as Body Mass Index if the respondent was asked to report height and weight (Reference 2).

Generalizability and Causality:

As discussed above, the data is not from a pure random sample; however, methods included in the BRFSS sampling protocol (disproportionate stratified sampling, timing of interviews to increase randomization, representation by each state) that ensure wide representation of respondents and to minimize sample bias. Within each strata the sampling is approximately random; therefore, the data can be generalized to the population. One potential limitation of the survey may be response bias. People who tend to have high or very high-risk health-related behaviors may be less likely to respond to the survey or may answer some questions untruthfully. The data does not come from a controlled experiment, so no causal (cause and effect) conclusions can be made based solely upon this data.

Part 1 References:

  1. The Behavioral Risk Ractor Surveillance System, (CDC) Centers for Disease Control and Prevention, https://www.cdc.gov/brfss/ web-accessed 29MAY2021.
  2. The BRFSS Data User Guide, 15AUG2013, (CDC) Centers for Disease Control and Prevention, https://www.cdc.gov/brfss/data_documentation/pdf/UserguideJune2013.pdf web-accessed 29MAY2021.
  3. Understanding Stratified Samples and How to Make Them, Crossman, Ashley, Updated 27JAN2020, ThoughtCo. https://www.thoughtco.com/stratified-sampling-3026731 web accessed 29MAY2021.
  4. The Behavioral Risk Factor Surveillance System (BRFSS) - 2013, https://d18ky98rnyall9.cloudfront.net/_e34476fda339107329fc316d1f98e042_brfss_codebook.html?Expires=1622419200&Signature=d-iUq6NEDXx2qpVRrAN7QQjSNfK72j6goRK3eZqW9wd6uC-TvivDMNQV0Di7BU~NyLam7nzxCSzn5dhOdveBmR-mqz0L1LJgqkmfFNHH-7YY4UkgHd8jp4p98CRszDJq9mAmP9Ci2oOgCID9fCNNWxwfW3oXgW8jYqNOGn7SaQY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A web-accessed 29MAY2021 - 30MAY2021

Part 2: Research questions

Research Question 1:

Are married people more likely to have their cholesterol checked? It would be interesting to see how this factor (married or not married) correlates to health.

To explore this research question, three variables will be used. The first variable, “marital” is a variable that describes the person’s marital status. The second variable that will be evaluated for this question is “cholchk” and describes if the person had cholesterol checked in the past five years, or five years or greater. The third variable “wedded” will be created in order to place people into two categories instead of six, for marital status.

It is of particular interest to this author because it will be interesting to see if people who are married tend to monitor their health status more than people who are not married, with checking cholesterol being a “marker” for personal health monitoring.

This (or any of the research questions in this project) is not intended to infer or explore any causal (cause and effect) analysis.

Research Question 2:

People are spending so much of their energy focused on working to make more money! It would be interesting to see how much people work, and how the amount of time worked correlates to their depression level. The research question is, do people who work more have higher depression levels?

To explore this research question, four variables will be used: “scntwrk1” (how many hours per week do you work), “misdeprd” (how often feel depressed in past 30 days) (Reference 5). A new variable “hardworker” will be created to create two categories of workers, those who work 40 hours or less and those who work greater than 40 hours. Additionally a new variable “prop” will be created to help to show a table of proportions of workers in two different levels (40 hours or less, or >40 hours) versus levels of reported depression.

Research Question 3:

In addition to looking at depression levels, it would also be interesting to see if people who worked more hours have higher levels of anxiety. The research question is: Do people who work more tend to experience nervousness more often?

To explore this research question, four variables will be used: “scntwrk1” (how many hours per week do you work), “misnervs” (how often feel nervous in past 30 days) (Reference 5). The new variable “hardworker” which was created to create two categories of workers, those who work 40 hours or less and those who work greater than 40 hours. Additionally a new variable “proprtn” will be created to help to show a table of proportions of workers in two different levels (40 hours or less, or >40 hours) versus levels of reported nervousness Note: a new variable “proprtn” will be created in addition to the created variable “prop” that is used to created a table for depression because they are different proportions, while the reused variables “scntwrk1” and “hardworker” are the same for both research questions # 2 and # 3.


Part 3: Exploratory data analysis

Research Question 1: The first variable we will look at is marital, which is a categorical variable with values that represent six levels (factors) of marital status. The seventh level in the table below is “NA” indicating that the information was Not Available.

brfss2013%>%
  group_by(marital) %>%
  summarize(count=n(), proportn = n() / sum (253329,70376,65745,10662,75070,13173,3420)) 
## # A tibble: 7 x 3
##   marital                          count proportn
## * <fct>                            <int>    <dbl>
## 1 Married                         253329  0.515  
## 2 Divorced                         70376  0.143  
## 3 Widowed                          65745  0.134  
## 4 Separated                        10662  0.0217 
## 5 Never married                    75070  0.153  
## 6 A member of an unmarried couple  13173  0.0268 
## 7 <NA>                              3420  0.00695

To separated these categories of marital status into just two categories of married people and unmarried people, we will use the mutate() function to create a new variable called “wedded” that is a categorical (character) variable with two levels and will take the value of “wed” if they are married and take the value of “not_wed” if they are not married (divorced, widowed, separated, never married or a member of an unmarried couple). Note: some people would consider some of these levels actually married, but for this study they are not. We will also filter out the observations where the information is not available (NA).

brfss2013 <- brfss2013 %>%
  mutate(wedded = ifelse(marital=="Married", "wed", "not_wed"))
brfss2013 %>%
  group_by(wedded) %>%
  filter(!(is.na(wedded))) %>%
  summarize(count=n(), proportion = n() / sum (253329,235026) )
## # A tibble: 2 x 3
##   wedded   count proportion
##   <chr>    <int>      <dbl>
## 1 not_wed 235026      0.481
## 2 wed     253329      0.519

We can double-check this by observing that the number “not_wed” 235026 = 70376 + 65745 + 10662 + 75070 + 13173 and that the number “wed” is the same value as “married” in the variable marital. This checks out fine. So, approximately 51.9% of respondents reported that they were married and approximately 48.1% of respondents reported they were something other than married.

Now we want to look at when people has their cholesterol checked. To see this, we will look at the sample statistics: (Please note that these are categorical variables and do not have sample statistics such as mean, median, standard deviation, etc. since they are not numeric.)

brfss2013 %>%
  filter(
    !is.na(wedded), 
    !is.na(cholchk)
  ) %>%
  count(wedded, cholchk) %>%
  group_by(wedded) %>%
  mutate(prop = n / sum(n))
## # A tibble: 8 x 4
## # Groups:   wedded [2]
##   wedded  cholchk                  n   prop
##   <chr>   <fct>                <int>  <dbl>
## 1 not_wed Within past year    146784 0.777 
## 2 not_wed Within past 2 years  21313 0.113 
## 3 not_wed Within past 5 years  13132 0.0695
## 4 not_wed 5 or more years ago   7746 0.0410
## 5 wed     Within past year    174441 0.769 
## 6 wed     Within past 2 years  27903 0.123 
## 7 wed     Within past 5 years  16675 0.0735
## 8 wed     5 or more years ago   7885 0.0348

Note: this only includes the people who reported both marital status and whether they had their cholesterol checked. If there was missing information in either variable then the row was removed in this table.

We can see from the summary statistics table that of the not_wed (not married) people, 77.7% had cholesterol checked in the last year, 11.3% had it checked within the last two years, 6.9% had it checked in the last five years, 4.1% had it checked five or more years ago. Summary statistics such as mean, median, standard deviation are not applicable here since this is a categorical variable, and not a numeric variable.

We can see that of the wed (married) people, 76.9.9% had cholesterol checked in the last year, 12.3% had it checked in the last two years, 7.3% had it checked in the last five years, 3.5% had it checked five or more years ago.

If we combine those who had cholesterol checked within the last year and the last two years, we see that: 89.0% of not married people had it checked in the last year or two. (0.776738 + 0.11278 = 0.890 = 89.0%) 89.2% of married people had it checked in the last year or two. (0.768788 + 0.12297 = 0.892 = 89.2%)

brfss2013 %>%
    filter((wedded %in% c("not_wed", "wed"))) %>% 
    filter((cholchk %in% c("Within past year", "Within past 2 years", "Within past 5 years", "5 or more years ago"))) %>% 
    ggplot() +
    geom_bar(mapping = aes(x=wedded, fill=cholchk)) +
    ggtitle("Plot of cholesterol check \n by marital status")

The answer to the research question is that married people are approximately just as likely (89.2% versus 89.0%) to have had their cholesterol checked in the last year or two as people who are not married. It is possible that other factors (confounding variables) may influence this as well, such as age, since younger people may be less likely to be married and less likely to need cholesterol checking.

Part 3, Research Question 1 References:

  1. ggplot2 bar plot with two categorical variables, Stack Overflow, https://stackoverflow.com/questions/24895575/ggplot2-bar-plot-with-two-categorical-variables web-accessed 29MAY2021.

  2. UC Business Analytics R Programming Guide, Categorical Data Descriptive Statistics, https://uc-r.github.io/descriptives_categorical web-accessed 29MAY2021.

  3. Removing NA’s from ggplot2, RStudio Community, https://community.rstudio.com/t/removing-nas-from-ggplot2/47030/3 web-accessed 30MAY2021


Research Question 2:

For the second research question, we will look at two existing variables scntwrk1 (how many hours per week do you work), misdeprd (how often feel depressed past 30 days), and we will introduce new calculated variables hardworker to separate people into two groups based on how much they worked and prop to help to display the proportions in a table.

brfss2013 %>%
  select(scntwrk1, misdeprd) %>%
  str()
## 'data.frame':    491775 obs. of  2 variables:
##  $ scntwrk1: int  NA 35 40 NA NA 40 40 NA NA NA ...
##  $ misdeprd: Factor w/ 5 levels "All","Most","Some",..: 5 5 5 5 NA NA NA NA NA NA ...

We can see that scntwrk1 (hours worked per week) is an integer, but misdeprd is a factor. It is an ordinal factor with five levels, meaning that the levels can be put in an ascending or descending order.

Let’s look at the summary statistics for the integer variable scntwrk1 which is the number of hours people reported working.

brfss2013 %>%
  filter(!(is.na(scntwrk1))) %>%
  summarize(mean_wrk=mean(scntwrk1), median_wrk=median(scntwrk1),
sd_wrk=sd(scntwrk1), min_wrk = min(scntwrk1), max_wrk=max(scntwrk1), count=n())
##   mean_wrk median_wrk   sd_wrk min_wrk max_wrk count
## 1 43.03659         40 15.85198       0      98 32362

We can see that since the mean is greater value than the median, the data are likely somewhat right-skewed from a normal distribution. We also see that the standard deviation is quite wide. We can also see that we only have 32362 observations that have values for this variable. Let’s see how many NA’s we have.

brfss2013 %>%
  group_by(scntwrk1) %>%
  summarize(count = n()) %>%
  filter(is.na(scntwrk1))
## # A tibble: 1 x 2
##   scntwrk1  count
##      <int>  <int>
## 1       NA 459413

Yes that is a lot of NA’s (459413). THere are still 32362 observations that did have data for this variable. Hopefully those same observations also have data reported for the other variable misdeprd. We will remove the NA’s from the graphic when we create the histogram below.

# Create a histogram of `scntwrk1`. Notice that we filter out the NA values using subset


ggplot(data = subset (brfss2013, !is.na(scntwrk1)), aes(x = scntwrk1)) +
  geom_histogram(binwidth = 2) +
  ggtitle("Histogram of hours worked last week (variable scntwrk1)")

In the plot above, we can see the distribution of how many hours per week different people report working. The survey question was “About how many hours do you work per week on all of your jobs and businesses combined?” (Reference 5) We see that it appears to be somewhat normally distributed; however, it has long, light tails to the left and right (possibly too long and light to be considered normal). The median value of 40 hours is prevalent.

Here are the summary statistics of the variable scntwrk1:

# Calculated Summary Statistics for the Variable scntwrk1
brfss2013 %>%
  filter(!(is.na(scntwrk1))) %>%
  summarize (work.mean = mean(scntwrk1), work.median = median(scntwrk1), work.sd = sd(scntwrk1),
             work.max = max(scntwrk1), work.min = min(scntwrk1))
##   work.mean work.median  work.sd work.max work.min
## 1  43.03659          40 15.85198       98        0
# Creating the quantiles-quantiles plot to check for normal distribution
ggplot(brfss2013, aes(sample = scntwrk1)) + stat_qq() +
ggtitle("Q-Q plot for variable scntwrk1 to check for normal distribution")
## Warning: Removed 459413 rows containing non-finite values (stat_qq).

The Q-Q plot confirms that the amount of hours people report working is not normally distributed. If it was normally distributed, this plot would approximate a straight, diagonal line. The S-shape of the plot indicates that the tails are too light, and that is also evident in the distribution plot.

Now let’s take a look at depression levels that people reported. The survey question was “During the past 30 days, about how often did you feel so depressed that nothing could cheer you up? (all, most, some, a little, or none of the time.)” (Reference 5)

# Create a histogram of `misdeprd`. Notice that we filter out the NA values using subset


ggplot(data = subset (brfss2013, !is.na(misdeprd)), aes(x = misdeprd)) +
  geom_bar() +
  ggtitle("Histogram of depression levels (variable misdeprd)")

In the bar-plot above, we can see the distribution of people who felt depressed in the last 30 days.

Let’s combine the results and look at a distribution of reported number of hours worked, color-coded for reported depression levels.

# Create a histogram of hours worked color coded for depression levels.

ggplot(data = subset (brfss2013, !is.na(misdeprd)), aes(x = scntwrk1,
           fill = misdeprd)) +
  geom_bar(position = "stack") +
  ggtitle("Histogram of hours worked last week, color coded for depression levels")
## Warning: Removed 26472 rows containing non-finite values (stat_count).

Notice that while the shape of the distribution is roughly the same as it was in the last histogram of hours worked, there are fewer observations on the plot above. This is because this plot requires observations where people reported both hours worked, and also depression levels. If they only reported one or the other (or neither) on the survey, then this plot removes the observation from the data-set for the plot; (that’s why the message “warning: removed 26472 rows containing non-finite values (stat_count)”).

Based on the plot, it does not visually appear that people who reported working longer hours reported suffering more from depression (since the proportions of the color-coding to the left of the median looks about the same as the color-coding to the right of the median).

To examine further, let’s create a new variable hardworker that will divide the workers into two groups. One group reported working 40 hours or less and the other group reported working more than 40 hours. Let’s take those two categories and color-code them for depression levels.

brfss2013 <- brfss2013 %>%
  mutate(hardworker = ifelse(scntwrk1<=40, "forty.or.less", "more.than.40"))
  


# Create a histogram of two groups of workers color coded for depression levels.
brfss2013 %>%
    filter((hardworker %in% c("forty.or.less", "more.than.40"))) %>% 
    filter((misdeprd %in% c("All", "Most", "Some", "A little", "None"))) %>% 
    ggplot() +
    geom_bar(mapping = aes(x=hardworker, fill=misdeprd)) +
    ggtitle("Plot of two groups of workers, color coded for depression levels")

It does not appear that people who worked over 40 hours experienced higher depression levels. Let’s create a probability table to see the proportion of people who reported different levels of depression among the two groups (those who worked 40 or less hours versus those who worked more than 40 hours).

brfss2013 %>%
  filter(
    !is.na(hardworker), 
    !is.na(misdeprd)
  ) %>%
  count(hardworker, misdeprd) %>%
  group_by(hardworker) %>%
  mutate(prop = n / sum(n))
## # A tibble: 10 x 4
## # Groups:   hardworker [2]
##    hardworker    misdeprd     n    prop
##    <chr>         <fct>    <int>   <dbl>
##  1 forty.or.less All         15 0.00268
##  2 forty.or.less Most        30 0.00536
##  3 forty.or.less Some       167 0.0299 
##  4 forty.or.less A little   341 0.0610 
##  5 forty.or.less None      5040 0.901  
##  6 more.than.40  All          7 0.00173
##  7 more.than.40  Most        18 0.00445
##  8 more.than.40  Some        78 0.0193 
##  9 more.than.40  A little   208 0.0515 
## 10 more.than.40  None      3730 0.923

The distributions are approximately the same. The people who worked 40 hours or less reported very slightly higher depression levels as listed below.

92.3% of those who worked over 40 hours had no depression (versus 90.1% of those who worked 40 hours or less).

5.1% of those who worked over 40 hours had a little depression (versus 6.1% of those who worked 40 hours or less).

1.9% of those who worked over 40 hours had some depression (versus 3.0% of those who worked 40 hours or less).

0.4% of those who worked over 40 hours were mostly depressed (versus 0.5% of those who worked 40 hours or less).

0.2% of those who worked over 40 hours were always depressed (versus 0.3% of those who worked 40 hours or less).

Based upon this analysis, the data indicate that people who work more do not have higher depression levels.


Research Question 3:

The third research question is “Do people who work more tend to experience nervousness more often?”

The survey question was “About how often during the past 30 days did you feel nervous- would you say all of the time, most of the time, some of the time, a little of the time or none of the time?” (Reference 5) The variable is misnervs.

As stated above in Part 2 (Research Questions), To explore this research question, four variables will be used: scntwrk1 (how many hours per week do you work), misnervs (how often feel nervous in past 30 days) (Reference 5). The new variable hardworker which was created to create two categories of workers, those who work 40 hours or less and those who work greater than 40 hours. Additionally a new variable proprtn will be created to help to show a table of proportions of workers in two different levels (40 hours or less, or >40 hours) versus levels of reported nervousness.

Since we have already evaluated the variable scntwrk in the last question, let’s evaluate the variable msnervs which shows how often people felt nervous in the past 30 days.

# Create a histogram of `misnervs`. Notice that we filter out the NA values using subset


ggplot(data = subset (brfss2013, !is.na(misnervs)), aes(x = misnervs)) +
  geom_bar() +
  ggtitle("Histogram of nervousness levels (variable misnervs)")

This is the distribution of how often people reported feeling nervous in the past 30 days. We can visually see that it is higher than what people reported for feeling depression; (visually comparring the plots).

Let’s combine reported levels of nervousness with amount of time spent at work.

# Create a histogram of hours worked color coded for nervousness levels.

ggplot(data = subset (brfss2013, !is.na(misnervs)), aes(x = scntwrk1,
           fill = misnervs)) +
  geom_bar(position = "stack") +
  ggtitle("Histogram of hours worked last week, color coded for nervousness levels")
## Warning: Removed 26516 rows containing non-finite values (stat_count).

Based on the plot above, it does not appear that people who worked more experienced higher levels of nervousness. Let’s group the people who worked 40 hours or less and compare that to people who worked more than 40 hours.

# Create a histogram of two groups of workers color coded for depression levels.
brfss2013 %>%
    filter((hardworker %in% c("forty.or.less", "more.than.40"))) %>% 
    filter((misnervs %in% c("All", "Most", "Some", "A little", "None"))) %>% 
    ggplot() +
    geom_bar(mapping = aes(x=hardworker, fill=misnervs)) +
    ggtitle("Plot of two groups of workers, color coded for nervousness levels")

Visually, the distributions look roughly similar. Let’s get the exact percentages by calculating the sample statistics in a table. We will use the mutate function to create the new variable proprtn to show the proportions.

brfss2013 %>%
  filter(
    !is.na(hardworker), 
    !is.na(misnervs)
  ) %>%
  count(hardworker, misnervs) %>%
  group_by(hardworker) %>%
  mutate(proprtn = n / sum(n))
## # A tibble: 10 x 4
## # Groups:   hardworker [2]
##    hardworker    misnervs     n proprtn
##    <chr>         <fct>    <int>   <dbl>
##  1 forty.or.less All         65 0.0116 
##  2 forty.or.less Most       126 0.0225 
##  3 forty.or.less Some       726 0.130  
##  4 forty.or.less A little  1814 0.325  
##  5 forty.or.less None      2859 0.511  
##  6 more.than.40  All         36 0.00891
##  7 more.than.40  Most        90 0.0223 
##  8 more.than.40  Some       512 0.127  
##  9 more.than.40  A little  1294 0.320  
## 10 more.than.40  None      2107 0.522

The distributions are very similar indeed. The people who worked 40 hours or less reported very slightly higher nervousness levels as listed below.

52.1% of those who worked over 40 hours had no nervousness (versus 51.1% of those who worked 40 hours or less).

32.0% of those who worked over 40 hours had a little nervousness (versus 32.4% of those who worked 40 hours or less).

12.7% of those who worked over 40 hours had some nervousness (versus 13.0% of those who worked 40 hours or less).

2.2% of those who worked over 40 hours were mostly nervous (versus 2.3% of those who worked 40 hours or less).

0.9% of those who worked over 40 hours were always nervous (versus 1.1% of those who worked 40 hours or less).

Based upon this analysis, the data indicate that people who work more do not tend to experience nervousness more often.


Part 3, Research Questions 2 and 3 References:

  1. Eliminating NAs from a ggplot, Stack Overflow, https://stackoverflow.com/questions/17216358/eliminating-nas-from-a-ggplot web-accessed 29MAY2021.

  2. Data Visualization with R, Kabacoff, Rob, 01DEC2020, Chapter 4 Bivariate Graphs https://rkabacoff.github.io/datavis/Bivariate.html web-accessed 30MAY2021

  3. Removing NA’s from ggplot2, RStudio Community, https://community.rstudio.com/t/removing-nas-from-ggplot2/47030/3 web-accessed 30MAY2021

  4. Conditional Probability with dplyr, RStudio Community, https://community.rstudio.com/t/conditional-probability-with-dplyr/5117 web-accessed 30MAY2021

  5. The Behavioral Risk Factor Surveillance System (BRFSS) - 2013, https://d18ky98rnyall9.cloudfront.net/_e34476fda339107329fc316d1f98e042_brfss_codebook.html?Expires=1622419200&Signature=d-iUq6NEDXx2qpVRrAN7QQjSNfK72j6goRK3eZqW9wd6uC-TvivDMNQV0Di7BU~NyLam7nzxCSzn5dhOdveBmR-mqz0L1LJgqkmfFNHH-7YY4UkgHd8jp4p98CRszDJq9mAmP9Ci2oOgCID9fCNNWxwfW3oXgW8jYqNOGn7SaQY_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A web-accessed 29MAY2021 - 30MAY2021

  6. ggplot2 qq plot (quantile-quantile graph): Quick Start Guide - R Software and Data Visualization, STHDA, http://www.sthda.com/english/wiki/ggplot2-qq-plot-quantile-quantile-graph-quick-start-guide-r-software-and-data-visualization, web accessed 30MAY2021