Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(tidyr)

Load data

load("brfss2013.RData")

Joseph Antony | 18th December 2020

Part 1: Data

Sampling Method

BRFSS conducted the survey for every U.S states by dividing the U.S population in to sub-groups - through landline-telephone interview and cellular-telephone interview. According to the BRFSS codebook document, “In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”

This indicates that BRFSS employed stratified sampling method, where the population is divided into sub groups and then random sampling is applied to select a sample from each subgroup.

Study Type

Since no form of treatment was been applied to the sample population and since only survey was done, the study is observational.

Generalizability

As large data were drawn randomly from both sample subgroups, the results can be generalized to the adult U.S population.

Causality

As mentioned earlier, this is an observational study since no treatment has been applied to the sample group and no random assignment was done on the sample population. Therefore, no causality can be inferred from the data.

Part 2: Research questions

Research question 1:

Is there a correlation between the average hours people sleep and people’s mental health within the past 30 days? Furthermore, how does this vary in term’s of gender?

My previous tenure working on oil rigs required me to work continuously in six-hour shifts (6 hours work - 6 hours rest - 6 hours work - 6 hours rest) a day. On average, I could only manage to sleep 2.5 hours per shift, making it a total of approximately 5 hours of sleep a day. I experienced not feeling good mentally on those days. Hence, I am interested to see whether there is any correlation in this case.

Research question 2:

How does average working hours per week vary among different ethnic groups in the U.S? Furthermore, is there differences in average working hours per week for both genders?

Very few studies have looked into the work-life balance among different ethnic groups. Hence, I am curious to see the average working hours per week for people of different ethnic groups in the U.S.

Research question 3:

Is there a correlation between smoking and people’s general health status in the U.S? Does this vary for both male and female respondents?

There are numerous researches highlighting the dangers smoking poses to a person’s health. Therefore, it would be interesting to see whether the data collected here supports this claim and shows a correlation between smoking and general health.

Part 3: Exploratory data analysis

Research question 1: Sleep vs Mental Health

# The variables used are:

#sleptim1: How Much Time Do You Sleep (within 24-hour period).
#menthlth: Number Of Days Mental Health Not Good (during the past 30 days).
#sex: Respondents Gender.

#Summarizing the number of hours and their counts:

table(brfss2013$sleptim1)

## 
##      0      1      2      3      4      5      6      7      8      9     10 
##      1    228   1076   3496  14261  33436 106197 142469 141102  23800  12102 
##     11     12     13     14     15     16     17     18     19     20     21 
##    833   3675    199    447    367    369     35    164     13     64      3 
##     22     23     24    103    450 
##     10      4     35      1      1

#There are 2 main outliers - 103 and 450. It is impossible to sleep for 103 or 450 hours 
#in a day that only has 24 hours. 

#Next, summarizing days of mental health not good in a month.

table(brfss2013$menthlth)

## 
##      0      1      2      3      4      5      6      7      8      9     10 
## 334461  15206  23520  13593   6660  16654   1861   6353   1244    198  11917 
##     11     12     13     14     15     16     17     18     19     20     21 
##     74    812     97   2516  10910    161    105    181     45   6633    497 
##     22     23     24     25     26     27     28     29     30    247   5000 
##    113     61     87   2318     89    164    680    415  25521      1      1

#Here too, there are 2 outliers - 247 and 5000 days. 

#Both of the above variables are discrete variables. Now, using the dplyr and tidyr 
#package, data will be arranged in a way that provides meaningful insight. 
#Outliers in both variables will be removed along with NA's. 

#Also, the variable 'sex' will also be included.

brfss2013 %>%
  select(menthlth, sleptim1, sex) %>%
  filter(sleptim1 <= 24, menthlth <= 30, !is.na(sex)) %>%
  group_by(menthlth, sex) %>%
  summarize(mn_slp = mean(sleptim1)) %>%
  spread(key = sex, value = "mn_slp")

## # A tibble: 31 x 3
## # Groups:   menthlth [31]
##    menthlth  Male Female
##       <int> <dbl>  <dbl>
##  1        0  7.13   7.18
##  2        1  6.97   7.07
##  3        2  6.93   7.04
##  4        3  6.86   6.98
##  5        4  6.88   6.92
##  6        5  6.84   6.96
##  7        6  6.69   6.87
##  8        7  6.78   6.86
##  9        8  6.78   6.98
## 10        9  6.65   6.88
## # ... with 21 more rows

#Here, the data is arranged in a way that shows the average number of days in a month respondents 
#felt their mental health were not good and the corresponding average number of hours both 
#male and female respondents sleeps in a day.

#Plotting the variables using ggplot2 package.

slp.mnth <- brfss2013 %>%
  select(menthlth, sleptim1, sex) %>%
  filter(sleptim1 <= 24, menthlth <= 30, !is.na(sex)) %>%
  group_by(menthlth, sex) %>%
  summarize(mn_slp = mean(sleptim1))

 
ggplot(slp.mnth, aes(x = menthlth, y = mn_slp)) + geom_point(aes(color = sex)) + 
  stat_smooth(method = lm, se = F) + 
  labs(title = "Mean sleeping hours & mental health in a month", 
       y = "mn_slp: Mean hours slept", 
       x = "menthlth: Number of days Mental Health not good in a month") + 
  theme(plot.title = element_text(hjust = 0.5))

#Here, scatter plot is used for 2 numerical variables. The colors of the dots distinguishes 
#between both male and female respondents. 

#The plot clearly illustrates that there is a strong negative correlation between the 
#two variables. The more hours people slept, the less number of days people felt that 
#their mental health was not good. The same trend is observed for both sexes. Most of the plots 
#are somewhat less dispersed from the trend line.

#This does not imply causation though, as there can be  other variables to consider that can 
#affect a person's mental health. Correlation does not mean causation. It is important to 
#note that sleeping more can actually be a sign of depression and oversleeping can exacerbate 
#and worsen depression symptoms.

#There are a couple of interesting outliers within the plot. For some females, those who 
#got an average sleep of more than 7 hours did not experience good mental health for 
#approximately 10-12 days. This in contrast for some males who slept less than 6.25 hours 
#and also did not experience good mental health for 10-12 days.

Research question 2: Average working hours for different ethnic groups in the U.S

#Variables used:

#X_imprace: Ethnicity groups.
#scntwrk1: How Many Hours Per Week Do You Work.
#sex: Respondents sex.

#Summarizing ethnicities in U.S

table(brfss2013$X_imprace) #Categorical variable.

## 
##                          White, Non-Hispanic 
##                                       383624 
##                          Black, Non-Hispanic 
##                                        39817 
##                          Asian, Non-Hispanic 
##                                         9629 
## American Indian/Alaskan Native, Non-Hispanic 
##                                         7781 
##                                     Hispanic 
##                                        37138 
##                     Other race, Non-Hispanic 
##                                        13777

#Most of the respondents are of White, non-Hispanic ethnicity.

#Next, summarizing hours worked in a week for each of the respondents. 

table(brfss2013$scntwrk1)# Continuous variable.

## 
##     0     1     2     3     4     5     6     7     8     9    10    11    12 
##     6    17    42    44    75   104    74    48   170    41   304    16   167 
##    13    14    15    16    17    18    19    20    21    22    23    24    25 
##    20    34   318   161    29    72    20  1095    40    46    38   288   590 
##    26    27    28    29    30    31    32    33    34    35    36    37    38 
##    43    44   129    28  1467     9   537    51    62   985   392   176   275 
##    39    40    41    42    43    44    45    46    47    48    49    50    51 
##    38 10851    36   322   118   153  2289    86    59   375     8  4148     7 
##    52    53    54    55    56    57    58    59    60    61    62    63    64 
##    70    21    30   922    62    20    31     2  2260     2    13    14    11 
##    65    66    67    68    70    72    73    74    75    76    78    79    80 
##   366     9     3    15   573    46     2     1    86     4     5     1   338 
##    81    82    83    84    85    86    87    89    90    91    94    95    96 
##     2     2     1    45    31     4     3     1    73     2     1     7    98 
##    97    98 
##   521   117

#There does not seem to be any outliers here. The maximum recorded number of hours worked 
#in a week is 98 hours. Total hours in a week are 168 hours. 

#Arranging the data for meaningful insight.

brfss2013 %>%
  select(X_race, scntwrk1, sex) %>%
  filter(!is.na(X_race), !is.na(scntwrk1), !is.na(sex)) %>%
  group_by(X_race, sex) %>%
  summarise(mn_wrkhrs = mean(scntwrk1)) %>%
  spread(key = sex, value = "mn_wrkhrs") %>%
  arrange(desc(Male))

## # A tibble: 8 x 3
## # Groups:   X_race [8]
##   X_race                                                        Male Female
##   <fct>                                                        <dbl>  <dbl>
## 1 Multiracial, non-Hispanic                                     48.1   41.4
## 2 Other race only, non-Hispanic                                 46.8   41.6
## 3 White only, non-Hispanic                                      46.7   40.0
## 4 Black only, non-Hispanic                                      46.2   41.2
## 5 Hispanic                                                      45.2   39.7
## 6 Asian only, non-Hispanic                                      45.1   41.2
## 7 Native Hawaiian or other Pacific Islander only, Non-Hispanic  45     41.7
## 8 American Indian or Alaskan Native only, Non-Hispanic          44.7   39.4

#Here, the data illustrates different ethnic groups and the mean working
#hours in a week for both male and female respondents. 

#Among the male respondents, those belonging to 'Multiracial, non-Hispanic' ethnic category 
#has the highest mean working hours in a week. Among the females, 'Native Hawaiian or other 
#Pacific Islander only, Non-Hispanic' respondents have the highest mean working hours. 

#Plotting the varibales using ggplot2.

r2 <- brfss2013 %>%
  select(X_race, scntwrk1, sex) %>%
  filter(!is.na(X_race), !is.na(scntwrk1), !is.na(sex)) %>%
  group_by(X_race, sex, scntwrk1)

ggplot(data = r2, aes(y = X_race, x = scntwrk1)) + 
  geom_boxplot(aes(color = sex)) + 
  labs(title = "Working Hours of Different Ethnicities", 
       x = "scntwrk1: Working Hours per week", y = "X_race: Ethnicity Groups") + 
  theme(plot.title = element_text(hjust = 0.5), legend.position="top")

#Boxplot was used for plotting between categorical and numerical variable. 
#Both male and female are represented in the box plots for each ethnic categories. 

#From all of the box plots, the median hours worked in a week ranges from approximately 39-50.
# Among males, the boxplots for black, native American, Asian, native Hawaiian, and Hispanic people 
#shows the same pattern of no first quartile and right skewed plots. This implies majority 
# of these male respondents works for a minimum of approximately 40 hours weekly. 

#For female respondents, only those belonging to Native American ethnic groups seems to have a maximum 
#working hours of approximately 40 hours per week. The plot is left skewed. Overall, majority 
# of female respondents works lower working hours than male respondents in the U.S.
#It makes sense, as many female respondents might be housewives.

#There are noticeable outliers among all the ethnic groups. the lowest range of outliers 
#ranges from approximately 0-30 hours whereas the highest outliers ranges from 
#approximately 55-98 working hours a week. The lower range outliers could suggest that many of 
#the respondents are currently still in their colleges. The higher outliers may suggest those
#who works in jobs or sectors that requires people to work more hours such as construction, 
#oil and gas sectors,hospitals etc.

Research question 3: Smoking vs General health

#Variables used:

#X_smoker3: Computed Smoking Status.
#genhlth: General Health.
#sex: Respondent's sex.

#Summarizing smoking status

table(brfss2013$X_smoker3)

## 
## Current smoker - now smokes every day Current smoker - now smokes some days 
##                                 55162                                 21495 
##                         Former smoker                          Never smoked 
##                                138134                                261651

#The number of observations are not equal among all smoker categories. Therefore, 
#I will take a random sample of 20,000 observations from each of these categories. 

#Summarizing general health 

table(brfss2013$genhlth)

## 
## Excellent Very good      Good      Fair      Poor 
##     85482    159076    150555     66726     27951

#Both are categorical variables. Arranging the data for meaningful insight.

brfss2013 %>%
  select(X_smoker3, genhlth, sex) %>%
  filter(!is.na(X_smoker3), !is.na(genhlth), !is.na(sex)) %>%
  group_by(X_smoker3, sex, genhlth) %>%
  summarise(count = n()) %>%
  spread(key = genhlth, value = "count")

## # A tibble: 8 x 7
## # Groups:   X_smoker3, sex [8]
##   X_smoker3                        sex   Excellent `Very good`  Good  Fair  Poor
##   <fct>                            <fct>     <int>       <int> <int> <int> <int>
## 1 Current smoker - now smokes eve~ Male       2526        6266  9152  4734  2296
## 2 Current smoker - now smokes eve~ Fema~      2719        7807 10309  6113  3004
## 3 Current smoker - now smokes som~ Male       1312        2761  3214  1540   784
## 4 Current smoker - now smokes som~ Fema~      1364        3327  3525  2247  1321
## 5 Former smoker                    Male       9266       20256 21733 10100  4408
## 6 Former smoker                    Fema~     11114       23118 21685 10618  5258
## 7 Never smoked                     Male      21412       33943 26827  8610  2835
## 8 Never smoked                     Fema~     32980       57213 49167 20585  7128

#The data is arranged in a way that showcases both male and female respondent's 
#smoking status and their corresponding general feeling of their health. 

#Random sampling of smokers.

df <- brfss2013 %>%
  select(X_smoker3, genhlth, sex) %>%
  na.omit()

a <- df %>%
  filter(X_smoker3 == 'Never smoked') %>%
  slice_sample(n=20000)

b <- df %>%
  filter(X_smoker3 == 'Former smoker') %>%
  slice_sample(n=20000)

c <- df %>%
  filter(X_smoker3 == 'Current smoker - now smokes every day') %>%
  slice_sample(n=20000)

d <- df %>%
  filter(X_smoker3 == 'Current smoker - now smokes some days') %>%
  slice_sample(n=20000)

z <- rbind(a,b,c,d)

table(z)

## , , sex = Male
## 
##                                        genhlth
## X_smoker3                               Excellent Very good Good Fair Poor
##   Current smoker - now smokes every day       929      2277 3276 1748  838
##   Current smoker - now smokes some days      1237      2587 2994 1422  734
##   Former smoker                              1323      2931 3210 1483  632
##   Never smoked                               1642      2659 2022  648  214
## 
## , , sex = Female
## 
##                                        genhlth
## X_smoker3                               Excellent Very good Good Fair Poor
##   Current smoker - now smokes every day       987      2883 3789 2173 1100
##   Current smoker - now smokes some days      1273      3148 3281 2098 1226
##   Former smoker                              1603      3361 3186 1493  778
##   Never smoked                               2518      4400 3772 1557  568

prop.table(table(z$X_smoker3))

## 
## Current smoker - now smokes every day Current smoker - now smokes some days 
##                                  0.25                                  0.25 
##                         Former smoker                          Never smoked 
##                                  0.25                                  0.25

z$X_smoker3 = gsub(pattern = "Current smoker - now smokes some days", 
                   replacement = "smokes some days", x = z$X_smoker3)

z$X_smoker3 = gsub(pattern = "Current smoker - now smokes every day", 
                   replacement = "Daily smoker", x = z$X_smoker3)


#Plotting the data using ggplot2.

ggplot(data = z, aes(x = X_smoker3, fill = genhlth)) + 
  geom_bar(position = "fill") + facet_grid(~ sex) + 
  labs(title = "Effects of smoking on General Health", 
       x = "X_smoker3: Smoker Status") + 
  theme(plot.title = element_text(hjust = 0.5), 
        axis.text.x = element_text(angle = 45, hjust=1))

#Bar plot was used for plotting between two categorical variables. Both male and female
#respondent's smoking status and their general health are included in the plot.

#The plots for both gender's show a common pattern where majority of those who never smokes
#and former smokers feels their general health condition varies from excellent to very good. 
#The opposite is true for current smokers where proportion of smokers who feel their general health
#is poor is higher than the non-smoking categories.

#Among both some days and everyday smoker respondents, there are no significant difference 
#among all general health condition. But in general, only few smokers feel their
#general health is excellent. 

#There can be numerous causes for a person's general health such as dietary habits, consumption 
#of alcohol etc. Therefore, causality cannot be inferred in this case. But, there is a strong #correlation between smoking habits and people's general feeling about their health. 

#Finally, further research has to be done. For example, how many packs of 
#cigarette does a person smoke per day. For non-smokers, maybe the person they are 
#with are smokers, hence they are subjected to passive smoking, which might also affect 
#their perception of health.