BRFSS conducted the survey for every U.S states by dividing the U.S population in to sub-groups - through landline-telephone interview and cellular-telephone interview. According to the BRFSS codebook document, “In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.”
This indicates that BRFSS employed stratified sampling method, where the population is divided into sub groups and then random sampling is applied to select a sample from each subgroup.
Since no form of treatment was been applied to the sample population and since only survey was done, the study is observational.
As large data were drawn randomly from both sample subgroups, the results can be generalized to the adult U.S population.
As mentioned earlier, this is an observational study since no treatment has been applied to the sample group and no random assignment was done on the sample population. Therefore, no causality can be inferred from the data.
Research question 1:
Is there a correlation between the average hours people sleep and people’s mental health within the past 30 days? Furthermore, how does this vary in term’s of gender?
My previous tenure working on oil rigs required me to work continuously in six-hour shifts (6 hours work - 6 hours rest - 6 hours work - 6 hours rest) a day. On average, I could only manage to sleep 2.5 hours per shift, making it a total of approximately 5 hours of sleep a day. I experienced not feeling good mentally on those days. Hence, I am interested to see whether there is any correlation in this case.
Research question 2:
How does average working hours per week vary among different ethnic groups in the U.S? Furthermore, is there differences in average working hours per week for both genders?
Very few studies have looked into the work-life balance among different ethnic groups. Hence, I am curious to see the average working hours per week for people of different ethnic groups in the U.S.
Research question 3:
Is there a correlation between smoking and people’s general health status in the U.S? Does this vary for both male and female respondents?
There are numerous researches highlighting the dangers smoking poses to a person’s health. Therefore, it would be interesting to see whether the data collected here supports this claim and shows a correlation between smoking and general health.
Research question 1: Sleep vs Mental Health
# The variables used are:
#sleptim1: How Much Time Do You Sleep (within 24-hour period).
#menthlth: Number Of Days Mental Health Not Good (during the past 30 days).
#sex: Respondents Gender.
#Summarizing the number of hours and their counts:
table(brfss2013$sleptim1) ##
## 0 1 2 3 4 5 6 7 8 9 10
## 1 228 1076 3496 14261 33436 106197 142469 141102 23800 12102
## 11 12 13 14 15 16 17 18 19 20 21
## 833 3675 199 447 367 369 35 164 13 64 3
## 22 23 24 103 450
## 10 4 35 1 1
#There are 2 main outliers - 103 and 450. It is impossible to sleep for 103 or 450 hours
#in a day that only has 24 hours.
#Next, summarizing days of mental health not good in a month.
table(brfss2013$menthlth) ##
## 0 1 2 3 4 5 6 7 8 9 10
## 334461 15206 23520 13593 6660 16654 1861 6353 1244 198 11917
## 11 12 13 14 15 16 17 18 19 20 21
## 74 812 97 2516 10910 161 105 181 45 6633 497
## 22 23 24 25 26 27 28 29 30 247 5000
## 113 61 87 2318 89 164 680 415 25521 1 1
#Here too, there are 2 outliers - 247 and 5000 days.
#Both of the above variables are discrete variables. Now, using the dplyr and tidyr
#package, data will be arranged in a way that provides meaningful insight.
#Outliers in both variables will be removed along with NA's.
#Also, the variable 'sex' will also be included.
brfss2013 %>%
select(menthlth, sleptim1, sex) %>%
filter(sleptim1 <= 24, menthlth <= 30, !is.na(sex)) %>%
group_by(menthlth, sex) %>%
summarize(mn_slp = mean(sleptim1)) %>%
spread(key = sex, value = "mn_slp")## # A tibble: 31 x 3
## # Groups: menthlth [31]
## menthlth Male Female
## <int> <dbl> <dbl>
## 1 0 7.13 7.18
## 2 1 6.97 7.07
## 3 2 6.93 7.04
## 4 3 6.86 6.98
## 5 4 6.88 6.92
## 6 5 6.84 6.96
## 7 6 6.69 6.87
## 8 7 6.78 6.86
## 9 8 6.78 6.98
## 10 9 6.65 6.88
## # ... with 21 more rows
#Here, the data is arranged in a way that shows the average number of days in a month respondents
#felt their mental health were not good and the corresponding average number of hours both
#male and female respondents sleeps in a day.
#Plotting the variables using ggplot2 package.
slp.mnth <- brfss2013 %>%
select(menthlth, sleptim1, sex) %>%
filter(sleptim1 <= 24, menthlth <= 30, !is.na(sex)) %>%
group_by(menthlth, sex) %>%
summarize(mn_slp = mean(sleptim1))
ggplot(slp.mnth, aes(x = menthlth, y = mn_slp)) + geom_point(aes(color = sex)) +
stat_smooth(method = lm, se = F) +
labs(title = "Mean sleeping hours & mental health in a month",
y = "mn_slp: Mean hours slept",
x = "menthlth: Number of days Mental Health not good in a month") +
theme(plot.title = element_text(hjust = 0.5))#Here, scatter plot is used for 2 numerical variables. The colors of the dots distinguishes
#between both male and female respondents.
#The plot clearly illustrates that there is a strong negative correlation between the
#two variables. The more hours people slept, the less number of days people felt that
#their mental health was not good. The same trend is observed for both sexes. Most of the plots
#are somewhat less dispersed from the trend line.
#This does not imply causation though, as there can be other variables to consider that can
#affect a person's mental health. Correlation does not mean causation. It is important to
#note that sleeping more can actually be a sign of depression and oversleeping can exacerbate
#and worsen depression symptoms.
#There are a couple of interesting outliers within the plot. For some females, those who
#got an average sleep of more than 7 hours did not experience good mental health for
#approximately 10-12 days. This in contrast for some males who slept less than 6.25 hours
#and also did not experience good mental health for 10-12 days.Research question 2: Average working hours for different ethnic groups in the U.S
#Variables used:
#X_imprace: Ethnicity groups.
#scntwrk1: How Many Hours Per Week Do You Work.
#sex: Respondents sex.
#Summarizing ethnicities in U.S
table(brfss2013$X_imprace) #Categorical variable.##
## White, Non-Hispanic
## 383624
## Black, Non-Hispanic
## 39817
## Asian, Non-Hispanic
## 9629
## American Indian/Alaskan Native, Non-Hispanic
## 7781
## Hispanic
## 37138
## Other race, Non-Hispanic
## 13777
#Most of the respondents are of White, non-Hispanic ethnicity.
#Next, summarizing hours worked in a week for each of the respondents.
table(brfss2013$scntwrk1)# Continuous variable. ##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 6 17 42 44 75 104 74 48 170 41 304 16 167
## 13 14 15 16 17 18 19 20 21 22 23 24 25
## 20 34 318 161 29 72 20 1095 40 46 38 288 590
## 26 27 28 29 30 31 32 33 34 35 36 37 38
## 43 44 129 28 1467 9 537 51 62 985 392 176 275
## 39 40 41 42 43 44 45 46 47 48 49 50 51
## 38 10851 36 322 118 153 2289 86 59 375 8 4148 7
## 52 53 54 55 56 57 58 59 60 61 62 63 64
## 70 21 30 922 62 20 31 2 2260 2 13 14 11
## 65 66 67 68 70 72 73 74 75 76 78 79 80
## 366 9 3 15 573 46 2 1 86 4 5 1 338
## 81 82 83 84 85 86 87 89 90 91 94 95 96
## 2 2 1 45 31 4 3 1 73 2 1 7 98
## 97 98
## 521 117
#There does not seem to be any outliers here. The maximum recorded number of hours worked
#in a week is 98 hours. Total hours in a week are 168 hours.
#Arranging the data for meaningful insight.
brfss2013 %>%
select(X_race, scntwrk1, sex) %>%
filter(!is.na(X_race), !is.na(scntwrk1), !is.na(sex)) %>%
group_by(X_race, sex) %>%
summarise(mn_wrkhrs = mean(scntwrk1)) %>%
spread(key = sex, value = "mn_wrkhrs") %>%
arrange(desc(Male))## # A tibble: 8 x 3
## # Groups: X_race [8]
## X_race Male Female
## <fct> <dbl> <dbl>
## 1 Multiracial, non-Hispanic 48.1 41.4
## 2 Other race only, non-Hispanic 46.8 41.6
## 3 White only, non-Hispanic 46.7 40.0
## 4 Black only, non-Hispanic 46.2 41.2
## 5 Hispanic 45.2 39.7
## 6 Asian only, non-Hispanic 45.1 41.2
## 7 Native Hawaiian or other Pacific Islander only, Non-Hispanic 45 41.7
## 8 American Indian or Alaskan Native only, Non-Hispanic 44.7 39.4
#Here, the data illustrates different ethnic groups and the mean working
#hours in a week for both male and female respondents.
#Among the male respondents, those belonging to 'Multiracial, non-Hispanic' ethnic category
#has the highest mean working hours in a week. Among the females, 'Native Hawaiian or other
#Pacific Islander only, Non-Hispanic' respondents have the highest mean working hours.
#Plotting the varibales using ggplot2.
r2 <- brfss2013 %>%
select(X_race, scntwrk1, sex) %>%
filter(!is.na(X_race), !is.na(scntwrk1), !is.na(sex)) %>%
group_by(X_race, sex, scntwrk1)
ggplot(data = r2, aes(y = X_race, x = scntwrk1)) +
geom_boxplot(aes(color = sex)) +
labs(title = "Working Hours of Different Ethnicities",
x = "scntwrk1: Working Hours per week", y = "X_race: Ethnicity Groups") +
theme(plot.title = element_text(hjust = 0.5), legend.position="top")#Boxplot was used for plotting between categorical and numerical variable.
#Both male and female are represented in the box plots for each ethnic categories.
#From all of the box plots, the median hours worked in a week ranges from approximately 39-50.
# Among males, the boxplots for black, native American, Asian, native Hawaiian, and Hispanic people
#shows the same pattern of no first quartile and right skewed plots. This implies majority
# of these male respondents works for a minimum of approximately 40 hours weekly.
#For female respondents, only those belonging to Native American ethnic groups seems to have a maximum
#working hours of approximately 40 hours per week. The plot is left skewed. Overall, majority
# of female respondents works lower working hours than male respondents in the U.S.
#It makes sense, as many female respondents might be housewives.
#There are noticeable outliers among all the ethnic groups. the lowest range of outliers
#ranges from approximately 0-30 hours whereas the highest outliers ranges from
#approximately 55-98 working hours a week. The lower range outliers could suggest that many of
#the respondents are currently still in their colleges. The higher outliers may suggest those
#who works in jobs or sectors that requires people to work more hours such as construction,
#oil and gas sectors,hospitals etc. Research question 3: Smoking vs General health
#Variables used:
#X_smoker3: Computed Smoking Status.
#genhlth: General Health.
#sex: Respondent's sex.
#Summarizing smoking status
table(brfss2013$X_smoker3)##
## Current smoker - now smokes every day Current smoker - now smokes some days
## 55162 21495
## Former smoker Never smoked
## 138134 261651
#The number of observations are not equal among all smoker categories. Therefore,
#I will take a random sample of 20,000 observations from each of these categories.
#Summarizing general health
table(brfss2013$genhlth)##
## Excellent Very good Good Fair Poor
## 85482 159076 150555 66726 27951
#Both are categorical variables. Arranging the data for meaningful insight.
brfss2013 %>%
select(X_smoker3, genhlth, sex) %>%
filter(!is.na(X_smoker3), !is.na(genhlth), !is.na(sex)) %>%
group_by(X_smoker3, sex, genhlth) %>%
summarise(count = n()) %>%
spread(key = genhlth, value = "count")## # A tibble: 8 x 7
## # Groups: X_smoker3, sex [8]
## X_smoker3 sex Excellent `Very good` Good Fair Poor
## <fct> <fct> <int> <int> <int> <int> <int>
## 1 Current smoker - now smokes eve~ Male 2526 6266 9152 4734 2296
## 2 Current smoker - now smokes eve~ Fema~ 2719 7807 10309 6113 3004
## 3 Current smoker - now smokes som~ Male 1312 2761 3214 1540 784
## 4 Current smoker - now smokes som~ Fema~ 1364 3327 3525 2247 1321
## 5 Former smoker Male 9266 20256 21733 10100 4408
## 6 Former smoker Fema~ 11114 23118 21685 10618 5258
## 7 Never smoked Male 21412 33943 26827 8610 2835
## 8 Never smoked Fema~ 32980 57213 49167 20585 7128
#The data is arranged in a way that showcases both male and female respondent's
#smoking status and their corresponding general feeling of their health.
#Random sampling of smokers.
df <- brfss2013 %>%
select(X_smoker3, genhlth, sex) %>%
na.omit()
a <- df %>%
filter(X_smoker3 == 'Never smoked') %>%
slice_sample(n=20000)
b <- df %>%
filter(X_smoker3 == 'Former smoker') %>%
slice_sample(n=20000)
c <- df %>%
filter(X_smoker3 == 'Current smoker - now smokes every day') %>%
slice_sample(n=20000)
d <- df %>%
filter(X_smoker3 == 'Current smoker - now smokes some days') %>%
slice_sample(n=20000)
z <- rbind(a,b,c,d)
table(z)## , , sex = Male
##
## genhlth
## X_smoker3 Excellent Very good Good Fair Poor
## Current smoker - now smokes every day 929 2277 3276 1748 838
## Current smoker - now smokes some days 1237 2587 2994 1422 734
## Former smoker 1323 2931 3210 1483 632
## Never smoked 1642 2659 2022 648 214
##
## , , sex = Female
##
## genhlth
## X_smoker3 Excellent Very good Good Fair Poor
## Current smoker - now smokes every day 987 2883 3789 2173 1100
## Current smoker - now smokes some days 1273 3148 3281 2098 1226
## Former smoker 1603 3361 3186 1493 778
## Never smoked 2518 4400 3772 1557 568
##
## Current smoker - now smokes every day Current smoker - now smokes some days
## 0.25 0.25
## Former smoker Never smoked
## 0.25 0.25
z$X_smoker3 = gsub(pattern = "Current smoker - now smokes some days",
replacement = "smokes some days", x = z$X_smoker3)
z$X_smoker3 = gsub(pattern = "Current smoker - now smokes every day",
replacement = "Daily smoker", x = z$X_smoker3)
#Plotting the data using ggplot2.
ggplot(data = z, aes(x = X_smoker3, fill = genhlth)) +
geom_bar(position = "fill") + facet_grid(~ sex) +
labs(title = "Effects of smoking on General Health",
x = "X_smoker3: Smoker Status") +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 45, hjust=1))#Bar plot was used for plotting between two categorical variables. Both male and female
#respondent's smoking status and their general health are included in the plot.
#The plots for both gender's show a common pattern where majority of those who never smokes
#and former smokers feels their general health condition varies from excellent to very good.
#The opposite is true for current smokers where proportion of smokers who feel their general health
#is poor is higher than the non-smoking categories.
#Among both some days and everyday smoker respondents, there are no significant difference
#among all general health condition. But in general, only few smokers feel their
#general health is excellent.
#There can be numerous causes for a person's general health such as dietary habits, consumption
#of alcohol etc. Therefore, causality cannot be inferred in this case. But, there is a strong #correlation between smoking habits and people's general feeling about their health.
#Finally, further research has to be done. For example, how many packs of
#cigarette does a person smoke per day. For non-smokers, maybe the person they are
#with are smokers, hence they are subjected to passive smoking, which might also affect
#their perception of health.