Research Questions
Research Question 1:
What is the general health status for people who live in private Residence compared to college housing ? And how does that differ among genders?
Research Question 2:
Do men have higher income level than women? And what is the most common income level among each gender?
Research Question 3:
Does gender effect the average sleeping hours? And Is there any relation between sleeping hours and having depressive disorder?
About The Data
Data Collection Method :
BRFSS collects health-related data in all 50 states as well as the District of Columbia and three U.S. territories. The data is collected using both landline telephone- and cellular telephone-based surveys. Random Digit Dialing (RDD) techniques are used to obtain participants’ telephone numbers on both landlines and cell phones. The number of interviews may vary between the states based on funding and the size of the region.
For more Information visit: https://www.cdc.gov/brfss/
Targeted Population:
The Surveys target adult population, aged 18 years or older.
Load packages
## Warning: package 'dplyr' was built under R version 4.0.3
Exploratory Data Analysis
Research Question 1:
What is the general health status for people who live in private Residence compared to college housing ? And how does that differ among genders?
#select the targeted variables
brfss2013_q1 <- brfss2013 %>%
select(pvtresd1, sex, genhlth)
#check if any null value
colSums(is.na(brfss2013_q1))## pvtresd1 sex genhlth
## 131636 7 1985
#remove null values for all variables
brfss2013_q1 <- brfss2013_q1 %>%
filter(!is.na(pvtresd1),!is.na(sex), !is.na(genhlth))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## pvtresd1 `n()`
## <fct> <int>
## 1 Yes 358476
## 2 No 44
only 44 in the sample live in college housing and the rest live in private residence. Due to the small number of college housing residents the analysis might not reflect the true reality, but we will use the sample anyway to get an idea.
Let’s take a look at the Marginal frequencies for each Health status by residence type
##
## Excellent Very good Good Fair Poor
## Yes 0.16249623 0.32116515 0.31076278 0.14313371 0.06244212
## No 0.15909091 0.20454545 0.45454545 0.15909091 0.02272727
#let's visualize the frequencies using a bar plot
ggplot(brfss2013_q1, aes(x = genhlth, group = pvtresd1, fill = pvtresd1)) +
geom_bar(aes(y = ..prop..), position = "dodge") +
labs(title = "Health Status By Residence Type",
x = "Health Status",
y = "Proportion",
fill = "Private Residence") +
theme(plot.title = element_text(hjust = 0.5))##
## Male Female
## 0.373332 0.626668
Frequency table for health status, Residence type and gender
## , , = Yes
##
##
## Excellent Very good Good Fair Poor
## Male 21447 42717 42995 18602 8067
## Female 36804 72413 68406 32708 14317
##
## , , = No
##
##
## Excellent Very good Good Fair Poor
## Male 5 5 6 3 0
## Female 2 4 14 4 1
Plotting health status for different Residence type grouped by gender
ggplot( brfss2013_q1, aes(x = genhlth, fill = sex, group =sex ))+
geom_bar(aes(y = ..prop..), position = position_dodge()) +
facet_wrap(~pvtresd1) +
labs(title = "Health Status By Residence Type Among Genders",
x = "Health Status",
y = "Proportion") +
theme(plot.title = element_text(hjust = 0.5))Overall we can observe that in private Residence the proportions of health status among the genders are almost the same. on the other in the college housing men have better proportions when it comes to “excellent” and very “good” status, in contrast to women which have better “Good” proportion.
Research Question 2:
Do men have higher income level than women? And what is the most common income level among each gender?
Select the variables of interest
#select
brfss2013_q2 <- brfss2013 %>%
filter(!is.na(income2), !is.na(sex)) %>%
select(income2, sex)
#change labels for income
levels(brfss2013_q2$income2) <- c("<$10,000","<$15,000" ,"<$20,000", "<$25,000", "<$35,000", "<$50,000", "<$75,000", ">=$75,000")view the frequency table of income level and gender
##
## Male Female
## <$10,000 8296 17145
## <$15,000 9207 17586
## <$20,000 12562 22311
## <$25,000 15734 25998
## <$35,000 19628 29239
## <$50,000 26817 34692
## <$75,000 29405 35826
## >=$75,000 56537 59365
View The marginal frequnces to get a better Idea
##
## Male Female
## <$10,000 0.04655809 0.07079971
## <$15,000 0.05167073 0.07262081
## <$20,000 0.07049937 0.09213254
## <$25,000 0.08830099 0.10735788
## <$35,000 0.11015456 0.12074149
## <$50,000 0.15050004 0.14325947
## <$75,000 0.16502419 0.14794229
## >=$75,000 0.31729204 0.24514581
Plot the Income level grouped by Sex
ggplot(brfss2013_q2, aes(income2, group = sex, fill = sex)) +
geom_bar(aes(y = ..prop..), position = "dodge") +
labs(title = "Income Level By Gender",
x = "Income",
y = "Proportion") +
theme(plot.title = element_text(hjust = 0.5))It looks like women have higher proportion for the income levels from “less than 10k” until “less than 35k”. However, men have higher proportions form the income level “less than 50k” and above.
let’s take a look from another angel to see how the proportions look for each gender individually.
ggplot(brfss2013_q2, aes(sex, fill = income2)) +
geom_bar(position = "fill") +
labs(title = "Income Level Among Gender",
x = "Gender",
y = "Proportion",
fill = "Income") +
theme(plot.title = element_text(hjust = 0.5))The graph indeed support the previous graph. So it seems there is an association between gender and the income level.
Research Question 3:
Does the gender effect the average sleeping hours? And Is there any relation between sleeping hours and having depressive disorder?
Select the variables of interest
view summary statistics for the variables
## sex sleptim1 addepev2
## Male :201313 Min. : 0.000 Yes : 95779
## Female:290455 1st Qu.: 6.000 No :393707
## NA's : 7 Median : 7.000 NA's: 2289
## Mean : 7.052
## 3rd Qu.: 8.000
## Max. :450.000
## NA's :7387
Removing observations that contain Null values
#remove null values in both variables
brfss2013_q3 <- brfss2013_q3 %>%
filter(!is.na(sex), !is.na(sleptim1), !is.na(addepev2))Removing observations that contain invalid values, in this case sleeping hours that are more than 24
##
## Male Female
## 0.4103382 0.5896618
Plot the distribution of sleeping hours for each gender
ggplot(brfss2013_q3, aes(y = sleptim1, x = sex)) +
geom_boxplot() +
labs(title = "Average Sleep Hours Among Gender",
x = "Gender",
y = "Average Sleep Hours",
fill = "Private Residence?") +
theme(plot.title = element_text(hjust = 0.5))It appears that the both gender have similar average sleeping hours distributions. Observations with high hours of sleep may require further investigation it seems there are few of them.
#View summary statistics for the above graph to get a better Idea
brfss2013_q3 %>%
group_by(sex) %>%
summarize(mean = mean(sleptim1),
median = median(sleptim1))## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 3
## sex mean median
## <fct> <dbl> <dbl>
## 1 Male 7.03 7
## 2 Female 7.07 7
brfss2013_q3 %>%
ggplot(aes(x = sex, fill = addepev2)) +
geom_bar(position = "dodge") +
labs(title = "Depressive Disorder By Gender",
x = "Depressive Disorder",
y = "Average Sleep Hours",
fill = "depressive disorder") +
theme(plot.title = element_text(hjust = 0.5))Among gender, women have higher proportion of depressive disorder.
Plot the proportion of depressive disorder for each avargae sleep hour by gender
ggplot(brfss2013_q3, aes(as.factor(sleptim1) ,group = addepev2, fill = addepev2)) +
geom_bar( position = "fill", aes(y = ..prop..)) +
facet_wrap(~sex, ncol = 1) +
labs(title = "Depressive Disorder And Sleep Hours Among Gender",
x = "Average Sleep Hours",
y = "Proportion",
fill = "Depressive Disorder") +
theme(plot.title = element_text(hjust = 0.5))The proportion of people who have depressive disorder increases as the sleep hours diverge from the average (7 hours). Both genders have similar pattern with slight difference for sleep hours above 16.