setwd("C:/Stat501stQuiz")getwd()## [1] "C:/Stat501stQuiz"
library(ggplot2)
library(dplyr)load ("brfss2013.RData")Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
Research quesion 1:
How many women are being pregnant in the year 2013 ?
Research quesion 2:
Does married women that are already retired are still generally healthy?
Research quesion 3:
Does female most likely to have arthritis than male?
Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Research quesion 1:
str(select(brfss2013, iyear, pregnant, sex))## 'data.frame': 491775 obs. of 3 variables:
## $ iyear : Factor w/ 2 levels "2013","2014": 1 1 1 1 1 1 1 1 1 1 ...
## $ pregnant: Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA 2 NA NA NA ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
Preggy<- brfss2013%>%
filter(pregnant == "Yes")
npreggy<-nrow(Preggy)
brfss2013 %>%
group_by(iyear, pregnant) %>%
filter(pregnant=="Yes", iyear !="NA")%>%
summarise(count=n(), percentage=n()*100/npreggy)## `summarise()` has grouped output by 'iyear'. You can override using the
## `.groups` argument.
## # A tibble: 2 x 4
## # Groups: iyear [2]
## iyear pregnant count percentage
## <fct> <fct> <int> <dbl>
## 1 2013 Yes 3004 98.4
## 2 2014 Yes 49 1.60
In our data, there are 98.362803% of 3004 respondents in 2013 are being pregnant.
ggplot(brfss2013, aes(y=iyear)) + geom_bar() + ggtitle('Pregnant Women') + theme_update()
Research quesion 2:
str(select(brfss2013, sex, genhlth,employ1))## 'data.frame': 491775 obs. of 3 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ genhlth: Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ employ1: Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
fret<-brfss2013%>%
filter(sex=="Female", employ1 == "Retired")nfret<-nrow(fret)
brfss2013%>%
filter(sex=="Female", employ1 == "Retired",genhlth !="NA")%>%
group_by(sex, genhlth, employ1)%>%
summarise(count=n(), percentage= n()*100/nfret)## `summarise()` has grouped output by 'sex', 'genhlth'. You can override using
## the `.groups` argument.
## # A tibble: 5 x 5
## # Groups: sex, genhlth [5]
## sex genhlth employ1 count percentage
## <fct> <fct> <fct> <int> <dbl>
## 1 Female Excellent Retired 10561 12.7
## 2 Female Very good Retired 25703 30.8
## 3 Female Good Retired 27921 33.5
## 4 Female Fair Retired 13727 16.5
## 5 Female Poor Retired 5063 6.07
In general, retired married women are still generaly healthy. In our data, it shows that only 6.073219% retired women are poor in health.
ggplot(brfss2013, aes(y=genhlth)) + geom_bar() + ggtitle('General Health') + theme_update()Research quesion 3:
str(select(brfss2013, sex, arthdis2))## 'data.frame': 491775 obs. of 2 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ arthdis2: Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 1 2 2 NA ...
farth<-brfss2013%>%
filter(arthdis2 == "Yes")nfarth<-nrow(farth)
brfss2013%>%
filter(arthdis2 == "Yes")%>%
group_by(sex, arthdis2)%>%
summarise(count=n(), percentage= n()*100/nfarth)## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 2 x 4
## # Groups: sex [2]
## sex arthdis2 count percentage
## <fct> <fct> <int> <dbl>
## 1 Male Yes 16030 32.8
## 2 Female Yes 32844 67.2
Our data prove our assumption that women mostly to have arthritis than men. In our data, it shows that 67.20137% of 32844 respondents have arthritis.
ggplot(brfss2013, aes(y=sex)) + geom_bar() + ggtitle('Arthritis') + theme_update()