Setup

setwd("C:/Stat501stQuiz")
getwd()
## [1] "C:/Stat501stQuiz"

Load Packages

library(ggplot2)
library(dplyr)

Load data

load ("brfss2013.RData")

Refer to the provided data in our google classroom.

Part 1: Research questions

Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

Research quesion 1:

How many women are being pregnant in the year 2013 ?

Research quesion 2:

Does married women that are already retired are still generally healthy?

Research quesion 3:

Does female most likely to have arthritis than male?

Part 3: Exploratory data analysis

Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Research quesion 1:

str(select(brfss2013, iyear, pregnant, sex))
## 'data.frame':    491775 obs. of  3 variables:
##  $ iyear   : Factor w/ 2 levels "2013","2014": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pregnant: Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA 2 NA NA NA ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
Preggy<- brfss2013%>%
  filter(pregnant == "Yes")
npreggy<-nrow(Preggy)

brfss2013 %>%
  group_by(iyear, pregnant) %>%
  filter(pregnant=="Yes", iyear !="NA")%>%
  summarise(count=n(), percentage=n()*100/npreggy)
## `summarise()` has grouped output by 'iyear'. You can override using the
## `.groups` argument.
## # A tibble: 2 x 4
## # Groups:   iyear [2]
##   iyear pregnant count percentage
##   <fct> <fct>    <int>      <dbl>
## 1 2013  Yes       3004      98.4 
## 2 2014  Yes         49       1.60

In our data, there are 98.362803% of 3004 respondents in 2013 are being pregnant.

ggplot(brfss2013, aes(y=iyear)) + geom_bar() + ggtitle('Pregnant Women')  + theme_update()

Research quesion 2:

str(select(brfss2013, sex, genhlth,employ1))
## 'data.frame':    491775 obs. of  3 variables:
##  $ sex    : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ genhlth: Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ employ1: Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
fret<-brfss2013%>%
  filter(sex=="Female", employ1 == "Retired")
nfret<-nrow(fret)

brfss2013%>%
  filter(sex=="Female", employ1 == "Retired",genhlth !="NA")%>%
  group_by(sex, genhlth, employ1)%>%
  summarise(count=n(), percentage= n()*100/nfret)
## `summarise()` has grouped output by 'sex', 'genhlth'. You can override using
## the `.groups` argument.
## # A tibble: 5 x 5
## # Groups:   sex, genhlth [5]
##   sex    genhlth   employ1 count percentage
##   <fct>  <fct>     <fct>   <int>      <dbl>
## 1 Female Excellent Retired 10561      12.7 
## 2 Female Very good Retired 25703      30.8 
## 3 Female Good      Retired 27921      33.5 
## 4 Female Fair      Retired 13727      16.5 
## 5 Female Poor      Retired  5063       6.07

In general, retired married women are still generaly healthy. In our data, it shows that only 6.073219% retired women are poor in health.

ggplot(brfss2013, aes(y=genhlth)) + geom_bar() + ggtitle('General Health')  + theme_update()

Research quesion 3:

str(select(brfss2013, sex, arthdis2))
## 'data.frame':    491775 obs. of  2 variables:
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ arthdis2: Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 1 2 2 NA ...
farth<-brfss2013%>%
  filter(arthdis2 == "Yes")
nfarth<-nrow(farth)

brfss2013%>%
  filter(arthdis2 == "Yes")%>%
  group_by(sex, arthdis2)%>%
  summarise(count=n(), percentage= n()*100/nfarth)
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 2 x 4
## # Groups:   sex [2]
##   sex    arthdis2 count percentage
##   <fct>  <fct>    <int>      <dbl>
## 1 Male   Yes      16030       32.8
## 2 Female Yes      32844       67.2

Our data prove our assumption that women mostly to have arthritis than men. In our data, it shows that 67.20137% of 32844 respondents have arthritis.

ggplot(brfss2013, aes(y=sex)) + geom_bar() + ggtitle('Arthritis')  + theme_update()