setwd("D:/STAT50/QUIZ1")getwd()## [1] "D:/STAT50/QUIZ1"
library(ggplot2)
library(dplyr)load("brfss2013.Rdata")Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
Research question 1:
Do asthmatic people usually avoid smoking? (Smoking has been a habit to many and this will tell us if having asthma will make them avoid the said habit)
Research question 2:
Do people with income less than $10,000 that never goes into check-ups generally has a poor health?
(Since low income people cannot usually afford going into health check-up.)
Research question 3:
Is having arthritis in men, implies that they have a difficulty in walking?
(Arthritis mostly affects men and cause them trouble.)
Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Research question 1:
str(select(brfss2013,asthma3,smoke100))## 'data.frame': 491775 obs. of 2 variables:
## $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
## $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
asthmatic<- brfss2013 %>%
filter(asthma3 == "Yes")asthmatic2<-nrow(asthmatic)brfss2013 %>%
filter(asthma3 != "No", smoke100 != "NA") %>%
group_by(asthma3, smoke100) %>%
summarise(count=n(), percentage=n()*100/asthmatic2)## `summarise()` has grouped output by 'asthma3'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 4
## # Groups: asthma3 [1]
## asthma3 smoke100 count percentage
## <fct> <fct> <int> <dbl>
## 1 Yes Yes 32263 48.0
## 2 Yes No 33148 49.3
Graph
Asmoke<- brfss2013%>%
filter(asthma3 == "Yes",smoke100 == "Yes", smoke100 != "NA")AS<- nrow(Asmoke)Nsmoke<-brfss2013%>%
filter(asthma3 == "Yes", smoke100 == "No", smoke100 != "NA")AN<- nrow(Nsmoke)x<- c(AS, AN)
labels <- c("Smokers","Nonsmokers")
pie(x, main = "Asthmatic People",col = rainbow(length(x)))
legend("topright", c("Smokers", "Nonsmokers"), cex = 0.8,
fill = rainbow(length(x)))Interpretation: We have seen that 49.3% of people having asthma avoid smoking and 48.0% do smoking, hence we cannot say that most asthmatic people avoid smoking.
Research question 2:
str(select(brfss2013,genhlth,checkup1,income2))## 'data.frame': 491775 obs. of 3 variables:
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ checkup1: Factor w/ 5 levels "Within past year",..: 1 1 1 2 4 1 1 1 1 1 ...
## $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
lincome<- brfss2013 %>%
filter(income2 == "Less than $10,000", checkup1 == "Never")lincome2<-nrow(lincome)lincome2## [1] 405
brfss2013 %>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA") %>%
group_by(income2, checkup1, genhlth) %>%
summarise(count=n(), percentage=n()*100/lincome2)## `summarise()` has grouped output by 'income2', 'checkup1'. You can override
## using the `.groups` argument.
## # A tibble: 5 × 5
## # Groups: income2, checkup1 [1]
## income2 checkup1 genhlth count percentage
## <fct> <fct> <fct> <int> <dbl>
## 1 Less than $10,000 Never Excellent 53 13.1
## 2 Less than $10,000 Never Very good 68 16.8
## 3 Less than $10,000 Never Good 127 31.4
## 4 Less than $10,000 Never Fair 98 24.2
## 5 Less than $10,000 Never Poor 54 13.3
Graph
L<- brfss2013%>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Excellent")
M<- brfss2013%>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Very good")
N<- brfss2013%>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Good")
O<- brfss2013%>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Fair")
P<- brfss2013%>%
filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Poor")A<-nrow(L)
B<-nrow(M)
C<-nrow(N)
D<-nrow(O)
E<-nrow(P)x<- c(A, B, C, D, E)
labels <- c("Excellent", "Very good", "Good", "Fair", "Poor")
pie(x, main = "General Health of People with Income Less than $10,000
that Never Go to Check ups",col = rainbow(length(x)))
legend("topright", c("Excellent", "Very good", "Good", "Fair", "Poor"), cex = 0.8,
fill = rainbow(length(x)))Interpretation: In conclusion, it does not generally mean that the people that never goes to check-up due to less income (less than $10,000) will have a poor general health. Moreover, there is only 13.33% of people having less than $10,000 income which never goes to check-up has poor general health.
Research question 3:
str(select(brfss2013, sex, arthdis2, diffwalk))## 'data.frame': 491775 obs. of 3 variables:
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ arthdis2: Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 1 2 2 NA ...
## $ diffwalk: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
marthrit<- brfss2013 %>%
filter(sex == "Male", arthdis2 == "Yes")marthrit2<-nrow(marthrit)marthrit2## [1] 16030
brfss2013 %>%
filter(sex == "Male", arthdis2 !="NA", arthdis2 == "Yes", diffwalk !="NA") %>%
group_by(sex, arthdis2, diffwalk) %>%
summarise(count=n(), percentage=n()*100/marthrit2)## `summarise()` has grouped output by 'sex', 'arthdis2'. You can override using
## the `.groups` argument.
## # A tibble: 2 × 5
## # Groups: sex, arthdis2 [1]
## sex arthdis2 diffwalk count percentage
## <fct> <fct> <fct> <int> <dbl>
## 1 Male Yes Yes 8725 54.4
## 2 Male Yes No 7200 44.9
Graph
Marthdis<- brfss2013%>%
filter(sex=="Male", arthdis2 !="NA", arthdis2=="Yes", diffwalk !="NA", diffwalk =="Yes")X<- nrow(Marthdis)Marthdis2<- brfss2013%>%
filter(sex=="Male", arthdis2 !="NA", arthdis2=="Yes", diffwalk !="NA", diffwalk =="No")Y<- nrow(Marthdis2)x<- c(X, Y)
labels <- c("Having Difficulty in Walking","Do not Have Difficulty in Walking")
pie(x, main = "Men with Arthritis",col = rainbow(length(x)))
legend("topright", c("Having Difficulty in Walking","Do not Have Difficulty in Walking"), cex = 0.8,
fill = rainbow(length(x)))Interpretation: Therefore we can say that the males that has arthritis usually has difficulty in walking. In fact, 54.4% of males having arthritis has difficulty in walking.