Setup

setwd("D:/STAT50/QUIZ1")
getwd()
## [1] "D:/STAT50/QUIZ1"

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.Rdata")

Refer to the provided data in our google classroom.

Part 1: Research questions

Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

Research question 1:

Do asthmatic people usually avoid smoking? (Smoking has been a habit to many and this will tell us if having asthma will make them avoid the said habit)

Research question 2:

Do people with income less than $10,000 that never goes into check-ups generally has a poor health?

(Since low income people cannot usually afford going into health check-up.)

Research question 3:

Is having arthritis in men, implies that they have a difficulty in walking?

(Arthritis mostly affects men and cause them trouble.)

Part 3: Exploratory data analysis

Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Research question 1:

str(select(brfss2013,asthma3,smoke100))
## 'data.frame':    491775 obs. of  2 variables:
##  $ asthma3 : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 2 ...
##  $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...
asthmatic<- brfss2013 %>%
  filter(asthma3 == "Yes")
asthmatic2<-nrow(asthmatic)
brfss2013 %>% 
  filter(asthma3 != "No", smoke100 != "NA") %>%
  group_by(asthma3, smoke100) %>% 
  summarise(count=n(), percentage=n()*100/asthmatic2)
## `summarise()` has grouped output by 'asthma3'. You can override using the
## `.groups` argument.
## # A tibble: 2 × 4
## # Groups:   asthma3 [1]
##   asthma3 smoke100 count percentage
##   <fct>   <fct>    <int>      <dbl>
## 1 Yes     Yes      32263       48.0
## 2 Yes     No       33148       49.3

Graph

Asmoke<- brfss2013%>%
  filter(asthma3 == "Yes",smoke100 == "Yes", smoke100 != "NA")
AS<- nrow(Asmoke)
Nsmoke<-brfss2013%>%
  filter(asthma3 == "Yes", smoke100 == "No", smoke100 != "NA")
AN<- nrow(Nsmoke)
x<-  c(AS, AN)
labels <-  c("Smokers","Nonsmokers")

pie(x, main = "Asthmatic People",col = rainbow(length(x)))
legend("topright", c("Smokers", "Nonsmokers"), cex = 0.8,
   fill = rainbow(length(x)))

Interpretation: We have seen that 49.3% of people having asthma avoid smoking and 48.0% do smoking, hence we cannot say that most asthmatic people avoid smoking.

Research question 2:

str(select(brfss2013,genhlth,checkup1,income2))
## 'data.frame':    491775 obs. of  3 variables:
##  $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
##  $ checkup1: Factor w/ 5 levels "Within past year",..: 1 1 1 2 4 1 1 1 1 1 ...
##  $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
lincome<- brfss2013 %>%
  filter(income2 == "Less than $10,000", checkup1 == "Never")
lincome2<-nrow(lincome)
lincome2
## [1] 405
brfss2013 %>% 
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA") %>%
  group_by(income2, checkup1, genhlth) %>%
  summarise(count=n(), percentage=n()*100/lincome2)
## `summarise()` has grouped output by 'income2', 'checkup1'. You can override
## using the `.groups` argument.
## # A tibble: 5 × 5
## # Groups:   income2, checkup1 [1]
##   income2           checkup1 genhlth   count percentage
##   <fct>             <fct>    <fct>     <int>      <dbl>
## 1 Less than $10,000 Never    Excellent    53       13.1
## 2 Less than $10,000 Never    Very good    68       16.8
## 3 Less than $10,000 Never    Good        127       31.4
## 4 Less than $10,000 Never    Fair         98       24.2
## 5 Less than $10,000 Never    Poor         54       13.3

Graph

L<- brfss2013%>%
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Excellent")
M<- brfss2013%>%
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Very good")
N<- brfss2013%>%
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Good")
O<- brfss2013%>%
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Fair")
P<- brfss2013%>%
  filter(income2 == "Less than $10,000", checkup1 == "Never", genhlth != "NA", genhlth == "Poor")
A<-nrow(L)
B<-nrow(M)
C<-nrow(N)
D<-nrow(O)
E<-nrow(P)
x<-  c(A, B, C, D, E)
labels <-  c("Excellent", "Very good", "Good", "Fair", "Poor")

pie(x, main = "General Health of People with Income Less than $10,000 
    that Never Go to Check ups",col = rainbow(length(x)))
legend("topright", c("Excellent", "Very good", "Good", "Fair", "Poor"), cex = 0.8,
   fill = rainbow(length(x)))

Interpretation: In conclusion, it does not generally mean that the people that never goes to check-up due to less income (less than $10,000) will have a poor general health. Moreover, there is only 13.33% of people having less than $10,000 income which never goes to check-up has poor general health.

Research question 3:

str(select(brfss2013, sex, arthdis2, diffwalk))
## 'data.frame':    491775 obs. of  3 variables:
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ arthdis2: Factor w/ 2 levels "Yes","No": 1 NA 1 NA NA NA 1 2 2 NA ...
##  $ diffwalk: Factor w/ 2 levels "Yes","No": 1 2 1 2 2 2 2 1 2 2 ...
marthrit<- brfss2013 %>%
  filter(sex == "Male", arthdis2 == "Yes")
marthrit2<-nrow(marthrit)
marthrit2
## [1] 16030
brfss2013 %>% 
  filter(sex == "Male", arthdis2 !="NA", arthdis2 == "Yes", diffwalk !="NA") %>%
  group_by(sex, arthdis2, diffwalk) %>%
  summarise(count=n(), percentage=n()*100/marthrit2)
## `summarise()` has grouped output by 'sex', 'arthdis2'. You can override using
## the `.groups` argument.
## # A tibble: 2 × 5
## # Groups:   sex, arthdis2 [1]
##   sex   arthdis2 diffwalk count percentage
##   <fct> <fct>    <fct>    <int>      <dbl>
## 1 Male  Yes      Yes       8725       54.4
## 2 Male  Yes      No        7200       44.9

Graph

Marthdis<- brfss2013%>%
  filter(sex=="Male", arthdis2 !="NA", arthdis2=="Yes", diffwalk !="NA", diffwalk =="Yes")
X<- nrow(Marthdis)
Marthdis2<- brfss2013%>%
  filter(sex=="Male", arthdis2 !="NA", arthdis2=="Yes", diffwalk !="NA", diffwalk =="No")
Y<- nrow(Marthdis2)
x<-  c(X, Y)
labels <-  c("Having Difficulty in Walking","Do not Have Difficulty in Walking")

pie(x, main = "Men with Arthritis",col = rainbow(length(x)))
legend("topright", c("Having Difficulty in Walking","Do not Have Difficulty in Walking"), cex = 0.8,
   fill = rainbow(length(x)))

Interpretation: Therefore we can say that the males that has arthritis usually has difficulty in walking. In fact, 54.4% of males having arthritis has difficulty in walking.