setwd("C:/StatQuiz")
getwd()
## [1] "C:/StatQuiz"
load("brfss2013.RData")
library(ggplot2)
library(dplyr)
Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
Research quesion 1: Are there more students who graduated in college who sleeps less than 7 hours still have a good general health compared to students who graduated high school who sleeps less than 7 hours?
Research quesion 2: Which one has more likely to consume fruits and vegetables once or more than in a day, is it those who have good/better health or fair/poor health?
Research quesion 3: Who has more good or better health according to the most common bad habit, is it the alcoholic one or the smoker one?
Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.
Research quesion 1: Are there more students who graduated in college who sleeps less than 7 hours still have a good general health compared to students who graduated high school who sleeps less than 7 hours?
We’ll be using the following variables for this question: sleptim1: The hours of sleep a person gets in a 24 hour period. genhlth: General health by the Respondents. *X_educag: Educational Status of a Student and their highest attain. Type of the variables we’re dealing with:
str(select(brfss2013,sleptim1,genhlth, X_educag))
## 'data.frame': 491775 obs. of 3 variables:
## $ sleptim1: int NA 6 9 8 6 8 7 6 8 8 ...
## $ genhlth : Factor w/ 5 levels "Excellent","Very good",..: 4 3 3 2 3 2 4 3 1 3 ...
## $ X_educag: Factor w/ 4 levels "Did not graduate high school",..: 4 3 4 2 4 4 2 3 4 2 ...
%>%
brfss2013 filter(sleptim1>24) %>%
select(sleptim1)
## sleptim1
## 1 103
## 2 450
We need a clean data frame so we need to filter the unrealistic value there for our further visualization.
<- brfss2013 %>%
rq2_brfss2013 filter(sleptim1 <= 24)
Then plotting the 24 hour period a day and examine the average sleep of a person.
ggplot(rq2_brfss2013,aes(x=as.factor(sleptim1))) + geom_bar() + ggtitle('Amount of Sleep of Respondents') + xlab('Hours Slept') + theme_bw()
As we see, people tend to sleep 6-8 hours according to our population.
So we need first to get the average sleep of a person.
%>%
rq2_brfss2013 summarise(avg_sleep = mean(sleptim1))
## avg_sleep
## 1 7.050986
Then we know that the amount of sleep for our population is 7 hours. But the thing here is to know the only specific respondents who sleeps less than 7 hours.
<- nrow(brfss2013)
total_obs %>%
brfss2013 group_by(X_educag) %>%
filter(X_educag != "NA") %>%
summarise(count = n(), Percentage = n()*100/total_obs)
## # A tibble: 4 × 3
## X_educag count Percentage
## <fct> <int> <dbl>
## 1 Did not graduate high school 42213 8.58
## 2 Graduated high school 142968 29.1
## 3 Attended college or technical school 134196 27.3
## 4 Graduated from college or technical school 170118 34.6
<- nrow(brfss2013)
total_obs %>%
brfss2013 group_by(sleptim1, X_educag) %>%
filter(sleptim1 < 7)%>%
summarise(count=n(), Percentage = n()*100/total_obs)
## `summarise()` has grouped output by 'sleptim1'. You can override using the
## `.groups` argument.
## # A tibble: 31 × 4
## # Groups: sleptim1 [7]
## sleptim1 X_educag count Percentage
## <int> <fct> <int> <dbl>
## 1 0 <NA> 1 0.000203
## 2 1 Did not graduate high school 65 0.0132
## 3 1 Graduated high school 76 0.0155
## 4 1 Attended college or technical school 51 0.0104
## 5 1 Graduated from college or technical school 33 0.00671
## 6 1 <NA> 3 0.000610
## 7 2 Did not graduate high school 252 0.0512
## 8 2 Graduated high school 381 0.0775
## 9 2 Attended college or technical school 278 0.0565
## 10 2 Graduated from college or technical school 156 0.0317
## # … with 21 more rows
%>%
brfss2013 group_by(sleptim1, X_educag,genhlth) %>%
filter(sleptim1 < 7)%>%
summarise(count=n(), Percentage = n()*100/total_obs)
## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.
## # A tibble: 172 × 5
## # Groups: sleptim1, X_educag [31]
## sleptim1 X_educag genhlth count Percentage
## <int> <fct> <fct> <int> <dbl>
## 1 0 <NA> Excellent 1 0.000203
## 2 1 Did not graduate high school Excellent 8 0.00163
## 3 1 Did not graduate high school Very good 3 0.000610
## 4 1 Did not graduate high school Good 12 0.00244
## 5 1 Did not graduate high school Fair 17 0.00346
## 6 1 Did not graduate high school Poor 25 0.00508
## 7 1 Graduated high school Excellent 8 0.00163
## 8 1 Graduated high school Very good 10 0.00203
## 9 1 Graduated high school Good 23 0.00468
## 10 1 Graduated high school Fair 13 0.00264
## # … with 162 more rows
%>%
brfss2013 group_by(sleptim1, X_educag,genhlth) %>%
filter(sleptim1 < 7, genhlth == "Good")%>%
summarise(count=n(), Percentage = n()*100/total_obs)
## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.
## # A tibble: 30 × 5
## # Groups: sleptim1, X_educag [30]
## sleptim1 X_educag genhlth count Percentage
## <int> <fct> <fct> <int> <dbl>
## 1 1 Did not graduate high school Good 12 0.00244
## 2 1 Graduated high school Good 23 0.00468
## 3 1 Attended college or technical school Good 12 0.00244
## 4 1 Graduated from college or technical school Good 10 0.00203
## 5 1 <NA> Good 1 0.000203
## 6 2 Did not graduate high school Good 45 0.00915
## 7 2 Graduated high school Good 91 0.0185
## 8 2 Attended college or technical school Good 53 0.0108
## 9 2 Graduated from college or technical school Good 42 0.00854
## 10 2 <NA> Good 3 0.000610
## # … with 20 more rows
%>%
brfss2013 group_by(sleptim1, X_educag,genhlth) %>%
filter(sleptim1 < 7, genhlth == "Good", X_educag != "Did not graduate high school", X_educag != "Attended college or technical school", X_educag != "Graduated high school")%>%
summarise(count = n(), Percentage = n()*100/total_obs)
## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.
## # A tibble: 6 × 5
## # Groups: sleptim1, X_educag [6]
## sleptim1 X_educag genhlth count Percentage
## <int> <fct> <fct> <int> <dbl>
## 1 1 Graduated from college or technical school Good 10 0.00203
## 2 2 Graduated from college or technical school Good 42 0.00854
## 3 3 Graduated from college or technical school Good 136 0.0277
## 4 4 Graduated from college or technical school Good 804 0.163
## 5 5 Graduated from college or technical school Good 2560 0.521
## 6 6 Graduated from college or technical school Good 10002 2.03
sum(0.00203, 0.00854, 0.02765, 0.16349, 0.520256, 2.03386)/6
## [1] 0.4593043
%>%
brfss2013 group_by(sleptim1, X_educag,genhlth) %>%
filter(sleptim1 < 7, genhlth == "Good", X_educag != "Did not graduate high school", X_educag != "Attended college or technical school", X_educag != "Graduated from college or technical school")%>%
summarise(count = n(), Percentage = n()*100/total_obs)
## `summarise()` has grouped output by 'sleptim1', 'X_educag'. You can override
## using the `.groups` argument.
## # A tibble: 6 × 5
## # Groups: sleptim1, X_educag [6]
## sleptim1 X_educag genhlth count Percentage
## <int> <fct> <fct> <int> <dbl>
## 1 1 Graduated high school Good 23 0.00468
## 2 2 Graduated high school Good 91 0.0185
## 3 3 Graduated high school Good 323 0.0657
## 4 4 Graduated high school Good 1554 0.316
## 5 5 Graduated high school Good 3744 0.761
## 6 6 Graduated high school Good 11574 2.35
After we filter to only choose college graduate and high school graduate and grouping them by, we finally get the specific respondents who sleep less than 7 hours which is our average sleep and still manage to graduate and have a good general health, however which of them graduated more, is it college or highs school?
<- c(sum(23,91,323,1554,3744,11574), sum(10,42,136,804,2560,10002))
abc <- c("Graduated High school", "Graduated COllege") labels
<- round(100*abc/491775,1) piepercent
pie(abc, main = "College VS Highschool Graduates who sleep less than 7 hours at night",col = rainbow(length(abc)))
legend("topright", c("Graduated High School","Graduated College"), cex = 0.8,
fill = rainbow(length(abc)))
Therefore, there are more High school graduate that has good general health compared to College graduates who has good general health.
Research quesion 2: Which one has more likely to consume fruits and vegetables once or more than in a day, is it those who have good/better health or fair/poor health?
We’ll be using the followng variables for this: X_rfhlth: The health status by the Respondents. X_frtlt1: Consumption of fruit for 1 or more a day. *X_veglt1: Consumption of vegetable for 1 or more a day.
str(select(brfss2013,X_rfhlth,X_frtlt1,X_veglt1))
## 'data.frame': 491775 obs. of 3 variables:
## $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
## $ X_frtlt1: Factor w/ 2 levels "Consumed fruit one or more times per day",..: 1 2 2 2 2 1 1 2 1 2 ...
## $ X_veglt1: Factor w/ 2 levels "Consumed vegetables one or more times per day",..: 2 1 1 1 1 1 1 1 1 1 ...
%>%
brfss2013 group_by(X_rfhlth) %>%
summarise(count = n(), percentage = n() * 100/total_obs)
## # A tibble: 3 × 3
## X_rfhlth count percentage
## <fct> <int> <dbl>
## 1 Good or Better Health 395109 80.3
## 2 Fair or Poor Health 94677 19.3
## 3 <NA> 1989 0.404
%>%
brfss2013 group_by(X_rfhlth) %>%
filter(X_rfhlth != "NA") %>%
summarise(count = n(), Percentage = n() *100/total_obs)
## # A tibble: 2 × 3
## X_rfhlth count Percentage
## <fct> <int> <dbl>
## 1 Good or Better Health 395109 80.3
## 2 Fair or Poor Health 94677 19.3
ggplot(rq2_brfss2013,aes(x=X_rfhlth)) + geom_bar() + ggtitle('Respondents') + xlab('Health') + theme_bw()
Around 80% of the respondents who have good or better health while there
are only 18.5%-19.5% respondents who have fair or poor health.
%>%
brfss2013 group_by(X_frtlt1) %>%
summarise(count = n(), percentage = n() * 100/total_obs)
## # A tibble: 3 × 3
## X_frtlt1 count percentage
## <fct> <int> <dbl>
## 1 Consumed fruit one or more times per day 291729 59.3
## 2 Consumed fruit less than one time per day 171343 34.8
## 3 <NA> 28703 5.84
ggplot(rq2_brfss2013,aes(x=X_frtlt1)) + geom_bar() + ggtitle('Fruit Consumption by Respondents') + xlab('Consumption of Fruits') + theme_bw()
Around 59% of the Respondents consume fruit one or more times a day than
those who did not and there are only 34.8%.
%>%
brfss2013 group_by(X_veglt1) %>%
summarise(count = n(), percentage = n() * 100/total_obs)
## # A tibble: 3 × 3
## X_veglt1 count percentage
## <fct> <int> <dbl>
## 1 Consumed vegetables one or more times per day 359834 73.2
## 2 Consumed vegetables less than one time per day 101777 20.7
## 3 <NA> 30164 6.13
ggplot(rq2_brfss2013,aes(x=X_veglt1)) + geom_bar() + ggtitle('Vegetable Consumption by Respondents') + xlab('Consumption of Vegetables') + theme_bw()
Around 73% of the Respondents consume vegetables one or more times a day
while around 20% who did not.
<- table(brfss2013$X_rfhlth,brfss2013$X_frtlt1)
ct_f prop.table(ct_f,1)
##
## Consumed fruit one or more times per day
## Good or Better Health 0.6467407
## Fair or Poor Health 0.5602353
##
## Consumed fruit less than one time per day
## Good or Better Health 0.3532593
## Fair or Poor Health 0.4397647
mosaicplot(prop.table(ct_f,1), main='Health vs Fruit Consumption', xlab='', ylab='Consumption of Fruits')
In this frequency table through mosaic plotting, we can see the people
who consume more fruits have an excellent health status.
<- table(brfss2013$X_rfhlth,brfss2013$X_veglt1)
ct_v prop.table(ct_v,1)
##
## Consumed vegetables one or more times per day
## Good or Better Health 0.7994668
## Fair or Poor Health 0.6973174
##
## Consumed vegetables less than one time per day
## Good or Better Health 0.2005332
## Fair or Poor Health 0.3026826
mosaicplot(prop.table(ct_v,1), main='Health vs Vegetable Consumption', xlab='', ylab='Consumption of Vegetables')
As we can see, it’s more likely the same of the fruits but vegetables
has a lot more, so we can conclude that people who have good or better
health are more likely to consume fruits and vegetables each day that’s
why they have that health status.
Research quesion 3: Who has more good or better health according to the most common bad habit, is it the alcoholic one or the smoker one?
We’ll be using there following variables for this question: X_rfhlth: Health status by the Respondents. X_rfsmok3: Knowing that if a respondents is still smoking. *X_rfchol: Knowing that if a respondents still drinking alcohol.
str(select(brfss2013,X_rfhlth,X_rfsmok3,X_rfchol))
## 'data.frame': 491775 obs. of 3 variables:
## $ X_rfhlth : Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
## $ X_rfsmok3: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 2 1 1 ...
## $ X_rfchol : Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 1 2 2 1 ...
To know that if the respondents is smoking or drinking means that they have a poor health status, let’s first check our selected variables one by one.
<- nrow(brfss2013)
total_obs %>%
brfss2013 group_by(X_rfhlth) %>%
summarise(count = n(), percentage = n()*100/total_obs)
## # A tibble: 3 × 3
## X_rfhlth count percentage
## <fct> <int> <dbl>
## 1 Good or Better Health 395109 80.3
## 2 Fair or Poor Health 94677 19.3
## 3 <NA> 1989 0.404
ggplot(brfss2013, aes(x=X_rfhlth)) + geom_bar() + ggtitle('Health Status of Respondents') + xlab('Health Status') + theme_bw()
According to our data, there are more than 80% of the Respondents who
have poor health.
%>%
brfss2013 group_by(X_rfsmok3) %>%
summarise(count = n(), percentage = n()*100/total_obs)
## # A tibble: 3 × 3
## X_rfsmok3 count percentage
## <fct> <int> <dbl>
## 1 No 399786 81.3
## 2 Yes 76654 15.6
## 3 <NA> 15335 3.12
ggplot(brfss2013, aes(x=X_rfsmok3)) + geom_bar() + ggtitle('Smoking Status of Respondents') + xlab('Currently a Smoker?') + theme_bw()
There are 81% of the Respondents who does not smoke anymore even though
maybe they smoke before.
%>%
brfss2013 group_by(X_rfchol) %>%
summarise(count = n(), percentage = n()*100/total_obs)
## # A tibble: 3 × 3
## X_rfchol count percentage
## <fct> <int> <dbl>
## 1 No 236614 48.1
## 2 Yes 183497 37.3
## 3 <NA> 71664 14.6
ggplot(brfss2013, aes(x=X_rfchol)) + geom_bar() + ggtitle('Alcohol Status of Respondents') + xlab('Currently an Alcoholic?') + theme_bw()
About 35% of the Respondents are still drinking alcohol. Now, we need to
create a new category of variable to categorize the persons such as
‘Smoker’, ‘Alcoholic’, ‘Both’, and ‘None’.
<- brfss2013 %>%
brfss2013 mutate(smoke_alc = ifelse(X_rfchol == 'Yes',
ifelse(X_rfsmok3 == 'Yes','Both','Alcoholic'),
ifelse(X_rfsmok3 == 'Yes','Smoker','None')))
For the distribution of our new variable:
%>%
brfss2013 group_by(smoke_alc) %>%
summarise(count = n(), percentage = n()*100/total_obs)
## # A tibble: 5 × 3
## smoke_alc count percentage
## <chr> <int> <dbl>
## 1 Alcoholic 152508 31.0
## 2 Both 26126 5.31
## 3 None 195855 39.8
## 4 Smoker 33154 6.74
## 5 <NA> 84132 17.1
ggplot(brfss2013,aes(x=smoke_alc)) + geom_bar() + ggtitle('Being an Alcoholic and Smoker Habits of Respondents') + xlab('Alcoholic or Smoker?') +theme_bw()
About 35% of the Respondents who does not smoke and drink alcohol.
Around 13% are smokers, and about 28% are still active at drinking
alcohol.
<- table(brfss2013$smoke_alc,brfss2013$X_rfhlth) rq1_table
rq1_table
##
## Good or Better Health Fair or Poor Health
## Alcoholic 114718 37198
## Both 15615 10382
## None 169910 25264
## Smoker 25385 7650
We see the proportions between the alcoholic and the smoker ones and they still have a good or better health status. Now let’s calculate both of their proportions.
prop.table(rq1_table, 1)
##
## Good or Better Health Fair or Poor Health
## Alcoholic 0.7551410 0.2448590
## Both 0.6006462 0.3993538
## None 0.8705565 0.1294435
## Smoker 0.7684274 0.2315726
mosaicplot(prop.table(rq1_table, 1),main='Alcoholic and/or Smoker vs Health', xlab='Alcoholic and/or Smoking status', ylab='Poor Health')
Comparing the Alcoholic and the Smoker, we see here that they are indeed
very close to each other but the real winner here is the smoker who have
more good or better health than those who are alcoholic person.