library(ggplot2)
library(dplyr)load("C:/Users/BackUp/Desktop/brfss2013 data.RData")
dim(brfss2013)## [1] 491775 330
BRFSS stands for The Behavior Risk Factor Survillance System, this health related survey collects data about US residents regarding about health conditions and health related behaviors. The BRFSS is run by CDC(Center for Disease Control and Prevention) and conducted by individual state departments. Data Collection: The observations in the sample are collected in both landline and cell phone based surveys. (N = 491,755 and variables = 330)
Scope of Inference/Generalizability: The Behavioral Risk Factor Surveillance System(BRFSS) project collects data fromadult population of age 18 years and older who reside in the United States within the 50 states.Casuality: No causation as this is a survey, an obsevational study.
Research quesion 1: I want to explore the average amount of drinks per day between sex.
How is the distribution between Sex for average amount of drinks per day in past 30 days?
Variables used: sex: Respondents Sex (Value 1= Male, Value 2= Female) avedrnk2: Average Alcoholic Drinks Per Day in Past 30 days
This question is of interest to me, as I am interested in if males or females pattern in alcoholic drinks per day.
Research quesion 2:
Investigate counts of three different categorical variables.How does male and female differ in terms of smoking usage and general health reportings?
Variables used: sex smokday2: Frequency of days smoking status (Everyday, Somedays, Not at all) genhlth: Evaluate their general health(Excellent, Very good, Good, Fair, Poor)
This question is of interest to me, as I am interested if people who reported smoking more, also reported worse general health.
Research quesion 3: Investigate people who are in different categories for tetanus shot: received Tdap, received tetanus (not Tdap), received tetanus(not sure what type), not received tetanus shot and their effects with sleeptime hours per day and alcoholic drinks per day in past 30 days?
Variables used: Tetanus: Received Tetanus Shot Since 2005 - 4 categories(Yes -Tdap, Yes received tetanus (not Tdap), received tetanus(not sure what type), not received tetanus shot
Sleptim1: hours of sleep per day (24 hour period) avedrnk2: Average Alcoholic Drinks Per Day in Past 30
This question is of interest to me as looking at three different varaibles that should be independent of one another, and to see if there is any insights or patterns I can find !
Research quesion 1:
ggplot(data = brfss2013, aes(x = avedrnk2, fill= sex)) +geom_bar(position ="dodge")+ xlim(0,20)+ xlab("Average amount of drinks per day in past 30 days")+ ggtitle("Distribution of average drinks per day in past 30 days by sex")## Warning: Removed 260978 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).
In this graph, it is a distribution of the average amount of drinks per day in past 30 days with color coded by sex of female and male. We are able to see that with 1 drink and 2 drinks, females count is higher than male. However as we increase to 3 drinks and above, Male count is higher than Female count.
brfss2013%>%
group_by(avedrnk2,sex)%>%
filter(!is.na(avedrnk2), !is.na(sex))%>%
summarise(n=n())## Source: local data frame [79 x 3]
## Groups: avedrnk2 [?]
##
## avedrnk2 sex n
## <int> <fctr> <int>
## 1 1 Male 38881
## 2 1 Female 67072
## 3 2 Male 34308
## 4 2 Female 35018
## 5 3 Male 15869
## 6 3 Female 10231
## 7 4 Male 7388
## 8 4 Female 3756
## 9 5 Male 4061
## 10 5 Female 1987
## # ... with 69 more rows
This table above also breaks down by the count between sex for average amount of drinks per day in past 30 days, showing that females have 1 or 2 drinks in count, and males lead in counts 3 drinks and more.
brfss2013%>%
group_by(sex)%>%
filter(!is.na(avedrnk2), !is.na(sex))%>%
summarise(mean_drnk = mean(avedrnk2))## # A tibble: 2 x 2
## sex mean_drnk
## <fctr> <dbl>
## 1 Male 2.654153
## 2 Female 1.805249
In fact, on average, Male have 2.65 drinks per day in past 30 days vs Females who have 1.80 drinks per day in past 30 days.
Research quesion 2:
#clean the data and get rid of NAs#
brfss2013cleandata<- brfss2013 %>%
filter(!is.na(smokday2),!is.na(genhlth),!is.na(sex))%>%
select(sex,smokday2,genhlth)
#Distribution count of general health reported and broken by smokeday and sex #
ggplot(brfss2013cleandata, aes(x=smokday2, fill=genhlth)) + geom_bar(position="dodge")+facet_wrap(~sex,ncol=4)In this graph, first I will analyze the Male group. From Left to Right, Everyday Male smokers mostly reported Good general health. Some days Male smokers mostly reported Good General Health. Not at All male, reported mostly Good general health. For Females, everyday smokers mostly reported Good general health. Some days female smokers mostly reported Good General Health. Not at All female, reported mostly Very Good general health.
brfss2013cleandata%>%
group_by(sex,smokday2,genhlth)%>%
summarise(count=n())## Source: local data frame [30 x 4]
## Groups: sex, smokday2 [?]
##
## sex smokday2 genhlth count
## <fctr> <fctr> <fctr> <int>
## 1 Male Every day Excellent 2526
## 2 Male Every day Very good 6266
## 3 Male Every day Good 9152
## 4 Male Every day Fair 4733
## 5 Male Every day Poor 2296
## 6 Male Some days Excellent 1312
## 7 Male Some days Very good 2761
## 8 Male Some days Good 3214
## 9 Male Some days Fair 1540
## 10 Male Some days Poor 784
## # ... with 20 more rows
Above are the counts for distribution for general health by smokday and sex. This pretty much summarizes similar conclusion that it does not have significance if a user is a everyday smoker, someday smoker, or not a smoker, for both genders, most people reported within the “Very good” and “Good” general health.
Research quesion 3:
# Clean data, get rid of NA #
brfss2013cleandata2<- brfss2013 %>%
filter(!is.na(renthom1),!is.na(tetanus),!is.na(sleptim1),!is.na(avedrnk2))
# Plot Sleep time hours x Avg Alcoholic Drinks Per Day In Past 30 by tetanus: Received tetanus shot since 2005
ggplot(brfss2013cleandata2, aes(x=avedrnk2, y=sleptim1, color= tetanus)) + geom_point(alpha=0.3)+ facet_grid(~ tetanus) + xlim(0,20)## Warning: Removed 333 rows containing missing values (geom_point).
Between the four categories for tetanus, it appears that the distribution is similar. There is no significant differences in terms of the plotting between the categories. Most points are in the 6 and 7 hours of sleeptime and 2 alcoholic drinks per day in past 30 days. There doesn’t seem to be much of a difference in receiving tetanus shot or not.
# average alcholic drinks per day between tetanus categories
brfss2013cleandata2%>%
group_by(tetanus)%>%
summarise(mean_avedrnk2 = mean(avedrnk2))## # A tibble: 4 x 2
## tetanus mean_avedrnk2
## <fctr> <dbl>
## 1 Yes, received Tdap 2.081844
## 2 Yes, received tetanus shot, but not Tdap 2.136087
## 3 Yes, received tetanus shot but not sure what type 2.299207
## 4 No, did not receive any tetanus since 2005 2.220108
Average hours sleep time is about 2 alcoholic drinks per day given all 4 tetanus categories: received Tdap shot or not.
# average sleeptime per hours between tetanus categories
brfss2013cleandata2%>%
group_by(tetanus, avedrnk2)%>%
summarise(mean_sleep = mean(sleptim1))## Source: local data frame [141 x 3]
## Groups: tetanus [?]
##
## tetanus avedrnk2 mean_sleep
## <fctr> <int> <dbl>
## 1 Yes, received Tdap 1 7.098516
## 2 Yes, received Tdap 2 7.050672
## 3 Yes, received Tdap 3 6.943933
## 4 Yes, received Tdap 4 6.926234
## 5 Yes, received Tdap 5 6.957240
## 6 Yes, received Tdap 6 6.823745
## 7 Yes, received Tdap 7 6.796178
## 8 Yes, received Tdap 8 7.038136
## 9 Yes, received Tdap 9 7.074074
## 10 Yes, received Tdap 10 6.600858
## # ... with 131 more rows
Average hours sleep time is between 6 and 7 hours given in all 4 tetanus categories: received Tdap shot or not.