Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("C:/Users/BackUp/Desktop/brfss2013 data.RData")
dim(brfss2013)
## [1] 491775    330

Part 1: Data

BRFSS stands for The Behavior Risk Factor Survillance System, this health related survey collects data about US residents regarding about health conditions and health related behaviors. The BRFSS is run by CDC(Center for Disease Control and Prevention) and conducted by individual state departments. Data Collection: The observations in the sample are collected in both landline and cell phone based surveys. (N = 491,755 and variables = 330)

Scope of Inference/Generalizability: The Behavioral Risk Factor Surveillance System(BRFSS) project collects data fromadult population of age 18 years and older who reside in the United States within the 50 states.Casuality: No causation as this is a survey, an obsevational study.


Part 2: Research questions

Research quesion 1: I want to explore the average amount of drinks per day between sex.
How is the distribution between Sex for average amount of drinks per day in past 30 days?

Variables used: sex: Respondents Sex (Value 1= Male, Value 2= Female) avedrnk2: Average Alcoholic Drinks Per Day in Past 30 days

This question is of interest to me, as I am interested in if males or females pattern in alcoholic drinks per day.

Research quesion 2:

Investigate counts of three different categorical variables.How does male and female differ in terms of smoking usage and general health reportings?

Variables used: sex smokday2: Frequency of days smoking status (Everyday, Somedays, Not at all) genhlth: Evaluate their general health(Excellent, Very good, Good, Fair, Poor)

This question is of interest to me, as I am interested if people who reported smoking more, also reported worse general health.

Research quesion 3: Investigate people who are in different categories for tetanus shot: received Tdap, received tetanus (not Tdap), received tetanus(not sure what type), not received tetanus shot and their effects with sleeptime hours per day and alcoholic drinks per day in past 30 days?

Variables used: Tetanus: Received Tetanus Shot Since 2005 - 4 categories(Yes -Tdap, Yes received tetanus (not Tdap), received tetanus(not sure what type), not received tetanus shot

Sleptim1: hours of sleep per day (24 hour period) avedrnk2: Average Alcoholic Drinks Per Day in Past 30

This question is of interest to me as looking at three different varaibles that should be independent of one another, and to see if there is any insights or patterns I can find !


Part 3: Exploratory data analysis

Research quesion 1:

ggplot(data = brfss2013, aes(x = avedrnk2, fill= sex)) +geom_bar(position ="dodge")+ xlim(0,20)+ xlab("Average amount of drinks per day in past 30 days")+ ggtitle("Distribution of average drinks per day in past 30 days by sex")
## Warning: Removed 260978 rows containing non-finite values (stat_count).
## Warning: Removed 1 rows containing missing values (geom_bar).

In this graph, it is a distribution of the average amount of drinks per day in past 30 days with color coded by sex of female and male. We are able to see that with 1 drink and 2 drinks, females count is higher than male. However as we increase to 3 drinks and above, Male count is higher than Female count.

brfss2013%>%
  group_by(avedrnk2,sex)%>%
  filter(!is.na(avedrnk2), !is.na(sex))%>%
  summarise(n=n())
## Source: local data frame [79 x 3]
## Groups: avedrnk2 [?]
## 
##    avedrnk2    sex     n
##       <int> <fctr> <int>
## 1         1   Male 38881
## 2         1 Female 67072
## 3         2   Male 34308
## 4         2 Female 35018
## 5         3   Male 15869
## 6         3 Female 10231
## 7         4   Male  7388
## 8         4 Female  3756
## 9         5   Male  4061
## 10        5 Female  1987
## # ... with 69 more rows

This table above also breaks down by the count between sex for average amount of drinks per day in past 30 days, showing that females have 1 or 2 drinks in count, and males lead in counts 3 drinks and more.

brfss2013%>%
  group_by(sex)%>%
  filter(!is.na(avedrnk2), !is.na(sex))%>%
  summarise(mean_drnk = mean(avedrnk2))
## # A tibble: 2 x 2
##      sex mean_drnk
##   <fctr>     <dbl>
## 1   Male  2.654153
## 2 Female  1.805249

In fact, on average, Male have 2.65 drinks per day in past 30 days vs Females who have 1.80 drinks per day in past 30 days.

Research quesion 2:

#clean the data and get rid of NAs#

brfss2013cleandata<- brfss2013 %>%
  filter(!is.na(smokday2),!is.na(genhlth),!is.na(sex))%>%
  select(sex,smokday2,genhlth)

#Distribution count of general health reported and broken by smokeday and sex #

ggplot(brfss2013cleandata, aes(x=smokday2, fill=genhlth)) + geom_bar(position="dodge")+facet_wrap(~sex,ncol=4)

In this graph, first I will analyze the Male group. From Left to Right, Everyday Male smokers mostly reported Good general health. Some days Male smokers mostly reported Good General Health. Not at All male, reported mostly Good general health. For Females, everyday smokers mostly reported Good general health. Some days female smokers mostly reported Good General Health. Not at All female, reported mostly Very Good general health.

brfss2013cleandata%>%
  group_by(sex,smokday2,genhlth)%>%
  summarise(count=n())
## Source: local data frame [30 x 4]
## Groups: sex, smokday2 [?]
## 
##       sex  smokday2   genhlth count
##    <fctr>    <fctr>    <fctr> <int>
## 1    Male Every day Excellent  2526
## 2    Male Every day Very good  6266
## 3    Male Every day      Good  9152
## 4    Male Every day      Fair  4733
## 5    Male Every day      Poor  2296
## 6    Male Some days Excellent  1312
## 7    Male Some days Very good  2761
## 8    Male Some days      Good  3214
## 9    Male Some days      Fair  1540
## 10   Male Some days      Poor   784
## # ... with 20 more rows

Above are the counts for distribution for general health by smokday and sex. This pretty much summarizes similar conclusion that it does not have significance if a user is a everyday smoker, someday smoker, or not a smoker, for both genders, most people reported within the “Very good” and “Good” general health.

Research quesion 3:

# Clean data, get rid of NA #
brfss2013cleandata2<- brfss2013 %>%
  filter(!is.na(renthom1),!is.na(tetanus),!is.na(sleptim1),!is.na(avedrnk2))


# Plot Sleep time hours x Avg Alcoholic Drinks Per Day In Past 30 by tetanus: Received tetanus shot since 2005

ggplot(brfss2013cleandata2, aes(x=avedrnk2, y=sleptim1, color= tetanus)) + geom_point(alpha=0.3)+ facet_grid(~ tetanus) + xlim(0,20)
## Warning: Removed 333 rows containing missing values (geom_point).

Between the four categories for tetanus, it appears that the distribution is similar. There is no significant differences in terms of the plotting between the categories. Most points are in the 6 and 7 hours of sleeptime and 2 alcoholic drinks per day in past 30 days. There doesn’t seem to be much of a difference in receiving tetanus shot or not.

# average alcholic drinks per day between tetanus categories 
brfss2013cleandata2%>%
  group_by(tetanus)%>%
  summarise(mean_avedrnk2 = mean(avedrnk2))
## # A tibble: 4 x 2
##                                             tetanus mean_avedrnk2
##                                              <fctr>         <dbl>
## 1                                Yes, received Tdap      2.081844
## 2          Yes, received tetanus shot, but not Tdap      2.136087
## 3 Yes, received tetanus shot but not sure what type      2.299207
## 4        No, did not receive any tetanus since 2005      2.220108

Average hours sleep time is about 2 alcoholic drinks per day given all 4 tetanus categories: received Tdap shot or not.

# average sleeptime per hours between tetanus categories 

brfss2013cleandata2%>%
  group_by(tetanus, avedrnk2)%>%
  summarise(mean_sleep = mean(sleptim1))
## Source: local data frame [141 x 3]
## Groups: tetanus [?]
## 
##               tetanus avedrnk2 mean_sleep
##                <fctr>    <int>      <dbl>
## 1  Yes, received Tdap        1   7.098516
## 2  Yes, received Tdap        2   7.050672
## 3  Yes, received Tdap        3   6.943933
## 4  Yes, received Tdap        4   6.926234
## 5  Yes, received Tdap        5   6.957240
## 6  Yes, received Tdap        6   6.823745
## 7  Yes, received Tdap        7   6.796178
## 8  Yes, received Tdap        8   7.038136
## 9  Yes, received Tdap        9   7.074074
## 10 Yes, received Tdap       10   6.600858
## # ... with 131 more rows

Average hours sleep time is between 6 and 7 hours given in all 4 tetanus categories: received Tdap shot or not.