Exploring the BRFSS data

Setup

Load packages

Load the ggplot2 package which creates elegant data visualizations. The dplyr package allows to provide a felxbile way for data manipulation.

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

library(dplyr)

Load data

Since the data is downloaded and saved into the same file directory as R Markdown files is, load() function will reload the R objects. Make sure your data and R Markdown files are in the same directory.

load("brfss2013.RData")

Summary of the data

First of all, dim() command retrieves the dimenstion of an object, which provides the overall structure of the dataset. It provides the information regarding to dataset on the number of observations as well as the variables of interests. The dim() returns the lengths of the rows and columns respectively. Even though it comes with the BRFSS codebook describing the characteristics of each variable, it would give better understanding of the data to look for the summary of the dataset. The str() command compactly display the internal structure of the dataset. It is a diagnostic function and alternative to summary() function. Be cautious using summary() function when there are so many varibles in the dataset One of variations of ls applying str, ls.str() returns all the names of variables with the range of levels or numbers.

dim(brfss2013)

## [1] 491775    330

str(brfss2013, list.len = 5)

## 'data.frame':    491775 obs. of  330 variables:
##  $ X_state  : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ fmonth   : Factor w/ 12 levels "January","February",..: 1 1 1 1 2 3 3 3 4 4 ...
##  $ idate    : int  1092013 1192013 1192013 1112013 2062013 3272013 3222013 3042013 4242013 4242013 ...
##  $ imonth   : Factor w/ 12 levels "January","February",..: 1 1 1 1 2 3 3 3 4 4 ...
##  $ iday     : Factor w/ 31 levels "1","2","3","4",..: 9 19 19 11 6 27 22 4 24 24 ...
##   [list output truncated]

As you may observe, the argument: list.len limits the number of output to return. This argument is useful to manage the output range for the summary of dataset with longer lists of variables. brfss2013.RData' is a dataframe with 491,775 observations for 330 variables. If you tryls.str(brfss2013)` command, it return the complete list of the variables in the dataset with the description of the type of variable, length of levels and range of the numbers. It contains categorical, ordinal and continous variables.

Part 1: Data

Generalization

According to the BRFSS webpage, the survey has been conducted since 1981. Since its first survey, the data has been collected through telephone interviews. From 2008, celluar phone interviews had been included, which allowed the accesiibility to hard=to-reach respondents and strengthen representativeness of survey. In implemeting the landline and cellular phone survey, the sampling replys on the RDD method, which allows the random selection of the samples. Randomly selected the adults within a household are interviewed when they are contacted through landlines. The random sampling thru the RDD and random selection of adult within a household strongly allow us to make generalization of the results.

Causality

The monthly survey is designed to collect the behaviors regarding to the health. Respondents are asked to answer to the survey questions without being randomly assigned to either control group or treatment group. Any causal result can’t be concluded because the respondents are not randomly assgined to experimental groups.

Part 2: Research questions

Research quesion 1: The avergae incidence of taking Aspirin to reduce the heart attack among 18+ y.o. adults is about 8.2%. However, people tend to believe that taking Aspirin have a postive impact on heart related disease and heart attack. Do people diagnosed with either Angina or Coronary Heart Disease or Heart Attack take more Aspirin than usual to reduce the heart attack than usual?

Research quesion 2: People tend to think that heavy smokers have higher chance to get heart attack. Do people smoking frequently and more than 100 cigarettes have more chance to having been diagnosed with heart attack than usual?

Research quesion 3: Does heavy alcohol drinkers are more like to have binge drinking on an occasion than usual? Are more male heavy drinkers to have binge drinking than female heavy drinkers?

Part 3: Exploratory data analysis

NOTE: Insert code chunks as needed by clicking on the “Insert a new code chunk” button (green button with orange arrow) above. Make sure that your code is visible in the project you submit. Delete this note when before you submit your work.

Research quesion 1:

The research question first started with a vague assumption that either those with Angina or Coronary heart disease or those been experiencing heart attack would take Aspirin to reduce the chance of having heart attack.

In order to create more precise and concrete research question of the interest,
it is a good start figuring out the data structures of the variables in question.

In doing so, the knowledge on the dimmensions and characteristics of each variables by str() and summary().

brfss2013 %>%
    select(cvdinfr4, cvdcrhd4, rduchart) %>%
    str()

## 'data.frame':    491775 obs. of  3 variables:
##  $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ cvdcrhd4: Factor w/ 2 levels "Yes","No": NA 2 2 2 2 2 2 1 2 2 ...
##  $ rduchart: Factor w/ 2 levels "Yes","No": NA NA NA NA NA NA NA NA NA NA ...

brfss2013 %>%
    select(cvdinfr4, cvdcrhd4, rduchart) %>%
    summary()

##  cvdinfr4      cvdcrhd4      rduchart     
##  Yes : 29284   Yes : 29064   Yes : 40145  
##  No  :459904   No  :458288   No  :  5595  
##  NA's:  2587   NA's:  4423   NA's:446035

The variables in question are all categorical variables with “yes” or “no” answers. NA answers are witnessed for very low number of cases except for the question whether they take Aspirin to reduce the heart attack.

Next step would be to find out the average incidence of heart attack among those who are 18 y.o. or older as well as the incidence among group of people of interest; those with high blood cholesterol and Angina or Coronary hear disease.

# Nations Heart Attack Rate
transform(as.data.frame(table(brfss2013$cvdinfr4)), percent_col = Freq/nrow(brfss2013)*100)

##   Var1   Freq percent_col
## 1  Yes  29284    5.954756
## 2   No 459904   93.519191

ggplot(brfss2013, aes(x=cvdinfr4))+geom_bar()

# Rate of being diagnosed with Angina or Coronary Heart Disease
transform(as.data.frame(table(brfss2013$cvdcrhd4)), percent_col = Freq/nrow(brfss2013)*100)

##   Var1   Freq percent_col
## 1  Yes  29064     5.91002
## 2   No 458288    93.19059

ggplot(brfss2013, aes(x=cvdcrhd4))+geom_bar()

# Rate of taking Aspirin to reduce the Heart Attack
transform(as.data.frame(table(brfss2013$rduchart)), percent_col = Freq/nrow(brfss2013)*100)

##   Var1  Freq percent_col
## 1  Yes 40145    8.163286
## 2   No  5595    1.137715

ggplot(brfss2013, aes(x=rduchart))+geom_bar()

According to BRFSS 2013 survey, approximately 5.9% of 18+ y.o. adults in the U.S. have ever been diagnosied with heart attack, 5.9% diagnosed with Angina or Coronary Heart Disease, and 8.2% taking Aspirin to reduce the Heart Attack.

rq1_tbl1 <- table(brfss2013$rduchart, brfss2013$cvdcrhd4, dnn = c("Aspirin Intake", "Heart Disease"))
prop.table(rq1_tbl1)

##               Heart Disease
## Aspirin Intake        Yes         No
##            Yes 0.14208841 0.73496842
##            No  0.00547007 0.11747309

rq1_tbl2 <- table(brfss2013$rduchart, brfss2013$cvdinfr4, dnn = c("Aspirin intakes", "Heart Attack"))
prop.table(rq1_tbl2)

##                Heart Attack
## Aspirin intakes        Yes         No
##             Yes 0.13808527 0.73936323
##             No  0.00623554 0.11631596

rq1_tbl3 <- table(brfss2013$cvdcrhd4, brfss2013$cvdinfr4, dnn = c("Heart Disease", "Heart Attack"))
prop.table(rq1_tbl3)

##              Heart Attack
## Heart Disease        Yes         No
##           Yes 0.02875058 0.03026095
##           No  0.02865373 0.91233475

rq1_tbl4 <- table(brfss2013$cvdinfr4, brfss2013$rduchart, brfss2013$cvdcrhd4, dnn = c("Heart Disease", "Heart Attack", "Aspirin Intakes"))
prop.table(rq1_tbl4)

## , , Aspirin Intakes = Yes
## 
##              Heart Attack
## Heart Disease         Yes          No
##           Yes 0.075020702 0.002148564
##           No  0.066538350 0.003222846
## 
## , , Aspirin Intakes = No
## 
##              Heart Attack
## Heart Disease         Yes          No
##           Yes 0.058861709 0.003782368
##           No  0.676506793 0.113918668

According to the additional data anaylsis, 14.2% of people diagnosed with heart disease take Aspirin to reduce the chance of having heart attack which is higher than the population (8.2%). About 12.8% of people ever diagnosed with heart attack also take Aspirin to reduce the heart attack in the future. In sum, people ever diagnosed with either Angina or Coronary Heart Disease or Heart Attack are more likely to take Aspirin as preventive practice against the heart attack in the future.

summarise(group_by(brfss2013, cvdinfr4), count = n(), perc_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   cvdinfr4  count   perc_col
##     <fctr>  <int>      <dbl>
## 1      Yes  29284  5.9547557
## 2       No 459904 93.5191907
## 3       NA   2587  0.5260536

summarise(group_by(brfss2013, cvdcrhd4), count = n(), perc_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   cvdcrhd4  count  perc_col
##     <fctr>  <int>     <dbl>
## 1      Yes  29064  5.910020
## 2       No 458288 93.190585
## 3       NA   4423  0.899395

summarise(group_by(brfss2013, rduchart), count = n(), perc_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   rduchart  count  perc_col
##     <fctr>  <int>     <dbl>
## 1      Yes  40145  8.163286
## 2       No   5595  1.137715
## 3       NA 446035 90.698999

summarise(group_by(brfss2013, cvdinfr4, cvdcrhd4, rduchart), count=n(), perc_col = count/nrow(brfss2013)*100)

## Source: local data frame [27 x 5]
## Groups: cvdinfr4, cvdcrhd4 [?]
## 
##    cvdinfr4 cvdcrhd4 rduchart count   perc_col
##      <fctr>   <fctr>   <fctr> <int>      <dbl>
## 1       Yes      Yes      Yes  3352 0.68161253
## 2       Yes      Yes       No    96 0.01952112
## 3       Yes      Yes       NA 10505 2.13613949
## 4       Yes       No      Yes  2630 0.53479742
## 5       Yes       No       No   169 0.03436531
## 6       Yes       No       NA 11107 2.25855320
## 7       Yes       NA      Yes   285 0.05795333
## 8       Yes       NA       No    18 0.00366021
## 9       Yes       NA       NA  1122 0.22815312
## 10       No      Yes      Yes  2973 0.60454476
## # ... with 17 more rows

The variable cvdinfr4 is whether the respondent has ever been diagnosed with Heart Attack and cvdcrhd4 with Angian or Coronary Heart Disease. The variable rduchart is whether the respondent takes Aspirin to reduce the heart attack.

summarise(group_by(brfss2013_rq1 <- brfss2013 %>% mutate(hd_asp = ifelse(cvdcrhd4 == "Yes" & rduchart == "Yes", "Yes", "No")) %>% filter(!is.na(hd_asp), !is.na(cvdinfr4)), hd_asp, cvdinfr4), count = n(), perc_col = count/nrow(brfss2013_rq1)*100)

## Source: local data frame [4 x 4]
## Groups: hd_asp [?]
## 
##   hd_asp cvdinfr4  count   perc_col
##    <chr>   <fctr>  <int>      <dbl>
## 1     No      Yes  14020  3.0261105
## 2     No       No 442956 95.6086864
## 3    Yes      Yes   3352  0.7235037
## 4    Yes       No   2973  0.6416995

The newly added variable hd_asp is whether the respondent has been diagnosed with Heart Disease and take Aspirin to reduce heart attack.

Research quesion 2: First of all, we need to define the definition of “heavy smoker”. We define them as those who smoke everyday and more than 100 in entire life. The variables in question for definition of heavy smoker are smokday2 and smoke100, respectively. The variable cvdinfr4 is whether respondents have ever diagnosed with Heart Attack

First, have a look at the structure and the summary of variables at hand.

brfss2013 %>%
    select(cvdinfr4, smokday2, smoke100) %>%
    str()

## 'data.frame':    491775 obs. of  3 variables:
##  $ cvdinfr4: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ smokday2: Factor w/ 3 levels "Every day","Some days",..: 3 NA 2 NA 3 NA 3 1 NA NA ...
##  $ smoke100: Factor w/ 2 levels "Yes","No": 1 2 1 2 1 2 1 1 2 2 ...

brfss2013 %>%
    select(cvdinfr4, smokday2, smoke100) %>%
    summary()

##  cvdinfr4            smokday2      smoke100     
##  Yes : 29284   Every day : 55163   Yes :215201  
##  No  :459904   Some days : 21494   No  :261654  
##  NA's:  2587   Not at all:138135   NA's: 14920  
##                NA's      :276983

The variables are all categorical variables, a couple with “yes” or “no” answers like cvdinfr4 and smoke100 and smokday2 with everyday, someday and not at all. A large number of NA answer is witnessed smokday2. Even though there is very large number of the NA, it does no good to have NA included in the analysis.

Next step would be to find out the average incidence of heart attack among those who are 18 y.o. or older as well as the incidence among group of people of interest; those smoking everyday with mor than 100 cigarettes in entire life.

# Incidence of Heart Attack
summarise(group_by(brfss2013, cvdinfr4), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   cvdinfr4  count    per_col
##     <fctr>  <int>      <dbl>
## 1      Yes  29284  5.9547557
## 2       No 459904 93.5191907
## 3       NA   2587  0.5260536

ggplot(brfss2013, aes(x=cvdinfr4))+geom_bar()

summarise(group_by(brfss2013 %>% filter(cvdinfr4 != "Don't know"), cvdinfr4), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   cvdinfr4  count   per_col
##     <fctr>  <int>     <dbl>
## 1      Yes  29284  5.954756
## 2       No 459904 93.519191

# Incidence of those smoking everyday
summarise(group_by(brfss2013, smokday2), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 4 × 3
##     smokday2  count   per_col
##       <fctr>  <int>     <dbl>
## 1  Every day  55163 11.217122
## 2  Some days  21494  4.370698
## 3 Not at all 138135 28.089065
## 4         NA 276983 56.323115

ggplot(brfss2013, aes(x=smokday2))+geom_bar()

summarise(group_by(brfss2013 %>% filter(smokday2 != "Don't know"), smokday2), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##     smokday2  count   per_col
##       <fctr>  <int>     <dbl>
## 1  Every day  55163 11.217122
## 2  Some days  21494  4.370698
## 3 Not at all 138135 28.089065

# Incidence of those ever smoke 100+ cigarettes in entire life
summarise(group_by(brfss2013, smoke100), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   smoke100  count   per_col
##     <fctr>  <int>     <dbl>
## 1      Yes 215201 43.760053
## 2       No 261654 53.206039
## 3       NA  14920  3.033908

ggplot(brfss2013, aes(x=smoke100))+geom_bar()

summarise(group_by(brfss2013 %>% filter(smoke100 != "Don't know"), smoke100), count=n(), per_col = count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   smoke100  count  per_col
##     <fctr>  <int>    <dbl>
## 1      Yes 215201 43.76005
## 2       No 261654 53.20604

As we already covered in research question 1, the average incidence of ever diagnosed with heart attack among age of 18+ adults is approximately 5.9%. About 11.2% of 18+ adults currently smokes everyday and 43.8% smoke more than 100 cigarettes in their entire life.

It seemed that those currently smoking everyday are more likely to smoke more than 100 cigarettes in their enitre life. But we need to check if it is true. In the meantime, we need to create a new variable heavy smoker with combination of both variables in order to explore the research question.

brfss2013 <- brfss2013 %>% mutate(smk_every=ifelse(smokday2 == "Every day", "Yes", "No"))
summarise(group_by(brfss2013 %>% mutate(heavy_smoke=ifelse(smk_every == smoke100, "Heavy", "Not")), heavy_smoke), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   heavy_smoke  count  per_col
##         <chr>  <int>    <dbl>
## 1       Heavy  55161 11.21671
## 2         Not 159629 32.45976
## 3        <NA> 276985 56.32352

summarise(group_by(brfss2013 %>% mutate(heavy_smoke=ifelse(smokday2 == "Every day" & smoke100 == "Yes", "Heavy", "Not")), heavy_smoke), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 3 × 3
##   heavy_smoke  count   per_col
##         <chr>  <int>     <dbl>
## 1       Heavy  55161 11.216715
## 2         Not 421283 85.665802
## 3        <NA>  15331  3.117483

brfss2013 <- brfss2013 %>% mutate(heavy_smoke=ifelse(smokday2 == "Every day" & smoke100 == "Yes", "Heavy", "Not"))
ggplot(brfss2013, aes(x=heavy_smoke))+geom_bar()

According to the analysis, the incidence of those currently smoking everyday and more than 100 cigarettes in their entire life is about 11.2%. This seems to concur the assumption made before.

Now, let’s investigate if heavy smokers are more likely to be diagnosed with heart attack than usual

summarise(group_by(brfss2013 %>% mutate(heavy_smoke=ifelse(smk_every == smoke100, "Heavy", "Not")), cvdinfr4, heavy_smoke), count=n(), per_col=count/nrow(brfss2013)*100)

## Source: local data frame [9 x 4]
## Groups: cvdinfr4 [?]
## 
##   cvdinfr4 heavy_smoke  count     per_col
##     <fctr>       <chr>  <int>       <dbl>
## 1      Yes       Heavy   4016  0.81663362
## 2      Yes         Not  14292  2.90620711
## 3      Yes        <NA>  10976  2.23191500
## 4       No       Heavy  50820 10.33399420
## 5       No         Not 144426 29.36830868
## 6       No        <NA> 264658 53.81688780
## 7       NA       Heavy    325  0.06608713
## 8       NA         Not    911  0.18524732
## 9       NA        <NA>   1351  0.27471913

Concluding form the result of 0.8%, the heavy smokers seem to have less incidence of ever diagnosed with heart attack than usual.

Research quesion 3: First of all, males drinking 2 drinks per day and females 1 drink per day would be defined as a heavy drinker. We have 3 separate calculated variables to identify them X_rfdrhv4 for males and females, X_rfdrmn4 for males only and X_rfdrwm4 for females. Finally, binge drinkers are defined as having more than 5 drinks on an occasion for males and 4+ for females. And the data is calculated and stored in the variable X_rfbing5.

First, have a look at the structure and the summary of variables at hand along with the bar charts per each.

brfss2013 %>%
    select(X_rfdrhv4, X_rfdrmn4, X_rfdrwm4, X_rfbing5) %>%
    str()

## 'data.frame':    491775 obs. of  4 variables:
##  $ X_rfdrhv4: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...
##  $ X_rfdrmn4: Factor w/ 2 levels "No","Yes": NA NA NA NA 1 NA NA NA 1 NA ...
##  $ X_rfdrwm4: Factor w/ 2 levels "No","Yes": 1 1 2 1 NA 1 1 1 NA 1 ...
##  $ X_rfbing5: Factor w/ 2 levels "No","Yes": 1 1 2 1 1 1 1 1 1 1 ...

brfss2013 %>%
    select(X_rfdrhv4, X_rfdrmn4, X_rfdrwm4, X_rfbing5) %>%
    summary()

##  X_rfdrhv4     X_rfdrmn4     X_rfdrwm4     X_rfbing5    
##  No  :442359   No  :178553   No  :263803   No  :409195  
##  Yes : 25533   Yes : 11868   Yes : 13665   Yes : 58849  
##  NA's: 23883   NA's:301354   NA's:214307   NA's: 23731

ggplot(brfss2013, aes(x=X_rfdrhv4))+geom_bar()

ggplot(brfss2013, aes(x=X_rfdrmn4))+geom_bar()

ggplot(brfss2013, aes(x=X_rfdrwm4))+geom_bar()

ggplot(brfss2013, aes(x=X_rfbing5))+geom_bar()

All the variables are categorical with “Yes”, “No” and “NA”. Now let’s find out the average incidence of heavy drikning and binge drinking.

# Incidence of heavy alcohol drinking
summarise(group_by(brfss2013 %>% filter(X_rfdrhv4 != "Don't know"), X_rfdrhv4), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   X_rfdrhv4  count   per_col
##      <fctr>  <int>     <dbl>
## 1        No 442359 89.951502
## 2       Yes  25533  5.192009

# Incidence of male heavy alcohol drinking
summarise(group_by(brfss2013 %>% filter(X_rfdrmn4 != "Don't know"), X_rfdrmn4), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   X_rfdrmn4  count   per_col
##      <fctr>  <int>     <dbl>
## 1        No 178553 36.307864
## 2       Yes  11868  2.413299

# Incidence of female heavy alcohol drinking
summarise(group_by(brfss2013 %>% filter(X_rfdrwm4 != "Don't know"), X_rfdrwm4), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   X_rfdrwm4  count  per_col
##      <fctr>  <int>    <dbl>
## 1        No 263803 53.64303
## 2       Yes  13665  2.77871

# Incidence of bing drinking
summarise(group_by(brfss2013 %>% filter(X_rfbing5 != "Don't know"), X_rfbing5), count=n(), per_col=count/nrow(brfss2013)*100)

## # A tibble: 2 × 3
##   X_rfbing5  count  per_col
##      <fctr>  <int>    <dbl>
## 1        No 409195 83.20777
## 2       Yes  58849 11.96665

About 5.2% of people drink alcohol heavily. The proportion of male heavy drinkers is a bit lower than that of female heavy drinkers, 2.4% vs. 2.8% respectively. In the meantime, about 12% of adults age of 18+ have binge drinking.

In order to explore if heavy drinkers are more likely to be binge drinkers, the next step is to build a sort of contingency table in combination of “heavy drinking” and “binge drinking”. It is good to have separate analysis to see any relationships between heavy drinking on different gender and bing drinking occasions.

summarise(group_by(brfss2013 %>% filter(X_rfbing5 != "Don't know"), X_rfbing5, X_rfdrhv4), count=n(), per_col=count/nrow(brfss2013)*100)

## Source: local data frame [6 x 4]
## Groups: X_rfbing5 [?]
## 
##   X_rfbing5 X_rfdrhv4  count    per_col
##      <fctr>    <fctr>  <int>      <dbl>
## 1        No        No 400625 81.4651009
## 2        No       Yes   6930  1.4091810
## 3        No        NA   1640  0.3334858
## 4       Yes        No  39809  8.0949621
## 5       Yes       Yes  17826  3.6248284
## 6       Yes        NA   1214  0.2468609

Compared to 12% binge drinking incidence, heavy drinkers are less likely to have binge drinking by about 8%. This can be translated as those drinking heavy seem to have a regualr drinks preventing them from having binge drink on an occasion.

summarise(group_by(brfss2013 %>% filter(X_rfbing5 != "Don't know"), X_rfbing5, X_rfdrmn4), count=n(), per_col=count/nrow(brfss2013)*100)

## Source: local data frame [6 x 4]
## Groups: X_rfbing5 [?]
## 
##   X_rfbing5 X_rfdrmn4  count    per_col
##      <fctr>    <fctr>  <int>      <dbl>
## 1        No        No 152827 31.0766102
## 2        No       Yes   1774  0.3607341
## 3        No        NA 254594 51.7704235
## 4       Yes        No  24698  5.0222154
## 5       Yes       Yes   9678  1.9679732
## 6       Yes        NA  24473  4.9764628

summarise(group_by(brfss2013 %>% filter(X_rfbing5 != "Don't know"), X_rfbing5, X_rfdrwm4), count=n(), per_col=count/nrow(brfss2013)*100)

## Source: local data frame [6 x 4]
## Groups: X_rfbing5 [?]
## 
##   X_rfbing5 X_rfdrwm4  count   per_col
##      <fctr>    <fctr>  <int>     <dbl>
## 1        No        No 247797 50.388287
## 2        No       Yes   5156  1.048447
## 3        No        NA 156242 31.771034
## 4       Yes        No  15111  3.072747
## 5       Yes       Yes   8148  1.656855
## 6       Yes        NA  35590  7.237049

brfss2013 <- brfss2013 %>%
    mutate(bing_malehv=ifelse(X_rfbing5=="Yes" & X_rfdrmn4=="Yes", "Male Heavy & Binge", "Not"), bing_femalehv=ifelse(X_rfbing5=="Yes" & X_rfdrwm4=="Yes", "Female Heavy & Binge", "Not"))
ggplot(brfss2013, aes(x=bing_malehv))+geom_bar()

ggplot(brfss2013, aes(x=bing_femalehv))+geom_bar()

As we assumed at research question, more male heavy drinkers take binge drink onan occasion than female heavy drinkers, 2.0% to 1.7%.