Background

This project is aimed at an exploratory look at understanding health risks related to the the data in the Behavioral Risk Factor Surveillance System (BRFSS) dataset published by the CDC. All data is publicly available and can be found at http://www.cdc.gov/brfss/.

Load packages

library(ggplot2)
library(dplyr)
library(stringr)

Load data

Make sure your data and R Markdown files are in the same directory. When loaded your data file will be called brfss2013. Delete this note when before you submit your work.

load("brfss2013.RData")

Part 1: Data

The observations are randomly collected via telephone survey. This is random sampling but not random assignment (observations only), therefore the results of studying this data can not be used for causal inference but results may be generalized to the population. * * *

Part 2: Research questions

Research question 1: Do people with a lower income tend to drink more?

This would be of interest due to the fact that many mental and physical health problems can be affected by alcohol consumption and can affect overall well being. This could be a stepping stone to understanding where alcoholism education and outreach efforts should be directed or possible other social program efforts.

Research question 2: Do people with early onset diabetes tend to exhibit higher instance of heart disease? How does this compare to those that get diagnosed later in life?

This would be of interest due to the fact that it is already known that people with diabetes are at higher risk of heart disease, than those that aren’t. However, it would be interesting to know if when it is diagnosed has an positive or negative. Additionally it could be investigated whether have an earlier vs. later onset diabetes has any affect on the ability to manage any heart disease. Perhaps this could help determine if earlier diagnosis could be helpful in treatment and disease management, resulting in better quality of life.

Research question 3: Do certain races/ethnicity groups tend to exhibit higher instances of obesity and poor diet?

This would be interesting to understand if health and nutrition could be affected by which ethnic group you belong to, and perhaps further study to look at income levels across ethnicities, and how it plays a role in overall health and diet quality of the individual. Geography, regional food availability, and many other affects could also play a role , but this is just a preliminary study to investigate the patterns in the data.


Part 3: Exploratory data analysis

Research question 1

Do people with a lower income tend to drink more based on their average drinks per day, over the past 30 days that they drank?

# Filter out columns of interest and remove values outside acceptable range, NA's, etc.
q1 <- brfss2013 %>% select(income2,alcday5,avedrnk2)
q1 <- q1 %>% filter(!is.na(income2),!is.na(alcday5),!is.na(avedrnk2))
q1 <- q1  %>% filter(alcday5 != 1)
q1  <- q1  %>% mutate(X_drnk_per_mo = ifelse(alcday5 >200,avedrnk2*(alcday5-200), ifelse(alcday5>100,avedrnk2*(alcday5-100)*(30/7),"different")))
q1$X_drnk_per_mo <- round(q1$X_drnk_per_mo)

q1 <- q1  %>% mutate(income1 = ifelse(income2 == "$75,000 or more",">$75K",ifelse(income2 == "Less than $75,000","<$75K",ifelse(income2 == "Less than $50,000","<$50K",ifelse(income2 == "Less than $35,000","<$35K",ifelse(income2 == "Less than $25,000","<$25K",ifelse(income2 == "Less than $20,000","<$20K",ifelse(income2 == "Less than $15,000","<$15K",ifelse(income2 == "Less than $10,000","<$10K", "different")))))))))
# head(q1[order(-q1$X_drnk_per_mo),])

# Compute summary statistics for "Drinks per 30 Days" across income levels
q1 %>% group_by(income2) %>% summarise(drnk_mean = mean(X_drnk_per_mo), drnk_median = median(X_drnk_per_mo),drnk_sd = sd(X_drnk_per_mo), drnk_min = min(X_drnk_per_mo), drnk_max = max(X_drnk_per_mo))
## # A tibble: 8 x 6
##   income2           drnk_mean drnk_median drnk_sd drnk_min drnk_max
##   <fct>                 <dbl>       <dbl>   <dbl>    <dbl>    <dbl>
## 1 Less than $10,000      29.1           9    75.9        1     2280
## 2 Less than $15,000      26.2           9    56.6        1     1800
## 3 Less than $20,000      24.2           9    50.3        1     2280
## 4 Less than $25,000      24.0           9    50.7        1     2100
## 5 Less than $35,000      23.4           9    47.5        1     1500
## 6 Less than $50,000      22.9           9    48.1        1     2280
## 7 Less than $75,000      22.2          10    40.6        1     1800
## 8 $75,000 or more        22.0          12    35.8        1     1800
# Create visualizations 

# Test Box plot
# par(mar=c(5,10,2,2))
# boxplot((X_drnk_per_mo/30)~income1,data = q1,xlab = 'Income Level ($ USD)',ylab ='Average Drinks per Day',log='y',las=1)

# Bar plot to visualize average drinking levels across income levels
q1_summ <- q1 %>% group_by(income2) %>% summarise(drnk_mean = mean(X_drnk_per_mo), drnk_median = median(X_drnk_per_mo),drnk_sd = sd(X_drnk_per_mo), drnk_min = min(X_drnk_per_mo), drnk_max = max(X_drnk_per_mo),count=n())

q1_summ <- q1_summ  %>% mutate(income1 = ifelse(income2 == "$75,000 or more",">$75K",ifelse(income2 == "Less than $75,000","<$75K",ifelse(income2 == "Less than $50,000","<$50K",ifelse(income2 == "Less than $35,000","<$35K",ifelse(income2 == "Less than $25,000","<$25K",ifelse(income2 == "Less than $20,000","<$20K",ifelse(income2 == "Less than $15,000","<$15K",ifelse(income2 == "Less than $10,000","<$10K", "different")))))))))


ggplot(q1_summ, aes(x=income1, y=drnk_mean),xlab='average drinks per month') + 
geom_bar(stat = "identity",fill='blue') +
labs(title="Average Drinks per 30 Days Across Income Levels ",x ="Income Level $ USD (in thousands)", y = "Average Drinks per 30 Days") + 
geom_text(aes(label=count), vjust=1.6, color="white", size=3.5)+
theme(plot.title = element_text(hjust = 0.5))

Conclusion:

Based on the above plot it appears that average drinking levels are higher among those groups with a lower income level, and this can be generalized to public since the sampling was random, as mentioned in Part 1. However it should be noted that for statistics reported with high drinking levels, that a confounding variable could be that participants could be under the influence while reporting, making statistics unreliable or introduce an unknown bias. Living below poverty line, age, mental state all could influence accuracy of reporting as well, or ability to participate. Note that the count of people surveyed varies widely across income levels.

Other questions could be posed to understand history of participant alcohol usage, i.e. “Have you ever checked into a support group for people with alcoholism?”, “Have you ever been cited for driving under the influence (DUI)?”. Further investigation is needed.

Research question 2

Do people with early onset diabetes tend to exhibit higher instance of heart attacks and other heart diseas? How does this compare to those that get diagnosed later in life?

# Filter out data of interest

q2 <- brfss2013 %>% select(diabage2,cvdinfr4,cvdcrhd4) %>% 
filter(!is.na(diabage2),diabage2!="",!is.na(cvdinfr4),!is.na(cvdcrhd4))  # Get rid of NA values
q2$diabage2 <- as.numeric(as.character(q2$diabage2)) # Convert age to integers
# q2 %>% filter(!is.na(q2$diabage2))    # Clean up NA values

# Characterize as early onset (assume less than 40 years old = early onset)
q2 <- q2 %>% filter(!is.na(q2$diabage2)) %>% mutate(early_onset = ifelse(diabage2<40,"early","not early"),num_people = 1)
head(q2)
##   diabage2 cvdinfr4 cvdcrhd4 early_onset num_people
## 1       69       No       No   not early          1
## 2       60       No       No   not early          1
## 3       66       No       No   not early          1
## 4       55       No       No   not early          1
## 5       64       No       No   not early          1
## 6       46       No       No   not early          1
# Calculate percentages of heart disease within 'early' and 'not early' groups and compare

# Count in each agegroup and make bar plot to investigate
q2_summ <- q2 %>% group_by(diabage2) %>% summarise(agecount = as.integer(n()))

# Bar Plot REDO
ggplot(data=q2, aes(x=diabage2)) +
  geom_histogram(binwidth =5,boundary=0,color = 'black',fill="blue") +
  labs(title = "Histogram Age of Diabetes Diagnosis",x = "Age of Diagnosis", y= "Number of People") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_x_continuous(breaks = pretty(q2$diabage2, n = 20)) #+

#  
# Heart Attack tables and % calculations
h_at_early <- table(q2$cvdinfr4[q2$early_onset=="early"])
perc_att_early <- round(100*h_at_early[1]/(h_at_early[1]+h_at_early[2]),1)

h_at_non <- table(q2$cvdinfr4[q2$early_onset=="not early"])
perc_att_nonearly <- round(100*h_at_non[1]/(h_at_non[1]+h_at_non[2]),1)

# Heart Disease tables and %
h_dis_early <- table(q2$cvdcrhd4[q2$early_onset=="early"])
perc_dis_early <- round(100*h_dis_early[1]/(h_dis_early[1]+h_dis_early[2]),1)

h_dis_non <- table(q2$cvdcrhd4[q2$early_onset=="not early"])
perc_dis_nonearly <- round(100*h_dis_non[1]/(h_dis_non[1]+h_dis_non[2]),1)


# Heart Attack stacked bar plot
ggplot(q2,aes(x=early_onset,y=num_people,fill=cvdinfr4)) +
  geom_bar(stat = 'identity',position = 'stack') +
  labs(title = paste("Heart attacks among early vs. non-early onset diabetics \n",perc_att_early,"% Attacks for early \n",perc_att_nonearly,"% Attacks for non-early"),x = "Early or Non-Early Diabetics\n (40 = Cutoff Age for Early Diagnosis)", y= "Number of People") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete(name = 'Heart Attack?')

# NORMALIZED Heart Attack stacked bar plot
ggplot(q2,aes(x=early_onset,y=num_people,fill=cvdinfr4)) +
  geom_bar(stat = 'identity',position = 'fill') +
  labs(title = paste("(Normalized) Heart attacks among early vs. non-early onset diabetics \n",perc_att_early,"% Attacks for early \n",perc_att_nonearly,"% Attacks for non-early"),x = "Early or Non-Early Diabetics \n (40 = Cutoff Age for Early Diagnosis)", y= "Proportion of People") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete(name = 'Heart Attack?')

# Heart Disease stacked bar plot
ggplot(q2,aes(x=early_onset,y=num_people,fill=cvdcrhd4)) +
  geom_bar(stat = 'identity',position = 'stack') +
  labs(title = paste("Heart disease among early vs. non-early onset diabetics \n",perc_dis_early,"% disease for early \n",perc_dis_nonearly,"% disease for non-early"),x = "Early or Non-Early Diabetics\n (40 = Cutoff Age for Early Diagnosis)", y= "Number of People") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete(name = 'Heart disease?')

# NORMALIZED Heart Disease stacked bar plot
ggplot(q2,aes(x=early_onset,y=num_people,fill=cvdcrhd4)) + geom_bar(stat = 'identity',position = 'fill') +
  labs(title = paste("(Normalized) Heart disease among early vs. non-early onset diabetics \n",perc_dis_early,"% disease for early \n",perc_dis_nonearly,"% disease for non-early"),x = "Early or Non-Early Diabetics\n (40 = Cutoff Age for Early Diagnosis)", y= "Proportion of People") +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_discrete(name = 'Heart disease?')

Conclusion:

Looking at the proportion of early onset diabetics in the survey, it appears there is a lower % of heart attack and disease. However, it should be noted that checking the heart attack and disease statistics of people with early onset diabetes vs. those who were diagnosed later in life, isn’t considering that the older you are, the more time you have had to develop general (non diabetes related) health issues, and have lived around longer amount of time (in which to suffer from a heart disease or attack). Statistically, I would presume the more life you have lived, the more likely you are to have suffered a heart attack or other ailment, regardless of pre-existing health conditions. This needs to be taken into account in future studies.

Research question 3

Do certain races/ethnicity groups tend to exhibit higher instances of obesity?

# Select rows of interest
q3 <- brfss2013 %>% select(X_imprace,X_prace1,X_bmi5cat) %>%
filter(!is.na(X_imprace),!is.na(X_prace1),!is.na(X_bmi5cat)) %>%  # Get rid of NA values
mutate(imprace_short = str_split(X_imprace,',',simplify = T)[,1],imprace_short = str_split(imprace_short,'/',simplify = T)[,1],num_people=1) # add column with shortened race names


# Plot survey demographics for permuted race/ethnicity
q3_summ <- q3 %>% group_by(imprace_short) %>% summarise(count = n()) 

ggplot(q3_summ, aes(x=reorder(imprace_short, -count), y=count)) + geom_bar(stat = "identity",color = 'black',fill="blue") + labs(title = "Survey Demographics Across Permuted Ethnicity/Race",x = "Permuted Race/Ethnicity Category", y= "Number of People") + theme(plot.title = element_text(hjust = 0.5)) +geom_text(aes(label=count,vjust=-0.2))# ,axis.text.x = element_text(angle = 90,))

# Filter, count and plot obese across ethnicity
q3_obese <- q3 %>% filter(X_bmi5cat=="Obese") %>% group_by(imprace_short) %>% summarise(count = n())

ggplot(q3,aes(x=imprace_short,y=num_people,fill=X_bmi5cat)) + geom_bar(stat = 'identity',position = 'fill') + labs(title = "Normalized BMI Across Race/Ethnicity",x = "Permuted Race/Ethnicity Category", y= "Number of People") + theme(plot.title = element_text(hjust = 0.5)) +
scale_fill_discrete(name = 'BMI Category')

Conclusion:

It appears that based off of the normalized BMI bar plot, that the occurrence of obesity in both Black and American Indian populations surveyed, is higher than that of any other ethnicities. However, reviewing the demographic overview plot, the sample populations for these is significantly lower, and should be taken into account (and is likely closer in line to the overall population distribution of ethnicities). Also noteworthy is how despite similar proportions of overweight populations across race/ethnicity, that obesity in Asian population surveyed, is significantly lower than that of the rest of the ethnic groups surveyed. It would also be of great interest to view how BMI varied across the rest of the world, with respection to region, country and ethnicity. Based on the random sampling nature of the survey and sample size, this trend can be generalized to the general population, but additional investigation is still warranted.