title: “Exploring the BRFSS data”

author: “Michelle Tan”

date: “3/3/2018”

output: html_document

Exploring the BRFSS data

Introduction I am going to use Behavioral Risk Factor Surveillance System statistics (BRFSS) from 2013 as this is data set provided by Duke University which is needed to complete course of Introduction to Probability and Data on Coursera. “The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.” (Background from Codebook). There are more than 400,000 of respondents in the survey, who are 18 years old or older and residents of the U.S. All respondents are reached by Random Digit Dialing method and weighted to represent population. There were more than 97.5% of telephone users in the U.S. which means this is representative, not biased random sampled survey. Of course, there could be problem with respondents - as answers are not verified, we can deal with over- and/or under-reporting issues, where respondents answers what they feel apropriate, not what is true. This can bias not only levels of income, but also others sensitive questions like the ones about health conditions of respondent. Other problem could be memory issue, as respondents have to say how much they did drink or smoke in last 30 days, etc. ##Setup Before I start my work with dataset I have to load libraries I will use in data exploratory.

Load packages

Load data

load("~/Downloads/_384b2d9eda4b29131fb681b243a7767d_brfss2013 .RData")

Part 1: Data

I was realy concerned if this dataset is not biased due to sampling method. After I did read the Overview (accessible on: CDC.gov) I got whole picture of how this survey is made and what to expected. As data provided by CDC is observational and selection of attendees is fully random, the results might be generalized to the population of the U.S. There are few variables I can see as possibly correlated, which will be part of my data exploratory analysis later. I will not provide full list of variables, because it would not add any information to this report. Used variables will be listed in research questions.

Part 2: Research questions

Research quesion 1: First of all I am interested in overall health conditions of the U.S. What are the sickest countries? I will plot three variables to get this image: genhlth and, of course, X_state

Research quesion 2: How is sickness affected by healthcare access? Does sick countries has good healthcare access or is health care too expensive? Does patients go to medical checks? I will use htlthpln1, medcost, income2 and checkup1 variables.

Research quesion 3: I am quite interested in heart strokes - is there higher incidence of strokes in obese and smoking population with high blood cholesterol who drinks alcohol, or is heart strokes occurence in population somehow random? I will use cvdstrk3, toldhi2, X_bmi, smokday2, alcday5 variables.

All variables used in exploratory analysis will be X_state, physhlth, menthlth, poorhlth, hlthpln1, medcost, income2, checkup1, cvdstrk3, toldhi2, smokday2, X_bmi, alcday5, avedrnk2, genhlth, drnk3ge5

Part 3: Exploratory data analysis

Research quesion 1: What state is sickest and what is fittest? Heatmap could be usefull! I will copy just that columns I will need.

# ggplot2 function for heatmapping health status
all_states <- map_data("state")
health_states <- brfss2013 %>%
    filter(!is.na(genhlth)) %>% # omit on NA values!
        select(genhlth, X_state, poorhlth)
health_states_poorhlth <- health_states %>%
    group_by(genhlth, X_state) %>%
    summarise(mean(poorhlth),n=n()) %>%
    mutate(pct = (n/sum(n))*100)
health_states_map <- health_states_poorhlth %>%
    mutate(region= tolower(X_state))
states_map <- merge(all_states, health_states_map, by="region")
# ploting map...
ggplot() +
  geom_polygon(data = states_map, aes(x= long, y = lat, group = group, fill = pct), color = "white") + 
  ggtitle("Heat map of poor health conditions in the U.S.") +
  scale_fill_gradient2(low = "blue", mid = "grey", high = "darkred") + 
  theme(legend.position = c(1, 0), legend.justification = c(1, 0))

I am using poorhlth variable to plot this heat map, which means that there is higher number of people with poor health in Florida. Why? Research quesion 2: Is poor result affected by lower healthcare access? Is health care too expensive? Does patients go to medical checks?

To answer this question we have to explore more data. Is it because residents of Florida do not go to medical exams? Is there healthcare accessible in Florida? Or it is expensive?

Well, when I look on histograms of variables I am thinking they could infere this result, it is clear this is not case. No, medical costs are not high. No, they are not heavy smokers.

# do they drink daily?
# alcday5: Days In Past 30 Had Alcoholic Beverage
summary(brfss2013 %>% 
          filter(!is.na(alcday5), X_state=="Florida") %>% 
          select(alcday5)
        )
##     alcday5      
##  Min.   :  0.00  
##  1st Qu.:  0.00  
##  Median :  0.00  
##  Mean   : 81.66  
##  3rd Qu.:201.00  
##  Max.   :230.00

No,they do not drink too often.

# do they have health plan?
plot(brfss2013 %>% 
       filter(X_state=="Florida") %>% 
       select(hlthpln1), main="Does have health plan?")

Yes, they do have health plan.

# aren´t they poor and therefore sick?
plot(brfss2013 %>% 
       filter(X_state=="Florida") %>% 
       select(income2), main="What are levels of income?"
     )

No, they are not poor.

#do they go to medical exams?
plot(brfss2013 %>%
       filter(X_state=="Florida") %>%
       select(checkup1),main="regularly go to medical exams?")

Yes, they visit doctor regulary. So- what is the problem?

# aren't they just old?
plot(brfss2013%>%
       filter(X_state=="Florida")%>%
       select(employ1), main="Aren/t they just old?")

Well, there is strong group of retirees…that mean that there are lot of older people, who are sick more often than younger one… (suppose)

# what about join pain?
summary(brfss2013 %>% 
          filter(!is.na(joinpain), X_state=="Florida") %>% 
          select(joinpain)
        )
##     joinpain     
##  Min.   : 0.000  
##  1st Qu.: 3.000  
##  Median : 5.000  
##  Mean   : 5.021  
##  3rd Qu.: 7.000  
##  Max.   :10.000

Not lot of join pain…

# and what about cholesterol?
plot(brfss2013 %>% 
       filter(X_state=="Florida") %>% 
       select(toldhi2), main="Blood cholesterol"
     )

Lot of blood cholesterol!

# Are retired people having bigger issue with blood cholesterol?
plot(brfss2013 %>% 
       filter(X_state=="Florida", employ1=="Retired") %>% 
       select(toldhi2), main="Blood cholesterol in retired group"
     )

In summary: There are lot of elderly with blood cholesterol problems. That could be the issue with poor health condition in Florida.

Research quesion 3: Does high blood cholesterol results in more hearthstrokes? Are heartstrokes correlated to obesity, smoking, drinking and high blood cholesterol?

heart <- brfss2013 %>%
  filter(!is.na(toldhi2), !is.na(cvdstrk3), !is.na(X_bmi5), !is.na(maxdrnks), X_state=="Florida") %>%
  select(toldhi2,cvdstrk3,maxdrnks, X_bmi5)

heart$toldhi2 <- ifelse(heart$toldhi2=="Yes",1,0)
heart$cvdstrk3 <- ifelse(heart$cvdstrk3=="Yes",1,0)

korelace<-cor(heart)
corrplot(korelace, method="ellipse")

This looks like there is weak correlation between high blood cholesterol and diagnosed stroke. Also it looks like there is weak correlation between high blood cholesterol and high BMI (obesity), but no correlation between obesity and diagnosed stroke. Even more! There is weak negative correlation between max drinks in past 30 days and high blood cholesterol, but very weak correlation between max drinks and obesity.

And what about other questions?

How is sickness affected by healthcare access? Does Florida has good healthcare access or is health care too expensive? Does patients go to medical checks?

sickness <- brfss2013 %>% 
  filter(!is.na(hlthpln1), !is.na(medcost), !is.na(income2), !is.na(checkup1), !is.na(medicare), !is.na(poorhlth)) %>%
  select(hlthpln1, medcost, income2, checkup1, medicare, poorhlth)

plot(sickness$income2, sickness$hlthpln1,  xlab="income", ylab="healthplan")

It is obvious that with there is bigger probability of having health plan if one belongs to higher income level.

plot(sickness$income2, sickness$medcost,  xlab="income", ylab="medcost")

Unsurprisingly the less income, the stronger feeling of expensive medical costs.

plot(sickness$income2, sickness$medicare,  xlab="income", ylab="medicare")

And…unsurprisingly - medicare is more often in lower income levels.

I would like to state if there is any causality in this data set, but I am not sure about how to draw this out as BRFSS is just observational survey; it means it can indicate few associations for deeper research but, alas, it is not proper to draw any causation here. BRFSS is enough for monitoring purposes of population health, not for big reasoning about healthcare issues of population.