Exploring the BRFSS data

Setup

Load packages

library(ggplot2)
library(dplyr)
library(purrr)
library(reshape2)
library(stringr)

Load data

load("brfss2013.RData")

Part 1: Data

Some nearly half-a-million individuals are in the data set (491,775). They come from all 50 states plus additional areas, such as the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands. Three-hundred-thirty (330) variables are present in the data, related to preventative health and risk factors. Data are collected through telephone surveys (both landline and mobile phones).

Although the paritcipants in the study are randomly selected, I would question techniques such as when non-English speakers were reached on the phone, how those who do not answer their phone (for whatever reason, including preference or disability) or desire to speak to phone solicitors are reached. Additionally, a segment of the population left out is those without access to phones. Some groups may be oversampled due to having both a landline and a mobile phone. Nonetheless, the data collection seems robust and I would believe we could safely generalize to the non-institutionalized adult population of the USA. The data collection process, however, is retrospective and does not involve random assignment, such as an experimental design and is unlikely to produce data that can be determined to be causal in nature.

Part 2: Research questions

Research quesion 1: How are asthma and mental health related (as measured through days nervous, hopeless, restless, depressed, and worthless in addition to days that everything felt like it was an effort in the last 30 days)?

One theory regarding the social stigma of disease indicates that those who have a physical illness may experience greater levels of mental health related symptoms because of the social stigma received because of the disease. Additionally, these are two issues I am personally interested in. If there is a relationship - there may be an indication for developing mental health interventions for those with asthma or an indication for health care providers to screen for mental health issues in those with asthma. Perhaps such a relationship would be able to support further research about mental health and asthma’s relationship.

Research quesion 2: Do smokers who exercise try to quit smoking more often than smokers who do not exercise?

Here, I am thinking about smoking as a generally unhealthy activity, one that has multiple side effects that negatively affect health: higher cancer incidence, higher heart disease incidence, and higher incidence of lung disease, among others. Regular exercise has positive effects on health: increased mood, increased cardiovascular health, prevention of metabolic syndrome, among others. As a smoker, quitting smoking can remediate some of the negative effects of smoking - perhaps those who exercise - taking postive steps on their health may be more open to quitting smoking. Perhaps exercise can assist with quitting smoking - if we find exercisers try to quit smoking more frequently, then perhaps more research can be done to look into the relationship and to see if exercise can assist with the effort to quit smoking.

Research quesion 3: How are education level and overall health status related?

A higher education level may indicate more awareness of having good health. While there are probably other variables not measured or tested here at play (such as poverty, class, race, age/generation, geography), we want to see if levels of education have any relation to health status. The results of this question could be used to drive further research into whether more public health work should be done with those with particular education levels.

Part 3: Exploratory data analysis

Research quesion 1:

##create a smaller data frame
q1.select<-c('asthnow', 'misnervs', 'mishopls',  'misrstls', 'misdeprd', 'miseffrt', 'miswtles')
q1data<-brfss2013[,q1.select]
##find only complete cases and summarise data
q1data<-q1data[complete.cases(q1data),]
summary(q1data)

##  asthnow        misnervs        mishopls        misrstls   
##  Yes:3222   All     : 166   All     :  74   All     : 251  
##  No :1366   Most    : 286   Most    : 142   Most    : 282  
##             Some    : 863   Some    : 401   Some    : 892  
##             A little:1390   A little: 610   A little:1083  
##             None    :1883   None    :3361   None    :2080  
##      misdeprd        miseffrt        miswtles   
##  All     :  72   All     : 283   All     :  95  
##  Most    : 121   Most    : 262   Most    : 115  
##  Some    : 311   Some    : 680   Some    : 271  
##  A little: 436   A little: 748   A little: 360  
##  None    :3648   None    :2615   None    :3747

#convert scores of mis* variables to numeric - and inverting (5=all, 1= none)
q1data.num<-q1data
q1data.num$misnervs<-6-as.numeric(q1data.num$misnervs)
q1data.num$misdeprd<-6-as.numeric(q1data.num$misdeprd)
q1data.num$miseffrt<-6-as.numeric(q1data.num$miseffrt)
q1data.num$mishopls<-6-as.numeric(q1data.num$mishopls)
q1data.num$misrstls<-6-as.numeric(q1data.num$misrstls)
q1data.num$miswtles<-6-as.numeric(q1data.num$miswtles)
summary(q1data.num)

##  asthnow       misnervs        mishopls        misrstls    
##  Yes:3222   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  No :1366   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##             Median :2.000   Median :1.000   Median :2.000  
##             Mean   :2.011   Mean   :1.465   Mean   :2.028  
##             3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.000  
##             Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     misdeprd        miseffrt        miswtles    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.372   Mean   :1.878   Mean   :1.355  
##  3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000

#split data and display summary grouped by asthnow
q1data.num %>% split(.$asthnow) %>% map(~summary(.))

## $Yes
##  asthnow       misnervs        mishopls        misrstls       misdeprd    
##  Yes:3222   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  No :   0   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000  
##             Median :2.000   Median :1.000   Median :2.00   Median :1.000  
##             Mean   :2.049   Mean   :1.496   Mean   :2.05   Mean   :1.401  
##             3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:1.000  
##             Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     miseffrt        miswtles    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000  
##  Mean   :1.923   Mean   :1.393  
##  3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :5.000  
## 
## $No
##  asthnow       misnervs        mishopls        misrstls    
##  Yes:   0   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  No :1366   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##             Median :2.000   Median :1.000   Median :2.000  
##             Mean   :1.921   Mean   :1.392   Mean   :1.976  
##             3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:3.000  
##             Max.   :5.000   Max.   :5.000   Max.   :5.000  
##     misdeprd        miseffrt       miswtles    
##  Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :1.000   Median :1.00   Median :1.000  
##  Mean   :1.305   Mean   :1.77   Mean   :1.264  
##  3rd Qu.:1.000   3rd Qu.:2.00   3rd Qu.:1.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000

#create means table for graphing
q1means<-aggregate(q1data.num[,2:7],by=list(q1data.num$asthnow),mean)
print(q1means)

##   Group.1 misnervs mishopls misrstls misdeprd miseffrt miswtles
## 1     Yes 2.049038 1.496276 2.050279 1.401304 1.923029 1.392924
## 2      No 1.920937 1.391654 1.975842 1.304539 1.770132 1.264275

q1means.long<-melt(q1means,id.vars = 'Group.1')
ggplot(q1means.long,aes(x=variable,y=value,fill=factor(Group.1)))+geom_bar(stat="identity",position="dodge")+
   scale_fill_discrete(name="Has Asthma",
                      labels=c("Yes", "No"))+
  xlab("Variable")+ylab("Mean Score")+
  labs(title="Mean Score for each of 6 mental health variables\ngrouped by whether the individual has asthma")

These data are interesting - first, the data were subsetted to create a smaller data frame that was more managable, including seven variables:
1. asthnow: a yes/no factor indicating whether the individual has asthma now
2. misnervs: a five level scale (All, Most, Some, A little, None) indicating whether the individual has felt nervous in the last 30 days
3. mishopls: same five levels - regarding feeling hopeless (rest of question like #2)
4. misrstls: same five levels - feeling restless (rest of question like #2)
5. misdeprd: same five levels - feeling depressed (rest of question like #2)
6. miseffrt: same five levels - everything feels like an effort (rest of question like #2)
7. miswtles: same five levels - feeling worthless (rest of question like #2)

Then only those observations which were complete for all questions were subsetted - which brought down the group to 4588 respondents. A summary was given.

The levels were treated like Likert-type scales, and scores were assigned, such that a high score of 5 indicated “all” and the score of 1 = none. (so higher scores mean more days/level of feeling the indicated way in the past 30 days) The data were split into two groups for analysis - and summaries given in the report. Finally, a means table was created for graphing and ggplot was used to create a graph.

Here we propose that “has asthma” is the independent variable and the various mental health variables reflecting symptoms are dependent variables. In each case, we observe that the mean is higher for the group that has asthma on all 6 mental health variables. The graph also shows this relationship. Although a parametric test was not run, we see that the variable of having asthma against ALL 6 of the mental health variables is potentially dependent. If it was independent, we would expect the means for all of the mental health variables to be the same without depending on whether the individual has asthma or not.

Research quesion 2:

q2.select<-c('X_smoker3', 'stopsmk2', 'X_totinda' )
q2data<-brfss2013[,q2.select]
#subset to just include those who smoke every day
q2data<-subset(q2data, X_smoker3=="Current smoker - now smokes every day")
q2data<-q2data[,2:3]
summary(q2data)

##  stopsmk2                                                X_totinda    
##  Yes :27428   Had physical activity or exercise               :31229  
##  No  :27603   No physical activity or exercise in last 30 days:21000  
##  NA's:  131   NA's                                            : 2933

q2data<-q2data[complete.cases(q2data),]
summary(q2data)

##  stopsmk2                                               X_totinda    
##  Yes:26041   Had physical activity or exercise               :31177  
##  No :26098   No physical activity or exercise in last 30 days:20962

q2table<-table(q2data)
q2table

##         X_totinda
## stopsmk2 Had physical activity or exercise
##      Yes                             16609
##      No                              14568
##         X_totinda
## stopsmk2 No physical activity or exercise in last 30 days
##      Yes                                             9432
##      No                                             11530

q2table.prop<-prop.table(q2table,2)
q2table.prop

##         X_totinda
## stopsmk2 Had physical activity or exercise
##      Yes                         0.5327325
##      No                          0.4672675
##         X_totinda
## stopsmk2 No physical activity or exercise in last 30 days
##      Yes                                        0.4499571
##      No                                         0.5500429

q2table.df<-as.data.frame(q2table)

ggplot(q2table.df,aes(x=X_totinda,y=Freq,fill=factor(stopsmk2)))+geom_bar(stat="identity",position="dodge")+
  scale_fill_discrete(name="tried to quit smoking\nin past year", labels=c("yes", "no"))+
  ylab("Frequency")+
  labs(title="number of daily smokers who tried to quit smoking by exercise status")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  xlab("Exercise status")

In research question two, we look to examine the relationship between physical activity and trying to quit smoking among daily smokers. First, we needed to subset our relevant variables from the larger data set. The three variables used in this study are “X_smoker3”, which is a caluculated varialbe (by the administrators of the survey) indicating whether someone is a current daily smoker (among other levels), “stopsmk2” indicating whether someone had tried to quit smoking in the past year, and “X_totinda” which indicates whether someone had physical activity (excercise) in the last 30 days.

Then, we subset the data frame so only the cases where individuals who smoke every day are included. In the next step, we eliminate this variable (“X_smoker3”) from the data frame to simplify analysis (since only those who smoke every day are now included in the data frame). We see that there are NA’s - and these won’t help our analysis - and so we eliminate all cases from the data frame that are not complete (i.e. those with NA’s).

The next step involves building a cross-tabulated table with the two variables of “stopsmk2” and “X_totinda”. We also build a table of the proportions by column by whether they had tried to stop smoking by exercise status. We find that indeed, those who had physical activity did try to stop smoking at a higher rate than those who did not have excercise. While this is not causal, we do know there is possibly an association between the two. It could be that a third variable is actually responsible for both behaviors, but we cannot tell by this analysis.

Next a plot was built, which shows, by proportion, that those who had physical exercise also were more likely to have tried to quit smoking. Those without exercise were less likely to have tried to quit smoking.

Research quesion 3:

q3.select<-c("X_educag", "genhlth")
q3data<-brfss2013[,q3.select]
summary(q3data)

##                                        X_educag           genhlth      
##  Did not graduate high school              : 42213   Excellent: 85482  
##  Graduated high school                     :142968   Very good:159076  
##  Attended college or technical school      :134196   Good     :150555  
##  Graduated from college or technical school:170118   Fair     : 66726  
##  NA's                                      :  2280   Poor     : 27951  
##                                                      NA's     :  1985

q3data<-q3data[complete.cases(q3data),]
summary(q3data)

##                                        X_educag           genhlth      
##  Did not graduate high school              : 41787   Excellent: 85046  
##  Graduated high school                     :142367   Very good:158500  
##  Attended college or technical school      :133739   Good     :149835  
##  Graduated from college or technical school:169665   Fair     : 66389  
##                                                      Poor     : 27788

q3table<-table(q3data)
q3table

##                                             genhlth
## X_educag                                     Excellent Very good  Good
##   Did not graduate high school                    3237      6241 13473
##   Graduated high school                          17551     39816 50212
##   Attended college or technical school           21347     44861 42828
##   Graduated from college or technical school     42911     67582 43322
##                                             genhlth
## X_educag                                      Fair  Poor
##   Did not graduate high school               12137  6699
##   Graduated high school                      24613 10175
##   Attended college or technical school       17642  7061
##   Graduated from college or technical school 11997  3853

q3table.prop<-prop.table(q3table,1)
q3table.prop

##                                             genhlth
## X_educag                                      Excellent  Very good
##   Did not graduate high school               0.07746428 0.14935267
##   Graduated high school                      0.12327997 0.27967155
##   Attended college or technical school       0.15961687 0.33543693
##   Graduated from college or technical school 0.25291604 0.39832611
##                                             genhlth
## X_educag                                           Good       Fair
##   Did not graduate high school               0.32242085 0.29044918
##   Graduated high school                      0.35269409 0.17288417
##   Attended college or technical school       0.32023568 0.13191365
##   Graduated from college or technical school 0.25533846 0.07070993
##                                             genhlth
## X_educag                                           Poor
##   Did not graduate high school               0.16031302
##   Graduated high school                      0.07147021
##   Attended college or technical school       0.05279687
##   Graduated from college or technical school 0.02270946

q3table.df<-as.data.frame(q3table)
#q3df.prop<-as.data.frame(round(q3table.prop,2))
q3df.prop<-as.data.frame(q3table.prop)
q3df.prop

##                                      X_educag   genhlth       Freq
## 1                Did not graduate high school Excellent 0.07746428
## 2                       Graduated high school Excellent 0.12327997
## 3        Attended college or technical school Excellent 0.15961687
## 4  Graduated from college or technical school Excellent 0.25291604
## 5                Did not graduate high school Very good 0.14935267
## 6                       Graduated high school Very good 0.27967155
## 7        Attended college or technical school Very good 0.33543693
## 8  Graduated from college or technical school Very good 0.39832611
## 9                Did not graduate high school      Good 0.32242085
## 10                      Graduated high school      Good 0.35269409
## 11       Attended college or technical school      Good 0.32023568
## 12 Graduated from college or technical school      Good 0.25533846
## 13               Did not graduate high school      Fair 0.29044918
## 14                      Graduated high school      Fair 0.17288417
## 15       Attended college or technical school      Fair 0.13191365
## 16 Graduated from college or technical school      Fair 0.07070993
## 17               Did not graduate high school      Poor 0.16031302
## 18                      Graduated high school      Poor 0.07147021
## 19       Attended college or technical school      Poor 0.05279687
## 20 Graduated from college or technical school      Poor 0.02270946

ggplot(q3df.prop,aes(x=X_educag,y=Freq,fill=factor(genhlth)))+geom_bar(stat="identity",position="dodge")+
  scale_fill_discrete(name="general health")+
  ylab("Proportion")+
  labs(title="General health by education level")+
  scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
  xlab("education level")

In order to answer question 3, we first subset the data to create a data frame with only the two variables of interest, “X-educag”, a variable that is a grouped variable (by the survey administrators), which indicates education level, and “genhlth” a variable which is the self-reported general health of the respondent. We then show a summary table of this.

Next, we eliminate those cases with NAs - as a case with one or both NA responses would not be helpful for our analysis. Again, we show a summary table.

Then, we cross-tabulate the educational level and general health level. This is a bit hard to compare because of all the large numbers in the table. In order to make analysis easier, we calculate proportions by educational level, so, for example, all those who graduated high school are broken into proportions with excellent, very good, good, fair, and poor health as reported. with the proportions, we can now see that the level of excellent health in each of the education levels is increasing - so the level of excellent health is lowest among those who did not graduate high school and highest among those who graduated college. Similarly, the level of poor health decreases with increased educational level. It appears that increasaed education level also means increased health level. This is not causual, as stated in the question - but it is an interesting correlation.

The plot created for this question also shows this relationship - compare, for example, the level of excellent health across the educational levels, and the level of poor health across educational levels. Further research can be done to examine this relationship and explore whether there is a clinical or public health significance.