library(ggplot2)
library(dplyr)
library(purrr)
library(reshape2)
library(stringr)load("brfss2013.RData")Some nearly half-a-million individuals are in the data set (491,775). They come from all 50 states plus additional areas, such as the District of Columbia, Puerto Rico, Guam, and the US Virgin Islands. Three-hundred-thirty (330) variables are present in the data, related to preventative health and risk factors. Data are collected through telephone surveys (both landline and mobile phones).
Although the paritcipants in the study are randomly selected, I would question techniques such as when non-English speakers were reached on the phone, how those who do not answer their phone (for whatever reason, including preference or disability) or desire to speak to phone solicitors are reached. Additionally, a segment of the population left out is those without access to phones. Some groups may be oversampled due to having both a landline and a mobile phone. Nonetheless, the data collection seems robust and I would believe we could safely generalize to the non-institutionalized adult population of the USA. The data collection process, however, is retrospective and does not involve random assignment, such as an experimental design and is unlikely to produce data that can be determined to be causal in nature.
Research quesion 1: How are asthma and mental health related (as measured through days nervous, hopeless, restless, depressed, and worthless in addition to days that everything felt like it was an effort in the last 30 days)?
One theory regarding the social stigma of disease indicates that those who have a physical illness may experience greater levels of mental health related symptoms because of the social stigma received because of the disease. Additionally, these are two issues I am personally interested in. If there is a relationship - there may be an indication for developing mental health interventions for those with asthma or an indication for health care providers to screen for mental health issues in those with asthma. Perhaps such a relationship would be able to support further research about mental health and asthma’s relationship.
Research quesion 2: Do smokers who exercise try to quit smoking more often than smokers who do not exercise?
Here, I am thinking about smoking as a generally unhealthy activity, one that has multiple side effects that negatively affect health: higher cancer incidence, higher heart disease incidence, and higher incidence of lung disease, among others. Regular exercise has positive effects on health: increased mood, increased cardiovascular health, prevention of metabolic syndrome, among others. As a smoker, quitting smoking can remediate some of the negative effects of smoking - perhaps those who exercise - taking postive steps on their health may be more open to quitting smoking. Perhaps exercise can assist with quitting smoking - if we find exercisers try to quit smoking more frequently, then perhaps more research can be done to look into the relationship and to see if exercise can assist with the effort to quit smoking.
Research quesion 3: How are education level and overall health status related?
A higher education level may indicate more awareness of having good health. While there are probably other variables not measured or tested here at play (such as poverty, class, race, age/generation, geography), we want to see if levels of education have any relation to health status. The results of this question could be used to drive further research into whether more public health work should be done with those with particular education levels.
Research quesion 1:
##create a smaller data frame
q1.select<-c('asthnow', 'misnervs', 'mishopls', 'misrstls', 'misdeprd', 'miseffrt', 'miswtles')
q1data<-brfss2013[,q1.select]
##find only complete cases and summarise data
q1data<-q1data[complete.cases(q1data),]
summary(q1data)## asthnow misnervs mishopls misrstls
## Yes:3222 All : 166 All : 74 All : 251
## No :1366 Most : 286 Most : 142 Most : 282
## Some : 863 Some : 401 Some : 892
## A little:1390 A little: 610 A little:1083
## None :1883 None :3361 None :2080
## misdeprd miseffrt miswtles
## All : 72 All : 283 All : 95
## Most : 121 Most : 262 Most : 115
## Some : 311 Some : 680 Some : 271
## A little: 436 A little: 748 A little: 360
## None :3648 None :2615 None :3747
#convert scores of mis* variables to numeric - and inverting (5=all, 1= none)
q1data.num<-q1data
q1data.num$misnervs<-6-as.numeric(q1data.num$misnervs)
q1data.num$misdeprd<-6-as.numeric(q1data.num$misdeprd)
q1data.num$miseffrt<-6-as.numeric(q1data.num$miseffrt)
q1data.num$mishopls<-6-as.numeric(q1data.num$mishopls)
q1data.num$misrstls<-6-as.numeric(q1data.num$misrstls)
q1data.num$miswtles<-6-as.numeric(q1data.num$miswtles)
summary(q1data.num)## asthnow misnervs mishopls misrstls
## Yes:3222 Min. :1.000 Min. :1.000 Min. :1.000
## No :1366 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000
## Mean :2.011 Mean :1.465 Mean :2.028
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000
## misdeprd miseffrt miswtles
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.372 Mean :1.878 Mean :1.355
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :5.000 Max. :5.000 Max. :5.000
#split data and display summary grouped by asthnow
q1data.num %>% split(.$asthnow) %>% map(~summary(.))## $Yes
## asthnow misnervs mishopls misrstls misdeprd
## Yes:3222 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## No : 0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.00 Median :1.000
## Mean :2.049 Mean :1.496 Mean :2.05 Mean :1.401
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:1.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## miseffrt miswtles
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000
## Mean :1.923 Mean :1.393
## 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :5.000 Max. :5.000
##
## $No
## asthnow misnervs mishopls misrstls
## Yes: 0 Min. :1.000 Min. :1.000 Min. :1.000
## No :1366 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :1.000 Median :2.000
## Mean :1.921 Mean :1.392 Mean :1.976
## 3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.:3.000
## Max. :5.000 Max. :5.000 Max. :5.000
## misdeprd miseffrt miswtles
## Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:1.000
## Median :1.000 Median :1.00 Median :1.000
## Mean :1.305 Mean :1.77 Mean :1.264
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :5.000 Max. :5.00 Max. :5.000
#create means table for graphing
q1means<-aggregate(q1data.num[,2:7],by=list(q1data.num$asthnow),mean)
print(q1means)## Group.1 misnervs mishopls misrstls misdeprd miseffrt miswtles
## 1 Yes 2.049038 1.496276 2.050279 1.401304 1.923029 1.392924
## 2 No 1.920937 1.391654 1.975842 1.304539 1.770132 1.264275
q1means.long<-melt(q1means,id.vars = 'Group.1')
ggplot(q1means.long,aes(x=variable,y=value,fill=factor(Group.1)))+geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="Has Asthma",
labels=c("Yes", "No"))+
xlab("Variable")+ylab("Mean Score")+
labs(title="Mean Score for each of 6 mental health variables\ngrouped by whether the individual has asthma")These data are interesting - first, the data were subsetted to create a smaller data frame that was more managable, including seven variables:
1. asthnow: a yes/no factor indicating whether the individual has asthma now
2. misnervs: a five level scale (All, Most, Some, A little, None) indicating whether the individual has felt nervous in the last 30 days
3. mishopls: same five levels - regarding feeling hopeless (rest of question like #2)
4. misrstls: same five levels - feeling restless (rest of question like #2)
5. misdeprd: same five levels - feeling depressed (rest of question like #2)
6. miseffrt: same five levels - everything feels like an effort (rest of question like #2)
7. miswtles: same five levels - feeling worthless (rest of question like #2)
Then only those observations which were complete for all questions were subsetted - which brought down the group to 4588 respondents. A summary was given.
The levels were treated like Likert-type scales, and scores were assigned, such that a high score of 5 indicated “all” and the score of 1 = none. (so higher scores mean more days/level of feeling the indicated way in the past 30 days) The data were split into two groups for analysis - and summaries given in the report. Finally, a means table was created for graphing and ggplot was used to create a graph.
Here we propose that “has asthma” is the independent variable and the various mental health variables reflecting symptoms are dependent variables. In each case, we observe that the mean is higher for the group that has asthma on all 6 mental health variables. The graph also shows this relationship. Although a parametric test was not run, we see that the variable of having asthma against ALL 6 of the mental health variables is potentially dependent. If it was independent, we would expect the means for all of the mental health variables to be the same without depending on whether the individual has asthma or not.
Research quesion 2:
q2.select<-c('X_smoker3', 'stopsmk2', 'X_totinda' )
q2data<-brfss2013[,q2.select]
#subset to just include those who smoke every day
q2data<-subset(q2data, X_smoker3=="Current smoker - now smokes every day")
q2data<-q2data[,2:3]
summary(q2data)## stopsmk2 X_totinda
## Yes :27428 Had physical activity or exercise :31229
## No :27603 No physical activity or exercise in last 30 days:21000
## NA's: 131 NA's : 2933
q2data<-q2data[complete.cases(q2data),]
summary(q2data)## stopsmk2 X_totinda
## Yes:26041 Had physical activity or exercise :31177
## No :26098 No physical activity or exercise in last 30 days:20962
q2table<-table(q2data)
q2table## X_totinda
## stopsmk2 Had physical activity or exercise
## Yes 16609
## No 14568
## X_totinda
## stopsmk2 No physical activity or exercise in last 30 days
## Yes 9432
## No 11530
q2table.prop<-prop.table(q2table,2)
q2table.prop## X_totinda
## stopsmk2 Had physical activity or exercise
## Yes 0.5327325
## No 0.4672675
## X_totinda
## stopsmk2 No physical activity or exercise in last 30 days
## Yes 0.4499571
## No 0.5500429
q2table.df<-as.data.frame(q2table)
ggplot(q2table.df,aes(x=X_totinda,y=Freq,fill=factor(stopsmk2)))+geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="tried to quit smoking\nin past year", labels=c("yes", "no"))+
ylab("Frequency")+
labs(title="number of daily smokers who tried to quit smoking by exercise status")+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
xlab("Exercise status")In research question two, we look to examine the relationship between physical activity and trying to quit smoking among daily smokers. First, we needed to subset our relevant variables from the larger data set. The three variables used in this study are “X_smoker3”, which is a caluculated varialbe (by the administrators of the survey) indicating whether someone is a current daily smoker (among other levels), “stopsmk2” indicating whether someone had tried to quit smoking in the past year, and “X_totinda” which indicates whether someone had physical activity (excercise) in the last 30 days.
Then, we subset the data frame so only the cases where individuals who smoke every day are included. In the next step, we eliminate this variable (“X_smoker3”) from the data frame to simplify analysis (since only those who smoke every day are now included in the data frame). We see that there are NA’s - and these won’t help our analysis - and so we eliminate all cases from the data frame that are not complete (i.e. those with NA’s).
The next step involves building a cross-tabulated table with the two variables of “stopsmk2” and “X_totinda”. We also build a table of the proportions by column by whether they had tried to stop smoking by exercise status. We find that indeed, those who had physical activity did try to stop smoking at a higher rate than those who did not have excercise. While this is not causal, we do know there is possibly an association between the two. It could be that a third variable is actually responsible for both behaviors, but we cannot tell by this analysis.
Next a plot was built, which shows, by proportion, that those who had physical exercise also were more likely to have tried to quit smoking. Those without exercise were less likely to have tried to quit smoking.
Research quesion 3:
q3.select<-c("X_educag", "genhlth")
q3data<-brfss2013[,q3.select]
summary(q3data)## X_educag genhlth
## Did not graduate high school : 42213 Excellent: 85482
## Graduated high school :142968 Very good:159076
## Attended college or technical school :134196 Good :150555
## Graduated from college or technical school:170118 Fair : 66726
## NA's : 2280 Poor : 27951
## NA's : 1985
q3data<-q3data[complete.cases(q3data),]
summary(q3data)## X_educag genhlth
## Did not graduate high school : 41787 Excellent: 85046
## Graduated high school :142367 Very good:158500
## Attended college or technical school :133739 Good :149835
## Graduated from college or technical school:169665 Fair : 66389
## Poor : 27788
q3table<-table(q3data)
q3table## genhlth
## X_educag Excellent Very good Good
## Did not graduate high school 3237 6241 13473
## Graduated high school 17551 39816 50212
## Attended college or technical school 21347 44861 42828
## Graduated from college or technical school 42911 67582 43322
## genhlth
## X_educag Fair Poor
## Did not graduate high school 12137 6699
## Graduated high school 24613 10175
## Attended college or technical school 17642 7061
## Graduated from college or technical school 11997 3853
q3table.prop<-prop.table(q3table,1)
q3table.prop## genhlth
## X_educag Excellent Very good
## Did not graduate high school 0.07746428 0.14935267
## Graduated high school 0.12327997 0.27967155
## Attended college or technical school 0.15961687 0.33543693
## Graduated from college or technical school 0.25291604 0.39832611
## genhlth
## X_educag Good Fair
## Did not graduate high school 0.32242085 0.29044918
## Graduated high school 0.35269409 0.17288417
## Attended college or technical school 0.32023568 0.13191365
## Graduated from college or technical school 0.25533846 0.07070993
## genhlth
## X_educag Poor
## Did not graduate high school 0.16031302
## Graduated high school 0.07147021
## Attended college or technical school 0.05279687
## Graduated from college or technical school 0.02270946
q3table.df<-as.data.frame(q3table)
#q3df.prop<-as.data.frame(round(q3table.prop,2))
q3df.prop<-as.data.frame(q3table.prop)
q3df.prop## X_educag genhlth Freq
## 1 Did not graduate high school Excellent 0.07746428
## 2 Graduated high school Excellent 0.12327997
## 3 Attended college or technical school Excellent 0.15961687
## 4 Graduated from college or technical school Excellent 0.25291604
## 5 Did not graduate high school Very good 0.14935267
## 6 Graduated high school Very good 0.27967155
## 7 Attended college or technical school Very good 0.33543693
## 8 Graduated from college or technical school Very good 0.39832611
## 9 Did not graduate high school Good 0.32242085
## 10 Graduated high school Good 0.35269409
## 11 Attended college or technical school Good 0.32023568
## 12 Graduated from college or technical school Good 0.25533846
## 13 Did not graduate high school Fair 0.29044918
## 14 Graduated high school Fair 0.17288417
## 15 Attended college or technical school Fair 0.13191365
## 16 Graduated from college or technical school Fair 0.07070993
## 17 Did not graduate high school Poor 0.16031302
## 18 Graduated high school Poor 0.07147021
## 19 Attended college or technical school Poor 0.05279687
## 20 Graduated from college or technical school Poor 0.02270946
ggplot(q3df.prop,aes(x=X_educag,y=Freq,fill=factor(genhlth)))+geom_bar(stat="identity",position="dodge")+
scale_fill_discrete(name="general health")+
ylab("Proportion")+
labs(title="General health by education level")+
scale_x_discrete(labels = function(x) str_wrap(x, width = 15))+
xlab("education level")In order to answer question 3, we first subset the data to create a data frame with only the two variables of interest, “X-educag”, a variable that is a grouped variable (by the survey administrators), which indicates education level, and “genhlth” a variable which is the self-reported general health of the respondent. We then show a summary table of this.
Next, we eliminate those cases with NAs - as a case with one or both NA responses would not be helpful for our analysis. Again, we show a summary table.
Then, we cross-tabulate the educational level and general health level. This is a bit hard to compare because of all the large numbers in the table. In order to make analysis easier, we calculate proportions by educational level, so, for example, all those who graduated high school are broken into proportions with excellent, very good, good, fair, and poor health as reported. with the proportions, we can now see that the level of excellent health in each of the education levels is increasing - so the level of excellent health is lowest among those who did not graduate high school and highest among those who graduated college. Similarly, the level of poor health decreases with increased educational level. It appears that increasaed education level also means increased health level. This is not causual, as stated in the question - but it is an interesting correlation.
The plot created for this question also shows this relationship - compare, for example, the level of excellent health across the educational levels, and the level of poor health across educational levels. Further research can be done to examine this relationship and explore whether there is a clinical or public health significance.