Setup

This is the initial setup necessary to run data analysis on the given dataset.

Load packages

An additional package of stats was added for further analysis

library(ggplot2)
library(dplyr)
library(stats)
library(corrplot)
library(psych)

Load data

The data was loaded from a local file provided, it contains data from the 2013 BRFSS.

load("brfss2013.RData")

# Preview the data
View(brfss2013)

Part 1: Data

About BRFSS

The Behavioral Risk Factor Surveillance System, otherwise known as the BRFSS, is a collaborative observational project between all the United States (US) states - including the District of Columbia (DC) and three US territories. Currently, the BRFSS is sponsored by the CDC National Center for Chronic Disease Prevention and Health Promotion, other CDC centers, as well as federal agencies such as Health Resources and Services Administration, Administration on Aging, Department of Veteran Affairs, and the Substance Abuse and Mental Health Services Administration. The BRFSS is the largest continuously conducted health survey in the world with more than 400,000 randomly selected adults interviewed annually. The BRFSS is designed to measure behavioral risk factors for non-institutionalized adults (individuals 18 years of age or older) residing in the US and territories.

The BRFSS began in 1984 with 15 states collecting surveillance data on major risk behaviors through monthly telephone interviews. The CDC developed a standard questionnaire for states to use with the initial topics including smoking, alcohol use, physical inactivity, diet, hypertension, and seat-belt use. By 1988, standardized sets of specific questions were added as optional modules. The BRFSS became a nationwide system by 1993, with a sample size of over 100,000. Currently, the survey is conducted by, with the assistance of the CDC, state health departments’ in-house interviewers, or contracted telephone call centers and universities continuously throughout the year. States (referring to the US states and territories) use a standardized main survey with optional modules with states additionally having the option to add questions as well that pertain to their region. The survey is conducted using a Random Digit Dialing (RDD) system for both landlines and cell phones. Cell phones were implemented in 2011. In conducting landline interviews, interviewers collect data from a randomly selected adult within the household; cellphone interviews are conducted with someone who resides in a private residence or college housing.

The BRFSS states its objective is to collect uniform, state-specific data on preventative health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population.

The 2013 BRFSS assessed tobacco use, HIV/AIDS knowledge, and prevention, exercise, immunization, health status, healthy days, health-related quality of life, healthcare access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, produce consumption, arthritis burden, and seat-belt use. In 2013, there were 22 optional standardized modules a state could choose to additionally use, a few of these optional modules are: sugar drink consumption, breast and cervical cancer screening, and reactions to race.

Dataset Notes

The 2013 BRFSS dataset is a randomly-selected observational study meaning results from this study could potentially be generalized to the entire population and causal relationships could be inferred between two or more variables. It is important to remember correlation does not equal causation.

Since cellphone lines are conducted with individuals who reside in a private residence or college housing, the data may be biased to those who can afford a mobile line and reside in a private or college residence; meaning, data may exclude individuals who cannot afford a cellular line or landline, individuals who cannot afford a college education, and individuals who do not live in a private residence. Thus, results concluded from this study may only be generalizable to part of the population, those who have a landline or cellphone (and are in college or private residence).

An additional concern can be raised with the way the survey was conducted, respondents’ answers were not validated. This can lead to under-reporting, over-reporting, or respondents forgetting key information.



References
Behavioral Risk Factor Surveillance System Overview: BRFSS 2013
About BRFSS
BRFSS FAQs
BRFSS History
BRFSS History Fact Sheet
BRFSS 2013 Calculated Variables
BRFSS Questionaire
BRFSS 2013 Optional Modules by State


Part 2: Research questions

Research quesion 1:

Is income level a determinate in healthcare status? Are there any differences among races?

A person’s ethnicity (cultural background) can have a profound effect on one’s views about healthcare and their access to healthcare. Income inequalities among different races could further create a barrier when it comes to healthcare. Here, we explore the relationship between race, income level, and healthcare insurance coverage to see if race and income level could be a determinant if someone will have healthcare insurance.

Variables used for analysis:

hlthpln1

Description: Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?

income2

Description: Respondant’s income level

_X_raceg21

Description: Respondant’s race classfied as Non-hispanic White or Non-White

Research quesion 2:

Do education and gender affect general health?


Generally, the higher education one receives the better their job opportunities, income, and access to healthcare. When individuals have access to healthcare, one would believe they are in a better position to deal with health concerns before those concerns can negatively impact their health. In this question, we explore the differences between education level on general health when stratified by gender.

Variables used for analysis:

genhlth

Description: Would you say that in general your health is: Excellant, Very Good, Good, Fair, Poor, Don’t Know/Not Sure, Refused, Not asked or Missing

educa

Description: What’s the highest grade or year of school you complete?

sex

Description: Respondant’s sex

Research quesion 3:

Of those seeking mental healthcare care from a professional, how many agree or disagree with the following statements that “treatment can help people with mental illness lead normal lives” and “people are generally caring and sympathetic to people with mental illness?” How does this compare across states?

1 in 5 adults suffer from a mental health problem yet there is a stigma around mental health; personally, I have felt this stigma translated into my own life. The fear of mistreatment from others, once your conditions are known, produces an environment where seeking treatment is put on the back burner. Here I want to explore how individuals seeking treatment think about seeking healthcare and societies’ views towards individuals with a mental health issue.

Variables used for analysis:

mistrhlp

Description: Treatment can help people with mental illness lead normal lives. Do you agree slightly or strongly, or disagree slightly or strongly?

mistrhlpf

People are generally caring and sympathetic to people with mental illness. Do you agree slightly or strongly or disagree slightly or strongly?

mistmnt

Description: Are you now taking medicine or receiving treatment from a doctor or other health professional for any type of mental health condition or emotional problem?

_state

Description: State code


Part 3: Exploratory data analysis

Research quesion 1:

Is income level a determinate in healthcare status? Is there any differences among race?


# Selection of relavant variables into a new dataset and omit NA responses
brfss2013_a <- brfss2013 %>% select(hlthpln1, income2, X_raceg21) %>% na.omit

# Checking to make sure the correct data columns were selected
colnames(brfss2013_a)
## [1] "hlthpln1"  "income2"   "X_raceg21"
#Get counts for each variable


c_raceg21 <- brfss2013_a %>%
  group_by(X_raceg21) %>%
  summarise(counts = n())
c_raceg21
## # A tibble: 2 x 2
##   X_raceg21             counts
##   <fct>                  <int>
## 1 Non-Hispanic White    322322
## 2 Non-White or Hispanic  91355
c_income2 <- brfss2013_a %>%
  group_by(income2) %>%
  summarise(counts = n())
c_income2
## # A tibble: 8 x 2
##   income2           counts
##   <fct>              <int>
## 1 Less than $10,000  24774
## 2 Less than $15,000  26207
## 3 Less than $20,000  34133
## 4 Less than $25,000  40941
## 5 Less than $35,000  48068
## 6 Less than $50,000  60652
## 7 Less than $75,000  64344
## 8 $75,000 or more   114558
c_hlthpln1 <- brfss2013_a %>%
  group_by(hlthpln1) %>%
  summarise(counts = n())
c_hlthpln1
## # A tibble: 2 x 2
##   hlthpln1 counts
##   <fct>     <int>
## 1 Yes      367077
## 2 No        46600
# Viewing the frequencies for income vs healthcare status
protable1 <- prop.table(table(brfss2013_a$income2, brfss2013_a$hlthpln1),2)
protable1
##                    
##                            Yes         No
##   Less than $10,000 0.05004127 0.13744635
##   Less than $15,000 0.05653037 0.11708155
##   Less than $20,000 0.07149726 0.16927039
##   Less than $25,000 0.08932458 0.17493562
##   Less than $35,000 0.11215358 0.14804721
##   Less than $50,000 0.14960621 0.12306867
##   Less than $75,000 0.16613953 0.07206009
##   $75,000 or more   0.30470719 0.05809013

Here we can see there appears to be some association between income level and if someone has healthcare. Let’s view a graphic representation.

#Graphical Representation 
graph1 <- ggplot(data = brfss2013_a, 
                 aes(x=income2, fill=hlthpln1)) + 
                 geom_bar(position = "fill") + 
                 coord_flip()+
                 ggtitle("Healthcare Status by Income Level")
graph1 <- graph1 + 
  xlab("Income Level") + 
  ylab("Proportions") + 
  scale_fill_discrete("Do you have healthcare coverage?")
graph1

It would seem the lower your income, the greater your chances are of not having healthcare, but throughout each income bracket more than 50% have healthcare. What about race and healthcare status?

#Graphical Representation of Income Level vs Healthcare Coverage = Yes vs Race (divided into White vs Non-white)
plotdat = brfss2013_a %>% 
  #Taking a subset of the data so the RMarkdown "Knit" function doesn't take over an hour to process 
    sample_frac(.1) %>% 
  group_by(X_raceg21, income2) %>% 
  #Creating dummy variable for Healthcare, if they have healthcare it is coded as 1, if not 0. 
  mutate(prop = sum(hlthpln1 == "Yes")/n()) %>% 
  filter(row_number(hlthpln1) == 1) 

 

plotdat %>% 
  group_by(X_raceg21, income2) %>% 
  ggplot(aes(income2, 
             prop,
             group = X_raceg21,
             color = X_raceg21)) + 
  geom_jitter(width = .2) +
  geom_line() + 
  ggtitle("Healthcare Status stratified by Income and Race") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(x = "Income Levels", y = "Proportion with Healthcare", color = "Population")

Here we can see that Non-Whites or Hispanics on average are less likely to have healthcare, with a greater divide at income levels “Less than $15,000” and “Less than $25,000”. What is interesting is the smallest divide is at the lowest level; this could potentially be caused by people qualifying for government-run healthcare. Let’s explore more and create a Logistic Regression because the dataset has a binary response (outcome, dependent) variable called hlthpln1.

#Creating A Fresh Dataset so we can set some of the variables as numeric rather than factors
brfss2013_catreg <- brfss2013 %>% 
  select(hlthpln1, income2, X_raceg21) %>% 
  na.omit 

#Changing factor variables to numeric
#Where 1=yes, 2=no
brfss2013_catreg$hlthpln1=as.factor(brfss2013_catreg$hlthpln1)
#Where 1=White, 2=Non-White
brfss2013_catreg$X_raceg21=as.numeric(brfss2013_catreg$X_raceg21)
#Where 1=Less than $10,000, 2=Less than $15,000, 3=Less than $20,000, 4=Less than $25,000, 5=Less than $35,000, 6=Less than $50,000,    7=Less than $75,000, 8=$75,000 or more  
brfss2013_catreg$income2=as.numeric(brfss2013_catreg$income2)


#Running a Logistic Regression 
catreg <- glm (hlthpln1~income2 + X_raceg21, 
               data = brfss2013_catreg, 
               family=binomial)
catreg
## 
## Call:  glm(formula = hlthpln1 ~ income2 + X_raceg21, family = binomial, 
##     data = brfss2013_catreg)
## 
## Coefficients:
## (Intercept)      income2    X_raceg21  
##     -1.1765      -0.3195       0.5372  
## 
## Degrees of Freedom: 413676 Total (i.e. Null);  413674 Residual
## Null Deviance:       291200 
## Residual Deviance: 263900    AIC: 263900
summary(catreg)
## 
## Call:
## glm(formula = hlthpln1 ~ income2 + X_raceg21, family = binomial, 
##     data = brfss2013_catreg)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0044  -0.5238  -0.3681  -0.2834   2.5437  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.176479   0.020010  -58.80   <2e-16 ***
## income2     -0.319476   0.002286 -139.78   <2e-16 ***
## X_raceg21    0.537219   0.010999   48.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 291242  on 413676  degrees of freedom
## Residual deviance: 263892  on 413674  degrees of freedom
## AIC: 263898
## 
## Number of Fisher Scoring iterations: 5
#Catagorical Regression Plots
par(mfrow=c(2,2))
plot(catreg)

From the Logistic Regression model we see that income has a negative effect and race has a positive effect. Both the coefficient of race and income are significant (p>0.05). Looking at the plots we can see the residuals are far from the line, and this makes sense healthcare status is determined by more than just income and race, other factors that possibly contribute to healthcare status are age, employment status, marriage status, etc. While, this is a good start to creating a predictive model for healthcare status, more variables will have to be added. However, we can say that race and income are significant when building a model for healthcare status.

Research quesion 2:

Does education and gender have an effect on general health?


# Selection of relavant variables into a new dataset and omit NA responses
brfss2013_b <- brfss2013 %>% select(genhlth, sex, educa) %>% na.omit

# Checking to make sure the correct data columns were selected
colnames(brfss2013_b)
## [1] "genhlth" "sex"     "educa"
#Creating A Fresh Dataset so we can set some of the variables as numeric rather than factors
brfss2013_cor <- brfss2013 %>% 
  select(genhlth, sex, educa) %>% 
  na.omit 
#Changing factor variables to numeric 
#Where 1=Excellant, 2=Very Good, 3=Good,4= Fair,5=Poor
brfss2013_cor$genhlth=as.numeric(brfss2013_cor$genhlth) 
#Where 0=Male, 1=Female
brfss2013_cor$sex=as.numeric(brfss2013_cor$sex==
                               "Female", 0, 1)
#Where 1=Never attended school or only kindergarten , 2=Grades 1 through 8 (Elementary), 3=Grades 9 through 11 (Some high school) , 4=Grade 12 or GED (High school graduate) , 5=College 1 year to 3 years (Some college or technical school), 6=College 4 years or more (College graduate),    
brfss2013_cor$educa=as.numeric(brfss2013_cor$educa)

Now that we have our variables as numeric representations (dummy variables), we can run a correlation to see if these variables correlate in any way.

#Finding correlation
r2_cor <-cor(brfss2013_cor)
r2_cor
##             genhlth         sex       educa
## genhlth  1.00000000  0.01673846 -0.29371654
## sex      0.01673846  1.00000000 -0.01702893
## educa   -0.29371654 -0.01702893  1.00000000

Now let’s plot the correlation to better view it.

corrplot(r2_cor, order="AOE", method="color", addCoef.col = "gray")

#Summary Statisitics
psych::describe(brfss2013_b)
##          vars      n mean   sd median trimmed  mad min max range  skew kurtosis
## genhlth*    1 487561 2.58 1.10      3    2.52 1.48   1   5     4  0.36    -0.50
## sex*        2 487561 1.59 0.49      2    1.61 0.00   1   2     1 -0.37    -1.86
## educa*      3 487561 4.85 1.06      5    4.96 1.48   1   6     5 -0.62    -0.15
##          se
## genhlth*  0
## sex*      0
## educa*    0

As we can see here, education and sex have almost no correlation. Genhlth and sex have almost no correlation. Interesting enough there is a weak negative correlation between education level and genhlth. We can infer that there is a weak association between being more education and poorer health. As for the summary statistics, since we are working with catagorical variables, the R package psych converts them to numerical values which should be interpreted cautiously (if at all).

Research quesion 3:

Of those seeking mental healthcare care from a professional, how many agree or disagree with the following statements that “treatment can help people with mental illiness lead normal lives” and “people are generally caring and sympathetic to people with mental illness?” How does this compare across states?


# Selection of relavant variables into a new dataset and omit NA responses
brfss2013_c <- brfss2013 %>% select(mistrhlp, mistmnt, misphlpf, X_state) %>% na.omit

# Checking to make sure the correct data columns were selected
colnames(brfss2013_c)
## [1] "mistrhlp" "mistmnt"  "misphlpf" "X_state"
#Creating a proportion table
prop <- with(brfss2013_c, table(mistrhlp, mistmnt))%>%
  prop.table(margin=2)
prop
##                             mistmnt
## mistrhlp                            Yes         No
##   Agree strongly             0.81523437 0.71261822
##   Agree slightly             0.13203125 0.21583518
##   Neither agree nor disagree 0.01679687 0.02763422
##   Disagree slightly          0.02363281 0.02928664
##   Disagree strongly          0.01230469 0.01462574
#Summary Statistics
psych::describe(prop)
##     vars n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## Yes    1 5  0.2 0.35   0.02     0.2 0.02 0.01 0.82   0.8 1.02    -0.99 0.16
## No     2 5  0.2 0.30   0.03     0.2 0.02 0.01 0.71   0.7 0.89    -1.19 0.13

Looking at the proportion table (comparing the Yes and No columns), it would seem that regardless of if one receives mental healthcare help from a professional the majority of people agree that seeking help from a mental health professional can help live a normal life. However, of our group of interest it looks that the majority of individuals seeking help believe that it can lead to a normal life. Let’s look at the other statement that people are generally caring and sympathetic to people with mental illness and if one is seeking mental health care. Looking at the summary statistics we can tell that the data is heavily skewed, the positive skewness indicates that the right-handed tail is larger than the left-handed tail. The kurtosis tells us that the data is “light-tailed” meaning there is less in the tails than the normal distribution. The data’s “skewness” is probably from the differences in agree strongly.

#Creating a second proportion table
prop2 <- with(brfss2013_c, table(misphlpf, mistmnt))%>%
  prop.table(margin=2)
prop2
##                             mistmnt
## misphlpf                            Yes         No
##   Agree strongly             0.19003906 0.24983300
##   Agree slightly             0.28906250 0.34876771
##   Neither agree nor disagree 0.04355469 0.04918609
##   Disagree slightly          0.28867188 0.24364519
##   Disagree strongly          0.18867187 0.10856801

Looking at this proportion table, we can see that of the individuals seeking mental healthcare help from a professional that it is roughly split on agreeing or disagreeing with this statement. Since, this data was collected in 2013, it would be interesting to see how it compares to 2019’s data, to see if the stigma around mental health has changed. Let’s filter the data now so agreeing slightly or strongly is just agree, same with disagree, and we are looking at only those who are seeking mental health help from a professional.

#Filtering Data to only those seeking mental healthcare from a professional
brfss2013_mh <- select(brfss2013_c, misphlpf, mistmnt, mistrhlp, X_state) %>% filter(mistmnt=="Yes")

#Mutating variavles so strongly and slightly agree are now just agree and the same with disagree, and filtering out those who nethier agree or disagree
brfss2013_mh <- brfss2013_mh %>%
  mutate(misphlpf = recode(misphlpf, "Agree strongly" = "Agree",
                           "Agree slightly" = "Agree", 
                           "Disagree strongly"="Disagree",
                           "Disagree slightly"="Disagree"), 
         mistrhlp = recode(mistrhlp, "Agree strongly" = "Agree",
                           "Agree slightly" = "Agree", 
                           "Disagree strongly"="Disagree",
                           "Disagree slightly"="Disagree")) %>%
  filter(mistrhlp !="Neither agree nor disagree", misphlpf!="Neither agree nor disagree") 


#Grouping by states, calculate counts and percentages
brfss2013_mh <- group_by(brfss2013_mh, X_state, mistrhlp, misphlpf) %>% 
  summarise(count=n()) %>% 
  mutate(pct=prop.table(count)*100)
brfss2013_mh
## # A tibble: 24 x 5
## # Groups:   X_state, mistrhlp [12]
##    X_state       mistrhlp misphlpf count   pct
##    <fct>         <fct>    <fct>    <int> <dbl>
##  1 Alabama       Agree    Agree       47  72.3
##  2 Alabama       Agree    Disagree    18  27.7
##  3 Alabama       Disagree Agree        1  33.3
##  4 Alabama       Disagree Disagree     2  66.7
##  5 Massachusetts Agree    Agree      292  51.9
##  6 Massachusetts Agree    Disagree   271  48.1
##  7 Massachusetts Disagree Agree        9  29.0
##  8 Massachusetts Disagree Disagree    22  71.0
##  9 Minnesota     Agree    Agree      825  50.6
## 10 Minnesota     Agree    Disagree   806  49.4
## # ... with 14 more rows

So what is this telling us? In most states, if you agree to the first question, you have almost a 50/50 chance of agreeing to the second question that people are caring towards those with mental health issues, except for Alabama where it is more likely you will agree to both statements. In terms of disagreeing to the first question, across the states it looks as though you are more likely to disagree to the second statement.

graph2 <- ggplot(brfss2013_mh)+
  aes(y=pct, 
      x=mistrhlp,
      fill=misphlpf, 
      )+
  geom_bar(stat="identity", position = "fill", 
           color="grey40")+
  facet_grid(.~X_state) 

graph2 <- graph2 + ggtitle("Beliefs About Mental Health Care By State") +
  xlab("Can Those with Menthal Health Issues Seeking \nProfessional Mental Healthcare Lead a Normal Life? By State") + 
  ylab("Proportions") + 
  scale_fill_discrete("Are Individuals Caring to Those with Mental Health Issues?") +
  theme(axis.text.x = element_text(colour="grey40", size=8, angle=90, hjust=.5, vjust=.5),
        axis.text.y = element_text(colour="grey40", size=8),
        axis.title.x= element_text(size=8, vjust=-1),
        axis.title.y=element_text(size=8),
        plot.title = element_text (hjust = 0.5, face="bold"),
        legend.text = element_text(size=8, color="white"),
        legend.position = "bottom", 
        legend.box = "vertical", 
        legend.title = element_text (size = 8, color="white"),
        legend.background = element_rect(fill="grey40", color = "gray40")
        
        )
graph2


Interestingly, individuals who are seeking mental health help from a professional and agree that seeking such help leads to a normal life are almost split down the middle when it comes to agreeing or disagreeing that individuals are caring to those with mental health issues across the states that did participate in this optional module, except for Alabama where individuals are more likely to agree to both statements. Those seeking mental health help from a professional that disagree that ‘seeking such help leads to a normal life’ seem more likely to disagree that individuals are caring to those with mental health issues. It would have been helpful for analysis if this module asked an additional question “Are you satisfied with your mental health care provider?” to understand if the individuals who disagree with both questions are potentially disagreeing because of dissatisfaction with care. With mental illness affecting 1 in 5 adults, the fact that so little states participated in this module is disappointing as it could give insight into which states need to improve on their mental health system.