Setup

Load packages and data

suppressWarnings(library(ggplot2))
suppressWarnings(library(dplyr))
load("brfss2013.RData")

Part 1: Data

The data used in this Exploratory data analysis was collected from all 50 states and other districts, by a collaboration between CDC (Center for Disease Control and Prevention) and the states. The data tries to find behavioural risk factors for non-institutionalized adult population residing in the US.The data used in this EDA can be found at: https://www.cdc.gov/brfss/annual_data/annual_2013.html

It is important to point out that the data was collected as part of an observational study, not an experimental study.This is clear because there is no random assignment of treatments to subjects which is required to establish causation. For example, the workout methods (eg. swimming, mentioned in the data) were not randomly assigned to subjects to study its effect on a response variable like energy level. Instead, the subjects themselves decide whether they want to do swimming or walking and give an indication of their energy level.Thus only correlation and not causation can be established from this study.

Stratified sampling techniques and randomization have been implemented (with much work and effort) to create a representative sample. Further details about survey sampling procedure can be found here- https://www.cdc.gov/brfss/data_documentation/pdf/userguidejune2013.pdf

However, this survey is done by calling on a landline or cell phone. People and families not having a phone are not represented in the sample. Also we do not get data from people who are very busy and choose not to participate in the study. Thus convienience bias is introduced. ‘Weighting’ procedures were used to counter any biases introduced. Thus, it is safe to assume that we do have a largely representative sample and the results/correlations gleaned from the study could be generalized to the said population of non-institutionalized United States adults(18 or older).

An exploratory data analysis of this huge data set should definitely aid in our understanding of behavioural risk factors corresponding to different diseases.

An observation about variables in the data:

Many variables in the data have a large number of Non-responses (NA). For example, I did some interesing analysis about percentage of women who had had a mamogram done depending on whether they had health care coverage or not and also how this varied with age. But this analysis was marred by the fact that most women did not choose to answer if they had had a mamogram, giving the unlikely result where more women said they did have a mamogram than those who said they did not have a mamogram, the remaining being Non-responses.

As a result, in my research questions, I have focused on variables where most participants have answered the question, hence the non-responses are not significant enough to affect the outcome of my research question.


Part 2: Research questions

Research quesion 1: Is the diagnosis of depressive disorder in people related to their income levels and their employment status?

Research quesion 2: Does the diagnosis of depressive disorder in people relate to their marital status or age?

Research quesion 3: Is the diagnosis of arthritis, high Blood Pressure or high Cholesterol related Body Mass Index (BMI) level? How does the BMI level itself vary with age or employment status of participants?

Research quesion 4: Is the income level of participants positively correlated with their education level? Is this same for both sexes?

About the research questions:

For my research questions, I chose variables that are solid enough - variables whose data is most likely factual and can be relied upon.

Q1,Q2 : These questions use the variable ‘addepev’ - were the participants ever told by a healthcare professional that they have a depressive disorder? (list of disorders given to participants) - Yes/No. The variable is framed in a solid way, its answers are more reliable than say, if the participants were asked if they believe that they suffer from depression. This makes my research question and its analysis more significant and dependable. My research questions ask how depression varies with, a)employment status, b)income, c)marital status, d)age. These questions would be very interesting to, say, an advertising campaign manager of a drug company that makes antidepressants and would help him identify his target market or audience with more precision.

Q3 : This question is made up of many questions, but fundamentally focuses on one variable- Body Mass Index(BMI). Initially I was planning to calculate this variable from height and weight of participants, but soon was pleasantly surprised to find that it was already calculated and available in the data set. First part of this question asks how arthritis diagnosis, blood pressure level and cholesterol level relate to BMI. I have people with arthritis in my family, hence these questions were particularly interesting to me. The second part asks how BMI varies with employment status and age. Answers of these research questions would be of intersest to companies creating medical diagnosis software because as we will see, BMI holds quite a lot of predictive power for anticipating arthritis or high cholesterol.

Q4: The last question checks the most commonly held belief that higher education implies higher income. The analysis also gives an idea about the strength of correlation, although not explicity calculated. We also ask whether this question leads to similar answers for both sexes. This question will be interesting and helpful to sociologists doing gender studies to compare income of males and females at same level of education


Part 3: Exploratory data analysis

Research quesion 1:

Research quesion 2:

Does the diagnosis of depressive disorder in people relate to their marital status or age?

Let’s look at the two additional variables:

  1. marital : marital status
str(brfss2013$marital)
##  Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
levels(brfss2013$marital)
## [1] "Married"                         "Divorced"                       
## [3] "Widowed"                         "Separated"                      
## [5] "Never married"                   "A member of an unmarried couple"
table(brfss2013$marital, useNA = "ifany")
## 
##                         Married                        Divorced 
##                          253329                           70376 
##                         Widowed                       Separated 
##                           65745                           10662 
##                   Never married A member of an unmarried couple 
##                           75070                           13173 
##                            <NA> 
##                            3420

Maximum respondents are married. NA values are very few.

  1. X_age80 : Age (collapsed above 80)
str(brfss2013$X_age80)
##  int [1:491775] 60 50 55 64 66 49 39 60 50 68 ...
summary(brfss2013$X_age80)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   42.00   57.00   54.77   68.00   80.00      11
IQR(brfss2013$X_age80, na.rm=TRUE)
## [1] 26

Middle 50% respondents are aged 42-68, a range of 26 years. Only 11 NA values.

Let’s first look at how depression relates to marital status.

ggplot(data = na.omit(brfss2013[,c("marital","addepev2")]),aes(x=marital, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top")+ xlab("Marital status") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion")

The chart seems very suggestive. We see higher rate of depression among separated and divorced people (almost double) than married people. Members of unmarried couples also show higher proportion of depressive disorder than married people. Thus we atleast get interesting ideas for further research.

Next, let’s look at how depression varies with age.

ggplot(data = brfss2013,aes(x=X_age80, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill') + theme(legend.position="top")+ scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion") + xlab("age") + xlim(18,80)

Although we see a pattern, a scatter plot would be much better.

Let’s try to plot % of people with depressive disorder at different ages.

a <- brfss2013 %>%
  count(X_age80)

b <- brfss2013 %>%
  filter(addepev2 == "Yes") %>%
  count(X_age80)
a <- a[-c(1,2,3,4,68),] 
b <- b[-c(1,2,66),]
a$prop = 100 *(b$n / a$n)
ggplot(data = a, aes(x=a$X_age80,y=a$prop)) + geom_point() + xlab("age") + ylab("percentage diagnosed with Depressive disorder")

It seems that incidence of depressive disorder increases with age uptil about 57 years, after which it swiftly degreases.

Now let’s look at the variation of depression with age at different marital statuses.

a <- brfss2013 %>%
  filter(X_age80 >= 25) %>%
  group_by(marital) %>%
  count(X_age80)
a <- na.omit(a)
b <- brfss2013 %>%
  filter(X_age80 >= 25) %>%
  group_by(marital) %>%
  filter(addepev2 == "Yes") %>%
  count(X_age80)
b <- na.omit(b)
a$prop = 100 *(b$n / a$n)
oldw <- getOption("warn")
suppressMessages(
ggplot(data = a, aes(x=a$X_age80,y=a$prop)) + geom_point() + xlab("age") + ylab("percentage diagnosed with Depressive disorder") + ylim(0,45) + facet_wrap(~marital)
)
## Warning: Removed 2 rows containing missing values (geom_point).

Note: We only plot ages:25-80, as the ‘signal’ (number of cases) below 25 is too low. For example there are no widowed repondents below age 20 who were diagnosed with depression.

The first thing to notice is that some plots are “noisier” than others. ‘Married’ plot has lowest noise (due to high signal across all ages) whereas ‘Widowed’ plot is unreliably noisy below age 45.

All plots attain maximum value in the range ~50-60 age. Although, the position of peak varies in the range - never married people attain a peak in depressive disorder at age ~50 years, while divorced people attain the same at age ~60 years.

Next we notice some similarities. The plot for ‘Widowed’ and ‘Divorced’ match very well after the age of 60. The plots for ‘Never married’ and ‘A member of an unmarried couple’ look quite similar. In fact, if we look carefully, the plot for ‘A member of an unmarried couple’ looks like a hybrid of ‘Never married’ and ‘Widowed’! Can lead to some interesing research.

In the age range 25-60, separated and divorced people have about twice the incidence of depression as compared with married people. This is something we saw before in a bar plot, but now we see it much more clearly across age.

Research quesion 3:

Research quesion 4:

Is the income level of participants positively correlated with their education level? Is this same for both sexes?

Vaiables:

  1. X_educag: Education level (4 levels)
str(brfss2013$X_educag)
##  Factor w/ 4 levels "Did not graduate high school",..: 4 3 4 2 4 4 2 3 4 2 ...
levels(brfss2013$X_educag)
## [1] "Did not graduate high school"              
## [2] "Graduated high school"                     
## [3] "Attended college or technical school"      
## [4] "Graduated from college or technical school"
table(brfss2013$X_educag, useNA = "ifany")
## 
##               Did not graduate high school 
##                                      42213 
##                      Graduated high school 
##                                     142968 
##       Attended college or technical school 
##                                     134196 
## Graduated from college or technical school 
##                                     170118 
##                                       <NA> 
##                                       2280
  1. X_incomg : Income category (5 levels)
str(brfss2013$X_incomg)
##  Factor w/ 5 levels "Less than $15,000",..: 5 5 5 5 4 5 NA 4 5 2 ...
levels(brfss2013$X_incomg)
## [1] "Less than $15,000"            "$15,000 to less than $25,000"
## [3] "$25,000 to less than $35,000" "$35,000 to less than $50,000"
## [5] "$50,000 or more"
table(brfss2013$X_incomg, useNA = "ifany")
## 
##            Less than $15,000 $15,000 to less than $25,000 
##                        52235                        76606 
## $25,000 to less than $35,000 $35,000 to less than $50,000 
##                        48867                        61508 
##              $50,000 or more                         <NA> 
##                       181131                        71428
  1. Sex: Male/Female
str(brfss2013$sex)
##  Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
levels(brfss2013$sex)
## [1] "Male"   "Female"
table(brfss2013$sex, useNA = "ifany")
## 
##   Male Female   <NA> 
## 201313 290455      7

Surprisingly, number of female respondents is much higher than male respondents.

Now let’s look at the relationship between income and education level.

ggplot(data = brfss2013,aes(x=X_educag, fill=X_incomg)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0)) + xlab('Education') + ylab('Proportion')+
scale_fill_discrete(name="Income category")

Cleary income is positively correlated to education level. Of those who did not graduate high school, less than 5% have income > 50K per year, while of those who graduated college or technical school, about ~65% have income>50K. Let’s see if this behaviour is the same regardless of sex.

MF = c('Male','Female')
ggplot(data = brfss2013[brfss2013$sex %in% MF,],aes(x=X_educag, fill=X_incomg)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0)) + xlab('Education') + ylab('Proportion')+ scale_fill_discrete(name="Income category") + facet_wrap(~sex)

The plots look similar for both sexes, although females in general have lower income than males in every education category (look at red and pink in each bar).

Part 4: Conclusion

This was my first stab at data analysis using R. It was an interesting experience trying to find meaningful and interesting correlations between different variables in such a huge data set. Many times, I guessed some variables to be correlated, only to find out they were not, while also finding unexpected correlations. This simple EDA, more than anything else, demonstrated the power of R and ggplot2 to quickly find correlations. This will be useful to me and anybody else who wants to do predictive analysis and machine learning in future using R because this same procedure is used to find predictive power of different features in machine learning.