Part 1: Data
The data used in this Exploratory data analysis was collected from all 50 states and other districts, by a collaboration between CDC (Center for Disease Control and Prevention) and the states. The data tries to find behavioural risk factors for non-institutionalized adult population residing in the US.The data used in this EDA can be found at: https://www.cdc.gov/brfss/annual_data/annual_2013.html
It is important to point out that the data was collected as part of an observational study, not an experimental study.This is clear because there is no random assignment of treatments to subjects which is required to establish causation. For example, the workout methods (eg. swimming, mentioned in the data) were not randomly assigned to subjects to study its effect on a response variable like energy level. Instead, the subjects themselves decide whether they want to do swimming or walking and give an indication of their energy level.Thus only correlation and not causation can be established from this study.
Stratified sampling techniques and randomization have been implemented (with much work and effort) to create a representative sample. Further details about survey sampling procedure can be found here- https://www.cdc.gov/brfss/data_documentation/pdf/userguidejune2013.pdf
However, this survey is done by calling on a landline or cell phone. People and families not having a phone are not represented in the sample. Also we do not get data from people who are very busy and choose not to participate in the study. Thus convienience bias is introduced. ‘Weighting’ procedures were used to counter any biases introduced. Thus, it is safe to assume that we do have a largely representative sample and the results/correlations gleaned from the study could be generalized to the said population of non-institutionalized United States adults(18 or older).
An exploratory data analysis of this huge data set should definitely aid in our understanding of behavioural risk factors corresponding to different diseases.
An observation about variables in the data:
Many variables in the data have a large number of Non-responses (NA). For example, I did some interesing analysis about percentage of women who had had a mamogram done depending on whether they had health care coverage or not and also how this varied with age. But this analysis was marred by the fact that most women did not choose to answer if they had had a mamogram, giving the unlikely result where more women said they did have a mamogram than those who said they did not have a mamogram, the remaining being Non-responses.
As a result, in my research questions, I have focused on variables where most participants have answered the question, hence the non-responses are not significant enough to affect the outcome of my research question.
Part 2: Research questions
Research quesion 1: Is the diagnosis of depressive disorder in people related to their income levels and their employment status?
Research quesion 2: Does the diagnosis of depressive disorder in people relate to their marital status or age?
Research quesion 3: Is the diagnosis of arthritis, high Blood Pressure or high Cholesterol related Body Mass Index (BMI) level? How does the BMI level itself vary with age or employment status of participants?
Research quesion 4: Is the income level of participants positively correlated with their education level? Is this same for both sexes?
About the research questions:
For my research questions, I chose variables that are solid enough - variables whose data is most likely factual and can be relied upon.
Q1,Q2 : These questions use the variable ‘addepev’ - were the participants ever told by a healthcare professional that they have a depressive disorder? (list of disorders given to participants) - Yes/No. The variable is framed in a solid way, its answers are more reliable than say, if the participants were asked if they believe that they suffer from depression. This makes my research question and its analysis more significant and dependable. My research questions ask how depression varies with, a)employment status, b)income, c)marital status, d)age. These questions would be very interesting to, say, an advertising campaign manager of a drug company that makes antidepressants and would help him identify his target market or audience with more precision.
Q3 : This question is made up of many questions, but fundamentally focuses on one variable- Body Mass Index(BMI). Initially I was planning to calculate this variable from height and weight of participants, but soon was pleasantly surprised to find that it was already calculated and available in the data set. First part of this question asks how arthritis diagnosis, blood pressure level and cholesterol level relate to BMI. I have people with arthritis in my family, hence these questions were particularly interesting to me. The second part asks how BMI varies with employment status and age. Answers of these research questions would be of intersest to companies creating medical diagnosis software because as we will see, BMI holds quite a lot of predictive power for anticipating arthritis or high cholesterol.
Q4: The last question checks the most commonly held belief that higher education implies higher income. The analysis also gives an idea about the strength of correlation, although not explicity calculated. We also ask whether this question leads to similar answers for both sexes. This question will be interesting and helpful to sociologists doing gender studies to compare income of males and females at same level of education
Part 3: Exploratory data analysis
Research quesion 1:
Is the diagnosis of depressive disorder in people related to their income levels and their employment status?
Let’s first take a brief look at all of the three variables we are looking at:
1)addepev2 : Diagnosed with depressive disorder?
## Factor w/ 2 levels "Yes","No": 1 1 1 2 2 2 2 2 2 2 ...
table(brfss2013$addepev2, useNA = "ifany")
##
## Yes No <NA>
## 95779 393707 2289
We don’t have a large number of ‘NA’ for this variable. This will make our analysis more reliable.
95779/(393707+95779+2289)
## [1] 0.1947618
About 20% of respondents were diagnosed with depression.
- X_incomg : Income category
## Factor w/ 5 levels "Less than $15,000",..: 5 5 5 5 4 5 NA 4 5 2 ...
levels(brfss2013$X_incomg)
## [1] "Less than $15,000" "$15,000 to less than $25,000"
## [3] "$25,000 to less than $35,000" "$35,000 to less than $50,000"
## [5] "$50,000 or more"
table(brfss2013$X_incomg, useNA = "ifany")
##
## Less than $15,000 $15,000 to less than $25,000
## 52235 76606
## $25,000 to less than $35,000 $35,000 to less than $50,000
## 48867 61508
## $50,000 or more <NA>
## 181131 71428
We have ~70K ‘NA’ which is pretty huge if we want to make inferences or predictions. But income is an important variable - to me at least - and I would allow the unspecified observations in my exploratory analysis with caution. Besides, most other variables in the data have even larger number of ‘NA’.
- employ1 : employment status
## Factor w/ 8 levels "Employed for wages",..: 7 1 1 7 7 1 1 7 7 5 ...
levels(brfss2013$employ1)
## [1] "Employed for wages" "Self-employed"
## [3] "Out of work for 1 year or more" "Out of work for less than 1 year"
## [5] "A homemaker" "A student"
## [7] "Retired" "Unable to work"
table(brfss2013$employ1, useNA = "ifany")
##
## Employed for wages Self-employed
## 202200 39832
## Out of work for 1 year or more Out of work for less than 1 year
## 14074 12242
## A homemaker A student
## 31647 12682
## Retired Unable to work
## 138259 37453
## <NA>
## 3386
We see that the largest chunk of respondents are either employed for wages or retired. We have relatively few NA values.
First, let’s look at 2 variables only. Let’s plot a bar chart and see how deppression diagnosis depends on income levels:
ggplot(data = brfss2013,aes(x=X_incomg, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab("Income level") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion")

This is wonderfully revealing! We see a clear relationship between income and depression. Higher the income, lower the proportion of depression.
Let’s plot without normalizing to make sure that we do not have too few obserations in any of the cases to make meaningful correlations:
ggplot(data = brfss2013,aes(x=X_incomg, fill=addepev2)) + geom_bar(na.rm = TRUE)+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab("Income level") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Count")

We see that we have atleast ~ 50,000 observations for every case which is good. Although, we notice we have many ‘NA’(in the right-most bar) - people who have not specified their income. If these belong to a particular income category more than others, they could potentially skew our observations.
Next let’s look at how depression varies with employment status.
ggplot(data = brfss2013,aes(x=employ1, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab("Employment status") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion")

Most notable feature of above chart is that depression diagnosisis is highest (~50%) among people who are unable to work, far higher than all other employment categories. We also notice depression incidence to be lowest among self-employed people and among those employed for wages.
Next we look at how incidence of depression varies with income level for different employment statuses.
ggplot(data = brfss2013,aes(x=X_incomg, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab("Income level") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion") + facet_wrap(~employ1)

Note: We are looking at household income, not individual income.
We make an interesting observation - among those unable to work, the general trend of decreasing depression with increasing household income does not hold.
Research quesion 2:
Does the diagnosis of depressive disorder in people relate to their marital status or age?
Let’s look at the two additional variables:
- marital : marital status
## Factor w/ 6 levels "Married","Divorced",..: 2 1 1 1 1 2 1 3 1 1 ...
levels(brfss2013$marital)
## [1] "Married" "Divorced"
## [3] "Widowed" "Separated"
## [5] "Never married" "A member of an unmarried couple"
table(brfss2013$marital, useNA = "ifany")
##
## Married Divorced
## 253329 70376
## Widowed Separated
## 65745 10662
## Never married A member of an unmarried couple
## 75070 13173
## <NA>
## 3420
Maximum respondents are married. NA values are very few.
- X_age80 : Age (collapsed above 80)
## int [1:491775] 60 50 55 64 66 49 39 60 50 68 ...
summary(brfss2013$X_age80)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 42.00 57.00 54.77 68.00 80.00 11
IQR(brfss2013$X_age80, na.rm=TRUE)
## [1] 26
Middle 50% respondents are aged 42-68, a range of 26 years. Only 11 NA values.
Let’s first look at how depression relates to marital status.
ggplot(data = na.omit(brfss2013[,c("marital","addepev2")]),aes(x=marital, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top")+ xlab("Marital status") + scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion")

The chart seems very suggestive. We see higher rate of depression among separated and divorced people (almost double) than married people. Members of unmarried couples also show higher proportion of depressive disorder than married people. Thus we atleast get interesting ideas for further research.
Next, let’s look at how depression varies with age.
ggplot(data = brfss2013,aes(x=X_age80, fill=addepev2)) + geom_bar(na.rm = TRUE, position = 'fill') + theme(legend.position="top")+ scale_fill_discrete(name="Diagnosed with depressive disorder?") + ylab("Proportion") + xlab("age") + xlim(18,80)

Although we see a pattern, a scatter plot would be much better.
Let’s try to plot % of people with depressive disorder at different ages.
a <- brfss2013 %>%
count(X_age80)
b <- brfss2013 %>%
filter(addepev2 == "Yes") %>%
count(X_age80)
a <- a[-c(1,2,3,4,68),]
b <- b[-c(1,2,66),]
a$prop = 100 *(b$n / a$n)
ggplot(data = a, aes(x=a$X_age80,y=a$prop)) + geom_point() + xlab("age") + ylab("percentage diagnosed with Depressive disorder")

It seems that incidence of depressive disorder increases with age uptil about 57 years, after which it swiftly degreases.
Now let’s look at the variation of depression with age at different marital statuses.
a <- brfss2013 %>%
filter(X_age80 >= 25) %>%
group_by(marital) %>%
count(X_age80)
a <- na.omit(a)
b <- brfss2013 %>%
filter(X_age80 >= 25) %>%
group_by(marital) %>%
filter(addepev2 == "Yes") %>%
count(X_age80)
b <- na.omit(b)
a$prop = 100 *(b$n / a$n)
oldw <- getOption("warn")
suppressMessages(
ggplot(data = a, aes(x=a$X_age80,y=a$prop)) + geom_point() + xlab("age") + ylab("percentage diagnosed with Depressive disorder") + ylim(0,45) + facet_wrap(~marital)
)
## Warning: Removed 2 rows containing missing values (geom_point).

Note: We only plot ages:25-80, as the ‘signal’ (number of cases) below 25 is too low. For example there are no widowed repondents below age 20 who were diagnosed with depression.
The first thing to notice is that some plots are “noisier” than others. ‘Married’ plot has lowest noise (due to high signal across all ages) whereas ‘Widowed’ plot is unreliably noisy below age 45.
All plots attain maximum value in the range ~50-60 age. Although, the position of peak varies in the range - never married people attain a peak in depressive disorder at age ~50 years, while divorced people attain the same at age ~60 years.
Next we notice some similarities. The plot for ‘Widowed’ and ‘Divorced’ match very well after the age of 60. The plots for ‘Never married’ and ‘A member of an unmarried couple’ look quite similar. In fact, if we look carefully, the plot for ‘A member of an unmarried couple’ looks like a hybrid of ‘Never married’ and ‘Widowed’! Can lead to some interesing research.
In the age range 25-60, separated and divorced people have about twice the incidence of depression as compared with married people. This is something we saw before in a bar plot, but now we see it much more clearly across age.
Research quesion 3:
Is the diagnosis of arthritis, high Blood Pressure or high Cholesterol related Body Mass Index (BMI) level? How does the BMI level itself vary with age or employment status of participants?
Variables that we will explore:
- X_bmi5cat : 4 levels of Body mass index (BMI) - Underweight (12-18.5), Normal weight (18.5-25), Overweight (25-30), Obese (>30).
## Factor w/ 4 levels "Underweight",..: 4 1 3 2 4 4 2 NA 4 3 ...
levels(brfss2013$X_bmi5cat)
## [1] "Underweight" "Normal weight" "Overweight" "Obese"
table(brfss2013$X_bmi5cat, useNA = "ifany")
##
## Underweight Normal weight Overweight Obese <NA>
## 8267 154898 167084 134799 26727
- X_drdxar1 : diagnosed by a doctor of having some form of Arthritis.
## Factor w/ 2 levels "Diagnosed with arthritis",..: 1 2 1 2 2 2 1 1 1 2 ...
levels(brfss2013$X_drdxar1)
## [1] "Diagnosed with arthritis" "Not diagnosed with arthritis"
table(brfss2013$X_drdxar1, useNA = "ifany")
##
## Diagnosed with arthritis Not diagnosed with arthritis
## 165151 323648
## <NA>
## 2976
- X_rfhype5 : told by a health care professional of having high blood pressure.
## Factor w/ 2 levels "No","Yes": 2 1 1 1 2 2 2 2 1 1 ...
levels(brfss2013$X_rfhype5)
## [1] "No" "Yes"
table(brfss2013$X_rfhype5, useNA = "ifany")
##
## No Yes <NA>
## 291428 198920 1427
- X_rfchol : had cholesterol level checked and told by health care professional of having high cholesterol.
## Factor w/ 2 levels "No","Yes": 2 1 1 2 1 2 1 2 2 1 ...
levels(brfss2013$X_rfchol)
## [1] "No" "Yes"
table(brfss2013$X_rfchol, useNA = "ifany")
##
## No Yes <NA>
## 236614 183497 71664
- X_rfbmi5 : Very similar to X_bmi5cat. Answers the question - “overweight or obese?”. This has only two levels - BMI>25 (‘Yes’) and BMI<25 (‘No’).
## Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 1 NA 2 2 ...
levels(brfss2013$X_rfbmi5)
## [1] "No" "Yes"
table(brfss2013$X_rfbmi5, useNA = "ifany")
##
## No Yes <NA>
## 163161 301885 26729
- X_ageg5yr : Divides the ages of respondents into 5-year bins (total 13 levels)
## Factor w/ 13 levels "Age 18 to 24",..: 9 7 8 9 10 6 4 9 7 10 ...
levels(brfss2013$X_ageg5yr)
## [1] "Age 18 to 24" "Age 25 to 29" "Age 30 to 34"
## [4] "Age 35 to 39" "Age 40 to 44" "Age 45 to 49"
## [7] "Age 50 to 54" "Age 55 to 59" "Age 60 to 64"
## [10] "Age 65 to 69" "Age 70 to 74" "Age 75 to 79"
## [13] "Age 80 or older"
table(brfss2013$X_ageg5yr, useNA = "ifany")
##
## Age 18 to 24 Age 25 to 29 Age 30 to 34 Age 35 to 39
## 27202 22910 27213 28107
## Age 40 to 44 Age 45 to 49 Age 50 to 54 Age 55 to 59
## 31508 36210 47173 52491
## Age 60 to 64 Age 65 to 69 Age 70 to 74 Age 75 to 79
## 53659 49997 39967 30093
## Age 80 or older <NA>
## 40515 4730
Now let’s look at their relationships by plotting them.
- Arthritis vs BMI
ggplot(data = brfss2013,aes(x=X_bmi5cat, fill=X_drdxar1)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab('BMI category') + ylab('Proportion')+ scale_fill_discrete(name="Arthritis Diagnosis")

It is clear that arthritis incidence is lowest among people with ‘normal weight’ (18.5 < BMI < 25), highest among ‘obese’ individuals (BMI > 30), and somewhere in between for overweight people. It is surprising that ‘underweight’ individuals show a slighlty higher incidence of arthritis than people with normal weight, althoug the difference is not significant.
- Blood Pressure vs BMI
ggplot(data = brfss2013,aes(x=X_bmi5cat, fill=X_rfhype5)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab('BMI category') + ylab('Proportion')+ scale_fill_discrete(name="Diagnosed with high blood pressure?")

The results are similar to what we saw with arthritis. The incidence of high blood pressure increases with the BMI level. Almost 60% of obese respondents have high BP, whereas only about ~25% of underweight or normal weight people have high BP.
- cholesterol vs BMI
ggplot(data = brfss2013,aes(x=X_bmi5cat, fill=X_rfchol)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab('BMI category') + ylab('Proportion')+ scale_fill_discrete(name="Diagnosed with high cholesterol? ")

As before, we see increasing diagnosis of high cholesterol (blue color in the bars) with increasing BMI level. Although, the proportion of ‘NA’ cases (grey color in the bars) is pretty huge, and more importantly varies with BMI level(!). It is sufficient enough to mess with our conclusions about relationship between cholesterol and BMI.
- BMI vs Employment category.
We are using the binary variable - X_rfbmi5 discussed before, which answers the question “is BMI>25?” to indicate BMI, for easier analysis.
ggplot(data = brfss2013,aes(x=employ1, fill=X_rfbmi5)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab('Employment category') + ylab('Proportion')+ scale_fill_discrete(name="Overweight or obese?, or, is BMI>25? ")

As expected, less than 50% of students show BMI>25, whereas every other employment category has more than 60% folks with BMI>25. Those who are unable to work, not surprisingly, show the highest proportion (~75%) having BMI>25.
- BMI vs Age
ggplot(data = brfss2013,aes(x=X_ageg5yr, fill=X_rfbmi5)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0),legend.position="top") + xlab('Employment category') + ylab('Proportion')+ scale_fill_discrete(name="Overweight or obese?, or, is BMI>25? ")

Interestingly, the proportion of respondents with BMI>25 (amount of ‘blue’ in a bar in the above chart) progressively increases from less than 50% for age 18-25, to ~70% for age group 40 to 44. Then it seems to stabilize through the ages of 45 to 70, and progressively decreases after that. We can hypothesize that this decrease above age 70 corresponds to the loss of weight in old age, possibly due to an increase in bodily diseases.
Research quesion 4:
Is the income level of participants positively correlated with their education level? Is this same for both sexes?
Vaiables:
- X_educag: Education level (4 levels)
## Factor w/ 4 levels "Did not graduate high school",..: 4 3 4 2 4 4 2 3 4 2 ...
levels(brfss2013$X_educag)
## [1] "Did not graduate high school"
## [2] "Graduated high school"
## [3] "Attended college or technical school"
## [4] "Graduated from college or technical school"
table(brfss2013$X_educag, useNA = "ifany")
##
## Did not graduate high school
## 42213
## Graduated high school
## 142968
## Attended college or technical school
## 134196
## Graduated from college or technical school
## 170118
## <NA>
## 2280
- X_incomg : Income category (5 levels)
## Factor w/ 5 levels "Less than $15,000",..: 5 5 5 5 4 5 NA 4 5 2 ...
levels(brfss2013$X_incomg)
## [1] "Less than $15,000" "$15,000 to less than $25,000"
## [3] "$25,000 to less than $35,000" "$35,000 to less than $50,000"
## [5] "$50,000 or more"
table(brfss2013$X_incomg, useNA = "ifany")
##
## Less than $15,000 $15,000 to less than $25,000
## 52235 76606
## $25,000 to less than $35,000 $35,000 to less than $50,000
## 48867 61508
## $50,000 or more <NA>
## 181131 71428
- Sex: Male/Female
## Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## [1] "Male" "Female"
table(brfss2013$sex, useNA = "ifany")
##
## Male Female <NA>
## 201313 290455 7
Surprisingly, number of female respondents is much higher than male respondents.
Now let’s look at the relationship between income and education level.
ggplot(data = brfss2013,aes(x=X_educag, fill=X_incomg)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0)) + xlab('Education') + ylab('Proportion')+
scale_fill_discrete(name="Income category")

Cleary income is positively correlated to education level. Of those who did not graduate high school, less than 5% have income > 50K per year, while of those who graduated college or technical school, about ~65% have income>50K. Let’s see if this behaviour is the same regardless of sex.
MF = c('Male','Female')
ggplot(data = brfss2013[brfss2013$sex %in% MF,],aes(x=X_educag, fill=X_incomg)) + geom_bar(na.rm = TRUE, position = 'fill')+theme(axis.text.x=element_text(angle = -90, hjust = 0)) + xlab('Education') + ylab('Proportion')+ scale_fill_discrete(name="Income category") + facet_wrap(~sex)

The plots look similar for both sexes, although females in general have lower income than males in every education category (look at red and pink in each bar).