Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The Behavioral Risk Factor Surveillance System(BRFSS) is a project administered by CDC. All of the states in the United States participate in this project. The main goal is to measure and collect behavioral risk factors data from people over 18 years old and eventually find ways to improve health of people. Here we are using year 2013 data, and data was collected by phone call surveys, including both landline and cellular telephones. For landline surveys, randomly selected adult in a household was chosen as sample data. On the other hand, only adult who participates by using a cellular phone was selected. Generally speaking, we can say random sampling was used in this data collection process, and the research results can be generalized to people in the states. However, we should be aware that there’s a population that doesn’t have either landline or cellular phone. Lastly, this is only a observation research and there’s no random assignment for experiment, so we can only say there’s a correlation but not a causation.


Part 2: Research questions

Research quesion 1: The life style for modern society is full of stress, and experts say depression is on the rise even for teenagers. Here I want to explore factors that might play a role in depression. First I want to know if household income level correlates with depression or not. We all know that rich people are getting richer and poor people are poorer, it will be interesting to see if depression is less likely happen in richer population. Later I want to explore the relationship between education level and depression. I would like to know if people with higher education degree know better dealing with stress and hence have lower depression rate.

Research quesion 2: Diabetes is a serious problem in the U.S. According to CDC’s report, almost 10% Americans have this health issue. Here I want to explore the correlations between sex, excercise or not, and diabetes. I want to find out if diabetes is more common in men or women, and we should pay more attention in certain gender. Also, regular exercise can help improve immune system and overall health. Let’s find out if the data supports this idea or not.

Research quesion 3: Obesity is also a common health issue and can result in numerous diseases, increase risk of certain type of cancer, coronary artery disease, diabetes, stroke, etc. Roughly there’s over 30% of population in the U.S. have obesity. Again, I’m gonna explore relationships between obesity and income level. It would be interesting to see if people with higher income tend to have lower obesity rate or not. I’ve heard a lot of people saying usually richer people have more time to excercise and they care more about their body shape and fitness. Let’s see what data tells us.


Part 3: Exploratory data analysis

Research quesion 1:

Relationship between income level and depression. In below plot, we can see in each income level, the number of people that has or had depression disorder are pretty close. However, there’s a huge difference in total number of people in each income level. It seems like people have lower household income are more likely to have depression.

brfss2013_1 <- brfss2013 %>% filter(!is.na(income2), !is.na(addepev2))

ggplot(brfss2013_1) + geom_bar(aes(x=income2, fill=addepev2)) + coord_flip() + labs( x = "Income Level") + guides(fill = guide_legend(title = "Told Depression"))

We further calculate the percetange of depression in each group. As shown in below table, the percentage of having depression is gradually decreasing as the income level increasing. It’s likely that people with lower income have more stress and thus are more likely to have depression issue. It’s also possible that having depression cause a person can’t find a well paid job and thus have lower income.

plot1<- brfss2013_1 %>% group_by(income2,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>%
  filter(addepev2 == "Yes")
plot1
## # A tibble: 8 x 5
## # Groups:   income2 [8]
##   income2           addepev2 Yes_No_Count all_count   pct
##   <fct>             <fct>           <int>     <int> <dbl>
## 1 Less than $10,000 Yes              9240     25223  0.37
## 2 Less than $15,000 Yes              8731     26623  0.33
## 3 Less than $20,000 Yes              9152     34716  0.26
## 4 Less than $25,000 Yes              9717     41528  0.23
## 5 Less than $35,000 Yes              9609     48670  0.2 
## 6 Less than $50,000 Yes             11056     61315  0.18
## 7 Less than $75,000 Yes             10894     65081  0.17
## 8 $75,000 or more   Yes             15448    115686  0.13
#Visualization

ggplot(plot1) + geom_bar(aes(x = reorder(income2,pct), y = pct),stat = "identity",fill = "tomato1") + coord_flip() + labs(x = "Household income level", y = "Depression percentage",title = "Decrease of Depression rate as Imcome level goes up ")

Education level and depression. The same analysis has been used. Here we can see the data for lower educated is relatively small, we can not draw conclusion at this point.

brfss2013_1 <- brfss2013_1 %>% filter(!is.na(educa))

ggplot(brfss2013_1) + geom_bar(aes(x=educa, fill=addepev2)) + coord_flip() + labs( x = "edu level") + guides(fill = guide_legend(title = "Told Depression"))

Looking at the percentage of depression for each education level, we can tell people having college 4 years or more degress have only 17% of depression chance. It’s possible that people with higher education level are less likely to have depression disorder. However, like I mentioned earlier, we will need more data of lower education level to justify this idea.

plot2 <- brfss2013_1 %>% group_by(educa,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(educa) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>%
  filter(addepev2 == "Yes")
plot2
## # A tibble: 6 x 5
## # Groups:   educa [6]
##   educa                              addepev2 Yes_No_Count all_count   pct
##   <fct>                              <fct>           <int>     <int> <dbl>
## 1 Never attended school or only kin~ Yes                88       450  0.2 
## 2 Grades 1 through 8 (Elementary)    Yes              2350     10168  0.23
## 3 Grades 9 though 11 (Some high sch~ Yes              6077     22373  0.27
## 4 Grade 12 or GED (High school grad~ Yes             23794    118216  0.2 
## 5 College 1 year to 3 years (Some c~ Yes             25803    115620  0.22
## 6 College 4 years or more (College ~ Yes             25647    151500  0.17
ggplot(plot2) + geom_bar(aes(x = educa, y = pct),stat = "identity",fill = "tomato1") + coord_flip()

Finally, let’s put all three variables together and see what are the relationships between them. We now look at depression percentage for different education level in dependence of household income level. Here we found an interesting result that in most education groups, people are more likely to have depression when their income level is relatively lower. However, people have college or more education level, the result is opposite and people are more likely to have depression while their income level getting higher.

plot3<- brfss2013_1 %>% group_by(income2,educa,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
  filter(addepev2 == "Yes")

levels(plot3$educa)
## [1] "Never attended school or only kindergarten"                  
## [2] "Grades 1 through 8 (Elementary)"                             
## [3] "Grades 9 though 11 (Some high school)"                       
## [4] "Grade 12 or GED (High school graduate)"                      
## [5] "College 1 year to 3 years (Some college or technical school)"
## [6] "College 4 years or more (College graduate)"
levels(plot3$educa) <- c("Never","Grades 1 to 8", "Grades 9 to 11", "Grades 12 or GED", "Some College","College graduate")

ggplot(plot3) + geom_bar(aes(x=income2,y=pct,fill = income2),stat="identity")+facet_wrap(~educa)+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())

Research quesion 2:

Sex and diabetes. First I want to explore whether men or women are more likely to have diabetes. In below bar graph, we can see women have more total observations than men, and number of people who have diabetes is slightly higher in women than men as well.

brfss2013_1 <- brfss2013_1 %>% filter(!is.na(diabete3))

ggplot(brfss2013_1) + geom_bar(aes(x=sex, fill=diabete3)) +  labs( x = "sex") + guides(fill = guide_legend(title = "Told Diabetes"))

We calculate the diabetes percentage in each group, and found the numbers are almost the same. Based on the data, we can say the probability to have diabetes is the same for men and women in the U.S., around 13%.

brfss2013_1 %>% group_by(sex,diabete3) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(sex) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))
## # A tibble: 7 x 5
## # Groups:   sex [2]
##   sex    diabete3                             Yes_No_Count all_count   pct
##   <fct>  <fct>                                       <int>     <int> <dbl>
## 1 Male   Yes                                         23138    176948  0.13
## 2 Male   No                                         150829    176948  0.85
## 3 Male   No, pre-diabetes or borderline diab~         2981    176948  0.02
## 4 Female Yes                                         29236    240945  0.12
## 5 Female Yes, but female told only during pr~         3965    240945  0.02
## 6 Female No                                         203599    240945  0.85
## 7 Female No, pre-diabetes or borderline diab~         4145    240945  0.02

Exercise and diabetes. In below graph we can see one fourth of people don’t even exercise in the last 30 days.

brfss2013_1 <- brfss2013_1 %>% filter(!is.na(exerany2))

options(scipen = 999)

ggplot(brfss2013_1) + geom_bar(aes(x=exerany2, fill=diabete3)) +  labs( x = "exercise in 30 days") + guides(fill = guide_legend(title = "Told Diabetes"))

Let’s calculate diabetes rate for people do exercise and people don’t exercise. As below table shows, people who do exercise in the last 30 days, the diabetes rate is 10%. However, for people don’t exercise, the rate is almost 2 fold as 19%, higher than average 13%, the diabetes rate for both men and women as shown in previous table. Even though we can not say exercise can reduce the risks of having diabetes here, we see the relationship between them and hopefully can provide more motivation for people to do exercise regularly.

brfss2013_1 %>% group_by(exerany2,diabete3) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(exerany2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>% filter(diabete3 == "Yes")
## # A tibble: 2 x 5
## # Groups:   exerany2 [2]
##   exerany2 diabete3 Yes_No_Count all_count   pct
##   <fct>    <fct>           <int>     <int> <dbl>
## 1 Yes      Yes             30133    288391  0.1 
## 2 No       Yes             19492    104871  0.19

Putting together, we can see the percentage of having diabetes is higher in both male and female, but the association seems more strong in female group (almost double the chance 6% for excercise vs. 11% for no exercise). Exercise seems to be more important for female in reducing chances to get diabetes.

plot4<- brfss2013_1 %>% group_by(exerany2,sex,diabete3) %>% summarise(Yes_No_Count = n()) %>% 
  ungroup() %>% group_by(exerany2) %>% 
  mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
  filter(diabete3 == "Yes")
plot4
## # A tibble: 4 x 6
## # Groups:   exerany2 [2]
##   exerany2 sex    diabete3 Yes_No_Count all_count   pct
##   <fct>    <fct>  <fct>           <int>     <int> <dbl>
## 1 Yes      Male   Yes             14249    288391  0.05
## 2 Yes      Female Yes             15884    288391  0.06
## 3 No       Male   Yes              7570    104871  0.07
## 4 No       Female Yes             11922    104871  0.11
ggplot(plot4) + geom_bar(aes(x=exerany2,y=pct,fill=exerany2),stat="identity")+facet_wrap(~sex)

Research quesion 3:

Obesity and income level. Let’s first look at how many people are overweight or having obesity in each income level group. If BMI >= 25, then it’s categorised into Yes group. Here we can the blue region “Yes”, increases as the income level increase. However, the total number of people are also increasing. So we need to calculate percentage to better understand the trend.

brfss2013_1 <- brfss2013_1 %>% filter(!is.na(X_rfbmi5))

ggplot(brfss2013_1) + geom_bar(aes(x=income2, fill=X_rfbmi5)) +  labs( x = "income level") + guides(fill = guide_legend(title = "BMI >= 25"))+ coord_flip()

In below table, we summarise and calculate the percentage of obesity people in each income level. The number is around 67% percentage for each group, and there’s is a slightly lower number 63% in household income level $75,000 or more. It’s hard to conclude that richer people are less likely to have obesity problem given what we have. We may need to further dissect the “Yes”/“No” since BMI >= 25 is too wide and we can’t see any difference between each group. If we classify BMI >= 30 as obesity for example, we might find something different.

brfss2013_1 %>% group_by(income2, X_rfbmi5) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
  group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
  filter(X_rfbmi5 == "Yes")
## # A tibble: 8 x 5
## # Groups:   income2 [8]
##   income2           X_rfbmi5 Yes_No_Count all_count   pct
##   <fct>             <fct>           <int>     <int> <dbl>
## 1 Less than $10,000 Yes             14419     21938  0.66
## 2 Less than $15,000 Yes             15909     23537  0.68
## 3 Less than $20,000 Yes             20445     30551  0.67
## 4 Less than $25,000 Yes             24723     37001  0.67
## 5 Less than $35,000 Yes             29214     43730  0.67
## 6 Less than $50,000 Yes             37801     55824  0.68
## 7 Less than $75,000 Yes             40184     59665  0.67
## 8 $75,000 or more   Yes             66967    106634  0.63