library(ggplot2)
library(dplyr)load("brfss2013.RData")The Behavioral Risk Factor Surveillance System(BRFSS) is a project administered by CDC. All of the states in the United States participate in this project. The main goal is to measure and collect behavioral risk factors data from people over 18 years old and eventually find ways to improve health of people. Here we are using year 2013 data, and data was collected by phone call surveys, including both landline and cellular telephones. For landline surveys, randomly selected adult in a household was chosen as sample data. On the other hand, only adult who participates by using a cellular phone was selected. Generally speaking, we can say random sampling was used in this data collection process, and the research results can be generalized to people in the states. However, we should be aware that there’s a population that doesn’t have either landline or cellular phone. Lastly, this is only a observation research and there’s no random assignment for experiment, so we can only say there’s a correlation but not a causation.
Research quesion 1: The life style for modern society is full of stress, and experts say depression is on the rise even for teenagers. Here I want to explore factors that might play a role in depression. First I want to know if household income level correlates with depression or not. We all know that rich people are getting richer and poor people are poorer, it will be interesting to see if depression is less likely happen in richer population. Later I want to explore the relationship between education level and depression. I would like to know if people with higher education degree know better dealing with stress and hence have lower depression rate.
Research quesion 2: Diabetes is a serious problem in the U.S. According to CDC’s report, almost 10% Americans have this health issue. Here I want to explore the correlations between sex, excercise or not, and diabetes. I want to find out if diabetes is more common in men or women, and we should pay more attention in certain gender. Also, regular exercise can help improve immune system and overall health. Let’s find out if the data supports this idea or not.
Research quesion 3: Obesity is also a common health issue and can result in numerous diseases, increase risk of certain type of cancer, coronary artery disease, diabetes, stroke, etc. Roughly there’s over 30% of population in the U.S. have obesity. Again, I’m gonna explore relationships between obesity and income level. It would be interesting to see if people with higher income tend to have lower obesity rate or not. I’ve heard a lot of people saying usually richer people have more time to excercise and they care more about their body shape and fitness. Let’s see what data tells us.
Research quesion 1:
Relationship between income level and depression. In below plot, we can see in each income level, the number of people that has or had depression disorder are pretty close. However, there’s a huge difference in total number of people in each income level. It seems like people have lower household income are more likely to have depression.
brfss2013_1 <- brfss2013 %>% filter(!is.na(income2), !is.na(addepev2))
ggplot(brfss2013_1) + geom_bar(aes(x=income2, fill=addepev2)) + coord_flip() + labs( x = "Income Level") + guides(fill = guide_legend(title = "Told Depression"))We further calculate the percetange of depression in each group. As shown in below table, the percentage of having depression is gradually decreasing as the income level increasing. It’s likely that people with lower income have more stress and thus are more likely to have depression issue. It’s also possible that having depression cause a person can’t find a well paid job and thus have lower income.
plot1<- brfss2013_1 %>% group_by(income2,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>%
filter(addepev2 == "Yes")
plot1## # A tibble: 8 x 5
## # Groups: income2 [8]
## income2 addepev2 Yes_No_Count all_count pct
## <fct> <fct> <int> <int> <dbl>
## 1 Less than $10,000 Yes 9240 25223 0.37
## 2 Less than $15,000 Yes 8731 26623 0.33
## 3 Less than $20,000 Yes 9152 34716 0.26
## 4 Less than $25,000 Yes 9717 41528 0.23
## 5 Less than $35,000 Yes 9609 48670 0.2
## 6 Less than $50,000 Yes 11056 61315 0.18
## 7 Less than $75,000 Yes 10894 65081 0.17
## 8 $75,000 or more Yes 15448 115686 0.13
#Visualization
ggplot(plot1) + geom_bar(aes(x = reorder(income2,pct), y = pct),stat = "identity",fill = "tomato1") + coord_flip() + labs(x = "Household income level", y = "Depression percentage",title = "Decrease of Depression rate as Imcome level goes up ")Education level and depression. The same analysis has been used. Here we can see the data for lower educated is relatively small, we can not draw conclusion at this point.
brfss2013_1 <- brfss2013_1 %>% filter(!is.na(educa))
ggplot(brfss2013_1) + geom_bar(aes(x=educa, fill=addepev2)) + coord_flip() + labs( x = "edu level") + guides(fill = guide_legend(title = "Told Depression"))Looking at the percentage of depression for each education level, we can tell people having college 4 years or more degress have only 17% of depression chance. It’s possible that people with higher education level are less likely to have depression disorder. However, like I mentioned earlier, we will need more data of lower education level to justify this idea.
plot2 <- brfss2013_1 %>% group_by(educa,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(educa) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>%
filter(addepev2 == "Yes")
plot2## # A tibble: 6 x 5
## # Groups: educa [6]
## educa addepev2 Yes_No_Count all_count pct
## <fct> <fct> <int> <int> <dbl>
## 1 Never attended school or only kin~ Yes 88 450 0.2
## 2 Grades 1 through 8 (Elementary) Yes 2350 10168 0.23
## 3 Grades 9 though 11 (Some high sch~ Yes 6077 22373 0.27
## 4 Grade 12 or GED (High school grad~ Yes 23794 118216 0.2
## 5 College 1 year to 3 years (Some c~ Yes 25803 115620 0.22
## 6 College 4 years or more (College ~ Yes 25647 151500 0.17
ggplot(plot2) + geom_bar(aes(x = educa, y = pct),stat = "identity",fill = "tomato1") + coord_flip()Finally, let’s put all three variables together and see what are the relationships between them. We now look at depression percentage for different education level in dependence of household income level. Here we found an interesting result that in most education groups, people are more likely to have depression when their income level is relatively lower. However, people have college or more education level, the result is opposite and people are more likely to have depression while their income level getting higher.
plot3<- brfss2013_1 %>% group_by(income2,educa,addepev2) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
filter(addepev2 == "Yes")
levels(plot3$educa)## [1] "Never attended school or only kindergarten"
## [2] "Grades 1 through 8 (Elementary)"
## [3] "Grades 9 though 11 (Some high school)"
## [4] "Grade 12 or GED (High school graduate)"
## [5] "College 1 year to 3 years (Some college or technical school)"
## [6] "College 4 years or more (College graduate)"
levels(plot3$educa) <- c("Never","Grades 1 to 8", "Grades 9 to 11", "Grades 12 or GED", "Some College","College graduate")
ggplot(plot3) + geom_bar(aes(x=income2,y=pct,fill = income2),stat="identity")+facet_wrap(~educa)+theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())Research quesion 2:
Sex and diabetes. First I want to explore whether men or women are more likely to have diabetes. In below bar graph, we can see women have more total observations than men, and number of people who have diabetes is slightly higher in women than men as well.
brfss2013_1 <- brfss2013_1 %>% filter(!is.na(diabete3))
ggplot(brfss2013_1) + geom_bar(aes(x=sex, fill=diabete3)) + labs( x = "sex") + guides(fill = guide_legend(title = "Told Diabetes"))We calculate the diabetes percentage in each group, and found the numbers are almost the same. Based on the data, we can say the probability to have diabetes is the same for men and women in the U.S., around 13%.
brfss2013_1 %>% group_by(sex,diabete3) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(sex) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))## # A tibble: 7 x 5
## # Groups: sex [2]
## sex diabete3 Yes_No_Count all_count pct
## <fct> <fct> <int> <int> <dbl>
## 1 Male Yes 23138 176948 0.13
## 2 Male No 150829 176948 0.85
## 3 Male No, pre-diabetes or borderline diab~ 2981 176948 0.02
## 4 Female Yes 29236 240945 0.12
## 5 Female Yes, but female told only during pr~ 3965 240945 0.02
## 6 Female No 203599 240945 0.85
## 7 Female No, pre-diabetes or borderline diab~ 4145 240945 0.02
Exercise and diabetes. In below graph we can see one fourth of people don’t even exercise in the last 30 days.
brfss2013_1 <- brfss2013_1 %>% filter(!is.na(exerany2))
options(scipen = 999)
ggplot(brfss2013_1) + geom_bar(aes(x=exerany2, fill=diabete3)) + labs( x = "exercise in 30 days") + guides(fill = guide_legend(title = "Told Diabetes"))Let’s calculate diabetes rate for people do exercise and people don’t exercise. As below table shows, people who do exercise in the last 30 days, the diabetes rate is 10%. However, for people don’t exercise, the rate is almost 2 fold as 19%, higher than average 13%, the diabetes rate for both men and women as shown in previous table. Even though we can not say exercise can reduce the risks of having diabetes here, we see the relationship between them and hopefully can provide more motivation for people to do exercise regularly.
brfss2013_1 %>% group_by(exerany2,diabete3) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(exerany2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2)) %>% filter(diabete3 == "Yes")## # A tibble: 2 x 5
## # Groups: exerany2 [2]
## exerany2 diabete3 Yes_No_Count all_count pct
## <fct> <fct> <int> <int> <dbl>
## 1 Yes Yes 30133 288391 0.1
## 2 No Yes 19492 104871 0.19
Putting together, we can see the percentage of having diabetes is higher in both male and female, but the association seems more strong in female group (almost double the chance 6% for excercise vs. 11% for no exercise). Exercise seems to be more important for female in reducing chances to get diabetes.
plot4<- brfss2013_1 %>% group_by(exerany2,sex,diabete3) %>% summarise(Yes_No_Count = n()) %>%
ungroup() %>% group_by(exerany2) %>%
mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
filter(diabete3 == "Yes")
plot4## # A tibble: 4 x 6
## # Groups: exerany2 [2]
## exerany2 sex diabete3 Yes_No_Count all_count pct
## <fct> <fct> <fct> <int> <int> <dbl>
## 1 Yes Male Yes 14249 288391 0.05
## 2 Yes Female Yes 15884 288391 0.06
## 3 No Male Yes 7570 104871 0.07
## 4 No Female Yes 11922 104871 0.11
ggplot(plot4) + geom_bar(aes(x=exerany2,y=pct,fill=exerany2),stat="identity")+facet_wrap(~sex)Research quesion 3:
Obesity and income level. Let’s first look at how many people are overweight or having obesity in each income level group. If BMI >= 25, then it’s categorised into Yes group. Here we can the blue region “Yes”, increases as the income level increase. However, the total number of people are also increasing. So we need to calculate percentage to better understand the trend.
brfss2013_1 <- brfss2013_1 %>% filter(!is.na(X_rfbmi5))
ggplot(brfss2013_1) + geom_bar(aes(x=income2, fill=X_rfbmi5)) + labs( x = "income level") + guides(fill = guide_legend(title = "BMI >= 25"))+ coord_flip()In below table, we summarise and calculate the percentage of obesity people in each income level. The number is around 67% percentage for each group, and there’s is a slightly lower number 63% in household income level $75,000 or more. It’s hard to conclude that richer people are less likely to have obesity problem given what we have. We may need to further dissect the “Yes”/“No” since BMI >= 25 is too wide and we can’t see any difference between each group. If we classify BMI >= 30 as obesity for example, we might find something different.
brfss2013_1 %>% group_by(income2, X_rfbmi5) %>% summarise(Yes_No_Count = n()) %>% ungroup() %>%
group_by(income2) %>% mutate(all_count = sum(Yes_No_Count), pct = round(Yes_No_Count/all_count,2))%>%
filter(X_rfbmi5 == "Yes")## # A tibble: 8 x 5
## # Groups: income2 [8]
## income2 X_rfbmi5 Yes_No_Count all_count pct
## <fct> <fct> <int> <int> <dbl>
## 1 Less than $10,000 Yes 14419 21938 0.66
## 2 Less than $15,000 Yes 15909 23537 0.68
## 3 Less than $20,000 Yes 20445 30551 0.67
## 4 Less than $25,000 Yes 24723 37001 0.67
## 5 Less than $35,000 Yes 29214 43730 0.67
## 6 Less than $50,000 Yes 37801 55824 0.68
## 7 Less than $75,000 Yes 40184 59665 0.67
## 8 $75,000 or more Yes 66967 106634 0.63