library(ggplot2)
library(dplyr)
library(tidyr)
library(colorspace)
library(knitr)
load("brfss2013.RData")
The Behavior Risk Factor Surveillance System (BRFSS) is the U.S health survey system to collect nationwide population-based state-level data on a multitude of health behaviors and conditions. Currently, there are 50 states as well as the District of Columbia and three U.S territories participated in this project.
The Centers of Disease Control and Prevention (CDC) conducts monthly survey via telephone with a standardized questionaire. The questionaire has three parts:
CDC limits to collect information just from the U.S residents living in a private residence or college housing who are 18 years old or older. Once these conditions are met, a random adult will be selected for the interview. Thus, the study is obsevational and only shows associational relationship.
There are some issues with this sampling method:
This research is useful for planning, initiating, and supporting health promotion and disease prevention programs at the state and federal level, and monitoring progress toward achieving health objectives for the state and nation.
Research quesion 1:
Are income level and having health care coverage associated with the ability to see doctor because of the cost or not?
Variables:
Health care is a big concern for the U.s government. Doing analysis in this area will help the government knows whether the health care coverage will affect people’s ability to see doctor when they are sick or not. And then the government can adjust health care program to benefit its residents
Research quesion 2:
Age and eating habit have any association in having diabetes or not?
Variables:
The number of people having diabetes are increasing. Answering this question will help people realize the benefit of having a good eating habit for their health.
Reasearch question 3:
For the U.S residents who reported that they had joint pain and their limitation due to the joint pain, do physical activity levels have some association with the join pain?
Variables:
There are some people will be interested in this topic especially athletes. They may want to know if their physical activities will have any relationship with their join pain. From there, they can reschedule their activities so that they can avoid or reduce the join pain during exercises
Research question 4:
Are smoking habit and ages in association with the number of days they reported their health was not good?
Variables:
Many people want to know whether smoking affects their health and at which age that people will have more bad bad with their health if they smoke
We will start looking the data in general
dim(brfss2013)
## [1] 491775 330
The data has 491,775 observations and 330 variables and some variables carry NA value. To simply the EDA process, we will create a function called “complete” to just take observations that have a value for each variables
complete <- function(...) {
study <- brfss2013 %>%
select(...)
return(study[complete.cases(study),]) #* Create new data without missing values with "complete.cases" function*
}
Research quesion 1:
Are income level and having health care coverage associated with the ability to see doctor because of the cost or not?
in_med_h <- complete(income2, medcost, hlthpln1)
head(in_med_h)
## income2 medcost hlthpln1
## 1 Less than $75,000 No Yes
## 2 $75,000 or more No Yes
## 3 $75,000 or more No Yes
## 4 Less than $75,000 No Yes
## 5 Less than $50,000 No Yes
## 6 $75,000 or more No Yes
Line 2 shows “Yes” for hlthpln1
and “No” for medcost
. In general, when a person is covered by health insurance, they will be able to see doctor if they are sick. So, we should revalue the medcost column by using plyr function “revalue” to revalue medcost column
in_med_h$medcost <- plyr::revalue(in_med_h$medcost, c("Yes" = "No", "No" = "Yes"))
head(in_med_h)
## income2 medcost hlthpln1
## 1 Less than $75,000 Yes Yes
## 2 $75,000 or more Yes Yes
## 3 $75,000 or more Yes Yes
## 4 Less than $75,000 Yes Yes
## 5 Less than $50,000 Yes Yes
## 6 $75,000 or more Yes Yes
We will look at income2 and medcost and see if income affects people’s ability to go to see doctor or not
income_medcost <- in_med_h %>%
group_by(income2, medcost) %>%
summarize(Sum = n()) %>%
spread(medcost, Sum) %>%
mutate(Sum = Yes + No, `%No` = round(No/Sum*100, digit = 1))
head(income_medcost)
## # A tibble: 6 x 5
## # Groups: income2 [8]
## income2 No Yes Sum `%No`
## <fct> <int> <int> <int> <dbl>
## 1 Less than $10,000 7249 17916 25165 28.8
## 2 Less than $15,000 6804 19782 26586 25.6
## 3 Less than $20,000 8047 26589 34636 23.2
## 4 Less than $25,000 8262 33238 41500 19.9
## 5 Less than $35,000 7118 41542 48660 14.6
## 6 Less than $50,000 6462 54855 61317 10.5
According to the above table, %No
for lower income is higher than higher income. It means that people who earn less will not afford to see doctor when they are sick. For example, for income less than $10,000; 28.8% cannot afford to see doctor. But for someone makes more than $75,000; only 3.5% cannot afford to see doctor. Thus, income does affect the ability to see doctor. We will see if these figures change when having health insurance or not
income_medcost_in <- in_med_h %>%
filter(hlthpln1 == "Yes") %>%
group_by(income2, medcost) %>%
summarize(Sum = n()) %>%
spread(medcost, Sum) %>%
mutate(Sum = Yes + No, `%No` = round(No/Sum*100, digit = 1))
head(income_medcost_in, 8)
## # A tibble: 8 x 5
## # Groups: income2 [8]
## income2 No Yes Sum `%No`
## <fct> <int> <int> <int> <dbl>
## 1 Less than $10,000 3464 15178 18642 18.6
## 2 Less than $15,000 3681 17372 21053 17.5
## 3 Less than $20,000 3858 22750 26608 14.5
## 4 Less than $25,000 4205 29029 33234 12.7
## 5 Less than $35,000 4155 37502 41657 10
## 6 Less than $50,000 4289 51221 55510 7.7
## 7 Less than $75,000 3513 58154 61667 5.7
## 8 $75,000 or more 3457 109492 112949 3.1
According to the new table, having health insurance will increase the ability to see doctor. We can see that all figures in the column “%No” decreases. However, we want to have a look at these figures when people dont have health insurance
income_medcost_no_in <- in_med_h %>%
filter(hlthpln1 == 'No') %>%
group_by(income2, medcost) %>%
summarize(Sum = n()) %>%
spread(medcost, Sum) %>%
mutate(Sum = Yes + No, `%No` = round(No/Sum*100, digit = 1))
head(income_medcost_no_in, 8)
## # A tibble: 8 x 5
## # Groups: income2 [8]
## income2 No Yes Sum `%No`
## <fct> <int> <int> <int> <dbl>
## 1 Less than $10,000 3785 2738 6523 58
## 2 Less than $15,000 3123 2410 5533 56.4
## 3 Less than $20,000 4189 3839 8028 52.2
## 4 Less than $25,000 4057 4209 8266 49.1
## 5 Less than $35,000 2963 4040 7003 42.3
## 6 Less than $50,000 2173 3634 5807 37.4
## 7 Less than $75,000 1074 2323 3397 31.6
## 8 $75,000 or more 599 2166 2765 21.7
we can see that income and having health care coverage are in association with the the medcost(the ability to see doctor because of medical expenses). For example, 58% of respondents are not willing to see doctor if income is less than $10,000 and having no health coverage. Yet, if covered by health policy, just 18.6% is not afford to see doctor.
TO be clearer, we will plot a bar chart to show these relationships
ggplot(in_med_h, aes(x = income2, y = 1, fill = medcost)) +
geom_col(position = 'fill') +
scale_y_continuous(labels = c("0%", "25%", "50%", "75%", "100%")) +
scale_fill_manual(values = rainbow_hcl(2)) +
facet_grid(hlthpln1 ~.) +
labs(title = "Proportion of Respondents Who Could Afford to See Doctor by \n Income Levels and Whether They Have Health Care Coverage", x = "Income level", y = "% of Respondents", fill = "Affordability") +
theme(axis.text.x = element_text(angle = 35, hjust = 1), plot.title = element_text(hjust = 0.5))
From the plot, we can see that income and having health care insurance will affect the ability to see doctor. Lower income people are more likely not affrod to see doctor because of the cost.
Summary:
Question: Are income level and have health care coverage associated with the ability to see doctor because of the cost or not?
Narrative from the Exploratory Analysis: the answer is yes. Having insurance will increase the affordability of people to go to see doctor. Also, the higher income people made, the more affordable people go to see doctor when they are sick. It looks like that lower income struggle with getting treatment when they are sick so the government should have a look at this issue to offer different type of health policy based on income so that everyone can have insurance and could afford for the medial cost.
Research quesion 2:
Age and eating habit have any association in having diabetes or not? We will first prepare the data for analysis
diabetes <- complete(X_age_g, diabete3, X_vegesum)
diabate3
has 4 observations: “Yes”, “No”, “No, pre-diabetes or borderline diabetes” and “Yes, but female told only during pregnancy”. We will just focus on Yes and No for this analysis
sum <- diabetes %>%
filter(diabete3 == c("Yes", "No")) %>%
group_by(diabete3, X_age_g) %>%
summarize(Sum = n()) %>%
spread(diabete3, Sum) %>%
mutate(Sum = Yes + No, `%Yes` = round(Yes/Sum*100, digit = 1))
head(sum)
## # A tibble: 6 x 5
## X_age_g Yes No Sum `%Yes`
## <fct> <int> <int> <int> <dbl>
## 1 Age 18 to 24 129 12199 12328 1
## 2 Age 25 to 34 443 22057 22500 2
## 3 Age 35 to 44 1363 25981 27344 5
## 4 Age 45 to 54 3912 34569 38481 10.2
## 5 Age 55 to 64 8130 42052 50182 16.2
## 6 Age 65 or older 15641 57878 73519 21.3
We can see that percentage of people having diabetes increases by ages. Only 1% people from age 18 to 24 having diabetes while 21.3% people from the age of 65 or older said that they have been struggling with diabetes. We will visualize the relationship between diabetes and the consumption of vegetables to see if increasing more vegetables everday will help with the disease or not
aa <- diabetes %>%
filter(diabete3 == c("Yes", "No"))
ggplot(aa, aes(x = X_age_g, y = X_vegesum)) +
geom_boxplot(color = "brown") +
facet_grid(diabete3 ~.) +
labs(title = "Diabetes In Association With The Consumpsion of Vegetables", y = "Vegetables consumed per day", x = "Ages")
According to the plot, having more vegetables everyday will lower the chance to have diabetes
Summary:
Question: Age and eating habit have any association in having diabetes or not?
Narrative from the Exploratory Analysis: the answer is yes. People who are getting old will have higher chance to have diabetes but increasing more vegetables everday will help with the disease.
Research question 3:
For the U.S residents who reported that they had joint pain and their limitation due to the joint pain, do physical activity levels have some association with the pain?
We will just focus on respondents who reported that they had some limitations due to join pain and what type of activity they participated
join_li_phy <- complete(X_pacat1, joinpain, lmtjoin3) %>%
filter(joinpain != 0 & lmtjoin3 == "Yes")
head(join_li_phy)
## X_pacat1 joinpain lmtjoin3
## 1 Inactive 7 Yes
## 2 Inactive 5 Yes
## 3 Insufficiently active 8 Yes
## 4 Inactive 8 Yes
## 5 Insufficiently active 5 Yes
## 6 Highly active 7 Yes
Some people may think that they will not have any join pain if they do not do any physical activies but according to the table, we can see that some respondents said that they had join pain even though they are inactive. To demtermine if there is any association between physical activity and joint pain score, we will do summary statistics
sum_stat <- join_li_phy %>%
group_by(X_pacat1) %>%
summarize(Q1 = quantile(joinpain, .25), MEAN = mean(joinpain), MEDIAN = median(joinpain), Q3 = quantile(joinpain, .75), IQR = IQR(joinpain), SDEV = sd(joinpain)) %>%
mutate(SKEW = ifelse(MEAN > MEDIAN, "RIGHT", "LEFT"))
sum_stat
## # A tibble: 4 x 8
## X_pacat1 Q1 MEAN MEDIAN Q3 IQR SDEV SKEW
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Highly active 4 5.49 5 7 3 2.26 RIGHT
## 2 Active 4 5.62 5 7 3 2.25 RIGHT
## 3 Insufficiently active 4 5.96 6 8 4 2.27 LEFT
## 4 Inactive 5 6.68 7 8 3 2.22 LEFT
We can see that the standard deviation of four categories are pretty similar to each other. Yet, the Q1, mean, median and Q3 are difeerent for these activities. For example, Hightly active and Active are right skew so more joint pain values are less than 5.5 and 5.6, respectively. However, the insufficiently active and inactive are left skew. The insufficiently active is slightly symmetric (mean = 5.965, median = 6).
To have a better view about these relations, we will do a density plot as below:
ggplot(join_li_phy, aes(x = joinpain, color = X_pacat1)) +
geom_density(adjust = 2) +
scale_color_manual(values = rev(heat_hcl(5, h = c(0, -100), c = c(80,40), l = c(40, 75), power = 1))) +
scale_x_continuous(breaks = c(1:10)) +
labs(title = "Distribution of Reported Joint Pain Levels \n by Physical Activity Level", y = "Density", x = "Joint Pain Level", col = "Physical Activity Level")
The density plot reflects the summary statistics as Inactive distribution is denser at higher joint pain scores compared to the other categories. High active distribution is denser at the low joint pain than the other three
join_li_phy_sum <- join_li_phy %>%
group_by(X_pacat1, joinpain) %>%
summarize(Sum = n())
cc <- scales::div_gradient_pal("brown", "purple", "pink", "Lab")(seq(0, 1, length.out = 10))
ggplot(join_li_phy_sum, aes(x = factor(X_pacat1, labels = c("Inactive", "Insufficiently active", "Active", "Highly active")), y = 1, fill = factor(joinpain, levels = c(10:1)))) +
geom_col(position = 'fill') +
scale_fill_manual(values = cc, labels = c('10-Worst pain', 9:2, '1-Least pain')) +
scale_y_continuous(labels = c("0%", "25%", "50%", "75%", "100%")) +
labs(title = "Proportion of Respondents Suffered from Joint Pain \n from Activity Levels", y = "% Respondents", x = "Physical Activity Levels", fill = "Pain Levels")
From the chart, we can see that % respondents who reported that they had worst pain is the highest for “Inactive”, followed by “Insufficiently active”, then “Active” and “Highly active”
Summary:
Question: For the U.S residents who reported that they had joint pain and their limitation due to the joint pain, do physical activity levels have some association with the join pain?
Narrative from the Exploratory Analysis: The answer is yes. Increased physical activies associated with the reduction of join pain level
Research question 4:
Are smoking habit and ages in association with the number of days they reported their health was not good?
smoke_phy <- complete(X_age_g, physhlth, X_smoker3)
head(smoke_phy)
## X_age_g physhlth X_smoker3
## 1 Age 55 to 64 30 Former smoker
## 2 Age 45 to 54 0 Never smoked
## 3 Age 55 to 64 3 Current smoker - now smokes some days
## 4 Age 55 to 64 2 Never smoked
## 5 Age 65 or older 10 Former smoker
## 6 Age 45 to 54 0 Never smoked
We will revalue X_smoker3
smoke_phy$X_smoker3 <- plyr::revalue(smoke_phy$X_smoker3, c("Current smoker - now smokes every day" = "Smokes every day", "Current smoker - now smokes some days" = "Smokes some days"))
head(smoke_phy)
## X_age_g physhlth X_smoker3
## 1 Age 55 to 64 30 Former smoker
## 2 Age 45 to 54 0 Never smoked
## 3 Age 55 to 64 3 Smokes some days
## 4 Age 55 to 64 2 Never smoked
## 5 Age 65 or older 10 Former smoker
## 6 Age 45 to 54 0 Never smoked
we will look at these three variables but exclude respondents who reported have zero bad day with their physical health. Then, we will perform summary statistics
smoke_phy_sum <- smoke_phy %>%
filter(physhlth != 0) %>%
group_by(X_age_g, X_smoker3) %>%
summarize(Q1 = quantile(physhlth, .25), MEAN = mean(physhlth), MEDIAN = median(physhlth), IQR = IQR(physhlth), Q3 = quantile(physhlth, .75), SDEV = sd(physhlth))
head(smoke_phy_sum)
## # A tibble: 6 x 8
## # Groups: X_age_g [2]
## X_age_g X_smoker3 Q1 MEAN MEDIAN IQR Q3 SDEV
## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Age 18 to 24 Smokes every day 2 8.52 5 8 10 8.88
## 2 Age 18 to 24 Smokes some days 2 7.24 4 6 8 8.37
## 3 Age 18 to 24 Former smoker 2 6.83 3 5 7 7.96
## 4 Age 18 to 24 Never smoked 2 5.79 3 5 7 7.15
## 5 Age 25 to 34 Smokes every day 2 10.2 5 13 15 10.1
## 6 Age 25 to 34 Smokes some days 2 9.22 5 12 14 10.0
We can see that for those smoke everyday, mean and median are higher than former smoker or never smoked people. Especially for old people (age 55 or more), the mean and median are higher than other age group. It means that smoking people experice more bad day with their physical health than former smoker and never smoked people. We will plot a line chart to see more about these relationships
ggplot(smoke_phy_sum, aes(x = X_smoker3, y = MEAN, col = X_age_g, group = X_age_g)) +
geom_line(lwd = 2) +
scale_y_continuous(breaks = c(0:20)) +
scale_color_manual(values = heat_hcl(6, h = c(20, -100), c = c(80,40), l = c(40, 75), power = 1))+
labs(title = "Average Number of Days Respondent Experience Bad Day \n with Their Physical Health by Smoking Habit and Age", x = "Smoking Habit", y = "Average Number of Days",color = "Age")
We can see from the chart that the more age people have and the more smoking they do, the more day they will experience with their physical health. All of lines (except age 18-24) peak at “Smokes some days” so it means that for those people even though they just smoke some day, it is a high change for them to have more bad day with their physical health than other people.
Summary:
Question: Are smoking habit and ages in association with the number of days people reported to have bad day with their health?
Narrative from the Exploratory Analysis: The answer is Yes. Smoking and older age associated with more bad days people have with their physical health