Setup

Load packages

library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(knitr)

setwd("/Users/Sheena/MOOC/Coursera/Statistics_with_R/1_Intro_to_Probability_and_Data/Project/")

Load data

load("brfss2013.RData")

Part 1: Data

From codebook of The Behavioral Risk Factor Surveillance Systems:

BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use. Since 2011, BRFSS conducts both landline telephone- and cellular telephone-based surveys. In conducting the BRFSS landline telephone survey, interviewers collect data from a randomly selected adult in a household. In conducting the cellular telephone version of the BRFSS questionnaire, interviewers collect data from an adult who participates by using a cellular telephone and resides in a private residence or college housing.

As can be noted here, BRFSS carried out the experiment by randomly selecting non-institutionalized adults, which qualifies the output a random sample that can be generalized to the adults of 18 years or older in the US. However, the method of collecting samples is the landline and cellular telephone survey. There are wo problems underlying this sample collection method, which will introduce bias into sample:

This survey design indicates generalizability of the sample but no causality since it did not involve any randomized experiment to account for uncontrollable variables.


Part 2: Research questions

Research quesion 1: What is the distribution of the observations in the sample look like in terms of levels of general health? Is there any noticeable differences between the distributions by males and females?

Research quesion 2: For the purpose of getting a general idea about women’s health awareness, we’d like to understand the popularity of HPV vaccines among different age groups of females.

Research quesion 3: We would like to find any recognizabe patterns showing that people’s weights are correlated with one or more variables. Here we will use the following data: > sex, education, income, exercising times per month and weight


Part 3: Exploratory data analysis

Research quesion 1:

One most appealing question is to get to know people’s health status by studying a random sample out of the total population. Here we will study the general health conditions for males and females.

health1 <- brfss2013 %>%
    filter(!is.na(brfss2013$genhlth), !is.na(brfss2013$sex), !is.na(brfss2013$X_age65yr)) %>%
    group_by(genhlth, sex) %>%
    summarize(n = n())

health2 <- brfss2013 %>%
    filter(!is.na(brfss2013$genhlth), !is.na(brfss2013$sex), !is.na(brfss2013$X_age65yr)) %>%
    group_by(genhlth, sex, X_age65yr) %>%
    summarize(n = n()) %>%
    mutate(pct_genhlth = n/sum(n),
    posn_pct = cumsum(pct_genhlth)-0.5*pct_genhlth)
# Plot the distribution of general health status by sex
ggplot(health1, aes(x = genhlth, y = n, fill = sex)) +
    geom_bar(stat = "identity", position = position_dodge(0.6), alpha = 0.8) +
    scale_fill_brewer(palette = "Accent")+
    scale_y_continuous("count", breaks = seq(0, 100000, by = 10000)) +
    labs(title = "General health distribution by sex")

From the bar chart above, we know that the majority of both males and females are in good or even better health conditions. The rest have fair and poor health status.

ggplot(health2, aes(x = genhlth, y = pct_genhlth, fill = X_age65yr)) +
    geom_bar(stat = "identity", alpha = 0.6, col = "dark grey") +
    scale_fill_manual(values = c("blue", "red")) +
    facet_wrap(~sex, ncol = 2) +
    geom_text(aes(label = paste0(sprintf("%.0f", pct_genhlth*100),"%"), y=1-posn_pct), col = "white") +
    labs(title = "Proportion of two age groups in each health status") +
    scale_x_discrete("health status") +
    scale_y_continuous("proportion") +
    coord_flip()

Let’s drill down a little bit further by taking the age into consideration. For both males and females, the proportion of people who are age 65 or older is higher (42% for male, 45% for female) in the “Poor” health status than that (22% for male, 25% for female) in the “Excellent” health status, which is not surprising.

Research quesion 2:

To focus on women’s health, we’d like to understand the popularity of HPV vaccine among different age groups of females.

hpv <- brfss2013 %>%
    filter(sex == "Female", !is.na(hpvadvc2), !is.na(X_ageg5yr)) %>%
    group_by(hpvadvc2, X_ageg5yr) %>%
    summarize(count = n())

ggplot(hpv, aes(x = X_ageg5yr, y = count, fill = hpvadvc2)) +
    geom_bar(stat = "identity", position = "stack", alpha = 0.6) +
    scale_fill_manual(values = c("#198C19", "#FD7D5F", "#48616C"))

The stacked bar chart tells us the answer directly by showing an absolute larger number of young females that have been vaccinated. We can also conclude that the porportion of females getting the HPV shots is higher among younger groups than among older groups. One main reason resulting in this obvious difference is that The Advisory Committee on Immunization Practices (ACIP), which advises the Centers for Disease Control and Prevention, recommends vaccination for females 13 through 26 who have not previously been vaccinated.

Research quesion 3:

For the last research question, we are going to look into a bunch of data extracted from the dataset brfss2013 to find any interesting patterns about people’s weight.

First off, let’s calculate the exercising times per month for each respondent, then select the data we need and regenerate a new dataset to simplify the data source for our analysis.

splitted <- t(sapply(brfss2013$exeroft1, function(x) substring(x, first=c(1,2), last=c(1,3))))
df_splitted <- as.data.frame(splitted)
df_census <- brfss2013 %>% select(sex, X_ageg5yr, educa, income2, weight2)
df3 <- cbind(df_census, df_splitted)

# Cleaning data
df3 <- df3 %>% filter(complete.cases(.), weight2 != "")

df3$V1 <- as.numeric(as.character(df3$V1))
df3$V2 <- as.numeric(as.character(df3$V2))
df3$weight2 <- as.numeric(as.character(df3$weight2))

df3 <- df3 %>%
    filter(weight2 <= 999) %>%
    # transform the exercising times per week into per month by multiplying by 4
    mutate(freq_exercise = ifelse(V1 == 1, 4*V2, V2)) %>%
    select(-c(V1, V2))

names(df3) <- c("sex", "age", "education", "income", "weight", "freq_exercise")

Next, we are intersted in finding out if there exists some kind of correlation among the education people received, income level and their weights. Notice here that we choose the median weight for each sub group, as the large number of outliers in people’s weights are very likely to cause bias for further analysis.

# heat map showing the relationship between weight and the combination of levels of income and education 
df_wt <- df3 %>%
    group_by(education, income, sex) %>%
    summarise(wt_md = median(weight))
head(df_wt)
## Source: local data frame [6 x 4]
## Groups: education, income [3]
## 
## # A tibble: 6 x 4
##                                    education            income    sex
##                                       <fctr>            <fctr> <fctr>
## 1 Never attended school or only kindergarten Less than $10,000   Male
## 2 Never attended school or only kindergarten Less than $10,000 Female
## 3 Never attended school or only kindergarten Less than $15,000   Male
## 4 Never attended school or only kindergarten Less than $15,000 Female
## 5 Never attended school or only kindergarten Less than $20,000   Male
## 6 Never attended school or only kindergarten Less than $20,000 Female
## # ... with 1 more variables: wt_md <dbl>
ggplot(df_wt, aes(x = education, y = income, fill = wt_md)) +
    geom_tile() +
    scale_fill_gradient(low = "green", high = "red", guide = guide_legend("median weight")) +
    geom_text(aes(label = wt_md), size = 2) +
    facet_grid(sex ~.) +
    theme(axis.text.x = element_text(size = 7, angle = 45, vjust = 0.7))

There are indeed a couple noteworthy facts from the heat map we created above. * Males with higher income tend to weigh heavier than those with relatively lower income. * Males who never attended school or only kindergarden are showing more variable weights in this group, ranging from 160 pounds to 210 pounds, depending on their income levels.

  • For females, those who received 4-year college education are more likely to control their weights.

  • The median weights are higher for women who earn less than $20,000 as well as hold a high school diploma or lower.

# weight box plot
ggplot(df3, aes(x = sex, y = weight, fill = sex)) +
    geom_boxplot() +
    scale_y_continuous(breaks = seq(0, 600, by = 100))

From the boxplot it is easy to see that the IQR for males is about 160 to 220 pound, while IQR for female is about 130 to 180 pounds. Outliers are a big problem in this weight analysis, so we have to carefully deal with them.

# Scatter plot: weight in pounds vs exercising frequency
df_ex <- df3 %>%
    filter(freq_exercise <= 250)
ggplot(df_ex, aes(x = weight, y = freq_exercise, col = sex)) +
    geom_point(alpha = 0.6, shape = 1) +
    scale_color_manual(values = c("#0099cc", "#ff4500")) 

In this scatterplot, dots are more dense on the lower left side, meaning most people weighing from 100 pounds to 300 pounds choose to exercise 0 to 50 times per month. And the heavier a person is, the more unlikely he/she will work out more than 100 times per month.