Setup

Load packages

library(ggplot2)
library(dplyr)

Load data

load("brfss2013.RData")

Part 1: Data

The dataset brfss2013 was collected by the Behavioral Risk Surveillance System (BRFSS) through telephone surveys in 2013 which includes information on preventive health practices and risk behaviors that may cause chronic diseases, injuries, and preventable infectious diseases that affect the adult population (18 and older) in 50 states across the US, territories like Guam, Puerto Rico, and the District of Columbia. The data was randomly chosen and there are biases caused by missing values.

The columns include information like the date and time and how the survey was conducted as well as basic resident information and demographics, race etc; risk behaviors like tobacco use, alcohol consumption and consumption of sugar drinks etc; preventive health practices, health care access,diets etc. This data set consists of 491,775 observations and 330 variables.

This project will only focus on several variables such as Medicare coverage, sex, race, age, population health conditions etc. I particularly picked these variables is because I’m interested in discovering the relationships among them and I think they can provide us a broad view of the message from this survey.


Part 2: Research questions

Research quesion 1:Investigate Medicare coverage (medicare) in each U.S. state and territories (X_state).

Research quesion 2:Investigate the relationship among race (X_race), income level (income2) and health care coverage for the past 12                                      month (nocov121).

Research quesion 3:Find out the distributions of the health conditions (X_rfhlth) by gender (sex) and age groups (X_age_g).


Part 3: Exploratory data analysis

Research quesion 1:Investigate Medicare coverage (medicare) in each U.S. state and territories (X_state).

Check the structures of the columns we will be working with.

brfss2013 %>% 
  select(X_state, medicare) %>%
  str()
## 'data.frame':    491775 obs. of  2 variables:
##  $ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ medicare: Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 1 ...

List of states with the most medicare coverage to the least.

brfss2013 %>% 
  filter(!is.na(X_state), medicare == "Yes") %>% 
  group_by(X_state) %>% 
  summarise(With_medicare = n()) %>% 
  arrange(desc(With_medicare))
## # A tibble: 38 x 2
##    X_state        With_medicare
##    <fct>                  <int>
##  1 Florida                14404
##  2 Maryland                6690
##  3 Nebraska                6499
##  4 Massachusetts           5371
##  5 Michigan                4890
##  6 Minnesota               4722
##  7 Ohio                    4477
##  8 South Carolina          4440
##  9 Kentucky                4421
## 10 New Jersey              4278
## # ... with 28 more rows

List of states with no medicare coverage from the highest to the least.

brfss2013 %>% 
  filter(!is.na(X_state), medicare == "No") %>% 
  group_by(X_state) %>% 
  summarise(Without_medicare = n()) %>% 
  arrange(desc(Without_medicare))
## # A tibble: 38 x 2
##    X_state       Without_medicare
##    <fct>                    <int>
##  1 Florida                  13923
##  2 Nebraska                  8841
##  3 Massachusetts             8627
##  4 Minnesota                 8179
##  5 New Jersey                7479
##  6 Utah                      7430
##  7 California                7281
##  8 Michigan                  6353
##  9 Ohio                      5965
## 10 Indiana                   5428
## # ... with 28 more rows

Medicare coverage by state.

brfss2013 %>% 
  filter(!is.na(medicare), !is.na(X_state)) %>% 
  group_by(X_state, medicare) %>% 
  summarise(count = n())
## `summarise()` has grouped output by 'X_state'. You can override using the `.groups` argument.
## # A tibble: 76 x 3
## # Groups:   X_state [38]
##    X_state     medicare count
##    <fct>       <fct>    <int>
##  1 Alabama     Yes       3030
##  2 Alabama     No        2607
##  3 Alaska      Yes       1022
##  4 Alaska      No        2733
##  5 Arizona     Yes       1702
##  6 Arizona     No        1696
##  7 California  Yes       3530
##  8 California  No        7281
##  9 Connecticut Yes       2693
## 10 Connecticut No        4229
## # ... with 66 more rows

Visualize the number of people without Medicare by state.

brfss2013 %>% 
  filter(!is.na(medicare), !is.na(X_state), medicare == 'Yes') %>% 
  ggplot(aes(x = X_state, fill = X_state)) +
  geom_bar() +
  labs(title = "Medicare Coverage by U.S. state (with Medicare)",
       caption = paste0("Data Source: BRFSS2013"),
       x = "U.S. states and territories", y = "Number of people",
       fill = 'U.S. States') +
  theme(plot.title = element_text(size=9)) +
  theme(text = element_text(size=8)) +
  coord_flip()


Visualize the number of people without Medicare by state.

brfss2013 %>% 
  filter(!is.na(medicare), !is.na(X_state), medicare == 'No') %>% 
  ggplot(aes(x = X_state, fill = X_state)) +
  geom_bar() +
  labs(title = "Medicare Coverage by U.S. state (without Medicare)",
       caption = paste0("Data Source: BRFSS2013"),
       x = "U.S. states and territories", y = "Number of people",
       fill = 'U.S. States') +
  theme(plot.title = element_text(size=9)) +
  theme(text = element_text(size=8)) +
  coord_flip()


From the two graphs above we can clearly see that the state of Florida has both the highest and the lowest Medicare coverage among all U.S. states and territories according to the survey.

Research quesion 2:Investigate the relationship among race (X_race), income level (income2) and health care coverage for the past 12 months.

Check the structures of the columns we will be working with.

brfss2013 %>% 
  select(X_race, income2, nocov121) %>%
  str()
## 'data.frame':    491775 obs. of  3 variables:
##  $ X_race  : Factor w/ 8 levels "White only, non-Hispanic",..: 2 1 1 1 1 2 1 1 1 1 ...
##  $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
##  $ nocov121: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...

Filter the NAs from those three columns and look into more information within each racial group.

brfss2013 %>% 
  filter(!is.na(X_race), !is.na(income2), !is.na(nocov121)) %>% 
  group_by(X_race, income2, nocov121) %>% 
  summarise(count = n())
## `summarise()` has grouped output by 'X_race', 'income2'. You can override using the `.groups` argument.
## # A tibble: 128 x 4
## # Groups:   X_race, income2 [64]
##    X_race                   income2           nocov121 count
##    <fct>                    <fct>             <fct>    <int>
##  1 White only, non-Hispanic Less than $10,000 Yes        757
##  2 White only, non-Hispanic Less than $10,000 No        6461
##  3 White only, non-Hispanic Less than $15,000 Yes        737
##  4 White only, non-Hispanic Less than $15,000 No        9202
##  5 White only, non-Hispanic Less than $20,000 Yes        957
##  6 White only, non-Hispanic Less than $20,000 No       11790
##  7 White only, non-Hispanic Less than $25,000 Yes       1110
##  8 White only, non-Hispanic Less than $25,000 No       16558
##  9 White only, non-Hispanic Less than $35,000 Yes       1250
## 10 White only, non-Hispanic Less than $35,000 No       21607
## # ... with 118 more rows

We can see that within each racial group, the number of people with health care coverage for the past 12 month outnumbers the number of the ones who don’t no matter which income level they are in. In most cases, those with higher incomes tend to have more health care coverage within each racial group.

brfss2013 %>% 
  filter(!is.na(X_race), !is.na(income2), !is.na(nocov121)) %>% 
  ggplot(aes(x = X_race, fill = nocov121)) +
  geom_bar() +
  labs(title = 'Health coverage by race',
       subtitle = 'NO: not without health care \nYES: without health care',
       x = 'Race',
       y = 'Number of population',
       caption = paste0("Data Source: BRFSS2013"),
       fill = 'Response') +
  coord_flip()


The visualization above shows the health care distribution in each racial group, we can see that the absolute majority of people in each racial group are covered with health care.

Research quesion 3:Find out the distributions of the health conditions (X_rfhlth) by gender (sex) and age groups (X_age_g).

Get an idea of the variables we will be working with.

brfss2013 %>% 
  select(X_rfhlth, sex, X_age_g) %>%
  str()
## 'data.frame':    491775 obs. of  3 variables:
##  $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
##  $ sex     : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
##  $ X_age_g : Factor w/ 6 levels "Age 18 to 24",..: 5 4 5 5 6 4 3 5 4 6 ...

Clean our data:

brfss2013 %>% 
  filter(!is.na(X_rfhlth), !is.na(sex), !is.na(X_age_g)) %>% 
  group_by(sex, X_age_g, X_rfhlth) %>% 
  summarise(count = n())
## `summarise()` has grouped output by 'sex', 'X_age_g'. You can override using the `.groups` argument.
## # A tibble: 24 x 4
## # Groups:   sex, X_age_g [12]
##    sex   X_age_g      X_rfhlth              count
##    <fct> <fct>        <fct>                 <int>
##  1 Male  Age 18 to 24 Good or Better Health 12627
##  2 Male  Age 18 to 24 Fair or Poor Health     929
##  3 Male  Age 25 to 34 Good or Better Health 20514
##  4 Male  Age 25 to 34 Fair or Poor Health    1932
##  5 Male  Age 35 to 44 Good or Better Health 22783
##  6 Male  Age 35 to 44 Fair or Poor Health    3072
##  7 Male  Age 45 to 54 Good or Better Health 28949
##  8 Male  Age 45 to 54 Fair or Poor Health    6179
##  9 Male  Age 55 to 64 Good or Better Health 34711
## 10 Male  Age 55 to 64 Fair or Poor Health    9773
## # ... with 14 more rows

Here is a visualization of what we found:

health <- brfss2013 %>% 
  filter(!is.na(X_rfhlth), !is.na(sex), !is.na(X_age_g))

Health conditions by gender

options(scipen = 999)
ggplot(health, aes(x =X_rfhlth, fill = sex)) +
  geom_bar() +
  labs(title = 'Health Conditions by Gender',
       x = 'Health Conditions',
       y = 'Number of population',
       caption = paste0("Data Source: BRFSS2013"),
       fill = 'Gender')


We can see that, overall, the number of females who has good health condition outnumbers that of males’ throughout all age groups according to the survey, but at the same time, the number of females who are in fair or poor health also outnumbers that of males’ in all age groups in the survey.
Health condition by age groups.

ggplot(health, aes(x =X_rfhlth, fill = X_age_g)) +
  geom_bar() +
  labs(title = 'Health conditions by each age group', 
       x = 'Health Conditions',
       y = 'Number of population',
       caption = paste0("Data Source: BRFSS2013"),
       fill = 'Age groups')


Based on the visualization, we can tell that the majority of the population are in good and better health; while under each category, older people (age 65 and older) are the largest groups under both good and poor health.