library(ggplot2)
library(dplyr)load("brfss2013.RData")The dataset brfss2013 was collected by the Behavioral Risk Surveillance System (BRFSS) through telephone surveys in 2013 which includes information on preventive health practices and risk behaviors that may cause chronic diseases, injuries, and preventable infectious diseases that affect the adult population (18 and older) in 50 states across the US, territories like Guam, Puerto Rico, and the District of Columbia. The data was randomly chosen and there are biases caused by missing values.
The columns include information like the date and time and how the survey was conducted as well as basic resident information and demographics, race etc; risk behaviors like tobacco use, alcohol consumption and consumption of sugar drinks etc; preventive health practices, health care access,diets etc. This data set consists of 491,775 observations and 330 variables.
This project will only focus on several variables such as Medicare coverage, sex, race, age, population health conditions etc. I particularly picked these variables is because I’m interested in discovering the relationships among them and I think they can provide us a broad view of the message from this survey.
Research quesion 1:Investigate Medicare coverage (medicare) in each U.S. state and territories (X_state).
Research quesion 2:Investigate the relationship among race (X_race), income level (income2) and health care coverage for the past 12 month (nocov121).
Research quesion 3:Find out the distributions of the health conditions (X_rfhlth) by gender (sex) and age groups (X_age_g).
Research quesion 1:Investigate Medicare coverage (medicare) in each U.S. state and territories (X_state).
Check the structures of the columns we will be working with.
brfss2013 %>%
select(X_state, medicare) %>%
str()## 'data.frame': 491775 obs. of 2 variables:
## $ X_state : Factor w/ 55 levels "0","Alabama",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ medicare: Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 2 1 ...
List of states with the most medicare coverage to the least.
brfss2013 %>%
filter(!is.na(X_state), medicare == "Yes") %>%
group_by(X_state) %>%
summarise(With_medicare = n()) %>%
arrange(desc(With_medicare))## # A tibble: 38 x 2
## X_state With_medicare
## <fct> <int>
## 1 Florida 14404
## 2 Maryland 6690
## 3 Nebraska 6499
## 4 Massachusetts 5371
## 5 Michigan 4890
## 6 Minnesota 4722
## 7 Ohio 4477
## 8 South Carolina 4440
## 9 Kentucky 4421
## 10 New Jersey 4278
## # ... with 28 more rows
List of states with no medicare coverage from the highest to the least.
brfss2013 %>%
filter(!is.na(X_state), medicare == "No") %>%
group_by(X_state) %>%
summarise(Without_medicare = n()) %>%
arrange(desc(Without_medicare))## # A tibble: 38 x 2
## X_state Without_medicare
## <fct> <int>
## 1 Florida 13923
## 2 Nebraska 8841
## 3 Massachusetts 8627
## 4 Minnesota 8179
## 5 New Jersey 7479
## 6 Utah 7430
## 7 California 7281
## 8 Michigan 6353
## 9 Ohio 5965
## 10 Indiana 5428
## # ... with 28 more rows
Medicare coverage by state.
brfss2013 %>%
filter(!is.na(medicare), !is.na(X_state)) %>%
group_by(X_state, medicare) %>%
summarise(count = n())## `summarise()` has grouped output by 'X_state'. You can override using the `.groups` argument.
## # A tibble: 76 x 3
## # Groups: X_state [38]
## X_state medicare count
## <fct> <fct> <int>
## 1 Alabama Yes 3030
## 2 Alabama No 2607
## 3 Alaska Yes 1022
## 4 Alaska No 2733
## 5 Arizona Yes 1702
## 6 Arizona No 1696
## 7 California Yes 3530
## 8 California No 7281
## 9 Connecticut Yes 2693
## 10 Connecticut No 4229
## # ... with 66 more rows
Visualize the number of people without Medicare by state.
brfss2013 %>%
filter(!is.na(medicare), !is.na(X_state), medicare == 'Yes') %>%
ggplot(aes(x = X_state, fill = X_state)) +
geom_bar() +
labs(title = "Medicare Coverage by U.S. state (with Medicare)",
caption = paste0("Data Source: BRFSS2013"),
x = "U.S. states and territories", y = "Number of people",
fill = 'U.S. States') +
theme(plot.title = element_text(size=9)) +
theme(text = element_text(size=8)) +
coord_flip()
Visualize the number of people without Medicare by state.
brfss2013 %>%
filter(!is.na(medicare), !is.na(X_state), medicare == 'No') %>%
ggplot(aes(x = X_state, fill = X_state)) +
geom_bar() +
labs(title = "Medicare Coverage by U.S. state (without Medicare)",
caption = paste0("Data Source: BRFSS2013"),
x = "U.S. states and territories", y = "Number of people",
fill = 'U.S. States') +
theme(plot.title = element_text(size=9)) +
theme(text = element_text(size=8)) +
coord_flip()
From the two graphs above we can clearly see that the state of Florida has both the highest and the lowest Medicare coverage among all U.S. states and territories according to the survey.
Research quesion 2:Investigate the relationship among race (X_race), income level (income2) and health care coverage for the past 12 months.
Check the structures of the columns we will be working with.
brfss2013 %>%
select(X_race, income2, nocov121) %>%
str()## 'data.frame': 491775 obs. of 3 variables:
## $ X_race : Factor w/ 8 levels "White only, non-Hispanic",..: 2 1 1 1 1 2 1 1 1 1 ...
## $ income2 : Factor w/ 8 levels "Less than $10,000",..: 7 8 8 7 6 8 NA 6 8 4 ...
## $ nocov121: Factor w/ 2 levels "Yes","No": 2 2 2 2 2 2 2 2 2 2 ...
Filter the NAs from those three columns and look into more information within each racial group.
brfss2013 %>%
filter(!is.na(X_race), !is.na(income2), !is.na(nocov121)) %>%
group_by(X_race, income2, nocov121) %>%
summarise(count = n())## `summarise()` has grouped output by 'X_race', 'income2'. You can override using the `.groups` argument.
## # A tibble: 128 x 4
## # Groups: X_race, income2 [64]
## X_race income2 nocov121 count
## <fct> <fct> <fct> <int>
## 1 White only, non-Hispanic Less than $10,000 Yes 757
## 2 White only, non-Hispanic Less than $10,000 No 6461
## 3 White only, non-Hispanic Less than $15,000 Yes 737
## 4 White only, non-Hispanic Less than $15,000 No 9202
## 5 White only, non-Hispanic Less than $20,000 Yes 957
## 6 White only, non-Hispanic Less than $20,000 No 11790
## 7 White only, non-Hispanic Less than $25,000 Yes 1110
## 8 White only, non-Hispanic Less than $25,000 No 16558
## 9 White only, non-Hispanic Less than $35,000 Yes 1250
## 10 White only, non-Hispanic Less than $35,000 No 21607
## # ... with 118 more rows
We can see that within each racial group, the number of people with health care coverage for the past 12 month outnumbers the number of the ones who don’t no matter which income level they are in. In most cases, those with higher incomes tend to have more health care coverage within each racial group.
brfss2013 %>%
filter(!is.na(X_race), !is.na(income2), !is.na(nocov121)) %>%
ggplot(aes(x = X_race, fill = nocov121)) +
geom_bar() +
labs(title = 'Health coverage by race',
subtitle = 'NO: not without health care \nYES: without health care',
x = 'Race',
y = 'Number of population',
caption = paste0("Data Source: BRFSS2013"),
fill = 'Response') +
coord_flip()
The visualization above shows the health care distribution in each racial group, we can see that the absolute majority of people in each racial group are covered with health care.
Research quesion 3:Find out the distributions of the health conditions (X_rfhlth) by gender (sex) and age groups (X_age_g).
Get an idea of the variables we will be working with.
brfss2013 %>%
select(X_rfhlth, sex, X_age_g) %>%
str()## 'data.frame': 491775 obs. of 3 variables:
## $ X_rfhlth: Factor w/ 2 levels "Good or Better Health",..: 2 1 1 1 1 1 2 1 1 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 1 2 2 2 1 2 ...
## $ X_age_g : Factor w/ 6 levels "Age 18 to 24",..: 5 4 5 5 6 4 3 5 4 6 ...
Clean our data:
brfss2013 %>%
filter(!is.na(X_rfhlth), !is.na(sex), !is.na(X_age_g)) %>%
group_by(sex, X_age_g, X_rfhlth) %>%
summarise(count = n())## `summarise()` has grouped output by 'sex', 'X_age_g'. You can override using the `.groups` argument.
## # A tibble: 24 x 4
## # Groups: sex, X_age_g [12]
## sex X_age_g X_rfhlth count
## <fct> <fct> <fct> <int>
## 1 Male Age 18 to 24 Good or Better Health 12627
## 2 Male Age 18 to 24 Fair or Poor Health 929
## 3 Male Age 25 to 34 Good or Better Health 20514
## 4 Male Age 25 to 34 Fair or Poor Health 1932
## 5 Male Age 35 to 44 Good or Better Health 22783
## 6 Male Age 35 to 44 Fair or Poor Health 3072
## 7 Male Age 45 to 54 Good or Better Health 28949
## 8 Male Age 45 to 54 Fair or Poor Health 6179
## 9 Male Age 55 to 64 Good or Better Health 34711
## 10 Male Age 55 to 64 Fair or Poor Health 9773
## # ... with 14 more rows
Here is a visualization of what we found:
health <- brfss2013 %>%
filter(!is.na(X_rfhlth), !is.na(sex), !is.na(X_age_g))Health conditions by gender
options(scipen = 999)
ggplot(health, aes(x =X_rfhlth, fill = sex)) +
geom_bar() +
labs(title = 'Health Conditions by Gender',
x = 'Health Conditions',
y = 'Number of population',
caption = paste0("Data Source: BRFSS2013"),
fill = 'Gender')
We can see that, overall, the number of females who has good health condition outnumbers that of males’ throughout all age groups according to the survey, but at the same time, the number of females who are in fair or poor health also outnumbers that of males’ in all age groups in the survey.
Health condition by age groups.
ggplot(health, aes(x =X_rfhlth, fill = X_age_g)) +
geom_bar() +
labs(title = 'Health conditions by each age group',
x = 'Health Conditions',
y = 'Number of population',
caption = paste0("Data Source: BRFSS2013"),
fill = 'Age groups')
Based on the visualization, we can tell that the majority of the population are in good and better health; while under each category, older people (age 65 and older) are the largest groups under both good and poor health.