Introduction to CDC 500 Cities/PLACES Data

This report explores CDC 500 Cities/PLACES data at the county level. In addition, economic variables such as income and unemployment are included to show associations between health and socioeconomic variables.

There is a relationship between physical and mental health found in counties across the United States. There is also a relationship between health and economic well-being, shown here through median household income.

Health varies by county, and there are some counties that have very poor health. Some regions are likelier than others to contain counties where residents report poor health.

This data comes from the 2022 CDC Division of Population Health release available through the CDC data portal here. The 500 Cities/PLACES dataset produces estimates for counties, cities, and census tracts across a number of important health variables. These variables include health status, estimates for the prevalence of different diseases, and other health indicators.

In addition, income and employment variables from the USDA Economic Research Service found here have been attached to this dataset.

Census designations for U.S. regions (Census Regions and Divisions, BEA Regions) are also joined to the data frame in order to see if regional patterns exist.

Using dplyr, ggplot2, and other tidyverse packages I analyze and visualize the relationship between health variables and between health and economic variables.

knitr::opts_chunk$set(echo  = TRUE, warning = FALSE, message = FALSE, fig.dim = c(8, 7))
# Load packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

library(here)

## here() starts at /Users/simone/Dropbox/GEOG 588/Lesson 1/Labs/RYouWithMe

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(DT)

new_counties_df <- read_csv(here("data", "myowndata", "new_counties_df.csv"))

## Rows: 81744 Columns: 118
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (16): StateAbbr, StateDesc, LocationName, DataSource, Category, Measure...
## dbl (100): Year, Data_Value, Low_Confidence_Limit, High_Confidence_Limit, To...
## lgl   (2): Data_Value_Footnote_Symbol, Data_Value_Footnote
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Read in regions by state
state_regions <- read_csv(here("data", "myowndata", "state_region_helper_with_fips.csv"))

## Rows: 51 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): state_abb_lower, state_abb, state, region, division, commonsense, b...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Preparation

This data set is currently cleaned fairly well, but using the clean_names() function from janitor makes data wrangling easier.

At this point, the state regions file is also joined to the main county dataset.

The factor order for the census divisions also is changed at this point so that they are grouped by region.

General edits are saved so that changes can be made to the theme used across different plots.

# clean names
new_counties_df  <- new_counties_df %>% clean_names()
state_regions <- state_regions %>% 
  select(-state) %>%
  clean_names()

# Select variables to exclude variables not used in analysis
new_counties_df <- new_counties_df %>%
  select(state_abbr:location_name, category:data_value, 
         low_confidence_limit:area_name, ends_with("2020"))

# Join counties data set with state regions dataset
new_counties_df <- new_counties_df %>% 
  left_join(., state_regions, by = c("state_abbr" = "state_abb"))


# Factor the division to change the order
new_counties_df$division <- factor(new_counties_df$division,
            levels = c('East North Central', 'West North Central',
                       'Middle Atlantic', 'New England',
                       'East South Central','South Atlantic', 'West South Central',
                       'Mountain','Pacific'))

# Save theme changes so that they can be called each time
theme_edits <- 
    theme(plot.margin = margin(.25, .25, .25, .25, "cm"),
        legend.position = "bottom",
        text = element_text(family = "Lato", size=12, hjust=.05),
        strip.text = element_text(size=11, hjust=.05),
        plot.caption = element_text(hjust=0, colour="grey50",
                                    margin = margin(t = 5, r=0, b = 0, l =0)),
        plot.title = element_text(size=16, face = "bold"),
        plot.subtitle = element_text(size=12, colour = "grey50"),
        axis.text = element_text(colour="grey20"),
        axis.text.x = element_text(size = 9, vjust = 0.5),
        axis.title = element_text(size=10),
        axis.title.y = element_text(angle= 0),
        axis.title.x = element_text(margin = margin(t = 10, r=0, b = 0, l =0)))

# Look at general health values - sorted by lowest to start
new_counties_df %>%
  filter(measure_id == "GHLTH") %>%
  select(state_abbr, location_name, data_value) %>%
  arrange(data_value) %>%
  datatable()

Table 1. General health values arranged by lowest values to highest. The table can be sorted by data_value to view the lowest and highest and ensure that values make sense and see the general areas where health status values are highest and lowest.

Analyze the dataset

The variables used for this analysis are health status variables - mental health status, general health status, and physical health status. These variables measure 14+ days in the last 30 day with poor health status, so a lower number is considered better. You can read more about these variables here.

Relationship Between Physical and Mental Health

There is a strong relationship between physical and mental health at the county level. Counties where respondents report having fewer bad mental health days also tend to have respondents report having fewer bad physical health days.

knitr::opts_chunk$set(echo  = TRUE, warning = FALSE, message = FALSE, fig.dim = c(8, 7))
# Scatterplot - mental and physical health
new_counties_df %>% 
  filter(short_question_text %in% c("Physical Health", "Mental Health")) %>%
  select(location_name, location_id, data_value, short_question_text) %>%
  pivot_wider(names_from = short_question_text, values_from = data_value) %>%
  clean_names() %>%
  ggplot(aes(x = physical_health, y = mental_health)) + 
  geom_jitter(alpha = .5, color = "#04025C") +
  geom_smooth() +
    labs(caption = "Source: CDC",
       title = "Physical and Mental Health Values in U.S. Counties",
       subtitle = 'Lower values represent fewer "bad" health days',
       x = "Physical Health", y = "Mental \nHealth") +
  theme_minimal() +
  theme_edits

Figure 1. Physical and mental health status measures are highly correlated with each other. Counties in which a higher percentage of residents report poor physical health also tend to have a higher percentage of residents who report poor mental health.

Relationship Between Health Status and Income

Income is strongly associated with health status. Counties with lower median household incomes are more likely to have a higher percentage of residents that report 14 or more bad health days.

knitr::opts_chunk$set(echo  = TRUE, warning = FALSE, message = FALSE, fig.dim = c(8, 7))
# Scatterplot - health status x income in 2020-----
new_counties_df %>% 
  filter(short_question_text == "General Health" ) %>%
  ggplot(aes(x = median_household_income_2020, y = data_value)) + 
  geom_jitter(alpha = .5, color = "#04025C") +
  scale_x_continuous(name="Median Household Income", limits=c(0, 100000), 
                     #breaks=c(0, 5000000, 1000000, 1500000, 2000000),
                     breaks = seq(0, 100000, by = 50000),
                     labels = scales::dollar) +
  labs(caption = "Source: CDC, USDA ERS",
       title = "Median Household Income and Health Status in U.S. Counties - 2020",
       subtitle = 'Lower values represent fewer "bad" health days',
       x = "Median Household Income", y = "% Reporting\n Bad Health Day") +
  scale_color_manual(values=c( "#04025C"), name = "Variable") +
  facet_wrap(~short_question_text) + 
  theme_minimal() +
  theme(plot.margin = margin(1, 2, 1, 1, "cm"),
        legend.position = "bottom",
        text = element_text(family = "Lato", size=12, hjust=.05),
        strip.text = element_text(size=11, hjust=.05),
        plot.caption = element_text(hjust=0, colour="grey50",
                                    margin = margin(t = 5, r=0, b = 0, l =0)),
        plot.title = element_text(size=14, face = "bold"),
        plot.subtitle = element_text(size=12, colour = "grey50"),
        axis.text = element_text(colour="grey20"),
        axis.text.x = element_text(size = 9, vjust = 0.5),
        axis.title = element_text(size=10),
        axis.title.y = element_text(angle= 0),
        axis.title.x = element_text(margin = margin(t = 10, r=0, b = 0, l =0)))

Figure 2. Counties with a higher median household income are also likely to have more residents that report poor general health.

Export Median Household Income and Health Status in U.S. Counties - 2020 Chart

ggsave(here("output", "healthstatus_income.jpg"))

General Health Status Distribution by County

General health varies by region, state, and county in the United States. Some counties have good general health values, with less than 10 percent of residents reporting 14 or more days of bad health days in the last 30 days. In other counties, more than 30 percent of residents report poor general health (14 or more days of bad health days in the last 30 days).

knitr::opts_chunk$set(echo  = TRUE, warning = FALSE, message = FALSE, fig.dim = c(8, 7))
# Get the median general health value to use for a chart
median_genhealth <- new_counties_df %>%
  filter(short_question_text == "General Health" ) %>%
  summarize(median = median(data_value)) %>%
  pull()

new_counties_df %>%
  filter(short_question_text == "General Health" ) %>%
  ggplot(aes(x = data_value)) +
  geom_histogram(fill = "#6BAB90", color = "white", binwidth = 2) +
  geom_linerange(aes(ymin = 0, ymax = 595, x = median_genhealth), 
                 linewidth = 1, color = "#04025C", linetype = "dashed") +
  geom_text(aes(label = "15.3%", x = median_genhealth, y = 610), position = position_dodge(2)) +
  labs(caption = "Source: CDC",
       title = "2020 General Health Values in U.S. Counties",
       subtitle = paste('The median percent of "bad" health days is', median_genhealth,"%"),
       x = "Data Value", y = "Count") +
  theme_minimal() +
  theme_edits

Figure 3. There is a wide variation in general health in counties across the country. The median percent of residents reporting poor health is 15.3%.

General Health Status by Census Region and Division

In general, health varies significantly by region. Although there are counties with a low percent of bad health days in every census region (and in most census divisions), the most frequent values vary by division.

The regions and divisions as defined by the Census Bureau can be seen in the following map from the Census Bureau:

Figure 4. Census Bureau region and division designations.

Counties in the Northeast have better general health, while counties in the Southeast have worse values in general.

knitr::opts_chunk$set(echo  = TRUE, warning = FALSE, message = FALSE, fig.dim = c(8, 7))
# Violin chart by region and division------
new_counties_df %>% 
  filter(short_question_text == "General Health" , !is.na(bea))  %>%
  ggplot(aes(x =division, y = data_value)) +
  geom_violin(aes(fill = region, color = region)) +
  coord_flip() +
  facet_wrap(~short_question_text) +
  labs(caption = "Source: CDC",
       title = "General Health in U.S. Counties by Census Division",
       subtitle = 'Lower values represent fewer "bad" health days',
       x = " ", y = " ") +
  scale_fill_manual(values=c("#E9C46A", "#B16379","#527394", "#6BAB90"), name = "Region") +
  scale_color_manual(values=c("#E9C46A", "#B16379","#527394", "#6BAB90"), name = "Region") +
  theme_minimal() + 
  theme_edits

Figure 5. General health status varies across counties, but also varies significantly by census region and census division. Counties in the South tend to have residents that report worse health than countries in the Northeast.

Conclusion

Health status is associated with economic variables such as household income. Location plays a role in reported health status, in part due to economic variables. More health, economic, and location variables can be explored to further depict the relationship between health status, economic status, and location.

References

Centers for Disease Control and Prevention. (2020, December 8). 500 Cities Project: 2016 to 2019 | PLACES: Local Data for Better Health | CDC. Retrieved April 3, 2022, from https://www.cdc.gov/places/about/500-cities-2016-2019/index.html

U.S. Census Bureau. 2010 Census Regions and Divisions of the United States. https://www2.census.gov/geo/pdfs/maps-data/maps/reference/us_regdiv.pdf

USDA Economic Research Service. (2022, June 3). Unemployment and median household income for the U.S., States, and counties, 2000-2021. County-level Data Sets: Download Data. https://www.ers.usda.gov/data-products/county-level-data-sets/county-level-data-sets-download-data/

CDC 500 Cities/PLACES Data Analysis

Simone Roy

2023-05-27