drawing

Introduction

Missouri, located in the Midwestern United States, showcases a geographically and socioeconomically varied landscape. The state’s educational and income levels reflect its mix of urban and rural areas, with distinct differences between metropolitan regions like St. Louis and Kansas City and the more rural and agricultural parts of the state.

Various public and private institutions serve Missouri’s education system, from primary and secondary schools to high-ranking universities and colleges. The presence of renowned universities such as the University of Missouri system, Washington University in St. Louis, and several other colleges and technical schools contribute to the state’s educational profile.

Income levels in Missouri, as in many states, correlate with education levels, geographic location, and industry employment. The state’s economy is diverse, with sectors such as agriculture, manufacturing, healthcare, and technology playing significant roles. Urban areas, particularly around St. Louis and Kansas City, tend to have higher average income levels compared to rural areas, reflecting the concentration of higher-paying jobs in these regions.

However, Missouri faces challenges similar to other states, such as income disparity and pockets of poverty, particularly in rural areas or where industries have declined. Missouri’s median household income and per capita income are essential indicators of the state’s economic health. While these figures have shown growth, they also highlight the need for continued focus on economic development and educational opportunities to ensure all Missourians can access high-quality jobs and achieve a comfortable standard of living.

This report will use U.S. Census Bureau data to compare graduate-level education levels and median income levels for 2020. Because the U.S. Census data is comprehensive and contains millions of data tables, we will utilize the tidycensus R package for ease of data use. Additionally, we will showcase two types of data analysis: Non-spatial and Spatial data analysis.

Loading Packages

The first step for this analysis is to load the library packages that will be used throughout the process. The packages are:

tidycensus = An R package that allows users to interface with a select number of the US Census Bureau’s data APIs and return tidyverse-ready data frames, optionally with simple feature geometry included. More information can be found at This Site

tidyverse = Includes the packages that are use in everyday data analyses, such as ggplot2, dplyr, and readr to name a few. Additional details can be found at This Site

ploty = Is a graphing library makes interactive, publication-quality graphs. More information can be found at This Site

sf =Support for simple features, a standardized way to encode spatial vector data. Additional information can be located at This Site

mapview = Allows you to quickly and conveniently create interactive visualisations of spatial data with or without background maps. More information can be found at This Site

scales = The scales packages provides the internal scaling infrastructure used by ggplot2, and gives you tools to override the default breaks, labels, transformations and palettes. Additional details are located at This Site

ggraph = Is a tool that allows you to create dynamic ggplot graphs. Additional information can be found at This Site

# Load nessesary R packages for this analysis

library(tidycensus)
library(tidyverse)
library(plotly)
library(mapview)
library(scales)
library(ggiraph)

Data Overview

The U.S. Census Data refers to the wealth of information collected by the United States Census Bureau, a principal agency of the U.S. Federal Statistical System, responsible for producing data about the American people and economy. The annual survey the Census Bureau conducts is the American Community Survey (ACS). This survey provides vital information about the nation and its people on a yearly basis. The ACS covers a broad range of topics, such as education, occupation, housing, commuting, and many more. It provides detailed demographic, social, economic, and housing data for congressional districts, counties, and other localities.

The two primary data categories we will examine are the number of graduate degree holders in Missouri by county and the median household income ranges. We will conduct the non-spatial data analysis with the graduate degree levels, show which counties have the largest and smallest percentages of graduate degree holders, and then make a margin of error plot for Missouri. Then, for the spatial data analysis, we will examine the five-year median income level ranges for 2020 for each county in Missouri. We will show the data with a series of graphs and plots, both static and interactive. Let’s begin.

In order to save time, we will use the functions of the tidycensus package to pull census directly from the U.S. Census Bureau. Before that can occur, you have to go to This Site, and sign up for an API key to pull the ACS data. This key allows you do conduct more than 500 inquiries of the census data. While 500 may be enough for small projects, larger projects can perform many calls over 500. Once you register for an API key, it can be used in multiple projects without limitations. For this project, we only need to setup the key once for both analysis.

# Set up your Census API key for the project

census_api_key("72648b32bb56ef917b282da801e60c26990a0941", install = TRUE, overwrite = TRUE)

## Your original .Renviron will be backed up and stored in your R HOME directory if needed.

## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`

## [1] "72648b32bb56ef917b282da801e60c26990a0941"

Non-Spatial Data Analysis

This portion focuses on analyzing educational attainment across counties in Missouri using data from the American Community Survey (ACS) conducted by the U.S. Census Bureau. The script is a data-driven exploration aimed at uncovering patterns or insights related to graduate degrees (or higher education levels) across different geographical areas within the state. We will pull down the data using tidycensus library and create a series of graphs to show the distrubution of graduate level degrees by county in Missouri.

The first step is to fetch the education data of Missouri from the U.S. Census Bureau.

# Fetch the data for Missouri's counties

graduate_degrees <- get_acs(
  geography = "county", #This indicates that the data requested is at the county level.
  variables = "DP02_0066P", 
  year = 2020, # The year of the data for which the survey was conducted 
  state = "MO", # State of interest
  survey = "acs5" # Survey type, in this case, American Community Survey Five Year Estimate
)

## Getting data from the 2016-2020 5-year ACS

## Using the ACS Data Profile

Now that we have the data, next we want to see the percentage levels of each county.

# Identify counties with the highest and lowest percentages

highest_percentages <- graduate_degrees %>%
  arrange(desc(estimate)) %>%
  head()

smallest_percentages <- graduate_degrees %>%
  arrange(estimate) %>%
  head()

print("Counties with the highest percentages of graduate degree holders:")

## [1] "Counties with the highest percentages of graduate degree holders:"

print(highest_percentages)

## # A tibble: 6 × 5
##   GEOID NAME                         variable   estimate   moe
##   <chr> <chr>                        <chr>         <dbl> <dbl>
## 1 29019 Boone County, Missouri       DP02_0066P     20.7   1.4
## 2 29189 St. Louis County, Missouri   DP02_0066P     19.3   0.4
## 3 29001 Adair County, Missouri       DP02_0066P     16.8   2.3
## 4 29510 St. Louis city, Missouri     DP02_0066P     16.8   0.8
## 5 29165 Platte County, Missouri      DP02_0066P     16.2   1.1
## 6 29183 St. Charles County, Missouri DP02_0066P     14.9   0.5

print("Counties with the smallest percentages of graduate degree holders:")

## [1] "Counties with the smallest percentages of graduate degree holders:"

print(smallest_percentages)

## # A tibble: 6 × 5
##   GEOID NAME                      variable   estimate   moe
##   <chr> <chr>                     <chr>         <dbl> <dbl>
## 1 29057 Dade County, Missouri     DP02_0066P      2.4   0.8
## 2 29085 Hickory County, Missouri  DP02_0066P      2.8   1.2
## 3 29223 Wayne County, Missouri    DP02_0066P      2.8   1.1
## 4 29197 Schuyler County, Missouri DP02_0066P      3.1   1.1
## 5 29119 McDonald County, Missouri DP02_0066P      3.2   0.9
## 6 29137 Monroe County, Missouri   DP02_0066P      3.3   1

As you can see, there is a large gap between the highest, Boon County at 20.7%, and lowest, Dade County at 2.4%. While this information is fine, we can make it better by showing all of the counties with their percent levels by creating a margin of error plot. The main purpose of a margin of error plot is to provide a visual representation of the confidence we have in our estimates. The margin of error reflects the range within which the true value is expected to fall with a certain level of confidence (commonly 95%). This type of plot helps in understanding the precision of estimates, especially when comparing groups or trends over time.

# Create a margin of error plot

mo_error_plot <- ggplot(graduate_degrees, aes(x = reorder(NAME, estimate), y = estimate)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - moe, ymax = estimate + moe), width = 0.1) +
  coord_flip() +
  labs(title = "Margin of Error for Graduate Degrees by County in Missouri",
  subtitle = "ACS 5-Year Estimates (2020)",
  x = "County",
  y = "Percentage with Graduate Degrees",
  caption = "Data Source: American Community Survey 5-Year Estimates") +
  scale_x_discrete(labels = function(x) str_remove(x, "County, Missouri|, Missori")) + # Cleaning up the labels of the chart by removing redundant information
  theme_minimal()

# Print the plot

print(mo_error_plot)

After plotting all of the counties information of graduate degree levels, we can clearly see the disparity of graduate degree holders in Missouri in 2020. However, we can see that the accuracy of the data can vary. The Central Point is each estimate is represented by a central point (often a dot or a bar). This point represents the estimate’s central or mean value. The Error Bars extend from each central point are lines or “error bars” that represent the margin of error around the estimate. The length of these bars indicates the size of the margin of error; longer bars signify greater uncertainty. For example, Putnam County has a long error bar showing high probability of error vs. St. Louis which has a very short error bar showing a tighter percentage of possible error in reporting.

But in looking at this plot, it can be difficult to examine and compare the information. Let’s make this convert this plot to be a little more user interactive. We will convert this graph to be interactive using the ploty package.

# Convert the plot to an interactive plot using plotly

interactive_plot <- ggplotly(mo_error_plot)
interactive_plot

Now we can look at each line of the data and see what the percentage information is. However, it is still a little difficult read. Let’s clean up and adjust the interactive plot using the ggiraph package to make it easier to read.

# Modify the plot for ggiraph

interactive_mo_plot <- ggplot(graduate_degrees, aes(x = reorder(NAME, estimate), 
                                                    y = estimate,
                                                    tooltip = paste(NAME,
                                                    ":",
                                                    estimate, "%"))) +
  
  geom_point_interactive(size = 4, color = "darkred") + # Make points interactive easier to read with a specified color and size
  geom_errorbar_interactive(aes(ymin = estimate - moe, ymax = estimate + moe),  width = 1) + # Interactive error bars
  coord_flip() +
  labs(title = "Interactive Margin of Error for Graduate Degrees by County in Missouri",
  subtitle = "ACS 5-Year Estimates (2020)",
  x = "County",
  y = "Percentage with Graduate Degrees",
  caption = "Data Source: American Community Survey 5-Year Estimates") +
  scale_x_discrete(labels = function(x) str_remove(x, "County, Missouri|, Missori")) +
  scale_y_continuous(labels = function(x) paste0(as.integer(x), "%")) + # Format the y-axis labels as percentages
 
  theme_minimal() +
  theme(text = element_text(size = 12), # Default text size for the plot, adjust as necessary
  plot.title = element_text(size = 20), # Adjust title size
  plot.subtitle = element_text(size = 16), # Adjust subtitle size
  axis.title = element_text(size = 16), # Adjust axis titles size
  axis.text = element_text(size = 12)) # Adjust axis text size if necessary

# Render the plot as an interactive plot using 'girafe'

interactive_plot <- girafe(ggobj = interactive_mo_plot)

interactive_plot

Now after cleaning up the plot, we can clearly view the margin of error plot to see the differences of graduate degree holders by county in Missouri. Next we will continue to use data from the U.S. Census Bureau to view the median household income.

Spatial Data Analysis

Data Processing

Fetch median household income data for Missouri counties using ACS 5-year estimates

As previously, we loaded R packages and established connection to the U.S. Census Bureau using an AKI key. Next we will pull the median household income for analysis.

# Fetch median household income data for Missouri counties using ACS 5-year estimates

mo_income <- get_acs(geography = "county", 
                     variables = "B19013_001", 
                     state = "MO", 
                     year = 2020,
                     survey = "acs5")

## Getting data from the 2016-2020 5-year ACS

Fetch Missouri county spatial data

Next, we need to pull some spatial data to tie the median house hold income we jsut pulled.

# Fetch spatial data for Missouri counties using the tidycensus package

counties <- tigris::counties(state = "MO", cb = TRUE, class = "sf")

## Retrieving data for the year 2022

Merge data

In order to tie the income data to the county data, we will merge the two dataset with a common ID names “GEOID”

# Merge ACS data with spatial data

MO_Income <- merge(counties, mo_income, by = "GEOID")

Create a Choropleth Map

Now that the data has been merged, let’s see how it looks by creating a choropleth map.

# Create a choropleth map with ggplot2 --------

ggplot(data = MO_Income) + 
  geom_sf(aes(fill = estimate), color = NA) + 
  scale_fill_viridis_c(name = "Median Income", option = "plasma", labels = dollar_format()) + 
  labs(title = "Median Household Income by County in Missouri, 2020",
  subtitle = "ACS 5-Year Estimates") + 
  theme_minimal() +
  theme(text = element_text(size = 12), # Default text size for the plot, adjust as necessary
  plot.title = element_text(size = 20), # Adjust title size
  plot.subtitle = element_text(size = 16), # Adjust subtitle size
  axis.title = element_text(size = 16), # Adjust axis titles size
  axis.text = element_text(size = 12), # Adjust axis text size if necessary 
  legend.text = element_text(size = 14)) # Adjust legend text size

While this map looks nice, we can make it better by adding interactivity. Let’s take the same data and create an interactive map for users to gain more information and insights.

Create an Interactive Map

Before we produce an interactive map, there is some data formatting that needs to take place in order for the map to be easily understood.

# Format the numbers to display in dollar format for tooltip

MO_Income$Estimate_Value <- scales::dollar(MO_Income$estimate)

# Calculate the income breaks for five categories of income ranges

income_breaks <- quantile(MO_Income$estimate, probs = seq(0, 1, by = 0.2), na.rm = TRUE)

# Generate custom range labels based on the calculated income ranges

labels_ranges <- sapply(1:(length(income_breaks)-1), function(i) {
  low <- dollar(income_breaks[i])
  high <- dollar(income_breaks[i+1] - 1)  # Subtracting 1 to avoid overlapping ranges
  paste(high, "-", low)
})

# Create a factor variable for categorization based on the income range breaks

MO_Income$Income_Category <- cut(MO_Income$estimate,
                                 breaks = income_breaks,
                                 include.lowest = TRUE,
                                 labels = labels_ranges)

# Reverse the factor levels of 'Income_Category' to go from highest to lowest

MO_Income$Income_Category <- factor(MO_Income$Income_Category, levels = rev(levels(MO_Income$Income_Category)))

Now that the household median income has been formatted into five categories for ease of viewing, let’s now build an interactive map.

# 'zcol' specifies the column for the color scale, which will now be the factor variable 'Income_Category'
# 'layer.name' specifies the column to display in the popup, which will be the actual dollar-formatted values

map <- mapview(MO_Income, zcol = "Income_Category", 
               layer.name = "Estimate_Value", 
               legend = TRUE)

Median Household Income by County in Missouri, 2020

Now you can examine each county more in depth. Not only can you put your cursor over the county to get some information, when you click on a county feature, you will get even more information within the attribute table.

Conclusion

This report comprehensively analyzed the educational attainment and income levels across Missouri counties, utilizing the American Community Survey (ACS) 5-year estimates for 2020. With its mix of urban and rural areas, Missouri showcases significant disparities in education and income levels, reflecting its diverse socio-economic landscape. High-ranking universities and colleges in metropolitan regions like St. Louis and Kansas City contribute positively to the state’s educational profile. However, there’s a noticeable gap between these urban centers and the more rural parts of Missouri regarding education and median household income levels. The analysis employed a variety of R packages to manage, visualize, and interpret U.S. Census Bureau data, highlighting the utility of tools such as tidycensus, tidyverse, plotly, sf, mapview, scales, and ggiraph for both non-spatial and spatial data analysis. For non-spatial analysis, the focus was on graduate degree holders across Missouri counties, revealing a wide gap between the highest (Boone County at 20.7%) and lowest (Dade County at 2.4%) percentages of graduate degree holders. This disparity was further illustrated through the margin-of-error plots and interactive visualizations, enhancing the comprehensibility of the data. Spatial data analysis centered on median household income by county, presented through both choropleth and interactive maps. These visualizations effectively communicated the income disparities across Missouri, with the interactive maps providing an engaging means for users to explore detailed income data by county.

Missouri’s socio-economic landscape is marked by its diversity, with significant disparities in education and income levels across the state. The presence of prominent educational institutions in urban areas contrasts with the challenges faced by rural counties, emphasizing the need for targeted economic development and educational opportunities to bridge these gaps. The report underscores the value of utilizing comprehensive data analysis tools to understand and address the socio-economic disparities within Missouri. By continuing to focus on economic development and educational access, policymakers and stakeholders can work towards a more equitable future for all Missourians.

Missouri Population Education and Income Statistics

Erik Reid

2024-04-01