Source: Data Driven Detroit, 2018 “https://datadrivendetroit.org/blog/2018/11/27/how-does-the-census-affect-local-and-state-budgeting/”
Introduction
For this project, I decided to explore state-level government finances in the United States. The dataset I used comes from the U.S. Census Bureau, which collects detailed financial data from all 50 states. The dataset includes variables like Total Expenditure, Capital Outlay, Revenue, General Revenue, and Insurance Trust Revenue, as well as more specific spending categories such as Education, Health, Welfare, Transportation, Parks, and Utilities. Most of these variables are numeric, measuring dollars spent or received by each state in a given year.
I was especially drawn to this dataset because I’ve always been a “numbers girl” and love looking at budgets and comparing how money is spent. I wanted to explore questions like: Which states spend the most overall? How do different categories of spending compare across states? And is there a relationship between a state’s revenue and its total expenditure?
I could not find a ReadMe file for the dataset, so I am not sure of the exact methodology used to collect all the data, but it is standard for the Census Bureau to gather state financial reports submitted each year.I find it interesting to see how different states manage their money and where priorities like education or healthcare are reflected in spending. Working with this data allows me to combine my love for math, budgets, and data analysis all in one project.
To work with the dataset, I first cleaned it by removing totals for the entire United States and grouping states into regions (Northeast, Midwest, South, West) for easier comparisons. I also added latitude and longitude coordinates for each state to make map visualizations possible.
Some background research
Research shows that state governments in the U.S. spend money on many of the same things my dataset helps explore. For example, education and health care make up some of the largest areas of state spending, with public welfare and Medicaid being big pieces of the budget for most states. According to the Urban Institute, state and local governments spend a large part of their budgets on education, health, and welfare programs, and these categories have grown significantly over the past few decades. The Census Bureau’s Annual Survey of State Government Finances collects data on revenue and expenditure categories such as taxes, intergovernmental revenue, and capital outlay for each state. This supports the idea that the numbers in my dataset—like total expenditure, capital outlay, and revenue—reflect how state governments allocate their money each year. I chose this research because it helps explain why different categories of spending (like education and health) are important for understanding how states manage their budgets.
Cited: U.S. Census Bureau. Annual Survey of State and Local Government Finances. U.S. Census Bureau, 2025, https://www.census.gov/programs-surveys/gov-finances.html.
Urban Institute. “State and Local Expenditures.” Urban Institute, 2025, https://www.urban.org/policy-centers/cross-center-initiatives/state-and-local-finance-initiative/state-and-local-backgrounders/state-and-local-expenditures.
Load Libraries and Dataset
# Look at datasetlibrary(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.1
Warning: package 'tibble' was built under R version 4.5.1
Warning: package 'purrr' was built under R version 4.5.1
Warning: package 'stringr' was built under R version 4.5.1
Warning: package 'forcats' was built under R version 4.5.1
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)library(plotly)
Warning: package 'plotly' was built under R version 4.5.1
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(leaflet)
Warning: package 'leaflet' was built under R version 4.5.1
Rows: 1051 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): State
dbl (30): Year, Totals.Capital outlay, Totals.Revenue, Totals.Expenditure, T...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filternewdf <- df |>filter(State !="UNITED STATES") |>mutate(region =case_when( State %in%c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT", "NEW JERSEY", "NEW YORK", "PENNSYLVANIA") ~"Northeast", State %in%c("ILLINOIS", "INDIANA", "MICHIGAN", "OHIO", "WISCONSIN", "IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA") ~"Midwest", State %in%c("DELAWARE", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA", "DISTRICT OF COLUMBIA", "WEST VIRGINIA", "ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE", "ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS", "VIRGIN ISLANDS", "PUERTO RICO") ~"South", State %in%c("ARIZONA", "COLORADO", "IDAHO", "MONTANA", "NEVADA", "NEW MEXICO", "UTAH", "WYOMING", "ALASKA", "CALIFORNIA", "HAWAII", "OREGON", "WASHINGTON", "GUAM") ~"West",TRUE~"Other" ))#Add lat and lng points (Used Chat gpt to get the cordanates)newdf <- newdf |>mutate(lat =case_when( State =="ALABAMA"~32.8, State =="ALASKA"~64.2, State =="ARIZONA"~34.3, State =="ARKANSAS"~35.1, State =="CALIFORNIA"~36.8, State =="COLORADO"~39.0, State =="CONNECTICUT"~41.6, State =="DELAWARE"~39.0, State =="FLORIDA"~27.8, State =="GEORGIA"~32.6, State =="HAWAII"~20.8, State =="IDAHO"~44.1, State =="ILLINOIS"~40.0, State =="INDIANA"~40.0, State =="IOWA"~42.1, State =="KANSAS"~38.5, State =="KENTUCKY"~37.5, State =="LOUISIANA"~31.0, State =="MAINE"~45.3, State =="MARYLAND"~39.0, State =="MASSACHUSETTS"~42.4, State =="MICHIGAN"~44.3, State =="MINNESOTA"~46.4, State =="MISSISSIPPI"~32.7, State =="MISSOURI"~38.6, State =="MONTANA"~46.9, State =="NEBRASKA"~41.1, State =="NEVADA"~39.3, State =="NEW HAMPSHIRE"~43.7, State =="NEW JERSEY"~40.1, State =="NEW MEXICO"~34.5, State =="NEW YORK"~42.9, State =="NORTH CAROLINA"~35.5, State =="NORTH DAKOTA"~47.5, State =="OHIO"~40.4, State =="OKLAHOMA"~35.6, State =="OREGON"~44.0, State =="PENNSYLVANIA"~41.2, State =="RHODE ISLAND"~41.7, State =="SOUTH CAROLINA"~33.8, State =="SOUTH DAKOTA"~44.4, State =="TENNESSEE"~35.8, State =="TEXAS"~31.0, State =="UTAH"~39.3, State =="VERMONT"~44.0, State =="VIRGINIA"~37.5, State =="WASHINGTON"~47.4, State =="WEST VIRGINIA"~38.6, State =="WISCONSIN"~44.6, State =="WYOMING"~43.0,TRUE~NA_real_ ),lng =case_when( State =="ALABAMA"~-86.8, State =="ALASKA"~-152.4, State =="ARIZONA"~-111.7, State =="ARKANSAS"~-92.4, State =="CALIFORNIA"~-119.7, State =="COLORADO"~-105.5, State =="CONNECTICUT"~-72.7, State =="DELAWARE"~-75.5, State =="FLORIDA"~-81.7, State =="GEORGIA"~-83.4, State =="HAWAII"~-157.5, State =="IDAHO"~-114.7, State =="ILLINOIS"~-89.2, State =="INDIANA"~-86.1, State =="IOWA"~-93.5, State =="KANSAS"~-98.0, State =="KENTUCKY"~-85.3, State =="LOUISIANA"~-92.0, State =="MAINE"~-69.0, State =="MARYLAND"~-76.7, State =="MASSACHUSETTS"~-71.8, State =="MICHIGAN"~-85.4, State =="MINNESOTA"~-94.6, State =="MISSISSIPPI"~-89.7, State =="MISSOURI"~-92.5, State =="MONTANA"~-110.4, State =="NEBRASKA"~-99.9, State =="NEVADA"~-116.6, State =="NEW HAMPSHIRE"~-71.6, State =="NEW JERSEY"~-74.5, State =="NEW MEXICO"~-106.0, State =="NEW YORK"~-75.5, State =="NORTH CAROLINA"~-79.4, State =="NORTH DAKOTA"~-100.5, State =="OHIO"~-82.8, State =="OKLAHOMA"~-97.5, State =="OREGON"~-120.6, State =="PENNSYLVANIA"~-77.2, State =="RHODE ISLAND"~-71.5, State =="SOUTH CAROLINA"~-81.0, State =="SOUTH DAKOTA"~-100.2, State =="TENNESSEE"~-86.4, State =="TEXAS"~-99.3, State =="UTAH"~-111.7, State =="VERMONT"~-72.7, State =="VIRGINIA"~-78.7, State =="WASHINGTON"~-120.5, State =="WEST VIRGINIA"~-80.6, State =="WISCONSIN"~-89.9, State =="WYOMING"~-107.6,TRUE~NA_real_ ) )
Multiple Linear Regression
model <-lm(`Totals.Expenditure`~`Totals.Capital outlay`+`Totals.Revenue`+`Totals.General revenue`+`Totals.Insurance trust revenue`,data = newdf)# Show regression resultssummary(model)
I ran a multiple linear regression to see if different types of state revenue, like Capital Outlay, Total Revenue, General Revenue, and Insurance Trust Revenue, could predict a state’s Total Expenditure. The numbers in the results, called coefficients, tell us how much each type of revenue is linked to spending. For example, if the coefficient for Total Revenue is positive, that basically means states that make more money tend to spend more.
The R-squared value shows how much of the spending differences between states the model can explain. From my results, it looks like some revenue types matter more than others. A few of the numbers did not seem very significant, which suggests those types of revenue do not really affect total spending that much.
Overall, the model makes sense and kind of confirms what I expected. States with more money usually spend more. But it is not perfect. I could not include things like population size or federal funding, which probably also affect spending. Even so, it was really interesting to see the math behind budgets and how the numbers connect, which is exactly why I like working with this kind of data.
Visualization 1
#Filter top 5 statestop5_states <- newdf |>filter(Year ==max(Year)) |>group_by(State) |>summarise(TotalExpenditure =sum(`Totals.Expenditure`)) |>arrange(desc(TotalExpenditure)) |>slice(1:5) |>pull(State)# Prepare expenditure categories using case_whenexp_categories <- newdf |>filter(State %in% top5_states & Year ==max(Year)) |>select(State,`Details.Education.Education Total`,`Details.Health.Health Total Expenditure`,`Details.Welfare.Welfare Institution Total Expenditure`,`Details.Transportation.Highways.Highways Total Expenditure`,`Details.Natural Resources.Parks.Parks Total Expenditure`,`Details.Utilities.Utilities Current Operation`) |>pivot_longer(cols =-State, names_to ="Category", values_to ="Expenditure") |>mutate(Category =case_when( Category =="Details.Education.Education Total"~"Education", Category =="Details.Health.Health Total Expenditure"~"Health", Category =="Details.Welfare.Welfare Institution Total Expenditure"~"Welfare", Category =="Details.Transportation.Highways.Highways Total Expenditure"~"Transportation", Category =="Details.Natural Resources.Parks.Parks Total Expenditure"~"Parks", Category =="Details.Utilities.Utilities Current Operation"~"Utilities",TRUE~ Category ))# "pivot_longer" and this chunk found online at "https://tidyr.tidyverse.org/reference/pivot_longer.html?"
# bar chartplot_ly(exp_categories, x =~State, y =~Expenditure, color =~Category, colors =c("Education"="blue", "Health"="red", "Welfare"="green", "Transportation"="orange", "Parks"="purple", "Utilities"="brown"),type ='bar') |>layout(title =paste("Expenditure Breakdown for Top 5 States in", max(newdf$Year)),barmode ='stack',xaxis =list(title ="State"),yaxis =list(title ="Expenditure ($)"),legend =list(title=list(text='Category')))
Explanation
This stacked bar chart shows how the top five states with the highest total expenditure in the most recent year split their spending across different categories like Education, Health, Welfare, Transportation, Parks, and Utilities. One interesting pattern is that Education and Health consistently take up the largest portions of the budget, which shows that states prioritize these areas over things like Parks or Utilities. Another surprise is that some states spend a noticeable amount on Welfare, while others allocate very little, which could reflect differences in population needs or state policies.
I wish I could have included a trend over time, so we could see how these expenditures change from year to year, but it would have been too much at once to comprehend. I also would have liked to include per capita spending to better compare states of different sizes.
#Legend help "https://rstudio.github.io/leaflet/reference/addLegend.html?"
Explanation
This interactive map uses circles to show the total expenditure of each state in the latest year, with the size and color of the circle representing the amount spent. One interesting pattern is that large states with higher populations, like California and Texas, have much bigger circles, indicating higher spending, while smaller states in the Midwest or Northeast have smaller circles. A surprise is that some smaller states still have relatively high expenditures, which might reflect higher per capita spending or specific programs.
I wish I could have included both revenue and expenditure in the same map, so we could compare how much money each state brings in versus how much it spends. Another feature that didn’t work as smoothly as I wanted was adding different layers for regions, which could have made it easier to visually compare regions at a glance.