This project explores total U.S. health expenditures from 1960 to 2023, using data sourced from the Centers for Medicare & Medicaid Services (CMS). Four classic forecasting models are applied in R to evaluate their predictive accuracy and to generate forecasts for 2024–2026. The goal is to identify which model performs best for medium-term (2–5 years) annual forecasts, providing actionable insights into likely near-future national healthcare spending trends.
# Read in data
nhe <- read.csv("NHE2023.csv")
# Read the file skipping header rows
nhe_raw <- read.csv("NHE2023.csv", skip = 1, header = TRUE, stringsAsFactors = FALSE)
# Fix column names by removing leading X
colnames(nhe_raw) <- c(
colnames(nhe_raw)[1], # Keep first column as-is
sub("^X", "", colnames(nhe_raw)[-1]) # Remove 'X' from rest
)
# Extract the first row (exclude first column)
spending_values_raw <- nhe_raw[1, -1]
# Remove commas and convert to numeric
spending_values <- as.numeric(gsub(",", "", spending_values_raw))
# Get years from column names
years <- as.numeric(colnames(nhe_raw)[-1])
# Build dataframe
nhe_df <- data.frame(
Year = years,
TotalSpending = spending_values
)
# check for NAs
# sum(is.na(nhe_df))
# Plot data
ggplot(data = nhe_df, mapping = aes(x = Year, y = TotalSpending)) +
geom_line () +
geom_point() +
scale_y_continuous(labels = comma) +
labs(title = "Total Annual National Health Expenditure 1960-2023 in Millions (USD)", x = "Year", y = "Total Spending in Millions (USD)") +
theme_minimal()
Spending shows a clear increasing trend and accelerating growth over time, suggesting a quadratic model may be the best fit for the data.
nhe_ts <- ts(nhe_df$TotalSpending, start = min(nhe_df$Year), frequency = 1)
# Use all but last 5 years for training, last 5 years for testing
yearstotal <- length(nhe_ts) # Total number of years
# Train: all years but last 5
train <- head(nhe_ts, yearstotal - 5)
# Test: last 5 years (realistic forecast window, leaves more history for training)
test <- tail(nhe_ts, 5)
The last five years (2019–2023) were held out as the test set, with earlier years used for training. This maximizes historical data for model fitting while providing a realistic window to evaluate forecasts and predict through 2026.
f_naive <- naive(train, h = length(test))
autoplot(f_naive) + autolayer(test, series = "Test Set (Actual Spending)") +
scale_y_continuous(labels = comma) +
labs(title = "Naive Forecast vs. Test Set (Actual Spending) in Millions (USD)", y = "Total Spending in Millions (USD)", color = "Data Type") +
theme_minimal()
f_ses <- ses(train, h = length(test))
autoplot(f_ses) + autolayer(test, series = "Test Set (Actual Spending)") +
scale_y_continuous(labels = comma) +
labs(title = "SES Forecast vs. Test Set (Actual Spending) in Millions (USD)", y = "Total Spending in Millions (USD)", color = "Data Type") +
theme_minimal()
f_lm <- forecast(tslm(train ~ trend), h = length(test))
autoplot(f_lm) + autolayer(test, series = "Test Set (Actual Spending)") +
scale_y_continuous(labels = comma) +
labs(title = "Linear Model Forecast vs. Test Set (Actual Spending) in Millions (USD)", y = "Total Spending in Millions (USD)", color = "Data Type") +
theme_minimal()
f_quad <- forecast(tslm(train ~ trend + I(trend^2)), h = length(test))
autoplot(f_quad) + autolayer(test, series = "Test Set (Actual Spending)") +
scale_y_continuous(labels = comma) +
labs(title = "Quadratic Model Forecast vs. Test Set (Actual Spending) in Millions (USD)", y = "Total Spending in Millions (USD)", color = "Data Type") +
theme_minimal()
autoplot(train, series = "Training Set") +
autolayer(test, series = "Test Set (Actual Spending)") +
autolayer(f_naive$mean, series = "Naive", linetype = "dashed") +
autolayer(f_ses$mean, series = "SES", linetype = "dotted") +
autolayer(f_lm$mean, series = "Linear") +
autolayer(f_quad$mean, series = "Quadratic") +
labs(
title = "U.S. Health Expenditure Forecasts vs. Actual Spending (Train + Test)",
y = "Total Spending in Millions (USD)",
x = "Year"
) +
scale_y_continuous(labels = comma) +
guides(colour = guide_legend(title = "Model")) +
theme_minimal()
# Make a table
accuracy_table <- data.frame(
Model = c("Naive", "SES", "Linear", "Quadratic"),
RMSE = c(
accuracy(f_naive, test)["Test set","RMSE"],
accuracy(f_ses, test)["Test set","RMSE"],
accuracy(f_lm, test)["Test set","RMSE"],
accuracy(f_quad, test)["Test set","RMSE"]
),
MAPE = c(
accuracy(f_naive, test)["Test set","MAPE"],
accuracy(f_ses, test)["Test set","MAPE"],
accuracy(f_lm, test)["Test set","MAPE"],
accuracy(f_quad, test)["Test set","MAPE"]
)
)
# Display table
kable(accuracy_table, digits = 2, caption = "**Forecast Accuracy Comparison of Models**")
| Model | RMSE | MAPE |
|---|---|---|
| Naive | 811980.6 | 16.10 |
| SES | 811994.7 | 16.10 |
| Linear | 1377657.8 | 30.80 |
| Quadratic | 336833.6 | 6.58 |
Naive Forecast:
Naive forecasting provides a baseline for comparison with other models, but ignores trend and accelerating growth.
Simple Exponential Smoothing (SES):
The SES model reacts slowly to trend and fails to capture accelerating growth. This model performs similarly to the naive forecast.
Linear Model:
The linear model captures overall upward trend, but significantly underestimates accelerating growth.
Quadratic Model:
Of the four models, the quadratic model best captures overall upward trend and accelerating growth, as seen by the lowest Root Mean Squared Error (RMSE) and Mean Absolute Percentage Error (MAPE). With a MAPE of 6.58%, the quadratic model is capturing the changes in yearly spending reasonably well.
# Refit quadratic model on full dataset (1960–2023)
fit_best <- tslm(nhe_ts ~ trend + I(trend^2))
# Forecast 3 years ahead (2024–2026)
f_best <- forecast(fit_best, h = 3)
# Plot forecast
autoplot(f_best) +
scale_y_continuous(labels = comma) +
labs(title = "Forecasted U.S. Health Spending in Millions (USD) (2024–2026)",
x = "Year", y = "Total Spending in Millions (USD)") +
theme_minimal()
# Table: Predicted total U.S. Healthcare Spending 2024-2026, 80% and 95% CI
forecast_df <- data.frame(
Year = 2024:2026,
Forecast_Trillions = round(as.numeric(f_best$mean) / 1e6, 2),
Forecast_Millions = comma(round(as.numeric(f_best$mean), 0)),
Lo80 = comma(round(as.numeric(f_best$lower[,1]), 0)),
Hi80 = comma(round(as.numeric(f_best$upper[,1]), 0)),
Lo95 = comma(round(as.numeric(f_best$lower[,2]), 0)),
Hi95 = comma(round(as.numeric(f_best$upper[,2]), 0))
)
kable(forecast_df, caption = "Predicted U.S. Health Spending in Millions (USD) 2024–2026 with 80% and 95% Confidence Intervals")
| Year | Forecast_Trillions | Forecast_Millions | Lo80 | Hi80 | Lo95 | Hi95 |
|---|---|---|---|---|---|---|
| 2024 | 4.69 | 4,686,183 | 4,564,545 | 4,807,821 | 4,498,446 | 4,873,920 |
| 2025 | 4.86 | 4,858,837 | 4,736,168 | 4,981,507 | 4,669,507 | 5,048,167 |
| 2026 | 5.03 | 5,034,634 | 4,910,831 | 5,158,438 | 4,843,554 | 5,225,714 |
The 80% and 95% confidence intervals indicate the range of plausible spending outcomes for each year. Each year’s projected increase is larger than the previous year, reflecting a quadratic trend:
Based on the quadratic trend forecast, total national healthcare spending is expected to continue its upward trajectory with accelerated growth. This aligns with historical patterns, particularly since the early 2000s and even more so since 2020. The model predicts U.S. healthcare spending to exceed $5 trillion by 2026.
CMS projections for 2024–2026 are higher, reflecting factors not included in this model.
Policymakers and healthcare administrators should anticipate higher-than-linear increases in spending. These findings highlight the rapid growth in healthcare costs and underscore the importance of predictive analytics in budgeting and reimbursement planning.
Predictions assume that past trends will continue; however, unexpected events such as policy changes, pandemics, or inflation could alter actual spending. COVID-19 and inflation during the test set period (2019-2023) likely created accelerated spending patterns not seen before, which simple forecasting models cannot fully capture. This may explain why even the best-fitting quadratic model forecast dips below 2023 actual spending.
For comparison, CMS’s official projections for 2024–2026 are higher than this model’s results, likely because they incorporate healthcare price inflation, as well as demographic and utilization factors, which are not included in simple time-series models.
Future work could improve forecast accuracy by incorporating inflation adjustments, utilization trends, and demographic variables, or by exploring more sophisticated modeling techniques beyond basic time-series approaches.
This project was implemented in R (version 4.5.1), using the following packages:
forecast — Time series modeling and forecastingggplot2 — Data visualizationscales — Axis formatting and labelsknitr — Report generation