As part of this study, R libraries such astidyversewill be used for data manipulation and visualization. More specifically, the functions inggplot2anddplyrwill be used frequently for plotting and reorganizing data. For consistency and clarity, thetidyverselibrary and the dataset itself will be loaded before descriptive analysis.
# Loading tidyverse for data manipulation and visualization
suppressPackageStartupMessages(library(tidyverse))
# Loading Medicare dataset for study (information is provide below)
medicare <- read.csv("~/Downloads/medicare_providers_2022.csv")
The dataset selected for R data and statistical analysis in this study is sourced from the Centers for Medicare & Medicaid Services (CMS). Updated annually, this dataset represents the 2022 fiscal year and provides recent Medicare information on services and procedures for Original Medicare (fee-for-service) Part B (Medical Insurance) beneficiaries. The dataset is unique in its representation of information as it includes multiple geographic and service-related variables.
The Medicare Physician & Other Practitioners - by Geography and Service Overview link provided below includes a list of all the data available for Medicare physicians and other practitioners, sorted by year. The dataset used for this study is under the 2022 selection area. To access the data, an acceptance of usage policy may be required.
Medicare Physician & Other Practitioners - by Geography and Service Overview
The Medicare Physician & Other Practitioners - by Geography and Service Dataset link provided below is the most recent available dataset from 2022. The dataset can be exported in CVS or CSV for Excel formats, which the former was used for data analysis in this document. This dataset is 42 MB in size and has 270,673 rows of data.
Medicare Physician & Other Practitioners - by Geography and Service Dataset
The Medicare Physician & Other Practitioners - by Geography and Service Data Dictionary link provided below includes detailed information on each variable. However, these variables and their implications will be covered in this study, which makes this source optional for reviewing as most information will be reiterated in this document.
Medicare Physician & Other Practitioners - by Geography and Service Data Dictionary
The Medicare Physician & Other Practitioners - by Geography and Service dataset selected for this study includes information regarding Medicare provider locations and services, overall geographic factors, and quantitative representations of payments submitted by providers and the aid paid to beneficiaries by Medicare.
Before analyzing the 15 variables in the dataset, the variables will be renamed for the purpose of clarity moving forward. This step is important to ensure that each variable name reflects its content and purpose in an accessible manner.
# Renaming medicare variables to accessible identifiers
medicare <- medicare %>%
rename(
provider_geo_level = Rndrng_Prvdr_Geo_Lvl,
provider_geo_code = Rndrng_Prvdr_Geo_Cd,
provider_geo_description = Rndrng_Prvdr_Geo_Desc,
hcpcs_code = HCPCS_Cd,
hcpcs_description = HCPCS_Desc,
hcpcs_drug_indicator = HCPCS_Drug_Ind,
place_of_service = Place_Of_Srvc,
total_providers = Tot_Rndrng_Prvdrs,
total_services = Tot_Srvcs,
total_beneficiaries = Tot_Benes,
total_beneficiary_day_services = Tot_Bene_Day_Srvcs,
average_submitted_charge = Avg_Sbmtd_Chrg,
average_medicare_allowed_amount = Avg_Mdcr_Alowd_Amt,
average_medicare_payment_amount = Avg_Mdcr_Pymt_Amt,
average_medicare_standardized_amount = Avg_Mdcr_Stdzd_Amt
)
The following table includes the type (character or numeric), classification (binary, nominal, ordinal, discrete, or continuous), and a basic overview description of each variable in the dataset. Additional information regarding the implications of each variable is covered below this table. However, this table can be used as a reference to understand what each variable means throughout the study.
| Variable Name | Type | Classification | Description |
|---|---|---|---|
| Geographic Information | |||
provider_geo_level |
character | Binary | Geographic levels (e.g., “National”, “State”) |
provider_geo_code |
character | Discrete | Geographic identifiers (e.g., FIPS codes in number) |
provider_geo_description |
character | Nominal | Descriptive names of geographic regions |
| Service Information | |||
hcpcs_code |
character | Nominal | Healthcare Common Procedure Coding System |
hcpcs_description |
character | Nominal | Descriptions of HCPCS codes |
hcpcs_drug_indicator |
character | Binary | Indicates if code represents a drug (‘Y’ or ‘N’) |
place_of_service |
character | Binary | Facility (‘F’) or non-facility (‘O’) service |
| Utilization Metrics | |||
total_providers |
numeric | Discrete | Count of rendering providers |
total_services |
numeric | Discrete | Count of services provided |
total_beneficiaries |
numeric | Discrete | Count of unique beneficiaries |
total_beneficiary_day_services |
numeric | Discrete | Count of beneficiary day services |
| Financial Metrics | |||
average_submitted_charge |
numeric | Continuous | Average charge submitted by providers |
average_medicare_allowed_amount |
numeric | Continuous | Average amount allowed by Medicare |
average_medicare_payment_amount |
numeric | Continuous | Average amount paid by Medicare |
average_medicare_standardized_amount |
numeric | Continuous | Average standardized Medicare payment |
The table above briefly details the 15 variables in the dataset. To find additional information, the following information below provides an explanation of key variables to enhance understanding of the dataset. This expanded description focuses on HCPCS codes, place of service data, service and beneficiary metrics, and financial indicators.
HCPCS Code and Description
The Healthcare Common Procedure Coding System (HCPCS) as described in
the hcpcs_code variable are identifiers for describing the
service furnished by the provider. Level I codes (CPT codes) are
maintained by the American Medical Association and Level II codes are
created by the Centers for Medicare & Medicaid Services for services
that are not covered by CPT codes. These descriptions
(hcpcs_description) are patient-friendly codes that
describe the service provided.
Place of Service
This binary variable (place_of_service) takes a value
between facility (‘F’) and non-facility (‘O’) services. Non-facility
refers to an office setting but can include other entities as described
by CMS.
Number of Services and Beneficiaries
The number of services (total_services) generally
implies the variable name, however, the method for attaining this number
can vary from service to service. The number of beneficiaries
(total_beneficiaries) represents distinct individuals
receiving the service and the total beneficiary days of services
(total_beneficiary_day_services accounts) for multiple
services provided to a beneficiary on the same day to avoid
double-counting.
Financial Metrics
The average charge submitted (average_submitted_charge)
represents what providers bill for the service. The average amount
allowed (average_medicare_allowed_amount) is the amount
Medicare allows including beneficiary responsibility and third-party
payments. The average amount paid by Medicare
(average_medicare_payment_amount) is what Medicare actually
pays after deductibles and coinsurance. The average standardized
Medicare payment (average_medicare_standardized_amount) is
a standardized payment amount that removes geographic differences which
allows for more accurate comparisons across regions.
Each observation or row in the dataset represents the aggregated service and payment information for a specific geographic region (National or State), the HCPCS code, and the place of service with Medicare payment amounts detailed for the procedure. While it may seem that the data represents a single beneficiary, each observation is based on the service provided and the costs associated with the procedure. However, this data can be used to determine implicit or possibly connected impacts on Medicare beneficiaries from costs.
Implications of Observations
By analyzing many of these observations from the Medicare Physician & Other Practitioners - by Geography and Service dataset, trends in healthcare delivery, cost variations, and service utilization patterns within the Medicare system can be identified.
This study is dedicated to my father, who frequently requires medical attention and relies on Medicare as his primary insurance. Through this analysis, I aim to discover insights that will not only benefit my family but also assist others in understanding the intricacies, advantages, and procedures associated with Medicare. I hope to empower individuals and families to navigate the complexities of the Medicare system more effectively.
By analyzing this dataset, this allows for uncovering derivations in the following scopes:
The following functions and table summarize key information regarding
the dataset. Starting from built-in methods, the summary()
and str() functions provide information on the minimum,
median, maximum, and quartiles for quantitative variables in the
dataset. In addition to this, the str() function summarizes
the dimensions of the dataset and the types associated with each
variable. From this, additional R methods are used to determine the
number of unique provider geographies, the number of unique HCPCS codes,
and the region with the highest service utilization from the data.
The summary() function is used to derive information on
the minimum, median, maximum, and quartiles for quantitative
variables.
# Summary of dataset (minimum, maximum, median, mean, etc.)
summary(medicare)
## provider_geo_level provider_geo_code provider_geo_description
## Length:270673 Length:270673 Length:270673
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## hcpcs_code hcpcs_description hcpcs_drug_indicator place_of_service
## Length:270673 Length:270673 Length:270673 Length:270673
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## total_providers total_beneficiaries total_services
## Min. : 1.0 Min. : 11 Min. : 11
## 1st Qu.: 11.0 1st Qu.: 30 1st Qu.: 40
## Median : 29.0 Median : 106 Median : 162
## Mean : 266.8 Mean : 5343 Mean : 23595
## 3rd Qu.: 95.0 3rd Qu.: 586 3rd Qu.: 1102
## Max. :601911.0 Max. :21459588 Max. :103325664
## total_beneficiary_day_services average_submitted_charge
## Min. : 11 Min. : 0.0
## 1st Qu.: 38 1st Qu.: 127.0
## Median : 143 Median : 440.6
## Mean : 10396 Mean : 1309.7
## 3rd Qu.: 831 3rd Qu.: 1606.7
## Max. :90436622 Max. :99509.8
## average_medicare_allowed_amount average_medicare_payment_amount
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 35.13 1st Qu.: 27.78
## Median : 110.84 Median : 85.52
## Mean : 291.80 Mean : 232.25
## 3rd Qu.: 315.65 3rd Qu.: 250.72
## Max. :58494.73 Max. :46612.60
## average_medicare_standardized_amount
## Min. : 0.00
## 1st Qu.: 27.60
## Median : 85.03
## Mean : 230.06
## 3rd Qu.: 249.34
## Max. :46577.95
The str() function is used to derive the dimensions of
the dataset and the types associated with each variable.
# Summary of dimensions and variables (data types and example data)
str(medicare)
## 'data.frame': 270673 obs. of 15 variables:
## $ provider_geo_level : chr "State" "State" "State" "State" ...
## $ provider_geo_code : chr "9E" "9E" "9E" "9E" ...
## $ provider_geo_description : chr "Foreign Country" "Foreign Country" "Foreign Country" "Foreign Country" ...
## $ hcpcs_code : chr "J1885" "G0439" "G0416" "G0283" ...
## $ hcpcs_description : chr "Injection, ketorolac tromethamine, per 15 mg" "Annual wellness visit, includes a personalized prevention plan of service (pps), subsequent visit" "Surgical pathology, gross and microscopic examinations, for prostate needle biopsy, any method" "Electrical stimulation (unattended), to one or more areas for indication(s) other than wound care, as part of a"| __truncated__ ...
## $ hcpcs_drug_indicator : chr "Y" "N" "N" "N" ...
## $ place_of_service : chr "O" "O" "F" "O" ...
## $ total_providers : int 6 3 3 2 3 3 3 3 2 1 ...
## $ total_beneficiaries : int 21 37 89 19 13 14 38 51 50 19 ...
## $ total_services : num 29 37 89 93 13 14 38 54 347 41 ...
## $ total_beneficiary_day_services : int 23 37 89 93 13 14 38 54 347 36 ...
## $ average_submitted_charge : num 39.8 226.4 821.9 41.7 947.5 ...
## $ average_medicare_allowed_amount : num 0.564 119.932 178.407 9.294 178.261 ...
## $ average_medicare_payment_amount : num 0.375 119.932 142.263 7.025 178.261 ...
## $ average_medicare_standardized_amount: num 0.373 128.837 139.189 7.065 183.48 ...
Additional R Data Summary Techniques
The num_unique_geos variable is calculated to be the
number of unique provider geographies from the
provider_geo_description variable.
# Number of Unique Provider Geographies
num_unique_geos <- length(unique(medicare$provider_geo_description))
print(paste("Number of Unique Provider Geographies:", num_unique_geos))
## [1] "Number of Unique Provider Geographies: 63"
The num_unique_hcpcs variable is calculated to be the
number of unique HCPCS codes from the hcpcs_code
variable.
# Number of Unique HCPCS Codes
num_unique_hcpcs <- length(unique(medicare$hcpcs_code))
print(paste("Number of Unique HCPCS Codes:", num_unique_hcpcs))
## [1] "Number of Unique HCPCS Codes: 9231"
The service_by_state variable is calculated to be the
aggregate of total services from the
provider_geo_description variable.
# Identify State with Highest Service Utilization
service_by_state <- aggregate(medicare$total_services, FUN=sum,
by=list(Category=medicare$provider_geo_description))
highest_state <- service_by_state[which.max(service_by_state$x), ]
print(paste("State with Highest Service Utilization:", highest_state$Category))
## [1] "State with Highest Service Utilization: National"
The following table includes summarized information regarding the dataset. From this, ideas regarding Medicare services and costs can be formulated such as how aid differs from the submitted payment amount from providers. Further analysis will be conducted on such potentially observed relationships.
| Metric | Value | Description |
|---|---|---|
| Total Number of Variables | 15 | Number of columns in the dataset |
| Total Number of Observations | 270,673 | Total number of rows representing service instances |
| Avg. Medicare Submitted per Service | $1309.7 | Average charge initially billed by providers |
| Avg. Medicare Allowed Aid per Service | $291.80 | Average amount Medicare approves for payment |
| Avg. Medicare Standardized Aid | $230.06 | Average payment adjusted for geographic variations |
| Avg. Medicare Actual Aid | $232.25 | Average payment provided by Medicare |
| Number of Unique Provider Geographies | 63 | Count of distinct geographical areas included |
| Number of Unique HCPCS Codes | 9231 | Count of different medical service codes in the dataset |
| State with Highest Service Utilization | National | Services aggregated at a National Level |
There are three plots covered in this section with two being box plots and one being a scatter plot. From these charts, significant insights regarding Medicare aid and payments are revealed. Found below will be three charts with the following titles: * Average Submitted Charge vs. Average Medicare Aid Amount * Distribution of Average Medicare Submitted Amount for Top 10 Areas * Distribution of Average Medicare Aid Amount for Top 10 Areas
The “Average Submitted Charge vs. Average Medicare Aid Amount” plot
demonstrates the relationship between two quantitative variables, namely
average_medicare_payment_amount and
average_submitted_charge. A few modifications were made to
the original data when plotting these variables. The first was removing
outliers that existed above $50,000 for both of the variables. From
calculations, there are only 45 cases where the variable values are
above $50,000, which makes these insignificant for evaluating overall
trends. In addition to this, 3 linear equations are mapped below. The
solid black line with the equation “y = 0.20x - 29.3” demonstrates the
line of best fit for the data. Two extra dashed lines are present with
the green line as “y = x” and the orange as “y = 0.8x”.
The line of each slope represents the coverage as a percent that Medicare provides for each service provided. For example, the y = x equates to 100% coverage from the submitted payment. As can be seen from the chart below, there is a minimum amount of services that have been fully covered by Medicare. To make a more realistic assumption, the y = 0.8x line is graphed to represent 80% coverage by Medicare. From this, it is also observed that few points are above or at that line. In fact, the line of best fit demonstrates that the actual coverage by Medicare is 20% for all services provided in 2022.
# Loading Scales Library
suppressPackageStartupMessages(library(scales))
# Fitting Linear Model
lm_fit <- lm(average_medicare_payment_amount ~ average_submitted_charge, data = medicare)
lm_coef <- coef(lm_fit)
# Average Submitted Charge vs. Average Medicare Aid Amount
ggplot(medicare, aes(average_submitted_charge, average_medicare_payment_amount)) +
geom_point(alpha = 0.75, color = "navyblue", fill = "turquoise1", shape = 21, size = 2, stroke = 0.5) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
geom_abline(intercept = 0, slope = 1, linetype = "longdash", color = "seagreen") +
geom_abline(intercept = 0, slope = 0.80, linetype = "longdash", color = "darkorange") +
# Labels, Annotations, and Coordinate Values
labs(title = "Average Submitted Charge vs. Average Medicare Aid Amount",
subtitle = "Comparison of submitted charges to Medicare aid",
x = "Average Submitted Charge ($)",
y = "Average Medicare Aid Amount ($)") +
annotate("text", x = 45000, y = 45000, label = "y = x", color = "seagreen", hjust = -0.5, size = 4) +
annotate("text", x = 45000, y = 36000, label = "y = 0.8x", color = "darkorange", hjust = -0.5, size = 4) +
annotate("text", x = 40000, y = lm_coef[1] + lm_coef[2] * 40000,
label = sprintf("\ny = %.2fx + %.2f", lm_coef[2], lm_coef[1]),
color = "black", hjust = -0.1, size = 4) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "gray50"),
panel.grid.minor = element_blank(),
legend.position = "none"
) +
scale_x_continuous(labels = dollar_format(scale = 1e-3, suffix = "K")) +
scale_y_continuous(labels = dollar_format(scale = 1e-3, suffix = "K")) +
coord_cartesian(xlim = c(0, 50000), ylim = c(0, 50000))
## `geom_smooth()` using formula = 'y ~ x'
Most cases have roughly 20% coverage whereas few services are fully covered by Medicare.
The “Distribution of Average Medicare Submitted Amount for Top 10
Areas” box plot visualizes the distribution of average submitted charges
across the top 10 geographic areas as determined by the total number of
services provided (total_services). A few adjustments were
made to the dataset in order to create this plot. The first was sorting
the data by the sum of total services and storing the top 10 areas. From
here, the data was reordered for plotting. In addition to this, the
maximum for the average submitted charges was set to be $50,000, which
includes outliers.
From this box plot, it can be seen that the median or the 50th percentile of the average submitted amount from providers is between $1,000 and $550 for the top 10 regions in the dataset. The top 5 regions with the highest average submitted charge include the National level, New York, California, Texas, and Florida. This data is particularly useful for comparing the average aid paid out by Medicare which is investigated in the next box plot.
# Find Top 10 Geographic Areas by Total Services
top_10_geo <- medicare %>%
group_by(provider_geo_description) %>%
summarize(total_services_sum = sum(total_services, na.rm = TRUE)) %>%
arrange(desc(total_services_sum)) %>%
head(10)
# Include Top 10 Geographic Areas and Filter Payments Below $50000
medicare_top_10 <- medicare %>%
filter(provider_geo_description %in% top_10_geo$provider_geo_description,
average_submitted_charge <= 50000)
# Filter Data for Values Below or At Zero
medicare_top_10 <- medicare %>%
filter(provider_geo_description %in% top_10_geo$provider_geo_description,
average_submitted_charge >= 1)
# Calculate median payment for each area and arrange
area_medians <- medicare_top_10 %>%
group_by(provider_geo_description) %>%
summarize(median_payment = median(average_submitted_charge, na.rm = TRUE)) %>%
arrange(desc(median_payment))
# Reorder the Geographic Areas based on arranged medians
medicare_top_10 <- medicare_top_10 %>%
mutate(provider_geo_description = factor(provider_geo_description,
levels = area_medians$provider_geo_description))
# Distribution of Average Medicare Submitted Amount for Top 10 Areas
ggplot(medicare_top_10, aes(provider_geo_description, average_submitted_charge)) +
geom_boxplot(aes(fill = provider_geo_description), outlier.shape = 1, outlier.size = 0.5, outlier.alpha = 0.3) +
scale_y_log10(labels = scales::dollar_format(accuracy = 1)) +
coord_cartesian(ylim = c(1, 100000)) +
labs(
title = "Distribution of Average Medicare Submitted Amount for Top 10 Areas",
subtitle = "Top areas based on total services provided, payments <= $50,000 (log scale)",
x = "Geographic Area",
y = "Average Submitted Amount ($) - Log Scale"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "gray50"),
legend.position = "none"
)
At the National level, New York, California, Texas, and Florida have the highest average submitted amount.
The “Distribution of Average Medicare Aid Amount for Top 10 Areas”
box plot visualizes the distribution of average Medicare aid amounts
across the top 10 geographic areas as determined by the total number of
services provided (total_services). A few adjustments were
made to the dataset in order to create this plot. The first was sorting
the data by the sum of total services and storing the top 10 areas. From
here, the data was reordered for plotting. In addition to this, the
maximum for the average aid amounts was set to be $50,000, which
includes outliers.
From this box plot, it can be seen that the median or the 50th percentile of the average aid amount provided by Medicare is between $300 and $90 for the top 10 regions in the dataset. The top 5 regions with the highest average aid amount provided include the National level, California, Florida, New York, and Texas. This data demonstrates a similar observation seen in the scatter plot; most of the service payments are often met with between 20% and 1% of Medicare aid.
# Find Top 10 Geographic Areas by Total Services
top_10_geo_aid <- medicare %>%
group_by(provider_geo_description) %>%
summarize(total_services_sum = sum(total_services, na.rm = TRUE)) %>%
arrange(desc(total_services_sum)) %>%
head(10)
# Include Top 10 Geographic Areas and Filter Payments Below $50000
medicare_top_10_aid <- medicare %>%
filter(provider_geo_description %in% top_10_geo$provider_geo_description,
average_medicare_payment_amount <= 50000)
# Filter Data for Values Below or At Zero
medicare_top_10_aid <- medicare %>%
filter(provider_geo_description %in% top_10_geo$provider_geo_description,
average_medicare_payment_amount >= 0)
# Calculate median payment for each area and arrange
area_medians <- medicare_top_10_aid %>%
group_by(provider_geo_description) %>%
summarize(median_payment = median(average_medicare_payment_amount, na.rm = TRUE)) %>%
arrange(desc(median_payment))
# Reorder the Geographic Areas based on arranged medians
medicare_top_10_aid <- medicare_top_10_aid %>%
mutate(provider_geo_description = factor(provider_geo_description,
levels = area_medians$provider_geo_description))
# Distribution of Average Medicare Aid Amount for Top 10 Areas
ggplot(medicare_top_10_aid, aes(provider_geo_description, average_medicare_payment_amount)) +
geom_boxplot(aes(fill = provider_geo_description), outlier.shape = 1, outlier.size = 0.5, outlier.alpha = 0.3) +
scale_y_log10(labels = scales::dollar_format(accuracy = 1)) +
coord_cartesian(ylim = c(1, 100000)) +
labs(
title = "Distribution of Average Medicare Aid Amount for Top 10 Areas",
subtitle = "Top areas based on total services provided, payments <= $50,000 (log scale)",
x = "Geographic Area",
y = "Average Medicare Aid Amount ($) - Log Scale"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
plot.subtitle = element_text(color = "gray50"),
legend.position = "none"
)
At the National level, California, Florida, New York, and Texas have the highest average Medicare aid amount.
This section contains 4 parts that will cover the data manipulations
performed on the original medicare dataset and have been
plotted into one representation of 4 separate charts. The primary data
manipulations that will be made include filtering the unique geographic
locations to simply the 50 states in the United States and grouping the
data into frames by state location. This section will include the
following components: * Part 0: Loading Additional Libraries for Mapping
* Part 1: Generating Subset Mappings for Each Map * Part 2: Creating
Maps for Average Aid, Payment Amounts, and Beneficiary Data * Part 3:
Multi-Dimensional Analysis of Medicare Services Across States
This code block loads the necessary libraries for creating US maps
and visualizations. The usmap package provides functions
for plotting US maps, viridis offers color palettes, and
patchwork allows for combining multiple plots. The last
library is for suppressing warning messages.
# Libraries for Mappings
suppressPackageStartupMessages(library(usmap))
suppressPackageStartupMessages(library(viridis))
suppressPackageStartupMessages(library(patchwork))
suppressPackageStartupMessages(library(knitr))
This section prepares the data for mapping. It defines valid US states, filters the Medicare data to include only those states, and creates several summary datasets including the average Medicare payment amount by state, the average submitted charge amount by state, the total beneficiaries by state, and the average services per beneficiary by state.
# Creating Valid Regions Vector
geo_descriptions <- c("Wyoming", "Wisconsin", "West Virginia", "Washington", "Virginia",
"Vermont", "Utah", "Texas", "Tennessee", "South Dakota", "South Carolina", "Rhode Island",
"Oregon", "Oklahoma", "Ohio", "North Dakota", "North Carolina", "New York",
"New Mexico", "New Jersey", "New Hampshire", "Nevada", "Nebraska", "Montana",
"Missouri", "Mississippi", "Minnesota", "Michigan", "Massachusetts", "Maryland",
"Maine", "Louisiana", "Kentucky", "Kansas", "Iowa", "Indiana", "Illinois",
"Idaho", "Hawaii", "Georgia", "Florida", "Delaware", "Connecticut", "Colorado",
"California", "Arkansas", "Arizona", "Alaska", "Alabama", "Pennsylvania")
# Constructing Medicare States Template Subset
medicare_states <- medicare %>%
filter(provider_geo_description %in% geo_descriptions) %>%
mutate(state = provider_geo_description)
# Subset 1: Average Medicare Aid Amount
state_summary <- medicare_states %>%
group_by(state) %>%
summarize(avg_payment = mean(average_medicare_payment_amount, na.rm = TRUE))
# Subset 2: Average Medicare Submitted Amount
state_summary_payment <- medicare_states %>%
group_by(state) %>%
summarize(avg_submitted = mean(average_submitted_charge, na.rm = TRUE))
# Subset 3: Total Beneficiaries
state_summary_beneficiaries <- medicare_states %>%
group_by(state) %>%
summarize(benefits = mean(total_beneficiaries, na.rm = TRUE))
# Subset 4: Total Services by Total Beneficiaries
state_summary_services <- medicare_states %>%
group_by(state) %>%
summarize(avg_services = mean(total_services / total_beneficiaries, na.rm = TRUE))
This code block creates four separate US maps using the plot_usmap function including a map showing average Medicare aid by state, a map displaying average submitted amounts by state, a map illustrating total beneficiaries by state, and a map presenting average services per beneficiary by state.
# Map 1: Average Medicare Aid by State
map_payment <- plot_usmap(data = state_summary, values = "avg_payment",
color = "white") +
scale_fill_viridis_c(name = "Avg Medicare\nPayment ($)",
label = scales::dollar_format(),
option = "plasma") +
labs(title = "Average Medicare Aid by State") +
theme(legend.position = "right")
# Map 2: Average Submitted Amount by State
map_submitted <- plot_usmap(data = state_summary_payment, values = "avg_submitted",
color = "white") +
scale_fill_viridis_c(name = "Avg Submitted\nAmount ($)",
label = scales::dollar_format(),
option = "viridis") +
labs(title = "Average Submitted Amount by State") +
theme(legend.position = "right")
# Map 3: Total Beneficiaries by State
map_beneficiaries <- plot_usmap(data = state_summary_beneficiaries, values = "benefits",
color = "white") +
scale_fill_viridis_c(name = "Total\nBeneficiaries", option = "mako") +
labs(title = "Total Beneficiaries by State") +
theme(legend.position = "right")
# Map 4: Total Services per Beneficiary by State
map_services <- plot_usmap(data = state_summary_services, values = "avg_services",
color = "white") +
scale_fill_viridis_c(name = "Avg Services\nper Beneficiary", option = "turbo") +
labs(title = "Average Services per Beneficiary by State") +
theme(legend.position = "right")
This final section combines the four individual maps into a single multi-panel visualization using the patchwork package. It arranges the maps in a 2x2 grid, adds an overall title, and adjusts the layout for optimal viewing.
Analyzing the relationship between average Medicare aid and average submitted amounts provides a nuanced understanding of cost coverage across the country. For instance, while states like California and New York may have a lower higher Medicare aid compared to some smaller states, they have significant disparity in average submitted amounts. This especially indicates that many services are not fully covered within these states. Moreover, states with substantial Medicare populations, such as Florida and Texas, do not align with the expected high utilization of Medicare services given their large senior demographics. This itself may reflect a reliance on private insurance or alternative healthcare systems that could be potentially influenced by the political leanings or demographic composition of these states. In contrast, Northeastern states like Connecticut demonstrate a propensity for using Medicare services. This contributes to higher average services per beneficiary metric. The trends depicted here could be synonymous with better healthcare infrastructure, a higher acceptance of Medicare benefits, or simply a larger proportion of elderly individuals dependent on Medicare.
# Multi-Dimensional Analysis of Medicare Services Across States
combined_map <- (map_payment + map_submitted) / (map_beneficiaries + map_services) +
plot_layout(heights = c(4, 4, 0.5)) +
plot_annotation(
title = "Multi-Dimensional Analysis of Medicare Services Across States",
theme = theme(plot.title = element_text(size = 16, face = "bold"),
plot.subtitle = element_text(size = 12))
)
# Displaying Combined Map
print(combined_map)
California has the highest aid, but also one of the highest average submitted payment amounts.
States with lower average submitted amounts and average services per beneficiary may indicate challenges related to healthcare access or awareness among Medicare recipients. The higher utilization of services in states such as West Virginia, when contrasted with the comparatively lower average Medicare aid, demonstrates a need for potential healthcare policies and allocation of resources to support beneficiaries’ awareness. The interactive effects of all four factors (aid, submitted amounts, beneficiaries, and utilization) provide a view of healthcare efficiency, equity, and accessibility across the United States with the utilization of mapping libraries.
This section will cover the various components of statistical analysis such as correlation, regression, and conducting t-tests. For the t-tests created, the alpha value used to determine statistical significance is 0.05. In addition to this, data manipulation is performed on the data below to remove outliers and abbreviate variable names for ease of demonstration.
The dataset is first refined by renaming/adjusting key variables related to charges, payments, providers, and beneficiaries. To improve the accuracy of correlation and regression analyses, extreme outliers are removed based on predefined thresholds. The outlier-filtered dataset ensures that extreme values do not distort statistical relationships.
# Creating Abbreviated Dataset
medicare_abbrev <- medicare %>%
select(
avg_sub_chg = average_submitted_charge,
avg_med_pay = average_medicare_payment_amount,
avg_med_all = average_medicare_allowed_amount,
avg_med_std = average_medicare_standardized_amount,
tot_prov = total_providers,
tot_benef = total_beneficiaries,
tot_ben_day = total_beneficiary_day_services,
tot_serv = total_services
)
# Removing Outliers: Filtered Dataset for Correlation Testing w/ Abbreviations
medicare_adjust_abbrev <- medicare_abbrev %>%
filter(avg_sub_chg <= 50000, avg_med_all <= 50000, avg_med_pay <= 50000,
avg_med_std <= 5000, tot_prov <= 500000, tot_benef <= 14000000,
tot_serv <= 10000000, tot_ben_day <= 10000000)
# Removing Outliers: Filtered Dataset
medicare_adjust <- medicare %>%
filter(average_submitted_charge <= 50000, average_medicare_allowed_amount <= 50000,
average_medicare_payment_amount <= 50000, total_services <= 10000000,
average_medicare_standardized_amount <= 5000, total_providers <= 500000,
total_beneficiaries <= 14000000, total_beneficiary_day_services <= 10000000)
This section examines the relationships between key financial and service-related Medicare metrics. Pearson and Spearman correlation tests are used to identify the strength and direction of relationships between submitted charges, allowed payments, and provider counts.
Pearson’s correlation measures linear relationships, while Spearman’s correlation accounts for non-linear but monotonic relationships. Here, it is assessed whether higher submitted charges correlate with Medicare reimbursements and whether provider counts relate to beneficiary counts.
The Pearson and Spearman analyses on the original dataset may show different results. If the relationships are purely linear, Pearson and Spearman should yield similar results. However, if there are non-linear but monotonic relationships, Spearman might show stronger correlations. Any significant differences between the two methods could indicate the presence of non-linear relationships or outliers affecting the Pearson correlation.
# Pearson Correlation Testing: Average Submitted Charge vs. Average Aid Amount
cor.test(medicare$average_submitted_charge, medicare$average_medicare_payment_amount)
##
## Pearson's product-moment correlation
##
## data: medicare$average_submitted_charge and medicare$average_medicare_payment_amount
## t = 654.06, df = 270671, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7811460 0.7840658
## sample estimates:
## cor
## 0.7826102
# Pearson Correlation Testing: Average Submitted Charge vs. Average Allowed Amount
cor.test(medicare$average_submitted_charge, medicare$average_medicare_allowed_amount)
##
## Pearson's product-moment correlation
##
## data: medicare$average_submitted_charge and medicare$average_medicare_allowed_amount
## t = 656.65, df = 270671, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7823469 0.7852526
## sample estimates:
## cor
## 0.7838041
# Pearson Correlation Testing: Total Providers vs. Total Beneficiaries
cor.test(medicare$total_providers, medicare$total_beneficiaries)
##
## Pearson's product-moment correlation
##
## data: medicare$total_providers and medicare$total_beneficiaries
## t = 517.47, df = 270671, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7033023 0.7070899
## sample estimates:
## cor
## 0.7052011
From the first three tests between the average submitted charge, average Medicare payment, average Medicare allowed amount, total providers, and total beneficiaries data, it can be seen that a roughly strong correlation exists between each pair. From this data, let us look at the Spearman correlation to determine if there is any difference.
# Spearman Correlation Testing: Average Submitted Charge vs. Average Aid Amount
suppressWarnings({
cor.test(medicare$average_submitted_charge, medicare$average_medicare_payment_amount, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare$average_submitted_charge and medicare$average_medicare_payment_amount
## S = 2.2847e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.930874
# Spearman Correlation Testing: Average Submitted Charge vs. Average Allowed Amount
suppressWarnings({
cor.test(medicare$average_submitted_charge, medicare$average_medicare_allowed_amount, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare$average_submitted_charge and medicare$average_medicare_allowed_amount
## S = 2.2459e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9320485
# Spearman Correlation Testing: Total Providers vs. Total Beneficiaries
suppressWarnings({
cor.test(medicare$total_providers, medicare$total_beneficiaries, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare$total_providers and medicare$total_beneficiaries
## S = 8.4523e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7442654
From Spearman’s correlation, it appears that for the same pairs of data, the Spearman correlation increases to almost a near-perfect relationship, demonstrating that there might be outliers or another factor influencing the data. In the next sections, the Pearson and Spearman correlation will be tested on the filtered data to determine if there is a relationship.
# Pearson Correlation: Average Submitted Charge vs. Average Aid Amount (Removed Outliers)
cor.test(medicare_adjust$average_submitted_charge, medicare_adjust$average_medicare_payment_amount)
##
## Pearson's product-moment correlation
##
## data: medicare_adjust$average_submitted_charge and medicare_adjust$average_medicare_payment_amount
## t = 726.08, df = 270165, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8118438 0.8143992
## sample estimates:
## cor
## 0.8131254
# Pearson Correlation: Average Submitted Charge vs. Average Allowed Amount (Removed Outliers)
cor.test(medicare_adjust$average_submitted_charge, medicare_adjust$average_medicare_allowed_amount)
##
## Pearson's product-moment correlation
##
## data: medicare_adjust$average_submitted_charge and medicare_adjust$average_medicare_allowed_amount
## t = 736.96, df = 270165, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8159379 0.8184432
## sample estimates:
## cor
## 0.8171944
# Pearson Correlation: Total Providers vs. Total Beneficiaries (Removed Outliers)
cor.test(medicare_adjust$total_providers, medicare_adjust$total_beneficiaries)
##
## Pearson's product-moment correlation
##
## data: medicare_adjust$total_providers and medicare_adjust$total_beneficiaries
## t = 448.95, df = 270165, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6515021 0.6558213
## sample estimates:
## cor
## 0.653667
From removing outliers, it can be seen that the Pearson correlation coefficient between the average submitted charge and the average Medicare aid amount increases from 0.7 as a strong positive relationship to 0.8 as a stronger relationship. In addition to that, a similar trend follows the average submitted charge amount and the Medicare-allowed aid amount. However, the last relationship between the total providers and total beneficiaries decreases in a positive relationship to 0.65.
# Spearman Correlation: Average Submitted Charge vs. Average Aid Amount (Removed Outliers)
suppressWarnings({
cor.test(medicare_adjust$average_submitted_charge, medicare_adjust$average_medicare_payment_amount, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare_adjust$average_submitted_charge and medicare_adjust$average_medicare_payment_amount
## S = 2.2841e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.930503
# Spearman Correlation: Average Submitted Charge vs. Average Allowed Amount (Removed Outliers)
suppressWarnings({
cor.test(medicare_adjust$average_submitted_charge, medicare_adjust$average_medicare_allowed_amount, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare_adjust$average_submitted_charge and medicare_adjust$average_medicare_allowed_amount
## S = 2.2453e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9316841
# Spearman Correlation: Total Providers vs. Total Beneficiaries (Removed Outliers)
suppressWarnings({
cor.test(medicare_adjust$total_providers, medicare_adjust$total_beneficiaries, method = "spearman")
})
##
## Spearman's rank correlation rho
##
## data: medicare_adjust$total_providers and medicare_adjust$total_beneficiaries
## S = 8.4048e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7442714
The Spearman correlation between the variables in the first two relationships increases to a near-perfect coefficient. This demonstrates that there are indeed outliers that impact the data, but since the Pearson correlation increased, this was mitigated to an extent. In addition to this, the Spearman correlation for the total providers and total beneficiaries does not decrease, which demonstrates that the outliers in the first original had been accounted for when calculating the Spearman correlation coefficient.
The table summarizes the results of correlation tests conducted using both Pearson and Spearman methods on various Medicare-related variables. The analysis includes comparisons between the average submitted charge and both the average Medicare payment and allowed amounts, as well as between the total number of providers and beneficiaries. The results are presented for both the original dataset and an adjusted version with outliers removed. Across all tests, strong positive correlations were observed, with Spearman correlations generally indicating a stronger relationship than Pearson correlations. This suggests that while there are linear components to these relationships, there may also be non-linear aspects. Removing outliers tended to increase the strength of the Pearson correlations, except in the case of providers versus beneficiaries, where the correlation decreased.
| Variables | Test Type | Original Data | Adjusted Data (Outliers Removed) |
|---|---|---|---|
| Submitted Charge vs Medicare Payment | Pearson | 0.7826 | 0.8131 |
| Spearman | 0.9309 | 0.9305 | |
| Submitted Charge vs Medicare Allowed | Pearson | 0.7838 | 0.8172 |
| Spearman | 0.9320 | 0.9317 | |
| Total Providers vs Total Beneficiaries | Pearson | 0.7052 | 0.6537 |
| Spearman | 0.7443 | 0.7441 |
All correlations are statistically significant with p-values < 2.2e-16.
A correlation matrix is generated to visualize relationships across all numerical variables, both with and without outliers. This helps identify which factors are strongly associated with each other.
# Pearson Correlation Matrix
pearson_cor <- cor(medicare_abbrev) %>% round(3)
print(pearson_cor)
## avg_sub_chg avg_med_pay avg_med_all avg_med_std tot_prov tot_benef
## avg_sub_chg 1.000 0.783 0.784 0.775 -0.019 -0.020
## avg_med_pay 0.783 1.000 1.000 0.997 -0.014 -0.013
## avg_med_all 0.784 1.000 1.000 0.997 -0.014 -0.013
## avg_med_std 0.775 0.997 0.997 1.000 -0.014 -0.013
## tot_prov -0.019 -0.014 -0.014 -0.014 1.000 0.705
## tot_benef -0.020 -0.013 -0.013 -0.013 0.705 1.000
## tot_ben_day -0.014 -0.009 -0.009 -0.009 0.700 0.853
## tot_serv -0.018 -0.012 -0.012 -0.012 0.389 0.508
## tot_ben_day tot_serv
## avg_sub_chg -0.014 -0.018
## avg_med_pay -0.009 -0.012
## avg_med_all -0.009 -0.012
## avg_med_std -0.009 -0.012
## tot_prov 0.700 0.389
## tot_benef 0.853 0.508
## tot_ben_day 1.000 0.578
## tot_serv 0.578 1.000
This matrix shows strong positive correlations (>0.77) between average submitted charges, Medicare payments, allowed amounts, and standardized amounts. There is also a strong correlation (0.70520) between total providers and total beneficiaries. However, these financial variables show very weak negative correlations with provider and beneficiary counts. Total beneficiary days and total services are strongly correlated with total beneficiaries (0.85252 and 0.50754 respectively).
# Spearman Correlation Matrix
spearman_cor <- cor(medicare_abbrev, method = "spearman") %>% round(3)
print(spearman_cor)
## avg_sub_chg avg_med_pay avg_med_all avg_med_std tot_prov tot_benef
## avg_sub_chg 1.000 0.931 0.932 0.927 -0.007 -0.297
## avg_med_pay 0.931 1.000 0.999 0.999 -0.040 -0.294
## avg_med_all 0.932 0.999 1.000 0.997 -0.032 -0.296
## avg_med_std 0.927 0.999 0.997 1.000 -0.042 -0.296
## tot_prov -0.007 -0.040 -0.032 -0.042 1.000 0.744
## tot_benef -0.297 -0.294 -0.296 -0.296 0.744 1.000
## tot_ben_day -0.322 -0.312 -0.313 -0.313 0.729 0.974
## tot_serv -0.367 -0.349 -0.351 -0.350 0.686 0.923
## tot_ben_day tot_serv
## avg_sub_chg -0.322 -0.367
## avg_med_pay -0.312 -0.349
## avg_med_all -0.313 -0.351
## avg_med_std -0.313 -0.350
## tot_prov 0.729 0.686
## tot_benef 0.974 0.923
## tot_ben_day 1.000 0.970
## tot_serv 0.970 1.000
The Spearman correlations reveal even stronger relationships between the financial variables (>0.92). Interestingly, it shows moderate negative correlations (-0.29 to -0.36) between financial variables and beneficiary/service counts, which was not apparent in the Pearson correlations. The relationships between provider/beneficiary counts and service metrics are very strong (0.68 to 0.97), suggesting non-linear associations.
# Removing Outliers: Pearson Correlation Matrix
pearson_cor_outlier_rm <- cor(medicare_adjust_abbrev) %>% round(3)
print(pearson_cor_outlier_rm)
## avg_sub_chg avg_med_pay avg_med_all avg_med_std tot_prov tot_benef
## avg_sub_chg 1.000 0.813 0.817 0.800 -0.025 -0.032
## avg_med_pay 0.813 1.000 0.999 0.995 -0.027 -0.030
## avg_med_all 0.817 0.999 1.000 0.994 -0.027 -0.030
## avg_med_std 0.800 0.995 0.994 1.000 -0.027 -0.030
## tot_prov -0.025 -0.027 -0.027 -0.027 1.000 0.654
## tot_benef -0.032 -0.030 -0.030 -0.030 0.654 1.000
## tot_ben_day -0.033 -0.030 -0.031 -0.031 0.623 0.874
## tot_serv -0.045 -0.042 -0.043 -0.042 0.400 0.558
## tot_ben_day tot_serv
## avg_sub_chg -0.033 -0.045
## avg_med_pay -0.030 -0.042
## avg_med_all -0.031 -0.043
## avg_med_std -0.031 -0.042
## tot_prov 0.623 0.400
## tot_benef 0.874 0.558
## tot_ben_day 1.000 0.648
## tot_serv 0.648 1.000
This matrix shows strong positive correlations (>0.79) between average submitted charges, Medicare payments, allowed amounts, and standardized amounts. The correlation between total providers and total beneficiaries is moderately strong (0.6537). Financial variables show very weak negative correlations with provider and beneficiary counts. Total beneficiary days and total services are strongly correlated with total beneficiaries (0.8743 and 0.5585 respectively).
# Removing Outliers: Spearman Correlation Matrix
spearman_cor_outlier_rm <- cor(medicare_adjust_abbrev, method = "spearman") %>% round(3)
print(spearman_cor_outlier_rm)
## avg_sub_chg avg_med_pay avg_med_all avg_med_std tot_prov tot_benef
## avg_sub_chg 1.000 0.931 0.932 0.927 -0.005 -0.296
## avg_med_pay 0.931 1.000 0.999 0.999 -0.038 -0.294
## avg_med_all 0.932 0.999 1.000 0.997 -0.030 -0.296
## avg_med_std 0.927 0.999 0.997 1.000 -0.040 -0.296
## tot_prov -0.005 -0.038 -0.030 -0.040 1.000 0.744
## tot_benef -0.296 -0.294 -0.296 -0.296 0.744 1.000
## tot_ben_day -0.322 -0.312 -0.313 -0.313 0.729 0.974
## tot_serv -0.367 -0.349 -0.351 -0.350 0.686 0.923
## tot_ben_day tot_serv
## avg_sub_chg -0.322 -0.367
## avg_med_pay -0.312 -0.349
## avg_med_all -0.313 -0.351
## avg_med_std -0.313 -0.350
## tot_prov 0.729 0.686
## tot_benef 0.974 0.923
## tot_ben_day 1.000 0.970
## tot_serv 0.970 1.000
The Spearman correlations reveal even stronger relationships between the financial variables (>0.92). Interestingly, it shows moderate negative correlations (-0.29 to -0.36) between financial variables and beneficiary/service counts, which was not as apparent in the Pearson correlations. The relationships between provider/beneficiary counts and service metrics are very strong (0.68 to 0.97), suggesting non-linear associations.
Financial variables (submitted charges, payments, allowed amounts) show strong positive correlations with each other in both Pearson and Spearman correlations which demonstrates that they have often increased together. However, there is a difference between Pearson and Spearman correlations, especially regarding the relationship between various aid/payment variables and beneficiary/service totals. While Pearson correlations show weak negative relationships, Spearman correlations reveal moderate negative associations which shows that some may be non-linear relationships that are better captured by Spearman’s rank-based correlations. The strong correlations between beneficiary counts, beneficiary days, and services indicate that these variables are closely related due to similar metrics. The relationship between total providers and total beneficiaries is moderately strong in the Pearson correlation (0.6537) but stronger in the Spearman correlation (0.7443), which shows that this relationship may not be fully linear.
The correlation matrices reveal strong relationships between financial variables and moderate to strong associations between provider and beneficiary metrics, highlighting both linear and non-linear interactions within the Medicare dataset.
Scatterplot matrices are used to visually assess trends and potential relationships among Medicare charge, payment, provider, and beneficiary data. These graphs help in detecting patterns or clustering within the dataset.
# Pearson Correlation Graphs
pairs(select(medicare_abbrev, avg_sub_chg, avg_med_pay, avg_med_all, avg_med_std))
This graph set visualizes the relationships between average submitted charges, Medicare payments, allowed amounts, and standardized amounts using Pearson correlations. The scatter plots show strong positive linear relationships between these financial variables, as indicated by the high correlation coefficients (>0.79) in the matrix. This shows that as one financial metric increases, the others tend to increase proportionally. However, this is not the case for all variables as it can be seen that some contain outliers that skew the data.
# Spearman Correlation Graphs
pairs(select(medicare_abbrev, tot_prov, tot_benef, tot_ben_day, tot_serv))
These graphs depict the relationships between total providers, total beneficiaries, total beneficiary days, and total services using Spearman rank correlations. The plots probably reveal strong positive monotonic relationships, especially between total beneficiaries, beneficiary days, and services. The relationship with total providers might appear less linear but still positively correlated.
# Removing Outliers: Pearson Correlation Graphs
pairs(select(medicare_adjust_abbrev, avg_sub_chg, avg_med_pay, avg_med_all, avg_med_std))
After removing outliers, some graphs such as between the average Medicare aid and the average Medicare allowed amount demonstrate a strong linear correlation. In addition to this, it can be seen that the outliers in the average submitted charge and average Medicare aid caused the correlation to appear stronger than it is without outliers.
# Removing Outliers: Spearman Correlation Graphs
pairs(select(medicare_adjust_abbrev, tot_prov, tot_benef, tot_ben_day, tot_serv))
These graphs, focusing on provider and service metrics with outliers removed, display stronger positive monotonic relationships than with outliers. The plots might demonstrate a more distinct pattern compared to the graphs with outliers. This is particularly true for the relationships involving total beneficiaries, which had lower Pearson correlations but higher Spearman correlations.
The correlation graphs visually confirm the findings from the correlation matrices. They illustrate the strong positive relationships among financial variables and among service/beneficiary metrics. The differences between Pearson and Spearman correlations are likely visible in the shapes of the relationships, with some showing more linear patterns (financial variables) and others displaying non-linear but monotonic relationships (service metrics).
The correlation graphs visually reinforce the strong relationships between financial variables and the moderate to strong associations between provider and beneficiary metrics, highlighting both linear and non-linear patterns in the Medicare dataset.
Regression analysis helps quantify the relationship between Medicare provider services and reimbursements. The goal is to determine how well one variable (e.g., total beneficiaries) predicts another (e.g., total providers).
This linear regression model examines how the number of total beneficiaries influences the total number of providers. The coefficient and R-squared value indicate the extent to which variations in provider numbers are explained by beneficiary counts.
# Creating Linear Regression w/ Outliers
medicare_rgs_outliers <- lm(total_providers ~ total_beneficiaries, data = medicare)
summary_stats_outliers <- summary(medicare_rgs_outliers)
print(summary_stats_outliers)
##
## Call:
## lm(formula = total_providers ~ total_beneficiaries, data = medicare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -233213 -146 -132 -83 258446
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.545e+02 4.475e+00 34.53 <2e-16 ***
## total_beneficiaries 2.101e-02 4.061e-05 517.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2325 on 270671 degrees of freedom
## Multiple R-squared: 0.4973, Adjusted R-squared: 0.4973
## F-statistic: 2.678e+05 on 1 and 270671 DF, p-value: < 2.2e-16
This output presents the results of a linear regression analysis examining the relationship between total providers and total beneficiaries in the Medicare dataset, including outliers. The model is statistically significant (F-statistic: 2.678e+05, p-value: < 2.2e-16), indicating a strong relationship between the variables. The coefficient for total_beneficiaries (2.101e-02) is positive and highly significant (p < 2e-16), demonstrating that for every additional beneficiary, there is an average increase of about 0.02101 providers. The intercept (154.5) represents the estimated number of providers when there are zero beneficiaries. The model explains approximately 49.73% of the variance in total providers (R-squared: 0.4973), demonstrating a moderate fit. However, the large range in residuals (Min: -233213, Max: 258446) suggests the presence of influential outliers that may be affecting the model’s accuracy of linear regression.
# Relationship Between Total Providers and Total Beneficiaries
ggplot(medicare, aes(total_providers, total_beneficiaries)) +
geom_point(alpha = 0.75, color = "navyblue", fill = "turquoise1", shape = 21, size = 2, stroke = 0.5) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
labs(
title = "Relationship Between Total Providers and Total Beneficiaries",
x = "Total Providers",
y = "Total Beneficiaries",
caption = paste("R-squared:", round(summary_stats_outliers$r.squared, 3))
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(face = "bold"),
axis.text = element_text(size = 10)
) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
This graph illustrates the relationship between the total number of providers and total beneficiaries in the Medicare dataset. The scatter plot shows a positive linear relationship, with a moderate R-squared value of approximately 0.497. This indicates that as the number of providers increases, the number of beneficiaries tends to increase as well, despite there being considerable variation around the trend line. The presence of outliers is notable, with some points far from the main cluster. Outliers will be attempted to be removed and analyzed below.
A second regression model is generated after removing extreme outliers, attempting to provide a better analysis of the underlying trends between beneficiaries and providers. The improved fit of the model is assessed by comparing R-squared values before and after filtering.
# Creating Linear Regression w/o Outliers
medicare_rgs_filtered <- lm(total_providers ~ total_beneficiaries, data = medicare_adjust)
summary_stats_filtered <- summary(medicare_rgs_filtered)
print(summary_stats_filtered)
##
## Call:
## lm(formula = total_providers ~ total_beneficiaries, data = medicare_adjust)
##
## Residuals:
## Min 1Q Median 3Q Max
## -147172 -125 -111 -65 218391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.328e+02 3.656e+00 36.34 <2e-16 ***
## total_beneficiaries 2.589e-02 5.767e-05 448.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1895 on 270165 degrees of freedom
## Multiple R-squared: 0.4273, Adjusted R-squared: 0.4273
## F-statistic: 2.016e+05 on 1 and 270165 DF, p-value: < 2.2e-16
This output presents the results of a linear regression analysis examining the relationship between total providers and total beneficiaries in the Medicare study dataset after removing outliers. The model remains statistically significant (F-statistic: 1.433e+05, p-value: < 2.2e-16), which indicates a strong relationship between the variables. The coefficient for total_beneficiaries (2.751e-02) is positive and highly significant (p < 2e-16), demonstrating that for every additional beneficiary, there is an average increase of about 0.02751 providers. The intercept (123.2) represents the estimated number of providers when there are zero beneficiaries. The model explains approximately 34.67% of the variance in total providers (R-squared: 0.3467), indicating a moderate fit, though lower than the model with outliers. The range of residuals (Min: -38523, Max: 77334) is significantly smaller than in the previous model, demonstrating that removing outliers has reduced the influence of extreme values. This filtered model likely provides a better representation of the relationship between providers and beneficiaries for the majority of the data points.
# Relationship Between Total Providers and Total Beneficiaries (Outliers Removed)
ggplot(medicare_adjust, aes(total_providers, total_beneficiaries)) +
geom_point(alpha = 0.75, color = "navyblue", fill = "turquoise1", shape = 21, size = 2, stroke = 0.5) +
geom_smooth(method = "lm", color = "black", se = FALSE) +
labs(
title = "Relationship Between Total Providers and Total Beneficiaries (Outliers Removed)",
x = "Total Providers",
y = "Total Beneficiaries",
caption = paste("R-squared:", round(summary_stats_filtered$r.squared, 3))
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.title = element_text(face = "bold"),
axis.text = element_text(size = 10)
) +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::comma)
## `geom_smooth()` using formula = 'y ~ x'
After removing outliers, this graph displays a more distributed linear relationship between total beneficiaries and total providers. The R-squared value decreases to about 0.347, indicating that while the relationship remains positive, the outliers were influencing the strength of the correlation in the original dataset. This graph provides a more representative view of the typical relationship between providers and beneficiaries. Therefore, the graph highlights how the majority of data points relate to each other without the skewing effect of extreme values.
The beneficiary-to-provider ratio is analyzed and visualized on a logarithmic scale to better interpret variations across different locations. This highlights disparities in healthcare access and identifies outlier regions where provider shortages or surpluses exist.
# Filtering Data to Demonstrate Ratio
medicare_adjust$beneficiary_provider_ratio <-
medicare_adjust$total_beneficiaries / medicare_adjust$total_providers
# Calculate Mean Ratio and Print
mean_ratio <- mean(medicare_adjust$beneficiary_provider_ratio, na.rm = TRUE)
cat("Mean Beneficiary to Provider Ratio:", round(mean_ratio, 2), "\n")
## Mean Beneficiary to Provider Ratio: 23.49
This segment calculates the ratio of beneficiaries to providers for each entry in the adjusted Medicare dataset (with outliers removed) and then computes the average ratio across all entries. The resulting mean ratio of 23.48 indicates that, on average, there are approximately 23.48 beneficiaries for every provider in the dataset. This metric provides a quick snapshot of the overall distribution of beneficiaries among providers, suggesting that typically, each provider serves about 23 to 24 Medicare beneficiaries. This ratio can be useful for understanding the general workload or patient load per provider in the Medicare system. It is important to note that this is an average and individual cases may vary significantly.
# Distribution of Beneficiary to Provider Ratio (Log Scale)
ggplot(medicare_adjust, aes(x = beneficiary_provider_ratio)) +
geom_histogram(binwidth = 0.1, fill = "aquamarine", color = "aquamarine4") +
scale_x_log10(labels = scales::comma) + # Apply log10 transformation
labs(title = "Distribution of Beneficiary to Provider Ratio (Log Scale)",
x = "Beneficiaries per Provider (Log Scale)",
y = "Frequency") +
theme_bw() +
facet_wrap(~place_of_service) +
coord_cartesian(xlim = c(1, 500), ylim = c(1, 25000))
This visualization presents the distribution of the beneficiary-to-provider ratio, displayed on a log scale, and faceted by place of service (labeled as ‘F’ and ‘O’). The histogram for place of service ‘F’ shows a high frequency of ratios clustered towards the lower end, indicating that many providers serve a relatively small number of beneficiaries. As the ratio increases (moving towards the right on the x-axis), the frequency decreases significantly. The histogram for place of service ‘O’ also shows a distribution skewed to the left, but with a less pronounced peak. The use of a logarithmic scale allows for a better understanding of the distribution across a wide range of ratios.
The following t-tests compare Medicare service metrics across different provider geographic levels to determine if there are statistically significant differences. This helps assess whether provider location impacts payment amounts and service utilization.
# Performing t-test for Average Submitted Charge vs. Provider Geographic Level
t.test(average_submitted_charge ~ provider_geo_level, data = medicare)
##
## Welch Two Sample t-test
##
## data: average_submitted_charge by provider_geo_level
## t = 21.249, df = 14617, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group National and group State is not equal to 0
## 95 percent confidence interval:
## 449.1008 540.3763
## sample estimates:
## mean in group National mean in group State
## 1780.090 1285.352
This t-test compares the average submitted charge between national and state-level providers. The results show a statistically significant difference (t = 21.249, p-value < 2.2e-16) between the two groups. National providers have a higher mean submitted charge ($1780.09) compared to state providers ($1285.35), with a difference of approximately $494.74 (95% CI: $449.10 to $540.38).
# Performing t-test for Average Aid Amount Charge vs. Provider Geographic Level
t.test(average_medicare_payment_amount ~ provider_geo_level, data = medicare)
##
## Welch Two Sample t-test
##
## data: average_medicare_payment_amount by provider_geo_level
## t = 16.985, df = 14571, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group National and group State is not equal to 0
## 95 percent confidence interval:
## 90.75193 114.43107
## sample estimates:
## mean in group National mean in group State
## 329.7900 227.1985
This analysis examines the difference in average Medicare payment amounts between national and state-level providers. The test reveals a statistically significant difference (t = 16.985, p-value < 2.2e-16). National providers receive higher average Medicare payments ($329.79) compared to state providers ($227.20), with a mean difference of about $102.59 (95% CI: $90.75 to $114.43).
# Performing t-test for Total Services vs. Provider Geographic Level
t.test(total_services ~ provider_geo_level, data = medicare)
##
## Welch Two Sample t-test
##
## data: total_services by provider_geo_level
## t = 9.9498, df = 13326, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group National and group State is not equal to 0
## 95 percent confidence interval:
## 186513.2 278029.3
## sample estimates:
## mean in group National mean in group State
## 244433.90 12162.67
To demonstrate the relationship between these two variables within the form of a bar graph, the results first must be saved into a variable after conducting the t-test.
# Storing t-test Results: Total Services vs. Provider Geographic Level for Plotting
t_test_results <- t.test(total_services ~ provider_geo_level, data = medicare)
The results can be further summarized by grouping functions and creating the mean from the t-test. The lower and upper intervals are also calculated to display within the bar graph.
# Summarizing t-test: Total Services vs. Provider Geographic Level
summary_data <- medicare %>%
group_by(provider_geo_level) %>%
summarise(
mean = mean(total_services),
se = sd(total_services) / sqrt(n()),
ci_lower = mean - qt(0.975, df = n() - 1) * se,
ci_upper = mean + qt(0.975, df = n() - 1) * se
)
From summarizing are storing the data, the following snippet plots the bar graph to demonstrate the difference between the National and State levels of Medicare.
# Creating Gradient: Total Services vs. Provider Geographic Level
gradient_fill <- scale_fill_gradient(low = "royalblue1", high = "brown2")
suppressWarnings({
# Plotting Data: Total Services vs. Provider Geographic Level
ggplot(summary_data, aes(x = provider_geo_level, y = mean, fill = mean)) +
geom_bar(stat = "identity", position = position_dodge(),
color = "black", size = 0.5) + # Add black outline
geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, position = position_dodge(0.9)) +
labs(
title = "Comparison of Total Services by Provider Geographic Level",
subtitle = paste("t =", round(t_test_results$statistic, 2),
", df =", round(t_test_results$parameter, 2),
", p-value =", format.pval(t_test_results$p.value, digits = 3)),
x = "Provider Geographic Level",
y = "Mean Total Services"
) +
scale_y_continuous(labels = scales::comma_format()) +
gradient_fill +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
legend.position = "none",
panel.grid.major = element_line(color = "gray90"),
panel.grid.minor = element_line(color = "gray95")
)
})
This t-test compares the total services provided by national and state-level providers. The results indicate a statistically significant difference (t = 9.9498, p-value < 2.2e-16). National providers offer substantially more services on average (244,433.90) compared to state providers (12,162.67), with a mean difference of approximately 232,271 services (95% CI: 186,513 to 278,029).
These t-tests consistently depict statistically significant differences between national and state-level providers across all three metrics: average submitted charges, average Medicare payment amounts, and total services provided. National providers consistently show higher values for all three measures. The most prevalent difference is in the total services provided, where national providers have an average of about 20 times more services than state providers. These results demonstrate that the geographic level of the provider (national vs. state) is strongly associated with differences in pricing, payment, and service volume in the Medicare system.
The objective of this study was to examine patterns related to Medicare care, costs incurred, and differences in compensation rendered by care providers in different geographic areas. The data collected by Centers for Medicare & Medicaid Services laid a firm background for examining healthcare utilization and costs in different regions. The preliminary analyses showed differences between reimbursements made by Medicare and costs imposed by care providers, which indicated a possible financial strain on beneficiaries. Through reformatting of the dataset for greater clarity and consistency in presentation. A firm analytical perspective was developed for assessing the impact of Medicare on care access and policy concerns.
The visual analysis of Medicare service distributions showed patterns in support provision and differences in payment. The scatter plots and box plots showed that Medicare reimbursed about 20% of total costs on a normal basis, although there were occasional instances of full reimbursement. Geographic differences were also found; states of greater cost, New York and California, showed greater differences between reimbursement levels by practitioners and those of Medicare. The opposite, however, showed in a state-by-state breakdown in which some regions, especially in the Northeast, showed a greater number of services per patient. In addition, other states of large numbers of beneficiaries of Medicare failed to achieve the desired levels of services as what was to be expected based on their population. The results point toward a reason for considering medical infrastructure and policy differences in regions when assessing the efficiency of Medicare.
In order to increase this study’s validity, a variety of data manipulation strategies were used, such as location filtering for providers in an effort to focus solely on the United States’ 50 states and aggregating data in relation to state-level metrics of service. The illustration of Medicare support and payment data clarified differences in coverage between regions. Interestingly, despite its extreme pattern of Medicare support, California reported some of its largest physician-reported charges, which helped support this trend of partial cost coverage. Areas of lower averages of submitted amounts and lower numbers of rendered services per beneficiary, especially in some of the Midwest, may face related barriers to accessing care or unfamiliarity among Medicare beneficiaries. The subsetted and filtered datasets made it possible for a more detailed and nuanced understanding of the pattern of Medicare support.
Statistical evaluations showed substantial correlations between physician submissions and Medicare reimbursement, although non-linear trends influenced by outliers were evidenced. Regression analysis showed that beneficiaries have a direct impact on providers, though with wide variability between states. T-tests comparing physicians on a national scale and a state-by-state scale showed statistically significant differences in physician submissions, support given by Medicare through payment aid, and total services. Physicians on a national scale receive greater compensation on average. The outcomes point toward systemic inequalities in Medicare resource provision, which may necessitate adjustments in reimbursement systems for greater equality between different levels of physicians.
This study sought to discover substantial patterns in Medicare provision, cost variability, and reimbursement differences between different geographic regions for medical practitioners. Future studies may consider examining factors underlying medical specialties in producing these differences and analyzing lasting influences of Medicare policy on medical care access. For example, one limitation of the dataset’s information was the lack of co-insurance data for beneficiaries. This made it difficult to determine if Medicare coverage was intentionally low for outside coverage. In addition, a study of supplemental coverage on financing Medicare support may provide insights on patient financing practices. Policy-making may be enhanced by examining ways of simplifying reimbursement systems, ultimately reducing physician compensation and Medicare financing disparities, which in return, can increase care access throughout the country. This continued study is aimed at further understanding Medicare’s critical position in Medicare provision in the United States.
This study of Medicare services, providers, and fees in 2022 highlights trends and anomalies that impact the healthcare landscape for senior Americans. Understanding these patterns is important for informing policy decisions, improving service delivery, and bridging gaps between Medicare and the public. Thank you for your time and dedication to exploring this data with me!