The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent more than 7 million discharges or 60 percent of total Medicare IPPS discharges.
# Load required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset
data <- read.csv(file.choose()) # Assuming you've downloaded and saved the CSV
# View the first few rows and the structure of the dataset
head(data)
str(data)
## 'data.frame': 163065 obs. of 12 variables:
## $ drg_definition : chr "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" ...
## $ provider_id : int 10001 10005 10006 10011 10016 10023 10029 10033 10039 10040 ...
## $ provider_name : chr "SOUTHEAST ALABAMA MEDICAL CENTER" "MARSHALL MEDICAL CENTER SOUTH" "ELIZA COFFEE MEMORIAL HOSPITAL" "ST VINCENT'S EAST" ...
## $ provider_street_address : chr "1108 ROSS CLARK CIRCLE" "2505 U S HIGHWAY 431 NORTH" "205 MARENGO STREET" "50 MEDICAL PARK EAST DRIVE" ...
## $ provider_city : chr "DOTHAN" "BOAZ" "FLORENCE" "BIRMINGHAM" ...
## $ provider_state : chr "AL" "AL" "AL" "AL" ...
## $ provider_zip_code : int 36301 35957 35631 35235 35007 36116 36801 35233 35801 35903 ...
## $ hospital_referral_region_description: chr "AL - Dothan" "AL - Birmingham" "AL - Birmingham" "AL - Birmingham" ...
## $ total_discharges : int 91 14 24 25 18 67 51 32 135 34 ...
## $ average_covered_charges : num 32963 15132 37560 13998 31633 ...
## $ average_total_payments : num 5777 5788 5435 5418 5658 ...
## $ average_medicare_payments : num 4764 4977 4454 4129 4851 ...
summary(data)
## drg_definition provider_id provider_name provider_street_address
## Length:163065 Min. : 10001 Length:163065 Length:163065
## Class :character 1st Qu.:110092 Class :character Class :character
## Mode :character Median :250007 Mode :character Mode :character
## Mean :255570
## 3rd Qu.:380075
## Max. :670077
## provider_city provider_state provider_zip_code
## Length:163065 Length:163065 Min. : 1040
## Class :character Class :character 1st Qu.:27261
## Mode :character Mode :character Median :44309
## Mean :47938
## 3rd Qu.:72901
## Max. :99835
## hospital_referral_region_description total_discharges average_covered_charges
## Length:163065 Min. : 11.00 Min. : 2459
## Class :character 1st Qu.: 17.00 1st Qu.: 15947
## Mode :character Median : 27.00 Median : 25246
## Mean : 42.78 Mean : 36134
## 3rd Qu.: 49.00 3rd Qu.: 43233
## Max. :3383.00 Max. :929119
## average_total_payments average_medicare_payments
## Min. : 2673 Min. : 1149
## 1st Qu.: 5234 1st Qu.: 4192
## Median : 7214 Median : 6158
## Mean : 9707 Mean : 8494
## 3rd Qu.: 11286 3rd Qu.: 10057
## Max. :156158 Max. :154621
# Check for missing values
colSums(is.na(data))
## drg_definition provider_id
## 0 0
## provider_name provider_street_address
## 0 0
## provider_city provider_state
## 0 0
## provider_zip_code hospital_referral_region_description
## 0 0
## total_discharges average_covered_charges
## 0 0
## average_total_payments average_medicare_payments
## 0 0
# Distribution of numeric variables
hist(data$total_discharges, main="Distribution of Total Discharges", col="lightblue", xlab="Total Discharges")
hist(data$average_covered_charges, main="Distribution of Average Covered Charges", col="lightgreen", xlab="Average Covered Charges")
# Barplot for providers by state
provider_counts <- data %>% group_by(provider_state) %>% summarize(count = n())
barplot(provider_counts$count, names.arg=provider_counts$provider_state, main="Providers by State", col="lightcoral")
# Scatter plot of Medicare payments vs total payments
plot(data$average_medicare_payments, data$average_total_payments, main="Medicare vs Total Payments", xlab="Medicare Payments", ylab="Total Payments", col="blue")
2. about EDA Distribution of Total Discharges: The histogram shows the
spread of total discharges across providers. Most providers have a
moderate number of discharges, with fewer hospitals handling extremely
high or low volumes. This suggests that there is some variability in the
number of patients served by different hospitals.
Distribution of Average Covered Charges: The histogram for average covered charges indicates that most providers charge within a certain range. However, there might be some outliers with significantly higher covered charges, suggesting differences in the cost of care across providers.
Providers by State: The bar plot displaying the number of providers by state shows that some states have a higher concentration of healthcare providers. This could reflect population density or the distribution of healthcare facilities across different regions.
Relationship Between Medicare Payments and Total Payments: The scatter plot of Medicare payments vs. total payments suggests a positive correlation, meaning that as Medicare payments increase, total payments also tend to increase. However, the spread in points may also indicate that Medicare does not cover the entire cost of treatment, and the degree of coverage varies by provider or region.
Missing Values Check: There are no significant missing values in the dataset, allowing us to proceed with analysis without data imputation or removal of rows/columns.
# Normalize the relevant numerical columns for clustering
data_normalized <- data %>%
select(total_discharges, average_covered_charges, average_total_payments, average_medicare_payments) %>%
scale()
# Determine the optimal number of clusters using the elbow method
wss <- (nrow(data_normalized)-1)*sum(apply(data_normalized, 2, var))
for (i in 2:10) wss[i] <- sum(kmeans(data_normalized, centers=i)$tot.withinss)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 8153250)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 8153250)
plot(1:10, wss, type="b", main="Elbow Method", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Apply k-means with the chosen number of clusters (say k=3)
set.seed(123)
kmeans_result <- kmeans(data_normalized, centers = 3)
# Add the cluster labels to the dataset
data$cluster <- kmeans_result$cluster
# Visualize clusters
library(ggplot2)
ggplot(data, aes(x=average_covered_charges, y=average_total_payments, color=factor(cluster))) +
geom_point() +
labs(title="K-Means Clustering on Charges and Payments", x="Average Covered Charges", y="Average Total Payments")
Data Normalization: To ensure that all variables contribute equally to the clustering process, we normalized the numerical columns. Determining the Optimal Number of Clusters: Using the elbow method, we identified three as the optimal number of clusters based on the within-cluster sum of squares (WCSS). K-Means Clustering: We applied the K-Means algorithm with three clusters to group the hospitals based on their similarities in terms of the selected variables. Cluster Analysis
The clustering results revealed three distinct groups of hospitals:
Low-Cost Hospitals: This cluster is characterized by a relatively low average covered charges, total payments, and Medicare payments. These hospitals might be serving a lower-income population or focusing on providing basic healthcare services. High-Charge, Low-Payment Hospitals: This cluster is defined by high average covered charges but relatively low total payments. This could indicate that these hospitals are using pricing strategies that result in higher charges but lower actual payments received, perhaps due to insurance negotiations or government programs. High-Cost Hospitals: This cluster is characterized by both high average covered charges and total payments. These hospitals might be providing specialized or premium healthcare services, leading to higher costs for patients and insurers. Key Insights
Segmentation: The clustering results offer a valuable segmentation of hospitals based on their cost structures. Pricing Strategies: The analysis reveals different pricing strategies employed by hospitals, highlighting the potential for cost-saving measures. Healthcare Service Delivery: The clusters might be associated with specific patient demographics or healthcare needs, providing insights into the factors driving cost variations. Policy Implications: The findings can inform policy decisions related to pricing regulations, reimbursement rates, and resource allocation.
Conclusion: By applying K-Means clustering to the healthcare cost data, we have uncovered important insights into the factors driving variations in costs and identified potential areas for improvement. These findings can be used to inform evidence-based decision-making and promote more equitable healthcare access.