Hospital charge data

About the data Data are being released that show significant variation across the country and within communities in what providers charge for common services. These data include information comparing the charges for the 100 most common inpatient services and 30 common outpatient services. Providers determine what they will charge for items and services provided to patients and these charges are the amount the providers bills for an item or service.

The data provided here include hospital-specific charges for the more than 3,000 U.S. hospitals that receive Medicare Inpatient Prospective Payment System (IPPS) payments for the top 100 most frequently billed discharges, paid under Medicare based on a rate per discharge using the Medicare Severity Diagnosis Related Group (MS-DRG) for Fiscal Year (FY) 2011. These DRGs represent more than 7 million discharges or 60 percent of total Medicare IPPS discharges.

# Load required libraries
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.3

## Warning: package 'ggplot2' was built under R version 4.3.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset
data <- read.csv(file.choose())  # Assuming you've downloaded and saved the CSV

# View the first few rows and the structure of the dataset
head(data)

str(data)

## 'data.frame':    163065 obs. of  12 variables:
##  $ drg_definition                      : chr  "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" "039 - EXTRACRANIAL PROCEDURES W/O CC/MCC" ...
##  $ provider_id                         : int  10001 10005 10006 10011 10016 10023 10029 10033 10039 10040 ...
##  $ provider_name                       : chr  "SOUTHEAST ALABAMA MEDICAL CENTER" "MARSHALL MEDICAL CENTER SOUTH" "ELIZA COFFEE MEMORIAL HOSPITAL" "ST VINCENT'S EAST" ...
##  $ provider_street_address             : chr  "1108 ROSS CLARK CIRCLE" "2505 U S HIGHWAY 431 NORTH" "205 MARENGO STREET" "50 MEDICAL PARK EAST DRIVE" ...
##  $ provider_city                       : chr  "DOTHAN" "BOAZ" "FLORENCE" "BIRMINGHAM" ...
##  $ provider_state                      : chr  "AL" "AL" "AL" "AL" ...
##  $ provider_zip_code                   : int  36301 35957 35631 35235 35007 36116 36801 35233 35801 35903 ...
##  $ hospital_referral_region_description: chr  "AL - Dothan" "AL - Birmingham" "AL - Birmingham" "AL - Birmingham" ...
##  $ total_discharges                    : int  91 14 24 25 18 67 51 32 135 34 ...
##  $ average_covered_charges             : num  32963 15132 37560 13998 31633 ...
##  $ average_total_payments              : num  5777 5788 5435 5418 5658 ...
##  $ average_medicare_payments           : num  4764 4977 4454 4129 4851 ...

summary(data)

##  drg_definition      provider_id     provider_name      provider_street_address
##  Length:163065      Min.   : 10001   Length:163065      Length:163065          
##  Class :character   1st Qu.:110092   Class :character   Class :character       
##  Mode  :character   Median :250007   Mode  :character   Mode  :character       
##                     Mean   :255570                                             
##                     3rd Qu.:380075                                             
##                     Max.   :670077                                             
##  provider_city      provider_state     provider_zip_code
##  Length:163065      Length:163065      Min.   : 1040    
##  Class :character   Class :character   1st Qu.:27261    
##  Mode  :character   Mode  :character   Median :44309    
##                                        Mean   :47938    
##                                        3rd Qu.:72901    
##                                        Max.   :99835    
##  hospital_referral_region_description total_discharges  average_covered_charges
##  Length:163065                        Min.   :  11.00   Min.   :  2459         
##  Class :character                     1st Qu.:  17.00   1st Qu.: 15947         
##  Mode  :character                     Median :  27.00   Median : 25246         
##                                       Mean   :  42.78   Mean   : 36134         
##                                       3rd Qu.:  49.00   3rd Qu.: 43233         
##                                       Max.   :3383.00   Max.   :929119         
##  average_total_payments average_medicare_payments
##  Min.   :  2673         Min.   :  1149           
##  1st Qu.:  5234         1st Qu.:  4192           
##  Median :  7214         Median :  6158           
##  Mean   :  9707         Mean   :  8494           
##  3rd Qu.: 11286         3rd Qu.: 10057           
##  Max.   :156158         Max.   :154621

# Check for missing values
colSums(is.na(data))

##                       drg_definition                          provider_id 
##                                    0                                    0 
##                        provider_name              provider_street_address 
##                                    0                                    0 
##                        provider_city                       provider_state 
##                                    0                                    0 
##                    provider_zip_code hospital_referral_region_description 
##                                    0                                    0 
##                     total_discharges              average_covered_charges 
##                                    0                                    0 
##               average_total_payments            average_medicare_payments 
##                                    0                                    0

# Distribution of numeric variables
hist(data$total_discharges, main="Distribution of Total Discharges", col="lightblue", xlab="Total Discharges")

hist(data$average_covered_charges, main="Distribution of Average Covered Charges", col="lightgreen", xlab="Average Covered Charges")

# Barplot for providers by state
provider_counts <- data %>% group_by(provider_state) %>% summarize(count = n())
barplot(provider_counts$count, names.arg=provider_counts$provider_state, main="Providers by State", col="lightcoral")

# Scatter plot of Medicare payments vs total payments
plot(data$average_medicare_payments, data$average_total_payments, main="Medicare vs Total Payments", xlab="Medicare Payments", ylab="Total Payments", col="blue")

2. about EDA Distribution of Total Discharges: The histogram shows the spread of total discharges across providers. Most providers have a moderate number of discharges, with fewer hospitals handling extremely high or low volumes. This suggests that there is some variability in the number of patients served by different hospitals.

Distribution of Average Covered Charges: The histogram for average covered charges indicates that most providers charge within a certain range. However, there might be some outliers with significantly higher covered charges, suggesting differences in the cost of care across providers.

Providers by State: The bar plot displaying the number of providers by state shows that some states have a higher concentration of healthcare providers. This could reflect population density or the distribution of healthcare facilities across different regions.

Relationship Between Medicare Payments and Total Payments: The scatter plot of Medicare payments vs. total payments suggests a positive correlation, meaning that as Medicare payments increase, total payments also tend to increase. However, the spread in points may also indicate that Medicare does not cover the entire cost of treatment, and the degree of coverage varies by provider or region.

Missing Values Check: There are no significant missing values in the dataset, allowing us to proceed with analysis without data imputation or removal of rows/columns.

# Normalize the relevant numerical columns for clustering
data_normalized <- data %>%
  select(total_discharges, average_covered_charges, average_total_payments, average_medicare_payments) %>%
  scale()

# Determine the optimal number of clusters using the elbow method
wss <- (nrow(data_normalized)-1)*sum(apply(data_normalized, 2, var))
for (i in 2:10) wss[i] <- sum(kmeans(data_normalized, centers=i)$tot.withinss)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 8153250)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 8153250)

plot(1:10, wss, type="b", main="Elbow Method", xlab="Number of Clusters", ylab="Within groups sum of squares")

# Apply k-means with the chosen number of clusters (say k=3)
set.seed(123)
kmeans_result <- kmeans(data_normalized, centers = 3)

# Add the cluster labels to the dataset
data$cluster <- kmeans_result$cluster

# Visualize clusters
library(ggplot2)
ggplot(data, aes(x=average_covered_charges, y=average_total_payments, color=factor(cluster))) + 
  geom_point() + 
  labs(title="K-Means Clustering on Charges and Payments", x="Average Covered Charges", y="Average Total Payments")

k means Clustering

Data Normalization: To ensure that all variables contribute equally to the clustering process, we normalized the numerical columns. Determining the Optimal Number of Clusters: Using the elbow method, we identified three as the optimal number of clusters based on the within-cluster sum of squares (WCSS). K-Means Clustering: We applied the K-Means algorithm with three clusters to group the hospitals based on their similarities in terms of the selected variables. Cluster Analysis

The clustering results revealed three distinct groups of hospitals:

Low-Cost Hospitals: This cluster is characterized by a relatively low average covered charges, total payments, and Medicare payments. These hospitals might be serving a lower-income population or focusing on providing basic healthcare services. High-Charge, Low-Payment Hospitals: This cluster is defined by high average covered charges but relatively low total payments. This could indicate that these hospitals are using pricing strategies that result in higher charges but lower actual payments received, perhaps due to insurance negotiations or government programs. High-Cost Hospitals: This cluster is characterized by both high average covered charges and total payments. These hospitals might be providing specialized or premium healthcare services, leading to higher costs for patients and insurers. Key Insights

Segmentation: The clustering results offer a valuable segmentation of hospitals based on their cost structures. Pricing Strategies: The analysis reveals different pricing strategies employed by hospitals, highlighting the potential for cost-saving measures. Healthcare Service Delivery: The clusters might be associated with specific patient demographics or healthcare needs, providing insights into the factors driving cost variations. Policy Implications: The findings can inform policy decisions related to pricing regulations, reimbursement rates, and resource allocation.

Conclusion: By applying K-Means clustering to the healthcare cost data, we have uncovered important insights into the factors driving variations in costs and identified potential areas for improvement. These findings can be used to inform evidence-based decision-making and promote more equitable healthcare access.

Hospital charge data

GAURANG

2024-09-17