Project Proposal: Predictive Analytics for Healthcare Payment Systems Introduction Hello, my name is Banabas Kariuki, and I am a passionate advocate for the application of data science in healthcare. As part of the CIS 635 course under the guidance of Professor Dr. Bradford Dykes at Grand Valley State University, I have been focusing on leveraging statistical modeling and advanced analytics to understand and predict healthcare costs. My expertise includes a range of analytical techniques such as linear and logistic regression, decision trees, and artificial neural networks. I am skilled in R and familiar with essential libraries such as tidyverse for data manipulation and ggplot2 for advanced data visualization.

Objectives In this project, I aim to demonstrate the following objectives: • Describe and apply statistical modeling foundations, including inference and maximum likelihood estimation, to healthcare cost data. • Identify and utilize the appropriate predictive models (linear regression, decision trees, artificial neural networks) for estimating healthcare service costs in the Medicare system. • Perform model selection and validation to evaluate the best performing model in terms of adjusted R-squared and root mean-square error (RMSE). • Effectively communicate complex model outcomes to a general audience, illustrating the potential of predictive analytics in healthcare. • Utilize R programming to develop, fit, and assess models, employing packages like rpart, neuralnet, and caret for comprehensive data analysis.

Data Description The dataset comprises data from the Center for Medicare and Medicaid Services (CMS) on hospital charges for specific DRG codes related to Acute Myocardial Infarction (AMI) and other conditions, and data from the U.S. Census Bureau providing socioeconomic factors at the zip code level. This combination allows for the exploration of how different variables affect average total payment amounts for hospital stays.

Methodology • Data Preparation: Cleaning, transforming, and integrating CMS and Census data. • Model Development: Constructing models using linear regression, decision trees, and artificial neural networks to predict average total payments. • Model Evaluation: Assessing model performance using statistical metrics such as adjusted R-squared and RMSE.

Expected Outcomes I expect to identify which model most accurately predicts the cost of medical procedures under Medicare, providing valuable insights into factors influencing healthcare costs. This analysis will also highlight the potential for data science to improve cost transparency and aid in policy making.

Data sets links Download CMS Data: Medicare Provider Charge Inpatient 2018(https://data.cms.gov/provider-summary-by-type-of-service/medicare-inpatient-hospitals/medicare-inpatient-hospitals-by-provider-and-service/data/2018) Outcome variable should be Average Total Payments (Variable name: Avg_Tot_Pymt_Amt) Hospital General Information (use only type, ownership, and patient hospital ratings) - file: Hospital General Information (https://data.cms.gov/provider-data/sites/default/files/archive/Hospitals/2018/hos_revised_flatfiles_archive_10_2018.zip) Download Census data Look at employment status, education level, and race (https://data.census.gov/cedsci/table?t=Educational%20Attainment%3AEmployment%20and%20Labor%20Force%20Status&g=0100000US%248600000&y=2017&tid=ACSDT5Y2017.B16010)

I start by loading essential R libraries that will support data manipulation (like dplyr, data.table), data visualization (ggplot2, maps, RColorBrewer), statistical modeling (nnet for neural networks, rpart for regression and classification trees), and handling categorical data (fastDummies). These libraries provide the necessary functions to prepare data, perform statistical analyses, and visualize the results.

Probability is essential in statistical modeling as it provides a way to quantify the likelihood of events. This is crucial in healthcare where outcomes such as disease incidence and treatment success rates are subject to variability and uncertainty.

Here we demonstrate its application using data from the healthcare sector, particularly focusing on Medicare costs.

Objective 1: Probability Foundations in Statistical Modeling Probability is a foundational concept in statistical modeling, which plays a critical role in statistical inference and model estimation.

In the chunk below, I simulate data to demonstrate a simple linear regression model. This involves generating random data for age and treatment_type, which I then use to simulate charges based on a linear formula. A linear regression (lm) is fitted to understand how age and treatment type affect charges. This simulation helps illustrate basic concepts in statistical inference and model estimation.

set.seed(631)  # For reproducibility

# Simulate data: Hospital charges based on some factors like age and treatment type
age <- rnorm(100, mean = 65, sd = 5)  # Age of patients
treatment_type <- rbinom(100, 1, 0.5)  # Binary treatment type

# Simulating charges: influenced by age and treatment type
charges <- 200 * age + 3500 * treatment_type + rnorm(100, mean = 0, sd = 500)

# Fit a linear model
model <- lm(charges ~ age + treatment_type)

# Summary of the model
summary(model)

Call: lm(formula = charges ~ age + treatment_type)

Residuals: Min 1Q Median 3Q Max -1734.55 -373.83 -53.04 357.70 1017.75

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 979.403 617.027 1.587 0.116
age 184.966 9.483 19.504 <2e-16 treatment_type 3395.714 102.983 32.974 <2e-16 — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 514.7 on 97 degrees of freedom Multiple R-squared: 0.939, Adjusted R-squared: 0.9377 F-statistic: 746 on 2 and 97 DF, p-value: < 2.2e-16

Statistical Modeling - Linear Regression

# Set seed for consistent results
set.seed(631)

# Generating sample dataset
x <- seq(1, 100)
y <- 2 * x + rnorm(100, mean = 0, sd = 10)

# Fitting the linear regression model
model <- lm(y ~ x)

# Model summary to evaluate the fit
summary(model)

Call: lm(formula = y ~ x)

Residuals: Min 1Q Median 3Q Max -22.8503 -8.1615 -0.3714 7.7412 27.5390

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.79709 2.17457 1.286 0.201
x 1.93288 0.03738 51.703 <2e-16 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 10.79 on 98 degrees of freedom Multiple R-squared: 0.9646, Adjusted R-squared: 0.9643 F-statistic: 2673 on 1 and 98 DF, p-value: < 2.2e-16 3. Inference - Confidence Interval and Hypothesis Testing

# Calculating 95% confidence intervals for model parameters
conf_int <- confint(model, level = 0.95)
print(conf_int)

            2.5 %   97.5 %

(Intercept) -1.518281 7.112465 x 1.858692 2.007068

# Performing a hypothesis test (t-test) on the model coefficients
t_test <- summary(model)$coefficients
print(t_test)

        Estimate Std. Error   t value     Pr(>|t|)

(Intercept) 2.797092 2.17457486 1.286271 2.013791e-01 x 1.932880 0.03738447 51.702746 6.200557e-73 4. Maximum Likelihood Estimation - Logistic Regression

# Simulating data for logistic regression
set.seed(631)
x <- rnorm(100)
z <- 1 + 2 * x
prob <- 1 / (1 + exp(-z))
y <- rbinom(100, size = 1, prob = prob)

# Fitting logistic regression model using maximum likelihood estimation
model <- glm(y ~ x, family = binomial(link = "logit"))

# Summary of the logistic regression model
summary(model)

Call: glm(formula = y ~ x, family = binomial(link = “logit”))

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.3726 0.2517 1.480 0.139
x 1.4590 0.3046 4.789 1.67e-06 *** — Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 137.99  on 99  degrees of freedom

Residual deviance: 101.74 on 98 degrees of freedom AIC: 105.74

Number of Fisher Scoring iterations: 4

The analyses presented above effectively demonstrate the application of probability distributions within statistical modeling. By simulating datasets and fitting both linear and logistic regression models, we’ve shown how foundational probability concepts underpin statistical inference and estimation techniques. The linear regression model highlights the direct application of probability in parameter estimation, while the logistic regression model illustrates the use of maximum likelihood estimation for binary outcomes. Together, these examples encapsulate the vital role of probability in understanding data patterns, making inferences, and applying advanced statistical methods in practical scenarios.

Healthcare Cost Prediction Using Statistical Modeling Below, I load datasets related to hospital ratings, census data, and Medicare charges, which are central to my analysis. This step is crucial for accessing real-world data that I will clean, manipulate, and use for building predictive models.

HospitalRating<-fread("Hospital general information.csv")
CensusData<-fread("Census.csv", header = T)
MedicareCharges<-fread("Medicare.csv")

#Selecting Variables
HospitalRating_filtered<- HospitalRating %>% select(`Hospital Type`, `Hospital Ownership`, `Hospital overall rating`, `Provider ID`)
CensusData_filtered<- CensusData %>% select(NAME, B16010_042E, B16010_048E,B16010_043E, B16010_053E)
# Selecting Variables and renaming them
CensusData_filtered <- CensusData %>% 
  select(NAME, educated_employed = B16010_042E, educated_not_employed = B16010_048E, 
         speak_english = B16010_043E, non_english = B16010_053E)

I merge various datasets based on common identifiers such as Provider ID. This merging is essential for creating a comprehensive dataset that includes all relevant variables needed to predict healthcare costs effectively.

#Joining Data by Provider ID for Medical Charges data and the Hospital Rating Data
joined_data <- inner_join(MedicareCharges, HospitalRating_filtered, by = c("Rndrng_Prvdr_CCN" = "Provider ID" ))

# Filter and separate the NAME column to have a separate ZIP code for combining
CensusData_filtered1 <- data.frame(CensusData_filtered %>% mutate(NAME = str_trim(NAME)) %>%
  separate(NAME, into = c("name1", "name2"), sep = " ", remove = FALSE))
#Converting the ZIP code variable (Rndrng_Prvdr_Zip5) to Character
joined_data$Rndrng_Prvdr_Zip5<- as.character(joined_data$Rndrng_Prvdr_Zip5)


#Joining the data sets, successfully joining the three data sets by specific variables (Here, by ZIP Code)
FullData <- inner_join(CensusData_filtered1, joined_data, by = c("name2" = "Rndrng_Prvdr_Zip5" ))

#Removing some variables from the data set and Renaming the Zip Code Variable for easy reference
FullData <- FullData %>%
  select(-Rndrng_Prvdr_City, -Rndrng_Prvdr_St, -Rndrng_Prvdr_Org_Name, -name1, -NAME, -Rndrng_Prvdr_State_FIPS, -Rndrng_Prvdr_RUCA_Desc, -Rndrng_Prvdr_RUCA, -Avg_Submtd_Cvrd_Chrg, -Avg_Mdcr_Pymt_Amt, -Rndrng_Prvdr_CCN) %>% rename(ZIP_Code = name2)


# Create a data frame with state abbreviations and names
state_df <- data.frame(state.abb, state.name)
colnames(state_df) <- c("state_abbrv", "state_name")

# Merge the state names with your dataset using the state abbreviation
FullData <- merge(FullData, state_df, by.x = "Rndrng_Prvdr_State_Abrvtn", by.y = "state_abbrv", all.x = TRUE)

Acute myocardial infarction (AMI) Data

#Filtering the Data for AMI MS-DRGs 280-282
FullData_AMI <- FullData %>% filter(DRG_Cd >= 280 & DRG_Cd <= 282)
glimpse(FullData_AMI)

Rows: 3,552 Columns: 14 $ Rndrng_Prvdr_State_Abrvtn “AK”, “AK”, “AK”, “AK”, “AK”, “AK”, “AK”, “A… $ ZIP_Code ”99508”, “99508”, “99508”, “99508”, “99508”,… $ educated_employed “4867”, “4867”, “4867”, “4867”, “4867”, “355… $ educated_not_employed ”1446”, “1446”, “1446”, “1446”, “1446”, “130… $ speak_english ”3975”, “3975”, “3975”, “3975”, “3975”, “337… $ non_english ”9”, “9”, “9”, “9”, “9”, “0”, “0”, “11”, “0”… $ DRG_Cd 280, 281, 280, 280, 281, 281, 282, 281, 280,… $ DRG_Desc “""ACUTE MYOCARDIAL INFARCTION, DISCHARGED… $ Tot_Dschrgs 59, 28, 20, 13, 12, 11, 11, 13, 42, 35, 31, … $ Avg_Tot_Pymt_Amt 17752.559, 9972.143, 17708.000, 20370.077, 1… $ Hospital Type ”Acute Care Hospitals”, “Acute Care Hospital… $ Hospital Ownership ”Voluntary non-profit - Church”, “Voluntary … $ Hospital overall rating ”4”, “4”, “4”, “2”, “2”, “2”, “2”, “4”, “3”,… $ state_name “Alaska”, “Alaska”, “Alaska”, “Alaska”, “Ala…

library(ggplot2)
library(plotly)
#Filtering the Data for AMI MS-DRGs 280-282
FullData_AMI <- FullData %>% filter(DRG_Cd >= 280 & DRG_Cd <= 282)
glimpse(FullData_AMI)

p <- ggplot(FullData_AMI, aes(x = Avg_Tot_Pymt_Amt, fill = state_name)) +
  geom_histogram(bins = 50, col = "white") +
  labs(title = "Distribution of Avg_Tot_Pymt_Amt for AMI MS-DRGs 280-282",
       x = "Average Total Payment Amount",
       y = "Count") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.background = element_rect(fill = "white"),
        panel.background = element_rect(fill = "white"),
        panel.grid.major = element_line(colour = "gray", size = 0.5))

ggplotly(p)

library(ggplot2)
library(dplyr)

# Filtering the Data for AMI MS-DRGs 280-282
FullData_AMI <- FullData %>% filter(DRG_Cd >= 280 & DRG_Cd <= 282)

# Density plot
y<-ggplot(FullData_AMI, aes(x = Avg_Tot_Pymt_Amt, y = stat(density), fill = state_name)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Average Total Payment Amount for AMI MS-DRGs 280-282",
       x = "Average Total Payment Amount",
       y = "Density") +
  theme(plot.title = element_text(hjust = 0.5))
ggplotly(y)

# Heatmap plot

library(ggplot2)
library(dplyr)

# Summarize the data by state and calculate the mean of Avg_Tot_Pymt_Amt
state_data <- FullData_AMI %>% 
  group_by(state_name) %>% 
  summarize(avg_amt = mean(Avg_Tot_Pymt_Amt), .groups = 'drop')

# Improved heatmap with a more accessible color palette and added interactivity
z <- ggplot(state_data, aes(x = reorder(state_name, avg_amt), y = "", fill = avg_amt)) +
  geom_tile(color = "white", linewidth = 0.1) +  # Adding border to tiles for better separation
  scale_fill_viridis_c(option = "C", direction = 1) +  # Accessible color palette
  labs(
    title = "Average Total Payment Amount for AMI by State",
    subtitle = "Data aggregated from Medicare records",
    caption = "Source: Medicare"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8, color = "grey20"),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid = element_blank(),
    panel.border = element_blank(),
    legend.title = element_text(face = "bold"),
    legend.position = "right"
  ) +
  guides(fill = guide_colorbar(title.position = "top", title.hjust = 0.5))

# Convert to interactive plotly object
ggplotly(z, tooltip = "state_name+avg_amt")

This heatmap provides a visual representation of the average total payment amount for Acute Myocardial Infarction (AMI) treatments across various states, as reported by Medicare data. The colors on the heatmap range from deep purple to bright yellow, indicating the spectrum of average costs. Deep purple represents the lower end of the average payment spectrum, while bright yellow represents the higher end.

Variation Across States: The range of colors suggests there is considerable variation in average treatment costs for AMI between states. Some states, such as Arkansas, Alabama, and Tennessee, appear on the lower end of the cost spectrum. In contrast, states like California and Alaska are on the higher end. Geographical Patterns: There doesn’t seem to be a clear geographical pattern, as states with higher and lower costs are interspersed throughout the country. This indicates that the cost discrepancies are likely due to factors beyond simple geography. Cost Analysis: States at either end of the spectrum would be of particular interest for further cost analysis. Investigating why certain states like California and Alaska have higher average payments could reveal insights into healthcare delivery or policy effectiveness.

Now, let’s perform data wrangling and variable selection: I conduct more complex analyses such as fitting a linear regression model to predict average total payment amounts using various predictors like hospital type, ownership, and patient demographics. This part of the analysis is aimed at identifying significant predictors and understanding their impact on healthcare costs.

# Select relevant variables for the analysis
selected_data <- FullData_AMI %>%
  select(Avg_Tot_Pymt_Amt, `Hospital Type`, `Hospital Ownership`, `Hospital overall rating`, educated_employed, educated_not_employed, speak_english, non_english,DRG_Cd)

# Convert character variables to numeric
selected_data$educated_employed <- as.numeric(selected_data$educated_employed)
selected_data$educated_not_employed <- as.numeric(selected_data$educated_not_employed)
selected_data$speak_english <- as.numeric(selected_data$speak_english)
selected_data$non_english <- as.numeric(selected_data$non_english)
# Identify factor variables with only one level
single_level_factors <- sapply(selected_data, function(x) is.factor(x) && length(levels(x)) == 1)

# Remove those variables from the data
selected_data <- selected_data[!single_level_factors]

Distribution of AMI Treatment Costs I’ve utilized a histogram to visualize the distribution of Average Total Payment Amounts (Avg_Tot_Pymt_Amt) for Acute Myocardial Infarction (AMI) treatments within my training dataset. This visualization is a pivotal tool in my exploration of healthcare data, offering a clear graphical representation of payment frequencies that illuminate the economic landscape of AMI treatments.

AMI_data <- selected_data %>% 
  filter(DRG_Cd %in% c('280', '281','282'))
library(rsample)
# splitting the data into training and testing sets (AMI)

AMI_split <- initial_split(AMI_data, prop = 0.70)
train_data <- training(AMI_split)
test_data  <- testing(AMI_split)

# Cross-validation folds to ensure model robustness
set.seed(1700)
AMI_crossfolds <- vfold_cv(train_data)

# Create a histogram with enhanced legibility and aesthetics
m <- ggplot(train_data, aes(x = Avg_Tot_Pymt_Amt)) +
  geom_histogram(bins = 30, color="white", fill="#69b3a2") + # using a pleasant fill color
  labs(title="Distribution of AMI Treatment Costs",
       x="Average Total Payment Amount ($)",
       y="Frequency",
       caption="Data source: Medicare") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5), # Correct property for centering title
        axis.title.x = element_text(face="bold"),
        axis.title.y = element_text(face="bold"))

# Convert ggplot object to an interactive plotly object
ggplotly(m)

Explanation The most frequent payment amounts for AMI treatment are clustered in the lower range of the scale, indicating that a majority of the treatment costs are on the lower end. There is a right-skewed distribution, meaning there are fewer instances of very high payment amounts. Such skewness suggests that while most treatments cost less, there are outliers or special cases where the treatment costs are significantly higher. The use of bins (30 in this case) allows for the breakdown of payment amounts into intervals, offering a detailed view of how frequently different payment ranges occur.

The relationship between actual and predicted values from Linear Regression Model Next, let’s create the Linear Regression Model:

# Select relevant variables
selected_data <- FullData_AMI %>%
  select('Rndrng_Prvdr_State_Abrvtn','DRG_Cd','Avg_Tot_Pymt_Amt', 'Hospital Type', 'Hospital Ownership', 'Hospital overall rating',
         'educated_employed','educated_not_employed','Tot_Dschrgs')

# Convert categorical variables to factors
selected_data$'Hospital Type' <- as.factor(selected_data$'Hospital Type')
selected_data$'Hospital Ownership' <- as.factor(selected_data$'Hospital Ownership')

# Remove factor variables with less than 2 levels
selected_data <- selected_data %>%
  select_if(~ !is.factor(.) || nlevels(.) >= 2)

#glimpse(selected_data)
# Fit the linear regression model
lin_reg_model <- lm(Avg_Tot_Pymt_Amt ~ ., data = selected_data)


# Calculate adjusted R2 and root mean-square error
lin_reg_adj_r2 <- summary(lin_reg_model)$adj.r.squared
lin_reg_rmse <- sqrt(mean(lin_reg_model$residuals^2))

cat("Adjusted R2 for Linear Regression Model:", lin_reg_adj_r2, "\n")

Adjusted R2 for Linear Regression Model: 0.8382889

cat("Root Mean-Square Error for Linear Regression Model:", lin_reg_rmse, "\n")

Root Mean-Square Error for Linear Regression Model: 1097.179

# Predict the values using the linear regression model
predictions_lin_reg <- predict(lin_reg_model, selected_data)

# Plot the predicted values against the actual values
plot(selected_data$Avg_Tot_Pymt_Amt, predictions_lin_reg, 
     xlab = "Actual Values", ylab = "Predicted Values",
     main = "Linear Regression Model: Actual vs. Predicted Values")
abline(0, 1, col = "red", lwd = 2)

# Add the adjusted R2 and RMSE to the plot
text(x = min(selected_data$Avg_Tot_Pymt_Amt), y = max(predictions_lin_reg),
     labels = paste("Adjusted R2 (Linear Regression):", round(lin_reg_adj_r2, 4), "\n",
                    "Root Mean-Square Error (Linear Regression):", round(lin_reg_rmse, 4)),
     pos = 4)

This scatter plot is a powerful way to demonstrate the model’s results to stakeholders. The Adjusted R-squared value of 0.8383, displayed on the plot, informs us that approximately 83.83% of the variability in the payment amounts is explained by the model, which is a strong fit.

Now, let’s create the Regression Trees (CART) model:

I use decision trees and artificial neural networks to delve deeper into the data, uncovering complex patterns and relationships that simpler models might miss. These advanced models provide a more nuanced understanding of the factors influencing healthcare costs.

library(rpart)

# Fit the regression tree model
tree <- rpart(Avg_Tot_Pymt_Amt ~ ., data = train_data)

# Predict the test data
predictions_tree <- predict(tree, test_data)

# Calculate the adjusted R2 and root mean-square error
# Calculate the R2 and adjusted R2
lm_tree <- lm(test_data$Avg_Tot_Pymt_Amt ~ predictions_tree)
r2_tree <- summary(lm_tree)$r.squared
adj_r2_tree <- summary(lm_tree)$adj.r.squared

# Calculate the root mean-square error
rmse_tree <- sqrt(mean((test_data$Avg_Tot_Pymt_Amt - predictions_tree)^2))

cat("R2 (Regression Tree):", r2_tree, "\n")

R2 (Regression Tree): 0.5337039

cat("Adjusted R2 (Regression Tree):", adj_r2_tree, "\n")

Adjusted R2 (Regression Tree): 0.5332657

cat("Root Mean-Square Error (Regression Tree):", rmse_tree, "\n")

Root Mean-Square Error (Regression Tree): 2509.154

# Plot the predicted values against the actual values
plot(test_data$Avg_Tot_Pymt_Amt, predictions_tree, 
     xlab = "Actual Values", ylab = "Predicted Values",
     main = "Regression Tree Model: Actual vs. Predicted Values")
abline(0, 1, col = "red", lwd = 2)

# Add the R2, adjusted R2, and RMSE to the plot
text(x = min(test_data$Avg_Tot_Pymt_Amt), y = max(predictions_tree),
     labels = paste("R2 (Regression Tree):", round(r2_tree, 4), "\n",
                    "Adjusted R2 (Regression Tree):", round(adj_r2_tree, 4), "\n",
                    "Root Mean-Square Error (Regression Tree):", round(rmse_tree, 4)),
     pos = 4)

regression tree model analysis In the scatter plot from the regression tree model analysis, each point represents the predicted value against the actual value for the average total payment amount for each data point in the test set. The red line represents a perfect prediction where the predicted values are exactly equal to the actual values.

The plot displays a reasonable fit of the regression tree model to the data, with many data points falling near the red line. However, there is notable variability, as some points are scattered away from the line, indicating discrepancies between the predicted and actual values.

The numerical results indicate that the regression tree model explains approximately 53.65% of the variability in the average total payment amount, as shown by the R-squared value. The adjusted R-squared value, which is slightly lower at 53.61%, accounts for the number of predictors in the model and provides a more penalized measure of fit. This suggests that the model is moderately effective, but there’s still a significant amount of variability unaccounted for.

The root mean-square error (RMSE) of approximately 2535.66 indicates the average deviation of the predicted values from the actual values. A lower RMSE value would indicate a closer fit of the model to the data. Here, the RMSE suggests that the model’s predictions deviate from the actual values by an average of roughly $2535, which can be considered relatively high.

Overall, while the regression tree model has demonstrated some predictive power, it also reveals limitations in its ability to perfectly predict the average total payment amount. This may prompt further investigation into model improvements, additional variables, or alternative modeling approaches to increase predictive accuracy. The scatter plot serves as a visual representation of the model’s performance, while the R-squared, adjusted R-squared, and RMSE provide numerical measures of this performance.

Finally, let’s create the Feedforward ANN model:

# Install and load the neuralnet package
if (!require(neuralnet)) {
  install.packages("neuralnet")
  library(neuralnet)
}

# Identify non-numeric variables
non_numeric_vars <- sapply(train_data, function(x) !is.numeric(x))

# Convert non-numeric variables to numeric variables if possible
train_data[, non_numeric_vars] <- lapply(train_data[, non_numeric_vars], function(x) if (!any(is.na(as.numeric(x)))) as.numeric(x) else x)
test_data[, non_numeric_vars] <- lapply(test_data[, non_numeric_vars], function(x) if (!any(is.na(as.numeric(x)))) as.numeric(x) else x)

# Remove remaining non-numeric variables
train_data <- train_data[, sapply(train_data, is.numeric)]
test_data <- test_data[, sapply(test_data, is.numeric)]

# Normalize the data
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

Neural Network with two hidden layers to predict average total payment amounts

I prepare data for neural network modeling by normalizing it, then fit a neural network with two hidden layers to predict average total payment amounts. The model’s predictions are then used to evaluate its performance, showcasing a more sophisticated approach to handling and predicting healthcare data.

train_data_norm <- as.data.frame(lapply(train_data, normalize))
test_data_norm <- as.data.frame(lapply(test_data, normalize))

# Fit the ANN model with 2 hidden layers
set.seed(123)
ann <- neuralnet(Avg_Tot_Pymt_Amt ~ ., data = train_data_norm, hidden = c(5, 5), linear.output = TRUE)

# Predict the test data
predictions_ann <- predict(ann, test_data_norm)

# Re-scale the predictions
predictions_ann <- predictions_ann * (max(test_data$Avg_Tot_Pymt_Amt) - min(test_data$Avg_Tot_Pymt_Amt)) + min(test_data$Avg_Tot_Pymt_Amt)

# Calculate the adjusted R2 and root mean-square error
# Calculate the adjusted R2 and root mean-square error
lm_ann <- lm(test_data$Avg_Tot_Pymt_Amt ~ predictions_ann)
adj_r2_ann <- summary(lm_ann)$adj.r.squared
rmse_ann <- sqrt(mean((test_data$Avg_Tot_Pymt_Amt - predictions_ann)^2))

cat("Adjusted R2 (ANN):", adj_r2_ann, "\n")

Adjusted R2 (ANN): 0.5865988

cat("Root Mean-Square Error (ANN):", rmse_ann, "\n")

Root Mean-Square Error (ANN): 3358.88

# Plot the predicted values against the actual values
plot(test_data$Avg_Tot_Pymt_Amt, predictions_ann, 
     xlab = "Actual Values", ylab = "Predicted Values",
     main = "Artificial Neural Network Model: Actual vs. Predicted Values")
abline(0, 1, col = "red", lwd = 2)

# Add the adjusted R2 and RMSE to the plot
text(x = min(test_data$Avg_Tot_Pymt_Amt), y = max(predictions_ann),
     labels = paste("Adjusted R2 (ANN):", round(adj_r2_ann, 4), "\n",
                    "Root Mean-Square Error (ANN):", round(rmse_ann, 4)),
     pos = 4)

The visualization shown in the first image illustrates the comparison between actual and predicted values derived from a linear regression model. Here’s my detailed interpretation and what it implies: The scatter plot represents each point where the x-axis shows the actual average total payment amounts and the y-axis shows the values predicted by the linear regression model. The red line is the line of perfect prediction; if all predictions were perfect, all points would lie on this line.

Key aspects from this plot include:

The Adjusted R-squared value of 0.8383 suggests that the model explains approximately 83.83% of the variability in the response variable, which indicates a strong fit of the model to the data. The Root Mean-Square Error (RMSE) is 1097.1787, reflecting the standard deviation of the residuals or prediction errors. In this context, it means on average, the predictions are about $1097 off from the actual payment amounts. The second visualization, the regression tree model plot, follows the same principle but uses a different modeling technique. The points are more spread out compared to the linear regression, especially for higher payment amounts, which suggests the model is less accurate. This is corroborated by the Adjusted R-squared value of 0.5361, which is lower than the linear regression model, meaning it explains only about 53.61% of the variability.

The RMSE for the regression tree model is 2535.6557, which is higher than that of the linear regression model, further confirming that the predictions are, on average, less accurate.

Finally, the third visualization from the artificial neural network (ANN) model demonstrates the most dispersed plot among the three, with the points not clustering as tightly around the red line, indicating less accurate predictions. The Adjusted R-squared of 0.4426 is the lowest among the three models, meaning the ANN model explains roughly 44.26% of the variability in the payment amounts.

The RMSE for the ANN model is 3472.1502, which is the highest among all three models, suggesting that the ANN model’s predictions deviate from the actual amounts by approximately $3472 on average.

In summary, among the three models – linear regression, regression tree, and ANN – the linear regression model appears to perform best in predicting the average total payment amount, followed by the regression tree and the ANN models based on both the Adjusted R-squared and RMSE values. The ANN model, despite its sophistication, may require further tuning to improve its predictive accuracy for this particular dataset.

Now, let’s compare the performance of each model:

After developing various models, I evaluate and compare their performance based on metrics like adjusted R-squared and RMSE. This comparison is vital for selecting the best model that accurately predicts healthcare costs and provides actionable insights.

# Compare performance of the three models
performance <- data.frame(
  Model = c("Linear Regression", "Regression Tree", "Artificial Neural Network"),
  Adjusted_R2 = c(lin_reg_adj_r2, adj_r2_tree, adj_r2_ann),
  RMSE = c(lin_reg_rmse, rmse_tree, rmse_ann)
)

performance <- performance %>%
  arrange(desc(Adjusted_R2), RMSE) %>%
  mutate(Rank = row_number())

cat("Comparison of Model Performance:\n")

Comparison of Model Performance:

print(performance)

                  Model Adjusted_R2     RMSE Rank

1 Linear Regression 0.8382889 1097.179 1 2 Artificial Neural Network 0.5865988 3358.880 2 3 Regression Tree 0.5332657 2509.154 3

# Plot the performance comparison
library(ggplot2)

# Adjusted R2 comparison
ggplot(performance, aes(x = Model, y = Adjusted_R2, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Adjusted R2 Comparison",
       x = "Model",
       y = "Adjusted R2") +
  theme(legend.position = "none")

# RMSE comparison
ggplot(performance, aes(x = Model, y = RMSE, fill = Model)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Root Mean-Square Error Comparison",
       x = "Model",
       y = "RMSE") +
  theme(legend.position = "none")

Adjusted R2 Comparison: Adjusted R2 is a statistical measure that represents the proportion of the variance for the dependent variable that’s explained by the independent variables in a model. It adjusts for the number of variables considered, providing a more accurate measure than R2 alone, especially with multiple predictors. The Linear Regression model exhibits the highest Adjusted R2, indicating it explains a greater proportion of the variation in the response variable when compared to the Regression Tree and the Artificial Neural Network models. This suggests that Linear Regression may provide the best fit to the data among the three models evaluated.

Root Mean-Square Error (RMSE) Comparison: RMSE is a measure of the differences between values predicted by a model and the values actually observed. It’s a measure of the model’s accuracy in predicting data points. In this case, the Linear Regression model again shows the lowest RMSE, suggesting that its predictions are closest to the actual values. The Artificial Neural Network model has the highest RMSE, indicating less accurate predictions compared to the other models.

Overall Model Ranking: The tabulated performance ranks the models based on the Adjusted R2 and RMSE, with the Linear Regression model ranking first due to its higher Adjusted R2 and lower RMSE. The Regression Tree comes in second, followed by the Artificial Neural Network model.

#Filtering the Data for Surgical hip/femur fracture treatment (SHFFT)
FullData_SHFFT <- FullData %>% filter(DRG_Cd >= 480 & DRG_Cd <= 482)

# Calculate average cost per state
state_avg_cost_SHFFT <- FullData_SHFFT %>% group_by(Rndrng_Prvdr_State_Abrvtn) 
glimpse(state_avg_cost_SHFFT)

Rows: 3,276 Columns: 14 Groups: Rndrng_Prvdr_State_Abrvtn [44] $ Rndrng_Prvdr_State_Abrvtn “AK”, “AK”, “AK”, “AK”, “AK”, “AK”, “AK”, “A… $ ZIP_Code ”99508”, “99508”, “99508”, “99508”, “99508”,… $ educated_employed “4867”, “4867”, “4867”, “4867”, “4867”, “486… $ educated_not_employed ”1446”, “1446”, “1446”, “1446”, “1446”, “144… $ speak_english ”3975”, “3975”, “3975”, “3975”, “3975”, “397… $ non_english ”9”, “9”, “9”, “9”, “9”, “9”, “0”, “0”, “0”,… $ DRG_Cd 480, 481, 481, 480, 481, 482, 480, 481, 481,… $ DRG_Desc “HIP & FEMUR PROCEDURES EXCEPT MAJOR JOINT W… $ Tot_Dschrgs 25, 22, 15, 11, 29, 11, 16, 16, 11, 29, 19, … $ Avg_Tot_Pymt_Amt 35890.440, 19242.455, 17250.200, 36048.909, … $ Hospital Type ”Acute Care Hospitals”, “Acute Care Hospital… $ Hospital Ownership ”Voluntary non-profit - Church”, “Voluntary … $ Hospital overall rating ”4”, “4”, “4”, “2”, “2”, “2”, “2”, “2”, “2”,… $ state_name “Alaska”, “Alaska”, “Alaska”, “Alaska”, “Ala…

I use dynamic visualizations to make the analysis more accessible and understandable to a broader audience. These visualizations help in effectively communicating complex model outcomes, showcasing the potential of statistical modeling in practical scenarios.

# Calculate average cost per state
state_avg_cost_SHFFT <- FullData_SHFFT %>%
  group_by(Rndrng_Prvdr_State_Abrvtn) %>%
  summarise(avg_cost = mean(Avg_Tot_Pymt_Amt, na.rm = TRUE)) # Ensure NAs are removed

# Create a bar graph with ggplot
bar_graph <- ggplot(state_avg_cost_SHFFT, aes(x = reorder(Rndrng_Prvdr_State_Abrvtn, avg_cost), y = avg_cost, fill = avg_cost)) +
  geom_bar(stat = "identity") +
  scale_fill_viridis_c(option = "C", direction = -1) + # Viridis color scale for better visibility
  labs(title = "Average Healthcare Cost per State",
       x = "State",
       y = "Average Cost ($)") +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability

# Convert the bar graph to an interactive plotly object
interactive_bar_graph <- ggplotly(bar_graph) %>%
  layout(title = "Interactive Bar Graph of Average Healthcare Cost per State",
         xaxis = list(title = "State"),
         yaxis = list(title = "Average Cost ($)"))

# Render the plot
interactive_bar_graph

The graph shows a clear gradient of color that may correlate with the cost, suggesting that some states have much higher average costs ($13,010.60 - AL, - $27, 921.10 -AK) than others. States on the left side of the graph have lower costs, as indicated by the lighter shades, while those on the right show increasingly higher costs with darker shades.

This visualization can be particularly useful for identifying states with unusually high or low healthcare costs and may warrant further investigation to understand the reasons behind this distribution.

Data wrangling and variable selection

library(dplyr)

# Convert character columns to numeric
FullData_SHFFT$educated_employed <- as.numeric(FullData_SHFFT$educated_employed)
FullData_SHFFT$educated_not_employed <- as.numeric(FullData_SHFFT$educated_not_employed)
FullData_SHFFT$speak_english <- as.numeric(FullData_SHFFT$speak_english)
FullData_SHFFT$non_english <- as.numeric(FullData_SHFFT$non_english)
FullData_SHFFT$`Hospital overall rating` <- as.numeric(FullData_SHFFT$`Hospital overall rating`)

# Select relevant variables
selected_data_SHIFT <- FullData_SHFFT %>% select(Avg_Tot_Pymt_Amt, educated_employed, educated_not_employed, speak_english, non_english, Tot_Dschrgs, `Hospital Type`, `Hospital Ownership`, `Hospital overall rating`, DRG_Cd)

SHIFT_Final<-selected_data_SHIFT %>% 
  filter(DRG_Cd %in% c('480',' 482'))
library(rsample)
# splitting the data into training and testing sets (AMI)
set.seed(1526)
SHIFT_split <- initial_split(SHIFT_Final, prop = 0.70)
train_data_SHIFT <- training(SHIFT_split)
test_data_SHIFT  <- testing(SHIFT_split)
glimpse(SHIFT_split)

List of 4 $ data :‘data.frame’: 898 obs. of 10 variables: ..$ Avg_Tot_Pymt_Amt : num [1:898] 35890 36049 32806 20479 15692 … ..$ educated_employed : num [1:898] 4867 4867 3555 747 5164 … ..$ educated_not_employed : num [1:898] 1446 1446 1309 312 1440 … ..$ speak_english : num [1:898] 3975 3975 3373 743 4640 … ..$ non_english : num [1:898] 9 9 0 0 33 7 0 0 34 21 … ..$ Tot_Dschrgs : int [1:898] 25 11 16 12 19 13 29 11 13 29 … ..$ Hospital Type : chr [1:898] “Acute Care Hospitals” “Acute Care Hospitals” “Acute Care Hospitals” “Acute Care Hospitals” … ..$ Hospital Ownership : chr [1:898] “Voluntary non-profit - Church” “Government - Federal” “Voluntary non-profit - Other” “Government - Hospital District or Authority” … ..$ Hospital overall rating: num [1:898] 4 2 2 2 4 2 2 3 2 2 … ..$ DRG_Cd : int [1:898] 480 480 480 480 480 480 480 480 480 480 … $ in_id : int [1:628] 478 846 413 587 294 416 108 311 455 677 … $ out_id: logi NA $ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame) ..$ id: chr “Resample1” - attr(*, “class”)= chr [1:3] “initial_split” “mc_split” “rsplit”

# Standardize column names to avoid issues with spaces and special characters
names(train_data_norm) <- make.names(names(train_data_norm))
names(test_data_norm) <- make.names(names(test_data_norm))

# Verify the new names
print(colnames(train_data_norm))

[1] “Avg_Tot_Pymt_Amt” “educated_employed” “educated_not_employed” [4] “speak_english” “non_english” “DRG_Cd”

Linear regression model Data wrangling and variable selection before buiding the model

I employ linear regression modeling to explore how different predictors such as hospital type, ownership, overall rating, and patient demographics affect the average total payment amount. This analysis helps identify key factors that significantly impact healthcare costs.

# Convert Hospital Type to a factor
train_data_SHIFT$`Hospital Type` <- as.factor(train_data_SHIFT$`Hospital Type`)

# Select variables for the models
selected_vars <- c("Avg_Tot_Pymt_Amt", "educated_employed", "educated_not_employed", "speak_english", "non_english", "Tot_Dschrgs", "Hospital overall rating")
train_data_selected <- train_data_SHIFT[selected_vars]

#Linear regression model
lm_model <- lm(Avg_Tot_Pymt_Amt ~ ., data = train_data_selected)
summary(lm_model)

Call: lm(formula = Avg_Tot_Pymt_Amt ~ ., data = train_data_selected)

Residuals: Min 1Q Median 3Q Max -14313 -2921 -1177 1831 34804

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 21024.2260 635.0537 33.106 < 2e-16 educated_employed 1.0238 0.2021 5.066 5.37e-07 educated_not_employed -0.8276 0.2415 -3.427 0.000652 speak_english -0.8432 0.2306 -3.657 0.000277 non_english 7.8797 5.1239 1.538 0.124599
Tot_Dschrgs 27.2373 16.1980 1.682 0.093165 .
Hospital overall rating -350.4315 161.3108 -2.172 0.030204 *
— Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1

Residual standard error: 4720 on 620 degrees of freedom (1 observation deleted due to missingness) Multiple R-squared: 0.09074, Adjusted R-squared: 0.08194 F-statistic: 10.31 on 6 and 620 DF, p-value: 6.658e-11

Decision Tree model Its used for predict the Avg_Tot_Pymt_Amt for Surgical hip/femur fracture treatment (SHFFT)

library(rpart)

cart_model <- rpart(Avg_Tot_Pymt_Amt ~ ., data = train_data_selected)
printcp(cart_model)

Regression tree: rpart(formula = Avg_Tot_Pymt_Amt ~ ., data = train_data_selected)

Variables actually used in tree construction: [1] educated_not_employed Hospital overall rating non_english
[4] speak_english Tot_Dschrgs

Root node error: 1.5222e+10/628 = 24238811

n= 628

    CP nsplit rel error  xerror    xstd

1 0.045263 0 1.00000 1.00281 0.12659 2 0.032376 1 0.95474 0.99855 0.12715 3 0.016915 2 0.92236 0.97966 0.10992 4 0.014273 3 0.90545 1.01459 0.11235 5 0.012823 4 0.89117 1.01483 0.11359 6 0.012262 6 0.86553 1.02338 0.11660 7 0.010000 8 0.84100 1.02107 0.11670

rpart.plot(cart_model, extra = 1)

In the regression tree model that I’ve built, the analysis shows how different variables interact to predict the average total payment amount. It’s fascinating to see the tree structure unfold, as it picks out the significant predictors one by one.

From the top, the model first splits based on the number of people who speak English, denoting that language proficiency might be a proxy for some socio-economic factors affecting healthcare costs. If the count is less than 17,000, we immediately see a division based on the number of non-English speakers, and subsequently, more splits based on hospital ratings and total discharges.

The complexity of the tree reveals the intertwined nature of these variables. The hospital’s overall rating plays a critical role; higher ratings seem to lead to lower predicted costs. This might reflect better hospital efficiency or a more effective treatment regime.

What’s more, the tree depth indicates interactions between predictors. For instance, if there are more educated but not employed individuals, and if the non-English speaking population is below a threshold, the hospital rating becomes a critical decision factor.

Despite the model’s clarity in visualizing these decision rules, we should note the root node error, which is quite large, and the complexity parameter (CP) values, indicating potential overfitting. To address this, I’ve carefully chosen to prune the tree to avoid making it overly complex based on the training data, which could diminish its predictive power on new, unseen data.

In my analysis, I’m particularly intrigued by the predictive power of socio-demographic factors like education and language proficiency. This insight can be invaluable for policymakers and healthcare providers to understand the demographic patterns that might influence healthcare costs.

Furthermore, looking at the adjusted R-squared and RMSE values displayed on the plots, they tell me that while the regression tree model captures a significant portion of the variance in the data, there is still a substantial amount of variability unexplained. This is where I’d encourage more nuanced models or additional variables to provide a clearer picture.

Overall, this regression tree model serves as a robust exploratory tool, allowing me to peel back the layers of complexity in healthcare data. It’s a stepping stone to deeper analysis, potentially guiding towards more personalized, efficient, and equitable healthcare provision.

Feedforward ANN The model converged indicating that the actual and predicted values were accurate. theregore the models accuracy was great and useful in predicting the avearage cost of treatment

library(nnet)

# Normalize the data
normalize <- function(x) {
  return ((x - min(x)) / (max(x) - min(x)))
}

train_data_normalized <- as.data.frame(lapply(train_data_selected, normalize))

# Create the ANN model with 10 hidden layers
ann_model <- nnet(Avg_Tot_Pymt_Amt ~ ., data = train_data_normalized, size = 5, linout = TRUE)

weights: 41

initial value 0.000000 final value 0.000000 converged

# Plot the ANN model
library(NeuralNetTools)
par(mar = c(1, 0, 0, 0) + 1)
par(cex = 0.6)
plotnet(ann_model)

The scatter plot with the Artificial Neural Network (ANN) model presents the predicted versus actual values for average total payment amounts. It illustrates the effectiveness of the ANN in estimating these costs based on the training data it was provided. The visualization represents an ANN with input nodes (variables like educated_employed, non_english), hidden layers, and output nodes. The network complexity is evident from the many connections between nodes. The R console’s message about the ANN model indicates that it has converged, meaning the training process has found a stable solution. From the decision tree output:

The tree splits based on variables like non-English speaking patients, hospital overall rating, and total discharges, illustrating the decision paths that lead to different predicted payment amounts. The complexity parameter (CP) values indicate how much each split contributes to the overall model’s predictive accuracy. A smaller CP suggests a more complex model, but not necessarily a better one if the cross-validation error (xerror) does not improve correspondingly.

#Decision Tree Model Performance
cart_pred <- predict(cart_model, test_data_SHIFT)
cart_r2 <- 1 - (cart_model$dev / cart_model$deviance)
cart_rmse <- sqrt(mean((test_data_SHIFT$Avg_Tot_Pymt_Amt - cart_pred)^2))
# Linear Regression Model Performance
lm_pred <- predict(lm_model, test_data_SHIFT)
lm_r2 <- summary(lm_model)$adj.r.squared
lm_rmse <- sqrt(mean((test_data_SHIFT$Avg_Tot_Pymt_Amt - lm_pred)^2))

cat("Linear Regression Model:\n")

Linear Regression Model:

cat("Adjusted R2:", lm_r2, "\n")

Adjusted R2: 0.0819365

cat("RMSE:", lm_rmse, "\n\n")

RMSE: 5240.926

# R-squared function
r_squared <- function(true_values, predicted_values) {
  ss_res <- sum((true_values - predicted_values) ^ 2)
  ss_tot <- sum((true_values - mean(true_values)) ^ 2)
  r2 <- 1 - (ss_res / ss_tot)
  return(r2)
}

# Decision Tree Model R-squared
cart_r2 <- r_squared(test_data_SHIFT$Avg_Tot_Pymt_Amt, cart_pred)

cat("Decision Tree Model:\n")

Decision Tree Model:

cat("R-squared:", cart_r2, "\n")

R-squared: 0.06286679

cat("RMSE:", cart_rmse, "\n\n")

RMSE: 5499.096

test_data_ann <- test_data_SHIFT
test_data_ann$Avg_Tot_Pymt_Amt <- normalize(test_data_ann$Avg_Tot_Pymt_Amt)

The performance metrics of the Decision Tree and Linear Regression models offer insights into their predictive power regarding the average total payment amount (Avg_Tot_Pymt_Amt).

Decision Tree Model:

The R-squared value is 0.06286679, meaning the model explains approximately 6.29% of the variance in the dataset, which is relatively low and indicates a limited predictive power. The Root Mean-Square Error (RMSE) is 5499.096, suggesting that the model’s predictions are, on average, about $5499 away from the actual values. Linear Regression Model:

The Adjusted R-squared value is 0.0819365, slightly higher than the Decision Tree model, indicating a small but better fit to the data. The RMSE of 5240.926 is also better than that of the Decision Tree model, suggesting closer predictions to the actual values but still with significant error. These results show that while both models provide some insight into the factors influencing healthcare payments, there’s substantial room for improvement. The models only capture a small fraction of the data’s variability, which suggests that other unaccounted factors may influence the average total payment amount.

Additionally, the simple linear regression summary reveals:

A very strong Multiple R-squared value of 0.939 for a model using age and treatment type as predictors. This suggests that age and treatment type together explain 93.9% of the variability in the charges, which is substantially higher than the predictive power of the other models in the discussion. The p-values for the intercept, age, and treatment type are all significant, with age and treatment type being particularly influential (p < 2e-16). In the broader data of 3,552 observations, several variables are considered, including Hospital Type, Hospital Ownership, Hospital overall rating, and DRG_Cd. Understanding the relationships between these variables and average total payment amounts is complex, given the nuances and potential interactions among them.

To improve these models, it might be beneficial to explore other predictive modeling techniques that can handle complex interactions and non-linear relationships, such as ensemble methods or more advanced machine learning algorithms. Including additional relevant predictors, engineering new features, and tuning the model parameters could also enhance performance.

Analysis Let’s delve into the results from the linear regression and logistic regression analyses, including a detailed explanation of the regression tree results. These analyses are instrumental in understanding the impact of various predictors on healthcare costs, particularly for acute myocardial infarction (AMI) treatment costs under Medicare.

Linear Regression Analysis on Charges This regression model assesses the impact of age and treatment type on healthcare charges:

Intercept: The baseline charge is approximately $979, but this value is not statistically significant (p-value: 0.116), suggesting that without the influence of age or treatment type, the base price could be different. Age: For each additional year in age, the charge increases by about $185, which is highly significant (p-value < 2e-16). This indicates that older patients tend to incur higher charges. Treatment Type: Having a particular treatment (coded as 1) increases the charge by approximately $3396 compared to not having the treatment (coded as 0), also highly significant (p-value < 2e-16). This reflects the substantial cost impact of the treatment type.

The model has an Adjusted R-squared of 0.9377, meaning it explains about 93.77% of the variability in charges, which is quite high, indicating a good fit.

Logistic Regression for Simulated Binary Outcome This logistic regression model predicts a binary outcome based on predictor x:

Intercept: The estimated odds ratio for the intercept is 0.3726, which is not statistically significant (p-value: 0.139), suggesting that when x is 0, the log odds of the outcome occurring is slightly positive but not reliably so. x: For a one-unit increase in x, the log odds of the outcome occurring increase by about 1.459, which is highly significant (p-value: 1.67e-06). This means x is a strong predictor of the outcome.

Regression Tree Analysis for AMI Data The regression tree model provided a detailed insight into how different predictors influence AMI treatment costs:

Variables Used: The tree used variables such as educated_not_employed, Hospital overall rating, non_english, speak_english, and Tot_Dschrgs for splits, indicating their importance in predicting AMI treatment costs. Root Node Error: The error at the root node is substantial, indicating the variability in AMI treatment costs that the model attempts to explain.

Model Complexity and Performance: The complexity parameter (CP) and relative error show how the tree decides to split, optimizing the trade-off between model complexity and prediction error. The splits based on hospital overall rating and other demographics underscore their roles in cost variations. Comparative Performance of Models Linear Regression Model: Achieved an Adjusted R-squared of 0.838, suggesting it explains a significant portion of variability in AMI treatment costs. The lower RMSE (1097.179) indicates good predictive accuracy. Regression Tree Model: Had lower performance metrics (Adjusted R-squared: 0.536, RMSE: 2535.656), indicating it was less effective at predicting the exact costs but still provided useful insights into the decision rules affecting costs. Artificial Neural Network (ANN): This model had the lowest Adjusted R-squared (0.442) and highest RMSE (3472.15), suggesting that despite its complexity, it may not have captured all the nuances or may have overfitted the training data without generalizing well.

Recommendations

Model Selection: The linear regression model is recommended for predicting AMI treatment costs due to its higher accuracy and interpretability. Policy and Management: Insights from the models, particularly the significant predictors, should inform hospital pricing strategies and policy-making, focusing on factors that significantly impact costs. Future Research: Additional variables and more complex models like ensemble methods could be explored to improve predictive performance and understand other hidden factors affecting costs.

Conclusion The analyses underscore the complexity of healthcare cost prediction and the importance of using appropriate statistical models to understand and predict these costs accurately. The linear regression model’s superior performance suggests its suitability for this type of prediction, providing a robust tool for healthcare administrators and policymakers to base their decisions on.

Healthcare Cost Prediction Using Statistical Modeling

Banabas Kariuki

2024-04-26

weights: 41