The cost of medical care significantly impacts both healthcare providers and patients. This project aims to explore the predictive utility of patient features captured by an insurance firm to estimate the annual cost of medical care. The dataset used is the publicly available Medical Cost Personal dataset from Kaggle, containing information on 1338 beneficiaries and 7 variables, including the target variable: medical costs billed by health insurance in a year.In this study, we aim to build upon previous work by applying advanced techniques to improve the accuracy of predictions and enhance model interpretability.
age: Age of the primary beneficiary.
sex: Gender of the insurance contractor.
bmi: Body mass index of the beneficiary.
children: Number of children covered by health insurance.
smoker: Smoking status of the beneficiary.
region: Residential area of the beneficiary in the US.
charges: Individual medical costs per beneficiary billed by health insurance in a year.
df = read.csv('insurance.csv', header=TRUE) #loading data
str(df) #examining structure of the dataset
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
Developing a robust predictive model for medical costs is crucial for assisting healthcare providers, insurers, and policymakers. This study aims to demonstrate the practical application of such a model.
The objective is to develop a predictive model using advanced regression techniques, establishing relationships between predictor variables (e.g., age, BMI, location) and the target variable (medical cost).
While the model provides valuable insights based on historical data, it assumes observed relationships will continue in the future. External factors not in the dataset may influence medical costs in the real world.
The objective of this study is to demonstrate the development of a linear regression model for the purpose of learning and research only and does not necessarily reflect the real-world situation for predicting cost of insurance or medical care for any individual patient. The study findings are not intended to be used for any commercial or diagnostic purposes.
This study employs advanced statistical modelling and machine learning techniques to predict cost of medical insurance based on patient characteristics.
# Load the dataset
df <- read.csv('insurance.csv')
# Data preprocessing
df$sex <- as.factor(df$sex)
df$smoker <- as.factor(df$smoker)
df$region <- as.factor(df$region)
# Feature Engineering: Add interaction terms and transformations
df$smoker_bmi <- ifelse(df$smoker == "yes", df$bmi, 0)
df$age_squared <- df$age^2
head(df)
## age sex bmi children smoker region charges smoker_bmi age_squared
## 1 19 female 27.900 0 yes southwest 16884.924 27.9 361
## 2 18 male 33.770 1 no southeast 1725.552 0.0 324
## 3 28 male 33.000 3 no southeast 4449.462 0.0 784
## 4 33 male 22.705 0 no northwest 21984.471 0.0 1089
## 5 32 male 28.880 0 no northwest 3866.855 0.0 1024
## 6 31 female 25.740 0 no southeast 3756.622 0.0 961
A train-test split ensures that the model is evaluated on unseen data.
set.seed(42)
train_index <- createDataPartition(df$charges, p = 0.8, list = FALSE)
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
save(train_data, file = "train_data.RData")
# Charges Distribution
summary(df$charges)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1122 4740 9382 13270 16640 63770
ggplot(df, aes(x = charges)) +
geom_histogram(binwidth = 2000, fill = "blue", alpha = 0.7) +
labs(title = "Distribution of Charges", x = "Charges", y = "Frequency")
# Charges vs BMI
ggplot(df, aes(x = bmi, y = charges, color = smoker)) +
geom_point(alpha = 0.5) +
labs(title = "Charges vs BMI by Smoker Status", x = "BMI", y = "Charges")
#Define function to calculate RMSE
calc_rmse <- function(actual, predicted) {
sqrt(mean((actual - predicted)^2))
}
lm_model <- lm(charges ~ age + age_squared + bmi + children + smoker + region + smoker_bmi, data = train_data)
summary(lm_model)
##
## Call:
## lm(formula = charges ~ age + age_squared + bmi + children + smoker +
## region + smoker_bmi, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14572.3 -1549.4 -1292.3 -859.2 22923.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.369e+03 1.502e+03 1.577 0.1150
## age -1.134e+01 7.077e+01 -0.160 0.8727
## age_squared 3.554e+00 8.846e-01 4.017 6.30e-05 ***
## bmi 6.057e+00 2.777e+01 0.218 0.8274
## children 6.418e+02 1.248e+02 5.141 3.25e-07 ***
## smokeryes -2.094e+04 1.758e+03 -11.910 < 2e-16 ***
## regionnorthwest -4.727e+02 4.093e+02 -1.155 0.2484
## regionsoutheast -1.065e+03 4.126e+02 -2.580 0.0100 *
## regionsouthwest -1.063e+03 4.178e+02 -2.545 0.0111 *
## smoker_bmi 1.451e+03 5.572e+01 26.036 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4659 on 1062 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8525
## F-statistic: 688.8 on 9 and 1062 DF, p-value: < 2.2e-16
# Predict and evaluate
lm_predictions <- predict(lm_model, test_data)
lm_rmse <- calc_rmse(test_data$charges, lm_predictions)
cat("Linear Regression RMSE:", lm_rmse, "\n")
## Linear Regression RMSE: 5408.644
residuals <- test_data$charges - lm_predictions
ggplot(data.frame(residuals), aes(x = residuals)) +
geom_histogram(binwidth = 500, fill = "blue", alpha = 0.5) +
labs(title = "Residual Histogram - Linear Regression", x = "Residuals", y = "Frequency")
The spread of residuals indicates how well the model’s predictions match the actual data. in this model, Min residual: -14572.3; Max residual: 22923.6 shows some large errors, especially on the higher end. However most of the residuals from the graph are very close to 0, which that for majority of the data points the model is predicting relatively accurately.
The histogram shows some large residuals (both positive and negative) far from zero. These represent instances where the model’s predictions are far off the mark.The residuals extend significantly to the right (positive residuals), suggesting the presence of underestimation for certain data points.
Ideally, residuals in a well-fitted linear regression model should follow a normal distribution centered around zero.Here, the residuals are heavily skewed and far from symmetrical, indicating potential problems with model assumptions.
Understanding output of our model:
Value: 4659 This Indicates the typical distance of observed data points from the regression line.
Multiple R-squared:
Value: 0.8537 85.37% of the variance in charges is explained by the predictors.
Adjusted R-squared: 0.8525 adjusts for the number of predictors, still very high, indicating a good fit.
F = 688.8, p-value < 2.2e-16 The overall model is highly statistically significant.
Value: 5408.644 Root Mean Square Error quantifies the average error magnitude, indicating a typical error of approximately 5409 units in predicted charges.
Coefficients Interpretation
Estimate: 2369 (not statistically significant at α = 0.05). This is the baseline charge when all predictors are at their reference levels.
Estimate: -11.34, not significant (p = 0.8727). Suggests no linear relationship between age and charges after accounting for age_squared.
Estimate: 3.554, highly significant (p = 6.3e-05). Indicates a significant quadratic effect of age on charges, implying costs increase more sharply for older individuals.
Estimate: 6.057, not significant (p = 0.8274). BMI alone doesn’t significantly impact charges after accounting for smoker_bmi.
Estimate: 641.8, highly significant (p = 3.25e-07). Each additional child increases charges by approximately $641.8 on average.
Estimate: -20940, highly significant (p < 2e-16). Smokers, on average, have significantly lower base costs, but this effect is counteracted by the strong positive interaction with smoker_bmi.
northwest: Estimate -472.7 (not significant, p = 0.2484). southeast: Estimate -1065 (significant, p = 0.01). southwest: Estimate -1063 (significant, p = 0.011). Indicates some regional differences, with southeast and southwest having significantly lower charges compared to the baseline (northeast).
Estimate: 1451, highly significant (p < 2e-16). Suggests a very strong interaction between being a smoker and BMI; for smokers, higher BMI leads to dramatically increased costs.
Insights The model explains most of the variability in charges (R² ≈ 85%).
Significant Predictors: age_squared, children, smoker, smoker_bmi, and some region effects are highly statistically significant.
Non-Significant Predictors: age, bmi (without interaction), regionnorthwest are not significant.
Practical Takeaways
Smoking combined with BMI has the largest impact on increasing charges.
There is a quadratic effect of age on charges, indicating that costs increase non-linearly as people age.
Living in southeast or southwest regions may slightly lower costs.
# Refine the model by excluding non-significant predictors
refined_model <- lm(formula = charges ~ age_squared + children + smoker + region + smoker_bmi, data = train_data)
# Summarize the refined model
summary(refined_model)
##
## Call:
## lm(formula = charges ~ age_squared + children + smoker + region +
## smoker_bmi, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14551.7 -1548.3 -1287.5 -888.2 22896.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.345e+03 3.963e+02 5.917 4.42e-09 ***
## age_squared 3.418e+00 1.271e-01 26.890 < 2e-16 ***
## children 6.367e+02 1.187e+02 5.362 1.01e-07 ***
## smokeryes -2.112e+04 1.569e+03 -13.461 < 2e-16 ***
## regionnorthwest -4.730e+02 4.086e+02 -1.157 0.24736
## regionsoutheast -1.044e+03 4.024e+02 -2.593 0.00963 **
## regionsouthwest -1.056e+03 4.163e+02 -2.538 0.01130 *
## smoker_bmi 1.456e+03 4.946e+01 29.446 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4655 on 1064 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8528
## F-statistic: 887.2 on 7 and 1064 DF, p-value: < 2.2e-16
# Calculate RMSE for the refined model
predicted_values <- predict(refined_model, newdata = train_data)
actual_values <- train_data$charges
rmse_refined <- sqrt(mean((predicted_values - actual_values)^2))
cat("Refined Model RMSE:", rmse_refined, "\n")
## Refined Model RMSE: 4637.44
Save the refined model:
save(refined_model, file = "linear_model.RData")
# Assessing model assumptions
par(mfrow=c(2,2))
plot(refined_model)
Interpretation of the Refined Model Outputs
Model Assumptions:Model assumptions of linearity and normality are mostly met.
Residual Standard Error (RSE):Reduced slightly to 4655 from 4659 in the original model.
Adjusted R-squared: Improved marginally from 0.8525 to 0.8528.
Refined RMSE:Reduced significantly to 4637.44 from 5408.644, indicating improved prediction accuracy.
Significant Predictors:
age_squared: Highly significant (p < 2e-16), confirming the quadratic effect of age.
children: Highly significant (p < 0.001), with each child adding ~$636.7 to costs.
smoker: Smoking dramatically reduces baseline costs but is counteracted by the interaction with BMI.
smoker_bmi: Highly significant, with a very strong positive interaction between smoking and BMI.
regionsoutheast and regionsouthwest: These regions have significantly lower costs than the baseline(northeast).
Insignificant Predictors:
regionnorthwest: Retained for completeness but remains insignificant (p = 0.24736).
Performance Improvement
Better Fit: Adjusted R-squared indicates a slight improvement in explanatory power after simplifying the model.
Lower Error: RMSE dropped by ~14%, showing better predictions.
Bayesian Optimization is employed to tune hyperparameters of the Random Forest model.
NB: Bayesian Optimization can be used to reduce the computational cost of selecting the best hyperparameters in computationally expensive models. The goal is to efficiently explore the hyperparameter space and identify a set of hyperparameters that minimize the model’s cost function (e.g., RMSE for regression tasks) using a probabilistic surrogate model. This process helps achieve the best model accuracy with fewer evaluations compared to exhaustive search methods.
rf_bayesian <- function(mtry, min_node_size) {
model <- randomForest(
charges ~ age + age_squared + bmi + children + smoker + region + smoker_bmi,
data = train_data,
mtry = as.integer(mtry),
nodesize = as.integer(min_node_size)
)
predictions <- predict(model, test_data)
rmse <- calc_rmse(test_data$charges, predictions)
list(Score = -rmse)
}
set.seed(42)
rf_bo <- BayesianOptimization(
FUN = rf_bayesian,
bounds = list(mtry = c(2L, 5L), min_node_size = c(5L, 15L)),
init_points = 5,
n_iter = 20,
acq = "ucb"
)
## elapsed = 3.36 Round = 1 mtry = 5.0000 min_node_size = 10.0000 Value = -5288.5540
## elapsed = 2.19 Round = 2 mtry = 5.0000 min_node_size = 12.0000 Value = -5276.7645
## elapsed = 1.62 Round = 3 mtry = 3.0000 min_node_size = 6.0000 Value = -5295.0543
## elapsed = 1.73 Round = 4 mtry = 4.0000 min_node_size = 12.0000 Value = -5269.7620
## elapsed = 1.60 Round = 5 mtry = 4.0000 min_node_size = 12.0000 Value = -5255.5042
## elapsed = 1.03 Round = 6 mtry = 2.0000 min_node_size = 12.0000 Value = -5175.6109
## elapsed = 1.48 Round = 7 mtry = 2.0000 min_node_size = 5.0000 Value = -5195.7240
## elapsed = 0.95 Round = 8 mtry = 2.0000 min_node_size = 14.0000 Value = -5184.0545
## elapsed = 0.84 Round = 9 mtry = 2.0000 min_node_size = 10.0000 Value = -5190.2363
## elapsed = 0.81 Round = 10 mtry = 2.0000 min_node_size = 12.0000 Value = -5206.5339
## elapsed = 0.89 Round = 11 mtry = 2.0000 min_node_size = 12.0000 Value = -5173.1805
## elapsed = 0.87 Round = 12 mtry = 2.0000 min_node_size = 12.0000 Value = -5187.9912
## elapsed = 1.00 Round = 13 mtry = 2.0000 min_node_size = 12.0000 Value = -5191.4101
## elapsed = 0.84 Round = 14 mtry = 2.0000 min_node_size = 12.0000 Value = -5180.6849
## elapsed = 1.00 Round = 15 mtry = 2.0000 min_node_size = 12.0000 Value = -5173.3701
## elapsed = 0.78 Round = 16 mtry = 2.0000 min_node_size = 12.0000 Value = -5185.2254
## elapsed = 1.08 Round = 17 mtry = 2.0000 min_node_size = 12.0000 Value = -5189.3970
## elapsed = 0.78 Round = 18 mtry = 2.0000 min_node_size = 12.0000 Value = -5162.4277
## elapsed = 0.86 Round = 19 mtry = 2.0000 min_node_size = 12.0000 Value = -5183.1236
## elapsed = 0.88 Round = 20 mtry = 2.0000 min_node_size = 12.0000 Value = -5164.5908
## elapsed = 0.95 Round = 21 mtry = 2.0000 min_node_size = 12.0000 Value = -5191.8742
## elapsed = 0.83 Round = 22 mtry = 2.0000 min_node_size = 12.0000 Value = -5185.4855
## elapsed = 1.55 Round = 23 mtry = 2.0000 min_node_size = 12.0000 Value = -5182.5062
## elapsed = 1.18 Round = 24 mtry = 2.0000 min_node_size = 12.0000 Value = -5172.3029
## elapsed = 0.89 Round = 25 mtry = 2.0000 min_node_size = 12.0000 Value = -5181.5966
##
## Best Parameters Found:
## Round = 18 mtry = 2.0000 min_node_size = 12.0000 Value = -5162.4277
# Train the optimized Random Forest model
rf_optimized <- randomForest(
charges ~ age + age_squared + bmi + children + smoker + region + smoker_bmi,
data = train_data,
mtry = rf_bo$Best_Par["mtry"],
nodesize = rf_bo$Best_Par["min_node_size"]
)
rf_predictions_optimized <- predict(rf_optimized, test_data)
rf_rmse_optimized <- calc_rmse(test_data$charges, rf_predictions_optimized)
cat("Optimized Random Forest RMSE:", rf_rmse_optimized, "\n")
## Optimized Random Forest RMSE: 5189.678
Save the random forest model
save(rf_optimized, file = "random_forest_model.RData")
XGBoost (eXtreme Gradient Boosting) is a powerful boosting algorithm used for supervised learning problems. In this study, we employ Bayesian optimization to tune XGBoost’s hyperparameters to minimize the RMSE.
# One-hot encode categorical variables
dummy_model <- dummyVars(" ~ .", data = train_data, fullRank = TRUE)
train_data_encoded <- data.frame(predict(dummy_model, newdata = train_data))
test_data_encoded <- data.frame(predict(dummy_model, newdata = test_data))
# Ensure all columns are numeric
train_data_encoded <- train_data_encoded %>% mutate(across(everything(), as.numeric))
test_data_encoded <- test_data_encoded %>% mutate(across(everything(), as.numeric))
# Create xgb.DMatrix objects
xgb_train <- xgb.DMatrix(data = as.matrix(train_data_encoded), label = train_data$charges)
xgb_test <- xgb.DMatrix(data = as.matrix(test_data_encoded), label = test_data$charges)
# Validate Data Consistency
if (nrow(as.matrix(train_data_encoded)) != length(train_data$charges)) {
stop("Mismatch between features and labels in train_data.")
}
if (nrow(as.matrix(test_data_encoded)) != length(test_data$charges)) {
stop("Mismatch between features and labels in test_data.")
}
# 2. Bayesian Optimization Function for XGBoost
xgb_bayesian <- function(max_depth, eta, subsample, colsample_bytree) {
params <- list(
objective = "reg:squarederror",
max_depth = as.integer(max_depth),
eta = eta,
subsample = subsample,
colsample_bytree = colsample_bytree,
nthread = 4 # Enable parallel processing
)
# Cross-validation
xgb_cv <- xgb.cv(
params = params,
data = xgb_train,
nrounds = 500, # Reasonable max boosting rounds
nfold = 3, # Reduce folds for faster evaluation
metrics = "rmse",
early_stopping_rounds = 20,
verbose = 0
)
# Handle edge case for missing best_iteration
best_nrounds <- ifelse(is.null(xgb_cv$best_iteration), 500, xgb_cv$best_iteration)
# Train the model with the best parameters
model <- xgb.train(
params = params,
data = xgb_train,
nrounds = best_nrounds,
verbose = 0
)
# Predictions and RMSE
predictions <- predict(model, xgb_test)
rmse <- sqrt(mean((test_data$charges - predictions)^2))
list(Score = -rmse, Pred = model, Nrounds = best_nrounds)
}
# 3. Perform Bayesian Optimization
set.seed(42)
xgb_bo <- BayesianOptimization(
FUN = xgb_bayesian,
bounds = list(
max_depth = c(3L, 10L),
eta = c(0.01, 0.3),
subsample = c(0.7, 1),
colsample_bytree = c(0.7, 1)
),
init_points = 5,
n_iter = 20,
acq = "ucb"
)
## elapsed = 7.38 Round = 1 max_depth = 9.0000 eta = 0.1605378 subsample = 0.8373225 colsample_bytree = 0.9820044 Value = -154.6936
## elapsed = 3.64 Round = 2 max_depth = 10.0000 eta = 0.2236106 subsample = 0.9157337 colsample_bytree = 0.9934679 Value = -228.0648
## elapsed = 5.78 Round = 3 max_depth = 5.0000 eta = 0.04905331 subsample = 0.9804017 colsample_bytree = 0.7352462 Value = -591.5660
## elapsed = 3.47 Round = 4 max_depth = 9.0000 eta = 0.2005278 subsample = 0.7766286 colsample_bytree = 0.8424991 Value = -377.7979
## elapsed = 7.84 Round = 5 max_depth = 7.0000 eta = 0.2144688 subsample = 0.8386878 colsample_bytree = 0.8680998 Value = -489.4881
## elapsed = 14.09 Round = 6 max_depth = 9.0000 eta = 0.0100 subsample = 0.9909259 colsample_bytree = 0.9445517 Value = -447.6260
## elapsed = 23.73 Round = 7 max_depth = 3.0000 eta = 0.3000 subsample = 0.7000 colsample_bytree = 1.0000 Value = -381.5564
## elapsed = 10.07 Round = 8 max_depth = 9.0000 eta = 0.06766773 subsample = 0.813348 colsample_bytree = 1.0000 Value = -115.5309
## elapsed = 1.91 Round = 9 max_depth = 10.0000 eta = 0.2517071 subsample = 0.7000 colsample_bytree = 0.9848041 Value = -287.5922
## elapsed = 2.83 Round = 10 max_depth = 10.0000 eta = 0.3000 subsample = 0.9932048 colsample_bytree = 0.7874764 Value = -1325.5548
## elapsed = 2.02 Round = 11 max_depth = 9.0000 eta = 0.3000 subsample = 0.7000 colsample_bytree = 0.7000 Value = -669.0375
## elapsed = 3.52 Round = 12 max_depth = 3.0000 eta = 0.0100 subsample = 0.7000 colsample_bytree = 0.7000 Value = -1271.0771
## elapsed = 2.05 Round = 13 max_depth = 9.0000 eta = 0.3000 subsample = 0.9838869 colsample_bytree = 0.9178851 Value = -224.9772
## elapsed = 3.59 Round = 14 max_depth = 3.0000 eta = 0.0100 subsample = 0.91239 colsample_bytree = 0.9035874 Value = -789.2112
## elapsed = 5.62 Round = 15 max_depth = 5.0000 eta = 0.3000 subsample = 0.9349806 colsample_bytree = 0.9428715 Value = -408.3555
## elapsed = 8.15 Round = 16 max_depth = 7.0000 eta = 0.0100 subsample = 0.764681 colsample_bytree = 0.8401223 Value = -657.1144
## elapsed = 6.25 Round = 17 max_depth = 6.0000 eta = 0.0100 subsample = 0.935249 colsample_bytree = 0.9840084 Value = -420.3384
## elapsed = 9.43 Round = 18 max_depth = 10.0000 eta = 0.0100 subsample = 0.9794313 colsample_bytree = 0.8804068 Value = -645.3962
## elapsed = 4.09 Round = 19 max_depth = 10.0000 eta = 0.3000 subsample = 0.7000 colsample_bytree = 0.893692 Value = -644.3185
## elapsed = 2.14 Round = 20 max_depth = 8.0000 eta = 0.3000 subsample = 0.9905041 colsample_bytree = 1.0000 Value = -247.0145
## elapsed = 4.70 Round = 21 max_depth = 7.0000 eta = 0.2894137 subsample = 0.9933751 colsample_bytree = 0.9129171 Value = -245.2481
## elapsed = 3.55 Round = 22 max_depth = 3.0000 eta = 0.0100 subsample = 0.7000 colsample_bytree = 0.9769014 Value = -526.1476
## elapsed = 2.10 Round = 23 max_depth = 3.0000 eta = 0.3000 subsample = 0.7000 colsample_bytree = 0.7929499 Value = -562.9228
## elapsed = 3.94 Round = 24 max_depth = 10.0000 eta = 0.2146027 subsample = 0.7881165 colsample_bytree = 1.0000 Value = -234.0391
## elapsed = 1.62 Round = 25 max_depth = 7.0000 eta = 0.297981 subsample = 0.8150998 colsample_bytree = 0.9463932 Value = -706.6844
##
## Best Parameters Found:
## Round = 8 max_depth = 9.0000 eta = 0.06766773 subsample = 0.813348 colsample_bytree = 1.0000 Value = -115.5309
# Retrieve Optimal Parameters
best_params <- xgb_bo$Best_Par
best_nrounds <- ifelse(
is.null(xgb_bo$History$Nrounds[[which.min(xgb_bo$History$Value)]]),
500,
xgb_bo$History$Nrounds[[which.min(xgb_bo$History$Value)]]
)
# 4. Train Final Model with Optimal Parameters
xgb_best <- xgb.train(
params = list(
objective = "reg:squarederror",
max_depth = best_params["max_depth"],
eta = best_params["eta"],
subsample = best_params["subsample"],
colsample_bytree = best_params["colsample_bytree"]
),
data = xgb_train,
nrounds = best_nrounds,
verbose = 0
)
# 5. Evaluate the Final Model
xgb_predictions <- predict(xgb_best, xgb_test)
xgb_rmse <- sqrt(mean((test_data$charges - xgb_predictions)^2))
cat("Optimized XGBoost RMSE:", xgb_rmse, "\n")
## Optimized XGBoost RMSE: 76.42326
# Save the dummyVars object after creating it
saveRDS(dummy_model, "dummy_model.rds")
save(xgb_best, file = "xgboost_model.RData")
# Calculate residuals for XGBoost
residuals_xgb <- test_data$charges - xgb_predictions
# Residuals Histogram for XGBoost
ggplot(data.frame(residuals_xgb), aes(x = residuals_xgb)) +
geom_histogram(binwidth = 500, fill = "green", alpha = 0.5) +
labs(title = "Residual Histogram - XGBoost", x = "Residuals", y = "Frequency") +
theme_minimal()
Distribution of Residuals:
The XGBoost model has a good performance overall, as residuals are centered around zero and evenly spread.
Key Takeaways:
The XGBoost model has improved prediction accuracy compared to the earlier linear regression results.
# Measure computational time for Random Forest
rf_time <- system.time({
rf_optimized <- randomForest(
charges ~ age + age_squared + bmi + children + smoker + region + smoker_bmi,
data = train_data,
mtry = rf_bo$Best_Par["mtry"],
nodesize = rf_bo$Best_Par["min_node_size"]
)
})
# Measure computational time for XGBoost
xgb_time <- system.time({
xgb_best <- xgb.train(
params = list(
objective = "reg:squarederror",
max_depth = best_params["max_depth"],
eta = best_params["eta"],
subsample = best_params["subsample"],
colsample_bytree = best_params["colsample_bytree"]
),
data = xgb_train,
nrounds = best_nrounds
)
})
# Display computation times
computation_times <- data.frame(
Model = c("Random Forest", "XGBoost"),
Time = c(round(rf_time["elapsed"], 2), round(xgb_time["elapsed"], 2))
)
kable(computation_times, caption = "Computational Time for Model Training")
Model | Time |
---|---|
Random Forest | 1.93 |
XGBoost | 4.30 |
Comment: Though XGBoost gave the most accurate predictions with an RMSE of 76.42326, it took significantly longer to train.
To compare the performance of the models developed in this study, we visualize their RMSEs. This comparison helps identify the most effective approach for the medical insurance cost prediction problem.
# Combine RMSE results
results <- data.frame(
Model = c("Linear Regression Base Model","Linear Regression Refined", "Random Forest Optimized", "XGBoost"),
RMSE = c(lm_rmse, rmse_refined, rf_rmse_optimized, xgb_rmse)
)
# Plot RMSE comparison with enhanced aesthetics
ggplot(results, aes(x = reorder(Model, RMSE), y = RMSE, fill = Model)) +
geom_bar(stat = "identity", width = 0.7, show.legend = FALSE) + # Remove legend
coord_flip() + # Flip coordinates for better readability of model names
scale_fill_brewer(palette = "Set2") + # Use a pleasant color palette
theme_minimal(base_size = 14) + # Increase base font size
labs(
title = "Comparison of Model RMSE Values",
subtitle = "Lower RMSE indicates better model performance",
x = "Model",
y = "Root Mean Squared Error (RMSE)"
) +
geom_text(aes(label = round(RMSE, 2)), hjust = -0.2, size = 4) + # Add RMSE values as labels
theme(
plot.title = element_text(face = "bold", hjust = 0.5), # Center title and bold
plot.subtitle = element_text(hjust = 0.5, face = "italic"), # Center and italicize subtitle
axis.text.x = element_text(angle = 45, hjust = 1), # Adjust x-axis text for clarity
axis.text.y = element_text(size = 12), # Increase y-axis text size
panel.grid.major.y = element_blank(), # Remove horizontal grid lines
panel.grid.minor = element_blank(), # Remove minor grid lines
panel.border = element_blank() # Remove plot border
)
Real-World Validation:
Collaborate with healthcare organizations to access real-world datasets
that reflect actual patient and policyholder data. Validate the current
model using this data to assess its robustness and
applicability.
Dynamic Data Integration:
Model Generalizability:
Evaluate the model’s performance across various healthcare systems,
geographic regions, and populations. A focus on transfer
learning or domain adaptation techniques may
improve generalizability across datasets.
Feature Expansion:
Include additional predictors such as:
Advanced Algorithms:
Explore other state-of-the-art models, such as:
Cost-Optimization Analysis:
Develop models to predict not only costs but also identify cost-saving
opportunities for healthcare providers and insurers based on patient
data.
This study provides a solid foundation for predicting medical insurance costs using machine learning models. While XGBoost achieved the lowest RMSE, the deployment of such models into practical healthcare and insurance settings requires further validation using real-world data.
By incorporating external features, validating across diverse populations, and ensuring fairness and interpretability, these models can deliver significant value to stakeholders in the healthcare industry. Future work should prioritize generalizability, scalability, and ethical considerations to ensure practical usability and real-world impact.