Homework2: Forecasting Personal Healthcare Expenses with Decision Tree

Introduction

In this homework, a decision tree model is mainly employed to predict healthcare costs based on features such as age, BMI, smoking status, sex, region and number of dependents. A decision tree is a decision support tool that uses a tree-like graph or model of decisions that classify how various factors impact healthcare costs. The objective of the study is to build two regression trees with significant features for performance comparison, classification tree and random forest model to forecast individual healthcare expenses and find out the most important predictors of healthcare costs, offering effective cost management and decision-making within the healthcare sector.

Essay

A decision tree uses a hierarchical, tree-like structure to map out relationships between input features (predictors) and possible outcomes. These outcomes can either be discrete classes (in a classification tree) or continuous values (in a regression tree). Structurally, a decision tree resembles an inverted tree, starting from a root node, the tree branches out through a series of splits based on certain features in the data. Each decision node is responsible for determining how to separate the data further based on a specific predictor. The results of each decision node are represented by branches, leading to either additional decision nodes known as leaf nodes positioned at the endpoints of the branches, represent the final predictions of the model. By following the path from the root node through various decision nodes to a leaf node, a decision tree arrives at a predicted outcome based on the values of the input features at each step. However, like any algorithm, decision trees come with advantages, limitations, and specific challenges that need to be considered.

The Good: Pros of Using Decision Trees

One of the main strengths of decision trees is their interpretability. Decision trees closely resemble human decision-making processes, breaking down complex problems into a series of smaller, easily understandable decisions. In predicting healthcare costs, a decision tree might split the data based on smoker status, age, or BMI, individual can easily see how each factor contributes to cost predictions. Decision trees can handle both numerical and categorical data and work well with little preprocessing and can use raw data directly, simplifying the data preparation process.

The Bad: Cons of Using Decision Trees

Decision trees tend to overfit, especially with deep trees that capture every detail of the training data. In predicting healthcare costs, this can lead to a model that performs well on training data but poorly on unseen data, as the tree captures noise rather than the general pattern. Decision trees are also sensitive to small changes in the data. A slight variation in the dataset can lead to entirely different splits, affecting the overall structure and outcome. Additionally, decision trees are prone to bias in imbalanced datasets, as they may favor the majority class, leading to skewed predictions. For this reason, decision trees are often used as part of ensemble methods (e.g., random forests), which average results over multiple trees to produce more stable predictions

The Ugly: Challenges with Conventional Decision Trees

In machine learning, conventional decision trees lack certain advanced capabilities that are needed for real-world problems. For instance, traditional decision trees don’t handle interactions between variables well, often resulting in missed insights when features are interdependent. They are also limited in scope when it comes to feature selection and hyperparameter tuning, leading to suboptimal performance compared to ensemble methods like random forests or gradient boosting.

Analysis on Article

While the decision tree and Random Forest models employed for predicting personal healthcare cost, they were limited in addressing some of the usability, accessibility and integration problems discussed in the article. The decision tree model implemented in the homework 2 provided a static representation of the decision-making process and an effective for understanding features splits. It did not address long-term maintenance or adaptivity. This could become an issue as the healthcare landscape or new data become available, requiring updates to the model. To address this, integrating tools such as SAP, ERP or API that support dynamic updates and real-time collaboration makes ensure the decision tree remains relevant and impactful in real world applications. The visualization of tree diagrams is less accessible on mobile devices to broader audiences. To address this issue, decision tree outputs are needed to convert into web-accessible, interactive formats for instance, DeciZone or Shiny app to enhance more interactive, interpretable for a diverse platforms to display the results to non-technical stakeholders. Large decision trees are difficult to scale and comprehend for big datasets. In this case, Random Forest and Gradient Boosting handle large datasets better and provide robust performance compared to standalone decision trees. PCA (Principal Component Analysis) and feature selection can be applied to reduce the dimensionality of input features for large dataset. Tree pruning also leads to simplify the structure of decision trees, ensuring the model clarity and ease of interpretation.

Loading Data Set

This data was extracted from https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv. It is split into training and testing data sets, using random selection. By using the rpart library in R to construct decision tree models for predicted value charges for all records in the training set.

#Load required libraries
library(ggplot2)
library(ggcorrplot)
library(rpart)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: lattice

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tibble       3.2.1
## ✔ infer        1.0.7     ✔ tidyr        1.3.1
## ✔ modeldata    1.4.0     ✔ tune         1.2.1
## ✔ parsnip      1.2.1     ✔ workflows    1.1.4
## ✔ purrr        1.0.2     ✔ workflowsets 1.1.0
## ✔ recipes      1.1.0     ✔ yardstick    1.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()         masks scales::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ dials::prune()           masks rpart::prune()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ MASS::select()           masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/

library(tidyr)
library(rpart.plot)
library(DataExplorer)
library(yardstick)
library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(123)
Healthcare_Cost <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv", header = TRUE)
sample_n(Healthcare_Cost, 5)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 35.150        0     no northwest  2134.901
## 2  62 female 38.095        2     no northeast 15230.324
## 3  46 female 28.900        2     no southwest  8823.279
## 4  18 female 33.880        0     no southeast 11482.635
## 5  18   male 34.430        0     no southeast  1137.470

summary(Healthcare_Cost)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Exploratory Data Analysis

By exploring the data comprehensively, the relationships and distributions of the variables in the data set and understand which factors might impact charges (the healthcare cost) will be determined. The available data set contains 1338 observations and 7 variables below.

Age: insurance contractor age, years
Sex: insurance contractor gender, [female, male]
BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
Children: number of children covered by health insurance / Number of dependents
Smoker: smoking, [yes, no]
Region: the beneficiary’s residential area in the US, [northeast, southeast, southwest, northwest]
Charges: Individual medical costs billed by health insurance, $ #predicted value

Features’ Distributions

The distributions of each variable individually is visualized to understand their ranges and possible skewness

# Age distribution
ggplot(Healthcare_Cost, aes(x = age)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Count")

# BMI distribution
ggplot(Healthcare_Cost, aes(x = bmi)) +
  geom_histogram(bins = 20, fill = "lightgreen", color = "black") +
  labs(title = "BMI Distribution", x = "BMI", y = "Count")

# Number of Dependents
ggplot(Healthcare_Cost, aes(x = children)) +
  geom_bar(fill = "salmon", color = "black") +
  labs(title = "Number of Dependents", x = "Children", y = "Count")

# Charges distribution
ggplot(Healthcare_Cost, aes(x = charges)) +
  geom_histogram(bins = 20, fill = "purple", color = "black") +
  labs(title = "Charges Distribution", x = "Charges", y = "Count")

Age Distribution plot indicates the minimum and maximum ages in the data set. this variable may disproportionately influence predictions of healthcare costs.

BMI is often correlated with health risks. Data with a high concentration of individuals in the “overweight” or “obese” range might show higher average healthcare costs due to associated health conditions. The histogram reveals a bell-shaped curve that represents most BMI values are centered around the mean, suggesting a balanced distribution of BMI across the population.

Number of Dependents plot shows the right-skewed distribution that means most people in the data set have a small family size or no dependents at all. This is important in the context of healthcare cost prediction because family size can impact charges.

The response variable charges distribution is right-skewed, that means most individuals have lower healthcare costs, with a few individuals having significantly higher charges. The individuals with extremely high charges may be outliers in predictive modeling.

Correlation and Relationship of Predictors

How each predictor correlates with charges, especially for bmi, age, and smoker, which are likely to have a significant impact on healthcare costs will be determined as follow:

# Charges by age
ggplot(Healthcare_Cost, aes(x = age, y = charges)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Charges by Age", x = "Age", y = "Charges")

## `geom_smooth()` using formula = 'y ~ x'

# Charges by BMI
ggplot(Healthcare_Cost, aes(x = bmi, y = charges)) +
  geom_point(color = "green", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Charges by BMI", x = "BMI", y = "Charges")

## `geom_smooth()` using formula = 'y ~ x'

# Charges by Smoking Status
ggplot(Healthcare_Cost, aes(x = smoker, y = charges)) +
  geom_boxplot(fill = c("skyblue", "pink")) +
  labs(title = "Charges by Smoking Status", x = "Smoker", y = "Charges")

# Charges by Region
ggplot(Healthcare_Cost, aes(x = region, y = charges)) +
  geom_boxplot(fill = "purple") +
  labs(title = "Charges by Region", x = "Region", y = "Charges")

# Calculate correlations for numeric variables
numeric_vars <- Healthcare_Cost %>% 
  dplyr::select(age, bmi, children, charges)
  correlations <- cor(numeric_vars)

# Plot correlation matrix
ggcorrplot(correlations, lab = TRUE, type = "lower", title = "Correlation Matrix")

BMI, Age, and Smoker are significant predictors of healthcare costs due to their strong association with health risks and chronic conditions. Positive correlations with charges suggest that as these factors increase so do the associated healthcare expenses. Differences in region also less influence charges than other variables. Accurately capturing these relationships can improve model accuracy and support effective resource allocation in healthcare.

The correlation matrix provides numerical variables with higher correlations to charges, such as bmi and age, may impact in a model predicting healthcare costs.

Split the Data into Training and Testing Sets

# Convert categorical variables such as `sex`, `smoker`, and `region` to factors.
Healthcare_Cost$sex <- as.factor(Healthcare_Cost$sex)
Healthcare_Cost$smoker <- as.factor(Healthcare_Cost$smoker)
Healthcare_Cost$region <- as.factor(Healthcare_Cost$region)
# Create a categorical outcome: "high" if charges > median, "low" otherwise
Healthcare_Cost <- Healthcare_Cost %>%
  mutate(category = ifelse(charges > median(charges), "high", "low"))

# Convert cost_category to a factor 
Healthcare_Cost$category <- as.factor(Healthcare_Cost$category)
# Split the data into training and testing sets (80% train, 20% test)
set.seed(123)
data_split <- initial_split(Healthcare_Cost, prop = 0.8)
train <- training(data_split)
test <- testing(data_split)

Building Regression Tree with Hyperparameter Tuning

Set the charges as the response variable and use rpart library to construct decision tree in training set.

# Create a decision tree model specification
tree_spec <- decision_tree() %>%
 set_engine("rpart", model = TRUE) %>%
 set_mode("regression")

# Fit the model to the training data
tree_fit <- tree_spec %>%
 fit(charges ~ ., data = train)
# Build the decision tree model with additional control over depth and width
Model_0 <- rpart(
  charges ~ age + sex + bmi + children + smoker + region,
  data = train,
  method = "anova", # Specify regression using "anova"
  xval = 10,
  model = TRUE,
  control = rpart.control(
    maxdepth = 5,        # Maximum depth 
    minsplit = 10,       # Minimum split a node
    minbucket = 10,      # Minimum  terminal leaf
    cp = 0.005            # Complexity parameter 
    
  )
)

# Plot the decision tree using rpart.plot
rpart.plot(Model_0, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")

printcp(Model_0)

## 
## Regression tree:
## rpart(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = train, method = "anova", model = TRUE, control = rpart.control(maxdepth = 5, 
##     minsplit = 10, minbucket = 10, cp = 0.005), xval = 10)
## 
## Variables actually used in tree construction:
## [1] age    bmi    smoker
## 
## Root node error: 1.6239e+11/1070 = 151767506
## 
## n= 1070 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.6344899      0   1.00000 1.00075 0.057241
## 2 0.1418519      1   0.36551 0.36793 0.020914
## 3 0.0576898      2   0.22366 0.22775 0.015715
## 4 0.0105785      3   0.16597 0.17208 0.014578
## 5 0.0065087      4   0.15539 0.16494 0.014546
## 6 0.0062557      5   0.14888 0.16259 0.014592
## 7 0.0050000      6   0.14263 0.15885 0.014746

plotcp(Model_0)

# Display feature importance
importance <- Model_0$variable.importance
# Create a bar plot
barplot(
  importance,
  main = "Variable Importance with Simple Regression",
  horiz = TRUE,
  las = 1,
  col = "Green",
  xlab = "Importance"
)

The smoker feature with higher scores indicates the most important feature in the data set to forecast the health care cost. Choosing the optimal hyperparameters identified through grid search (maxdepth = 5, minsplit = 10, and cp = 0.005), the final model was trained. The model identified smoker status, bmi, and age as the most critical predictors of healthcare costs. This aligns with existing research, as smoking and obesity are associated with chronic health conditions.

Evaluate the model on the test data

# Extract optimal cp
optimal_cp <- Model_0$cptable[which.min(Model_0$cptable[, "xerror"]), "CP"]
print(optimal_cp)

## [1] 0.005

#Prune the tree
pruned_tree <- rpart::prune(Model_0, cp = optimal_cp)

# Plot the pruned tree
rpart.plot(pruned_tree)

# Generate predictions and evaluate
predictions <- predict(pruned_tree, newdata = test)
results <- data.frame(actual = test$charges, predicted = predictions)

As we can see, summary of a model showed us that some of the variable are not significant such as sex factor, while smoking seems to have a huge influence on charges. Training a model without non-significant variables and check if performance can be improved.By pruning the tree using the complexity parameter (cp), the model avoided overfitting while maintaining interpretability.

Compare the Models’ Predicted Values and Actual Values

library(WVPlots)

## Loading required package: wrapr

## 
## Attaching package: 'wrapr'

## The following objects are masked from 'package:tidyr':
## 
##     pack, unpack

## The following object is masked from 'package:tibble':
## 
##     view

## The following object is masked from 'package:dplyr':
## 
##     coalesce

test$prediction <- predict(Model_0, newdata = test)
ggplot(test, aes(x = prediction, y = charges)) + 
  geom_point(color = "blue", alpha = 0.7) + 
  geom_abline(color = "red") +
  ggtitle("Prediction vs. Actual values")

GainCurvePlot(test, "prediction", "charges", "Model_0")

The curve illustrates that errors in the model are close to zero so model predicts quite well.

Forcing the split with smoker variable may increase bias if that feature is not the optimal initial split for minimizing error and reduce variance, as it prioritizes smoker over potentially better splits. Both model’s performance matrices are the same that means smoker predictor is as informative as the optimal feature chosen in the first model.

Building Classification Tree

A classification decision tree model is built to predict healthcare cost categories such as “low,” or “high” based on several demographic and lifestyle for individuals based on input variables such as age, sex, BMI, number of dependents, smoking status, and region. The model evaluates performance using accuracy and ROC AUC, allowing an assessment of both overall correctness and discriminatory power. The decision tree shows age and bmi features most influence cost categorization, providing a useful tool for predicting personal healthcare costs.

# Build the initial tree model
tree_model <- rpart(
  category ~ age + sex + bmi + children + smoker + region,
  data = train,
  method = "class",
  control = rpart.control(
    maxdepth = 5,
    minsplit = 10,
    minbucket = 10,
    cp = 0.001 # Low cp to grow a large tree
  )
)

# Verify the complexity parameter table
printcp(tree_model)

## 
## Classification tree:
## rpart(formula = category ~ age + sex + bmi + children + smoker + 
##     region, data = train, method = "class", control = rpart.control(maxdepth = 5, 
##     minsplit = 10, minbucket = 10, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] age      children smoker  
## 
## Root node error: 523/1070 = 0.48879
## 
## n= 1070 
## 
##          CP nsplit rel error  xerror     xstd
## 1 0.4990440      0   1.00000 1.00000 0.031264
## 2 0.3288719      1   0.50096 0.50478 0.026963
## 3 0.0095602      2   0.17208 0.17591 0.017534
## 4 0.0019120      4   0.15296 0.15870 0.016730
## 5 0.0010000      6   0.14914 0.16826 0.017183

optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
print(optimal_cp)

## [1] 0.001912046

Tree.class <- rpart(
  category ~ age + sex + bmi + children + smoker + region,
  data = train,
  method = "class",
  control = rpart.control(cp = optimal_cp)
)

#rm(tree_model)
# Visualize the pruned tree
rpart.plot(Tree.class, yesno = TRUE)

# Predictions on the test set
Tree.class.pred <- predict(Tree.class, test, type = "class")

# Create a confusion matrix for performance evaluation
(Tree.class.conf <- confusionMatrix(data = Tree.class.pred, 
                                  reference = test$category))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  110   7
##       low    12 139
##                                           
##                Accuracy : 0.9291          
##                  95% CI : (0.8915, 0.9568)
##     No Information Rate : 0.5448          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8566          
##                                           
##  Mcnemar's Test P-Value : 0.3588          
##                                           
##             Sensitivity : 0.9016          
##             Specificity : 0.9521          
##          Pos Pred Value : 0.9402          
##          Neg Pred Value : 0.9205          
##              Prevalence : 0.4552          
##          Detection Rate : 0.4104          
##    Detection Prevalence : 0.4366          
##       Balanced Accuracy : 0.9268          
##                                           
##        'Positive' Class : high            
##

# Visualize predicted vs. actual values
plot(test$category, Tree.class.pred, 
     main = "Simple Classification: Predicted vs. Actual",
     xlab = "Actual",
     ylab = "Predicted")

# Extract and plot variable importance
variable_importance <- tree_model$variable.importance

# Create a bar plot
barplot(
  variable_importance,
  main = "Variable Importance with Simple Classification",
  horiz = TRUE,
  las = 1,
  col = "skyblue",
  xlab = "Importance"
)

Building Random Forest Regression Model

The Random Forest model was built using 1,070 samples and 7 predictors. The model used 10-fold cross-validation, splitting the data into 10 subsets, each used for training and validation, to ensure generalizability and robustness. The model tested combinations of hyperparameters and their influence on model’s performance.

# Define the random forest model specification
rf_spec <- rand_forest(
  mode = "regression",  
  trees = 500          
) %>%
  set_engine("ranger")  

# Fit the random forest model to the training data
rf_model <- rf_spec %>%
  fit(charges ~ age + sex + bmi + children + smoker + region, data = train)

# Generate predictions on the test data
test.rf <- test %>%
  mutate(rf_pred = predict(rf_model, new_data = test)$.pred)

####Random Forest Regression
random.frst = train(charges ~ ., 
               data = train, 
               method = "ranger",  # for random forest
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "RMSE",  # evaluate hyperparamter combinations with RMSE
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # 10 folds
                 savePredictions = "final"       # save predictions for the optimal tuning parameter1
                 )
               )
random.frst

## Random Forest 
## 
## 1070 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 963, 963, 964, 963, 962, 964, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared   MAE     
##   2     variance    4371.774  0.8899820  2735.013
##   2     extratrees  5094.252  0.8451633  3360.581
##   3     variance    3687.366  0.9116965  1941.169
##   3     extratrees  4358.334  0.8781913  2568.037
##   5     variance    3550.802  0.9160564  1685.549
##   5     extratrees  3847.607  0.9028767  1946.602
##   7     variance    3615.651  0.9132422  1702.543
##   7     extratrees  3716.980  0.9089116  1793.592
##   9     variance    3674.298  0.9107917  1746.086
##   9     extratrees  3653.920  0.9118395  1746.506
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 5, splitrule = variance
##  and min.node.size = 5.

plot(random.frst)

random.frst.pred <- predict(random.frst, test.rf, type = "raw")
# Visualize Predicted vs. Actual
ggplot(data = test.rf, aes(x = charges, y = random.frst.pred)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
  labs(
    title = "Random Forest Regression: Predicted vs. Actual",
    x = "Actual Charges",
    y = "Predicted Charges"
  ) +
  theme_minimal()

(random.frst.pred <- RMSE(pred = random.frst.pred,
                           obs = test.rf$charges))

## [1] 3000.998

Random Forest Classification

In this section, the random forest model is developed to classify healthcare costs into two categories, high and low, based on seven predictors, such as age, bmi, smoking status, and region. Through cross-validation and hyperparameter tuning, the model achieved exceptional performance, making it a reliable tool for predicting healthcare cost categories.

class.frst = train(category ~ ., 
               data = train, 
               method = "ranger",  # for random forest
               tuneLength = 5,  # choose up to 5 combinations of tuning parameters
               metric = "ROC",  # evaluate hyperparamter combinations with ROC
               trControl = trainControl(
                 method = "cv",  # k-fold cross validation
                 number = 10,  # 10 folds
                 savePredictions = "final",       # save predictions for the optimal tuning parameter1
                 classProbs = TRUE,  # return class probabilities in addition to predicted values
                 summaryFunction = twoClassSummary  # for binary response variable
                 )
               )
class.frst

## Random Forest 
## 
## 1070 samples
##    7 predictor
##    2 classes: 'high', 'low' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 964, 964, 962, 963, 963, 963, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   ROC        Sens       Spec     
##   2     gini        0.9999301  1.0000000  0.9980769
##   2     extratrees  0.9992600  0.9780471  0.9885704
##   3     gini        0.9999650  1.0000000  0.9980769
##   3     extratrees  0.9996135  0.9926599  0.9866110
##   5     gini        1.0000000  1.0000000  0.9980769
##   5     extratrees  0.9996828  0.9944781  0.9942671
##   7     gini        1.0000000  1.0000000  0.9980769
##   7     extratrees  0.9999644  0.9981481  0.9961538
##   9     gini        1.0000000  1.0000000  0.9980769
##   9     extratrees  0.9999644  0.9981481  0.9961538
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule = gini
##  and min.node.size = 1.

plot(class.frst)

#Generate Predictions
class.frst.pred <- predict(class.frst, test.rf, type = "raw")
# Combine actual and predicted values
results <- data.frame(Actual = test.rf$category, Predicted = class.frst.pred)

# Bar plot for predicted vs. actual classes
ggplot(results, aes(x = Actual, fill = Predicted)) +
  geom_bar(position = "dodge", color = "black") +  # Dodge for side-by-side comparison
  labs(
    title = "Class Distribution: Predicted vs. Actual",
    x = "Actual Class",
    y = "Frequency",
    fill = "Predicted Class"
  ) +
  theme_minimal()

he ROC (Receiver Operating Characteristic) Cross-Validation score measures the model’s ability to distinguish between classes. For this dataset, using Gini as the splitting rule and setting mtry = 7 yields the best classification performance (ROC = 1.000). The bar chart shows the majority of the predictions align with the actual class labels. There are few or no mismatched predictions visible (red bars under the low category or blue bars under the high category).

Evaluate Models’ Performance

Metrics such as RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R-squared will be used for regression decision tree and random forest models.

# Actual values
actual <- test$charges

# Metrics for Decision Tree
dt_rmse <- RMSE(pred = predictions, obs = actual)
dt_mae <- MAE(pred = predictions, obs = actual)
dt_r2 <- 1 - sum((actual - predictions)^2) / sum((actual - mean(actual))^2)

# Metrics for Random Forest
rf_predictions <- predict(random.frst, test.rf, type = "raw")
rf_rmse <- RMSE(pred = rf_predictions, obs = actual)
rf_mae <- MAE(pred = rf_predictions, obs = actual)
sst <- sum((actual - mean(actual))^2)  # Total sum of squares
ssr <- sum((actual - rf_predictions)^2)  # Sum of squared residuals
rf_r2 <- 1 - (ssr / sst)

# Combine metrics into a data frame
performance_summary <- data.frame(
  Model = c("Decision Tree", "Random Forest"),
  RMSE = c(dt_rmse, rf_rmse),
  MAE = c(dt_mae, rf_mae),
  R_squared = c(dt_r2, rf_r2)
)
# Print summary table
print(performance_summary)

##           Model     RMSE      MAE R_squared
## 1 Decision Tree 4690.709 2958.388 0.8222974
## 2 Random Forest 3000.998 1389.043 0.9272642

Decision tree model has higher RMSE and lower R-squared compared to random forests due to over-fitting or under-fitting. On the other hand, Random forest model has lower RMSE and higher R-squared because they aggregate multiple trees, leading to better generalization and greater accuracy.

Estimating an Individual Healthcare Cost Using Decision Tree

Let’s calculate what charges on health care will be for them:

‘Andrew’: 19 years old, BMI 27.9, has no children, smokes, from northwest region.

Lisa: 60 years old, BMI 50, 2 children, doesn’t smoke, from southeast region.

James: 30 years old. BMI 31.2, no children, doesn’t smoke, from northeast region.

Andrew <- data.frame(age = 19,
                  bmi = 27.9,
                  sex = "male",
                  children = 0,
                  smoker = "yes",
                  region = "northwest")
print(paste0("Health care charges for Andrew is $",round(predict(Model_0, Andrew), 2 )))

## [1] "Health care charges for Andrew is $18728.54"

Lisa <- data.frame(age = 60,
                   bmi = 50,
                   sex = "female",
                   children = 2,
                   smoker = "no",
                   region = "southeast")
print(paste0("Health care charges for Lisa is $ ", round(predict(Model_0, Lisa), 2)))

## [1] "Health care charges for Lisa is $ 13645.35"

James <- data.frame(age = 30,
                   bmi = 31.2,
                   sex = "male",
                   children = 0,
                   smoker = "no",
                   region = "northeast")
print(paste0("Health care charges for James is $ ", round(predict(Model_0, James), 2)))

## [1] "Health care charges for James is $ 5490.38"

Conclusions

The comparison between the Decision Tree and Random Forest regression models demonstrates the clear superiority of Random Forest in predicting healthcare costs. Its lower RMSE and MAE, combined with a higher R_squared value that indicate that Random Forest provides more accurate, precise, and reliable predictions. While the Decision Tree offers simplicity and interpretability, its performance limitations make it less suitable for complex datasets. In the context of healthcare cost forecasting, Random Forest emerges as the optimal model, enabling informed decision-making and effective resource allocation.