Introduction

In this homework, a decision tree model is mainly employed to predict healthcare costs based on features such as age, BMI, smoking status, sex, region and number of dependents. A decision tree is a decision support tool that uses a tree-like graph or model of decisions that classify how various factors impact healthcare costs. The objective of the study is to build two regression trees with significant features for performance comparison, classification tree and random forest model to forecast individual healthcare expenses and find out the most important predictors of healthcare costs, offering effective cost management and decision-making within the healthcare sector.

Essay

A decision tree uses a hierarchical, tree-like structure to map out relationships between input features (predictors) and possible outcomes. These outcomes can either be discrete classes (in a classification tree) or continuous values (in a regression tree). Structurally, a decision tree resembles an inverted tree, starting from a root node, the tree branches out through a series of splits based on certain features in the data. Each decision node is responsible for determining how to separate the data further based on a specific predictor. The results of each decision node are represented by branches, leading to either additional decision nodes known as leaf nodes positioned at the endpoints of the branches, represent the final predictions of the model. By following the path from the root node through various decision nodes to a leaf node, a decision tree arrives at a predicted outcome based on the values of the input features at each step. However, like any algorithm, decision trees come with advantages, limitations, and specific challenges that need to be considered.

The Good: Pros of Using Decision Trees

One of the main strengths of decision trees is their interpretability. Decision trees closely resemble human decision-making processes, breaking down complex problems into a series of smaller, easily understandable decisions. In predicting healthcare costs, a decision tree might split the data based on smoker status, age, or BMI, individual can easily see how each factor contributes to cost predictions. Decision trees can handle both numerical and categorical data and work well with little preprocessing and can use raw data directly, simplifying the data preparation process.

The Bad: Cons of Using Decision Trees

Decision trees tend to overfit, especially with deep trees that capture every detail of the training data. In predicting healthcare costs, this can lead to a model that performs well on training data but poorly on unseen data, as the tree captures noise rather than the general pattern. Decision trees are also sensitive to small changes in the data. A slight variation in the dataset can lead to entirely different splits, affecting the overall structure and outcome. Additionally, decision trees are prone to bias in imbalanced datasets, as they may favor the majority class, leading to skewed predictions. For this reason, decision trees are often used as part of ensemble methods (e.g., random forests), which average results over multiple trees to produce more stable predictions

The Ugly: Challenges with Conventional Decision Trees

In machine learning, conventional decision trees lack certain advanced capabilities that are needed for real-world problems. For instance, traditional decision trees don’t handle interactions between variables well, often resulting in missed insights when features are interdependent. They are also limited in scope when it comes to feature selection and hyperparameter tuning, leading to suboptimal performance compared to ensemble methods like random forests or gradient boosting.

Summary

Therefore, combining decision trees within ensemble methods such as random forests or gradient boosting could address these limitation and challenges, providing both robust and interpretable models that are well-suited for the complex nature of healthcare data.

Loading Data Set

This data was extracted from https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv. It is split into training and testing data sets, using random selection. By using the rpart library in R to construct decision tree models for predicted value charges for all records in the training set.

#Load required libraries
library(ggplot2)
library(ggcorrplot)
library('rpart')
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.7     ✔ rsample      1.2.1
## ✔ dials        1.3.0     ✔ tibble       3.2.1
## ✔ infer        1.0.7     ✔ tidyr        1.3.1
## ✔ modeldata    1.4.0     ✔ tune         1.2.1
## ✔ parsnip      1.2.1     ✔ workflows    1.1.4
## ✔ purrr        1.0.2     ✔ workflowsets 1.1.0
## ✔ recipes      1.1.0     ✔ yardstick    1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard()         masks scales::discard()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ yardstick::precision()   masks caret::precision()
## ✖ dials::prune()           masks rpart::prune()
## ✖ yardstick::recall()      masks caret::recall()
## ✖ MASS::select()           masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step()          masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(tidyr)
library(rpart.plot)
library(DataExplorer)
library(yardstick)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(123)
Healthcare_Cost <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv", header = TRUE)
sample_n(Healthcare_Cost, 5)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 35.150        0     no northwest  2134.901
## 2  62 female 38.095        2     no northeast 15230.324
## 3  46 female 28.900        2     no southwest  8823.279
## 4  18 female 33.880        0     no southeast 11482.635
## 5  18   male 34.430        0     no southeast  1137.470
summary(Healthcare_Cost)
##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Exploratory Data Analysis

By exploring the data comprehensively, the relationships and distributions of the variables in the data set and understand which factors might impact charges (the healthcare cost) will be determined. The available data set contains 1338 observations and 7 variables below.

  1. Age: insurance contractor age, years

  2. Sex: insurance contractor gender, [female, male]

  3. BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9

  4. Children: number of children covered by health insurance / Number of dependents

  5. Smoker: smoking, [yes, no]

  6. Region: the beneficiary’s residential area in the US, [northeast, southeast, southwest, northwest]

  7. Charges: Individual medical costs billed by health insurance, $ #predicted value

Features’ Distributions

The distributions of each variable individually is visualized to understand their ranges and possible skewness

# Age distribution
ggplot(Healthcare_Cost, aes(x = age)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black") +
  labs(title = "Age Distribution", x = "Age", y = "Count")

# BMI distribution
ggplot(Healthcare_Cost, aes(x = bmi)) +
  geom_histogram(bins = 20, fill = "lightgreen", color = "black") +
  labs(title = "BMI Distribution", x = "BMI", y = "Count")

# Number of Dependents
ggplot(Healthcare_Cost, aes(x = children)) +
  geom_bar(fill = "salmon", color = "black") +
  labs(title = "Number of Dependents", x = "Children", y = "Count")

# Charges distribution
ggplot(Healthcare_Cost, aes(x = charges)) +
  geom_histogram(bins = 20, fill = "purple", color = "black") +
  labs(title = "Charges Distribution", x = "Charges", y = "Count")

Age Distribution plot indicates the minimum and maximum ages in the data set. this variable may disproportionately influence predictions of healthcare costs.

BMI is often correlated with health risks. Data with a high concentration of individuals in the “overweight” or “obese” range might show higher average healthcare costs due to associated health conditions. The histogram reveals a bell-shaped curve that represents most BMI values are centered around the mean, suggesting a balanced distribution of BMI across the population.

Number of Dependents plot shows the right-skewed distribution that means most people in the data set have a small family size or no dependents at all. This is important in the context of healthcare cost prediction because family size can impact charges.

The response variable charges distribution is right-skewed, that means most individuals have lower healthcare costs, with a few individuals having significantly higher charges. The individuals with extremely high charges may be outliers in predictive modeling.

Correlation and Relationship of Predictors

How each predictor correlates with charges, especially for bmi, age, and smoker, which are likely to have a significant impact on healthcare costs will be determined as follow:

# Charges by age
ggplot(Healthcare_Cost, aes(x = age, y = charges)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Charges by Age", x = "Age", y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'

# Charges by BMI
ggplot(Healthcare_Cost, aes(x = bmi, y = charges)) +
  geom_point(color = "green", alpha = 0.5) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Charges by BMI", x = "BMI", y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'

# Charges by Smoking Status
ggplot(Healthcare_Cost, aes(x = smoker, y = charges)) +
  geom_boxplot(fill = c("skyblue", "pink")) +
  labs(title = "Charges by Smoking Status", x = "Smoker", y = "Charges")

# Charges by Region
ggplot(Healthcare_Cost, aes(x = region, y = charges)) +
  geom_boxplot(fill = "purple") +
  labs(title = "Charges by Region", x = "Region", y = "Charges")

# Calculate correlations for numeric variables
numeric_vars <- Healthcare_Cost %>% 
  dplyr::select(age, bmi, children, charges)
  correlations <- cor(numeric_vars)

# Plot correlation matrix
ggcorrplot(correlations, lab = TRUE, type = "lower", title = "Correlation Matrix")

BMI, Age, and Smoker are significant predictors of healthcare costs due to their strong association with health risks and chronic conditions. Positive correlations with charges suggest that as these factors increase so do the associated healthcare expenses. Differences in region also less influence charges than other variables. Accurately capturing these relationships can improve model accuracy and support effective resource allocation in healthcare.

The correlation matrix provides numerical variables with higher correlations to charges, such as bmi and age, may impact in a model predicting healthcare costs.

Split the Data into Training and Testing Sets

# Convert categorical variables such as `sex`, `smoker`, and `region` to factors.
Healthcare_Cost$sex <- as.factor(Healthcare_Cost$sex)
Healthcare_Cost$smoker <- as.factor(Healthcare_Cost$smoker)
Healthcare_Cost$region <- as.factor(Healthcare_Cost$region)
# Create a categorical outcome: "high" if charges > median, "low" otherwise
Healthcare_Cost <- Healthcare_Cost %>%
  mutate(category = ifelse(charges > median(charges), "high", "low"))

# Convert cost_category to a factor 
Healthcare_Cost$category <- as.factor(Healthcare_Cost$category)
# Split the data into training and testing sets (80% train, 20% test)
set.seed(123)
data_split <- initial_split(Healthcare_Cost, prop = 0.8)
train <- training(data_split)
test <- testing(data_split)

Building Regression Tree

Set the charges as the response variable and use rpart library to construct decision tree in training set.

# Create a decision tree model specification
tree_spec <- decision_tree() %>%
 set_engine("rpart", model = TRUE) %>%
 set_mode("regression")

# Fit the model to the training data
tree_fit <- tree_spec %>%
 fit(charges ~ ., data = train)
# Build the decision tree model using numeric predictors alone, specifying "anova" for regression
Model_0 <- rpart(charges ~ age + sex + bmi + children + smoker + region,
                       data = train, 
                       control = rpart.control(maxdepth = 4), 
                       method = "anova")
#summary(Model_0)

# Plot the decision tree using rpart.plot
rpart.plot(Model_0, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")

# Display feature importance
importance <- Model_0$variable.importance
print(importance)
##       smoker          bmi          age       region          sex     children 
## 103035589747  23677340482  11301463510   4650635601   1506995269    215285038

The smoker feature with higher scores indicates the most important feature in the data set to forecast the health care cost.

Evaluate the model on the test data

# Generate predictions 
predictions <- predict(Model_0, newdata = test)

# Convert predictions to a data frame
predictions_df <- as.data.frame(predictions)
colnames(predictions_df) <- "predicted_charges"

# Combine predictions with test_data
results <- bind_cols(test, predictions_df)
# View the results
print(head(results))
##   age    sex    bmi children smoker    region   charges category
## 1  56 female 39.820        0     no southeast 11090.718     high
## 2  27   male 42.130        0    yes southeast 39611.758     high
## 3  60 female 36.005        0     no northeast 13228.847     high
## 4  30 female 32.400        1     no southwest  4149.736      low
## 5  63 female 23.085        0     no northeast 14451.835     high
## 6  19 female 28.600        5     no southwest  4687.797      low
##   predicted_charges
## 1         12212.295
## 2         38485.712
## 3         12212.295
## 4          5490.382
## 5         12212.295
## 6          5490.382

As we can see, summary of a model showed us that some of the variable are not significant such as sex factor, while smoking seems to have a huge influence on charges. Training a model without non-significant variables and check if performance can be improved.

Building the New Decision Tree Model without sex Variable

# Build the new decision tree model 
Model_01 <- rpart(charges ~ age +  bmi + children + smoker + region,
                       data = train, 
                       control = rpart.control(maxdepth = 4), 
                       method = "anova")
#summary(Model_01)
# Plot the decision tree using rpart.plot
rpart.plot(Model_01, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")

# Generate predictions 
tree_predictions <- predict(Model_01, newdata = test) %>%
  as.data.frame()  
colnames(tree_predictions) <- "tree_pred"  

# Combine predictions with test data
tree_predictions <- bind_cols(test, tree_predictions)

# View the first few rows of the combined data
print(head(tree_predictions))
##   age    sex    bmi children smoker    region   charges category tree_pred
## 1  56 female 39.820        0     no southeast 11090.718     high 12212.295
## 2  27   male 42.130        0    yes southeast 39611.758     high 38485.712
## 3  60 female 36.005        0     no northeast 13228.847     high 12212.295
## 4  30 female 32.400        1     no southwest  4149.736      low  5490.382
## 5  63 female 23.085        0     no northeast 14451.835     high 12212.295
## 6  19 female 28.600        5     no southwest  4687.797      low  5490.382

Compare the Models’ Performance

library(WVPlots)
## Loading required package: wrapr
## 
## Attaching package: 'wrapr'
## The following objects are masked from 'package:tidyr':
## 
##     pack, unpack
## The following object is masked from 'package:tibble':
## 
##     view
## The following object is masked from 'package:dplyr':
## 
##     coalesce
test$prediction <- predict(Model_01, newdata = test)
ggplot(test, aes(x = prediction, y = charges)) + 
  geom_point(color = "blue", alpha = 0.7) + 
  geom_abline(color = "red") +
  ggtitle("Prediction vs. Real values")

GainCurvePlot(test, "prediction", "charges", "Model_01")

The curve illustrates that errors in the model are close to zero so model predicts quite well.

Train the Second Decision Tree

To train and compare two different decision trees with varying levels of bias and variance, we can manually set the first split for each tree by modifying the feature selection process. This approach observes how changing the initial split affects the performance and interpretability of each model.

# Set up a decision tree 
tree2 <- decision_tree() %>%
  set_engine("rpart", parms = list(split = "smoker")) %>%
  set_mode("regression")  

# Fit the second decision tree model on the training data
model_2 <- tree2 %>%
  fit(charges ~ age + bmi + children + smoker + region, data = train)

# Generate predictions on the test data
predictions_2 <- predict(model_2, new_data = test) %>%
  bind_cols(test) %>%
  rename(predicted_charges = .pred)

Evaluate and Compare Two Decision Trees’ Performance

# Decision tree performance metrics
tree_metrics <- tree_predictions %>%
  metrics(truth = charges, estimate = tree_pred)
print(tree_metrics)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    5019.   
## 2 rsq     standard       0.799
## 3 mae     standard    3340.
# Model 2 performance metrics
metrics_2 <- predictions_2 %>%
  metrics(truth = charges, estimate = predicted_charges)
print(metrics_2)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    5019.   
## 2 rsq     standard       0.799
## 3 mae     standard    3340.

Forcing the split with smoker variable may increase bias if that feature is not the optimal initial split for minimizing error and reduce variance, as it prioritizes smoker over potentially better splits. Both model’s performance matrices are the same that means smoker predictor is as informative as the optimal feature chosen in the first model.

Building Classification Tree

A classification decision tree model is built to predict healthcare cost categories such as “low,” or “high” based on several demographic and lifestyle for individuals based on input variables such as age, sex, BMI, number of dependents, smoking status, and region. The model evaluates performance using accuracy and ROC AUC, allowing an assessment of both overall correctness and discriminatory power. The decision tree shows age and bmi features most influence cost categorization, providing a useful tool for predicting personal healthcare costs.

# Set up a decision tree specification 
tree_spec <- decision_tree() %>%
  set_engine("rpart", model = TRUE) %>%
  set_mode("classification")

# Fit the model to the training data, predicting cost category
tree_model <- tree_spec %>%
  fit(category ~ age + sex + bmi + children + smoker + region, data = train)

# Print the fitted model
print(tree_model)
## parsnip model object
## 
## n= 1070 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 1070 523 high (0.51121495 0.48878505)  
##   2) age>=49.5 311  13 high (0.95819936 0.04180064) *
##   3) age< 49.5 759 249 low (0.32806324 0.67193676)  
##     6) smoker=yes 172   0 high (1.00000000 0.00000000) *
##     7) smoker=no 587  77 low (0.13117547 0.86882453) *
# Generate class and probability predictions
predictions <- predict(tree_model, new_data = test, type = "prob") %>%
  bind_cols(predict(tree_model, new_data = test)) %>%
  bind_cols(test) 

# Calculate performance metrics: accuracy and ROC AUC
metrics <- predictions %>%
  metrics(truth = category, estimate = .pred_class)

# Calculate ROC AUC using the probability for the "high" class
roc_auc_value <- predictions %>%
  roc_auc(truth = category, .pred_high)

# Print the performance metrics
print(metrics)
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.918
## 2 kap      binary         0.833
print(roc_auc_value)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.916
# Plot the decision tree
rpart.plot(tree_model$fit, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")

Building Random Forest Model

# Set up the random forest specification
rf_spec <- rand_forest() %>%
  set_engine("ranger") %>%
  set_mode("regression")  # Use "classification" for classification tasks

# Fit the model to the training data
rf_model <- rf_spec %>%
  fit(charges ~ age + bmi + children + smoker + region, data = train)

# Random forest predictions
rf_predictions <- predict(rf_model, new_data = test) %>%
  bind_cols(test) %>%
  rename(rf_pred = .pred)

Evaluate Models’ Performance

Metrics such as RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R-squared will be used for regression decision tree and random forest models

# Decision tree performance metrics
tree_metrics <- tree_predictions %>%
  metrics(truth = charges, estimate = tree_pred)
print(tree_metrics)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    5019.   
## 2 rsq     standard       0.799
## 3 mae     standard    3340.
# Random forest performance metrics
rf_metrics <- rf_predictions %>%
  metrics(truth = charges, estimate = rf_pred)
print(rf_metrics)
## # A tibble: 3 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard    4228.   
## 2 rsq     standard       0.858
## 3 mae     standard    2539.

Compare the Results

# Combine metrics into a single data frame 
metrics <- bind_rows(
  tree_metrics %>% mutate(model = "Decision Tree"),
  rf_metrics %>% mutate(model = "Random Forest")
)

# Plot RMSE, MAE, and R-squared for comparison
metrics %>%
  filter(.metric %in% c("rmse", "mae", "rsq")) %>%
  ggplot(aes(x = model, y = .estimate, fill = model)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~.metric, scales = "free") +
  labs(title = "Model Performance Comparison",
       y = "Metric Value", x = "Model") +
  theme_minimal()

Decision tree model has higher RMSE and lower R-squared compared to random forests due to over-fitting or under-fitting. On the other hand, Random forest model has lower RMSE and higher R-squared because they aggregate multiple trees, leading to better generalization and greater accuracy.

Estimating an Individual Healthcare Cost Using Decision Tree

Let’s calculate what charges on health care will be for them:

‘Andrew’: 19 years old, BMI 27.9, has no children, smokes, from northwest region.

Lisa: 60 years old, BMI 50, 2 children, doesn’t smoke, from southeast region.

James: 30 years old. BMI 31.2, no children, doesn’t smoke, from northeast region.

Andrew <- data.frame(age = 19,
                  bmi = 27.9,
                  children = 0,
                  smoker = "yes",
                  region = "northwest")
print(paste0("Health care charges for Andrew is $",round(predict(Model_01, Andrew), 2 )))
## [1] "Health care charges for Andrew is $21672.18"
Lisa <- data.frame(age = 60,
                   bmi = 50,
                   children = 2,
                   smoker = "no",
                   region = "southeast")
print(paste0("Health care charges for Lisa is $ ", round(predict(Model_01, Lisa), 2)))
## [1] "Health care charges for Lisa is $ 12212.3"
James <- data.frame(age = 30,
                   bmi = 31.2,
                   children = 0,
                   smoker = "no",
                   region = "northeast")
print(paste0("Health care charges for James is $ ", round(predict(Model_01, James), 2)))
## [1] "Health care charges for James is $ 5490.38"

Conclusions

Generally a strong predictor, as smoking status has a significant impact on health-related costs. High BMI often correlates with health risks and increased medical expenses. Older individuals may have higher healthcare costs. Region plays a role due to geographic variations in healthcare costs. The models employed are quite understandable as each decision is represented by one node of the tree but this approach is only useful for low-dimension feature vectors. Comparing the performance of decision trees and random forests, decision trees are easier to interpret, but can be prone to over fitting and may perform less robustly than random forests on complex data. Random Forest particularly offers better predictive performance due to its ensemble approach, making it more suitable when accuracy is a priority over interpretability.