In this homework, a decision tree model is mainly employed to predict healthcare costs based on features such as age, BMI, smoking status, sex, region and number of dependents. A decision tree is a decision support tool that uses a tree-like graph or model of decisions that classify how various factors impact healthcare costs. The objective of the study is to build two regression trees with significant features for performance comparison, classification tree and random forest model to forecast individual healthcare expenses and find out the most important predictors of healthcare costs, offering effective cost management and decision-making within the healthcare sector.
A decision tree uses a hierarchical, tree-like structure to map out relationships between input features (predictors) and possible outcomes. These outcomes can either be discrete classes (in a classification tree) or continuous values (in a regression tree). Structurally, a decision tree resembles an inverted tree, starting from a root node, the tree branches out through a series of splits based on certain features in the data. Each decision node is responsible for determining how to separate the data further based on a specific predictor. The results of each decision node are represented by branches, leading to either additional decision nodes known as leaf nodes positioned at the endpoints of the branches, represent the final predictions of the model. By following the path from the root node through various decision nodes to a leaf node, a decision tree arrives at a predicted outcome based on the values of the input features at each step. However, like any algorithm, decision trees come with advantages, limitations, and specific challenges that need to be considered.
The Good: Pros of Using Decision Trees
One of the main strengths of decision trees is their interpretability. Decision trees closely resemble human decision-making processes, breaking down complex problems into a series of smaller, easily understandable decisions. In predicting healthcare costs, a decision tree might split the data based on smoker status, age, or BMI, individual can easily see how each factor contributes to cost predictions. Decision trees can handle both numerical and categorical data and work well with little preprocessing and can use raw data directly, simplifying the data preparation process.
The Bad: Cons of Using Decision Trees
Decision trees tend to overfit, especially with deep trees that capture every detail of the training data. In predicting healthcare costs, this can lead to a model that performs well on training data but poorly on unseen data, as the tree captures noise rather than the general pattern. Decision trees are also sensitive to small changes in the data. A slight variation in the dataset can lead to entirely different splits, affecting the overall structure and outcome. Additionally, decision trees are prone to bias in imbalanced datasets, as they may favor the majority class, leading to skewed predictions. For this reason, decision trees are often used as part of ensemble methods (e.g., random forests), which average results over multiple trees to produce more stable predictions
The Ugly: Challenges with Conventional Decision Trees
In machine learning, conventional decision trees lack certain advanced capabilities that are needed for real-world problems. For instance, traditional decision trees don’t handle interactions between variables well, often resulting in missed insights when features are interdependent. They are also limited in scope when it comes to feature selection and hyperparameter tuning, leading to suboptimal performance compared to ensemble methods like random forests or gradient boosting.
While the decision tree and Random Forest models employed for predicting personal healthcare cost, they were limited in addressing some of the usability, accessibility and integration problems discussed in the article. The decision tree model implemented in the homework 2 provided a static representation of the decision-making process and an effective for understanding features splits. It did not address long-term maintenance or adaptivity. This could become an issue as the healthcare landscape or new data become available, requiring updates to the model. To address this, integrating tools such as SAP, ERP or API that support dynamic updates and real-time collaboration makes ensure the decision tree remains relevant and impactful in real world applications. The visualization of tree diagrams is less accessible on mobile devices to broader audiences. To address this issue, decision tree outputs are needed to convert into web-accessible, interactive formats for instance, DeciZone or Shiny app to enhance more interactive, interpretable for a diverse platforms to display the results to non-technical stakeholders. Large decision trees are difficult to scale and comprehend for big datasets. In this case, Random Forest and Gradient Boosting handle large datasets better and provide robust performance compared to standalone decision trees. PCA (Principal Component Analysis) and feature selection can be applied to reduce the dimensionality of input features for large dataset. Tree pruning also leads to simplify the structure of decision trees, ensuring the model clarity and ease of interpretation.
This data was extracted from https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv.
It is split into training and testing data sets, using random selection.
By using the rpart library in R to construct decision tree
models for predicted value charges for all records in the
training set.
#Load required libraries
library(ggplot2)
library(ggcorrplot)
library(rpart)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom 1.0.7 ✔ rsample 1.2.1
## ✔ dials 1.3.0 ✔ tibble 3.2.1
## ✔ infer 1.0.7 ✔ tidyr 1.3.1
## ✔ modeldata 1.4.0 ✔ tune 1.2.1
## ✔ parsnip 1.2.1 ✔ workflows 1.1.4
## ✔ purrr 1.0.2 ✔ workflowsets 1.1.0
## ✔ recipes 1.1.0 ✔ yardstick 1.3.1
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ yardstick::precision() masks caret::precision()
## ✖ dials::prune() masks rpart::prune()
## ✖ yardstick::recall() masks caret::recall()
## ✖ MASS::select() masks dplyr::select()
## ✖ yardstick::sensitivity() masks caret::sensitivity()
## ✖ yardstick::specificity() masks caret::specificity()
## ✖ recipes::step() masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
library(tidyr)
library(rpart.plot)
library(DataExplorer)
library(yardstick)
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(123)
Healthcare_Cost <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv", header = TRUE)
sample_n(Healthcare_Cost, 5)
## age sex bmi children smoker region charges
## 1 19 female 35.150 0 no northwest 2134.901
## 2 62 female 38.095 2 no northeast 15230.324
## 3 46 female 28.900 2 no southwest 8823.279
## 4 18 female 33.880 0 no southeast 11482.635
## 5 18 male 34.430 0 no southeast 1137.470
summary(Healthcare_Cost)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
By exploring the data comprehensively, the relationships and distributions of the variables in the data set and understand which factors might impact charges (the healthcare cost) will be determined. The available data set contains 1338 observations and 7 variables below.
Age: insurance contractor age, years
Sex: insurance contractor gender, [female, male]
BMI: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
Children: number of children covered by health insurance / Number of dependents
Smoker: smoking, [yes, no]
Region: the beneficiary’s residential area in the US, [northeast, southeast, southwest, northwest]
Charges: Individual medical costs billed by health insurance, $ #predicted value
The distributions of each variable individually is visualized to understand their ranges and possible skewness
# Age distribution
ggplot(Healthcare_Cost, aes(x = age)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black") +
labs(title = "Age Distribution", x = "Age", y = "Count")
# BMI distribution
ggplot(Healthcare_Cost, aes(x = bmi)) +
geom_histogram(bins = 20, fill = "lightgreen", color = "black") +
labs(title = "BMI Distribution", x = "BMI", y = "Count")
# Number of Dependents
ggplot(Healthcare_Cost, aes(x = children)) +
geom_bar(fill = "salmon", color = "black") +
labs(title = "Number of Dependents", x = "Children", y = "Count")
# Charges distribution
ggplot(Healthcare_Cost, aes(x = charges)) +
geom_histogram(bins = 20, fill = "purple", color = "black") +
labs(title = "Charges Distribution", x = "Charges", y = "Count")
Age Distribution plot indicates the minimum and maximum ages in the data set. this variable may disproportionately influence predictions of healthcare costs.
BMI is often correlated with health risks. Data with a high concentration of individuals in the “overweight” or “obese” range might show higher average healthcare costs due to associated health conditions. The histogram reveals a bell-shaped curve that represents most BMI values are centered around the mean, suggesting a balanced distribution of BMI across the population.
Number of Dependents plot shows the right-skewed distribution that means most people in the data set have a small family size or no dependents at all. This is important in the context of healthcare cost prediction because family size can impact charges.
The response variable charges distribution is
right-skewed, that means most individuals have lower healthcare costs,
with a few individuals having significantly higher charges. The
individuals with extremely high charges may be outliers in predictive
modeling.
How each predictor correlates with charges, especially
for bmi, age, and smoker, which
are likely to have a significant impact on healthcare costs will be
determined as follow:
# Charges by age
ggplot(Healthcare_Cost, aes(x = age, y = charges)) +
geom_point(color = "blue", alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Charges by Age", x = "Age", y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'
# Charges by BMI
ggplot(Healthcare_Cost, aes(x = bmi, y = charges)) +
geom_point(color = "green", alpha = 0.5) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Charges by BMI", x = "BMI", y = "Charges")
## `geom_smooth()` using formula = 'y ~ x'
# Charges by Smoking Status
ggplot(Healthcare_Cost, aes(x = smoker, y = charges)) +
geom_boxplot(fill = c("skyblue", "pink")) +
labs(title = "Charges by Smoking Status", x = "Smoker", y = "Charges")
# Charges by Region
ggplot(Healthcare_Cost, aes(x = region, y = charges)) +
geom_boxplot(fill = "purple") +
labs(title = "Charges by Region", x = "Region", y = "Charges")
# Calculate correlations for numeric variables
numeric_vars <- Healthcare_Cost %>%
dplyr::select(age, bmi, children, charges)
correlations <- cor(numeric_vars)
# Plot correlation matrix
ggcorrplot(correlations, lab = TRUE, type = "lower", title = "Correlation Matrix")
BMI, Age, and Smoker are
significant predictors of healthcare costs due to their strong
association with health risks and chronic conditions. Positive
correlations with charges suggest that as these factors increase so do
the associated healthcare expenses. Differences in region
also less influence charges than other variables. Accurately capturing
these relationships can improve model accuracy and support effective
resource allocation in healthcare.
The correlation matrix provides numerical variables with higher correlations to charges, such as bmi and age, may impact in a model predicting healthcare costs.
# Convert categorical variables such as `sex`, `smoker`, and `region` to factors.
Healthcare_Cost$sex <- as.factor(Healthcare_Cost$sex)
Healthcare_Cost$smoker <- as.factor(Healthcare_Cost$smoker)
Healthcare_Cost$region <- as.factor(Healthcare_Cost$region)
# Create a categorical outcome: "high" if charges > median, "low" otherwise
Healthcare_Cost <- Healthcare_Cost %>%
mutate(category = ifelse(charges > median(charges), "high", "low"))
# Convert cost_category to a factor
Healthcare_Cost$category <- as.factor(Healthcare_Cost$category)
# Split the data into training and testing sets (80% train, 20% test)
set.seed(123)
data_split <- initial_split(Healthcare_Cost, prop = 0.8)
train <- training(data_split)
test <- testing(data_split)
Set the charges as the response variable and use
rpart library to construct decision tree in training
set.
# Create a decision tree model specification
tree_spec <- decision_tree() %>%
set_engine("rpart", model = TRUE) %>%
set_mode("regression")
# Fit the model to the training data
tree_fit <- tree_spec %>%
fit(charges ~ ., data = train)
# Build the decision tree model with additional control over depth and width
Model_0 <- rpart(
charges ~ age + sex + bmi + children + smoker + region,
data = train,
method = "anova", # Specify regression using "anova"
xval = 10,
model = TRUE,
control = rpart.control(
maxdepth = 5, # Maximum depth
minsplit = 10, # Minimum split a node
minbucket = 10, # Minimum terminal leaf
cp = 0.005 # Complexity parameter
)
)
# Plot the decision tree using rpart.plot
rpart.plot(Model_0, type = 4, extra = 101, under = TRUE, cex = 0.8, box.palette = "auto")
printcp(Model_0)
##
## Regression tree:
## rpart(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = train, method = "anova", model = TRUE, control = rpart.control(maxdepth = 5,
## minsplit = 10, minbucket = 10, cp = 0.005), xval = 10)
##
## Variables actually used in tree construction:
## [1] age bmi smoker
##
## Root node error: 1.6239e+11/1070 = 151767506
##
## n= 1070
##
## CP nsplit rel error xerror xstd
## 1 0.6344899 0 1.00000 1.00075 0.057241
## 2 0.1418519 1 0.36551 0.36793 0.020914
## 3 0.0576898 2 0.22366 0.22775 0.015715
## 4 0.0105785 3 0.16597 0.17208 0.014578
## 5 0.0065087 4 0.15539 0.16494 0.014546
## 6 0.0062557 5 0.14888 0.16259 0.014592
## 7 0.0050000 6 0.14263 0.15885 0.014746
plotcp(Model_0)
# Display feature importance
importance <- Model_0$variable.importance
# Create a bar plot
barplot(
importance,
main = "Variable Importance with Simple Regression",
horiz = TRUE,
las = 1,
col = "Green",
xlab = "Importance"
)
The smoker feature with higher scores indicates the most
important feature in the data set to forecast the health care cost.
Choosing the optimal hyperparameters identified through grid search
(maxdepth = 5, minsplit = 10, and cp = 0.005), the final model was
trained. The model identified smoker status,
bmi, and age as the most critical predictors
of healthcare costs. This aligns with existing research, as smoking and
obesity are associated with chronic health conditions.
# Extract optimal cp
optimal_cp <- Model_0$cptable[which.min(Model_0$cptable[, "xerror"]), "CP"]
print(optimal_cp)
## [1] 0.005
#Prune the tree
pruned_tree <- rpart::prune(Model_0, cp = optimal_cp)
# Plot the pruned tree
rpart.plot(pruned_tree)
# Generate predictions and evaluate
predictions <- predict(pruned_tree, newdata = test)
results <- data.frame(actual = test$charges, predicted = predictions)
As we can see, summary of a model showed us that some of the variable
are not significant such as sex factor, while smoking seems
to have a huge influence on charges. Training a model without
non-significant variables and check if performance can be improved.By
pruning the tree using the complexity parameter (cp), the model avoided
overfitting while maintaining interpretability.
library(WVPlots)
## Loading required package: wrapr
##
## Attaching package: 'wrapr'
## The following objects are masked from 'package:tidyr':
##
## pack, unpack
## The following object is masked from 'package:tibble':
##
## view
## The following object is masked from 'package:dplyr':
##
## coalesce
test$prediction <- predict(Model_0, newdata = test)
ggplot(test, aes(x = prediction, y = charges)) +
geom_point(color = "blue", alpha = 0.7) +
geom_abline(color = "red") +
ggtitle("Prediction vs. Actual values")
GainCurvePlot(test, "prediction", "charges", "Model_0")
The curve illustrates that errors in the model are close to zero so model predicts quite well.
Forcing the split with smoker variable may increase bias
if that feature is not the optimal initial split for minimizing error
and reduce variance, as it prioritizes smoker over
potentially better splits. Both model’s performance matrices are the
same that means smoker predictor is as informative as the
optimal feature chosen in the first model.
A classification decision tree model is built to predict healthcare
cost categories such as “low,” or “high” based on several demographic
and lifestyle for individuals based on input variables such as age, sex,
BMI, number of dependents, smoking status, and region. The model
evaluates performance using accuracy and ROC AUC, allowing an assessment
of both overall correctness and discriminatory power. The decision tree
shows age and bmi features most influence cost
categorization, providing a useful tool for predicting personal
healthcare costs.
# Build the initial tree model
tree_model <- rpart(
category ~ age + sex + bmi + children + smoker + region,
data = train,
method = "class",
control = rpart.control(
maxdepth = 5,
minsplit = 10,
minbucket = 10,
cp = 0.001 # Low cp to grow a large tree
)
)
# Verify the complexity parameter table
printcp(tree_model)
##
## Classification tree:
## rpart(formula = category ~ age + sex + bmi + children + smoker +
## region, data = train, method = "class", control = rpart.control(maxdepth = 5,
## minsplit = 10, minbucket = 10, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] age children smoker
##
## Root node error: 523/1070 = 0.48879
##
## n= 1070
##
## CP nsplit rel error xerror xstd
## 1 0.4990440 0 1.00000 1.00000 0.031264
## 2 0.3288719 1 0.50096 0.50478 0.026963
## 3 0.0095602 2 0.17208 0.17591 0.017534
## 4 0.0019120 4 0.15296 0.15870 0.016730
## 5 0.0010000 6 0.14914 0.16826 0.017183
optimal_cp <- tree_model$cptable[which.min(tree_model$cptable[, "xerror"]), "CP"]
print(optimal_cp)
## [1] 0.001912046
Tree.class <- rpart(
category ~ age + sex + bmi + children + smoker + region,
data = train,
method = "class",
control = rpart.control(cp = optimal_cp)
)
#rm(tree_model)
# Visualize the pruned tree
rpart.plot(Tree.class, yesno = TRUE)
# Predictions on the test set
Tree.class.pred <- predict(Tree.class, test, type = "class")
# Create a confusion matrix for performance evaluation
(Tree.class.conf <- confusionMatrix(data = Tree.class.pred,
reference = test$category))
## Confusion Matrix and Statistics
##
## Reference
## Prediction high low
## high 110 7
## low 12 139
##
## Accuracy : 0.9291
## 95% CI : (0.8915, 0.9568)
## No Information Rate : 0.5448
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8566
##
## Mcnemar's Test P-Value : 0.3588
##
## Sensitivity : 0.9016
## Specificity : 0.9521
## Pos Pred Value : 0.9402
## Neg Pred Value : 0.9205
## Prevalence : 0.4552
## Detection Rate : 0.4104
## Detection Prevalence : 0.4366
## Balanced Accuracy : 0.9268
##
## 'Positive' Class : high
##
# Visualize predicted vs. actual values
plot(test$category, Tree.class.pred,
main = "Simple Classification: Predicted vs. Actual",
xlab = "Actual",
ylab = "Predicted")
# Extract and plot variable importance
variable_importance <- tree_model$variable.importance
# Create a bar plot
barplot(
variable_importance,
main = "Variable Importance with Simple Classification",
horiz = TRUE,
las = 1,
col = "skyblue",
xlab = "Importance"
)
The Random Forest model was built using 1,070 samples and 7 predictors. The model used 10-fold cross-validation, splitting the data into 10 subsets, each used for training and validation, to ensure generalizability and robustness. The model tested combinations of hyperparameters and their influence on model’s performance.
# Define the random forest model specification
rf_spec <- rand_forest(
mode = "regression",
trees = 500
) %>%
set_engine("ranger")
# Fit the random forest model to the training data
rf_model <- rf_spec %>%
fit(charges ~ age + sex + bmi + children + smoker + region, data = train)
# Generate predictions on the test data
test.rf <- test %>%
mutate(rf_pred = predict(rf_model, new_data = test)$.pred)
####Random Forest Regression
random.frst = train(charges ~ .,
data = train,
method = "ranger", # for random forest
tuneLength = 5, # choose up to 5 combinations of tuning parameters
metric = "RMSE", # evaluate hyperparamter combinations with RMSE
trControl = trainControl(
method = "cv", # k-fold cross validation
number = 10, # 10 folds
savePredictions = "final" # save predictions for the optimal tuning parameter1
)
)
random.frst
## Random Forest
##
## 1070 samples
## 7 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 963, 963, 964, 963, 962, 964, ...
## Resampling results across tuning parameters:
##
## mtry splitrule RMSE Rsquared MAE
## 2 variance 4371.774 0.8899820 2735.013
## 2 extratrees 5094.252 0.8451633 3360.581
## 3 variance 3687.366 0.9116965 1941.169
## 3 extratrees 4358.334 0.8781913 2568.037
## 5 variance 3550.802 0.9160564 1685.549
## 5 extratrees 3847.607 0.9028767 1946.602
## 7 variance 3615.651 0.9132422 1702.543
## 7 extratrees 3716.980 0.9089116 1793.592
## 9 variance 3674.298 0.9107917 1746.086
## 9 extratrees 3653.920 0.9118395 1746.506
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 5, splitrule = variance
## and min.node.size = 5.
plot(random.frst)
random.frst.pred <- predict(random.frst, test.rf, type = "raw")
# Visualize Predicted vs. Actual
ggplot(data = test.rf, aes(x = charges, y = random.frst.pred)) +
geom_point(color = "blue", alpha = 0.6) +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
labs(
title = "Random Forest Regression: Predicted vs. Actual",
x = "Actual Charges",
y = "Predicted Charges"
) +
theme_minimal()
(random.frst.pred <- RMSE(pred = random.frst.pred,
obs = test.rf$charges))
## [1] 3000.998
In this section, the random forest model is developed to classify healthcare costs into two categories, high and low, based on seven predictors, such as age, bmi, smoking status, and region. Through cross-validation and hyperparameter tuning, the model achieved exceptional performance, making it a reliable tool for predicting healthcare cost categories.
class.frst = train(category ~ .,
data = train,
method = "ranger", # for random forest
tuneLength = 5, # choose up to 5 combinations of tuning parameters
metric = "ROC", # evaluate hyperparamter combinations with ROC
trControl = trainControl(
method = "cv", # k-fold cross validation
number = 10, # 10 folds
savePredictions = "final", # save predictions for the optimal tuning parameter1
classProbs = TRUE, # return class probabilities in addition to predicted values
summaryFunction = twoClassSummary # for binary response variable
)
)
class.frst
## Random Forest
##
## 1070 samples
## 7 predictor
## 2 classes: 'high', 'low'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 964, 964, 962, 963, 963, 963, ...
## Resampling results across tuning parameters:
##
## mtry splitrule ROC Sens Spec
## 2 gini 0.9999301 1.0000000 0.9980769
## 2 extratrees 0.9992600 0.9780471 0.9885704
## 3 gini 0.9999650 1.0000000 0.9980769
## 3 extratrees 0.9996135 0.9926599 0.9866110
## 5 gini 1.0000000 1.0000000 0.9980769
## 5 extratrees 0.9996828 0.9944781 0.9942671
## 7 gini 1.0000000 1.0000000 0.9980769
## 7 extratrees 0.9999644 0.9981481 0.9961538
## 9 gini 1.0000000 1.0000000 0.9980769
## 9 extratrees 0.9999644 0.9981481 0.9961538
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 5, splitrule = gini
## and min.node.size = 1.
plot(class.frst)
#Generate Predictions
class.frst.pred <- predict(class.frst, test.rf, type = "raw")
# Combine actual and predicted values
results <- data.frame(Actual = test.rf$category, Predicted = class.frst.pred)
# Bar plot for predicted vs. actual classes
ggplot(results, aes(x = Actual, fill = Predicted)) +
geom_bar(position = "dodge", color = "black") + # Dodge for side-by-side comparison
labs(
title = "Class Distribution: Predicted vs. Actual",
x = "Actual Class",
y = "Frequency",
fill = "Predicted Class"
) +
theme_minimal()
he ROC (Receiver Operating Characteristic) Cross-Validation score measures the model’s ability to distinguish between classes. For this dataset, using Gini as the splitting rule and setting mtry = 7 yields the best classification performance (ROC = 1.000). The bar chart shows the majority of the predictions align with the actual class labels. There are few or no mismatched predictions visible (red bars under the low category or blue bars under the high category).
Metrics such as RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R-squared will be used for regression decision tree and random forest models.
# Actual values
actual <- test$charges
# Metrics for Decision Tree
dt_rmse <- RMSE(pred = predictions, obs = actual)
dt_mae <- MAE(pred = predictions, obs = actual)
dt_r2 <- 1 - sum((actual - predictions)^2) / sum((actual - mean(actual))^2)
# Metrics for Random Forest
rf_predictions <- predict(random.frst, test.rf, type = "raw")
rf_rmse <- RMSE(pred = rf_predictions, obs = actual)
rf_mae <- MAE(pred = rf_predictions, obs = actual)
sst <- sum((actual - mean(actual))^2) # Total sum of squares
ssr <- sum((actual - rf_predictions)^2) # Sum of squared residuals
rf_r2 <- 1 - (ssr / sst)
# Combine metrics into a data frame
performance_summary <- data.frame(
Model = c("Decision Tree", "Random Forest"),
RMSE = c(dt_rmse, rf_rmse),
MAE = c(dt_mae, rf_mae),
R_squared = c(dt_r2, rf_r2)
)
# Print summary table
print(performance_summary)
## Model RMSE MAE R_squared
## 1 Decision Tree 4690.709 2958.388 0.8222974
## 2 Random Forest 3000.998 1389.043 0.9272642
Decision tree model has higher RMSE and lower R-squared compared to random forests due to over-fitting or under-fitting. On the other hand, Random forest model has lower RMSE and higher R-squared because they aggregate multiple trees, leading to better generalization and greater accuracy.
Let’s calculate what charges on health care will be for them:
‘Andrew’: 19 years old, BMI 27.9, has no children, smokes, from northwest region.
Lisa: 60 years old, BMI 50, 2 children, doesn’t smoke, from southeast region.
James: 30 years old. BMI 31.2, no children, doesn’t smoke, from northeast region.
Andrew <- data.frame(age = 19,
bmi = 27.9,
sex = "male",
children = 0,
smoker = "yes",
region = "northwest")
print(paste0("Health care charges for Andrew is $",round(predict(Model_0, Andrew), 2 )))
## [1] "Health care charges for Andrew is $18728.54"
Lisa <- data.frame(age = 60,
bmi = 50,
sex = "female",
children = 2,
smoker = "no",
region = "southeast")
print(paste0("Health care charges for Lisa is $ ", round(predict(Model_0, Lisa), 2)))
## [1] "Health care charges for Lisa is $ 13645.35"
James <- data.frame(age = 30,
bmi = 31.2,
sex = "male",
children = 0,
smoker = "no",
region = "northeast")
print(paste0("Health care charges for James is $ ", round(predict(Model_0, James), 2)))
## [1] "Health care charges for James is $ 5490.38"
The comparison between the Decision Tree and Random Forest regression models demonstrates the clear superiority of Random Forest in predicting healthcare costs. Its lower RMSE and MAE, combined with a higher R_squared value that indicate that Random Forest provides more accurate, precise, and reliable predictions. While the Decision Tree offers simplicity and interpretability, its performance limitations make it less suitable for complex datasets. In the context of healthcare cost forecasting, Random Forest emerges as the optimal model, enabling informed decision-making and effective resource allocation.