Homework #3: Utilizing Support Vector Machines to Predict Personal Healthcare Costs

Introduction

In Homework 3, the Support Vector Machines (SVM) algorithm is applied to predict personal healthcare costs using features such as age, BMI, smoking status, sex, region, and the number of dependents. The same healthcare_cost dataset used in Homework 2 is revisited to analyze the performance of the SVM model. This homework aims to evaluate the model’s prediction accuracy, compare its results with those obtained in Homework 2, and assess how well the SVM model fits the dataset. Additionally, the analysis investigates whether SVM is more suitable for classification or regression scenarios in this context and evaluates recommendations for achieving more accurate predictions. In this algorithm, each data item is plotted as a point in n-dimensional space (where n is number of features), with the value of each feature being the value of a particular coordinate. Then, classification is performed by finding the hyper-plane that best differentiates the two classes. This homework concludes with a critical discussion on the findings and whether the recommendations align with the observed results.

Loading Requierd libraries and Dataset used in Homework2

#Load required libraries
library(caTools)
library(tidymodels)

## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

## ✔ broom        1.0.7     ✔ recipes      1.1.0
## ✔ dials        1.3.0     ✔ rsample      1.2.1
## ✔ dplyr        1.1.4     ✔ tibble       3.2.1
## ✔ ggplot2      3.5.1     ✔ tidyr        1.3.1
## ✔ infer        1.0.7     ✔ tune         1.2.1
## ✔ modeldata    1.4.0     ✔ workflows    1.1.4
## ✔ parsnip      1.2.1     ✔ workflowsets 1.1.0
## ✔ purrr        1.0.2     ✔ yardstick    1.3.1

## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Learn how to get started at https://www.tidymodels.org/start/

library(e1071)

## 
## Attaching package: 'e1071'

## The following object is masked from 'package:tune':
## 
##     tune

## The following object is masked from 'package:rsample':
## 
##     permutations

## The following object is masked from 'package:parsnip':
## 
##     tune

library(dplyr)
library(ggplot2)
library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(123)
Healthcare_Cost <- read.csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/refs/heads/master/insurance.csv", header = TRUE)
sample_n(Healthcare_Cost, 5)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 35.150        0     no northwest  2134.901
## 2  62 female 38.095        2     no northeast 15230.324
## 3  46 female 28.900        2     no southwest  8823.279
## 4  18 female 33.880        0     no southeast 11482.635
## 5  18   male 34.430        0     no southeast  1137.470

summary(Healthcare_Cost)

##       age            sex                 bmi           children    
##  Min.   :18.00   Length:1338        Min.   :15.96   Min.   :0.000  
##  1st Qu.:27.00   Class :character   1st Qu.:26.30   1st Qu.:0.000  
##  Median :39.00   Mode  :character   Median :30.40   Median :1.000  
##  Mean   :39.21                      Mean   :30.66   Mean   :1.095  
##  3rd Qu.:51.00                      3rd Qu.:34.69   3rd Qu.:2.000  
##  Max.   :64.00                      Max.   :53.13   Max.   :5.000  
##     smoker             region             charges     
##  Length:1338        Length:1338        Min.   : 1122  
##  Class :character   Class :character   1st Qu.: 4740  
##  Mode  :character   Mode  :character   Median : 9382  
##                                        Mean   :13270  
##                                        3rd Qu.:16640  
##                                        Max.   :63770

Data Preprocessing

By converting variables such as sex, smoker, and region to factors, the model can interpret these variables correctly during training. The dataset is split into training and testing sets (75%-25%) to evaluate the model’s performance on unseen data.

# Convert categorical variables to factors
df <- Healthcare_Cost %>%
  mutate(
    sex = as.factor(sex),
    smoker = as.factor(smoker),
    region = as.factor(region)
  )
# Split the data into training and testing sets (75% train, 25% test)
set.seed(123)
data_split <- initial_split(df, prop = 0.75)
train_df <- training(data_split)
test_df <- testing(data_split)
train_df <- train_df %>%
  mutate(across(c(age, bmi, charges), scale))  # Scale numeric columns
test_df <- test_df %>%
  mutate(across(c(age, bmi, charges), scale))

SVM Regression

Support Vector Machine (SVM) algorithm aims to find out the hyper plane which differentiate the class in the data set while maximizing the margin between the classes. Support vectors are those sample observation which lies closer to hyper plane. The hyper plane can be linear or non-linear. Kernel functions can transform the non-linear data sets into higher dimensional space and then use linear equations to divide the space using different Kernel functions such as linear, non-linear, radial bias function and sigmoid.

# Train SVM Model
svm_model <- svm(
  charges ~ age + bmi + children + smoker + sex + region,
  data = train_df,
  type = "eps-regression",          # Regression type
  kernel = "radial",                # Kernel type
  cost = 1,                         # Regularization parameter
  gamma = 0.1,                      # Kernel coefficient
  epsilon = 0.1                     # Epsilon margin  
)
# Print Model Summary
summary(svm_model)

## 
## Call:
## svm(formula = charges ~ age + bmi + children + smoker + sex + region, 
##     data = train_df, type = "eps-regression", kernel = "radial", 
##     cost = 1, gamma = 0.1, epsilon = 0.1)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  371

The model is designed to predict a continuous target variable, charges, using the input features age, bmi, children, smoker, sex, and region. The formula indicates the relationship being modeled, with charges as the dependent variable and the remaining features as predictors. SVM-Type eps-regression represents employs epsilon-Support Vector Regression (SVR) which is particularly suitable for continuous outputs. The hyperparameters such as cost and gamma mayyprovide accurate predictions and shape the model’s optimum performance. The large Number of Support Vectors: 371 shows the complexity of the data and the model’s reliance on a significant portion of the dataset.

Generate Predictions and Evaluate Model

Calculating both RMSE and MAE highlight the predictive accuracy of the SVM regression model.

# Predict charges on the test set
test_df <- test_df %>%
  mutate(predicted_charges = predict(svm_model, newdata = test_df))

# Calculate performance metrics: RMSE and MAE
rmse <- sqrt(mean((test_df$charges - test_df$predicted_charges)^2))
mae <- mean(abs(test_df$charges - test_df$predicted_charges))

# Print performance metrics
cat("RMSE:", round(rmse, 2), "\n")

## RMSE: 0.42

cat("MAE:", round(mae, 2), "\n")

## MAE: 0.2

the RMSE value of 4683.58 indicates that, while the model captures the general trend of the data, there are instances where the predictions deviate significantly from the true values, potentially due to noise, outliers, or unmodeled complexity in the dataset. The RMSE is nearly double the MAE 2566.19, indicating that the errors in the model are not uniformly distributed. Generally we can say that the model performs reasonably well.

Visualize Predictions vs Actual Charges

# Plot actual vs predicted charges
ggplot(test_df, aes(x = charges, y = predicted_charges)) +
  geom_point(alpha = 0.5, color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Charges",
       x = "Actual Charges",
       y = "Predicted Charges") +
  theme_minimal()

This plot depicts the relationship between the actual charges and the predicted charges for SVM regression model. Each blue points represents a single observation in the test dataset, where its position corresponds to the actual charge value and the model’s predicted charge value for that observation. The red dashed line represents the predicted charges perfectly match the actual charges.The model appears to perform better in the lower range of charges (below 10,000), where the points closely follow the red dashed line but struggles with high-charge (above 20,000) predictions, as evidenced by the wider spread in this region.

SVM Binary Classification

The SVM classification model is employed for both linear and non-linear classification tasks to project data into higher-dimensional spaces using kernel functions and to predict a binary target variable: charges_category, which classifies healthcare costs as either “high” or “low” based on whether the charges exceed the median value. The dataset also includes input features such as age, bmi, children, smoker, sex, and region, which are used to predict the target class. Categorical variables like sex, smoker, and region are converted into factors. The data is then split into training and testing sets to evaluate the model’s ability to generalize to unseen data. A 75-25 splits. The SVM model works by identifying a hyperplane that best separates the two classes in the training data. The hyperplane is defined by support vectors, which are the data points closest to the decision boundary.

# Create charges_category (binary classification)
Healthcare_Cost <- Healthcare_Cost %>%
  mutate(
    sex = as.factor(sex),
    smoker = as.factor(smoker),
    region = as.factor(region),
    charges_category = as.factor(ifelse(charges > median(charges), "high", "low")) 
  )

# Split the dataset
set.seed(123)
data_split <- sample(nrow(Healthcare_Cost), round(nrow(Healthcare_Cost) * 0.75), replace = FALSE)
train <- Healthcare_Cost[data_split, ]
test <- Healthcare_Cost[-data_split, ]
# Convert charges_category to a factor
train$charges_category <- as.factor(train$charges_category)
test$charges_category <- as.factor(test$charges_category)

# Train the SVM classification model
svm_prob <- svm(
  charges_category ~ age + bmi + children + smoker + sex + region,
  data = train,
  type = "C-classification",  
  kernel = "radial",          
  cost = 1,                   
  gamma = 0.1               
)

# Predict categories on the test dataset
test <- test %>%
  mutate(predicted_category = predict(svm_prob, newdata = test))

# Create a confusion matrix
conf_matrix <- confusionMatrix(test$predicted_category, test$charges_category)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction high low
##       high  141   7
##       low    17 169
##                                          
##                Accuracy : 0.9281         
##                  95% CI : (0.895, 0.9534)
##     No Information Rate : 0.5269         
##     P-Value [Acc > NIR] : < 2e-16        
##                                          
##                   Kappa : 0.8554         
##                                          
##  Mcnemar's Test P-Value : 0.06619        
##                                          
##             Sensitivity : 0.8924         
##             Specificity : 0.9602         
##          Pos Pred Value : 0.9527         
##          Neg Pred Value : 0.9086         
##              Prevalence : 0.4731         
##          Detection Rate : 0.4222         
##    Detection Prevalence : 0.4431         
##       Balanced Accuracy : 0.9263         
##                                          
##        'Positive' Class : high           
##

The confusion matrix indicates that the model performs exceptionally well with an accuracy of 92.81%, a high Kappa value of 0.8554, and balanced sensitivity (89.24%) and specificity (96.02%), the model demonstrates robust predictive capabilities. However, the slightly lower sensitivity compared to specificity suggests that the model may miss some high cases.

Build Hyperplane with Non-Linear Kernel

The SVM is employed to construct a hyperplane (or decision boundary) in the feature space that optimally separates these two classes while maximizing the margin.

# Define the range for age and bmi
age_range <- seq(min(train$age) - 1, max(train$age) + 1, length.out = 100)
bmi_range <- seq(min(train$bmi) - 1, max(train$bmi) + 1, length.out = 100)

# Fix other variables to representative values
fixed_children <- round(mean(train$children))  #  mean number of children
fixed_smoker <- levels(train$smoker)[1]        # first level of smoker (e.g., "no")
fixed_sex <- levels(train$sex)[1]              # first level of sex (e.g., "male")
fixed_region <- levels(train$region)[1]        # first level of region (e.g., "southeast")

# Create the grid
grid <- expand.grid(
  age = age_range,
  bmi = bmi_range,
  children = fixed_children,
  smoker = fixed_smoker,
  sex = fixed_sex,
  region = fixed_region
)

# Ensure the data types match the training data
grid$children <- as.numeric(grid$children)  
grid$smoker <- as.factor(grid$smoker)     
grid$sex <- as.factor(grid$sex)            
grid$region <- as.factor(grid$region)     

# Predict classes for the grid
grid$predicted <- predict(svm_prob, newdata = grid)


# Plot data points and hyperplane
ggplot(train, aes(x = age, y = bmi, color = charges_category)) +
  geom_point(size = 2) +  # Plot data points
  geom_contour(data = grid, aes(x = age, y = bmi, z = as.numeric(predicted)), breaks = 0.5, color = "black") +
  labs(title = "SVM Hyperplane", x = "Age", y = "BMI") +
  theme_minimal()

In the visualization, the relationship between the two categories (high and low) is complex and cannot be separated by a straight line.The non-linear kernel, Radial Basis Function (RBF) handles intricate patterns in the data and create flexible decision boundaries that adapt to these patterns.The hyperplane itself is not directly visible in this plot, the grouping of points shows tht the model has separated the classes.

Receiver Operating Characteristic (ROC) Curve

The response variable of data set is considered as binary classification problem to plot ROC curve. SVM with probability estimation enabled generates probabilities for class membership by the probability = TRUE parameter when fitting the model.

# Classifier 1: SVM Model
svm_model1 <- svm(
  charges_category ~ age + bmi + children + smoker + sex + region,
  data = train,
  kernel = "radial",
  cost = 1,
  gamma = 0.1,
  probability = TRUE
)
# Generate Predicted Probabilities for Classifier 1
test$svm1_probs <- attr(predict(svm_model1, newdata = test, probability = TRUE), "probabilities")[, "high"]

# Classifier 2: SVM Model with Different Parameters
svm_model2 <- svm(
  charges_category ~ age + bmi + children + smoker + sex + region,
  data = train,
  kernel = "sigmoid",
  cost = 0.5,  # Higher regularization parameter
  gamma = 0.2, # Higher kernel coefficient
  probability = TRUE
)
# Generate Predicted Probabilities for Classifier 2
test$svm2_probs <- attr(predict(svm_model2, newdata = test, probability = TRUE), "probabilities")[, "high"]

# Compute ROC Curves
roc1 <- roc(test$charges_category, test$svm1_probs, levels = c("low", "high"))

## Setting direction: controls < cases

roc2 <- roc(test$charges_category, test$svm2_probs, levels = c("low", "high"))

## Setting direction: controls < cases

# Plot ROC Curves
plot(roc1, col = "blue", main = "Comparing two ROCs with similar AUC", lwd = 2, legacy.axes = TRUE, xlab = "1 - Specificity", ylab = "Sensitivity")
lines(roc2, col = "green", lwd = 2)
# Add legend
legend(
  "bottomright",                        
  legend = c("SVM Model 1 (Cost=1, Gamma=0.1)", 
             "SVM Model 2 (Cost=0.5, Gamma=0.2)"), 
  col = c("blue", "green"),            
  lwd = 2                               
)

The blue ROC curve remains closer to the upper-left corner, indicating strong discriminative ability. This implies that Classifer 1 achieves high true positive rates with relatively few false positives across most thresholds. The green ROC indicates that Classifier 2 might have slightly higher false positive rates for the same levels of sensitivity compared to classifier 1. The gray diagonal line represents a random classifier, where the model has no ability to distinguish between the two classes. Both the blue and green ROC curves lie significantly above this line, confirming that both SVM models outperform random guessing by a substantial margin.

Improving Performance by Hyperparameter Tuning

Hyperparameter tuning is conducted using cross-validation on the SVM model with a radial kernel that is a powerful non-linear transformation that maps data into a higher-dimensional space. The RBF kernel allows the SVM to capture non-linear relationships in the data, improving model performance when the data is not linearly separable.

library(e1071)

# Tune the SVM hyperparameters using cross-validation
set.seed(123)
tune_result <- tune(
  svm,
  charges_category ~ .,
  data = train,
  kernel = "radial",
  ranges = list(cost = 2^(0:5), gamma = 2^(-5:-1))
)

# Best model parameters
best_model <- tune_result$best.model
summary(tune_result)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost   gamma
##    32 0.03125
## 
## - best performance: 0.0139505 
## 
## - Detailed performance results:
##    cost   gamma      error  dispersion
## 1     1 0.03125 0.03284158 0.016880474
## 2     2 0.03125 0.03185149 0.017368333
## 3     4 0.03125 0.02589109 0.016356607
## 4     8 0.03125 0.02291089 0.013340210
## 5    16 0.03125 0.01494059 0.007055987
## 6    32 0.03125 0.01395050 0.009638484
## 7     1 0.06250 0.03185149 0.019191812
## 8     2 0.06250 0.02490099 0.013467712
## 9     4 0.06250 0.01991089 0.011480923
## 10    8 0.06250 0.01395050 0.009638484
## 11   16 0.06250 0.01395050 0.009638484
## 12   32 0.06250 0.01994059 0.014126970
## 13    1 0.12500 0.02690099 0.012508647
## 14    2 0.12500 0.01793069 0.010302894
## 15    4 0.12500 0.01495050 0.009724164
## 16    8 0.12500 0.01596040 0.012652778
## 17   16 0.12500 0.01994059 0.011528426
## 18   32 0.12500 0.02092079 0.010953720
## 19    1 0.25000 0.02091089 0.015137603
## 20    2 0.25000 0.01693069 0.009474534
## 21    4 0.25000 0.01793069 0.009161189
## 22    8 0.25000 0.01893069 0.008760222
## 23   16 0.25000 0.02391089 0.008415421
## 24   32 0.25000 0.02589109 0.011660048
## 25    1 0.50000 0.01891089 0.011918533
## 26    2 0.50000 0.02092079 0.009931641
## 27    4 0.50000 0.02292079 0.011599357
## 28    8 0.50000 0.02490099 0.015012856
## 29   16 0.50000 0.02787129 0.012198839
## 30   32 0.50000 0.02788119 0.015376294

The best performance value of (0.0139505) shows the model’s ability to make accurate predictions, and the use of 10-fold cross-validation results generalize well to available dataset.

Using Linear Kernels

The linear kernel is suitable for linearly separable data or datasets where a straight hyperplane can separate the classes effectively.

# Train the SVM model with a linear kernel and enable probabilities
svm_linear <- svm(
  charges_category ~ .,
  data = train,
  kernel = "linear",
  cost = best_model$cost,
  probability = TRUE  # Enable probability estimates
)

# Predict and obtain probabilities
test$predicted_probs_linear <- attr(predict(svm_linear, newdata = test, probability = TRUE), "probabilities")[, "high"]
head(test$predicted_probs_linear)

## [1] 0.99999903 0.99999990 0.05609239 0.99999990 0.98821155 0.99999990

Each Probabilities value represents the model’s confidence that a given observation belongs to the “high” class with values (e.g., 0.99999903, 0.99999990). Values (e.g., 0.05609239) indicate a high confidence that the observation belongs to the “low” class.

Cross-Validation for Evaluation

The 5-fold cross-validation is used to evaluate the performance of a Support Vector Machine (SVM) with a Radial kernel for classifying charges_category into “high” or “low” in which dataset is divided into 5 equal subsets, 4 folds are used for training the model, and 1 fold is used for testing. This process is repeated 5 times, providing a robust estimate of the model’s performance.

# Define training control
train_control <- trainControl(method = "cv", number = 5)

# Train the model with cross-validation
svm_cv <- train(
  charges_category ~ .,
  data = train,
  method = "svmRadial",
  trControl = train_control,
  tuneGrid = expand.grid(C = tune_result$best.parameters$cost, sigma = tune_result$best.parameters$gamma)
)

# Evaluate performance
svm_cv

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1004 samples
##    7 predictor
##    2 classes: 'high', 'low' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 803, 804, 802, 803, 804 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9721191  0.9442261
## 
## Tuning parameter 'sigma' was held constant at a value of 0.03125
## 
## Tuning parameter 'C' was held constant at a value of 32

The data consists of 1004 samples with 7 predictors and 2 classes (high and low). The sizes of the training datasets across the folds are consistent, with approximately 75% (e.g., 803 or 804 samples) used for training in each fold, while the remaining 25% are used for validation. The outcomes of the 5-fold cross-validation demonstrate the strength and robustness of the SVM model with an RBF kernel. The high accuracy (97.51%) and Kappa statistic (95.02%) mean the model is well-tuned and effective at classifying charges_category as “high” or “low.”

Feature Selection

Feature engineering directly impacts defining regression hyperplane, as well-represented and relevant features reduce noise and improve the SVM’s ability to focus on meaningful patterns. Combined with hyperparameter tuning, feature engineering enables the SVM to construct a non-linear decision boundary that captures the intricacies of the dataset while avoiding overfitting.

# SVM hyperparameter tuning
tune_result <- tune(
  svm,
  charges ~ .,  # Use all features for modeling
  data = train_df, 
  kernel = "radial",
  ranges = list(cost = 2^(0:5), gamma = 2^(-5:-1))
)

# Train the best model
best_model <- tune_result$best.model
summary(best_model)

## 
## Call:
## best.tune(METHOD = svm, train.x = charges ~ ., data = train_df, ranges = list(cost = 2^(0:5), 
##     gamma = 2^(-5:-1)), kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  32 
##       gamma:  0.03125 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  347

# Scale numeric features (excluding the target variable 'charges')
numeric_cols <- setdiff(names(train_df)[sapply(train_df, is.numeric)], "charges")
train_df[, numeric_cols] <- scale(train_df[, numeric_cols])
test_df[, numeric_cols] <- scale(test_df[, numeric_cols])

# Cross-validation control
ctrl <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Perform cross-validation for SVM with radial kernel
svm_cv <- train(
  charges ~ .,  
  data = train_df,
  method = "svmRadial",
  trControl = ctrl,
  tuneLength = 5
)

# Print the cross-validation results
print(svm_cv)

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 1003 samples
##    6 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 903, 903, 903, 903, 903, 903, ... 
## Resampling results across tuning parameters:
## 
##   C     RMSE       Rsquared   MAE      
##   0.25  0.4135637  0.8306700  0.2226263
##   0.50  0.3987401  0.8381194  0.2148013
##   1.00  0.3924086  0.8414038  0.2134274
##   2.00  0.3924064  0.8411199  0.2138868
##   4.00  0.3939351  0.8397814  0.2145014
## 
## Tuning parameter 'sigma' was held constant at a value of 0.09129893
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.09129893 and C = 2.

Comparision of RF Vs SVM

Random Forest RF model is an ensemble learning method that constructs multiple decision trees and aggregates their predictions and handles non-linear relationships and capture complex interactions between features . By combining the outputs of many trees, RF reduces overfitting and increases robustness. SVM regression with an RBF kernel maps input data into a higher-dimensional space to capture non-linear relationships. It constructs a hyperplane that minimizes the prediction error within a specified tolerance.

library(randomForest)

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

# Random Forest
set.seed(123)
rf_model <- randomForest(
  charges ~ age + bmi + children + smoker + sex + region,
  data = train_df,
  ntree = 500,
  importance = TRUE
)

# Predict and evaluate RF
rf_preds <- predict(rf_model, test_df)
rf_rmse <- sqrt(mean((rf_preds - test_df$charges)^2))
cat("Random Forest RMSE:", rf_rmse, "\n")

## Random Forest RMSE: 0.3935922

# SVM (Radial Kernel)
set.seed(123)
svm_model <- svm(
  charges ~ age + bmi + children + smoker + sex + region,
  data = train_df,
  kernel = "radial",
  cost = 1,
  gamma = 0.1
)

# Predict and evaluate SVM
svm_preds <- predict(svm_model, test_df)
svm_rmse <- sqrt(mean((svm_preds - test_df$charges)^2))
cat("SVM RMSE:", svm_rmse, "\n")

## SVM RMSE: 0.418551

# Compare RMSE
if (rf_rmse < svm_rmse) {
  cat("Random Forest performs better.\n")
} else {
  cat("SVM performs better.\n")
}

## Random Forest performs better.

For this analysis of predicting healthcare cost with availablfe dataset, Random Forest RF is recommended because RF handled complex, non-linear relationships without requiring extensive hyperparameter tuning.

Comparing Random Forest and SVM Performance for Predicting Healthcare Costs

When choosing between Random Forest and Support Vector Machine (SVM) for predicting healthcare costs, several factors must be considered, including dataset characteristics, feature types, presence of noise, interpretability, hyperparameter tuning, prediction, computational complexity, and empirical performance. Each algorithm has its strengths and weaknesses, which make it suitable depending on the specific circumstances. The size of the dataset is an important factor when selecting a machine learning algorithm. Random Forest performs effectively on large datasets, as it can manage a considerable number of samples and features without the risk of overfitting. In contrast, SVM may struggle with large datasets due to its training complexity, which increases quadratically with the number of samples. As a result, for large datasets, Random Forest is generally the more suitable choice, offering scalability and reliability in such scenarios. The types of features in the dataset also influence algorithm selection. Random Forest handles both numerical and categorical data efficiently with minimal preprocessing, making it highly versatile. On the other hand, SVM works optimally with numerical data but often requires preprocessing steps, such as one-hot encoding for categorical variables and feature scaling, to perform well. Consequently, if the dataset contains many categorical variables or preprocessing is challenging, Random Forest is the recommended algorithm. The presence of noise and outliers in the dataset can significantly impact model performance. Random Forest is inherently robust to noisy data and outliers due to its ensemble approach, which aggregates predictions from multiple decision trees to mitigate the influence of anomalies. In contrast, SVM is more sensitive to noise, especially when data points overlap or are not well-separated. For datasets with significant noise or outliers, Random Forest emerges as the better option. The complexity of hyperparameter tuning also differs between the two algorithms. Random Forest requires tuning fewer hyperparameters, such as the number of trees maximum depth and minimum samples split. On the other hand, SVM demands careful tuning of parameters like the regularization parameter (C), kernel coefficient (gamma), and the choice of kernel. This process can be computationally expensive. For faster and simpler hyperparameter tuning, Random Forest is recommended. Computational efficiency is another key consideration. Random Forest offers faster training and prediction times, especially for large datasets, making it more efficient overall. In contrast, SVM is slower during training, as its complexity grows quadratically with the size of the dataset. For larger datasets, Random Forest is more computationally efficient and practical. Cross-validation is a robust method to compare the performance of Random Forest and SVM. For predicting healthcare costs, Random Forest is typically preferred due to its ability to handle complex, non-linear relationships, its robustness to noise and outliers, and its suitability for large datasets and mixed data types. In summary, Random Forest is often the better choice for predicting healthcare costs due to its scalability, versatility, robustness, ease of use, and interpretability. While SVM has its strengths, particularly in smaller, well-separated datasets with minimal noise, it requires more preprocessing and computational resources.

Which algorithm is recommended to get more accurate results?

Random Forest is generally recommended for achieving more accurate results in predicting healthcare costs due to its robustness, flexibility, and strong performance across a variety of dataset characteristics.

Is it better for classification or regression scenarios?

The choice of algorithm also depends on whether the task involves regression or classification. For regression tasks, such as predicting healthcare costs as a continuous variable, Random Forest is naturally suited and effectively models complex, non-linear relationships. While SVM can perform regression (SVR), it requires precise tuning and is more effective for clean, less noisy datasets. Therefore, Random Forest is generally more robust and reliable for regression tasks. In classification tasks, such as categorizing healthcare costs into “low” and “high,” Random Forest performs well on imbalanced datasets and includes built-in mechanisms like class weighting to address class imbalance. SVM is effective when the classes are well-separated, and the dataset is moderate in size. However, Random Forest often proves to be the better choice for its ability to handle imbalanced datasets with ease and reliability.

Do you agree with the recommendations? Why?

Yes, I agree with the recommendations that I stated.Because Random Forests is robust, flexible, and capable of predicting healthcare costs for both classification and regression tasks.

Academic Articles Reference

https://www.sciencedirect.com/science/article/pii/S2772442523000527#sec3 “A hybrid mental health prediction model using Support Vector Machine, Multilayer Perceptron, and Random Forest algorithms” Source: ScienceDirect This study evaluates the performance of SVM, Random Forest, and Multilayer Perceptron for predicting mental health outcomes.
https://pmc.ncbi.nlm.nih.gov/articles/PMC5820083/ “Comparison of Models for the Prediction of Medical Costs of Spinal Fusion in Taiwan Diagnosis-Related Groups by Machine Learning Algorithms.” Source: PubMed Central (PMC) This article compares Decision Trees and SVMs, for predicting spinal fusion costs.
https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-019-1004-8 “Comparing different supervised machine learning algorithms for disease prediction” Source: BMC Medical Informatics and Decision Making This research investigates Decision Trees and SVMs, in disease prediction. The study demonstrates that SVMs often achieve higher accuracy in complex datasets, whereas Decision Trees are advantageous for model interpretability.