Predictive Modeling


Insurance Health Plans


You are just hired as a Senior Data Scientist in Bank of Universe (BOU). BOU is a private company founded twenty years ago and now has more than 5,000 employees. BOU did all the research, chose the insurance company, and picked plan options for employees twenty years ago. The new CEO, Mr. Buffet, wants to make changes and offer self-funded Health Plans (SHP) starting next year. SHP is cheaper for BOU, since BOU does not have to pay for the separate insurance carrier by taking some risks. BOU has received several years’ medical costs in file insurance.csv Download insurance.csvfrom the current insurance carrier.

Data source: From Kaggle https://www.kaggle.com/mirichoi0218/insurance/home

It contains the following columns:

  • age: age of primary beneficiary
  • sex: insurance contractor gender, female, male
  • bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: smoking
  • region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
  • charges: individual medical costs billed by health insurance


1. Data Preparation.

Section A:

Load the dataset insurance.csv into memory.

insurance <- read.csv("data/insurance.csv")

Display the dimension of the data frame

dim(insurance)
## [1] 1338    7

Display the first six rows of the data frame

head(insurance)
##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

The code loads and explores the insurance.csv dataset, which contains 1,338 rows and 7 variables related to individuals’ demographics, health behaviors, and insurance charges. An initial preview shows a mix of numeric and categorical variables, with noticeable variation in medical costs. For example, smokers tend to have higher charges. Overall, the dataset is well-structured and ready for further analysis or predictive modeling.


Section B:

In the data frame, transform the variable charges by setting insurance$charges = log(insurance$charges). Do not transform it outside of the data frame.

insurance$charges <- log(insurance$charges)

Display the first ten occurrences

head(insurance$charges, 10)
##  [1]  9.734176  7.453302  8.400538  9.998092  8.260197  8.231275  9.016827
##  [8]  8.893093  8.765054 10.272397

The code applies a log transformation to the charges variable in the insurance dataset to reduce skewness, normalize the distribution, and stabilize variance—making the data more suitable for regression modeling. The transformed values are compressed and more statistically manageable, improving model performance and interpretability.


Section C:

Using the data set from 1.b, use the model.matrix() function to create another data set that uses dummy variables in place of categorical variables. Verify that the first column only has ones (1) as values, and then discard the column only after verifying it has only ones as values.

insurance_dummy <- model.matrix(charges ~ ., data = insurance)

Verify first column has only 1’s

all(insurance_dummy[, 1] == 1)
## [1] TRUE

Display the first ten rows

head(insurance_dummy[, 1], 10)
##  1  2  3  4  5  6  7  8  9 10 
##  1  1  1  1  1  1  1  1  1  1

Remove the intercept column (first column)

insurance_dummy <- insurance_dummy[, -1]

The code prepares the insurance dataset for regression modeling by converting categorical variables into dummy variables using model.matrix(). It confirms that the first column (the intercept) contains only ones, then removes it to avoid redundancy, ensuring the dataset is fully numeric and ready for further analysis or modeling.


Section D:

Use the sample() function with set.seed equal to 1 to generate row indexes for your training and tests sets, with 2/3 of the row indexes for your training set and 1/3 for your test set. Do not use any method other than the sample() function for splitting your data.

Generate row indexes

set.seed(1)
sample_index <- sample(1:nrow(insurance), nrow(insurance) * 2/3)

The code performs a randomized split of the insurance dataset into training and testing sets, using two-thirds of the data for training. It sets a seed for reproducibility and uses the sample() function to generate row indices for the training set. This ensures consistent and unbiased model evaluation on unseen data.


Section E:

Create a training and test data set from the data set created in 1.b using the training and test row indexes created in 1.d. Unless otherwise stated, only use the training and test data sets created in this step.

Create a training data set and display its dimensions

train_data <- insurance[sample_index, ]
dim(train_data)
## [1] 892   7

Create a testing data set and display its dimensions

test_data <- insurance[-sample_index, ]
dim(test_data)
## [1] 446   7

The code splits the insurance dataset into a training set with 892 rows and a test set with 446 rows using previously sampled indices. This step ensures a proper division of data for building and evaluating predictive models, supporting accurate and unbiased model performance assessment.


Section F:

Create a training and test data set from data set created in 1.c using the training and test row indexes created in 1.d

Create a dummy training data set and display its dimensions

train_dummy <- insurance_dummy[sample_index, ]
dim(train_dummy)
## [1] 892   8

Create a dummy testing data set and display its dimensions

test_dummy <- insurance_dummy[-sample_index, ]
dim(test_dummy)
## [1] 446   8

The code splits the dummy-variable version of the insurance dataset into a training set with 892 rows and a test set with 446 rows using previously generated indices. This prepares fully numeric data for machine learning models that require consistent input formats for training and evaluation.



2. Build a multiple linear regression model.

Section A:

Perform multiple linear regression with charges as the response and the predictors are age, sex, bmi, children, smoker, and region. Print out the results using the summary() function. Use the training data set created in step 1.e to train your model.

set.seed(1)
lm_model <- lm(charges ~ age + sex + bmi + children + smoker + region, 
               data = train_data)
summary(lm_model)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03533 -0.17138 -0.03868  0.06382  2.12159 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.042076   0.082891  84.956  < 2e-16 ***
## age              0.034915   0.001030  33.911  < 2e-16 ***
## sexmale         -0.063228   0.028700  -2.203 0.027847 *  
## bmi              0.011805   0.002441   4.836 1.56e-06 ***
## children         0.100152   0.011894   8.420  < 2e-16 ***
## smokeryes        1.533561   0.035475  43.229  < 2e-16 ***
## regionnorthwest -0.082398   0.041040  -2.008 0.044972 *  
## regionsoutheast -0.140025   0.040180  -3.485 0.000516 ***
## regionsouthwest -0.107718   0.041108  -2.620 0.008934 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4259 on 883 degrees of freedom
## Multiple R-squared:  0.7839, Adjusted R-squared:  0.7819 
## F-statistic: 400.4 on 8 and 883 DF,  p-value: < 2.2e-16

The regression analysis shows that insurance charges are strongly influenced by several factors, with the model explaining over 78% of the variation in charges. Key predictors such as age, BMI, number of children, and especially smoking status have significant positive effects on charges. Smoking is the most impactful, leading to much higher costs. While gender is statistically significant, its effect is minor. Regional differences also exist, with the Southeast and Southwest associated with lower charges. Overall, the model effectively highlights the main factors in driving insurance costs.


Section B:

Is there a relationship between the predictors and the response?

The multiple linear regression model shows a strong relationship between the predictors and insurance charges, with an adjusted R-squared of 78.2%, indicating the model explains most of the variation in charges. Key predictors such as age, BMI, number of children, and especially smoking status significantly influence charges. Smokers are associated with notably higher charges, making it the strongest predictor. Geographic regions also impact charges, with the Southeast and Southwest linked to lower costs. Overall, the model confirms that several demographic and behavioral factors meaningfully affect insurance charges.


Section C:

Does sex have a statistically significant relationship to the response?

The model shows that sex is a statistically significant predictor of insurance charges, with males having slightly higher charges than females. However, the effect size is small, indicating that while gender does influence charges, its practical impact is minimal compared to stronger predictors like smoking status.


Section D:

Perform best subset selection using the stepAIC() function from the MASS library, choose best model based on AIC. For the “direction” parameter in the stepAIC() method, set direciton=“backward”

lm_best <- stepAIC(lm_model, direction = "backward")
## Start:  AIC=-1513.99
## charges ~ age + sex + bmi + children + smoker + region
## 
##            Df Sum of Sq    RSS      AIC
## <none>                  160.13 -1513.99
## - sex       1      0.88 161.01 -1511.10
## - region    3      2.39 162.52 -1506.80
## - bmi       1      4.24 164.37 -1492.67
## - children  1     12.86 172.99 -1447.09
## - age       1    208.54 368.67  -772.14
## - smoker    1    338.89 499.02  -502.09
summary(lm_best)
## 
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.03533 -0.17138 -0.03868  0.06382  2.12159 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.042076   0.082891  84.956  < 2e-16 ***
## age              0.034915   0.001030  33.911  < 2e-16 ***
## sexmale         -0.063228   0.028700  -2.203 0.027847 *  
## bmi              0.011805   0.002441   4.836 1.56e-06 ***
## children         0.100152   0.011894   8.420  < 2e-16 ***
## smokeryes        1.533561   0.035475  43.229  < 2e-16 ***
## regionnorthwest -0.082398   0.041040  -2.008 0.044972 *  
## regionsoutheast -0.140025   0.040180  -3.485 0.000516 ***
## regionsouthwest -0.107718   0.041108  -2.620 0.008934 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4259 on 883 degrees of freedom
## Multiple R-squared:  0.7839, Adjusted R-squared:  0.7819 
## F-statistic: 400.4 on 8 and 883 DF,  p-value: < 2.2e-16

The backward stepwise regression analysis identified a model that includes age, sex, BMI, number of children, smoking status, and region as significant predictors of insurance charges. All variables remained in the model because removing any of them increased the AIC, indicating each contribute meaningfully. Smoking status had the strongest impact, significantly increasing charges, followed by age and regional differences. The model explains about 78% of the variance in charges (R² = 0.784), and all predictors are statistically significant. Overall, the model is robust, accurate, and highlights key factors influencing medical insurance costs.


Section E:

Compute the test error of the best model in 3d based on AIC using LOOCV using trainControl() and train() from the caret library. Report the MSE by squaring the reported RMSE.

ctrl_loocv <- trainControl(method = "LOOCV")
set.seed(1)
lm_loocv <- train(charges ~ age + bmi + children + smoker + region, 
                  data = train_data, 
                  method = "lm", 
                  trControl = ctrl_loocv)
mse_loocv <- lm_loocv$results$RMSE^2
mse_loocv
## [1] 0.1839504

The Leave-One-Out Cross-Validation (LOOCV) was used to evaluate the performance of the best regression model selected by AIC. Using the caret package in R, the model was trained and tested across all individual data points. The resulting Mean Squared Error (MSE) was approximately *0.184**, indicating that the model has good predictive accuracy and generalizes well to unseen data.


Section F:

Calculate the test error of the best model in 3d based on AIC using 10-fold Cross-Validation. Use train and trainControl from the caret library. Refer to model selected in 3d based on AIC. Report the MSE.

ctrl_cv10 <- trainControl(method = "cv", number = 10)
set.seed(1)
lm_cv10 <- train(charges ~ age + bmi + children + smoker + region, 
                 data = train_data, 
                 method = "lm", 
                 trControl = ctrl_cv10)
mse_cv10 <- lm_cv10$results$RMSE^2
mse_cv10
## [1] 0.1806693

The model was trained with selected predictors (age, BMI, children, smoker, and region) and achieved a Mean Squared Error (MSE) of approximately 0.181. This low MSE indicates strong predictive accuracy and suggests that the model generalizes well to unseen data, reaffirming its reliability for estimating insurance charges.


Section G:

Calculate and report the test MSE using the best model from 2.d and test data set created in step 1.e.

pred_lm <- predict(lm_best, newdata = test_data)
mse_test_lm <- mean((test_data$charges - pred_lm)^2)
mse_test_lm
## [1] 0.231291

The resulting MSE is approximately 0.231, indicating a low average prediction error on unseen data. While slightly higher than the cross-validation errors, this is expected and confirms that the model generalizes well and maintains strong predictive performance.


Section H:

Compare the test MSE calculated in step 2.f using 10-fold cross-validation with the test MSE calculated in step 2.g. How similar are they?

print(c(CV10_MSE = mse_cv10, Test_MSE = mse_test_lm))
##  CV10_MSE  Test_MSE 
## 0.1806693 0.2312910

The code compares the 10-fold cross-validation MSE (0.1807) with the test set MSE (0.2313). While the test MSE is slightly higher—as expected due to its evaluation on unseen data—the values are relatively close. This indicates that the model generalizes well, is not overfitting, and performs reliably on new data.



3. Build a regression tree model.

Section A:

Build a regression tree model using function tree(), where charges is the response and the predictors are age, sex, bmi, children, smoker, and region.

tree_model <- tree(charges ~ age + sex + bmi + children + smoker + region, 
                   data = train_data)
summary(tree_model)
## 
## Regression tree:
## tree(formula = charges ~ age + sex + bmi + children + smoker + 
##     region, data = train_data)
## Variables actually used in tree construction:
## [1] "age"      "children"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.5649 = 501 / 887 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2870 -0.4765 -0.2651  0.0000  0.4276  2.4230

The regression tree model was built to predict insurance charges using several predictors, but only age and number of children were used in the final tree, indicating their stronger influence. The model has five terminal nodes and a residual mean deviance of 0.5649, reflecting moderate predictive accuracy. Residuals show a balanced distribution around zero, but the range indicates potential prediction errors. Overall, while the model highlights key predictors, data preprocessing needs improvement for better performance.


Section B:

Find the optimal tree by using cross-validation and display the results in a graphic. Report the best size.

cv_tree <- cv.tree(tree_model)
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

The plot shows that deviance decreases as tree size increases, with a sharp improvement up to size 3. Beyond that, additional splits yield minimal performance gains. This suggests that a tree size of 3 is optimal, offering a good balance between model accuracy and complexity, and helps avoid overfitting based on the bias-variance trade-off.


Section C:

Justify the number you picked for the optimal tree with regard to the principle of variance-bias trade-off.

best_size <- 3

A tree size of 3 is optimal because it strikes a balance between bias and variance. It significantly reduces bias compared to smaller trees while avoiding the overfitting and high variance that come with larger trees, aligning well with the bias-variance trade-off principle.


Section D:

Prune the tree using the optinal size found in 3.b.

pruned_tree <- prune.tree(tree_model, best = best_size)
summary(pruned_tree)
## 
## Regression tree:
## snip.tree(tree = tree_model, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "age"
## Number of terminal nodes:  3 
## Residual mean deviance:  0.5959 = 529.7 / 889 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2870 -0.5271 -0.2305  0.0000  0.4768  2.3320

The pruned regression tree, reduced to three terminal nodes, uses only the variable age for prediction. Despite being simpler, it maintains similar predictive performance to the unpruned tree, with only a slight increase in deviance. The residuals remain centered around zero, indicating unbiased predictions. Overall, the pruned model achieves a strong balance between simplicity and accuracy, aligning well with the bias-variance trade-off.


Section E:

Plot the best tree model and give labels.

plot(pruned_tree)
text(pruned_tree, pretty = 0)

The pruned regression tree uses age as the sole predictor to group individuals into three categories based on insurance charges. It splits at ages 22.5 and 36.5, showing that predicted charges increase with age. Despite its simplicity, the model effectively captures the relationship between age and healthcare costs, offering clear and interpretable predictions.


Section F:

Calculate the test MSE for the best model.

pred_tree <- predict(pruned_tree, newdata = test_data)
mse_test_tree <- mean((test_data$charges - pred_tree)^2)
mse_test_tree
## [1] 0.7035004

The pruned regression tree achieved a test MSE of 0.7035, indicating reasonable predictive performance on unseen data. A warning about NAs introduced by coercion suggests that some variables in the test set may be improperly formatted, likely due to unconverted categorical data. Despite this, the model—based solely on age—performed adequately, though data formatting should be reviewed for cleaner execution.



4. Build a random forest model.

Section A:

Build a random forest model using function randomForest(), where charges is the response and the predictors are age, sex, bmi, children, smoker, and region.

set.seed(1)
rf_model <- randomForest(charges ~ age + sex + bmi + children + smoker + region, 
                         data = train_data, importance = TRUE)
summary(rf_model)
##                 Length Class  Mode     
## call              4    -none- call     
## type              1    -none- character
## predicted       892    -none- numeric  
## mse             500    -none- numeric  
## rsq             500    -none- numeric  
## oob.times       892    -none- numeric  
## importance       12    -none- numeric  
## importanceSD      6    -none- numeric  
## localImportance   0    -none- NULL     
## proximity         0    -none- NULL     
## ntree             1    -none- numeric  
## mtry              1    -none- numeric  
## forest           11    -none- list     
## coefs             0    -none- NULL     
## y               892    -none- numeric  
## test              0    -none- NULL     
## inbag             0    -none- NULL     
## terms             3    terms  call

The Random Forest regression model predict healthcare charges using predictors such as age, sex, BMI, children, smoker status, and region. The model was built with 500 trees and set to measure variable importance. The summary includes predicted values for 892 observations, and tracks model performance via Mean Squared Error (MSE) and R-squared across all trees. It also supports out-of-bag (OOB) error estimation for internal validation. The setup enables later analysis of which variables most influence the predicted charges.

Section B:

Compute the test error using the test data set.

pred_rf <- predict(rf_model, newdata = test_data)
mse_test_rf <- mean((test_data$charges - pred_rf)^2)
mse_test_rf
## [1] 0.1779905

The Random Forest regression model was evaluated using test data by generating predictions and calculating the Mean Squared Error (MSE) to assess its performance. The resulting MSE of 0.1779905 suggests that the model provides accurate predictions and generalizes well to new data, particularly if the response variable was log-transformed.


Section C:

Extract variable importance measure using the importance() function.

importance(rf_model)
##             %IncMSE IncNodePurity
## age      107.746484    244.796624
## sex        4.645977      7.228707
## bmi       20.522785     56.674792
## children  31.544284     32.628370
## smoker   146.783672    310.455172
## region     8.976811     14.513753

The variable importance output from the Random Forest model shows that smoker status and age are the most influential predictors of healthcare charges, with the highest values in both %IncMSE and IncNodePurity. BMI and children have moderate impact, while region and sex contribute the least. This indicates that lifestyle and demographic factors, especially smoking, play a key role in predicting healthcare costs.


Section D:

Plot the variable importance using the function, varImpPlot(). Which are the top 3 important predictors in this model?

varImpPlot(rf_model)

The variable importance plot from the Random Forest model shows that smoker, age, and BMI are the top three predictors influencing healthcare charges. Smoker is the most significant by far, followed by age, while BMI ranks slightly above children in importance. These variables contribute most to the model’s accuracy and decision-making process.



5. Build a support vector machine model.

Section A:

The response is charges and the predictors are age, sex, bmi, children, smoker, and region. Please use the svm() function with radial kernel and gamma=5 and cost = 50.

set.seed(1)
svm_model <- svm(charges ~ age + sex + bmi + children + smoker + region, 
                 data = train_data, kernel = "radial", gamma = 5, cost = 50)
summary(svm_model)
## 
## Call:
## svm(formula = charges ~ age + sex + bmi + children + smoker + region, 
##     data = train_data, kernel = "radial", gamma = 5, cost = 50)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  50 
##       gamma:  5 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  715

This SVM regression model predicts insurance charges using predictors like age, sex, BMI, children, smoking status, and region. It employs a radial kernel with a high gamma (5) and cost (50), indicating a complex, flexible model tailored to the training data. The model uses epsilon-regression with a tolerance of 0.1, allowing small prediction errors. A total of 715 support vectors were used, suggesting many data points influence the model, which may lead to overfitting. While the model is likely accurate on training data, its generalization should be tested on unseen data to confirm performance.


Section B:

Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1,3 and 5 and potential kernel: “linear”,“radial” and “sigmoid”. And use the training set created in step 1.e.

set.seed(1)
svm_tune <- tune(svm, charges ~ ., data = train_data,
                 ranges = list(kernel = c("linear", "radial", "sigmoid"),
                               cost = c(1, 10, 50, 100),
                               gamma = c(1, 3, 5)))
summary(svm_tune)
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  kernel cost gamma
##  radial    1     1
## 
## - best performance: 0.1902345 
## 
## - Detailed performance results:
##     kernel cost gamma        error   dispersion
## 1   linear    1     1 1.948499e-01 4.769998e-02
## 2   radial    1     1 1.902345e-01 6.255756e-02
## 3  sigmoid    1     1 2.299086e+03 3.024145e+02
## 4   linear   10     1 1.950497e-01 4.765131e-02
## 5   radial   10     1 2.389178e-01 7.217036e-02
## 6  sigmoid   10     1 2.274547e+05 3.370394e+04
## 7   linear   50     1 1.950977e-01 4.767836e-02
## 8   radial   50     1 3.462730e-01 7.312303e-02
## 9  sigmoid   50     1 5.685687e+06 9.284122e+05
## 10  linear  100     1 1.950707e-01 4.768355e-02
## 11  radial  100     1 4.390889e-01 1.115394e-01
## 12 sigmoid  100     1 2.354112e+07 3.291762e+06
## 13  linear    1     3 1.948499e-01 4.769998e-02
## 14  radial    1     3 3.811084e-01 6.663638e-02
## 15 sigmoid    1     3 4.489028e+03 3.940325e+02
## 16  linear   10     3 1.950497e-01 4.765131e-02
## 17  radial   10     3 4.050899e-01 4.282095e-02
## 18 sigmoid   10     3 4.474026e+05 4.804788e+04
## 19  linear   50     3 1.950977e-01 4.767836e-02
## 20  radial   50     3 4.543307e-01 5.637178e-02
## 21 sigmoid   50     3 1.117223e+07 1.033690e+06
## 22  linear  100     3 1.950707e-01 4.768355e-02
## 23  radial  100     3 4.771536e-01 7.218485e-02
## 24 sigmoid  100     3 4.451442e+07 4.405710e+06
## 25  linear    1     5 1.948499e-01 4.769998e-02
## 26  radial    1     5 4.785357e-01 7.555896e-02
## 27 sigmoid    1     5 5.151436e+03 4.795631e+02
## 28  linear   10     5 1.950497e-01 4.765131e-02
## 29  radial   10     5 4.764771e-01 5.680321e-02
## 30 sigmoid   10     5 4.847946e+05 3.574412e+04
## 31  linear   50     5 1.950977e-01 4.767836e-02
## 32  radial   50     5 4.894965e-01 5.697417e-02
## 33 sigmoid   50     5 1.224914e+07 1.227219e+06
## 34  linear  100     5 1.950707e-01 4.768355e-02
## 35  radial  100     5 4.920183e-01 5.631912e-02
## 36 sigmoid  100     5 4.972590e+07 3.832052e+06

A grid search was conducted to find the best SVM regression model using different combinations of kernel types, cost values, and gamma values. The best-performing model used a radial kernel with cost = 1 and gamma = 1, achieving the lowest cross-validation error of 0.1902. This configuration outperformed others, including models with linear and sigmoid kernels. The results indicate that a radial SVM with low cost and gamma values provides the best balance of flexibility and generalization for predicting insurance charges.


Section C:

Print out the model results. What are the best model parameters?

best_svm <- svm_tune$best.model
best_svm
## 
## Call:
## best.tune(METHOD = svm, train.x = charges ~ ., data = train_data, 
##     ranges = list(kernel = c("linear", "radial", "sigmoid"), cost = c(1, 
##         10, 50, 100), gamma = c(1, 3, 5)))
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  493

The best SVM regression model uses a radial kernel with cost = 1, gamma = 1, and epsilon = 0.1, selected through cross-validation. It relies on 493 support vectors, indicating a balanced model that captures non-linear patterns while maintaining generalizability. This configuration provides an effective trade-off between model accuracy and complexity for predicting insurance charges.


Section D:

Forecast charges using the test dataset and the best model found in c).

pred_svm <- predict(best_svm, newdata = test_data)
summary(pred_svm)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.123   8.518   9.142   9.039   9.519  10.786

The best SVM model was used to predict insurance charges on the test dataset, producing results on a log scale. The predicted values ranged from 7.123 to 10.786, with a mean of 9.039 and a median of 9.142, indicating a symmetrical distribution. The model appears to generalize well, showing consistent predictions without extreme outliers. To better interpret the results, back-transformation to the original dollar scale and error evaluation would be useful for the next steps.


Section E:

Compute the MSE (Mean Squared Error) on the test data.

mse_test_svm <- mean((test_data$charges - pred_svm)^2)
mse_test_svm
## [1] 0.2259622

The SVM model achieved a test Mean Squared Error (MSE) of 0.226 on the log-transformed charges, indicating accurate predictions with low average squared error. While this reflects good model performance on the transformed scale, back-transforming to the original scale would provide clearer real-world interpretation.



6. Perform the k-means cluster analysis.

Section A:

Remove the sex, smoker, and region, since they are not numerical values.

insurance_numeric <- insurance[, c("age", "bmi", "children", "charges")]
head(insurance_numeric)
##   age    bmi children  charges
## 1  19 27.900        0 9.734176
## 2  18 33.770        1 7.453302
## 3  28 33.000        3 8.400538
## 4  33 22.705        0 9.998092
## 5  32 28.880        0 8.260197
## 6  31 25.740        0 8.231275

The code shows a data preprocessing step, where non-numeric variables (sex, smoker, and region) are removed to create a new dataset, insurance_numeric, containing only numerical features: age, bmi, children, and charges. This ensures compatibility with modeling techniques that require numerical input. The preview of the data confirms the dataset is clean and ready for further analysis, with “charges” appearing to be log-transformed to improve model performance.


Section B:

Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.

fviz_nbclust(insurance_numeric, kmeans, method = "gap_stat") +
  ggtitle("Optimal Number of Clusters - Gap Statistic Method") +
  theme_test()

The Gap Statistic plot indicates that the optimal number of clusters is 2, where the gap value peaks. This suggests the data is best divided into two well-separated groups, as adding more clusters beyond this point does not significantly improve clustering quality and may lead to unnecessary complexity.


Section C:

Perform k-means clustering using the 3 clusters.

kmeans_model <- kmeans(insurance_numeric, centers = 3, nstart = 25)
summary(kmeans_model)
##              Length Class  Mode   
## cluster      1338   -none- numeric
## centers        12   -none- numeric
## totss           1   -none- numeric
## withinss        3   -none- numeric
## tot.withinss    1   -none- numeric
## betweenss       1   -none- numeric
## size            3   -none- numeric
## iter            1   -none- numeric
## ifault          1   -none- numeric

K-means clustering was performed on the “insurance_numeric” dataset using three clusters. The model successfully assigned all 1,338 observations into groups, with metrics such as the total sum of squares and within-cluster sum of squares offering insights into the compactness and separation of the clusters. Although the Gap Statistic had previously indicated that two clusters might be optimal, this step intentionally explores three clusters for comparison and deeper analysis. The summary output confirms that the algorithm executed correctly, laying the groundwork for further investigation into cluster characteristics, such as their sizes and centroid values.


Section D:

Visualize the clusters in different colors.

fviz_cluster(kmeans_model, data = insurance_numeric, geom = "point",
             ellipse.type = "norm", ggtheme = theme_test(),
             main = "Cluster Visualization"
             )

The cluster visualization shows the results of k-means clustering with three clusters applied to the “insurance_numeric” dataset, using PCA for dimensionality reduction. Cluster 3 appears well-separated on the left, while Clusters 1 and 2 overlap more in the center and right, indicating similarities between them. The plot suggests that while the data can be grouped into three clusters, further analysis may be needed to confirm whether this separation is meaningful or if fewer clusters might be more appropriate.



7. Build a neural networks model.

Section A:

Remove the sex, smoker, and region, since they are not numerical values.

insurance_nn <- insurance[, c("age", "bmi", "children", "charges")]
head(insurance_nn)
##   age    bmi children  charges
## 1  19 27.900        0 9.734176
## 2  18 33.770        1 7.453302
## 3  28 33.000        3 8.400538
## 4  33 22.705        0 9.998092
## 5  32 28.880        0 8.260197
## 6  31 25.740        0 8.231275

The code filters the insurance dataset to retain only numerical variables—age, bmi, children, and charges—by removing categorical features like sex, smoker, and region. The resulting dataset, insurance_nn, contains clean numeric data suitable for statistical modeling. The first six rows confirm successful extraction, showing varied values across all selected columns. This step ensures the data is ready for machine learning or regression analysis.

Section B:

Standardize the inputs using the scale() function.

insurance_scaled <- scale(insurance_nn)
head(insurance_scaled)
##             age        bmi    children    charges
## [1,] -1.4382265 -0.4531506 -0.90827406  0.6911354
## [2,] -1.5094011  0.5094306 -0.07873775 -1.7893505
## [3,] -0.7976553  0.3831636  1.58033487 -0.7592166
## [4,] -0.4417824 -1.3050431 -0.90827406  0.9781472
## [5,] -0.5129570 -0.2924471 -0.90827406 -0.9118403
## [6,] -0.5841316 -0.8073542 -0.90827406 -0.9432929

The code standardizes the numerical features (age, bmi, children, and charges) in the dataset using the scale() function, converting each value into a z-score with a mean of 0 and standard deviation of 1. This ensures all variables are on the same scale, preventing any single feature from disproportionately influencing the model. The output confirms successful standardization, making the data suitable for scale-sensitive algorithms like clustering or neural networks.

Section C:

Convert the standardized inputs to a data frame using the as.data.frame() function.

insurance_scaled <- as.data.frame(insurance_scaled)
class(insurance_scaled)
## [1] "data.frame"

The code converts the standardized data from a matrix to a data frame using as.data.frame(), ensuring compatibility with R functions that require data in data.frame format. The class() check confirms the successful conversion, making the dataset ready for further analysis or modeling.

Section D:

Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.

set.seed(1)
index_nn <- sample(1:nrow(insurance_scaled), nrow(insurance_scaled)*0.8)
train_nn <- insurance_scaled[index_nn, ]
dim(train_nn)
## [1] 1070    4
test_nn <- insurance_scaled[-index_nn, ]
dim(test_nn)
## [1] 268   4

The code splits the standardized dataset into training and test sets, using 80% of the data (1,070 records) for training and 20% (268 records) for testing. A random seed ensures reproducibility, and the “sample()” function is used to randomly select training indices. This setup allows the model to be trained in one subset and evaluated on another, supporting accurate performance assessment and preventing overfitting.

Section E:

The response is charges and the predictors are age, bmi, and children. Please use 1 hidden layer with 1 neuron.

set.seed(1)
nn_model <- neuralnet(charges ~ age + bmi + children, 
                      data = train_nn, hidden = c(1))
summary(nn_model)
##                     Length Class      Mode    
## call                   4   -none-     call    
## response            1070   -none-     numeric 
## covariate           3210   -none-     numeric 
## model.list             2   -none-     list    
## err.fct                1   -none-     function
## act.fct                1   -none-     function
## linear.output          1   -none-     logical 
## data                   4   data.frame list    
## exclude                0   -none-     NULL    
## net.result             1   -none-     list    
## weights                1   -none-     list    
## generalized.weights    1   -none-     list    
## startweights           1   -none-     list    
## result.matrix          9   -none-     numeric

A simple neural network model was built using the “neuralnet” package to predict insurance charges based on age, BMI, and number of children. The model uses one hidden layer with a single neuron and was trained on standardized data. The summary confirms successful training on 1,070 observations, with a structure appropriate for regression tasks. This model is now ready for evaluation or further refinement.

Section F:

Plot the neural networks.

plot(nn_model)

The neural network, consisting of a single hidden neuron, was trained to predict insurance charges using age, BMI, and number of children as input features. The connection weights reveal that age has a moderately strong negative influence on the prediction, while BMI and children have relatively minor effects. The model achieved a training error of 352.77 and required 22,494 iterations to converge. This visualization confirms that the model was successfully trained and provides clear insight into how each variable contributes to the output.

Section G:

Forecast the charges in the test dataset.

pred_nn <- predict(nn_model, newdata = test_nn)
summary(pred_nn)
##        V1          
##  Min.   :-1.14625  
##  1st Qu.:-0.52559  
##  Median : 0.13413  
##  Mean   :-0.03579  
##  3rd Qu.: 0.43390  
##  Max.   : 0.90450

The trained neural network model was used to predict insurance charges on the test dataset, producing outputs in standardized form. The predicted values range from -1.15 to 0.90, with a mean near zero, reflecting the scaled nature of the data. These results confirm successful forecasting, but the predictions need to be converted back to the original scale for real-world interpretation and evaluation.

Section H:

Get the observed charges of the test dataset.

obs_nn <- test_nn$charges
summary(obs_nn)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.25758 -0.66002  0.08807  0.01331  0.81200  2.13630

This step extracts and summarizes the actual (observed) insurance charges from the standardized test dataset. The observed values range from -2.26 to 2.14 and are centered around zero, reflecting the standardized scale. These values are essential for evaluating the model’s performance by comparing them to the predicted charges.

Section I:

Compute test error (MSE).

mse_test_nn <- mean((obs_nn - pred_nn)^2)
mse_test_nn
## [1] 0.8142879

The Mean Squared Error (MSE) was computed to assess the prediction accuracy of the neural network on the test dataset. The resulting MSE of 0.8143 reflects the average squared difference between the predicted and actual values, based on standardized data. This metric serves as a critical indicator of the model’s performance and provides a baseline for future tuning or comparison with other models.



8. Putting it all together.

Section A:

For predicting insurance charges, your supervisor asks you to choose the best model among the multiple regression, regression tree, random forest, support vector machine, and neural network models. Compare the test MSEs of the models generated in steps 2.g, 3.f, 4.b, 5.e, and 7.d. Display the names for these types of these models, using these labels: Multiple Linear Regression, Regression Tree, Random Forest, Support Vector Machine, and Neural Network and their corresponding test MSEs in a data.frame. Label the column in your data frame with the labels as Model.Type, and label the column with the test MSEs as Test.MSE and round the data in this column to 4 decimal places. Present the formatted data to your supervisor and recommend which model is best and why.

model_compare <- data.frame(
  Model.Type = c("Multiple Linear Regression", "Regression Tree", "Random Forest", "Support Vector Machine", "Neural Network"),
  Test.MSE = round(c(mse_test_lm, mse_test_tree, mse_test_rf, mse_test_svm, mse_test_nn), 4)
)
print(model_compare)
##                   Model.Type Test.MSE
## 1 Multiple Linear Regression   0.2313
## 2            Regression Tree   0.7035
## 3              Random Forest   0.1780
## 4     Support Vector Machine   0.2260
## 5             Neural Network   0.8143

Recommendation based on lowest MSE

best_model <- model_compare[which.min(model_compare$Test.MSE), ]
print(paste("Recommended model is", best_model$Model.Type, "with Test MSE =", best_model$Test.MSE))
## [1] "Recommended model is Random Forest with Test MSE = 0.178"

The code compares five predictive models for estimating insurance charges using Test Mean Squared Error (MSE). Among Multiple Linear Regression, Regression Tree, Random Forest, Support Vector Machine, and Neural Network, the Random Forest model achieves the lowest MSE (0.1780), indicating the best performance. This suggests Random Forest is the most accurate and reliable model for predicting insurance charges in this analysis.


Section B:

Another supervisor from the sales department has requested your help to create a predictive model that his sales representatives can use to explain to clients what the potential costs could be for different kinds of customers, and they need an easy and visual way of explaining it. What model would you recommend, and what are the benefits and disadvantages of your recommended model compared to other models?

A Regression Tree model is recommended for the sales department because it provides a simple, visual, and easy-to-explain way to predict insurance costs. It allows sales reps to clearly show clients how different factors affect pricing through intuitive “if-then” decision paths. While it may not be as accurate as models like Random Forests, its interpretability makes it ideal for client communication. For best results, it can be used alongside more accurate models for internal predictions.


Section C:

The supervisor from the sales department likes your regression tree model. But she says that the salespeople say the numbers in it are way too low and suggests that maybe the numbers on the leaf nodes predicting charges are log transformations of the actual charges. You realize that in step 1.b of this project that you had indeed transformed charges using the log function. And now you realize that you need to reverse the transformation in your final output. The solution you have is to reverse the log transformation of the variables in the regression tree model you created and redisplay the result.

Follow these steps:

Subsection i:

Copy your pruned tree model to a new variable.

tree_copy <- pruned_tree
summary(tree_copy)
## 
## Regression tree:
## snip.tree(tree = tree_model, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "age"
## Number of terminal nodes:  3 
## Residual mean deviance:  0.5959 = 529.7 / 889 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2870 -0.5271 -0.2305  0.0000  0.4768  2.3320

The pruned regression tree model uses only age as the predictor and has 3 terminal nodes, making it simple and easy to interpret. With a residual mean deviance of 0.5959 and balanced residuals, the model provides reasonably accurate predictions while maintaining high transparency. This makes it well-suited for situations where clarity and explainability are more important than maximum predictive accuracy.

Subsection ii:

In your new variable, find the data.frame named “frame” and reverse the log transformation on the data.frame column yval using the exp() function. (If the copy of your pruned tree model is named copy_of_my_pruned_tree, then the data frame is accessed as copy_of_my_pruned_tree$frame, and it works just like a normal data frame.).

tree_copy$frame$yval <- exp(tree_copy$frame$yval)
summary(tree_copy$frame$yval)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3723    5374    6576    7561    8832   13300

The code reverses the earlier log transformation on predicted insurance charges from a pruned regression tree, converting them back to dollar amounts using the exp() function. The summary shows that predicted charges range from $3,723 to $13,300, with a mean of $7,561. This step makes the model’s output interpretable and usable for real-world applications, such as explaining costs to clients.

Subsection iii:

After you reverse the log transform on the yval column, then replot the tree with labels.

plot(tree_copy)
text(tree_copy, pretty = 0)

The pruned regression tree uses age as the sole predictor to estimate insurance charges, dividing clients into three age groups with distinct cost predictions: $3,723 for those under 22.5, $6,576 for ages 22.5–36.5, and $13,300 for those 36.5 and older. This simple, interpretable model effectively highlights how age influences insurance costs, making it ideal for client-facing explanations.