Predictive Modeling
Insurance Health Plans
You are just hired as a Senior Data Scientist in Bank of Universe (BOU). BOU is a private company founded twenty years ago and now has more than 5,000 employees. BOU did all the research, chose the insurance company, and picked plan options for employees twenty years ago. The new CEO, Mr. Buffet, wants to make changes and offer self-funded Health Plans (SHP) starting next year. SHP is cheaper for BOU, since BOU does not have to pay for the separate insurance carrier by taking some risks. BOU has received several years’ medical costs in file insurance.csv Download insurance.csvfrom the current insurance carrier.
Data source: From Kaggle https://www.kaggle.com/mirichoi0218/insurance/home
It contains the following columns:
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: smoking
- region: the beneficiary’s residential area in the US, northeast, southeast, southwest, northwest.
- charges: individual medical costs billed by health insurance
1. Data Preparation.
Section A:
Load the dataset insurance.csv into memory.
Display the dimension of the data frame
## [1] 1338 7
Display the first six rows of the data frame
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
The code loads and explores the insurance.csv dataset, which contains 1,338 rows and 7 variables related to individuals’ demographics, health behaviors, and insurance charges. An initial preview shows a mix of numeric and categorical variables, with noticeable variation in medical costs. For example, smokers tend to have higher charges. Overall, the dataset is well-structured and ready for further analysis or predictive modeling.
Section B:
In the data frame, transform the variable charges by setting insurance$charges = log(insurance$charges). Do not transform it outside of the data frame.
Display the first ten occurrences
## [1] 9.734176 7.453302 8.400538 9.998092 8.260197 8.231275 9.016827
## [8] 8.893093 8.765054 10.272397
The code applies a log transformation to the charges
variable in the insurance dataset to reduce skewness,
normalize the distribution, and stabilize variance—making the data more
suitable for regression modeling. The transformed values are compressed
and more statistically manageable, improving model performance and
interpretability.
Section C:
Using the data set from 1.b, use the model.matrix() function to create another data set that uses dummy variables in place of categorical variables. Verify that the first column only has ones (1) as values, and then discard the column only after verifying it has only ones as values.
Verify first column has only 1’s
## [1] TRUE
Display the first ten rows
## 1 2 3 4 5 6 7 8 9 10
## 1 1 1 1 1 1 1 1 1 1
Remove the intercept column (first column)
The code prepares the insurance dataset for
regression modeling by converting categorical variables into
dummy variables using model.matrix(). It
confirms that the first column (the intercept) contains only
ones, then removes it to avoid redundancy, ensuring the
dataset is fully numeric and ready for further analysis or modeling.
Section D:
Use the sample() function with set.seed equal to 1 to generate row indexes for your training and tests sets, with 2/3 of the row indexes for your training set and 1/3 for your test set. Do not use any method other than the sample() function for splitting your data.
Generate row indexes
The code performs a randomized split of the
insurance dataset into training and testing sets, using
two-thirds of the data for training.
It sets a seed for reproducibility and uses the sample()
function to generate row indices for the training set. This ensures
consistent and unbiased model evaluation on unseen data.
Section E:
Create a training and test data set from the data set created in 1.b using the training and test row indexes created in 1.d. Unless otherwise stated, only use the training and test data sets created in this step.
Create a training data set and display its dimensions
## [1] 892 7
Create a testing data set and display its dimensions
## [1] 446 7
The code splits the insurance dataset into a training set with 892 rows and a test set with 446 rows using previously sampled indices. This step ensures a proper division of data for building and evaluating predictive models, supporting accurate and unbiased model performance assessment.
Section F:
Create a training and test data set from data set created in 1.c using the training and test row indexes created in 1.d
Create a dummy training data set and display its dimensions
## [1] 892 8
Create a dummy testing data set and display its dimensions
## [1] 446 8
The code splits the dummy-variable version of the insurance dataset into a training set with 892 rows and a test set with 446 rows using previously generated indices. This prepares fully numeric data for machine learning models that require consistent input formats for training and evaluation.
2. Build a multiple linear regression model.
Section A:
Perform multiple linear regression with charges as the response and the predictors are age, sex, bmi, children, smoker, and region. Print out the results using the summary() function. Use the training data set created in step 1.e to train your model.
set.seed(1)
lm_model <- lm(charges ~ age + sex + bmi + children + smoker + region,
data = train_data)
summary(lm_model)##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03533 -0.17138 -0.03868 0.06382 2.12159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.042076 0.082891 84.956 < 2e-16 ***
## age 0.034915 0.001030 33.911 < 2e-16 ***
## sexmale -0.063228 0.028700 -2.203 0.027847 *
## bmi 0.011805 0.002441 4.836 1.56e-06 ***
## children 0.100152 0.011894 8.420 < 2e-16 ***
## smokeryes 1.533561 0.035475 43.229 < 2e-16 ***
## regionnorthwest -0.082398 0.041040 -2.008 0.044972 *
## regionsoutheast -0.140025 0.040180 -3.485 0.000516 ***
## regionsouthwest -0.107718 0.041108 -2.620 0.008934 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4259 on 883 degrees of freedom
## Multiple R-squared: 0.7839, Adjusted R-squared: 0.7819
## F-statistic: 400.4 on 8 and 883 DF, p-value: < 2.2e-16
The regression analysis shows that insurance charges are strongly
influenced by several factors, with the model explaining over
78% of the variation in charges. Key predictors such as
age, BMI, number of children, and
especially smoking status have significant positive effects
on charges. Smoking is the most impactful, leading to much
higher costs. While gender is statistically significant, its effect is
minor. Regional differences also exist, with the Southeast and Southwest
associated with lower charges. Overall, the model effectively highlights
the main factors in driving insurance costs.
Section B:
Is there a relationship between the predictors and the response?
The multiple linear regression model shows a strong relationship
between the predictors and insurance charges, with an adjusted R-squared
of 78.2%, indicating the model explains most of the
variation in charges. Key predictors such as age,
BMI, number of children, and especially
smoking status significantly influence
charges. Smokers are associated with
notably higher charges, making it the strongest
predictor. Geographic regions also impact charges, with the Southeast
and Southwest linked to lower costs. Overall, the model confirms that
several demographic and behavioral factors meaningfully affect insurance
charges.
Section C:
Does sex have a statistically significant relationship to the response?
The model shows that sex is a statistically significant
predictor of insurance charges, with males having slightly
higher charges than females. However, the effect size is small,
indicating that while gender does influence charges, its practical
impact is minimal compared to stronger predictors like smoking
status.
Section D:
Perform best subset selection using the stepAIC() function from the MASS library, choose best model based on AIC. For the “direction” parameter in the stepAIC() method, set direciton=“backward”
## Start: AIC=-1513.99
## charges ~ age + sex + bmi + children + smoker + region
##
## Df Sum of Sq RSS AIC
## <none> 160.13 -1513.99
## - sex 1 0.88 161.01 -1511.10
## - region 3 2.39 162.52 -1506.80
## - bmi 1 4.24 164.37 -1492.67
## - children 1 12.86 172.99 -1447.09
## - age 1 208.54 368.67 -772.14
## - smoker 1 338.89 499.02 -502.09
##
## Call:
## lm(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.03533 -0.17138 -0.03868 0.06382 2.12159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.042076 0.082891 84.956 < 2e-16 ***
## age 0.034915 0.001030 33.911 < 2e-16 ***
## sexmale -0.063228 0.028700 -2.203 0.027847 *
## bmi 0.011805 0.002441 4.836 1.56e-06 ***
## children 0.100152 0.011894 8.420 < 2e-16 ***
## smokeryes 1.533561 0.035475 43.229 < 2e-16 ***
## regionnorthwest -0.082398 0.041040 -2.008 0.044972 *
## regionsoutheast -0.140025 0.040180 -3.485 0.000516 ***
## regionsouthwest -0.107718 0.041108 -2.620 0.008934 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4259 on 883 degrees of freedom
## Multiple R-squared: 0.7839, Adjusted R-squared: 0.7819
## F-statistic: 400.4 on 8 and 883 DF, p-value: < 2.2e-16
The backward stepwise regression analysis identified a model that
includes age, sex, BMI, number of
children, smoking status, and
region as significant predictors of insurance
charges. All variables remained in the model because
removing any of them increased the AIC, indicating each contribute
meaningfully. Smoking status had the strongest impact, significantly
increasing charges, followed by age and regional differences. The model
explains about 78% of the variance in charges
(R² = 0.784), and all predictors are statistically
significant. Overall, the model is robust, accurate, and highlights key
factors influencing medical insurance costs.
Section E:
Compute the test error of the best model in 3d based on AIC using LOOCV using trainControl() and train() from the caret library. Report the MSE by squaring the reported RMSE.
ctrl_loocv <- trainControl(method = "LOOCV")
set.seed(1)
lm_loocv <- train(charges ~ age + bmi + children + smoker + region,
data = train_data,
method = "lm",
trControl = ctrl_loocv)
mse_loocv <- lm_loocv$results$RMSE^2
mse_loocv## [1] 0.1839504
The Leave-One-Out Cross-Validation (LOOCV) was used to evaluate the
performance of the best regression model selected by AIC. Using the
caret package in R, the model was trained and tested across
all individual data points. The resulting Mean Squared Error (MSE) was
approximately *0.184**, indicating that the model has good predictive
accuracy and generalizes well to unseen data.
Section F:
Calculate the test error of the best model in 3d based on AIC using 10-fold Cross-Validation. Use train and trainControl from the caret library. Refer to model selected in 3d based on AIC. Report the MSE.
ctrl_cv10 <- trainControl(method = "cv", number = 10)
set.seed(1)
lm_cv10 <- train(charges ~ age + bmi + children + smoker + region,
data = train_data,
method = "lm",
trControl = ctrl_cv10)
mse_cv10 <- lm_cv10$results$RMSE^2
mse_cv10## [1] 0.1806693
The model was trained with selected predictors (age,
BMI, children, smoker, and
region) and achieved a Mean Squared Error (MSE) of
approximately 0.181. This low MSE indicates strong
predictive accuracy and suggests that the model generalizes well to
unseen data, reaffirming its reliability for estimating insurance
charges.
Section G:
Calculate and report the test MSE using the best model from 2.d and test data set created in step 1.e.
pred_lm <- predict(lm_best, newdata = test_data)
mse_test_lm <- mean((test_data$charges - pred_lm)^2)
mse_test_lm## [1] 0.231291
The resulting MSE is approximately 0.231, indicating a low average prediction error on unseen data. While slightly higher than the cross-validation errors, this is expected and confirms that the model generalizes well and maintains strong predictive performance.
Section H:
Compare the test MSE calculated in step 2.f using 10-fold cross-validation with the test MSE calculated in step 2.g. How similar are they?
## CV10_MSE Test_MSE
## 0.1806693 0.2312910
The code compares the 10-fold cross-validation MSE (0.1807) with the test set MSE (0.2313). While the test MSE is slightly higher—as expected due to its evaluation on unseen data—the values are relatively close. This indicates that the model generalizes well, is not overfitting, and performs reliably on new data.
3. Build a regression tree model.
Section A:
Build a regression tree model using function tree(),
where charges is the response and the predictors are
age, sex, bmi,
children, smoker, and region.
tree_model <- tree(charges ~ age + sex + bmi + children + smoker + region,
data = train_data)
summary(tree_model)##
## Regression tree:
## tree(formula = charges ~ age + sex + bmi + children + smoker +
## region, data = train_data)
## Variables actually used in tree construction:
## [1] "age" "children"
## Number of terminal nodes: 5
## Residual mean deviance: 0.5649 = 501 / 887
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2870 -0.4765 -0.2651 0.0000 0.4276 2.4230
The regression tree model was built to predict insurance charges using several predictors, but only age and number of children were used in the final tree, indicating their stronger influence. The model has five terminal nodes and a residual mean deviance of 0.5649, reflecting moderate predictive accuracy. Residuals show a balanced distribution around zero, but the range indicates potential prediction errors. Overall, while the model highlights key predictors, data preprocessing needs improvement for better performance.
Section B:
Find the optimal tree by using cross-validation and display the results in a graphic. Report the best size.
cv_tree <- cv.tree(tree_model)
plot(cv_tree$size, cv_tree$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
The plot shows that deviance decreases as tree size increases, with a
sharp improvement up to size 3. Beyond that, additional
splits yield minimal performance gains. This suggests that a tree size
of 3 is optimal, offering a good balance between model accuracy and
complexity, and helps avoid overfitting based on the bias-variance
trade-off.
Section C:
Justify the number you picked for the optimal tree with regard to the principle of variance-bias trade-off.
A tree size of 3 is optimal because it strikes a balance between bias and variance. It significantly reduces bias compared to smaller trees while avoiding the overfitting and high variance that come with larger trees, aligning well with the bias-variance trade-off principle.
Section D:
Prune the tree using the optinal size found in 3.b.
##
## Regression tree:
## snip.tree(tree = tree_model, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "age"
## Number of terminal nodes: 3
## Residual mean deviance: 0.5959 = 529.7 / 889
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2870 -0.5271 -0.2305 0.0000 0.4768 2.3320
The pruned regression tree, reduced to three terminal nodes, uses
only the variable age for prediction. Despite being
simpler, it maintains similar predictive performance to the unpruned
tree, with only a slight increase in deviance. The residuals remain
centered around zero, indicating unbiased predictions. Overall, the
pruned model achieves a strong balance between simplicity and accuracy,
aligning well with the bias-variance trade-off.
Section E:
Plot the best tree model and give labels.
The pruned regression tree uses
age as the sole predictor
to group individuals into three categories based on insurance
charges. It splits at ages 22.5 and
36.5, showing that predicted charges increase with age.
Despite its simplicity, the model effectively captures the relationship
between age and healthcare costs, offering clear and interpretable
predictions.
Section F:
Calculate the test MSE for the best model.
pred_tree <- predict(pruned_tree, newdata = test_data)
mse_test_tree <- mean((test_data$charges - pred_tree)^2)
mse_test_tree## [1] 0.7035004
The pruned regression tree achieved a test MSE of 0.7035, indicating reasonable predictive performance on unseen data. A warning about NAs introduced by coercion suggests that some variables in the test set may be improperly formatted, likely due to unconverted categorical data. Despite this, the model—based solely on age—performed adequately, though data formatting should be reviewed for cleaner execution.
4. Build a random forest model.
Section A:
Build a random forest model using function randomForest(), where charges is the response and the predictors are age, sex, bmi, children, smoker, and region.
set.seed(1)
rf_model <- randomForest(charges ~ age + sex + bmi + children + smoker + region,
data = train_data, importance = TRUE)
summary(rf_model)## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 892 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 892 -none- numeric
## importance 12 -none- numeric
## importanceSD 6 -none- numeric
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 892 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
The Random Forest regression model predict healthcare charges using predictors such as age, sex, BMI, children, smoker status, and region. The model was built with 500 trees and set to measure variable importance. The summary includes predicted values for 892 observations, and tracks model performance via Mean Squared Error (MSE) and R-squared across all trees. It also supports out-of-bag (OOB) error estimation for internal validation. The setup enables later analysis of which variables most influence the predicted charges.
Section B:
Compute the test error using the test data set.
pred_rf <- predict(rf_model, newdata = test_data)
mse_test_rf <- mean((test_data$charges - pred_rf)^2)
mse_test_rf## [1] 0.1779905
The Random Forest regression model was evaluated using test data by generating predictions and calculating the Mean Squared Error (MSE) to assess its performance. The resulting MSE of 0.1779905 suggests that the model provides accurate predictions and generalizes well to new data, particularly if the response variable was log-transformed.
Section C:
Extract variable importance measure using the importance() function.
## %IncMSE IncNodePurity
## age 107.746484 244.796624
## sex 4.645977 7.228707
## bmi 20.522785 56.674792
## children 31.544284 32.628370
## smoker 146.783672 310.455172
## region 8.976811 14.513753
The variable importance output from the Random Forest model shows
that smoker status and age are the most
influential predictors of healthcare charges, with the highest values in
both %IncMSE and IncNodePurity.
BMI and children have moderate impact, while
region and sex contribute the least. This
indicates that lifestyle and demographic factors, especially smoking,
play a key role in predicting healthcare costs.
Section D:
Plot the variable importance using the function, varImpPlot(). Which are the top 3 important predictors in this model?
The variable importance plot from the Random Forest model shows that
smoker, age, and BMI are the top
three predictors influencing healthcare charges. Smoker is the most
significant by far, followed by age, while BMI ranks slightly above
children in importance. These variables contribute most to the model’s
accuracy and decision-making process.
5. Build a support vector machine model.
Section A:
The response is charges and the predictors are age, sex, bmi, children, smoker, and region. Please use the svm() function with radial kernel and gamma=5 and cost = 50.
set.seed(1)
svm_model <- svm(charges ~ age + sex + bmi + children + smoker + region,
data = train_data, kernel = "radial", gamma = 5, cost = 50)
summary(svm_model)##
## Call:
## svm(formula = charges ~ age + sex + bmi + children + smoker + region,
## data = train_data, kernel = "radial", gamma = 5, cost = 50)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 50
## gamma: 5
## epsilon: 0.1
##
##
## Number of Support Vectors: 715
This SVM regression model predicts insurance charges using predictors like age, sex, BMI, children, smoking status, and region. It employs a radial kernel with a high gamma (5) and cost (50), indicating a complex, flexible model tailored to the training data. The model uses epsilon-regression with a tolerance of 0.1, allowing small prediction errors. A total of 715 support vectors were used, suggesting many data points influence the model, which may lead to overfitting. While the model is likely accurate on training data, its generalization should be tested on unseen data to confirm performance.
Section B:
Perform a grid search to find the best model with potential cost: 1, 10, 50, 100 and potential gamma: 1,3 and 5 and potential kernel: “linear”,“radial” and “sigmoid”. And use the training set created in step 1.e.
set.seed(1)
svm_tune <- tune(svm, charges ~ ., data = train_data,
ranges = list(kernel = c("linear", "radial", "sigmoid"),
cost = c(1, 10, 50, 100),
gamma = c(1, 3, 5)))
summary(svm_tune)##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## kernel cost gamma
## radial 1 1
##
## - best performance: 0.1902345
##
## - Detailed performance results:
## kernel cost gamma error dispersion
## 1 linear 1 1 1.948499e-01 4.769998e-02
## 2 radial 1 1 1.902345e-01 6.255756e-02
## 3 sigmoid 1 1 2.299086e+03 3.024145e+02
## 4 linear 10 1 1.950497e-01 4.765131e-02
## 5 radial 10 1 2.389178e-01 7.217036e-02
## 6 sigmoid 10 1 2.274547e+05 3.370394e+04
## 7 linear 50 1 1.950977e-01 4.767836e-02
## 8 radial 50 1 3.462730e-01 7.312303e-02
## 9 sigmoid 50 1 5.685687e+06 9.284122e+05
## 10 linear 100 1 1.950707e-01 4.768355e-02
## 11 radial 100 1 4.390889e-01 1.115394e-01
## 12 sigmoid 100 1 2.354112e+07 3.291762e+06
## 13 linear 1 3 1.948499e-01 4.769998e-02
## 14 radial 1 3 3.811084e-01 6.663638e-02
## 15 sigmoid 1 3 4.489028e+03 3.940325e+02
## 16 linear 10 3 1.950497e-01 4.765131e-02
## 17 radial 10 3 4.050899e-01 4.282095e-02
## 18 sigmoid 10 3 4.474026e+05 4.804788e+04
## 19 linear 50 3 1.950977e-01 4.767836e-02
## 20 radial 50 3 4.543307e-01 5.637178e-02
## 21 sigmoid 50 3 1.117223e+07 1.033690e+06
## 22 linear 100 3 1.950707e-01 4.768355e-02
## 23 radial 100 3 4.771536e-01 7.218485e-02
## 24 sigmoid 100 3 4.451442e+07 4.405710e+06
## 25 linear 1 5 1.948499e-01 4.769998e-02
## 26 radial 1 5 4.785357e-01 7.555896e-02
## 27 sigmoid 1 5 5.151436e+03 4.795631e+02
## 28 linear 10 5 1.950497e-01 4.765131e-02
## 29 radial 10 5 4.764771e-01 5.680321e-02
## 30 sigmoid 10 5 4.847946e+05 3.574412e+04
## 31 linear 50 5 1.950977e-01 4.767836e-02
## 32 radial 50 5 4.894965e-01 5.697417e-02
## 33 sigmoid 50 5 1.224914e+07 1.227219e+06
## 34 linear 100 5 1.950707e-01 4.768355e-02
## 35 radial 100 5 4.920183e-01 5.631912e-02
## 36 sigmoid 100 5 4.972590e+07 3.832052e+06
A grid search was conducted to find the best SVM regression model using different combinations of kernel types, cost values, and gamma values. The best-performing model used a radial kernel with cost = 1 and gamma = 1, achieving the lowest cross-validation error of 0.1902. This configuration outperformed others, including models with linear and sigmoid kernels. The results indicate that a radial SVM with low cost and gamma values provides the best balance of flexibility and generalization for predicting insurance charges.
Section C:
Print out the model results. What are the best model parameters?
##
## Call:
## best.tune(METHOD = svm, train.x = charges ~ ., data = train_data,
## ranges = list(kernel = c("linear", "radial", "sigmoid"), cost = c(1,
## 10, 50, 100), gamma = c(1, 3, 5)))
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 1
## epsilon: 0.1
##
##
## Number of Support Vectors: 493
The best SVM regression model uses a radial kernel with cost = 1, gamma = 1, and epsilon = 0.1, selected through cross-validation. It relies on 493 support vectors, indicating a balanced model that captures non-linear patterns while maintaining generalizability. This configuration provides an effective trade-off between model accuracy and complexity for predicting insurance charges.
Section D:
Forecast charges using the test dataset and the best model found in c).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.123 8.518 9.142 9.039 9.519 10.786
The best SVM model was used to predict insurance charges on the test dataset, producing results on a log scale. The predicted values ranged from 7.123 to 10.786, with a mean of 9.039 and a median of 9.142, indicating a symmetrical distribution. The model appears to generalize well, showing consistent predictions without extreme outliers. To better interpret the results, back-transformation to the original dollar scale and error evaluation would be useful for the next steps.
Section E:
Compute the MSE (Mean Squared Error) on the test data.
## [1] 0.2259622
The SVM model achieved a test Mean Squared Error (MSE) of 0.226 on the log-transformed charges, indicating accurate predictions with low average squared error. While this reflects good model performance on the transformed scale, back-transforming to the original scale would provide clearer real-world interpretation.
6. Perform the k-means cluster analysis.
Section A:
Remove the sex, smoker, and region, since they are not numerical values.
## age bmi children charges
## 1 19 27.900 0 9.734176
## 2 18 33.770 1 7.453302
## 3 28 33.000 3 8.400538
## 4 33 22.705 0 9.998092
## 5 32 28.880 0 8.260197
## 6 31 25.740 0 8.231275
The code shows a data preprocessing step, where non-numeric variables
(sex, smoker, and region) are
removed to create a new dataset,
insurance_numeric, containing only
numerical features: age, bmi,
children, and charges. This ensures
compatibility with modeling techniques that require numerical input. The
preview of the data confirms the dataset is clean and ready for further
analysis, with “charges” appearing to be log-transformed to improve
model performance.
Section B:
Determine the optimal number of clusters. Justify your answer. It may take longer running time since it uses a large dataset.
fviz_nbclust(insurance_numeric, kmeans, method = "gap_stat") +
ggtitle("Optimal Number of Clusters - Gap Statistic Method") +
theme_test()
The Gap Statistic plot indicates that the optimal number of clusters is
2, where the gap value peaks. This suggests the data is best divided
into two well-separated groups, as adding more clusters beyond this
point does not significantly improve clustering quality and may lead to
unnecessary complexity.
Section C:
Perform k-means clustering using the 3 clusters.
## Length Class Mode
## cluster 1338 -none- numeric
## centers 12 -none- numeric
## totss 1 -none- numeric
## withinss 3 -none- numeric
## tot.withinss 1 -none- numeric
## betweenss 1 -none- numeric
## size 3 -none- numeric
## iter 1 -none- numeric
## ifault 1 -none- numeric
K-means clustering was performed on the “insurance_numeric” dataset using three clusters. The model successfully assigned all 1,338 observations into groups, with metrics such as the total sum of squares and within-cluster sum of squares offering insights into the compactness and separation of the clusters. Although the Gap Statistic had previously indicated that two clusters might be optimal, this step intentionally explores three clusters for comparison and deeper analysis. The summary output confirms that the algorithm executed correctly, laying the groundwork for further investigation into cluster characteristics, such as their sizes and centroid values.
Section D:
Visualize the clusters in different colors.
fviz_cluster(kmeans_model, data = insurance_numeric, geom = "point",
ellipse.type = "norm", ggtheme = theme_test(),
main = "Cluster Visualization"
)
The cluster visualization shows the results of k-means clustering with
three clusters applied to the “insurance_numeric” dataset, using PCA for
dimensionality reduction. Cluster 3 appears well-separated on the left,
while Clusters 1 and 2 overlap more in the center and right, indicating
similarities between them. The plot suggests that while the data can be
grouped into three clusters, further analysis may be needed to confirm
whether this separation is meaningful or if fewer clusters might be more
appropriate.
7. Build a neural networks model.
Section A:
Remove the sex, smoker, and region, since they are not numerical values.
## age bmi children charges
## 1 19 27.900 0 9.734176
## 2 18 33.770 1 7.453302
## 3 28 33.000 3 8.400538
## 4 33 22.705 0 9.998092
## 5 32 28.880 0 8.260197
## 6 31 25.740 0 8.231275
The code filters the insurance dataset
to retain only numerical variables—age, bmi,
children, and charges—by removing categorical
features like sex, smoker, and
region. The resulting dataset,
insurance_nn, contains clean numeric data
suitable for statistical modeling. The first six rows confirm successful
extraction, showing varied values across all selected columns. This step
ensures the data is ready for machine learning or regression
analysis.
Section B:
Standardize the inputs using the scale() function.
## age bmi children charges
## [1,] -1.4382265 -0.4531506 -0.90827406 0.6911354
## [2,] -1.5094011 0.5094306 -0.07873775 -1.7893505
## [3,] -0.7976553 0.3831636 1.58033487 -0.7592166
## [4,] -0.4417824 -1.3050431 -0.90827406 0.9781472
## [5,] -0.5129570 -0.2924471 -0.90827406 -0.9118403
## [6,] -0.5841316 -0.8073542 -0.90827406 -0.9432929
The code standardizes the numerical features (age,
bmi, children, and charges) in
the dataset using the scale() function, converting each
value into a z-score with a mean of 0 and standard deviation of 1. This
ensures all variables are on the same scale, preventing any single
feature from disproportionately influencing the model. The output
confirms successful standardization, making the data suitable for
scale-sensitive algorithms like clustering or neural networks.
Section C:
Convert the standardized inputs to a data frame using the as.data.frame() function.
## [1] "data.frame"
The code converts the standardized data from a matrix to a data frame
using as.data.frame(), ensuring compatibility with R
functions that require data in data.frame format. The
class() check confirms the successful conversion, making
the dataset ready for further analysis or modeling.
Section D:
Split the dataset into a training set containing 80% of the original data and the test set containing the remaining 20%.
## [1] 1070 4
## [1] 268 4
The code splits the standardized dataset into training and test sets, using 80% of the data (1,070 records) for training and 20% (268 records) for testing. A random seed ensures reproducibility, and the “sample()” function is used to randomly select training indices. This setup allows the model to be trained in one subset and evaluated on another, supporting accurate performance assessment and preventing overfitting.
Section E:
The response is charges and the predictors are age, bmi, and children. Please use 1 hidden layer with 1 neuron.
set.seed(1)
nn_model <- neuralnet(charges ~ age + bmi + children,
data = train_nn, hidden = c(1))
summary(nn_model)## Length Class Mode
## call 4 -none- call
## response 1070 -none- numeric
## covariate 3210 -none- numeric
## model.list 2 -none- list
## err.fct 1 -none- function
## act.fct 1 -none- function
## linear.output 1 -none- logical
## data 4 data.frame list
## exclude 0 -none- NULL
## net.result 1 -none- list
## weights 1 -none- list
## generalized.weights 1 -none- list
## startweights 1 -none- list
## result.matrix 9 -none- numeric
A simple neural network model was built using the “neuralnet” package to predict insurance charges based on age, BMI, and number of children. The model uses one hidden layer with a single neuron and was trained on standardized data. The summary confirms successful training on 1,070 observations, with a structure appropriate for regression tasks. This model is now ready for evaluation or further refinement.
Section F:
Plot the neural networks.
The neural network, consisting of a single hidden neuron, was trained to predict insurance charges using age, BMI, and number of children as input features. The connection weights reveal that age has a moderately strong negative influence on the prediction, while BMI and children have relatively minor effects. The model achieved a training error of 352.77 and required 22,494 iterations to converge. This visualization confirms that the model was successfully trained and provides clear insight into how each variable contributes to the output.
Section G:
Forecast the charges in the test dataset.
## V1
## Min. :-1.14625
## 1st Qu.:-0.52559
## Median : 0.13413
## Mean :-0.03579
## 3rd Qu.: 0.43390
## Max. : 0.90450
The trained neural network model was used to predict insurance charges on the test dataset, producing outputs in standardized form. The predicted values range from -1.15 to 0.90, with a mean near zero, reflecting the scaled nature of the data. These results confirm successful forecasting, but the predictions need to be converted back to the original scale for real-world interpretation and evaluation.
Section H:
Get the observed charges of the test dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.25758 -0.66002 0.08807 0.01331 0.81200 2.13630
This step extracts and summarizes the actual (observed) insurance charges from the standardized test dataset. The observed values range from -2.26 to 2.14 and are centered around zero, reflecting the standardized scale. These values are essential for evaluating the model’s performance by comparing them to the predicted charges.
Section I:
Compute test error (MSE).
## [1] 0.8142879
The Mean Squared Error (MSE) was computed to assess the prediction accuracy of the neural network on the test dataset. The resulting MSE of 0.8143 reflects the average squared difference between the predicted and actual values, based on standardized data. This metric serves as a critical indicator of the model’s performance and provides a baseline for future tuning or comparison with other models.
8. Putting it all together.
Section A:
For predicting insurance charges, your supervisor asks you to choose the best model among the multiple regression, regression tree, random forest, support vector machine, and neural network models. Compare the test MSEs of the models generated in steps 2.g, 3.f, 4.b, 5.e, and 7.d. Display the names for these types of these models, using these labels: Multiple Linear Regression, Regression Tree, Random Forest, Support Vector Machine, and Neural Network and their corresponding test MSEs in a data.frame. Label the column in your data frame with the labels as Model.Type, and label the column with the test MSEs as Test.MSE and round the data in this column to 4 decimal places. Present the formatted data to your supervisor and recommend which model is best and why.
model_compare <- data.frame(
Model.Type = c("Multiple Linear Regression", "Regression Tree", "Random Forest", "Support Vector Machine", "Neural Network"),
Test.MSE = round(c(mse_test_lm, mse_test_tree, mse_test_rf, mse_test_svm, mse_test_nn), 4)
)
print(model_compare)## Model.Type Test.MSE
## 1 Multiple Linear Regression 0.2313
## 2 Regression Tree 0.7035
## 3 Random Forest 0.1780
## 4 Support Vector Machine 0.2260
## 5 Neural Network 0.8143
Recommendation based on lowest MSE
best_model <- model_compare[which.min(model_compare$Test.MSE), ]
print(paste("Recommended model is", best_model$Model.Type, "with Test MSE =", best_model$Test.MSE))## [1] "Recommended model is Random Forest with Test MSE = 0.178"
The code compares five predictive models for estimating insurance charges using Test Mean Squared Error (MSE). Among Multiple Linear Regression, Regression Tree, Random Forest, Support Vector Machine, and Neural Network, the Random Forest model achieves the lowest MSE (0.1780), indicating the best performance. This suggests Random Forest is the most accurate and reliable model for predicting insurance charges in this analysis.
Section B:
Another supervisor from the sales department has requested your help to create a predictive model that his sales representatives can use to explain to clients what the potential costs could be for different kinds of customers, and they need an easy and visual way of explaining it. What model would you recommend, and what are the benefits and disadvantages of your recommended model compared to other models?
A Regression Tree model is recommended for the sales department because it provides a simple, visual, and easy-to-explain way to predict insurance costs. It allows sales reps to clearly show clients how different factors affect pricing through intuitive “if-then” decision paths. While it may not be as accurate as models like Random Forests, its interpretability makes it ideal for client communication. For best results, it can be used alongside more accurate models for internal predictions.
Section C:
The supervisor from the sales department likes your regression tree model. But she says that the salespeople say the numbers in it are way too low and suggests that maybe the numbers on the leaf nodes predicting charges are log transformations of the actual charges. You realize that in step 1.b of this project that you had indeed transformed charges using the log function. And now you realize that you need to reverse the transformation in your final output. The solution you have is to reverse the log transformation of the variables in the regression tree model you created and redisplay the result.
Follow these steps:
Subsection i:
Copy your pruned tree model to a new variable.
##
## Regression tree:
## snip.tree(tree = tree_model, nodes = 3:4)
## Variables actually used in tree construction:
## [1] "age"
## Number of terminal nodes: 3
## Residual mean deviance: 0.5959 = 529.7 / 889
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.2870 -0.5271 -0.2305 0.0000 0.4768 2.3320
The pruned regression tree model uses only age as the predictor and has 3 terminal nodes, making it simple and easy to interpret. With a residual mean deviance of 0.5959 and balanced residuals, the model provides reasonably accurate predictions while maintaining high transparency. This makes it well-suited for situations where clarity and explainability are more important than maximum predictive accuracy.
Subsection ii:
In your new variable, find the data.frame named “frame” and reverse
the log transformation on the data.frame column yval using the
exp() function. (If the copy of your pruned tree model is
named copy_of_my_pruned_tree, then the data frame is accessed as
copy_of_my_pruned_tree$frame, and it works just like a normal data
frame.).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3723 5374 6576 7561 8832 13300
The code reverses the earlier log transformation on predicted
insurance charges from a pruned regression tree, converting them back to
dollar amounts using the exp() function. The summary shows
that predicted charges range from $3,723 to $13,300, with a mean of
$7,561. This step makes the model’s output interpretable and usable for
real-world applications, such as explaining costs to clients.
Subsection iii:
After you reverse the log transform on the yval column, then replot the tree with labels.
The pruned regression tree uses age as the sole predictor to estimate
insurance charges, dividing clients into three age groups with distinct
cost predictions: $3,723 for those under 22.5, $6,576 for ages
22.5–36.5, and $13,300 for those 36.5 and older. This simple,
interpretable model effectively highlights how age influences insurance
costs, making it ideal for client-facing explanations.