The healthcare industry is continually evolving, driven by advancements in medical technology, changes in healthcare policies, and shifting demographics. With these changes come challenges and opportunities for insurers and individuals seeking medical coverage. As insurance providers strive to offer competitive and personalized healthcare plans, it becomes essential to understand the factors influencing premium prices and create models that can accurately predict them.
This report presents a comprehensive analysis of a medical insurance premium dataset using various regression and machine learning techniques. The dataset contains valuable information about policyholders, including their age, weight, height, medical history, and the presence of chronic diseases, diabetes, allergies, blood pressure problems, transplants, and family cancer history. These attributes may play a crucial role in determining the premium price of medical insurance.
To begin, the dataset was preprocessed to ensure data quality and consistency. Missing values were removed, and categorical variables were transformed into factors for further analysis. Exploratory data analysis (EDA) was conducted to gain insights into the relationships between specific attributes and the premium price. Boxplots and scatter plots were employed to visualize the distributions and correlations.
Subsequently, four different regression models were built and
evaluated:
Forward Stepwise Linear Regression, Principal Component Regression (PCR) with 5 and 9 components, Random Forest Regression, Gradient Boosting Regression, Support Vector Machines (SVM), and Decision Trees.
Each model’s performance was assessed using the Mean Squared Error (MSE)
and R-squared (R2) metrics on both the training and testing
datasets.
The results demonstrated that the Random Forest Regression model achieved the lowest MSE and highest R2 on the testing dataset, suggesting its effectiveness in predicting premium prices based on the given attributes. Gradient Boosting Regression also showed promising results, highlighting the potential of this technique for insurance premium predictions.
In conclusion, this report provides valuable insights into the relationship between medical insurance premium prices and various individual attributes. The results of the regression and machine learning models can assist insurance providers in making data-driven decisions while devising personalized insurance plans for their clients. Further refinements and investigations could be made to enhance the predictive accuracy and interpretability of the models.
The dataset used in this study was obtained from kaggle and it contains information about individuals’ medical insurance premiums along with various attributes that may influence the premium price. Before conducting the analysis, the dataset was pre-processed to ensure its quality and suitability for modeling.
EDA was conducted to gain insights into the relationships between specific attributes and the medical insurance premium price. Boxplots and scatter plots were utilized to visualize the distributions and correlations. The boxplots showed the distribution of premium prices based on different categorical attributes, while scatter plots depicted the relationship between continuous attributes (e.g., age, weight, height) and premium prices.
For this study, four different regression and machine learning models were built and evaluated to predict medical insurance premium prices based on the given attributes:
A linear regression model was constructed using the lm
function from the stats package. The step function was used
to perform forward stepwise regression, selecting variables that best
explained the variance in the premium price.
PCR is a dimensionality reduction technique that combines principal
components analysis with linear regression. Two PCR models were built,
one with 5 components and another with 9 components, using the
pcr function from the pls library.
Cross-validation was used for model evaluation.
A random forest regression model was built using the
randomForest function from the randomForest
library. The number of variables considered at each split (mtry) was set
to 3. This ensemble learning method utilizes multiple decision trees to
make predictions.
Gradient Boosting Regression: The gradient boosting model was
created using the gbm function from the gbm
library. This boosting algorithm builds an ensemble of weak learners,
typically decision trees, to iteratively improve predictions.
Support Vector Machines (SVM): SVM is a supervised learning
algorithm used for classification and regression tasks. In this study,
the svm function from the e1071 library was
employed to build an SVM regression model with a radial
kernel.
Decision Trees: A decision tree model was constructed using the
rpart function from the rpart library.
Decision trees recursively partition the data into subsets to make
predictions.
The performance of each model was assessed using two evaluation metrics:
Mean Squared Error (MSE): MSE measures the average squared difference between the predicted and actual values. Lower MSE indicates better model performance.
R-squared (R2): R2 represents the proportion of variance in the dependent variable (premium price) that is predictable from the independent variables. Higher R2 values indicate better fit and prediction.
The final section of the report presents a tabular summary of the results obtained from the different models, including the training and testing MSE and R2 values. These results provide insights into the relative performance of each model in predicting medical insurance premium prices based on the given attributes.
In conclusion, the methodology section outlines the data collection, pre-processing, exploratory data analysis, model building, and evaluation steps employed in this study to analyze and predict medical insurance premium prices. The combination of regression and machine learning techniques offers a comprehensive approach to understanding the factors affecting insurance premiums and enables the development of accurate predictive models to aid insurance providers in decision-making and personalized plan design.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(janitor)
library(caret)
library(randomForest)
library(tidyverse)
library(rsample)
library(ggthemes)
library(gbm)
library(rpart)
library(pls)
library(gridExtra)
library(tree)
library(RColorBrewer)
library(e1071)
# Read and clean the dataset
medpremium <- read.csv("Medicalpremium.csv")
medpremium <- clean_names(medpremium)
# Data preprocessing
medpremium <- medpremium %>%
drop_na() %>%
mutate(diabetes = as.factor(case_when(diabetes == 0 ~ "No",
diabetes == 1 ~ "Yes"))) %>%
mutate(blood_pressure_problems = as.factor(case_when(blood_pressure_problems == 0 ~ "No",
blood_pressure_problems == 1 ~ "Yes"))) %>%
mutate(any_transplants = as.factor(case_when(any_transplants == 0 ~ "No",
any_transplants == 1 ~ "Yes"))) %>%
mutate(any_chronic_diseases = as.factor(case_when(any_chronic_diseases == 0 ~ "No",
any_chronic_diseases == 1 ~ "Yes"))) %>%
mutate(known_allergies = as.factor(case_when(known_allergies == 0 ~ "No",
known_allergies == 1 ~ "Yes"))) %>%
mutate(history_of_cancer_in_family = as.factor(case_when(history_of_cancer_in_family == 0 ~ "No",
history_of_cancer_in_family == 1 ~ "Yes")))
# Exploratory data analysis - Boxplots
v1 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = diabetes, fill = diabetes), show.legend = FALSE) +
xlab("Diabetes") +
ylab("Premium Price")
v2 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = any_transplants, fill = any_transplants), show.legend = FALSE) +
xlab("Any Transplants") +
ylab("Premium Price")
v3 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = any_chronic_diseases, fill = any_chronic_diseases), show.legend = FALSE) +
xlab("Chronic Diseases") +
ylab("Premium Price")
v4 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = blood_pressure_problems, fill = blood_pressure_problems), show.legend = FALSE) +
xlab("Blood Pressure Problems") +
ylab("Premium Price")
v5 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = known_allergies, fill = known_allergies), show.legend = FALSE) +
xlab("Known Allergies") +
ylab("Premium Price")
v6 <- ggplot(medpremium) +
geom_boxplot(aes(y = premium_price, x = history_of_cancer_in_family, fill = history_of_cancer_in_family), show.legend = FALSE) +
xlab("Cancer in Family") +
ylab("Premium Price")
grid.arrange(v1, v2, v3, v4, v5, v6, nrow = 2)
v7 <- ggplot(medpremium) +
geom_point(aes(x = age, y = premium_price)) +
geom_smooth(aes(x = age, y = premium_price)) +
xlab("Age (years)") +
ylab("Premium Price")
v8 <- ggplot(medpremium) +
geom_point(aes(x = weight, y = premium_price)) +
geom_smooth(aes(x = weight, y = premium_price), colour = "green") +
xlab("Weight (kg)") +
ylab("Premium Price")
v9 <- ggplot(medpremium) +
geom_point(aes(x = height, y = premium_price)) +
geom_smooth(aes(x = height, y = premium_price), colour = "red") +
xlab("Height (cm)") +
ylab("Premium Price")
v10 <- ggplot(medpremium, mapping = aes(x = premium_price, y = factor(number_of_major_surgeries), fill = factor(number_of_major_surgeries))) +
geom_violin(color = "red", fill = "orange", alpha = 0.2, show.legend = FALSE) +
labs(fill = "Number of Major Surgeries") +
ylab("Number of Major Surgeries") +
xlab("Premium Price")
grid.arrange(v7, v8, v9, v10, nrow = 2)
Here, we have observed the following:
The policyholders’ ages range from 18 to 66 years, with a mean age of approximately 41.75 years. The first quartile (25th percentile) is 30 years, and the third quartile (75th percentile) is 53 years.
Among the policyholders, 572 individuals do not have diabetes (No), and 414 individuals have diabetes (Yes).
524 policyholders do not have blood pressure problems (No), while 462 individuals have blood pressure problems (Yes).
A total of 931 policyholders have not undergone any transplants (No), and 55 individuals have had transplants (Yes).
808 policyholders do not have any chronic diseases (No), and 178 individuals have chronic diseases (Yes).
The policyholders’ heights range from 145 cm to 188 cm, with a mean height of approximately 168.2 cm. The first quartile (25th percentile) is 161 cm, and the third quartile (75th percentile) is 176 cm.
The policyholders’ weights range from 51 kg to 132 kg, with a mean weight of approximately 76.95 kg. The first quartile (25th percentile) is 67 kg, and the third quartile (75th percentile) is 87 kg.
774 policyholders do not have known allergies (No), and 212 individuals have known allergies (Yes).
870 policyholders do not have a history of cancer in the family (No), while 116 individuals have a history of cancer in the family (Yes).
The number of major surgeries ranges from 0 to 3, with a mean value of approximately 0.6673. The first quartile (25th percentile) is 0 surgeries, and the third quartile (75th percentile) is 1 surgery.
To further explore the relationship between categorical variables and the premium price, boxplots were created. These visualizations provide an overview of how different categorical attributes influence the distribution of premium prices.
Boxplot 1 - Premium Price vs. Diabetes: The boxplot shows the distribution of premium prices for policyholders with and without diabetes. It provides insights into whether diabetes affects medical insurance premiums.
Boxplot 2 - Premium Price vs. Any Transplants: This boxplot compares the premium prices for policyholders who have had transplants (Yes) and those who have not (No).
Boxplot 3 - Premium Price vs. Any Chronic Diseases: This boxplot displays the distribution of premium prices for policyholders with and without chronic diseases.
Boxplot 4 - Premium Price vs. Blood Pressure Problems: The boxplot illustrates how the presence of blood pressure problems influences medical insurance premiums.
Boxplot 5 - Premium Price vs. Known Allergies: This boxplot compares the premium prices for policyholders with known allergies (Yes) and those without (No).
Boxplot 6 - Premium Price vs. History of Cancer in Family: The boxplot presents the distribution of premium prices for policyholders with and without a history of cancer in the family.
To investigate the relationship between continuous variables (age, weight, height) and premium prices, scatter plots were created.
Scatter Plot 1 - Premium Price vs. Age: This scatter plot shows the relationship between age and premium prices. It helps identify any potential patterns or trends in insurance premiums based on age.
Scatter Plot 2 - Premium Price vs. Weight: The scatter plot depicts how weight is related to medical insurance premium prices.
Scatter Plot 3 - Premium Price vs. Height: This scatter plot illustrates the relationship between height and medical insurance premium prices.
In this study, several regression and machine learning models were built to predict medical insurance premium prices based on the given attributes. Model evaluation is a crucial step to assess the performance and accuracy of these models in making predictions. The evaluation metrics used in this analysis are Mean Squared Error (MSE) and R-squared (R2). MSE measures the average squared difference between the predicted and actual premium prices, while R2 represents the proportion of variance in the premium prices that can be explained by the model.
# Split the data into training and testing sets
set.seed(1234)
med.split <- initial_split(medpremium, prop = 3 / 4)
med.train <- training(med.split)
med.test <- testing(med.split)
# Define evaluation functions
rsquared <- function(pred, actual) {
1 - (sum((actual - pred)^2) / sum((actual - mean(actual))^2))
}
MSE <- function(pred, actual) {
sum((actual - pred)^2) / length(actual)
}
# Perform forward stepwise regression
linear.fwd <- step(lm(premium_price ~ ., data = med.train), direction = "forward")
## Start: AIC=12198.41
## premium_price ~ age + diabetes + blood_pressure_problems + any_transplants +
## any_chronic_diseases + height + weight + known_allergies +
## history_of_cancer_in_family + number_of_major_surgeries
fwd.pred.train <- predict(linear.fwd, med.train)
mse.fwd.train <- MSE(fwd.pred.train, med.train$premium_price)
r2.fwd.train <- rsquared(fwd.pred.train, med.train$premium_price)
fwd.pred.test <- predict(linear.fwd, med.test)
mse.fwd.test <- MSE(fwd.pred.test, med.test$premium_price)
r2.fwd.test <- rsquared(fwd.pred.test, med.test$premium_price)
# Perform Principal Component Regression (PCR)
pcr.model <- pcr(premium_price ~ ., data = med.train, scale = TRUE, validation = "CV")
validationplot(pcr.model, val.type = "MSEP", main = "Premium Price", ylab = "Mean Squared Error")
pcr.pred5.train <- predict(pcr.model, med.train, ncomp = 5)
mse.pcr5.train <- MSE(pcr.pred5.train, med.train$premium_price)
r2.pcr5.train <- rsquared(pcr.pred5.train, med.train$premium_price)
pcr.pred5.test <- predict(pcr.model, med.test, ncomp = 5)
mse.pcr5.test <- MSE(pcr.pred5.test, med.test$premium_price)
r2.pcr5.test <- rsquared(pcr.pred5.test, med.test$premium_price)
pcr.pred9.train <- predict(pcr.model, med.train, ncomp = 9)
mse.pcr9.train <- MSE(pcr.pred9.train, med.train$premium_price)
r2.pcr9.train <- rsquared(pcr.pred9.train, med.train$premium_price)
pcr.pred9.test <- predict(pcr.model, med.test, ncomp = 9)
mse.pcr9.test <- MSE(pcr.pred9.test, med.test$premium_price)
r2.pcr9.test <- rsquared(pcr.pred9.test, med.test$premium_price)
# Perform Random Forest regression
set.seed(11)
medcost.rf.model <- randomForest(premium_price ~ ., data = med.train, mtry = 3, importance = TRUE)
pred.rf.train <- predict(medcost.rf.model, med.train)
mse.rf.train <- MSE(pred.rf.train, med.train$premium_price)
r2.rf.train <- rsquared(pred.rf.train, med.train$premium_price)
pred.rf.test <- predict(medcost.rf.model, med.test)
mse.rf.test <- MSE(pred.rf.test, med.test$premium_price)
r2.rf.test <- rsquared(pred.rf.test, med.test$premium_price)
imp <- data.frame(importance(medcost.rf.model, type = 1))
imp <- rownames_to_column(imp, var = "variable")
ggplot(imp, aes(x = reorder(variable, X.IncMSE), y = X.IncMSE, color = reorder(variable, X.IncMSE))) +
geom_point(show.legend = FALSE, size = 3) +
geom_segment(aes(x = variable, xend = variable, y = 0, yend = X.IncMSE), size = 3, show.legend = FALSE) +
xlab("") +
ylab("% Increase in MSE") +
labs(title = "Variable Importance for Prediction of Premium Price") +
coord_flip() +
scale_color_manual(values = colorRampPalette(brewer.pal(1, "Purples"))(10)) +
theme_classic()
imp %>%
arrange(desc(X.IncMSE)) %>%
rename(`% Increase in MSE` = X.IncMSE)
## variable % Increase in MSE
## 1 age 115.6331152
## 2 any_transplants 40.5955759
## 3 weight 29.7742047
## 4 any_chronic_diseases 23.3300828
## 5 number_of_major_surgeries 21.8641653
## 6 history_of_cancer_in_family 17.2005894
## 7 blood_pressure_problems 14.2675086
## 8 height 7.8357810
## 9 diabetes 1.3737953
## 10 known_allergies 0.6030496
# Perform Gradient Boosting regression
set.seed(1234)
medcost.boost.model <- gbm(premium_price ~ ., data = med.train, distribution = "gaussian", n.trees = 5000, interaction.depth = 4)
pred.boost.train <- predict(medcost.boost.model, med.train, n.trees = 5000)
mse.boost.train <- MSE(pred.boost.train, med.train$premium_price)
r2.boost.train <- rsquared(pred.boost.train, med.train$premium_price)
pred.boost.test <- predict(medcost.boost.model, med.test, n.trees = 5000)
mse.boost.test <- MSE(pred.boost.test, med.test$premium_price)
r2.boost.test <- rsquared(pred.boost.test, med.test$premium_price)
# Perform Support Vector Machines (SVM)
set.seed(1234)
svm.model <- svm(premium_price ~ ., data = med.train, kernel = "radial")
pred.svm.train <- predict(svm.model, med.train)
mse.svm.train <- MSE(pred.svm.train, med.train$premium_price)
r2.svm.train <- rsquared(pred.svm.train, med.train$premium_price)
pred.svm.test <- predict(svm.model, med.test)
mse.svm.test <- MSE(pred.svm.test, med.test$premium_price)
r2.svm.test <- rsquared(pred.svm.test, med.test$premium_price)
# Perform Decision Trees
decisiontree.model <- rpart(premium_price ~ ., data = med.train)
pred.decisiontree.train <- predict(decisiontree.model, med.train)
mse.decisiontree.train <- MSE(pred.decisiontree.train, med.train$premium_price)
r2.decisiontree.train <- rsquared(pred.decisiontree.train, med.train$premium_price)
pred.decisiontree.test <- predict(decisiontree.model, med.test)
mse.decisiontree.test <- MSE(pred.decisiontree.test, med.test$premium_price)
r2.decisiontree.test <- rsquared(pred.decisiontree.test, med.test$premium_price)
# Summary of results
results <- data.frame(
Model = c("Linear Regression (Forward Stepwise)", "Principal Component Regression (PCR, 5 components)",
"Principal Component Regression (PCR, 9 components)", "Random Forest Regression",
"Gradient Boosting Regression", "Support Vector Machines (SVM)", "Decision Trees"),
Train_MSE = c(mse.fwd.train, mse.pcr5.train, mse.pcr9.train, mse.rf.train,
mse.boost.train, mse.svm.train, mse.decisiontree.train),
Test_MSE = c(mse.fwd.test, mse.pcr5.test, mse.pcr9.test, mse.rf.test,
mse.boost.test, mse.svm.test, mse.decisiontree.test),
Train_R2 = c(r2.fwd.train, r2.pcr5.train, r2.pcr9.train, r2.rf.train,
r2.boost.train, r2.svm.train, r2.decisiontree.train),
Test_R2 = c(r2.fwd.test, r2.pcr5.test, r2.pcr9.test, r2.rf.test,
r2.boost.test, r2.svm.test, r2.decisiontree.test)
)
# Print the results
print(results)
## Model Train_MSE Test_MSE
## 1 Linear Regression (Forward Stepwise) 14315738.7 13036555
## 2 Principal Component Regression (PCR, 5 components) 23665439.5 19761695
## 3 Principal Component Regression (PCR, 9 components) 20866836.3 18879532
## 4 Random Forest Regression 2593700.3 10051203
## 5 Gradient Boosting Regression 231936.4 11977133
## 6 Support Vector Machines (SVM) 9891260.0 11759199
## 7 Decision Trees 7784322.1 11542743
## Train_R2 Test_R2
## 1 0.6363780 0.6547589
## 2 0.3988941 0.4766601
## 3 0.4699790 0.5000220
## 4 0.9341196 0.7338186
## 5 0.9941088 0.6828151
## 6 0.7487604 0.6885865
## 7 0.8022770 0.6943188
1. Forward Stepwise Linear Regression Evaluation:
The forward stepwise linear regression model was evaluated on both the training and testing datasets. The model achieved a training MSE of approximately 14,315,738.7 and a testing MSE of approximately 13,036,555. The R2 values for the training and testing datasets were approximately 0.636 and 0.655, respectively. These results indicate that the model explains around 63.6% of the variance in the training data and around 65.5% of the variance in the testing data.
2. Principal Component Regression (PCR) Evaluation:
PCR models with 5 and 9 components were evaluated using cross-validation. The model with 5 components achieved a training MSE of approximately 23,665,439.5 and a testing MSE of approximately 19,761,695. The corresponding R2 values for the training and testing datasets were approximately 0.399 and 0.477, respectively. The model with 9 components showed a training MSE of approximately 20,866,836.3 and a testing MSE of approximately 18,879,532, with R2 values of approximately 0.470 and 0.500 for the training and testing datasets, respectively. These results suggest that the PCR models were less effective in predicting premium prices compared to other models.
3. Random Forest Regression Evaluation:
The random forest regression model demonstrated strong performance in predicting medical insurance premium prices. The model achieved a training MSE of approximately 2,593,700.3 and a testing MSE of approximately 10,051,203. The R2 values for the training and testing datasets were approximately 0.934 and 0.734, respectively. These high R2 values indicate that the random forest model explains a substantial portion of the variance in both the training and testing datasets, making it an effective model for premium price prediction.
4. Gradient Boosting Regression Evaluation:
The gradient boosting regression model also showed promising results. The model achieved a training MSE of approximately 231,936.4 and a testing MSE of approximately 11,977,133. The R2 values for the training and testing datasets were approximately 0.994 and 0.683, respectively. The high training R2 indicates a near-perfect fit to the training data, but the testing R2 suggests some overfitting on the testing data. Nonetheless, the model still performed well in predicting medical insurance premium prices.
5. Support Vector Machines (SVM) Evaluation:
The SVM regression model achieved a training MSE of approximately 9,891,260 and a testing MSE of approximately 11,759,199. The R2 values for the training and testing datasets were approximately 0.749 and 0.689, respectively. These results indicate that the SVM model provides a reasonably good fit to the data and performs well in predicting premium prices.
6. Decision Trees Evaluation:
The decision tree model showed moderate performance in predicting premium prices. The model achieved a training MSE of approximately 7,784,322.1 and a testing MSE of approximately 11,542,743. The R2 values for the training and testing datasets were approximately 0.802 and 0.694, respectively. The decision tree model demonstrated a good fit to the training data, but its performance on the testing data was slightly lower, suggesting some degree of overfitting.
The results of this study present the performance of different regression and machine learning models in predicting medical insurance premium prices based on the given attributes. The models evaluated include Forward Stepwise Linear Regression, Principal Component Regression (PCR) with 5 and 9 components, Random Forest Regression, Gradient Boosting Regression, Support Vector Machines (SVM), and Decision Trees. The evaluation metrics used to assess the models’ performance were Mean Squared Error (MSE) and R-squared (R2). Here are the key findings:
Forward Step-wise Linear Regression:
The linear regression model achieved a testing MSE of approximately 13,036,555, with a testing R2 of approximately 0.655. While the model provided reasonable predictions, other models showed better performance.
Principal Component Regression (PCR):
Model with 5 components: The PCR model with 5 components achieved a testing MSE of approximately 19,761,695 and a testing R2 of approximately 0.477.
Model with 9 components:
The PCR model with 9 components achieved a testing MSE of approximately 18,879,532 and a testing R2 of approximately 0.500. Both PCR models performed adequately but were less effective than other models.
Random Forest Regression:
The random forest model demonstrated strong predictive power with a testing MSE of approximately 10,051,203 and a testing R2 of approximately 0.734. This model outperformed other models, providing the most accurate predictions.
Gradient Boosting Regression:
The gradient boosting model achieved a testing MSE of approximately 11,977,133 and a testing R2 of approximately 0.683. The model showed promising results, but some overfitting on the testing data was observed.
Support Vector Machines (SVM):
The SVM regression model achieved a testing MSE of approximately 11,759,199 and a testing R2 of approximately 0.689. The SVM model performed reasonably well, providing accurate predictions.
Decision Trees:
The decision tree model showed moderate performance with a testing MSE of approximately 11,542,743 and a testing R2 of approximately 0.694. The model demonstrated good fit to the training data but exhibited slight overfitting on the testing data.
The analysis and evaluation of different regression and machine
learning models reveal insights into predicting medical insurance
premium prices based on various attributes. Among the models tested, the
Random Forest Regression model emerged as the most effective in
accurately predicting premium prices. The model provided the lowest
testing MSE and a high testing R2, indicating strong predictive
capabilities. The Gradient Boosting Regression and Support Vector
Machines (SVM) models also demonstrated reasonably good
performance, though the gradient boosting model showed some
overfitting.
Overall, this study highlights the importance of data-driven modeling techniques in understanding the factors influencing medical insurance premium prices and aids insurance providers in making informed decisions. By utilizing the Random Forest Regression model, insurance companies can improve premium pricing accuracy, leading to fairer and more personalized insurance plans for their clients. It is crucial for insurers to continually refine and update their predictive models with new data to maintain relevance and accuracy in a dynamic insurance market.
Although the Random Forest model showed excellent performance, further research could explore ensemble methods and other advanced modeling techniques to potentially improve prediction accuracy further. Additionally, incorporating more diverse data sources and exploring the impact of additional features might enhance the model’s predictive capabilities. Continuous monitoring and validation of the model on new data will be essential to ensure its continued effectiveness in real-world insurance scenarios.
In conclusion, this study contributes to the field of insurance analytics by providing valuable insights into medical insurance premium pricing. The results highlight the potential of machine learning algorithms, specifically the Random Forest Regression model, in predicting premium prices with high accuracy. Ultimately, the knowledge gained from this research can aid insurance companies in delivering more precise, fair, and personalized insurance plans to their clients, fostering stronger customer satisfaction and loyalty.
Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret
pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12(1), 77.
XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
Random forests. Machine Learning, 45(1), 5-32.
Data Visualization: A Practical Introduction. Princeton University Press. Retrieved from
R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.