In this final project, we aim to predict medical insurance costs using data science and machine learning techniques. The healthcare industry has seen significant advancements in treatments, technologies, and services, leading to a rise in the cost of medical care and insurance. Predicting these costs is critical for healthcare providers and insurance companies for better financial planning, risk management, and developing effective insurance policies.
The challenge lies in understanding the multitude of variables that contribute to medical insurance costs, such as individual health conditions, lifestyle choices, geographic location, and healthcare provider pricing. This project will address these complexities by collecting, preprocessing, and analyzing relevant data, engineering useful features, and selecting suitable machine learning models that adapt to the dynamic healthcare landscape while maintaining data privacy and security standards.
Our approach involves several key steps: 1. Data Understanding: Collect and understand the data, identifying key features that influence insurance costs. 2. Data Preparation: Clean and preprocess the data, handling missing values, outliers, and scaling features as needed. 3. Data Visualization: Visualize the data to uncover patterns, relationships, and trends that can inform model building. 4. Modeling: Develop and train machine learning models to predict insurance costs accurately. 5. Evaluation: Evaluate the models using appropriate metrics to ensure their effectiveness and reliability.
By leveraging data science and machine learning, this project aims to provide valuable insights and predictions that can help healthcare providers and insurance companies navigate the intricate landscape of medical insurance cost prediction, ultimately leading to better financial planning, risk mitigation, and more effective insurance policies.
# Load necessary libraries for models
if (!require("e1071")) install.packages("e1071", dependencies=TRUE)
## Loading required package: e1071
## Warning: package 'e1071' was built under R version 4.3.3
if (!require("randomForest")) install.packages("randomForest", dependencies=TRUE)
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
if (!require("gbm")) install.packages("gbm", dependencies=TRUE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
if (!require("xgboost")) install.packages("xgboost", dependencies=TRUE)
## Loading required package: xgboost
## Warning: package 'xgboost' was built under R version 4.3.3
if (!require("Metrics")) install.packages("Metrics", dependencies=TRUE)
## Loading required package: Metrics
## Warning: package 'Metrics' was built under R version 4.3.3
if (!require("caret")) install.packages("caret", dependencies=TRUE)
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.3
##
## Attaching package: 'caret'
## The following objects are masked from 'package:Metrics':
##
## precision, recall
if (!require("ggplot2")) install.packages("ggplot2", dependencies=TRUE)
library(e1071)
library(randomForest)
library(gbm)
library(xgboost)
library(Metrics)
library(caret)
library(ggplot2)
# Specify the path to your CSV file
file_path <- "D:\\Jeevani\\insurance.csv"
# Read the CSV file into a data frame
df <- read.csv(file_path, header = TRUE, sep = ",")
# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com/"))
options(warn = -1)
# Data Understanding
head(df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
class(df)
## [1] "data.frame"
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
dim(df)
## [1] 1338 7
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
sum(is.na(df))
## [1] 0
colSums(is.na(df))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
# Handling Missing Values (if any)
#df <- na.omit(df)
# Encoding Categorical Variables
for(column in colnames(df)) {
if(is.factor(df[[column]]) || is.character(df[[column]])) {
df[[column]] <- as.integer(factor(df[[column]]))
}
}
# Splitting the Dataset
x <- df[, !names(df) %in% 'charges']
y <- df$charges
set.seed(2)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- x[trainIndex, ]
X_test <- x[-trainIndex, ]
Y_train <- y[trainIndex]
Y_test <- y[-trainIndex]
# Visualizations
# Age Column
ggplot(df, aes(x = age)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "blue", alpha = 0.3) +
geom_density(color = "blue", size = 1) +
ggtitle('Age Distribution') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Gender column
ggplot(df, aes(x=sex)) + geom_bar() + ggtitle('Sex Distribution')
# BMI Column
ggplot(df, aes(x = bmi)) +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "green", alpha = 0.3) +
geom_density(color = "green", size = 1) +
ggtitle('BMI Distribution') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Children column
ggplot(df, aes(x=children)) + geom_bar() + ggtitle('Children Distribution')
# Smoker column
ggplot(df, aes(x=smoker)) + geom_bar() + ggtitle('Smoker Distribution')
# Region column
ggplot(df, aes(x=region)) + geom_bar() + ggtitle('Region Distribution')
# Charges column
ggplot(df, aes(x = charges)) +
geom_histogram(aes(y = ..density..), binwidth = 1000, fill = "red", alpha = 0.3) +
geom_density(color = "red", size = 1) +
ggtitle('Charges Distribution') +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
# Linear Regression Model
model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "lm")
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(predictions, Y_test)
print(sprintf("Linear Regression R-squared score: %.2f", r2))
## [1] "Linear Regression R-squared score: 0.67"
print(sprintf("Linear Regression Mean Absolute Error: %.2f", mae))
## [1] "Linear Regression Mean Absolute Error: 4520.58"
# Decision Tree Model
library(rpart)
model <- rpart(Y_train ~ ., data = data.frame(X_train, Y_train), method = "anova")
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(predictions, Y_test)
print(sprintf("Decision Tree R-squared score: %.2f", r2))
## [1] "Decision Tree R-squared score: 0.76"
print(sprintf("Decision Tree Mean Absolute Error: %.2f", mae))
## [1] "Decision Tree Mean Absolute Error: 3468.12"
# Random Forest Model
model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(predictions, Y_test)
print(sprintf("Random Forest R-squared score: %.2f", r2))
## [1] "Random Forest R-squared score: 0.79"
print(sprintf("Random Forest Mean Absolute Error: %.2f", mae))
## [1] "Random Forest Mean Absolute Error: 2994.57"
# Ordinary Least Squares Model
model <- lm(Y_train ~ ., data = data.frame(X_train, Y_train))
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(predictions, Y_test)
print(sprintf("OLS Model R-squared score: %.2f", r2))
## [1] "OLS Model R-squared score: 0.67"
print(sprintf("OLS Model Mean Absolute Error: %.2f", mae))
## [1] "OLS Model Mean Absolute Error: 4520.58"
summary(model)
##
## Call:
## lm(formula = Y_train ~ ., data = data.frame(X_train, Y_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11754.7 -2674.3 -836.9 1631.3 29220.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -36083.32 1287.62 -28.023 < 2e-16 ***
## age 255.42 13.03 19.608 < 2e-16 ***
## sex -186.18 366.14 -0.508 0.611209
## bmi 338.81 30.99 10.932 < 2e-16 ***
## children 563.43 151.26 3.725 0.000206 ***
## smoker 24475.14 451.56 54.202 < 2e-16 ***
## region -300.77 165.84 -1.814 0.070005 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5965 on 1065 degrees of freedom
## Multiple R-squared: 0.7685, Adjusted R-squared: 0.7672
## F-statistic: 589.2 on 6 and 1065 DF, p-value: < 2.2e-16
# Support Vector Regressor Model
svr_model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "svmRadial", trControl = trainControl(method="cv", number=5))
svr_predictions <- predict(svr_model, newdata = data.frame(X_test))
svr_r2 <- R2(svr_predictions, Y_test)
svr_mae <- mae(svr_predictions, Y_test)
print(sprintf("SVR R-squared score: %.2f", svr_r2))
## [1] "SVR R-squared score: 0.77"
print(sprintf("SVR Mean Absolute Error: %.2f", svr_mae))
## [1] "SVR Mean Absolute Error: 2913.61"
# Bagging Model
bagging_model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
bagging_predictions <- predict(bagging_model, newdata = data.frame(X_test))
bagging_r2 <- R2(bagging_predictions, Y_test)
bagging_mae <- mae(bagging_predictions, Y_test)
print(sprintf("Bagging R-squared score: %.2f", bagging_r2))
## [1] "Bagging R-squared score: 0.79"
print(sprintf("Bagging Mean Absolute Error: %.2f", bagging_mae))
## [1] "Bagging Mean Absolute Error: 3006.58"
# Gradient Boosting Model
gbm_model <- gbm(Y_train ~ ., data = data.frame(X_train, Y_train), distribution = "gaussian", n.trees = 100, interaction.depth = 1)
gbm_predictions <- predict(gbm_model, newdata = data.frame(X_test), n.trees = 100)
gbm_r2 <- R2(gbm_predictions, Y_test)
gbm_mae <- mae(gbm_predictions, Y_test)
print(sprintf("GBM R-squared score: %.2f", gbm_r2))
## [1] "GBM R-squared score: 0.67"
print(sprintf("GBM Mean Absolute Error: %.2f", gbm_mae))
## [1] "GBM Mean Absolute Error: 4629.07"
# XGBoost Model
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = Y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test))
xgboost_model <- xgboost(data = dtrain, max_depth = 2, eta = 1, nrounds = 100, objective = "reg:squarederror", eval_metric = "rmse", seed = 555)
## [1] train-rmse:4943.464399
## [2] train-rmse:4623.665082
## [3] train-rmse:4530.681323
## [4] train-rmse:4382.239061
## [5] train-rmse:4335.756361
## [6] train-rmse:4300.051073
## [7] train-rmse:4272.177864
## [8] train-rmse:4248.524866
## [9] train-rmse:4215.595022
## [10] train-rmse:4194.686812
## [11] train-rmse:4176.275440
## [12] train-rmse:4153.340737
## [13] train-rmse:4120.280341
## [14] train-rmse:4087.636737
## [15] train-rmse:4071.426811
## [16] train-rmse:4059.852447
## [17] train-rmse:4021.782435
## [18] train-rmse:4014.811243
## [19] train-rmse:3999.384093
## [20] train-rmse:3984.732503
## [21] train-rmse:3972.704650
## [22] train-rmse:3957.481602
## [23] train-rmse:3947.868327
## [24] train-rmse:3911.189760
## [25] train-rmse:3904.533504
## [26] train-rmse:3883.171816
## [27] train-rmse:3868.512971
## [28] train-rmse:3838.178760
## [29] train-rmse:3831.060022
## [30] train-rmse:3820.086918
## [31] train-rmse:3809.183898
## [32] train-rmse:3789.109267
## [33] train-rmse:3781.633695
## [34] train-rmse:3778.711248
## [35] train-rmse:3771.459234
## [36] train-rmse:3757.546548
## [37] train-rmse:3753.533362
## [38] train-rmse:3744.976603
## [39] train-rmse:3739.405490
## [40] train-rmse:3721.239020
## [41] train-rmse:3712.834434
## [42] train-rmse:3707.321179
## [43] train-rmse:3703.966630
## [44] train-rmse:3680.496668
## [45] train-rmse:3676.472545
## [46] train-rmse:3673.019373
## [47] train-rmse:3668.226529
## [48] train-rmse:3654.828760
## [49] train-rmse:3643.501895
## [50] train-rmse:3625.213583
## [51] train-rmse:3616.869889
## [52] train-rmse:3586.820091
## [53] train-rmse:3578.760621
## [54] train-rmse:3574.008633
## [55] train-rmse:3562.383921
## [56] train-rmse:3556.477365
## [57] train-rmse:3553.507523
## [58] train-rmse:3523.253819
## [59] train-rmse:3514.915905
## [60] train-rmse:3508.811600
## [61] train-rmse:3504.259653
## [62] train-rmse:3496.442249
## [63] train-rmse:3484.700005
## [64] train-rmse:3480.644590
## [65] train-rmse:3477.033839
## [66] train-rmse:3465.260258
## [67] train-rmse:3453.365816
## [68] train-rmse:3447.236038
## [69] train-rmse:3442.579088
## [70] train-rmse:3438.913361
## [71] train-rmse:3436.527899
## [72] train-rmse:3416.406397
## [73] train-rmse:3412.127727
## [74] train-rmse:3407.490014
## [75] train-rmse:3393.232231
## [76] train-rmse:3383.147509
## [77] train-rmse:3377.123091
## [78] train-rmse:3372.761446
## [79] train-rmse:3370.355691
## [80] train-rmse:3360.960934
## [81] train-rmse:3357.414492
## [82] train-rmse:3337.569262
## [83] train-rmse:3315.684690
## [84] train-rmse:3309.687955
## [85] train-rmse:3297.424040
## [86] train-rmse:3292.996211
## [87] train-rmse:3290.051520
## [88] train-rmse:3282.953702
## [89] train-rmse:3278.835025
## [90] train-rmse:3275.815934
## [91] train-rmse:3266.827172
## [92] train-rmse:3263.892143
## [93] train-rmse:3258.032323
## [94] train-rmse:3254.292509
## [95] train-rmse:3251.086700
## [96] train-rmse:3248.813310
## [97] train-rmse:3243.097113
## [98] train-rmse:3240.240646
## [99] train-rmse:3238.257000
## [100] train-rmse:3234.215602
xgboost_predictions <- predict(xgboost_model, newdata = dtest)
xgboost_r2 <- R2(xgboost_predictions, Y_test)
xgboost_mae <- mae(xgboost_predictions, Y_test)
print(sprintf("XGBoost R-squared score: %.2f", xgboost_r2))
## [1] "XGBoost R-squared score: 0.77"
print(sprintf("XGBoost Mean Absolute Error: %.2f", xgboost_mae))
## [1] "XGBoost Mean Absolute Error: 2940.03"
# Load necessary libraries for models
if (!require("e1071")) install.packages("e1071", dependencies=TRUE)
if (!require("randomForest")) install.packages("randomForest", dependencies=TRUE)
if (!require("gbm")) install.packages("gbm", dependencies=TRUE)
if (!require("xgboost")) install.packages("xgboost", dependencies=TRUE)
if (!require("Metrics")) install.packages("Metrics", dependencies=TRUE)
if (!require("caret")) install.packages("caret", dependencies=TRUE)
if (!require("dplyr")) install.packages("dplyr", dependencies=TRUE)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:xgboost':
##
## slice
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(e1071)
library(randomForest)
library(gbm)
library(xgboost)
library(Metrics)
library(caret)
library(dplyr)
# Specify the path to your CSV file
file_path <- "D:\\Jeevani\\insurance.csv"
# Read the CSV file into a data frame
df <- read.csv(file_path, header = TRUE, sep = ",")
# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com/"))
options(warn = -1)
# Data Understanding
head(df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
class(df)
## [1] "data.frame"
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
dim(df)
## [1] 1338 7
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
sum(is.na(df))
## [1] 0
colSums(is.na(df))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
# Handling Missing Values (if any)
df <- df %>% mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Handling Outliers with Z-scores
z_score_outliers <- function(data, cols, threshold = 2.6) {
for (col in cols) {
z_scores <- scale(data[[col]])
data <- data[abs(z_scores) < threshold, ]
}
return(data)
}
numerical_cols <- c("age", "bmi", "children", "charges")
df <- z_score_outliers(df, numerical_cols)
# Encoding Categorical Variables
for(column in colnames(df)) {
if(is.factor(df[[column]]) || is.character(df[[column]])) {
df[[column]] <- as.integer(factor(df[[column]]))
}
}
# Standard Scaling Features
standard_scaler <- preProcess(df, method = c("center", "scale"))
df <- predict(standard_scaler, df)
# Splitting the Dataset
x <- df[, !names(df) %in% 'charges']
y <- df$charges
set.seed(2)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- x[trainIndex, ]
X_test <- x[-trainIndex, ]
Y_train <- y[trainIndex]
Y_test <- y[-trainIndex]
# Define a function to evaluate models
evaluate_model <- function(model, X_test, Y_test) {
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(Y_test, predictions)
return(list(r2 = r2, mae = mae))
}
# Linear Regression Model
model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "lm")
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Linear Regression R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Linear Regression R-squared score: 0.72, MAE: 0.37"
# Decision Tree Model
library(rpart)
model <- rpart(Y_train ~ ., data = data.frame(X_train, Y_train), method = "anova")
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Decision Tree R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Decision Tree R-squared score: 0.81, MAE: 0.31"
# Random Forest Model
model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Random Forest R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Random Forest R-squared score: 0.83, MAE: 0.26"
# Ordinary Least Squares Model
model <- lm(Y_train ~ ., data = data.frame(X_train, Y_train))
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("OLS Model R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "OLS Model R-squared score: 0.72, MAE: 0.37"
summary(model)
##
## Call:
## lm(formula = Y_train ~ ., data = data.frame(X_train, Y_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92761 -0.26383 -0.11607 0.08204 2.32909
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.004320 0.016884 0.256 0.79812
## age 0.309094 0.017085 18.091 < 2e-16 ***
## sex 0.001671 0.016964 0.098 0.92156
## bmi 0.165484 0.017258 9.589 < 2e-16 ***
## children 0.053755 0.016819 3.196 0.00144 **
## smoker 0.796958 0.017186 46.373 < 2e-16 ***
## region -0.038366 0.016967 -2.261 0.02396 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5386 on 1012 degrees of freedom
## Multiple R-squared: 0.7107, Adjusted R-squared: 0.709
## F-statistic: 414.4 on 6 and 1012 DF, p-value: < 2.2e-16
# Support Vector Regressor Model
svr_model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "svmRadial", trControl = trainControl(method="cv", number=5))
results <- evaluate_model(svr_model, X_test, Y_test)
print(sprintf("SVR R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "SVR R-squared score: 0.81, MAE: 0.23"
# Bagging Model
bagging_model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
results <- evaluate_model(bagging_model, X_test, Y_test)
print(sprintf("Bagging R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Bagging R-squared score: 0.83, MAE: 0.27"
# Gradient Boosting Model
gbm_model <- gbm(Y_train ~ ., data = data.frame(X_train, Y_train), distribution = "gaussian", n.trees = 100, interaction.depth = 1)
results <- evaluate_model(gbm_model, X_test, Y_test)
## Using 100 trees...
print(sprintf("GBM R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "GBM R-squared score: 0.72, MAE: 0.37"
# XGBoost Model
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = Y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test))
xgboost_model <- xgboost(data = dtrain, max_depth = 2, eta = 0.1, nrounds = 100, objective = "reg:squarederror", eval_metric = "rmse", seed = 555)
## [1] train-rmse:1.025036
## [2] train-rmse:0.944706
## [3] train-rmse:0.874023
## [4] train-rmse:0.812115
## [5] train-rmse:0.757824
## [6] train-rmse:0.710698
## [7] train-rmse:0.669679
## [8] train-rmse:0.634318
## [9] train-rmse:0.603838
## [10] train-rmse:0.577790
## [11] train-rmse:0.555503
## [12] train-rmse:0.536680
## [13] train-rmse:0.520725
## [14] train-rmse:0.506962
## [15] train-rmse:0.495250
## [16] train-rmse:0.485059
## [17] train-rmse:0.476417
## [18] train-rmse:0.468946
## [19] train-rmse:0.462461
## [20] train-rmse:0.456925
## [21] train-rmse:0.452145
## [22] train-rmse:0.447696
## [23] train-rmse:0.443882
## [24] train-rmse:0.440270
## [25] train-rmse:0.437218
## [26] train-rmse:0.434789
## [27] train-rmse:0.432536
## [28] train-rmse:0.430507
## [29] train-rmse:0.428857
## [30] train-rmse:0.427180
## [31] train-rmse:0.425640
## [32] train-rmse:0.424332
## [33] train-rmse:0.423093
## [34] train-rmse:0.421797
## [35] train-rmse:0.420703
## [36] train-rmse:0.419842
## [37] train-rmse:0.418797
## [38] train-rmse:0.417876
## [39] train-rmse:0.417106
## [40] train-rmse:0.416222
## [41] train-rmse:0.415108
## [42] train-rmse:0.414361
## [43] train-rmse:0.413463
## [44] train-rmse:0.412530
## [45] train-rmse:0.411421
## [46] train-rmse:0.410915
## [47] train-rmse:0.410303
## [48] train-rmse:0.409388
## [49] train-rmse:0.408827
## [50] train-rmse:0.408177
## [51] train-rmse:0.407516
## [52] train-rmse:0.407154
## [53] train-rmse:0.406604
## [54] train-rmse:0.406251
## [55] train-rmse:0.405815
## [56] train-rmse:0.405368
## [57] train-rmse:0.404791
## [58] train-rmse:0.404350
## [59] train-rmse:0.403972
## [60] train-rmse:0.403507
## [61] train-rmse:0.403132
## [62] train-rmse:0.402915
## [63] train-rmse:0.402602
## [64] train-rmse:0.402283
## [65] train-rmse:0.401956
## [66] train-rmse:0.401674
## [67] train-rmse:0.401513
## [68] train-rmse:0.401170
## [69] train-rmse:0.400918
## [70] train-rmse:0.400669
## [71] train-rmse:0.400466
## [72] train-rmse:0.400116
## [73] train-rmse:0.399862
## [74] train-rmse:0.399742
## [75] train-rmse:0.399466
## [76] train-rmse:0.399259
## [77] train-rmse:0.399011
## [78] train-rmse:0.398789
## [79] train-rmse:0.398590
## [80] train-rmse:0.398340
## [81] train-rmse:0.398055
## [82] train-rmse:0.397786
## [83] train-rmse:0.397694
## [84] train-rmse:0.397510
## [85] train-rmse:0.397175
## [86] train-rmse:0.397024
## [87] train-rmse:0.396826
## [88] train-rmse:0.396621
## [89] train-rmse:0.396468
## [90] train-rmse:0.396291
## [91] train-rmse:0.396086
## [92] train-rmse:0.395817
## [93] train-rmse:0.395646
## [94] train-rmse:0.395469
## [95] train-rmse:0.395048
## [96] train-rmse:0.394908
## [97] train-rmse:0.394682
## [98] train-rmse:0.394523
## [99] train-rmse:0.394454
## [100] train-rmse:0.394351
xgboost_predictions <- predict(xgboost_model, newdata = dtest)
xgboost_r2 <- R2(xgboost_predictions, Y_test)
xgboost_mae <- mae(Y_test, xgboost_predictions)
print(sprintf("XGBoost R-squared score: %.2f, MAE: %.2f", xgboost_r2, xgboost_mae))
## [1] "XGBoost R-squared score: 0.84, MAE: 0.23"
# Load necessary libraries for models
if (!require("e1071")) install.packages("e1071", dependencies=TRUE)
if (!require("randomForest")) install.packages("randomForest", dependencies=TRUE)
if (!require("gbm")) install.packages("gbm", dependencies=TRUE)
if (!require("xgboost")) install.packages("xgboost", dependencies=TRUE)
if (!require("Metrics")) install.packages("Metrics", dependencies=TRUE)
if (!require("caret")) install.packages("caret", dependencies=TRUE)
if (!require("dplyr")) install.packages("dplyr", dependencies=TRUE)
library(e1071)
library(randomForest)
library(gbm)
library(xgboost)
library(Metrics)
library(caret)
library(dplyr)
# Specify the path to your CSV file
file_path <- "D:\\Jeevani\\insurance.csv"
# Read the CSV file into a data frame
df <- read.csv(file_path, header = TRUE, sep = ",")
# Set CRAN mirror
options(repos = c(CRAN = "https://cran.rstudio.com/"))
options(warn = -1)
# Data Understanding
head(df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
class(df)
## [1] "data.frame"
str(df)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
dim(df)
## [1] 1338 7
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
sum(is.na(df))
## [1] 0
colSums(is.na(df))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
# Handling Missing Values (if any)
df <- df %>% mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Handling Outliers with Z-scores
z_score_outliers <- function(data, cols, threshold = 2.4) {
for (col in cols) {
z_scores <- scale(data[[col]])
data <- data[abs(z_scores) < threshold, ]
}
return(data)
}
numerical_cols <- c("age", "bmi", "children", "charges")
df <- z_score_outliers(df, numerical_cols)
# Encoding Categorical Variables
for(column in colnames(df)) {
if(is.factor(df[[column]]) || is.character(df[[column]])) {
df[[column]] <- as.integer(factor(df[[column]]))
}
}
# Min-Max Scaling Features
min_max_scaler <- preProcess(df, method = c("range"))
df <- predict(min_max_scaler, df)
# Splitting the Dataset
x <- df[, !names(df) %in% 'charges']
y <- df$charges
set.seed(2)
trainIndex <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- x[trainIndex, ]
X_test <- x[-trainIndex, ]
Y_train <- y[trainIndex]
Y_test <- y[-trainIndex]
# Define a function to evaluate models
evaluate_model <- function(model, X_test, Y_test) {
predictions <- predict(model, newdata = data.frame(X_test))
r2 <- R2(predictions, Y_test)
mae <- mae(Y_test, predictions)
return(list(r2 = r2, mae = mae))
}
# Linear Regression Model
model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "lm")
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Linear Regression R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Linear Regression R-squared score: 0.74, MAE: 0.10"
# Decision Tree Model
library(rpart)
model <- rpart(Y_train ~ ., data = data.frame(X_train, Y_train), method = "anova")
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Decision Tree R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Decision Tree R-squared score: 0.80, MAE: 0.07"
# Random Forest Model
model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("Random Forest R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Random Forest R-squared score: 0.81, MAE: 0.08"
# Ordinary Least Squares Model
model <- lm(Y_train ~ ., data = data.frame(X_train, Y_train))
results <- evaluate_model(model, X_test, Y_test)
print(sprintf("OLS Model R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "OLS Model R-squared score: 0.74, MAE: 0.10"
summary(model)
##
## Call:
## lm(formula = Y_train ~ ., data = data.frame(X_train, Y_train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.21235 -0.07005 -0.03101 0.01356 0.54131
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.022199 0.014336 -1.548 0.12185
## age 0.236484 0.014241 16.606 < 2e-16 ***
## sex 0.001546 0.008576 0.180 0.85693
## bmi 0.186606 0.021388 8.725 < 2e-16 ***
## children 0.037139 0.012453 2.982 0.00293 **
## smoker 0.490011 0.011462 42.751 < 2e-16 ***
## region -0.028291 0.011727 -2.412 0.01603 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1336 on 970 degrees of freedom
## Multiple R-squared: 0.6777, Adjusted R-squared: 0.6757
## F-statistic: 339.9 on 6 and 970 DF, p-value: < 2.2e-16
# Support Vector Regressor Model
svr_model <- train(Y_train ~ ., data = data.frame(X_train, Y_train), method = "svmRadial", trControl = trainControl(method="cv", number=5))
results <- evaluate_model(svr_model, X_test, Y_test)
print(sprintf("SVR R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "SVR R-squared score: 0.82, MAE: 0.06"
# Bagging Model
bagging_model <- randomForest(Y_train ~ ., data = data.frame(X_train, Y_train), ntree = 100)
results <- evaluate_model(bagging_model, X_test, Y_test)
print(sprintf("Bagging R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "Bagging R-squared score: 0.81, MAE: 0.07"
# Gradient Boosting Model
gbm_model <- gbm(Y_train ~ ., data = data.frame(X_train, Y_train), distribution = "gaussian", n.trees = 100, interaction.depth = 1)
results <- evaluate_model(gbm_model, X_test, Y_test)
## Using 100 trees...
print(sprintf("GBM R-squared score: %.2f, MAE: %.2f", results$r2, results$mae))
## [1] "GBM R-squared score: 0.74, MAE: 0.10"
# XGBoost Model
dtrain <- xgb.DMatrix(data = as.matrix(X_train), label = Y_train)
dtest <- xgb.DMatrix(data = as.matrix(X_test))
xgboost_model <- xgboost(data = dtrain, max_depth = 2, eta = 0.1, nrounds = 100, objective = "reg:squarederror", eval_metric = "rmse", seed = 555)
## [1] train-rmse:0.309193
## [2] train-rmse:0.282836
## [3] train-rmse:0.259490
## [4] train-rmse:0.238870
## [5] train-rmse:0.220726
## [6] train-rmse:0.204774
## [7] train-rmse:0.190816
## [8] train-rmse:0.178664
## [9] train-rmse:0.168106
## [10] train-rmse:0.159005
## [11] train-rmse:0.151169
## [12] train-rmse:0.144469
## [13] train-rmse:0.138732
## [14] train-rmse:0.133874
## [15] train-rmse:0.129735
## [16] train-rmse:0.126198
## [17] train-rmse:0.123211
## [18] train-rmse:0.120660
## [19] train-rmse:0.118486
## [20] train-rmse:0.116646
## [21] train-rmse:0.115053
## [22] train-rmse:0.113595
## [23] train-rmse:0.112455
## [24] train-rmse:0.111342
## [25] train-rmse:0.110482
## [26] train-rmse:0.109626
## [27] train-rmse:0.108838
## [28] train-rmse:0.108248
## [29] train-rmse:0.107623
## [30] train-rmse:0.107181
## [31] train-rmse:0.106627
## [32] train-rmse:0.106181
## [33] train-rmse:0.105785
## [34] train-rmse:0.105447
## [35] train-rmse:0.105172
## [36] train-rmse:0.104915
## [37] train-rmse:0.104649
## [38] train-rmse:0.104405
## [39] train-rmse:0.104145
## [40] train-rmse:0.103798
## [41] train-rmse:0.103612
## [42] train-rmse:0.103333
## [43] train-rmse:0.103114
## [44] train-rmse:0.102948
## [45] train-rmse:0.102766
## [46] train-rmse:0.102627
## [47] train-rmse:0.102505
## [48] train-rmse:0.102307
## [49] train-rmse:0.102144
## [50] train-rmse:0.102035
## [51] train-rmse:0.101904
## [52] train-rmse:0.101732
## [53] train-rmse:0.101590
## [54] train-rmse:0.101481
## [55] train-rmse:0.101399
## [56] train-rmse:0.101302
## [57] train-rmse:0.101125
## [58] train-rmse:0.101018
## [59] train-rmse:0.100954
## [60] train-rmse:0.100881
## [61] train-rmse:0.100817
## [62] train-rmse:0.100710
## [63] train-rmse:0.100617
## [64] train-rmse:0.100559
## [65] train-rmse:0.100478
## [66] train-rmse:0.100347
## [67] train-rmse:0.100298
## [68] train-rmse:0.100241
## [69] train-rmse:0.100133
## [70] train-rmse:0.100024
## [71] train-rmse:0.099936
## [72] train-rmse:0.099872
## [73] train-rmse:0.099790
## [74] train-rmse:0.099751
## [75] train-rmse:0.099708
## [76] train-rmse:0.099671
## [77] train-rmse:0.099627
## [78] train-rmse:0.099554
## [79] train-rmse:0.099501
## [80] train-rmse:0.099460
## [81] train-rmse:0.099426
## [82] train-rmse:0.099397
## [83] train-rmse:0.099316
## [84] train-rmse:0.099256
## [85] train-rmse:0.099209
## [86] train-rmse:0.099140
## [87] train-rmse:0.099088
## [88] train-rmse:0.099049
## [89] train-rmse:0.098982
## [90] train-rmse:0.098916
## [91] train-rmse:0.098873
## [92] train-rmse:0.098817
## [93] train-rmse:0.098795
## [94] train-rmse:0.098749
## [95] train-rmse:0.098707
## [96] train-rmse:0.098646
## [97] train-rmse:0.098612
## [98] train-rmse:0.098575
## [99] train-rmse:0.098537
## [100] train-rmse:0.098486
xgboost_predictions <- predict(xgboost_model, newdata = dtest)
xgboost_r2 <- R2(xgboost_predictions, Y_test)
xgboost_mae <- mae(Y_test, xgboost_predictions)
print(sprintf("XGBoost R-squared score: %.2f, MAE: %.2f", xgboost_r2, xgboost_mae))
## [1] "XGBoost R-squared score: 0.82, MAE: 0.06"