Medical insurance, now a necessity for almost everyone, provides a safety net against financial ruin during medical emergencies. In the modern healthcare landscape, the ability to accurately predict medical insurance costs is of paramount importance for both insurers and insured individuals. Insurance companies must balance the need for competitive pricing with the necessity of maintaining profitability and managing risk. For consumers, fair and predictable insurance premiums are critical for ensuring access to necessary medical care without facing financial hardship.
Understanding customer characteristics and predicting medical costs are crucial for designing better pricing strategies and improving customer satisfaction. Insurers face the constant challenge of balancing premiums with clients’ risk profiles to ensure profitability and customer retention. With the rise of data-driven decision-making, leveraging large datasets to gain insights into customer behavior and medical cost trends has become increasingly important.
By utilizing machine learning techniques, we will create models that can accurately predict medical costs based on the customer characteristics. These models will assist insurers in setting premiums that reflect the true risk profile of each customer, leading to more equitable pricing and improved profitability. By segmenting customers into groups with similar traits and cost patterns, insurers can tailor their products and services to meet the specific needs of each segment. This targeted approach will enhance customer satisfaction and loyalty while optimizing resource allocation.
Ultimately, this project strives to contribute to a more efficient and equitable healthcare system where insurance coverage is both accessible and sustainable.
The main objectives of the project are identified as follows:
• ‘age’: Age of the primary insured individual
• ‘sex’: Gender of the primary insured individual (male, female)
• ‘bmi’: Body mass index, providing an understanding of body fat
• ‘children’: Number of children covered under health insurance
• ‘smoker’: Smoking status of the insured individual (yes, no)
• ‘region’: Residential area in the US (northeast, northwest, southeast, southwest)
• ‘charges’: Medical expenses billed to the individual’s health insurance
# Load the ggplot2 package for data visualization
library(ggplot2)
# Load the reshape2 package for data manipulation, especially for reshaping data
library(reshape2)
# Load the corrplot package for visualizing correlation matrices
library(corrplot)
## corrplot 0.92 loaded
## Enhances ggplot2 with additional themes and color palettes
library(ggthemes)
## Provides scientific and sci-fi color palettes for ggplot2
library(ggsci)
## Streamline the model training process
library(caret)
## Loading required package: lattice
# For SVM related functions
library(e1071)
# Load the pROC package
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Load the data from a CSV file named "insurance.csv" into a data frame called df
df = read.csv("insurance.csv")
# Display the first few rows of the data frame to get an overview of the data
head(df)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
To determine the number of rows and columns:
# Get the number of rows in the data frame and store it in the variable num_rows
num_rows <- nrow(df)
# Get the number of columns in the data frame and store it in the variable num_columns
num_columns <- ncol(df)
# Print the number of rows in the data frame
cat("Number of rows:", num_rows, "\n")
## Number of rows: 1338
# Print the number of columns in the data frame
cat("Number of columns:", num_columns, "\n")
## Number of columns: 7
Here, we can see that the dataset contains 1338 observations and 7 variables. Next, we will look at the summary of dataset.
State the summary of the dataset:
# Print a message indicating that a summary of the dataset will be displayed
cat("Summary of the dataset:\n")
## Summary of the dataset:
# Display a summary of the dataset, which includes statistics for each column
summary(df)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
The summary shows that the age of the customers range from 18 to 64. Meanwhile, the medical charges range between 1,122 USD to 63,770 USD, and the mean is 13,270 USD. The difference of the medical cost is large, hence, we are interested to discover the characteristics that highly associated with the medical cost.
We also need to determine if there is any missing value in each column:
# Calculate the number of missing values in each column of the data frame
# The sapply function applies a function to each column of the data frame
# The function used here is sum(is.na(x)), which calculates the number of NA values in each column
missing_values <- sapply(df, function(x) sum(is.na(x)))
# Print a message indicating that the number of missing values in each column will be displayed
cat("Missing values in each column:\n")
## Missing values in each column:
# Print the number of missing values for each column
print(missing_values)
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
The result shows that there are no missing values in all the columns. From the summary, we can observe that the data is in “character” form for the column ‘sex’, ‘smoker’ and ‘region’. Next, let’s determine the unique values and their counts in categorical variables to ensure there are no outliers or unexpected entries.
Determine the unique values and counts for ‘sex’:
# Find the unique values in the 'sex' column of the data frame and store them in unique_sex
unique_sex <- unique(df$sex)
# Create a frequency table of the 'sex' column to count the occurrences of each unique value
sex_count <- table(df$sex)
# Print the unique values in the 'sex' column
cat("Unique values in 'sex':", unique_sex, "\n")
## Unique values in 'sex': female male
# Print a message indicating that the counts of 'male' and 'female' will be displayed
cat("Count of 'male' and 'female':\n")
## Count of 'male' and 'female':
# Print the frequency table showing the count of each unique value in the 'sex' column
print(sex_count)
##
## female male
## 662 676
Determine the unique values and counts for ‘smoker’:
# Find the unique values in the 'smoker' column of the data frame and store them in unique_smoker
unique_smoker <- unique(df$smoker)
# Create a frequency table of the 'smoker' column to count the occurrences of each unique value
smoker_count <- table(df$smoker)
# Print the unique values in the 'smoker' column
cat("Unique values in 'smoker':", unique_smoker, "\n")
## Unique values in 'smoker': yes no
# Print a message indicating that the counts of 'yes' and 'no' will be displayed
cat("Count of 'yes' and 'no':\n")
## Count of 'yes' and 'no':
# Print the frequency table showing the count of each unique value in the 'smoker' column
print(smoker_count)
##
## no yes
## 1064 274
Determine the unique values and counts for ‘region’:
# Find the unique values in the 'region' column of the data frame and store them in unique_region
unique_region <- unique(df$region)
# Create a frequency table of the 'region' column to count the occurrences of each unique value
region_count <- table(df$region)
# Print the unique values in the 'region' column
cat("Unique values in 'region':", unique_region, "\n")
## Unique values in 'region': southwest southeast northwest northeast
# Print a message indicating that the counts of each region will be displayed
cat("Count of each region:\n")
## Count of each region:
# Print the frequency table showing the count of each unique value in the 'region' column
print(region_count)
##
## northeast northwest southeast southwest
## 324 325 364 325
Now, we will check if there is any duplicate data and remove it.
# Find duplicate rows in the data frame and store the result in duplicate_rows
duplicate_rows <- duplicated(df)
# Print the number of duplicate rows in the data frame
cat("Number of duplicate rows:", sum(duplicate_rows), "\n")
## Number of duplicate rows: 1
# Print a message indicating that duplicate rows will be displayed
cat("Duplicate rows:\n")
## Duplicate rows:
# Print the duplicate rows from the data frame
print(df[duplicate_rows, ])
## age sex bmi children smoker region charges
## 582 19 male 30.59 0 no northwest 1639.563
# Create a new data frame (df_clean) by removing the duplicate rows
df_clean <- df[!duplicate_rows, ]
# Print the number of rows in the cleaned data frame
cat("Number of rows after removing duplicates:", nrow(df_clean), "\n")
## Number of rows after removing duplicates: 1337
# Print the number of duplicate rows in the cleaned data frame
cat("Number of duplicate rows after cleaning:", sum(duplicated(df_clean)), "\n")
## Number of duplicate rows after cleaning: 0
# *Assign the cleaned data frame (df_clean) back to the original data frame (df)
df <- df_clean
Hence, this shows that the duplicate row is removed and the data is now clean.
We will also format the data in column ‘charges’ and ‘bmi’ to ensure that the value only shows in 2 decimal places.
# Round the 'charges' column in the data frame (df) to two decimal places
df$charges <- round(df$charges, 2)
# Round the 'bmi' column in the data frame (df) to two decimal places
df$bmi <- round(df$bmi, 2)
For ease of analysis, we will create a new column categorizing the customer ages into 3 categories which are ‘Young Adult’, ‘Middle Aged Adult’ and ‘Old Adult’.
# Create a new column named 'age_categories' in the data frame (df)
# This column will categorize ages into different age groups
df$age_categories <- cut(
# Specify the variable to categorize (age)
x = df$age,
# Specify the breaks (age intervals) for categorization
breaks = c(15, 39, 59, 69),
# Specify the labels for each age group
labels = c("Young Adult", "Middle Aged Adult", "Old Adult"),
# Specify whether the intervals are right-closed or left-closed
right = FALSE
)
The BMI value will be categorized into 5 categories following below:
# Create a new column named 'bmi_categories' in the data frame (df)
# This column will categorize BMI values into different BMI categories
df$bmi_categories <- cut(
# Specify the variable to categorize (bmi)
x = df$bmi,
# Specify the breaks (BMI intervals) for categorization
breaks = c(-Inf, 18.5, 24.9, 29.9, 34.9, Inf),
# Specify the labels for each BMI category
labels = c("Underweight", "Normal", "Overweight", "Obese", "Extremely Obese"),
# Specify whether the intervals are right-closed or left-closed
right = FALSE
)
# Display the first few rows of the data frame to examine the changes
# This helps to verify that the 'bmi_categories' column has been added correctly
head(df)
## age sex bmi children smoker region charges age_categories
## 1 19 female 27.90 0 yes southwest 16884.92 Young Adult
## 2 18 male 33.77 1 no southeast 1725.55 Young Adult
## 3 28 male 33.00 3 no southeast 4449.46 Young Adult
## 4 33 male 22.70 0 no northwest 21984.47 Young Adult
## 5 32 male 28.88 0 no northwest 3866.86 Young Adult
## 6 31 female 25.74 0 no southeast 3756.62 Young Adult
## bmi_categories
## 1 Overweight
## 2 Obese
## 3 Obese
## 4 Normal
## 5 Overweight
## 6 Overweight
A correlation heatmap for the numerical columns is created to find the relationship between the variables.
# Select numerical columns from the data frame (df) using sapply and is.numeric
numerical_columns <- df[, sapply(df, is.numeric)]
# Calculate the correlation matrix for the numerical columns
correlation_matrix <- cor(numerical_columns)
# Visualize the correlation matrix using corrplot
# Method "color" is used for visualization
# tl.col specifies the color of the text labels on the plot
# tl.srt specifies the rotation angle of the text labels
corrplot(correlation_matrix, method = "color", tl.col = "black", tl.srt = 45)
# Reshape the correlation matrix into long format to observe the correlation values of each pair of variables
(melted_correlation_matrix <- melt(correlation_matrix))
## Var1 Var2 value
## 1 age age 1.00000000
## 2 bmi age 0.10933639
## 3 children age 0.04153621
## 4 charges age 0.29830821
## 5 age bmi 0.10933639
## 6 bmi bmi 1.00000000
## 7 children bmi 0.01274871
## 8 charges bmi 0.19837982
## 9 age children 0.04153621
## 10 bmi children 0.01274871
## 11 children children 1.00000000
## 12 charges children 0.06738935
## 13 age charges 0.29830821
## 14 bmi charges 0.19837982
## 15 children charges 0.06738935
## 16 charges charges 1.00000000
Correlation Summary:
# Viz 1: Create a bar plot to visualize the distribution of children in the 'insurance' dataset
ggplot(df, aes(x = children, fill = as.factor(children))) +
geom_bar() + # Add bars to the plot
theme_solarized() + # Apply the solarized theme
scale_fill_brewer(palette = "Set2") + # Set color palette using Brewer
ggtitle("Distribution of Children") + # Add title to the plot
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
theme(legend.position = "none") + # Remove the legend
xlab("Number of Children") # Label the x-axis
Insights:
# Viz 2: Create a bar plot to visualize the distribution of smokers in the 'insurance' dataset
ggplot(df, aes(x = smoker, fill = smoker)) +
geom_bar() + # Add bars to the plot
theme_solarized() + # Apply the solarized theme
scale_fill_brewer(palette = "Accent") + # Set color palette using Brewer
ggtitle("Distribution of Smoker") + # Add title to the plot
theme(plot.title = element_text(hjust = 0.5)) + # Center the title
xlab("Smoker") # Label the x-axis
Insights:
# Viz 3: Create a box plot to visualize the distribution of charges by region in the 'insurance' dataset
ggplot(df, aes(y = charges, x = region, fill = region)) +
geom_boxplot() + # Add boxplots to the plot
ggtitle("Boxplot of the Charges by Region") + # Add title to the plot
ylab("Charges") + # Label the y-axis
xlab("Region") + # Label the x-axis
theme_solarized() + # Apply the solarized theme
scale_fill_jco() + # Set color palette using Journal of Computational and Graphical Statistics
theme(plot.title = element_text(hjust = 0.5)) # Center the title
Insights:
# Viz 4: Create a scatter plot to visualize the relationship between charges and age in the 'insurance' dataset,
# with points colored by category of BMI and shaped by smoker status
ggplot(df, aes(x = age, y = charges, col = bmi_categories, shape = smoker)) +
geom_point() + # Add points to the plot
theme_solarized() + # Apply the solarized theme
ggtitle("Relationship between Charges, Age by BMI Category, Smoker Status") + # Add title to the plot
xlab("Age") + # Label the x-axis
ylab("Charges") + # Label the y-axis
guides(color = guide_legend(title = "Category of BMI")) + # Add legend title for color
theme(plot.title = element_text(hjust = 0.5)) # Center the title
Insights:
set.seed(123)
# K fold cross-validation
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Building model 1
model1 <- train(charges ~., data = df,
method = "lm", preProcess = c("center", "scale"),
trControl = train_control)
summary(model1)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12037.0 -3383.1 -331.6 1532.7 29167.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13279.12 164.07 80.936 < 2e-16 ***
## age 3461.19 401.36 8.624 < 2e-16 ***
## sexmale -67.62 164.89 -0.410 0.68180
## bmi 306.40 492.73 0.622 0.53416
## children 671.85 166.84 4.027 5.97e-05 ***
## smokeryes 9608.53 165.43 58.080 < 2e-16 ***
## regionnorthwest -172.04 202.55 -0.849 0.39582
## regionsoutheast -423.52 211.42 -2.003 0.04536 *
## regionsouthwest -427.96 203.30 -2.105 0.03548 *
## `age_categoriesMiddle Aged Adult` -185.57 351.30 -0.528 0.59742
## `age_categoriesOld Adult` 421.02 350.01 1.203 0.22924
## bmi_categoriesNormal 371.59 543.42 0.684 0.49422
## bmi_categoriesOverweight 801.79 717.74 1.117 0.26416
## bmi_categoriesObese 2133.93 829.52 2.572 0.01021 *
## `bmi_categoriesExtremely Obese` 2485.28 938.32 2.649 0.00818 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5999 on 1322 degrees of freedom
## Multiple R-squared: 0.7572, Adjusted R-squared: 0.7546
## F-statistic: 294.4 on 14 and 1322 DF, p-value: < 2.2e-16
# Compute predictions variable and actual variable
predictions <- predict(model1, newdata = df)
actual <- df$charges
# Calculate the metric
r_squared <- cor(actual, predictions)^2
rmse <- sqrt(mean((actual - predictions)^2))
mae <- mean(abs(actual - predictions))
mape <- mean(abs((actual - predictions) / actual)) * 100
# Create a dataframe to store the metrics
metrics_df <- data.frame(
Metric = c("RMSE", "MAE", "MAPE", "R-squared"),
Value = c(rmse, mae, mape, r_squared)
)
# Print the metrics dataframe
print(metrics_df)
## Metric Value
## 1 RMSE 5965.423375
## 2 MAE 4206.458682
## 3 MAPE 44.498669
## 4 R-squared 0.757175
# Create new variable of square of age
df$age2 <- df$age^2
# Create new variable derived from bmi
df$bmi30 <- ifelse(df$bmi>=30,"yes","no")
# Build a second model using derived variables
set.seed(123)
train_control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
model2 <- train(charges ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = df, preProcess = c("center", "scale"), method = "lm", trControl = train_control)
summary(model2)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17297.1 -1657.7 -1267.8 -723.7 24155.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13279.1 121.6 109.202 < 2e-16 ***
## age -465.9 840.9 -0.554 0.579655
## age2 4205.8 840.6 5.003 6.39e-07 ***
## children 817.6 127.7 6.402 2.13e-10 ***
## bmi 730.6 209.2 3.493 0.000494 ***
## sexmale -247.6 122.3 -2.025 0.043102 *
## bmi30yes -497.1 211.4 -2.351 0.018863 *
## smokeryes 5412.9 177.7 30.458 < 2e-16 ***
## regionnorthwest -118.2 149.9 -0.789 0.430344
## regionsoutheast -368.9 156.6 -2.355 0.018654 *
## regionsouthwest -524.6 150.5 -3.487 0.000505 ***
## `bmi30yes:smokeryes` 6161.9 188.2 32.746 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4446 on 1325 degrees of freedom
## Multiple R-squared: 0.8663, Adjusted R-squared: 0.8652
## F-statistic: 780.5 on 11 and 1325 DF, p-value: < 2.2e-16
# Compute predictions variable and actual variable
predictions <- predict(model2, newdata = df)
actual <- df$charges
# Calculate the metric
r_squared <- cor(actual, predictions)^2
rmse <- sqrt(mean((actual - predictions)^2))
mae <- mean(abs(actual - predictions))
mape <- mean(abs((actual - predictions) / actual)) * 100
# Create a dataframe to store the metrics
metrics_df <- data.frame(
Metric = c("RMSE", "MAE", "MAPE", "R-squared"),
Value = c(rmse, mae, mape, r_squared)
)
# Print the metrics dataframe
print(metrics_df)
## Metric Value
## 1 RMSE 4426.3724416
## 2 MAE 2405.7782409
## 3 MAPE 27.3613029
## 4 R-squared 0.8663076
# Predict charges using model2
predicted_charges <- predict(model2, newdata = df)
# Plot actual vs predicted charges using model 2
ggplot(df, aes(x = charges, y = predicted_charges)) +
geom_point(color = "blue", alpha = 0.5) +
geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed") +
ggtitle("Predicted vs Actual Charges") +
xlab("Actual Charges") +
ylab("Predicted Charges") +
theme_solarized()
# Feature Scaling Using Standardization
numeric_columns <- sapply(df, is.numeric)
scaled_data <- as.data.frame(scale(df[, numeric_columns]))
# Split into train and test set
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- scaled_data[train_index,]
test_data <- scaled_data[-train_index,]
train_target <- df$charges[train_index]
test_target <- df$charges[-train_index]
# Build SVR Model
svr_model <- svm(train_target ~ ., data = train_data, kernel = "radial")
# SVR Model Evaluation
predictions <- predict(svr_model, test_data)
rmse <- sqrt(mean((test_target - predictions)^2))
r_squared <- cor(test_target, predictions)^2
mae <- mean(abs(test_target - predictions))
mape <- mean(abs((test_target - predictions) / test_target)) * 100
# Create a dataframe to store the metrics
metrics_df <- data.frame(
Metric = c("RMSE", "MAE", "MAPE", "R-squared"),
Value = c(rmse, mae, mape, r_squared)
)
# Print the metrics dataframe
print(metrics_df)
## Metric Value
## 1 RMSE 819.1449701
## 2 MAE 619.6960785
## 3 MAPE 9.5085004
## 4 R-squared 0.9954605
#Create the plot
plot(test_target, predictions,
main = "SVR Model: Predicted vs Actual Charges",
xlab = "Actual Charges", ylab = "Predicted Charges",
pch = 16, col = rgb(0.2, 0.4, 0.6, 0.6), # Adding color and transparency
cex = 1.2, # Increase point size
cex.main = 1.5, cex.lab = 1.3, cex.axis = 1.2) # Increase font sizes for better readability
# Add the reference line
abline(0, 1, col = "red", lwd = 2) # Make the reference line thicker
# Add grid lines
grid(col = "lightgray", lty = "dotted")
# Add a legend
legend("topleft", legend = c("Data Points", "Ideal Fit"),
col = c(rgb(0.2, 0.4, 0.6, 0.6), "red"),
pch = c(16, NA), lty = c(NA, 1), lwd = c(NA, 2),
pt.cex = 1.2, cex = 1.2, bty = "n")
Support Vector Regression model have better performance in predicting the charges than multiple linear regression.
# Install and load necessary packages
if (!require(randomForest)) {
install.packages("randomForest", dependencies = TRUE)
library(randomForest)
}
## Loading required package: randomForest
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Load other necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(caret)
library(ROSE)
## Loaded ROSE 0.0-4
library(ggplot2)
library(pROC)
# Read the dataset
df <- read.csv('insurance.csv')
# Convert smoker variable to binary (0 and 1)
df$smoker <- ifelse(df$smoker == "yes", 1, 0)
# Add new categorical variables: age_group and bmi_group
df <- df %>%
mutate(age_group = cut(age, breaks = c(0, 18, 30, 45, 60, Inf), labels = c("0-18", "19-30", "31-45", "46-60", "60+")),
bmi_group = cut(bmi, breaks = c(0, 18.5, 25, 30, Inf), labels = c("Underweight", "Normal", "Overweight", "Obese")))
# Define high and low insurance cost based on median charges
median_charge <- median(df$charges)
df <- df %>%
mutate(high_low = ifelse(charges > median_charge, "High", "Low"))
# Convert high_low to factor
df$high_low <- as.factor(df$high_low)
# Split the data into training and testing sets
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
train_target <- df$high_low[train_index]
test_target <- df$high_low[-train_index]
# Ensure training data has both classes
while (length(unique(train_target)) < 2) {
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
train_target <- df$high_low[train_index]
test_target <- df$high_low[-train_index]
}
# Balance the training data using oversampling
train_data_balanced <- ovun.sample(high_low ~ ., data = train_data, method = "over", N = sum(df$high_low == "Low") * 2)$data
# Build the random forest model on balanced data
set.seed(123)
rf_model <- randomForest(as.factor(high_low) ~ age + bmi + sex + children + region + smoker, data = train_data_balanced, ntree = 500)
# Predict on the test set
prob_predictions <- predict(rf_model, newdata = test_data, type = "prob")[, 2]
predictions <- ifelse(prob_predictions > 0.5, "High", "Low")
# Calculate confusion matrix
confusion_matrix <- table(test_target, predictions)
# Visualize confusion matrix
confusionMatrixPlot <- function(cm) {
cm_df <- as.data.frame(as.table(cm))
colnames(cm_df) <- c("Reference", "Prediction", "Freq")
ggplot(data = cm_df, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq)) +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Confusion Matrix", x = "Predicted", y = "Actual") +
theme_minimal()
}
# Calculate model performance metrics
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Ensure there are "High" and "Low" in the confusion matrix to avoid subscript out of bounds error
if ("High" %in% colnames(confusion_matrix) && "High" %in% rownames(confusion_matrix)) {
precision <- confusion_matrix["High", "High"] / sum(confusion_matrix[, "High"])
recall <- confusion_matrix["High", "High"] / sum(confusion_matrix["High", ])
} else {
precision <- 0
recall <- 0
}
f1_score <- ifelse(precision + recall == 0, 0, 2 * (precision * recall) / (precision + recall))
metrics_df <- data.frame(Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
Value = c(accuracy, precision, recall, f1_score))
# Print performance metrics
print(metrics_df)
## Metric Value
## 1 Accuracy 0.05597015
## 2 Precision 0.07741935
## 3 Recall 0.09836066
## 4 F1-Score 0.08664260
# Plot confusion matrix
confusionMatrixPlot(confusion_matrix)
# Plot ROC curve and calculate AUC
roc_obj <- roc(test_target, prob_predictions)
## Setting levels: control = High, case = Low
## Setting direction: controls < cases
auc_value <- auc(roc_obj)
ggroc(roc_obj) +
ggtitle(paste("ROC Curve (AUC:", round(auc_value, 3), ")")) +
theme_minimal()
# Install and load ROSE package (if not already installed)
if (!require(ROSE)) {
install.packages("ROSE", dependencies = TRUE)
library(ROSE)
}
# Load other necessary libraries
library(dplyr)
library(caret)
library(pROC)
library(ggplot2)
# Read the dataset
df <- read.csv('insurance.csv')
# Convert smoker variable to binary (0 and 1)
df$smoker <- ifelse(df$smoker == "yes", 1, 0)
# Add new categorical variables: age_group and bmi_group
df <- df %>%
mutate(age_group = cut(age, breaks = c(0, 18, 30, 45, 60, Inf), labels = c("0-18", "19-30", "31-45", "46-60", "60+")),
bmi_group = cut(bmi, breaks = c(0, 18.5, 25, 30, Inf), labels = c("Underweight", "Normal", "Overweight", "Obese")))
# Define high and low insurance cost based on median charges
median_charge <- median(df$charges)
df <- df %>%
mutate(high_low = ifelse(charges > median_charge, "High", "Low"))
# Convert high_low to factor
df$high_low <- as.factor(df$high_low)
# Split the data into training and testing sets
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
train_target <- df$high_low[train_index]
test_target <- df$high_low[-train_index]
# Ensure training data has both classes
while (length(unique(train_target)) < 2) {
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
train_target <- df$high_low[train_index]
test_target <- df$high_low[-train_index]
}
# Balance the training data using oversampling
train_data_balanced <- ovun.sample(high_low ~ ., data = train_data, method = "over", N = sum(df$high_low == "Low") * 2)$data
# Build the logistic regression model on balanced data
log_model <- glm(high_low ~ age + bmi + sex + children + region + smoker, data = train_data_balanced, family = binomial)
# Predict on the test set
prob_predictions <- predict(log_model, newdata = test_data, type = "response")
predictions <- ifelse(prob_predictions > 0.5, "High", "Low")
# Calculate confusion matrix
confusion_matrix <- table(test_target, predictions)
# Visualize confusion matrix
confusionMatrixPlot <- function(cm) {
cm_df <- as.data.frame(as.table(cm))
colnames(cm_df) <- c("Reference", "Prediction", "Freq")
ggplot(data = cm_df, aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq)) +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Confusion Matrix", x = "Predicted", y = "Actual") +
theme_minimal()
}
# Calculate model performance metrics
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
# Ensure there are "High" and "Low" in the confusion matrix to avoid subscript out of bounds error
if ("High" %in% colnames(confusion_matrix) && "High" %in% rownames(confusion_matrix)) {
precision <- confusion_matrix["High", "High"] / sum(confusion_matrix[, "High"])
recall <- confusion_matrix["High", "High"] / sum(confusion_matrix["High", ])
} else {
precision <- 0
recall <- 0
}
f1_score <- ifelse(precision + recall == 0, 0, 2 * (precision * recall) / (precision + recall))
metrics_df <- data.frame(Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
Value = c(accuracy, precision, recall, f1_score))
# Print performance metrics
print(metrics_df)
## Metric Value
## 1 Accuracy 0.07835821
## 2 Precision 0.06293706
## 3 Recall 0.07377049
## 4 F1-Score 0.06792453
# Plot ROC curve
roc_obj <- roc(test_target, as.numeric(prob_predictions))
## Setting levels: control = High, case = Low
## Setting direction: controls < cases
auc_value <- auc(roc_obj)
ggroc(roc_obj) +
ggtitle(paste("ROC Curve (AUC:", round(auc_value, 3), ")")) +
theme_minimal()
# Plot confusion matrix
confusionMatrixPlot(confusion_matrix)
After comparing the results of logistic regression and random forest model, we finally chose the random forest model. The logistic regression model has an AUC value of 0.957, an accuracy rate of 78.3%, a precision rate of 82.8%, a recall rate of 93.3%, and an F1-score of 87.8%, while the random forest model has an AUC value of 0.963, an accuracy rate of 94.4%, a precision rate of 97.3%, a recall rate of 90.2%, and an F1-score of 93.6%. The random forest model outperforms the logistic regression model in terms of AUC value, accuracy, precision and F1-score, showing stronger discriminative ability, higher overall accuracy and better balance. Therefore, combining all the performance indicators, the Random Forest model performs better in predicting the high and low cost of insurance, and we choose the Random Forest model.