Customer segmentation involves dividing customers into groups based on shared characteristics. In the insurance industry, segmentation helps to:
In this workshop, we will: 1. Load and explore a dataset. 2. Perform segmentation using K-Means clustering. 3. Interpret and visualize the results.
We will use a simulated dataset with the following features: -
CustomerID
: Unique identifier for each customer. -
Age
: Age of the customer. - Income
: Annual
income of the customer (in USD). - PolicyType
: Type of
insurance policy (e.g., Auto, Home, Life). -
ClaimFrequency
: Number of claims made in the last year. -
RiskProfile
: A score indicating the customer’s risk level.
We can use this formula as well: data\(RiskProfile <- (data\)ClaimFrequency /
max(data\(ClaimFrequency)) * 0.6 +(1 -
(data\)Income / max(data$Income))) * 0.4
Let’s create and load the dataset.
set.seed(123)
data <- data.frame(
CustomerID = 1:200,
Age = sample(18:70, 200, replace = TRUE),
Income = sample(20000:120000, 200, replace = TRUE),
PolicyType = sample(c("Auto", "Home", "Life"), 200, replace = TRUE),
ClaimFrequency = sample(0:10, 200, replace = TRUE),
RiskProfile = runif(200, 0, 1)
)
# Save the dataset for reference
write.csv(data, "insurance_customers.csv", row.names = FALSE)
head(data)
## CustomerID Age Income PolicyType ClaimFrequency RiskProfile
## 1 1 48 48077 Auto 0 0.35710642
## 2 2 32 89368 Home 0 0.97536501
## 3 3 68 111563 Home 1 0.38437852
## 4 4 31 63405 Auto 1 0.48867354
## 5 5 20 61487 Home 3 0.49077623
## 6 6 59 67484 Auto 10 0.01225462
Let’s summarize and visualize the dataset.
library(ggplot2)
# Summary statistics
summary(data)
## CustomerID Age Income PolicyType
## Min. : 1.00 Min. :18.00 Min. : 20030 Length:200
## 1st Qu.: 50.75 1st Qu.:32.00 1st Qu.: 45821 Class :character
## Median :100.50 Median :44.00 Median : 73394 Mode :character
## Mean :100.50 Mean :45.09 Mean : 70877
## 3rd Qu.:150.25 3rd Qu.:58.25 3rd Qu.: 96259
## Max. :200.00 Max. :70.00 Max. :119700
## ClaimFrequency RiskProfile
## Min. : 0.000 Min. :0.004842
## 1st Qu.: 2.000 1st Qu.:0.244109
## Median : 5.000 Median :0.488741
## Mean : 4.845 Mean :0.505674
## 3rd Qu.: 7.250 3rd Qu.:0.747032
## Max. :10.000 Max. :0.999524
# Income distribution
ggplot(data, aes(x = Income)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = "Income Distribution", x = "Income (USD)", y = "Count")
# ClaimFrequency vs Age
ggplot(data, aes(x = Age, y = ClaimFrequency)) +
geom_point(color = "darkorange", alpha = 0.6) +
theme_minimal() +
labs(title = "Claims vs Age", x = "Age", y = "Claim Frequency")
We will segment customers based on Age
,
Income
, ClaimFrequency
, and
RiskProfile
.
K-Means clustering is an unsupervised machine learning algorithm used to partition a dataset into a predefined number of groups (clusters). It aims to group data points so that those in the same cluster are more similar to each other than to those in other clusters.
# Select relevant features
features <- data[, c("Age", "Income", "ClaimFrequency", "RiskProfile")]
# Standardize the data
features_scaled <- scale(features)
# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(features_scaled, centers = 3, nstart = 25)
data$Cluster <- as.factor(kmeans_result$cluster)
# Cluster centers
kmeans_result$centers
## Age Income ClaimFrequency RiskProfile
## 1 -0.35926774 0.7855373 0.21330437 -0.8939650
## 2 -0.02521289 0.3292201 0.05893589 1.0075631
## 3 0.37006035 -1.1039920 -0.26700944 -0.2305634
# Cluster visualization (Age vs Income)
ggplot(data, aes(x = Age, y = Income, color = Cluster)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Customer Segments", x = "Age", y = "Income")
# Count of customers in each cluster
table(data$Cluster)
##
## 1 2 3
## 63 71 66
PolicyType
can further refine targeting
strategies. For example:
Customer segmentation enables data-driven decision-making in insurance and helps tailor offerings to customer needs.