Introduction

Customer segmentation involves dividing customers into groups based on shared characteristics. In the insurance industry, segmentation helps to:

In this workshop, we will: 1. Load and explore a dataset. 2. Perform segmentation using K-Means clustering. 3. Interpret and visualize the results.

Dataset

We will use a simulated dataset with the following features: - CustomerID: Unique identifier for each customer. - Age: Age of the customer. - Income: Annual income of the customer (in USD). - PolicyType: Type of insurance policy (e.g., Auto, Home, Life). - ClaimFrequency: Number of claims made in the last year. - RiskProfile: A score indicating the customer’s risk level. We can use this formula as well: data\(RiskProfile <- (data\)ClaimFrequency / max(data\(ClaimFrequency)) * 0.6 +(1 - (data\)Income / max(data$Income))) * 0.4

Let’s create and load the dataset.

set.seed(123)

data <- data.frame(
  CustomerID = 1:200,
  Age = sample(18:70, 200, replace = TRUE),
  Income = sample(20000:120000, 200, replace = TRUE),
  PolicyType = sample(c("Auto", "Home", "Life"), 200, replace = TRUE),
  ClaimFrequency = sample(0:10, 200, replace = TRUE),
  RiskProfile = runif(200, 0, 1)
)

# Save the dataset for reference
write.csv(data, "insurance_customers.csv", row.names = FALSE)

head(data)
##   CustomerID Age Income PolicyType ClaimFrequency RiskProfile
## 1          1  48  48077       Auto              0  0.35710642
## 2          2  32  89368       Home              0  0.97536501
## 3          3  68 111563       Home              1  0.38437852
## 4          4  31  63405       Auto              1  0.48867354
## 5          5  20  61487       Home              3  0.49077623
## 6          6  59  67484       Auto             10  0.01225462

Exploratory Data Analysis

Let’s summarize and visualize the dataset.

library(ggplot2)

# Summary statistics
summary(data)
##    CustomerID          Age            Income        PolicyType       
##  Min.   :  1.00   Min.   :18.00   Min.   : 20030   Length:200        
##  1st Qu.: 50.75   1st Qu.:32.00   1st Qu.: 45821   Class :character  
##  Median :100.50   Median :44.00   Median : 73394   Mode  :character  
##  Mean   :100.50   Mean   :45.09   Mean   : 70877                     
##  3rd Qu.:150.25   3rd Qu.:58.25   3rd Qu.: 96259                     
##  Max.   :200.00   Max.   :70.00   Max.   :119700                     
##  ClaimFrequency    RiskProfile      
##  Min.   : 0.000   Min.   :0.004842  
##  1st Qu.: 2.000   1st Qu.:0.244109  
##  Median : 5.000   Median :0.488741  
##  Mean   : 4.845   Mean   :0.505674  
##  3rd Qu.: 7.250   3rd Qu.:0.747032  
##  Max.   :10.000   Max.   :0.999524
# Income distribution
ggplot(data, aes(x = Income)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(title = "Income Distribution", x = "Income (USD)", y = "Count")

# ClaimFrequency vs Age
ggplot(data, aes(x = Age, y = ClaimFrequency)) +
  geom_point(color = "darkorange", alpha = 0.6) +
  theme_minimal() +
  labs(title = "Claims vs Age", x = "Age", y = "Claim Frequency")

K-Means clustering

We will segment customers based on Age, Income, ClaimFrequency, and RiskProfile.

K-Means clustering is an unsupervised machine learning algorithm used to partition a dataset into a predefined number of groups (clusters). It aims to group data points so that those in the same cluster are more similar to each other than to those in other clusters.

# Select relevant features
features <- data[, c("Age", "Income", "ClaimFrequency", "RiskProfile")]

# Standardize the data
features_scaled <- scale(features)

# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(features_scaled, centers = 3, nstart = 25)

data$Cluster <- as.factor(kmeans_result$cluster)

# Cluster centers
kmeans_result$centers
##           Age     Income ClaimFrequency RiskProfile
## 1 -0.35926774  0.7855373     0.21330437  -0.8939650
## 2 -0.02521289  0.3292201     0.05893589   1.0075631
## 3  0.37006035 -1.1039920    -0.26700944  -0.2305634

Visualization of clusters

# Cluster visualization (Age vs Income)
ggplot(data, aes(x = Age, y = Income, color = Cluster)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Customer Segments", x = "Age", y = "Income")

# Count of customers in each cluster
table(data$Cluster)
## 
##  1  2  3 
## 63 71 66

Insights and Recommendations

Customer segmentation enables data-driven decision-making in insurance and helps tailor offerings to customer needs.