This exploratory data analysis focuses on identifying patterns and risk factors associated with motor insurance claims. Using real policy-level data, we investigate how different vehicle attributes—such as type, usage, manufacturer, insured value, and premium—relate to claim frequency and severity. The goal is to derive actionable insights that support underwriting decisions and pricing strategies in the motor insurance sector.
Claim Amounts Distribution Claim amounts are highly skewed to the right, with most being relatively small but a few large claims driving up the variability. This is a typical pattern in insurance claims data, justifying the use of a log scale in visualizations.
Vehicle Type and Claims Certain vehicle types exhibit higher claim rates, indicating elevated risk levels. This suggests the need for differentiated underwriting based on vehicle classification.
Vehicle Usage Impact Vehicles used for commercial or delivery purposes show significantly higher claim frequencies and severities compared to those used for private purposes. This underscores the importance of usage-based segmentation in pricing.
Insured Value and Claim Occurrence Higher median insured values are observed among vehicles that filed claims, although with high variability. This may reflect greater exposure or a higher tendency to report damage for more valuable vehicles.
Manufacturer Risk Profiles Some vehicle manufacturers are associated with notably higher claim rates. These patterns may be influenced by driver demographics or repair costs and are valuable for refining risk selection.
Mismatch Between Premiums and Risk In some segments, average premiums appear low relative to the average claim amounts. This misalignment could indicate underpricing and profitability risk in certain usage categories.
Temporal Trends Claim rates remain relatively stable across years, with minor fluctuations. Monitoring these trends helps identify long-term shifts in portfolio risk and performance.
# Load libraries
library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
library(plotly)
# Data to be used
data <- read.csv("motor_data14-2018.csv")
# Add a binary flag for claim occurrence
data <- data %>%
mutate(claim_occurred = ifelse(is.na(CLAIM_PAID) | CLAIM_PAID == 0, 0, 1))
# Check how many claims vs no-claims
table(data$claim_occurred)
##
## 0 1
## 470762 37737
prop.table(table(data$claim_occurred))
##
## 0 1
## 0.92578746 0.07421254
# Histogram of CLAIM_PAID (only where claims occurred)
claim_dist_plot <- ggplot(data %>% filter(claim_occurred == 1), aes(x = CLAIM_PAID)) +
geom_histogram(bins = 30, fill = "cadetblue3") +
scale_x_log10() +
labs(
title = "Distribution of Claim Amounts",
x = "CLAIM_PAID (log scale)",
y = "Frequency"
) +
theme_minimal()
ggplotly(claim_dist_plot, tooltip = c("x", "y"))
Interpretation: The distribution of claim amounts is highly right-skewed, which is typical in insurance datasets. Most claims are relatively small, but there are a few large claims that significantly increase the overall variance. The log scale helps visualize the wide range of values more clearly.
# Group by vehicle type and claim status
vehicle_claims <- data %>%
group_by(TYPE_VEHICLE, claim_occurred) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
rename(no_claim = `0`, had_claim = `1`) %>%
mutate(total = no_claim + had_claim,
claim_rate = round(had_claim / total, 4)) %>%
arrange(desc(claim_rate))
# Print the summary
print(vehicle_claims)
## # A tibble: 11 × 5
## TYPE_VEHICLE no_claim had_claim total claim_rate
## <chr> <int> <int> <int> <dbl>
## 1 Station Wagones 31529 4455 35984 0.124
## 2 Bus 57997 7591 65588 0.116
## 3 Pick-up 81190 8961 90151 0.0994
## 4 Truck 85939 8157 94096 0.0867
## 5 Automobile 67143 5828 72971 0.0799
## 6 Special construction 7611 609 8220 0.0741
## 7 Tanker 6081 356 6437 0.0553
## 8 Trailers and semitrailers 20436 1110 21546 0.0515
## 9 Tractor 7392 126 7518 0.0168
## 10 Motor-cycle 105433 544 105977 0.0051
## 11 Trade plates 11 0 11 0
vehicle_plot <- ggplot(vehicle_claims, aes(x = reorder(TYPE_VEHICLE, claim_rate), y = claim_rate)) +
geom_col(fill = "cadetblue3") +
coord_flip() +
labs(
title = "Claim Rate by Vehicle Type",
x = "Vehicle Type",
y = "Claim Rate"
) +
theme_minimal()
ggplotly(vehicle_plot, tooltip = c("x", "y"))
Interpretation: Certain vehicle types are associated with a noticeably higher claim rate. For example, commercial vehicles or specialized types may present more risk. This kind of segmentation helps in risk-based pricing and underwriting decisions.
# Group by usage and claim occurrence
usage_claims <- data %>%
group_by(USAGE, claim_occurred) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
rename(no_claim = `0`, had_claim = `1`) %>%
mutate(total = no_claim + had_claim,
claim_rate = round(had_claim / total, 4)) %>%
arrange(desc(claim_rate))
# Print the summary
print(usage_claims)
## # A tibble: 14 × 5
## USAGE no_claim had_claim total claim_rate
## <chr> <int> <int> <int> <dbl>
## 1 Ambulance 1738 415 2153 0.193
## 2 Own service 30756 3105 33861 0.0917
## 3 General Cartage 72857 7314 80171 0.0912
## 4 Own Goods 122845 11262 134107 0.084
## 5 Private 112435 9785 122220 0.0801
## 6 Fare Paying Passengers 81299 4776 86075 0.0555
## 7 Car Hires 1500 74 1574 0.047
## 8 Special Construction 4717 150 4867 0.0308
## 9 Fire fighting 36 1 37 0.027
## 10 Learnes 1878 52 1930 0.0269
## 11 Taxi 26166 632 26798 0.0236
## 12 Agricultural Own Farm 5174 89 5263 0.0169
## 13 Agricultural Any Farm 2546 39 2585 0.0151
## 14 Others 6815 43 6858 0.0063
usage_claims_plot <- ggplot(usage_claims, aes(x = reorder(USAGE, claim_rate), y = claim_rate)) +
geom_col(fill = "cadetblue3") +
coord_flip() +
labs(
title = "Claim Rate by Vehicle Usage",
x = "Vehicle Usage",
y = "Claim Rate"
) +
theme_minimal()
ggplotly(usage_claims_plot, tooltip = c("x", "y"))
Interpretation: The usage of the vehicle significantly affects the likelihood of a claim. Vehicles used for business or deliveries tend to have higher claim rates compared to private usage. This insight is crucial for adjusting premiums or eligibility criteria based on risk exposure.
# Summary of claims by manufacturer
make_claims <- data %>%
group_by(MAKE, claim_occurred) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
rename(no_claim = `0`, had_claim = `1`) %>%
mutate(total = no_claim + had_claim,
claim_rate = round(had_claim / total, 4)) %>%
arrange(desc(claim_rate))
print(make_claims)
## # A tibble: 747 × 5
## MAKE no_claim had_claim total claim_rate
## <chr> <int> <int> <int> <dbl>
## 1 CHINA ZENGIZO 0 1 1 1
## 2 JAC DUMP 0 2 2 1
## 3 JERMEN 0 1 1 1
## 4 LIZE 0 1 1 1
## 5 NISSAN X-TRIAL 0 1 1 1
## 6 COASTER BUS 1 2 3 0.667
## 7 TOYOTA L/C PRADO 1 2 3 0.667
## 8 CO 3 5 8 0.625
## 9 CO BUS 3 5 8 0.625
## 10 AMI 1 1 2 0.5
## # ℹ 737 more rows
make_plot <- ggplot(make_claims %>% filter(total > 1000),
aes(x = reorder(MAKE, claim_rate), y = claim_rate)) +
geom_col(fill = "#6A5ACD") +
coord_flip() +
labs(
title = "Claim Rate by Vehicle Manufacturer (min 1000 policies)",
x = "Manufacturer",
y = "Claim Rate"
) +
theme_minimal()
# Convert to interactive plot
ggplotly(make_plot, tooltip = c("x", "y"))
Interpretation: Manufacturers show different patterns in claim frequency, even when controlling for sample size (min 1000 policies). Some brands may correlate with higher-risk drivers or vehicles with higher repair costs. This can inform underwriting strategies or partnerships with auto manufacturers.
usage_claims <- data %>%
group_by(USAGE, claim_occurred) %>%
summarise(count = n(), .groups = "drop") %>%
pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
rename(no_claim = `0`, had_claim = `1`) %>%
mutate(total = no_claim + had_claim,
claim_rate = round(had_claim / total, 4)) %>%
filter(total > 1000) %>%
arrange(desc(claim_rate))
print(usage_claims)
## # A tibble: 13 × 5
## USAGE no_claim had_claim total claim_rate
## <chr> <int> <int> <int> <dbl>
## 1 Ambulance 1738 415 2153 0.193
## 2 Own service 30756 3105 33861 0.0917
## 3 General Cartage 72857 7314 80171 0.0912
## 4 Own Goods 122845 11262 134107 0.084
## 5 Private 112435 9785 122220 0.0801
## 6 Fare Paying Passengers 81299 4776 86075 0.0555
## 7 Car Hires 1500 74 1574 0.047
## 8 Special Construction 4717 150 4867 0.0308
## 9 Learnes 1878 52 1930 0.0269
## 10 Taxi 26166 632 26798 0.0236
## 11 Agricultural Own Farm 5174 89 5263 0.0169
## 12 Agricultural Any Farm 2546 39 2585 0.0151
## 13 Others 6815 43 6858 0.0063
This analysis highlights how data-driven exploration of insurance portfolios can reveal critical risk factors that support underwriting and pricing optimization. By understanding the relationships between claim behavior and policy characteristics, insurers can make more informed decisions to improve risk selection, set appropriate premiums, and enhance overall profitability. Future work may include predictive modeling to estimate claim probability and severity at the individual policy level.