Introduction

This exploratory data analysis focuses on identifying patterns and risk factors associated with motor insurance claims. Using real policy-level data, we investigate how different vehicle attributes—such as type, usage, manufacturer, insured value, and premium—relate to claim frequency and severity. The goal is to derive actionable insights that support underwriting decisions and pricing strategies in the motor insurance sector.

Key Insights

Claim Amounts Distribution Claim amounts are highly skewed to the right, with most being relatively small but a few large claims driving up the variability. This is a typical pattern in insurance claims data, justifying the use of a log scale in visualizations.

Vehicle Type and Claims Certain vehicle types exhibit higher claim rates, indicating elevated risk levels. This suggests the need for differentiated underwriting based on vehicle classification.

Vehicle Usage Impact Vehicles used for commercial or delivery purposes show significantly higher claim frequencies and severities compared to those used for private purposes. This underscores the importance of usage-based segmentation in pricing.

Insured Value and Claim Occurrence Higher median insured values are observed among vehicles that filed claims, although with high variability. This may reflect greater exposure or a higher tendency to report damage for more valuable vehicles.

Manufacturer Risk Profiles Some vehicle manufacturers are associated with notably higher claim rates. These patterns may be influenced by driver demographics or repair costs and are valuable for refining risk selection.

Mismatch Between Premiums and Risk In some segments, average premiums appear low relative to the average claim amounts. This misalignment could indicate underpricing and profitability risk in certain usage categories.

Temporal Trends Claim rates remain relatively stable across years, with minor fluctuations. Monitoring these trends helps identify long-term shifts in portfolio risk and performance.

# Load libraries
library(readr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)
library(plotly)
# Data to be used
data <- read.csv("motor_data14-2018.csv")

Create a binary variable for claims

# Add a binary flag for claim occurrence
data <- data %>%
  mutate(claim_occurred = ifelse(is.na(CLAIM_PAID) | CLAIM_PAID == 0, 0, 1))

# Check how many claims vs no-claims
table(data$claim_occurred)
## 
##      0      1 
## 470762  37737
prop.table(table(data$claim_occurred))
## 
##          0          1 
## 0.92578746 0.07421254

Visualize claim amounts

# Histogram of CLAIM_PAID (only where claims occurred)
claim_dist_plot <- ggplot(data %>% filter(claim_occurred == 1), aes(x = CLAIM_PAID)) +
  geom_histogram(bins = 30, fill = "cadetblue3") +
  scale_x_log10() +
  labs(
    title = "Distribution of Claim Amounts",
    x = "CLAIM_PAID (log scale)",
    y = "Frequency"
  ) +
  theme_minimal()
ggplotly(claim_dist_plot, tooltip = c("x", "y"))

Interpretation: The distribution of claim amounts is highly right-skewed, which is typical in insurance datasets. Most claims are relatively small, but there are a few large claims that significantly increase the overall variance. The log scale helps visualize the wide range of values more clearly.

Analyze Claims by Vehicle Type (TYPE_VEHICLE)

# Group by vehicle type and claim status
vehicle_claims <- data %>%
  group_by(TYPE_VEHICLE, claim_occurred) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
  rename(no_claim = `0`, had_claim = `1`) %>%
  mutate(total = no_claim + had_claim,
         claim_rate = round(had_claim / total, 4)) %>%
  arrange(desc(claim_rate))

# Print the summary
print(vehicle_claims)
## # A tibble: 11 × 5
##    TYPE_VEHICLE              no_claim had_claim  total claim_rate
##    <chr>                        <int>     <int>  <int>      <dbl>
##  1 Station Wagones              31529      4455  35984     0.124 
##  2 Bus                          57997      7591  65588     0.116 
##  3 Pick-up                      81190      8961  90151     0.0994
##  4 Truck                        85939      8157  94096     0.0867
##  5 Automobile                   67143      5828  72971     0.0799
##  6 Special construction          7611       609   8220     0.0741
##  7 Tanker                        6081       356   6437     0.0553
##  8 Trailers and semitrailers    20436      1110  21546     0.0515
##  9 Tractor                       7392       126   7518     0.0168
## 10 Motor-cycle                 105433       544 105977     0.0051
## 11 Trade plates                    11         0     11     0

Plot Claim Rate by Vehicle Type

vehicle_plot <- ggplot(vehicle_claims, aes(x = reorder(TYPE_VEHICLE, claim_rate), y = claim_rate)) +
  geom_col(fill = "cadetblue3") +
  coord_flip() +
  labs(
    title = "Claim Rate by Vehicle Type",
    x = "Vehicle Type",
    y = "Claim Rate"
  ) +
  theme_minimal()

ggplotly(vehicle_plot, tooltip = c("x", "y"))

Interpretation: Certain vehicle types are associated with a noticeably higher claim rate. For example, commercial vehicles or specialized types may present more risk. This kind of segmentation helps in risk-based pricing and underwriting decisions.

Analyze Claims by Vehicle Usage (USAGE)

# Group by usage and claim occurrence
usage_claims <- data %>%
  group_by(USAGE, claim_occurred) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
  rename(no_claim = `0`, had_claim = `1`) %>%
  mutate(total = no_claim + had_claim,
         claim_rate = round(had_claim / total, 4)) %>%
  arrange(desc(claim_rate))

# Print the summary
print(usage_claims)
## # A tibble: 14 × 5
##    USAGE                  no_claim had_claim  total claim_rate
##    <chr>                     <int>     <int>  <int>      <dbl>
##  1 Ambulance                  1738       415   2153     0.193 
##  2 Own service               30756      3105  33861     0.0917
##  3 General Cartage           72857      7314  80171     0.0912
##  4 Own Goods                122845     11262 134107     0.084 
##  5 Private                  112435      9785 122220     0.0801
##  6 Fare Paying Passengers    81299      4776  86075     0.0555
##  7 Car Hires                  1500        74   1574     0.047 
##  8 Special Construction       4717       150   4867     0.0308
##  9 Fire fighting                36         1     37     0.027 
## 10 Learnes                    1878        52   1930     0.0269
## 11 Taxi                      26166       632  26798     0.0236
## 12 Agricultural Own Farm      5174        89   5263     0.0169
## 13 Agricultural Any Farm      2546        39   2585     0.0151
## 14 Others                     6815        43   6858     0.0063
usage_claims_plot <- ggplot(usage_claims, aes(x = reorder(USAGE, claim_rate), y = claim_rate)) +
  geom_col(fill = "cadetblue3") +
  coord_flip() +
  labs(
    title = "Claim Rate by Vehicle Usage",
    x = "Vehicle Usage",
    y = "Claim Rate"
  ) +
  theme_minimal()

ggplotly(usage_claims_plot, tooltip = c("x", "y"))

Interpretation: The usage of the vehicle significantly affects the likelihood of a claim. Vehicles used for business or deliveries tend to have higher claim rates compared to private usage. This insight is crucial for adjusting premiums or eligibility criteria based on risk exposure.

Analyze Relationship Between Insured Value, Premium, and Claims

# Summary statistics by claim occurrence
summary_stats <- data %>%
  group_by(claim_occurred) %>%
  summarise(
    avg_insured_value = mean(INSURED_VALUE, na.rm = TRUE),
    median_insured_value = median(INSURED_VALUE, na.rm = TRUE),
    avg_premium = mean(PREMIUM, na.rm = TRUE),
    median_premium = median(PREMIUM, na.rm = TRUE),
    count = n()
  )

print(summary_stats)
## # A tibble: 2 × 6
##   claim_occurred avg_insured_value median_insured_value avg_premium
##            <dbl>             <dbl>                <dbl>       <dbl>
## 1              0           525827.               52463.       7332.
## 2              1          1036063.              800000       15586.
## # ℹ 2 more variables: median_premium <dbl>, count <int>

Visualization: Boxplots of Insured Value and Premium by Claim Status

# Boxplot for Insured Value
ggplot(data, aes(x = factor(claim_occurred), y = INSURED_VALUE)) +
  geom_boxplot(fill = c("cadetblue3", "#FF6347")) +
  scale_y_log10() +
  labs(
    title = "Insured Value by Claim Occurrence",
    x = "Claim Occurred (0 = No, 1 = Yes)",
    y = "Insured Value (log scale)"
  ) +
  theme_minimal()
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 231392 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Boxplot for Premium
ggplot(data, aes(x = factor(claim_occurred), y = PREMIUM)) +
  geom_boxplot(fill = c("cadetblue3", "#FF6347")) +
  scale_y_log10() +
  labs(title = "Premium by Claim Occurrence",
       x = "Claim Occurred (0 = No, 1 = Yes)",
       y = "Premium (log scale)") +
  theme_minimal()
## Warning in scale_y_log10(): log-10 transformation introduced infinite values.
## Warning: Removed 22 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Interpretation: The median insured value tends to be higher among vehicles that filed a claim. However, the variability is also wider, indicating some high-value vehicles are involved in claims. This could imply that higher-value vehicles are either more exposed or more likely to report and process claims.

Analyze Claims by Vehicle Manufacturer (MAKE)

# Summary of claims by manufacturer
make_claims <- data %>%
  group_by(MAKE, claim_occurred) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
  rename(no_claim = `0`, had_claim = `1`) %>%
  mutate(total = no_claim + had_claim,
         claim_rate = round(had_claim / total, 4)) %>%
  arrange(desc(claim_rate))

print(make_claims)
## # A tibble: 747 × 5
##    MAKE             no_claim had_claim total claim_rate
##    <chr>               <int>     <int> <int>      <dbl>
##  1 CHINA ZENGIZO           0         1     1      1    
##  2 JAC DUMP                0         2     2      1    
##  3 JERMEN                  0         1     1      1    
##  4 LIZE                    0         1     1      1    
##  5 NISSAN X-TRIAL          0         1     1      1    
##  6 COASTER BUS             1         2     3      0.667
##  7 TOYOTA L/C PRADO        1         2     3      0.667
##  8 CO                      3         5     8      0.625
##  9 CO BUS                  3         5     8      0.625
## 10 AMI                     1         1     2      0.5  
## # ℹ 737 more rows

Visualization: Claims rate by manufacturer

make_plot <- ggplot(make_claims %>% filter(total > 1000),
                    aes(x = reorder(MAKE, claim_rate), y = claim_rate)) +
  geom_col(fill = "#6A5ACD") +
  coord_flip() +
  labs(
    title = "Claim Rate by Vehicle Manufacturer (min 1000 policies)",
    x = "Manufacturer",
    y = "Claim Rate"
  ) +
  theme_minimal()

# Convert to interactive plot
ggplotly(make_plot, tooltip = c("x", "y"))

Interpretation: Manufacturers show different patterns in claim frequency, even when controlling for sample size (min 1000 policies). Some brands may correlate with higher-risk drivers or vehicles with higher repair costs. This can inform underwriting strategies or partnerships with auto manufacturers.

Summary of insured value and average premium by manufacturer

make_summary <- data %>%
  group_by(MAKE) %>%
  summarise(
    avg_insured_value = mean(INSURED_VALUE, na.rm = TRUE),
    avg_premium = mean(PREMIUM, na.rm = TRUE),
    total_policies = n()
  ) %>%
  filter(total_policies > 50) %>%
  arrange(desc(avg_premium))

print(make_summary)
## # A tibble: 178 × 4
##    MAKE      avg_insured_value avg_premium total_policies
##    <chr>                 <dbl>       <dbl>          <int>
##  1 CRANE              8004138.      69712.             53
##  2 TRAKER             2673792.      38715.             79
##  3 LOWBED              575851.      34609.            290
##  4 VERSATILE          5178649.      30656.            161
##  5 SUNLONG            2071534.      28762.            470
##  6 BUS                1818015.      26322.             60
##  7 IVECO              1944861.      25197.          13192
##  8 GELION             1591989.      22977.            937
##  9 BEBEN              1892542.      22468.            571
## 10 RENAULT            1712719.      22046.           1731
## # ℹ 168 more rows

Analyze Claims by Vehicle Usage (USAGE)

usage_claims <- data %>%
  group_by(USAGE, claim_occurred) %>%
  summarise(count = n(), .groups = "drop") %>%
  pivot_wider(names_from = claim_occurred, values_from = count, values_fill = 0) %>%
  rename(no_claim = `0`, had_claim = `1`) %>%
  mutate(total = no_claim + had_claim,
         claim_rate = round(had_claim / total, 4)) %>%
  filter(total > 1000) %>%
  arrange(desc(claim_rate))

print(usage_claims)
## # A tibble: 13 × 5
##    USAGE                  no_claim had_claim  total claim_rate
##    <chr>                     <int>     <int>  <int>      <dbl>
##  1 Ambulance                  1738       415   2153     0.193 
##  2 Own service               30756      3105  33861     0.0917
##  3 General Cartage           72857      7314  80171     0.0912
##  4 Own Goods                122845     11262 134107     0.084 
##  5 Private                  112435      9785 122220     0.0801
##  6 Fare Paying Passengers    81299      4776  86075     0.0555
##  7 Car Hires                  1500        74   1574     0.047 
##  8 Special Construction       4717       150   4867     0.0308
##  9 Learnes                    1878        52   1930     0.0269
## 10 Taxi                      26166       632  26798     0.0236
## 11 Agricultural Own Farm      5174        89   5263     0.0169
## 12 Agricultural Any Farm      2546        39   2585     0.0151
## 13 Others                     6815        43   6858     0.0063

Average of CLAIM_PAID, INSURED_VALUE and PREMIUM per usage (claims only)

usage_summary <- data %>%
  filter(!is.na(CLAIM_PAID) & CLAIM_PAID > 0) %>%
  group_by(USAGE) %>%
  summarise(
    avg_claim = mean(CLAIM_PAID, na.rm = TRUE),
    avg_insured_value = mean(INSURED_VALUE, na.rm = TRUE),
    avg_premium = mean(PREMIUM, na.rm = TRUE),
    count = n()
  ) %>%
  filter(count > 100) %>%
  arrange(desc(avg_claim))

print(usage_summary)
## # A tibble: 8 × 5
##   USAGE                  avg_claim avg_insured_value avg_premium count
##   <chr>                      <dbl>             <dbl>       <dbl> <int>
## 1 General Cartage          647975.          1207016.      24811.  7314
## 2 Fare Paying Passengers   235423.           584976.      18072.  4776
## 3 Own Goods                159606.          1135283.      11748. 11262
## 4 Ambulance                155104.          1635985.      25411.   415
## 5 Private                  107631.           936792.      11318.  9785
## 6 Special Construction     105591.          1581973.      17567.   150
## 7 Own service              104737.          1389931.      19380.  3105
## 8 Taxi                      26853.            11131.       1614.   632

Bar plot of average claims, insured value and premium by USAGE

# Convert data to long format
usage_summary_long <- usage_summary %>%
  pivot_longer(cols = c(avg_claim, avg_insured_value, avg_premium),
               names_to = "metric",
               values_to = "value")

# Create grouped bar plot
usage_plot <- ggplot(usage_summary_long,
                     aes(x = reorder(USAGE, value), y = value, fill = metric)) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_fill_manual(
    values = c("avg_claim" = "#FF7F50",
               "avg_insured_value" = "#4682B4",
               "avg_premium" = "#6A5ACD"),
    labels = c("Avg. Claim", "Avg. Insured Value", "Avg. Premium")
  ) +
  labs(
    title = "Average Claim, Insured Value and Premium by Vehicle Usage",
    x = "Vehicle Usage",
    y = "Amount",
    fill = "Metric"
  ) +
  theme_minimal()

# Convertir a gráfico interactivo
ggplotly(usage_plot, tooltip = c("x", "y", "fill"))

Interpretation: Vehicles used for specific purposes (e.g., taxis or commercial delivery) show higher average claim amounts. However, premiums and insured values vary inconsistently, suggesting potential underpricing in some segments. This type of comparative analysis is essential for portfolio profitability and risk adjustment.

Conclusion

This analysis highlights how data-driven exploration of insurance portfolios can reveal critical risk factors that support underwriting and pricing optimization. By understanding the relationships between claim behavior and policy characteristics, insurers can make more informed decisions to improve risk selection, set appropriate premiums, and enhance overall profitability. Future work may include predictive modeling to estimate claim probability and severity at the individual policy level.