Exploratory Data Analysis of Insurance Claim Fraud Patterns

Applied Business Analytics

Author

Raymond Eshiet

Published

May 6, 2026

1 Problem Statement

1.1 Business Context

Insurance fraud has been an issue in worldwide and the Irish market is not an exception.In Ireland, there has been isssues of fraudulent and exaggerated motor claims which inversely affects honest customers. Detecting fraudulent claims early, at the initial assessment stage, is therefore a high-value business priority for any insurance company operating in the Irish market.

This analysis examines 1,000 motor insurance claims to identify the behavioural and demographic patterns that distinguish fraudulent from legitimate claims.

1.2 Core Analytical Questions

Which characteristics of claims are strongly associated with fraud?
Are there interaction effects between variables that increases the likelihood of fraud beyond what each predictor achieves independently?
What data quality issues exist and how should they be addressed?

2 Solution Statement

2.1 Analytical Approach

The analysis follows a four-phase pipeline:

Phase 1: Ingestion and Audit: Load the raw CSV, resolve encoding issues, document data types, quantify missingness, and identify structural inconsistencies.
Phase 2 : recipes Preprocessing Pipeline: Define a formal, reproducible preprocessing recipe using the recipes package. All cleaning, imputation, normalisation, and encoding steps are specified here,replacing ad-hoc mutate() transformations with an auditable, reusable pipeline.
Phase 3 — Feature Engineering via recipes: Derive new analytically meaningful variables (passenger count, age band, street type, repeat claimant flag) directly within the recipe using step_mutate() and step_cut().
Phase 4 — EDA and Visualisation: Produce publication-quality visualisations examining each fraud risk dimension, with cross-variable interaction analysis and actionable business recommendations.

2.2 Tools and Reproducibility

All analysis is conducted in R via this Quarto document. The tidyverse ecosystem is used for data wrangling; ggplot2 for visualization; and kableExtra for formatted tables. The .qmd file and data.csv (placed in the 00_data/ folder) are the only inputs required to reproduce all outputs.

3 Exploratory Analysis

3.1 Data Ingestion and Audit

Show Code

# The raw file uses Latin-1 encoding.
# Standard UTF-8 fails on byte 0xa3 (the pound sign £), which appears in
# several address and cost fields common in legacy Irish/UK admin systems.
df_raw <- read_csv(
  "00_data/data.csv",
  locale        = locale(encoding = "latin1"),
  show_col_types = FALSE
)

glimpse(df_raw)

Rows: 1,000
Columns: 7
$ driver     <chr> "JOSEPH MCGRATH", "MARY BRENNAN", "JOSEPH COLLINS", "ROBERT…
$ age        <dbl> 24, 53, 48, 40, 27, 38, 34, 34, 33, 21, 24, 23, 55, 24, 74,…
$ address    <chr> "3 CO$RIB VIEW", "21 BLACKWATER VIEW", "7 SLANEY LODGE", "1…
$ passenger1 <chr> "JOSEPH GRIFFIN", NA, NA, NA, NA, "MICHAEL BROWNE", NA, NA,…
$ passenger2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "LIAM BARRETT",…
$ repaircost <chr> "approx 3k", "approx 2k", "approx 2k", "approx 500", "appro…
$ fraudFlag  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

Show Code

df_raw |>
  head(6) |>
  kbl(booktabs = TRUE, align = "l") |>
  kable_styling(
    full_width        = TRUE,
    bootstrap_options = c("striped", "hover", "condensed")
  )

Table 1: Table 1: First six records of the raw dataset

driver	age	address	passenger1	passenger2	repaircost	fraudFlag
JOSEPH MCGRATH	24	3 CO$RIB VIEW	JOSEPH GRIFFIN	NA	approx 3k	FALSE
MARY BRENNAN	53	21 BLACKWATER VIEW	NA	NA	approx 2k	FALSE
JOSEPH COLLINS	48	7 SLANEY LODGE	NA	NA	approx 2k	FALSE
ROBERT WALSH	40	12 LIFFEY GROVE	NA	NA	approx 500	FALSE
KEVIN OCONNELL	27	21 BLACKWATER GLADE	NA	NA	approx 500	FALSE
BRIAN CULLEN	38	9 BARROW GLADE	MICHAEL BROWNE	NA	approx 2k	FALSE

3.1.1 Schema and Quality Audit

The dataset contains 1000 records across 7 variables. Table 2 documents the full schema and data quality findings.

Show Code

schema_tbl <- tibble(
  Variable        = names(df_raw),
  Type            = sapply(df_raw, class),
  `Non-Null`      = sapply(df_raw, \(x) sum(!is.na(x))),
  `Null Count`    = sapply(df_raw, \(x) sum(is.na(x))),
  `Null %`        = round(sapply(df_raw, \(x) mean(is.na(x)) * 100), 1),
  `Unique Values` = sapply(df_raw, n_distinct),
  Notes           = c(
    "639 distinct drivers some appearing more than once",
    "Range 20–85; no missing values",
    "16 records contains corrupted special characters",
    "64.8% missing — structural (solo driver claims)",
    "89.0% missing — structural (no second passenger)",
    "Free text, not numeric; 3 malformed entries",
    "Binary outcome — 10% fraud prevalence"
  )
)

schema_tbl |>
  kbl(booktabs = TRUE) |>
  kable_styling(
    full_width        = TRUE,
    bootstrap_options = c("striped", "condensed")
  ) |>
  row_spec(which(schema_tbl$`Null %` > 50), background = "#f4D35E") |>
  column_spec(7, italic = TRUE, color = "#555555")

Table 2: Table 2: Variable schema and initial data quality assessment

Variable	Type	Non-Null	Null Count	Null %	Unique Values	Notes
driver	character	1000	0	0.0	639	639 distinct drivers some appearing more than once
age	numeric	1000	0	0.0	66	Range 20–85; no missing values
address	character	1000	0	0.0	759	16 records contains corrupted special characters
passenger1	character	352	648	64.8	296	64.8% missing — structural (solo driver claims)
passenger2	character	110	890	89.0	93	89.0% missing — structural (no second passenger)
repaircost	character	1000	0	0.0	8	Free text, not numeric; 3 malformed entries
fraudFlag	logical	1000	0	0.0	2	Binary outcome — 10% fraud prevalence

Key audit findings:

passenger1 has 64.8% missing values and passenger2 has 89.0%. This is not a random data loss but are structural missings representing claims with no passengers.
repaircost is stored as free text (e.g., "approx 1k"). This requires formal parsing to numeric before any quantitative analysis.
address contains 16 records with corrupted special characters which indicates upstream encoding failures.
3 repair cost entries are malformed (e.g., "approx 5!!", "approx $*0", "approx 2~") and require pattern-based correction.
fraudFlag is a boolean with a 10% fraud prevalence, creating a class imbalance to be noted in any downstream modelling.

3.2 `recipes` Preprocessing Pipeline

All preprocessing and feature engineering is formalised in a single recipes pipeline. This produces a complete audit trail via tidy() and ensures all transformations are reproducible and portable to new claims data.

3.2.1 Step 1 — Prepare the Raw Input

Two minimal pre-steps are applied before the recipe — string-level address cleaning and free-text cost parsing — because these require character manipulation that sits naturally in mutate() before the recipe takes over all downstream transformations.

Show Code

df_input <- df_raw |>

  # Clean corrupted addresses 
  mutate(
    address = str_replace_all(address, "[^A-Za-z0-9 ]", "") |> str_squish()
  ) |>

  # Parse repaircost text to numeric
  mutate(
    cost_raw = case_when(
      repaircost == "approx 500"  ~ 500,
      repaircost == "approx 1k"   ~ 1000,
      repaircost == "approx 2k"   ~ 2000,
      repaircost == "approx 3k"   ~ 3000,
      repaircost == "above 3k"    ~ 4000,
      repaircost == "approx 2~"   ~ 2000,
      repaircost == "approx 5!!"  ~ 500,
      repaircost == "approx $*0"  ~ 500,
      TRUE                        ~ NA_real_
    ),
    # recipes requires a factor outcome variable
    fraud_flag = factor(
      fraudFlag,
      levels = c("FALSE", "TRUE"),
      labels = c("Legitimate", "Fraud")
    )
  ) |>

  # Keep only the columns the recipe will work with
  select(fraud_flag, age, address, passenger1, passenger2, cost_raw)

glimpse(df_input)

Rows: 1,000
Columns: 6
$ fraud_flag <fct> Legitimate, Legitimate, Legitimate, Legitimate, Legitimate,…
$ age        <dbl> 24, 53, 48, 40, 27, 38, 34, 34, 33, 21, 24, 23, 55, 24, 74,…
$ address    <chr> "3 CORIB VIEW", "21 BLACKWATER VIEW", "7 SLANEY LODGE", "12…
$ passenger1 <chr> "JOSEPH GRIFFIN", NA, NA, NA, NA, "MICHAEL BROWNE", NA, NA,…
$ passenger2 <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "LIAM BARRETT",…
$ cost_raw   <dbl> 3000, 2000, 2000, 500, 500, 2000, 500, 2000, 2000, 500, 500…

3.2.2 Step 2 — Define the Recipe

Show Code

#| label: define-recipe
df_input <- df_input |> mutate(driver = df_raw$driver)

fraud_recipe <- recipe(fraud_flag ~ ., data = df_input) |>

  # F1: Passenger count
  # step_mutate() adds new computed columns inside the recipe.
  # !is.na() returns TRUE (1) if a value is present, FALSE (0) if missing.
  # Summing the two columns produces an ordinal count: 0, 1, or 2.
  # This collapses two sparse 65–89% missing columns into one signal.
  step_mutate(
    num_passengers = as.integer(!is.na(passenger1)) +
                     as.integer(!is.na(passenger2))
  ) |>

  #F2: Street type
  # Extracts the residential suffix word from each address as a geographic proxy.
  # str_extract returns the first match of the alternation pattern.
  # replace_na() labels any address that doesn't match a known suffix as "OTHER".
  step_mutate(
    street_type = str_extract(
      str_to_upper(address),
      "(VIEW|GROVE|LODGE|GLADE|PARK|LANE|DRIVE|COURT|TERRACE|GARDENS)"
    ) |> replace_na("OTHER")
  ) |>

  #F3: Repeat claimant flag
  # Flags drivers whose name appears more than once in the dataset.
 
  step_mutate(
    repeat_claimant = as.integer(
      driver %in% { df_input |> count(driver) |> filter(n > 1) |> pull(driver) }
    )
  ) |>

  # Remove raw columns no longer needed
  # passenger1/2 are now encoded in num_passengers.
  # address is now encoded in street_type.
  # Keeping them would introduce noise and inflated column counts.
 step_rm(passenger1, passenger2, address, driver) |>

  #F4: Age band via step_cut()
  # Discretises the continuous age variable into ordered categorical bands.
  # These break points (30, 40, 50, 60) are actuarially standard.
  # include_outside_range = TRUE ensures ages below 30 and above 60 are captured.
  step_cut(age, breaks = c(30, 40, 50, 60), include_outside_range = TRUE) |>

  # Impute remaining NAs in numeric columns
  # step_impute_median() fills missing values with the column median.
  # Median is used rather than mean because the repair cost distribution is right-skewed, small number of high-value claims pull the mean upward,making median a more representative central value.
  step_impute_median(all_numeric_predictors()) |>

  # Normalise numeric predictors
  # step_normalize() applies z-score standardisation: (x - mean) / sd.
  # This centres each variable at 0 with unit variance.
  # Without this, age (range 20–85) and cost_raw (range 500–4000) differ
  # by roughly 50-fold in scale, distance-based and regularised models would treat cost_raw as 50x more important than age purely due to scale.
  step_normalize(all_numeric_predictors()) |>

  # Dummy-encode the nominal street_type variable
  # Models cannot consume raw character strings.
  # step_dummy() creates k-1 binary indicator columns.
  # one_hot = FALSE drops the reference category to avoid perfect multicollinearity (the dummy variable trap).
 step_string2factor(street_type) |>
  step_dummy(street_type, one_hot = FALSE) |>

  #Remove zero-variance predictors
  # step_zv() drops any column where every value is identical.
  # Such columns carry no information and cause errors in some model types.
  step_zv(all_predictors())

3.2.3 Step 3 — Prep the Recipe

Show Code

# prep() fits the recipe to the data it calculates and stores:
#   - medians for step_impute_median()
#   - means and standard deviations for step_normalize()
#   - factor levels for step_dummy() and step_cut()
# Nothing is transformed yet; prep() only *learns* the parameters.

fraud_recipe_prepped <- prep(fraud_recipe, training = df_input, verbose = FALSE)

fraud_recipe_prepped

3.2.4 Step 4 — Bake the Recipe

Show Code

# bake() applies all prepped transformations and returns a clean data frame.
# new_data = NULL means "apply to the same data used in prep()".
# In production, pass new incoming claims here for identical preprocessing.

df_processed <- bake(fraud_recipe_prepped, new_data = NULL)

cat("Dimensions after preprocessing:",
    nrow(df_processed), "rows,",
    ncol(df_processed), "columns\n\n")

Dimensions after preprocessing: 1000 rows, 15 columns

Show Code

cat("Columns in preprocessed dataset:\n")

Columns in preprocessed dataset:

Show Code

print(names(df_processed))

 [1] "age"                 "cost_raw"            "fraud_flag"         
 [4] "num_passengers"      "repeat_claimant"     "street_type_DRIVE"  
 [7] "street_type_GARDENS" "street_type_GLADE"   "street_type_GROVE"  
[10] "street_type_LANE"    "street_type_LODGE"   "street_type_OTHER"  
[13] "street_type_PARK"    "street_type_TERRACE" "street_type_VIEW"

Show Code

df_processed |>
  head(6) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  kbl(booktabs = TRUE) |>
  kable_styling(
    full_width        = TRUE,
    bootstrap_options = c("striped", "condensed"),
    font_size         = 10
  )

Table 3: Table 3: First six rows of the preprocessed dataset (numerics rounded to 3 d.p.)

age	cost_raw	fraud_flag	num_passengers	repeat_claimant	street_type_GLADE	street_type_GROVE	street_type_LANE	street_type_VIEW
[min,30]	1.828	Legitimate	0.786	-1.288	0	0	0	1
(50,60]	0.727	Legitimate	-0.675	0.776	0	0	0	1
(40,50]	0.727	Legitimate	-0.675	0.776	0	0	1	0
(30,40]	-0.925	Legitimate	-0.675	0.776	0	1	0	0
[min,30]	-0.925	Legitimate	-0.675	-1.288	1	0	0	0
(30,40]	0.727	Legitimate	0.786	-1.288	1	0	0	0

3.2.5 Step 5 — Recipe Audit Trail

Show Code

# tidy() extracts a human-readable record of every step in the prepped recipe.
# This is your formal audit trail: what was done, to which columns, trained or not.

tidy(fraud_recipe_prepped) |>
  kbl(
    booktabs  = TRUE,
    col.names = c("Step No.", "Operation", "Type", "Trained", "Skip", "ID")
  ) |>
  kable_styling(
    full_width        = TRUE,
    bootstrap_options = c("striped", "hover", "condensed")
  )

Table 4: Table 4: recipes audit trail — all preprocessing steps applied

Step No.	Operation	Type	Trained	Skip	ID
1	step	mutate	TRUE	FALSE	mutate_nYwzv
2	step	mutate	TRUE	FALSE	mutate_HmTk5
3	step	mutate	TRUE	FALSE	mutate_G75AK
4	step	rm	TRUE	FALSE	rm_JI5jK
5	step	cut	TRUE	FALSE	cut_dlJsU
6	step	impute_median	TRUE	FALSE	impute_median_BL2E3
7	step	normalize	TRUE	FALSE	normalize_ckCjO
8	step	string2factor	TRUE	FALSE	string2factor_BWS4W
9	step	dummy	TRUE	FALSE	dummy_5tSHI
10	step	zv	TRUE	FALSE	zv_BxVhQ

3.2.6 Step 6 — Visualise the Effect of `step_normalize()`

Show Code

before_norm <- df_input |>
  select(age, cost_raw) |>
  mutate(age = as.numeric(age)) |>            # ensure numeric
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  mutate(stage = "Before step_normalize()")

after_norm <- df_processed |>
  select(cost_raw) |>                         # only cost_raw from df_processed
  bind_cols(df_input |> select(age) |> mutate(age = as.numeric(scale(age)))) |>  # manually scaled age
  pivot_longer(everything(), names_to = "variable", values_to = "value") |>
  mutate(stage = "After step_normalize()")

bind_rows(before_norm, after_norm) |>
  mutate(
    stage    = factor(stage, levels = c("Before step_normalize()",
                                        "After step_normalize()")),
    variable = if_else(variable == "age", "Driver Age", "Repair Cost")
  ) |>
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram(bins = 28, colour = "white", alpha = 0.85) +
  facet_wrap(stage ~ variable, scales = "free", ncol = 4) +
  scale_fill_manual(
    values = c("Driver Age" = LEGIT, "Repair Cost" = ACCENT),
    guide  = "none"
  ) +
  labs(
    title    = "Effect of step_normalize() on Numeric Predictors",
    subtitle = "After normalisation both variables centre at 0 with unit variance, same scale for modelling",
    x        = "Value",
    y        = "Count",
    caption  = "Insurance Claims Dataset — recipes preprocessing pipeline"
  ) +
  theme_project()

Figure 1: Before and after step_normalize(). Age and repair cost differ by ~50-fold in raw scale. After normalisation both are centred at zero with unit variance — directly comparable for any downstream model.

3.3 Exploratory Visualisations

The processed dataset is used for all visualisations. For charting purposes the normalised numeric columns are back-transformed to approximate original scale using the stored prep() parameters, and the dummy-encoded street_type columns are reversed to recover the original category labels.

Show Code

# Back-transform normalised numerics for readable axis labels
# tidy() on a normalised step returns the mean and sd used.
# Multiplying by sd and adding mean reverses the z-score.
norm_params <- tidy(fraud_recipe_prepped, number = 7)   # step_normalize is step 7

get_param <- function(col, stat) {
  norm_params |> filter(terms == col, statistic == stat) |> pull(value)
}

age_mean  <- get_param("age",      "mean");  age_sd  <- get_param("age",      "sd")
cost_mean <- get_param("cost_raw", "mean");  cost_sd <- get_param("cost_raw", "sd")

# Recover street_type from dummy columns
street_cols <- names(df_processed)[str_detect(names(df_processed), "^street_type_")]

recover_street <- function(row) {
  hit <- which(row == 1)
  if (length(hit) == 0) "LODGE"   # reference category dropped by step_dummy
  else str_remove(names(row)[hit[1]], "street_type_")
}

# Build the analysis-ready visualisation data frame
df_viz <- df_processed |>
  mutate(
    fraud_flag  = fraud_flag == "Fraud",
    age_raw     = df_input$age,              #pull raw age directly from df_input
    cost_real   = cost_raw * cost_sd + cost_mean,
    street_type = apply(
      select(df_processed, all_of(street_cols)), 1, recover_street
    ),
    age_label = case_when(
      age_raw <= 30 ~ "20-30",
      age_raw <= 40 ~ "31-40",
      age_raw <= 50 ~ "41-50",
      age_raw <= 60 ~ "51-60",
      TRUE          ~ "61+"
    ),
    age_label = factor(age_label,
                       levels = c("20-30","31-40","41-50","51-60","61+"),
                       ordered = TRUE),
    cost_band = case_when(
      cost_real <= 750  ~ "£500",
      cost_real <= 1500 ~ "£1,000",
      cost_real <= 2500 ~ "£2,000",
      cost_real <= 3500 ~ "£3,000",
      TRUE              ~ ">£3,000"
    ),
    cost_band = factor(cost_band,
                       levels = c("£500","£1,000","£2,000","£3,000",">£3,000"),
                       ordered = TRUE)
  )

3.3.1 Class Distribution

Show Code

df_viz |>
  count(fraud_flag) |>
  mutate(
    label = if_else(fraud_flag, "Fraudulent", "Legitimate"),
    pct   = n / sum(n)
  ) |>
  ggplot(aes(x = label, y = n, fill = label)) +
  geom_col(width = 0.5, colour = "white") +
  geom_text(
    aes(label = paste0(n, "\n(", percent(pct, accuracy = 0.1), ")")),
    vjust = -0.4, fontface = "bold", size = 4.5
  ) +
  scale_fill_manual(
    values = c("Fraudulent" = FRAUD, "Legitimate" = LEGIT),
    guide  = "none"
  ) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  labs(
    title    = "Claim Distribution: Legitimate vs Fraudulent",
    subtitle = "10% fraud prevalence — class imbalance must be addressed before predictive modelling",
    x        = NULL,
    y        = "Number of Claims",
    caption  = "Source: Insurance Claims Dataset"
  ) +
  theme_project()

Figure 2: Claim distribution — 900 legitimate vs 100 fraudulent. The 10:1 class imbalance is typical of real-world insurance fraud datasets.

The dataset contains 900 legitimate and 100 fraudulent claims. A naive classifier that always predicts “legitimate” would achieve 90% accuracy while detecting zero fraud. Techniques such as SMOTE oversampling or cost-sensitive learning must be applied in any downstream model.

3.3.2 Passenger Count as a Fraud Signal

Show Code

fraud_by_pass <- df_viz |>
  group_by(num_passengers) |>
  summarise(fraud_rate = mean(fraud_flag) * 100, total = n(), .groups = "drop")

df_viz |>
  mutate(pass_label = case_when(
    num_passengers == 0 ~ "0 Passengers\n(Solo)",
    num_passengers == 1 ~ "1 Passenger",
    TRUE                ~ "2 Passengers"
  )) |>
  ggplot(aes(
    x    = factor(pass_label,
                  levels = c("0 Passengers\n(Solo)", "1 Passenger", "2 Passengers")),
    fill = fraud_flag
  )) +
  geom_bar(position = "fill", width = 0.55, colour = "white") +
  geom_text(
    data = fraud_by_pass |>
      mutate(pass_label = c("0 Passengers\n(Solo)", "1 Passenger", "2 Passengers")),
    aes(x = pass_label, y = 1.05,
        label = paste0(round(fraud_rate, 1), "% fraud")),
    inherit.aes = FALSE, fontface = "bold", size = 4.2, colour = FRAUD
  ) +
  scale_fill_manual(
    values = c("FALSE" = LEGIT, "TRUE" = FRAUD),
    labels = c("Legitimate", "Fraudulent"),
    name   = "Claim Outcome"
  ) +
  scale_y_continuous(
    labels = percent_format(),
    expand = expansion(mult = c(0, 0.12))
  ) +
  labs(
    title    = "Fraud Incidence by Number of Passengers",
    subtitle = "num_passengers engineered via step_mutate(), ghost passenger fraud is the strongest signal",
    x        = "Passengers in Vehicle",
    y        = "Proportion of Claims (%)",
    caption  = "Source: Insurance Claims Dataset (n = 1,000)"
  ) +
  theme_project()

Figure 3: Fraud rate by passenger count (the strongest individual predictor). The num_passengers feature was engineered via step_mutate() in the recipe. Claims with two passengers carry a 60.9% fraud rate, 55 times higher than solo driver claims.

The passenger count variable that was derived inside the recipe via step_mutate() is the single strongest fraud predictor. Solo driver claims have just a 1.1% fraud rate. Claims with one passenger rise to 10.7%, and claims with two passengers reach 60.9%, a 55-fold increase. This is “ghost passenger” pattern, where fictitious passengers are added to inflate injury claims.

3.3.3 Age Distribution and Fraud Risk

Show Code

p_hist <- ggplot(df_viz, aes(x = age_raw, fill = fraud_flag)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 25, alpha = 0.75, colour = "white", position = "identity") +
  geom_vline(aes(xintercept = mean(age_raw[!fraud_flag])),
             colour = LEGIT, linetype = "dashed", linewidth = 1.2) +
  geom_vline(aes(xintercept = mean(age_raw[fraud_flag])),
             colour = FRAUD, linetype = "dashed", linewidth = 1.2) +
  annotate("text", x = 44, y = 0.033,
           label = "Legit mean: 42.8", colour = LEGIT, size = 3.6, fontface = "italic") +
  annotate("text", x = 35, y = 0.038,
           label = "Fraud mean: 33.8", colour = FRAUD, size = 3.6, fontface = "italic") +
  scale_fill_manual(
    values = c("FALSE" = LEGIT, "TRUE" = FRAUD),
    labels = c("Legitimate", "Fraudulent"),
    name   = "Outcome"
  ) +
  labs(title = "Age Distribution by Claim Outcome",
       x = "Driver Age (Years)", y = "Density") +
  theme_project()

p_rate <- df_viz |>
  filter(!is.na(age_label)) |>
  group_by(age_label) |>
  summarise(fraud_rate = mean(fraud_flag) * 100, n = n(), .groups = "drop") |>
  ggplot(aes(x = age_label, y = fraud_rate, fill = fraud_rate > 10)) +
  geom_col(width = 0.6, colour = "white") +
  geom_text(aes(label = paste0(round(fraud_rate, 1), "%\n(n=", n, ")")),
            vjust = -0.3, size = 3.8, fontface = "bold", colour = DARK) +
  scale_fill_manual(values = c("FALSE" = LEGIT, "TRUE" = FRAUD), guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.22))) +
  labs(title    = "Fraud Rate by Age Group",
       subtitle = "Bands created by step_cut()",
       x = "Age Group", y = "Fraud Rate (%)") +
  theme_project()

(p_hist | p_rate) +
  plot_annotation(
    title    = "Age as a Risk Factor in Insurance Fraud",
    subtitle = "Fraud concentrated in 20–40 cohort; zero fraud in drivers aged 61+",
    caption  = "Source: Insurance Claims Dataset",
    theme    = theme_project()
  )

Figure 4: Age as a fraud risk factor. Fraudulent claimants average 33.8 years vs 42.8 for legitimate. The age bands shown were created by step_cut() inside the recipe.

The mean age of fraudulent claimants (33.8 years) is almost a decade lower than legitimate claimants (42.8 years). The 31–40 age group has the highest fraud rate at 17.3%, while no fraud was recorded in the 61+ cohort. This pattern is consistent with actuarial evidence from the Central Bank of Ireland that younger male drivers have higher fraud risk.

3.3.4 Repair Cost Patterns

Show Code

p_stacked <- df_viz |>
  filter(!is.na(cost_band)) |>
  count(cost_band, fraud_flag) |>
  group_by(cost_band) |>
  mutate(pct = n / sum(n) * 100) |>
  ggplot(aes(x = cost_band, y = pct, fill = fraud_flag)) +
  geom_col(colour = "white", width = 0.6) +
  scale_fill_manual(
    values = c("FALSE" = LEGIT, "TRUE" = FRAUD),
    labels = c("Legitimate", "Fraudulent"),
    name   = "Outcome"
  ) +
  scale_y_continuous(labels = percent_format(scale = 1)) +
  labs(title    = "Claim Outcome by Repair Cost Band",
       subtitle = "Fraud concentrates at low cost — deliberate under-claiming to avoid scrutiny",
       x = "Estimated Repair Cost", y = "Proportion (%)") +
  theme_project()

p_box <- df_viz |>
  mutate(outcome = if_else(fraud_flag, "Fraudulent", "Legitimate")) |>
  ggplot(aes(x = outcome, y = cost_real, fill = outcome)) +
  geom_boxplot(alpha = 0.8, colour = DARK,
               outlier.colour = ACCENT, outlier.size = 2.5) +
  stat_summary(fun = mean, geom = "point",
               shape = 18, size = 4, colour = "white") +
  scale_fill_manual(
    values = c("Fraudulent" = FRAUD, "Legitimate" = LEGIT),
    guide  = "none"
  ) +
  scale_y_continuous(labels = scales::label_dollar(prefix = "£")) +
  labs(title    = "Repair Cost Distribution",
       subtitle = "Diamond = group mean",
       x = NULL, y = "Repair Cost (£)") +
  theme_project()

(p_stacked | p_box) +
  plot_annotation(
    title    = "Repair Cost Patterns in Fraudulent vs Legitimate Claims",
    subtitle = "Fraudulent claims average £885 — soft fraud means low costs correlate with higher fraud risk",
    caption  = "Source: Insurance Claims Dataset",
    theme    = theme_project()
  )

Figure 5: Repair cost patterns. Fraud clusters at low cost bands — the soft fraud phenomenon. Mean fraudulent claim (£885) is well below the legitimate mean (£1,391).

Fraudulent claims average £885, while its £1,391 for legitimate claims. The £500 band accounts for 55% of all fraud despite representing only 27% of total claims. This is the soft fraud phenomenon where claimants deliberately keep amounts small to avoid automatic reviews. Any fraud screen that only flags high-value claims will miss the majority of fraud in this dataset.

3.3.5 Interaction Effect: Age × Passenger Count

Show Code

df_viz |>
  filter(!is.na(age_label)) |>
  group_by(age_label, num_passengers) |>
  summarise(
    fraud_rate = mean(fraud_flag) * 100,
    n          = n(),
    .groups    = "drop"
  ) |>
  mutate(tile_label = paste0(round(fraud_rate, 1), "%\n(n=", n, ")")) |>
  ggplot(aes(x = factor(num_passengers),
             y = age_label,
             fill = fraud_rate)) +
  geom_tile(colour = "white", linewidth = 1) +
  geom_text(aes(label = tile_label),
            size = 3.8, fontface = "bold", colour = "white") +
  scale_fill_gradientn(
    colours = c(LEGIT, LIGHT, ACCENT, FRAUD),
    name    = "Fraud Rate (%)",
    limits  = c(0, 80)
  ) +
  scale_x_discrete(
    labels = c("0 Passengers", "1 Passenger", "2 Passengers")
  ) +
  labs(
    title    = "Fraud Rate Heatmap: Age Group × Passenger Count",
    subtitle = "Both features engineered in recipe — interaction reveals multiplicative risk (up to 76.5%)",
    x        = "Passengers (from step_mutate())",
    y        = "Age Group (from step_cut())",
    caption  = "Source: Insurance Claims Dataset"
  ) +
  theme_project() +
  theme(panel.grid = element_blank())

Figure 6: Risk heatmap combining step_cut() age bands and step_mutate() passenger count. The 20-30/2-passenger cell reaches 76.5%, confirming a multiplicative interaction effect.

The heatmap reveals risk is multiplicative, not additive. The 20–30 age group has a 12.9% base fraud rate; two-passenger claims have a 60.9% rate. If additive, the combined expectation would be roughly 74%. The observed rate is 76.5%, confirming a genuine interaction effect. This finding only emerges from cross-dimensional analysis and is invisible when variables are examined in isolation.

3.3.6 Address Geography as a Risk Proxy

Show Code

df_viz |>
  group_by(street_type) |>
  summarise(fraud_rate = mean(fraud_flag) * 100, n = n(), .groups = "drop") |>
  filter(n >= 30) |>
  mutate(
    street_type = fct_reorder(street_type, fraud_rate),
    risk_tier   = case_when(
      fraud_rate > 15 ~ "High Risk (>15%)",
      fraud_rate >  5 ~ "Medium Risk (5-15%)",
      TRUE            ~ "Low Risk (<5%)"
    ),
    risk_tier = factor(risk_tier,
                       levels = c("High Risk (>15%)",
                                  "Medium Risk (5-15%)",
                                  "Low Risk (<5%)"))
  ) |>
  ggplot(aes(x = fraud_rate, y = street_type, fill = risk_tier)) +
  geom_col(colour = "white", height = 0.6) +
  geom_text(aes(label = paste0(round(fraud_rate, 1), "% (n=", n, ")")),
            hjust = -0.1, size = 3.8, colour = DARK) +
  scale_fill_manual(
    values = c("High Risk (>15%)"    = FRAUD,
               "Medium Risk (5-15%)" = ACCENT,
               "Low Risk (<5%)"      = LEGIT),
    name = "Risk Tier"
  ) +
  scale_x_continuous(
    expand = expansion(mult = c(0, 0.28)),
    labels = percent_format(scale = 1)
  ) +
  labs(
    title    = "Fraud Rate by Address Street Type",
    subtitle = "street_type engineered via step_mutate() in the recipe — TERRACE and OTHER carry highest risk",
    x        = "Fraud Rate (%)",
    y        = "Address Suffix",
    caption  = "Source: Insurance Claims Dataset"
  ) +
  theme_project()

Figure 7: Fraud rate by address street type, engineered via step_mutate() in the recipe. TERRACE and unclassified (OTHER) addresses carry fraud rates above 50%.

The street_type variable that was extracted inside the recipe via step_mutate(), acts as a meaningful geographic risk proxy. TERRACE and unclassified addresses show fraud rates exceeding 50%, possibly reflecting geographic clustering of organised fraud activity or the use of fictitious addresses. The low-risk cluster (PARK, GLADE, VIEW, GROVE, LODGE) likely corresponds to newer suburban developments.

4 Discussion and Conclusions

4.1 Key Findings

Finding 1 : Passenger count is the strongest individual predictor. The num_passengers feature escalates from 1.1% fraud rate (solo) to 60.9% (two passengers) a 55-fold increase. This reveals a ghost passenger pattern in fraud claims.

Finding 2 : Young drivers (20–40) are disproportionately implicated. The age bands reveal the 31–40 cohort has the highest group-level fraud rate (17.3%). Combined with two passengers, risk reaches 76.5% in the 20–30 segment. Zero fraud was recorded in the 61+ group.

Finding 3 : Fraud clusters at low repair costs (soft fraud). Counter-intuitively, fraud concentrates at £500–£1,000. Claimants deliberately keep amounts small to evade scrutiny thresholds. Any fraud screen that only flags high-value claims will miss the majority of fraud in this dataset.

Finding 4 : Address street type provides a geographic risk proxy. The street_type variable reveals TERRACE and unclassified addresses carry fraud rates above 50%, warranting further spatial clustering analysis.

4.2 BUSINESS INSIGHTS

Table 5: Risk framework for claims processing derived from EDA findings

Risk Tier	Criteria	Recommended Action
CRITICAL	Driver aged <40 AND 2 passengers	This should be manually reviewed and handled by a senior manager
HIGH	1+ passengers AND repair cost ≤ £1,000	There should be an automated fraud flag and must have a supervisor signoff
MEDIUM	Repeat claimant OR TERRACE/unclassified address	Enhanced KYC and documentation should be requested and additional evidence should be required
STANDARD	Solo driver, age 41+, standard residential address	This should follow the standard automated processing pipeline

4.3 Reproducibility

All outputs are fully reproducible. The recipes pipeline in Section 3.2 encapsulates every transformation step, ensuring identical preprocessing can be applied to new incoming claims at scoring time without re-specifying any transformation manually.

1 Problem Statement

1.1 Business Context

1.2 Core Analytical Questions

2 Solution Statement

2.1 Analytical Approach

2.2 Tools and Reproducibility

3 Exploratory Analysis

3.1 Data Ingestion and Audit

3.1.1 Schema and Quality Audit

3.2 recipes Preprocessing Pipeline

3.2.1 Step 1 — Prepare the Raw Input

3.2.2 Step 2 — Define the Recipe

3.2.3 Step 3 — Prep the Recipe

3.2.4 Step 4 — Bake the Recipe

3.2.5 Step 5 — Recipe Audit Trail

3.2.6 Step 6 — Visualise the Effect of step_normalize()

3.3 Exploratory Visualisations

3.3.1 Class Distribution

3.3.2 Passenger Count as a Fraud Signal

3.3.3 Age Distribution and Fraud Risk

3.3.4 Repair Cost Patterns

3.3.5 Interaction Effect: Age × Passenger Count

3.3.6 Address Geography as a Risk Proxy

4 Discussion and Conclusions

4.1 Key Findings

4.2 BUSINESS INSIGHTS

4.3 Reproducibility

3.2 `recipes` Preprocessing Pipeline

3.2.6 Step 6 — Visualise the Effect of `step_normalize()`