As an actuarial analyst, the report establishes a defensible, data-driven framework to predict claim frequency for a motor portfolio, enabling pricing, underwriting, and portfolio steering decisions to be made with evidence rather than intuition. With assumed Ni∼Poisson(Viλ(Xi)) with exposure Vi as an offset and covariates x={weight, distance, age, carage, gender}. The work delivers validated rating relativities for pricing, risk segmentation that underwrites can act on, and an operational High-risk flag with a documented threshold. Together these support tariff setting, portfolio monitoring, fairness checks, and communication with non-technical stakeholders.
#1 Load the library, set seed
library(cluster)
library(readr)
library(rpart)
library(rpart.plot)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ purrr 1.1.0
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidyr)
library(GGally)
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
library(ggplot2)
library(dplyr)
library(magrittr)
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(mboost)
## Loading required package: parallel
## Loading required package: stabs
##
## Attaching package: 'mboost'
##
## The following object is masked from 'package:magrittr':
##
## extract
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:ggplot2':
##
## %+%
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
##
## Attaching package: 'partykit'
##
## The following object is masked from 'package:mboost':
##
## varimp
library(rpart)
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(10)
#2 Load the data set
CaseStudyData <- read_csv("CaseStudyData_ACST8040_2025.csv")
## Rows: 200000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Counts, distance, weight, age, carage, exposure
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(CaseStudyData)
## Rows: 200,000
## Columns: 7
## $ Counts <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ gender <chr> "male", "male", "female", "male", "male", "male", "female", "…
## $ distance <dbl> 17, 27, 24, 19, 31, 48, 18, 33, 22, 51, 43, 71, 52, 7, 71, 62…
## $ weight <dbl> 2076, 1640, 1279, 2655, 964, 1458, 1231, 2085, 3935, 1267, 14…
## $ age <dbl> 85, 44, 38, 81, 72, 44, 73, 44, 72, 69, 67, 32, 38, 80, 43, 3…
## $ carage <dbl> 13, 10, 28, 35, 12, 16, 9, 16, 31, 9, 28, 7, 26, 10, 12, 7, 1…
## $ exposure <dbl> 0.9249603, 0.9642885, 0.9612256, 0.9163487, 0.8896720, 0.9424…
CaseStudyData <- CaseStudyData %>%
mutate(
gender = factor(gender), # categorical feature
across(c(distance, weight, age, carage, exposure, Counts), as.numeric)
)
CaseStudyData <- CaseStudyData %>%
mutate(
gender = factor(gender), # categorical feature
across(c(distance, weight, age, carage, exposure, Counts), as.numeric)
)
CaseStudyData <- CaseStudyData %>% mutate(freq = Counts / pmax(exposure, 1e-9))
#3 Summary data statistics
# Get numeric column names
num_cols <- names(dplyr::select(CaseStudyData, tidyselect::where(is.numeric)))
# Summary stats for numeric variables
num_summary <- CaseStudyData %>%
dplyr::summarise(
dplyr::across(
dplyr::all_of(num_cols),
list(
n = ~sum(!is.na(.x)),
mean = ~mean(.x, na.rm = TRUE),
sd = ~sd(.x, na.rm = TRUE),
p25 = ~quantile(.x, 0.25, na.rm = TRUE),
p50 = ~quantile(.x, 0.50, na.rm = TRUE),
p75 = ~quantile(.x, 0.75, na.rm = TRUE)
),
.names = "{.col}_{.fn}"
)
) %>%
tidyr::pivot_longer(
everything(),
names_to = c("variable", ".value"),
names_sep = "_"
)
print(num_summary)
## # A tibble: 7 × 7
## variable n mean sd p25 p50 p75
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Counts 200000 0.159 0.408 0 0 0
## 2 distance 200000 36.1 21.0 19 33 51
## 3 weight 200000 1915. 758. 1316 1809 2451
## 4 age 200000 56.6 20.0 38 59 76
## 5 carage 200000 21.5 11.1 11 20 32
## 6 exposure 200000 0.900 0.0577 0.850 0.900 0.950
## 7 freq 200000 0.176 0.454 0 0 0
#4 Plot histograms for numeric variables
if (!exists("CaseStudyData", inherits = FALSE) || is.function(get("CaseStudyData"))) {
path <- "/mnt/data/CaseStudyData_ACST8040_2025 (1).csv" # change if running locally
CaseStudyData <- readr::read_csv(path, show_col_types = FALSE)
}
CaseStudyData <- CaseStudyData |>
dplyr::mutate(
gender = factor(gender),
dplyr::across(c(distance, weight, age, carage, exposure, Counts),
~ readr::parse_number(as.character(.)))
)
num_cols <- names(CaseStudyData)[vapply(CaseStudyData, is.numeric, logical(1))]
long_num <- CaseStudyData |>
dplyr::select(dplyr::all_of(num_cols)) |>
tidyr::pivot_longer(dplyr::everything(), names_to = "variable", values_to = "value") |>
dplyr::filter(is.finite(value))
stats_mm <- long_num |>
dplyr::group_by(variable) |>
dplyr::summarise(mean = mean(value), median = median(value), .groups = "drop")
fill_hist <- "#CBD5E1"; col_mean <- "#2563EB"; col_med <- "#DC2626"
p_num <- ggplot(long_num, aes(x = value)) +
geom_histogram(aes(y = after_stat(density)),
bins = 40, fill = fill_hist, color = "white", linewidth = 0.25) +
geom_density(linewidth = 0.7) +
geom_vline(data = stats_mm, aes(xintercept = mean), color = col_mean, linewidth = 0.5) +
geom_vline(data = stats_mm, aes(xintercept = median), color = col_med, linewidth = 0.5, linetype = 2) +
facet_wrap(~ variable, scales = "free", ncol = 3) +
scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
labs(title = "Histograms of Numeric Variables",
subtitle = "Density overlay + mean (blue) & median (red)",
x = NULL, y = "Density") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
panel.grid.minor = element_blank())
print(p_num)
#5 Plot histograms for non-numeric variables
cat_cols <- names(CaseStudyData)[vapply(CaseStudyData, \(x) is.factor(x) || is.character(x), logical(1))]
if (length(cat_cols) > 0) {
long_cat <- CaseStudyData |>
dplyr::select(dplyr::all_of(cat_cols)) |>
dplyr::mutate(dplyr::across(dplyr::everything(), as.factor)) |>
tidyr::pivot_longer(dplyr::everything(), names_to = "variable", values_to = "level")
p_cat <- ggplot(long_cat, aes(x = level)) +
geom_bar(fill = fill_hist, color = "white", linewidth = 0.25) +
facet_wrap(~ variable, scales = "free_x", ncol = 3) +
labs(title = "Categorical Variables", x = NULL, y = "Count") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
axis.text.x = element_text(angle = 30, hjust = 1))
print(p_cat)
}
#6 Build numeric matrix to create pairwise correlations
num_idx <- vapply(CaseStudyData, is.numeric, logical(1))
X_num <- CaseStudyData[, num_idx, drop = FALSE]
cat_idx <- vapply(CaseStudyData, function(x) is.factor(x) || is.character(x), logical(1))
if (any(cat_idx)) {
X_cat <- stats::model.matrix(~ . - 1, data = CaseStudyData[, cat_idx, drop = FALSE]) |> as.data.frame()
X_all <- cbind(X_num, X_cat)
} else {
X_all <- X_num
}
# Correlation matrices
cor_pearson <- stats::cor(X_all, use = "pairwise.complete.obs", method = "pearson")
cor_spearman <- stats::cor(X_all, use = "pairwise.complete.obs", method = "spearman")
print(round(cor_pearson, 3))
## Counts distance weight age carage exposure freq genderfemale
## Counts 1.000 0.044 0.051 0.114 0.136 0.020 0.998 -0.003
## distance 0.044 1.000 0.000 0.000 0.000 -0.001 0.043 -0.002
## weight 0.051 0.000 1.000 0.002 0.000 -0.002 0.052 -0.002
## age 0.114 0.000 0.002 1.000 0.000 -0.001 0.113 0.000
## carage 0.136 0.000 0.000 0.000 1.000 -0.005 0.136 0.002
## exposure 0.020 -0.001 -0.002 -0.001 -0.005 1.000 -0.005 0.003
## freq 0.998 0.043 0.052 0.113 0.136 -0.005 1.000 -0.003
## genderfemale -0.003 -0.002 -0.002 0.000 0.002 0.003 -0.003 1.000
## gendermale 0.003 0.002 0.002 0.000 -0.002 -0.003 0.003 -1.000
## gendermale
## Counts 0.003
## distance 0.002
## weight 0.002
## age 0.000
## carage -0.002
## exposure -0.003
## freq 0.003
## genderfemale -1.000
## gendermale 1.000
print(round(cor_spearman, 3))
## Counts distance weight age carage exposure freq genderfemale
## Counts 1.000 0.041 0.047 0.105 0.129 0.019 0.997 -0.003
## distance 0.041 1.000 0.000 0.000 0.001 -0.001 0.041 -0.003
## weight 0.047 0.000 1.000 0.003 0.000 -0.003 0.047 -0.002
## age 0.105 0.000 0.003 1.000 0.000 -0.001 0.104 0.000
## carage 0.129 0.001 0.000 0.000 1.000 -0.005 0.129 0.002
## exposure 0.019 -0.001 -0.003 -0.001 -0.005 1.000 -0.008 0.003
## freq 0.997 0.041 0.047 0.104 0.129 -0.008 1.000 -0.004
## genderfemale -0.003 -0.003 -0.002 0.000 0.002 0.003 -0.004 1.000
## gendermale 0.003 0.003 0.002 0.000 -0.002 -0.003 0.004 -1.000
## gendermale
## Counts 0.003
## distance 0.003
## weight 0.002
## age 0.000
## carage -0.002
## exposure -0.003
## freq 0.004
## genderfemale -1.000
## gendermale 1.000
#7 Plot the correlation heatmaps
plot_corr <- function(C, title) {
ord <- stats::hclust(stats::as.dist(1 - abs(C)))$order
C_ord <- C[ord, ord]
df <- as.data.frame(as.table(C_ord))
names(df) <- c("Var1","Var2","value")
ggplot(df, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white", linewidth = 0.25) +
scale_fill_gradient2(limits = c(-1, 1), oob = scales::squish,
low = "#B91C1C", mid = "#F3F4F6", high = "#1D4ED8", midpoint = 0) +
coord_fixed() +
labs(title = title, x = NULL, y = NULL, fill = "ρ") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
plot.title = element_text(face = "bold"),
panel.grid = element_blank())
}
p1 <- plot_corr(cor_pearson, "Pearson correlation (numeric + factor dummies)")
p2 <- plot_corr(cor_spearman, "Spearman correlation (numeric + factor dummies)")
print(p1); print(p2)
#8 Estimate Poisson for GLM
DF <- if (exists("CaseStudyData", inherits = FALSE) && !is.function(get("CaseStudyData"))) {
CaseStudyData
} else if (exists("data", inherits = FALSE) && !is.function(get("data"))) {
data
} else {
# Load if needed (edit path if not running in this session)
read_csv("/mnt/data/CaseStudyData_ACST8040_2025 (1).csv", show_col_types = FALSE)
}
# Fixin column types
DF <- DF %>%
mutate(
gender = factor(gender),
distance = readr::parse_number(as.character(distance)),
weight = readr::parse_number(as.character(weight)),
age = readr::parse_number(as.character(age)),
carage = readr::parse_number(as.character(carage)),
exposure = readr::parse_number(as.character(exposure)),
Counts = readr::parse_number(as.character(Counts))
)
# Create full model formula
# x1=weight
# x2=distance
# x3=age
# x4=carages
# x5=gender
form_full <- Counts ~
gender +
weight + distance + age + carage +
I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
weight:distance + weight:age + weight:carage +
distance:age + distance:carage + age:carage +
offset(log(exposure))
# Fit full Poisson
m_full <- glm(form_full, data = DF, family = poisson(link = "log"))
#9 Select the glm model
# AIC-based stepwise
m_step_aic <- stepAIC(m_full, direction = "both", trace = FALSE)
# Final model :)
m_final <- m_step_aic
summary(m_final)
##
## Call:
## glm(formula = Counts ~ gender + weight + distance + age + carage +
## I(age^2) + I(carage^2) + weight:distance + weight:age + age:carage +
## offset(log(exposure)), family = poisson(link = "log"), data = DF)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.778e+00 1.188e-01 -57.063 < 2e-16 ***
## gendermale 1.655e-02 1.148e-02 1.441 0.150
## weight 2.316e-04 2.772e-05 8.353 < 2e-16 ***
## distance 6.260e-03 7.230e-04 8.658 < 2e-16 ***
## age 6.797e-02 3.033e-03 22.413 < 2e-16 ***
## carage 1.332e-01 4.197e-03 31.744 < 2e-16 ***
## I(age^2) -1.858e-04 2.310e-05 -8.046 8.54e-16 ***
## I(carage^2) -5.941e-04 7.456e-05 -7.968 1.62e-15 ***
## weight:distance -5.171e-07 3.352e-07 -1.542 0.123
## weight:age -6.707e-07 3.755e-07 -1.786 0.074 .
## age:carage -1.162e-03 3.006e-05 -38.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 125453 on 199999 degrees of freedom
## Residual deviance: 116186 on 199989 degrees of freedom
## AIC: 175303
##
## Number of Fisher Scoring iterations: 6
coef(m_final)
## (Intercept) gendermale weight distance age
## -6.777607e+00 1.654691e-02 2.315685e-04 6.259824e-03 6.797319e-02
## carage I(age^2) I(carage^2) weight:distance weight:age
## 1.332362e-01 -1.858359e-04 -5.940642e-04 -5.170794e-07 -6.706863e-07
## age:carage
## -1.161681e-03
# Small check for over-dispersion
dispersion <- sum(residuals(m_final, type = "pearson")^2) / m_final$df.residual
cat("Dispersion estimate:", round(dispersion, 3), "\n")
## Dispersion estimate: 1.004
#9.1 (continue) Benchmark lambda λ
# Benchmark λ(x): exposure=1
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_benchmark <- data.frame(
weight = 1500,
distance = 10,
age = 30,
carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
lambda_hat <- predict(m_final, newdata = x_benchmark, type = "response") # this is λ(x)
cat("Benchmark λ(x) at (1500,10,30,4,female):", round(lambda_hat, 6), "\n")
## Benchmark λ(x) at (1500,10,30,4,female): 0.015771
#10 Plot lambda λ vs x3 = age
age_grid <- data.frame(
weight = 1500,
distance = 10,
age = seq(16, 85, by = 1),
carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
pred <- predict(m_final, newdata = age_grid, type = "link", se.fit = TRUE)
age_grid$lambda <- exp(pred$fit)
age_grid$lower <- exp(pred$fit - 1.96 * pred$se.fit)
age_grid$upper <- exp(pred$fit + 1.96 * pred$se.fit)
ggplot(age_grid, aes(x = age, y = lambda)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.15) +
geom_line(linewidth = 0.8) +
labs(title = expression(hat(lambda)(x)~"vs driver age (others fixed)"),
subtitle = "Fixed at weight=1500, distance=10, carage=4, gender=female; exposure = 1",
x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#11 Using Elbow Method for Total Within-Cluster SS vs K (age, carage)
# Age (x3) and carage (x4)
# Numeric + drop NA
X <- DF |>
dplyr::mutate(
age = readr::parse_number(as.character(age)),
carage = readr::parse_number(as.character(carage))
) |>
dplyr::select(dplyr::all_of(c("age","carage"))) |>
tidyr::drop_na()
# Standardise the variable
X_s <- scale(X)
Kmax <- 10
elbow <- data.frame(K = integer(), tot_withinss = double(), betweenss = double(), totss = double())
# Elbow plot: total within-cluster SS vs K
for (k in 1:Kmax) {
km <- kmeans(X_s, centers = k, nstart = 20, iter.max = 100)
elbow <- rbind(elbow,
data.frame(K = k,
tot_withinss = km$tot.withinss,
betweenss = km$betweenss,
totss = km$totss))
}
ggplot(elbow, aes(K, tot_withinss)) +
geom_line() + geom_point(size = 2) +
labs(title = "Elbow plot (age, carage)",
subtitle = "Total within-cluster sum of squares vs K",
x = "K (number of clusters)", y = "Total within-cluster SS") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#12 Using Gap statistic subsample for speed on big data
n_gap <- min(nrow(X_s), 5000) # sample with 5k points (cause my laptop is weak T_T )
idx <- sample.int(nrow(X_s), n_gap)
FUNkm <- function(x, k) stats::kmeans(x, centers = k, nstart = 20, iter.max = 100)
gap <- cluster::clusGap(X_s[idx, , drop = FALSE], FUN = FUNkm, K.max = Kmax, B = 50) # B=50 is a good default
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 250000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 250000)
gap_df <- as.data.frame(gap$Tab)
gap_df$K <- 1:Kmax
ggplot(gap_df, aes(K, gap)) +
geom_line() + geom_point(size = 2) +
geom_errorbar(aes(ymin = gap - SE.sim, ymax = gap + SE.sim), width = 0.2) +
labs(title = "Gap statistic vs K (age, carage)",
x = "K", y = "Gap") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
# Select K*
K_star <- cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method = "Tibs2001SEmax")
cat("Selected K* by gap statistic:", K_star, "\n")
## Selected K* by gap statistic: 4
#13 Finalise k-means, scattering with cluster colours + centroids
km_final <- kmeans(X_s, centers = K_star, nstart = 50, iter.max = 200)
DF$cluster_k <- factor(km_final$cluster)
# Recover centroids on original scale for plotting
cent_scaled <- km_final$centers
cent_orig <- sweep(cent_scaled, 2, attr(X_s, "scaled:scale"), "*")
cent_orig <- sweep(cent_orig, 2, attr(X_s, "scaled:center"), "+")
cent_df <- data.frame(age = cent_orig[, "age"], carage = cent_orig[, "carage"],
cluster_k = factor(seq_len(K_star)))
ggplot(DF, aes(x = age, y = carage, color = cluster_k)) +
geom_point(alpha = 0.35, size = 0.7) +
geom_point(data = cent_df, aes(x = age, y = carage), inherit.aes = FALSE,
shape = 8, size = 3, color = "black") +
labs(title = "K-means clusters on (age, carage)",
subtitle = paste0("K* = ", K_star, " by gap statistic"),
x = "Driver age (x3)", y = "Car age (x4)", color = "Cluster") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#14 Rebuild clusters
X <- DF %>%
dplyr::mutate(
age = readr::parse_number(as.character(age)),
carage = readr::parse_number(as.character(carage))
) %>%
dplyr::select(age, carage) %>%
tidyr::drop_na()
X_s <- scale(X)
# choose K* with gap
n_gap <- min(nrow(X_s), 10000)
idx <- sample.int(nrow(X_s), n_gap)
FUNkm <- function(x, k) stats::kmeans(x, centers=k, nstart=15, iter.max=200, algorithm="Lloyd")
gap <- cluster::clusGap(X_s[idx, , drop=FALSE], FUN=FUNkm, K.max=10, B=30)
K_star <- cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method="Tibs2001SEmax")
# final kmeans on full data with K*
km_final <- kmeans(X_s, centers = K_star, nstart = 50, iter.max = 200)
DF$cluster_k <- factor(km_final$cluster) # x6
# Predicting cluster on new (age,carage)
sc_center <- attr(X_s, "scaled:center")
sc_scale <- attr(X_s, "scaled:scale")
assign_cluster <- function(age, carage, centers, center, scale) {
z1 <- (age - center["age"]) / scale["age"]
z2 <- (carage- center["carage"])/ scale["carage"]
d2 <- rowSums((centers - matrix(c(z1, z2), nrow=nrow(centers), ncol=2, byrow=TRUE))^2)
which.min(d2)
}
#15 Form full GLM (linear + quadratics + mixed + cluster interactions)
form_full <- Counts ~
gender + cluster_k +
weight + distance + age + carage +
I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
weight:distance + weight:age + weight:carage +
distance:age + distance:carage + age:carage +
cluster_k:weight + cluster_k:distance + cluster_k:age + cluster_k:carage +
offset(log(exposure))
m_full <- glm(form_full, data = DF, family = poisson(link="log"))
#16 Select the model, Benchmark λ(x)
# Model Selection
m_final <- MASS::stepAIC(m_full, direction="both", trace=FALSE) # use stats::step(..., k=2) if MASS unavailable
summary(m_final)
##
## Call:
## glm(formula = Counts ~ cluster_k + weight + distance + age +
## carage + I(age^2) + weight:distance + weight:age + distance:age +
## age:carage + cluster_k:distance + cluster_k:age + offset(log(exposure)),
## family = poisson(link = "log"), data = DF)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.099e+00 2.328e-01 -13.314 < 2e-16 ***
## cluster_k2 2.161e+00 4.990e-01 4.331 1.49e-05 ***
## cluster_k3 1.157e+00 2.020e-01 5.728 1.02e-08 ***
## cluster_k4 2.545e+00 5.044e-01 5.047 4.50e-07 ***
## weight 2.313e-04 2.763e-05 8.370 < 2e-16 ***
## distance 3.625e-03 1.968e-03 1.842 0.065520 .
## age -3.389e-02 9.003e-03 -3.764 0.000167 ***
## carage 5.347e-03 5.069e-03 1.055 0.291469
## I(age^2) 2.273e-04 1.053e-04 2.158 0.030911 *
## weight:distance -5.262e-07 3.353e-07 -1.569 0.116637
## weight:age -6.569e-07 3.754e-07 -1.750 0.080168 .
## distance:age 1.099e-04 3.986e-05 2.756 0.005846 **
## age:carage 1.910e-04 7.868e-05 2.428 0.015193 *
## cluster_k2:distance -5.996e-03 1.957e-03 -3.065 0.002180 **
## cluster_k3:distance -1.842e-03 1.203e-03 -1.532 0.125596
## cluster_k4:distance -5.323e-03 1.931e-03 -2.757 0.005840 **
## cluster_k2:age -2.995e-03 9.181e-03 -0.326 0.744271
## cluster_k3:age 8.265e-03 4.700e-03 1.758 0.078689 .
## cluster_k4:age -9.326e-03 9.268e-03 -1.006 0.314267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 125453 on 199999 degrees of freedom
## Residual deviance: 115141 on 199981 degrees of freedom
## AIC: 174274
##
## Number of Fisher Scoring iterations: 6
dispersion <- sum(residuals(m_final, type="pearson")^2) / m_final$df.residual
cat("Dispersion:", round(dispersion, 3), "\n")
## Dispersion: 1.006
# Benchmark λ(x) at (x1=1500, x2=10, x3=30, x4=4, x5=female); exposure=1
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
bench_cluster <- assign_cluster(age=30, carage=4, centers=km_final$centers,
center=sc_center, scale=sc_scale)
x_benchmark <- data.frame(
weight = 1500, distance = 10, age = 30, carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
cluster_k = factor(bench_cluster, levels = levels(DF$cluster_k)),
exposure = 1
)
lambda_hat <- predict(m_final, newdata = x_benchmark, type = "response")
cat("Benchmark λ(x) with cluster =", as.character(x_benchmark$cluster_k), ":",
round(lambda_hat, 6), "\n")
## Benchmark λ(x) with cluster = 1 : 0.030556
#17 Plot λ(x) vs age with others fixed
age_seq <- seq(16, 85, by = 1)
age_clusters <- vapply(age_seq, function(a)
assign_cluster(a, 4, km_final$centers, sc_center, sc_scale), integer(1))
age_grid <- data.frame(
weight = 1500, distance = 10, age = age_seq, carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
cluster_k = factor(age_clusters, levels = levels(DF$cluster_k)),
exposure = 1
)
pred <- predict(m_final, newdata = age_grid, type = "link", se.fit = TRUE)
age_grid$lambda <- exp(pred$fit)
age_grid$lower <- exp(pred$fit - 1.96 * pred$se.fit)
age_grid$upper <- exp(pred$fit + 1.96 * pred$se.fit)
ggplot(age_grid, aes(x = age, y = lambda)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.15) +
geom_line(linewidth = 0.9) +
labs(title = expression(hat(lambda)(x)~"vs driver age with cluster x"[6]),
subtitle = paste0("Fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
"; exposure=1; cluster chosen by k-means for each age"),
x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#18 Grow Poisson regression tree
tree0 <- rpart(
formula = Counts ~ weight + distance + age + carage + gender + offset(log(exposure)),
data = DF,
method = "poisson",
control = rpart.control(cp = 0.001, xval = 10, minbucket = 500, maxdepth = 6)
)
#19 Choose optimal size of the tree, plot the final tree
ctab <- tree0$cptable
i_min <- which.min(ctab[,"xerror"])
xerr_min <- ctab[i_min,"xerror"]; xstd_min <- ctab[i_min,"xstd"]
cp_1se <- ctab[ max(which(ctab[,"xerror"] <= xerr_min + xstd_min)), "CP" ]
tree <- prune.rpart(tree0, cp = cp_1se)
printcp(tree0) # CV table
##
## Rates regression tree:
## rpart(formula = Counts ~ weight + distance + age + carage + gender +
## offset(log(exposure)), data = DF, method = "poisson", control = rpart.control(cp = 0.001,
## xval = 10, minbucket = 500, maxdepth = 6))
##
## Variables actually used in tree construction:
## [1] age carage weight
##
## Root node error: 125453/200000 = 0.62726
##
## n= 200000
##
## CP nsplit rel error xerror xstd
## 1 0.0351320 0 1.00000 1.00002 0.0034205
## 2 0.0032330 2 0.92974 0.92981 0.0032510
## 3 0.0012241 3 0.92650 0.92667 0.0032438
## 4 0.0010772 4 0.92528 0.92581 0.0032411
## 5 0.0010000 5 0.92420 0.92484 0.0032374
tree$variable.importance # variable importance
## age carage weight distance gender
## 5325.89354 3915.00566 314.26181 16.60581 13.15646
# Plot the tree
rpart.plot(tree, type = 2, extra = 101, under = TRUE, faclen = 0, varlen = 0, cex = 0.75,
main = "Poisson regression tree for claim frequency λ(x)")
#20 Benchmark λ(x) as in (b)
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
weight = 1500, distance = 10, age = 30, carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
lambda_hat <- predict(tree, newdata = x_bench, type = "vector")
cat("Benchmark λ(x) from tree:", round(lambda_hat, 6), "\n")
## Benchmark λ(x) from tree: 0.039014
#21 Plot λ(x) vs age with others fixed
#tree predictions
age_grid <- data.frame(
weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
age_grid$lambda <- predict(tree, newdata = age_grid, type = "vector")
ggplot(age_grid, aes(x = age, y = lambda)) +
geom_line(linewidth = 0.9) +
labs(title = expression(hat(lambda)(x)~"vs driver age — Poisson tree"),
subtitle = "Fixed at weight=1500, distance=10, carage=4, gender=female; exposure=1",
x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#22 Grow a Poisson boosting tree NO BASE MODEL
ctrl <- mboost::boost_control(mstop = 500, nu = 0.05) # Shrinkage ν = 0.05
tctrl <- partykit::ctree_control(maxdepth = 3, minbucket = 1000) # Tree size
bb0 <- mboost::blackboost(
Counts ~ weight + distance + age + carage + gender,
data = DF,
family = mboost::Poisson(),
offset = log(DF$exposure), # Baseline λ(x)=exp(0)=1
control = ctrl,
tree_controls = tctrl
)
#23 Benchmark λ(x) as in (b): exposure = 1 so prediction is λ(x)
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
weight = 1500, distance = 10, age = 30, carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
# blackboost predicts on link scale
eta_bench <- predict(bb0, newdata = x_bench, type = "link")
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
lambda_bench <- exp(eta_bench)
cat("Benchmark λ(x) from boosting:", round(lambda_bench, 6), "\n")
## Benchmark λ(x) from boosting: 0.062315
#24 Plot λ(x) vs age with other predictors fixed
age_grid <- data.frame(
weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
eta_grid <- predict(bb0, newdata = age_grid, type = "link")
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
age_grid$lambda <- exp(eta_grid)
ggplot(age_grid, aes(x = age, y = lambda)) +
geom_line(linewidth = 0.9) +
labs(title = expression(hat(lambda)(x)~"vs driver age — Poisson boosting tree"),
subtitle = paste0("Fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
"; exposure=1 | ν=0.05, maxdepth=3, mstop=", mboost::mstop(bb0)),
x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#25 Grow a Poisson boosting tree WITH BASE MODEL
form_full_b <- Counts ~
gender +
weight + distance + age + carage +
I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
weight:distance + weight:age + weight:carage +
distance:age + distance:carage + age:carage +
offset(log(exposure))
m_glm_full <- glm(form_full_b, data = DF, family = poisson(link = "log"))
m_glm_best <- suppressWarnings(MASS::stepAIC(m_glm_full, direction = "both", trace = FALSE))
# Log-rate from GLM base (exposure set to 1 so this is log λ_hat)
eta_glm <- predict(m_glm_best, newdata = transform(DF, exposure = 1), type = "link")
#26 Boosting tree that corrects the GLM
ctrl <- mboost::boost_control(mstop = 500, nu = 0.05) # shrinkage ν
tctrl <- partykit::ctree_control(maxdepth = 3, minbucket = 1000) # tree size
bb_glm_base <- mboost::blackboost(
Counts ~ weight + distance + age + carage + gender, # NO x6
data = DF,
family = mboost::Poisson(),
offset = log(DF$exposure) + eta_glm, # base = GLM
control = ctrl,
tree_controls = tctrl
)
#27 Tune number of boosting steps
folds <- mboost::cv(model.weights(bb_glm_base), type = "kfold", B = 5)
grid <- seq(50, 1500, by = 50)
cvr <- mboost::cvrisk(bb_glm_base, folds = folds, grid = grid)
mboost::mstop(bb_glm_base) <- mboost::mstop(cvr)
cat("Chosen mstop:", mboost::mstop(bb_glm_base),
"| shrinkage ν:", ctrl$nu, "| tree maxdepth:", tctrl$maxdepth, "\n")
## Chosen mstop: 50 | shrinkage ν: 0.05 | tree maxdepth: 3
#28 Benchmark λ(x) as in (b); exposure=1
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
weight = 1500, distance = 10, age = 30, carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
eta_base_bench <- predict(m_glm_best, newdata = x_bench, type = "link") # log λ from GLM
eta_boost_corr <- predict(bb_glm_base, newdata = x_bench, type = "link") # correction f(x)
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
lambda_bench <- exp(eta_base_bench + eta_boost_corr)
cat("Benchmark λ(x) from GLM+boosting:", round(lambda_bench, 6), "\n")
## Benchmark λ(x) from GLM+boosting: 0.015771
#29 Benchmark λ(x) as in (b); exposure=1
age_grid <- data.frame(
weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
gender = factor(gender_val, levels = levels(DF$gender)),
exposure = 1
)
eta_base <- predict(m_glm_best, newdata = age_grid, type = "link")
eta_corr <- predict(bb_glm_base, newdata = age_grid, type = "link")
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
age_grid$lambda <- exp(eta_base + eta_corr)
ggplot(age_grid, aes(x = age, y = lambda)) +
geom_line(linewidth = 0.9) +
labs(title = expression(hat(lambda)(x)~"vs driver age — GLM base + boosting tree"),
subtitle = paste0("Base: best GLM (b); correction: boosted trees | ν=", ctrl$nu,
", maxdepth=", tctrl$maxdepth, ", mstop=", mboost::mstop(bb_glm_base),
" | fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
"; exposure=1"),
x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#30 Compare the models fitted in items Comparing models from: (b) GLM (d) GLM+cluster (d) Poisson tree (f) Poisson boosting (no base) (g) Poisson boosting (with base) Using 10-fold cross validation error to select the best model. The reports mean Poisson deviance (lower is better) across folds.
poisson_dev <- function(y, mu){ mu <- pmax(mu,1e-12); 2*sum(ifelse(y==0,-mu, y*log(y/mu)-(y-mu))) }
kfold_index <- function(n, K=10) sample(rep(1:K, length.out = n))
# centroid assignment fix
assign_cluster <- function(age, carage, centers, center, scale){
C <- as.matrix(centers)
c_age <- if (!is.null(names(center)) && "age" %in% names(center)) center[["age"]] else center[1]
c_car <- if (!is.null(names(center)) && "carage" %in% names(center)) center[["carage"]] else center[2]
s_age <- if (!is.null(names(scale)) && "age" %in% names(scale)) scale[["age"]] else scale[1]
s_car <- if (!is.null(names(scale)) && "carage" %in% names(scale)) scale[["carage"]] else scale[2]
if (isTRUE(all.equal(s_age,0))) s_age <- 1; if (isTRUE(all.equal(s_car,0))) s_car <- 1
z <- c((age-c_age)/s_age, (carage-c_car)/s_car)
target <- matrix(z, nrow=nrow(C), ncol=2, byrow=TRUE)
which.min(rowSums((C - target)^2))
}
choose_Kstar <- function(Xs, Kmax=10, n_sample=3000, B=10){ # cheaper GAP
idx <- sample.int(nrow(Xs), min(nrow(Xs), n_sample))
FUNkm <- function(x,k) stats::kmeans(x, centers=k, nstart=10, iter.max=100)
gap <- cluster::clusGap(Xs[idx,,drop=FALSE], FUN=FUNkm, K.max=Kmax, B=B)
cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method="Tibs2001SEmax")
}
#31 Define baseline Poisson GLM formula
# GLM formula
form_b_full <- Counts ~
gender + weight + distance + age + carage +
I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
weight:distance + weight:age + weight:carage +
distance:age + distance:carage + age:carage +
offset(log(exposure))
#32 Select baseline GLM using AIC
# Pick GLM (b) structure
if (!exists("form_b_sel", inherits = FALSE)) {
sidx <- sample.int(nrow(DF), min(50000, nrow(DF)))
glm_b_once <- glm(form_b_full, data=DF[sidx,], family=poisson("log"))
form_b_sel <- formula(MASS::stepAIC(glm_b_once, direction="both", trace=FALSE))
}
# Cluster model
form_d_light <- update(form_b_sel, . ~ . + cluster_k)
#33 10-fold cross-validation comparing models (b, d, e, f, g)
K <- 10
folds <- kfold_index(nrow(DF), K)
out <- vector("list", K)
for (k in 1:K) {
train <- DF[folds != k, , drop=FALSE]
test <- DF[folds == k, , drop=FALSE]
## (b) GLM
glm_b <- glm(form_b_sel, data=train, family=poisson("log"))
mu_b <- predict(glm_b, newdata=test, type="response")
dev_b <- poisson_dev(test$Counts, mu_b)
## (d) GLM + cluster x6
Xtr <- train[, c("age","carage")]
Xtr_s <- scale(Xtr)
Kstar <- choose_Kstar(Xtr_s, Kmax=10, n_sample=3000, B=10)
km <- kmeans(Xtr_s, centers=Kstar, nstart=10, iter.max=100)
train$cluster_k <- factor(km$cluster, levels = seq_len(nrow(as.matrix(km$centers))))
sc_center <- attr(Xtr_s,"scaled:center"); sc_scale <- attr(Xtr_s,"scaled:scale")
test$cluster_k <- factor(
vapply(seq_len(nrow(test)),
function(i) assign_cluster(test$age[i], test$carage[i], km$centers, sc_center, sc_scale),
integer(1)),
levels = levels(train$cluster_k)
)
glm_d <- glm(form_d_light, data=train, family=poisson("log"))
mu_d <- predict(glm_d, newdata=test, type="response")
dev_d <- poisson_dev(test$Counts, mu_d)
## (e) Poisson tree
tree0 <- rpart(Counts ~ weight + distance + age + carage + gender + offset(log(exposure)),
data=train, method="poisson",
control=rpart.control(cp=0.001, xval=5, minbucket=2000, maxdepth=5))
ctab <- tree0$cptable; i_min <- which.min(ctab[,"xerror"])
cp_1se <- ctab[max(which(ctab[,"xerror"] <= ctab[i_min,"xerror"] + ctab[i_min,"xstd"])), "CP"]
tree <- prune.rpart(tree0, cp=cp_1se)
mu_e <- predict(tree, newdata=test, type="vector")
dev_e <- poisson_dev(test$Counts, mu_e)
## (f) Boosting (no base)
ctrl_f <- mboost::boost_control(mstop=300, nu=0.05)
tctrl_f <- partykit::ctree_control(maxdepth=3, minbucket=2000)
bb_f <- mboost::blackboost(
Counts ~ weight + distance + age + carage + gender,
data=train, family=mboost::Poisson(),
offset=log(train$exposure),
control=ctrl_f, tree_controls=tctrl_f
)
folds_f <- mboost::cv(model.weights(bb_f), type="kfold", B=3)
cvr_f <- mboost::cvrisk(bb_f, folds=folds_f, grid=seq(50,300,by=50))
mboost::mstop(bb_f) <- mboost::mstop(cvr_f)
mu_f <- exp(predict(bb_f, newdata=test, type="link")) * test$exposure
dev_f <- poisson_dev(test$Counts, mu_f)
## (g) GLM base + boosting correction
glm_base <- glm(form_b_sel, data=train, family=poisson("log"))
eta_glm_tr <- predict(glm_base, newdata=transform(train, exposure=1), type="link")
ctrl_g <- mboost::boost_control(mstop=300, nu=0.05)
tctrl_g <- partykit::ctree_control(maxdepth=3, minbucket=2000)
bb_g <- mboost::blackboost(
Counts ~ weight + distance + age + carage + gender,
data=train, family=mboost::Poisson(),
offset = log(train$exposure) + eta_glm_tr,
control=ctrl_g, tree_controls=tctrl_g
)
folds_g <- mboost::cv(model.weights(bb_g), type="kfold", B=3)
cvr_g <- mboost::cvrisk(bb_g, folds=folds_g, grid=seq(50,300,by=50))
mboost::mstop(bb_g) <- mboost::mstop(cvr_g)
eta_base_te <- predict(glm_base, newdata=transform(test, exposure=1), type="link")
eta_corr_te <- predict(bb_g, newdata=test, type="link")
mu_g <- exp(eta_base_te + eta_corr_te) * test$exposure
dev_g <- poisson_dev(test$Counts, mu_g)
out[[k]] <- data.frame(
fold = k,
model = c("(b)-GLM","(d)_GLM_cluster","(e)_Poisson_Tree",
"(f)_Poisson_Boost_NoBase","(g)_Poisson_Boost_Base"),
dev = c(dev_b, dev_d, dev_e, dev_f, dev_g),
n = nrow(test)
)
}
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
#34 Create categorial variable High
DF <- DF %>%
mutate(High = factor(ifelse(Counts >= 2, "Yes", "No"), levels = c("No","Yes")))
#35 Fit logistic regression
fit_logit <- glm(High ~ weight + distance + age + carage + gender,
data = DF, family = binomial("logit"))
#36 Benchmark probability at (1500,10,30,4,female)
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(weight=1500, distance=10, age=30, carage=4,
gender=factor(gender_val, levels=levels(DF$gender)))
p_bench <- predict(fit_logit, newdata=x_bench, type="response")
cat("Pr(High=Yes) at (1500,10,30,4,female):", round(p_bench,4), "\n")
## Pr(High=Yes) at (1500,10,30,4,female): 0.0018
#37 Plot Pr(High=Yes) vs driver age
age_grid <- data.frame(
weight=1500, distance=10, age=seq(16,85,1), carage=4,
gender=factor(gender_val, levels=levels(DF$gender))
)
age_grid$prob <- predict(fit_logit, newdata=age_grid, type="response")
ggplot(age_grid, aes(age, prob)) +
geom_line(linewidth=0.9) +
labs(title=expression(Pr(High==Yes)~" vs driver age"),
subtitle="Fixed: weight=1500, distance=10, carage=4, gender=female",
x="Driver age (x3)", y="Probability") +
theme_minimal(base_size=13) +
theme(plot.title=element_text(face="bold"))
#38 Create ROC curve + optimal threshold
phat <- predict(fit_logit, type = "response")
y <- as.integer(DF$High == "Yes")
# small grid
thr <- seq(0, 1, by = 0.01)
calc_rates <- function(t){
pred <- phat >= t
TP <- sum(pred & y == 1); FP <- sum(pred & y == 0)
FN <- sum(!pred & y == 1); TN <- sum(!pred & y == 0)
TPR <- TP / (TP + FN); FPR <- FP / (FP + TN)
ACC <- (TP + TN) / length(y)
c(TPR = TPR, FPR = FPR, ACC = ACC)
}
M <- vapply(thr, calc_rates, numeric(3))
roc_df <- data.frame(
threshold = thr,
TPR = unname(M["TPR",]),
FPR = unname(M["FPR",]),
ACC = unname(M["ACC",])
)
best <- roc_df[which.max(roc_df$ACC), ]
cat("Best threshold:", round(best$threshold, 4),
"| TPR:", round(best$TPR, 3),
"| FPR:", round(best$FPR, 3), "\n")
## Best threshold: 0.12 | TPR: 0 | FPR: 0
ggplot(roc_df, aes(FPR, TPR)) +
geom_path(linewidth = 1) +
geom_abline(slope = 1, intercept = 0, linetype = 2) +
geom_point(data = best, color = "red", size = 3) +
labs(title = "ROC — logistic classifier (light)",
subtitle = paste0("Best threshold = ", round(best$threshold,3),
" | TPR = ", round(best$TPR,3),
" | FPR = ", round(best$FPR,3)),
x = "False Positive Rate", y = "True Positive Rate") +
theme_minimal(base_size = 13) +
theme(plot.title = element_text(face = "bold"))
#Summary: 1) Exploratory analysis. Cleaned the dataset, produced summary statistics and histograms; examined pairwise correlations. Pearson was used as the primary measure (continuous, roughly linear relationships), with Spearman noted for robustness when monotonic but non-linear patterns appeared.
Baseline frequency model (GLM). Fitted Poisson GLMs with log link including linear terms, strategic quadratics, and key interactions. Model selection was done via AIC/BIC. Benchmarked predictions at a fixed profile and drew partial-dependence style plots, λ^(x) vs driver age.
Unsupervised segmentation. Ran k-means on (age, carage), chose K* by the gap statistic; visualised clusters and assigned each policy a cluster label.
Cluster-augmented GLM. Added the cluster label as x6 and re-fit the GLM to test whether segmentation improves deviance and interpretability.
Tree models. Built a Poisson regression tree (with pruning to 1-SE) to capture thresholds and provide rule-based insight.
Boosting. Trained Poisson boosting trees to learn residual nonlinearity the GLM can miss.
Hybrid model. Used the GLM as a base and added a boosting correction to combine interpretability with extra predictive lift.
Model comparison. Performed 10-fold cross-validation using out-of-fold Poisson deviance per exposure to compare (GLM), (GLM+cluster), (Tree), (Boosting), and (GLM+Boosting). Selected the best model balancing accuracy, stability, and governance.
High-risk classification. Defined High = 1(Counts ≥ 2), fitted a logistic regression on linear terms, plotted Pr(High=1) vs age (others fixed), produced the ROC curve, and chose the optimal threshold assuming equal cost of FP and FN, reported the corresponding TPR/FPR.