As an actuarial analyst, the report establishes a defensible, data-driven framework to predict claim frequency for a motor portfolio, enabling pricing, underwriting, and portfolio steering decisions to be made with evidence rather than intuition. With assumed Ni∼Poisson(Viλ(Xi)) with exposure Vi as an offset and covariates x={weight, distance, age, carage, gender}. The work delivers validated rating relativities for pricing, risk segmentation that underwrites can act on, and an operational High-risk flag with a documented threshold. Together these support tariff setting, portfolio monitoring, fairness checks, and communication with non-technical stakeholders.

Question (a)

#1 Load the library, set seed

library(cluster)
library(readr)
library(rpart)
library(rpart.plot)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.1.0
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(tidyr)
library(GGally)
library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

library(ggplot2)
library(dplyr)
library(magrittr)

## 
## Attaching package: 'magrittr'
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(mboost)

## Loading required package: parallel
## Loading required package: stabs
## 
## Attaching package: 'mboost'
## 
## The following object is masked from 'package:magrittr':
## 
##     extract
## 
## The following object is masked from 'package:tidyr':
## 
##     extract
## 
## The following object is masked from 'package:ggplot2':
## 
##     %+%

library(partykit)

## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## 
## Attaching package: 'partykit'
## 
## The following object is masked from 'package:mboost':
## 
##     varimp

library(rpart)
library(gbm)

## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

set.seed(10)

#2 Load the data set

CaseStudyData <- read_csv("CaseStudyData_ACST8040_2025.csv")

## Rows: 200000 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): Counts, distance, weight, age, carage, exposure
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(CaseStudyData)

## Rows: 200,000
## Columns: 7
## $ Counts   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ gender   <chr> "male", "male", "female", "male", "male", "male", "female", "…
## $ distance <dbl> 17, 27, 24, 19, 31, 48, 18, 33, 22, 51, 43, 71, 52, 7, 71, 62…
## $ weight   <dbl> 2076, 1640, 1279, 2655, 964, 1458, 1231, 2085, 3935, 1267, 14…
## $ age      <dbl> 85, 44, 38, 81, 72, 44, 73, 44, 72, 69, 67, 32, 38, 80, 43, 3…
## $ carage   <dbl> 13, 10, 28, 35, 12, 16, 9, 16, 31, 9, 28, 7, 26, 10, 12, 7, 1…
## $ exposure <dbl> 0.9249603, 0.9642885, 0.9612256, 0.9163487, 0.8896720, 0.9424…

CaseStudyData <- CaseStudyData %>%
  mutate(
    gender = factor(gender),                   # categorical feature
    across(c(distance, weight, age, carage, exposure, Counts), as.numeric)
  )

CaseStudyData <- CaseStudyData %>%
  mutate(
    gender = factor(gender),                   # categorical feature
    across(c(distance, weight, age, carage, exposure, Counts), as.numeric)
  )

CaseStudyData <- CaseStudyData %>% mutate(freq = Counts / pmax(exposure, 1e-9))

#3 Summary data statistics

# Get numeric column names 
num_cols <- names(dplyr::select(CaseStudyData, tidyselect::where(is.numeric)))

# Summary stats for numeric variables
num_summary <- CaseStudyData %>%
  dplyr::summarise(
    dplyr::across(
      dplyr::all_of(num_cols),
      list(
        n    = ~sum(!is.na(.x)),
        mean = ~mean(.x, na.rm = TRUE),
        sd   = ~sd(.x,   na.rm = TRUE),
        p25  = ~quantile(.x, 0.25, na.rm = TRUE),
        p50  = ~quantile(.x, 0.50, na.rm = TRUE),
        p75  = ~quantile(.x, 0.75, na.rm = TRUE)
      ),
      .names = "{.col}_{.fn}"
    )
  ) %>%
  tidyr::pivot_longer(
    everything(),
    names_to = c("variable", ".value"),
    names_sep = "_"
  )

print(num_summary)

## # A tibble: 7 × 7
##   variable      n     mean       sd      p25      p50      p75
##   <chr>     <int>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 Counts   200000    0.159   0.408     0        0        0    
## 2 distance 200000   36.1    21.0      19       33       51    
## 3 weight   200000 1915.    758.     1316     1809     2451    
## 4 age      200000   56.6    20.0      38       59       76    
## 5 carage   200000   21.5    11.1      11       20       32    
## 6 exposure 200000    0.900   0.0577    0.850    0.900    0.950
## 7 freq     200000    0.176   0.454     0        0        0

#4 Plot histograms for numeric variables

if (!exists("CaseStudyData", inherits = FALSE) || is.function(get("CaseStudyData"))) {
  path <- "/mnt/data/CaseStudyData_ACST8040_2025 (1).csv"  # change if running locally
  CaseStudyData <- readr::read_csv(path, show_col_types = FALSE)
}

CaseStudyData <- CaseStudyData |>
  dplyr::mutate(
    gender = factor(gender),
    dplyr::across(c(distance, weight, age, carage, exposure, Counts),
                  ~ readr::parse_number(as.character(.)))
  )


num_cols <- names(CaseStudyData)[vapply(CaseStudyData, is.numeric, logical(1))]
long_num <- CaseStudyData |>
  dplyr::select(dplyr::all_of(num_cols)) |>
  tidyr::pivot_longer(dplyr::everything(), names_to = "variable", values_to = "value") |>
  dplyr::filter(is.finite(value))

stats_mm <- long_num |>
  dplyr::group_by(variable) |>
  dplyr::summarise(mean = mean(value), median = median(value), .groups = "drop")


fill_hist <- "#CBD5E1"; col_mean <- "#2563EB"; col_med <- "#DC2626"

p_num <- ggplot(long_num, aes(x = value)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 40, fill = fill_hist, color = "white", linewidth = 0.25) +
  geom_density(linewidth = 0.7) +
  geom_vline(data = stats_mm, aes(xintercept = mean),  color = col_mean, linewidth = 0.5) +
  geom_vline(data = stats_mm, aes(xintercept = median), color = col_med,  linewidth = 0.5, linetype = 2) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  scale_x_continuous(labels = scales::label_number(big.mark = ",")) +
  labs(title = "Histograms of Numeric Variables",
       subtitle = "Density overlay + mean (blue) & median (red)",
       x = NULL, y = "Density") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"),
        strip.text  = element_text(face = "bold"),
        panel.grid.minor = element_blank())
print(p_num)

#5 Plot histograms for non-numeric variables

cat_cols <- names(CaseStudyData)[vapply(CaseStudyData, \(x) is.factor(x) || is.character(x), logical(1))]
if (length(cat_cols) > 0) {
  long_cat <- CaseStudyData |>
    dplyr::select(dplyr::all_of(cat_cols)) |>
    dplyr::mutate(dplyr::across(dplyr::everything(), as.factor)) |>
    tidyr::pivot_longer(dplyr::everything(), names_to = "variable", values_to = "level")

  p_cat <- ggplot(long_cat, aes(x = level)) +
    geom_bar(fill = fill_hist, color = "white", linewidth = 0.25) +
    facet_wrap(~ variable, scales = "free_x", ncol = 3) +
    labs(title = "Categorical Variables", x = NULL, y = "Count") +
    theme_minimal(base_size = 13) +
    theme(plot.title = element_text(face = "bold"),
          strip.text  = element_text(face = "bold"),
          axis.text.x = element_text(angle = 30, hjust = 1))
  print(p_cat)
}

#6 Build numeric matrix to create pairwise correlations

num_idx <- vapply(CaseStudyData, is.numeric, logical(1))
X_num  <- CaseStudyData[, num_idx, drop = FALSE]

cat_idx <- vapply(CaseStudyData, function(x) is.factor(x) || is.character(x), logical(1))
if (any(cat_idx)) {
  X_cat <- stats::model.matrix(~ . - 1, data = CaseStudyData[, cat_idx, drop = FALSE]) |> as.data.frame()
  X_all <- cbind(X_num, X_cat)
} else {
  X_all <- X_num
}

# Correlation matrices
cor_pearson  <- stats::cor(X_all, use = "pairwise.complete.obs", method = "pearson")
cor_spearman <- stats::cor(X_all, use = "pairwise.complete.obs", method = "spearman")

print(round(cor_pearson, 3))

##              Counts distance weight    age carage exposure   freq genderfemale
## Counts        1.000    0.044  0.051  0.114  0.136    0.020  0.998       -0.003
## distance      0.044    1.000  0.000  0.000  0.000   -0.001  0.043       -0.002
## weight        0.051    0.000  1.000  0.002  0.000   -0.002  0.052       -0.002
## age           0.114    0.000  0.002  1.000  0.000   -0.001  0.113        0.000
## carage        0.136    0.000  0.000  0.000  1.000   -0.005  0.136        0.002
## exposure      0.020   -0.001 -0.002 -0.001 -0.005    1.000 -0.005        0.003
## freq          0.998    0.043  0.052  0.113  0.136   -0.005  1.000       -0.003
## genderfemale -0.003   -0.002 -0.002  0.000  0.002    0.003 -0.003        1.000
## gendermale    0.003    0.002  0.002  0.000 -0.002   -0.003  0.003       -1.000
##              gendermale
## Counts            0.003
## distance          0.002
## weight            0.002
## age               0.000
## carage           -0.002
## exposure         -0.003
## freq              0.003
## genderfemale     -1.000
## gendermale        1.000

print(round(cor_spearman, 3))

##              Counts distance weight    age carage exposure   freq genderfemale
## Counts        1.000    0.041  0.047  0.105  0.129    0.019  0.997       -0.003
## distance      0.041    1.000  0.000  0.000  0.001   -0.001  0.041       -0.003
## weight        0.047    0.000  1.000  0.003  0.000   -0.003  0.047       -0.002
## age           0.105    0.000  0.003  1.000  0.000   -0.001  0.104        0.000
## carage        0.129    0.001  0.000  0.000  1.000   -0.005  0.129        0.002
## exposure      0.019   -0.001 -0.003 -0.001 -0.005    1.000 -0.008        0.003
## freq          0.997    0.041  0.047  0.104  0.129   -0.008  1.000       -0.004
## genderfemale -0.003   -0.003 -0.002  0.000  0.002    0.003 -0.004        1.000
## gendermale    0.003    0.003  0.002  0.000 -0.002   -0.003  0.004       -1.000
##              gendermale
## Counts            0.003
## distance          0.003
## weight            0.002
## age               0.000
## carage           -0.002
## exposure         -0.003
## freq              0.004
## genderfemale     -1.000
## gendermale        1.000

#7 Plot the correlation heatmaps

plot_corr <- function(C, title) {
  ord <- stats::hclust(stats::as.dist(1 - abs(C)))$order
  C_ord <- C[ord, ord]
  df <- as.data.frame(as.table(C_ord))
  names(df) <- c("Var1","Var2","value")
  ggplot(df, aes(Var1, Var2, fill = value)) +
    geom_tile(color = "white", linewidth = 0.25) +
    scale_fill_gradient2(limits = c(-1, 1), oob = scales::squish,
                         low = "#B91C1C", mid = "#F3F4F6", high = "#1D4ED8", midpoint = 0) +
    coord_fixed() +
    labs(title = title, x = NULL, y = NULL, fill = "ρ") +
    theme_minimal(base_size = 12) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1),
          plot.title = element_text(face = "bold"),
          panel.grid = element_blank())
}

p1 <- plot_corr(cor_pearson,  "Pearson correlation (numeric + factor dummies)")
p2 <- plot_corr(cor_spearman, "Spearman correlation (numeric + factor dummies)")
print(p1); print(p2)

Question (b)

#8 Estimate Poisson for GLM

DF <- if (exists("CaseStudyData", inherits = FALSE) && !is.function(get("CaseStudyData"))) {
  CaseStudyData
} else if (exists("data", inherits = FALSE) && !is.function(get("data"))) {
  data
} else {
  # Load if needed (edit path if not running in this session)
  read_csv("/mnt/data/CaseStudyData_ACST8040_2025 (1).csv", show_col_types = FALSE)
}

# Fixin column types
DF <- DF %>%
  mutate(
    gender   = factor(gender),
    distance = readr::parse_number(as.character(distance)),
    weight   = readr::parse_number(as.character(weight)),
    age      = readr::parse_number(as.character(age)),
    carage   = readr::parse_number(as.character(carage)),
    exposure = readr::parse_number(as.character(exposure)),
    Counts   = readr::parse_number(as.character(Counts))
  )

# Create full model formula
# x1=weight
# x2=distance 
# x3=age 
# x4=carages 
# x5=gender

form_full <- Counts ~
  gender +
  weight + distance + age + carage +
  I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
  weight:distance + weight:age + weight:carage +
  distance:age + distance:carage + age:carage +
  offset(log(exposure))

# Fit full Poisson
m_full <- glm(form_full, data = DF, family = poisson(link = "log"))

#9 Select the glm model

# AIC-based stepwise 
m_step_aic <- stepAIC(m_full, direction = "both", trace = FALSE)

# Final model :)
m_final <- m_step_aic
summary(m_final)

## 
## Call:
## glm(formula = Counts ~ gender + weight + distance + age + carage + 
##     I(age^2) + I(carage^2) + weight:distance + weight:age + age:carage + 
##     offset(log(exposure)), family = poisson(link = "log"), data = DF)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -6.778e+00  1.188e-01 -57.063  < 2e-16 ***
## gendermale       1.655e-02  1.148e-02   1.441    0.150    
## weight           2.316e-04  2.772e-05   8.353  < 2e-16 ***
## distance         6.260e-03  7.230e-04   8.658  < 2e-16 ***
## age              6.797e-02  3.033e-03  22.413  < 2e-16 ***
## carage           1.332e-01  4.197e-03  31.744  < 2e-16 ***
## I(age^2)        -1.858e-04  2.310e-05  -8.046 8.54e-16 ***
## I(carage^2)     -5.941e-04  7.456e-05  -7.968 1.62e-15 ***
## weight:distance -5.171e-07  3.352e-07  -1.542    0.123    
## weight:age      -6.707e-07  3.755e-07  -1.786    0.074 .  
## age:carage      -1.162e-03  3.006e-05 -38.648  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 125453  on 199999  degrees of freedom
## Residual deviance: 116186  on 199989  degrees of freedom
## AIC: 175303
## 
## Number of Fisher Scoring iterations: 6

coef(m_final)

##     (Intercept)      gendermale          weight        distance             age 
##   -6.777607e+00    1.654691e-02    2.315685e-04    6.259824e-03    6.797319e-02 
##          carage        I(age^2)     I(carage^2) weight:distance      weight:age 
##    1.332362e-01   -1.858359e-04   -5.940642e-04   -5.170794e-07   -6.706863e-07 
##      age:carage 
##   -1.161681e-03

# Small check for over-dispersion
dispersion <- sum(residuals(m_final, type = "pearson")^2) / m_final$df.residual
cat("Dispersion estimate:", round(dispersion, 3), "\n")

## Dispersion estimate: 1.004

#9.1 (continue) Benchmark lambda λ

#  Benchmark λ(x): exposure=1 
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]

x_benchmark <- data.frame(
  weight = 1500,
  distance = 10,
  age = 30,
  carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)

lambda_hat <- predict(m_final, newdata = x_benchmark, type = "response")  # this is λ(x)
cat("Benchmark λ(x) at (1500,10,30,4,female):", round(lambda_hat, 6), "\n")

## Benchmark λ(x) at (1500,10,30,4,female): 0.015771

#10 Plot lambda λ vs x3 = age

age_grid <- data.frame(
  weight = 1500,
  distance = 10,
  age = seq(16, 85, by = 1),  
  carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)

pred <- predict(m_final, newdata = age_grid, type = "link", se.fit = TRUE)
age_grid$lambda  <- exp(pred$fit)
age_grid$lower   <- exp(pred$fit - 1.96 * pred$se.fit)
age_grid$upper   <- exp(pred$fit + 1.96 * pred$se.fit)

ggplot(age_grid, aes(x = age, y = lambda)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.15) +
  geom_line(linewidth = 0.8) +
  labs(title = expression(hat(lambda)(x)~"vs driver age (others fixed)"),
       subtitle = "Fixed at weight=1500, distance=10, carage=4, gender=female; exposure = 1",
       x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (c)

#11 Using Elbow Method for Total Within-Cluster SS vs K (age, carage)

# Age (x3) and carage (x4) 
# Numeric + drop NA
X <- DF |>
  dplyr::mutate(
    age    = readr::parse_number(as.character(age)),
    carage = readr::parse_number(as.character(carage))
  ) |>
  dplyr::select(dplyr::all_of(c("age","carage"))) |>
  tidyr::drop_na()

# Standardise the variable
X_s <- scale(X)

Kmax <- 10
elbow <- data.frame(K = integer(), tot_withinss = double(), betweenss = double(), totss = double())

# Elbow plot: total within-cluster SS vs K 
for (k in 1:Kmax) {
  km <- kmeans(X_s, centers = k, nstart = 20, iter.max = 100)
  elbow <- rbind(elbow,
                 data.frame(K = k,
                            tot_withinss = km$tot.withinss,
                            betweenss    = km$betweenss,
                            totss        = km$totss))
}
ggplot(elbow, aes(K, tot_withinss)) +
  geom_line() + geom_point(size = 2) +
  labs(title = "Elbow plot (age, carage)",
       subtitle = "Total within-cluster sum of squares vs K",
       x = "K (number of clusters)", y = "Total within-cluster SS") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

#12 Using Gap statistic subsample for speed on big data

n_gap <- min(nrow(X_s), 5000)            # sample with 5k points (cause my laptop is weak T_T )
idx <- sample.int(nrow(X_s), n_gap)
FUNkm <- function(x, k) stats::kmeans(x, centers = k, nstart = 20, iter.max = 100)
gap <- cluster::clusGap(X_s[idx, , drop = FALSE], FUN = FUNkm, K.max = Kmax, B = 50)  # B=50 is a good default

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 250000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 250000)

gap_df <- as.data.frame(gap$Tab)
gap_df$K <- 1:Kmax

ggplot(gap_df, aes(K, gap)) +
  geom_line() + geom_point(size = 2) +
  geom_errorbar(aes(ymin = gap - SE.sim, ymax = gap + SE.sim), width = 0.2) +
  labs(title = "Gap statistic vs K (age, carage)",
       x = "K", y = "Gap") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

# Select K* 
K_star <- cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method = "Tibs2001SEmax")
cat("Selected K* by gap statistic:", K_star, "\n")

## Selected K* by gap statistic: 4

#13 Finalise k-means, scattering with cluster colours + centroids

km_final <- kmeans(X_s, centers = K_star, nstart = 50, iter.max = 200)
DF$cluster_k <- factor(km_final$cluster)

# Recover centroids on original scale for plotting
cent_scaled <- km_final$centers
cent_orig <- sweep(cent_scaled, 2, attr(X_s, "scaled:scale"), "*")
cent_orig <- sweep(cent_orig, 2, attr(X_s, "scaled:center"), "+")
cent_df <- data.frame(age = cent_orig[, "age"], carage = cent_orig[, "carage"],
                      cluster_k = factor(seq_len(K_star)))

ggplot(DF, aes(x = age, y = carage, color = cluster_k)) +
  geom_point(alpha = 0.35, size = 0.7) +
  geom_point(data = cent_df, aes(x = age, y = carage), inherit.aes = FALSE,
             shape = 8, size = 3, color = "black") +
  labs(title = "K-means clusters on (age, carage)",
       subtitle = paste0("K* = ", K_star, " by gap statistic"),
       x = "Driver age (x3)", y = "Car age (x4)", color = "Cluster") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (d)

#14 Rebuild clusters

X <- DF %>%
  dplyr::mutate(
    age    = readr::parse_number(as.character(age)),
    carage = readr::parse_number(as.character(carage))
  ) %>%
  dplyr::select(age, carage) %>%   
  tidyr::drop_na()

X_s <- scale(X)

# choose K* with gap 
n_gap <- min(nrow(X_s), 10000)
idx   <- sample.int(nrow(X_s), n_gap)
FUNkm <- function(x, k) stats::kmeans(x, centers=k, nstart=15, iter.max=200, algorithm="Lloyd")
gap   <- cluster::clusGap(X_s[idx, , drop=FALSE], FUN=FUNkm, K.max=10, B=30)
K_star <- cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method="Tibs2001SEmax")

# final kmeans on full data with K*
km_final <- kmeans(X_s, centers = K_star, nstart = 50, iter.max = 200)
DF$cluster_k <- factor(km_final$cluster)   # x6

# Predicting cluster on new (age,carage)
sc_center <- attr(X_s, "scaled:center")
sc_scale  <- attr(X_s, "scaled:scale")
assign_cluster <- function(age, carage, centers, center, scale) {
  z1 <- (age   - center["age"])   / scale["age"]
  z2 <- (carage- center["carage"])/ scale["carage"]
  d2 <- rowSums((centers - matrix(c(z1, z2), nrow=nrow(centers), ncol=2, byrow=TRUE))^2)
  which.min(d2)
}

#15 Form full GLM (linear + quadratics + mixed + cluster interactions)

form_full <- Counts ~
  gender + cluster_k +
  weight + distance + age + carage +
  I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
  weight:distance + weight:age + weight:carage +
  distance:age + distance:carage + age:carage +
  cluster_k:weight + cluster_k:distance + cluster_k:age + cluster_k:carage +
  offset(log(exposure))

m_full <- glm(form_full, data = DF, family = poisson(link="log"))

#16 Select the model, Benchmark λ(x)

# Model Selection
m_final <- MASS::stepAIC(m_full, direction="both", trace=FALSE)  # use stats::step(..., k=2) if MASS unavailable
summary(m_final)

## 
## Call:
## glm(formula = Counts ~ cluster_k + weight + distance + age + 
##     carage + I(age^2) + weight:distance + weight:age + distance:age + 
##     age:carage + cluster_k:distance + cluster_k:age + offset(log(exposure)), 
##     family = poisson(link = "log"), data = DF)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.099e+00  2.328e-01 -13.314  < 2e-16 ***
## cluster_k2           2.161e+00  4.990e-01   4.331 1.49e-05 ***
## cluster_k3           1.157e+00  2.020e-01   5.728 1.02e-08 ***
## cluster_k4           2.545e+00  5.044e-01   5.047 4.50e-07 ***
## weight               2.313e-04  2.763e-05   8.370  < 2e-16 ***
## distance             3.625e-03  1.968e-03   1.842 0.065520 .  
## age                 -3.389e-02  9.003e-03  -3.764 0.000167 ***
## carage               5.347e-03  5.069e-03   1.055 0.291469    
## I(age^2)             2.273e-04  1.053e-04   2.158 0.030911 *  
## weight:distance     -5.262e-07  3.353e-07  -1.569 0.116637    
## weight:age          -6.569e-07  3.754e-07  -1.750 0.080168 .  
## distance:age         1.099e-04  3.986e-05   2.756 0.005846 ** 
## age:carage           1.910e-04  7.868e-05   2.428 0.015193 *  
## cluster_k2:distance -5.996e-03  1.957e-03  -3.065 0.002180 ** 
## cluster_k3:distance -1.842e-03  1.203e-03  -1.532 0.125596    
## cluster_k4:distance -5.323e-03  1.931e-03  -2.757 0.005840 ** 
## cluster_k2:age      -2.995e-03  9.181e-03  -0.326 0.744271    
## cluster_k3:age       8.265e-03  4.700e-03   1.758 0.078689 .  
## cluster_k4:age      -9.326e-03  9.268e-03  -1.006 0.314267    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 125453  on 199999  degrees of freedom
## Residual deviance: 115141  on 199981  degrees of freedom
## AIC: 174274
## 
## Number of Fisher Scoring iterations: 6

dispersion <- sum(residuals(m_final, type="pearson")^2) / m_final$df.residual
cat("Dispersion:", round(dispersion, 3), "\n")

## Dispersion: 1.006

# Benchmark λ(x) at (x1=1500, x2=10, x3=30, x4=4, x5=female); exposure=1
gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
bench_cluster <- assign_cluster(age=30, carage=4, centers=km_final$centers,
                                center=sc_center, scale=sc_scale)

x_benchmark <- data.frame(
  weight = 1500, distance = 10, age = 30, carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  cluster_k = factor(bench_cluster, levels = levels(DF$cluster_k)),
  exposure = 1
)

lambda_hat <- predict(m_final, newdata = x_benchmark, type = "response")
cat("Benchmark λ(x) with cluster =", as.character(x_benchmark$cluster_k), ":",
    round(lambda_hat, 6), "\n")

## Benchmark λ(x) with cluster = 1 : 0.030556

#17 Plot λ(x) vs age with others fixed

age_seq <- seq(16, 85, by = 1)
age_clusters <- vapply(age_seq, function(a)
  assign_cluster(a, 4, km_final$centers, sc_center, sc_scale), integer(1))

age_grid <- data.frame(
  weight = 1500, distance = 10, age = age_seq, carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  cluster_k = factor(age_clusters, levels = levels(DF$cluster_k)),
  exposure = 1
)

pred <- predict(m_final, newdata = age_grid, type = "link", se.fit = TRUE)
age_grid$lambda <- exp(pred$fit)
age_grid$lower  <- exp(pred$fit - 1.96 * pred$se.fit)
age_grid$upper  <- exp(pred$fit + 1.96 * pred$se.fit)

ggplot(age_grid, aes(x = age, y = lambda)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.15) +
  geom_line(linewidth = 0.9) +
  labs(title = expression(hat(lambda)(x)~"vs driver age with cluster x"[6]),
       subtitle = paste0("Fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
                         "; exposure=1; cluster chosen by k-means for each age"),
       x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (e)

#18 Grow Poisson regression tree

tree0 <- rpart(
  formula = Counts ~ weight + distance + age + carage + gender + offset(log(exposure)),
  data    = DF,
  method  = "poisson",
  control = rpart.control(cp = 0.001, xval = 10, minbucket = 500, maxdepth = 6) 
)

#19 Choose optimal size of the tree, plot the final tree

ctab   <- tree0$cptable
i_min  <- which.min(ctab[,"xerror"])
xerr_min <- ctab[i_min,"xerror"]; xstd_min <- ctab[i_min,"xstd"]
cp_1se <- ctab[ max(which(ctab[,"xerror"] <= xerr_min + xstd_min)), "CP" ]

tree <- prune.rpart(tree0, cp = cp_1se)

printcp(tree0)       # CV table

## 
## Rates regression tree:
## rpart(formula = Counts ~ weight + distance + age + carage + gender + 
##     offset(log(exposure)), data = DF, method = "poisson", control = rpart.control(cp = 0.001, 
##     xval = 10, minbucket = 500, maxdepth = 6))
## 
## Variables actually used in tree construction:
## [1] age    carage weight
## 
## Root node error: 125453/200000 = 0.62726
## 
## n= 200000 
## 
##          CP nsplit rel error  xerror      xstd
## 1 0.0351320      0   1.00000 1.00002 0.0034205
## 2 0.0032330      2   0.92974 0.92981 0.0032510
## 3 0.0012241      3   0.92650 0.92667 0.0032438
## 4 0.0010772      4   0.92528 0.92581 0.0032411
## 5 0.0010000      5   0.92420 0.92484 0.0032374

tree$variable.importance  # variable importance

##        age     carage     weight   distance     gender 
## 5325.89354 3915.00566  314.26181   16.60581   13.15646

# Plot the tree 
rpart.plot(tree, type = 2, extra = 101, under = TRUE, faclen = 0, varlen = 0, cex = 0.75,
           main = "Poisson regression tree for claim frequency λ(x)")

#20 Benchmark λ(x) as in (b)

gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
  weight = 1500, distance = 10, age = 30, carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)
lambda_hat <- predict(tree, newdata = x_bench, type = "vector") 
cat("Benchmark λ(x) from tree:", round(lambda_hat, 6), "\n")

## Benchmark λ(x) from tree: 0.039014

#21 Plot λ(x) vs age with others fixed

#tree predictions
age_grid <- data.frame(
  weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)
age_grid$lambda <- predict(tree, newdata = age_grid, type = "vector")

ggplot(age_grid, aes(x = age, y = lambda)) +
  geom_line(linewidth = 0.9) +
  labs(title = expression(hat(lambda)(x)~"vs driver age — Poisson tree"),
       subtitle = "Fixed at weight=1500, distance=10, carage=4, gender=female; exposure=1",
       x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (f)

#22 Grow a Poisson boosting tree NO BASE MODEL

ctrl <- mboost::boost_control(mstop = 500, nu = 0.05)   # Shrinkage ν = 0.05 
tctrl <- partykit::ctree_control(maxdepth = 3, minbucket = 1000)  # Tree size 

bb0 <- mboost::blackboost(
  Counts ~ weight + distance + age + carage + gender,
  data = DF,
  family = mboost::Poisson(),              
  offset = log(DF$exposure),               # Baseline λ(x)=exp(0)=1 
  control = ctrl,
  tree_controls = tctrl
)

#23 Benchmark λ(x) as in (b): exposure = 1 so prediction is λ(x)

gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
  weight = 1500, distance = 10, age = 30, carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)

# blackboost predicts on link scale
eta_bench   <- predict(bb0, newdata = x_bench, type = "link")

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

lambda_bench <- exp(eta_bench)
cat("Benchmark λ(x) from boosting:", round(lambda_bench, 6), "\n")

## Benchmark λ(x) from boosting: 0.062315

#24 Plot λ(x) vs age with other predictors fixed

age_grid <- data.frame(
  weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)
eta_grid <- predict(bb0, newdata = age_grid, type = "link")

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

age_grid$lambda <- exp(eta_grid)

ggplot(age_grid, aes(x = age, y = lambda)) +
  geom_line(linewidth = 0.9) +
  labs(title = expression(hat(lambda)(x)~"vs driver age — Poisson boosting tree"),
       subtitle = paste0("Fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
                         "; exposure=1 | ν=0.05, maxdepth=3, mstop=", mboost::mstop(bb0)),
       x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (g)

#25 Grow a Poisson boosting tree WITH BASE MODEL

form_full_b <- Counts ~
  gender +
  weight + distance + age + carage +
  I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
  weight:distance + weight:age + weight:carage +
  distance:age + distance:carage + age:carage +
  offset(log(exposure))

m_glm_full  <- glm(form_full_b, data = DF, family = poisson(link = "log"))
m_glm_best  <- suppressWarnings(MASS::stepAIC(m_glm_full, direction = "both", trace = FALSE))

# Log-rate from GLM base (exposure set to 1 so this is log λ_hat)
eta_glm <- predict(m_glm_best, newdata = transform(DF, exposure = 1), type = "link")

#26 Boosting tree that corrects the GLM

ctrl  <- mboost::boost_control(mstop = 500, nu = 0.05)              # shrinkage ν
tctrl <- partykit::ctree_control(maxdepth = 3, minbucket = 1000)    # tree size
bb_glm_base <- mboost::blackboost(
  Counts ~ weight + distance + age + carage + gender,               # NO x6
  data    = DF,
  family  = mboost::Poisson(),
  offset  = log(DF$exposure) + eta_glm,                             # base = GLM
  control = ctrl,
  tree_controls = tctrl
)

#27 Tune number of boosting steps

folds <- mboost::cv(model.weights(bb_glm_base), type = "kfold", B = 5)
grid  <- seq(50, 1500, by = 50)
cvr   <- mboost::cvrisk(bb_glm_base, folds = folds, grid = grid)
mboost::mstop(bb_glm_base) <- mboost::mstop(cvr)
cat("Chosen mstop:", mboost::mstop(bb_glm_base),
    "| shrinkage ν:", ctrl$nu, "| tree maxdepth:", tctrl$maxdepth, "\n")

## Chosen mstop: 50 | shrinkage ν: 0.05 | tree maxdepth: 3

#28 Benchmark λ(x) as in (b); exposure=1

gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(
  weight = 1500, distance = 10, age = 30, carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)

eta_base_bench <- predict(m_glm_best, newdata = x_bench, type = "link")  # log λ from GLM
eta_boost_corr <- predict(bb_glm_base, newdata = x_bench, type = "link") # correction f(x)

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

lambda_bench   <- exp(eta_base_bench + eta_boost_corr)
cat("Benchmark λ(x) from GLM+boosting:", round(lambda_bench, 6), "\n")

## Benchmark λ(x) from GLM+boosting: 0.015771

#29 Benchmark λ(x) as in (b); exposure=1

age_grid <- data.frame(
  weight = 1500, distance = 10, age = seq(16, 85, by = 1), carage = 4,
  gender = factor(gender_val, levels = levels(DF$gender)),
  exposure = 1
)
eta_base <- predict(m_glm_best, newdata = age_grid, type = "link")
eta_corr <- predict(bb_glm_base, newdata = age_grid, type = "link")

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

age_grid$lambda <- exp(eta_base + eta_corr)

ggplot(age_grid, aes(x = age, y = lambda)) +
  geom_line(linewidth = 0.9) +
  labs(title = expression(hat(lambda)(x)~"vs driver age — GLM base + boosting tree"),
       subtitle = paste0("Base: best GLM (b); correction: boosted trees | ν=", ctrl$nu,
                         ", maxdepth=", tctrl$maxdepth, ", mstop=", mboost::mstop(bb_glm_base),
                         " | fixed: weight=1500, distance=10, carage=4, gender=", gender_val,
                         "; exposure=1"),
       x = "Driver age (x3)", y = expression(hat(lambda)(x))) +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

Question (h)

#30 Compare the models fitted in items Comparing models from: (b) GLM (d) GLM+cluster (d) Poisson tree (f) Poisson boosting (no base) (g) Poisson boosting (with base) Using 10-fold cross validation error to select the best model. The reports mean Poisson deviance (lower is better) across folds.

poisson_dev <- function(y, mu){ mu <- pmax(mu,1e-12); 2*sum(ifelse(y==0,-mu, y*log(y/mu)-(y-mu))) }
kfold_index <- function(n, K=10) sample(rep(1:K, length.out = n))

# centroid assignment fix
assign_cluster <- function(age, carage, centers, center, scale){
  C <- as.matrix(centers)
  c_age <- if (!is.null(names(center)) && "age"    %in% names(center)) center[["age"]]    else center[1]
  c_car <- if (!is.null(names(center)) && "carage" %in% names(center)) center[["carage"]] else center[2]
  s_age <- if (!is.null(names(scale))  && "age"    %in% names(scale))  scale[["age"]]     else scale[1]
  s_car <- if (!is.null(names(scale))  && "carage" %in% names(scale))  scale[["carage"]]  else scale[2]
  if (isTRUE(all.equal(s_age,0))) s_age <- 1; if (isTRUE(all.equal(s_car,0))) s_car <- 1
  z <- c((age-c_age)/s_age, (carage-c_car)/s_car)
  target <- matrix(z, nrow=nrow(C), ncol=2, byrow=TRUE)
  which.min(rowSums((C - target)^2))
}
choose_Kstar <- function(Xs, Kmax=10, n_sample=3000, B=10){        # cheaper GAP
  idx <- sample.int(nrow(Xs), min(nrow(Xs), n_sample))
  FUNkm <- function(x,k) stats::kmeans(x, centers=k, nstart=10, iter.max=100)
  gap <- cluster::clusGap(Xs[idx,,drop=FALSE], FUN=FUNkm, K.max=Kmax, B=B)
  cluster::maxSE(gap$Tab[,"gap"], gap$Tab[,"SE.sim"], method="Tibs2001SEmax")
}

#31 Define baseline Poisson GLM formula

# GLM formula
form_b_full <- Counts ~
  gender + weight + distance + age + carage +
  I(weight^2) + I(distance^2) + I(age^2) + I(carage^2) +
  weight:distance + weight:age + weight:carage +
  distance:age + distance:carage + age:carage +
  offset(log(exposure))

#32 Select baseline GLM using AIC

# Pick GLM (b) structure 
if (!exists("form_b_sel", inherits = FALSE)) {
  sidx <- sample.int(nrow(DF), min(50000, nrow(DF)))
  glm_b_once <- glm(form_b_full, data=DF[sidx,], family=poisson("log"))
  form_b_sel <- formula(MASS::stepAIC(glm_b_once, direction="both", trace=FALSE))
}
# Cluster model
form_d_light <- update(form_b_sel, . ~ . + cluster_k)

#33 10-fold cross-validation comparing models (b, d, e, f, g)

K <- 10
folds <- kfold_index(nrow(DF), K)
out <- vector("list", K)

for (k in 1:K) {
  train <- DF[folds != k, , drop=FALSE]
  test  <- DF[folds == k, , drop=FALSE]

  ## (b) GLM 
  glm_b <- glm(form_b_sel, data=train, family=poisson("log"))
  mu_b  <- predict(glm_b, newdata=test, type="response")
  dev_b <- poisson_dev(test$Counts, mu_b)

  ## (d) GLM + cluster x6 
  Xtr   <- train[, c("age","carage")]
  Xtr_s <- scale(Xtr)
  Kstar <- choose_Kstar(Xtr_s, Kmax=10, n_sample=3000, B=10)
  km    <- kmeans(Xtr_s, centers=Kstar, nstart=10, iter.max=100)
  train$cluster_k <- factor(km$cluster, levels = seq_len(nrow(as.matrix(km$centers))))
  sc_center <- attr(Xtr_s,"scaled:center"); sc_scale <- attr(Xtr_s,"scaled:scale")
  test$cluster_k <- factor(
    vapply(seq_len(nrow(test)),
           function(i) assign_cluster(test$age[i], test$carage[i], km$centers, sc_center, sc_scale),
           integer(1)),
    levels = levels(train$cluster_k)
  )
  glm_d <- glm(form_d_light, data=train, family=poisson("log"))
  mu_d  <- predict(glm_d, newdata=test, type="response")
  dev_d <- poisson_dev(test$Counts, mu_d)

  ## (e) Poisson tree
  tree0 <- rpart(Counts ~ weight + distance + age + carage + gender + offset(log(exposure)),
                 data=train, method="poisson",
                 control=rpart.control(cp=0.001, xval=5, minbucket=2000, maxdepth=5))
  ctab <- tree0$cptable; i_min <- which.min(ctab[,"xerror"])
  cp_1se <- ctab[max(which(ctab[,"xerror"] <= ctab[i_min,"xerror"] + ctab[i_min,"xstd"])), "CP"]
  tree  <- prune.rpart(tree0, cp=cp_1se)
  mu_e  <- predict(tree, newdata=test, type="vector")
  dev_e <- poisson_dev(test$Counts, mu_e)

  ## (f) Boosting (no base) 
  ctrl_f  <- mboost::boost_control(mstop=300, nu=0.05)
  tctrl_f <- partykit::ctree_control(maxdepth=3, minbucket=2000)
  bb_f <- mboost::blackboost(
    Counts ~ weight + distance + age + carage + gender,
    data=train, family=mboost::Poisson(),
    offset=log(train$exposure),
    control=ctrl_f, tree_controls=tctrl_f
  )
  folds_f <- mboost::cv(model.weights(bb_f), type="kfold", B=3)
  cvr_f   <- mboost::cvrisk(bb_f, folds=folds_f, grid=seq(50,300,by=50))
  mboost::mstop(bb_f) <- mboost::mstop(cvr_f)
  mu_f  <- exp(predict(bb_f, newdata=test, type="link")) * test$exposure
  dev_f <- poisson_dev(test$Counts, mu_f)

  ## (g) GLM base + boosting correction 
  glm_base <- glm(form_b_sel, data=train, family=poisson("log"))
  eta_glm_tr <- predict(glm_base, newdata=transform(train, exposure=1), type="link")
  ctrl_g  <- mboost::boost_control(mstop=300, nu=0.05)
  tctrl_g <- partykit::ctree_control(maxdepth=3, minbucket=2000)
  bb_g <- mboost::blackboost(
    Counts ~ weight + distance + age + carage + gender,
    data=train, family=mboost::Poisson(),
    offset = log(train$exposure) + eta_glm_tr,
    control=ctrl_g, tree_controls=tctrl_g
  )
  folds_g <- mboost::cv(model.weights(bb_g), type="kfold", B=3)
  cvr_g   <- mboost::cvrisk(bb_g, folds=folds_g, grid=seq(50,300,by=50))
  mboost::mstop(bb_g) <- mboost::mstop(cvr_g)
  eta_base_te <- predict(glm_base, newdata=transform(test, exposure=1), type="link")
  eta_corr_te <- predict(bb_g, newdata=test, type="link")
  mu_g  <- exp(eta_base_te + eta_corr_te) * test$exposure
  dev_g <- poisson_dev(test$Counts, mu_g)

  out[[k]] <- data.frame(
    fold = k,
    model = c("(b)-GLM","(d)_GLM_cluster","(e)_Poisson_Tree",
              "(f)_Poisson_Boost_NoBase","(g)_Poisson_Boost_Base"),
    dev   = c(dev_b, dev_d, dev_e, dev_f, dev_g),
    n     = nrow(test)
  )
}

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 150000)

## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.
## Warning in object$predict(newdata = newdata, which = which, aggregate =
## aggregate): User-specified offset is not a scalar, thus it cannot be used for
## predictions when 'newdata' is specified.

Question (i)

#34 Create categorial variable High

DF <- DF %>%
  mutate(High = factor(ifelse(Counts >= 2, "Yes", "No"), levels = c("No","Yes")))

#35 Fit logistic regression

fit_logit <- glm(High ~ weight + distance + age + carage + gender,
                 data = DF, family = binomial("logit"))

#36 Benchmark probability at (1500,10,30,4,female)

gender_val <- if ("female" %in% levels(DF$gender)) "female" else levels(DF$gender)[1]
x_bench <- data.frame(weight=1500, distance=10, age=30, carage=4,
                      gender=factor(gender_val, levels=levels(DF$gender)))
p_bench <- predict(fit_logit, newdata=x_bench, type="response")
cat("Pr(High=Yes) at (1500,10,30,4,female):", round(p_bench,4), "\n")

## Pr(High=Yes) at (1500,10,30,4,female): 0.0018

#37 Plot Pr(High=Yes) vs driver age

age_grid <- data.frame(
  weight=1500, distance=10, age=seq(16,85,1), carage=4,
  gender=factor(gender_val, levels=levels(DF$gender))
)
age_grid$prob <- predict(fit_logit, newdata=age_grid, type="response")

ggplot(age_grid, aes(age, prob)) +
  geom_line(linewidth=0.9) +
  labs(title=expression(Pr(High==Yes)~" vs driver age"),
       subtitle="Fixed: weight=1500, distance=10, carage=4, gender=female",
       x="Driver age (x3)", y="Probability") +
  theme_minimal(base_size=13) +
  theme(plot.title=element_text(face="bold"))

#38 Create ROC curve + optimal threshold

phat <- predict(fit_logit, type = "response")
y    <- as.integer(DF$High == "Yes")

# small grid 
thr <- seq(0, 1, by = 0.01)

calc_rates <- function(t){
  pred <- phat >= t
  TP <- sum(pred & y == 1); FP <- sum(pred & y == 0)
  FN <- sum(!pred & y == 1); TN <- sum(!pred & y == 0)
  TPR <- TP / (TP + FN); FPR <- FP / (FP + TN)
  ACC <- (TP + TN) / length(y)
  c(TPR = TPR, FPR = FPR, ACC = ACC)
}

M <- vapply(thr, calc_rates, numeric(3))
roc_df <- data.frame(
  threshold = thr,
  TPR = unname(M["TPR",]),
  FPR = unname(M["FPR",]),
  ACC = unname(M["ACC",])
)

best <- roc_df[which.max(roc_df$ACC), ]
cat("Best threshold:", round(best$threshold, 4),
    "| TPR:", round(best$TPR, 3),
    "| FPR:", round(best$FPR, 3), "\n")

## Best threshold: 0.12 | TPR: 0 | FPR: 0

ggplot(roc_df, aes(FPR, TPR)) +
  geom_path(linewidth = 1) +
  geom_abline(slope = 1, intercept = 0, linetype = 2) +
  geom_point(data = best, color = "red", size = 3) +
  labs(title = "ROC — logistic classifier (light)",
       subtitle = paste0("Best threshold = ", round(best$threshold,3),
                         " | TPR = ", round(best$TPR,3),
                         " | FPR = ", round(best$FPR,3)),
       x = "False Positive Rate", y = "True Positive Rate") +
  theme_minimal(base_size = 13) +
  theme(plot.title = element_text(face = "bold"))

#Summary: 1) Exploratory analysis. Cleaned the dataset, produced summary statistics and histograms; examined pairwise correlations. Pearson was used as the primary measure (continuous, roughly linear relationships), with Spearman noted for robustness when monotonic but non-linear patterns appeared.

Baseline frequency model (GLM). Fitted Poisson GLMs with log link including linear terms, strategic quadratics, and key interactions. Model selection was done via AIC/BIC. Benchmarked predictions at a fixed profile and drew partial-dependence style plots, λ^(x) vs driver age.
Unsupervised segmentation. Ran k-means on (age, carage), chose K* by the gap statistic; visualised clusters and assigned each policy a cluster label.
Cluster-augmented GLM. Added the cluster label as x6 and re-fit the GLM to test whether segmentation improves deviance and interpretability.
Tree models. Built a Poisson regression tree (with pruning to 1-SE) to capture thresholds and provide rule-based insight.
Boosting. Trained Poisson boosting trees to learn residual nonlinearity the GLM can miss.
Hybrid model. Used the GLM as a base and added a boosting correction to combine interpretability with extra predictive lift.
Model comparison. Performed 10-fold cross-validation using out-of-fold Poisson deviance per exposure to compare (GLM), (GLM+cluster), (Tree), (Boosting), and (GLM+Boosting). Selected the best model balancing accuracy, stability, and governance.
High-risk classification. Defined High = 1(Counts ≥ 2), fitted a logistic regression on linear terms, plotted Pr⁡(High=1) vs age (others fixed), produced the ROC curve, and chose the optimal threshold assuming equal cost of FP and FN, reported the corresponding TPR/FPR.

ACST8040 Assignment 2

Tuan Anh Vu_46745963

2025-10-25