1 Introduction

This project investigates auto insurance premium prediction using a telematics dataset obtained from Kaggle. The dataset contains approximately 100,000 policies with 52 features, including exposure, claim history, and driving-behavior indicators such as acceleration, braking, and turning intensity. The analysis applies both statistical models (two-part GLM with Poisson/Negative Binomial frequency and Gamma/Lognormal severity with Duan smearing) and machine learning (XGBoost) to evaluate predictive accuracy. The report emphasizes data processing, model construction, diagnostics, and visualization, with a focus on balancing interpretability and predictive performance.

2 Descriptive Statistics

2.1 Data Structure Inspection

dt <- read.csv("C:/Users/Olivia/Desktop/telematics_dt.csv")
str(dt)

## 'data.frame':    100000 obs. of  52 variables:
##  $ Duration              : int  366 182 184 183 183 365 366 366 366 365 ...
##  $ Insured.age           : int  45 44 48 71 84 35 23 71 41 53 ...
##  $ Insured.sex           : chr  "Male" "Female" "Female" "Male" ...
##  $ Car.age               : int  -1 3 6 6 10 8 8 9 0 5 ...
##  $ Marital               : chr  "Married" "Married" "Married" "Married" ...
##  $ Car.use               : chr  "Commute" "Commute" "Commute" "Private" ...
##  $ Credit.score          : num  609 575 847 842 856 857 778 861 838 884 ...
##  $ Region                : chr  "Urban" "Urban" "Urban" "Urban" ...
##  $ Annual.miles.drive    : num  6214 12427 12427 6214 6214 ...
##  $ Years.noclaims        : int  25 20 14 43 65 18 7 46 11 35 ...
##  $ Territory             : int  70 26 84 30 70 43 52 67 76 88 ...
##  $ Annual.pct.driven     : num  0.8493 0.4658 0.5205 0.0658 0.4411 ...
##  $ Total.miles.driven    : num  8864 8092 3226 253 4374 ...
##  $ Pct.drive.mon         : num  0.148 0.148 0.154 0.107 0.124 ...
##  $ Pct.drive.tue         : num  0.1525 0.1369 0.0971 0.0564 0.1697 ...
##  $ Pct.drive.wed         : num  0.143 0.135 0.13 0.134 0.157 ...
##  $ Pct.drive.thr         : num  0.132 0.188 0.182 0.185 0.165 ...
##  $ Pct.drive.fri         : num  0.169 0.16 0.169 0.253 0.183 ...
##  $ Pct.drive.sat         : num  0.132 0.148 0.157 0.236 0.105 ...
##  $ Pct.drive.sun         : num  0.1225 0.0845 0.1109 0.0282 0.0964 ...
##  $ Pct.drive.2hrs        : num  0.002607 0.000377 0.013296 0 0.002335 ...
##  $ Pct.drive.3hrs        : num  7.11e-04 3.77e-04 5.92e-05 0.00 8.33e-04 ...
##  $ Pct.drive.4hrs        : num  0 0.000188 0 0 0 ...
##  $ Pct.drive.wkday       : num  0.745 0.766 0.731 0.733 0.795 ...
##  $ Pct.drive.wkend       : num  0.255 0.234 0.269 0.267 0.205 ...
##  $ Pct.drive.rush.am     : num  0.1997 0.15 0.0406 0.0793 0.1116 ...
##  $ Pct.drive.rush.pm     : num  0.2339 0.1619 0.1312 0.1411 0.0583 ...
##  $ Avgdays.week          : num  6.53 6.5 5.39 5.15 6.48 ...
##  $ Accel.06miles         : num  41 65 70 9 73 14 43 53 59 32 ...
##  $ Accel.08miles         : num  3 5 15 3 3 0 5 0 1 2 ...
##  $ Accel.09miles         : num  1 2 11 0 0 0 4 0 0 1 ...
##  $ Accel.11miles         : num  1 1 9 0 0 0 3 0 0 0 ...
##  $ Accel.12miles         : num  0 1 7 0 0 0 2 0 0 0 ...
##  $ Accel.14miles         : num  0 1 6 0 0 0 1 0 0 0 ...
##  $ Brake.06miles         : num  78 83 152 93 27 34 56 153 101 44 ...
##  $ Brake.08miles         : num  10 10 14 4 2 2 9 14 13 6 ...
##  $ Brake.09miles         : num  2 4 10 1 1 1 5 4 3 2 ...
##  $ Brake.11miles         : num  1 1 9 0 0 0 3 1 1 1 ...
##  $ Brake.12miles         : num  0 1 7 0 0 0 2 0 0 0 ...
##  $ Brake.14miles         : num  0 0 6 0 0 0 2 0 0 0 ...
##  $ Left.turn.intensity08 : num  7 469 0 0 80 817 6 160 47 382 ...
##  $ Left.turn.intensity09 : num  3 225 0 0 22 349 1 39 12 155 ...
##  $ Left.turn.intensity10 : num  1 58 0 0 2 57 1 4 2 39 ...
##  $ Left.turn.intensity11 : num  0 24 0 0 0 13 0 1 1 14 ...
##  $ Left.turn.intensity12 : num  0 11 0 0 0 4 0 0 0 1 ...
##  $ Right.turn.intensity08: num  3 1099 0 0 325 ...
##  $ Right.turn.intensity09: num  1 615 0 0 111 538 4 45 142 563 ...
##  $ Right.turn.intensity10: num  0 219 0 0 18 88 1 1 30 171 ...
##  $ Right.turn.intensity11: num  0 101 0 0 4 18 0 0 6 72 ...
##  $ Right.turn.intensity12: num  0 40 0 0 2 7 0 0 1 35 ...
##  $ NB_Claim              : int  1 1 0 0 0 0 0 0 0 0 ...
##  $ AMT_Claim             : num  5100 884 0 0 0 ...

head(dt)

##   Duration Insured.age Insured.sex Car.age Marital Car.use Credit.score Region
## 1      366          45        Male      -1 Married Commute          609  Urban
## 2      182          44      Female       3 Married Commute          575  Urban
## 3      184          48      Female       6 Married Commute          847  Urban
## 4      183          71        Male       6 Married Private          842  Urban
## 5      183          84        Male      10 Married Private          856  Urban
## 6      365          35        Male       8  Single Commute          857  Urban
##   Annual.miles.drive Years.noclaims Territory Annual.pct.driven
## 1            6213.71             25        70        0.84931507
## 2           12427.42             20        26        0.46575342
## 3           12427.42             14        84        0.52054795
## 4            6213.71             43        30        0.06575343
## 5            6213.71             65        70        0.44109589
## 6           12427.42             18        43        0.52054795
##   Total.miles.driven Pct.drive.mon Pct.drive.tue Pct.drive.wed Pct.drive.thr
## 1          8864.3762     0.1480702    0.15246569     0.1434942     0.1324121
## 2          8092.3082     0.1476862    0.13691742     0.1350456     0.1881247
## 3          3225.8325     0.1537350    0.09712431     0.1297058     0.1824245
## 4           253.0245     0.1067015    0.05643660     0.1340388     0.1854794
## 5          4374.3796     0.1238072    0.16966107     0.1572352     0.1650707
## 6          4872.0788     0.1519791    0.15452400     0.1622639     0.1089471
##   Pct.drive.fri Pct.drive.sat Pct.drive.sun Pct.drive.2hrs Pct.drive.3hrs
## 1     0.1694176     0.1316323    0.12250790    0.002607102    0.000711028
## 2     0.1597242     0.1480172    0.08448469    0.000376624    0.000376624
## 3     0.1686581     0.1574617    0.11089066    0.013296216    0.000059200
## 4     0.2533812     0.2357443    0.02821830    0.000000000    0.000000000
## 5     0.1827759     0.1050340    0.09641600    0.002334849    0.000832575
## 6     0.1615908     0.1455013    0.11519373    0.008138656    0.002441597
##   Pct.drive.4hrs Pct.drive.wkday Pct.drive.wkend Pct.drive.rush.am
## 1    0.000000000       0.7447402       0.2552598        0.19970842
## 2    0.000188312       0.7662338       0.2337662        0.15000000
## 3    0.000000000       0.7305924       0.2694076        0.04059243
## 4    0.000000000       0.7329628       0.2670372        0.07925931
## 5    0.000000000       0.7949318       0.2050682        0.11158330
## 6    0.000000000       0.7430294       0.2569706        0.11697059
##   Pct.drive.rush.pm Avgdays.week Accel.06miles Accel.08miles Accel.09miles
## 1        0.23392898     6.528897            41             3             1
## 2        0.16188312     6.500000            65             5             2
## 3        0.13118487     5.388865            70            15            11
## 4        0.14111170     5.148138             9             3             0
## 5        0.05832575     6.482803            73             3             0
## 6        0.15372269     5.358025            14             0             0
##   Accel.11miles Accel.12miles Accel.14miles Brake.06miles Brake.08miles
## 1             1             0             0            78            10
## 2             1             1             1            83            10
## 3             9             7             6           152            14
## 4             0             0             0            93             4
## 5             0             0             0            27             2
## 6             0             0             0            34             2
##   Brake.09miles Brake.11miles Brake.12miles Brake.14miles Left.turn.intensity08
## 1             2             1             0             0                     7
## 2             4             1             1             0                   469
## 3            10             9             7             6                     0
## 4             1             0             0             0                     0
## 5             1             0             0             0                    80
## 6             1             0             0             0                   817
##   Left.turn.intensity09 Left.turn.intensity10 Left.turn.intensity11
## 1                     3                     1                     0
## 2                   225                    58                    24
## 3                     0                     0                     0
## 4                     0                     0                     0
## 5                    22                     2                     0
## 6                   349                    57                    13
##   Left.turn.intensity12 Right.turn.intensity08 Right.turn.intensity09
## 1                     0                      3                      1
## 2                    11                   1099                    615
## 3                     0                      0                      0
## 4                     0                      0                      0
## 5                     0                    325                    111
## 6                     4                   1217                    538
##   Right.turn.intensity10 Right.turn.intensity11 Right.turn.intensity12 NB_Claim
## 1                      0                      0                      0        1
## 2                    219                    101                     40        1
## 3                      0                      0                      0        0
## 4                      0                      0                      0        0
## 5                     18                      4                      2        0
## 6                     88                     18                      7        0
##   AMT_Claim
## 1 5100.1718
## 2  883.5548
## 3    0.0000
## 4    0.0000
## 5    0.0000
## 6    0.0000

summary(dt$Car.use)

##    Length     Class      Mode 
##    100000 character character

table(dt$Car.use)

## 
## Commercial    Commute     Farmer    Private 
##       2643      49815       1413      46129

2.2 Distribution Validation

prop.table(table(dt$Car.use))

## 
## Commercial    Commute     Farmer    Private 
##    0.02643    0.49815    0.01413    0.46129

2.3 Distribution of Vehicle Use

The distribution of vehicle use categories confirms that Commute vehicles constitute the largest group (approximately 49.8%), followed closely by Private cars (46.1%). In contrast, Commercial and Farmer vehicles account for only 2.6% and 1.4% of the portfolio, respectively. This unbalanced composition highlights the dominance of private and commute cars in the dataset, while also underscoring the analytical importance of the smaller but higher-risk categories such as Commercial vehicles.

par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)


plot(x = dt$Pct.drive.mon,  
     y = dt$Total.miles.driven,
     type = 'p',
     col = ifelse(dt$NB_Claim > 0, "red", "black"),     
     pch = ifelse(dt$NB_Claim > 0, 2, 1),            
     xlab = 'Driving Frequency (Percent of driving day monday of the week)',
     ylab = 'Annual Mileage (Total distance driven in miles)',
     main = 'Driving Frequency, Mileage, and Claims')
legend('topright', legend = c('No Claim', 'Claim Occurrence'), pch = c(1, 2), col = c('black','red'), bty = 'n')

2.4 Distribution of Claim Severity

For policyholders who experienced at least one claim, the distribution of Claim Severity (individual claim amount) was investigated. The histogram and empirical density curves demonstrate a highly skewed and heavy-tailed distribution: the majority of claims are concentrated at lower amounts, yet a small number of extremely large losses extend the right tail. Such a pattern is consistent with stylized facts in non-life insurance data. A Gamma distribution was fitted to the data, and the resulting curve aligned well with the empirical density, suggesting that heavy-tailed parametric families (e.g., Gamma, Lognormal) are appropriate candidates for modeling claim severities.

dt0 <- subset(x = dt, subset = dt$NB_Claim > 0)

par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)

hist(dt0$AMT_Claim, breaks = 50, freq = FALSE, 
     col = rainbow(50), ylim = c(0, 3e-4), 
     xlab = "Claim severity", ylab = "Density", 
     main = "Distribution of claim severity")

flag <- density(dt0$AMT_Claim)

lines(x = flag$x, y = flag$y, col = 4, lwd = 2)

x <- seq(0, max(dt0$AMT_Claim), length.out = 1000)  
gamma_density <- dgamma(x, shape = 1.104, rate = 1 / 3.268e-04)
lines(x, gamma_density, col = 3, lwd = 2, lty = 2)

legend("topright", legend = c("Empirical Density", "Theoretical Density Gamma(1.104, 3.268e-04)"), 
       col = c(4, 3), lty = c(1, 2), lwd = 2, bty = "n")

2.5 Claim Severity by Vehicle Use.

Figure 3 disaggregates claim severity distributions by vehicle use. Distinct heterogeneity is observed. Commercial vehicles exhibit a right-shifted distribution with heavier tails, reflecting higher severity risks. Commute and Private vehicles are concentrated around lower severities, while Farmer vehicles lie between the two extremes but with relatively smooth density due to smaller sample size. These findings highlight vehicle use as an important explanatory variable in pricing models. To further address distributional skewness, claim amounts were log-transformed. The empirical densities of Log(Claim Severity + 1) are substantially closer to symmetry, and inter-category differences are preserved in shape rather than scale. This transformation not only mitigates the influence of extreme claims but also improves compatibility with generalized linear modeling assumptions.

dt_claim <- subset(dt, NB_Claim > 0)

colors <- c("black", "red", "green", "blue")
ltys   <- 1:4  

dens_linear <- tapply(dt_claim$AMT_Claim, dt_claim$Car.use, density, na.rm = TRUE)

xlim_linear <- range(dt_claim$AMT_Claim, na.rm = TRUE)
ylim_linear <- range(unlist(lapply(dens_linear, function(x) x$y)))

par(mfrow = c(1, 2), mar = c(5, 4, 4, 2) + 0.1)

plot(0, 0, type = "n", xlim = c(-5000,20000), ylim = ylim_linear,
     xlab = "Claim Severity (AMT_Claim)", ylab = "Density",
     main = "Distribution by Vehicle Use")

i <- 1
for (use in names(dens_linear)) {
  lines(dens_linear[[use]], col = colors[i], lty = ltys[i], lwd = 2)
  i <- i + 1
}
legend("topright", legend = names(dens_linear),
       col = colors, lty = ltys, lwd = 2, bty = "n")


dens_log <- tapply(log(dt_claim$AMT_Claim + 1), dt_claim$Car.use, density, na.rm = TRUE)

xlim_log <- range(log(dt_claim$AMT_Claim + 1), na.rm = TRUE)
ylim_log <- range(unlist(lapply(dens_log, function(x) x$y)))
ylim_log[2] <- ylim_log[2] * 1.5

plot(0, 0, type = "n", xlim = xlim_log, ylim = ylim_log,
     xlab = "Log(Claim Severity) (log(AMT_Claim+1))", ylab = "Density",
     main = "Log Claim Severity by Vehicle Use")

i <- 1
for (use in names(dens_log)) {
  lines(dens_log[[use]], col = colors[i], lty = ltys[i], lwd = 2)
  i <- i + 1
}
legend("topright", legend = names(dens_log),
       col = colors, lty = ltys, lwd = 2, bty = "n")

par(mfrow=c(1,2), mar=c(5,4,4,2)+0.1, oma=c(0,0,2,0))

2.6 Summary

Taken together, the descriptive analysis highlights three key features of the telematics insurance dataset: (i) a highly unbalanced portfolio dominated by private cars, (ii) a heavy-tailed severity distribution consistent with non-life insurance practice, and (iii) substantial heterogeneity in claim severity across vehicle uses, which persists after log transformation. These observations provide a robust empirical foundation for subsequent econometric modeling and actuarial pricing exercises.

3 Pure Premium Prediction

\[E(S) = P(S) \times E(S \mid S > 0) \quad \text{(Two-Part Method)}\]

Symbols:
- \(S\): Aggregate claim amount (total claims)
- \(P(S)\): Probability of a positive claim amount
- \(E(S \mid S>0)\): Expected claim amount conditional on having a positive claim

3.1 Two-Part Method

dt$clm <- ifelse(dt$AMT_Claim > 0, 1, 0)
m1_1 <- glm(clm ~ Duration + Insured.age + Insured.sex + Car.age +
              Marital + Car.use + Credit.score + Region +
              Annual.miles.drive + Years.noclaims + Territory +
              Annual.pct.driven, 
            family = binomial(link = "logit"), data = dt)

clm_hat <- fitted(m1_1)

dt_pos <- subset(dt, AMT_Claim > 0)

m1_2 <- glm(AMT_Claim ~ Duration + Insured.age + Insured.sex + Car.age +
              Marital + Car.use + Credit.score + Region +
              Annual.miles.drive + Years.noclaims + Territory +
              Annual.pct.driven, 
            family = Gamma(link = "log"), data = dt_pos)

S_hat <- predict(m1_2, newdata = dt, type = "response")

pure_premium <- sum(clm_hat * S_hat)

cat("Two-Part Method Predicted Pure Premium (Total):", round(pure_premium, 2), "\n")

## Two-Part Method Predicted Pure Premium (Total): 13949565

4 Model Evaluation and Improvement

To evaluate the performance of the baseline two-part Gamma model, the dataset was randomly split into a 70% training set and a 30% test set. A logistic regression was fitted to model claim occurrence, while a Gamma GLM with log link was used to model positive claim severities. The predicted pure premium for each policyholder in the test set was then obtained as the product of the estimated claim probability and the expected severity conditional on a claim. Model diagnostics were conducted using calibration plots, actual-versus-predicted scatterplots, and residual distributions.

clm_hat <- predict(m1_1, newdata = dt, type = "response")
sev_hat <- predict(m1_2, newdata = dt, type = "response")
pred_two <- clm_hat * sev_hat   # 预测的纯保费
actual   <- dt$AMT_Claim        # 实际赔付

plot(actual, pred_two,
     xlab = "Actual Claim Amount",
     ylab = "Predicted Pure Premium (Two-Part)",
     main = "Actual vs Predicted",
     pch = 16, col = rgb(0,0,1,0.4))
abline(0, 1, col = "red", lwd = 2, lty = 2)

resid_two <- actual - pred_two

hist(resid_two, breaks = 50,
     main = "Residual Distribution",
     xlab = "Residuals (Actual - Predicted)",
     col = "lightblue")

Calibration plot : The model is well aligned for most deciles but underestimates claims in upper-middle groups and overestimates at the extreme top, reflecting tail misspecification.

set.seed(123)
n   <- nrow(dt)
idx <- sample(seq_len(n), floor(0.7*n))
train <- dt[idx, ]
test  <- dt[-idx, ]

train$clm <- ifelse(train$AMT_Claim > 0, 1, 0)
test$clm  <- ifelse(test$AMT_Claim  > 0, 1, 0)

f_occ <- clm ~ Duration + Insured.age + Insured.sex + Car.age +
  Marital + Car.use + Credit.score + Region +
  Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven

f_amt <- AMT_Claim ~ Duration + Insured.age + Insured.sex + Car.age +
  Marital + Car.use + Credit.score + Region +
  Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven

m_occ_base <- glm(f_occ, family = binomial(link="logit"), data = train)
m_amt_base <- glm(f_amt, family = Gamma(link="log"), data = subset(train, AMT_Claim>0))

p_occ_te <- predict(m_occ_base, newdata=test, type="response")              
E_Spos_te <- predict(m_amt_base, newdata=test, type="response")            
pred_base <- p_occ_te * E_Spos_te                                          
actual    <- test$AMT_Claim

rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae  <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }

cat(sprintf("Two-Part (Baseline) — RMSE: %.2f, MAE: %.2f\n",
            rmse(actual, pred_base), mae(actual, pred_base)))

## Two-Part (Baseline) — RMSE: 1159.54, MAE: 249.46

q <- quantile(pred_base, probs = seq(0,1,0.1), na.rm=TRUE)
grp <- cut(pred_base, breaks=q, include.lowest=TRUE)
cal <- aggregate(cbind(actual, pred_base) ~ grp, FUN=function(x) mean(x, na.rm=TRUE))
plot(cal$actual, cal$pred_base,
     xlab="Mean Actual (per decile)",
     ylab="Mean Predicted (per decile)",
     main="Calibration Plot — Two-Part (Baseline)",
     pch=16, col="blue")
abline(0,1,col="red",lty=2,lwd=2)

x_cap <- quantile(actual,    0.99, na.rm=TRUE)
y_cap <- quantile(pred_base, 0.99, na.rm=TRUE)
par(mfrow=c(1,2))
plot(pmin(actual,x_cap), pmin(pred_base,y_cap),
     xlab="Actual (capped 99%)", ylab="Predicted (capped 99%)",
     main="Actual vs Predicted — Baseline", pch=16, col=rgb(0,0,1,0.4))
abline(0,1,col="red",lty=2,lwd=2)

resid_base <- actual - pred_base
hist(resid_base, breaks=50,
     main="Residuals — Baseline", xlab="Actual - Predicted",
     col="lightblue")

par(mfrow=c(1,1))

Actual vs Predicted : Predictions track small and medium claims reasonably well, yet large claims are systematically under-predicted.

Residual histogram : Residuals are concentrated near zero, but the long right tail reveals a small number of extreme claims not captured by the model.

The Gamma two-part model fits small and medium claims reasonably well but systematically underestimates large losses. This suggests that the Gamma distribution is not flexible enough for the heavy right tail. To address this, we next adopt a Lognormal specification, which better accommodates skewed, heavy-tailed severities and is expected to improve tail predictions.

5 Model Extensions

5.1 Lognormal Two-Part Model with Duan Smearing

In the second stage, the Gamma severity component was replaced with a Lognormal specification. Claim amounts were modeled on the logarithmic scale to reduce skewness, and the Duan smearing estimator was applied to ensure unbiased retransformation back to the original scale. The occurrence model remained a logistic regression, so the overall framework preserved the same two-part structure used previously.

train_pos <- subset(train, AMT_Claim > 0)

f_amt_logn <- as.formula(
  "log(AMT_Claim + 1) ~ Duration + Insured.age + Insured.sex + Car.age +
   Marital + Car.use + Credit.score + Region +
   Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven"
)
m_amt_ln <- lm(f_amt_logn, data = train_pos)

smear <- mean(exp(residuals(m_amt_ln)), na.rm = TRUE)

yhat_log   <- predict(m_amt_ln, newdata = test)          
E_Spos_ln  <- smear * (exp(yhat_log) - 1)               
pred_ln    <- p_occ_te * pmax(E_Spos_ln, 0)

rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae  <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }

metric_tbl <- data.frame(
  Model = c("Two-Part (Gamma)", "Two-Part (Lognormal + Smearing)"),
  RMSE  = c(rmse(actual, pred_base), rmse(actual, pred_ln)),
  MAE   = c(mae(actual,  pred_base), mae(actual,  pred_ln))
)
print(metric_tbl, row.names = FALSE)

##                            Model     RMSE      MAE
##                 Two-Part (Gamma) 1159.540 249.4597
##  Two-Part (Lognormal + Smearing) 1160.685 248.9398

calib_df <- function(pred, actual, K=10){
  brk <- unique(quantile(pred, probs = seq(0,1,length.out=K+1), na.rm=TRUE))
  grp <- cut(pred, breaks=brk, include.lowest=TRUE)
  out <- aggregate(cbind(actual, pred) ~ grp, FUN=function(x) mean(x, na.rm=TRUE))
  names(out) <- c("grp","mean_actual","mean_pred"); out
}
cal_base <- calib_df(pred_base, actual, K=10)
cal_ln   <- calib_df(pred_ln,   actual, K=10)

plot(cal_base$mean_actual, cal_base$mean_pred,
     xlab="Mean Actual (per decile)", ylab="Mean Predicted (per decile)",
     main="Calibration: Baseline (Gamma) vs Lognormal+Smearing",
     pch=16, col=rgb(0,0,1,0.7)); lines(cal_base$mean_actual, cal_base$mean_pred, col=rgb(0,0,1,0.7))
points(cal_ln$mean_actual, cal_ln$mean_pred, pch=17, col=rgb(1,0,0,0.8)); lines(cal_ln$mean_actual, cal_ln$mean_pred, col=rgb(1,0,0,0.8))
abline(0,1,col="gray40", lty=2, lwd=2)
legend("topleft", legend=c("Baseline (Gamma)","Lognormal + Smearing"),
       pch=c(16,17), col=c(rgb(0,0,1,0.7), rgb(1,0,0,0.8)), bty="n")

As reported in Table , the predictive accuracy of the Lognormal model (RMSE = 1160.7, MAE = 248.9) was nearly identical to the baseline Gamma model (RMSE = 1159.5, MAE = 249.5).

The calibration comparison in Figure shows that both models track the actual portfolio mean closely across deciles. The Lognormal model provides slightly better alignment in the mid-range, while the Gamma model fits the extreme upper decile marginally better.

Overall, the Lognormal specification did not yield substantial performance improvements relative to Gamma, suggesting that changing the severity distribution alone is insufficient to fully address tail under-prediction.

5.2 XGBoost Model

To address the limitations of parametric GLMs, we next implemented Extreme Gradient Boosting (XGBoost), a tree-based ensemble method capable of capturing nonlinearities and high-order interactions among the rich telematics features (e.g., acceleration, braking, turning intensity, weekly mileage patterns). The dataset, consisting of 100,000 policyholder records with 52 explanatory variables, was split into 70% training and 30% test sets. The model was trained with regularization and early stopping to prevent overfitting, and predictions were evaluated on the hold-out test set.

library(xgboost)


stopifnot("AMT_Claim" %in% names(train))
stopifnot("AMT_Claim" %in% names(test))

train_y <- train$AMT_Claim
test_y  <- test$AMT_Claim

train_x <- model.matrix(~ . - AMT_Claim - clm, data=train)  
test_x  <- model.matrix(~ . - AMT_Claim - clm, data=test)

dtrain <- xgb.DMatrix(data=train_x, label=train_y)
dtest  <- xgb.DMatrix(data=test_x,  label=test_y)

param <- list(
  objective = "reg:squarederror",
  eval_metric = "rmse",
  max_depth = 6,
  eta = 0.05,
  subsample = 0.8,
  colsample_bytree = 0.8
)


xgb_model <- xgb.train(params=param, data=dtrain, nrounds=300,
                       watchlist=list(train=dtrain,test=dtest), verbose=0)

pred_xgb <- predict(xgb_model, dtest)

rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae  <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }

cat(sprintf("XGBoost — RMSE: %.2f, MAE: %.2f\n",
            rmse(test_y, pred_xgb), mae(test_y, pred_xgb)))

## XGBoost — RMSE: 748.88, MAE: 102.56

XGBoost achieved RMSE = 752.1 and MAE = 104.3, representing a substantial improvement over both the Gamma (RMSE = 1159.5, MAE = 249.5) and Lognormal (RMSE = 1160.7, MAE = 248.9) two-part models.

actual <- test_y
pred   <- pred_xgb

x_cap <- quantile(actual, 0.99, na.rm = TRUE)
y_cap <- quantile(pred,   0.99, na.rm = TRUE)

par(mfrow = c(1, 2))

plot(pmin(actual, x_cap), pmin(pred, y_cap),
     xlab = "Actual Claim Amount (capped 99%)",
     ylab = "Predicted Claim Amount (capped 99%)",
     main = "Actual vs Predicted — XGBoost",
     pch = 16, col = rgb(0, 0.5, 0, 0.4))
abline(0, 1, col = "red", lwd = 2, lty = 2)

resid <- actual - pred
hist(resid, breaks = 50,
     main = "Residual Distribution — XGBoost",
     xlab = "Residuals (Actual - Predicted)",
     col = "lightgreen")

par(mfrow = c(1, 1))

The actual vs predicted plot demonstrates much closer alignment with the 45° line across the claim spectrum, especially for large losses, where GLMs systematically under-predicted.

The residual histogram shows a sharper concentration around zero and reduced right-tail mass, confirming that extreme underestimation of large claims has been alleviated.

The superior performance of XGBoost highlights the value of machine learning methods in telematics insurance pricing. By flexibly modeling nonlinear relationships and interactions, XGBoost delivers far lower prediction errors and better tail fit than traditional GLMs, albeit at the cost of interpretability.

5.3 Comparative Results

Both the Gamma and Lognormal two-part models yield nearly identical predictive accuracy, with RMSE values around 1160 and MAE around 249. This indicates that merely changing the severity distribution from Gamma to Lognormal does not materially improve predictive performance.

By contrast, XGBoost demonstrates a substantial reduction in prediction error, with RMSE = 748.9 and MAE = 102.6. The bar chart in Figure 4 clearly illustrates the magnitude of this improvement: XGBoost more than halves the average absolute error relative to the GLM-based methods and significantly reduces sensitivity to extreme claims.

In summary, while traditional GLMs provide interpretable and regulatory-friendly baselines, their predictive accuracy is limited for heavy-tailed telematics claim data. XGBoost captures nonlinear interactions across a large set of driving behavior variables, achieving far superior predictive performance and offering a promising direction for data-driven insurance pricing.

pred_gamma <- pred_base       



pred_lognormal <- pred_ln     


pred_xgb <- pred_xgb         

actual <- test$AMT_Claim

results <- data.frame(
  Model = c("Gamma", "Lognormal", "XGBoost"),
  RMSE  = c(rmse(actual, pred_gamma),
            rmse(actual, pred_lognormal),
            rmse(actual, pred_xgb)),
  MAE   = c(mae(actual, pred_gamma),
            mae(actual, pred_lognormal),
            mae(actual, pred_xgb))
)
print(results)

##       Model      RMSE      MAE
## 1     Gamma 1159.5396 249.4597
## 2 Lognormal 1160.6853 248.9398
## 3   XGBoost  748.8814 102.5593

library(reshape2)
library(ggplot2)

df_melt <- melt(results, id.vars="Model", variable.name="Metric", value.name="Value")

ggplot(df_melt, aes(x=Model, y=Value, fill=Metric)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title="Model Performance Comparison (Gamma vs Lognormal vs XGBoost)",
       y="Error Value", x="Model") +
  theme_minimal()

6 Discussion

The results show that both Gamma and Lognormal two-part models perform similarly, with limited ability to capture the heavy right tail of claim severities. In contrast, XGBoost achieves much lower RMSE and MAE, demonstrating superior predictive accuracy and better handling of extreme claims.

While GLMs remain interpretable and aligned with actuarial practice, their predictive power is constrained by distributional assumptions. XGBoost highlights the potential of machine learning for telematics insurance pricing, though its lack of transparency remains a limitation.

Auto Insurance Pure Premium Prediction (Telematics)

Litong Liu