This project investigates auto insurance premium prediction using a telematics dataset obtained from Kaggle. The dataset contains approximately 100,000 policies with 52 features, including exposure, claim history, and driving-behavior indicators such as acceleration, braking, and turning intensity. The analysis applies both statistical models (two-part GLM with Poisson/Negative Binomial frequency and Gamma/Lognormal severity with Duan smearing) and machine learning (XGBoost) to evaluate predictive accuracy. The report emphasizes data processing, model construction, diagnostics, and visualization, with a focus on balancing interpretability and predictive performance.
dt <- read.csv("C:/Users/Olivia/Desktop/telematics_dt.csv")
str(dt)
## 'data.frame': 100000 obs. of 52 variables:
## $ Duration : int 366 182 184 183 183 365 366 366 366 365 ...
## $ Insured.age : int 45 44 48 71 84 35 23 71 41 53 ...
## $ Insured.sex : chr "Male" "Female" "Female" "Male" ...
## $ Car.age : int -1 3 6 6 10 8 8 9 0 5 ...
## $ Marital : chr "Married" "Married" "Married" "Married" ...
## $ Car.use : chr "Commute" "Commute" "Commute" "Private" ...
## $ Credit.score : num 609 575 847 842 856 857 778 861 838 884 ...
## $ Region : chr "Urban" "Urban" "Urban" "Urban" ...
## $ Annual.miles.drive : num 6214 12427 12427 6214 6214 ...
## $ Years.noclaims : int 25 20 14 43 65 18 7 46 11 35 ...
## $ Territory : int 70 26 84 30 70 43 52 67 76 88 ...
## $ Annual.pct.driven : num 0.8493 0.4658 0.5205 0.0658 0.4411 ...
## $ Total.miles.driven : num 8864 8092 3226 253 4374 ...
## $ Pct.drive.mon : num 0.148 0.148 0.154 0.107 0.124 ...
## $ Pct.drive.tue : num 0.1525 0.1369 0.0971 0.0564 0.1697 ...
## $ Pct.drive.wed : num 0.143 0.135 0.13 0.134 0.157 ...
## $ Pct.drive.thr : num 0.132 0.188 0.182 0.185 0.165 ...
## $ Pct.drive.fri : num 0.169 0.16 0.169 0.253 0.183 ...
## $ Pct.drive.sat : num 0.132 0.148 0.157 0.236 0.105 ...
## $ Pct.drive.sun : num 0.1225 0.0845 0.1109 0.0282 0.0964 ...
## $ Pct.drive.2hrs : num 0.002607 0.000377 0.013296 0 0.002335 ...
## $ Pct.drive.3hrs : num 7.11e-04 3.77e-04 5.92e-05 0.00 8.33e-04 ...
## $ Pct.drive.4hrs : num 0 0.000188 0 0 0 ...
## $ Pct.drive.wkday : num 0.745 0.766 0.731 0.733 0.795 ...
## $ Pct.drive.wkend : num 0.255 0.234 0.269 0.267 0.205 ...
## $ Pct.drive.rush.am : num 0.1997 0.15 0.0406 0.0793 0.1116 ...
## $ Pct.drive.rush.pm : num 0.2339 0.1619 0.1312 0.1411 0.0583 ...
## $ Avgdays.week : num 6.53 6.5 5.39 5.15 6.48 ...
## $ Accel.06miles : num 41 65 70 9 73 14 43 53 59 32 ...
## $ Accel.08miles : num 3 5 15 3 3 0 5 0 1 2 ...
## $ Accel.09miles : num 1 2 11 0 0 0 4 0 0 1 ...
## $ Accel.11miles : num 1 1 9 0 0 0 3 0 0 0 ...
## $ Accel.12miles : num 0 1 7 0 0 0 2 0 0 0 ...
## $ Accel.14miles : num 0 1 6 0 0 0 1 0 0 0 ...
## $ Brake.06miles : num 78 83 152 93 27 34 56 153 101 44 ...
## $ Brake.08miles : num 10 10 14 4 2 2 9 14 13 6 ...
## $ Brake.09miles : num 2 4 10 1 1 1 5 4 3 2 ...
## $ Brake.11miles : num 1 1 9 0 0 0 3 1 1 1 ...
## $ Brake.12miles : num 0 1 7 0 0 0 2 0 0 0 ...
## $ Brake.14miles : num 0 0 6 0 0 0 2 0 0 0 ...
## $ Left.turn.intensity08 : num 7 469 0 0 80 817 6 160 47 382 ...
## $ Left.turn.intensity09 : num 3 225 0 0 22 349 1 39 12 155 ...
## $ Left.turn.intensity10 : num 1 58 0 0 2 57 1 4 2 39 ...
## $ Left.turn.intensity11 : num 0 24 0 0 0 13 0 1 1 14 ...
## $ Left.turn.intensity12 : num 0 11 0 0 0 4 0 0 0 1 ...
## $ Right.turn.intensity08: num 3 1099 0 0 325 ...
## $ Right.turn.intensity09: num 1 615 0 0 111 538 4 45 142 563 ...
## $ Right.turn.intensity10: num 0 219 0 0 18 88 1 1 30 171 ...
## $ Right.turn.intensity11: num 0 101 0 0 4 18 0 0 6 72 ...
## $ Right.turn.intensity12: num 0 40 0 0 2 7 0 0 1 35 ...
## $ NB_Claim : int 1 1 0 0 0 0 0 0 0 0 ...
## $ AMT_Claim : num 5100 884 0 0 0 ...
head(dt)
## Duration Insured.age Insured.sex Car.age Marital Car.use Credit.score Region
## 1 366 45 Male -1 Married Commute 609 Urban
## 2 182 44 Female 3 Married Commute 575 Urban
## 3 184 48 Female 6 Married Commute 847 Urban
## 4 183 71 Male 6 Married Private 842 Urban
## 5 183 84 Male 10 Married Private 856 Urban
## 6 365 35 Male 8 Single Commute 857 Urban
## Annual.miles.drive Years.noclaims Territory Annual.pct.driven
## 1 6213.71 25 70 0.84931507
## 2 12427.42 20 26 0.46575342
## 3 12427.42 14 84 0.52054795
## 4 6213.71 43 30 0.06575343
## 5 6213.71 65 70 0.44109589
## 6 12427.42 18 43 0.52054795
## Total.miles.driven Pct.drive.mon Pct.drive.tue Pct.drive.wed Pct.drive.thr
## 1 8864.3762 0.1480702 0.15246569 0.1434942 0.1324121
## 2 8092.3082 0.1476862 0.13691742 0.1350456 0.1881247
## 3 3225.8325 0.1537350 0.09712431 0.1297058 0.1824245
## 4 253.0245 0.1067015 0.05643660 0.1340388 0.1854794
## 5 4374.3796 0.1238072 0.16966107 0.1572352 0.1650707
## 6 4872.0788 0.1519791 0.15452400 0.1622639 0.1089471
## Pct.drive.fri Pct.drive.sat Pct.drive.sun Pct.drive.2hrs Pct.drive.3hrs
## 1 0.1694176 0.1316323 0.12250790 0.002607102 0.000711028
## 2 0.1597242 0.1480172 0.08448469 0.000376624 0.000376624
## 3 0.1686581 0.1574617 0.11089066 0.013296216 0.000059200
## 4 0.2533812 0.2357443 0.02821830 0.000000000 0.000000000
## 5 0.1827759 0.1050340 0.09641600 0.002334849 0.000832575
## 6 0.1615908 0.1455013 0.11519373 0.008138656 0.002441597
## Pct.drive.4hrs Pct.drive.wkday Pct.drive.wkend Pct.drive.rush.am
## 1 0.000000000 0.7447402 0.2552598 0.19970842
## 2 0.000188312 0.7662338 0.2337662 0.15000000
## 3 0.000000000 0.7305924 0.2694076 0.04059243
## 4 0.000000000 0.7329628 0.2670372 0.07925931
## 5 0.000000000 0.7949318 0.2050682 0.11158330
## 6 0.000000000 0.7430294 0.2569706 0.11697059
## Pct.drive.rush.pm Avgdays.week Accel.06miles Accel.08miles Accel.09miles
## 1 0.23392898 6.528897 41 3 1
## 2 0.16188312 6.500000 65 5 2
## 3 0.13118487 5.388865 70 15 11
## 4 0.14111170 5.148138 9 3 0
## 5 0.05832575 6.482803 73 3 0
## 6 0.15372269 5.358025 14 0 0
## Accel.11miles Accel.12miles Accel.14miles Brake.06miles Brake.08miles
## 1 1 0 0 78 10
## 2 1 1 1 83 10
## 3 9 7 6 152 14
## 4 0 0 0 93 4
## 5 0 0 0 27 2
## 6 0 0 0 34 2
## Brake.09miles Brake.11miles Brake.12miles Brake.14miles Left.turn.intensity08
## 1 2 1 0 0 7
## 2 4 1 1 0 469
## 3 10 9 7 6 0
## 4 1 0 0 0 0
## 5 1 0 0 0 80
## 6 1 0 0 0 817
## Left.turn.intensity09 Left.turn.intensity10 Left.turn.intensity11
## 1 3 1 0
## 2 225 58 24
## 3 0 0 0
## 4 0 0 0
## 5 22 2 0
## 6 349 57 13
## Left.turn.intensity12 Right.turn.intensity08 Right.turn.intensity09
## 1 0 3 1
## 2 11 1099 615
## 3 0 0 0
## 4 0 0 0
## 5 0 325 111
## 6 4 1217 538
## Right.turn.intensity10 Right.turn.intensity11 Right.turn.intensity12 NB_Claim
## 1 0 0 0 1
## 2 219 101 40 1
## 3 0 0 0 0
## 4 0 0 0 0
## 5 18 4 2 0
## 6 88 18 7 0
## AMT_Claim
## 1 5100.1718
## 2 883.5548
## 3 0.0000
## 4 0.0000
## 5 0.0000
## 6 0.0000
summary(dt$Car.use)
## Length Class Mode
## 100000 character character
table(dt$Car.use)
##
## Commercial Commute Farmer Private
## 2643 49815 1413 46129
prop.table(table(dt$Car.use))
##
## Commercial Commute Farmer Private
## 0.02643 0.49815 0.01413 0.46129
The distribution of vehicle use categories confirms that Commute vehicles constitute the largest group (approximately 49.8%), followed closely by Private cars (46.1%). In contrast, Commercial and Farmer vehicles account for only 2.6% and 1.4% of the portfolio, respectively. This unbalanced composition highlights the dominance of private and commute cars in the dataset, while also underscoring the analytical importance of the smaller but higher-risk categories such as Commercial vehicles.
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)
plot(x = dt$Pct.drive.mon,
y = dt$Total.miles.driven,
type = 'p',
col = ifelse(dt$NB_Claim > 0, "red", "black"),
pch = ifelse(dt$NB_Claim > 0, 2, 1),
xlab = 'Driving Frequency (Percent of driving day monday of the week)',
ylab = 'Annual Mileage (Total distance driven in miles)',
main = 'Driving Frequency, Mileage, and Claims')
legend('topright', legend = c('No Claim', 'Claim Occurrence'), pch = c(1, 2), col = c('black','red'), bty = 'n')
For policyholders who experienced at least one claim, the distribution of Claim Severity (individual claim amount) was investigated. The histogram and empirical density curves demonstrate a highly skewed and heavy-tailed distribution: the majority of claims are concentrated at lower amounts, yet a small number of extremely large losses extend the right tail. Such a pattern is consistent with stylized facts in non-life insurance data. A Gamma distribution was fitted to the data, and the resulting curve aligned well with the empirical density, suggesting that heavy-tailed parametric families (e.g., Gamma, Lognormal) are appropriate candidates for modeling claim severities.
dt0 <- subset(x = dt, subset = dt$NB_Claim > 0)
par(mfrow = c(1, 1), mar = c(5, 4, 4, 2) + 0.1)
hist(dt0$AMT_Claim, breaks = 50, freq = FALSE,
col = rainbow(50), ylim = c(0, 3e-4),
xlab = "Claim severity", ylab = "Density",
main = "Distribution of claim severity")
flag <- density(dt0$AMT_Claim)
lines(x = flag$x, y = flag$y, col = 4, lwd = 2)
x <- seq(0, max(dt0$AMT_Claim), length.out = 1000)
gamma_density <- dgamma(x, shape = 1.104, rate = 1 / 3.268e-04)
lines(x, gamma_density, col = 3, lwd = 2, lty = 2)
legend("topright", legend = c("Empirical Density", "Theoretical Density Gamma(1.104, 3.268e-04)"),
col = c(4, 3), lty = c(1, 2), lwd = 2, bty = "n")
Figure 3 disaggregates claim severity distributions by vehicle use. Distinct heterogeneity is observed. Commercial vehicles exhibit a right-shifted distribution with heavier tails, reflecting higher severity risks. Commute and Private vehicles are concentrated around lower severities, while Farmer vehicles lie between the two extremes but with relatively smooth density due to smaller sample size. These findings highlight vehicle use as an important explanatory variable in pricing models. To further address distributional skewness, claim amounts were log-transformed. The empirical densities of Log(Claim Severity + 1) are substantially closer to symmetry, and inter-category differences are preserved in shape rather than scale. This transformation not only mitigates the influence of extreme claims but also improves compatibility with generalized linear modeling assumptions.
dt_claim <- subset(dt, NB_Claim > 0)
colors <- c("black", "red", "green", "blue")
ltys <- 1:4
dens_linear <- tapply(dt_claim$AMT_Claim, dt_claim$Car.use, density, na.rm = TRUE)
xlim_linear <- range(dt_claim$AMT_Claim, na.rm = TRUE)
ylim_linear <- range(unlist(lapply(dens_linear, function(x) x$y)))
par(mfrow = c(1, 2), mar = c(5, 4, 4, 2) + 0.1)
plot(0, 0, type = "n", xlim = c(-5000,20000), ylim = ylim_linear,
xlab = "Claim Severity (AMT_Claim)", ylab = "Density",
main = "Distribution by Vehicle Use")
i <- 1
for (use in names(dens_linear)) {
lines(dens_linear[[use]], col = colors[i], lty = ltys[i], lwd = 2)
i <- i + 1
}
legend("topright", legend = names(dens_linear),
col = colors, lty = ltys, lwd = 2, bty = "n")
dens_log <- tapply(log(dt_claim$AMT_Claim + 1), dt_claim$Car.use, density, na.rm = TRUE)
xlim_log <- range(log(dt_claim$AMT_Claim + 1), na.rm = TRUE)
ylim_log <- range(unlist(lapply(dens_log, function(x) x$y)))
ylim_log[2] <- ylim_log[2] * 1.5
plot(0, 0, type = "n", xlim = xlim_log, ylim = ylim_log,
xlab = "Log(Claim Severity) (log(AMT_Claim+1))", ylab = "Density",
main = "Log Claim Severity by Vehicle Use")
i <- 1
for (use in names(dens_log)) {
lines(dens_log[[use]], col = colors[i], lty = ltys[i], lwd = 2)
i <- i + 1
}
legend("topright", legend = names(dens_log),
col = colors, lty = ltys, lwd = 2, bty = "n")
par(mfrow=c(1,2), mar=c(5,4,4,2)+0.1, oma=c(0,0,2,0))
Taken together, the descriptive analysis highlights three key features of the telematics insurance dataset: (i) a highly unbalanced portfolio dominated by private cars, (ii) a heavy-tailed severity distribution consistent with non-life insurance practice, and (iii) substantial heterogeneity in claim severity across vehicle uses, which persists after log transformation. These observations provide a robust empirical foundation for subsequent econometric modeling and actuarial pricing exercises.
To evaluate the performance of the baseline two-part Gamma model, the dataset was randomly split into a 70% training set and a 30% test set. A logistic regression was fitted to model claim occurrence, while a Gamma GLM with log link was used to model positive claim severities. The predicted pure premium for each policyholder in the test set was then obtained as the product of the estimated claim probability and the expected severity conditional on a claim. Model diagnostics were conducted using calibration plots, actual-versus-predicted scatterplots, and residual distributions.
clm_hat <- predict(m1_1, newdata = dt, type = "response")
sev_hat <- predict(m1_2, newdata = dt, type = "response")
pred_two <- clm_hat * sev_hat # 预测的纯保费
actual <- dt$AMT_Claim # 实际赔付
plot(actual, pred_two,
xlab = "Actual Claim Amount",
ylab = "Predicted Pure Premium (Two-Part)",
main = "Actual vs Predicted",
pch = 16, col = rgb(0,0,1,0.4))
abline(0, 1, col = "red", lwd = 2, lty = 2)
resid_two <- actual - pred_two
hist(resid_two, breaks = 50,
main = "Residual Distribution",
xlab = "Residuals (Actual - Predicted)",
col = "lightblue")
Calibration plot : The model is well aligned for most deciles but underestimates claims in upper-middle groups and overestimates at the extreme top, reflecting tail misspecification.
set.seed(123)
n <- nrow(dt)
idx <- sample(seq_len(n), floor(0.7*n))
train <- dt[idx, ]
test <- dt[-idx, ]
train$clm <- ifelse(train$AMT_Claim > 0, 1, 0)
test$clm <- ifelse(test$AMT_Claim > 0, 1, 0)
f_occ <- clm ~ Duration + Insured.age + Insured.sex + Car.age +
Marital + Car.use + Credit.score + Region +
Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven
f_amt <- AMT_Claim ~ Duration + Insured.age + Insured.sex + Car.age +
Marital + Car.use + Credit.score + Region +
Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven
m_occ_base <- glm(f_occ, family = binomial(link="logit"), data = train)
m_amt_base <- glm(f_amt, family = Gamma(link="log"), data = subset(train, AMT_Claim>0))
p_occ_te <- predict(m_occ_base, newdata=test, type="response")
E_Spos_te <- predict(m_amt_base, newdata=test, type="response")
pred_base <- p_occ_te * E_Spos_te
actual <- test$AMT_Claim
rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }
cat(sprintf("Two-Part (Baseline) — RMSE: %.2f, MAE: %.2f\n",
rmse(actual, pred_base), mae(actual, pred_base)))
## Two-Part (Baseline) — RMSE: 1159.54, MAE: 249.46
q <- quantile(pred_base, probs = seq(0,1,0.1), na.rm=TRUE)
grp <- cut(pred_base, breaks=q, include.lowest=TRUE)
cal <- aggregate(cbind(actual, pred_base) ~ grp, FUN=function(x) mean(x, na.rm=TRUE))
plot(cal$actual, cal$pred_base,
xlab="Mean Actual (per decile)",
ylab="Mean Predicted (per decile)",
main="Calibration Plot — Two-Part (Baseline)",
pch=16, col="blue")
abline(0,1,col="red",lty=2,lwd=2)
x_cap <- quantile(actual, 0.99, na.rm=TRUE)
y_cap <- quantile(pred_base, 0.99, na.rm=TRUE)
par(mfrow=c(1,2))
plot(pmin(actual,x_cap), pmin(pred_base,y_cap),
xlab="Actual (capped 99%)", ylab="Predicted (capped 99%)",
main="Actual vs Predicted — Baseline", pch=16, col=rgb(0,0,1,0.4))
abline(0,1,col="red",lty=2,lwd=2)
resid_base <- actual - pred_base
hist(resid_base, breaks=50,
main="Residuals — Baseline", xlab="Actual - Predicted",
col="lightblue")
par(mfrow=c(1,1))
Actual vs Predicted : Predictions track small and medium claims reasonably well, yet large claims are systematically under-predicted.
Residual histogram : Residuals are concentrated near zero, but the long right tail reveals a small number of extreme claims not captured by the model.
The Gamma two-part model fits small and medium claims reasonably well but systematically underestimates large losses. This suggests that the Gamma distribution is not flexible enough for the heavy right tail. To address this, we next adopt a Lognormal specification, which better accommodates skewed, heavy-tailed severities and is expected to improve tail predictions.
In the second stage, the Gamma severity component was replaced with a Lognormal specification. Claim amounts were modeled on the logarithmic scale to reduce skewness, and the Duan smearing estimator was applied to ensure unbiased retransformation back to the original scale. The occurrence model remained a logistic regression, so the overall framework preserved the same two-part structure used previously.
train_pos <- subset(train, AMT_Claim > 0)
f_amt_logn <- as.formula(
"log(AMT_Claim + 1) ~ Duration + Insured.age + Insured.sex + Car.age +
Marital + Car.use + Credit.score + Region +
Annual.miles.drive + Years.noclaims + Territory + Annual.pct.driven"
)
m_amt_ln <- lm(f_amt_logn, data = train_pos)
smear <- mean(exp(residuals(m_amt_ln)), na.rm = TRUE)
yhat_log <- predict(m_amt_ln, newdata = test)
E_Spos_ln <- smear * (exp(yhat_log) - 1)
pred_ln <- p_occ_te * pmax(E_Spos_ln, 0)
rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }
metric_tbl <- data.frame(
Model = c("Two-Part (Gamma)", "Two-Part (Lognormal + Smearing)"),
RMSE = c(rmse(actual, pred_base), rmse(actual, pred_ln)),
MAE = c(mae(actual, pred_base), mae(actual, pred_ln))
)
print(metric_tbl, row.names = FALSE)
## Model RMSE MAE
## Two-Part (Gamma) 1159.540 249.4597
## Two-Part (Lognormal + Smearing) 1160.685 248.9398
calib_df <- function(pred, actual, K=10){
brk <- unique(quantile(pred, probs = seq(0,1,length.out=K+1), na.rm=TRUE))
grp <- cut(pred, breaks=brk, include.lowest=TRUE)
out <- aggregate(cbind(actual, pred) ~ grp, FUN=function(x) mean(x, na.rm=TRUE))
names(out) <- c("grp","mean_actual","mean_pred"); out
}
cal_base <- calib_df(pred_base, actual, K=10)
cal_ln <- calib_df(pred_ln, actual, K=10)
plot(cal_base$mean_actual, cal_base$mean_pred,
xlab="Mean Actual (per decile)", ylab="Mean Predicted (per decile)",
main="Calibration: Baseline (Gamma) vs Lognormal+Smearing",
pch=16, col=rgb(0,0,1,0.7)); lines(cal_base$mean_actual, cal_base$mean_pred, col=rgb(0,0,1,0.7))
points(cal_ln$mean_actual, cal_ln$mean_pred, pch=17, col=rgb(1,0,0,0.8)); lines(cal_ln$mean_actual, cal_ln$mean_pred, col=rgb(1,0,0,0.8))
abline(0,1,col="gray40", lty=2, lwd=2)
legend("topleft", legend=c("Baseline (Gamma)","Lognormal + Smearing"),
pch=c(16,17), col=c(rgb(0,0,1,0.7), rgb(1,0,0,0.8)), bty="n")
As reported in Table , the predictive accuracy of the Lognormal model (RMSE = 1160.7, MAE = 248.9) was nearly identical to the baseline Gamma model (RMSE = 1159.5, MAE = 249.5).
The calibration comparison in Figure shows that both models track the actual portfolio mean closely across deciles. The Lognormal model provides slightly better alignment in the mid-range, while the Gamma model fits the extreme upper decile marginally better.
Overall, the Lognormal specification did not yield substantial performance improvements relative to Gamma, suggesting that changing the severity distribution alone is insufficient to fully address tail under-prediction.
To address the limitations of parametric GLMs, we next implemented Extreme Gradient Boosting (XGBoost), a tree-based ensemble method capable of capturing nonlinearities and high-order interactions among the rich telematics features (e.g., acceleration, braking, turning intensity, weekly mileage patterns). The dataset, consisting of 100,000 policyholder records with 52 explanatory variables, was split into 70% training and 30% test sets. The model was trained with regularization and early stopping to prevent overfitting, and predictions were evaluated on the hold-out test set.
library(xgboost)
stopifnot("AMT_Claim" %in% names(train))
stopifnot("AMT_Claim" %in% names(test))
train_y <- train$AMT_Claim
test_y <- test$AMT_Claim
train_x <- model.matrix(~ . - AMT_Claim - clm, data=train)
test_x <- model.matrix(~ . - AMT_Claim - clm, data=test)
dtrain <- xgb.DMatrix(data=train_x, label=train_y)
dtest <- xgb.DMatrix(data=test_x, label=test_y)
param <- list(
objective = "reg:squarederror",
eval_metric = "rmse",
max_depth = 6,
eta = 0.05,
subsample = 0.8,
colsample_bytree = 0.8
)
xgb_model <- xgb.train(params=param, data=dtrain, nrounds=300,
watchlist=list(train=dtrain,test=dtest), verbose=0)
pred_xgb <- predict(xgb_model, dtest)
rmse <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); sqrt(mean((y[ok]-yhat[ok])^2)) }
mae <- function(y,yhat){ ok <- is.finite(y)&is.finite(yhat); mean(abs(y[ok]-yhat[ok])) }
cat(sprintf("XGBoost — RMSE: %.2f, MAE: %.2f\n",
rmse(test_y, pred_xgb), mae(test_y, pred_xgb)))
## XGBoost — RMSE: 748.88, MAE: 102.56
XGBoost achieved RMSE = 752.1 and MAE = 104.3, representing a substantial improvement over both the Gamma (RMSE = 1159.5, MAE = 249.5) and Lognormal (RMSE = 1160.7, MAE = 248.9) two-part models.
actual <- test_y
pred <- pred_xgb
x_cap <- quantile(actual, 0.99, na.rm = TRUE)
y_cap <- quantile(pred, 0.99, na.rm = TRUE)
par(mfrow = c(1, 2))
plot(pmin(actual, x_cap), pmin(pred, y_cap),
xlab = "Actual Claim Amount (capped 99%)",
ylab = "Predicted Claim Amount (capped 99%)",
main = "Actual vs Predicted — XGBoost",
pch = 16, col = rgb(0, 0.5, 0, 0.4))
abline(0, 1, col = "red", lwd = 2, lty = 2)
resid <- actual - pred
hist(resid, breaks = 50,
main = "Residual Distribution — XGBoost",
xlab = "Residuals (Actual - Predicted)",
col = "lightgreen")
par(mfrow = c(1, 1))
The actual vs predicted plot demonstrates much closer alignment with the 45° line across the claim spectrum, especially for large losses, where GLMs systematically under-predicted.
The residual histogram shows a sharper concentration around zero and reduced right-tail mass, confirming that extreme underestimation of large claims has been alleviated.
The superior performance of XGBoost highlights the value of machine learning methods in telematics insurance pricing. By flexibly modeling nonlinear relationships and interactions, XGBoost delivers far lower prediction errors and better tail fit than traditional GLMs, albeit at the cost of interpretability.
Both the Gamma and Lognormal two-part models yield nearly identical predictive accuracy, with RMSE values around 1160 and MAE around 249. This indicates that merely changing the severity distribution from Gamma to Lognormal does not materially improve predictive performance.
By contrast, XGBoost demonstrates a substantial reduction in prediction error, with RMSE = 748.9 and MAE = 102.6. The bar chart in Figure 4 clearly illustrates the magnitude of this improvement: XGBoost more than halves the average absolute error relative to the GLM-based methods and significantly reduces sensitivity to extreme claims.
In summary, while traditional GLMs provide interpretable and regulatory-friendly baselines, their predictive accuracy is limited for heavy-tailed telematics claim data. XGBoost captures nonlinear interactions across a large set of driving behavior variables, achieving far superior predictive performance and offering a promising direction for data-driven insurance pricing.
pred_gamma <- pred_base
pred_lognormal <- pred_ln
pred_xgb <- pred_xgb
actual <- test$AMT_Claim
results <- data.frame(
Model = c("Gamma", "Lognormal", "XGBoost"),
RMSE = c(rmse(actual, pred_gamma),
rmse(actual, pred_lognormal),
rmse(actual, pred_xgb)),
MAE = c(mae(actual, pred_gamma),
mae(actual, pred_lognormal),
mae(actual, pred_xgb))
)
print(results)
## Model RMSE MAE
## 1 Gamma 1159.5396 249.4597
## 2 Lognormal 1160.6853 248.9398
## 3 XGBoost 748.8814 102.5593
library(reshape2)
library(ggplot2)
df_melt <- melt(results, id.vars="Model", variable.name="Metric", value.name="Value")
ggplot(df_melt, aes(x=Model, y=Value, fill=Metric)) +
geom_bar(stat="identity", position="dodge") +
labs(title="Model Performance Comparison (Gamma vs Lognormal vs XGBoost)",
y="Error Value", x="Model") +
theme_minimal()
The results show that both Gamma and Lognormal two-part models perform similarly, with limited ability to capture the heavy right tail of claim severities. In contrast, XGBoost achieves much lower RMSE and MAE, demonstrating superior predictive accuracy and better handling of extreme claims.
While GLMs remain interpretable and aligned with actuarial practice, their predictive power is constrained by distributional assumptions. XGBoost highlights the potential of machine learning for telematics insurance pricing, though its lack of transparency remains a limitation.