Modeling car Insurance accidents and cost of accidents
Author
Darwhin Gomez
Modeling for car Insurance
Car insurance provides financial protection in cases of property damage or personal injury resulting from automobile accidents. Insurers must estimate two key outcomes for each policyholder:
The probability that the driver will be involved in a crash, and
The expected financial cost of that crash, if one occurs.
Accurate predictions of both components are essential for pricing policies, managing risk, and ensuring the financial stability of the insurer. From a modeling perspective, this naturally leads to a two-part predictive task: a binary classification problem (crash vs. no crash) and a continuous regression problem (claim cost conditional on a crash).
To address these questions, we apply supervised machine learning techniques—specifically binary logistic regression to model crash probability, and multiple linear regression to estimate crash cost. Both modeling approaches are appropriate given the structured, tabular nature of the data and the interpretability requirements common in insurance analytics.
The provided training data consist of 8,161 observations and 26 variables, including demographic characteristics, vehicle attributes, prior claims history, driving record, and socioeconomic indicators. The evaluation dataset includes an additional 2,141 records for which model predictions must be generated.
Methods
This study follows a structured end-to-end modeling workflow typical in actuarial data mining:
Data Exploration
We begin by reviewing the distributions, central tendencies, correlations, and missingness patterns across all predictors. This step provides intuition into variable behavior and informs subsequent transformations.
Data Preparation
Several variables contain missing values (e.g., income, home value, years on job). We address missingness using median imputation for numeric variables and create missing-indicator flags where appropriate. Skewed variables such as vehicle value and prior claim amounts undergo log-transformations to stabilize variance. Categorical predictors are encoded as factors.
Model Development
Binary Logistic Regression:
Multiple logistic regression models are trained to predict TARGET_FLAG, the indicator for whether a driver experienced a crash. Different variable subsets and transformations are explored, including stepwise selection.
Multiple Linear Regression:
For records where a crash occurred, linear regression models are fitted to TARGET_AMT, the associated claim cost. Alternative specifications are compared based on goodness-of-fit, interpretability, and model diagnostics.
Model Evaluation and Selection
We evaluate logistic models using accuracy, precision, sensitivity, specificity, F1-score, and AUC. Linear regression models are assessed using R², adjusted R², RMSE, F-statistics, and residual diagnostics. Cross-validation is used to mitigate overfitting and ensure model generalizability.
Prediction on Evaluation Data
Once the final models are selected, we generate predictions for the evaluation dataset, including:
Warning: Removed 1879 rows containing non-finite outside the scale range
(`stat_bin()`).
Data manipulation
Missing Data
Job- could be missing for any number of reason, but we will keep this under a new label ” unspecified”, 526 cases in train
Car age- this is peculiar since car model years is a primary data point for insurance, could it be that these are really new cars, or really old cars, 510 cases in train
Age - Small number of cases - 6 cases in train. We can do a mean impute here
Home Value- This could represent that the person does not own a home which would be 0,464 cases
YOJ - years on job lets see if this connected to people whom do have a job specified, 454 case.
Income - Income if there is no job listed could make sense to have zero. 445 cases in train.
During data preparation, I observed that many individuals with missing income also had commercial-use vehicles and job category recoded as “SelfEmployed.”
Because self-employed drivers with commercial auto policies likely report income similarly, I imputed their missing INCOME values using the median income of all commercial-use customers:
57892
This preserves domain logic and stabilizes the logistic regression model.
We also imputed missing income for private use cases with the median of cases labeled private :
51110
skim(train)
Missing values were addressed using domain-appropriate logic.
Income was imputed using median values segmented by vehicle use (commercial vs. private) and adjusted for self-employed individuals. Home values were imputed to zero, YOJ was imputed to zero due to its distribution and realistic interpretation, and CAR_AGE was cleaned by setting negative values to zero and imputing the remaining values using the mean. Job missingness was recoded to “Unspecified,” and records with commercial vehicle use and unspecified job type were reassigned to “SelfEmployed.” All categorical variables were cleaned by removing “z_” prefixes and refactoring levels. Rows missing AGE were removed. After transformation, the dataset contains no problematic missingness and is suitable for modeling.
The correlation matrix revealed several meaningful relationships between numeric predictors and the likelihood of being involved in an accident (TARGET_FLAG). Variables related to household composition showed notable correlations: having children (HOMEKIDS) and especially having children of driving age (KIDSDRIV) were positively associated with crash risk. Behavioral and driving-history measures were also strong indicators. Prior claims history (OLDCLAIM), claim frequency (CLM_FREQ), and accumulated motor vehicle record points (MVR_PTS) all demonstrated positive correlations with accident involvement, consistent with actuarial expectations that past behavior is predictive of future risk. Additionally, longer commute distances (TRAVTIME) exhibited a mild but meaningful correlation with higher crash probability, reflecting increased road exposure. Overall, the correlation structure supports the inclusion of these variables in the logistic regression model, both for predictive strength and domain relevance.
The dataset exhibits a clear class imbalance: only about one in four policyholders experienced an accident, meaning roughly 25% of observations have TARGET_FLAG = 1, while the remaining 75% did not. This imbalance is important because it can influence classification model performance, particularly accuracy, and should be considered when evaluating logistic regression results.
Modeling
Logistic models
To prepare for modeling we encoded categorical values
[1] "Dimensions"
[1] "train:"
[1] 8155 39
[1] "eval:"
[1] 2140 39
[1] TRUE
[1] "All collumn names are excact in the train and eval sets."
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 10551 2904
Yes 1463 1392
Accuracy : 0.7323
95% CI : (0.7254, 0.739)
No Information Rate : 0.7366
P-Value [Acc > NIR] : 0.898
Kappa : 0.2267
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.32402
Specificity : 0.87823
Pos Pred Value : 0.48757
Neg Pred Value : 0.78417
Prevalence : 0.26340
Detection Rate : 0.08535
Detection Prevalence : 0.17505
Balanced Accuracy : 0.60112
'Positive' Class : Yes
Code
cm_rf
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 17004 4444
Yes 1017 2000
Accuracy : 0.7768
95% CI : (0.7715, 0.782)
No Information Rate : 0.7366
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3062
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.31037
Specificity : 0.94357
Pos Pred Value : 0.66291
Neg Pred Value : 0.79280
Prevalence : 0.26340
Detection Rate : 0.08175
Detection Prevalence : 0.12332
Balanced Accuracy : 0.62697
'Positive' Class : Yes
Code
cm_xgb
Confusion Matrix and Statistics
Reference
Prediction No Yes
No 596778 135528
Yes 51978 96456
Accuracy : 0.7871
95% CI : (0.7862, 0.788)
No Information Rate : 0.7366
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3796
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4158
Specificity : 0.9199
Pos Pred Value : 0.6498
Neg Pred Value : 0.8149
Prevalence : 0.2634
Detection Rate : 0.1095
Detection Prevalence : 0.1685
Balanced Accuracy : 0.6678
'Positive' Class : Yes
Model
Sensitivity (TPR)
Specificity (TNR)
Accuracy
Balanced Accuracy
Logistic Regression
0.417
0.922
0.789
0.669
Naive Bayes
0.322
0.879
0.733
0.601
Random Forest
0.311
0.946
0.779
0.628
XGBoost
0.414
0.919
0.786
0.667
The logistic regression model performs very well with the encoded variables, slightly outperforming all other tested models. In addition to its strong predictive accuracy, it has the advantage of being easily interpretable, as the direction and magnitude of each coefficient provide direct insights into how the predictors influence crash likelihood.
Linear Regression for Target_Amt
Because TARGET_AMT represents the dollar amount of a crash only when a crash actually occurs, the severity model must be trained exclusively on policyholders who experienced an accident (TARGET_FLAG = 1). This results in a much smaller and more concentrated training subset. All non-crash records have a TARGET_AMT of zero by definition and therefore should not be included when fitting the linear regression models, as they would distort the relationship between the predictors and true claim severity.
Because TARGET_AMT is only defined for policyholders who were involved in a crash, the severity model was fit exclusively on crash records. Several peers reported higher R² values by fitting a regression model to the entire dataset, where approximately 75% of records have TARGET_AMT = 0. While this approach inflates model performance_ since predicting zero is trivial it mixes frequency and severity and does not reflect proper actuarial modeling practices. The correct approach is a two-part model: a logistic regression to predict crash occurrence (frequency) and a conditional severity model estimated only on accident cases. As a result, the R² of the severity model is lower, which is expected given the inherent variability of claim costs and the limited predictors available.
Furthermore, predicting the dollar cost of a crash is inherently difficult using this dataset, as actual severity is influenced by unobserved factors such as injury level, property damage, speed of impact, environmental conditions, and accident type. None of which are captured in the data. To illustrate this point, I also trained a Box–Cox transformed regression model on the entire training set and obtained a much higher R² of approximately 0.29. However, this improvement is misleading: the model achieves a high R² only because it learns to predict values close to zero, which dominate the dataset. In other words, the model appears more accurate simply because most policyholders did not file a claim, not because it is better at predicting true claim severity.
Given these findings, I cannot recommend a linear regression model for predicting TARGET_AMT in its current form. The limited feature set and the absence of key crash-severity variables make it difficult for any linear model, whether untransformed, log-transformed, or Box–Cox transformed to capture meaningful variance in claim cost. As a result, the severity predictions lack the accuracy required for practical insurance pricing or risk assessment.
If we examine the Q–Q plot for the model trained on the entire dataset, we immediately see why this model is invalid. The upper tail of the plot sharply deviates upward after approximately the second theoretical quantile. This spike corresponds to all observations with non-zero claim amounts—i.e., the policyholders who actually experienced a crash. Because 75% of the data consists of zeros, the model is essentially trying to fit two fundamentally different distributions simultaneously: a large mass at zero and a long, continuous right tail for crash costs. The resulting Q–Q pattern shows that the linear model cannot capture this mixture distribution, confirming that a full-dataset severity model is statistically mis-specified and inappropriate for predicting TARGET_AMT.
Model Selected
Based on the modeling results, I recommend using the logistic regression model fitted with the glm() function as the final model for predicting crash occurrence (TARGET_FLAG). This model demonstrated strong overall performance, competitive AUC, and clear interpretability, making it the most suitable choice for estimating accident likelihood.
However, I will not provide predictions for TARGET_AMT in the evaluation set. Despite extensive testing—including untransformed, log-transformed, and Box–Cox transformed linear models—I was unable to identify a severity model with sufficient explanatory power or reliable residual behavior. The available predictors do not capture key determinants of claim cost (such as injury severity, collision type, repair estimates, or environmental factors), resulting in weak or unstable models. Therefore, no regression model tested offered a robust or valid explanation of variance in crash amounts.