Auto Insurance Crash Prediction

Author

Andres Garcia Damasco

Published

April 15, 2026

Introduction

This analysis explores an auto insurance dataset of 6,528 customers to predict two outcomes: whether a customer will be involved in a car crash (TARGET_FLAG) and, if so, how much the claim will cost (TARGET_AMT). We build and compare multiple logistic regression models for the binary crash outcome and multiple linear regression models for the cost outcome. The analysis covers data exploration, cleaning and preparation, model building, and final model selection with performance evaluation.

Section 1: Data Exploration

Data

library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

fix_money <- function(x) as.numeric(gsub("[$,]", "", x))

train <- read.csv("insurance-training-data2-2.csv", stringsAsFactors = FALSE) %>%
  select(-INDEX) %>%
  mutate(across(c(INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM), fix_money))

test <- read.csv("insurance-testing-data2-2.csv", stringsAsFactors = FALSE) %>%
  select(-INDEX) %>%
  mutate(across(c(INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM), fix_money))

The training and test sets are loaded with stringsAsFactors = FALSE to prevent R from misreading the dollar-sign columns. The INDEX column is dropped immediately since it is just a row identifier with no predictive value. The four money columns: INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM, are stripped of $ and commas and converted to numeric.

Missing Values

colSums(is.na(train))

TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
          0           0           0           3           0         375 
     INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
        354           0         368           0           0           0 
        JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
          0           0           0           0           0           0 
    RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
          0           0           0           0           0         399 
 URBANICITY 
          0

Five variables have missing values: CAR_AGE (399), YOJ (375), HOME_VAL (368), INCOME (354), and AGE (3). No other columns are affected.

Medians

impute_median <- function(x) ifelse(is.na(x), median(x, na.rm = TRUE), x)

train <- train %>% mutate(across(c(AGE, YOJ, INCOME, HOME_VAL, CAR_AGE), impute_median))
test  <- test  %>% mutate(across(c(AGE, YOJ, INCOME, HOME_VAL, CAR_AGE), impute_median))

# Confirm no more NAs
colSums(is.na(train))

TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
          0           0           0           0           0           0 
     INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
          0           0           0           0           0           0 
        JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
          0           0           0           0           0           0 
    RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
          0           0           0           0           0           0 
 URBANICITY 
          0

Missing values are replaced with each variable’s median. Median is preferred over mean here because variables like INCOME and HOME_VAL are right-skewed, a small number of high earners or high-value homeowners would pull the mean upward, making it a poor stand-in for a typical missing observation.

Summary Stats

train %>%
  select(AGE, INCOME, HOME_VAL, BLUEBOOK, TRAVTIME, MVR_PTS, CLM_FREQ, OLDCLAIM, TARGET_AMT) %>%
  as.data.frame() %>%
  stargazer(type = "text", title = "Summary Statistics")


Summary Statistics
==========================================================
Statistic    N      Mean      St. Dev.    Min      Max    
----------------------------------------------------------
AGE        6,528   44.853       8.649     16       81     
INCOME     6,528 61,195.250  46,386.770    0     367,030  
HOME_VAL   6,528 154,658.200 125,197.500   0     885,282  
BLUEBOOK   6,528 15,641.760   8,381.489  1,500   69,740   
TRAVTIME   6,528   33.442      15.966      5       142    
MVR_PTS    6,528    1.700       2.145      0       13     
CLM_FREQ   6,528    0.799       1.159      0        5     
OLDCLAIM   6,528  4,119.320   8,924.665    0     57,037   
TARGET_AMT 6,528  1,466.616   4,545.654  0.000 107,586.100
----------------------------------------------------------

prop.table(table(train$TARGET_FLAG))


        0         1 
0.7362132 0.2637868

The training set contains 6,528 observations across 25 variables. A few patterns stand out. INCOME and HOME_VAL are both highly right-skewed — the standard deviations ($46k and $125k respectively) are nearly as large as the means, suggesting a long right tail of wealthy customers. OLDCLAIM tells a similar story, with a mean of $4,119 but a max of $57,037. TARGET_AMT is even more extreme, averaging $1,467 but reaching over $107,000 — this will likely require a log transformation before modeling.

MVR_PTS averages 1.7 with a max of 13, and CLM_FREQ averages 0.8 claims over the past 5 years, both of which we’d expect to be positively associated with crash probability.

The target variable is imbalanced — about 26% of customers were in a crash and 74% were not. This is worth keeping in mind when evaluating logistic regression model performance, since a model that predicts “no crash” for everyone would still achieve 74% accuracy.

Plots

library(ggplot2)

# Histograms for numeric variables
train %>%
  select(AGE, INCOME, HOME_VAL, BLUEBOOK, TRAVTIME, MVR_PTS, OLDCLAIM, TARGET_AMT) %>%
  pivot_longer(everything()) %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "steelblue") +
  facet_wrap(~name, scales = "free") +
  labs(title = "Distribution of Numeric Variables")

Several numeric variables show strong right skew, INCOME, HOME_VAL, OLDCLAIM, and TARGET_AMT all have the bulk of observations clustered near zero with long right tails. This suggests log transformations may be appropriate before modeling. AGE is roughly bell-shaped and centered around 45. TRAVTIME is also fairly normal, centered around 30 minutes. MVR_PTS is heavily concentrated at zero, meaning most customers have a clean driving record. BLUEBOOK shows a moderate right skew with most vehicles valued under $20,000.

# Crash rate by car type and urbanicity
train %>%
  mutate(TARGET_FLAG = factor(TARGET_FLAG)) %>%
  ggplot(aes(x = CAR_TYPE, fill = TARGET_FLAG)) +
  geom_bar(position = "fill") +
  labs(title = "Crash Rate by Car Type", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Crash rates vary noticeably across car types. Minivans have the lowest crash rate at roughly 17%, while Panel Trucks, Pickups, and Sports Cars all show crash rates around 30-33%, nearly double that of Minivans. Vans and SUVs fall in between at around 26-30%. This suggests CAR_TYPE will be a useful predictor in the logistic regression model.

Correlation with target

# Correlation of numeric variables with TARGET_FLAG
train %>%
  select(where(is.numeric)) %>%
  cor(use = "complete.obs") %>%
  as.data.frame() %>%
  select(TARGET_FLAG) %>%
  arrange(desc(abs(TARGET_FLAG))) %>%
  round(2)

            TARGET_FLAG
TARGET_FLAG        1.00
TARGET_AMT         0.54
MVR_PTS            0.22
CLM_FREQ           0.22
HOME_VAL          -0.17
OLDCLAIM           0.14
INCOME            -0.13
HOMEKIDS           0.12
KIDSDRIV           0.12
BLUEBOOK          -0.11
AGE               -0.10
CAR_AGE           -0.09
TIF               -0.08
YOJ               -0.07
TRAVTIME           0.04

The variables most correlated with TARGET_FLAG are MVR_PTS and CLM_FREQ (both +0.22), which makes intuitive sense, drivers with more traffic violations and prior claims are more likely to crash. HOME_VAL (−0.17) and INCOME (−0.13) are negatively correlated, consistent with the idea that wealthier individuals tend to drive more carefully. HOMEKIDS and KIDSDRIV (both +0.12) suggest that households with children, especially driving-age teenagers, face higher crash risk. BLUEBOOK (−0.11) is mildly negative, possibly reflecting that owners of more valuable cars drive more cautiously. Overall correlations are modest, which is typical for crash prediction, no single variable dominates, so a multivariate model is necessary.

Section 2: Data Preparation

# Check unique values of categorical variables
train %>% select(where(is.character)) %>% sapply(unique)

$PARENT1
[1] "No"  "Yes"

$MSTATUS
[1] "z_No" "Yes" 

$SEX
[1] "z_F" "M"  

$EDUCATION
[1] "Masters"       "Bachelors"     "PhD"           "z_High School"
[5] "<High School" 

$JOB
[1] "Lawyer"        "Home Maker"    "Clerical"      ""             
[5] "Manager"       "Professional"  "Doctor"        "z_Blue Collar"
[9] "Student"      

$CAR_USE
[1] "Private"    "Commercial"

$CAR_TYPE
[1] "Pickup"      "Sports Car"  "z_SUV"       "Van"         "Panel Truck"
[6] "Minivan"    

$RED_CAR
[1] "no"  "yes"

$REVOKED
[1] "No"  "Yes"

$URBANICITY
[1] "z_Highly Rural/ Rural" "Highly Urban/ Urban"

MSTATUS and SEX have z_ prefixes (just R’s way of setting a reference category), JOB has a blank “” which needs to be treated as missing, and EDUCATION has <High School as a category.

Clean and dummy code categoricals

# Fix blank JOB entries -> NA, then impute with most common category
train$JOB[train$JOB == ""] <- NA
test$JOB[test$JOB == ""] <- NA

most_common <- names(sort(table(train$JOB), decreasing = TRUE))[1]
train$JOB[is.na(train$JOB)] <- most_common
test$JOB[is.na(test$JOB)] <- most_common

# Convert all character columns to factors — glm() handles dummies automatically
train <- train %>% mutate(across(where(is.character), as.factor))
test  <- test  %>% mutate(across(where(is.character), as.factor))

All binary categorical variables (PARENT1, MSTATUS, SEX, CAR_USE, RED_CAR, REVOKED, URBANICITY) are converted to factors. Multi-level categoricals (EDUCATION, JOB, CAR_TYPE) are also factored, and R will use the first level alphabetically as the reference category in regression. One blank entry in JOB is treated as missing and imputed with the most frequently occurring job category.

Log transformations

From the histograms we saw that INCOME, HOME_VAL, OLDCLAIM, and TARGET_AMT are all heavily right-skewed. Log transforming them pulls in the tail and makes the linear regression assumptions more reasonable.

# Log transform skewed variables — add 1 to avoid log(0) errors
train <- train %>%
  mutate(
    LOG_INCOME   = log(INCOME + 1),
    LOG_HOME_VAL = log(HOME_VAL + 1),
    LOG_OLDCLAIM = log(OLDCLAIM + 1),
    LOG_TARGET_AMT = log(TARGET_AMT + 1)
  )

test <- test %>%
  mutate(
    LOG_INCOME   = log(INCOME + 1),
    LOG_HOME_VAL = log(HOME_VAL + 1),
    LOG_OLDCLAIM = log(OLDCLAIM + 1)
  )

We add 1 before taking the log to handle zero values: INCOME, HOME_VAL, and OLDCLAIM all have observations at zero, and log(0) is undefined. The transformed variables will be used in place of the originals in our models.

Model 1: Simple

# Model 1 — simple, using strongest predictors from correlation analysis
logit1 <- glm(TARGET_FLAG ~ MVR_PTS + CLM_FREQ + KIDSDRIV + REVOKED + URBANICITY,
              data = train, family = binomial)

Model 2: Full model with all predictors

logit2 <- glm(TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + LOG_INCOME + PARENT1 +
                LOG_HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + CAR_USE +
                BLUEBOOK + TIF + CAR_TYPE + RED_CAR + LOG_OLDCLAIM + CLM_FREQ +
                REVOKED + MVR_PTS + CAR_AGE + URBANICITY,
              data = train, family = binomial)

Model 3: Stepwise selected model

logit3 <- step(logit2, trace = 0)

Comparison

AIC(logit1, logit2, logit3)

       df      AIC
logit1  6 6671.336
logit2 37 5900.982
logit3 30 5893.970

Three logistic regression models are estimated. Model 1 includes only the five variables most correlated with the target. Model 2 includes all available predictors using log-transformed versions of the skewed money variables. Model 3 applies stepwise selection starting from Model 2, letting AIC guide variable inclusion. Model 3 achieves the lowest AIC (5894), followed by Model 1 (6671), with the full Model 2 performing worst (5901), indicating that not all variables in Model 2 contribute meaningfully and stepwise selection improves fit by dropping the weakest ones. Model 3 will be our preferred logistic regression model going forward.

Confusion matrix and classification metrics

pred_prob <- predict(logit3, newdata = train, type = "response")
pred_class <- ifelse(pred_prob >= 0.5, 1, 0)

confusionMatrix(factor(pred_class), factor(train$TARGET_FLAG), positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4453 1013
         1  353  709
                                          
               Accuracy : 0.7907          
                 95% CI : (0.7807, 0.8006)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3857          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4117          
            Specificity : 0.9266          
         Pos Pred Value : 0.6676          
         Neg Pred Value : 0.8147          
             Prevalence : 0.2638          
         Detection Rate : 0.1086          
   Detection Prevalence : 0.1627          
      Balanced Accuracy : 0.6691          
                                          
       'Positive' Class : 1

The model achieves 79.1% accuracy, comfortably above the 73.6% no information rate, confirming it adds real predictive value. Specificity is strong at 92.7%, meaning the model correctly identifies the vast majority of non-crash cases. Sensitivity is weaker at 41.2% — the model misses roughly 6 in 10 actual crashes, which is a known limitation of using a 0.5 threshold on imbalanced data where crashes are the minority class. When the model does predict a crash, it is correct about two thirds of the time (precision = 66.8%). The classification error rate is 20.9%. Overall Model 3 performs meaningfully better than simply predicting no crash for everyone, as confirmed by the p-value well below 0.05

Linear regresion models for Target_ATM

# Filter to crash cases only
train_crash <- train %>% filter(TARGET_FLAG == 1)

# Model 1 — simple, intuitive predictors
lm1 <- lm(LOG_TARGET_AMT ~ BLUEBOOK + CAR_AGE + CAR_TYPE + MVR_PTS,
           data = train_crash)

# Model 2 — fuller model
lm2 <- lm(LOG_TARGET_AMT ~ BLUEBOOK + CAR_AGE + CAR_TYPE + MVR_PTS +
             LOG_INCOME + TRAVTIME + URBANICITY + CLM_FREQ + LOG_OLDCLAIM,
           data = train_crash)

# Model 3 — stepwise from full
lm3 <- step(lm2, trace = 0)

# Compare
summary(lm1)$adj.r.squared

[1] 0.01154962

summary(lm2)$adj.r.squared

[1] 0.009775249

summary(lm3)$adj.r.squared

[1] 0.01376565

Linear regression models are estimated on the subset of crash cases only (n = 1,722), since TARGET_AMT is only meaningful when a crash occurred. The response variable is log-transformed to address right skew. Model 1 uses vehicle-related predictors only. Model 2 expands to include driver and behavioral variables. Model 3 applies stepwise selection from Model 2. Adjusted R² is low across all three models — 1.2%, 1.0%, and 1.4% respectively — indicating that the available variables explain very little of the variation in crash costs. This is not unusual in insurance modeling, where claim amounts are inherently noisy. Model 3 is selected as the best linear model based on its marginally higher adjusted R².

Residual plots for lm3

par(mfrow = c(2, 2))
plot(lm3)

Residuals vs Fitted: The residuals are roughly centered around zero across all fitted values, and the red line is nearly flat, suggesting no strong nonlinear pattern. However the spread is very wide, consistent with the low R², the model captures the mean reasonably but individual predictions are noisy. Observations 574, 1400, and 756 are flagged as potential outliers.

Q-Q Plot: The residuals follow the diagonal line closely through the middle of the distribution, indicating approximate normality for most observations. However both tails deviate noticeably, the left tail drops below and the right tail rises above the line, suggesting heavier tails than a normal distribution. This is mild and acceptable given the nature of insurance claim data.

Scale-Location: The red line trends slightly upward at higher fitted values, indicating mild heteroscedasticity, variance increases slightly as predicted values grow. This is common with cost data even after log transformation. It is not severe enough to invalidate the model but is worth acknowledging.

Residuals vs Leverage: No observations appear to have both high leverage and large residuals simultaneously, meaning no single data point is exerting undue influence on the model coefficients. The flagged points (574, 756, 1401) have large residuals but low leverage, so they are outliers in terms of prediction error but not influential on the model fit itself.

Predictions on the test set

# crash probability using best logistic model
test$P_TARGET_FLAG <- predict(logit3, newdata = test, type = "response")

# 0.5 threshold
test$TARGET_FLAG <- ifelse(test$P_TARGET_FLAG >= 0.5, 1, 0)

# crash cost using best linear model,  only for predicted crashes
test$P_TARGET_AMT <- exp(predict(lm3, newdata = test)) - 1

# If predicted no crash, cost is 0
test$TARGET_AMT <- ifelse(test$TARGET_FLAG == 1, test$P_TARGET_AMT, 0)

# Export predictions
test %>%
  select(P_TARGET_FLAG, TARGET_FLAG, TARGET_AMT) %>%
  write.csv("predictions.csv", row.names = FALSE)

# Get odds ratios for logit3
exp(coef(logit3)) %>% round(3)

                    (Intercept)                        KIDSDRIV 
                          1.251                           1.609 
                     LOG_INCOME                      PARENT1Yes 
                          0.928                           1.589 
                   LOG_HOME_VAL                     MSTATUSz_No 
                          0.972                           1.524 
             EDUCATIONBachelors                EDUCATIONMasters 
                          0.593                           0.560 
                   EDUCATIONPhD          EDUCATIONz_High School 
                          0.540                           0.963 
                      JOBDoctor                   JOBHome Maker 
                          0.347                           0.705 
                      JOBLawyer                      JOBManager 
                          0.825                           0.404 
                JOBProfessional                      JOBStudent 
                          0.784                           0.615 
               JOBz_Blue Collar                        TRAVTIME 
                          0.874                           1.014 
                 CAR_USEPrivate                        BLUEBOOK 
                          0.460                           1.000 
                            TIF             CAR_TYPEPanel Truck 
                          0.946                           1.578 
                 CAR_TYPEPickup              CAR_TYPESports Car 
                          1.730                           2.484 
                    CAR_TYPEVan                   CAR_TYPEz_SUV 
                          1.733                           2.093 
                       CLM_FREQ                      REVOKEDYes 
                          1.166                           2.148 
                        MVR_PTS URBANICITYz_Highly Rural/ Rural 
                          1.116                           0.093

Positive example — REVOKED (2.148): Drivers whose license was revoked in the past 7 years have 2.15 times the odds of crashing compared to drivers with no revocation, holding all else constant. This makes intuitive sense, license revocation signals a history of dangerous driving behavior.

Negative example — CAR_USEPrivate (0.460): Drivers using their vehicle for private use have 54% lower odds of crashing compared to commercial use drivers. This aligns with theory, commercial vehicles are on the road more frequently and in more varied conditions, increasing exposure to crashes.

A few other notable findings: Sports Cars have 2.5 times the odds of crashing compared to the reference category (Minivan), and being unmarried (MSTATUSz_No) increases odds by 1.52 times. On the other hand, living in a rural area reduces crash odds dramatically (odds ratio = 0.093), likely due to less traffic. Doctors and Managers show notably low odds ratios (0.35 and 0.40), consistent with the theory that white collar professionals drive more safely.

RMSE and F-statistis for lm3

# F-statistic and R²
summary(lm3)


Call:
lm(formula = LOG_TARGET_AMT ~ BLUEBOOK + MVR_PTS, data = train_crash)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1471 -0.3965  0.0402  0.3951  3.2090 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.075e+00  4.262e-02 189.476  < 2e-16 ***
BLUEBOOK    1.105e-05  2.321e-06   4.761 2.08e-06 ***
MVR_PTS     1.459e-02  7.405e-03   1.970    0.049 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7875 on 1719 degrees of freedom
Multiple R-squared:  0.01491,   Adjusted R-squared:  0.01377 
F-statistic: 13.01 on 2 and 1719 DF,  p-value: 2.465e-06

# RMSE
sqrt(mean(lm3$residuals^2))

[1] 0.7868097

The final linear model (lm3) retains only BLUEBOOK and MVR_PTS after stepwise selection. The F-statistic is 13.01 (p-value < 0.001), confirming the model is statistically significant overall. RMSE is 0.787 on the log scale, meaning predictions are typically off by about 0.79 log-dollars. Adjusted R² is 1.4%, confirming the model explains very little variance in crash costs, again, expected given the noisy nature of insurance claims.

BLUEBOOK is statistically significant at the 0.001 level, higher vehicle value is associated with slightly higher crash costs, which makes sense since more expensive cars cost more to repair. MVR_PTS is significant at the 0.05 level, more traffic violations are weakly associated with higher claim amounts.

Conclusion

Two modeling objectives were addressed in this analysis. For crash prediction, logistic regression Model 3 — selected via stepwise AIC — achieved 79.1% accuracy with strong specificity (92.7%), making it reliable at identifying safe drivers. Key predictors of crash risk include license revocation, sports car ownership, unmarried status, and motor vehicle record points. For crash cost prediction, the linear regression model explained very little variance (adjusted R² = 1.4%), which is consistent with the inherently unpredictable nature of insurance claim amounts. BLUEBOOK and MVR_PTS were the only retained predictors. Predictions for the evaluation dataset have been generated using a 0.5 classification threshold for crash probability and the linear model for cost estimation.