This analysis explores an auto insurance dataset of 6,528 customers to predict two outcomes: whether a customer will be involved in a car crash (TARGET_FLAG) and, if so, how much the claim will cost (TARGET_AMT). We build and compare multiple logistic regression models for the binary crash outcome and multiple linear regression models for the cost outcome. The analysis covers data exploration, cleaning and preparation, model building, and final model selection with performance evaluation.
Section 1: Data Exploration
Data
library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 4.0.2 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The training and test sets are loaded with stringsAsFactors = FALSE to prevent R from misreading the dollar-sign columns. The INDEX column is dropped immediately since it is just a row identifier with no predictive value. The four money columns: INCOME, HOME_VAL, BLUEBOOK, and OLDCLAIM, are stripped of $ and commas and converted to numeric.
Missing values are replaced with each variable’s median. Median is preferred over mean here because variables like INCOME and HOME_VAL are right-skewed, a small number of high earners or high-value homeowners would pull the mean upward, making it a poor stand-in for a typical missing observation.
Summary Statistics
==========================================================
Statistic N Mean St. Dev. Min Max
----------------------------------------------------------
AGE 6,528 44.853 8.649 16 81
INCOME 6,528 61,195.250 46,386.770 0 367,030
HOME_VAL 6,528 154,658.200 125,197.500 0 885,282
BLUEBOOK 6,528 15,641.760 8,381.489 1,500 69,740
TRAVTIME 6,528 33.442 15.966 5 142
MVR_PTS 6,528 1.700 2.145 0 13
CLM_FREQ 6,528 0.799 1.159 0 5
OLDCLAIM 6,528 4,119.320 8,924.665 0 57,037
TARGET_AMT 6,528 1,466.616 4,545.654 0.000 107,586.100
----------------------------------------------------------
prop.table(table(train$TARGET_FLAG))
0 1
0.7362132 0.2637868
The training set contains 6,528 observations across 25 variables. A few patterns stand out. INCOME and HOME_VAL are both highly right-skewed — the standard deviations ($46k and $125k respectively) are nearly as large as the means, suggesting a long right tail of wealthy customers. OLDCLAIM tells a similar story, with a mean of $4,119 but a max of $57,037. TARGET_AMT is even more extreme, averaging $1,467 but reaching over $107,000 — this will likely require a log transformation before modeling.
MVR_PTS averages 1.7 with a max of 13, and CLM_FREQ averages 0.8 claims over the past 5 years, both of which we’d expect to be positively associated with crash probability.
The target variable is imbalanced — about 26% of customers were in a crash and 74% were not. This is worth keeping in mind when evaluating logistic regression model performance, since a model that predicts “no crash” for everyone would still achieve 74% accuracy.
Plots
library(ggplot2)# Histograms for numeric variablestrain %>%select(AGE, INCOME, HOME_VAL, BLUEBOOK, TRAVTIME, MVR_PTS, OLDCLAIM, TARGET_AMT) %>%pivot_longer(everything()) %>%ggplot(aes(x = value)) +geom_histogram(bins =30, fill ="steelblue") +facet_wrap(~name, scales ="free") +labs(title ="Distribution of Numeric Variables")
Several numeric variables show strong right skew, INCOME, HOME_VAL, OLDCLAIM, and TARGET_AMT all have the bulk of observations clustered near zero with long right tails. This suggests log transformations may be appropriate before modeling. AGE is roughly bell-shaped and centered around 45. TRAVTIME is also fairly normal, centered around 30 minutes. MVR_PTS is heavily concentrated at zero, meaning most customers have a clean driving record. BLUEBOOK shows a moderate right skew with most vehicles valued under $20,000.
# Crash rate by car type and urbanicitytrain %>%mutate(TARGET_FLAG =factor(TARGET_FLAG)) %>%ggplot(aes(x = CAR_TYPE, fill = TARGET_FLAG)) +geom_bar(position ="fill") +labs(title ="Crash Rate by Car Type", y ="Proportion") +theme(axis.text.x =element_text(angle =45, hjust =1))
Crash rates vary noticeably across car types. Minivans have the lowest crash rate at roughly 17%, while Panel Trucks, Pickups, and Sports Cars all show crash rates around 30-33%, nearly double that of Minivans. Vans and SUVs fall in between at around 26-30%. This suggests CAR_TYPE will be a useful predictor in the logistic regression model.
Correlation with target
# Correlation of numeric variables with TARGET_FLAGtrain %>%select(where(is.numeric)) %>%cor(use ="complete.obs") %>%as.data.frame() %>%select(TARGET_FLAG) %>%arrange(desc(abs(TARGET_FLAG))) %>%round(2)
The variables most correlated with TARGET_FLAG are MVR_PTS and CLM_FREQ (both +0.22), which makes intuitive sense, drivers with more traffic violations and prior claims are more likely to crash. HOME_VAL (−0.17) and INCOME (−0.13) are negatively correlated, consistent with the idea that wealthier individuals tend to drive more carefully. HOMEKIDS and KIDSDRIV (both +0.12) suggest that households with children, especially driving-age teenagers, face higher crash risk. BLUEBOOK (−0.11) is mildly negative, possibly reflecting that owners of more valuable cars drive more cautiously. Overall correlations are modest, which is typical for crash prediction, no single variable dominates, so a multivariate model is necessary.
Section 2: Data Preparation
# Check unique values of categorical variablestrain %>%select(where(is.character)) %>%sapply(unique)
MSTATUS and SEX have z_ prefixes (just R’s way of setting a reference category), JOB has a blank “” which needs to be treated as missing, and EDUCATION has <High School as a category.
Clean and dummy code categoricals
# Fix blank JOB entries -> NA, then impute with most common categorytrain$JOB[train$JOB ==""] <-NAtest$JOB[test$JOB ==""] <-NAmost_common <-names(sort(table(train$JOB), decreasing =TRUE))[1]train$JOB[is.na(train$JOB)] <- most_commontest$JOB[is.na(test$JOB)] <- most_common# Convert all character columns to factors — glm() handles dummies automaticallytrain <- train %>%mutate(across(where(is.character), as.factor))test <- test %>%mutate(across(where(is.character), as.factor))
All binary categorical variables (PARENT1, MSTATUS, SEX, CAR_USE, RED_CAR, REVOKED, URBANICITY) are converted to factors. Multi-level categoricals (EDUCATION, JOB, CAR_TYPE) are also factored, and R will use the first level alphabetically as the reference category in regression. One blank entry in JOB is treated as missing and imputed with the most frequently occurring job category.
Log transformations
From the histograms we saw that INCOME, HOME_VAL, OLDCLAIM, and TARGET_AMT are all heavily right-skewed. Log transforming them pulls in the tail and makes the linear regression assumptions more reasonable.
We add 1 before taking the log to handle zero values: INCOME, HOME_VAL, and OLDCLAIM all have observations at zero, and log(0) is undefined. The transformed variables will be used in place of the originals in our models.
Model 1: Simple
# Model 1 — simple, using strongest predictors from correlation analysislogit1 <-glm(TARGET_FLAG ~ MVR_PTS + CLM_FREQ + KIDSDRIV + REVOKED + URBANICITY,data = train, family = binomial)
Three logistic regression models are estimated. Model 1 includes only the five variables most correlated with the target. Model 2 includes all available predictors using log-transformed versions of the skewed money variables. Model 3 applies stepwise selection starting from Model 2, letting AIC guide variable inclusion. Model 3 achieves the lowest AIC (5894), followed by Model 1 (6671), with the full Model 2 performing worst (5901), indicating that not all variables in Model 2 contribute meaningfully and stepwise selection improves fit by dropping the weakest ones. Model 3 will be our preferred logistic regression model going forward.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4453 1013
1 353 709
Accuracy : 0.7907
95% CI : (0.7807, 0.8006)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3857
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4117
Specificity : 0.9266
Pos Pred Value : 0.6676
Neg Pred Value : 0.8147
Prevalence : 0.2638
Detection Rate : 0.1086
Detection Prevalence : 0.1627
Balanced Accuracy : 0.6691
'Positive' Class : 1
The model achieves 79.1% accuracy, comfortably above the 73.6% no information rate, confirming it adds real predictive value. Specificity is strong at 92.7%, meaning the model correctly identifies the vast majority of non-crash cases. Sensitivity is weaker at 41.2% — the model misses roughly 6 in 10 actual crashes, which is a known limitation of using a 0.5 threshold on imbalanced data where crashes are the minority class. When the model does predict a crash, it is correct about two thirds of the time (precision = 66.8%). The classification error rate is 20.9%. Overall Model 3 performs meaningfully better than simply predicting no crash for everyone, as confirmed by the p-value well below 0.05
Linear regression models are estimated on the subset of crash cases only (n = 1,722), since TARGET_AMT is only meaningful when a crash occurred. The response variable is log-transformed to address right skew. Model 1 uses vehicle-related predictors only. Model 2 expands to include driver and behavioral variables. Model 3 applies stepwise selection from Model 2. Adjusted R² is low across all three models — 1.2%, 1.0%, and 1.4% respectively — indicating that the available variables explain very little of the variation in crash costs. This is not unusual in insurance modeling, where claim amounts are inherently noisy. Model 3 is selected as the best linear model based on its marginally higher adjusted R².
Residual plots for lm3
par(mfrow =c(2, 2))plot(lm3)
Residuals vs Fitted: The residuals are roughly centered around zero across all fitted values, and the red line is nearly flat, suggesting no strong nonlinear pattern. However the spread is very wide, consistent with the low R², the model captures the mean reasonably but individual predictions are noisy. Observations 574, 1400, and 756 are flagged as potential outliers.
Q-Q Plot: The residuals follow the diagonal line closely through the middle of the distribution, indicating approximate normality for most observations. However both tails deviate noticeably, the left tail drops below and the right tail rises above the line, suggesting heavier tails than a normal distribution. This is mild and acceptable given the nature of insurance claim data.
Scale-Location: The red line trends slightly upward at higher fitted values, indicating mild heteroscedasticity, variance increases slightly as predicted values grow. This is common with cost data even after log transformation. It is not severe enough to invalidate the model but is worth acknowledging.
Residuals vs Leverage: No observations appear to have both high leverage and large residuals simultaneously, meaning no single data point is exerting undue influence on the model coefficients. The flagged points (574, 756, 1401) have large residuals but low leverage, so they are outliers in terms of prediction error but not influential on the model fit itself.
Predictions on the test set
# crash probability using best logistic modeltest$P_TARGET_FLAG <-predict(logit3, newdata = test, type ="response")# 0.5 thresholdtest$TARGET_FLAG <-ifelse(test$P_TARGET_FLAG >=0.5, 1, 0)# crash cost using best linear model, only for predicted crashestest$P_TARGET_AMT <-exp(predict(lm3, newdata = test)) -1# If predicted no crash, cost is 0test$TARGET_AMT <-ifelse(test$TARGET_FLAG ==1, test$P_TARGET_AMT, 0)# Export predictionstest %>%select(P_TARGET_FLAG, TARGET_FLAG, TARGET_AMT) %>%write.csv("predictions.csv", row.names =FALSE)
# Get odds ratios for logit3exp(coef(logit3)) %>%round(3)
Positive example — REVOKED (2.148): Drivers whose license was revoked in the past 7 years have 2.15 times the odds of crashing compared to drivers with no revocation, holding all else constant. This makes intuitive sense, license revocation signals a history of dangerous driving behavior.
Negative example — CAR_USEPrivate (0.460): Drivers using their vehicle for private use have 54% lower odds of crashing compared to commercial use drivers. This aligns with theory, commercial vehicles are on the road more frequently and in more varied conditions, increasing exposure to crashes.
A few other notable findings: Sports Cars have 2.5 times the odds of crashing compared to the reference category (Minivan), and being unmarried (MSTATUSz_No) increases odds by 1.52 times. On the other hand, living in a rural area reduces crash odds dramatically (odds ratio = 0.093), likely due to less traffic. Doctors and Managers show notably low odds ratios (0.35 and 0.40), consistent with the theory that white collar professionals drive more safely.
RMSE and F-statistis for lm3
# F-statistic and R²summary(lm3)
Call:
lm(formula = LOG_TARGET_AMT ~ BLUEBOOK + MVR_PTS, data = train_crash)
Residuals:
Min 1Q Median 3Q Max
-4.1471 -0.3965 0.0402 0.3951 3.2090
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.075e+00 4.262e-02 189.476 < 2e-16 ***
BLUEBOOK 1.105e-05 2.321e-06 4.761 2.08e-06 ***
MVR_PTS 1.459e-02 7.405e-03 1.970 0.049 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7875 on 1719 degrees of freedom
Multiple R-squared: 0.01491, Adjusted R-squared: 0.01377
F-statistic: 13.01 on 2 and 1719 DF, p-value: 2.465e-06
# RMSEsqrt(mean(lm3$residuals^2))
[1] 0.7868097
The final linear model (lm3) retains only BLUEBOOK and MVR_PTS after stepwise selection. The F-statistic is 13.01 (p-value < 0.001), confirming the model is statistically significant overall. RMSE is 0.787 on the log scale, meaning predictions are typically off by about 0.79 log-dollars. Adjusted R² is 1.4%, confirming the model explains very little variance in crash costs, again, expected given the noisy nature of insurance claims.
BLUEBOOK is statistically significant at the 0.001 level, higher vehicle value is associated with slightly higher crash costs, which makes sense since more expensive cars cost more to repair. MVR_PTS is significant at the 0.05 level, more traffic violations are weakly associated with higher claim amounts.
Conclusion
Two modeling objectives were addressed in this analysis. For crash prediction, logistic regression Model 3 — selected via stepwise AIC — achieved 79.1% accuracy with strong specificity (92.7%), making it reliable at identifying safe drivers. Key predictors of crash risk include license revocation, sports car ownership, unmarried status, and motor vehicle record points. For crash cost prediction, the linear regression model explained very little variance (adjusted R² = 1.4%), which is consistent with the inherently unpredictable nature of insurance claim amounts. BLUEBOOK and MVR_PTS were the only retained predictors. Predictions for the evaluation dataset have been generated using a 0.5 classification threshold for crash probability and the linear model for cost estimation.