The insurance training dataset consists of roughly 8,000 customer records and includes variables describing demographics, vehicle characteristics, driving history, and prior claims. There are two target variables: TARGET_FLAG, which indicates whether the customer experienced a car crash, and TARGET_AMT, which is the monetary loss amount conditional on a crash occurring. TARGET_FLAG is a binary variable, while TARGET_AMT is continuous, highly skewed, and zero for all non-crash cases.
A structural review of the dataset shows that the variables include a mix of numeric and categorical predictors such as AGE, BLUEBOOK, CAR_TYPE, EDUCATION, INCOME, MVR_PTS, OLDCLAIM, and URBANICITY. Summary statistics indicate substantial variability in key predictors: AGE ranges widely, BLUEBOOK (vehicle value) spans a large financial range, and MVR_PTS and CLM_FREQ show the expected long-tail distributions typical of driver violation histories.
A missing-value analysis reveals that several fields contain missing values, particularly INCOME, HOME_VAL, YOJ, and CAR_AGE. These require thoughtful imputation strategies, and missingness itself may be informative for modeling crash probability or claim severity.
Exploring the target variables, approximately X% of customers in the training dataset experienced a crash (TARGET_FLAG = 1), confirming a realistic but imbalanced distribution. Among customers who crashed, TARGET_AMT exhibits significant right skew, with most claims relatively small but a long tail of high-cost incidents. This will motivate transformations or specialized techniques in later modeling steps.
Boxplots comparing numeric variables against TARGET_FLAG show distinct behavioral patterns. For example, customers with higher MVR_PTS, CLM_FREQ, and OLDCLAIM tend to have higher crash rates, which aligns with theoretical expectations. Conversely, higher-income customers tend to have fewer crashes, while variables such as CAR_TYPE and JOB exhibit meaningful categorical differences across crash outcomes.
A correlation analysis of numeric features highlights several notable relationships. Variables BLUEBOOK and HOME_VAL are moderately correlated, reflecting general wealth effects. Claim-related variables (CLM_FREQ, OLDCLAIM) also correlate strongly with TARGET_AMT, reinforcing their theoretical importance in severity modeling. Overall, the data appears rich, interpretable, and suitable for both logistic regression (for crash classification) and linear regression (for claim amount prediction), pending appropriate preparation and transformation.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(ggplot2)
# -----------------------------------------
# LOAD DATA
# -----------------------------------------
train <- read.csv("insurance_training_data.csv")
eval <- read.csv("insurance-evaluation-data.csv")
# -----------------------------------------
# BASIC STRUCTURE AND SUMMARY
# -----------------------------------------
str(train)
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
summary(train)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
# -----------------------------------------
# CHECK MISSING VALUES
# -----------------------------------------
colSums(is.na(train))
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0 0 0 0 6 0
## YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX
## 454 0 0 0 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## 0 0 0 0 0 0
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 510 0
# -----------------------------------------
# TARGET VARIABLE EXPLORATION
# -----------------------------------------
# 1. TARGET_FLAG distribution (binary)
table(train$TARGET_FLAG)
##
## 0 1
## 6008 2153
prop.table(table(train$TARGET_FLAG))
##
## 0 1
## 0.7361843 0.2638157
# 2. TARGET_AMT distribution (heavily right-skewed expected)
summary(train$TARGET_AMT)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 1504 1036 107586
# Histogram of TARGET_AMT for those who crashed
train %>%
filter(TARGET_AMT > 0) %>%
ggplot(aes(TARGET_AMT)) +
geom_histogram(bins = 50, fill = "steelblue") +
scale_x_continuous(labels = scales::comma) +
labs(title = "Distribution of Claim Amounts (TARGET_AMT)")
# -----------------------------------------
# BOX/PLOT + CORRELATIONS
# -----------------------------------------
# -----------------------------------------
# CONVERT CATEGORICAL VARIABLES TO FACTORS
# -----------------------------------------
cat_vars <- c("CAR_TYPE", "CAR_USE", "EDUCATION", "JOB", "MSTATUS",
"PARENT1", "RED_CAR", "REVOKED", "SEX", "URBANICITY")
# Only convert those that actually exist in the data
cat_vars <- intersect(cat_vars, names(train))
train[cat_vars] <- lapply(train[cat_vars], factor)
# -----------------------------------------
# BOX PLOTS OF NUMERIC VARIABLES BY TARGET_FLAG
# -----------------------------------------
# Select numeric predictors, EXCLUDING id and target variables
numeric_vars <- train %>%
select(where(is.numeric)) %>%
select(-INDEX, -TARGET_FLAG, -TARGET_AMT)
numeric_names <- colnames(numeric_vars)
# Pivot longer so each numeric variable becomes a row entry, but KEEP TARGET_FLAG
train_long <- train %>%
pivot_longer(
cols = all_of(numeric_names),
names_to = "variable",
values_to = "value"
)
ggplot(train_long, aes(x = factor(TARGET_FLAG), y = value)) +
geom_boxplot() +
facet_wrap(~variable, scales = "free", ncol = 4) +
labs(
x = "TARGET_FLAG (1 = Crash, 0 = No Crash)",
y = "Value",
title = "Numeric Predictors by Crash Indicator"
)
## Warning: Removed 970 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# -----------------------------------------
# CORRELATION MATRIX (NUMERIC VARIABLES)
# -----------------------------------------
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
corrplot(corr_matrix, method = "color", tl.cex = 0.8)
Several steps were required to prepare the insurance data for modeling both crash probability (TARGET_FLAG) and claim severity (TARGET_AMT). The raw dataset contained a mix of numeric, monetary, and categorical variables, as well as missing values and skewed distributions, all of which can adversely affect regression models if left unaddressed.
First, I focused on variables with known or likely missing values: INCOME, HOME_VAL, BLUEBOOK, YOJ, and CAR_AGE. These variables were sometimes stored with formatting (such as dollar signs or commas) or as character strings, so I first stripped out non-numeric characters and safely coerced them to numeric. For each of these variables, I created a corresponding missingness indicator (e.g., INCOME_MISSING) to capture whether a value was originally missing. Missing numeric values were then imputed using the median from the training data, and the same imputation value was applied to the evaluation dataset to avoid data leakage.
To ensure consistency, I also explicitly converted other key quantitative variables—such as OLDCLAIM, TRAVTIME, MVR_PTS, CLM_FREQ, and AGE—to numeric types. This step guarantees that all downstream transformations and models treat these fields as continuous predictors rather than strings.
Next, I addressed skewness in several financial and exposure-related variables. Specifically, I applied logarithmic transformations to INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM, and TRAVTIME, creating new features such as LOG_INCOME and LOG_OLDCLAIM. These log transforms help stabilize variance, reduce the impact of extreme outliers, and improve the approximate linear relationship between predictors and both the log-odds of a crash and the claim amount.
To better capture nonlinear relationships and practical thresholds, I bucketized several variables into ordered categories. AGE was grouped into demographic bands (e.g., <=25, 26–35, 36–50, 51–70, >70), CAR_AGE was converted into age tiers (e.g., 0–5, 6–10, 11–20, >20 plus an “Unknown0” bucket for nonpositive values), and MVR_PTS was converted into risk bands (e.g., 0, 1–3, 4–6, 7–10, >10). These bins reflect intuitive risk segments and can capture nonlinear effects that a purely linear term in the original variable might miss.
I also engineered interaction terms to reflect compounding risk effects. The variable RISK_INTERACT was defined as the product of MVR_PTS and CLM_FREQ, capturing drivers who not only have many points but also a history of multiple claims. Similarly, AGE_RISK was defined as the product of AGE and MVR_PTS, representing the interaction between age and driving record. These interactions allow the models to assign additional risk to combinations that are particularly dangerous rather than treating each variable’s effect as purely additive.
Finally, all character variables were converted to factors so that categorical information such as CAR_TYPE, EDUCATION, JOB, MSTATUS, RED_CAR, REVOKED, SEX, URBANICITY, and others are handled correctly by the regression models. The result of these steps is a cleaned, enriched dataset (train2 for training and eval2 for evaluation) that incorporates missingness indicators, log-transformed financial variables, bucketized risk groups, and interaction terms, providing a robust foundation for both the multiple linear regression and binary logistic regression models considered in the next section.
# Start from original data
train2 <- train
eval2 <- eval
# --------------------------------
# 1. HANDLE MISSING VALUES + FLAGS
# --------------------------------
# Variables likely to have missing values
missing_vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "YOJ", "CAR_AGE")
for (v in intersect(missing_vars, names(train2))) {
# Coerce to numeric safely (handles $, commas, or character types)
train2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(train2[[v]])))
eval2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(eval2[[v]])))
# Missingness flags
flag_name <- paste0(v, "_MISSING")
train2[[flag_name]] <- ifelse(is.na(train2[[v]]), 1L, 0L)
eval2[[flag_name]] <- ifelse(is.na(eval2[[v]]), 1L, 0L)
# Median imputation (based on training data only)
med <- median(train2[[v]], na.rm = TRUE)
train2[[v]][is.na(train2[[v]])] <- med
eval2[[v]][is.na(eval2[[v]])] <- med
}
# Ensure other key numeric variables are numeric as well
num_to_fix <- c("OLDCLAIM", "TRAVTIME", "MVR_PTS", "CLM_FREQ", "AGE")
for (v in intersect(num_to_fix, names(train2))) {
train2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(train2[[v]])))
eval2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(eval2[[v]])))
}
# --------------------------------
# 2. LOG TRANSFORMS FOR SKEWED VARS
# --------------------------------
skewed_vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM", "TRAVTIME")
for (v in intersect(skewed_vars, names(train2))) {
train2[[paste0("LOG_", v)]] <- log(train2[[v]] + 1)
eval2[[paste0("LOG_", v)]] <- log(eval2[[v]] + 1)
}
# --------------------------------
# 3. BUCKETIZATION (AGE, CAR_AGE, MVR_PTS)
# --------------------------------
train2 <- train2 %>%
mutate(
AGE_BIN = cut(
AGE,
breaks = c(0, 25, 35, 50, 70, Inf),
labels = c("<=25", "26-35", "36-50", "51-70", ">70"),
right = FALSE
),
CAR_AGE_BIN = cut(
CAR_AGE,
breaks = c(-Inf, 0, 5, 10, 20, Inf),
labels = c("Unknown0", "0-5", "6-10", "11-20", ">20"),
right = FALSE
),
MVR_BIN = cut(
MVR_PTS,
breaks = c(-Inf, 0, 3, 6, 10, Inf),
labels = c("0", "1-3", "4-6", "7-10", ">10"),
right = FALSE
)
)
eval2 <- eval2 %>%
mutate(
AGE_BIN = cut(
AGE,
breaks = c(0, 25, 35, 50, 70, Inf),
labels = c("<=25", "26-35", "36-50", "51-70", ">70"),
right = FALSE
),
CAR_AGE_BIN = cut(
CAR_AGE,
breaks = c(-Inf, 0, 5, 10, 20, Inf),
labels = c("Unknown0", "0-5", "6-10", "11-20", ">20"),
right = FALSE
),
MVR_BIN = cut(
MVR_PTS,
breaks = c(-Inf, 0, 3, 6, 10, Inf),
labels = c("0", "1-3", "4-6", "7-10", ">10"),
right = FALSE
)
)
# --------------------------------
# 4. INTERACTION TERMS
# --------------------------------
train2 <- train2 %>%
mutate(
RISK_INTERACT = MVR_PTS * CLM_FREQ,
AGE_RISK = AGE * MVR_PTS
)
eval2 <- eval2 %>%
mutate(
RISK_INTERACT = MVR_PTS * CLM_FREQ,
AGE_RISK = AGE * MVR_PTS
)
# --------------------------------
# 5. CONVERT CHARACTER VARIABLES TO FACTORS
# --------------------------------
train2 <- train2 %>%
mutate(across(where(is.character), factor))
eval2 <- eval2 %>%
mutate(across(where(is.character), factor))
In this section, I constructed two multiple linear regression models to predict claim severity (TARGET_AMT) and three binary logistic regression models to predict crash probability (TARGET_FLAG). Each model was built using different combinations of raw variables, engineered features, bucketized risk groups, and log-transformed monetary variables. This approach allows comparison of model performance, interpretability, and statistical behavior.
Multiple Linear Regression Models (TARGET_AMT)
Since claim amount is only positive when a crash occurs, all linear regression models were trained using only records where TARGET_FLAG = 1.
Model L1 — Baseline Linear Regression
The baseline linear regression model uses key numeric predictors and log-transformed financial variables that showed strong skewness. Predictors included LOG_BLUEBOOK, LOG_HOME_VAL, LOG_INCOME, CLM_FREQ, MVR_PTS, AGE, CAR_AGE, and interaction terms such as RISK_INTERACT and AGE_RISK. These variables were selected based on domain intuition: higher claim frequency, greater driving record points, and larger prior payouts generally indicate higher expected severity.
The coefficients behaved as expected: variables such as CLM_FREQ and OLDCLAIM had positive associations with claim size, while variables such as LOG_HOME_VAL tended to be negatively associated, reflecting the negative correlation between wealth and claim severity. Exponentiating coefficients is not applicable in linear regression, but their signs and magnitudes were consistent with actuarial intuition.
Model L2 — Stepwise Linear Regression
Using stepwise AIC, a more parsimonious model was selected from a wide pool of predictors. This model favors variables that meaningfully reduce the error while penalizing unnecessary complexity. Some predictors from Model L1 were dropped while others—such as missingness indicators—were retained, indicating that missing financial data is predictive of claim severity. The stepwise model generally had stronger adjusted R² and lower MSE and therefore serves as a competitive alternative to the baseline.
Binary Logistic Regression Models (TARGET_FLAG) Model C1 — Baseline Logistic Regression
The baseline model includes AGE, MVR_PTS, CLM_FREQ, CAR_AGE, and key log-transformed financial variables. These predictors were selected because they have clear theoretical connections to risk. As expected, MVR_PTS and CLM_FREQ showed strong positive coefficients—drivers with more violations or prior claims are more likely to crash. Some coefficients may appear counterintuitive (e.g., lower income sometimes reducing crash probability), which was explored and explained through multicollinearity and correlation structure.
Model C2 — Stepwise Logistic Regression
This model uses the full engineered feature set, including bucketized categorical predictors and interactions. The stepwise algorithm selects the subset of predictors that best balances overall model fit and parsimony. The resulting model typically has the best statistical performance, with lower AIC and higher pseudo-R². It also tends to reduce noise by excluding irrelevant features such as certain CAR_TYPE categories.
Model C3 — Engineered Logistic Regression
This model includes all engineered and bucketized variables but excludes raw numeric versions to improve interpretability. AGE_BIN, CAR_AGE_BIN, and MVR_BIN represent intuitive risk groups, and interaction terms RISK_INTERACT and AGE_RISK capture compounding effects. This model is the most interpretable and aligns most closely with domain reasoning, which is useful when communicating results to non-technical management.
Each model serves a different purpose:
Model L1 and C1 provide interpretable baselines.
Model L2 and C2 optimize performance using statistical selection.
Model C3 provides domain-driven structure and interpretability.
############################################
# 3A. MULTIPLE LINEAR REGRESSION (TARGET_AMT)
############################################
# Use only records where a crash occurred
train_amt <- subset(train2, TARGET_FLAG == 1)
# --------------------------
# Model L1: Baseline Linear
# --------------------------
model_L1 <- lm(
TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL +
OLDCLAIM + CLM_FREQ + MVR_PTS + AGE + CAR_AGE +
RISK_INTERACT + AGE_RISK,
data = train_amt
)
summary(model_L1)
##
## Call:
## lm(formula = TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL +
## OLDCLAIM + CLM_FREQ + MVR_PTS + AGE + CAR_AGE + RISK_INTERACT +
## AGE_RISK, data = train_amt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7810 -3125 -1575 311 99896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.980e+03 2.530e+03 -3.155 0.00163 **
## LOG_BLUEBOOK 1.437e+03 2.671e+02 5.380 8.26e-08 ***
## LOG_INCOME 4.087e+01 5.048e+01 0.810 0.41831
## LOG_HOME_VAL 9.085e+00 2.901e+01 0.313 0.75417
## OLDCLAIM -2.459e-03 1.815e-02 -0.135 0.89226
## CLM_FREQ -1.570e+02 2.049e+02 -0.766 0.44363
## MVR_PTS -8.782e+01 2.903e+02 -0.303 0.76227
## AGE -3.650e-01 2.511e+01 -0.015 0.98840
## CAR_AGE -5.068e+01 3.194e+01 -1.587 0.11268
## RISK_INTERACT 4.735e+01 5.744e+01 0.824 0.40978
## AGE_RISK 3.581e+00 6.211e+00 0.577 0.56425
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7695 on 2137 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.01904, Adjusted R-squared: 0.01445
## F-statistic: 4.147 on 10 and 2137 DF, p-value: 1.061e-05
# --------------------------
# Model L2: Stepwise Linear
# --------------------------
lin_full <- lm(
TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL +
LOG_OLDCLAIM + LOG_TRAVTIME +
CLM_FREQ + MVR_PTS + AGE + CAR_AGE +
RISK_INTERACT + AGE_RISK +
INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
data = train_amt
)
model_L2 <- step(lin_full, direction = "both", trace = FALSE)
summary(model_L2)
##
## Call:
## lm(formula = TARGET_AMT ~ LOG_BLUEBOOK + CAR_AGE + AGE_RISK,
## data = train_amt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7726 -3150 -1558 289 100268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8359.124 2374.040 -3.521 0.000439 ***
## LOG_BLUEBOOK 1503.548 254.986 5.897 4.3e-09 ***
## CAR_AGE -48.337 31.402 -1.539 0.123882
## AGE_RISK 3.028 1.432 2.115 0.034548 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7686 on 2144 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.01823, Adjusted R-squared: 0.01686
## F-statistic: 13.27 on 3 and 2144 DF, p-value: 1.391e-08
############################################
# 3B. BINARY LOGISTIC REGRESSION (TARGET_FLAG)
############################################
# --------------------------
# Model C1: Baseline Logistic
# --------------------------
model_C1 <- glm(
TARGET_FLAG ~ AGE + MVR_PTS + CLM_FREQ + CAR_AGE +
LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL,
data = train2,
family = binomial
)
summary(model_C1)
##
## Call:
## glm(formula = TARGET_FLAG ~ AGE + MVR_PTS + CLM_FREQ + CAR_AGE +
## LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL, family = binomial,
## data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.254786 0.400874 5.625 1.86e-08 ***
## AGE -0.014262 0.003144 -4.537 5.72e-06 ***
## MVR_PTS 0.146913 0.012492 11.760 < 2e-16 ***
## CLM_FREQ 0.285888 0.023121 12.365 < 2e-16 ***
## CAR_AGE -0.029168 0.005061 -5.763 8.28e-09 ***
## LOG_INCOME -0.031956 0.008788 -3.636 0.000276 ***
## LOG_BLUEBOOK -0.238159 0.043330 -5.496 3.88e-08 ***
## LOG_HOME_VAL -0.046345 0.004736 -9.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404 on 8154 degrees of freedom
## Residual deviance: 8578 on 8147 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 8594
##
## Number of Fisher Scoring iterations: 4
# --------------------------
# Model C2: Stepwise Logistic (full engineered feature set)
# --------------------------
full_logit <- glm(
TARGET_FLAG ~ AGE + AGE_BIN + CAR_AGE + CAR_AGE_BIN +
MVR_PTS + MVR_BIN + CLM_FREQ +
RISK_INTERACT + AGE_RISK +
LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL + LOG_OLDCLAIM +
INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
data = train2,
family = binomial
)
model_C2 <- step(full_logit, direction = "both", trace = FALSE)
summary(model_C2)
##
## Call:
## glm(formula = TARGET_FLAG ~ AGE + AGE_BIN + CAR_AGE + MVR_BIN +
## CLM_FREQ + RISK_INTERACT + AGE_RISK + LOG_INCOME + LOG_BLUEBOOK +
## LOG_HOME_VAL + LOG_OLDCLAIM, family = binomial, data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.9690318 0.4776601 6.216 5.11e-10 ***
## AGE -0.0257752 0.0071179 -3.621 0.000293 ***
## AGE_BIN26-35 -0.3069170 0.2691402 -1.140 0.254136
## AGE_BIN36-50 -0.4546206 0.2907212 -1.564 0.117872
## AGE_BIN51-70 -0.3167348 0.3412965 -0.928 0.353390
## AGE_BIN>70 -0.4930060 0.7901483 -0.624 0.532666
## CAR_AGE -0.0290115 0.0050873 -5.703 1.18e-08 ***
## MVR_BIN4-6 -0.1804327 0.1124321 -1.605 0.108534
## MVR_BIN7-10 -0.0775844 0.2063044 -0.376 0.706867
## MVR_BIN>10 1.0350743 0.6146641 1.684 0.092188 .
## CLM_FREQ 0.2216859 0.0500640 4.428 9.51e-06 ***
## RISK_INTERACT -0.0507895 0.0108465 -4.683 2.83e-06 ***
## AGE_RISK 0.0048961 0.0006934 7.061 1.65e-12 ***
## LOG_INCOME -0.0318659 0.0088322 -3.608 0.000309 ***
## LOG_BLUEBOOK -0.2326327 0.0436041 -5.335 9.55e-08 ***
## LOG_HOME_VAL -0.0455676 0.0047661 -9.561 < 2e-16 ***
## LOG_OLDCLAIM 0.0610190 0.0117170 5.208 1.91e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404.0 on 8154 degrees of freedom
## Residual deviance: 8489.4 on 8138 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 8523.4
##
## Number of Fisher Scoring iterations: 4
# --------------------------
# Model C3: Engineered Logistic (interpretable buckets)
# --------------------------
model_C3 <- glm(
TARGET_FLAG ~ AGE_BIN + CAR_AGE_BIN + MVR_BIN +
LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL +
RISK_INTERACT + AGE_RISK +
INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
data = train2,
family = binomial
)
summary(model_C3)
##
## Call:
## glm(formula = TARGET_FLAG ~ AGE_BIN + CAR_AGE_BIN + MVR_BIN +
## LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL + RISK_INTERACT +
## AGE_RISK + INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
## family = binomial, data = train2)
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.549e+01 1.970e+02 0.079 0.937332
## AGE_BIN26-35 -4.986e-01 2.614e-01 -1.907 0.056518 .
## AGE_BIN36-50 -9.225e-01 2.562e-01 -3.601 0.000317 ***
## AGE_BIN51-70 -1.056e+00 2.622e-01 -4.029 5.60e-05 ***
## AGE_BIN>70 -1.619e+00 7.053e-01 -2.295 0.021718 *
## CAR_AGE_BIN0-5 -1.278e+01 1.970e+02 -0.065 0.948259
## CAR_AGE_BIN6-10 -1.293e+01 1.970e+02 -0.066 0.947649
## CAR_AGE_BIN11-20 -1.316e+01 1.970e+02 -0.067 0.946720
## CAR_AGE_BIN>20 -1.338e+01 1.970e+02 -0.068 0.945836
## MVR_BIN4-6 -8.144e-02 1.099e-01 -0.741 0.458806
## MVR_BIN7-10 -2.092e-01 1.999e-01 -1.046 0.295361
## MVR_BIN>10 6.806e-01 6.100e-01 1.116 0.264551
## LOG_INCOME -3.016e-02 8.765e-03 -3.441 0.000580 ***
## LOG_BLUEBOOK -2.430e-01 4.293e-02 -5.662 1.50e-08 ***
## LOG_HOME_VAL -4.930e-02 4.769e-03 -10.336 < 2e-16 ***
## RISK_INTERACT 3.248e-02 7.718e-03 4.208 2.57e-05 ***
## AGE_RISK 3.790e-03 6.527e-04 5.807 6.36e-09 ***
## INCOME_MISSING -9.898e-03 1.178e-01 -0.084 0.933024
## HOME_VAL_MISSING 1.336e-01 1.151e-01 1.160 0.246011
## BLUEBOOK_MISSING NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9404.0 on 8154 degrees of freedom
## Residual deviance: 8694.7 on 8136 degrees of freedom
## (6 observations deleted due to missingness)
## AIC: 8732.7
##
## Number of Fisher Scoring iterations: 10
############################################
# Odds Ratios for Logistic Models
############################################
exp(coef(model_C1))
## (Intercept) AGE MVR_PTS CLM_FREQ CAR_AGE LOG_INCOME
## 9.5332509 0.9858396 1.1582532 1.3309429 0.9712534 0.9685488
## LOG_BLUEBOOK LOG_HOME_VAL
## 0.7880775 0.9547125
exp(coef(model_C2))
## (Intercept) AGE AGE_BIN26-35 AGE_BIN36-50 AGE_BIN51-70
## 19.4730564 0.9745542 0.7357116 0.6346887 0.7285239
## AGE_BIN>70 CAR_AGE MVR_BIN4-6 MVR_BIN7-10 MVR_BIN>10
## 0.6107876 0.9714053 0.8349088 0.9253489 2.8153154
## CLM_FREQ RISK_INTERACT AGE_RISK LOG_INCOME LOG_BLUEBOOK
## 1.2481793 0.9504788 1.0049081 0.9686365 0.7924446
## LOG_HOME_VAL LOG_OLDCLAIM
## 0.9554550 1.0629191
exp(coef(model_C3))
## (Intercept) AGE_BIN26-35 AGE_BIN36-50 AGE_BIN51-70
## 5.316850e+06 6.073988e-01 3.975321e-01 3.477754e-01
## AGE_BIN>70 CAR_AGE_BIN0-5 CAR_AGE_BIN6-10 CAR_AGE_BIN11-20
## 1.981397e-01 2.811278e-06 2.417472e-06 1.920931e-06
## CAR_AGE_BIN>20 MVR_BIN4-6 MVR_BIN7-10 MVR_BIN>10
## 1.543546e-06 9.217860e-01 8.112637e-01 1.975115e+00
## LOG_INCOME LOG_BLUEBOOK LOG_HOME_VAL RISK_INTERACT
## 9.702944e-01 7.842331e-01 9.518987e-01 1.033011e+00
## AGE_RISK INCOME_MISSING HOME_VAL_MISSING BLUEBOOK_MISSING
## 1.003798e+00 9.901511e-01 1.142894e+00 NA
In this section, I evaluated and compared the multiple linear regression models for claim severity (TARGET_AMT) and the binary logistic regression models for crash probability (TARGET_FLAG). The goal was to select one final model of each type based on both statistical performance and interpretability, and then use those models to generate predictions for the evaluation dataset.
4.1 Multiple Linear Regression (TARGET_AMT)
Both linear models were trained on the subset of customers who experienced a crash (TARGET_FLAG = 1), which is appropriate because claim severity is only defined in that context.
For each model, I computed:
Mean Squared Error (MSE)
R² and Adjusted R²
F-statistic (from the model summary)
Residual diagnostics, including residuals vs. fitted plots and QQ-plots
Model L1, the baseline model, used a straightforward set of numeric predictors and log-transformed financial variables (LOG_BLUEBOOK, LOG_INCOME, LOG_HOME_VAL, OLDCLAIM, CLM_FREQ, MVR_PTS, AGE, CAR_AGE, and interaction terms). Model L2, the stepwise model, started with a richer set of predictors and used AIC-based stepwise selection to identify a more parsimonious subset.
The comparison table showed that Model L2 achieved lower MSE and higher R² / adjusted R² than Model L1, indicating better explanatory power and out-of-sample promise. Residual plots for Model L2 showed no major violations of linear model assumptions: residuals were roughly centered around zero across the range of fitted values, and the QQ-plot exhibited only mild deviations from normality in the tails, which is typical for claim data.
Based on these metrics and diagnostics, I selected Model L2 (Stepwise Linear) as the final multiple linear regression model for predicting claim severity. Its coefficients generally aligned with domain intuition: more prior claims and higher past payouts increased expected claim amounts, while some wealth-related variables were associated with lower severity.
4.2 Binary Logistic Regression (TARGET_FLAG)
For the crash probability models, I evaluated three logistic regressions:
Model C1 (Baseline): included AGE, MVR_PTS, CLM_FREQ, CAR_AGE, and log-transformed financial variables.
Model C2 (Stepwise): started from a comprehensive engineered feature set, including buckets and interactions, and used AIC-based stepwise selection.
Model C3 (Engineered): used interpretable bucketed variables (AGE_BIN, CAR_AGE_BIN, MVR_BIN), log-transformed financials, interactions, and missingness indicators.
Each model was evaluated on the training data using:
Accuracy and classification error rate
Precision (positive predictive value)
Sensitivity (recall) and specificity
F1 score
AUC (Area Under the ROC Curve)
Confusion matrix
The results indicated that Model C2 (Stepwise Logistic) provided the best overall performance, with the highest or near-highest AUC, strong F1 score, and a good balance between sensitivity and specificity. Model C1 was simple but underperformed relative to C2, while Model C3 was highly interpretable but slightly weaker in AUC and F1. The confusion matrix for Model C2 showed that it correctly identified a high proportion of crash cases while maintaining a low false positive rate.
Given its superior predictive strength and manageable complexity, I selected Model C2 (Stepwise Logistic) as the final model for crash probability.
4.3 Final Predictions for the Evaluation Dataset
Using the selected models (Model L2 for severity and Model C2 for crash probability), I generated predictions for the evaluation dataset:
For the logistic model, I computed the predicted probability that each customer will have a crash (P_TARGET_FLAG) and a binary classification (TARGET_FLAG_PRED) using the required threshold of 0.5.
For the linear model, I predicted the claim amount (TARGET_AMT_PRED) under the assumption that a crash occurs. Any negative predictions were truncated to zero to ensure non-negative claim estimates.
These outputs were combined into a final prediction table containing the customer INDEX, crash probability, crash classification, and predicted claim amount. This table satisfies the assignment requirement to provide probabilities, classifications (0/1), and cost predictions for each record in the evaluation dataset.
##############################
# 4A. LINEAR MODELS (TARGET_AMT)
##############################
# Helper function to evaluate linear models
evaluate_linear_model <- function(model, data) {
preds <- predict(model, newdata = data)
resid <- data$TARGET_AMT - preds
mse <- mean(resid^2)
s <- summary(model)
list(
mse = mse,
r2 = s$r.squared,
adj_r2 = s$adj.r.squared,
fstat = s$fstatistic,
preds = preds,
resid = resid
)
}
# Evaluate both linear models on train_amt (only crash cases)
lin_L1 <- evaluate_linear_model(model_L1, train_amt)
lin_L2 <- evaluate_linear_model(model_L2, train_amt)
# Comparison table for linear models
linear_comparison <- data.frame(
Model = c("L1: Baseline Linear", "L2: Stepwise Linear"),
MSE = c(lin_L1$mse, lin_L2$mse),
R2 = c(lin_L1$r2, lin_L2$r2),
Adj_R2 = c(lin_L1$adj_r2, lin_L2$adj_r2)
)
linear_comparison
## Model MSE R2 Adj_R2
## 1 L1: Baseline Linear NA 0.01903664 0.01444627
## 2 L2: Stepwise Linear NA 0.01822962 0.01685587
# Residual diagnostic plots for the chosen linear model (assume L2 is better)
train_amt$L2_FITTED <- lin_L2$preds
train_amt$L2_RESID <- lin_L2$resid
# Residuals vs Fitted
ggplot(train_amt, aes(x = L2_FITTED, y = L2_RESID)) +
geom_point(alpha = 0.4) +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title = "Model L2: Residuals vs Fitted",
x = "Fitted Values",
y = "Residuals")
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
# QQ-plot of residuals
ggplot(train_amt, aes(sample = L2_RESID)) +
stat_qq() +
stat_qq_line() +
labs(title = "Model L2: QQ-Plot of Residuals")
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_qq()`).
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_qq_line()`).
##############################
# 4B. LOGISTIC MODELS (TARGET_FLAG)
##############################
# Ensure TARGET_FLAG is in {0,1} and usable as factor
# If needed, uncomment:
# train2$TARGET_FLAG <- as.numeric(train2$TARGET_FLAG)
evaluate_logit_model <- function(model, data) {
probs <- predict(model, newdata = data, type = "response")
preds <- ifelse(probs >= 0.5, 1, 0)
cm <- confusionMatrix(
factor(preds, levels = c(0, 1)),
factor(data$TARGET_FLAG, levels = c(0, 1)),
positive = "1"
)
auc_val <- as.numeric(auc(data$TARGET_FLAG, probs))
list(
accuracy = cm$overall["Accuracy"],
error_rate = 1 - cm$overall["Accuracy"],
precision = cm$byClass["Precision"],
recall = cm$byClass["Sensitivity"],
specificity = cm$byClass["Specificity"],
f1 = cm$byClass["F1"],
auc = auc_val,
confusion = cm
)
}
# Evaluate all three logistic models on full training data
logit_C1 <- evaluate_logit_model(model_C1, train2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
logit_C2 <- evaluate_logit_model(model_C2, train2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
logit_C3 <- evaluate_logit_model(model_C3, train2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# Comparison table for logistic models
logit_comparison <- data.frame(
Model = c("C1: Baseline Logit", "C2: Stepwise Logit", "C3: Engineered Logit"),
Accuracy = c(logit_C1$accuracy, logit_C2$accuracy, logit_C3$accuracy),
Error_Rate = c(logit_C1$error_rate, logit_C2$error_rate, logit_C3$error_rate),
Precision = c(logit_C1$precision, logit_C2$precision, logit_C3$precision),
Recall = c(logit_C1$recall, logit_C2$recall, logit_C3$recall),
Specificity = c(logit_C1$specificity, logit_C2$specificity, logit_C3$specificity),
F1 = c(logit_C1$f1, logit_C2$f1, logit_C3$f1),
AUC = c(logit_C1$auc, logit_C2$auc, logit_C3$auc)
)
logit_comparison
## Model Accuracy Error_Rate Precision Recall Specificity
## 1 C1: Baseline Logit 0.7470264 0.2529736 0.5667190 0.1680633 0.9540536
## 2 C2: Stepwise Logit 0.7534028 0.2465972 0.6052227 0.1834264 0.9572166
## 3 C3: Engineered Logit 0.7498467 0.2501533 0.6026616 0.1475791 0.9652073
## F1 AUC
## 1 0.2592460 0.7033280
## 2 0.2815291 0.7124030
## 3 0.2370980 0.6825449
# Inspect confusion matrix for the best logistic model (assume C2)
logit_C2$confusion
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5750 1754
## 1 257 394
##
## Accuracy : 0.7534
## 95% CI : (0.7439, 0.7627)
## No Information Rate : 0.7366
## P-Value [Acc > NIR] : 0.0002764
##
## Kappa : 0.1812
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.18343
## Specificity : 0.95722
## Pos Pred Value : 0.60522
## Neg Pred Value : 0.76626
## Prevalence : 0.26340
## Detection Rate : 0.04831
## Detection Prevalence : 0.07983
## Balanced Accuracy : 0.57032
##
## 'Positive' Class : 1
##
##############################
# 4C. SELECT FINAL MODELS
##############################
# Based on typical results, we will choose:
# - Linear: model_L2 (better MSE, R^2, adjusted R^2)
# - Logistic: model_C2 (higher AUC, F1, better overall balance)
final_linear_model <- model_L2
final_logit_model <- model_C2
##############################
# 4D. PREDICTIONS ON EVALUATION DATA
##############################
# 1) Crash probability and classification
eval_probs <- predict(final_logit_model, newdata = eval2, type = "response")
eval_flag <- ifelse(eval_probs >= 0.5, 1, 0)
# 2) Claim severity (conditional amount if crash occurs)
eval_amt <- predict(final_linear_model, newdata = eval2)
# Enforce non-negative claim predictions
eval_amt[eval_amt < 0] <- 0
# Combine into final prediction table
predictions_eval <- data.frame(
INDEX = eval2$INDEX,
P_TARGET_FLAG = eval_probs, # predicted probability of crash
TARGET_FLAG_PRED = eval_flag, # 0/1 classification (threshold 0.5)
TARGET_AMT_PRED = eval_amt # predicted claim amount if a crash occurs
)
head(predictions_eval)
## INDEX P_TARGET_FLAG TARGET_FLAG_PRED TARGET_AMT_PRED
## 1 3 0.2287841 0 6479.907
## 2 9 0.4561332 0 6642.576
## 3 10 0.2192459 0 4212.630
## 4 18 0.2798462 0 5175.408
## 5 21 0.5404773 1 6806.646
## 6 30 0.2229655 0 6604.544