Data621 - Homework 4

Author

Haoming Chen, Bikash Bhowmik, Rupendra Shrestha, Anthony Josue Roman, Jerald Melukkaran

Published

April 26, 2026

Introduction

This analysis develops predictive models for an auto insurance dataset with 8,161 records. A logistic regression model is used to estimate the probability of a crash event. A linear regression model is applied to predict the financial cost of crashes. The dataset includes demographic, behavioral, vehicle, and historical claim information.

Code

select <- dplyr::select
library(MASS)
library(tidyverse)
library(caret)
library(pROC)
library(corrplot)
library(gridExtra)
library(MASS)
library(car)
library(knitr)

Data Exploration

This section examines the dataset to assess distributions, relationships, and data quality. Exploratory analysis highlights key risk factors and potential modeling challenges. These insights support informed feature engineering and model design.

Load Data

The training and evaluation datasets are imported from external sources. Both datasets contain identical structures but serve different purposes in modeling. The training data is used to build and validate models. The evaluation data is reserved for final prediction and performance assessment.

Code

train_raw <- read.csv("https://github.com/vincent-usny/621_hw4/raw/refs/heads/main/insurance_training_data.csv")
eval_raw  <- read.csv("https://github.com/vincent-usny/621_hw4/raw/refs/heads/main/insurance-evaluation-data.csv")

tibble(
  Variable = names(train_raw),
  Type     = sapply(train_raw, class),
  Sample   = sapply(train_raw, function(x) paste(head(na.omit(x), 3), collapse = ", "))
) %>%
  kable(caption = "Dataset Overview (8,161 rows, 26 variables)")

Dataset Overview (8,161 rows, 26 variables)
Variable	Type	Sample
INDEX	integer	1, 2, 4
TARGET_FLAG	integer	0, 0, 0
TARGET_AMT	numeric	0, 0, 0
KIDSDRIV	integer	0, 0, 0
AGE	integer	60, 43, 35
HOMEKIDS	integer	0, 0, 1
YOJ	integer	11, 11, 10
INCOME	character	$67,349, $91,449, $16,039
PARENT1	character	No, No, No
HOME_VAL	character	$0, $257,252, $124,191
MSTATUS	character	z_No, z_No, Yes
SEX	character	M, M, z_F
EDUCATION	character	PhD, z_High School, z_High School
JOB	character	Professional, z_Blue Collar, Clerical
TRAVTIME	integer	14, 22, 5
CAR_USE	character	Private, Commercial, Private
BLUEBOOK	character	$14,230, $14,940, $4,010
TIF	integer	11, 1, 4
CAR_TYPE	character	Minivan, Minivan, z_SUV
RED_CAR	character	yes, yes, no
OLDCLAIM	character	$4,461, $0, $38,690
CLM_FREQ	integer	2, 0, 2
REVOKED	character	No, No, No
MVR_PTS	integer	3, 0, 3
CAR_AGE	integer	18, 1, 10
URBANICITY	character	Highly Urban/ Urban, Highly Urban/ Urban, Highly Urban/ Urban

Clean Currency Columns

Several variables contain currency symbols and commas that prevent numeric processing. These characters are removed using string operations before converting to numeric format. Cleaning ensures accurate calculations and model compatibility. This step is critical for variables like income, home value, and claim amounts.

Code

clean_currency <- function(x) {
  x %>%
    str_remove_all("\\$") %>%
    str_remove_all(",") %>%
    as.numeric()
}

clean_df <- function(df) {
  df %>%
    mutate(across(c(INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM), clean_currency))
}

train <- clean_df(train_raw)
eval  <- clean_df(eval_raw)

Descriptive Statistics

Summary statistics are computed to understand central tendency and variability. Measures include mean, median, standard deviation, and missing value percentage. The dataset shows moderate skewness in variables like claims and MVR points. These insights guide transformation and modeling decisions.

Code

train %>%
  dplyr::select(INDEX, TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS,
         YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF,
         OLDCLAIM, CLM_FREQ, MVR_PTS, CAR_AGE) %>%
  summarise(across(
    everything(),
    list(
      mean   = ~mean(.x, na.rm = TRUE),
      median = ~median(.x, na.rm = TRUE),
      sd     = ~sd(.x, na.rm = TRUE),
      pct_na = ~mean(is.na(.x)) * 100
    ),
    .names = "{.col}||{.fn}"
  )) %>%
  pivot_longer(
    everything(),
    names_to  = c("variable", ".value"),
    names_sep = "\\|\\|"
  ) %>%
  filter(variable != "INDEX") %>%
  mutate(across(where(is.numeric), ~round(.x, 2))) %>%
  kable(caption = "Descriptive Statistics for Numeric Variables")

Descriptive Statistics for Numeric Variables
variable	mean	median	sd	pct_na
TARGET_FLAG	0.26	0	0.44	0.00
TARGET_AMT	1504.32	0	4704.03	0.00
KIDSDRIV	0.17	0	0.51	0.00
AGE	44.79	45	8.63	0.07
HOMEKIDS	0.72	0	1.12	0.00
YOJ	10.50	11	4.09	5.56
INCOME	61898.09	54028	47572.68	5.45
HOME_VAL	154867.29	161160	129123.77	5.69
TRAVTIME	33.49	33	15.91	0.00
BLUEBOOK	15709.90	14440	8419.73	0.00
TIF	5.35	4	4.15	0.00
OLDCLAIM	4037.08	0	8777.14	0.00
CLM_FREQ	0.80	0	1.16	0.00
MVR_PTS	1.70	1	2.15	0.00
CAR_AGE	8.33	8	5.70	6.25

MVR points (or DMV violation points) are a system used by motor vehicle departments (e.g., NY DMV) to track high-risk drivers, assigning points to a license based on convictions for moving violations.

The average driver is around 45 years old with a median income of $54,028. MVR points and claim frequency are right-skewed, indicating most drivers are clean, but a small group is accumulating risk fast. Missing data is modest, topping out at 6.4% for JOB, and is handled via median/mode imputation with missingness flags retained.

Missing Values

Missing values are identified and quantified across all variables. Most variables have low missingness, with JOB having the highest proportion. Understanding missing patterns helps inform imputation strategy. Missingness indicators are later included as predictive features.

Code

train %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  filter(n_missing > 0) %>%
  mutate(pct_missing = round(n_missing / nrow(train) * 100, 2)) %>%
  arrange(desc(n_missing)) %>%
  kable(caption = "Variables with Missing Values")

Variables with Missing Values
variable	n_missing	pct_missing
CAR_AGE	510	6.25
HOME_VAL	464	5.69
YOJ	454	5.56
INCOME	445	5.45
AGE	6	0.07

Distribution of Target Variables

The crash indicator is imbalanced, with about 26% crash cases. Crash cost (TARGET_AMT) shows a strong right-skewed distribution. Most claims are small, but a few extreme values exist. This justifies applying a log transformation for modeling.

Code

p1 <- train %>%
  count(TARGET_FLAG) %>%
  mutate(TARGET_FLAG = factor(TARGET_FLAG, labels = c("No Crash", "Crash"))) %>%
  ggplot(aes(x = TARGET_FLAG, y = n, fill = TARGET_FLAG)) +
  geom_col(show.legend = FALSE) +
  labs(title = "TARGET_FLAG Distribution", x = NULL, y = "Count") +
  theme_minimal()

p2 <- train %>%
  filter(TARGET_FLAG == 1) %>%
  ggplot(aes(x = TARGET_AMT)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(title = "TARGET_AMT (Crash Cases Only)", x = "Crash Cost", y = "Count") +
  theme_minimal()

grid.arrange(p1, p2, ncol = 2)

The target classes are imbalanced: roughly 26% of records are crash cases. TARGET_AMT is right-skewed among crash cases, suggesting a log transform may be warranted for the linear regression model.

About 26.4% of records are crash cases. Among crash cases, the average payout is $5,702 with a median of $4,104, and a long right tail reaching over $107,000. That skew is why we model log(TARGET_AMT) rather than the raw value.

Numeric Variable Distributions by TARGET_FLAG

Boxplots compare numeric variables across crash and non-crash groups. Driving-related variables show clear separation between the two groups. MVR points and claim frequency strongly indicate higher crash risk. Other variables like income show weaker but noticeable trends.

Code

numeric_vars <- train %>%
  dplyr::select(TARGET_FLAG, AGE, INCOME, HOME_VAL, BLUEBOOK, TRAVTIME,
         MVR_PTS, CLM_FREQ, OLDCLAIM, TIF, YOJ) %>%
  pivot_longer(-TARGET_FLAG, names_to = "variable", values_to = "value") %>%
  mutate(TARGET_FLAG = factor(TARGET_FLAG, labels = c("No Crash", "Crash")))

Code

ggplot(numeric_vars, aes(x = TARGET_FLAG, y = value, fill = TARGET_FLAG)) +
  geom_boxplot(outlier.size = 0.5, show.legend = FALSE) +
  facet_wrap(~variable, scales = "free_y", ncol = 3) +
  labs(title = "Numeric Variables by Crash Status", x = NULL, y = NULL) +
  theme_minimal()

MVR points and claim frequency show the clearest separation between crash and non-crash drivers, as expected. Commute distance (TRAVTIME) also skews higher for crash cases. Income and home value lean slightly lower for crash drivers, consistent with the theory that financial stability correlates with safer driving.

Categorical Variable Crash Rates

Crash rates are calculated for each category within categorical variables. Urban drivers and revoked licenses show significantly higher crash rates. Commercial vehicle usage is also associated with increased risk. Some variables, like car color, show little to no predictive value.

Code

cat_vars <- c("CAR_USE", "CAR_TYPE", "EDUCATION", "JOB",
              "MSTATUS", "SEX", "REVOKED", "RED_CAR",
              "PARENT1", "URBANICITY")

train %>%
  select(TARGET_FLAG, all_of(cat_vars)) %>%
  pivot_longer(-TARGET_FLAG, names_to = "variable", values_to = "value") %>%
  group_by(variable, value) %>%
  summarise(crash_rate = mean(TARGET_FLAG, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = reorder(value, crash_rate), y = crash_rate, fill = crash_rate)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~variable, scales = "free_y", ncol = 2) +
  scale_fill_gradient(low = "lightblue", high = "firebrick") +
  labs(title = "Crash Rate by Categorical Variable Level",
       x = NULL, y = "Crash Rate") +
  theme_minimal(base_size = 8)

The urbanicity split is striking: urban drivers crash at 31.4% vs. just 6.9% for rural drivers. Revoked licenses nearly double the crash rate (44.3% vs. 23.9%), and commercial vehicle use pushes it to 34.6%. The red car effect, on the other hand, is essentially zero (25.9% vs. 26.6%). The urban legend does not hold up.

Correlation Matrix (Numeric Variables)

Correlation analysis identifies relationships among numeric variables. Strong associations exist between claim-related variables and crash outcomes. Some predictors show moderate correlation with each other. This helps detect potential multicollinearity issues in modeling.

Code

train %>%
  select(TARGET_FLAG, TARGET_AMT, AGE, INCOME, HOME_VAL, BLUEBOOK,
         TRAVTIME, MVR_PTS, CLM_FREQ, OLDCLAIM, TIF, YOJ,
         KIDSDRIV, HOMEKIDS, CAR_AGE) %>%
  drop_na() %>%
  cor() %>%
  corrplot(method = "color", type = "lower", tl.cex = 0.7,
           title = "Correlation Matrix", mar = c(0,0,2,0))

Key observations from the correlation matrix: MVR_PTS and CLM_FREQ show positive correlation with TARGET_FLAG as expected. OLDCLAIM and CLM_FREQ are moderately correlated with each other, suggesting possible multicollinearity if both are included.

MVR points, claim frequency, and OLDCLAIM are the strongest numeric correlates of TARGET_FLAG. CLM_FREQ and OLDCLAIM are moderately correlated with each other, which we account for in model building.

Data Preparation

Imputation Strategy

Missing numeric values are replaced using median imputation. Categorical variables are imputed using the most frequent category. Binary indicators are created to capture missingness effects. This approach preserves data while allowing models to learn missing patterns.

Code

# Median imputation for numeric, mode for categorical
get_mode <- function(x) {
  ux <- na.omit(unique(x))
  ux[which.max(tabulate(match(x, ux)))]
}

# Store medians/modes from training set only (avoid data leakage)
medians <- train %>%
  summarise(
    AGE     = median(AGE,     na.rm = TRUE),
    YOJ     = median(YOJ,     na.rm = TRUE),
    INCOME  = median(INCOME,  na.rm = TRUE),
    HOME_VAL= median(HOME_VAL,na.rm = TRUE),
    CAR_AGE = median(CAR_AGE, na.rm = TRUE)
  )

job_mode <- get_mode(train$JOB)

impute_df <- function(df) {
  df %>%
    mutate(
      AGE_MISSING     = as.integer(is.na(AGE)),
      YOJ_MISSING     = as.integer(is.na(YOJ)),
      INCOME_MISSING  = as.integer(is.na(INCOME)),
      HOME_VAL_MISSING= as.integer(is.na(HOME_VAL)),
      CAR_AGE_MISSING = as.integer(is.na(CAR_AGE)),
      JOB_MISSING     = as.integer(is.na(JOB) | JOB == ""),
      AGE      = if_else(is.na(AGE),      medians$AGE,      AGE),
      YOJ      = if_else(is.na(YOJ),      medians$YOJ,      YOJ),
      INCOME   = if_else(is.na(INCOME),   medians$INCOME,   INCOME),
      HOME_VAL = if_else(is.na(HOME_VAL), medians$HOME_VAL, HOME_VAL),
      CAR_AGE  = if_else(is.na(CAR_AGE),  medians$CAR_AGE,  CAR_AGE),
      JOB      = if_else(is.na(JOB) | JOB == "", job_mode,  JOB)
    )
}

train <- impute_df(train)
eval  <- impute_df(eval)

Missing-value flags are retained as binary indicator columns so the model can learn whether missingness itself is predictive.

Feature Engineering

New features are created to enhance predictive power. Log transformations reduce skewness in monetary variables. Binary indicators capture behavioral and demographic risk factors. Composite features summarize overall risk and claim intensity.

Code

engineer_features <- function(df) {
  df %>%
    mutate(
      # Log transforms for right-skewed money variables
      LOG_INCOME   = log1p(INCOME),
      LOG_HOME_VAL = log1p(HOME_VAL),
      LOG_BLUEBOOK = log1p(BLUEBOOK),
      LOG_OLDCLAIM = log1p(OLDCLAIM),

      # Age risk: young (<25) or old (>70) drivers
      AGE_RISK = as.integer(AGE < 25 | AGE > 70),

      # Binary recodes
      URBAN      = as.integer(str_detect(URBANICITY, "Urban")),
      IS_REVOKED = as.integer(REVOKED == "Yes"),
      IS_COMMERCIAL = as.integer(CAR_USE == "Commercial"),
      HAS_KIDS_DRIVING = as.integer(KIDSDRIV > 0),
      IS_RED_CAR = as.integer(RED_CAR == "yes"),
      IS_SINGLE_PARENT = as.integer(PARENT1 == "Yes"),
      IS_MARRIED = as.integer(MSTATUS == "Yes"),
      IS_MALE    = as.integer(SEX == "M"),

      # Claim density: average claim value per claim event
      CLM_DENSITY = if_else(CLM_FREQ > 0, OLDCLAIM / CLM_FREQ, 0),

      # Total risk score (additive heuristic)
      RISK_SCORE = MVR_PTS + CLM_FREQ + IS_REVOKED * 3 + KIDSDRIV * 2,

      # Car age bucket
      CAR_AGE_NEW = as.integer(CAR_AGE <= 3),

      # Education numeric encoding (ordinal)
      EDU_LEVEL = case_when(
        EDUCATION == "<High School" ~ 1,
        EDUCATION == "z_High School" ~ 2,
        EDUCATION == "Bachelors" ~ 3,
        EDUCATION == "Masters" ~ 4,
        EDUCATION == "PhD" ~ 5,
        TRUE ~ 2
      ),

      # Factor variables for model use
      TARGET_FLAG = factor(TARGET_FLAG)
    )
}

train <- engineer_features(train)
eval  <- engineer_features(eval)

Prepare Crash-Only Subset for Linear Regression

Linear regression is applied only to records with crash events. Non-crash observations have zero cost and are excluded. The target variable is log-transformed to stabilize variance. This improves model fit and interpretability.

Code

train_crash <- train %>%
  filter(TARGET_FLAG == 1) %>%
  mutate(LOG_TARGET_AMT = log1p(TARGET_AMT))

Build Models

Binary Logistic Regression (TARGET_FLAG)

Model 1: Core Risk Variables

This model uses fundamental risk-related variables. It focuses on driving history, demographics, and exposure. The goal is to establish a baseline predictive model. All included variables have strong theoretical justification.

Code

logit1 <- glm(
  TARGET_FLAG ~ MVR_PTS + CLM_FREQ + IS_REVOKED + KIDSDRIV +
    AGE + TRAVTIME + IS_COMMERCIAL + URBAN + LOG_INCOME +
    TIF + YOJ,
  data   = train,
  family = binomial(link = "logit")
)

summary(logit1)


Call:
glm(formula = TARGET_FLAG ~ MVR_PTS + CLM_FREQ + IS_REVOKED + 
    KIDSDRIV + AGE + TRAVTIME + IS_COMMERCIAL + URBAN + LOG_INCOME + 
    TIF + YOJ, family = binomial(link = "logit"), data = train)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.728341   0.199199  -8.676  < 2e-16 ***
MVR_PTS        0.122193   0.013005   9.396  < 2e-16 ***
CLM_FREQ       0.182403   0.024238   7.525 5.25e-14 ***
IS_REVOKED     0.774783   0.076182  10.170  < 2e-16 ***
KIDSDRIV       0.422265   0.050941   8.289  < 2e-16 ***
AGE           -0.024337   0.003254  -7.480 7.44e-14 ***
TRAVTIME       0.013721   0.001795   7.646 2.08e-14 ***
IS_COMMERCIAL  0.728174   0.056747  12.832  < 2e-16 ***
URBAN          2.017686   0.110203  18.309  < 2e-16 ***
LOG_INCOME    -0.128777   0.012703 -10.137  < 2e-16 ***
TIF           -0.048119   0.006990  -6.884 5.81e-12 ***
YOJ            0.019279   0.009734   1.981   0.0476 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7886.5  on 8149  degrees of freedom
AIC: 7910.5

Number of Fisher Scoring iterations: 5

Model 2: Expanded with Engineered Features

This model expands on the baseline with engineered variables. Additional features capture nonlinear relationships and interactions. Socioeconomic and behavioral indicators are incorporated. This improves flexibility and predictive performance.

Code

logit2 <- glm(
  TARGET_FLAG ~ MVR_PTS + CLM_FREQ + IS_REVOKED + KIDSDRIV +
    AGE_RISK + TRAVTIME + IS_COMMERCIAL + URBAN +
    LOG_INCOME + LOG_HOME_VAL + TIF + YOJ +
    RISK_SCORE + CLM_DENSITY + EDU_LEVEL +
    IS_MARRIED + IS_SINGLE_PARENT + HAS_KIDS_DRIVING +
    CAR_AGE_MISSING + INCOME_MISSING,
  data   = train,
  family = binomial(link = "logit")
)

summary(logit2)


Call:
glm(formula = TARGET_FLAG ~ MVR_PTS + CLM_FREQ + IS_REVOKED + 
    KIDSDRIV + AGE_RISK + TRAVTIME + IS_COMMERCIAL + URBAN + 
    LOG_INCOME + LOG_HOME_VAL + TIF + YOJ + RISK_SCORE + CLM_DENSITY + 
    EDU_LEVEL + IS_MARRIED + IS_SINGLE_PARENT + HAS_KIDS_DRIVING + 
    CAR_AGE_MISSING + INCOME_MISSING, family = binomial(link = "logit"), 
    data = train)

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -2.073e+00  1.688e-01 -12.276  < 2e-16 ***
MVR_PTS           1.198e-01  1.351e-02   8.870  < 2e-16 ***
CLM_FREQ          1.743e-01  2.548e-02   6.842 7.83e-12 ***
IS_REVOKED        8.092e-01  8.569e-02   9.444  < 2e-16 ***
KIDSDRIV          1.440e-01  1.195e-01   1.205 0.228294    
AGE_RISK          8.514e-01  2.582e-01   3.298 0.000973 ***
TRAVTIME          1.441e-02  1.840e-03   7.828 4.96e-15 ***
IS_COMMERCIAL     7.408e-01  5.830e-02  12.706  < 2e-16 ***
URBAN             2.272e+00  1.127e-01  20.162  < 2e-16 ***
LOG_INCOME       -1.017e-01  1.391e-02  -7.315 2.58e-13 ***
LOG_HOME_VAL     -2.299e-02  6.268e-03  -3.667 0.000245 ***
TIF              -5.095e-02  7.185e-03  -7.091 1.33e-12 ***
YOJ               2.025e-02  1.022e-02   1.982 0.047470 *  
RISK_SCORE               NA         NA      NA       NA    
CLM_DENSITY      -7.687e-06  4.948e-06  -1.554 0.120302    
EDU_LEVEL        -3.162e-01  2.622e-02 -12.060  < 2e-16 ***
IS_MARRIED       -4.594e-01  8.052e-02  -5.705 1.16e-08 ***
IS_SINGLE_PARENT  4.081e-01  9.352e-02   4.364 1.28e-05 ***
HAS_KIDS_DRIVING  4.577e-01  1.917e-01   2.388 0.016953 *  
CAR_AGE_MISSING   1.554e-01  1.155e-01   1.345 0.178745    
INCOME_MISSING    9.864e-02  1.278e-01   0.772 0.440094    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7552.8  on 8141  degrees of freedom
AIC: 7592.8

Number of Fisher Scoring iterations: 5

Model 3: Stepwise Selection

Stepwise selection is used to identify the most efficient model. Variables are added or removed based on AIC optimization. This reduces model complexity while maintaining performance. The final model retains only statistically meaningful predictors.

Code

logit_full <- glm(
  TARGET_FLAG ~ MVR_PTS + CLM_FREQ + IS_REVOKED + KIDSDRIV +
    AGE + AGE_RISK + TRAVTIME + IS_COMMERCIAL + URBAN +
    LOG_INCOME + LOG_HOME_VAL + LOG_BLUEBOOK + LOG_OLDCLAIM +
    TIF + YOJ + RISK_SCORE + CLM_DENSITY + EDU_LEVEL +
    IS_MARRIED + IS_MALE + IS_SINGLE_PARENT + HAS_KIDS_DRIVING +
    IS_RED_CAR + CAR_AGE + CAR_AGE_NEW +
    CAR_AGE_MISSING + INCOME_MISSING + HOME_VAL_MISSING +
    AGE_MISSING + YOJ_MISSING,
  data   = train,
  family = binomial(link = "logit")
)

invisible(capture.output(
  logit3 <- stepAIC(logit_full, direction = "both", trace = FALSE)
))

Code

summary(logit3)


Call:
glm(formula = TARGET_FLAG ~ MVR_PTS + IS_REVOKED + AGE_RISK + 
    TRAVTIME + IS_COMMERCIAL + URBAN + LOG_INCOME + LOG_HOME_VAL + 
    LOG_BLUEBOOK + LOG_OLDCLAIM + TIF + YOJ + CLM_DENSITY + EDU_LEVEL + 
    IS_MARRIED + IS_MALE + IS_SINGLE_PARENT + HAS_KIDS_DRIVING + 
    CAR_AGE_NEW + CAR_AGE_MISSING + AGE_MISSING, family = binomial(link = "logit"), 
    data = train)

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       1.413e+00  4.522e-01   3.124 0.001783 ** 
MVR_PTS           1.045e-01  1.396e-02   7.483 7.25e-14 ***
IS_REVOKED        8.755e-01  8.742e-02  10.016  < 2e-16 ***
AGE_RISK          8.003e-01  2.617e-01   3.058 0.002229 ** 
TRAVTIME          1.482e-02  1.857e-03   7.978 1.49e-15 ***
IS_COMMERCIAL     9.250e-01  6.313e-02  14.654  < 2e-16 ***
URBAN             2.288e+00  1.136e-01  20.140  < 2e-16 ***
LOG_INCOME       -7.893e-02  1.412e-02  -5.590 2.27e-08 ***
LOG_HOME_VAL     -2.108e-02  6.317e-03  -3.338 0.000845 ***
LOG_BLUEBOOK     -4.152e-01  4.856e-02  -8.550  < 2e-16 ***
LOG_OLDCLAIM      6.868e-02  8.833e-03   7.775 7.56e-15 ***
TIF              -5.204e-02  7.264e-03  -7.163 7.89e-13 ***
YOJ               1.991e-02  1.029e-02   1.935 0.052992 .  
CLM_DENSITY      -2.690e-05  6.040e-06  -4.454 8.43e-06 ***
EDU_LEVEL        -2.449e-01  2.918e-02  -8.393  < 2e-16 ***
IS_MARRIED       -5.009e-01  8.125e-02  -6.164 7.09e-10 ***
IS_MALE          -2.853e-01  6.104e-02  -4.674 2.96e-06 ***
IS_SINGLE_PARENT  3.529e-01  9.480e-02   3.722 0.000197 ***
HAS_KIDS_DRIVING  6.647e-01  8.705e-02   7.636 2.24e-14 ***
CAR_AGE_NEW       1.615e-01  7.249e-02   2.228 0.025860 *  
CAR_AGE_MISSING   1.890e-01  1.179e-01   1.603 0.108924    
AGE_MISSING       2.175e+00  1.193e+00   1.823 0.068280 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7433.9  on 8139  degrees of freedom
AIC: 7477.9

Number of Fisher Scoring iterations: 5

Multiple Linear Regression (TARGET_AMT)

Models are fit on crash cases only since TARGET_AMT is undefined (zero) for non-crash records.

Model 1: Direct Predictors

This model uses direct predictors of claim cost. It includes vehicle value, driving behavior, and exposure variables. The goal is to establish a simple baseline for cost prediction. Results provide insight into primary cost drivers.

Code

lm1 <- lm(
  LOG_TARGET_AMT ~ BLUEBOOK + CAR_AGE + MVR_PTS + CLM_FREQ +
    TRAVTIME + URBAN + LOG_INCOME + LOG_OLDCLAIM,
  data = train_crash
)

summary(lm1)


Call:
lm(formula = LOG_TARGET_AMT ~ BLUEBOOK + CAR_AGE + MVR_PTS + 
    CLM_FREQ + TRAVTIME + URBAN + LOG_INCOME + LOG_OLDCLAIM, 
    data = train_crash)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6787 -0.3978  0.0373  0.4009  3.2519 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.065e+00  9.903e-02  81.444  < 2e-16 ***
BLUEBOOK      1.026e-05  2.198e-06   4.669 3.22e-06 ***
CAR_AGE      -1.193e-04  3.339e-03  -0.036   0.9715    
MVR_PTS       1.574e-02  7.248e-03   2.171   0.0300 *  
CLM_FREQ     -4.204e-02  2.386e-02  -1.762   0.0782 .  
TRAVTIME     -3.716e-04  1.152e-03  -0.323   0.7471    
URBAN         2.476e-02  7.822e-02   0.317   0.7516    
LOG_INCOME    3.245e-03  5.150e-03   0.630   0.5287    
LOG_OLDCLAIM  7.138e-03  6.925e-03   1.031   0.3028    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8073 on 2144 degrees of freedom
Multiple R-squared:  0.01537,   Adjusted R-squared:  0.0117 
F-statistic: 4.184 on 8 and 2144 DF,  p-value: 5.536e-05

Model 2: Engineered + Interaction Terms

This model incorporates engineered features and interaction terms. The interaction between vehicle value and age captures nuanced effects. Additional predictors improve model explanatory power. This results in better fit and predictive accuracy.

Code

lm2 <- lm(
  LOG_TARGET_AMT ~ LOG_BLUEBOOK + CAR_AGE + MVR_PTS + CLM_FREQ +
    TRAVTIME + URBAN + LOG_INCOME + LOG_OLDCLAIM +
    RISK_SCORE + EDU_LEVEL + IS_COMMERCIAL +
    AGE + CAR_AGE_NEW + CLM_DENSITY +
    LOG_BLUEBOOK:CAR_AGE,
  data = train_crash
)

summary(lm2)


Call:
lm(formula = LOG_TARGET_AMT ~ LOG_BLUEBOOK + CAR_AGE + MVR_PTS + 
    CLM_FREQ + TRAVTIME + URBAN + LOG_INCOME + LOG_OLDCLAIM + 
    RISK_SCORE + EDU_LEVEL + IS_COMMERCIAL + AGE + CAR_AGE_NEW + 
    CLM_DENSITY + LOG_BLUEBOOK:CAR_AGE, data = train_crash)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.6830 -0.3933  0.0401  0.4086  3.1972 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.441e+00  4.453e-01  14.465  < 2e-16 ***
LOG_BLUEBOOK          1.938e-01  4.720e-02   4.107 4.16e-05 ***
CAR_AGE               4.124e-02  4.956e-02   0.832    0.405    
MVR_PTS               3.132e-02  1.274e-02   2.459    0.014 *  
CLM_FREQ             -1.332e-02  3.091e-02  -0.431    0.667    
TRAVTIME             -4.660e-04  1.155e-03  -0.403    0.687    
URBAN                 1.727e-02  7.816e-02   0.221    0.825    
LOG_INCOME            7.386e-04  5.220e-03   0.141    0.887    
LOG_OLDCLAIM          2.633e-04  9.637e-03   0.027    0.978    
RISK_SCORE           -1.489e-02  1.035e-02  -1.438    0.150    
EDU_LEVEL             1.999e-02  2.230e-02   0.896    0.370    
IS_COMMERCIAL         5.966e-03  3.616e-02   0.165    0.869    
AGE                   4.166e-05  1.879e-03   0.022    0.982    
CAR_AGE_NEW          -3.973e-02  6.346e-02  -0.626    0.531    
CLM_DENSITY           4.497e-06  3.713e-06   1.211    0.226    
LOG_BLUEBOOK:CAR_AGE -4.935e-03  5.113e-03  -0.965    0.335    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8056 on 2137 degrees of freedom
Multiple R-squared:  0.02287,   Adjusted R-squared:  0.01601 
F-statistic: 3.335 on 15 and 2137 DF,  p-value: 1.402e-05

Select Models

Logistic Regression Evaluation

Models are evaluated using accuracy, precision, recall, and AUC. ROC curves assess the trade-off between sensitivity and specificity. Class imbalance is considered when interpreting results. The stepwise model performs best overall.

Code

evaluate_logit <- function(model, data, threshold = 0.5, label = "") {
  probs <- predict(model, newdata = data, type = "response")
  preds <- factor(if_else(probs >= threshold, 1, 0), levels = c(0, 1))
  actuals <- factor(as.integer(as.character(data$TARGET_FLAG)), levels = c(0, 1))

  cm  <- confusionMatrix(preds, actuals, positive = "1")
  roc_obj <- roc(actuals, probs, quiet = TRUE)

  tibble(
    Model       = label,
    AIC         = AIC(model),
    Accuracy    = cm$overall["Accuracy"],
    Error_Rate  = 1 - cm$overall["Accuracy"],
    Precision   = cm$byClass["Precision"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    F1          = cm$byClass["F1"],
    AUC         = as.numeric(auc(roc_obj))
  )
}

logit_results <- bind_rows(
  evaluate_logit(logit1, train, label = "Logit 1: Core"),
  evaluate_logit(logit2, train, label = "Logit 2: Engineered"),
  evaluate_logit(logit3, train, label = "Logit 3: Stepwise")
)

logit_results %>%
  mutate(across(where(is.numeric), ~round(.x, 4))) %>%
  kable(caption = "Logistic Regression Model Comparison (Training Data)")

Logistic Regression Model Comparison (Training Data)
Model	AIC	Accuracy	Error_Rate	Precision	Sensitivity	Specificity	F1	AUC
Logit 1: Core	7910.498	0.7660	0.2340	0.6183	0.2949	0.9348	0.3994	0.7696
Logit 2: Engineered	7592.810	0.7794	0.2206	0.6427	0.3693	0.9264	0.4690	0.7952
Logit 3: Stepwise	7477.900	0.7864	0.2136	0.6607	0.3915	0.9279	0.4917	0.8041

Confusion Matrices

Confusion matrices summarize prediction outcomes. They highlight true positives, false positives, and false negatives. False negatives are especially costly for insurers. This helps evaluate real-world model impact.

Code

print_cm <- function(model, data, label) {
  probs   <- predict(model, newdata = data, type = "response")
  preds   <- factor(if_else(probs >= 0.5, 1, 0), levels = c(0, 1))
  actuals <- factor(as.integer(as.character(data$TARGET_FLAG)), levels = c(0, 1))
  cat("\n---", label, "---\n")
  print(confusionMatrix(preds, actuals, positive = "1")$table)
}

print_cm(logit1, train, "Logit 1: Core")


--- Logit 1: Core ---
          Reference
Prediction    0    1
         0 5616 1518
         1  392  635

Code

print_cm(logit2, train, "Logit 2: Engineered")


--- Logit 2: Engineered ---
          Reference
Prediction    0    1
         0 5566 1358
         1  442  795

Code

print_cm(logit3, train, "Logit 3: Stepwise")


--- Logit 3: Stepwise ---
          Reference
Prediction    0    1
         0 5575 1310
         1  433  843

Rows are actual, columns are predicted. We care especially about false negatives (missed crashes) since those are the costly misses for an insurer. The class imbalance at 26.4% crash rate means all models lean toward predicting no crash.

ROC Curves

ROC curves visualize model discrimination ability. Better models curve closer to the top-left corner. All models outperform random guessing. The stepwise model achieves the highest AUC.

Code

roc1 <- roc(as.integer(as.character(train$TARGET_FLAG)),
            predict(logit1, train, type = "response"), quiet = TRUE)
roc2 <- roc(as.integer(as.character(train$TARGET_FLAG)),
            predict(logit2, train, type = "response"), quiet = TRUE)
roc3 <- roc(as.integer(as.character(train$TARGET_FLAG)),
            predict(logit3, train, type = "response"), quiet = TRUE)

ggroc(list("Logit 1" = roc1, "Logit 2" = roc2, "Logit 3" = roc3)) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "gray") +
  labs(title = "ROC Curves - Logistic Models", color = "Model") +
  theme_minimal()

Closer to the top-left corner is better; the diagonal is a random guess. All three models beat chance meaningfully, with Logit 3 edging out on AUC.

Linear Regression Evaluation

Performance is measured using R², adjusted R², and RMSE. Residual diagnostics assess model assumptions. Log transformation improves normality and variance stability. The engineered model provides better overall fit.

Code

evaluate_lm <- function(model, data, label = "") {
  preds   <- predict(model, newdata = data)
  actuals <- data$LOG_TARGET_AMT
  residuals <- actuals - preds
  n <- length(actuals)
  p <- length(coef(model)) - 1

  ss_res  <- sum(residuals^2)
  ss_tot  <- sum((actuals - mean(actuals))^2)
  r2      <- 1 - ss_res / ss_tot
  adj_r2  <- 1 - (1 - r2) * (n - 1) / (n - p - 1)
  rmse    <- sqrt(mean(residuals^2))
  f_stat  <- summary(model)$fstatistic[1]

  tibble(
    Model    = label,
    R2       = round(r2, 4),
    Adj_R2   = round(adj_r2, 4),
    RMSE     = round(rmse, 4),
    F_Stat   = round(f_stat, 2),
    AIC      = round(AIC(model), 2)
  )
}

bind_rows(
  evaluate_lm(lm1, train_crash, "LM 1: Direct"),
  evaluate_lm(lm2, train_crash, "LM 2: Engineered")
) %>%
  kable(caption = "Linear Regression Model Comparison (Crash Cases, Log Scale)")

Linear Regression Model Comparison (Crash Cases, Log Scale)
Model	R2	Adj_R2	RMSE	F_Stat	AIC
LM 1: Direct	0.0154	0.0117	0.8056	4.18	5199.38
LM 2: Engineered	0.0229	0.0160	0.8026	3.33	5196.92

Residual Plots - Selected Linear Model

Residual plots check model assumptions visually. They assess linearity, normality, and constant variance. Outliers and influential observations are identified. The log transformation improves diagnostic results.

Code

par(mfrow = c(2, 2))
plot(lm2, which = 1:4)

Code

par(mfrow = c(1, 1))

Residuals vs. Fitted checks for non-linearity, Q-Q checks normality, Scale-Location checks constant variance, and Cook’s Distance flags influential outliers. Modeling on log(TARGET_AMT) rather than raw cost improves all four considerably given the heavy right tail in crash payouts.

Multicollinearity Check

Variance Inflation Factor (VIF) measures predictor redundancy. High VIF values indicate multicollinearity concerns. Most variables remain within acceptable limits. Some correlated terms are retained due to theoretical importance.

Code

vif(lm2) %>%
  enframe(name = "Variable", value = "VIF") %>%
  arrange(desc(VIF)) %>%
  mutate(VIF = round(VIF, 2)) %>%
  kable(caption = "Variance Inflation Factors - LM 2")

Variance Inflation Factors - LM 2
Variable	VIF
CAR_AGE	233.32
LOG_BLUEBOOK:CAR_AGE	232.08
LOG_OLDCLAIM	5.93
CLM_FREQ	4.94
RISK_SCORE	4.78
MVR_PTS	3.58
LOG_BLUEBOOK	3.23
CAR_AGE_NEW	2.81
CLM_DENSITY	2.24
EDU_LEVEL	2.08
LOG_INCOME	1.12
IS_COMMERCIAL	1.08
AGE	1.07
URBAN	1.02
TRAVTIME	1.02

Variables with VIF above 5-10 warrant investigation. LOG_BLUEBOOK and its interaction term will naturally share variance; monitor but retain unless model stability is compromised.

Model Selection Rationale

Logistic Regression: Logit 3 (stepwise) is selected as the final model. It achieves the lowest AIC and retains only variables with meaningful statistical support. The stepwise process removes noise variables while preserving the important predictors (MVR_PTS, CLM_FREQ, REVOKED, KIDSDRIV, TRAVTIME).

Linear Regression: LM 2 is selected. The interaction between vehicle value and car age is justified (an older high-value car may represent different risk than a new cheap one) and improves adjusted R². Residual plots show reasonably well-behaved errors on the log scale.

Predictions on Evaluation Data

Final predictions are generated on unseen evaluation data. Crash probabilities and classifications are computed. Predicted costs are generated only for crash cases. Results are saved for further use and evaluation.

Code

eval_preds <- eval %>%
  mutate(
    # Logistic: probability and binary classification
    P_CRASH    = predict(logit3, newdata = eval, type = "response"),
    TARGET_FLAG_PRED = if_else(P_CRASH >= 0.5, 1L, 0L),

    # Linear: predicted log cost, back-transformed; 0 for non-crash predictions
    LOG_AMT_PRED   = predict(lm2, newdata = eval),
    TARGET_AMT_PRED = if_else(
      TARGET_FLAG_PRED == 1,
      expm1(pmax(LOG_AMT_PRED, 0)),
      0
    )
  )

eval_preds %>%
  select(INDEX, P_CRASH, TARGET_FLAG_PRED, TARGET_AMT_PRED) %>%
  mutate(across(where(is.numeric), ~round(.x, 2))) %>%
  head(20) %>%
  kable(caption = "Sample Predictions on Evaluation Data (First 20 Rows)")

Sample Predictions on Evaluation Data (First 20 Rows)
INDEX	P_CRASH	TARGET_FLAG_PRED	TARGET_AMT_PRED
3	0.17	0	0.00
9	0.51	1	4237.27
10	0.11	0	0.00
18	0.15	0	0.00
21	0.40	0	0.00
30	0.23	0	0.00
31	0.32	0	0.00
37	0.37	0	0.00
39	0.06	0	0.00
47	0.11	0	0.00
60	0.03	0	0.00
62	0.64	1	3916.52
63	0.84	1	3627.57
64	0.10	0	0.00
68	0.03	0	0.00
75	0.54	1	4011.65
76	0.73	1	3275.28
83	0.13	0	0.00
87	0.50	1	3841.96
92	0.37	0	0.00

These are our final predictions on the 2,141 held-out evaluation records the models have never seen. P_CRASH is the raw probability from Logit 3, TARGET_FLAG_PRED is the 0/1 classification at the 0.5 threshold, and TARGET_AMT_PRED is the dollar estimate from LM 2 （only non-zero for predicted crashes）. If the predicted crash rate on evaluation lands close to the 26.4%, that is a good sign the models are generalizing and not just memorizing the training data.

Code

eval_preds %>%
  select(INDEX, P_CRASH, TARGET_FLAG_PRED, TARGET_AMT_PRED) %>%
  write_csv("insurance_predictions.csv")

cat("Predictions saved to insurance_predictions.csv\n")

Predictions saved to insurance_predictions.csv

Code

cat("Total eval records:", nrow(eval_preds), "\n")

Total eval records: 2141

Code

cat("Predicted crashes:", sum(eval_preds$TARGET_FLAG_PRED), "\n")

Predicted crashes: 345

Code

cat("Predicted crash rate:", round(mean(eval_preds$TARGET_FLAG_PRED) * 100, 1), "%\n")

Predicted crash rate: 16.1 %

Conclusion

The models successfully identify key determinants of crash risk and claim severity. Behavioral variables such as driving record and claim history emerge as the strongest predictors. Feature engineering and interaction terms significantly improve cost prediction accuracy. Overall, the models provide interpretable and actionable insights for insurance risk assessment.

Logit 3 (stepwise) and LM 2 (engineered features with interaction) are our final selected models. The logistic model correctly identifies the key behavioral signals including driving record, claim history, revocations, and urbanicity. The linear model captures payout variation through vehicle value, claim history, and commute patterns. Predictions on the evaluation set are saved to insurance_predictions.csv.