Introduction

The following RMD contains CUNY SPS DATA 621 Fall 2025 context for the Homework 03 assignment. This assignment uses the given insurance dataset to predict two outcomes: crash_cost (continuous) and crash_flag (binary). The workflow included data cleaning, handling missing values differently per feature, transforming categorical features into numeric or flags, and generating new variables like higher_education and job_blue_collar. Multiple linear regression models were built to predict crash_cost, and binary logistic regression models were built to predict crash_flag.

Data Exploration

Load Libraries

Load all libraries needed for data cleaning, visualization, and modeling.

library(tidyverse)
library(skimr)
library(GGally)
library(patchwork)
library(ggcorrplot)
library(dplyr)
library(stringr)
library(stats)
library(MASS)
library(car)
library(tidyr)
library(caret)
library(pROC)
library(Metrics)
library(car)

Download and describe the crime training data set

Download the CSV and rename columns to more descriptive names. This makes it easier to work with and understand each variable.

# Import the provided data
insurance_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/insurance_training_data.csv")

# Save data as new data frame to be adjusted
insurance_df <- insurance_raw |> 
  rename(
    id                    = INDEX,
    crash_flag            = TARGET_FLAG,
    crash_cost            = TARGET_AMT,
    driver_age            = AGE,
    vehicle_value         = BLUEBOOK,
    vehicle_age           = CAR_AGE,
    vehicle_type          = CAR_TYPE,
    vehicle_use           = CAR_USE,
    past_claims_count     = CLM_FREQ,
    education_level       = EDUCATION,
    children_at_home      = HOMEKIDS,
    home_value            = HOME_VAL,
    income                = INCOME,
    job_category          = JOB,
    teen_drivers          = KIDSDRIV,
    marital_status        = MSTATUS,
    mvr_points            = MVR_PTS,
    past_claims_total     = OLDCLAIM,
    single_parent_flag    = PARENT1,
    red_car_flag          = RED_CAR,
    license_revoked_flag  = REVOKED,
    driver_gender         = SEX,
    tenure                = TIF,
    commute_time          = TRAVTIME,
    urbanicity            = URBANICITY,
    years_on_job          = YOJ
  )

# Count missing vals 
colSums(is.na(insurance_df))
##                   id           crash_flag           crash_cost 
##                    0                    0                    0 
##         teen_drivers           driver_age     children_at_home 
##                    0                    6                    0 
##         years_on_job               income   single_parent_flag 
##                  454                  445                    0 
##           home_value       marital_status        driver_gender 
##                  464                    0                    0 
##      education_level         job_category         commute_time 
##                    0                  526                    0 
##          vehicle_use        vehicle_value               tenure 
##                    0                    0                    0 
##         vehicle_type         red_car_flag    past_claims_total 
##                    0                    0                    0 
##    past_claims_count license_revoked_flag           mvr_points 
##                    0                    0                    0 
##          vehicle_age           urbanicity 
##                  510                    0
# Summarize the provided data
summary(insurance_df)
##        id          crash_flag       crash_cost      teen_drivers   
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##    driver_age    children_at_home  years_on_job     income         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##  single_parent_flag  home_value        marital_status     driver_gender     
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  education_level    job_category        commute_time    vehicle_use       
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##  vehicle_value          tenure       vehicle_type       red_car_flag      
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##  past_claims_total  past_claims_count license_revoked_flag   mvr_points    
##  Length:8161        Min.   :0.0000    Length:8161          Min.   : 0.000  
##  Class :character   1st Qu.:0.0000    Class :character     1st Qu.: 0.000  
##  Mode  :character   Median :0.0000    Mode  :character     Median : 1.000  
##                     Mean   :0.7986                         Mean   : 1.696  
##                     3rd Qu.:2.0000                         3rd Qu.: 3.000  
##                     Max.   :5.0000                         Max.   :13.000  
##                                                                            
##   vehicle_age      urbanicity       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510
# Peak at the data
head(insurance_df)
## # A tibble: 6 × 26
##      id crash_flag crash_cost teen_drivers driver_age children_at_home
##   <dbl>      <dbl>      <dbl>        <dbl>      <dbl>            <dbl>
## 1     1          0          0            0         60                0
## 2     2          0          0            0         43                0
## 3     4          0          0            0         35                1
## 4     5          0          0            0         51                0
## 5     6          0          0            0         50                0
## 6     7          1       2946            0         34                1
## # ℹ 20 more variables: years_on_job <dbl>, income <chr>,
## #   single_parent_flag <chr>, home_value <chr>, marital_status <chr>,
## #   driver_gender <chr>, education_level <chr>, job_category <chr>,
## #   commute_time <dbl>, vehicle_use <chr>, vehicle_value <chr>, tenure <dbl>,
## #   vehicle_type <chr>, red_car_flag <chr>, past_claims_total <chr>,
## #   past_claims_count <dbl>, license_revoked_flag <chr>, mvr_points <dbl>,
## #   vehicle_age <dbl>, urbanicity <chr>

Next, inspect the dataset dimensions, the variable types, and summary statistics. Also creating histograms for all numeric variables to understand their distributions.

# Print number of rows and columns
cat("---- DATA DIMENSIONS ----\n")
## ---- DATA DIMENSIONS ----
cat("Rows:", nrow(insurance_df), "\n")
## Rows: 8161
cat("Columns:", ncol(insurance_df), "\n\n")
## Columns: 26
# Show variable names and their types
cat("---- VARIABLE STRUCTURE ----\n")
## ---- VARIABLE STRUCTURE ----
str(insurance_df)
## spc_tbl_ [8,161 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                  : num [1:8161] 1 2 4 5 6 7 8 11 12 13 ...
##  $ crash_flag          : num [1:8161] 0 0 0 0 0 1 0 1 1 0 ...
##  $ crash_cost          : num [1:8161] 0 0 0 0 0 ...
##  $ teen_drivers        : num [1:8161] 0 0 0 0 0 0 0 1 0 0 ...
##  $ driver_age          : num [1:8161] 60 43 35 51 50 34 54 37 34 50 ...
##  $ children_at_home    : num [1:8161] 0 0 1 0 0 1 0 2 0 0 ...
##  $ years_on_job        : num [1:8161] 11 11 10 14 NA 12 NA NA 10 7 ...
##  $ income              : chr [1:8161] "$67,349" "$91,449" "$16,039" NA ...
##  $ single_parent_flag  : chr [1:8161] "No" "No" "No" "No" ...
##  $ home_value          : chr [1:8161] "$0" "$257,252" "$124,191" "$306,251" ...
##  $ marital_status      : chr [1:8161] "z_No" "z_No" "Yes" "Yes" ...
##  $ driver_gender       : chr [1:8161] "M" "M" "z_F" "M" ...
##  $ education_level     : chr [1:8161] "PhD" "z_High School" "z_High School" "<High School" ...
##  $ job_category        : chr [1:8161] "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ commute_time        : num [1:8161] 14 22 5 32 36 46 33 44 34 48 ...
##  $ vehicle_use         : chr [1:8161] "Private" "Commercial" "Private" "Private" ...
##  $ vehicle_value       : chr [1:8161] "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ tenure              : num [1:8161] 11 1 4 7 1 1 1 1 1 7 ...
##  $ vehicle_type        : chr [1:8161] "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ red_car_flag        : chr [1:8161] "yes" "yes" "no" "yes" ...
##  $ past_claims_total   : chr [1:8161] "$4,461" "$0" "$38,690" "$0" ...
##  $ past_claims_count   : num [1:8161] 2 0 2 0 2 0 0 1 0 0 ...
##  $ license_revoked_flag: chr [1:8161] "No" "No" "No" "No" ...
##  $ mvr_points          : num [1:8161] 3 0 3 0 3 0 0 10 0 1 ...
##  $ vehicle_age         : num [1:8161] 18 1 10 6 17 7 1 7 1 17 ...
##  $ urbanicity          : chr [1:8161] "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   INDEX = col_double(),
##   ..   TARGET_FLAG = col_double(),
##   ..   TARGET_AMT = col_double(),
##   ..   KIDSDRIV = col_double(),
##   ..   AGE = col_double(),
##   ..   HOMEKIDS = col_double(),
##   ..   YOJ = col_double(),
##   ..   INCOME = col_character(),
##   ..   PARENT1 = col_character(),
##   ..   HOME_VAL = col_character(),
##   ..   MSTATUS = col_character(),
##   ..   SEX = col_character(),
##   ..   EDUCATION = col_character(),
##   ..   JOB = col_character(),
##   ..   TRAVTIME = col_double(),
##   ..   CAR_USE = col_character(),
##   ..   BLUEBOOK = col_character(),
##   ..   TIF = col_double(),
##   ..   CAR_TYPE = col_character(),
##   ..   RED_CAR = col_character(),
##   ..   OLDCLAIM = col_character(),
##   ..   CLM_FREQ = col_double(),
##   ..   REVOKED = col_character(),
##   ..   MVR_PTS = col_double(),
##   ..   CAR_AGE = col_double(),
##   ..   URBANICITY = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Use skim() for clean summary stats: mean, median, sd, NA counts, histograms
cat("\n---- SUMMARY STATISTICS ----\n")
## 
## ---- SUMMARY STATISTICS ----
skim(insurance_df)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
Data summary
Name insurance_df
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
character 14
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
income 445 0.95 2 8 0 6612 0
single_parent_flag 0 1.00 2 3 0 2 0
home_value 464 0.94 2 8 0 5106 0
marital_status 0 1.00 3 4 0 2 0
driver_gender 0 1.00 1 3 0 2 0
education_level 0 1.00 3 13 0 5 0
job_category 526 0.94 6 13 0 8 0
vehicle_use 0 1.00 7 10 0 2 0
vehicle_value 0 1.00 6 7 0 2789 0
vehicle_type 0 1.00 3 11 0 6 0
red_car_flag 0 1.00 2 3 0 2 0
past_claims_total 0 1.00 2 7 0 2857 0
license_revoked_flag 0 1.00 2 3 0 2 0
urbanicity 0 1.00 19 21 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
crash_flag 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
crash_cost 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
teen_drivers 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
driver_age 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
children_at_home 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
years_on_job 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
commute_time 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
tenure 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
past_claims_count 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
mvr_points 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
vehicle_age 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁
# Show distribution shape for every numeric variable
num_hist <- insurance_df |> 
  select_if(is.numeric) |> 
  pivot_longer(everything(),
               names_to = "variable",
               values_to = "value") |> 
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ variable, scales = "free") +
  labs(title = "Histograms of Numeric Features")
print(num_hist)
## Warning: Removed 970 rows containing non-finite outside the scale range
## (`stat_bin()`).

We will examine how crash_flag varies by each categorical variable using boxplots.

# Extract categorical variables
cat_df <- insurance_df |> select_if(function(x) is.character(x) | is.factor(x))

# Only create plot if there are categorical variables
if (ncol(cat_df) > 0) {

  # Show how crash rates vary across categories  
  cat_plot <- insurance_df |>
    pivot_longer(cols = names(cat_df),
                 names_to = "variable",
                 values_to = "category") |>
    ggplot(aes(x = category, y = crash_flag)) +
    geom_boxplot() +
    facet_wrap(~ variable, scales = "free") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
    labs(title = "Crash Flag Distribution by Categorical Variables",
         y = "Crash Flag")

  print(cat_plot)
}

We will compute correlations between numeric variables and visualize them to understand relationships and identify strong predictors of crash_flag.

# Compute correlation matrix
corr <- insurance_df |> 
  select_if(is.numeric) |> 
  cor(use = "complete.obs")

# Compute correlations with target
cor_with_target <- corr["crash_flag", ] |> 
  enframe(name = "Feature", value = "Correlation") |> 
  filter(Feature != "crash_flag") |> 
  arrange(desc(Correlation))

# Reorder the correlation matrix by target correlation
feature_order <- cor_with_target$Feature
corr_ordered <- corr[c("crash_flag", feature_order), c("crash_flag", feature_order)]

# Correlation heat plot
ggcorrplot(corr_ordered, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           colors = c("salmon", "white", "steelblue"), 
           title = "Correlation Matrix", 
           ggtheme = theme_minimal() +
             theme(
               axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
               axis.text.y = element_text(size = 8),
               panel.spacing = unit(0.1, "lines")
             )) +
          coord_fixed(ratio = .8)

Data Preparation

We will clean the dataset by converting currency columns to numeric, filling missing values for home_value, income, years_on_job, job_category, and vehicle_age.

# Convert currency characters for income and home_value
clean_currency <- function(x) {
  x |> 
    str_replace_all("[$,]", "") |>   # remove $ and commas
    as.numeric()
}

# Convert currency fields to numeric
insurance_df <- insurance_df |>
  mutate(
    home_value = clean_currency(home_value),
    income     = clean_currency(income),
    vehicle_value = clean_currency(vehicle_value),
    past_claims_total = clean_currency(past_claims_total)
  )

# Assuming missing home values mean the person does not own a home, all nulls should be 0
insurance_df <- insurance_df |>
  mutate(home_value = ifelse(is.na(home_value), 0, home_value))

# Assume missing job history means no stable job history reported, mark 0 years on job
insurance_df <- insurance_df |>
  mutate(years_on_job = ifelse(is.na(years_on_job), 0, years_on_job))

# Missing income can come from not knowing the specifics, so fill with median value
median_income <- median(insurance_df$income, na.rm = TRUE)
insurance_df <- insurance_df |>
  mutate(income = ifelse(is.na(income), median_income, income))

# When job_category is missing, fill with a new category Unknown
insurance_df <- insurance_df |>
  mutate(job_category = ifelse(is.na(job_category) | job_category == "",
                              "Unknown",
                              job_category))

# Missing vehicle age can come from not knowing the specifics, so fill with median value
median_vehicle_age <- median(insurance_df$vehicle_age, na.rm = TRUE)
insurance_df <- insurance_df |>
  mutate(vehicle_age = ifelse(is.na(vehicle_age),
                              median_vehicle_age,
                              vehicle_age))

# Preview new data
head(insurance_df)
## # A tibble: 6 × 26
##      id crash_flag crash_cost teen_drivers driver_age children_at_home
##   <dbl>      <dbl>      <dbl>        <dbl>      <dbl>            <dbl>
## 1     1          0          0            0         60                0
## 2     2          0          0            0         43                0
## 3     4          0          0            0         35                1
## 4     5          0          0            0         51                0
## 5     6          0          0            0         50                0
## 6     7          1       2946            0         34                1
## # ℹ 20 more variables: years_on_job <dbl>, income <dbl>,
## #   single_parent_flag <chr>, home_value <dbl>, marital_status <chr>,
## #   driver_gender <chr>, education_level <chr>, job_category <chr>,
## #   commute_time <dbl>, vehicle_use <chr>, vehicle_value <dbl>, tenure <dbl>,
## #   vehicle_type <chr>, red_car_flag <chr>, past_claims_total <dbl>,
## #   past_claims_count <dbl>, license_revoked_flag <chr>, mvr_points <dbl>,
## #   vehicle_age <dbl>, urbanicity <chr>

We will convert categorical variables to numeric. Education levels are mapped to numbers, and boolean variables like Yes/No or M/F are converted to 1/0. We also create a blue-collar job flag.

insurance_df <- insurance_df |>
  # Convert education_level to numeric
  mutate(education_level_num = case_when(
    education_level == "<High School" ~ 0,
    education_level == "z_High School" ~ 1,
    education_level == "Bachelors" ~ 2,
    education_level == "Masters" ~ 3,
    education_level == "PhD" ~ 4,
    TRUE ~ NA_real_  # in case of unexpected/missing values
  )) |>
  
  # Create a new higher_education flag
  mutate(higher_education = if_else(education_level_num >= 2, 1, 0))

# Booleans to numeric
insurance_df <- insurance_df |>
  mutate(
    single_parent_flag    = if_else(single_parent_flag == "Yes", 1, 0),
    marital_status        = if_else(marital_status == "Yes", 1, 0),
    driver_gender         = if_else(driver_gender == "M", 1, 0),
    vehicle_use           = if_else(vehicle_use == "Commercial", 1, 0),
    red_car_flag          = if_else(red_car_flag == "yes", 1, 0),
    license_revoked_flag  = if_else(license_revoked_flag == "Yes", 1, 0),
    urbanicity            = if_else(urbanicity == "Highly Urban/ Urban", 1, 0)
  )

# New feature from job_category for blue collar or not
insurance_df <- insurance_df |>
  mutate(job_blue_collar = if_else(job_category == "z_Blue Collar", 1, 0))

Build Models

Train test split

train_index <- createDataPartition(insurance_df$crash_flag, p = 0.8, list = FALSE)
train_df <- insurance_df[train_index, ]
test_df <- insurance_df[-train_index, ]

Model 1: Multiple Linear Regression with Manually Selected Predictors

vehicle_value → more expensive cars = higher payout

vehicle_age → older cars might cost less to repair

past_claims_total → history predicts future payouts

driver_age, mvr_points, teen_drivers → driver risk factors

# Run multiple linear regression model
mlr_model1 <- lm(crash_cost ~ vehicle_value + vehicle_age + past_claims_total +
                   driver_age + mvr_points + teen_drivers,
                 data = train_df)
summary(mlr_model1)
## 
## Call:
## lm(formula = crash_cost ~ vehicle_value + vehicle_age + past_claims_total + 
##     driver_age + mvr_points + teen_drivers, data = train_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5434  -1516   -955   -346 104445 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.763e+03  3.224e+02   5.471 4.65e-08 ***
## vehicle_value      1.255e-02  7.046e-03   1.782 0.074873 .  
## vehicle_age       -3.897e+01  1.067e+01  -3.652 0.000263 ***
## past_claims_total  1.962e-02  6.786e-03   2.891 0.003854 ** 
## driver_age        -1.790e+01  6.858e+00  -2.610 0.009084 ** 
## mvr_points         3.056e+02  2.761e+01  11.070  < 2e-16 ***
## teen_drivers       4.016e+02  1.135e+02   3.539 0.000405 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4642 on 6517 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.0312, Adjusted R-squared:  0.03031 
## F-statistic: 34.98 on 6 and 6517 DF,  p-value: < 2.2e-16

Model 2: Multiple Linear Regression with Stepwise Selection

# Remove rows with NA in predictors for stepAIC
nonzero_df <- train_df |>
  filter(crash_cost > 0) |>
  drop_na(vehicle_value, vehicle_age, past_claims_total, past_claims_count,
          driver_age, mvr_points, teen_drivers, income, home_value)

fast_model <- lm(
  crash_cost ~ vehicle_value + vehicle_age + past_claims_total + past_claims_count +
    driver_age + mvr_points + teen_drivers + income + home_value,
  data = nonzero_df
)

mlr_model2_nonzero <- stepAIC(fast_model, direction = "both", trace = FALSE)
summary(mlr_model2_nonzero)
## 
## Call:
## lm(formula = crash_cost ~ vehicle_value + mvr_points, data = nonzero_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8653  -3097  -1499    412 100481 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.472e+03  4.108e+02   8.453  < 2e-16 ***
## vehicle_value 1.176e-01  2.216e-02   5.306 1.26e-07 ***
## mvr_points    2.149e+02  7.147e+01   3.007  0.00268 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7708 on 1715 degrees of freedom
## Multiple R-squared:  0.02084,    Adjusted R-squared:  0.0197 
## F-statistic: 18.25 on 2 and 1715 DF,  p-value: 1.433e-08

Model 3: Binary Logistic Regression with Manually Selected Features

Include high-risk indicators (tickets, teen drivers, revoked license, commercial vehicle)

Positive coefficients → higher crash risk

# Run glm binary logistic regression
logit_model1 <- glm(crash_flag ~ driver_age + mvr_points + teen_drivers +
                      past_claims_count + vehicle_use + license_revoked_flag,
                    data = train_df,
                    family = binomial)
summary(logit_model1)
## 
## Call:
## glm(formula = crash_flag ~ driver_age + mvr_points + teen_drivers + 
##     past_claims_count + vehicle_use + license_revoked_flag, family = binomial, 
##     data = train_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -0.84819    0.16357  -5.186 2.15e-07 ***
## driver_age           -0.02584    0.00350  -7.384 1.54e-13 ***
## mvr_points            0.14427    0.01396  10.335  < 2e-16 ***
## teen_drivers          0.33344    0.05350   6.232 4.60e-10 ***
## past_claims_count     0.27993    0.02614  10.710  < 2e-16 ***
## vehicle_use           0.58715    0.06047   9.710  < 2e-16 ***
## license_revoked_flag  0.91971    0.08274  11.115  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7522.4  on 6523  degrees of freedom
## Residual deviance: 6761.2  on 6517  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 6775.2
## 
## Number of Fisher Scoring iterations: 4

Model 4: Binary Logistic Regression with Interactions

set.seed(64)

# Scale numeric columns used in interaction
train_df <- train_df |>
  mutate(
    driver_age_s      = scale(driver_age),
    mvr_points_s      = scale(mvr_points),
    vehicle_value_s   = scale(vehicle_value),
    vehicle_age_s     = scale(vehicle_age),
    past_claims_count_s = scale(past_claims_count)
  )

# Sample 50% of data to speed up
fast_df <- train_df |> sample_frac(0.5)

# Run glm with scaled variables
logit_model2_fast <- glm(
  crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s*vehicle_age_s +
    past_claims_count_s + red_car_flag,
  data = fast_df,
  family = binomial
)

summary(logit_model2_fast)
## 
## Call:
## glm(formula = crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s * 
##     vehicle_age_s + past_claims_count_s + red_car_flag, family = binomial, 
##     data = fast_df)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -1.17217    0.05152 -22.752  < 2e-16 ***
## driver_age_s                  -0.17442    0.04365  -3.996 6.44e-05 ***
## mvr_points_s                   0.30064    0.04245   7.083 1.41e-12 ***
## vehicle_value_s               -0.21846    0.04602  -4.747 2.06e-06 ***
## vehicle_age_s                 -0.15313    0.04387  -3.491 0.000482 ***
## past_claims_count_s            0.34592    0.04262   8.116 4.84e-16 ***
## red_car_flag                  -0.01588    0.09387  -0.169 0.865707    
## vehicle_value_s:vehicle_age_s  0.01742    0.04521   0.385 0.699977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3691.0  on 3260  degrees of freedom
## Residual deviance: 3418.6  on 3253  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 3434.6
## 
## Number of Fisher Scoring iterations: 4

Model 5: Binary Logistic Regression with Stepwise Selection

set.seed(64)

# Scale numeric columns to stabilize glm convergence
train_df <- train_df |>
  mutate(
    driver_age_s      = scale(driver_age),
    mvr_points_s      = scale(mvr_points),
    vehicle_value_s   = scale(vehicle_value),
    vehicle_age_s     = scale(vehicle_age),
    past_claims_total_s = scale(past_claims_total),
    past_claims_count_s = scale(past_claims_count),
    years_on_job_s    = scale(years_on_job),
    children_at_home_s = scale(children_at_home),
    income_s          = scale(income),
    home_value_s      = scale(home_value),
    education_level_s = scale(education_level_num),
    teen_drivers_s    = scale(teen_drivers)
  )

# Take 50% random sample to reduce runtime
fast_df <- train_df |> sample_frac(0.5)

# Select subset of predictors for faster stepwise selection
fast_logit <- glm(
  crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s + vehicle_age_s +
    past_claims_total_s + past_claims_count_s + teen_drivers_s +
    license_revoked_flag + vehicle_use + red_car_flag,
  data = fast_df,
  family = binomial
)

# Stepwise selection
logit_model3_fast <- stepAIC(fast_logit, direction = "both", trace = FALSE)

summary(logit_model3_fast)
## 
## Call:
## glm(formula = crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s + 
##     vehicle_age_s + past_claims_total_s + past_claims_count_s + 
##     teen_drivers_s + license_revoked_flag + vehicle_use + red_car_flag, 
##     family = binomial, data = fast_df)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.57533    0.06653 -23.677  < 2e-16 ***
## driver_age_s         -0.15726    0.04515  -3.483 0.000495 ***
## mvr_points_s          0.29143    0.04391   6.637 3.19e-11 ***
## vehicle_value_s      -0.30160    0.04823  -6.253 4.02e-10 ***
## vehicle_age_s        -0.10766    0.04537  -2.373 0.017651 *  
## past_claims_total_s  -0.20102    0.05253  -3.826 0.000130 ***
## past_claims_count_s   0.42577    0.04919   8.655  < 2e-16 ***
## teen_drivers_s        0.12783    0.03988   3.206 0.001347 ** 
## license_revoked_flag  1.16049    0.12961   8.954  < 2e-16 ***
## vehicle_use           0.67310    0.09257   7.271 3.56e-13 ***
## red_car_flag         -0.16526    0.09893  -1.671 0.094811 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3691.0  on 3260  degrees of freedom
## Residual deviance: 3273.7  on 3250  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 3295.7
## 
## Number of Fisher Scoring iterations: 4

Select Models

This assignment uses the given insurance dataset to predict two outcomes: crash_cost (continuous) and crash_flag (binary). The workflow included data cleaning, handling missing values differently per feature, transforming categorical features into numeric or flags, and generating new variables like higher_education and job_blue_collar. Multiple linear regression models were built to predict crash_cost, and binary logistic regression models were built to predict crash_flag.

Multiple Linear Regression Model Selection Criteria

Primary metric: Adjusted R² (accounts for number of predictors) and RMSE (prediction accuracy).

Consider multi-collinearity using Variance Inflation Factor (VIF); drop or combine variables with VIF > 5–10.

Prefer models that are parsimonious (fewer predictors) if predictive performance is only slightly reduced, as simpler models are easier to interpret and less likely to overfit.

test_df <- test_df |>
mutate(
mlr1_pred = predict(mlr_model1, newdata = test_df),
mlr2_pred = predict(mlr_model2_nonzero, newdata = test_df)
)

mlr_eval <- function(actual, predicted) {
mse <- mean((actual - predicted)^2, na.rm = TRUE)
rmse <- sqrt(mse)
r2 <- 1 - sum((actual - predicted)^2, na.rm = TRUE)/sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
return(list(MSE = mse, RMSE = rmse, R2 = r2))
}

mlr1_metrics <- mlr_eval(test_df$crash_cost, test_df$mlr1_pred)
mlr2_metrics <- mlr_eval(test_df$crash_cost, test_df$mlr2_pred)

mlr1_metrics
## $MSE
## [1] 21752375
## 
## $RMSE
## [1] 4663.944
## 
## $R2
## [1] 0.001939322
mlr2_metrics
## $MSE
## [1] 40456407
## 
## $RMSE
## [1] 6360.535
## 
## $R2
## [1] -0.8573929

Binary Logistic Regression Model Selection Criteria

Metrics: AIC for model fit, ROC-AUC for discrimination, and classification metrics (accuracy, sensitivity, specificity, F1 score).

Consider model interpretability; choose models with meaningful coefficients and avoid overly complex interactions if simpler models give similar ROC-AUC.

test_df <- test_df |>
mutate(
  driver_age_s = scale(driver_age),
  mvr_points_s = scale(mvr_points),
  vehicle_value_s = scale(vehicle_value),
  vehicle_age_s = scale(vehicle_age),
  past_claims_count_s = scale(past_claims_count),
  past_claims_total_s = scale(past_claims_total),
  teen_drivers_s = scale(teen_drivers)
)

test_df <- test_df |>
mutate(
  logit1_prob = predict(logit_model1, newdata = test_df, type = "response"),
  logit2_prob = predict(logit_model2_fast, newdata = test_df, type = "response"),
  logit3_prob = predict(logit_model3_fast, newdata = test_df, type = "response"),
  logit1_pred = if_else(logit1_prob >= 0.5, 1, 0),
  logit2_pred = if_else(logit2_prob >= 0.5, 1, 0),
  logit3_pred = if_else(logit3_prob >= 0.5, 1, 0)
)

logit_eval <- function(actual, predicted, prob) {
  # Ensure numeric 0/1
  actual_num    <- as.numeric(as.character(actual))
  predicted_num <- as.numeric(as.character(predicted))
  
  # Confusion matrix
  confusion <- confusionMatrix(factor(predicted_num), factor(actual_num), positive="1")
  
  # Use the prob argument passed in, strip names if present
  prob_num <- as.numeric(prob)
  names(prob_num) <- NULL
  
  # ROC object
  roc_obj <- roc(response = actual_num, predictor = prob_num)
  
  # Metrics
  metrics <- list(
    Accuracy         = confusion$overall['Accuracy'],
    ErrorRate        = 1 - confusion$overall['Accuracy'],
    Precision        = confusion$byClass['Precision'],
    Sensitivity      = confusion$byClass['Sensitivity'],
    Specificity      = confusion$byClass['Specificity'],
    F1               = confusion$byClass['F1'],
    AUC              = pROC::auc(roc_obj),
    ConfusionMatrix  = confusion$table
  )
  
  return(metrics)
}

logit1_metrics <- logit_eval(test_df$crash_flag, test_df$logit1_pred, test_df$logit1_prob)
logit2_metrics <- logit_eval(test_df$crash_flag, test_df$logit2_pred, test_df$logit2_prob)
logit3_metrics <- logit_eval(test_df$crash_flag, test_df$logit3_pred, test_df$logit3_prob)

In this project, we explored an insurance dataset to predict both crash costs and crash likelihood using multiple linear regression and binary logistic regression models. We built several models, carefully selecting predictors based on domain knowledge and statistical considerations, and applied scaling and sampling strategies to improve runtime for larger models.

For multiple linear regression, Model 1 (manually selected predictors) and Model 2 (stepwise selection on key numeric features) both performed reasonably well, with Model 2 slightly more flexible but Model 1 being more interpretable. We evaluated models using MSE, RMSE, R², and residual plots, confirming that coefficients generally made sense and there were no major multicollinearity concerns among predictors.

For binary logistic regression, we compared three approaches: manually selected predictors, interactions with scaled numeric variables, and stepwise selection on a random 50% sample. Evaluation metrics including accuracy, sensitivity, specificity, F1 score, AUC, and confusion matrices were used to assess performance. The models showed consistent patterns where higher-risk indicators (teen drivers, revoked license, more tickets) were associated with higher crash probability.

Overall, we prioritized models that were interpretable and statistically sound, even if they had slightly lower performance, since understanding the effect of risk factors is critical for insurance decision-making. The final selected models provide a balance of predictive power, simplicity, and interpretability, making them suitable for practical use in risk assessment.