The following RMD contains CUNY SPS DATA 621 Fall 2025 context for the Homework 03 assignment. This assignment uses the given insurance dataset to predict two outcomes: crash_cost (continuous) and crash_flag (binary). The workflow included data cleaning, handling missing values differently per feature, transforming categorical features into numeric or flags, and generating new variables like higher_education and job_blue_collar. Multiple linear regression models were built to predict crash_cost, and binary logistic regression models were built to predict crash_flag.
Load all libraries needed for data cleaning, visualization, and modeling.
library(tidyverse)
library(skimr)
library(GGally)
library(patchwork)
library(ggcorrplot)
library(dplyr)
library(stringr)
library(stats)
library(MASS)
library(car)
library(tidyr)
library(caret)
library(pROC)
library(Metrics)
library(car)
Download the CSV and rename columns to more descriptive names. This makes it easier to work with and understand each variable.
# Import the provided data
insurance_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/insurance_training_data.csv")
# Save data as new data frame to be adjusted
insurance_df <- insurance_raw |>
rename(
id = INDEX,
crash_flag = TARGET_FLAG,
crash_cost = TARGET_AMT,
driver_age = AGE,
vehicle_value = BLUEBOOK,
vehicle_age = CAR_AGE,
vehicle_type = CAR_TYPE,
vehicle_use = CAR_USE,
past_claims_count = CLM_FREQ,
education_level = EDUCATION,
children_at_home = HOMEKIDS,
home_value = HOME_VAL,
income = INCOME,
job_category = JOB,
teen_drivers = KIDSDRIV,
marital_status = MSTATUS,
mvr_points = MVR_PTS,
past_claims_total = OLDCLAIM,
single_parent_flag = PARENT1,
red_car_flag = RED_CAR,
license_revoked_flag = REVOKED,
driver_gender = SEX,
tenure = TIF,
commute_time = TRAVTIME,
urbanicity = URBANICITY,
years_on_job = YOJ
)
# Count missing vals
colSums(is.na(insurance_df))
## id crash_flag crash_cost
## 0 0 0
## teen_drivers driver_age children_at_home
## 0 6 0
## years_on_job income single_parent_flag
## 454 445 0
## home_value marital_status driver_gender
## 464 0 0
## education_level job_category commute_time
## 0 526 0
## vehicle_use vehicle_value tenure
## 0 0 0
## vehicle_type red_car_flag past_claims_total
## 0 0 0
## past_claims_count license_revoked_flag mvr_points
## 0 0 0
## vehicle_age urbanicity
## 510 0
# Summarize the provided data
summary(insurance_df)
## id crash_flag crash_cost teen_drivers
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## driver_age children_at_home years_on_job income
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## single_parent_flag home_value marital_status driver_gender
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## education_level job_category commute_time vehicle_use
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## vehicle_value tenure vehicle_type red_car_flag
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## past_claims_total past_claims_count license_revoked_flag mvr_points
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## vehicle_age urbanicity
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
# Peak at the data
head(insurance_df)
## # A tibble: 6 × 26
## id crash_flag crash_cost teen_drivers driver_age children_at_home
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 60 0
## 2 2 0 0 0 43 0
## 3 4 0 0 0 35 1
## 4 5 0 0 0 51 0
## 5 6 0 0 0 50 0
## 6 7 1 2946 0 34 1
## # ℹ 20 more variables: years_on_job <dbl>, income <chr>,
## # single_parent_flag <chr>, home_value <chr>, marital_status <chr>,
## # driver_gender <chr>, education_level <chr>, job_category <chr>,
## # commute_time <dbl>, vehicle_use <chr>, vehicle_value <chr>, tenure <dbl>,
## # vehicle_type <chr>, red_car_flag <chr>, past_claims_total <chr>,
## # past_claims_count <dbl>, license_revoked_flag <chr>, mvr_points <dbl>,
## # vehicle_age <dbl>, urbanicity <chr>
Next, inspect the dataset dimensions, the variable types, and summary statistics. Also creating histograms for all numeric variables to understand their distributions.
# Print number of rows and columns
cat("---- DATA DIMENSIONS ----\n")
## ---- DATA DIMENSIONS ----
cat("Rows:", nrow(insurance_df), "\n")
## Rows: 8161
cat("Columns:", ncol(insurance_df), "\n\n")
## Columns: 26
# Show variable names and their types
cat("---- VARIABLE STRUCTURE ----\n")
## ---- VARIABLE STRUCTURE ----
str(insurance_df)
## spc_tbl_ [8,161 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:8161] 1 2 4 5 6 7 8 11 12 13 ...
## $ crash_flag : num [1:8161] 0 0 0 0 0 1 0 1 1 0 ...
## $ crash_cost : num [1:8161] 0 0 0 0 0 ...
## $ teen_drivers : num [1:8161] 0 0 0 0 0 0 0 1 0 0 ...
## $ driver_age : num [1:8161] 60 43 35 51 50 34 54 37 34 50 ...
## $ children_at_home : num [1:8161] 0 0 1 0 0 1 0 2 0 0 ...
## $ years_on_job : num [1:8161] 11 11 10 14 NA 12 NA NA 10 7 ...
## $ income : chr [1:8161] "$67,349" "$91,449" "$16,039" NA ...
## $ single_parent_flag : chr [1:8161] "No" "No" "No" "No" ...
## $ home_value : chr [1:8161] "$0" "$257,252" "$124,191" "$306,251" ...
## $ marital_status : chr [1:8161] "z_No" "z_No" "Yes" "Yes" ...
## $ driver_gender : chr [1:8161] "M" "M" "z_F" "M" ...
## $ education_level : chr [1:8161] "PhD" "z_High School" "z_High School" "<High School" ...
## $ job_category : chr [1:8161] "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ commute_time : num [1:8161] 14 22 5 32 36 46 33 44 34 48 ...
## $ vehicle_use : chr [1:8161] "Private" "Commercial" "Private" "Private" ...
## $ vehicle_value : chr [1:8161] "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ tenure : num [1:8161] 11 1 4 7 1 1 1 1 1 7 ...
## $ vehicle_type : chr [1:8161] "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ red_car_flag : chr [1:8161] "yes" "yes" "no" "yes" ...
## $ past_claims_total : chr [1:8161] "$4,461" "$0" "$38,690" "$0" ...
## $ past_claims_count : num [1:8161] 2 0 2 0 2 0 0 1 0 0 ...
## $ license_revoked_flag: chr [1:8161] "No" "No" "No" "No" ...
## $ mvr_points : num [1:8161] 3 0 3 0 3 0 0 10 0 1 ...
## $ vehicle_age : num [1:8161] 18 1 10 6 17 7 1 7 1 17 ...
## $ urbanicity : chr [1:8161] "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
## - attr(*, "spec")=
## .. cols(
## .. INDEX = col_double(),
## .. TARGET_FLAG = col_double(),
## .. TARGET_AMT = col_double(),
## .. KIDSDRIV = col_double(),
## .. AGE = col_double(),
## .. HOMEKIDS = col_double(),
## .. YOJ = col_double(),
## .. INCOME = col_character(),
## .. PARENT1 = col_character(),
## .. HOME_VAL = col_character(),
## .. MSTATUS = col_character(),
## .. SEX = col_character(),
## .. EDUCATION = col_character(),
## .. JOB = col_character(),
## .. TRAVTIME = col_double(),
## .. CAR_USE = col_character(),
## .. BLUEBOOK = col_character(),
## .. TIF = col_double(),
## .. CAR_TYPE = col_character(),
## .. RED_CAR = col_character(),
## .. OLDCLAIM = col_character(),
## .. CLM_FREQ = col_double(),
## .. REVOKED = col_character(),
## .. MVR_PTS = col_double(),
## .. CAR_AGE = col_double(),
## .. URBANICITY = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Use skim() for clean summary stats: mean, median, sd, NA counts, histograms
cat("\n---- SUMMARY STATISTICS ----\n")
##
## ---- SUMMARY STATISTICS ----
skim(insurance_df)
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
## Warning in attr(x, "align"): 'xfun::attr()' is deprecated.
## Use 'xfun::attr2()' instead.
## See help("Deprecated")
| Name | insurance_df |
| Number of rows | 8161 |
| Number of columns | 26 |
| _______________________ | |
| Column type frequency: | |
| character | 14 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| income | 445 | 0.95 | 2 | 8 | 0 | 6612 | 0 |
| single_parent_flag | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| home_value | 464 | 0.94 | 2 | 8 | 0 | 5106 | 0 |
| marital_status | 0 | 1.00 | 3 | 4 | 0 | 2 | 0 |
| driver_gender | 0 | 1.00 | 1 | 3 | 0 | 2 | 0 |
| education_level | 0 | 1.00 | 3 | 13 | 0 | 5 | 0 |
| job_category | 526 | 0.94 | 6 | 13 | 0 | 8 | 0 |
| vehicle_use | 0 | 1.00 | 7 | 10 | 0 | 2 | 0 |
| vehicle_value | 0 | 1.00 | 6 | 7 | 0 | 2789 | 0 |
| vehicle_type | 0 | 1.00 | 3 | 11 | 0 | 6 | 0 |
| red_car_flag | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| past_claims_total | 0 | 1.00 | 2 | 7 | 0 | 2857 | 0 |
| license_revoked_flag | 0 | 1.00 | 2 | 3 | 0 | 2 | 0 |
| urbanicity | 0 | 1.00 | 19 | 21 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 5151.87 | 2978.89 | 1 | 2559 | 5133 | 7745 | 10302.0 | ▇▇▇▇▇ |
| crash_flag | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
| crash_cost | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
| teen_drivers | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 | ▇▁▁▁▁ |
| driver_age | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
| children_at_home | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 | ▇▂▁▁▁ |
| years_on_job | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 | ▂▃▇▃▁ |
| commute_time | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
| tenure | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
| past_claims_count | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 | ▇▂▁▁▁ |
| mvr_points | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
| vehicle_age | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 | ▆▇▇▃▁ |
# Show distribution shape for every numeric variable
num_hist <- insurance_df |>
select_if(is.numeric) |>
pivot_longer(everything(),
names_to = "variable",
values_to = "value") |>
ggplot(aes(x = value)) +
geom_histogram(bins = 30) +
facet_wrap(~ variable, scales = "free") +
labs(title = "Histograms of Numeric Features")
print(num_hist)
## Warning: Removed 970 rows containing non-finite outside the scale range
## (`stat_bin()`).
We will examine how crash_flag varies by each categorical variable using boxplots.
# Extract categorical variables
cat_df <- insurance_df |> select_if(function(x) is.character(x) | is.factor(x))
# Only create plot if there are categorical variables
if (ncol(cat_df) > 0) {
# Show how crash rates vary across categories
cat_plot <- insurance_df |>
pivot_longer(cols = names(cat_df),
names_to = "variable",
values_to = "category") |>
ggplot(aes(x = category, y = crash_flag)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Crash Flag Distribution by Categorical Variables",
y = "Crash Flag")
print(cat_plot)
}
We will compute correlations between numeric variables and visualize them to understand relationships and identify strong predictors of crash_flag.
# Compute correlation matrix
corr <- insurance_df |>
select_if(is.numeric) |>
cor(use = "complete.obs")
# Compute correlations with target
cor_with_target <- corr["crash_flag", ] |>
enframe(name = "Feature", value = "Correlation") |>
filter(Feature != "crash_flag") |>
arrange(desc(Correlation))
# Reorder the correlation matrix by target correlation
feature_order <- cor_with_target$Feature
corr_ordered <- corr[c("crash_flag", feature_order), c("crash_flag", feature_order)]
# Correlation heat plot
ggcorrplot(corr_ordered,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("salmon", "white", "steelblue"),
title = "Correlation Matrix",
ggtheme = theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
panel.spacing = unit(0.1, "lines")
)) +
coord_fixed(ratio = .8)
We will clean the dataset by converting currency columns to numeric, filling missing values for home_value, income, years_on_job, job_category, and vehicle_age.
# Convert currency characters for income and home_value
clean_currency <- function(x) {
x |>
str_replace_all("[$,]", "") |> # remove $ and commas
as.numeric()
}
# Convert currency fields to numeric
insurance_df <- insurance_df |>
mutate(
home_value = clean_currency(home_value),
income = clean_currency(income),
vehicle_value = clean_currency(vehicle_value),
past_claims_total = clean_currency(past_claims_total)
)
# Assuming missing home values mean the person does not own a home, all nulls should be 0
insurance_df <- insurance_df |>
mutate(home_value = ifelse(is.na(home_value), 0, home_value))
# Assume missing job history means no stable job history reported, mark 0 years on job
insurance_df <- insurance_df |>
mutate(years_on_job = ifelse(is.na(years_on_job), 0, years_on_job))
# Missing income can come from not knowing the specifics, so fill with median value
median_income <- median(insurance_df$income, na.rm = TRUE)
insurance_df <- insurance_df |>
mutate(income = ifelse(is.na(income), median_income, income))
# When job_category is missing, fill with a new category Unknown
insurance_df <- insurance_df |>
mutate(job_category = ifelse(is.na(job_category) | job_category == "",
"Unknown",
job_category))
# Missing vehicle age can come from not knowing the specifics, so fill with median value
median_vehicle_age <- median(insurance_df$vehicle_age, na.rm = TRUE)
insurance_df <- insurance_df |>
mutate(vehicle_age = ifelse(is.na(vehicle_age),
median_vehicle_age,
vehicle_age))
# Preview new data
head(insurance_df)
## # A tibble: 6 × 26
## id crash_flag crash_cost teen_drivers driver_age children_at_home
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 0 0 60 0
## 2 2 0 0 0 43 0
## 3 4 0 0 0 35 1
## 4 5 0 0 0 51 0
## 5 6 0 0 0 50 0
## 6 7 1 2946 0 34 1
## # ℹ 20 more variables: years_on_job <dbl>, income <dbl>,
## # single_parent_flag <chr>, home_value <dbl>, marital_status <chr>,
## # driver_gender <chr>, education_level <chr>, job_category <chr>,
## # commute_time <dbl>, vehicle_use <chr>, vehicle_value <dbl>, tenure <dbl>,
## # vehicle_type <chr>, red_car_flag <chr>, past_claims_total <dbl>,
## # past_claims_count <dbl>, license_revoked_flag <chr>, mvr_points <dbl>,
## # vehicle_age <dbl>, urbanicity <chr>
We will convert categorical variables to numeric. Education levels are mapped to numbers, and boolean variables like Yes/No or M/F are converted to 1/0. We also create a blue-collar job flag.
insurance_df <- insurance_df |>
# Convert education_level to numeric
mutate(education_level_num = case_when(
education_level == "<High School" ~ 0,
education_level == "z_High School" ~ 1,
education_level == "Bachelors" ~ 2,
education_level == "Masters" ~ 3,
education_level == "PhD" ~ 4,
TRUE ~ NA_real_ # in case of unexpected/missing values
)) |>
# Create a new higher_education flag
mutate(higher_education = if_else(education_level_num >= 2, 1, 0))
# Booleans to numeric
insurance_df <- insurance_df |>
mutate(
single_parent_flag = if_else(single_parent_flag == "Yes", 1, 0),
marital_status = if_else(marital_status == "Yes", 1, 0),
driver_gender = if_else(driver_gender == "M", 1, 0),
vehicle_use = if_else(vehicle_use == "Commercial", 1, 0),
red_car_flag = if_else(red_car_flag == "yes", 1, 0),
license_revoked_flag = if_else(license_revoked_flag == "Yes", 1, 0),
urbanicity = if_else(urbanicity == "Highly Urban/ Urban", 1, 0)
)
# New feature from job_category for blue collar or not
insurance_df <- insurance_df |>
mutate(job_blue_collar = if_else(job_category == "z_Blue Collar", 1, 0))
Model 1: Multiple linear regression with selected variables to predict crash_cost.
Model 2: Stepwise linear regression on non-zero crash_cost to find important predictors.
Model 3: Logistic regression predicting crash_flag using high-risk indicators.
Model 4: Logistic regression with interactions to capture combined effects.
Model 5: Stepwise logistic regression to select important predictors automatically.
train_index <- createDataPartition(insurance_df$crash_flag, p = 0.8, list = FALSE)
train_df <- insurance_df[train_index, ]
test_df <- insurance_df[-train_index, ]
vehicle_value → more expensive cars = higher payout
vehicle_age → older cars might cost less to repair
past_claims_total → history predicts future payouts
driver_age, mvr_points, teen_drivers → driver risk factors
# Run multiple linear regression model
mlr_model1 <- lm(crash_cost ~ vehicle_value + vehicle_age + past_claims_total +
driver_age + mvr_points + teen_drivers,
data = train_df)
summary(mlr_model1)
##
## Call:
## lm(formula = crash_cost ~ vehicle_value + vehicle_age + past_claims_total +
## driver_age + mvr_points + teen_drivers, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5434 -1516 -955 -346 104445
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.763e+03 3.224e+02 5.471 4.65e-08 ***
## vehicle_value 1.255e-02 7.046e-03 1.782 0.074873 .
## vehicle_age -3.897e+01 1.067e+01 -3.652 0.000263 ***
## past_claims_total 1.962e-02 6.786e-03 2.891 0.003854 **
## driver_age -1.790e+01 6.858e+00 -2.610 0.009084 **
## mvr_points 3.056e+02 2.761e+01 11.070 < 2e-16 ***
## teen_drivers 4.016e+02 1.135e+02 3.539 0.000405 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4642 on 6517 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.0312, Adjusted R-squared: 0.03031
## F-statistic: 34.98 on 6 and 6517 DF, p-value: < 2.2e-16
# Remove rows with NA in predictors for stepAIC
nonzero_df <- train_df |>
filter(crash_cost > 0) |>
drop_na(vehicle_value, vehicle_age, past_claims_total, past_claims_count,
driver_age, mvr_points, teen_drivers, income, home_value)
fast_model <- lm(
crash_cost ~ vehicle_value + vehicle_age + past_claims_total + past_claims_count +
driver_age + mvr_points + teen_drivers + income + home_value,
data = nonzero_df
)
mlr_model2_nonzero <- stepAIC(fast_model, direction = "both", trace = FALSE)
summary(mlr_model2_nonzero)
##
## Call:
## lm(formula = crash_cost ~ vehicle_value + mvr_points, data = nonzero_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8653 -3097 -1499 412 100481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.472e+03 4.108e+02 8.453 < 2e-16 ***
## vehicle_value 1.176e-01 2.216e-02 5.306 1.26e-07 ***
## mvr_points 2.149e+02 7.147e+01 3.007 0.00268 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7708 on 1715 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.0197
## F-statistic: 18.25 on 2 and 1715 DF, p-value: 1.433e-08
Include high-risk indicators (tickets, teen drivers, revoked license, commercial vehicle)
Positive coefficients → higher crash risk
# Run glm binary logistic regression
logit_model1 <- glm(crash_flag ~ driver_age + mvr_points + teen_drivers +
past_claims_count + vehicle_use + license_revoked_flag,
data = train_df,
family = binomial)
summary(logit_model1)
##
## Call:
## glm(formula = crash_flag ~ driver_age + mvr_points + teen_drivers +
## past_claims_count + vehicle_use + license_revoked_flag, family = binomial,
## data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.84819 0.16357 -5.186 2.15e-07 ***
## driver_age -0.02584 0.00350 -7.384 1.54e-13 ***
## mvr_points 0.14427 0.01396 10.335 < 2e-16 ***
## teen_drivers 0.33344 0.05350 6.232 4.60e-10 ***
## past_claims_count 0.27993 0.02614 10.710 < 2e-16 ***
## vehicle_use 0.58715 0.06047 9.710 < 2e-16 ***
## license_revoked_flag 0.91971 0.08274 11.115 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7522.4 on 6523 degrees of freedom
## Residual deviance: 6761.2 on 6517 degrees of freedom
## (5 observations deleted due to missingness)
## AIC: 6775.2
##
## Number of Fisher Scoring iterations: 4
set.seed(64)
# Scale numeric columns used in interaction
train_df <- train_df |>
mutate(
driver_age_s = scale(driver_age),
mvr_points_s = scale(mvr_points),
vehicle_value_s = scale(vehicle_value),
vehicle_age_s = scale(vehicle_age),
past_claims_count_s = scale(past_claims_count)
)
# Sample 50% of data to speed up
fast_df <- train_df |> sample_frac(0.5)
# Run glm with scaled variables
logit_model2_fast <- glm(
crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s*vehicle_age_s +
past_claims_count_s + red_car_flag,
data = fast_df,
family = binomial
)
summary(logit_model2_fast)
##
## Call:
## glm(formula = crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s *
## vehicle_age_s + past_claims_count_s + red_car_flag, family = binomial,
## data = fast_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.17217 0.05152 -22.752 < 2e-16 ***
## driver_age_s -0.17442 0.04365 -3.996 6.44e-05 ***
## mvr_points_s 0.30064 0.04245 7.083 1.41e-12 ***
## vehicle_value_s -0.21846 0.04602 -4.747 2.06e-06 ***
## vehicle_age_s -0.15313 0.04387 -3.491 0.000482 ***
## past_claims_count_s 0.34592 0.04262 8.116 4.84e-16 ***
## red_car_flag -0.01588 0.09387 -0.169 0.865707
## vehicle_value_s:vehicle_age_s 0.01742 0.04521 0.385 0.699977
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3691.0 on 3260 degrees of freedom
## Residual deviance: 3418.6 on 3253 degrees of freedom
## (3 observations deleted due to missingness)
## AIC: 3434.6
##
## Number of Fisher Scoring iterations: 4
set.seed(64)
# Scale numeric columns to stabilize glm convergence
train_df <- train_df |>
mutate(
driver_age_s = scale(driver_age),
mvr_points_s = scale(mvr_points),
vehicle_value_s = scale(vehicle_value),
vehicle_age_s = scale(vehicle_age),
past_claims_total_s = scale(past_claims_total),
past_claims_count_s = scale(past_claims_count),
years_on_job_s = scale(years_on_job),
children_at_home_s = scale(children_at_home),
income_s = scale(income),
home_value_s = scale(home_value),
education_level_s = scale(education_level_num),
teen_drivers_s = scale(teen_drivers)
)
# Take 50% random sample to reduce runtime
fast_df <- train_df |> sample_frac(0.5)
# Select subset of predictors for faster stepwise selection
fast_logit <- glm(
crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s + vehicle_age_s +
past_claims_total_s + past_claims_count_s + teen_drivers_s +
license_revoked_flag + vehicle_use + red_car_flag,
data = fast_df,
family = binomial
)
# Stepwise selection
logit_model3_fast <- stepAIC(fast_logit, direction = "both", trace = FALSE)
summary(logit_model3_fast)
##
## Call:
## glm(formula = crash_flag ~ driver_age_s + mvr_points_s + vehicle_value_s +
## vehicle_age_s + past_claims_total_s + past_claims_count_s +
## teen_drivers_s + license_revoked_flag + vehicle_use + red_car_flag,
## family = binomial, data = fast_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57533 0.06653 -23.677 < 2e-16 ***
## driver_age_s -0.15726 0.04515 -3.483 0.000495 ***
## mvr_points_s 0.29143 0.04391 6.637 3.19e-11 ***
## vehicle_value_s -0.30160 0.04823 -6.253 4.02e-10 ***
## vehicle_age_s -0.10766 0.04537 -2.373 0.017651 *
## past_claims_total_s -0.20102 0.05253 -3.826 0.000130 ***
## past_claims_count_s 0.42577 0.04919 8.655 < 2e-16 ***
## teen_drivers_s 0.12783 0.03988 3.206 0.001347 **
## license_revoked_flag 1.16049 0.12961 8.954 < 2e-16 ***
## vehicle_use 0.67310 0.09257 7.271 3.56e-13 ***
## red_car_flag -0.16526 0.09893 -1.671 0.094811 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3691.0 on 3260 degrees of freedom
## Residual deviance: 3273.7 on 3250 degrees of freedom
## (3 observations deleted due to missingness)
## AIC: 3295.7
##
## Number of Fisher Scoring iterations: 4
This assignment uses the given insurance dataset to predict two outcomes: crash_cost (continuous) and crash_flag (binary). The workflow included data cleaning, handling missing values differently per feature, transforming categorical features into numeric or flags, and generating new variables like higher_education and job_blue_collar. Multiple linear regression models were built to predict crash_cost, and binary logistic regression models were built to predict crash_flag.
Primary metric: Adjusted R² (accounts for number of predictors) and RMSE (prediction accuracy).
Consider multi-collinearity using Variance Inflation Factor (VIF); drop or combine variables with VIF > 5–10.
Prefer models that are parsimonious (fewer predictors) if predictive performance is only slightly reduced, as simpler models are easier to interpret and less likely to overfit.
test_df <- test_df |>
mutate(
mlr1_pred = predict(mlr_model1, newdata = test_df),
mlr2_pred = predict(mlr_model2_nonzero, newdata = test_df)
)
mlr_eval <- function(actual, predicted) {
mse <- mean((actual - predicted)^2, na.rm = TRUE)
rmse <- sqrt(mse)
r2 <- 1 - sum((actual - predicted)^2, na.rm = TRUE)/sum((actual - mean(actual, na.rm = TRUE))^2, na.rm = TRUE)
return(list(MSE = mse, RMSE = rmse, R2 = r2))
}
mlr1_metrics <- mlr_eval(test_df$crash_cost, test_df$mlr1_pred)
mlr2_metrics <- mlr_eval(test_df$crash_cost, test_df$mlr2_pred)
mlr1_metrics
## $MSE
## [1] 21752375
##
## $RMSE
## [1] 4663.944
##
## $R2
## [1] 0.001939322
mlr2_metrics
## $MSE
## [1] 40456407
##
## $RMSE
## [1] 6360.535
##
## $R2
## [1] -0.8573929
Metrics: AIC for model fit, ROC-AUC for discrimination, and classification metrics (accuracy, sensitivity, specificity, F1 score).
Consider model interpretability; choose models with meaningful coefficients and avoid overly complex interactions if simpler models give similar ROC-AUC.
test_df <- test_df |>
mutate(
driver_age_s = scale(driver_age),
mvr_points_s = scale(mvr_points),
vehicle_value_s = scale(vehicle_value),
vehicle_age_s = scale(vehicle_age),
past_claims_count_s = scale(past_claims_count),
past_claims_total_s = scale(past_claims_total),
teen_drivers_s = scale(teen_drivers)
)
test_df <- test_df |>
mutate(
logit1_prob = predict(logit_model1, newdata = test_df, type = "response"),
logit2_prob = predict(logit_model2_fast, newdata = test_df, type = "response"),
logit3_prob = predict(logit_model3_fast, newdata = test_df, type = "response"),
logit1_pred = if_else(logit1_prob >= 0.5, 1, 0),
logit2_pred = if_else(logit2_prob >= 0.5, 1, 0),
logit3_pred = if_else(logit3_prob >= 0.5, 1, 0)
)
logit_eval <- function(actual, predicted, prob) {
# Ensure numeric 0/1
actual_num <- as.numeric(as.character(actual))
predicted_num <- as.numeric(as.character(predicted))
# Confusion matrix
confusion <- confusionMatrix(factor(predicted_num), factor(actual_num), positive="1")
# Use the prob argument passed in, strip names if present
prob_num <- as.numeric(prob)
names(prob_num) <- NULL
# ROC object
roc_obj <- roc(response = actual_num, predictor = prob_num)
# Metrics
metrics <- list(
Accuracy = confusion$overall['Accuracy'],
ErrorRate = 1 - confusion$overall['Accuracy'],
Precision = confusion$byClass['Precision'],
Sensitivity = confusion$byClass['Sensitivity'],
Specificity = confusion$byClass['Specificity'],
F1 = confusion$byClass['F1'],
AUC = pROC::auc(roc_obj),
ConfusionMatrix = confusion$table
)
return(metrics)
}
logit1_metrics <- logit_eval(test_df$crash_flag, test_df$logit1_pred, test_df$logit1_prob)
logit2_metrics <- logit_eval(test_df$crash_flag, test_df$logit2_pred, test_df$logit2_prob)
logit3_metrics <- logit_eval(test_df$crash_flag, test_df$logit3_pred, test_df$logit3_prob)
In this project, we explored an insurance dataset to predict both crash costs and crash likelihood using multiple linear regression and binary logistic regression models. We built several models, carefully selecting predictors based on domain knowledge and statistical considerations, and applied scaling and sampling strategies to improve runtime for larger models.
For multiple linear regression, Model 1 (manually selected predictors) and Model 2 (stepwise selection on key numeric features) both performed reasonably well, with Model 2 slightly more flexible but Model 1 being more interpretable. We evaluated models using MSE, RMSE, R², and residual plots, confirming that coefficients generally made sense and there were no major multicollinearity concerns among predictors.
For binary logistic regression, we compared three approaches: manually selected predictors, interactions with scaled numeric variables, and stepwise selection on a random 50% sample. Evaluation metrics including accuracy, sensitivity, specificity, F1 score, AUC, and confusion matrices were used to assess performance. The models showed consistent patterns where higher-risk indicators (teen drivers, revoked license, more tickets) were associated with higher crash probability.
Overall, we prioritized models that were interpretable and statistically sound, even if they had slightly lower performance, since understanding the effect of risk factors is critical for insurance decision-making. The final selected models provide a balance of predictive power, simplicity, and interpretability, making them suitable for practical use in risk assessment.