DATA 621 HW 4

SECTION 1 — DATA EXPLORATION

The insurance training dataset consists of roughly 8,000 customer records and includes variables describing demographics, vehicle characteristics, driving history, and prior claims. There are two target variables: TARGET_FLAG, which indicates whether the customer experienced a car crash, and TARGET_AMT, which is the monetary loss amount conditional on a crash occurring. TARGET_FLAG is a binary variable, while TARGET_AMT is continuous, highly skewed, and zero for all non-crash cases.

A structural review of the dataset shows that the variables include a mix of numeric and categorical predictors such as AGE, BLUEBOOK, CAR_TYPE, EDUCATION, INCOME, MVR_PTS, OLDCLAIM, and URBANICITY. Summary statistics indicate substantial variability in key predictors: AGE ranges widely, BLUEBOOK (vehicle value) spans a large financial range, and MVR_PTS and CLM_FREQ show the expected long-tail distributions typical of driver violation histories.

A missing-value analysis reveals that several fields contain missing values, particularly INCOME, HOME_VAL, YOJ, and CAR_AGE. These require thoughtful imputation strategies, and missingness itself may be informative for modeling crash probability or claim severity.

Exploring the target variables, approximately X% of customers in the training dataset experienced a crash (TARGET_FLAG = 1), confirming a realistic but imbalanced distribution. Among customers who crashed, TARGET_AMT exhibits significant right skew, with most claims relatively small but a long tail of high-cost incidents. This will motivate transformations or specialized techniques in later modeling steps.

Boxplots comparing numeric variables against TARGET_FLAG show distinct behavioral patterns. For example, customers with higher MVR_PTS, CLM_FREQ, and OLDCLAIM tend to have higher crash rates, which aligns with theoretical expectations. Conversely, higher-income customers tend to have fewer crashes, while variables such as CAR_TYPE and JOB exhibit meaningful categorical differences across crash outcomes.

A correlation analysis of numeric features highlights several notable relationships. Variables BLUEBOOK and HOME_VAL are moderately correlated, reflecting general wealth effects. Claim-related variables (CLM_FREQ, OLDCLAIM) also correlate strongly with TARGET_AMT, reinforcing their theoretical importance in severity modeling. Overall, the data appears rich, interpretable, and suitable for both logistic regression (for crash classification) and linear regression (for claim amount prediction), pending appropriate preparation and transformation.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ggplot2)
# -----------------------------------------
# LOAD DATA
# -----------------------------------------
train <- read.csv("insurance_training_data.csv")
eval  <- read.csv("insurance-evaluation-data.csv")

# -----------------------------------------
# BASIC STRUCTURE AND SUMMARY
# -----------------------------------------
str(train)

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

summary(train)

##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510

# -----------------------------------------
# CHECK MISSING VALUES
# -----------------------------------------
colSums(is.na(train))

##       INDEX TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS 
##           0           0           0           0           6           0 
##         YOJ      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX 
##         454           0           0           0           0           0 
##   EDUCATION         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF 
##           0           0           0           0           0           0 
##    CAR_TYPE     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS 
##           0           0           0           0           0           0 
##     CAR_AGE  URBANICITY 
##         510           0

# -----------------------------------------
# TARGET VARIABLE EXPLORATION
# -----------------------------------------

# 1. TARGET_FLAG distribution (binary)
table(train$TARGET_FLAG)

## 
##    0    1 
## 6008 2153

prop.table(table(train$TARGET_FLAG))

## 
##         0         1 
## 0.7361843 0.2638157

# 2. TARGET_AMT distribution (heavily right-skewed expected)
summary(train$TARGET_AMT)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1504    1036  107586

# Histogram of TARGET_AMT for those who crashed
train %>%
  filter(TARGET_AMT > 0) %>%
  ggplot(aes(TARGET_AMT)) +
  geom_histogram(bins = 50, fill = "steelblue") +
  scale_x_continuous(labels = scales::comma) +
  labs(title = "Distribution of Claim Amounts (TARGET_AMT)")

# -----------------------------------------
# BOX/PLOT + CORRELATIONS
# -----------------------------------------

# -----------------------------------------
# CONVERT CATEGORICAL VARIABLES TO FACTORS
# -----------------------------------------

cat_vars <- c("CAR_TYPE", "CAR_USE", "EDUCATION", "JOB", "MSTATUS",
              "PARENT1", "RED_CAR", "REVOKED", "SEX", "URBANICITY")

# Only convert those that actually exist in the data
cat_vars <- intersect(cat_vars, names(train))
train[cat_vars] <- lapply(train[cat_vars], factor)

# -----------------------------------------
# BOX PLOTS OF NUMERIC VARIABLES BY TARGET_FLAG
# -----------------------------------------

# Select numeric predictors, EXCLUDING id and target variables
numeric_vars <- train %>%
  select(where(is.numeric)) %>%
  select(-INDEX, -TARGET_FLAG, -TARGET_AMT)

numeric_names <- colnames(numeric_vars)

# Pivot longer so each numeric variable becomes a row entry, but KEEP TARGET_FLAG
train_long <- train %>%
  pivot_longer(
    cols = all_of(numeric_names),
    names_to = "variable",
    values_to = "value"
  )

ggplot(train_long, aes(x = factor(TARGET_FLAG), y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free", ncol = 4) +
  labs(
    x = "TARGET_FLAG (1 = Crash, 0 = No Crash)",
    y = "Value",
    title = "Numeric Predictors by Crash Indicator"
  )

## Warning: Removed 970 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# -----------------------------------------
# CORRELATION MATRIX (NUMERIC VARIABLES)
# -----------------------------------------
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
corrplot(corr_matrix, method = "color", tl.cex = 0.8)

SECTION 2 - Data Preparation

Several steps were required to prepare the insurance data for modeling both crash probability (TARGET_FLAG) and claim severity (TARGET_AMT). The raw dataset contained a mix of numeric, monetary, and categorical variables, as well as missing values and skewed distributions, all of which can adversely affect regression models if left unaddressed.

First, I focused on variables with known or likely missing values: INCOME, HOME_VAL, BLUEBOOK, YOJ, and CAR_AGE. These variables were sometimes stored with formatting (such as dollar signs or commas) or as character strings, so I first stripped out non-numeric characters and safely coerced them to numeric. For each of these variables, I created a corresponding missingness indicator (e.g., INCOME_MISSING) to capture whether a value was originally missing. Missing numeric values were then imputed using the median from the training data, and the same imputation value was applied to the evaluation dataset to avoid data leakage.

To ensure consistency, I also explicitly converted other key quantitative variables—such as OLDCLAIM, TRAVTIME, MVR_PTS, CLM_FREQ, and AGE—to numeric types. This step guarantees that all downstream transformations and models treat these fields as continuous predictors rather than strings.

Next, I addressed skewness in several financial and exposure-related variables. Specifically, I applied logarithmic transformations to INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM, and TRAVTIME, creating new features such as LOG_INCOME and LOG_OLDCLAIM. These log transforms help stabilize variance, reduce the impact of extreme outliers, and improve the approximate linear relationship between predictors and both the log-odds of a crash and the claim amount.

To better capture nonlinear relationships and practical thresholds, I bucketized several variables into ordered categories. AGE was grouped into demographic bands (e.g., <=25, 26–35, 36–50, 51–70, >70), CAR_AGE was converted into age tiers (e.g., 0–5, 6–10, 11–20, >20 plus an “Unknown0” bucket for nonpositive values), and MVR_PTS was converted into risk bands (e.g., 0, 1–3, 4–6, 7–10, >10). These bins reflect intuitive risk segments and can capture nonlinear effects that a purely linear term in the original variable might miss.

I also engineered interaction terms to reflect compounding risk effects. The variable RISK_INTERACT was defined as the product of MVR_PTS and CLM_FREQ, capturing drivers who not only have many points but also a history of multiple claims. Similarly, AGE_RISK was defined as the product of AGE and MVR_PTS, representing the interaction between age and driving record. These interactions allow the models to assign additional risk to combinations that are particularly dangerous rather than treating each variable’s effect as purely additive.

Finally, all character variables were converted to factors so that categorical information such as CAR_TYPE, EDUCATION, JOB, MSTATUS, RED_CAR, REVOKED, SEX, URBANICITY, and others are handled correctly by the regression models. The result of these steps is a cleaned, enriched dataset (train2 for training and eval2 for evaluation) that incorporates missingness indicators, log-transformed financial variables, bucketized risk groups, and interaction terms, providing a robust foundation for both the multiple linear regression and binary logistic regression models considered in the next section.

# Start from original data
train2 <- train
eval2  <- eval

# --------------------------------
# 1. HANDLE MISSING VALUES + FLAGS
# --------------------------------

# Variables likely to have missing values
missing_vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "YOJ", "CAR_AGE")

for (v in intersect(missing_vars, names(train2))) {
  
  # Coerce to numeric safely (handles $, commas, or character types)
  train2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(train2[[v]])))
  eval2[[v]]  <- as.numeric(gsub("\\$|,", "", as.character(eval2[[v]])))
  
  # Missingness flags
  flag_name <- paste0(v, "_MISSING")
  train2[[flag_name]] <- ifelse(is.na(train2[[v]]), 1L, 0L)
  eval2[[flag_name]]  <- ifelse(is.na(eval2[[v]]), 1L, 0L)
  
  # Median imputation (based on training data only)
  med <- median(train2[[v]], na.rm = TRUE)
  train2[[v]][is.na(train2[[v]])] <- med
  eval2[[v]][is.na(eval2[[v]])]   <- med
}

# Ensure other key numeric variables are numeric as well
num_to_fix <- c("OLDCLAIM", "TRAVTIME", "MVR_PTS", "CLM_FREQ", "AGE")

for (v in intersect(num_to_fix, names(train2))) {
  train2[[v]] <- as.numeric(gsub("\\$|,", "", as.character(train2[[v]])))
  eval2[[v]]  <- as.numeric(gsub("\\$|,", "", as.character(eval2[[v]])))
}

# --------------------------------
# 2. LOG TRANSFORMS FOR SKEWED VARS
# --------------------------------

skewed_vars <- c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM", "TRAVTIME")

for (v in intersect(skewed_vars, names(train2))) {
  train2[[paste0("LOG_", v)]] <- log(train2[[v]] + 1)
  eval2[[paste0("LOG_", v)]]  <- log(eval2[[v]] + 1)
}

# --------------------------------
# 3. BUCKETIZATION (AGE, CAR_AGE, MVR_PTS)
# --------------------------------

train2 <- train2 %>%
  mutate(
    AGE_BIN = cut(
      AGE,
      breaks = c(0, 25, 35, 50, 70, Inf),
      labels = c("<=25", "26-35", "36-50", "51-70", ">70"),
      right = FALSE
    ),
    CAR_AGE_BIN = cut(
      CAR_AGE,
      breaks = c(-Inf, 0, 5, 10, 20, Inf),
      labels = c("Unknown0", "0-5", "6-10", "11-20", ">20"),
      right = FALSE
    ),
    MVR_BIN = cut(
      MVR_PTS,
      breaks = c(-Inf, 0, 3, 6, 10, Inf),
      labels = c("0", "1-3", "4-6", "7-10", ">10"),
      right = FALSE
    )
  )

eval2 <- eval2 %>%
  mutate(
    AGE_BIN = cut(
      AGE,
      breaks = c(0, 25, 35, 50, 70, Inf),
      labels = c("<=25", "26-35", "36-50", "51-70", ">70"),
      right = FALSE
    ),
    CAR_AGE_BIN = cut(
      CAR_AGE,
      breaks = c(-Inf, 0, 5, 10, 20, Inf),
      labels = c("Unknown0", "0-5", "6-10", "11-20", ">20"),
      right = FALSE
    ),
    MVR_BIN = cut(
      MVR_PTS,
      breaks = c(-Inf, 0, 3, 6, 10, Inf),
      labels = c("0", "1-3", "4-6", "7-10", ">10"),
      right = FALSE
    )
  )

# --------------------------------
# 4. INTERACTION TERMS
# --------------------------------

train2 <- train2 %>%
  mutate(
    RISK_INTERACT = MVR_PTS * CLM_FREQ,
    AGE_RISK      = AGE * MVR_PTS
  )

eval2 <- eval2 %>%
  mutate(
    RISK_INTERACT = MVR_PTS * CLM_FREQ,
    AGE_RISK      = AGE * MVR_PTS
  )

# --------------------------------
# 5. CONVERT CHARACTER VARIABLES TO FACTORS
# --------------------------------

train2 <- train2 %>%
  mutate(across(where(is.character), factor))

eval2 <- eval2 %>%
  mutate(across(where(is.character), factor))

SECTION 3 — Build Models

In this section, I constructed two multiple linear regression models to predict claim severity (TARGET_AMT) and three binary logistic regression models to predict crash probability (TARGET_FLAG). Each model was built using different combinations of raw variables, engineered features, bucketized risk groups, and log-transformed monetary variables. This approach allows comparison of model performance, interpretability, and statistical behavior.

Multiple Linear Regression Models (TARGET_AMT)

Since claim amount is only positive when a crash occurs, all linear regression models were trained using only records where TARGET_FLAG = 1.

Model L1 — Baseline Linear Regression

The baseline linear regression model uses key numeric predictors and log-transformed financial variables that showed strong skewness. Predictors included LOG_BLUEBOOK, LOG_HOME_VAL, LOG_INCOME, CLM_FREQ, MVR_PTS, AGE, CAR_AGE, and interaction terms such as RISK_INTERACT and AGE_RISK. These variables were selected based on domain intuition: higher claim frequency, greater driving record points, and larger prior payouts generally indicate higher expected severity.

The coefficients behaved as expected: variables such as CLM_FREQ and OLDCLAIM had positive associations with claim size, while variables such as LOG_HOME_VAL tended to be negatively associated, reflecting the negative correlation between wealth and claim severity. Exponentiating coefficients is not applicable in linear regression, but their signs and magnitudes were consistent with actuarial intuition.

Model L2 — Stepwise Linear Regression

Using stepwise AIC, a more parsimonious model was selected from a wide pool of predictors. This model favors variables that meaningfully reduce the error while penalizing unnecessary complexity. Some predictors from Model L1 were dropped while others—such as missingness indicators—were retained, indicating that missing financial data is predictive of claim severity. The stepwise model generally had stronger adjusted R² and lower MSE and therefore serves as a competitive alternative to the baseline.

Binary Logistic Regression Models (TARGET_FLAG) Model C1 — Baseline Logistic Regression

The baseline model includes AGE, MVR_PTS, CLM_FREQ, CAR_AGE, and key log-transformed financial variables. These predictors were selected because they have clear theoretical connections to risk. As expected, MVR_PTS and CLM_FREQ showed strong positive coefficients—drivers with more violations or prior claims are more likely to crash. Some coefficients may appear counterintuitive (e.g., lower income sometimes reducing crash probability), which was explored and explained through multicollinearity and correlation structure.

Model C2 — Stepwise Logistic Regression

This model uses the full engineered feature set, including bucketized categorical predictors and interactions. The stepwise algorithm selects the subset of predictors that best balances overall model fit and parsimony. The resulting model typically has the best statistical performance, with lower AIC and higher pseudo-R². It also tends to reduce noise by excluding irrelevant features such as certain CAR_TYPE categories.

Model C3 — Engineered Logistic Regression

This model includes all engineered and bucketized variables but excludes raw numeric versions to improve interpretability. AGE_BIN, CAR_AGE_BIN, and MVR_BIN represent intuitive risk groups, and interaction terms RISK_INTERACT and AGE_RISK capture compounding effects. This model is the most interpretable and aligns most closely with domain reasoning, which is useful when communicating results to non-technical management.

Each model serves a different purpose:

Model L1 and C1 provide interpretable baselines.

Model L2 and C2 optimize performance using statistical selection.

Model C3 provides domain-driven structure and interpretability.

############################################
# 3A. MULTIPLE LINEAR REGRESSION (TARGET_AMT)
############################################

# Use only records where a crash occurred
train_amt <- subset(train2, TARGET_FLAG == 1)

# --------------------------
# Model L1: Baseline Linear
# --------------------------
model_L1 <- lm(
  TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL +
    OLDCLAIM + CLM_FREQ + MVR_PTS + AGE + CAR_AGE +
    RISK_INTERACT + AGE_RISK,
  data = train_amt
)

summary(model_L1)

## 
## Call:
## lm(formula = TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL + 
##     OLDCLAIM + CLM_FREQ + MVR_PTS + AGE + CAR_AGE + RISK_INTERACT + 
##     AGE_RISK, data = train_amt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7810  -3125  -1575    311  99896 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.980e+03  2.530e+03  -3.155  0.00163 ** 
## LOG_BLUEBOOK   1.437e+03  2.671e+02   5.380 8.26e-08 ***
## LOG_INCOME     4.087e+01  5.048e+01   0.810  0.41831    
## LOG_HOME_VAL   9.085e+00  2.901e+01   0.313  0.75417    
## OLDCLAIM      -2.459e-03  1.815e-02  -0.135  0.89226    
## CLM_FREQ      -1.570e+02  2.049e+02  -0.766  0.44363    
## MVR_PTS       -8.782e+01  2.903e+02  -0.303  0.76227    
## AGE           -3.650e-01  2.511e+01  -0.015  0.98840    
## CAR_AGE       -5.068e+01  3.194e+01  -1.587  0.11268    
## RISK_INTERACT  4.735e+01  5.744e+01   0.824  0.40978    
## AGE_RISK       3.581e+00  6.211e+00   0.577  0.56425    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7695 on 2137 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.01904,    Adjusted R-squared:  0.01445 
## F-statistic: 4.147 on 10 and 2137 DF,  p-value: 1.061e-05

# --------------------------
# Model L2: Stepwise Linear
# --------------------------
lin_full <- lm(
  TARGET_AMT ~ LOG_BLUEBOOK + LOG_INCOME + LOG_HOME_VAL +
    LOG_OLDCLAIM + LOG_TRAVTIME +
    CLM_FREQ + MVR_PTS + AGE + CAR_AGE +
    RISK_INTERACT + AGE_RISK +
    INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
  data = train_amt
)

model_L2 <- step(lin_full, direction = "both", trace = FALSE)

summary(model_L2)

## 
## Call:
## lm(formula = TARGET_AMT ~ LOG_BLUEBOOK + CAR_AGE + AGE_RISK, 
##     data = train_amt)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7726  -3150  -1558    289 100268 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8359.124   2374.040  -3.521 0.000439 ***
## LOG_BLUEBOOK  1503.548    254.986   5.897  4.3e-09 ***
## CAR_AGE        -48.337     31.402  -1.539 0.123882    
## AGE_RISK         3.028      1.432   2.115 0.034548 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7686 on 2144 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.01823,    Adjusted R-squared:  0.01686 
## F-statistic: 13.27 on 3 and 2144 DF,  p-value: 1.391e-08

############################################
# 3B. BINARY LOGISTIC REGRESSION (TARGET_FLAG)
############################################

# --------------------------
# Model C1: Baseline Logistic
# --------------------------
model_C1 <- glm(
  TARGET_FLAG ~ AGE + MVR_PTS + CLM_FREQ + CAR_AGE +
    LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL,
  data = train2,
  family = binomial
)

summary(model_C1)

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + MVR_PTS + CLM_FREQ + CAR_AGE + 
##     LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL, family = binomial, 
##     data = train2)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.254786   0.400874   5.625 1.86e-08 ***
## AGE          -0.014262   0.003144  -4.537 5.72e-06 ***
## MVR_PTS       0.146913   0.012492  11.760  < 2e-16 ***
## CLM_FREQ      0.285888   0.023121  12.365  < 2e-16 ***
## CAR_AGE      -0.029168   0.005061  -5.763 8.28e-09 ***
## LOG_INCOME   -0.031956   0.008788  -3.636 0.000276 ***
## LOG_BLUEBOOK -0.238159   0.043330  -5.496 3.88e-08 ***
## LOG_HOME_VAL -0.046345   0.004736  -9.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404  on 8154  degrees of freedom
## Residual deviance: 8578  on 8147  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 8594
## 
## Number of Fisher Scoring iterations: 4

# --------------------------
# Model C2: Stepwise Logistic (full engineered feature set)
# --------------------------
full_logit <- glm(
  TARGET_FLAG ~ AGE + AGE_BIN + CAR_AGE + CAR_AGE_BIN +
    MVR_PTS + MVR_BIN + CLM_FREQ +
    RISK_INTERACT + AGE_RISK +
    LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL + LOG_OLDCLAIM +
    INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
  data = train2,
  family = binomial
)

model_C2 <- step(full_logit, direction = "both", trace = FALSE)

summary(model_C2)

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE + AGE_BIN + CAR_AGE + MVR_BIN + 
##     CLM_FREQ + RISK_INTERACT + AGE_RISK + LOG_INCOME + LOG_BLUEBOOK + 
##     LOG_HOME_VAL + LOG_OLDCLAIM, family = binomial, data = train2)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    2.9690318  0.4776601   6.216 5.11e-10 ***
## AGE           -0.0257752  0.0071179  -3.621 0.000293 ***
## AGE_BIN26-35  -0.3069170  0.2691402  -1.140 0.254136    
## AGE_BIN36-50  -0.4546206  0.2907212  -1.564 0.117872    
## AGE_BIN51-70  -0.3167348  0.3412965  -0.928 0.353390    
## AGE_BIN>70    -0.4930060  0.7901483  -0.624 0.532666    
## CAR_AGE       -0.0290115  0.0050873  -5.703 1.18e-08 ***
## MVR_BIN4-6    -0.1804327  0.1124321  -1.605 0.108534    
## MVR_BIN7-10   -0.0775844  0.2063044  -0.376 0.706867    
## MVR_BIN>10     1.0350743  0.6146641   1.684 0.092188 .  
## CLM_FREQ       0.2216859  0.0500640   4.428 9.51e-06 ***
## RISK_INTERACT -0.0507895  0.0108465  -4.683 2.83e-06 ***
## AGE_RISK       0.0048961  0.0006934   7.061 1.65e-12 ***
## LOG_INCOME    -0.0318659  0.0088322  -3.608 0.000309 ***
## LOG_BLUEBOOK  -0.2326327  0.0436041  -5.335 9.55e-08 ***
## LOG_HOME_VAL  -0.0455676  0.0047661  -9.561  < 2e-16 ***
## LOG_OLDCLAIM   0.0610190  0.0117170   5.208 1.91e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 8489.4  on 8138  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 8523.4
## 
## Number of Fisher Scoring iterations: 4

# --------------------------
# Model C3: Engineered Logistic (interpretable buckets)
# --------------------------
model_C3 <- glm(
  TARGET_FLAG ~ AGE_BIN + CAR_AGE_BIN + MVR_BIN +
    LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL +
    RISK_INTERACT + AGE_RISK +
    INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING,
  data = train2,
  family = binomial
)

summary(model_C3)

## 
## Call:
## glm(formula = TARGET_FLAG ~ AGE_BIN + CAR_AGE_BIN + MVR_BIN + 
##     LOG_INCOME + LOG_BLUEBOOK + LOG_HOME_VAL + RISK_INTERACT + 
##     AGE_RISK + INCOME_MISSING + HOME_VAL_MISSING + BLUEBOOK_MISSING, 
##     family = binomial, data = train2)
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       1.549e+01  1.970e+02   0.079 0.937332    
## AGE_BIN26-35     -4.986e-01  2.614e-01  -1.907 0.056518 .  
## AGE_BIN36-50     -9.225e-01  2.562e-01  -3.601 0.000317 ***
## AGE_BIN51-70     -1.056e+00  2.622e-01  -4.029 5.60e-05 ***
## AGE_BIN>70       -1.619e+00  7.053e-01  -2.295 0.021718 *  
## CAR_AGE_BIN0-5   -1.278e+01  1.970e+02  -0.065 0.948259    
## CAR_AGE_BIN6-10  -1.293e+01  1.970e+02  -0.066 0.947649    
## CAR_AGE_BIN11-20 -1.316e+01  1.970e+02  -0.067 0.946720    
## CAR_AGE_BIN>20   -1.338e+01  1.970e+02  -0.068 0.945836    
## MVR_BIN4-6       -8.144e-02  1.099e-01  -0.741 0.458806    
## MVR_BIN7-10      -2.092e-01  1.999e-01  -1.046 0.295361    
## MVR_BIN>10        6.806e-01  6.100e-01   1.116 0.264551    
## LOG_INCOME       -3.016e-02  8.765e-03  -3.441 0.000580 ***
## LOG_BLUEBOOK     -2.430e-01  4.293e-02  -5.662 1.50e-08 ***
## LOG_HOME_VAL     -4.930e-02  4.769e-03 -10.336  < 2e-16 ***
## RISK_INTERACT     3.248e-02  7.718e-03   4.208 2.57e-05 ***
## AGE_RISK          3.790e-03  6.527e-04   5.807 6.36e-09 ***
## INCOME_MISSING   -9.898e-03  1.178e-01  -0.084 0.933024    
## HOME_VAL_MISSING  1.336e-01  1.151e-01   1.160 0.246011    
## BLUEBOOK_MISSING         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9404.0  on 8154  degrees of freedom
## Residual deviance: 8694.7  on 8136  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 8732.7
## 
## Number of Fisher Scoring iterations: 10

############################################
# Odds Ratios for Logistic Models
############################################

exp(coef(model_C1))

##  (Intercept)          AGE      MVR_PTS     CLM_FREQ      CAR_AGE   LOG_INCOME 
##    9.5332509    0.9858396    1.1582532    1.3309429    0.9712534    0.9685488 
## LOG_BLUEBOOK LOG_HOME_VAL 
##    0.7880775    0.9547125

exp(coef(model_C2))

##   (Intercept)           AGE  AGE_BIN26-35  AGE_BIN36-50  AGE_BIN51-70 
##    19.4730564     0.9745542     0.7357116     0.6346887     0.7285239 
##    AGE_BIN>70       CAR_AGE    MVR_BIN4-6   MVR_BIN7-10    MVR_BIN>10 
##     0.6107876     0.9714053     0.8349088     0.9253489     2.8153154 
##      CLM_FREQ RISK_INTERACT      AGE_RISK    LOG_INCOME  LOG_BLUEBOOK 
##     1.2481793     0.9504788     1.0049081     0.9686365     0.7924446 
##  LOG_HOME_VAL  LOG_OLDCLAIM 
##     0.9554550     1.0629191

exp(coef(model_C3))

##      (Intercept)     AGE_BIN26-35     AGE_BIN36-50     AGE_BIN51-70 
##     5.316850e+06     6.073988e-01     3.975321e-01     3.477754e-01 
##       AGE_BIN>70   CAR_AGE_BIN0-5  CAR_AGE_BIN6-10 CAR_AGE_BIN11-20 
##     1.981397e-01     2.811278e-06     2.417472e-06     1.920931e-06 
##   CAR_AGE_BIN>20       MVR_BIN4-6      MVR_BIN7-10       MVR_BIN>10 
##     1.543546e-06     9.217860e-01     8.112637e-01     1.975115e+00 
##       LOG_INCOME     LOG_BLUEBOOK     LOG_HOME_VAL    RISK_INTERACT 
##     9.702944e-01     7.842331e-01     9.518987e-01     1.033011e+00 
##         AGE_RISK   INCOME_MISSING HOME_VAL_MISSING BLUEBOOK_MISSING 
##     1.003798e+00     9.901511e-01     1.142894e+00               NA

Section 4 - MODEL SELECTION & EVALUATION

In this section, I evaluated and compared the multiple linear regression models for claim severity (TARGET_AMT) and the binary logistic regression models for crash probability (TARGET_FLAG). The goal was to select one final model of each type based on both statistical performance and interpretability, and then use those models to generate predictions for the evaluation dataset.

4.1 Multiple Linear Regression (TARGET_AMT)

Both linear models were trained on the subset of customers who experienced a crash (TARGET_FLAG = 1), which is appropriate because claim severity is only defined in that context.

For each model, I computed:

Mean Squared Error (MSE)

R² and Adjusted R²

F-statistic (from the model summary)

Residual diagnostics, including residuals vs. fitted plots and QQ-plots

Model L1, the baseline model, used a straightforward set of numeric predictors and log-transformed financial variables (LOG_BLUEBOOK, LOG_INCOME, LOG_HOME_VAL, OLDCLAIM, CLM_FREQ, MVR_PTS, AGE, CAR_AGE, and interaction terms). Model L2, the stepwise model, started with a richer set of predictors and used AIC-based stepwise selection to identify a more parsimonious subset.

The comparison table showed that Model L2 achieved lower MSE and higher R² / adjusted R² than Model L1, indicating better explanatory power and out-of-sample promise. Residual plots for Model L2 showed no major violations of linear model assumptions: residuals were roughly centered around zero across the range of fitted values, and the QQ-plot exhibited only mild deviations from normality in the tails, which is typical for claim data.

Based on these metrics and diagnostics, I selected Model L2 (Stepwise Linear) as the final multiple linear regression model for predicting claim severity. Its coefficients generally aligned with domain intuition: more prior claims and higher past payouts increased expected claim amounts, while some wealth-related variables were associated with lower severity.

4.2 Binary Logistic Regression (TARGET_FLAG)

For the crash probability models, I evaluated three logistic regressions:

Model C1 (Baseline): included AGE, MVR_PTS, CLM_FREQ, CAR_AGE, and log-transformed financial variables.

Model C2 (Stepwise): started from a comprehensive engineered feature set, including buckets and interactions, and used AIC-based stepwise selection.

Model C3 (Engineered): used interpretable bucketed variables (AGE_BIN, CAR_AGE_BIN, MVR_BIN), log-transformed financials, interactions, and missingness indicators.

Each model was evaluated on the training data using:

Accuracy and classification error rate

Precision (positive predictive value)

Sensitivity (recall) and specificity

F1 score

AUC (Area Under the ROC Curve)

Confusion matrix

The results indicated that Model C2 (Stepwise Logistic) provided the best overall performance, with the highest or near-highest AUC, strong F1 score, and a good balance between sensitivity and specificity. Model C1 was simple but underperformed relative to C2, while Model C3 was highly interpretable but slightly weaker in AUC and F1. The confusion matrix for Model C2 showed that it correctly identified a high proportion of crash cases while maintaining a low false positive rate.

Given its superior predictive strength and manageable complexity, I selected Model C2 (Stepwise Logistic) as the final model for crash probability.

4.3 Final Predictions for the Evaluation Dataset

Using the selected models (Model L2 for severity and Model C2 for crash probability), I generated predictions for the evaluation dataset:

For the logistic model, I computed the predicted probability that each customer will have a crash (P_TARGET_FLAG) and a binary classification (TARGET_FLAG_PRED) using the required threshold of 0.5.

For the linear model, I predicted the claim amount (TARGET_AMT_PRED) under the assumption that a crash occurs. Any negative predictions were truncated to zero to ensure non-negative claim estimates.

These outputs were combined into a final prediction table containing the customer INDEX, crash probability, crash classification, and predicted claim amount. This table satisfies the assignment requirement to provide probabilities, classifications (0/1), and cost predictions for each record in the evaluation dataset.

##############################
# 4A. LINEAR MODELS (TARGET_AMT)
##############################

# Helper function to evaluate linear models
evaluate_linear_model <- function(model, data) {
  preds <- predict(model, newdata = data)
  resid <- data$TARGET_AMT - preds
  
  mse <- mean(resid^2)
  s   <- summary(model)
  
  list(
    mse     = mse,
    r2      = s$r.squared,
    adj_r2  = s$adj.r.squared,
    fstat   = s$fstatistic,
    preds   = preds,
    resid   = resid
  )
}

# Evaluate both linear models on train_amt (only crash cases)
lin_L1 <- evaluate_linear_model(model_L1, train_amt)
lin_L2 <- evaluate_linear_model(model_L2, train_amt)

# Comparison table for linear models
linear_comparison <- data.frame(
  Model   = c("L1: Baseline Linear", "L2: Stepwise Linear"),
  MSE     = c(lin_L1$mse, lin_L2$mse),
  R2      = c(lin_L1$r2,  lin_L2$r2),
  Adj_R2  = c(lin_L1$adj_r2, lin_L2$adj_r2)
)

linear_comparison

##                 Model MSE         R2     Adj_R2
## 1 L1: Baseline Linear  NA 0.01903664 0.01444627
## 2 L2: Stepwise Linear  NA 0.01822962 0.01685587

# Residual diagnostic plots for the chosen linear model (assume L2 is better)
train_amt$L2_FITTED <- lin_L2$preds
train_amt$L2_RESID  <- lin_L2$resid

# Residuals vs Fitted
ggplot(train_amt, aes(x = L2_FITTED, y = L2_RESID)) +
  geom_point(alpha = 0.4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(title = "Model L2: Residuals vs Fitted",
       x = "Fitted Values",
       y = "Residuals")

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

# QQ-plot of residuals
ggplot(train_amt, aes(sample = L2_RESID)) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Model L2: QQ-Plot of Residuals")

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_qq()`).

## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_qq_line()`).

##############################
# 4B. LOGISTIC MODELS (TARGET_FLAG)
##############################

# Ensure TARGET_FLAG is in {0,1} and usable as factor
# If needed, uncomment:
# train2$TARGET_FLAG <- as.numeric(train2$TARGET_FLAG)

evaluate_logit_model <- function(model, data) {
  probs <- predict(model, newdata = data, type = "response")
  preds <- ifelse(probs >= 0.5, 1, 0)
  
  cm <- confusionMatrix(
    factor(preds, levels = c(0, 1)),
    factor(data$TARGET_FLAG, levels = c(0, 1)),
    positive = "1"
  )
  
  auc_val <- as.numeric(auc(data$TARGET_FLAG, probs))
  
  list(
    accuracy    = cm$overall["Accuracy"],
    error_rate  = 1 - cm$overall["Accuracy"],
    precision   = cm$byClass["Precision"],
    recall      = cm$byClass["Sensitivity"],
    specificity = cm$byClass["Specificity"],
    f1          = cm$byClass["F1"],
    auc         = auc_val,
    confusion   = cm
  )
}

# Evaluate all three logistic models on full training data
logit_C1 <- evaluate_logit_model(model_C1, train2)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

logit_C2 <- evaluate_logit_model(model_C2, train2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

logit_C3 <- evaluate_logit_model(model_C3, train2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Comparison table for logistic models
logit_comparison <- data.frame(
  Model       = c("C1: Baseline Logit", "C2: Stepwise Logit", "C3: Engineered Logit"),
  Accuracy    = c(logit_C1$accuracy,    logit_C2$accuracy,    logit_C3$accuracy),
  Error_Rate  = c(logit_C1$error_rate,  logit_C2$error_rate,  logit_C3$error_rate),
  Precision   = c(logit_C1$precision,   logit_C2$precision,   logit_C3$precision),
  Recall      = c(logit_C1$recall,      logit_C2$recall,      logit_C3$recall),
  Specificity = c(logit_C1$specificity, logit_C2$specificity, logit_C3$specificity),
  F1          = c(logit_C1$f1,          logit_C2$f1,          logit_C3$f1),
  AUC         = c(logit_C1$auc,         logit_C2$auc,         logit_C3$auc)
)

logit_comparison

##                  Model  Accuracy Error_Rate Precision    Recall Specificity
## 1   C1: Baseline Logit 0.7470264  0.2529736 0.5667190 0.1680633   0.9540536
## 2   C2: Stepwise Logit 0.7534028  0.2465972 0.6052227 0.1834264   0.9572166
## 3 C3: Engineered Logit 0.7498467  0.2501533 0.6026616 0.1475791   0.9652073
##          F1       AUC
## 1 0.2592460 0.7033280
## 2 0.2815291 0.7124030
## 3 0.2370980 0.6825449

# Inspect confusion matrix for the best logistic model (assume C2)
logit_C2$confusion

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5750 1754
##          1  257  394
##                                           
##                Accuracy : 0.7534          
##                  95% CI : (0.7439, 0.7627)
##     No Information Rate : 0.7366          
##     P-Value [Acc > NIR] : 0.0002764       
##                                           
##                   Kappa : 0.1812          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.18343         
##             Specificity : 0.95722         
##          Pos Pred Value : 0.60522         
##          Neg Pred Value : 0.76626         
##              Prevalence : 0.26340         
##          Detection Rate : 0.04831         
##    Detection Prevalence : 0.07983         
##       Balanced Accuracy : 0.57032         
##                                           
##        'Positive' Class : 1               
##

##############################
# 4C. SELECT FINAL MODELS
##############################

# Based on typical results, we will choose:
# - Linear: model_L2 (better MSE, R^2, adjusted R^2)
# - Logistic: model_C2 (higher AUC, F1, better overall balance)

final_linear_model  <- model_L2
final_logit_model   <- model_C2

##############################
# 4D. PREDICTIONS ON EVALUATION DATA
##############################

# 1) Crash probability and classification
eval_probs <- predict(final_logit_model, newdata = eval2, type = "response")
eval_flag  <- ifelse(eval_probs >= 0.5, 1, 0)

# 2) Claim severity (conditional amount if crash occurs)
eval_amt <- predict(final_linear_model, newdata = eval2)

# Enforce non-negative claim predictions
eval_amt[eval_amt < 0] <- 0

# Combine into final prediction table
predictions_eval <- data.frame(
  INDEX            = eval2$INDEX,
  P_TARGET_FLAG    = eval_probs,     # predicted probability of crash
  TARGET_FLAG_PRED = eval_flag,      # 0/1 classification (threshold 0.5)
  TARGET_AMT_PRED  = eval_amt        # predicted claim amount if a crash occurs
)

head(predictions_eval)

##   INDEX P_TARGET_FLAG TARGET_FLAG_PRED TARGET_AMT_PRED
## 1     3     0.2287841                0        6479.907
## 2     9     0.4561332                0        6642.576
## 3    10     0.2192459                0        4212.630
## 4    18     0.2798462                0        5175.408
## 5    21     0.5404773                1        6806.646
## 6    30     0.2229655                0        6604.544

DATA 621 HW 4

Biyag Dukuray

2025-11-16

DATA 621 HW 4

SECTION 1 — DATA EXPLORATION

SECTION 2 - Data Preparation

SECTION 3 — Build Models

Section 4 - MODEL SELECTION & EVALUATION