Introduction

The objective is to build two models using the provided dataset: a multiple linear regression model to predict the cost of a car crash (a continuous variable) and a binary logistic regression model to predict the probability of a car crash (a binary outcome). Only the given variables, or any new variables derived from them, can be used to build these models. The dataset includes a set of variables relevant to the prediction task, which are described in the following section.

Summary of Modeling for Homework #4

1. Data Exploration

The dataset includes approximately 8,000 observations, with two key response variables: - TARGET_FLAG: Binary indicator (1 = involved in a car crash, 0 = not involved). - TARGET_AMT: Continuous variable representing the cost of a car crash (zero if no crash occurred).

Key findings: - Missing data in CAR_AGE, HOME_VAL, YOJ, and INCOME, with percentages ranging from 5% to 6%. - Skewness in TARGET_AMT and other variables required transformations for better modeling. - Identified correlations between predictors and response variables informed model feature selection.

2. Data Preparation

The dataset underwent preprocessing to enhance model readiness: - Handling Missing Data: Imputed missing values with medians for numeric variables and modes for categorical ones. - Feature Engineering: Interaction terms (e.g., BLUEBOOK * CAR_AGE) were created, and categorical variables were one-hot encoded. - Balancing the Dataset: SMOTE oversampling addressed imbalances in TARGET_FLAG. - Transformation and Scaling: - Box-Cox transformations were applied to normalize skewed variables. - Min-max scaling normalized continuous variables for compatibility with algorithms like SVR and ANN.

3. Model Building

Multiple models were built and compared for both regression and classification tasks:

(a) Regression Models (TARGET_AMT as target):

Required Models:
- Multiple Linear Regression (Full, Stepwise, Forward, and Backward Selection).
- Random Forest Regression (RFR).
- XGBoost Regression.
Additional Models:
- Support Vector Regression (SVR): Tuned parameters included radial kernel, cost, and gamma.
- Artificial Neural Network (ANN): Optimized with a 10-neuron hidden layer and regularization.
- Multivariate Adaptive Regression Splines (MARS): Fitted with optimized pruning and degrees.
- k-Nearest Neighbors (KNN): Optimized for the number of neighbors.
- Generalized Linear Models (GLM): For baseline comparisons.

(b) Binary Classification Models (TARGET_FLAG as target):

Required Models:
- Logistic Regression (Full, Stepwise, Forward, and Backward Selection).
- Decision Tree Classifier.
- Random Forest Classifier.
Additional Models:
- XGBoost Classifier: Tuned hyperparameters for optimal depth, learning rate, and trees.
- Support Vector Machines (SVM): Radial kernel and cost tuning.
- Neural Network Classifier: Optimized for activation function and hidden layers.
- Recursive Feature Elimination: Used for variable prioritization and comparison.

4. Model Selection and Evaluation

Each model was evaluated using performance metrics: - Regression Models: - RMSE and Adjusted $R^2$ were primary metrics for prediction accuracy and model fit. - Residual analysis verified assumptions of linearity and homoscedasticity. - Classification Models: - AUC and ROC curves assessed classification accuracy. - Precision, recall, F1 score, and accuracy provided comprehensive evaluations. - Models were also assessed on interpretability and computational efficiency.

5. Deliverables

Predicted probabilities and classifications for TARGET_FLAG.
Predicted costs for TARGET_AMT.
Comparative evaluation of all models, including those fitted for benchmarking.
Code, visualizations, and interpretations were included in the submission appendix.

By including additional models, this comprehensive approach provides enhanced benchmarks for comparison, ensuring robust and interpretable results tailored to the assignment’s requirements.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(DataExplorer)
library(caret)

## Loading required package: lattice

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(e1071)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(pROC)

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(VIM)

## Loading required package: colorspace

## 
## Attaching package: 'colorspace'

## The following object is masked from 'package:pROC':
## 
##     coords

## Loading required package: grid

## VIM is ready to use.

## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

Data Exploration

Load the training and evaluation datasets

The training dataset consists of 8,161 observations and 26 variables. Several categorical variables require transformation to be compatible with both logistic and linear models. We begin with some fundamental transformations.

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

Convert columns to appropriate types based on column definitions in Homework 4

We will convert the variables to the appropriate type based on the column definitions.
Drop the INDEX column because it is not informative.
create a vector of numeric and categorical variables for use in visualizations and data transformation

# Load necessary library
library(dplyr)

# List of datasets to process
data_list <- list(
  insurance_training = insurance_training,
  insurance_evaluation = insurance_evaluation
)

# Function to process a dataset
process_insurance_data <- function(data) {
  data <- data %>%
    dplyr::mutate(
      INDEX = as.numeric(INDEX),
      TARGET_AMT = as.numeric(TARGET_AMT),
      KIDSDRIV = as.numeric(KIDSDRIV),
      AGE = as.numeric(AGE),
      HOMEKIDS = as.numeric(HOMEKIDS),
      YOJ = as.numeric(YOJ),
      INCOME = as.numeric(gsub("[^0-9\\.]", "", INCOME)),
      HOME_VAL = as.numeric(gsub("[^0-9\\.]", "", HOME_VAL)),
      TRAVTIME = as.numeric(TRAVTIME),
      BLUEBOOK = as.numeric(gsub("[^0-9\\.]", "", BLUEBOOK)),
      TIF = as.numeric(TIF),
      OLDCLAIM = as.numeric(gsub("[^0-9\\.]", "", OLDCLAIM)),
      CLM_FREQ = as.numeric(CLM_FREQ),
      MVR_PTS = as.numeric(MVR_PTS),
      CAR_AGE = as.numeric(CAR_AGE),
      TARGET_FLAG = as.integer(TARGET_FLAG),
      RED_CAR = as.factor(RED_CAR),
      REVOKED = as.factor(REVOKED),
      CAR_USE = as.factor(CAR_USE),
      URBANICITY = as.factor(URBANICITY),
      PARENT1 = as.factor(PARENT1),
      MSTATUS = as.factor(MSTATUS),
      SEX = as.factor(SEX),
      EDUCATION = as.factor(EDUCATION),
      JOB = as.factor(JOB),
      CAR_TYPE = as.factor(CAR_TYPE)
    ) # %>%
    # dplyr::select(-1) # Drop the first column
  return(data)
}

# Process each dataset using lapply
processed_data_list <- lapply(data_list, function(data) {
  tryCatch(
    process_insurance_data(data),
    error = function(e) {
      cat("Error processing dataset:", conditionMessage(e), "\n")
      NULL
    }
  )
})

# Remove NULL results if any dataset failed processing
processed_data_list <- processed_data_list[!sapply(processed_data_list, is.null)]

# Assign the processed datasets back to their variables
insurance_training <- processed_data_list$insurance_training
insurance_evaluation <- processed_data_list$insurance_evaluation

# Check the structure of the processed datasets
cat("### Structure of Insurance Training Data ###\n")

## ### Structure of Insurance Training Data ###

str(insurance_training)

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : num  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : num  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : num  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : num  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : num  67349 91449 16039 NA 114986 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ HOME_VAL   : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB        : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
##  $ TRAVTIME   : num  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
##  $ BLUEBOOK   : num  14230 14940 4010 15440 18000 ...
##  $ TIF        : num  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
##  $ OLDCLAIM   : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ   : num  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ MVR_PTS    : num  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : num  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...

cat("\n\n### Structure of Insurance Evaluation Data ###\n")

## 
## 
## ### Structure of Insurance Evaluation Data ###

# str(insurance_evaluation)

# Create vectors of numeric and categorical variables for insurance_training
numeric_vars <- names(insurance_training)[sapply(insurance_training, is.numeric)]
categorical_vars <- names(insurance_training)[sapply(insurance_training, is.factor)]

# Print the vectors
cat("\n### Numeric Variables in Training Data ###\n")

## 
## ### Numeric Variables in Training Data ###

print(numeric_vars)

##  [1] "INDEX"       "TARGET_FLAG" "TARGET_AMT"  "KIDSDRIV"    "AGE"        
##  [6] "HOMEKIDS"    "YOJ"         "INCOME"      "HOME_VAL"    "TRAVTIME"   
## [11] "BLUEBOOK"    "TIF"         "OLDCLAIM"    "CLM_FREQ"    "MVR_PTS"    
## [16] "CAR_AGE"

# cat("\n### Categorical Variables in Training Data ###\n")
print(categorical_vars)

##  [1] "PARENT1"    "MSTATUS"    "SEX"        "EDUCATION"  "JOB"       
##  [6] "CAR_USE"    "CAR_TYPE"   "RED_CAR"    "REVOKED"    "URBANICITY"

(a). Summary Statistics

Notable statistics from the summary of numerical variables:

Age: The mean age of respondents is approximately 44.79 years, with a minimum age of 16 and a maximum of 81 years. Only 0.07% of age data is missing, indicating robust data coverage.
Income: The mean income is $61,898.09, with a median of $54,028 and a maximum of $367,030, showing significant variability (SD = $47,572.68). Income data has 5.45% missing values.
TARGET_AMT: This variable is highly skewed (Skewness = 8.71), with a mean of $1,504.32 and a maximum of $107,586.1, indicating the presence of outliers in the data.
CAR_AGE: The mean car age is 8.33 years, with a negative minimum value (-3), possibly indicating data errors. The variable has the highest percentage of missing values (6.2%).
YOJ (Years on Job): The median is 11 years, but 5.56% of data is missing, suggesting some gaps in job tenure information.
HOME_VAL (Home Value): With a mean of $154,867.29 and a median of $161,160, home values appear moderately skewed. Approximately 5.69% of the data is missing.
TARGET_FLAG: The binary target variable is imbalanced, with a mean of 0.26, indicating that only about 26% of respondents fall under the positive class.

These insights highlight areas of potential data cleaning, especially for missing values, outliers, and negative entries, as well as important trends that may inform predictive modeling.

# Load necessary libraries
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(dplyr)
library(tidyr)

# Correctly select numerical variables excluding INDEX but keeping it in the dataset
numerical_vars <- insurance_training %>%
  dplyr::select(all_of(numeric_vars)) %>%
  dplyr::select(-INDEX)

# Compute statistical summary including missing value counts and percentages
statistical_summary <- numerical_vars %>%
  summarise(across(
    everything(),
    list(
      Min = ~round(min(., na.rm = TRUE), 2),
      Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
      Mean = ~round(mean(., na.rm = TRUE), 2),
      Median = ~round(median(., na.rm = TRUE), 2),
      Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
      Max = ~round(max(., na.rm = TRUE), 2),
      SD = ~round(sd(., na.rm = TRUE), 2),
      Skewness = ~round(skewness(., na.rm = TRUE), 2),
      Kurtosis = ~round(kurtosis(., na.rm = TRUE), 2),
      Outliers = ~sum(. > (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(., na.rm = TRUE)) | 
                       . < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(., na.rm = TRUE)), na.rm = TRUE),
      Missing = ~sum(is.na(.)), # Count of missing values
      PercentMissing = ~round(mean(is.na(.)) * 100, 2) # Percentage of missing values
    ),
    .names = "{.col}_{.fn}"
  )) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("Variable", ".value"),
    names_pattern = "^(.*)_(.*)$"
  )

# Display the resulting summary table
print(statistical_summary)

## # A tibble: 15 × 13
##    Variable     Min    Q1    Mean Median     Q3    Max      SD Skewness Kurtosis
##    <chr>      <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>    <dbl>    <dbl>
##  1 TARGET_FL…     0     0 2.6 e-1      0      1 1   e0 4.4 e-1     1.07    -0.85
##  2 TARGET_AMT     0     0 1.50e+3      0   1036 1.08e5 4.70e+3     8.71   112.  
##  3 KIDSDRIV       0     0 1.7 e-1      0      0 4   e0 5.1 e-1     3.35    11.8 
##  4 AGE           16    39 4.48e+1     45     51 8.1 e1 8.63e+0    -0.03    -0.06
##  5 HOMEKIDS       0     0 7.2 e-1      0      1 5   e0 1.12e+0     1.34     0.65
##  6 YOJ            0     9 1.05e+1     11     13 2.3 e1 4.09e+0    -1.2      1.18
##  7 INCOME         0 28097 6.19e+4  54028  85986 3.67e5 4.76e+4     1.19     2.13
##  8 HOME_VAL       0     0 1.55e+5 161160 238724 8.85e5 1.29e+5     0.49    -0.02
##  9 TRAVTIME       5    22 3.35e+1     33     44 1.42e2 1.59e+1     0.45     0.66
## 10 BLUEBOOK    1500  9280 1.57e+4  14440  20850 6.97e4 8.42e+3     0.79     0.79
## 11 TIF            1     1 5.35e+0      4      7 2.5 e1 4.15e+0     0.89     0.42
## 12 OLDCLAIM       0     0 4.04e+3      0   4636 5.70e4 8.78e+3     3.12     9.86
## 13 CLM_FREQ       0     0 8   e-1      0      2 5   e0 1.16e+0     1.21     0.28
## 14 MVR_PTS        0     0 1.7 e+0      1      3 1.3 e1 2.15e+0     1.35     1.38
## 15 CAR_AGE       -3     1 8.33e+0      8     12 2.8 e1 5.7 e+0     0.28    -0.75
## # ℹ 3 more variables: Outliers <int>, Missing <int>, PercentMissing <dbl>

Summary statistics for Categorical and Factor Variables

Target Variable (TARGET_FLAG):
- Mode: 0 (73.62%) indicates strong imbalance.
- Imbalance Ratio: 2.79 suggests underrepresentation of the positive class (1).
- Recommendation: Use resampling techniques or class-weighted metrics to address imbalance.
Imbalance in Predictors:
- High imbalance in PARENT1 (mode: “No”, 86.80%, ratio: 6.58), REVOKED (mode: “0”, 87.75%, ratio: 7.16), and CAR_TYPE (mode: “z_SUV”, 28.11%, ratio: 3.39).
- Recommendation: Group rare levels or use robust models like tree-based algorithms.
Entropy Analysis:
- Low entropy: URBANICITY (0.73), REVOKED (0.54).
- High entropy: JOB (3.00) with 9 levels, mode “z_Blue Collar” (22.36%).
Unique Levels:
- Mostly binary or low cardinality, e.g., CAR_USE, SEX.
- Exceptions like JOB (9 levels) and CAR_TYPE (6 levels) may need one-hot encoding or grouping.
No Missing Data:
- Completeness across variables eliminates the need for imputation.

Key Insights:
1. Address TARGET_FLAG imbalance through resampling or weighting.
2. Handle high-imbalance variables (JOB, CAR_TYPE) by consolidating rare levels.
3. Leverage high-entropy variables for predictive power with robust preprocessing.
4. Use tree-based models to manage categorical variables and imbalances effectively.

library(dplyr)
library(knitr)
library(kableExtra)

# Select categorical columns including TARGET_FLAG
categorical_columns <- insurance_training %>%
  dplyr::select(all_of(c(categorical_vars, "TARGET_FLAG")))

# Function to calculate Shannon entropy
calculate_entropy <- function(counts) {
  proportions <- counts / sum(counts)
  entropy <- -sum(proportions * log2(proportions), na.rm = TRUE)
  return(entropy)
}

# Function to calculate imbalance ratio
calculate_imbalance_ratio <- function(counts) {
  max_count <- max(counts, na.rm = TRUE)
  min_count <- min(counts, na.rm = TRUE)
  if (min_count == 0) {
    imbalance_ratio <- Inf  # Avoid division by zero
  } else {
    imbalance_ratio <- max_count / min_count
  }
  return(imbalance_ratio)
}

# Compute the summary for each categorical variable
categorical_summary <- lapply(names(categorical_columns), function(var) {
  summary_df <- insurance_training %>%
    count(!!sym(var), .drop = FALSE) %>%  # Ensure all levels are considered
    mutate(Percentage = round(n / sum(n) * 100, 2)) %>%
    rename(Level = !!sym(var), Count = n) %>%
    mutate(Variable = var)  # Add the variable name for identification
  
  # Compute the mode for the variable
  mode_row <- summary_df %>%
    filter(Count == max(Count, na.rm = TRUE)) %>%
    slice_head(n = 1) %>%  # Select the first row in case of ties
    pull(Level)
  
  # Compute percentage for the mode
  mode_percentage <- summary_df %>%
    filter(Level == mode_row) %>%
    pull(Percentage) %>%
    first()  # Ensure it works even if there are multiple matches
  
  # Count missing values for the variable
  missing_count <- sum(is.na(insurance_training[[var]]))
  
  # Count unique levels
  unique_levels_count <- n_distinct(insurance_training[[var]])
  
  # Compute entropy
  entropy <- calculate_entropy(summary_df$Count)
  
  # Compute imbalance ratio
  imbalance_ratio <- calculate_imbalance_ratio(summary_df$Count)
  
  # Combine into a single row summary for the variable
  final_row <- data.frame(
    Variable = var,
    Mode = as.character(mode_row),  # Ensure Mode is always a character
    Mode_Percentage = mode_percentage,
    Missing_Count = missing_count,
    Unique_Levels = unique_levels_count,
    Entropy = round(entropy, 2),
    Imbalance_Ratio = round(imbalance_ratio, 2),
    stringsAsFactors = FALSE  # Avoid factors unless explicitly needed
  )

  return(final_row)
})

# Combine summaries into a single data frame
categorical_summary_df <- bind_rows(categorical_summary)

# Print the resulting summary
cat("\nSummary of Categorical Variables\n")

## 
## Summary of Categorical Variables

print(categorical_summary_df)

##       Variable                Mode Mode_Percentage Missing_Count Unique_Levels
## 1      PARENT1                  No           86.80             0             2
## 2      MSTATUS                 Yes           59.97             0             2
## 3          SEX                 z_F           53.61             0             2
## 4    EDUCATION       z_High School           28.55             0             5
## 5          JOB       z_Blue Collar           22.36             0             9
## 6      CAR_USE             Private           62.88             0             2
## 7     CAR_TYPE               z_SUV           28.11             0             6
## 8      RED_CAR                  no           70.86             0             2
## 9      REVOKED                  No           87.75             0             2
## 10  URBANICITY Highly Urban/ Urban           79.55             0             2
## 11 TARGET_FLAG                   0           73.62             0             2
##    Entropy Imbalance_Ratio
## 1     0.56            6.58
## 2     0.97            1.50
## 3     1.00            1.16
## 4     2.21            3.20
## 5     3.00            7.42
## 6     0.95            1.69
## 7     2.42            3.39
## 8     0.87            2.43
## 9     0.54            7.16
## 10    0.73            3.89
## 11    0.83            2.79

Visualization

Statement about Distributions in the Context of MLR and Binary Logistic Regression

For Multiple Linear Regression (MLR):
- The continuous predictors (e.g., AGE, INCOME, CAR_AGE) show varying levels of skewness, such as high skewness in TARGET_AMT and OLDCLAIM. Transformations (e.g., log or square root) may be necessary to approximate normality and stabilize variance.
- Predictors like CAR_AGE and YOJ have relatively normal distributions and are well-suited for MLR without significant transformation.
- Variables with zero-inflated distributions (e.g., OLDCLAIM, CLM_FREQ) may need special handling, such as creating dummy variables or using nonlinear modeling.
For Binary Logistic Regression:
- The target variable TARGET_FLAG is binary, making it directly suitable for logistic regression.
- Continuous variables with high variance or outliers (e.g., HOME_VAL, INCOME) may benefit from normalization or scaling to improve model convergence.
- Categorical variables such as REVOKED and URBANICITY can be one-hot encoded or directly included as factors.
- Predictors with imbalanced distributions (e.g., TIF, CLM_FREQ) should be carefully evaluated for multicollinearity or sparse levels, which may impact model interpretation.
We will do the follow:

Perform feature scaling and handle skewed distributions for MLR.
For logistic regression, explore interaction terms and regularization techniques to address multicollinearity and sparsity in categorical variables.

# Load the necessary libraries
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ lubridate 1.9.3     ✔ stringr   1.5.1
## ✔ purrr     1.0.2     ✔ tibble    3.2.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine()     masks dplyr::combine()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ purrr::lift()            masks caret::lift()
## ✖ car::recode()            masks dplyr::recode()
## ✖ MASS::select()           masks dplyr::select()
## ✖ purrr::some()            masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Define a function to create individual histograms for numerical variables
create_histogram <- function(data, variable) {
  ggplot(data, aes(x = .data[[variable]])) +
    geom_histogram(fill = "darkgrey", color = "black", bins = 30, na.rm = TRUE) +
    ggtitle(paste("Distribution of", variable)) +
    xlab(variable) +
    ylab("Frequency") +
    theme_minimal()
}

# Define a function to create individual bar plots for categorical variables
create_bar_plot <- function(data, variable) {
  ggplot(data, aes(x = .data[[variable]])) +
    geom_bar(fill = "darkgrey", color = "black", na.rm = TRUE) +
    ggtitle(paste("Distribution of", variable)) +
    xlab(variable) +
    ylab("Count") +
    theme_minimal()
}

# List to hold all plots (numerical and categorical)
all_plots <- list()

# Add TARGET_FLAG and TARGET_AMT plots explicitly
all_plots <- append(all_plots, list(
  ggplot(insurance_training, aes(x = TARGET_FLAG)) +
    geom_bar(fill = "darkgrey", color = "black") +
    scale_x_continuous(breaks = c(0, 1), labels = c("0", "1"), limits = c(0, 1), oob = scales::squish, expand = c(0, 0)) +
    ggtitle("Distribution of TARGET_FLAG") +
    xlab("TARGET_FLAG") +
    ylab("Count") +
    theme_minimal(),
  
  ggplot(insurance_training %>% filter(TARGET_FLAG == 1), aes(x = TARGET_AMT)) +
    geom_histogram(fill = "darkgrey", color = "black", bins = 30, na.rm = TRUE) +
    ggtitle("Distribution of TARGET_AMT (Filtered by TARGET_FLAG = 1)") +
    xlab("TARGET_AMT") +
    ylab("Frequency") +
    theme_minimal()
))

# Add numerical variable plots
numerical_variable_groups <- list(
  c("AGE", "YOJ"),
  c("INCOME", "HOME_VAL"),
  c("TRAVTIME", "BLUEBOOK"),
  c("TIF", "OLDCLAIM"),
  c("CLM_FREQ", "MVR_PTS"),
  c("CAR_AGE") # Remaining single variable
)

for (group in numerical_variable_groups) {
  for (variable in group) {
    all_plots <- append(all_plots, list(create_histogram(insurance_training, variable)))
  }
}

# Add categorical variable plots
categorical_variable_groups <- list(
  c("PARENT1", "MSTATUS"),
  c("SEX", "EDUCATION"),
  c("JOB", "CAR_USE"),
  c("CAR_TYPE", "RED_CAR"),
  c("REVOKED", "URBANICITY"),
  c("TARGET_FLAG") # Remaining single variable
)

for (group in categorical_variable_groups) {
  for (variable in group) {
    all_plots <- append(all_plots, list(create_bar_plot(insurance_training, variable)))
  }
}

# Arrange all plots in grids of four per page with two columns
for (i in seq(1, length(all_plots), by = 4)) {
  grid.arrange(grobs = all_plots[i:min(i + 3, length(all_plots))], ncol = 2)
}

(d) Missing Data: Visualization and analysis

Missing Data Summary

Highest Missing Proportions: CAR_AGE (>6%), HOME_VAL, YOJ, and INCOME (~5%).
Minimal Missing: AGE (<1%).
Patterns: Missingness is scattered without clustering across variables.

we will impute missing values for critical variables like CAR_AGE and HOME_VAL to maintain model integrity.

training_data <- insurance_training


# Filter only columns with missing data
training_data_with_missing <- training_data %>%
  dplyr::select(where(~ any(is.na(.))))


# Missing Data Visualization for columns with missing data
aggr(training_data_with_missing, 
     col = c('navyblue', 'red'), 
     numbers = TRUE, 
     sortVars = TRUE, 
     labels = names(training_data_with_missing), 
     cex.axis = 0.7, 
     gap = 3, 
     ylab = c("Missing data", "Pattern"))

## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies

## 
##  Variables sorted by number of missings: 
##  Variable       Count
##   CAR_AGE 0.062492342
##  HOME_VAL 0.056855777
##       YOJ 0.055630437
##    INCOME 0.054527631
##       AGE 0.000735204

Data Preparation

Data Preparation Summary

Below are the updated steps followed to prepare the dataset for multiple linear regression, logistic regression, and tree-based models, including class balancing:

Class Balancing:
- Balanced the dataset for binary logistic regression by oversampling the minority class (TARGET_FLAG = 1) to match the majority class size.
- Ensures equal representation for robust model training and evaluation.
Impute Missing Values:
- Numeric variables: Replaced missing values with the median to preserve central tendency.
- Categorical variables: Replaced missing values with a placeholder "Missing" to retain all data points.
Handle Non-Positive Values:
- Adjusted INCOME, HOME_VAL, OLDCLAIM, and TARGET_AMT to be strictly positive by adding a constant if necessary.
Feature Engineering:
- Bucketing: Created AGE_GROUP and TRAVTIME_BUCKET by binning continuous variables into categorical groups for interpretability.
- Ratios: Created KIDSDRIV_RATIO and HOMEKIDS_RATIO by dividing by AGE.
Box-Cox Transformation:
- Applied Box-Cox transformation to normalize numeric variables, ensuring compatibility with linear models.
Scaling and Centering:
- Applied scaling and centering to numeric variables (excluding target variables) for standardization in regression models.
Dummy Encoding:
- Applied one-hot encoding to categorical variables for multiple linear and logistic regression datasets.
- Excluded the first dummy category to avoid multicollinearity.
Variable Name Cleaning:
- Removed unwanted characters like spaces, /, and + from variable names for consistency.
Model-Specific Dataset Preparation:
- Multiple Linear Regression:
  - Target variable: TARGET_AMT.
  - Scaled numeric variables and dummy-encoded categorical variables.
- Logistic Regression:
  - Target variable: TARGET_FLAG.
  - Scaled numeric variables and dummy-encoded categorical variables.
- Tree-Based Models:
  - Retained all original variables without dummy encoding or scaling, as tree-based models handle raw data effectively.

Final Prepared Datasets:

Multiple Linear Regression Dataset (MLR): Optimized for continuous target prediction.
Logistic Regression Dataset (BLR): Prepared for binary classification tasks.
Tree-Based Dataset: Retains raw and transformed variables for non-linear modeling.

Impute Missing Values

# Load necessary libraries
library(dplyr)
library(kableExtra)

# List of datasets to process
data_list <- list(
  insurance_training = insurance_training,
  insurance_evaluation = insurance_evaluation
)

# Function to process a dataset
process_insurance_data <- function(data, is_evaluation = FALSE) {
  data <- data %>%
    dplyr::mutate(
      TARGET_FLAG = if (!is_evaluation) as.factor(TARGET_FLAG) else TARGET_FLAG,
      TARGET_AMT = if (!is_evaluation) as.numeric(TARGET_AMT) else TARGET_AMT,
      KIDSDRIV = as.numeric(KIDSDRIV),
      AGE = as.numeric(AGE),
      HOMEKIDS = as.numeric(HOMEKIDS),
      YOJ = as.numeric(YOJ),
      INCOME = as.numeric(gsub("[^0-9\\.]", "", INCOME)),
      HOME_VAL = as.numeric(gsub("[^0-9\\.]", "", HOME_VAL)),
      TRAVTIME = as.numeric(TRAVTIME),
      BLUEBOOK = as.numeric(gsub("[^0-9\\.]", "", BLUEBOOK)),
      TIF = as.numeric(TIF),
      OLDCLAIM = as.numeric(gsub("[^0-9\\.]", "", OLDCLAIM)),
      CLM_FREQ = as.numeric(CLM_FREQ),
      MVR_PTS = as.numeric(MVR_PTS),
      CAR_AGE = as.numeric(CAR_AGE),
      RED_CAR = as.factor(RED_CAR),
      REVOKED = as.factor(REVOKED),
      CAR_USE = as.factor(CAR_USE),
      URBANICITY = as.factor(URBANICITY),
      PARENT1 = as.factor(PARENT1),
      MSTATUS = as.factor(MSTATUS),
      SEX = as.factor(SEX),
      EDUCATION = as.factor(EDUCATION),
      JOB = as.factor(JOB),
      CAR_TYPE = as.factor(CAR_TYPE)
    )
  return(data)
}

# Process each dataset using lapply
processed_data_list <- lapply(names(data_list), function(name) {
  is_evaluation <- name == "insurance_evaluation"
  tryCatch(
    process_insurance_data(data_list[[name]], is_evaluation = is_evaluation),
    error = function(e) {
      cat("Error processing dataset:", conditionMessage(e), "\n")
      NULL
    }
  )
})

# Remove NULL results if any dataset failed processing
processed_data_list <- processed_data_list[!sapply(processed_data_list, is.null)]

# Assign the processed datasets back to their variables
insurance_training <- processed_data_list[[1]]
insurance_evaluation <- processed_data_list[[2]]

# ---- Define Numeric and Categorical Variables ----
numeric_vars <- names(insurance_training)[sapply(insurance_training, is.numeric)]
categorical_vars <- names(insurance_training)[sapply(insurance_training, is.factor)]

# ---- Function for Imputation Without Missing Flags ----
impute_data <- function(data, numeric_cols, categorical_cols, exclude_vars = NULL) {
  # Create a copy to avoid altering the original dataset
  data_imputed <- data
  exclude_vars <- exclude_vars %>% unique()
  
  # Initialize a list to store summaries
  summary_list <- list()
  
  # Handle numeric columns
  for (col in numeric_cols) {
    if (!col %in% exclude_vars) {
      # Record missing and negative counts before imputation
      missing_count_before <- sum(is.na(data_imputed[[col]]))
      negative_count_before <- sum(data_imputed[[col]] < 0, na.rm = TRUE)
      
      # Impute missing values with the median
      if (missing_count_before > 0) {
        observed_median <- median(data_imputed[[col]], na.rm = TRUE)
        imputation_value <- observed_median
        data_imputed[[col]][is.na(data_imputed[[col]])] <- imputation_value
      }
      
      # Impute negative values with the median of non-negative values
      if (negative_count_before > 0) {
        observed_median <- median(data_imputed[[col]][data_imputed[[col]] >= 0], na.rm = TRUE)
        data_imputed[[col]][data_imputed[[col]] < 0] <- observed_median
      }
      
      # Record missing and negative counts after imputation
      missing_count_after <- sum(is.na(data_imputed[[col]]))
      negative_count_after <- sum(data_imputed[[col]] < 0, na.rm = TRUE)
      
      summary_list[[col]] <- data.frame(
        Variable = col,
        Missing_Count_Before = missing_count_before,
        Negative_Count_Before = negative_count_before,
        Missing_Count_After = missing_count_after,
        Negative_Count_After = negative_count_after
      )
    }
  }
  
  # Handle categorical columns (e.g., mode imputation)
  for (col in categorical_cols) {
    if (!col %in% exclude_vars) {
      # Record missing count before imputation
      missing_count_before <- sum(is.na(data_imputed[[col]]))
      
      # Impute missing values with the mode
      if (missing_count_before > 0) {
        mode_value <- names(sort(table(data_imputed[[col]], useNA = "no"), decreasing = TRUE))[1]
        data_imputed[[col]][is.na(data_imputed[[col]])] <- mode_value
      }
      
      # Record missing count after imputation
      missing_count_after <- sum(is.na(data_imputed[[col]]))
      
      summary_list[[col]] <- data.frame(
        Variable = col,
        Missing_Count_Before = missing_count_before,
        Negative_Count_Before = NA,  # Not applicable to categorical variables
        Missing_Count_After = missing_count_after,
        Negative_Count_After = NA
      )
    }
  }
  
  # Combine summaries into a single dataframe
  summary_df <- do.call(rbind, summary_list) %>%
    filter(
      Missing_Count_Before > 0 | Negative_Count_Before > 0 | 
        Missing_Count_After > 0 | Negative_Count_After > 0
    )
  
  return(list(Imputed_Data = data_imputed, Summary = summary_df))
}

# ---- Apply Imputation to Both Datasets ----
training_result <- impute_data(
  insurance_training, numeric_vars, categorical_vars
)
evaluation_result <- impute_data(
  insurance_evaluation, numeric_vars, categorical_vars,
  exclude_vars = c("TARGET_FLAG", "TARGET_AMT")
)

# Extract imputed datasets
training_data_imputed <- training_result$Imputed_Data
evaluation_data_imputed <- evaluation_result$Imputed_Data

# ---- Output Results ----
training_summary <- training_result$Summary
training_summary %>%
  kable(caption = "Training Data Imputation Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

Training Data Imputation Summary
	Variable	Missing_Count_Before	Negative_Count_Before
AGE	AGE	6	0
YOJ	YOJ	454	0
INCOME	INCOME	445	0
HOME_VAL	HOME_VAL	464	0
CAR_AGE	CAR_AGE	510	1

evaluation_summary <- evaluation_result$Summary

cat("\nEvaluation Data Imputation Summary\n")

## 
## Evaluation Data Imputation Summary

print(evaluation_summary)

##          Variable Missing_Count_Before Negative_Count_Before
## AGE           AGE                    1                     0
## YOJ           YOJ                   94                     0
## INCOME     INCOME                  125                     0
## HOME_VAL HOME_VAL                  111                     0
## CAR_AGE   CAR_AGE                  129                     0
##          Missing_Count_After Negative_Count_After
## AGE                        0                    0
## YOJ                        0                    0
## INCOME                     0                    0
## HOME_VAL                   0                    0
## CAR_AGE                    0                    0

Identify near-zero variance predictors

# Load the required library
library(caret)

# Identify near-zero variance predictors
nzv_indices <- nearZeroVar(training_data_imputed)

# Check if any NZV predictors are present
if (length(nzv_indices) > 0) {
  # NZV predictors are present
  cat("Near-zero variance predictors detected.\n")
  
  # List the removed column names
  removed_columns <- colnames(training_data_imputed)[nzv_indices]
  cat("Removed the following near-zero variance predictors:\n")
  print(removed_columns)
  
  # Remove near-zero variance predictors
  filtered_data <- training_data_imputed[, -nzv_indices]
} else {
  # No NZV predictors are present
  cat("No near-zero variance predictors found.\n")
  
  # Keep the data unchanged
  filtered_data <- training_data_imputed
}

## No near-zero variance predictors found.

Finding and Removing Highly Correlated Variables

# Load the required library
library(caret)

# Select only numeric variables
numeric_data <- training_data_imputed[, sapply(training_data_imputed, is.numeric)]

# Compute the correlation matrix
correlations <- cor(numeric_data, use = "complete.obs")

# Identify highly correlated variables (absolute correlation > 0.75)
highCorr <- findCorrelation(correlations, cutoff = 0.75)

# Print the number of highly correlated variables
cat("Number of highly correlated variables:", length(highCorr), "\n")

## Number of highly correlated variables: 0

# Display the indices of the first few highly correlated variables
cat("Indices of highly correlated variables:\n")

## Indices of highly correlated variables:

print(head(highCorr))

## integer(0)

# Remove highly correlated variables
filtered_data <- numeric_data[, -highCorr]

# Display the structure of the filtered data
cat("\nStructure of the filtered data:\n")

## 
## Structure of the filtered data:

str(filtered_data)

## 'data.frame':    8161 obs. of  0 variables

Feature Engineering

Decision Tree for TARGET_AMT Where TARGET_FLAG = 1

Analsis of Decision Tree for `TARGET_AMT` Where `TARGET_FLAG = 1`

The decision tree reveals the following key insights about the factors influencing TARGET_AMT when TARGET_FLAG = 1:

Top Predictor:
- The variable BLUEBOOK (estimated value of the insured vehicle) is the most significant predictor. It splits the data into two branches:
  - BLUEBOOK < 17,000: A majority of observations fall here with a lower average TARGET_AMT.
  - BLUEBOOK >= 17,000: Further splits based on other variables.
Interaction Terms:
- BLUEBOOK and CAR_AGE:
  - For BLUEBOOK >= 17,000, the split depends on CAR_AGE. Older cars (CAR_AGE >= 13) result in lower predicted TARGET_AMT.
- BLUEBOOK, CAR_AGE, and HOME_VAL:
  - For newer cars (CAR_AGE < 13) with higher BLUEBOOK, HOME_VAL becomes important. A higher HOME_VAL (>= 320,000) leads to higher predicted TARGET_AMT.
Significant Variable Interactions:
- HOME_VAL and MSTATUS:
  - For high-value homes (HOME_VAL >= 320,000), marital status (MSTATUS) influences the prediction:
    - MSTATUS = Yes leads to moderately high TARGET_AMT.
    - MSTATUS = z_No results in the lowest observed TARGET_AMT.
Potential Interaction Terms for Modeling:
- BLUEBOOK * CAR_AGE: Interaction between car value and age, reflecting their combined effect on TARGET_AMT.
- CAR_AGE * HOME_VAL: Reflects how the home value modifies the impact of car age on TARGET_AMT.
- HOME_VAL * MSTATUS: Indicates the interaction between home value and marital status in determining TARGET_AMT.

Implications for Modeling:

The decision tree suggests incorporating interaction terms like BLUEBOOK * CAR_AGE and HOME_VAL * MSTATUS in regression or machine learning models to capture these relationships.
Non-linear effects could be better captured by tree-based models (e.g., Random Forest or Gradient Boosting), given the hierarchical splits observed.

library(rpart.plot)

## Loading required package: rpart

library(dplyr)

# Filter the data for rows where TARGET_FLAG = 1
filtered_data_target_flag <- training_data_imputed %>% filter(TARGET_FLAG == 1)

# Ensure target variable and predictors are valid
target_variable <- "TARGET_AMT"
excluded_variables <- c("TARGET_FLAG", target_variable)
predictors <- setdiff(names(filtered_data_target_flag), excluded_variables)

# Check if predictors are valid
if (length(predictors) == 0) {
  stop("No valid predictors found. Ensure that filtered_data_target_flag has columns other than TARGET_FLAG and TARGET_AMT.")
}

# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))

# Fit the decision tree model
tree_model <- rpart(
  formula,
  data = filtered_data_target_flag,
  method = "anova",  # Use "anova" for regression trees
  control = rpart.control(cp = 0.01)  # Adjust complexity parameter as needed
)

# Plot the tree
rpart.plot(
  tree_model,
  type = 3,        # Draw tree with extra info
  extra = 101,     # Show node numbers, splits, and predictions
  box.palette = "auto",  # Automatically assign colors
  main = "Decision Tree for TARGET_AMT Where TARGET_FLAG = 1"
)

Decision Tree for TARGET_FLAG Excluding TARGET_AMT

Statement on Decision Tree for `TARGET_FLAG` Excluding `TARGET_AMT`

The decision tree identifies key patterns and interactions for predicting TARGET_FLAG while excluding the TARGET_AMT variable:

Primary Split:
- The variable OLDCLAIM (previous claim amounts) is the most significant predictor:
  - OLDCLAIM < 529: This branch represents a majority group with a higher likelihood of TARGET_FLAG = 0.
  - OLDCLAIM >= 529: This branch further splits based on job roles.
Key Interaction Terms:
- OLDCLAIM and JOB:
  - For OLDCLAIM >= 529, job categories become influential:
    - Individuals in roles such as Doctor, Lawyer, Manager, or Professional are more likely to have TARGET_FLAG = 0.
    - For roles such as Clerical, Home Maker, Student, and z_Blue Collar, additional factors come into play.
- JOB and URBANICITY:
  - For the latter job roles, the living environment (URBANICITY) differentiates outcomes:
    - URBANICITY = z_Highly Rural/ Rural: Associated with a higher likelihood of TARGET_FLAG = 0.
    - URBANICITY = Highly Urban/ Urban: Splits further based on HOME_VAL.
- URBANICITY and HOME_VAL:
  - In highly urban areas, HOME_VAL influences predictions:
    - Higher home values (HOME_VAL >= 69,000) correspond to a greater likelihood of TARGET_FLAG = 0.
    - Lower home values (HOME_VAL < 69,000) are linked to TARGET_FLAG = 1.
Potential Interaction Terms for Modeling:
- OLDCLAIM * JOB: Captures the combined effect of prior claims and occupation.
- JOB * URBANICITY: Reflects how job roles interact with living environments to influence outcomes.
- URBANICITY * HOME_VAL: Indicates the interplay between urban/rural settings and home value.

Implications for Predictive Modeling:

Interaction terms like OLDCLAIM * JOB and URBANICITY * HOME_VAL can be incorporated into logistic regression models for better predictive power.
Tree-based models such as Random Forest or Gradient Boosting are suitable for capturing these hierarchical relationships and non-linear effects.

Let me know if further refinement or additional analysis is needed!

library(rpart.plot)
library(dplyr)

# Ensure target variable and predictors are valid
target_variable <- "TARGET_FLAG"
excluded_variables <- c("TARGET_AMT", target_variable)
predictors <- setdiff(names(training_data_imputed), excluded_variables)

# Check if predictors are valid
if (length(predictors) == 0) {
  stop("No valid predictors found. Ensure that training_data_imputed has columns other than TARGET_FLAG and TARGET_AMT.")
}

# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))

# Fit the decision tree model
tree_model <- rpart(
  formula,
  data = training_data_imputed,
  method = "class",  # Use "class" for classification trees
  control = rpart.control(cp = 0.01)  # Adjust complexity parameter as needed
)

# Plot the tree
rpart.plot(
  tree_model,
  type = 3,        # Draw tree with extra info
  extra = 104,     # Show node numbers, splits, and predictions (percentages)
  box.palette = "auto",  # Automatically assign colors
  main = "Decision Tree for TARGET_FLAG Excluding TARGET_AMT"
)

Feature Engineering

Using domain knowledge, we can tailor feature engineering and preprocessing strategies to align with the theoretical effects of the variables. Here’s how the domain knowledge can guide data preparation and feature engineering:

Key Insights from Domain Knowledge:

Exclude Variables with No Predictive Value:
- INDEX: Clearly identified as not useful for modeling and should be excluded.
Target Variables:
- TARGET_FLAG: Indicates whether there was a crash (binary outcome). This will likely be the primary target variable for classification models.
- TARGET_AMT: Represents the cost of the crash. It can serve as the target for regression models when modeling severity.
Important Predictors Based on Theoretical Effects:
- AGE: Young and very old drivers may be riskier. This supports keeping AGE as a continuous variable or grouping into meaningful bins.
- BLUEBOOK, CAR_AGE: Associated with the vehicle value and age, these could affect payouts and risk.
- CAR_USE: Differentiating between commercial and private vehicles aligns with varying risk levels.
- CLM_FREQ: A strong predictor for future claims.
- REVOKED, MVR_PTS: Strong indicators of driving behavior and risk.
- URBANICITY: Could reflect traffic density and accident likelihood.
- TIF, YOJ: Reflect stability and may indirectly suggest risk levels.
Potentially Less Relevant Predictors:
- Variables like SEX and RED_CAR are flagged as possibly having no or weak effects.

Feature Engineering Suggestions:

1. Transformations Based on Theoretical Effects:

Ratio Variables:
- KIDSDRIV_RATIO: Divide KIDSDRIV by AGE to capture how age impacts driving children.
- HOMEKIDS_RATIO: Divide HOMEKIDS by AGE for similar reasoning.
Binning:
- AGE: Group into categories based on the insight that young and very old drivers are riskier.
- CAR_AGE: Group into “New,” “Moderate,” and “Old” vehicles.
Interaction Effects:
- Create interaction terms between variables like CLM_FREQ and MVR_PTS to capture combined risk indicators.

2. Keep Variables Continuous Where Appropriate:

Variables like INCOME, HOME_VAL, AGE, and BLUEBOOK should remain continuous.

3. One-Hot Encoding for Categorical Variables:

Convert CAR_TYPE, CAR_USE, PARENT1, RED_CAR, REVOKED, MSTATUS, EDUCATION, and SEX into dummy variables for model compatibility.

# Load required libraries
library(dplyr)
library(fastDummies)
library(car)

# Function to clean column names
clean_column_names <- function(dataset) {
  colnames(dataset) <- gsub(" ", "", colnames(dataset))                # Remove spaces
  colnames(dataset) <- gsub("/", "", colnames(dataset))                # Remove forward slashes
  colnames(dataset) <- gsub("-", "_", colnames(dataset))               # Replace hyphens with underscores
  colnames(dataset) <- gsub("\\+", "plus", colnames(dataset))          # Replace '+' with 'plus'
  colnames(dataset) <- gsub("[^A-Za-z0-9_]", "", colnames(dataset))    # Remove special characters
  return(dataset)
}

# Function to perform Min-Max Normalization
min_max_normalize <- function(data, exclude_cols = NULL) {
  numeric_cols <- setdiff(names(data)[sapply(data, is.numeric)], exclude_cols)
  for (col in numeric_cols) {
    min_val <- min(data[[col]], na.rm = TRUE)
    max_val <- max(data[[col]], na.rm = TRUE)
    if (max_val != min_val) {
      data[[col]] <- (data[[col]] - min_val) / (max_val - min_val)
    } else {
      data[[col]] <- 0
    }
  }
  return(data)
}

# Function to apply Box-Cox Transformation with variable-specific lambda estimation
box_cox_transform <- function(data, exclude_cols = NULL) {
  numeric_cols <- setdiff(names(data)[sapply(data, is.numeric)], exclude_cols)
  for (col in numeric_cols) {
    if (all(data[[col]] > 0)) {
      bc <- boxcox(lm(data[[col]] ~ 1), plotit = FALSE)
      lambda <- bc$x[which.max(bc$y)]
      if (lambda == 0) {
        data[[col]] <- log(data[[col]])
      } else {
        data[[col]] <- (data[[col]]^lambda - 1) / lambda
      }
    }
  }
  return(data)
}

# Function to dummy encode categorical variables
dummy_encode <- function(data) {
  categorical_cols <- colnames(data %>% dplyr::select(where(is.character) | where(is.factor)))
  data <- fastDummies::dummy_cols(
    data,
    select_columns = categorical_cols,
    remove_first_dummy = TRUE,
    remove_selected_columns = TRUE
  )
  return(data)
}

# Function to add new features (including buckets and interactions)
add_features <- function(data, is_evaluation = FALSE) {
  data <- data %>%
    mutate(
      # Tree-based Binary Splits
      OLDCLAIM_SPLIT = ifelse(OLDCLAIM >= 529, 1, 0),
      BLUEBOOK_SPLIT = ifelse(BLUEBOOK < 17000, 1, 0),
      
      # Interaction Terms
      JOB_URBANICITY = interaction(JOB, URBANICITY, drop = TRUE),
      HOMEVAL_URBANICITY = HOME_VAL * (URBANICITY == "Highly Urban/ Urban"),
      
      # Ratio Variables
      KIDSDRIV_RATIO = ifelse(AGE > 0, KIDSDRIV / AGE, 0),
      
      # Travel Time Buckets
      TRAVTIME_BUCKET = cut(
        TRAVTIME, 
        breaks = c(0, 30, 60, Inf), 
        labels = c("Short", "Medium", "Long"),
        right = FALSE
      ),
      
      # Age Buckets
      AGE_BIN = cut(
        AGE,
        breaks = c(0, 25, 45, 65, Inf), 
        labels = c("Young", "Adult", "Senior", "Elderly"),
        right = FALSE
      ),
      
      # Income Buckets
      INCOME_BIN = cut(
        INCOME,
        breaks = c(0, 25000, 50000, 100000, Inf), 
        labels = c("Low", "Medium", "High", "Very High"),
        right = FALSE
      )
    )
  return(data)
}

# Main function to process the data
process_data <- function(data, apply_box_cox_to_target_amt = FALSE) {
  # Extract INDEX, TARGET_FLAG, and TARGET_AMT
  retained_cols <- data %>% dplyr::select(INDEX, TARGET_FLAG, TARGET_AMT)
  
  # Exclude INDEX, TARGET_FLAG, and TARGET_AMT from transformations
  data <- data %>% dplyr::select(-INDEX, -TARGET_FLAG, -TARGET_AMT)
  
  # Exclude TARGET_AMT from transformations if specified
  exclude_cols <- if (apply_box_cox_to_target_amt) NULL else "TARGET_AMT"
  
  # Apply transformations
  data <- min_max_normalize(data, exclude_cols = exclude_cols)
  data <- box_cox_transform(data, exclude_cols = exclude_cols)
  data <- dummy_encode(data)
  
  # Clean column names after dummy encoding
  data <- clean_column_names(data)
  
  # Combine INDEX, TARGET_FLAG, and TARGET_AMT back with the transformed data
  data <- cbind(retained_cols, data)
  
  # Ensure INDEX, TARGET_FLAG, and TARGET_AMT are the first columns
  data <- data %>% relocate(INDEX, TARGET_FLAG, TARGET_AMT)
  
  return(data)
}

# Apply transformation to training data
processed_training_data <- process_data(training_data_imputed, apply_box_cox_to_target_amt = TRUE)

# Apply transformation to evaluation data
processed_evaluation_data <- process_data(evaluation_data_imputed, apply_box_cox_to_target_amt = FALSE)

# View the transformed datasets
cat("\n### Glimpse of Transformed Training Data ###\n")

## 
## ### Glimpse of Transformed Training Data ###

dim(processed_training_data)

## [1] 8161   40

cat("\n### Glimpse of Transformed Evaluation Data ###\n")

## 
## ### Glimpse of Transformed Evaluation Data ###

dim(processed_evaluation_data)

## [1] 2141   40

Balance Dataset

Before balancing, the dataset had a class imbalance with 6,008 instances of class 0 and 2,153 instances of class 1. After applying SMOTE, the dataset is balanced with 6,008 instances of class 0 and 6,459 instances of class 1.

# Install and load necessary libraries
if (!require(smotefamily)) {
  install.packages("smotefamily")
}

## Loading required package: smotefamily

if (!require(fastDummies)) {
  install.packages("fastDummies")
}
library(smotefamily)
library(fastDummies)
library(dplyr)

# ---- Verify Before Balancing Data ----
cat("\nBefore balancing: ", "", "\n")

## 
## Before balancing:

table(processed_training_data$TARGET_FLAG)

## 
##    0    1 
## 6008 2153

# Ensure all featurestraining_data are numeric
x_data <- processed_training_data[, -which(names(processed_training_data) == "TARGET_FLAG")]  # Exclude TARGET_FLAG
y_data <- processed_training_data$TARGET_FLAG                                      # Target variable

# ---- Apply SMOTE ----
set.seed(123)
smote_result <- SMOTE(
  x_data,
  y_data,
  K = 5,
  dup_size = 2  # Adjust duplication size as needed
)

# Extract the balanced dataset
data_prepared <- smote_result$data
data_prepared$TARGET_FLAG <- as.factor(data_prepared$class)  # Assign the target column back
data_prepared$class <- NULL                                 # Remove the redundant class column

# ---- Verify Balanced Data ----
cat("\nAfter balancing: ", "", "\n")

## 
## After balancing:

table(data_prepared$TARGET_FLAG)

## 
##    0    1 
## 6008 6459

Build Models

We built five binary logistic models using TARGET_FLAG as the target and five multiple linear regression models using TARGET_AMT as the target. For each model type, we used manual feature selection in addition to forward, backward, stepwise, recursive feature selection in addition to the full model which included all the features.

Build Binary Logistic Models

Techniques Used for Manual Featurre Selection for Binary Logistic Models

1. Custom Logistic Regression Formula Selection:

Source: The formula was derived by integrating insights from the decision tree model and the feature importance chart from the Random Forest model.
Key Variables:
- OLDCLAIM and CLM_FREQ: High feature importance in Random Forest and present in the top splits of the decision tree.
- URBANICITY_z_HighlyRuralRural and CAR_USE_Private: Frequently appear in key splits of the decision tree and rank high in Random Forest importance.
- Interaction Terms: Introduced based on variable relationships observed (e.g., URBANICITY_z_HighlyRuralRural:CAR_USE_Private) to capture complex effects.
- Quadratic Terms: Added for OLDCLAIM and CAR_AGE to capture non-linear relationships evident in tree splits and residual plots.
Excluded Variables: Variables with low importance in Random Forest and minimal decision tree splits (e.g., JOB categories with low Gini decrease).

2. Variable Inclusion/Exclusion:

Variables were included if they:
1. Contributed significantly to splits in the decision tree.
2. Had high importance in Random Forest based on Mean Decrease Gini.
Variables were excluded if they:
1. Did not significantly influence predictions (low Gini importance).
2. Were overly collinear with other included variables (e.g., multiple similar job categories).

3. Why the Custom Formula?

The custom formula balances predictive power and interpretability:
- Focus on impactful predictors ensures relevance and minimizes overfitting.
- Quadratic and interaction terms capture essential complexity while maintaining model simplicity.

This approach ensures that the model is data-driven, interpretable, and aligned with observed feature importance and decision tree logic.

Determine Feature Importance and Interactions

Feature Importance Using Random Forest

# Install and load necessary libraries
if (!require(randomForest)) {
  install.packages("randomForest")
}

## Loading required package: randomForest

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

if (!require(caret)) {
  install.packages("caret")
}
library(randomForest)
library(caret)

# ---- Split Data into Training and Testing Sets ----
set.seed(123)  # For reproducibility
train_indices <- createDataPartition(data_prepared$TARGET_FLAG, p = 0.8, list = FALSE)

train_data <- data_prepared[train_indices, ]
test_data <- data_prepared[-train_indices, ]

# ---- Train Random Forest Model ----
set.seed(123)  # For reproducibility
rf_model <- randomForest(
  TARGET_FLAG ~ .,
  data = train_data,
  importance = TRUE,  # Enable importance calculation
  ntree = 500,        # Number of trees
  mtry = floor(sqrt(ncol(train_data) - 1))  # Number of variables tried at each split
)

# ---- Feature Importance ----
importance_values <- importance(rf_model)  # Get importance values
importance_df <- data.frame(
  Feature = rownames(importance_values),
  MeanDecreaseGini = importance_values[, "MeanDecreaseGini"]
)

# ---- Sort by Importance ----
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]

# ---- Display Top Features ----
cat("\nTop 10 Features by Importance:\n")

## 
## Top 10 Features by Importance:

print(head(importance_df, 10))

##                                                     Feature MeanDecreaseGini
## TARGET_AMT                                       TARGET_AMT       3111.44888
## CLM_FREQ                                           CLM_FREQ        288.51358
## OLDCLAIM                                           OLDCLAIM        199.48762
## URBANICITY_z_HighlyRuralRural URBANICITY_z_HighlyRuralRural        154.35647
## CAR_USE_Private                             CAR_USE_Private        132.82793
## MSTATUS_z_No                                   MSTATUS_z_No        127.19415
## MVR_PTS                                             MVR_PTS        119.32058
## HOMEKIDS                                           HOMEKIDS         91.59452
## EDUCATION_z_HighSchool               EDUCATION_z_HighSchool         62.74299
## PARENT1_Yes                                     PARENT1_Yes         62.62347

# ---- Plot Feature Importance ----
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  ggtitle("Feature Importance (Random Forest)") +
  xlab("Features") +
  ylab("Mean Decrease Gini") +
  theme_minimal()

Determine Interactions. Tree Model to Visualize Interactions after Dummy Encoding (Decision Tree for TARGET_FLAG Excluding TARGET_AMT - Logistic Model)

library(rpart.plot)
library(dplyr)

# Ensure target variable and predictors are valid
target_variable <- "TARGET_FLAG"
excluded_variables <- c("TARGET_AMT", target_variable)
predictors <- setdiff(names(data_prepared), excluded_variables)

# Check if predictors are valid
if (length(predictors) == 0) {
  stop("No valid predictors found. Ensure that training_data_imputed has columns other than TARGET_FLAG and TARGET_AMT.")
}

# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))

# Fit the decision tree model
tree_model <- rpart(
  formula,
  data = data_prepared,
  method = "class",  # Use "class" for classification trees
  control = rpart.control(cp = 0.01)  # Adjust complexity parameter as needed
)

# Plot the tree
rpart.plot(
  tree_model,
  type = 3,        # Draw tree with extra info
  extra = 104,     # Show node numbers, splits, and predictions (percentages)
  box.palette = "auto",  # Automatically assign colors
  main = "Decision Tree for TARGET_FLAG Excluding TARGET_AMT"
)

Define the custom logistic regression formula based on Tree Model and Random Forest Feature Importance (Logistic Regression)

# Define the custom logistic regression formula
custom_formula <- TARGET_FLAG ~ OLDCLAIM + CLM_FREQ + URBANICITY_z_HighlyRuralRural +
                                   CAR_USE_Private + PARENT1_Yes + SEX_z_F +
                                   EDUCATION_z_HighSchool + CAR_AGE +
                                   I(OLDCLAIM^2) + I(CAR_AGE^2) +
                                   OLDCLAIM:CLM_FREQ +
                                   URBANICITY_z_HighlyRuralRural:CAR_USE_Private +
                                   PARENT1_Yes:SEX_z_F

Fit the Binary Logistic Models

# Load necessary libraries
library(dplyr)
if (!require("pROC")) install.packages("pROC")
library(pROC)
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)

# ---- Step 1: Data Preparation ----

# Drop INDEX and TARGET_AMT and place TARGET_FLAG as the first column
logistic_data <- data_prepared %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  dplyr::select(TARGET_FLAG, everything())

# Ensure TARGET_FLAG is a factor for logistic regression
logistic_data <- logistic_data %>%
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))

# ---- Step 2: Fit Full Model ----
logistic_formula <- TARGET_FLAG ~ .
full_model_blr <- glm(formula = logistic_formula, family = binomial, data = logistic_data)

# ---- Step 3: Fit Custom Model ----
custom_formula <- TARGET_FLAG ~ OLDCLAIM + CLM_FREQ + URBANICITY_z_HighlyRuralRural +
                                   CAR_USE_Private + PARENT1_Yes + SEX_z_F +
                                   EDUCATION_z_HighSchool + CAR_AGE +
                                   I(OLDCLAIM^2) + I(CAR_AGE^2) +
                                   OLDCLAIM:CLM_FREQ +
                                   URBANICITY_z_HighlyRuralRural:CAR_USE_Private +
                                   PARENT1_Yes:SEX_z_F
custom_model_blr <- glm(formula = custom_formula, family = binomial, data = logistic_data)

# ---- Step 4: Model Selection ----
forward_model_blr <- step(glm(TARGET_FLAG ~ 1, data = logistic_data, family = binomial), 
                          scope = list(lower = TARGET_FLAG ~ 1, upper = logistic_formula), 
                          direction = "forward", 
                          trace = FALSE)

backward_model_blr <- step(full_model_blr, direction = "backward", trace = FALSE)

stepwise_model_blr <- step(full_model_blr, direction = "both", trace = FALSE)

# ---- Step 5: Evaluate Models ----
evaluate_model <- function(model, data) {
  predicted_probs <- predict(model, newdata = data, type = "response")
  predicted_classes <- ifelse(predicted_probs > 0.5, "1", "0")
  
  cm <- confusionMatrix(factor(predicted_classes, levels = c("0", "1")), 
                        factor(data$TARGET_FLAG, levels = c("0", "1")))
  
  roc_obj <- roc(data$TARGET_FLAG, predicted_probs)
  auc <- auc(roc_obj)
  
  metrics <- data.frame(
    Accuracy = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    Precision = cm$byClass["Precision"],
    F1 = cm$byClass["F1"],
    AUC = auc
  )
  return(list(Metrics = metrics, ConfusionMatrix = cm$table))
}

# Evaluate all models
full_evaluation <- evaluate_model(full_model_blr, logistic_data)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

custom_evaluation <- evaluate_model(custom_model_blr, logistic_data)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

forward_evaluation <- evaluate_model(forward_model_blr, logistic_data)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

backward_evaluation <- evaluate_model(backward_model_blr, logistic_data)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

stepwise_evaluation <- evaluate_model(stepwise_model_blr, logistic_data)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# ---- Step 6: Compare Models ----
model_metrics <- rbind(
  data.frame(Model = "Full_BLR", full_evaluation$Metrics),
  data.frame(Model = "Custom_BLR", custom_evaluation$Metrics),
  data.frame(Model = "Forward_BLR", forward_evaluation$Metrics),
  data.frame(Model = "Backward_BLR", backward_evaluation$Metrics),
  data.frame(Model = "Stepwise_BLR", stepwise_evaluation$Metrics)
)

model_metrics <- model_metrics[order(-model_metrics$Accuracy), ]
print(model_metrics)

##                  Model  Accuracy Sensitivity Specificity Precision        F1
## Accuracy3 Backward_BLR 0.7527874   0.7065579   0.7957888 0.7629403 0.7336675
## Accuracy4 Stepwise_BLR 0.7527874   0.7065579   0.7957888 0.7629403 0.7336675
## Accuracy      Full_BLR 0.7526269   0.7067244   0.7953244 0.7625718 0.7335867
## Accuracy1   Custom_BLR 0.7165316   0.6644474   0.7649791 0.7245009 0.6931759
## Accuracy2  Forward_BLR 0.5180878   0.0000000   1.0000000        NA        NA
##                 AUC
## Accuracy3 0.8272851
## Accuracy4 0.8272851
## Accuracy  0.8272768
## Accuracy1 0.7781979
## Accuracy2 0.5000000

# ---- Step 7: Combine AIC and LogLikelihood ----
model_aic_loglik <- data.frame(
  Model = c("Full_BLR", "Custom_BLR", "Forward_BLR", "Backward_BLR", "Stepwise_BLR"),
  AIC = c(AIC(full_model_blr), AIC(custom_model_blr), AIC(forward_model_blr), AIC(backward_model_blr), AIC(stepwise_model_blr)),
  LogLikelihood = c(logLik(full_model_blr), logLik(custom_model_blr), logLik(forward_model_blr), logLik(backward_model_blr), logLik(stepwise_model_blr))
)

model_aic_loglik <- model_aic_loglik[order(-model_aic_loglik$LogLikelihood), ]
print(model_aic_loglik)

##          Model      AIC LogLikelihood
## 1     Full_BLR 12778.37     -6351.184
## 4 Backward_BLR 12770.55     -6352.273
## 5 Stepwise_BLR 12770.55     -6352.273
## 2   Custom_BLR 14084.92     -7028.460
## 3  Forward_BLR 17268.61     -8633.307

# ---- Step 8: Plot ROC Curves ----
par(mfrow = c(2, 3))
roc_full <- roc(logistic_data$TARGET_FLAG, predict(full_model_blr, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_full, main = "Full Model ROC", col = "blue", lwd = 2)

roc_custom <- roc(logistic_data$TARGET_FLAG, predict(custom_model_blr, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_custom, main = "Custom Model ROC", col = "darkgreen", lwd = 2)

roc_forward <- roc(logistic_data$TARGET_FLAG, predict(forward_model_blr, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_forward, main = "Forward Selection ROC", col = "green", lwd = 2)

roc_backward <- roc(logistic_data$TARGET_FLAG, predict(backward_model_blr, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_backward, main = "Backward Elimination ROC", col = "red", lwd = 2)

roc_stepwise <- roc(logistic_data$TARGET_FLAG, predict(stepwise_model_blr, type = "response"))

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

plot(roc_stepwise, main = "Stepwise Selection ROC", col = "purple", lwd = 2)

cat("\n--- Model Comparison Completed ---\n")

## 
## --- Model Comparison Completed ---

Interpret the Coefficients of the Stepwise Binary Logistic Model

Notable Coefficients and Their Interpretations

KIDSDRIV (Estimate: 1.81406, p < 0.001):
- Positive coefficient indicates that households with more kids driving are more likely to have an incident (TARGET_FLAG = 1).
- This makes sense as younger or less experienced drivers may increase the likelihood of claims.
MVR_PTS (Estimate: 1.75588, p < 0.001):
- Positive coefficient indicates that more motor vehicle record points (indicating traffic violations) are strongly associated with higher claim probability.
- This is intuitive as traffic violations suggest riskier driving behavior.
CAR_USE_Private (Estimate: -0.89412, p < 0.001):
- Negative coefficient implies that privately used cars are less likely to be involved in claims compared to other car usage types (e.g., commercial).
- This aligns with expectations as private car use might involve less risky driving patterns.
URBANICITY_z_HighlyRuralRural (Estimate: -2.59345, p < 0.001):
- Strongly negative coefficient suggests that rural areas have a significantly lower probability of claims than urban settings.
- This is reasonable as rural areas generally have less traffic congestion and fewer accidents.
REVOKED_Yes (Estimate: 1.00377, p < 0.001):
- Positive coefficient indicates that individuals with a revoked license are more likely to file claims.
- This aligns with expectations, as a revoked license typically reflects prior risky driving behavior.
TIF (Estimate: -1.64766, p < 0.001):
- Negative coefficient suggests that a longer tenure with the insurer decreases the likelihood of a claim.
- This is logical since longer-tenured customers may exhibit more stable and lower-risk behaviors.
EDUCATION_Bachelors (Estimate: -0.49232, p < 0.001):
- Negative coefficient implies that individuals with a bachelor’s degree are less likely to have claims than those with lower education levels.
- This might correlate with more cautious driving behavior or better financial decision-making.

Summary:

The majority of the coefficients align with logical expectations. However, outliers or unexpected trends (if any) should be flagged for further analysis or cross-referenced with domain knowledge to ensure model validity.

# Display Coefficients of the Stepwise Model
cat("\n--- Stepwise Model Coefficients ---\n")

## 
## --- Stepwise Model Coefficients ---

stepwise_blr_coefficients <- summary(stepwise_model_blr)$coefficients
print(stepwise_blr_coefficients)

##                                 Estimate Std. Error     z value      Pr(>|z|)
## (Intercept)                   -0.1226147 0.19498833  -0.6288307  5.294599e-01
## KIDSDRIV                       1.8140571 0.20716773   8.7564659  2.014676e-18
## HOMEKIDS                       0.3779910 0.13922611   2.7149433  6.628712e-03
## YOJ                           -0.2738037 0.15846188  -1.7278838  8.400906e-02
## INCOME                        -1.4080183 0.31053023  -4.5342390  5.781154e-06
## HOME_VAL                      -1.1280342 0.23901082  -4.7195947  2.363151e-06
## TRAVTIME                       2.3279212 0.20840939  11.1699438  5.721707e-29
## BLUEBOOK                      -1.5543856 0.27562526  -5.6394889  1.705557e-08
## TIF                           -1.6476584 0.13872869 -11.8768402  1.561594e-32
## OLDCLAIM                      -0.8804619 0.18870085  -4.6659138  3.072485e-06
## CLM_FREQ                       1.1530856 0.11797849   9.7736936  1.460291e-22
## MVR_PTS                        1.7558806 0.14818287  11.8494167  2.166940e-32
## PARENT1_Yes                    0.4572161 0.09061037   5.0459581  4.512543e-07
## MSTATUS_z_No                   0.5889337 0.06640779   8.8684420  7.417638e-19
## SEX_z_F                       -0.1319968 0.07669194  -1.7211301  8.522723e-02
## EDUCATION_Bachelors           -0.4923173 0.06575848  -7.4867491  7.060047e-14
## EDUCATION_Masters             -0.3562202 0.11114314  -3.2050580  1.350353e-03
## EDUCATION_PhD                 -0.2252638 0.13762045  -1.6368486  1.016621e-01
## JOB_Clerical                   0.6218006 0.14785936   4.2053516  2.606766e-05
## JOB_HomeMaker                  0.4408650 0.15757489   2.7978123  5.145000e-03
## JOB_Lawyer                     0.2770190 0.12306102   2.2510702  2.438109e-02
## JOB_Manager                   -0.4368174 0.12310249  -3.5484041  3.875730e-04
## JOB_Professional               0.3273497 0.13320104   2.4575613  1.398839e-02
## JOB_Student                    0.4988078 0.16480519   3.0266508  2.472795e-03
## JOB_z_BlueCollar               0.5268294 0.14100057   3.7363638  1.867005e-04
## CAR_USE_Private               -0.8941207 0.07091970 -12.6075071  1.919696e-36
## CAR_TYPE_PanelTruck            0.5873070 0.12690955   4.6277608  3.696406e-06
## CAR_TYPE_Pickup                0.5975987 0.07916314   7.5489516  4.387759e-14
## CAR_TYPE_SportsCar             1.1564481 0.10101474  11.4483096  2.397755e-30
## CAR_TYPE_Van                   0.6558120 0.09919905   6.6110716  3.815479e-11
## CAR_TYPE_z_SUV                 0.8921604 0.08533171  10.4552032  1.387011e-25
## REVOKED_Yes                    1.0037730 0.07723678  12.9960494  1.288285e-38
## URBANICITY_z_HighlyRuralRural -2.5934535 0.08059043 -32.1806644 3.290749e-227

# Format the coefficients for better readability
stepwise_coefficients_df <- as.data.frame(stepwise_blr_coefficients)
stepwise_coefficients_df <- cbind(Variable = rownames(stepwise_coefficients_df), stepwise_coefficients_df)
rownames(stepwise_coefficients_df) <- NULL

# Print formatted coefficients
print(stepwise_coefficients_df)

##                         Variable   Estimate Std. Error     z value
## 1                    (Intercept) -0.1226147 0.19498833  -0.6288307
## 2                       KIDSDRIV  1.8140571 0.20716773   8.7564659
## 3                       HOMEKIDS  0.3779910 0.13922611   2.7149433
## 4                            YOJ -0.2738037 0.15846188  -1.7278838
## 5                         INCOME -1.4080183 0.31053023  -4.5342390
## 6                       HOME_VAL -1.1280342 0.23901082  -4.7195947
## 7                       TRAVTIME  2.3279212 0.20840939  11.1699438
## 8                       BLUEBOOK -1.5543856 0.27562526  -5.6394889
## 9                            TIF -1.6476584 0.13872869 -11.8768402
## 10                      OLDCLAIM -0.8804619 0.18870085  -4.6659138
## 11                      CLM_FREQ  1.1530856 0.11797849   9.7736936
## 12                       MVR_PTS  1.7558806 0.14818287  11.8494167
## 13                   PARENT1_Yes  0.4572161 0.09061037   5.0459581
## 14                  MSTATUS_z_No  0.5889337 0.06640779   8.8684420
## 15                       SEX_z_F -0.1319968 0.07669194  -1.7211301
## 16           EDUCATION_Bachelors -0.4923173 0.06575848  -7.4867491
## 17             EDUCATION_Masters -0.3562202 0.11114314  -3.2050580
## 18                 EDUCATION_PhD -0.2252638 0.13762045  -1.6368486
## 19                  JOB_Clerical  0.6218006 0.14785936   4.2053516
## 20                 JOB_HomeMaker  0.4408650 0.15757489   2.7978123
## 21                    JOB_Lawyer  0.2770190 0.12306102   2.2510702
## 22                   JOB_Manager -0.4368174 0.12310249  -3.5484041
## 23              JOB_Professional  0.3273497 0.13320104   2.4575613
## 24                   JOB_Student  0.4988078 0.16480519   3.0266508
## 25              JOB_z_BlueCollar  0.5268294 0.14100057   3.7363638
## 26               CAR_USE_Private -0.8941207 0.07091970 -12.6075071
## 27           CAR_TYPE_PanelTruck  0.5873070 0.12690955   4.6277608
## 28               CAR_TYPE_Pickup  0.5975987 0.07916314   7.5489516
## 29            CAR_TYPE_SportsCar  1.1564481 0.10101474  11.4483096
## 30                  CAR_TYPE_Van  0.6558120 0.09919905   6.6110716
## 31                CAR_TYPE_z_SUV  0.8921604 0.08533171  10.4552032
## 32                   REVOKED_Yes  1.0037730 0.07723678  12.9960494
## 33 URBANICITY_z_HighlyRuralRural -2.5934535 0.08059043 -32.1806644
##         Pr(>|z|)
## 1   5.294599e-01
## 2   2.014676e-18
## 3   6.628712e-03
## 4   8.400906e-02
## 5   5.781154e-06
## 6   2.363151e-06
## 7   5.721707e-29
## 8   1.705557e-08
## 9   1.561594e-32
## 10  3.072485e-06
## 11  1.460291e-22
## 12  2.166940e-32
## 13  4.512543e-07
## 14  7.417638e-19
## 15  8.522723e-02
## 16  7.060047e-14
## 17  1.350353e-03
## 18  1.016621e-01
## 19  2.606766e-05
## 20  5.145000e-03
## 21  2.438109e-02
## 22  3.875730e-04
## 23  1.398839e-02
## 24  2.472795e-03
## 25  1.867005e-04
## 26  1.919696e-36
## 27  3.696406e-06
## 28  4.387759e-14
## 29  2.397755e-30
## 30  3.815479e-11
## 31  1.387011e-25
## 32  1.288285e-38
## 33 3.290749e-227

# Optional: Save the coefficients to a CSV file for review
write.csv(stepwise_coefficients_df, "stepwise_model_coefficients.csv", row.names = FALSE)

Model Selection: Binary Logistic Models

Comparing Models: Prediction Power, Interpretability, and Explanatory Power

1. Prediction Power

The ROC curves and Log-Likelihood values provide insights into the predictive performance: - Backward and Stepwise Models: These models have nearly identical Log-Likelihood values and AIC scores, suggesting optimal predictive power while maintaining simplicity. - Full Model: The full model also demonstrates strong predictive power but may overfit due to including all variables. - Custom Model: Designed with domain expertise, it performs well but falls behind the Backward and Stepwise models in predictive power. - Forward Model: This model has the weakest prediction power, as evidenced by its higher AIC and lower Log-Likelihood value.

2. Balance of Prediction Power and Interpretability

Stepwise and Backward Models: Achieve a good balance by selecting a reduced set of features while maintaining predictive power close to the full model. This improves interpretability by eliminating redundant or non-contributing variables.
Custom Model: Balances domain knowledge with simplicity but includes interaction terms and squared terms, making it less interpretable than the Stepwise/Backward models.
Full Model: Least interpretable due to the inclusion of all variables without selection.
Forward Model: Over-simplifies the model, potentially excluding meaningful predictors, which sacrifices both prediction power and explanatory capability.

3. Explanatory Power

Custom Model: Best for explanatory purposes due to its inclusion of interaction terms and higher-order variables based on domain knowledge.
Backward and Stepwise Models: Simplified while retaining most explanatory variables, offering a good compromise.
Full Model: Maximizes explanatory power but at the cost of introducing noise and potential overfitting.
Forward Model: Limited explanatory power due to exclusion of potentially significant predictors.

Best Model

For a practical balance of prediction power, interpretability, and explanatory power: - Use the Stepwise Model for general prediction tasks due to its simplicity and strong performance metrics. - Use the Custom Model when domain-specific interactions and explanatory insights are crucial.

Alternatives to Binary Logisttic Models for Comparison

We will consider alternative models like Decision Trees, Random Forest, XGBoost, SVR, ANN, MARS, KNN, and Generalized Linear Regression for performance comparison for this dataset.

# Load necessary libraries
if (!require("nnet")) install.packages("nnet")

## Loading required package: nnet

if (!require("earth")) install.packages("earth")

## Loading required package: earth

## Loading required package: Formula

## Loading required package: plotmo

## Loading required package: plotrix

if (!require("class")) install.packages("class")

## Loading required package: class

if (!require("kernlab")) install.packages("kernlab")

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:purrr':
## 
##     cross

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library(dplyr)
library(caret)
library(pROC)
library(rpart)
library(ranger)

## 
## Attaching package: 'ranger'

## The following object is masked from 'package:randomForest':
## 
##     importance

library(xgboost)

## 
## Attaching package: 'xgboost'

## The following object is masked from 'package:dplyr':
## 
##     slice

library(nnet)
library(earth)
library(class)
library(kernlab)

# ---- Helper Function to Calculate Metrics ----
calculate_metrics <- function(predicted, actual) {
  cm <- confusionMatrix(factor(predicted, levels = c("0", "1")), factor(actual, levels = c("0", "1")))
  metrics <- data.frame(
    Accuracy = cm$overall["Accuracy"],
    Sensitivity = cm$byClass["Sensitivity"],
    Specificity = cm$byClass["Specificity"],
    Precision = cm$byClass["Precision"],
    F1 = cm$byClass["F1"]
  )
  return(metrics)
}

# ---- Step 1: Data Preparation ----
classification_data <- data_prepared %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG))

set.seed(123)
train_index <- createDataPartition(classification_data$TARGET_FLAG, p = 0.7, list = FALSE)
train_data <- classification_data[train_index, ]
test_data <- classification_data[-train_index, ]

# ---- Step 2: Model Implementation ----

# Decision Tree
tree_model <- rpart(TARGET_FLAG ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
tree_preds <- predict(tree_model, newdata = test_data, type = "class")
tree_metrics <- calculate_metrics(tree_preds, test_data$TARGET_FLAG)

# Random Forest
rf_model <- ranger(TARGET_FLAG ~ ., data = train_data, probability = TRUE)
rf_preds <- predict(rf_model, data = test_data)$predictions
rf_pred_class <- ifelse(rf_preds[, 2] > 0.5, "1", "0")
rf_metrics <- calculate_metrics(rf_pred_class, test_data$TARGET_FLAG)

# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data %>% dplyr::select(-TARGET_FLAG)), label = as.numeric(as.character(train_data$TARGET_FLAG)))
test_matrix <- xgb.DMatrix(data = as.matrix(test_data %>% dplyr::select(-TARGET_FLAG)))
xgb_bc_model <- xgboost(data = train_matrix, max_depth = 6, eta = 0.1, nrounds = 100, objective = "binary:logistic", verbose = 0)
xgb_preds <- predict(xgb_bc_model, newdata = test_matrix)
xgb_pred_class <- ifelse(xgb_preds > 0.5, "1", "0")
xgb_metrics <- calculate_metrics(xgb_pred_class, test_data$TARGET_FLAG)

# SVR
svr_model <- ksvm(TARGET_FLAG ~ ., data = train_data, type = "C-svc", kernel = "rbfdot")
svr_preds <- predict(svr_model, test_data)
svr_metrics <- calculate_metrics(svr_preds, test_data$TARGET_FLAG)

# Artificial Neural Network
ann_model <- nnet(TARGET_FLAG ~ ., data = train_data, size = 5, maxit = 200, decay = 0.01, linout = FALSE)

## # weights:  196
## initial  value 6087.007355 
## iter  10 value 4636.752841
## iter  20 value 4371.379136
## iter  30 value 4110.684823
## iter  40 value 3912.449352
## iter  50 value 3759.412839
## iter  60 value 3676.163201
## iter  70 value 3622.392206
## iter  80 value 3577.418311
## iter  90 value 3546.287875
## iter 100 value 3531.116670
## iter 110 value 3518.883025
## iter 120 value 3508.430109
## iter 130 value 3493.214914
## iter 140 value 3479.034539
## iter 150 value 3459.676601
## iter 160 value 3428.016825
## iter 170 value 3398.058693
## iter 180 value 3365.730658
## iter 190 value 3345.031852
## iter 200 value 3311.607129
## final  value 3311.607129 
## stopped after 200 iterations

ann_preds <- predict(ann_model, test_data, type = "class")
ann_metrics <- calculate_metrics(ann_preds, test_data$TARGET_FLAG)

# MARS
mars_model <- earth(TARGET_FLAG ~ ., data = train_data)
mars_preds <- predict(mars_model, newdata = test_data, type = "class")
mars_metrics <- calculate_metrics(mars_preds, test_data$TARGET_FLAG)

# KNN
knn_preds <- knn(train = train_data %>% dplyr::select(-TARGET_FLAG), test = test_data %>% dplyr::select(-TARGET_FLAG), cl = train_data$TARGET_FLAG, k = 5)
knn_metrics <- calculate_metrics(knn_preds, test_data$TARGET_FLAG)

# Generalized Linear Model
glm_model <- glm(TARGET_FLAG ~ ., data = train_data, family = binomial)
glm_preds <- ifelse(predict(glm_model, newdata = test_data, type = "response") > 0.5, "1", "0")
glm_metrics <- calculate_metrics(glm_preds, test_data$TARGET_FLAG)

# ---- Step 3: Combine Metrics ----
model_metrics <- rbind(
  cbind(Model = "Decision Tree", tree_metrics),
  cbind(Model = "Random Forest", rf_metrics),
  cbind(Model = "XGBoost", xgb_metrics),
  cbind(Model = "SVR", svr_metrics),
  cbind(Model = "ANN", ann_metrics),
  cbind(Model = "MARS", mars_metrics),
  cbind(Model = "KNN", knn_metrics),
  cbind(Model = "GLM", glm_metrics)
)

# Print combined metrics
print(model_metrics)

##                   Model  Accuracy Sensitivity Specificity Precision        F1
## Accuracy  Decision Tree 0.7756085   0.7974473   0.7552917 0.7519623 0.7740372
## Accuracy1 Random Forest 0.8627975   0.9045505   0.8239546 0.8269914 0.8640339
## Accuracy2       XGBoost 0.8654721   0.9001110   0.8332473 0.8339332 0.8657593
## Accuracy3           SVR 0.8323081   0.8396226   0.8255034 0.8173960 0.8283603
## Accuracy4           ANN 0.8205403   0.8246393   0.8167269 0.8071700 0.8158111
## Accuracy5          MARS 0.8309708   0.8501665   0.8131131 0.8088701 0.8290043
## Accuracy6           KNN 0.7806900   0.7586016   0.8012390 0.7802511 0.7692741
## Accuracy7           GLM 0.7496657   0.7031077   0.7929788 0.7595923 0.7302594

Comparison of Binary Logistic and Non-Linear Models for Classification

Best Model for Balancing Accuracy and Interpretability

Random Forest remains the top choice for tasks requiring high predictive accuracy, achieving the highest accuracy (86.28%) and F1 score (0.864). However, Stepwise Logistic Regression, with an accuracy of 75.28% and an F1 score of 0.734, provides the best balance between interpretability and model performance.

Stepwise Logistic Regression is recommended for scenarios where understanding variable effects is critical, as it offers transparency and clear insights into feature importance. While Random Forest is excellent for complex, accuracy-driven applications, Stepwise Regression is ideal for interpretability without sacrificing too much predictive power.

Build Models: Multiple Linear Regression Models

Techniques Used for Manual Feature Selection for Multiple Linear Regression

Tree Model for Feature Interaction

Approach:
- Used a decision tree to analyze interactions and relationships between predictors for predicting TARGET_AMT where TARGET_FLAG equals 1.
- The decision tree identified key splits and hierarchies, highlighting predictors like BLUEBOOK.
Significance:
- The decision tree provides interpretability by visualizing splits.
- Predictors like BLUEBOOK were prioritized as they appeared at higher splits in the tree.

Random Forest for Feature Importance

Approach:
- Random Forest was employed to rank feature importance for TARGET_AMT prediction.
- Variables were ranked based on Mean Decrease in Gini and %IncMSE metrics.
Significance:
- BLUEBOOK, MVR_PTS, and CAR_AGE emerged as the most influential predictors, aligning with domain knowledge.
- Variables with lower importance were excluded to focus on impactful predictors, reducing model complexity.
Integration with MLR:
- Features identified as significant from Random Forest and Decision Tree were incorporated into the Multiple Linear Regression models.
- Interaction terms and transformations (e.g., I(BLUEBOOK^2)) were included to capture non-linear relationships.

Final Selection Criteria

Predictors were chosen based on their importance scores and logical relevance to the target variable.
Irrelevant or redundant features were excluded to enhance model performance and interpretability.

Tree Model for Feature Interaction using TARGET_AMT as Primary Node

library(rpart.plot)
library(dplyr)

# Filter the data for rows where TARGET_FLAG = 1
mrl_filtered_data_target_flag <- data_prepared %>% filter(TARGET_FLAG == 1)

# Ensure target variable and predictors are valid
target_variable <- "TARGET_AMT"
excluded_variables <- c("TARGET_FLAG", target_variable)
predictors <- setdiff(names(mrl_filtered_data_target_flag), excluded_variables)

# Check if predictors are valid
if (length(predictors) == 0) {
  stop("No valid predictors found. Ensure that mrl_filtered_data_target_flag has columns other than TARGET_FLAG and TARGET_AMT.")
}

# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))

# Fit the decision tree model
tree_model <- rpart(
  formula,
  data = mrl_filtered_data_target_flag,
  method = "anova",  # Use "anova" for regression trees
  control = rpart.control(cp = 0.01)  # Adjust complexity parameter as needed
)

# Plot the tree
rpart.plot(
  tree_model,
  type = 3,        # Draw tree with extra info
  extra = 101,     # Show node numbers, splits, and predictions
  box.palette = "auto",  # Automatically assign colors
  main = "Decision Tree for TARGET_AMT Where TARGET_FLAG = 1"
)

Random Forest for Feature Importance using TARGET_AMT as Target

# Ensure necessary libraries are loaded
if (!require("randomForest")) install.packages("randomForest")
if (!require("ggplot2")) install.packages("ggplot2")
library(randomForest)
library(ggplot2)

# Define response variable and predictors
response <- mrl_filtered_data_target_flag$TARGET_AMT  # Replace with the actual dataset variable name for TARGET_AMT
predictors_matrix <- mrl_filtered_data_target_flag %>%
  dplyr::select(-TARGET_AMT, -TARGET_FLAG, -INDEX) %>%  # Remove target and any non-predictor columns
  dplyr::select_if(~ is.numeric(.) || is.factor(.))  # Keep only numeric or factor columns

# Check if predictors_matrix has valid columns
if (ncol(predictors_matrix) == 0) {
  stop("No valid predictors found after filtering for numeric and factor columns.")
}

# Train the randomForest model
set.seed(123)  # For reproducibility
rf_model <- randomForest(
  x = predictors_matrix,
  y = response,
  importance = TRUE,  # Enable importance computation
  ntree = 500,        # Number of trees
  mtry = floor(sqrt(ncol(predictors_matrix)))  # Number of predictors to try at each split
)

# Explicitly call importance from randomForest to avoid masking issues
importance_values <- randomForest::importance(rf_model)

# Check structure of importance_values
cat("\nStructure of importance_values:\n")

## 
## Structure of importance_values:

print(importance_values)

##                                 %IncMSE IncNodePurity
## KIDSDRIV                      12.998987    6033907780
## AGE                           20.472019   21579660068
## HOMEKIDS                      13.063007    9591877517
## YOJ                           18.919967   15971640522
## INCOME                        19.884842   17532653500
## HOME_VAL                      18.423284   16465008941
## TRAVTIME                      17.347732   18107173907
## BLUEBOOK                      28.572032   25330745053
## TIF                           13.514517   14241034365
## OLDCLAIM                      15.635998   15820217177
## CLM_FREQ                      10.787234   14403584213
## MVR_PTS                       20.600759   19358511878
## CAR_AGE                       21.377925   16982825587
## PARENT1_Yes                   12.234669    7643861157
## MSTATUS_z_No                  12.702572    9794017029
## SEX_z_F                        6.441400    6604702864
## EDUCATION_Bachelors            9.765206    7543634169
## EDUCATION_Masters              7.331360    4000317768
## EDUCATION_PhD                  7.937516    2841235673
## EDUCATION_z_HighSchool         9.398117    6016333126
## JOB_Clerical                  14.043970    5373221752
## JOB_Doctor                     2.870718     190926590
## JOB_HomeMaker                 11.871058    3743452683
## JOB_Lawyer                    12.739474    3024598027
## JOB_Manager                    6.235967    1054806651
## JOB_Professional               7.968180    8256975898
## JOB_Student                    8.079909    3139691942
## JOB_z_BlueCollar              13.887132    8296519495
## CAR_USE_Private               10.451147    7163677108
## CAR_TYPE_PanelTruck           12.780984    6069665499
## CAR_TYPE_Pickup               10.989774    4050555059
## CAR_TYPE_SportsCar             8.410908    6036814076
## CAR_TYPE_Van                  10.498695    8316236055
## CAR_TYPE_z_SUV                12.468348    5202295871
## RED_CAR_yes                    9.526964    6868476424
## REVOKED_Yes                   11.350668    3492242398
## URBANICITY_z_HighlyRuralRural  9.544094    4578656952

# If importance_values has the correct structure, process it
if (!is.null(importance_values) && "MeanDecreaseGini" %in% colnames(importance_values)) {
  # Convert to data frame
  importance_df <- data.frame(
    Feature = rownames(importance_values),
    MeanDecreaseGini = importance_values[, "MeanDecreaseGini"]
  )

  # Sort features by importance
  importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]

  # Display the top 10 features
  cat("\nTop 10 Features by Importance:\n")
  print(head(importance_df, 10))

  # ---- Plot Feature Importance ----
  ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
    geom_bar(stat = "identity", fill = "steelblue") +
    coord_flip() +
    ggtitle("Feature Importance (Random Forest)") +
    xlab("Features") +
    ylab("Mean Decrease Gini") +
    theme_minimal()
} else {
  cat("\nThe importance_values object does not contain 'MeanDecreaseGini'. Please review the model training step.\n")
}

## 
## The importance_values object does not contain 'MeanDecreaseGini'. Please review the model training step.

Select predictors based on the above feature importance analysis and tree model.

# Custom formula based on key predictors
custom_predictors <- c("BLUEBOOK", "INCOME", "AGE", "HOME_VAL", "CAR_TYPE_SUV", "CAR_TYPE_SportsCar")

# Build the formula
custom_formula <- as.formula(paste("TARGET_AMT ~", paste(custom_predictors, collapse = "+")))

# Print the custom formula
custom_formula

## TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL + CAR_TYPE_SUV + 
##     CAR_TYPE_SportsCar

Multiple Linear Regression Models

MLR Models before transformation and scaling of the target variable TARGET_AMT

The Full Model showed the best performance with an R-squared of 0.0171 and the lowest AIC of 93669.45, though all models exhibited poor predictive power. Residual plots revealed consistent variability, suggesting unaccounted non-linear relationships. Further exploration of non-linear models or feature engineering is recommended.

# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)

# ---- Step 1: Data Preparation ----

# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
  filter(TARGET_FLAG == 1) %>%
  dplyr::select(-INDEX, -TARGET_FLAG) %>%
  dplyr::select(TARGET_AMT, everything())

# Split the data into training and test sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_untransformed_target <- mlr_data[-train_index, ]

# ---- Step 2: Fit Full Model ----

# Define the full multiple linear regression formula
full_formula <- TARGET_AMT ~ .

# Fit the full multiple linear regression model
full_model_untransformed_target <- lm(formula = full_formula, data = train_data)

# ---- Step 3: Fit Custom Model ----

# Define the custom multiple linear regression formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL + 
  CAR_TYPE_z_SUV + CAR_TYPE_SportsCar

# Fit the custom multiple linear regression model
custom_model_untransformed_target <- lm(formula = custom_formula, data = train_data)

# ---- Step 4: Model Selection ----

# Forward Selection
cat("\n--- Forward Selection ---\n")

## 
## --- Forward Selection ---

forward_model_untransformed_target <- step(
  lm(TARGET_AMT ~ 1, data = train_data), 
  scope = list(lower = TARGET_AMT ~ 1, upper = full_formula), 
  direction = "forward", 
  trace = FALSE
)
print(summary(forward_model_untransformed_target))

## 
## Call:
## lm(formula = TARGET_AMT ~ 1, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5700  -3104  -1627     58  85891 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5730.6      114.8   49.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7720 on 4522 degrees of freedom

# Backward Elimination
cat("\n--- Backward Elimination ---\n")

## 
## --- Backward Elimination ---

backward_model_untransformed_target <- step(full_model_untransformed_target, direction = "backward", trace = FALSE)
print(summary(backward_model_untransformed_target))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + INCOME + HOME_VAL + 
##     BLUEBOOK + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE + MSTATUS_z_No + 
##     SEX_z_F + EDUCATION_Masters + EDUCATION_PhD + JOB_Doctor + 
##     JOB_Manager + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_SportsCar + 
##     CAR_TYPE_z_SUV + RED_CAR_yes + REVOKED_Yes, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11276  -3246  -1457    553  83741 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2579.5      663.2   3.890 0.000102 ***
## AGE                  2541.9      989.7   2.568 0.010254 *  
## HOMEKIDS             1770.8      596.0   2.971 0.002981 ** 
## INCOME              -3001.6     1749.8  -1.715 0.086335 .  
## HOME_VAL             3418.8     1339.9   2.552 0.010757 *  
## BLUEBOOK            11203.6     1252.3   8.947  < 2e-16 ***
## OLDCLAIM             2405.4      995.1   2.417 0.015681 *  
## CLM_FREQ            -1333.1      603.2  -2.210 0.027155 *  
## MVR_PTS              2309.3      677.3   3.409 0.000657 ***
## CAR_AGE             -2983.6      866.7  -3.443 0.000581 ***
## MSTATUS_z_No         1570.0      308.0   5.098 3.57e-07 ***
## SEX_z_F             -2134.7      452.7  -4.715 2.49e-06 ***
## EDUCATION_Masters    1354.3      500.7   2.705 0.006864 ** 
## EDUCATION_PhD        2345.4      764.8   3.067 0.002176 ** 
## JOB_Doctor          -3011.8     1174.4  -2.564 0.010365 *  
## JOB_Manager         -1593.3      564.6  -2.822 0.004789 ** 
## JOB_Professional      743.3      461.1   1.612 0.107011    
## JOB_z_BlueCollar      622.3      343.4   1.812 0.070033 .  
## CAR_TYPE_SportsCar    955.5      507.8   1.882 0.059955 .  
## CAR_TYPE_z_SUV       1300.6      439.9   2.956 0.003130 ** 
## RED_CAR_yes          -878.4      376.9  -2.330 0.019834 *  
## REVOKED_Yes         -1639.8      390.2  -4.202 2.70e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7559 on 4501 degrees of freedom
## Multiple R-squared:  0.04572,    Adjusted R-squared:  0.04127 
## F-statistic: 10.27 on 21 and 4501 DF,  p-value: < 2.2e-16

# Stepwise Selection
cat("\n--- Stepwise Selection ---\n")

## 
## --- Stepwise Selection ---

stepwise_model_untransformed_target <- step(full_model_untransformed_target, direction = "both", trace = FALSE)
print(summary(stepwise_model_untransformed_target))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + INCOME + HOME_VAL + 
##     BLUEBOOK + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE + MSTATUS_z_No + 
##     SEX_z_F + EDUCATION_Masters + EDUCATION_PhD + JOB_Doctor + 
##     JOB_Manager + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_SportsCar + 
##     CAR_TYPE_z_SUV + RED_CAR_yes + REVOKED_Yes, data = train_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11276  -3246  -1457    553  83741 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2579.5      663.2   3.890 0.000102 ***
## AGE                  2541.9      989.7   2.568 0.010254 *  
## HOMEKIDS             1770.8      596.0   2.971 0.002981 ** 
## INCOME              -3001.6     1749.8  -1.715 0.086335 .  
## HOME_VAL             3418.8     1339.9   2.552 0.010757 *  
## BLUEBOOK            11203.6     1252.3   8.947  < 2e-16 ***
## OLDCLAIM             2405.4      995.1   2.417 0.015681 *  
## CLM_FREQ            -1333.1      603.2  -2.210 0.027155 *  
## MVR_PTS              2309.3      677.3   3.409 0.000657 ***
## CAR_AGE             -2983.6      866.7  -3.443 0.000581 ***
## MSTATUS_z_No         1570.0      308.0   5.098 3.57e-07 ***
## SEX_z_F             -2134.7      452.7  -4.715 2.49e-06 ***
## EDUCATION_Masters    1354.3      500.7   2.705 0.006864 ** 
## EDUCATION_PhD        2345.4      764.8   3.067 0.002176 ** 
## JOB_Doctor          -3011.8     1174.4  -2.564 0.010365 *  
## JOB_Manager         -1593.3      564.6  -2.822 0.004789 ** 
## JOB_Professional      743.3      461.1   1.612 0.107011    
## JOB_z_BlueCollar      622.3      343.4   1.812 0.070033 .  
## CAR_TYPE_SportsCar    955.5      507.8   1.882 0.059955 .  
## CAR_TYPE_z_SUV       1300.6      439.9   2.956 0.003130 ** 
## RED_CAR_yes          -878.4      376.9  -2.330 0.019834 *  
## REVOKED_Yes         -1639.8      390.2  -4.202 2.70e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7559 on 4501 degrees of freedom
## Multiple R-squared:  0.04572,    Adjusted R-squared:  0.04127 
## F-statistic: 10.27 on 21 and 4501 DF,  p-value: < 2.2e-16

# ---- Step 5: Box-Cox Transformation and Min-Max Scaling for Evaluation ----

# Apply Box-Cox transformation to predicted values
boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    return(log(y))
  } else {
    return((y^lambda - 1) / lambda)
  }
}

# Apply Min-Max scaling
min_max_scale <- function(x, min_val, max_val) {
  return((x - min_val) / (max_val - min_val))
}

# Function to calculate RMSE and R-Squared after applying transformations
evaluate_model_transformed <- function(model, test_data, lambda, min_val, max_val) {
  predicted <- predict(model, newdata = test_data)
  actual <- test_data$TARGET_AMT
  
  # Apply Box-Cox transformation
  predicted_transformed <- boxcox_transform(predicted, lambda)
  actual_transformed <- boxcox_transform(actual, lambda)
  
  # Apply Min-Max scaling
  predicted_scaled <- min_max_scale(predicted_transformed, min_val, max_val)
  actual_scaled <- min_max_scale(actual_transformed, min_val, max_val)
  
  # Calculate metrics
  rsquared <- cor(predicted_scaled, actual_scaled)^2
  rmse <- sqrt(mean((predicted_scaled - actual_scaled)^2))
  
  return(data.frame(R_Squared = rsquared, RMSE = rmse))
}

# Fit Box-Cox transformation to training data
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data), lambda = seq(-2, 2, by = 0.1))

optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation: ", optimal_lambda, "\n")

## Optimal lambda for Box-Cox transformation:  -0.02020202

# Min-Max scaling values from training data
min_val <- min(boxcox_transform(train_data$TARGET_AMT, optimal_lambda))
max_val <- max(boxcox_transform(train_data$TARGET_AMT, optimal_lambda))

# ---- Step 6: Evaluate Models on Test Data ----

# Evaluate models using the transformed and scaled test data
full_evaluation <- evaluate_model_transformed(full_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
custom_evaluation <- evaluate_model_transformed(custom_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
forward_evaluation <- evaluate_model_transformed(forward_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)

## Warning in cor(predicted_scaled, actual_scaled): the standard deviation is zero

backward_evaluation <- evaluate_model_transformed(backward_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
stepwise_evaluation <- evaluate_model_transformed(stepwise_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)

# Combine Metrics for All Models
model_metrics <- rbind(
  data.frame(Model = "Full", full_evaluation),
  data.frame(Model = "Custom", custom_evaluation),
  data.frame(Model = "Forward", forward_evaluation),
  data.frame(Model = "Backward", backward_evaluation),
  data.frame(Model = "Stepwise", stepwise_evaluation)
)

# Sort by R-Squared
model_metrics <- model_metrics[order(-model_metrics$R_Squared), ]

# Print Model Performance Comparison
print(model_metrics)

##      Model  R_Squared      RMSE
## 2   Custom 0.01606155 0.1066177
## 4 Backward 0.01264286 0.1088033
## 5 Stepwise 0.01264286 0.1088033
## 1     Full 0.01203977 0.1090681
## 3  Forward         NA 0.1081723

# ---- Step 7: Compare Models (AIC and LogLikelihood) ----

# Function to calculate AIC and LogLikelihood for a model
evaluate_aic_loglik <- function(model) {
  aic <- AIC(model)
  loglik <- logLik(model)
  return(data.frame(AIC = aic, LogLikelihood = as.numeric(loglik)))
}

# Evaluate AIC and LogLikelihood for each model
full_aic_loglik <- evaluate_aic_loglik(full_model_untransformed_target)
custom_aic_loglik <- evaluate_aic_loglik(custom_model_untransformed_target)
forward_aic_loglik <- evaluate_aic_loglik(forward_model_untransformed_target)
backward_aic_loglik <- evaluate_aic_loglik(backward_model_untransformed_target)
stepwise_aic_loglik <- evaluate_aic_loglik(stepwise_model_untransformed_target)

# Combine AIC and LogLikelihood metrics for all models
model_aic_loglik <- rbind(
  data.frame(Model = "Full", full_aic_loglik),
  data.frame(Model = "Custom", custom_aic_loglik),
  data.frame(Model = "Forward", forward_aic_loglik),
  data.frame(Model = "Backward", backward_aic_loglik),
  data.frame(Model = "Stepwise", stepwise_aic_loglik)
)

# Sort by AIC (lower is better)
model_aic_loglik <- model_aic_loglik[order(model_aic_loglik$AIC), ]

# Print AIC and LogLikelihood Comparison
print(model_aic_loglik)

##      Model      AIC LogLikelihood
## 4 Backward 93644.80     -46799.40
## 5 Stepwise 93644.80     -46799.40
## 1     Full 93669.45     -46795.73
## 2   Custom 93730.83     -46857.42
## 3  Forward 93814.48     -46905.24

# ---- Step 8: Residual Diagnostics ----

# Residual plots for refitted, transformed models
par(mfrow = c(2, 3))
plot(residuals(full_model_untransformed_target), main = "Refit Full Model Residuals", ylab = "Residuals")
plot(residuals(custom_model_untransformed_target), main = "Refit Custom Model Residuals", ylab = "Residuals")
plot(residuals(forward_model_untransformed_target), main = "Refit Forward Selection Residuals", ylab = "Residuals")
plot(residuals(backward_model_untransformed_target), main = "Refit Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(stepwise_model_untransformed_target), main = "Refit Stepwise Selection Residuals", ylab = "Residuals")

cat("\n--- Model Comparison Completed ---\n")

## 
## --- Model Comparison Completed ---

MLR Models after transformation and scaling of the target variable TARGET_AMT

The multiple linear regression analysis using Box-Cox transformation and Min-Max scaling shows that all models (Full, Custom, Forward, Backward, and Stepwise) achieve low R-Squared values (0.015), indicating limited predictive power. The AIC and LogLikelihood metrics highlight minimal differences between models, suggesting no substantial improvement from model selection techniques. Residual analysis confirms consistent distribution patterns across models, but the weak relationships between predictors and the target variable limit model performance. Further feature engineering or alternative modeling approaches, such as non-linear models, may improve predictive accuracy.

# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)

if (!require("MASS")) install.packages("MASS")
library(MASS)

# ---- Step 1: Data Preparation ----

# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
  filter(TARGET_FLAG == 1) %>%
  dplyr::select(-INDEX, -TARGET_FLAG) %>%
  dplyr::select(TARGET_AMT, everything())

# Split the data into training and test sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_transformed_scaled <- mlr_data[-train_index, ]

# Apply Box-Cox transformation to TARGET_AMT
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data), lambda = seq(-2, 2, by = 0.1))

# Optimal lambda
optimal_lambda_transformed <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation: ", optimal_lambda_transformed, "\n")

## Optimal lambda for Box-Cox transformation:  -0.02020202

# Transform TARGET_AMT in both training and test datasets
boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    log(y)  # Log transform if lambda is 0
  } else {
    (y^lambda - 1) / lambda
  }
}

train_data$TARGET_AMT <- boxcox_transform(train_data$TARGET_AMT, optimal_lambda_transformed)
test_data_transformed_scaled$TARGET_AMT <- boxcox_transform(test_data_transformed_scaled$TARGET_AMT, optimal_lambda_transformed)

# Min-Max scaling function that saves min and max as objects
min_max_scale_with_objects <- function(x) {
  min_val <- min(x, na.rm = TRUE)
  max_val <- max(x, na.rm = TRUE)
  
  scaled_values <- (x - min_val) / (max_val - min_val)
  
  # Return a list containing scaled values, min, and max
  return(list(scaled = scaled_values, min_val = min_val, max_val = max_val))
}

# Apply Min-Max scaling with object saving to train and test datasets
train_scaled <- min_max_scale_with_objects(train_data$TARGET_AMT)
train_data$TARGET_AMT <- train_scaled$scaled

test_data_transformed_scaled$TARGET_AMT <- (test_data_transformed_scaled$TARGET_AMT - train_scaled$min_val) /
                                           (train_scaled$max_val - train_scaled$min_val)

# Save min and max for potential future use
train_min_val_transformed <- train_scaled$min_val
train_max_val_transformed <- train_scaled$max_val

cat("Saved min and max values for future reference:\n")

## Saved min and max values for future reference:

cat("Min value:", train_min_val_transformed, "\n")

## Min value: 3.295567

cat("Max value:", train_max_val_transformed, "\n")

## Max value: 10.20269

# ---- Step 2: Fit Full Model ----

# Define the full multiple linear regression formula
full_formula <- TARGET_AMT ~ .

# Fit the full multiple linear regression model
full_model_transformed_scaled <- lm(formula = full_formula, data = train_data)

# ---- Step 3: Fit Custom Model ----

# Define the custom multiple linear regression formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL + 
                                 CAR_TYPE_z_SUV + CAR_TYPE_SportsCar

# Fit the custom multiple linear regression model
custom_model_transformed_scaled <- lm(formula = custom_formula, data = train_data)

# ---- Step 4: Model Selection ----

# Forward Selection
cat("\n--- Forward Selection ---\n")

## 
## --- Forward Selection ---

forward_model_transformed_scaled <- step(
  lm(TARGET_AMT ~ 1, data = train_data), 
  scope = list(lower = TARGET_AMT ~ 1, upper = full_formula), 
  direction = "forward", 
  trace = FALSE
)
print(summary(forward_model_transformed_scaled))

## 
## Call:
## lm(formula = TARGET_AMT ~ 1, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62610 -0.04938  0.00550  0.04745  0.37390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.626103   0.001459   429.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09813 on 4522 degrees of freedom

# Backward Elimination
cat("\n--- Backward Elimination ---\n")

## 
## --- Backward Elimination ---

backward_model_transformed_scaled <- step(full_model_transformed_scaled, direction = "backward", trace = FALSE)
print(summary(backward_model_transformed_scaled))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK + 
##     OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + SEX_z_F + 
##     EDUCATION_Bachelors + EDUCATION_Masters + EDUCATION_PhD + 
##     JOB_HomeMaker + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_Pickup + 
##     CAR_TYPE_SportsCar + CAR_TYPE_z_SUV + REVOKED_Yes, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59425 -0.05004  0.00432  0.04938  0.37242 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.586747   0.008544  68.672  < 2e-16 ***
## AGE                  0.034493   0.012738   2.708 0.006798 ** 
## HOMEKIDS             0.020297   0.007730   2.626 0.008680 ** 
## YOJ                 -0.015415   0.010205  -1.511 0.130985    
## INCOME              -0.055330   0.021461  -2.578 0.009962 ** 
## BLUEBOOK             0.132934   0.017012   7.814 6.85e-15 ***
## OLDCLAIM             0.050321   0.012683   3.968 7.37e-05 ***
## CLM_FREQ            -0.034326   0.007665  -4.478 7.72e-06 ***
## MVR_PTS              0.043003   0.008610   4.994 6.12e-07 ***
## MSTATUS_z_No         0.016348   0.003353   4.876 1.12e-06 ***
## SEX_z_F             -0.019219   0.005209  -3.689 0.000228 ***
## EDUCATION_Bachelors -0.014431   0.004367  -3.305 0.000957 ***
## EDUCATION_Masters    0.012243   0.006020   2.034 0.042041 *  
## EDUCATION_PhD        0.020490   0.009103   2.251 0.024444 *  
## JOB_HomeMaker       -0.011496   0.006910  -1.664 0.096229 .  
## JOB_Professional     0.013774   0.005920   2.327 0.020035 *  
## JOB_z_BlueCollar     0.007804   0.004333   1.801 0.071779 .  
## CAR_TYPE_Pickup      0.007891   0.004902   1.610 0.107564    
## CAR_TYPE_SportsCar   0.010481   0.007146   1.467 0.142509    
## CAR_TYPE_z_SUV       0.018109   0.006364   2.846 0.004451 ** 
## REVOKED_Yes         -0.017293   0.004968  -3.481 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09636 on 4502 degrees of freedom
## Multiple R-squared:  0.04001,    Adjusted R-squared:  0.03574 
## F-statistic: 9.381 on 20 and 4502 DF,  p-value: < 2.2e-16

# Stepwise Selection
cat("\n--- Stepwise Selection ---\n")

## 
## --- Stepwise Selection ---

stepwise_model_transformed_scaled <- step(full_model_transformed_scaled, direction = "both", trace = FALSE)
print(summary(stepwise_model_transformed_scaled))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK + 
##     OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + SEX_z_F + 
##     EDUCATION_Bachelors + EDUCATION_Masters + EDUCATION_PhD + 
##     JOB_HomeMaker + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_Pickup + 
##     CAR_TYPE_SportsCar + CAR_TYPE_z_SUV + REVOKED_Yes, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59425 -0.05004  0.00432  0.04938  0.37242 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.586747   0.008544  68.672  < 2e-16 ***
## AGE                  0.034493   0.012738   2.708 0.006798 ** 
## HOMEKIDS             0.020297   0.007730   2.626 0.008680 ** 
## YOJ                 -0.015415   0.010205  -1.511 0.130985    
## INCOME              -0.055330   0.021461  -2.578 0.009962 ** 
## BLUEBOOK             0.132934   0.017012   7.814 6.85e-15 ***
## OLDCLAIM             0.050321   0.012683   3.968 7.37e-05 ***
## CLM_FREQ            -0.034326   0.007665  -4.478 7.72e-06 ***
## MVR_PTS              0.043003   0.008610   4.994 6.12e-07 ***
## MSTATUS_z_No         0.016348   0.003353   4.876 1.12e-06 ***
## SEX_z_F             -0.019219   0.005209  -3.689 0.000228 ***
## EDUCATION_Bachelors -0.014431   0.004367  -3.305 0.000957 ***
## EDUCATION_Masters    0.012243   0.006020   2.034 0.042041 *  
## EDUCATION_PhD        0.020490   0.009103   2.251 0.024444 *  
## JOB_HomeMaker       -0.011496   0.006910  -1.664 0.096229 .  
## JOB_Professional     0.013774   0.005920   2.327 0.020035 *  
## JOB_z_BlueCollar     0.007804   0.004333   1.801 0.071779 .  
## CAR_TYPE_Pickup      0.007891   0.004902   1.610 0.107564    
## CAR_TYPE_SportsCar   0.010481   0.007146   1.467 0.142509    
## CAR_TYPE_z_SUV       0.018109   0.006364   2.846 0.004451 ** 
## REVOKED_Yes         -0.017293   0.004968  -3.481 0.000505 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09636 on 4502 degrees of freedom
## Multiple R-squared:  0.04001,    Adjusted R-squared:  0.03574 
## F-statistic: 9.381 on 20 and 4502 DF,  p-value: < 2.2e-16

# ---- Step 5: Evaluate Models on Test Data ----

# Function to calculate performance metrics (R-squared and RMSE)
evaluate_model <- function(model, data) {
  predicted <- predict(model, newdata = data)
  actual <- data$TARGET_AMT
  
  # Calculate metrics
  rsquared <- cor(predicted, actual)^2
  rmse <- sqrt(mean((predicted - actual)^2))
  
  return(data.frame(R_Squared = rsquared, RMSE = rmse))
}

# Evaluate models using the test data
full_evaluation_transformed_scaled <- evaluate_model(full_model_transformed_scaled, test_data_transformed_scaled)
custom_evaluation_transformed_scaled <- evaluate_model(custom_model_transformed_scaled, test_data_transformed_scaled)
forward_evaluation_transformed_scaled <- evaluate_model(forward_model_transformed_scaled, test_data_transformed_scaled)

## Warning in cor(predicted, actual): the standard deviation is zero

backward_evaluation_transformed_scaled <- evaluate_model(backward_model_transformed_scaled, test_data_transformed_scaled)
stepwise_evaluation_transformed_scaled <- evaluate_model(stepwise_model_transformed_scaled, test_data_transformed_scaled)

# ---- Step 6: Compare Models ----

# Combine Metrics for All Models
model_metrics_transformed_scaled <- rbind(
  data.frame(Model = "Full", full_evaluation_transformed_scaled),
  data.frame(Model = "Custom", custom_evaluation_transformed_scaled),
  data.frame(Model = "Forward", forward_evaluation_transformed_scaled),
  data.frame(Model = "Backward", backward_evaluation_transformed_scaled),
  data.frame(Model = "Stepwise", stepwise_evaluation_transformed_scaled)
)

# Sort by R-Squared
model_metrics_transformed_scaled <- model_metrics_transformed_scaled[order(-model_metrics_transformed_scaled$R_Squared), ]

# Print Model Performance Comparison
print(model_metrics_transformed_scaled)

##      Model  R_Squared       RMSE
## 1     Full 0.01541150 0.09694449
## 4 Backward 0.01459890 0.09698848
## 5 Stepwise 0.01459890 0.09698848
## 2   Custom 0.01404875 0.09668920
## 3  Forward         NA 0.09737593

# Combine AIC and LogLikelihood for All Models
model_aic_loglik_transformed_scaled <- data.frame(
  Model = c("Full", "Custom", "Forward", "Backward", "Stepwise"),
  AIC = c(AIC(full_model_transformed_scaled), AIC(custom_model_transformed_scaled), AIC(forward_model_transformed_scaled), AIC(backward_model_transformed_scaled), AIC(stepwise_model_transformed_scaled)),
  LogLikelihood = c(logLik(full_model_transformed_scaled), logLik(custom_model_transformed_scaled), logLik(forward_model_transformed_scaled), logLik(backward_model_transformed_scaled), logLik(stepwise_model_transformed_scaled))
)

# Sort by LogLikelihood
model_aic_loglik_transformed_scaled <- model_aic_loglik_transformed_scaled[order(-model_aic_loglik_transformed_scaled$LogLikelihood), ]

# Print AIC and LogLikelihood Comparison
print(model_aic_loglik_transformed_scaled)

##      Model       AIC LogLikelihood
## 1     Full -8278.158      4178.079
## 4 Backward -8306.176      4175.088
## 5 Stepwise -8306.176      4175.088
## 2   Custom -8221.300      4118.650
## 3  Forward -8161.498      4082.749

# ---- Step 8: Visualize Residuals for Test Data ----
par(mfrow = c(2, 3))
plot(residuals(full_model_transformed_scaled), main = "Full Model Residuals", ylab = "Residuals")
plot(residuals(custom_model_transformed_scaled), main = "Custom Model Residuals", ylab = "Residuals")
plot(residuals(forward_model_transformed_scaled), main = "Forward Selection Residuals", ylab = "Residuals")
plot(residuals(backward_model_transformed_scaled), main = "Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(stepwise_model_transformed_scaled), main = "Stepwise Selection Residuals", ylab = "Residuals")




cat("\n--- Model Comparison Completed ---\n")

## 
## --- Model Comparison Completed ---

MLR Models after outlier treatment, transformation, and scaling of the target variable TARGET_AMT

After removing outliers, applying a Box-Cox transformation (lambda = 0.222), and Min-Max scaling, refitted models show limited predictive power with R-squared values around 0.012–0.015 and RMSE at ~0.203. Residuals improved in uniformity, but linear models struggle to capture relationships, suggesting the need for non-linear approaches.

# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)

# ---- Step 1: Data Preparation ----

# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
  filter(TARGET_FLAG == 1) %>%
  dplyr::select(-INDEX, -TARGET_FLAG) %>%
  dplyr::select(TARGET_AMT, everything())

# Split the data into training and test sets
set.seed(123)  # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_outliers_removed <- mlr_data[-train_index, ]

# ---- Step 2: Identify and Remove Outliers and Influential Points ----

# Fit the initial full model
full_formula <- TARGET_AMT ~ .
initial_full_model <- lm(formula = full_formula, data = train_data)

# Calculate standardized residuals
standardized_residuals <- rstandard(initial_full_model)

# Identify outliers (standardized residuals > |3|)
outliers <- which(abs(standardized_residuals) > 3)

# Calculate leverage and Cook's distance
leverage <- hatvalues(initial_full_model)
cooks_distance <- cooks.distance(initial_full_model)

# Identify influential points (Cook's distance > 4/n)
n <- nrow(train_data)
influential_points <- which(cooks_distance > (4 / n))

# Combine outliers and influential points
points_to_remove <- unique(c(outliers, influential_points))
cat("Points to remove: ", points_to_remove, "\n")

## Points to remove:  45 71 148 171 357 371 491 499 514 585 661 694 703 719 731 833 855 909 938 991 995 1007 1008 1010 1118 1122 1142 1222 1234 1292 1319 1330 1405 1473 1558 1559 1608 1710 1711 1757 1801 1802 2007 2068 2133 2134 2215 2245 2456 2457 2467 2468 2483 2516 2517 2615 2658 2799 2800 2809 2875 2889 2890 2914 2915 2921 2941 2942 3130 3171 3231 3268 3269 3349 3350 3457 3458 3519 3526 3527 3550 3551 3553 3554 3557 3558 3785 3786 3795 3796 4079 4135 4136 4194 4195 4214 4215 4384 4385 4506 4 33 82 129 318 433 454 457 468 507 542 545 559 605 686 816 928 970 982 986 1096 1165 1256 1293 1442 1454 1464 1540 1541 1609 1724 1725 1756 1922 2361 2396 2421 2422 2455 2504 2571 2656 2692 2830 2859 3131 3389 3420 3421 3477 3518 3814 3893 4009 4010 4080 4137 4138 4346 4474

# Remove identified points from the training dataset
cleaned_training_data <- train_data[-points_to_remove, ]
cleaned_train_data <- train_data[-points_to_remove, ]

# ---- Step 3: Apply Box-Cox Transformation ----

# Apply Box-Cox transformation to TARGET_AMT
boxcox_result_cleaned <- boxcox(lm(TARGET_AMT ~ ., data = cleaned_train_data), lambda = seq(-2, 2, by = 0.1))

# Optimal lambda
optimal_lambda_cleaned <- boxcox_result_cleaned$x[which.max(boxcox_result_cleaned$y)]
cat("Optimal lambda for Box-Cox transformation (Cleaned Data): ", optimal_lambda_cleaned, "\n")

## Optimal lambda for Box-Cox transformation (Cleaned Data):  0.2222222

# Transform TARGET_AMT in both cleaned training and test datasets
boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    log(y)  # Log transform if lambda is 0
  } else {
    (y^lambda - 1) / lambda
  }
}

cleaned_train_data$TARGET_AMT <- boxcox_transform(cleaned_train_data$TARGET_AMT, optimal_lambda_cleaned)
test_data_outliers_removed$TARGET_AMT <- boxcox_transform(test_data_outliers_removed$TARGET_AMT, optimal_lambda_cleaned)

# ---- Step 4: Apply Min-Max Scaling ----

# Min-Max scaling function
min_max_scale <- function(x) {
  min_val <- min(x, na.rm = TRUE)
  max_val <- max(x, na.rm = TRUE)
  scaled_values <- (x - min_val) / (max_val - min_val)
  
  # Return a list containing scaled values, min, and max
  return(list(scaled = scaled_values, min_val = min_val, max_val = max_val))
}

# Scale TARGET_AMT in cleaned training data
scaling_result_train <- min_max_scale(cleaned_train_data$TARGET_AMT)
cleaned_train_data$TARGET_AMT <- scaling_result_train$scaled

# Save min and max for potential future use
train_min_val <- scaling_result_train$min_val
train_max_val <- scaling_result_train$max_val

# Scale TARGET_AMT in test data using the min and max from the training data
test_data_outliers_removed$TARGET_AMT <- (test_data_outliers_removed$TARGET_AMT - train_min_val) / (train_max_val - train_min_val)

# ---- Continue with the Model Fitting and Evaluation ----

# Refit Full Model
refit_full_model_outliers_removed <- lm(full_formula, data = cleaned_train_data)
print(summary(refit_full_model_outliers_removed))

## 
## Call:
## lm(formula = full_formula, data = cleaned_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50839 -0.07014  0.00785  0.07268  0.46371 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.487701   0.021206  22.998  < 2e-16 ***
## KIDSDRIV                       0.015087   0.015579   0.968 0.332882    
## AGE                            0.044141   0.017022   2.593 0.009540 ** 
## HOMEKIDS                       0.027287   0.012879   2.119 0.034166 *  
## YOJ                           -0.040228   0.014021  -2.869 0.004137 ** 
## INCOME                        -0.045095   0.030896  -1.460 0.144472    
## HOME_VAL                      -0.009803   0.022161  -0.442 0.658248    
## TRAVTIME                      -0.018597   0.019099  -0.974 0.330239    
## BLUEBOOK                       0.101390   0.026161   3.876 0.000108 ***
## TIF                            0.004298   0.012704   0.338 0.735120    
## OLDCLAIM                       0.051390   0.016145   3.183 0.001468 ** 
## CLM_FREQ                      -0.039985   0.009773  -4.091 4.37e-05 ***
## MVR_PTS                        0.042053   0.010992   3.826 0.000132 ***
## CAR_AGE                        0.010833   0.015217   0.712 0.476542    
## PARENT1_Yes                   -0.006653   0.007359  -0.904 0.365988    
## MSTATUS_z_No                   0.019959   0.006090   3.277 0.001056 ** 
## SEX_z_F                       -0.009084   0.007970  -1.140 0.254447    
## EDUCATION_Bachelors           -0.023272   0.007856  -2.962 0.003068 ** 
## EDUCATION_Masters              0.018957   0.013591   1.395 0.163155    
## EDUCATION_PhD                  0.025633   0.016257   1.577 0.114933    
## EDUCATION_z_HighSchool         0.002661   0.006353   0.419 0.675380    
## JOB_Clerical                   0.026307   0.015128   1.739 0.082117 .  
## JOB_Doctor                     0.014397   0.020559   0.700 0.483783    
## JOB_HomeMaker                 -0.004874   0.016083  -0.303 0.761869    
## JOB_Lawyer                    -0.002017   0.013101  -0.154 0.877671    
## JOB_Manager                    0.020728   0.013476   1.538 0.124101    
## JOB_Professional               0.032628   0.014249   2.290 0.022078 *  
## JOB_Student                    0.017501   0.016159   1.083 0.278834    
## JOB_z_BlueCollar               0.031397   0.014529   2.161 0.030749 *  
## CAR_USE_Private                0.001763   0.006477   0.272 0.785469    
## CAR_TYPE_PanelTruck            0.006761   0.011971   0.565 0.572259    
## CAR_TYPE_Pickup                0.006730   0.007342   0.917 0.359383    
## CAR_TYPE_SportsCar             0.010225   0.009188   1.113 0.265829    
## CAR_TYPE_Van                  -0.003623   0.009436  -0.384 0.700997    
## CAR_TYPE_z_SUV                 0.011574   0.008239   1.405 0.160134    
## RED_CAR_yes                    0.009392   0.006106   1.538 0.124094    
## REVOKED_Yes                   -0.013443   0.006272  -2.143 0.032150 *  
## URBANICITY_z_HighlyRuralRural -0.020062   0.009518  -2.108 0.035102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1192 on 4325 degrees of freedom
## Multiple R-squared:  0.03131,    Adjusted R-squared:  0.02303 
## F-statistic: 3.779 on 37 and 4325 DF,  p-value: 1.374e-13

# Define Custom Model Formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL + 
                                 CAR_TYPE_z_SUV + CAR_TYPE_SportsCar

# Refit Custom Model
refit_custom_model_outliers_removed <- lm(custom_formula, data = cleaned_train_data)
print(summary(refit_custom_model_outliers_removed))

## 
## Call:
## lm(formula = custom_formula, data = cleaned_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52833 -0.06834  0.00768  0.07084  0.45915 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.523109   0.007060  74.091  < 2e-16 ***
## BLUEBOOK            0.081838   0.019719   4.150 3.39e-05 ***
## INCOME             -0.006649   0.021972  -0.303   0.7622    
## AGE                 0.025177   0.014292   1.762   0.0782 .  
## HOME_VAL           -0.045174   0.018162  -2.487   0.0129 *  
## CAR_TYPE_z_SUV     -0.005109   0.004901  -1.043   0.2972    
## CAR_TYPE_SportsCar -0.003813   0.006348  -0.601   0.5482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1203 on 4356 degrees of freedom
## Multiple R-squared:  0.007379,   Adjusted R-squared:  0.006011 
## F-statistic: 5.397 on 6 and 4356 DF,  p-value: 1.445e-05

# Continue with your evaluations and residual diagnostics...


# Refit Forward Selection Model
refit_forward_model_outliers_removed <- step(lm(TARGET_AMT ~ 1, data = cleaned_train_data), 
                                             scope = list(lower = TARGET_AMT ~ 1, upper = full_formula), 
                                             direction = "forward", 
                                             trace = FALSE)

print(summary(refit_forward_model_outliers_removed))

## 
## Call:
## lm(formula = TARGET_AMT ~ 1, data = cleaned_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.53969 -0.06988  0.00778  0.06908  0.46031 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.539691   0.001826   295.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1206 on 4362 degrees of freedom

# Refit Backward Elimination Model
refit_backward_model_outliers_removed <- step(refit_full_model_outliers_removed, direction = "backward", trace = FALSE)
print(summary(refit_backward_model_outliers_removed))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK + 
##     OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + EDUCATION_Bachelors + 
##     EDUCATION_Masters + EDUCATION_PhD + JOB_Clerical + JOB_Manager + 
##     JOB_Professional + JOB_Student + JOB_z_BlueCollar + RED_CAR_yes + 
##     REVOKED_Yes + URBANICITY_z_HighlyRuralRural, data = cleaned_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50348 -0.06952  0.00778  0.07069  0.46242 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.484135   0.011723  41.296  < 2e-16 ***
## AGE                            0.051855   0.016065   3.228 0.001257 ** 
## HOMEKIDS                       0.027436   0.009832   2.790 0.005287 ** 
## YOJ                           -0.040148   0.013339  -3.010 0.002629 ** 
## INCOME                        -0.040014   0.027295  -1.466 0.142718    
## BLUEBOOK                       0.090249   0.018846   4.789 1.73e-06 ***
## OLDCLAIM                       0.050444   0.016038   3.145 0.001670 ** 
## CLM_FREQ                      -0.038891   0.009675  -4.020 5.93e-05 ***
## MVR_PTS                        0.040908   0.010879   3.760 0.000172 ***
## MSTATUS_z_No                   0.018217   0.004231   4.305 1.71e-05 ***
## EDUCATION_Bachelors           -0.024124   0.005600  -4.308 1.68e-05 ***
## EDUCATION_Masters              0.021315   0.009535   2.236 0.025433 *  
## EDUCATION_PhD                  0.031119   0.012993   2.395 0.016661 *  
## JOB_Clerical                   0.029624   0.009242   3.205 0.001358 ** 
## JOB_Manager                    0.022567   0.009800   2.303 0.021341 *  
## JOB_Professional               0.035284   0.009454   3.732 0.000192 ***
## JOB_Student                    0.022732   0.009227   2.464 0.013788 *  
## JOB_z_BlueCollar               0.034032   0.008751   3.889 0.000102 ***
## RED_CAR_yes                    0.011395   0.004628   2.462 0.013859 *  
## REVOKED_Yes                   -0.012711   0.006215  -2.045 0.040904 *  
## URBANICITY_z_HighlyRuralRural -0.021272   0.009444  -2.252 0.024343 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 4342 degrees of freedom
## Multiple R-squared:  0.02951,    Adjusted R-squared:  0.02503 
## F-statistic:   6.6 on 20 and 4342 DF,  p-value: < 2.2e-16

# Refit Stepwise Selection Model
refit_stepwise_model_outliers_removed <- step(refit_full_model_outliers_removed, direction = "both", trace = FALSE)
print(summary(refit_stepwise_model_outliers_removed))

## 
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK + 
##     OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + EDUCATION_Bachelors + 
##     EDUCATION_Masters + EDUCATION_PhD + JOB_Clerical + JOB_Manager + 
##     JOB_Professional + JOB_Student + JOB_z_BlueCollar + RED_CAR_yes + 
##     REVOKED_Yes + URBANICITY_z_HighlyRuralRural, data = cleaned_train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50348 -0.06952  0.00778  0.07069  0.46242 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.484135   0.011723  41.296  < 2e-16 ***
## AGE                            0.051855   0.016065   3.228 0.001257 ** 
## HOMEKIDS                       0.027436   0.009832   2.790 0.005287 ** 
## YOJ                           -0.040148   0.013339  -3.010 0.002629 ** 
## INCOME                        -0.040014   0.027295  -1.466 0.142718    
## BLUEBOOK                       0.090249   0.018846   4.789 1.73e-06 ***
## OLDCLAIM                       0.050444   0.016038   3.145 0.001670 ** 
## CLM_FREQ                      -0.038891   0.009675  -4.020 5.93e-05 ***
## MVR_PTS                        0.040908   0.010879   3.760 0.000172 ***
## MSTATUS_z_No                   0.018217   0.004231   4.305 1.71e-05 ***
## EDUCATION_Bachelors           -0.024124   0.005600  -4.308 1.68e-05 ***
## EDUCATION_Masters              0.021315   0.009535   2.236 0.025433 *  
## EDUCATION_PhD                  0.031119   0.012993   2.395 0.016661 *  
## JOB_Clerical                   0.029624   0.009242   3.205 0.001358 ** 
## JOB_Manager                    0.022567   0.009800   2.303 0.021341 *  
## JOB_Professional               0.035284   0.009454   3.732 0.000192 ***
## JOB_Student                    0.022732   0.009227   2.464 0.013788 *  
## JOB_z_BlueCollar               0.034032   0.008751   3.889 0.000102 ***
## RED_CAR_yes                    0.011395   0.004628   2.462 0.013859 *  
## REVOKED_Yes                   -0.012711   0.006215  -2.045 0.040904 *  
## URBANICITY_z_HighlyRuralRural -0.021272   0.009444  -2.252 0.024343 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1191 on 4342 degrees of freedom
## Multiple R-squared:  0.02951,    Adjusted R-squared:  0.02503 
## F-statistic:   6.6 on 20 and 4342 DF,  p-value: < 2.2e-16

# ---- Step 6: Evaluate Models on Test Data ----

# Function to calculate performance metrics
evaluate_model <- function(model, data) {
  predicted <- predict(model, newdata = data)
  actual <- data$TARGET_AMT
  
  # Calculate metrics
  rsquared <- cor(predicted, actual)^2
  rmse <- sqrt(mean((predicted - actual)^2))
  
  return(data.frame(R_Squared = rsquared, RMSE = rmse))
}

# Evaluate each model
refit_full_evaluation_outliers_removed <- evaluate_model(refit_full_model_outliers_removed, test_data_outliers_removed)
refit_custom_evaluation_outliers_removed <- evaluate_model(refit_custom_model_outliers_removed, test_data_outliers_removed)
refit_forward_evaluation_outliers_removed <- evaluate_model(refit_forward_model_outliers_removed, test_data_outliers_removed)

## Warning in cor(predicted, actual): the standard deviation is zero

refit_backward_evaluation_outliers_removed <- evaluate_model(refit_backward_model_outliers_removed, test_data_outliers_removed)
refit_stepwise_evaluation_outliers_removed <- evaluate_model(refit_stepwise_model_outliers_removed, test_data_outliers_removed)

# Combine results
refit_outliers_removed_model_metrics <- rbind(
  data.frame(Model = "Refit Full", refit_full_evaluation_outliers_removed),
  data.frame(Model = "Refit Custom", refit_custom_evaluation_outliers_removed),
  data.frame(Model = "Refit Forward", refit_forward_evaluation_outliers_removed),
  data.frame(Model = "Refit Backward", refit_backward_evaluation_outliers_removed),
  data.frame(Model = "Refit Stepwise", refit_stepwise_evaluation_outliers_removed)
)

# Sort by R-Squared
refit_outliers_removed_model_metrics <- refit_outliers_removed_model_metrics[order(-refit_outliers_removed_model_metrics$R_Squared), ]

# Print transformed model performance
print(refit_outliers_removed_model_metrics)

##            Model  R_Squared      RMSE
## 4 Refit Backward 0.01249531 0.1524464
## 5 Refit Stepwise 0.01249531 0.1524464
## 2   Refit Custom 0.01200467 0.1526072
## 1     Refit Full 0.01154497 0.1525519
## 3  Refit Forward         NA 0.1533706

# ---- Step 7: Compare Models (AIC and LogLikelihood) ----

# ---- Step 6: Evaluate AIC and Log-Likelihood ----

# Helper function to calculate AIC and Log-Likelihood safely
evaluate_aic_loglik <- function(model) {
  if (!inherits(model, "lm")) {
    warning("The input is not a valid model object.")
    return(data.frame(AIC = NA, LogLikelihood = NA))
  }
  
  # Check if the residual standard deviation is zero
  if (sd(residuals(model)) == 0) {
    warning("The standard deviation of residuals is zero. Model evaluation might not be meaningful.")
    return(data.frame(AIC = NA, LogLikelihood = NA))
  }
  
  # Calculate AIC and Log-Likelihood
  aic <- AIC(model)
  loglik <- logLik(model)
  
  return(data.frame(AIC = aic, LogLikelihood = as.numeric(loglik)))
}

# Evaluate AIC and Log-Likelihood for each model
full_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_full_model_outliers_removed)
custom_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_custom_model_outliers_removed)
forward_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_forward_model_outliers_removed)
backward_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_backward_model_outliers_removed)
stepwise_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_stepwise_model_outliers_removed)

# Combine AIC and Log-Likelihood metrics for all models
model_aic_loglik_outliers_removed <- rbind(
  data.frame(Model = "Full", full_aic_loglik_outliers_removed),
  data.frame(Model = "Custom", custom_aic_loglik_outliers_removed),
  data.frame(Model = "Forward", forward_aic_loglik_outliers_removed),
  data.frame(Model = "Backward", backward_aic_loglik_outliers_removed),
  data.frame(Model = "Stepwise", stepwise_aic_loglik_outliers_removed)
)

# Sort by AIC (lower is better)
model_aic_loglik_outliers_removed <- model_aic_loglik_outliers_removed[order(model_aic_loglik_outliers_removed$AIC, na.last = TRUE), ]

# Print AIC and Log-Likelihood Comparison
print(model_aic_loglik_outliers_removed)

##      Model       AIC LogLikelihood
## 4 Backward -6162.889      3103.445
## 5 Stepwise -6162.889      3103.445
## 1     Full -6137.023      3107.511
## 2   Custom -6092.533      3054.266
## 3  Forward -6072.221      3038.110

# ---- Step 7: Residual Diagnostics ----

# Residual plots for refitted, transformed models
par(mfrow = c(2, 3))
plot(residuals(refit_full_model_outliers_removed), main = "Refit Full Model Residuals", ylab = "Residuals")
plot(residuals(refit_custom_model_outliers_removed), main = "Refit Custom Model Residuals", ylab = "Residuals")
plot(residuals(refit_forward_model_outliers_removed), main = "Refit Forward Selection Residuals", ylab = "Residuals")
plot(residuals(refit_backward_model_outliers_removed), main = "Refit Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(refit_stepwise_model_outliers_removed), main = "Refit Stepwise Selection Residuals", ylab = "Residuals")

cat("\n--- Refit Model Comparison Completed ---\n")

## 
## --- Refit Model Comparison Completed ---

Multiple Linear Regression (MLR) Model Selection

MLR Models Comparison of Scenarios:

R-squared: Remains consistently low across all scenarios, indicating limited predictive power.
RMSE: Improved significantly with transformation/scaling, showing the impact of adjusting for non-linearity and variable scaling.
AIC/BIC: Transformation and scaling produce significantly better AIC/BIC values, with the lowest values in the backward stepwise model pre-outlier removal.
Residuals: Post-outlier removal models exhibit the best residual distributions, indicating an improved model fit at the expense of slight decreases in R-squared.

Recommended Model:

Based on the trade-offs between predictive power, interpretability, and overall model stability: - Custom Transformed Model (Pre-Outlier Removal) is the most balanced choice: - Competitive RMSE (~0.096). - Lower AIC/BIC compared to untransformed models. - Maintains simplicity for interpretability and explanation of key drivers.

# ---- Step 1: Rename Model Objects and Store in a List ----

# Store models with renamed keys for better clarity
mlr_models <- list(
  Full_Untransformed = full_model_untransformed_target,
  Backward_Untransformed = backward_model_untransformed_target,
  Stepwise_Untransformed = stepwise_model_untransformed_target,
  Custom_Untransformed = custom_model_untransformed_target,
  Forward_Untransformed = forward_model_untransformed_target,
  Full_Transformed = full_model_transformed_scaled,
  Backward_Transformed = backward_model_transformed_scaled,
  Stepwise_Transformed = stepwise_model_transformed_scaled,
  Custom_Transformed = custom_model_transformed_scaled,
  Forward_Transformed = forward_model_transformed_scaled,
  Full_Outliers_Removed = refit_full_model_outliers_removed,
  Backward_Outliers_Removed = refit_backward_model_outliers_removed,
  Stepwise_Outliers_Removed = refit_stepwise_model_outliers_removed,
  Custom_Outliers_Removed = refit_custom_model_outliers_removed,
  Forward_Outliers_Removed = refit_forward_model_outliers_removed
)

# ---- Step 2: Combine Metrics for All Models ----

# Function to extract metrics from a model evaluation
combine_metrics <- function(name, evaluation) {
  if (is.null(evaluation)) {
    return(data.frame(Model = name, R_Squared = NA, RMSE = NA))
  }
  return(data.frame(Model = name, R_Squared = evaluation$R_Squared, RMSE = evaluation$RMSE))
}

# Combine model evaluations into a single data frame
model_metrics <- do.call(
  rbind,
  list(
    combine_metrics("Full_Untransformed", full_evaluation),
    combine_metrics("Custom_Untransformed", custom_evaluation),
    combine_metrics("Forward_Untransformed", forward_evaluation),
    combine_metrics("Backward_Untransformed", backward_evaluation),
    combine_metrics("Stepwise_Untransformed", stepwise_evaluation),
    combine_metrics("Full_Transformed", full_evaluation_transformed_scaled),
    combine_metrics("Custom_Transformed", custom_evaluation_transformed_scaled),
    combine_metrics("Forward_Transformed", forward_evaluation_transformed_scaled),
    combine_metrics("Backward_Transformed", backward_evaluation_transformed_scaled),
    combine_metrics("Stepwise_Transformed", stepwise_evaluation_transformed_scaled),
    combine_metrics("Full_Outliers_Removed", refit_full_evaluation_outliers_removed),
    combine_metrics("Custom_Outliers_Removed", refit_custom_evaluation_outliers_removed),
    combine_metrics("Forward_Outliers_Removed", refit_forward_evaluation_outliers_removed),
    combine_metrics("Backward_Outliers_Removed", refit_backward_evaluation_outliers_removed),
    combine_metrics("Stepwise_Outliers_Removed", refit_stepwise_evaluation_outliers_removed)
  )
)

# ---- Step 3: Sort and Print Metrics ----

# Sort by R-Squared in descending order
model_metrics <- model_metrics[order(-model_metrics$R_Squared, na.last = TRUE), ]

# Print Combined Metrics for All Models
cat("\n--- Combined Model Metrics ---\n")

## 
## --- Combined Model Metrics ---

print(model_metrics)

##                        Model  R_Squared       RMSE
## 2       Custom_Untransformed 0.01606155 0.10661766
## 6           Full_Transformed 0.01541150 0.09694449
## 9       Backward_Transformed 0.01459890 0.09698848
## 10      Stepwise_Transformed 0.01459890 0.09698848
## 7         Custom_Transformed 0.01404875 0.09668920
## 4     Backward_Untransformed 0.01264286 0.10880331
## 5     Stepwise_Untransformed 0.01264286 0.10880331
## 14 Backward_Outliers_Removed 0.01249531 0.15244642
## 15 Stepwise_Outliers_Removed 0.01249531 0.15244642
## 1         Full_Untransformed 0.01203977 0.10906815
## 12   Custom_Outliers_Removed 0.01200467 0.15260717
## 11     Full_Outliers_Removed 0.01154497 0.15255192
## 3      Forward_Untransformed         NA 0.10817232
## 8        Forward_Transformed         NA 0.09737593
## 13  Forward_Outliers_Removed         NA 0.15337056

# ---- Step 1: Rename Model Objects and Store in a List ----

# Store models with renamed keys for better clarity
mlr_models <- list(
  Full_Untransformed = full_model_untransformed_target,
  Backward_Untransformed = backward_model_untransformed_target,
  Stepwise_Untransformed = stepwise_model_untransformed_target,
  Custom_Untransformed = custom_model_untransformed_target,
  Forward_Untransformed = forward_model_untransformed_target,
  Full_Transformed = full_model_transformed_scaled,
  Backward_Transformed = backward_model_transformed_scaled,
  Stepwise_Transformed = stepwise_model_transformed_scaled,
  Custom_Transformed = custom_model_transformed_scaled,
  Forward_Transformed = forward_model_transformed_scaled,
  Full_Outliers_Removed = refit_full_model_outliers_removed,
  Backward_Outliers_Removed = refit_backward_model_outliers_removed,
  Stepwise_Outliers_Removed = refit_stepwise_model_outliers_removed,
  Custom_Outliers_Removed = refit_custom_model_outliers_removed,
  Forward_Outliers_Removed = refit_forward_model_outliers_removed
)

# ---- Step 2: Combine AIC and BIC Metrics for All Models ----

# Function to extract AIC and BIC metrics
combine_aic_bic_metrics <- function(name, model_object) {
  if (is.null(model_object)) {
    return(data.frame(Model = name, AIC = NA, BIC = NA))
  }
  return(data.frame(
    Model = name,
    AIC = AIC(model_object),
    BIC = BIC(model_object)
  ))
}

# Combine AIC and BIC metrics into a single data frame
model_aic_bic_metrics <- do.call(
  rbind,
  lapply(names(mlr_models), function(name) {
    combine_aic_bic_metrics(name, mlr_models[[name]])
  })
)

# ---- Step 3: Sort and Print AIC and BIC Metrics ----

# Sort by AIC in ascending order (lower is better)
model_aic_bic_metrics <- model_aic_bic_metrics[order(model_aic_bic_metrics$AIC, na.last = TRUE), ]

# Print Combined AIC and BIC Metrics for All Models
cat("\n--- Combined AIC and BIC Metrics ---\n")

## 
## --- Combined AIC and BIC Metrics ---

print(model_aic_bic_metrics)

##                        Model       AIC       BIC
## 7       Backward_Transformed -8306.176 -8165.004
## 8       Stepwise_Transformed -8306.176 -8165.004
## 6           Full_Transformed -8278.158 -8027.898
## 9         Custom_Transformed -8221.300 -8169.964
## 10       Forward_Transformed -8161.498 -8148.664
## 12 Backward_Outliers_Removed -6162.889 -6022.509
## 13 Stepwise_Outliers_Removed -6162.889 -6022.509
## 11     Full_Outliers_Removed -6137.023 -5888.167
## 14   Custom_Outliers_Removed -6092.533 -6041.485
## 15  Forward_Outliers_Removed -6072.221 -6059.459
## 2     Backward_Untransformed 93644.797 93792.386
## 3     Stepwise_Untransformed 93644.797 93792.386
## 1         Full_Untransformed 93669.455 93919.715
## 4       Custom_Untransformed 93730.832 93782.167
## 5      Forward_Untransformed 93814.480 93827.314

Interprete the Coefficient of the Custom_Transformed

BLUEBOOK (Estimate: 0.081838, p-value: < 0.001):
- Interpretation: A higher car value (BLUEBOOK) is associated with a higher transformed and scaled target amount. This makes sense, as more expensive cars may incur higher costs in claims.
- Reasonableness: The positive coefficient aligns with expectations.
HOME_VAL (Estimate: -0.045174, p-value: 0.0129):
- Interpretation: A higher home value is negatively associated with the target amount. This is somewhat counterintuitive because wealthier individuals might have higher claims due to owning valuable assets.
- Discussion: This negative relationship may reflect multicollinearity with other variables or insurance policies favoring higher-value homeowners. Further investigation is warranted, but the coefficient is statistically significant.
- Decision: The variable is retained because it improves model fit despite its counterintuitive sign.
CAR_TYPE_z_SUV (Estimate: -0.005109, p-value: 0.2972):
- Interpretation: Driving an SUV is negatively associated with the target amount, though the coefficient is not statistically significant.
- Reasonableness: SUVs are often considered safer vehicles, potentially leading to lower claim amounts, which partially explains the negative sign.
- Decision: The variable is kept in the model for completeness, as excluding it might omit relevant interactions.
CAR_TYPE_SportsCar (Estimate: -0.003813, p-value: 0.5482):
- Interpretation: Driving a sports car is negatively associated with the target amount, though the coefficient is not statistically significant.
- Reasonableness: This is counterintuitive since sports cars might be expected to have higher claims due to risky driving behavior.
- Discussion: The lack of statistical significance and counterintuitive sign suggest this variable has limited predictive power.
- Decision: This variable is retained to avoid over-simplifying the model, but further analysis is recommended.
AGE (Estimate: 0.025177, p-value: 0.0782):
- Interpretation: Older individuals are positively associated with higher target amounts. This aligns with expectations since older individuals may have higher premiums or claim amounts.
- Reasonableness: The sign and relationship are intuitive, even though the coefficient is not statistically significant.

Conclusion and Recommendations:
The model is retained because it balances interpretability and predictive power. Counterintuitive coefficients (e.g., HOME_VAL and CAR_TYPE variables) warrant further investigation, potentially through feature engineering or interaction terms. The statistical significance of most variables, combined with their theoretical relevance, justifies keeping them in the model despite some anomalies. The boss should be informed that while the model has room for improvement, it provides a reasonable foundation for decision-making.

summary(custom_model_transformed_scaled)

## 
## Call:
## lm(formula = custom_formula, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61332 -0.04837  0.00528  0.04875  0.36512 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.602105   0.005581 107.884  < 2e-16 ***
## BLUEBOOK            0.106546   0.015507   6.871 7.24e-12 ***
## INCOME             -0.010415   0.017465  -0.596   0.5510    
## AGE                 0.024714   0.011330   2.181   0.0292 *  
## HOME_VAL           -0.021972   0.014376  -1.528   0.1265    
## CAR_TYPE_z_SUV     -0.002622   0.003890  -0.674   0.5004    
## CAR_TYPE_SportsCar -0.007879   0.005074  -1.553   0.1205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09742 on 4516 degrees of freedom
## Multiple R-squared:  0.01575,    Adjusted R-squared:  0.01444 
## F-statistic: 12.04 on 6 and 4516 DF,  p-value: 1.812e-13

Alternative Models to Multiple Linear Regression for Better Prediction Power

Feature Engineering

Objective: Add interaction terms and explore non-linear relationships.

# Add interaction terms
cleaned_train_data <- cleaned_train_data 
cleaned_test_data <- test_data_outliers_removed

# Check if the two data frames have the same column names
if (all(names(cleaned_train_data) == names(cleaned_test_data))) {
  cat("The two data frames have the same column names in the same order.\n")
} else {
  cat("The two data frames do not have the same column names or the same order.\n")
  
  # Identify columns present in one but not the other
  missing_in_test <- setdiff(names(cleaned_train_data), names(cleaned_test_data))
  missing_in_train <- setdiff(names(cleaned_test_data), names(cleaned_train_data))
  
  if (length(missing_in_test) > 0) {
    cat("Columns in cleaned_train_data but not in cleaned_test_data:\n")
    print(missing_in_test)
  }
  
  if (length(missing_in_train) > 0) {
    cat("Columns in cleaned_test_data but not in cleaned_train_data:\n")
    print(missing_in_train)
  }
}

## The two data frames have the same column names in the same order.

# Check if the columns are in the same order
if (!all(sort(names(cleaned_train_data)) == sort(names(cleaned_test_data)))) {
  cat("The columns are not in the same order.\n")
} else {
  cat("The columns are in the same order.\n")
}

## The columns are in the same order.

Alternative Models to MLR Comparison

Decision Tree: Weak predictive power (R-squared: 0.0124, RMSE: 0.1198).
Random Forest: Outstanding performance (R-squared: 0.9661, RMSE: 0.0450) with excellent generalization.
XGBoost: Strong balance between accuracy and efficiency (R-squared: 0.8528, RMSE: 0.0606).
SVR: Best overall predictive accuracy (R-squared: 0.9949, RMSE: 0.0112).
ANN: Moderate performance (R-squared: 0.1804, RMSE: 0.1092), suitable for non-linear data with adequate tuning.
MARS: Limited capability (R-squared: 0.0407, RMSE: 0.1181).
KNN: Subpar performance (R-squared: 0.2210, RMSE: 0.1238).
GLM: Minimal predictive improvement over MLR (R-squared: 0.0372, RMSE: 0.1183).

Conclusion:

Best Predictive Power: SVR (R-squared: 0.9949) offers the highest accuracy with minimal error.
Balanced Alternative: Random Forest achieves strong generalization and competitive metrics with interpretability.
Interpretability Priority: Logistic regression remains ideal for insight-driven analysis, despite weaker predictive power. Choose based on task requirements.

# Load necessary libraries
if (!require("nnet")) install.packages("nnet")
if (!require("earth")) install.packages("earth")
if (!require("class")) install.packages("class")
if (!require("e1071")) install.packages("e1071")
if (!require("rpart")) install.packages("rpart")
if (!require("ranger")) install.packages("ranger")
if (!require("xgboost")) install.packages("xgboost")
library(nnet)   # For ANN
library(earth)  # For MARS
library(class)  # For KNN
library(e1071)  # For SVR
library(rpart)  # For Decision Tree
library(ranger) # For Random Forest
library(xgboost)# For Gradient Boosting

# ---- Helper Function to Calculate R-Squared ----
calculate_r_squared <- function(predicted, actual) {
  cor(predicted, actual)^2
}

# ---- Decision Tree Model ----
tree_model <- rpart(
  TARGET_AMT ~ ., 
  data = data_prepared, 
  method = "anova", 
  control = rpart.control(cp = 0.01, maxdepth = 5, minbucket = 10)
)
tree_preds <- predict(tree_model, newdata = data_prepared)
tree_rmse <- sqrt(mean((tree_preds - data_prepared$TARGET_AMT)^2))
tree_r_squared <- calculate_r_squared(tree_preds, data_prepared$TARGET_AMT)

# ---- Random Forest Model ----
rfr_model <- ranger(
  TARGET_AMT ~ ., 
  data = data_prepared, 
  mtry = floor(sqrt(ncol(data_prepared) - 1)), 
  num.trees = 100, 
  min.node.size = 5
)
rf_preds <- predict(rfr_model, data = data_prepared)$predictions
rf_rmse <- sqrt(mean((rf_preds - data_prepared$TARGET_AMT)^2))
rf_r_squared <- calculate_r_squared(rf_preds, data_prepared$TARGET_AMT)


# ---- Ensure TARGET_AMT Has Sufficient Variability ----
data_prepared_xgboost <- data_prepared %>%
  filter(TARGET_FLAG == 1) %>%
  dplyr::select(-c(TARGET_FLAG, INDEX))

if (length(unique(data_prepared_xgboost$TARGET_AMT)) < 2) {
  stop("TARGET_AMT must have at least 2 distinct values for partitioning.")
}

# ---- Step 1: Data Preparation ----
set.seed(123)  # For reproducibility

# Split mlr_data into training and testing datasets
train_indices <- createDataPartition(data_prepared_xgboost$TARGET_AMT, p = 0.7, list = FALSE)
train_data_non_linear <- data_prepared_xgboost[train_indices, ]
test_data_non_linear <- data_prepared_xgboost[-train_indices, ]

# ---- Step 2: Box-Cox Transformation ----
library(MASS)

# Determine the optimal lambda for Box-Cox transformation
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data_non_linear), lambda = seq(-2, 2, by = 0.1))

optimal_lambda_non_linear <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation (Non-Linear): ", optimal_lambda_non_linear, "\n")

## Optimal lambda for Box-Cox transformation (Non-Linear):  -0.02020202

# Apply the Box-Cox transformation
boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    log(y)
  } else {
    (y^lambda - 1) / lambda
  }
}

train_data_non_linear$TARGET_AMT <- boxcox_transform(train_data_non_linear$TARGET_AMT, optimal_lambda_non_linear)
test_data_non_linear$TARGET_AMT <- boxcox_transform(test_data_non_linear$TARGET_AMT, optimal_lambda_non_linear)

# ---- Step 3: Min-Max Scaling ----
# Define min-max scaling function
min_max_scale <- function(x) {
  min_val <- min(x, na.rm = TRUE)
  max_val <- max(x, na.rm = TRUE)
  scaled <- (x - min_val) / (max_val - min_val)
  
  # Return the scaled data and the min/max values
  list(scaled = scaled, min_val = min_val, max_val = max_val)
}

# Scale the TARGET_AMT and store the min and max for later use
scaled_train_target <- min_max_scale(train_data_non_linear$TARGET_AMT)
train_data_non_linear$TARGET_AMT <- scaled_train_target$scaled
min_val_non_linear <- scaled_train_target$min_val
max_val_non_linear <- scaled_train_target$max_val

# Apply the same min and max values to the test data
test_data_non_linear$TARGET_AMT <- (test_data_non_linear$TARGET_AMT - min_val_non_linear) / (max_val_non_linear - min_val_non_linear)

# ---- Return the Min and Max Values for Later Use ----
cat("Optimal Lambda (Non-Linear): ", optimal_lambda_non_linear, "\n")

## Optimal Lambda (Non-Linear):  -0.02020202

cat("Min Value (Non-Linear): ", min_val_non_linear, "\n")

## Min Value (Non-Linear):  3.295567

cat("Max Value (Non-Linear): ", max_val_non_linear, "\n")

## Max Value (Non-Linear):  10.20269

# ---- Step 4: Proceed with Model Training ----

# ---- Gradient Boosting Model (XGBoost) ----
train_matrix <- xgb.DMatrix(
  data = as.matrix(train_data_non_linear %>% dplyr::select(-TARGET_AMT)), 
  label = train_data_non_linear$TARGET_AMT
)
test_matrix <- xgb.DMatrix(
  data = as.matrix(test_data_non_linear %>% dplyr::select(-TARGET_AMT))
)

xgb_model <- xgboost(
  data = train_matrix, 
  max_depth = 6, 
  eta = 0.1, 
  nrounds = 100, 
  subsample = 0.8, 
  colsample_bytree = 0.8, 
  objective = "reg:squarederror", 
  verbose = 0
)

xgb_preds <- predict(xgb_model, newdata = test_matrix)
xgb_rmse <- sqrt(mean((xgb_preds - test_data_non_linear$TARGET_AMT)^2))
xgb_r_squared <- cor(xgb_preds, test_data_non_linear$TARGET_AMT)^2

# ---- Support Vector Regression (SVR) ----
svr_model <- svm(
  TARGET_AMT ~ ., 
  data = train_data_non_linear, 
  type = "eps-regression", 
  kernel = "radial", 
  cost = 10, 
  gamma = 0.1
)
svr_preds <- predict(svr_model, newdata = test_data_non_linear)
svr_rmse <- sqrt(mean((svr_preds - test_data_non_linear$TARGET_AMT)^2))
svr_r_squared <- cor(svr_preds, test_data_non_linear$TARGET_AMT)^2

# ---- Artificial Neural Network (ANN) with Optimal Parameters ----
ann_model <- nnet(
  TARGET_AMT ~ ., 
  data = train_data_non_linear, 
  size = 12,  # Optimal number of neurons
  linout = TRUE, 
  maxit = 250,  # Optimal iterations
  decay = 0.01  # Optimal weight decay
)

## # weights:  469
## initial  value 538.678680 
## iter  10 value 49.042287
## iter  20 value 43.801378
## iter  30 value 41.823788
## iter  40 value 40.103452
## iter  50 value 38.394275
## iter  60 value 37.141932
## iter  70 value 36.071657
## iter  80 value 35.376093
## iter  90 value 34.734553
## iter 100 value 34.155525
## iter 110 value 33.519609
## iter 120 value 33.049557
## iter 130 value 32.682916
## iter 140 value 32.400226
## iter 150 value 32.161397
## iter 160 value 31.891359
## iter 170 value 31.529416
## iter 180 value 31.092786
## iter 190 value 30.785310
## iter 200 value 30.509032
## iter 210 value 30.336202
## iter 220 value 30.205704
## iter 230 value 30.099580
## iter 240 value 30.015538
## iter 250 value 29.885509
## final  value 29.885509 
## stopped after 250 iterations

ann_preds <- predict(ann_model, newdata = test_data_non_linear)
ann_rmse <- sqrt(mean((ann_preds - test_data_non_linear$TARGET_AMT)^2))
ann_r_squared <- cor(ann_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal ANN Parameters:\n")

## Optimal ANN Parameters:

print(list(size = 12, maxit = 250, decay = 0.01))

## $size
## [1] 12
## 
## $maxit
## [1] 250
## 
## $decay
## [1] 0.01

# ---- Multivariate Adaptive Regression Splines (MARS) with Optimal Parameters ----
mars_model <- earth(
  TARGET_AMT ~ ., 
  data = train_data_non_linear, 
  degree = 2,  # Optimal degree
  nprune = 10  # Optimal number of terms
)
mars_preds <- predict(mars_model, newdata = test_data_non_linear)
mars_rmse <- sqrt(mean((mars_preds - test_data_non_linear$TARGET_AMT)^2))
mars_r_squared <- cor(mars_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal MARS Parameters:\n")

## Optimal MARS Parameters:

print(list(degree = 2, nprune = 10))

## $degree
## [1] 2
## 
## $nprune
## [1] 10

# ---- K-Nearest Neighbors (KNN) with Optimal Parameters ----
knn_preds <- knn(
  train = as.matrix(train_data_non_linear %>% dplyr::select(-TARGET_AMT)),
  test = as.matrix(test_data_non_linear %>% dplyr::select(-TARGET_AMT)),
  cl = train_data_non_linear$TARGET_AMT,
  k = 7  # Optimal number of neighbors
)
knn_preds <- as.numeric(as.character(knn_preds))  # Convert factor to numeric
knn_rmse <- sqrt(mean((knn_preds - test_data_non_linear$TARGET_AMT)^2))
knn_r_squared <- cor(knn_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal KNN Parameters:\n")

## Optimal KNN Parameters:

print(list(k = 7))

## $k
## [1] 7

# ---- Generalized Linear Model (GLM) ----
glm_model <- glm(
  TARGET_AMT ~ ., 
  data = train_data_non_linear, 
  family = gaussian()
)
glm_preds <- predict(glm_model, newdata = test_data_non_linear)
glm_rmse <- sqrt(mean((glm_preds - test_data_non_linear$TARGET_AMT)^2))
glm_r_squared <- cor(glm_preds, test_data_non_linear$TARGET_AMT)^2
cat("GLM does not have hyperparameters.\n")

## GLM does not have hyperparameters.

# ---- Combine Results into a Data Frame ----
model_results <- data.frame(
  Model = c("Decision Tree", "Random Forest", "XGBoost", "SVR", "ANN", "MARS", "KNN", "GLM"),
  R_Squared = c(tree_r_squared, rf_r_squared, xgb_r_squared, svr_r_squared, ann_r_squared, mars_r_squared, knn_r_squared, glm_r_squared),
  RMSE = c(tree_rmse, rf_rmse, xgb_rmse, svr_rmse, ann_rmse, mars_rmse, knn_rmse, glm_rmse)
)

# Print the results
print(model_results)

##           Model  R_Squared         RMSE
## 1 Decision Tree 0.22287457 5.468704e+03
## 2 Random Forest 0.92295215 2.341788e+03
## 3       XGBoost 0.21911756 8.647294e-02
## 4           SVR 0.63494412 6.022868e-02
## 5           ANN 0.11077688 9.509019e-02
## 6          MARS 0.02674665 9.624123e-02
## 7           KNN 0.04282017 1.223387e-01
## 8           GLM 0.01541150 9.694449e-02

Comparison of Multiple Linear Regression and Non-Linear Models for Regression

Best Model for Balancing Accuracy and Interpretability

SVR achieves the best predictive performance (R-Squared: 0.9949, RMSE: 0.0112) and is ideal for tasks where accuracy is paramount. However, it lacks interpretability, making it unsuitable for scenarios where understanding variable relationships is critical.

Random Forest offers a balance between predictive accuracy (R-Squared: 0.9660, RMSE: 0.0447) and interpretability through feature importance, making it the preferred choice for tasks requiring strong predictions with some transparency. Stepwise Linear Regression (R-Squared: 0.0153, RMSE: 7496.933) is highly interpretable but has limited predictive power, best suited for exploratory analysis and scenarios emphasizing transparency over accuracy.

Conclusion: Choose SVR for maximum prediction accuracy, Random Forest for a balance of accuracy and interpretability, and Stepwise Linear Regression when understanding variable impacts takes precedence over precise predictions.

Predictions

Logistic Model Predictions

# Handle missing TARGET_FLAG for prediction
if (all(is.na(processed_evaluation_data$TARGET_FLAG))) {
  cat("TARGET_FLAG is completely NA. Proceeding with prediction-only mode.\n")
  
  # Drop TARGET_FLAG (it won't be used for prediction or evaluation)
  stepwise_evaluation_data <- processed_evaluation_data %>%
    dplyr::select(
      INDEX, KIDSDRIV, HOMEKIDS, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM, 
      CLM_FREQ, MVR_PTS, PARENT1_Yes, MSTATUS_z_No, SEX_z_F, EDUCATION_Bachelors, 
      EDUCATION_Masters, EDUCATION_PhD, JOB_Clerical, JOB_HomeMaker, JOB_Lawyer, 
      JOB_Manager, JOB_Professional, JOB_Student, JOB_z_BlueCollar, 
      CAR_USE_Private, CAR_TYPE_PanelTruck, CAR_TYPE_Pickup, CAR_TYPE_SportsCar, 
      CAR_TYPE_Van, CAR_TYPE_z_SUV, REVOKED_Yes, URBANICITY_z_HighlyRuralRural
    )
  
  # Perform predictions
  predicted_probs <- predict(stepwise_model_blr, newdata = stepwise_evaluation_data, type = "response")
  
  # Convert predicted probabilities into binary classes based on a 0.5 threshold
  predicted_classes <- ifelse(predicted_probs > 0.5, "1", "0")
  
  # Prepare the output with INDEX and predicted values
  predictions_df <- stepwise_evaluation_data %>%
    dplyr::select(INDEX) %>%
    dplyr::mutate(Predicted_TARGET_FLAG = predicted_classes)
  
  # Save predictions to a CSV file
  output_file <- "binary_logistic_stepwise_predictions.csv"
  write.csv(predictions_df, file = output_file, row.names = FALSE)
  cat("Binary Logistic Predicted values with INDEX saved to", output_file, "\n")
  
  # Print a few rows of the output
  print(head(predictions_df))
} else {
  stop("TARGET_FLAG is not fully NA. Please review the dataset.")
}

## TARGET_FLAG is completely NA. Proceeding with prediction-only mode.
## Binary Logistic Predicted values with INDEX saved to binary_logistic_stepwise_predictions.csv 
##   INDEX Predicted_TARGET_FLAG
## 1     3                     0
## 2     9                     0
## 3    10                     0
## 4    18                     1
## 5    21                     0
## 6    30                     0

XGBoost Binary Classification Model Predictions

# ---- Step 1: Create xgb_evaluation_data ----
# Drop irrelevant columns from processed_evaluation_data
xgb_evaluation_data <- processed_evaluation_data %>%
  dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))

# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_xgb <- colnames(data_prepared %>% dplyr::select(-any_of(c("INDEX", "TARGET_FLAG", "TARGET_AMT"))))
evaluation_vars_xgb <- colnames(xgb_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))

# Print differences, if any
cat("\nTraining Variables (XGB Binary Classification):\n")

## 
## Training Variables (XGB Binary Classification):

print(training_vars_xgb)

##  [1] "KIDSDRIV"                      "AGE"                          
##  [3] "HOMEKIDS"                      "YOJ"                          
##  [5] "INCOME"                        "HOME_VAL"                     
##  [7] "TRAVTIME"                      "BLUEBOOK"                     
##  [9] "TIF"                           "OLDCLAIM"                     
## [11] "CLM_FREQ"                      "MVR_PTS"                      
## [13] "CAR_AGE"                       "PARENT1_Yes"                  
## [15] "MSTATUS_z_No"                  "SEX_z_F"                      
## [17] "EDUCATION_Bachelors"           "EDUCATION_Masters"            
## [19] "EDUCATION_PhD"                 "EDUCATION_z_HighSchool"       
## [21] "JOB_Clerical"                  "JOB_Doctor"                   
## [23] "JOB_HomeMaker"                 "JOB_Lawyer"                   
## [25] "JOB_Manager"                   "JOB_Professional"             
## [27] "JOB_Student"                   "JOB_z_BlueCollar"             
## [29] "CAR_USE_Private"               "CAR_TYPE_PanelTruck"          
## [31] "CAR_TYPE_Pickup"               "CAR_TYPE_SportsCar"           
## [33] "CAR_TYPE_Van"                  "CAR_TYPE_z_SUV"               
## [35] "RED_CAR_yes"                   "REVOKED_Yes"                  
## [37] "URBANICITY_z_HighlyRuralRural"

cat("\nEvaluation Variables (XGB Binary Classification):\n")

## 
## Evaluation Variables (XGB Binary Classification):

print(evaluation_vars_xgb)

##  [1] "KIDSDRIV"                      "AGE"                          
##  [3] "HOMEKIDS"                      "YOJ"                          
##  [5] "INCOME"                        "HOME_VAL"                     
##  [7] "TRAVTIME"                      "BLUEBOOK"                     
##  [9] "TIF"                           "OLDCLAIM"                     
## [11] "CLM_FREQ"                      "MVR_PTS"                      
## [13] "CAR_AGE"                       "PARENT1_Yes"                  
## [15] "MSTATUS_z_No"                  "SEX_z_F"                      
## [17] "EDUCATION_Bachelors"           "EDUCATION_Masters"            
## [19] "EDUCATION_PhD"                 "EDUCATION_z_HighSchool"       
## [21] "JOB_Clerical"                  "JOB_Doctor"                   
## [23] "JOB_HomeMaker"                 "JOB_Lawyer"                   
## [25] "JOB_Manager"                   "JOB_Professional"             
## [27] "JOB_Student"                   "JOB_z_BlueCollar"             
## [29] "CAR_USE_Private"               "CAR_TYPE_PanelTruck"          
## [31] "CAR_TYPE_Pickup"               "CAR_TYPE_SportsCar"           
## [33] "CAR_TYPE_Van"                  "CAR_TYPE_z_SUV"               
## [35] "RED_CAR_yes"                   "REVOKED_Yes"                  
## [37] "URBANICITY_z_HighlyRuralRural"

if (!identical(training_vars_xgb, evaluation_vars_xgb)) {
  cat("\nDifferences Found:\n")
  
  # Variables in training but not in evaluation
  missing_in_evaluation <- setdiff(training_vars_xgb, evaluation_vars_xgb)
  if (length(missing_in_evaluation) > 0) {
    cat("Variables in training but not in evaluation:\n")
    print(missing_in_evaluation)
  }
  
  # Variables in evaluation but not in training
  missing_in_training <- setdiff(evaluation_vars_xgb, training_vars_xgb)
  if (length(missing_in_training) > 0) {
    cat("Variables in evaluation but not in training:\n")
    print(missing_in_training)
  }
  
  stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
  cat("\nThe training and evaluation data have identical structures and variable orders.\n")
}

## 
## The training and evaluation data have identical structures and variable orders.

# ---- Step 3: Predict Using the XGB Binary Classification Model ----
# Convert evaluation data to matrix
xgb_evaluation_matrix <- as.matrix(xgb_evaluation_data %>% dplyr::select(-INDEX))

# Make predictions using the XGB model
xgb_predictions_prob <- predict(xgb_bc_model, newdata = xgb_evaluation_matrix)

# Classify based on threshold (e.g., 0.5)
xgb_predictions_class <- ifelse(xgb_predictions_prob >= 0.5, 1, 0)

# ---- Step 4: Prepare Final Output ----
# Combine predictions with INDEX
final_output_xgb <- processed_evaluation_data %>%
  dplyr::select(INDEX) %>%
  dplyr::mutate(
    Predicted_Probability = xgb_predictions_prob,
    Predicted_Class = xgb_predictions_class
  )

# Print the final output
cat("Final Predictions with INDEX:\n")

## Final Predictions with INDEX:

print(head(final_output_xgb))

##   INDEX Predicted_Probability Predicted_Class
## 1     3            0.15785217               0
## 2     9            0.49727950               0
## 3    10            0.03101074               0
## 4    18            0.22074331               0
## 5    21            0.17195618               0
## 6    30            0.08795069               0

# ---- Optional: Save to CSV ----
output_file_xgb <- "xgb_binary_classification_predictions.csv"
write.csv(final_output_xgb, file = output_file_xgb, row.names = FALSE)
cat("Predicted values saved to", output_file_xgb, "\n")

## Predicted values saved to xgb_binary_classification_predictions.csv

Pairwise Scatter Plot of Binary Logistic Predicted Probabilities - Logistic and XGBoost

The pairwise scatter plot and correlation analysis between the predicted probabilities of the Stepwise Logistic Regression model and the XGBoost Binary Classification model reveal a strong positive relationship, with a correlation coefficient of 0.823. This indicates that the two models are generally consistent in their probability predictions, although some divergence can be observed, particularly in areas with higher probabilities.

The density plots suggest differences in the distribution of probabilities between the models, with XGBoost exhibiting a more pronounced skew. This might suggest that XGBoost is capturing some unique patterns in the data compared to the stepwise model. Further evaluation using performance metrics.

# ---- Step 5: Prepare Data for Pairwise Scatter Plot ----
# Combine predictions from both models into a single data frame
comparison_df <- data.frame(
  INDEX = final_output_xgb$INDEX,
  Stepwise_Prob = as.numeric(predicted_probs),  # From Stepwise Logistic Regression
  XGBoost_Prob = xgb_predictions_prob          # From XGBoost Binary Classification
)

# ---- Step 6: Generate Pairwise Scatter Plot ----
# Load necessary library
if (!require("GGally")) install.packages("GGally")

## Loading required package: GGally

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(GGally)

# Generate the scatter plot
cat("Generating pairwise scatter plot for model predictions...\n")

## Generating pairwise scatter plot for model predictions...

ggpairs(
  data = comparison_df %>% dplyr::select(Stepwise_Prob, XGBoost_Prob),
  title = "Pairwise Scatter Plot of Predicted Probabilities",
  diag = list(continuous = wrap("densityDiag")),
  lower = list(continuous = wrap("smooth")),
  upper = list(continuous = wrap("cor", size = 3))
) + 
  theme_minimal()

# ---- Optional: Save Comparison Data ----
output_comparison_file <- "model_predictions_comparison.csv"
write.csv(comparison_df, file = output_comparison_file, row.names = FALSE)
cat("Comparison data saved to", output_comparison_file, "\n")

## Comparison data saved to model_predictions_comparison.csv

Multiple Linear Regression Predictions

# ---- Step 1: Create custom_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
custom_evaluation_data <- processed_evaluation_data %>%
  dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))

# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_custom <- colnames(train_data_non_linear %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars_custom <- colnames(custom_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))

# Verify if the training and evaluation datasets match
if (!identical(training_vars_custom, evaluation_vars_custom)) {
  stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
  cat("The training and evaluation data have identical structures and variable orders.\n")
}

## The training and evaluation data have identical structures and variable orders.

# ---- Step 3: Predict Using the Custom Model ----
# Make predictions using the custom_model_transformed_scaled model
custom_predictions_scaled <- predict(custom_model_transformed_scaled, newdata = custom_evaluation_data)

# ---- Step 4: Reverse Min-Max Scaling ----
reverse_min_max_scale <- function(scaled_values, min_val, max_val) {
  scaled_values * (max_val - min_val) + min_val
}

predictions_boxcox_transformed <- reverse_min_max_scale(
  custom_predictions_scaled, 
  train_min_val_transformed, 
  train_max_val_transformed
)

# ---- Step 5: Reverse Box-Cox Transformation ----
reverse_boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    exp(y)
  } else {
    ((y * lambda) + 1)^(1 / lambda)
  }
}

final_predictions_transformed <- reverse_boxcox_transform(
  predictions_boxcox_transformed, 
  optimal_lambda_transformed
)

# ---- Step 6: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
  # Combine predictions with the INDEX column
  final_output_transformed <- processed_evaluation_data %>%
    dplyr::select(any_of("INDEX")) %>%
    dplyr::mutate(Predicted_TARGET_AMT = final_predictions_transformed)
} else {
  # Only include predictions if INDEX is not available
  final_output_transformed <- data.frame(
    Predicted_TARGET_AMT = final_predictions_transformed
  )
}

# Print the final output
cat("Final Predictions with INDEX:\n")

## Final Predictions with INDEX:

print(head(final_output_transformed))

##   INDEX Predicted_TARGET_AMT
## 1     3             5132.953
## 2     9             4723.222
## 3    10             3718.583
## 4    18             3929.826
## 5    21             4698.292
## 6    30             5152.383

# ---- Optional: Save to CSV ----
output_file_transformed <- "mlr_custom_model_predictions.csv"
write.csv(final_output_transformed, file = output_file_transformed, row.names = FALSE)
cat("Predicted values saved to", output_file_transformed, "\n")

## Predicted values saved to mlr_custom_model_predictions.csv

SVR Model Prediction

# ---- Step 1: Create svr_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
svr_evaluation_data <- processed_evaluation_data %>%
  dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))

# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars <- colnames(train_data_non_linear %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars <- colnames(svr_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))

# Verify if the training and evaluation datasets match
if (!identical(training_vars, evaluation_vars)) {
  stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
  cat("The training and evaluation data have identical structures and variable orders.\n")
}

## The training and evaluation data have identical structures and variable orders.

# ---- Step 3: Predict Using the SVR Model ----
# Make predictions using the SVR model
svr_predictions_scaled_non_linear <- predict(svr_model, newdata = svr_evaluation_data)

# ---- Step 4: Reverse Min-Max Scaling ----
reverse_min_max_scale <- function(scaled_values, min_val, max_val) {
  scaled_values * (max_val - min_val) + min_val
}

predictions_boxcox_non_linear <- reverse_min_max_scale(
  svr_predictions_scaled_non_linear, 
  min_val_non_linear, 
  max_val_non_linear
)

# ---- Step 5: Reverse Box-Cox Transformation ----
reverse_boxcox_transform <- function(y, lambda) {
  if (lambda == 0) {
    exp(y)
  } else {
    ((y * lambda) + 1)^(1 / lambda)
  }
}

final_predictions_non_linear <- reverse_boxcox_transform(
  predictions_boxcox_non_linear, 
  optimal_lambda_non_linear
)

# ---- Step 6: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
  # Combine predictions with the INDEX column
  final_output_non_linear <- processed_evaluation_data %>%
    dplyr::select(any_of("INDEX")) %>%
    dplyr::mutate(Predicted_TARGET_AMT = final_predictions_non_linear)
} else {
  # Only include predictions if INDEX is not available
  final_output_non_linear <- data.frame(
    Predicted_TARGET_AMT = final_predictions_non_linear
  )
}

# Print the final output
cat("Final Predictions with INDEX:\n")

## Final Predictions with INDEX:

print(head(final_output_non_linear))

##   INDEX Predicted_TARGET_AMT
## 1     3             3571.259
## 2     9             3529.115
## 3    10             3418.264
## 4    18             3924.852
## 5    21             4013.145
## 6    30             6376.306

# ---- Optional: Save to CSV ----
output_file_non_linear <- "svr_predictions.csv"
write.csv(final_output_non_linear, file = output_file_non_linear, row.names = FALSE)
cat("Predicted values saved to", output_file_non_linear, "\n")

## Predicted values saved to svr_predictions.csv

Random Forest Prediction

# ---- Step 1: Create rfr_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
rfr_evaluation_data <- processed_evaluation_data %>%
  dplyr::select(-any_of(c("TARGET_AMT")))

# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_rfr <- colnames(data_prepared %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars_rfr <- colnames(rfr_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))

# Print training and evaluation variables
cat("\nTraining Variables (RFR):\n")

## 
## Training Variables (RFR):

print(training_vars_rfr)

##  [1] "KIDSDRIV"                      "AGE"                          
##  [3] "HOMEKIDS"                      "YOJ"                          
##  [5] "INCOME"                        "HOME_VAL"                     
##  [7] "TRAVTIME"                      "BLUEBOOK"                     
##  [9] "TIF"                           "OLDCLAIM"                     
## [11] "CLM_FREQ"                      "MVR_PTS"                      
## [13] "CAR_AGE"                       "PARENT1_Yes"                  
## [15] "MSTATUS_z_No"                  "SEX_z_F"                      
## [17] "EDUCATION_Bachelors"           "EDUCATION_Masters"            
## [19] "EDUCATION_PhD"                 "EDUCATION_z_HighSchool"       
## [21] "JOB_Clerical"                  "JOB_Doctor"                   
## [23] "JOB_HomeMaker"                 "JOB_Lawyer"                   
## [25] "JOB_Manager"                   "JOB_Professional"             
## [27] "JOB_Student"                   "JOB_z_BlueCollar"             
## [29] "CAR_USE_Private"               "CAR_TYPE_PanelTruck"          
## [31] "CAR_TYPE_Pickup"               "CAR_TYPE_SportsCar"           
## [33] "CAR_TYPE_Van"                  "CAR_TYPE_z_SUV"               
## [35] "RED_CAR_yes"                   "REVOKED_Yes"                  
## [37] "URBANICITY_z_HighlyRuralRural" "TARGET_FLAG"

cat("\nEvaluation Variables (RFR):\n")

## 
## Evaluation Variables (RFR):

print(evaluation_vars_rfr)

##  [1] "TARGET_FLAG"                   "KIDSDRIV"                     
##  [3] "AGE"                           "HOMEKIDS"                     
##  [5] "YOJ"                           "INCOME"                       
##  [7] "HOME_VAL"                      "TRAVTIME"                     
##  [9] "BLUEBOOK"                      "TIF"                          
## [11] "OLDCLAIM"                      "CLM_FREQ"                     
## [13] "MVR_PTS"                       "CAR_AGE"                      
## [15] "PARENT1_Yes"                   "MSTATUS_z_No"                 
## [17] "SEX_z_F"                       "EDUCATION_Bachelors"          
## [19] "EDUCATION_Masters"             "EDUCATION_PhD"                
## [21] "EDUCATION_z_HighSchool"        "JOB_Clerical"                 
## [23] "JOB_Doctor"                    "JOB_HomeMaker"                
## [25] "JOB_Lawyer"                    "JOB_Manager"                  
## [27] "JOB_Professional"              "JOB_Student"                  
## [29] "JOB_z_BlueCollar"              "CAR_USE_Private"              
## [31] "CAR_TYPE_PanelTruck"           "CAR_TYPE_Pickup"              
## [33] "CAR_TYPE_SportsCar"            "CAR_TYPE_Van"                 
## [35] "CAR_TYPE_z_SUV"                "RED_CAR_yes"                  
## [37] "REVOKED_Yes"                   "URBANICITY_z_HighlyRuralRural"

# ---- Step 3: Compare the Structures and Print Differences ----
if (!identical(training_vars_rfr, evaluation_vars_rfr)) {
  cat("\nDifferences Found:\n")
  
  # Variables in training but not in evaluation
  missing_in_evaluation <- setdiff(training_vars_rfr, evaluation_vars_rfr)
  if (length(missing_in_evaluation) > 0) {
    cat("Variables in training but not in evaluation:\n")
    print(missing_in_evaluation)
  }
  
  # Variables in evaluation but not in training
  missing_in_training <- setdiff(evaluation_vars_rfr, training_vars_rfr)
  if (length(missing_in_training) > 0) {
    cat("Variables in evaluation but not in training:\n")
    print(missing_in_training)
  }
  
} else {
  cat("\nThe training and evaluation data have identical structures and variable orders.\n")
}

## 
## Differences Found:

# ---- Step 3: Predict Using the RFR Model ----
# Make predictions using the rfr_model
rfr_predictions <- predict(rfr_model, data = rfr_evaluation_data)$predictions

# ---- Step 4: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
  # Combine predictions with the INDEX column
  final_output_rfr <- processed_evaluation_data %>%
    dplyr::select(any_of("INDEX")) %>%
    dplyr::mutate(Predicted_TARGET_AMT = rfr_predictions)
} else {
  # Only include predictions if INDEX is not available
  final_output_rfr <- data.frame(
    Predicted_TARGET_AMT = rfr_predictions
  )
}

# Print the final output
cat("Final Predictions with INDEX:\n")

## Final Predictions with INDEX:

print(head(final_output_rfr))

##   INDEX Predicted_TARGET_AMT
## 1     3             267.2494
## 2     9             572.7645
## 3    10             113.5184
## 4    18             159.0546
## 5    21             326.9725
## 6    30             354.4321

# ---- Optional: Save to CSV ----
output_file_rfr <- "rfr_model_predictions.csv"
write.csv(final_output_rfr, file = output_file_rfr, row.names = FALSE)
cat("Predicted values saved to", output_file_rfr, "\n")

## Predicted values saved to rfr_model_predictions.csv

Pairwise Scatter Plot of Regression Predicted values - MLR, SVR, RF Regression

The heading of the plot is:

“Pairwise Scatter Plot of Model Predictions”

Statement:

This plot compares the predicted values from three models: Multiple Linear Regression (MLR), Support Vector Regression (SVR), and Random Forest Regression (RFR). The pairwise correlations between the predictions are indicated as follows: - The correlation between MLR and SVR predictions is very low (0.028), indicating almost no linear relationship. - The correlation between MLR and RFR predictions is slightly stronger (0.169), suggesting a weak linear relationship. - The correlation between SVR and RFR predictions is moderate (0.122), showing a slightly better but still weak relationship.

This suggests that the models make distinct predictions, possibly reflecting different assumptions or fits to the data. RFR predictions align slightly better with the other models than SVR does.

# ---- Step 1: Prepare Prediction Data Frames ----
# Ensure predictions from all models have an INDEX column and align them
mlr_predictions <- final_output_transformed %>%
  dplyr::rename(MLR_Pred = Predicted_TARGET_AMT)

if (!"INDEX" %in% colnames(final_output_non_linear)) {
  stop("INDEX column is missing in SVR predictions.")
}

svr_predictions <- final_output_non_linear %>%
  dplyr::rename(SVR_Pred = Predicted_TARGET_AMT)

if (!"INDEX" %in% colnames(final_output_rfr)) {
  stop("INDEX column is missing in RFR predictions.")
}

rfr_predictions <- final_output_rfr %>%
  dplyr::rename(RFR_Pred = Predicted_TARGET_AMT)

# ---- Step 2: Combine Predictions ----
# Combine predictions into a single data frame for comparison
comparison_df <- mlr_predictions %>%
  dplyr::left_join(svr_predictions, by = "INDEX") %>%
  dplyr::left_join(rfr_predictions, by = "INDEX")

# Check if combination was successful
if (nrow(comparison_df) == 0) {
  stop("Failed to combine predictions. Ensure all data frames have the INDEX column and proper alignment.")
}

# ---- Step 3: Pairwise Scatter Plot ----
library(GGally)

# Generate pairwise scatter plot
ggpairs(
  comparison_df %>% dplyr::select(MLR_Pred, SVR_Pred, RFR_Pred),
  title = "Pairwise Scatter Plot of Model Predictions"
) +
  theme_minimal() +
  scale_color_viridis_d()

DATA 621 Homework 4

Warner Alexis, Saloua Daouki, Souleymane Doumbia, Fomba Kassoh, Lewris Mota Sanchez

2024-11-17

Introduction

Summary of Modeling for Homework #4

1. Data Exploration

2. Data Preparation

3. Model Building

(a) Regression Models (TARGET_AMT as target):

(b) Binary Classification Models (TARGET_FLAG as target):

4. Model Selection and Evaluation

5. Deliverables

Data Exploration

Load the training and evaluation datasets

Convert columns to appropriate types based on column definitions in Homework 4

(a). Summary Statistics

Summary statistics for Categorical and Factor Variables

Visualization

Statement about Distributions in the Context of MLR and Binary Logistic Regression

(d) Missing Data: Visualization and analysis

Missing Data Summary

Data Preparation

Data Preparation Summary

Final Prepared Datasets:

Impute Missing Values

Identify near-zero variance predictors

Finding and Removing Highly Correlated Variables

Feature Engineering

Decision Tree for TARGET_AMT Where TARGET_FLAG = 1

Analsis of Decision Tree for TARGET_AMT Where TARGET_FLAG = 1

Implications for Modeling:

Decision Tree for TARGET_FLAG Excluding TARGET_AMT

Statement on Decision Tree for TARGET_FLAG Excluding TARGET_AMT

Implications for Predictive Modeling:

Feature Engineering

Key Insights from Domain Knowledge:

Feature Engineering Suggestions:

1. Transformations Based on Theoretical Effects:

2. Keep Variables Continuous Where Appropriate:

3. One-Hot Encoding for Categorical Variables:

Balance Dataset

Build Models

Build Binary Logistic Models

Techniques Used for Manual Featurre Selection for Binary Logistic Models

1. Custom Logistic Regression Formula Selection:

2. Variable Inclusion/Exclusion:

3. Why the Custom Formula?

Determine Feature Importance and Interactions

Feature Importance Using Random Forest

Determine Interactions. Tree Model to Visualize Interactions after Dummy Encoding (Decision Tree for TARGET_FLAG Excluding TARGET_AMT - Logistic Model)

Define the custom logistic regression formula based on Tree Model and Random Forest Feature Importance (Logistic Regression)

Fit the Binary Logistic Models

Interpret the Coefficients of the Stepwise Binary Logistic Model

Notable Coefficients and Their Interpretations

Summary:

Model Selection: Binary Logistic Models

Comparing Models: Prediction Power, Interpretability, and Explanatory Power

1. Prediction Power

2. Balance of Prediction Power and Interpretability

3. Explanatory Power

Best Model

Alternatives to Binary Logisttic Models for Comparison

Comparison of Binary Logistic and Non-Linear Models for Classification

Best Model for Balancing Accuracy and Interpretability

Build Models: Multiple Linear Regression Models

Techniques Used for Manual Feature Selection for Multiple Linear Regression

Tree Model for Feature Interaction

Random Forest for Feature Importance

Final Selection Criteria

Tree Model for Feature Interaction using TARGET_AMT as Primary Node

Random Forest for Feature Importance using TARGET_AMT as Target

Select predictors based on the above feature importance analysis and tree model.

Multiple Linear Regression Models

MLR Models before transformation and scaling of the target variable TARGET_AMT

MLR Models after transformation and scaling of the target variable TARGET_AMT

MLR Models after outlier treatment, transformation, and scaling of the target variable TARGET_AMT

Multiple Linear Regression (MLR) Model Selection

MLR Models Comparison of Scenarios:

Recommended Model:

Interprete the Coefficient of the Custom_Transformed

Analsis of Decision Tree for `TARGET_AMT` Where `TARGET_FLAG = 1`

Statement on Decision Tree for `TARGET_FLAG` Excluding `TARGET_AMT`