The objective is to build two models using the provided dataset: a multiple linear regression model to predict the cost of a car crash (a continuous variable) and a binary logistic regression model to predict the probability of a car crash (a binary outcome). Only the given variables, or any new variables derived from them, can be used to build these models. The dataset includes a set of variables relevant to the prediction task, which are described in the following section.
The dataset includes approximately 8,000 observations, with two key response variables: - TARGET_FLAG: Binary indicator (1 = involved in a car crash, 0 = not involved). - TARGET_AMT: Continuous variable representing the cost of a car crash (zero if no crash occurred).
Key findings: - Missing data in CAR_AGE, HOME_VAL, YOJ, and INCOME, with percentages ranging from 5% to 6%. - Skewness in TARGET_AMT and other variables required transformations for better modeling. - Identified correlations between predictors and response variables informed model feature selection.
The dataset underwent preprocessing to enhance model readiness: - Handling Missing Data: Imputed missing values with medians for numeric variables and modes for categorical ones. - Feature Engineering: Interaction terms (e.g., BLUEBOOK * CAR_AGE) were created, and categorical variables were one-hot encoded. - Balancing the Dataset: SMOTE oversampling addressed imbalances in TARGET_FLAG. - Transformation and Scaling: - Box-Cox transformations were applied to normalize skewed variables. - Min-max scaling normalized continuous variables for compatibility with algorithms like SVR and ANN.
Multiple models were built and compared for both regression and classification tasks:
Each model was evaluated using performance metrics: - Regression Models: - RMSE and Adjusted \(R^2\) were primary metrics for prediction accuracy and model fit. - Residual analysis verified assumptions of linearity and homoscedasticity. - Classification Models: - AUC and ROC curves assessed classification accuracy. - Precision, recall, F1 score, and accuracy provided comprehensive evaluations. - Models were also assessed on interpretability and computational efficiency.
By including additional models, this comprehensive approach provides enhanced benchmarks for comparison, ensuring robust and interpretable results tailored to the assignment’s requirements.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(DataExplorer)
library(caret)
## Loading required package: lattice
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(e1071)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(VIM)
## Loading required package: colorspace
##
## Attaching package: 'colorspace'
## The following object is masked from 'package:pROC':
##
## coords
## Loading required package: grid
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
The training dataset consists of 8,161 observations and 26 variables. Several categorical variables require transformation to be compatible with both logistic and linear models. We begin with some fundamental transformations.
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
INDEX
column because it is not
informative.# Load necessary library
library(dplyr)
# List of datasets to process
data_list <- list(
insurance_training = insurance_training,
insurance_evaluation = insurance_evaluation
)
# Function to process a dataset
process_insurance_data <- function(data) {
data <- data %>%
dplyr::mutate(
INDEX = as.numeric(INDEX),
TARGET_AMT = as.numeric(TARGET_AMT),
KIDSDRIV = as.numeric(KIDSDRIV),
AGE = as.numeric(AGE),
HOMEKIDS = as.numeric(HOMEKIDS),
YOJ = as.numeric(YOJ),
INCOME = as.numeric(gsub("[^0-9\\.]", "", INCOME)),
HOME_VAL = as.numeric(gsub("[^0-9\\.]", "", HOME_VAL)),
TRAVTIME = as.numeric(TRAVTIME),
BLUEBOOK = as.numeric(gsub("[^0-9\\.]", "", BLUEBOOK)),
TIF = as.numeric(TIF),
OLDCLAIM = as.numeric(gsub("[^0-9\\.]", "", OLDCLAIM)),
CLM_FREQ = as.numeric(CLM_FREQ),
MVR_PTS = as.numeric(MVR_PTS),
CAR_AGE = as.numeric(CAR_AGE),
TARGET_FLAG = as.integer(TARGET_FLAG),
RED_CAR = as.factor(RED_CAR),
REVOKED = as.factor(REVOKED),
CAR_USE = as.factor(CAR_USE),
URBANICITY = as.factor(URBANICITY),
PARENT1 = as.factor(PARENT1),
MSTATUS = as.factor(MSTATUS),
SEX = as.factor(SEX),
EDUCATION = as.factor(EDUCATION),
JOB = as.factor(JOB),
CAR_TYPE = as.factor(CAR_TYPE)
) # %>%
# dplyr::select(-1) # Drop the first column
return(data)
}
# Process each dataset using lapply
processed_data_list <- lapply(data_list, function(data) {
tryCatch(
process_insurance_data(data),
error = function(e) {
cat("Error processing dataset:", conditionMessage(e), "\n")
NULL
}
)
})
# Remove NULL results if any dataset failed processing
processed_data_list <- processed_data_list[!sapply(processed_data_list, is.null)]
# Assign the processed datasets back to their variables
insurance_training <- processed_data_list$insurance_training
insurance_evaluation <- processed_data_list$insurance_evaluation
# Check the structure of the processed datasets
cat("### Structure of Insurance Training Data ###\n")
## ### Structure of Insurance Training Data ###
str(insurance_training)
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : num 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : num 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : num 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : num 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : num 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : num 67349 91449 16039 NA 114986 ...
## $ PARENT1 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ HOME_VAL : num 0 257252 124191 306251 243925 ...
## $ MSTATUS : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
## $ SEX : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
## $ JOB : Factor w/ 9 levels "","Clerical",..: 7 9 2 9 3 9 9 9 2 7 ...
## $ TRAVTIME : num 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
## $ BLUEBOOK : num 14230 14940 4010 15440 18000 ...
## $ TIF : num 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
## $ RED_CAR : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
## $ OLDCLAIM : num 4461 0 38690 0 19217 ...
## $ CLM_FREQ : num 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ MVR_PTS : num 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : num 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...
cat("\n\n### Structure of Insurance Evaluation Data ###\n")
##
##
## ### Structure of Insurance Evaluation Data ###
# str(insurance_evaluation)
# Create vectors of numeric and categorical variables for insurance_training
numeric_vars <- names(insurance_training)[sapply(insurance_training, is.numeric)]
categorical_vars <- names(insurance_training)[sapply(insurance_training, is.factor)]
# Print the vectors
cat("\n### Numeric Variables in Training Data ###\n")
##
## ### Numeric Variables in Training Data ###
print(numeric_vars)
## [1] "INDEX" "TARGET_FLAG" "TARGET_AMT" "KIDSDRIV" "AGE"
## [6] "HOMEKIDS" "YOJ" "INCOME" "HOME_VAL" "TRAVTIME"
## [11] "BLUEBOOK" "TIF" "OLDCLAIM" "CLM_FREQ" "MVR_PTS"
## [16] "CAR_AGE"
# cat("\n### Categorical Variables in Training Data ###\n")
print(categorical_vars)
## [1] "PARENT1" "MSTATUS" "SEX" "EDUCATION" "JOB"
## [6] "CAR_USE" "CAR_TYPE" "RED_CAR" "REVOKED" "URBANICITY"
Notable statistics from the summary of numerical variables:
Age: The mean age of respondents is approximately 44.79 years, with a minimum age of 16 and a maximum of 81 years. Only 0.07% of age data is missing, indicating robust data coverage.
Income: The mean income is $61,898.09, with a median of $54,028 and a maximum of $367,030, showing significant variability (SD = $47,572.68). Income data has 5.45% missing values.
TARGET_AMT: This variable is highly skewed (Skewness = 8.71), with a mean of $1,504.32 and a maximum of $107,586.1, indicating the presence of outliers in the data.
CAR_AGE: The mean car age is 8.33 years, with a negative minimum value (-3), possibly indicating data errors. The variable has the highest percentage of missing values (6.2%).
YOJ (Years on Job): The median is 11 years, but 5.56% of data is missing, suggesting some gaps in job tenure information.
HOME_VAL (Home Value): With a mean of $154,867.29 and a median of $161,160, home values appear moderately skewed. Approximately 5.69% of the data is missing.
TARGET_FLAG: The binary target variable is imbalanced, with a mean of 0.26, indicating that only about 26% of respondents fall under the positive class.
These insights highlight areas of potential data cleaning, especially for missing values, outliers, and negative entries, as well as important trends that may inform predictive modeling.
# Load necessary libraries
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(dplyr)
library(tidyr)
# Correctly select numerical variables excluding INDEX but keeping it in the dataset
numerical_vars <- insurance_training %>%
dplyr::select(all_of(numeric_vars)) %>%
dplyr::select(-INDEX)
# Compute statistical summary including missing value counts and percentages
statistical_summary <- numerical_vars %>%
summarise(across(
everything(),
list(
Min = ~round(min(., na.rm = TRUE), 2),
Q1 = ~round(quantile(., 0.25, na.rm = TRUE), 2),
Mean = ~round(mean(., na.rm = TRUE), 2),
Median = ~round(median(., na.rm = TRUE), 2),
Q3 = ~round(quantile(., 0.75, na.rm = TRUE), 2),
Max = ~round(max(., na.rm = TRUE), 2),
SD = ~round(sd(., na.rm = TRUE), 2),
Skewness = ~round(skewness(., na.rm = TRUE), 2),
Kurtosis = ~round(kurtosis(., na.rm = TRUE), 2),
Outliers = ~sum(. > (quantile(., 0.75, na.rm = TRUE) + 1.5 * IQR(., na.rm = TRUE)) |
. < (quantile(., 0.25, na.rm = TRUE) - 1.5 * IQR(., na.rm = TRUE)), na.rm = TRUE),
Missing = ~sum(is.na(.)), # Count of missing values
PercentMissing = ~round(mean(is.na(.)) * 100, 2) # Percentage of missing values
),
.names = "{.col}_{.fn}"
)) %>%
pivot_longer(
cols = everything(),
names_to = c("Variable", ".value"),
names_pattern = "^(.*)_(.*)$"
)
# Display the resulting summary table
print(statistical_summary)
## # A tibble: 15 × 13
## Variable Min Q1 Mean Median Q3 Max SD Skewness Kurtosis
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 TARGET_FL… 0 0 2.6 e-1 0 1 1 e0 4.4 e-1 1.07 -0.85
## 2 TARGET_AMT 0 0 1.50e+3 0 1036 1.08e5 4.70e+3 8.71 112.
## 3 KIDSDRIV 0 0 1.7 e-1 0 0 4 e0 5.1 e-1 3.35 11.8
## 4 AGE 16 39 4.48e+1 45 51 8.1 e1 8.63e+0 -0.03 -0.06
## 5 HOMEKIDS 0 0 7.2 e-1 0 1 5 e0 1.12e+0 1.34 0.65
## 6 YOJ 0 9 1.05e+1 11 13 2.3 e1 4.09e+0 -1.2 1.18
## 7 INCOME 0 28097 6.19e+4 54028 85986 3.67e5 4.76e+4 1.19 2.13
## 8 HOME_VAL 0 0 1.55e+5 161160 238724 8.85e5 1.29e+5 0.49 -0.02
## 9 TRAVTIME 5 22 3.35e+1 33 44 1.42e2 1.59e+1 0.45 0.66
## 10 BLUEBOOK 1500 9280 1.57e+4 14440 20850 6.97e4 8.42e+3 0.79 0.79
## 11 TIF 1 1 5.35e+0 4 7 2.5 e1 4.15e+0 0.89 0.42
## 12 OLDCLAIM 0 0 4.04e+3 0 4636 5.70e4 8.78e+3 3.12 9.86
## 13 CLM_FREQ 0 0 8 e-1 0 2 5 e0 1.16e+0 1.21 0.28
## 14 MVR_PTS 0 0 1.7 e+0 1 3 1.3 e1 2.15e+0 1.35 1.38
## 15 CAR_AGE -3 1 8.33e+0 8 12 2.8 e1 5.7 e+0 0.28 -0.75
## # ℹ 3 more variables: Outliers <int>, Missing <int>, PercentMissing <dbl>
TARGET_FLAG
):
PARENT1
(mode: “No”, 86.80%, ratio:
6.58), REVOKED
(mode: “0”, 87.75%, ratio: 7.16), and
CAR_TYPE
(mode: “z_SUV”, 28.11%, ratio: 3.39).URBANICITY
(0.73), REVOKED
(0.54).JOB
(3.00) with 9 levels, mode “z_Blue
Collar” (22.36%).CAR_USE
,
SEX
.JOB
(9 levels) and
CAR_TYPE
(6 levels) may need one-hot encoding or
grouping.Key Insights:
1. Address TARGET_FLAG
imbalance through resampling or
weighting.
2. Handle high-imbalance variables (JOB
,
CAR_TYPE
) by consolidating rare levels.
3. Leverage high-entropy variables for predictive power with robust
preprocessing.
4. Use tree-based models to manage categorical variables and imbalances
effectively.
library(dplyr)
library(knitr)
library(kableExtra)
# Select categorical columns including TARGET_FLAG
categorical_columns <- insurance_training %>%
dplyr::select(all_of(c(categorical_vars, "TARGET_FLAG")))
# Function to calculate Shannon entropy
calculate_entropy <- function(counts) {
proportions <- counts / sum(counts)
entropy <- -sum(proportions * log2(proportions), na.rm = TRUE)
return(entropy)
}
# Function to calculate imbalance ratio
calculate_imbalance_ratio <- function(counts) {
max_count <- max(counts, na.rm = TRUE)
min_count <- min(counts, na.rm = TRUE)
if (min_count == 0) {
imbalance_ratio <- Inf # Avoid division by zero
} else {
imbalance_ratio <- max_count / min_count
}
return(imbalance_ratio)
}
# Compute the summary for each categorical variable
categorical_summary <- lapply(names(categorical_columns), function(var) {
summary_df <- insurance_training %>%
count(!!sym(var), .drop = FALSE) %>% # Ensure all levels are considered
mutate(Percentage = round(n / sum(n) * 100, 2)) %>%
rename(Level = !!sym(var), Count = n) %>%
mutate(Variable = var) # Add the variable name for identification
# Compute the mode for the variable
mode_row <- summary_df %>%
filter(Count == max(Count, na.rm = TRUE)) %>%
slice_head(n = 1) %>% # Select the first row in case of ties
pull(Level)
# Compute percentage for the mode
mode_percentage <- summary_df %>%
filter(Level == mode_row) %>%
pull(Percentage) %>%
first() # Ensure it works even if there are multiple matches
# Count missing values for the variable
missing_count <- sum(is.na(insurance_training[[var]]))
# Count unique levels
unique_levels_count <- n_distinct(insurance_training[[var]])
# Compute entropy
entropy <- calculate_entropy(summary_df$Count)
# Compute imbalance ratio
imbalance_ratio <- calculate_imbalance_ratio(summary_df$Count)
# Combine into a single row summary for the variable
final_row <- data.frame(
Variable = var,
Mode = as.character(mode_row), # Ensure Mode is always a character
Mode_Percentage = mode_percentage,
Missing_Count = missing_count,
Unique_Levels = unique_levels_count,
Entropy = round(entropy, 2),
Imbalance_Ratio = round(imbalance_ratio, 2),
stringsAsFactors = FALSE # Avoid factors unless explicitly needed
)
return(final_row)
})
# Combine summaries into a single data frame
categorical_summary_df <- bind_rows(categorical_summary)
# Print the resulting summary
cat("\nSummary of Categorical Variables\n")
##
## Summary of Categorical Variables
print(categorical_summary_df)
## Variable Mode Mode_Percentage Missing_Count Unique_Levels
## 1 PARENT1 No 86.80 0 2
## 2 MSTATUS Yes 59.97 0 2
## 3 SEX z_F 53.61 0 2
## 4 EDUCATION z_High School 28.55 0 5
## 5 JOB z_Blue Collar 22.36 0 9
## 6 CAR_USE Private 62.88 0 2
## 7 CAR_TYPE z_SUV 28.11 0 6
## 8 RED_CAR no 70.86 0 2
## 9 REVOKED No 87.75 0 2
## 10 URBANICITY Highly Urban/ Urban 79.55 0 2
## 11 TARGET_FLAG 0 73.62 0 2
## Entropy Imbalance_Ratio
## 1 0.56 6.58
## 2 0.97 1.50
## 3 1.00 1.16
## 4 2.21 3.20
## 5 3.00 7.42
## 6 0.95 1.69
## 7 2.42 3.39
## 8 0.87 2.43
## 9 0.54 7.16
## 10 0.73 3.89
## 11 0.83 2.79
AGE
,
INCOME
, CAR_AGE
) show varying levels of
skewness, such as high skewness in TARGET_AMT
and
OLDCLAIM
. Transformations (e.g., log or square root) may be
necessary to approximate normality and stabilize variance.CAR_AGE
and YOJ
have
relatively normal distributions and are well-suited for MLR without
significant transformation.OLDCLAIM
, CLM_FREQ
) may need special handling,
such as creating dummy variables or using nonlinear modeling.TARGET_FLAG
is binary, making it
directly suitable for logistic regression.HOME_VAL
, INCOME
) may benefit from
normalization or scaling to improve model convergence.REVOKED
and
URBANICITY
can be one-hot encoded or directly included as
factors.TIF
,
CLM_FREQ
) should be carefully evaluated for
multicollinearity or sparse levels, which may impact model
interpretation.# Load the necessary libraries
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ kableExtra::group_rows() masks dplyr::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ car::recode() masks dplyr::recode()
## ✖ MASS::select() masks dplyr::select()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Define a function to create individual histograms for numerical variables
create_histogram <- function(data, variable) {
ggplot(data, aes(x = .data[[variable]])) +
geom_histogram(fill = "darkgrey", color = "black", bins = 30, na.rm = TRUE) +
ggtitle(paste("Distribution of", variable)) +
xlab(variable) +
ylab("Frequency") +
theme_minimal()
}
# Define a function to create individual bar plots for categorical variables
create_bar_plot <- function(data, variable) {
ggplot(data, aes(x = .data[[variable]])) +
geom_bar(fill = "darkgrey", color = "black", na.rm = TRUE) +
ggtitle(paste("Distribution of", variable)) +
xlab(variable) +
ylab("Count") +
theme_minimal()
}
# List to hold all plots (numerical and categorical)
all_plots <- list()
# Add TARGET_FLAG and TARGET_AMT plots explicitly
all_plots <- append(all_plots, list(
ggplot(insurance_training, aes(x = TARGET_FLAG)) +
geom_bar(fill = "darkgrey", color = "black") +
scale_x_continuous(breaks = c(0, 1), labels = c("0", "1"), limits = c(0, 1), oob = scales::squish, expand = c(0, 0)) +
ggtitle("Distribution of TARGET_FLAG") +
xlab("TARGET_FLAG") +
ylab("Count") +
theme_minimal(),
ggplot(insurance_training %>% filter(TARGET_FLAG == 1), aes(x = TARGET_AMT)) +
geom_histogram(fill = "darkgrey", color = "black", bins = 30, na.rm = TRUE) +
ggtitle("Distribution of TARGET_AMT (Filtered by TARGET_FLAG = 1)") +
xlab("TARGET_AMT") +
ylab("Frequency") +
theme_minimal()
))
# Add numerical variable plots
numerical_variable_groups <- list(
c("AGE", "YOJ"),
c("INCOME", "HOME_VAL"),
c("TRAVTIME", "BLUEBOOK"),
c("TIF", "OLDCLAIM"),
c("CLM_FREQ", "MVR_PTS"),
c("CAR_AGE") # Remaining single variable
)
for (group in numerical_variable_groups) {
for (variable in group) {
all_plots <- append(all_plots, list(create_histogram(insurance_training, variable)))
}
}
# Add categorical variable plots
categorical_variable_groups <- list(
c("PARENT1", "MSTATUS"),
c("SEX", "EDUCATION"),
c("JOB", "CAR_USE"),
c("CAR_TYPE", "RED_CAR"),
c("REVOKED", "URBANICITY"),
c("TARGET_FLAG") # Remaining single variable
)
for (group in categorical_variable_groups) {
for (variable in group) {
all_plots <- append(all_plots, list(create_bar_plot(insurance_training, variable)))
}
}
# Arrange all plots in grids of four per page with two columns
for (i in seq(1, length(all_plots), by = 4)) {
grid.arrange(grobs = all_plots[i:min(i + 3, length(all_plots))], ncol = 2)
}
CAR_AGE
(>6%), HOME_VAL
, YOJ
, and
INCOME
(~5%).AGE
(<1%).we will impute missing values for critical variables like
CAR_AGE
and HOME_VAL
to maintain model
integrity.
training_data <- insurance_training
# Filter only columns with missing data
training_data_with_missing <- training_data %>%
dplyr::select(where(~ any(is.na(.))))
# Missing Data Visualization for columns with missing data
aggr(training_data_with_missing,
col = c('navyblue', 'red'),
numbers = TRUE,
sortVars = TRUE,
labels = names(training_data_with_missing),
cex.axis = 0.7,
gap = 3,
ylab = c("Missing data", "Pattern"))
## Warning in plot.aggr(res, ...): not enough horizontal space to display
## frequencies
##
## Variables sorted by number of missings:
## Variable Count
## CAR_AGE 0.062492342
## HOME_VAL 0.056855777
## YOJ 0.055630437
## INCOME 0.054527631
## AGE 0.000735204
Below are the updated steps followed to prepare the dataset for multiple linear regression, logistic regression, and tree-based models, including class balancing:
TARGET_FLAG = 1
) to match the majority
class size."Missing"
to retain all data points.INCOME
, HOME_VAL
,
OLDCLAIM
, and TARGET_AMT
to be strictly
positive by adding a constant if necessary.AGE_GROUP
and
TRAVTIME_BUCKET
by binning continuous variables into
categorical groups for interpretability.KIDSDRIV_RATIO
and
HOMEKIDS_RATIO
by dividing by AGE
./
, and
+
from variable names for consistency.TARGET_AMT
.TARGET_FLAG
.# Load necessary libraries
library(dplyr)
library(kableExtra)
# List of datasets to process
data_list <- list(
insurance_training = insurance_training,
insurance_evaluation = insurance_evaluation
)
# Function to process a dataset
process_insurance_data <- function(data, is_evaluation = FALSE) {
data <- data %>%
dplyr::mutate(
TARGET_FLAG = if (!is_evaluation) as.factor(TARGET_FLAG) else TARGET_FLAG,
TARGET_AMT = if (!is_evaluation) as.numeric(TARGET_AMT) else TARGET_AMT,
KIDSDRIV = as.numeric(KIDSDRIV),
AGE = as.numeric(AGE),
HOMEKIDS = as.numeric(HOMEKIDS),
YOJ = as.numeric(YOJ),
INCOME = as.numeric(gsub("[^0-9\\.]", "", INCOME)),
HOME_VAL = as.numeric(gsub("[^0-9\\.]", "", HOME_VAL)),
TRAVTIME = as.numeric(TRAVTIME),
BLUEBOOK = as.numeric(gsub("[^0-9\\.]", "", BLUEBOOK)),
TIF = as.numeric(TIF),
OLDCLAIM = as.numeric(gsub("[^0-9\\.]", "", OLDCLAIM)),
CLM_FREQ = as.numeric(CLM_FREQ),
MVR_PTS = as.numeric(MVR_PTS),
CAR_AGE = as.numeric(CAR_AGE),
RED_CAR = as.factor(RED_CAR),
REVOKED = as.factor(REVOKED),
CAR_USE = as.factor(CAR_USE),
URBANICITY = as.factor(URBANICITY),
PARENT1 = as.factor(PARENT1),
MSTATUS = as.factor(MSTATUS),
SEX = as.factor(SEX),
EDUCATION = as.factor(EDUCATION),
JOB = as.factor(JOB),
CAR_TYPE = as.factor(CAR_TYPE)
)
return(data)
}
# Process each dataset using lapply
processed_data_list <- lapply(names(data_list), function(name) {
is_evaluation <- name == "insurance_evaluation"
tryCatch(
process_insurance_data(data_list[[name]], is_evaluation = is_evaluation),
error = function(e) {
cat("Error processing dataset:", conditionMessage(e), "\n")
NULL
}
)
})
# Remove NULL results if any dataset failed processing
processed_data_list <- processed_data_list[!sapply(processed_data_list, is.null)]
# Assign the processed datasets back to their variables
insurance_training <- processed_data_list[[1]]
insurance_evaluation <- processed_data_list[[2]]
# ---- Define Numeric and Categorical Variables ----
numeric_vars <- names(insurance_training)[sapply(insurance_training, is.numeric)]
categorical_vars <- names(insurance_training)[sapply(insurance_training, is.factor)]
# ---- Function for Imputation Without Missing Flags ----
impute_data <- function(data, numeric_cols, categorical_cols, exclude_vars = NULL) {
# Create a copy to avoid altering the original dataset
data_imputed <- data
exclude_vars <- exclude_vars %>% unique()
# Initialize a list to store summaries
summary_list <- list()
# Handle numeric columns
for (col in numeric_cols) {
if (!col %in% exclude_vars) {
# Record missing and negative counts before imputation
missing_count_before <- sum(is.na(data_imputed[[col]]))
negative_count_before <- sum(data_imputed[[col]] < 0, na.rm = TRUE)
# Impute missing values with the median
if (missing_count_before > 0) {
observed_median <- median(data_imputed[[col]], na.rm = TRUE)
imputation_value <- observed_median
data_imputed[[col]][is.na(data_imputed[[col]])] <- imputation_value
}
# Impute negative values with the median of non-negative values
if (negative_count_before > 0) {
observed_median <- median(data_imputed[[col]][data_imputed[[col]] >= 0], na.rm = TRUE)
data_imputed[[col]][data_imputed[[col]] < 0] <- observed_median
}
# Record missing and negative counts after imputation
missing_count_after <- sum(is.na(data_imputed[[col]]))
negative_count_after <- sum(data_imputed[[col]] < 0, na.rm = TRUE)
summary_list[[col]] <- data.frame(
Variable = col,
Missing_Count_Before = missing_count_before,
Negative_Count_Before = negative_count_before,
Missing_Count_After = missing_count_after,
Negative_Count_After = negative_count_after
)
}
}
# Handle categorical columns (e.g., mode imputation)
for (col in categorical_cols) {
if (!col %in% exclude_vars) {
# Record missing count before imputation
missing_count_before <- sum(is.na(data_imputed[[col]]))
# Impute missing values with the mode
if (missing_count_before > 0) {
mode_value <- names(sort(table(data_imputed[[col]], useNA = "no"), decreasing = TRUE))[1]
data_imputed[[col]][is.na(data_imputed[[col]])] <- mode_value
}
# Record missing count after imputation
missing_count_after <- sum(is.na(data_imputed[[col]]))
summary_list[[col]] <- data.frame(
Variable = col,
Missing_Count_Before = missing_count_before,
Negative_Count_Before = NA, # Not applicable to categorical variables
Missing_Count_After = missing_count_after,
Negative_Count_After = NA
)
}
}
# Combine summaries into a single dataframe
summary_df <- do.call(rbind, summary_list) %>%
filter(
Missing_Count_Before > 0 | Negative_Count_Before > 0 |
Missing_Count_After > 0 | Negative_Count_After > 0
)
return(list(Imputed_Data = data_imputed, Summary = summary_df))
}
# ---- Apply Imputation to Both Datasets ----
training_result <- impute_data(
insurance_training, numeric_vars, categorical_vars
)
evaluation_result <- impute_data(
insurance_evaluation, numeric_vars, categorical_vars,
exclude_vars = c("TARGET_FLAG", "TARGET_AMT")
)
# Extract imputed datasets
training_data_imputed <- training_result$Imputed_Data
evaluation_data_imputed <- evaluation_result$Imputed_Data
# ---- Output Results ----
training_summary <- training_result$Summary
training_summary %>%
kable(caption = "Training Data Imputation Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Variable | Missing_Count_Before | Negative_Count_Before | Missing_Count_After | Negative_Count_After | |
---|---|---|---|---|---|
AGE | AGE | 6 | 0 | 0 | 0 |
YOJ | YOJ | 454 | 0 | 0 | 0 |
INCOME | INCOME | 445 | 0 | 0 | 0 |
HOME_VAL | HOME_VAL | 464 | 0 | 0 | 0 |
CAR_AGE | CAR_AGE | 510 | 1 | 0 | 0 |
evaluation_summary <- evaluation_result$Summary
cat("\nEvaluation Data Imputation Summary\n")
##
## Evaluation Data Imputation Summary
print(evaluation_summary)
## Variable Missing_Count_Before Negative_Count_Before
## AGE AGE 1 0
## YOJ YOJ 94 0
## INCOME INCOME 125 0
## HOME_VAL HOME_VAL 111 0
## CAR_AGE CAR_AGE 129 0
## Missing_Count_After Negative_Count_After
## AGE 0 0
## YOJ 0 0
## INCOME 0 0
## HOME_VAL 0 0
## CAR_AGE 0 0
# Load the required library
library(caret)
# Identify near-zero variance predictors
nzv_indices <- nearZeroVar(training_data_imputed)
# Check if any NZV predictors are present
if (length(nzv_indices) > 0) {
# NZV predictors are present
cat("Near-zero variance predictors detected.\n")
# List the removed column names
removed_columns <- colnames(training_data_imputed)[nzv_indices]
cat("Removed the following near-zero variance predictors:\n")
print(removed_columns)
# Remove near-zero variance predictors
filtered_data <- training_data_imputed[, -nzv_indices]
} else {
# No NZV predictors are present
cat("No near-zero variance predictors found.\n")
# Keep the data unchanged
filtered_data <- training_data_imputed
}
## No near-zero variance predictors found.
TARGET_AMT
Where
TARGET_FLAG = 1
The decision tree reveals the following key insights about the
factors influencing TARGET_AMT
when
TARGET_FLAG = 1
:
BLUEBOOK
(estimated value of the insured
vehicle) is the most significant predictor. It splits the data into two
branches:
BLUEBOOK < 17,000
: A majority of observations fall
here with a lower average TARGET_AMT
.BLUEBOOK >= 17,000
: Further splits based on other
variables.BLUEBOOK
and CAR_AGE
:
BLUEBOOK >= 17,000
, the split depends on
CAR_AGE
. Older cars (CAR_AGE >= 13
) result
in lower predicted TARGET_AMT
.BLUEBOOK
, CAR_AGE
, and
HOME_VAL
:
CAR_AGE < 13
) with higher
BLUEBOOK
, HOME_VAL
becomes important. A higher
HOME_VAL
(>= 320,000) leads to higher predicted
TARGET_AMT
.HOME_VAL
and MSTATUS
:
HOME_VAL >= 320,000
), marital
status (MSTATUS
) influences the prediction:
MSTATUS = Yes
leads to moderately high
TARGET_AMT
.MSTATUS = z_No
results in the lowest observed
TARGET_AMT
.BLUEBOOK * CAR_AGE
: Interaction between car value and
age, reflecting their combined effect on TARGET_AMT
.CAR_AGE * HOME_VAL
: Reflects how the home value
modifies the impact of car age on TARGET_AMT
.HOME_VAL * MSTATUS
: Indicates the interaction between
home value and marital status in determining
TARGET_AMT
.BLUEBOOK * CAR_AGE
and HOME_VAL * MSTATUS
in
regression or machine learning models to capture these
relationships.library(rpart.plot)
## Loading required package: rpart
library(dplyr)
# Filter the data for rows where TARGET_FLAG = 1
filtered_data_target_flag <- training_data_imputed %>% filter(TARGET_FLAG == 1)
# Ensure target variable and predictors are valid
target_variable <- "TARGET_AMT"
excluded_variables <- c("TARGET_FLAG", target_variable)
predictors <- setdiff(names(filtered_data_target_flag), excluded_variables)
# Check if predictors are valid
if (length(predictors) == 0) {
stop("No valid predictors found. Ensure that filtered_data_target_flag has columns other than TARGET_FLAG and TARGET_AMT.")
}
# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))
# Fit the decision tree model
tree_model <- rpart(
formula,
data = filtered_data_target_flag,
method = "anova", # Use "anova" for regression trees
control = rpart.control(cp = 0.01) # Adjust complexity parameter as needed
)
# Plot the tree
rpart.plot(
tree_model,
type = 3, # Draw tree with extra info
extra = 101, # Show node numbers, splits, and predictions
box.palette = "auto", # Automatically assign colors
main = "Decision Tree for TARGET_AMT Where TARGET_FLAG = 1"
)
TARGET_FLAG
Excluding
TARGET_AMT
The decision tree identifies key patterns and interactions for
predicting TARGET_FLAG
while excluding the
TARGET_AMT
variable:
OLDCLAIM
(previous claim amounts) is the
most significant predictor:
OLDCLAIM < 529
: This branch
represents a majority group with a higher likelihood of
TARGET_FLAG = 0
.OLDCLAIM >= 529
: This branch
further splits based on job roles.OLDCLAIM
and JOB
:
OLDCLAIM >= 529
, job categories become
influential:
Doctor
,
Lawyer
, Manager
, or Professional
are more likely to have TARGET_FLAG = 0
.Clerical
, Home Maker
,
Student
, and z_Blue Collar
, additional factors
come into play.JOB
and URBANICITY
:
URBANICITY
) differentiates outcomes:
URBANICITY = z_Highly Rural/ Rural
:
Associated with a higher likelihood of
TARGET_FLAG = 0
.URBANICITY = Highly Urban/ Urban
:
Splits further based on HOME_VAL
.URBANICITY
and HOME_VAL
:
HOME_VAL
influences predictions:
HOME_VAL >= 69,000
) correspond
to a greater likelihood of TARGET_FLAG = 0
.HOME_VAL < 69,000
) are linked to
TARGET_FLAG = 1
.OLDCLAIM * JOB
: Captures the combined effect of prior
claims and occupation.JOB * URBANICITY
: Reflects how job roles interact with
living environments to influence outcomes.URBANICITY * HOME_VAL
: Indicates the interplay between
urban/rural settings and home value.OLDCLAIM * JOB
and
URBANICITY * HOME_VAL
can be incorporated into logistic
regression models for better predictive power.Let me know if further refinement or additional analysis is needed!
library(rpart.plot)
library(dplyr)
# Ensure target variable and predictors are valid
target_variable <- "TARGET_FLAG"
excluded_variables <- c("TARGET_AMT", target_variable)
predictors <- setdiff(names(training_data_imputed), excluded_variables)
# Check if predictors are valid
if (length(predictors) == 0) {
stop("No valid predictors found. Ensure that training_data_imputed has columns other than TARGET_FLAG and TARGET_AMT.")
}
# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))
# Fit the decision tree model
tree_model <- rpart(
formula,
data = training_data_imputed,
method = "class", # Use "class" for classification trees
control = rpart.control(cp = 0.01) # Adjust complexity parameter as needed
)
# Plot the tree
rpart.plot(
tree_model,
type = 3, # Draw tree with extra info
extra = 104, # Show node numbers, splits, and predictions (percentages)
box.palette = "auto", # Automatically assign colors
main = "Decision Tree for TARGET_FLAG Excluding TARGET_AMT"
)
Using domain knowledge, we can tailor feature engineering and preprocessing strategies to align with the theoretical effects of the variables. Here’s how the domain knowledge can guide data preparation and feature engineering:
INDEX
: Clearly identified as not useful for modeling
and should be excluded.TARGET_FLAG
: Indicates whether there was a crash
(binary outcome). This will likely be the primary target variable for
classification models.TARGET_AMT
: Represents the cost of the crash. It can
serve as the target for regression models when modeling severity.AGE
as a continuous variable or
grouping into meaningful bins.SEX
and RED_CAR
are flagged
as possibly having no or weak effects.KIDSDRIV_RATIO
: Divide KIDSDRIV
by
AGE
to capture how age impacts driving children.HOMEKIDS_RATIO
: Divide HOMEKIDS
by
AGE
for similar reasoning.CLM_FREQ
and MVR_PTS
to capture combined risk
indicators.INCOME
, HOME_VAL
,
AGE
, and BLUEBOOK
should remain
continuous.CAR_TYPE
, CAR_USE
,
PARENT1
, RED_CAR
, REVOKED
,
MSTATUS
, EDUCATION
, and SEX
into
dummy variables for model compatibility.# Load required libraries
library(dplyr)
library(fastDummies)
library(car)
# Function to clean column names
clean_column_names <- function(dataset) {
colnames(dataset) <- gsub(" ", "", colnames(dataset)) # Remove spaces
colnames(dataset) <- gsub("/", "", colnames(dataset)) # Remove forward slashes
colnames(dataset) <- gsub("-", "_", colnames(dataset)) # Replace hyphens with underscores
colnames(dataset) <- gsub("\\+", "plus", colnames(dataset)) # Replace '+' with 'plus'
colnames(dataset) <- gsub("[^A-Za-z0-9_]", "", colnames(dataset)) # Remove special characters
return(dataset)
}
# Function to perform Min-Max Normalization
min_max_normalize <- function(data, exclude_cols = NULL) {
numeric_cols <- setdiff(names(data)[sapply(data, is.numeric)], exclude_cols)
for (col in numeric_cols) {
min_val <- min(data[[col]], na.rm = TRUE)
max_val <- max(data[[col]], na.rm = TRUE)
if (max_val != min_val) {
data[[col]] <- (data[[col]] - min_val) / (max_val - min_val)
} else {
data[[col]] <- 0
}
}
return(data)
}
# Function to apply Box-Cox Transformation with variable-specific lambda estimation
box_cox_transform <- function(data, exclude_cols = NULL) {
numeric_cols <- setdiff(names(data)[sapply(data, is.numeric)], exclude_cols)
for (col in numeric_cols) {
if (all(data[[col]] > 0)) {
bc <- boxcox(lm(data[[col]] ~ 1), plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
if (lambda == 0) {
data[[col]] <- log(data[[col]])
} else {
data[[col]] <- (data[[col]]^lambda - 1) / lambda
}
}
}
return(data)
}
# Function to dummy encode categorical variables
dummy_encode <- function(data) {
categorical_cols <- colnames(data %>% dplyr::select(where(is.character) | where(is.factor)))
data <- fastDummies::dummy_cols(
data,
select_columns = categorical_cols,
remove_first_dummy = TRUE,
remove_selected_columns = TRUE
)
return(data)
}
# Function to add new features (including buckets and interactions)
add_features <- function(data, is_evaluation = FALSE) {
data <- data %>%
mutate(
# Tree-based Binary Splits
OLDCLAIM_SPLIT = ifelse(OLDCLAIM >= 529, 1, 0),
BLUEBOOK_SPLIT = ifelse(BLUEBOOK < 17000, 1, 0),
# Interaction Terms
JOB_URBANICITY = interaction(JOB, URBANICITY, drop = TRUE),
HOMEVAL_URBANICITY = HOME_VAL * (URBANICITY == "Highly Urban/ Urban"),
# Ratio Variables
KIDSDRIV_RATIO = ifelse(AGE > 0, KIDSDRIV / AGE, 0),
# Travel Time Buckets
TRAVTIME_BUCKET = cut(
TRAVTIME,
breaks = c(0, 30, 60, Inf),
labels = c("Short", "Medium", "Long"),
right = FALSE
),
# Age Buckets
AGE_BIN = cut(
AGE,
breaks = c(0, 25, 45, 65, Inf),
labels = c("Young", "Adult", "Senior", "Elderly"),
right = FALSE
),
# Income Buckets
INCOME_BIN = cut(
INCOME,
breaks = c(0, 25000, 50000, 100000, Inf),
labels = c("Low", "Medium", "High", "Very High"),
right = FALSE
)
)
return(data)
}
# Main function to process the data
process_data <- function(data, apply_box_cox_to_target_amt = FALSE) {
# Extract INDEX, TARGET_FLAG, and TARGET_AMT
retained_cols <- data %>% dplyr::select(INDEX, TARGET_FLAG, TARGET_AMT)
# Exclude INDEX, TARGET_FLAG, and TARGET_AMT from transformations
data <- data %>% dplyr::select(-INDEX, -TARGET_FLAG, -TARGET_AMT)
# Exclude TARGET_AMT from transformations if specified
exclude_cols <- if (apply_box_cox_to_target_amt) NULL else "TARGET_AMT"
# Apply transformations
data <- min_max_normalize(data, exclude_cols = exclude_cols)
data <- box_cox_transform(data, exclude_cols = exclude_cols)
data <- dummy_encode(data)
# Clean column names after dummy encoding
data <- clean_column_names(data)
# Combine INDEX, TARGET_FLAG, and TARGET_AMT back with the transformed data
data <- cbind(retained_cols, data)
# Ensure INDEX, TARGET_FLAG, and TARGET_AMT are the first columns
data <- data %>% relocate(INDEX, TARGET_FLAG, TARGET_AMT)
return(data)
}
# Apply transformation to training data
processed_training_data <- process_data(training_data_imputed, apply_box_cox_to_target_amt = TRUE)
# Apply transformation to evaluation data
processed_evaluation_data <- process_data(evaluation_data_imputed, apply_box_cox_to_target_amt = FALSE)
# View the transformed datasets
cat("\n### Glimpse of Transformed Training Data ###\n")
##
## ### Glimpse of Transformed Training Data ###
dim(processed_training_data)
## [1] 8161 40
cat("\n### Glimpse of Transformed Evaluation Data ###\n")
##
## ### Glimpse of Transformed Evaluation Data ###
dim(processed_evaluation_data)
## [1] 2141 40
Before balancing, the dataset had a class imbalance with 6,008
instances of class 0
and 2,153 instances of class
1
. After applying SMOTE, the dataset is balanced with 6,008
instances of class 0
and 6,459 instances of class
1
.
# Install and load necessary libraries
if (!require(smotefamily)) {
install.packages("smotefamily")
}
## Loading required package: smotefamily
if (!require(fastDummies)) {
install.packages("fastDummies")
}
library(smotefamily)
library(fastDummies)
library(dplyr)
# ---- Verify Before Balancing Data ----
cat("\nBefore balancing: ", "", "\n")
##
## Before balancing:
table(processed_training_data$TARGET_FLAG)
##
## 0 1
## 6008 2153
# Ensure all featurestraining_data are numeric
x_data <- processed_training_data[, -which(names(processed_training_data) == "TARGET_FLAG")] # Exclude TARGET_FLAG
y_data <- processed_training_data$TARGET_FLAG # Target variable
# ---- Apply SMOTE ----
set.seed(123)
smote_result <- SMOTE(
x_data,
y_data,
K = 5,
dup_size = 2 # Adjust duplication size as needed
)
# Extract the balanced dataset
data_prepared <- smote_result$data
data_prepared$TARGET_FLAG <- as.factor(data_prepared$class) # Assign the target column back
data_prepared$class <- NULL # Remove the redundant class column
# ---- Verify Balanced Data ----
cat("\nAfter balancing: ", "", "\n")
##
## After balancing:
table(data_prepared$TARGET_FLAG)
##
## 0 1
## 6008 6459
We built five binary logistic models using TARGET_FLAG as the target and five multiple linear regression models using TARGET_AMT as the target. For each model type, we used manual feature selection in addition to forward, backward, stepwise, recursive feature selection in addition to the full model which included all the features.
URBANICITY_z_HighlyRuralRural:CAR_USE_Private
) to capture
complex effects.OLDCLAIM
and CAR_AGE
to capture non-linear relationships evident in
tree splits and residual plots.This approach ensures that the model is data-driven, interpretable, and aligned with observed feature importance and decision tree logic.
# Install and load necessary libraries
if (!require(randomForest)) {
install.packages("randomForest")
}
## Loading required package: randomForest
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
if (!require(caret)) {
install.packages("caret")
}
library(randomForest)
library(caret)
# ---- Split Data into Training and Testing Sets ----
set.seed(123) # For reproducibility
train_indices <- createDataPartition(data_prepared$TARGET_FLAG, p = 0.8, list = FALSE)
train_data <- data_prepared[train_indices, ]
test_data <- data_prepared[-train_indices, ]
# ---- Train Random Forest Model ----
set.seed(123) # For reproducibility
rf_model <- randomForest(
TARGET_FLAG ~ .,
data = train_data,
importance = TRUE, # Enable importance calculation
ntree = 500, # Number of trees
mtry = floor(sqrt(ncol(train_data) - 1)) # Number of variables tried at each split
)
# ---- Feature Importance ----
importance_values <- importance(rf_model) # Get importance values
importance_df <- data.frame(
Feature = rownames(importance_values),
MeanDecreaseGini = importance_values[, "MeanDecreaseGini"]
)
# ---- Sort by Importance ----
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]
# ---- Display Top Features ----
cat("\nTop 10 Features by Importance:\n")
##
## Top 10 Features by Importance:
print(head(importance_df, 10))
## Feature MeanDecreaseGini
## TARGET_AMT TARGET_AMT 3111.44888
## CLM_FREQ CLM_FREQ 288.51358
## OLDCLAIM OLDCLAIM 199.48762
## URBANICITY_z_HighlyRuralRural URBANICITY_z_HighlyRuralRural 154.35647
## CAR_USE_Private CAR_USE_Private 132.82793
## MSTATUS_z_No MSTATUS_z_No 127.19415
## MVR_PTS MVR_PTS 119.32058
## HOMEKIDS HOMEKIDS 91.59452
## EDUCATION_z_HighSchool EDUCATION_z_HighSchool 62.74299
## PARENT1_Yes PARENT1_Yes 62.62347
# ---- Plot Feature Importance ----
library(ggplot2)
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
ggtitle("Feature Importance (Random Forest)") +
xlab("Features") +
ylab("Mean Decrease Gini") +
theme_minimal()
library(rpart.plot)
library(dplyr)
# Ensure target variable and predictors are valid
target_variable <- "TARGET_FLAG"
excluded_variables <- c("TARGET_AMT", target_variable)
predictors <- setdiff(names(data_prepared), excluded_variables)
# Check if predictors are valid
if (length(predictors) == 0) {
stop("No valid predictors found. Ensure that training_data_imputed has columns other than TARGET_FLAG and TARGET_AMT.")
}
# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))
# Fit the decision tree model
tree_model <- rpart(
formula,
data = data_prepared,
method = "class", # Use "class" for classification trees
control = rpart.control(cp = 0.01) # Adjust complexity parameter as needed
)
# Plot the tree
rpart.plot(
tree_model,
type = 3, # Draw tree with extra info
extra = 104, # Show node numbers, splits, and predictions (percentages)
box.palette = "auto", # Automatically assign colors
main = "Decision Tree for TARGET_FLAG Excluding TARGET_AMT"
)
# Define the custom logistic regression formula
custom_formula <- TARGET_FLAG ~ OLDCLAIM + CLM_FREQ + URBANICITY_z_HighlyRuralRural +
CAR_USE_Private + PARENT1_Yes + SEX_z_F +
EDUCATION_z_HighSchool + CAR_AGE +
I(OLDCLAIM^2) + I(CAR_AGE^2) +
OLDCLAIM:CLM_FREQ +
URBANICITY_z_HighlyRuralRural:CAR_USE_Private +
PARENT1_Yes:SEX_z_F
# Load necessary libraries
library(dplyr)
if (!require("pROC")) install.packages("pROC")
library(pROC)
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)
# ---- Step 1: Data Preparation ----
# Drop INDEX and TARGET_AMT and place TARGET_FLAG as the first column
logistic_data <- data_prepared %>%
dplyr::select(-INDEX, -TARGET_AMT) %>%
dplyr::select(TARGET_FLAG, everything())
# Ensure TARGET_FLAG is a factor for logistic regression
logistic_data <- logistic_data %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
# ---- Step 2: Fit Full Model ----
logistic_formula <- TARGET_FLAG ~ .
full_model_blr <- glm(formula = logistic_formula, family = binomial, data = logistic_data)
# ---- Step 3: Fit Custom Model ----
custom_formula <- TARGET_FLAG ~ OLDCLAIM + CLM_FREQ + URBANICITY_z_HighlyRuralRural +
CAR_USE_Private + PARENT1_Yes + SEX_z_F +
EDUCATION_z_HighSchool + CAR_AGE +
I(OLDCLAIM^2) + I(CAR_AGE^2) +
OLDCLAIM:CLM_FREQ +
URBANICITY_z_HighlyRuralRural:CAR_USE_Private +
PARENT1_Yes:SEX_z_F
custom_model_blr <- glm(formula = custom_formula, family = binomial, data = logistic_data)
# ---- Step 4: Model Selection ----
forward_model_blr <- step(glm(TARGET_FLAG ~ 1, data = logistic_data, family = binomial),
scope = list(lower = TARGET_FLAG ~ 1, upper = logistic_formula),
direction = "forward",
trace = FALSE)
backward_model_blr <- step(full_model_blr, direction = "backward", trace = FALSE)
stepwise_model_blr <- step(full_model_blr, direction = "both", trace = FALSE)
# ---- Step 5: Evaluate Models ----
evaluate_model <- function(model, data) {
predicted_probs <- predict(model, newdata = data, type = "response")
predicted_classes <- ifelse(predicted_probs > 0.5, "1", "0")
cm <- confusionMatrix(factor(predicted_classes, levels = c("0", "1")),
factor(data$TARGET_FLAG, levels = c("0", "1")))
roc_obj <- roc(data$TARGET_FLAG, predicted_probs)
auc <- auc(roc_obj)
metrics <- data.frame(
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
Precision = cm$byClass["Precision"],
F1 = cm$byClass["F1"],
AUC = auc
)
return(list(Metrics = metrics, ConfusionMatrix = cm$table))
}
# Evaluate all models
full_evaluation <- evaluate_model(full_model_blr, logistic_data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
custom_evaluation <- evaluate_model(custom_model_blr, logistic_data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
forward_evaluation <- evaluate_model(forward_model_blr, logistic_data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
backward_evaluation <- evaluate_model(backward_model_blr, logistic_data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
stepwise_evaluation <- evaluate_model(stepwise_model_blr, logistic_data)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# ---- Step 6: Compare Models ----
model_metrics <- rbind(
data.frame(Model = "Full_BLR", full_evaluation$Metrics),
data.frame(Model = "Custom_BLR", custom_evaluation$Metrics),
data.frame(Model = "Forward_BLR", forward_evaluation$Metrics),
data.frame(Model = "Backward_BLR", backward_evaluation$Metrics),
data.frame(Model = "Stepwise_BLR", stepwise_evaluation$Metrics)
)
model_metrics <- model_metrics[order(-model_metrics$Accuracy), ]
print(model_metrics)
## Model Accuracy Sensitivity Specificity Precision F1
## Accuracy3 Backward_BLR 0.7527874 0.7065579 0.7957888 0.7629403 0.7336675
## Accuracy4 Stepwise_BLR 0.7527874 0.7065579 0.7957888 0.7629403 0.7336675
## Accuracy Full_BLR 0.7526269 0.7067244 0.7953244 0.7625718 0.7335867
## Accuracy1 Custom_BLR 0.7165316 0.6644474 0.7649791 0.7245009 0.6931759
## Accuracy2 Forward_BLR 0.5180878 0.0000000 1.0000000 NA NA
## AUC
## Accuracy3 0.8272851
## Accuracy4 0.8272851
## Accuracy 0.8272768
## Accuracy1 0.7781979
## Accuracy2 0.5000000
# ---- Step 7: Combine AIC and LogLikelihood ----
model_aic_loglik <- data.frame(
Model = c("Full_BLR", "Custom_BLR", "Forward_BLR", "Backward_BLR", "Stepwise_BLR"),
AIC = c(AIC(full_model_blr), AIC(custom_model_blr), AIC(forward_model_blr), AIC(backward_model_blr), AIC(stepwise_model_blr)),
LogLikelihood = c(logLik(full_model_blr), logLik(custom_model_blr), logLik(forward_model_blr), logLik(backward_model_blr), logLik(stepwise_model_blr))
)
model_aic_loglik <- model_aic_loglik[order(-model_aic_loglik$LogLikelihood), ]
print(model_aic_loglik)
## Model AIC LogLikelihood
## 1 Full_BLR 12778.37 -6351.184
## 4 Backward_BLR 12770.55 -6352.273
## 5 Stepwise_BLR 12770.55 -6352.273
## 2 Custom_BLR 14084.92 -7028.460
## 3 Forward_BLR 17268.61 -8633.307
# ---- Step 8: Plot ROC Curves ----
par(mfrow = c(2, 3))
roc_full <- roc(logistic_data$TARGET_FLAG, predict(full_model_blr, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_full, main = "Full Model ROC", col = "blue", lwd = 2)
roc_custom <- roc(logistic_data$TARGET_FLAG, predict(custom_model_blr, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_custom, main = "Custom Model ROC", col = "darkgreen", lwd = 2)
roc_forward <- roc(logistic_data$TARGET_FLAG, predict(forward_model_blr, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_forward, main = "Forward Selection ROC", col = "green", lwd = 2)
roc_backward <- roc(logistic_data$TARGET_FLAG, predict(backward_model_blr, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_backward, main = "Backward Elimination ROC", col = "red", lwd = 2)
roc_stepwise <- roc(logistic_data$TARGET_FLAG, predict(stepwise_model_blr, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_stepwise, main = "Stepwise Selection ROC", col = "purple", lwd = 2)
cat("\n--- Model Comparison Completed ---\n")
##
## --- Model Comparison Completed ---
The majority of the coefficients align with logical expectations. However, outliers or unexpected trends (if any) should be flagged for further analysis or cross-referenced with domain knowledge to ensure model validity.
# Display Coefficients of the Stepwise Model
cat("\n--- Stepwise Model Coefficients ---\n")
##
## --- Stepwise Model Coefficients ---
stepwise_blr_coefficients <- summary(stepwise_model_blr)$coefficients
print(stepwise_blr_coefficients)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1226147 0.19498833 -0.6288307 5.294599e-01
## KIDSDRIV 1.8140571 0.20716773 8.7564659 2.014676e-18
## HOMEKIDS 0.3779910 0.13922611 2.7149433 6.628712e-03
## YOJ -0.2738037 0.15846188 -1.7278838 8.400906e-02
## INCOME -1.4080183 0.31053023 -4.5342390 5.781154e-06
## HOME_VAL -1.1280342 0.23901082 -4.7195947 2.363151e-06
## TRAVTIME 2.3279212 0.20840939 11.1699438 5.721707e-29
## BLUEBOOK -1.5543856 0.27562526 -5.6394889 1.705557e-08
## TIF -1.6476584 0.13872869 -11.8768402 1.561594e-32
## OLDCLAIM -0.8804619 0.18870085 -4.6659138 3.072485e-06
## CLM_FREQ 1.1530856 0.11797849 9.7736936 1.460291e-22
## MVR_PTS 1.7558806 0.14818287 11.8494167 2.166940e-32
## PARENT1_Yes 0.4572161 0.09061037 5.0459581 4.512543e-07
## MSTATUS_z_No 0.5889337 0.06640779 8.8684420 7.417638e-19
## SEX_z_F -0.1319968 0.07669194 -1.7211301 8.522723e-02
## EDUCATION_Bachelors -0.4923173 0.06575848 -7.4867491 7.060047e-14
## EDUCATION_Masters -0.3562202 0.11114314 -3.2050580 1.350353e-03
## EDUCATION_PhD -0.2252638 0.13762045 -1.6368486 1.016621e-01
## JOB_Clerical 0.6218006 0.14785936 4.2053516 2.606766e-05
## JOB_HomeMaker 0.4408650 0.15757489 2.7978123 5.145000e-03
## JOB_Lawyer 0.2770190 0.12306102 2.2510702 2.438109e-02
## JOB_Manager -0.4368174 0.12310249 -3.5484041 3.875730e-04
## JOB_Professional 0.3273497 0.13320104 2.4575613 1.398839e-02
## JOB_Student 0.4988078 0.16480519 3.0266508 2.472795e-03
## JOB_z_BlueCollar 0.5268294 0.14100057 3.7363638 1.867005e-04
## CAR_USE_Private -0.8941207 0.07091970 -12.6075071 1.919696e-36
## CAR_TYPE_PanelTruck 0.5873070 0.12690955 4.6277608 3.696406e-06
## CAR_TYPE_Pickup 0.5975987 0.07916314 7.5489516 4.387759e-14
## CAR_TYPE_SportsCar 1.1564481 0.10101474 11.4483096 2.397755e-30
## CAR_TYPE_Van 0.6558120 0.09919905 6.6110716 3.815479e-11
## CAR_TYPE_z_SUV 0.8921604 0.08533171 10.4552032 1.387011e-25
## REVOKED_Yes 1.0037730 0.07723678 12.9960494 1.288285e-38
## URBANICITY_z_HighlyRuralRural -2.5934535 0.08059043 -32.1806644 3.290749e-227
# Format the coefficients for better readability
stepwise_coefficients_df <- as.data.frame(stepwise_blr_coefficients)
stepwise_coefficients_df <- cbind(Variable = rownames(stepwise_coefficients_df), stepwise_coefficients_df)
rownames(stepwise_coefficients_df) <- NULL
# Print formatted coefficients
print(stepwise_coefficients_df)
## Variable Estimate Std. Error z value
## 1 (Intercept) -0.1226147 0.19498833 -0.6288307
## 2 KIDSDRIV 1.8140571 0.20716773 8.7564659
## 3 HOMEKIDS 0.3779910 0.13922611 2.7149433
## 4 YOJ -0.2738037 0.15846188 -1.7278838
## 5 INCOME -1.4080183 0.31053023 -4.5342390
## 6 HOME_VAL -1.1280342 0.23901082 -4.7195947
## 7 TRAVTIME 2.3279212 0.20840939 11.1699438
## 8 BLUEBOOK -1.5543856 0.27562526 -5.6394889
## 9 TIF -1.6476584 0.13872869 -11.8768402
## 10 OLDCLAIM -0.8804619 0.18870085 -4.6659138
## 11 CLM_FREQ 1.1530856 0.11797849 9.7736936
## 12 MVR_PTS 1.7558806 0.14818287 11.8494167
## 13 PARENT1_Yes 0.4572161 0.09061037 5.0459581
## 14 MSTATUS_z_No 0.5889337 0.06640779 8.8684420
## 15 SEX_z_F -0.1319968 0.07669194 -1.7211301
## 16 EDUCATION_Bachelors -0.4923173 0.06575848 -7.4867491
## 17 EDUCATION_Masters -0.3562202 0.11114314 -3.2050580
## 18 EDUCATION_PhD -0.2252638 0.13762045 -1.6368486
## 19 JOB_Clerical 0.6218006 0.14785936 4.2053516
## 20 JOB_HomeMaker 0.4408650 0.15757489 2.7978123
## 21 JOB_Lawyer 0.2770190 0.12306102 2.2510702
## 22 JOB_Manager -0.4368174 0.12310249 -3.5484041
## 23 JOB_Professional 0.3273497 0.13320104 2.4575613
## 24 JOB_Student 0.4988078 0.16480519 3.0266508
## 25 JOB_z_BlueCollar 0.5268294 0.14100057 3.7363638
## 26 CAR_USE_Private -0.8941207 0.07091970 -12.6075071
## 27 CAR_TYPE_PanelTruck 0.5873070 0.12690955 4.6277608
## 28 CAR_TYPE_Pickup 0.5975987 0.07916314 7.5489516
## 29 CAR_TYPE_SportsCar 1.1564481 0.10101474 11.4483096
## 30 CAR_TYPE_Van 0.6558120 0.09919905 6.6110716
## 31 CAR_TYPE_z_SUV 0.8921604 0.08533171 10.4552032
## 32 REVOKED_Yes 1.0037730 0.07723678 12.9960494
## 33 URBANICITY_z_HighlyRuralRural -2.5934535 0.08059043 -32.1806644
## Pr(>|z|)
## 1 5.294599e-01
## 2 2.014676e-18
## 3 6.628712e-03
## 4 8.400906e-02
## 5 5.781154e-06
## 6 2.363151e-06
## 7 5.721707e-29
## 8 1.705557e-08
## 9 1.561594e-32
## 10 3.072485e-06
## 11 1.460291e-22
## 12 2.166940e-32
## 13 4.512543e-07
## 14 7.417638e-19
## 15 8.522723e-02
## 16 7.060047e-14
## 17 1.350353e-03
## 18 1.016621e-01
## 19 2.606766e-05
## 20 5.145000e-03
## 21 2.438109e-02
## 22 3.875730e-04
## 23 1.398839e-02
## 24 2.472795e-03
## 25 1.867005e-04
## 26 1.919696e-36
## 27 3.696406e-06
## 28 4.387759e-14
## 29 2.397755e-30
## 30 3.815479e-11
## 31 1.387011e-25
## 32 1.288285e-38
## 33 3.290749e-227
# Optional: Save the coefficients to a CSV file for review
write.csv(stepwise_coefficients_df, "stepwise_model_coefficients.csv", row.names = FALSE)
The ROC curves and Log-Likelihood values provide insights into the predictive performance: - Backward and Stepwise Models: These models have nearly identical Log-Likelihood values and AIC scores, suggesting optimal predictive power while maintaining simplicity. - Full Model: The full model also demonstrates strong predictive power but may overfit due to including all variables. - Custom Model: Designed with domain expertise, it performs well but falls behind the Backward and Stepwise models in predictive power. - Forward Model: This model has the weakest prediction power, as evidenced by its higher AIC and lower Log-Likelihood value.
For a practical balance of prediction power, interpretability, and explanatory power: - Use the Stepwise Model for general prediction tasks due to its simplicity and strong performance metrics. - Use the Custom Model when domain-specific interactions and explanatory insights are crucial.
We will consider alternative models like Decision Trees, Random Forest, XGBoost, SVR, ANN, MARS, KNN, and Generalized Linear Regression for performance comparison for this dataset.
# Load necessary libraries
if (!require("nnet")) install.packages("nnet")
## Loading required package: nnet
if (!require("earth")) install.packages("earth")
## Loading required package: earth
## Loading required package: Formula
## Loading required package: plotmo
## Loading required package: plotrix
if (!require("class")) install.packages("class")
## Loading required package: class
if (!require("kernlab")) install.packages("kernlab")
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:purrr':
##
## cross
## The following object is masked from 'package:ggplot2':
##
## alpha
library(dplyr)
library(caret)
library(pROC)
library(rpart)
library(ranger)
##
## Attaching package: 'ranger'
## The following object is masked from 'package:randomForest':
##
## importance
library(xgboost)
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
library(nnet)
library(earth)
library(class)
library(kernlab)
# ---- Helper Function to Calculate Metrics ----
calculate_metrics <- function(predicted, actual) {
cm <- confusionMatrix(factor(predicted, levels = c("0", "1")), factor(actual, levels = c("0", "1")))
metrics <- data.frame(
Accuracy = cm$overall["Accuracy"],
Sensitivity = cm$byClass["Sensitivity"],
Specificity = cm$byClass["Specificity"],
Precision = cm$byClass["Precision"],
F1 = cm$byClass["F1"]
)
return(metrics)
}
# ---- Step 1: Data Preparation ----
classification_data <- data_prepared %>%
dplyr::select(-INDEX, -TARGET_AMT) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
set.seed(123)
train_index <- createDataPartition(classification_data$TARGET_FLAG, p = 0.7, list = FALSE)
train_data <- classification_data[train_index, ]
test_data <- classification_data[-train_index, ]
# ---- Step 2: Model Implementation ----
# Decision Tree
tree_model <- rpart(TARGET_FLAG ~ ., data = train_data, method = "class", control = rpart.control(cp = 0.01))
tree_preds <- predict(tree_model, newdata = test_data, type = "class")
tree_metrics <- calculate_metrics(tree_preds, test_data$TARGET_FLAG)
# Random Forest
rf_model <- ranger(TARGET_FLAG ~ ., data = train_data, probability = TRUE)
rf_preds <- predict(rf_model, data = test_data)$predictions
rf_pred_class <- ifelse(rf_preds[, 2] > 0.5, "1", "0")
rf_metrics <- calculate_metrics(rf_pred_class, test_data$TARGET_FLAG)
# XGBoost
train_matrix <- xgb.DMatrix(data = as.matrix(train_data %>% dplyr::select(-TARGET_FLAG)), label = as.numeric(as.character(train_data$TARGET_FLAG)))
test_matrix <- xgb.DMatrix(data = as.matrix(test_data %>% dplyr::select(-TARGET_FLAG)))
xgb_bc_model <- xgboost(data = train_matrix, max_depth = 6, eta = 0.1, nrounds = 100, objective = "binary:logistic", verbose = 0)
xgb_preds <- predict(xgb_bc_model, newdata = test_matrix)
xgb_pred_class <- ifelse(xgb_preds > 0.5, "1", "0")
xgb_metrics <- calculate_metrics(xgb_pred_class, test_data$TARGET_FLAG)
# SVR
svr_model <- ksvm(TARGET_FLAG ~ ., data = train_data, type = "C-svc", kernel = "rbfdot")
svr_preds <- predict(svr_model, test_data)
svr_metrics <- calculate_metrics(svr_preds, test_data$TARGET_FLAG)
# Artificial Neural Network
ann_model <- nnet(TARGET_FLAG ~ ., data = train_data, size = 5, maxit = 200, decay = 0.01, linout = FALSE)
## # weights: 196
## initial value 6087.007355
## iter 10 value 4636.752841
## iter 20 value 4371.379136
## iter 30 value 4110.684823
## iter 40 value 3912.449352
## iter 50 value 3759.412839
## iter 60 value 3676.163201
## iter 70 value 3622.392206
## iter 80 value 3577.418311
## iter 90 value 3546.287875
## iter 100 value 3531.116670
## iter 110 value 3518.883025
## iter 120 value 3508.430109
## iter 130 value 3493.214914
## iter 140 value 3479.034539
## iter 150 value 3459.676601
## iter 160 value 3428.016825
## iter 170 value 3398.058693
## iter 180 value 3365.730658
## iter 190 value 3345.031852
## iter 200 value 3311.607129
## final value 3311.607129
## stopped after 200 iterations
ann_preds <- predict(ann_model, test_data, type = "class")
ann_metrics <- calculate_metrics(ann_preds, test_data$TARGET_FLAG)
# MARS
mars_model <- earth(TARGET_FLAG ~ ., data = train_data)
mars_preds <- predict(mars_model, newdata = test_data, type = "class")
mars_metrics <- calculate_metrics(mars_preds, test_data$TARGET_FLAG)
# KNN
knn_preds <- knn(train = train_data %>% dplyr::select(-TARGET_FLAG), test = test_data %>% dplyr::select(-TARGET_FLAG), cl = train_data$TARGET_FLAG, k = 5)
knn_metrics <- calculate_metrics(knn_preds, test_data$TARGET_FLAG)
# Generalized Linear Model
glm_model <- glm(TARGET_FLAG ~ ., data = train_data, family = binomial)
glm_preds <- ifelse(predict(glm_model, newdata = test_data, type = "response") > 0.5, "1", "0")
glm_metrics <- calculate_metrics(glm_preds, test_data$TARGET_FLAG)
# ---- Step 3: Combine Metrics ----
model_metrics <- rbind(
cbind(Model = "Decision Tree", tree_metrics),
cbind(Model = "Random Forest", rf_metrics),
cbind(Model = "XGBoost", xgb_metrics),
cbind(Model = "SVR", svr_metrics),
cbind(Model = "ANN", ann_metrics),
cbind(Model = "MARS", mars_metrics),
cbind(Model = "KNN", knn_metrics),
cbind(Model = "GLM", glm_metrics)
)
# Print combined metrics
print(model_metrics)
## Model Accuracy Sensitivity Specificity Precision F1
## Accuracy Decision Tree 0.7756085 0.7974473 0.7552917 0.7519623 0.7740372
## Accuracy1 Random Forest 0.8627975 0.9045505 0.8239546 0.8269914 0.8640339
## Accuracy2 XGBoost 0.8654721 0.9001110 0.8332473 0.8339332 0.8657593
## Accuracy3 SVR 0.8323081 0.8396226 0.8255034 0.8173960 0.8283603
## Accuracy4 ANN 0.8205403 0.8246393 0.8167269 0.8071700 0.8158111
## Accuracy5 MARS 0.8309708 0.8501665 0.8131131 0.8088701 0.8290043
## Accuracy6 KNN 0.7806900 0.7586016 0.8012390 0.7802511 0.7692741
## Accuracy7 GLM 0.7496657 0.7031077 0.7929788 0.7595923 0.7302594
Random Forest remains the top choice for tasks requiring high predictive accuracy, achieving the highest accuracy (86.28%) and F1 score (0.864). However, Stepwise Logistic Regression, with an accuracy of 75.28% and an F1 score of 0.734, provides the best balance between interpretability and model performance.
Stepwise Logistic Regression is recommended for scenarios where understanding variable effects is critical, as it offers transparency and clear insights into feature importance. While Random Forest is excellent for complex, accuracy-driven applications, Stepwise Regression is ideal for interpretability without sacrificing too much predictive power.
TARGET_AMT
where
TARGET_FLAG
equals 1.BLUEBOOK
.BLUEBOOK
were prioritized as they
appeared at higher splits in the tree.TARGET_AMT
prediction.BLUEBOOK
, MVR_PTS
, and
CAR_AGE
emerged as the most influential predictors,
aligning with domain knowledge.I(BLUEBOOK^2)
) were included to capture non-linear
relationships.library(rpart.plot)
library(dplyr)
# Filter the data for rows where TARGET_FLAG = 1
mrl_filtered_data_target_flag <- data_prepared %>% filter(TARGET_FLAG == 1)
# Ensure target variable and predictors are valid
target_variable <- "TARGET_AMT"
excluded_variables <- c("TARGET_FLAG", target_variable)
predictors <- setdiff(names(mrl_filtered_data_target_flag), excluded_variables)
# Check if predictors are valid
if (length(predictors) == 0) {
stop("No valid predictors found. Ensure that mrl_filtered_data_target_flag has columns other than TARGET_FLAG and TARGET_AMT.")
}
# Create the formula
formula <- as.formula(paste(target_variable, "~", paste(predictors, collapse = "+")))
# Fit the decision tree model
tree_model <- rpart(
formula,
data = mrl_filtered_data_target_flag,
method = "anova", # Use "anova" for regression trees
control = rpart.control(cp = 0.01) # Adjust complexity parameter as needed
)
# Plot the tree
rpart.plot(
tree_model,
type = 3, # Draw tree with extra info
extra = 101, # Show node numbers, splits, and predictions
box.palette = "auto", # Automatically assign colors
main = "Decision Tree for TARGET_AMT Where TARGET_FLAG = 1"
)
# Ensure necessary libraries are loaded
if (!require("randomForest")) install.packages("randomForest")
if (!require("ggplot2")) install.packages("ggplot2")
library(randomForest)
library(ggplot2)
# Define response variable and predictors
response <- mrl_filtered_data_target_flag$TARGET_AMT # Replace with the actual dataset variable name for TARGET_AMT
predictors_matrix <- mrl_filtered_data_target_flag %>%
dplyr::select(-TARGET_AMT, -TARGET_FLAG, -INDEX) %>% # Remove target and any non-predictor columns
dplyr::select_if(~ is.numeric(.) || is.factor(.)) # Keep only numeric or factor columns
# Check if predictors_matrix has valid columns
if (ncol(predictors_matrix) == 0) {
stop("No valid predictors found after filtering for numeric and factor columns.")
}
# Train the randomForest model
set.seed(123) # For reproducibility
rf_model <- randomForest(
x = predictors_matrix,
y = response,
importance = TRUE, # Enable importance computation
ntree = 500, # Number of trees
mtry = floor(sqrt(ncol(predictors_matrix))) # Number of predictors to try at each split
)
# Explicitly call importance from randomForest to avoid masking issues
importance_values <- randomForest::importance(rf_model)
# Check structure of importance_values
cat("\nStructure of importance_values:\n")
##
## Structure of importance_values:
print(importance_values)
## %IncMSE IncNodePurity
## KIDSDRIV 12.998987 6033907780
## AGE 20.472019 21579660068
## HOMEKIDS 13.063007 9591877517
## YOJ 18.919967 15971640522
## INCOME 19.884842 17532653500
## HOME_VAL 18.423284 16465008941
## TRAVTIME 17.347732 18107173907
## BLUEBOOK 28.572032 25330745053
## TIF 13.514517 14241034365
## OLDCLAIM 15.635998 15820217177
## CLM_FREQ 10.787234 14403584213
## MVR_PTS 20.600759 19358511878
## CAR_AGE 21.377925 16982825587
## PARENT1_Yes 12.234669 7643861157
## MSTATUS_z_No 12.702572 9794017029
## SEX_z_F 6.441400 6604702864
## EDUCATION_Bachelors 9.765206 7543634169
## EDUCATION_Masters 7.331360 4000317768
## EDUCATION_PhD 7.937516 2841235673
## EDUCATION_z_HighSchool 9.398117 6016333126
## JOB_Clerical 14.043970 5373221752
## JOB_Doctor 2.870718 190926590
## JOB_HomeMaker 11.871058 3743452683
## JOB_Lawyer 12.739474 3024598027
## JOB_Manager 6.235967 1054806651
## JOB_Professional 7.968180 8256975898
## JOB_Student 8.079909 3139691942
## JOB_z_BlueCollar 13.887132 8296519495
## CAR_USE_Private 10.451147 7163677108
## CAR_TYPE_PanelTruck 12.780984 6069665499
## CAR_TYPE_Pickup 10.989774 4050555059
## CAR_TYPE_SportsCar 8.410908 6036814076
## CAR_TYPE_Van 10.498695 8316236055
## CAR_TYPE_z_SUV 12.468348 5202295871
## RED_CAR_yes 9.526964 6868476424
## REVOKED_Yes 11.350668 3492242398
## URBANICITY_z_HighlyRuralRural 9.544094 4578656952
# If importance_values has the correct structure, process it
if (!is.null(importance_values) && "MeanDecreaseGini" %in% colnames(importance_values)) {
# Convert to data frame
importance_df <- data.frame(
Feature = rownames(importance_values),
MeanDecreaseGini = importance_values[, "MeanDecreaseGini"]
)
# Sort features by importance
importance_df <- importance_df[order(-importance_df$MeanDecreaseGini), ]
# Display the top 10 features
cat("\nTop 10 Features by Importance:\n")
print(head(importance_df, 10))
# ---- Plot Feature Importance ----
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
ggtitle("Feature Importance (Random Forest)") +
xlab("Features") +
ylab("Mean Decrease Gini") +
theme_minimal()
} else {
cat("\nThe importance_values object does not contain 'MeanDecreaseGini'. Please review the model training step.\n")
}
##
## The importance_values object does not contain 'MeanDecreaseGini'. Please review the model training step.
# Custom formula based on key predictors
custom_predictors <- c("BLUEBOOK", "INCOME", "AGE", "HOME_VAL", "CAR_TYPE_SUV", "CAR_TYPE_SportsCar")
# Build the formula
custom_formula <- as.formula(paste("TARGET_AMT ~", paste(custom_predictors, collapse = "+")))
# Print the custom formula
custom_formula
## TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL + CAR_TYPE_SUV +
## CAR_TYPE_SportsCar
The Full Model showed the best performance with an R-squared of 0.0171 and the lowest AIC of 93669.45, though all models exhibited poor predictive power. Residual plots revealed consistent variability, suggesting unaccounted non-linear relationships. Further exploration of non-linear models or feature engineering is recommended.
# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)
# ---- Step 1: Data Preparation ----
# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
filter(TARGET_FLAG == 1) %>%
dplyr::select(-INDEX, -TARGET_FLAG) %>%
dplyr::select(TARGET_AMT, everything())
# Split the data into training and test sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_untransformed_target <- mlr_data[-train_index, ]
# ---- Step 2: Fit Full Model ----
# Define the full multiple linear regression formula
full_formula <- TARGET_AMT ~ .
# Fit the full multiple linear regression model
full_model_untransformed_target <- lm(formula = full_formula, data = train_data)
# ---- Step 3: Fit Custom Model ----
# Define the custom multiple linear regression formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL +
CAR_TYPE_z_SUV + CAR_TYPE_SportsCar
# Fit the custom multiple linear regression model
custom_model_untransformed_target <- lm(formula = custom_formula, data = train_data)
# ---- Step 4: Model Selection ----
# Forward Selection
cat("\n--- Forward Selection ---\n")
##
## --- Forward Selection ---
forward_model_untransformed_target <- step(
lm(TARGET_AMT ~ 1, data = train_data),
scope = list(lower = TARGET_AMT ~ 1, upper = full_formula),
direction = "forward",
trace = FALSE
)
print(summary(forward_model_untransformed_target))
##
## Call:
## lm(formula = TARGET_AMT ~ 1, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5700 -3104 -1627 58 85891
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5730.6 114.8 49.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7720 on 4522 degrees of freedom
# Backward Elimination
cat("\n--- Backward Elimination ---\n")
##
## --- Backward Elimination ---
backward_model_untransformed_target <- step(full_model_untransformed_target, direction = "backward", trace = FALSE)
print(summary(backward_model_untransformed_target))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + INCOME + HOME_VAL +
## BLUEBOOK + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE + MSTATUS_z_No +
## SEX_z_F + EDUCATION_Masters + EDUCATION_PhD + JOB_Doctor +
## JOB_Manager + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_SportsCar +
## CAR_TYPE_z_SUV + RED_CAR_yes + REVOKED_Yes, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11276 -3246 -1457 553 83741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2579.5 663.2 3.890 0.000102 ***
## AGE 2541.9 989.7 2.568 0.010254 *
## HOMEKIDS 1770.8 596.0 2.971 0.002981 **
## INCOME -3001.6 1749.8 -1.715 0.086335 .
## HOME_VAL 3418.8 1339.9 2.552 0.010757 *
## BLUEBOOK 11203.6 1252.3 8.947 < 2e-16 ***
## OLDCLAIM 2405.4 995.1 2.417 0.015681 *
## CLM_FREQ -1333.1 603.2 -2.210 0.027155 *
## MVR_PTS 2309.3 677.3 3.409 0.000657 ***
## CAR_AGE -2983.6 866.7 -3.443 0.000581 ***
## MSTATUS_z_No 1570.0 308.0 5.098 3.57e-07 ***
## SEX_z_F -2134.7 452.7 -4.715 2.49e-06 ***
## EDUCATION_Masters 1354.3 500.7 2.705 0.006864 **
## EDUCATION_PhD 2345.4 764.8 3.067 0.002176 **
## JOB_Doctor -3011.8 1174.4 -2.564 0.010365 *
## JOB_Manager -1593.3 564.6 -2.822 0.004789 **
## JOB_Professional 743.3 461.1 1.612 0.107011
## JOB_z_BlueCollar 622.3 343.4 1.812 0.070033 .
## CAR_TYPE_SportsCar 955.5 507.8 1.882 0.059955 .
## CAR_TYPE_z_SUV 1300.6 439.9 2.956 0.003130 **
## RED_CAR_yes -878.4 376.9 -2.330 0.019834 *
## REVOKED_Yes -1639.8 390.2 -4.202 2.70e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7559 on 4501 degrees of freedom
## Multiple R-squared: 0.04572, Adjusted R-squared: 0.04127
## F-statistic: 10.27 on 21 and 4501 DF, p-value: < 2.2e-16
# Stepwise Selection
cat("\n--- Stepwise Selection ---\n")
##
## --- Stepwise Selection ---
stepwise_model_untransformed_target <- step(full_model_untransformed_target, direction = "both", trace = FALSE)
print(summary(stepwise_model_untransformed_target))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + INCOME + HOME_VAL +
## BLUEBOOK + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE + MSTATUS_z_No +
## SEX_z_F + EDUCATION_Masters + EDUCATION_PhD + JOB_Doctor +
## JOB_Manager + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_SportsCar +
## CAR_TYPE_z_SUV + RED_CAR_yes + REVOKED_Yes, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11276 -3246 -1457 553 83741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2579.5 663.2 3.890 0.000102 ***
## AGE 2541.9 989.7 2.568 0.010254 *
## HOMEKIDS 1770.8 596.0 2.971 0.002981 **
## INCOME -3001.6 1749.8 -1.715 0.086335 .
## HOME_VAL 3418.8 1339.9 2.552 0.010757 *
## BLUEBOOK 11203.6 1252.3 8.947 < 2e-16 ***
## OLDCLAIM 2405.4 995.1 2.417 0.015681 *
## CLM_FREQ -1333.1 603.2 -2.210 0.027155 *
## MVR_PTS 2309.3 677.3 3.409 0.000657 ***
## CAR_AGE -2983.6 866.7 -3.443 0.000581 ***
## MSTATUS_z_No 1570.0 308.0 5.098 3.57e-07 ***
## SEX_z_F -2134.7 452.7 -4.715 2.49e-06 ***
## EDUCATION_Masters 1354.3 500.7 2.705 0.006864 **
## EDUCATION_PhD 2345.4 764.8 3.067 0.002176 **
## JOB_Doctor -3011.8 1174.4 -2.564 0.010365 *
## JOB_Manager -1593.3 564.6 -2.822 0.004789 **
## JOB_Professional 743.3 461.1 1.612 0.107011
## JOB_z_BlueCollar 622.3 343.4 1.812 0.070033 .
## CAR_TYPE_SportsCar 955.5 507.8 1.882 0.059955 .
## CAR_TYPE_z_SUV 1300.6 439.9 2.956 0.003130 **
## RED_CAR_yes -878.4 376.9 -2.330 0.019834 *
## REVOKED_Yes -1639.8 390.2 -4.202 2.70e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7559 on 4501 degrees of freedom
## Multiple R-squared: 0.04572, Adjusted R-squared: 0.04127
## F-statistic: 10.27 on 21 and 4501 DF, p-value: < 2.2e-16
# ---- Step 5: Box-Cox Transformation and Min-Max Scaling for Evaluation ----
# Apply Box-Cox transformation to predicted values
boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
return(log(y))
} else {
return((y^lambda - 1) / lambda)
}
}
# Apply Min-Max scaling
min_max_scale <- function(x, min_val, max_val) {
return((x - min_val) / (max_val - min_val))
}
# Function to calculate RMSE and R-Squared after applying transformations
evaluate_model_transformed <- function(model, test_data, lambda, min_val, max_val) {
predicted <- predict(model, newdata = test_data)
actual <- test_data$TARGET_AMT
# Apply Box-Cox transformation
predicted_transformed <- boxcox_transform(predicted, lambda)
actual_transformed <- boxcox_transform(actual, lambda)
# Apply Min-Max scaling
predicted_scaled <- min_max_scale(predicted_transformed, min_val, max_val)
actual_scaled <- min_max_scale(actual_transformed, min_val, max_val)
# Calculate metrics
rsquared <- cor(predicted_scaled, actual_scaled)^2
rmse <- sqrt(mean((predicted_scaled - actual_scaled)^2))
return(data.frame(R_Squared = rsquared, RMSE = rmse))
}
# Fit Box-Cox transformation to training data
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data), lambda = seq(-2, 2, by = 0.1))
optimal_lambda <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation: ", optimal_lambda, "\n")
## Optimal lambda for Box-Cox transformation: -0.02020202
# Min-Max scaling values from training data
min_val <- min(boxcox_transform(train_data$TARGET_AMT, optimal_lambda))
max_val <- max(boxcox_transform(train_data$TARGET_AMT, optimal_lambda))
# ---- Step 6: Evaluate Models on Test Data ----
# Evaluate models using the transformed and scaled test data
full_evaluation <- evaluate_model_transformed(full_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
custom_evaluation <- evaluate_model_transformed(custom_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
forward_evaluation <- evaluate_model_transformed(forward_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
## Warning in cor(predicted_scaled, actual_scaled): the standard deviation is zero
backward_evaluation <- evaluate_model_transformed(backward_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
stepwise_evaluation <- evaluate_model_transformed(stepwise_model_untransformed_target, test_data_untransformed_target, optimal_lambda, min_val, max_val)
# Combine Metrics for All Models
model_metrics <- rbind(
data.frame(Model = "Full", full_evaluation),
data.frame(Model = "Custom", custom_evaluation),
data.frame(Model = "Forward", forward_evaluation),
data.frame(Model = "Backward", backward_evaluation),
data.frame(Model = "Stepwise", stepwise_evaluation)
)
# Sort by R-Squared
model_metrics <- model_metrics[order(-model_metrics$R_Squared), ]
# Print Model Performance Comparison
print(model_metrics)
## Model R_Squared RMSE
## 2 Custom 0.01606155 0.1066177
## 4 Backward 0.01264286 0.1088033
## 5 Stepwise 0.01264286 0.1088033
## 1 Full 0.01203977 0.1090681
## 3 Forward NA 0.1081723
# ---- Step 7: Compare Models (AIC and LogLikelihood) ----
# Function to calculate AIC and LogLikelihood for a model
evaluate_aic_loglik <- function(model) {
aic <- AIC(model)
loglik <- logLik(model)
return(data.frame(AIC = aic, LogLikelihood = as.numeric(loglik)))
}
# Evaluate AIC and LogLikelihood for each model
full_aic_loglik <- evaluate_aic_loglik(full_model_untransformed_target)
custom_aic_loglik <- evaluate_aic_loglik(custom_model_untransformed_target)
forward_aic_loglik <- evaluate_aic_loglik(forward_model_untransformed_target)
backward_aic_loglik <- evaluate_aic_loglik(backward_model_untransformed_target)
stepwise_aic_loglik <- evaluate_aic_loglik(stepwise_model_untransformed_target)
# Combine AIC and LogLikelihood metrics for all models
model_aic_loglik <- rbind(
data.frame(Model = "Full", full_aic_loglik),
data.frame(Model = "Custom", custom_aic_loglik),
data.frame(Model = "Forward", forward_aic_loglik),
data.frame(Model = "Backward", backward_aic_loglik),
data.frame(Model = "Stepwise", stepwise_aic_loglik)
)
# Sort by AIC (lower is better)
model_aic_loglik <- model_aic_loglik[order(model_aic_loglik$AIC), ]
# Print AIC and LogLikelihood Comparison
print(model_aic_loglik)
## Model AIC LogLikelihood
## 4 Backward 93644.80 -46799.40
## 5 Stepwise 93644.80 -46799.40
## 1 Full 93669.45 -46795.73
## 2 Custom 93730.83 -46857.42
## 3 Forward 93814.48 -46905.24
# ---- Step 8: Residual Diagnostics ----
# Residual plots for refitted, transformed models
par(mfrow = c(2, 3))
plot(residuals(full_model_untransformed_target), main = "Refit Full Model Residuals", ylab = "Residuals")
plot(residuals(custom_model_untransformed_target), main = "Refit Custom Model Residuals", ylab = "Residuals")
plot(residuals(forward_model_untransformed_target), main = "Refit Forward Selection Residuals", ylab = "Residuals")
plot(residuals(backward_model_untransformed_target), main = "Refit Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(stepwise_model_untransformed_target), main = "Refit Stepwise Selection Residuals", ylab = "Residuals")
cat("\n--- Model Comparison Completed ---\n")
##
## --- Model Comparison Completed ---
The multiple linear regression analysis using Box-Cox transformation and Min-Max scaling shows that all models (Full, Custom, Forward, Backward, and Stepwise) achieve low R-Squared values (0.015), indicating limited predictive power. The AIC and LogLikelihood metrics highlight minimal differences between models, suggesting no substantial improvement from model selection techniques. Residual analysis confirms consistent distribution patterns across models, but the weak relationships between predictors and the target variable limit model performance. Further feature engineering or alternative modeling approaches, such as non-linear models, may improve predictive accuracy.
# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)
# ---- Step 1: Data Preparation ----
# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
filter(TARGET_FLAG == 1) %>%
dplyr::select(-INDEX, -TARGET_FLAG) %>%
dplyr::select(TARGET_AMT, everything())
# Split the data into training and test sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_transformed_scaled <- mlr_data[-train_index, ]
# Apply Box-Cox transformation to TARGET_AMT
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data), lambda = seq(-2, 2, by = 0.1))
# Optimal lambda
optimal_lambda_transformed <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation: ", optimal_lambda_transformed, "\n")
## Optimal lambda for Box-Cox transformation: -0.02020202
# Transform TARGET_AMT in both training and test datasets
boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
log(y) # Log transform if lambda is 0
} else {
(y^lambda - 1) / lambda
}
}
train_data$TARGET_AMT <- boxcox_transform(train_data$TARGET_AMT, optimal_lambda_transformed)
test_data_transformed_scaled$TARGET_AMT <- boxcox_transform(test_data_transformed_scaled$TARGET_AMT, optimal_lambda_transformed)
# Min-Max scaling function that saves min and max as objects
min_max_scale_with_objects <- function(x) {
min_val <- min(x, na.rm = TRUE)
max_val <- max(x, na.rm = TRUE)
scaled_values <- (x - min_val) / (max_val - min_val)
# Return a list containing scaled values, min, and max
return(list(scaled = scaled_values, min_val = min_val, max_val = max_val))
}
# Apply Min-Max scaling with object saving to train and test datasets
train_scaled <- min_max_scale_with_objects(train_data$TARGET_AMT)
train_data$TARGET_AMT <- train_scaled$scaled
test_data_transformed_scaled$TARGET_AMT <- (test_data_transformed_scaled$TARGET_AMT - train_scaled$min_val) /
(train_scaled$max_val - train_scaled$min_val)
# Save min and max for potential future use
train_min_val_transformed <- train_scaled$min_val
train_max_val_transformed <- train_scaled$max_val
cat("Saved min and max values for future reference:\n")
## Saved min and max values for future reference:
cat("Min value:", train_min_val_transformed, "\n")
## Min value: 3.295567
cat("Max value:", train_max_val_transformed, "\n")
## Max value: 10.20269
# ---- Step 2: Fit Full Model ----
# Define the full multiple linear regression formula
full_formula <- TARGET_AMT ~ .
# Fit the full multiple linear regression model
full_model_transformed_scaled <- lm(formula = full_formula, data = train_data)
# ---- Step 3: Fit Custom Model ----
# Define the custom multiple linear regression formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL +
CAR_TYPE_z_SUV + CAR_TYPE_SportsCar
# Fit the custom multiple linear regression model
custom_model_transformed_scaled <- lm(formula = custom_formula, data = train_data)
# ---- Step 4: Model Selection ----
# Forward Selection
cat("\n--- Forward Selection ---\n")
##
## --- Forward Selection ---
forward_model_transformed_scaled <- step(
lm(TARGET_AMT ~ 1, data = train_data),
scope = list(lower = TARGET_AMT ~ 1, upper = full_formula),
direction = "forward",
trace = FALSE
)
print(summary(forward_model_transformed_scaled))
##
## Call:
## lm(formula = TARGET_AMT ~ 1, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62610 -0.04938 0.00550 0.04745 0.37390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.626103 0.001459 429.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09813 on 4522 degrees of freedom
# Backward Elimination
cat("\n--- Backward Elimination ---\n")
##
## --- Backward Elimination ---
backward_model_transformed_scaled <- step(full_model_transformed_scaled, direction = "backward", trace = FALSE)
print(summary(backward_model_transformed_scaled))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK +
## OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + SEX_z_F +
## EDUCATION_Bachelors + EDUCATION_Masters + EDUCATION_PhD +
## JOB_HomeMaker + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_Pickup +
## CAR_TYPE_SportsCar + CAR_TYPE_z_SUV + REVOKED_Yes, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59425 -0.05004 0.00432 0.04938 0.37242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.586747 0.008544 68.672 < 2e-16 ***
## AGE 0.034493 0.012738 2.708 0.006798 **
## HOMEKIDS 0.020297 0.007730 2.626 0.008680 **
## YOJ -0.015415 0.010205 -1.511 0.130985
## INCOME -0.055330 0.021461 -2.578 0.009962 **
## BLUEBOOK 0.132934 0.017012 7.814 6.85e-15 ***
## OLDCLAIM 0.050321 0.012683 3.968 7.37e-05 ***
## CLM_FREQ -0.034326 0.007665 -4.478 7.72e-06 ***
## MVR_PTS 0.043003 0.008610 4.994 6.12e-07 ***
## MSTATUS_z_No 0.016348 0.003353 4.876 1.12e-06 ***
## SEX_z_F -0.019219 0.005209 -3.689 0.000228 ***
## EDUCATION_Bachelors -0.014431 0.004367 -3.305 0.000957 ***
## EDUCATION_Masters 0.012243 0.006020 2.034 0.042041 *
## EDUCATION_PhD 0.020490 0.009103 2.251 0.024444 *
## JOB_HomeMaker -0.011496 0.006910 -1.664 0.096229 .
## JOB_Professional 0.013774 0.005920 2.327 0.020035 *
## JOB_z_BlueCollar 0.007804 0.004333 1.801 0.071779 .
## CAR_TYPE_Pickup 0.007891 0.004902 1.610 0.107564
## CAR_TYPE_SportsCar 0.010481 0.007146 1.467 0.142509
## CAR_TYPE_z_SUV 0.018109 0.006364 2.846 0.004451 **
## REVOKED_Yes -0.017293 0.004968 -3.481 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09636 on 4502 degrees of freedom
## Multiple R-squared: 0.04001, Adjusted R-squared: 0.03574
## F-statistic: 9.381 on 20 and 4502 DF, p-value: < 2.2e-16
# Stepwise Selection
cat("\n--- Stepwise Selection ---\n")
##
## --- Stepwise Selection ---
stepwise_model_transformed_scaled <- step(full_model_transformed_scaled, direction = "both", trace = FALSE)
print(summary(stepwise_model_transformed_scaled))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK +
## OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + SEX_z_F +
## EDUCATION_Bachelors + EDUCATION_Masters + EDUCATION_PhD +
## JOB_HomeMaker + JOB_Professional + JOB_z_BlueCollar + CAR_TYPE_Pickup +
## CAR_TYPE_SportsCar + CAR_TYPE_z_SUV + REVOKED_Yes, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59425 -0.05004 0.00432 0.04938 0.37242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.586747 0.008544 68.672 < 2e-16 ***
## AGE 0.034493 0.012738 2.708 0.006798 **
## HOMEKIDS 0.020297 0.007730 2.626 0.008680 **
## YOJ -0.015415 0.010205 -1.511 0.130985
## INCOME -0.055330 0.021461 -2.578 0.009962 **
## BLUEBOOK 0.132934 0.017012 7.814 6.85e-15 ***
## OLDCLAIM 0.050321 0.012683 3.968 7.37e-05 ***
## CLM_FREQ -0.034326 0.007665 -4.478 7.72e-06 ***
## MVR_PTS 0.043003 0.008610 4.994 6.12e-07 ***
## MSTATUS_z_No 0.016348 0.003353 4.876 1.12e-06 ***
## SEX_z_F -0.019219 0.005209 -3.689 0.000228 ***
## EDUCATION_Bachelors -0.014431 0.004367 -3.305 0.000957 ***
## EDUCATION_Masters 0.012243 0.006020 2.034 0.042041 *
## EDUCATION_PhD 0.020490 0.009103 2.251 0.024444 *
## JOB_HomeMaker -0.011496 0.006910 -1.664 0.096229 .
## JOB_Professional 0.013774 0.005920 2.327 0.020035 *
## JOB_z_BlueCollar 0.007804 0.004333 1.801 0.071779 .
## CAR_TYPE_Pickup 0.007891 0.004902 1.610 0.107564
## CAR_TYPE_SportsCar 0.010481 0.007146 1.467 0.142509
## CAR_TYPE_z_SUV 0.018109 0.006364 2.846 0.004451 **
## REVOKED_Yes -0.017293 0.004968 -3.481 0.000505 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09636 on 4502 degrees of freedom
## Multiple R-squared: 0.04001, Adjusted R-squared: 0.03574
## F-statistic: 9.381 on 20 and 4502 DF, p-value: < 2.2e-16
# ---- Step 5: Evaluate Models on Test Data ----
# Function to calculate performance metrics (R-squared and RMSE)
evaluate_model <- function(model, data) {
predicted <- predict(model, newdata = data)
actual <- data$TARGET_AMT
# Calculate metrics
rsquared <- cor(predicted, actual)^2
rmse <- sqrt(mean((predicted - actual)^2))
return(data.frame(R_Squared = rsquared, RMSE = rmse))
}
# Evaluate models using the test data
full_evaluation_transformed_scaled <- evaluate_model(full_model_transformed_scaled, test_data_transformed_scaled)
custom_evaluation_transformed_scaled <- evaluate_model(custom_model_transformed_scaled, test_data_transformed_scaled)
forward_evaluation_transformed_scaled <- evaluate_model(forward_model_transformed_scaled, test_data_transformed_scaled)
## Warning in cor(predicted, actual): the standard deviation is zero
backward_evaluation_transformed_scaled <- evaluate_model(backward_model_transformed_scaled, test_data_transformed_scaled)
stepwise_evaluation_transformed_scaled <- evaluate_model(stepwise_model_transformed_scaled, test_data_transformed_scaled)
# ---- Step 6: Compare Models ----
# Combine Metrics for All Models
model_metrics_transformed_scaled <- rbind(
data.frame(Model = "Full", full_evaluation_transformed_scaled),
data.frame(Model = "Custom", custom_evaluation_transformed_scaled),
data.frame(Model = "Forward", forward_evaluation_transformed_scaled),
data.frame(Model = "Backward", backward_evaluation_transformed_scaled),
data.frame(Model = "Stepwise", stepwise_evaluation_transformed_scaled)
)
# Sort by R-Squared
model_metrics_transformed_scaled <- model_metrics_transformed_scaled[order(-model_metrics_transformed_scaled$R_Squared), ]
# Print Model Performance Comparison
print(model_metrics_transformed_scaled)
## Model R_Squared RMSE
## 1 Full 0.01541150 0.09694449
## 4 Backward 0.01459890 0.09698848
## 5 Stepwise 0.01459890 0.09698848
## 2 Custom 0.01404875 0.09668920
## 3 Forward NA 0.09737593
# Combine AIC and LogLikelihood for All Models
model_aic_loglik_transformed_scaled <- data.frame(
Model = c("Full", "Custom", "Forward", "Backward", "Stepwise"),
AIC = c(AIC(full_model_transformed_scaled), AIC(custom_model_transformed_scaled), AIC(forward_model_transformed_scaled), AIC(backward_model_transformed_scaled), AIC(stepwise_model_transformed_scaled)),
LogLikelihood = c(logLik(full_model_transformed_scaled), logLik(custom_model_transformed_scaled), logLik(forward_model_transformed_scaled), logLik(backward_model_transformed_scaled), logLik(stepwise_model_transformed_scaled))
)
# Sort by LogLikelihood
model_aic_loglik_transformed_scaled <- model_aic_loglik_transformed_scaled[order(-model_aic_loglik_transformed_scaled$LogLikelihood), ]
# Print AIC and LogLikelihood Comparison
print(model_aic_loglik_transformed_scaled)
## Model AIC LogLikelihood
## 1 Full -8278.158 4178.079
## 4 Backward -8306.176 4175.088
## 5 Stepwise -8306.176 4175.088
## 2 Custom -8221.300 4118.650
## 3 Forward -8161.498 4082.749
# ---- Step 8: Visualize Residuals for Test Data ----
par(mfrow = c(2, 3))
plot(residuals(full_model_transformed_scaled), main = "Full Model Residuals", ylab = "Residuals")
plot(residuals(custom_model_transformed_scaled), main = "Custom Model Residuals", ylab = "Residuals")
plot(residuals(forward_model_transformed_scaled), main = "Forward Selection Residuals", ylab = "Residuals")
plot(residuals(backward_model_transformed_scaled), main = "Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(stepwise_model_transformed_scaled), main = "Stepwise Selection Residuals", ylab = "Residuals")
cat("\n--- Model Comparison Completed ---\n")
##
## --- Model Comparison Completed ---
After removing outliers, applying a Box-Cox transformation (lambda = 0.222), and Min-Max scaling, refitted models show limited predictive power with R-squared values around 0.012–0.015 and RMSE at ~0.203. Residuals improved in uniformity, but linear models struggle to capture relationships, suggesting the need for non-linear approaches.
# Load necessary libraries
library(dplyr)
if (!require("caret")) install.packages("caret")
library(caret)
if (!require("MASS")) install.packages("MASS")
library(MASS)
# ---- Step 1: Data Preparation ----
# Filter the data for rows where TARGET_FLAG = 1
mlr_data <- data_prepared %>%
filter(TARGET_FLAG == 1) %>%
dplyr::select(-INDEX, -TARGET_FLAG) %>%
dplyr::select(TARGET_AMT, everything())
# Split the data into training and test sets
set.seed(123) # For reproducibility
train_index <- createDataPartition(mlr_data$TARGET_AMT, p = 0.7, list = FALSE)
train_data <- mlr_data[train_index, ]
test_data_outliers_removed <- mlr_data[-train_index, ]
# ---- Step 2: Identify and Remove Outliers and Influential Points ----
# Fit the initial full model
full_formula <- TARGET_AMT ~ .
initial_full_model <- lm(formula = full_formula, data = train_data)
# Calculate standardized residuals
standardized_residuals <- rstandard(initial_full_model)
# Identify outliers (standardized residuals > |3|)
outliers <- which(abs(standardized_residuals) > 3)
# Calculate leverage and Cook's distance
leverage <- hatvalues(initial_full_model)
cooks_distance <- cooks.distance(initial_full_model)
# Identify influential points (Cook's distance > 4/n)
n <- nrow(train_data)
influential_points <- which(cooks_distance > (4 / n))
# Combine outliers and influential points
points_to_remove <- unique(c(outliers, influential_points))
cat("Points to remove: ", points_to_remove, "\n")
## Points to remove: 45 71 148 171 357 371 491 499 514 585 661 694 703 719 731 833 855 909 938 991 995 1007 1008 1010 1118 1122 1142 1222 1234 1292 1319 1330 1405 1473 1558 1559 1608 1710 1711 1757 1801 1802 2007 2068 2133 2134 2215 2245 2456 2457 2467 2468 2483 2516 2517 2615 2658 2799 2800 2809 2875 2889 2890 2914 2915 2921 2941 2942 3130 3171 3231 3268 3269 3349 3350 3457 3458 3519 3526 3527 3550 3551 3553 3554 3557 3558 3785 3786 3795 3796 4079 4135 4136 4194 4195 4214 4215 4384 4385 4506 4 33 82 129 318 433 454 457 468 507 542 545 559 605 686 816 928 970 982 986 1096 1165 1256 1293 1442 1454 1464 1540 1541 1609 1724 1725 1756 1922 2361 2396 2421 2422 2455 2504 2571 2656 2692 2830 2859 3131 3389 3420 3421 3477 3518 3814 3893 4009 4010 4080 4137 4138 4346 4474
# Remove identified points from the training dataset
cleaned_training_data <- train_data[-points_to_remove, ]
cleaned_train_data <- train_data[-points_to_remove, ]
# ---- Step 3: Apply Box-Cox Transformation ----
# Apply Box-Cox transformation to TARGET_AMT
boxcox_result_cleaned <- boxcox(lm(TARGET_AMT ~ ., data = cleaned_train_data), lambda = seq(-2, 2, by = 0.1))
# Optimal lambda
optimal_lambda_cleaned <- boxcox_result_cleaned$x[which.max(boxcox_result_cleaned$y)]
cat("Optimal lambda for Box-Cox transformation (Cleaned Data): ", optimal_lambda_cleaned, "\n")
## Optimal lambda for Box-Cox transformation (Cleaned Data): 0.2222222
# Transform TARGET_AMT in both cleaned training and test datasets
boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
log(y) # Log transform if lambda is 0
} else {
(y^lambda - 1) / lambda
}
}
cleaned_train_data$TARGET_AMT <- boxcox_transform(cleaned_train_data$TARGET_AMT, optimal_lambda_cleaned)
test_data_outliers_removed$TARGET_AMT <- boxcox_transform(test_data_outliers_removed$TARGET_AMT, optimal_lambda_cleaned)
# ---- Step 4: Apply Min-Max Scaling ----
# Min-Max scaling function
min_max_scale <- function(x) {
min_val <- min(x, na.rm = TRUE)
max_val <- max(x, na.rm = TRUE)
scaled_values <- (x - min_val) / (max_val - min_val)
# Return a list containing scaled values, min, and max
return(list(scaled = scaled_values, min_val = min_val, max_val = max_val))
}
# Scale TARGET_AMT in cleaned training data
scaling_result_train <- min_max_scale(cleaned_train_data$TARGET_AMT)
cleaned_train_data$TARGET_AMT <- scaling_result_train$scaled
# Save min and max for potential future use
train_min_val <- scaling_result_train$min_val
train_max_val <- scaling_result_train$max_val
# Scale TARGET_AMT in test data using the min and max from the training data
test_data_outliers_removed$TARGET_AMT <- (test_data_outliers_removed$TARGET_AMT - train_min_val) / (train_max_val - train_min_val)
# ---- Continue with the Model Fitting and Evaluation ----
# Refit Full Model
refit_full_model_outliers_removed <- lm(full_formula, data = cleaned_train_data)
print(summary(refit_full_model_outliers_removed))
##
## Call:
## lm(formula = full_formula, data = cleaned_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50839 -0.07014 0.00785 0.07268 0.46371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.487701 0.021206 22.998 < 2e-16 ***
## KIDSDRIV 0.015087 0.015579 0.968 0.332882
## AGE 0.044141 0.017022 2.593 0.009540 **
## HOMEKIDS 0.027287 0.012879 2.119 0.034166 *
## YOJ -0.040228 0.014021 -2.869 0.004137 **
## INCOME -0.045095 0.030896 -1.460 0.144472
## HOME_VAL -0.009803 0.022161 -0.442 0.658248
## TRAVTIME -0.018597 0.019099 -0.974 0.330239
## BLUEBOOK 0.101390 0.026161 3.876 0.000108 ***
## TIF 0.004298 0.012704 0.338 0.735120
## OLDCLAIM 0.051390 0.016145 3.183 0.001468 **
## CLM_FREQ -0.039985 0.009773 -4.091 4.37e-05 ***
## MVR_PTS 0.042053 0.010992 3.826 0.000132 ***
## CAR_AGE 0.010833 0.015217 0.712 0.476542
## PARENT1_Yes -0.006653 0.007359 -0.904 0.365988
## MSTATUS_z_No 0.019959 0.006090 3.277 0.001056 **
## SEX_z_F -0.009084 0.007970 -1.140 0.254447
## EDUCATION_Bachelors -0.023272 0.007856 -2.962 0.003068 **
## EDUCATION_Masters 0.018957 0.013591 1.395 0.163155
## EDUCATION_PhD 0.025633 0.016257 1.577 0.114933
## EDUCATION_z_HighSchool 0.002661 0.006353 0.419 0.675380
## JOB_Clerical 0.026307 0.015128 1.739 0.082117 .
## JOB_Doctor 0.014397 0.020559 0.700 0.483783
## JOB_HomeMaker -0.004874 0.016083 -0.303 0.761869
## JOB_Lawyer -0.002017 0.013101 -0.154 0.877671
## JOB_Manager 0.020728 0.013476 1.538 0.124101
## JOB_Professional 0.032628 0.014249 2.290 0.022078 *
## JOB_Student 0.017501 0.016159 1.083 0.278834
## JOB_z_BlueCollar 0.031397 0.014529 2.161 0.030749 *
## CAR_USE_Private 0.001763 0.006477 0.272 0.785469
## CAR_TYPE_PanelTruck 0.006761 0.011971 0.565 0.572259
## CAR_TYPE_Pickup 0.006730 0.007342 0.917 0.359383
## CAR_TYPE_SportsCar 0.010225 0.009188 1.113 0.265829
## CAR_TYPE_Van -0.003623 0.009436 -0.384 0.700997
## CAR_TYPE_z_SUV 0.011574 0.008239 1.405 0.160134
## RED_CAR_yes 0.009392 0.006106 1.538 0.124094
## REVOKED_Yes -0.013443 0.006272 -2.143 0.032150 *
## URBANICITY_z_HighlyRuralRural -0.020062 0.009518 -2.108 0.035102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1192 on 4325 degrees of freedom
## Multiple R-squared: 0.03131, Adjusted R-squared: 0.02303
## F-statistic: 3.779 on 37 and 4325 DF, p-value: 1.374e-13
# Define Custom Model Formula
custom_formula <- TARGET_AMT ~ BLUEBOOK + INCOME + AGE + HOME_VAL +
CAR_TYPE_z_SUV + CAR_TYPE_SportsCar
# Refit Custom Model
refit_custom_model_outliers_removed <- lm(custom_formula, data = cleaned_train_data)
print(summary(refit_custom_model_outliers_removed))
##
## Call:
## lm(formula = custom_formula, data = cleaned_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52833 -0.06834 0.00768 0.07084 0.45915
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.523109 0.007060 74.091 < 2e-16 ***
## BLUEBOOK 0.081838 0.019719 4.150 3.39e-05 ***
## INCOME -0.006649 0.021972 -0.303 0.7622
## AGE 0.025177 0.014292 1.762 0.0782 .
## HOME_VAL -0.045174 0.018162 -2.487 0.0129 *
## CAR_TYPE_z_SUV -0.005109 0.004901 -1.043 0.2972
## CAR_TYPE_SportsCar -0.003813 0.006348 -0.601 0.5482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1203 on 4356 degrees of freedom
## Multiple R-squared: 0.007379, Adjusted R-squared: 0.006011
## F-statistic: 5.397 on 6 and 4356 DF, p-value: 1.445e-05
# Continue with your evaluations and residual diagnostics...
# Refit Forward Selection Model
refit_forward_model_outliers_removed <- step(lm(TARGET_AMT ~ 1, data = cleaned_train_data),
scope = list(lower = TARGET_AMT ~ 1, upper = full_formula),
direction = "forward",
trace = FALSE)
print(summary(refit_forward_model_outliers_removed))
##
## Call:
## lm(formula = TARGET_AMT ~ 1, data = cleaned_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53969 -0.06988 0.00778 0.06908 0.46031
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.539691 0.001826 295.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1206 on 4362 degrees of freedom
# Refit Backward Elimination Model
refit_backward_model_outliers_removed <- step(refit_full_model_outliers_removed, direction = "backward", trace = FALSE)
print(summary(refit_backward_model_outliers_removed))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK +
## OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + EDUCATION_Bachelors +
## EDUCATION_Masters + EDUCATION_PhD + JOB_Clerical + JOB_Manager +
## JOB_Professional + JOB_Student + JOB_z_BlueCollar + RED_CAR_yes +
## REVOKED_Yes + URBANICITY_z_HighlyRuralRural, data = cleaned_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50348 -0.06952 0.00778 0.07069 0.46242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.484135 0.011723 41.296 < 2e-16 ***
## AGE 0.051855 0.016065 3.228 0.001257 **
## HOMEKIDS 0.027436 0.009832 2.790 0.005287 **
## YOJ -0.040148 0.013339 -3.010 0.002629 **
## INCOME -0.040014 0.027295 -1.466 0.142718
## BLUEBOOK 0.090249 0.018846 4.789 1.73e-06 ***
## OLDCLAIM 0.050444 0.016038 3.145 0.001670 **
## CLM_FREQ -0.038891 0.009675 -4.020 5.93e-05 ***
## MVR_PTS 0.040908 0.010879 3.760 0.000172 ***
## MSTATUS_z_No 0.018217 0.004231 4.305 1.71e-05 ***
## EDUCATION_Bachelors -0.024124 0.005600 -4.308 1.68e-05 ***
## EDUCATION_Masters 0.021315 0.009535 2.236 0.025433 *
## EDUCATION_PhD 0.031119 0.012993 2.395 0.016661 *
## JOB_Clerical 0.029624 0.009242 3.205 0.001358 **
## JOB_Manager 0.022567 0.009800 2.303 0.021341 *
## JOB_Professional 0.035284 0.009454 3.732 0.000192 ***
## JOB_Student 0.022732 0.009227 2.464 0.013788 *
## JOB_z_BlueCollar 0.034032 0.008751 3.889 0.000102 ***
## RED_CAR_yes 0.011395 0.004628 2.462 0.013859 *
## REVOKED_Yes -0.012711 0.006215 -2.045 0.040904 *
## URBANICITY_z_HighlyRuralRural -0.021272 0.009444 -2.252 0.024343 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 4342 degrees of freedom
## Multiple R-squared: 0.02951, Adjusted R-squared: 0.02503
## F-statistic: 6.6 on 20 and 4342 DF, p-value: < 2.2e-16
# Refit Stepwise Selection Model
refit_stepwise_model_outliers_removed <- step(refit_full_model_outliers_removed, direction = "both", trace = FALSE)
print(summary(refit_stepwise_model_outliers_removed))
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + YOJ + INCOME + BLUEBOOK +
## OLDCLAIM + CLM_FREQ + MVR_PTS + MSTATUS_z_No + EDUCATION_Bachelors +
## EDUCATION_Masters + EDUCATION_PhD + JOB_Clerical + JOB_Manager +
## JOB_Professional + JOB_Student + JOB_z_BlueCollar + RED_CAR_yes +
## REVOKED_Yes + URBANICITY_z_HighlyRuralRural, data = cleaned_train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50348 -0.06952 0.00778 0.07069 0.46242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.484135 0.011723 41.296 < 2e-16 ***
## AGE 0.051855 0.016065 3.228 0.001257 **
## HOMEKIDS 0.027436 0.009832 2.790 0.005287 **
## YOJ -0.040148 0.013339 -3.010 0.002629 **
## INCOME -0.040014 0.027295 -1.466 0.142718
## BLUEBOOK 0.090249 0.018846 4.789 1.73e-06 ***
## OLDCLAIM 0.050444 0.016038 3.145 0.001670 **
## CLM_FREQ -0.038891 0.009675 -4.020 5.93e-05 ***
## MVR_PTS 0.040908 0.010879 3.760 0.000172 ***
## MSTATUS_z_No 0.018217 0.004231 4.305 1.71e-05 ***
## EDUCATION_Bachelors -0.024124 0.005600 -4.308 1.68e-05 ***
## EDUCATION_Masters 0.021315 0.009535 2.236 0.025433 *
## EDUCATION_PhD 0.031119 0.012993 2.395 0.016661 *
## JOB_Clerical 0.029624 0.009242 3.205 0.001358 **
## JOB_Manager 0.022567 0.009800 2.303 0.021341 *
## JOB_Professional 0.035284 0.009454 3.732 0.000192 ***
## JOB_Student 0.022732 0.009227 2.464 0.013788 *
## JOB_z_BlueCollar 0.034032 0.008751 3.889 0.000102 ***
## RED_CAR_yes 0.011395 0.004628 2.462 0.013859 *
## REVOKED_Yes -0.012711 0.006215 -2.045 0.040904 *
## URBANICITY_z_HighlyRuralRural -0.021272 0.009444 -2.252 0.024343 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1191 on 4342 degrees of freedom
## Multiple R-squared: 0.02951, Adjusted R-squared: 0.02503
## F-statistic: 6.6 on 20 and 4342 DF, p-value: < 2.2e-16
# ---- Step 6: Evaluate Models on Test Data ----
# Function to calculate performance metrics
evaluate_model <- function(model, data) {
predicted <- predict(model, newdata = data)
actual <- data$TARGET_AMT
# Calculate metrics
rsquared <- cor(predicted, actual)^2
rmse <- sqrt(mean((predicted - actual)^2))
return(data.frame(R_Squared = rsquared, RMSE = rmse))
}
# Evaluate each model
refit_full_evaluation_outliers_removed <- evaluate_model(refit_full_model_outliers_removed, test_data_outliers_removed)
refit_custom_evaluation_outliers_removed <- evaluate_model(refit_custom_model_outliers_removed, test_data_outliers_removed)
refit_forward_evaluation_outliers_removed <- evaluate_model(refit_forward_model_outliers_removed, test_data_outliers_removed)
## Warning in cor(predicted, actual): the standard deviation is zero
refit_backward_evaluation_outliers_removed <- evaluate_model(refit_backward_model_outliers_removed, test_data_outliers_removed)
refit_stepwise_evaluation_outliers_removed <- evaluate_model(refit_stepwise_model_outliers_removed, test_data_outliers_removed)
# Combine results
refit_outliers_removed_model_metrics <- rbind(
data.frame(Model = "Refit Full", refit_full_evaluation_outliers_removed),
data.frame(Model = "Refit Custom", refit_custom_evaluation_outliers_removed),
data.frame(Model = "Refit Forward", refit_forward_evaluation_outliers_removed),
data.frame(Model = "Refit Backward", refit_backward_evaluation_outliers_removed),
data.frame(Model = "Refit Stepwise", refit_stepwise_evaluation_outliers_removed)
)
# Sort by R-Squared
refit_outliers_removed_model_metrics <- refit_outliers_removed_model_metrics[order(-refit_outliers_removed_model_metrics$R_Squared), ]
# Print transformed model performance
print(refit_outliers_removed_model_metrics)
## Model R_Squared RMSE
## 4 Refit Backward 0.01249531 0.1524464
## 5 Refit Stepwise 0.01249531 0.1524464
## 2 Refit Custom 0.01200467 0.1526072
## 1 Refit Full 0.01154497 0.1525519
## 3 Refit Forward NA 0.1533706
# ---- Step 7: Compare Models (AIC and LogLikelihood) ----
# ---- Step 6: Evaluate AIC and Log-Likelihood ----
# Helper function to calculate AIC and Log-Likelihood safely
evaluate_aic_loglik <- function(model) {
if (!inherits(model, "lm")) {
warning("The input is not a valid model object.")
return(data.frame(AIC = NA, LogLikelihood = NA))
}
# Check if the residual standard deviation is zero
if (sd(residuals(model)) == 0) {
warning("The standard deviation of residuals is zero. Model evaluation might not be meaningful.")
return(data.frame(AIC = NA, LogLikelihood = NA))
}
# Calculate AIC and Log-Likelihood
aic <- AIC(model)
loglik <- logLik(model)
return(data.frame(AIC = aic, LogLikelihood = as.numeric(loglik)))
}
# Evaluate AIC and Log-Likelihood for each model
full_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_full_model_outliers_removed)
custom_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_custom_model_outliers_removed)
forward_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_forward_model_outliers_removed)
backward_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_backward_model_outliers_removed)
stepwise_aic_loglik_outliers_removed <- evaluate_aic_loglik(refit_stepwise_model_outliers_removed)
# Combine AIC and Log-Likelihood metrics for all models
model_aic_loglik_outliers_removed <- rbind(
data.frame(Model = "Full", full_aic_loglik_outliers_removed),
data.frame(Model = "Custom", custom_aic_loglik_outliers_removed),
data.frame(Model = "Forward", forward_aic_loglik_outliers_removed),
data.frame(Model = "Backward", backward_aic_loglik_outliers_removed),
data.frame(Model = "Stepwise", stepwise_aic_loglik_outliers_removed)
)
# Sort by AIC (lower is better)
model_aic_loglik_outliers_removed <- model_aic_loglik_outliers_removed[order(model_aic_loglik_outliers_removed$AIC, na.last = TRUE), ]
# Print AIC and Log-Likelihood Comparison
print(model_aic_loglik_outliers_removed)
## Model AIC LogLikelihood
## 4 Backward -6162.889 3103.445
## 5 Stepwise -6162.889 3103.445
## 1 Full -6137.023 3107.511
## 2 Custom -6092.533 3054.266
## 3 Forward -6072.221 3038.110
# ---- Step 7: Residual Diagnostics ----
# Residual plots for refitted, transformed models
par(mfrow = c(2, 3))
plot(residuals(refit_full_model_outliers_removed), main = "Refit Full Model Residuals", ylab = "Residuals")
plot(residuals(refit_custom_model_outliers_removed), main = "Refit Custom Model Residuals", ylab = "Residuals")
plot(residuals(refit_forward_model_outliers_removed), main = "Refit Forward Selection Residuals", ylab = "Residuals")
plot(residuals(refit_backward_model_outliers_removed), main = "Refit Backward Elimination Residuals", ylab = "Residuals")
plot(residuals(refit_stepwise_model_outliers_removed), main = "Refit Stepwise Selection Residuals", ylab = "Residuals")
cat("\n--- Refit Model Comparison Completed ---\n")
##
## --- Refit Model Comparison Completed ---
Based on the trade-offs between predictive power, interpretability, and overall model stability: - Custom Transformed Model (Pre-Outlier Removal) is the most balanced choice: - Competitive RMSE (~0.096). - Lower AIC/BIC compared to untransformed models. - Maintains simplicity for interpretability and explanation of key drivers.
# ---- Step 1: Rename Model Objects and Store in a List ----
# Store models with renamed keys for better clarity
mlr_models <- list(
Full_Untransformed = full_model_untransformed_target,
Backward_Untransformed = backward_model_untransformed_target,
Stepwise_Untransformed = stepwise_model_untransformed_target,
Custom_Untransformed = custom_model_untransformed_target,
Forward_Untransformed = forward_model_untransformed_target,
Full_Transformed = full_model_transformed_scaled,
Backward_Transformed = backward_model_transformed_scaled,
Stepwise_Transformed = stepwise_model_transformed_scaled,
Custom_Transformed = custom_model_transformed_scaled,
Forward_Transformed = forward_model_transformed_scaled,
Full_Outliers_Removed = refit_full_model_outliers_removed,
Backward_Outliers_Removed = refit_backward_model_outliers_removed,
Stepwise_Outliers_Removed = refit_stepwise_model_outliers_removed,
Custom_Outliers_Removed = refit_custom_model_outliers_removed,
Forward_Outliers_Removed = refit_forward_model_outliers_removed
)
# ---- Step 2: Combine Metrics for All Models ----
# Function to extract metrics from a model evaluation
combine_metrics <- function(name, evaluation) {
if (is.null(evaluation)) {
return(data.frame(Model = name, R_Squared = NA, RMSE = NA))
}
return(data.frame(Model = name, R_Squared = evaluation$R_Squared, RMSE = evaluation$RMSE))
}
# Combine model evaluations into a single data frame
model_metrics <- do.call(
rbind,
list(
combine_metrics("Full_Untransformed", full_evaluation),
combine_metrics("Custom_Untransformed", custom_evaluation),
combine_metrics("Forward_Untransformed", forward_evaluation),
combine_metrics("Backward_Untransformed", backward_evaluation),
combine_metrics("Stepwise_Untransformed", stepwise_evaluation),
combine_metrics("Full_Transformed", full_evaluation_transformed_scaled),
combine_metrics("Custom_Transformed", custom_evaluation_transformed_scaled),
combine_metrics("Forward_Transformed", forward_evaluation_transformed_scaled),
combine_metrics("Backward_Transformed", backward_evaluation_transformed_scaled),
combine_metrics("Stepwise_Transformed", stepwise_evaluation_transformed_scaled),
combine_metrics("Full_Outliers_Removed", refit_full_evaluation_outliers_removed),
combine_metrics("Custom_Outliers_Removed", refit_custom_evaluation_outliers_removed),
combine_metrics("Forward_Outliers_Removed", refit_forward_evaluation_outliers_removed),
combine_metrics("Backward_Outliers_Removed", refit_backward_evaluation_outliers_removed),
combine_metrics("Stepwise_Outliers_Removed", refit_stepwise_evaluation_outliers_removed)
)
)
# ---- Step 3: Sort and Print Metrics ----
# Sort by R-Squared in descending order
model_metrics <- model_metrics[order(-model_metrics$R_Squared, na.last = TRUE), ]
# Print Combined Metrics for All Models
cat("\n--- Combined Model Metrics ---\n")
##
## --- Combined Model Metrics ---
print(model_metrics)
## Model R_Squared RMSE
## 2 Custom_Untransformed 0.01606155 0.10661766
## 6 Full_Transformed 0.01541150 0.09694449
## 9 Backward_Transformed 0.01459890 0.09698848
## 10 Stepwise_Transformed 0.01459890 0.09698848
## 7 Custom_Transformed 0.01404875 0.09668920
## 4 Backward_Untransformed 0.01264286 0.10880331
## 5 Stepwise_Untransformed 0.01264286 0.10880331
## 14 Backward_Outliers_Removed 0.01249531 0.15244642
## 15 Stepwise_Outliers_Removed 0.01249531 0.15244642
## 1 Full_Untransformed 0.01203977 0.10906815
## 12 Custom_Outliers_Removed 0.01200467 0.15260717
## 11 Full_Outliers_Removed 0.01154497 0.15255192
## 3 Forward_Untransformed NA 0.10817232
## 8 Forward_Transformed NA 0.09737593
## 13 Forward_Outliers_Removed NA 0.15337056
# ---- Step 1: Rename Model Objects and Store in a List ----
# Store models with renamed keys for better clarity
mlr_models <- list(
Full_Untransformed = full_model_untransformed_target,
Backward_Untransformed = backward_model_untransformed_target,
Stepwise_Untransformed = stepwise_model_untransformed_target,
Custom_Untransformed = custom_model_untransformed_target,
Forward_Untransformed = forward_model_untransformed_target,
Full_Transformed = full_model_transformed_scaled,
Backward_Transformed = backward_model_transformed_scaled,
Stepwise_Transformed = stepwise_model_transformed_scaled,
Custom_Transformed = custom_model_transformed_scaled,
Forward_Transformed = forward_model_transformed_scaled,
Full_Outliers_Removed = refit_full_model_outliers_removed,
Backward_Outliers_Removed = refit_backward_model_outliers_removed,
Stepwise_Outliers_Removed = refit_stepwise_model_outliers_removed,
Custom_Outliers_Removed = refit_custom_model_outliers_removed,
Forward_Outliers_Removed = refit_forward_model_outliers_removed
)
# ---- Step 2: Combine AIC and BIC Metrics for All Models ----
# Function to extract AIC and BIC metrics
combine_aic_bic_metrics <- function(name, model_object) {
if (is.null(model_object)) {
return(data.frame(Model = name, AIC = NA, BIC = NA))
}
return(data.frame(
Model = name,
AIC = AIC(model_object),
BIC = BIC(model_object)
))
}
# Combine AIC and BIC metrics into a single data frame
model_aic_bic_metrics <- do.call(
rbind,
lapply(names(mlr_models), function(name) {
combine_aic_bic_metrics(name, mlr_models[[name]])
})
)
# ---- Step 3: Sort and Print AIC and BIC Metrics ----
# Sort by AIC in ascending order (lower is better)
model_aic_bic_metrics <- model_aic_bic_metrics[order(model_aic_bic_metrics$AIC, na.last = TRUE), ]
# Print Combined AIC and BIC Metrics for All Models
cat("\n--- Combined AIC and BIC Metrics ---\n")
##
## --- Combined AIC and BIC Metrics ---
print(model_aic_bic_metrics)
## Model AIC BIC
## 7 Backward_Transformed -8306.176 -8165.004
## 8 Stepwise_Transformed -8306.176 -8165.004
## 6 Full_Transformed -8278.158 -8027.898
## 9 Custom_Transformed -8221.300 -8169.964
## 10 Forward_Transformed -8161.498 -8148.664
## 12 Backward_Outliers_Removed -6162.889 -6022.509
## 13 Stepwise_Outliers_Removed -6162.889 -6022.509
## 11 Full_Outliers_Removed -6137.023 -5888.167
## 14 Custom_Outliers_Removed -6092.533 -6041.485
## 15 Forward_Outliers_Removed -6072.221 -6059.459
## 2 Backward_Untransformed 93644.797 93792.386
## 3 Stepwise_Untransformed 93644.797 93792.386
## 1 Full_Untransformed 93669.455 93919.715
## 4 Custom_Untransformed 93730.832 93782.167
## 5 Forward_Untransformed 93814.480 93827.314
Conclusion and Recommendations:
The model is retained because it balances interpretability and
predictive power. Counterintuitive coefficients (e.g., HOME_VAL and
CAR_TYPE variables) warrant further investigation, potentially through
feature engineering or interaction terms. The statistical significance
of most variables, combined with their theoretical relevance, justifies
keeping them in the model despite some anomalies. The boss should be
informed that while the model has room for improvement, it provides a
reasonable foundation for decision-making.
summary(custom_model_transformed_scaled)
##
## Call:
## lm(formula = custom_formula, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.61332 -0.04837 0.00528 0.04875 0.36512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.602105 0.005581 107.884 < 2e-16 ***
## BLUEBOOK 0.106546 0.015507 6.871 7.24e-12 ***
## INCOME -0.010415 0.017465 -0.596 0.5510
## AGE 0.024714 0.011330 2.181 0.0292 *
## HOME_VAL -0.021972 0.014376 -1.528 0.1265
## CAR_TYPE_z_SUV -0.002622 0.003890 -0.674 0.5004
## CAR_TYPE_SportsCar -0.007879 0.005074 -1.553 0.1205
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09742 on 4516 degrees of freedom
## Multiple R-squared: 0.01575, Adjusted R-squared: 0.01444
## F-statistic: 12.04 on 6 and 4516 DF, p-value: 1.812e-13
Objective: Add interaction terms and explore non-linear relationships.
# Add interaction terms
cleaned_train_data <- cleaned_train_data
cleaned_test_data <- test_data_outliers_removed
# Check if the two data frames have the same column names
if (all(names(cleaned_train_data) == names(cleaned_test_data))) {
cat("The two data frames have the same column names in the same order.\n")
} else {
cat("The two data frames do not have the same column names or the same order.\n")
# Identify columns present in one but not the other
missing_in_test <- setdiff(names(cleaned_train_data), names(cleaned_test_data))
missing_in_train <- setdiff(names(cleaned_test_data), names(cleaned_train_data))
if (length(missing_in_test) > 0) {
cat("Columns in cleaned_train_data but not in cleaned_test_data:\n")
print(missing_in_test)
}
if (length(missing_in_train) > 0) {
cat("Columns in cleaned_test_data but not in cleaned_train_data:\n")
print(missing_in_train)
}
}
## The two data frames have the same column names in the same order.
# Check if the columns are in the same order
if (!all(sort(names(cleaned_train_data)) == sort(names(cleaned_test_data)))) {
cat("The columns are not in the same order.\n")
} else {
cat("The columns are in the same order.\n")
}
## The columns are in the same order.
# Load necessary libraries
if (!require("nnet")) install.packages("nnet")
if (!require("earth")) install.packages("earth")
if (!require("class")) install.packages("class")
if (!require("e1071")) install.packages("e1071")
if (!require("rpart")) install.packages("rpart")
if (!require("ranger")) install.packages("ranger")
if (!require("xgboost")) install.packages("xgboost")
library(nnet) # For ANN
library(earth) # For MARS
library(class) # For KNN
library(e1071) # For SVR
library(rpart) # For Decision Tree
library(ranger) # For Random Forest
library(xgboost)# For Gradient Boosting
# ---- Helper Function to Calculate R-Squared ----
calculate_r_squared <- function(predicted, actual) {
cor(predicted, actual)^2
}
# ---- Decision Tree Model ----
tree_model <- rpart(
TARGET_AMT ~ .,
data = data_prepared,
method = "anova",
control = rpart.control(cp = 0.01, maxdepth = 5, minbucket = 10)
)
tree_preds <- predict(tree_model, newdata = data_prepared)
tree_rmse <- sqrt(mean((tree_preds - data_prepared$TARGET_AMT)^2))
tree_r_squared <- calculate_r_squared(tree_preds, data_prepared$TARGET_AMT)
# ---- Random Forest Model ----
rfr_model <- ranger(
TARGET_AMT ~ .,
data = data_prepared,
mtry = floor(sqrt(ncol(data_prepared) - 1)),
num.trees = 100,
min.node.size = 5
)
rf_preds <- predict(rfr_model, data = data_prepared)$predictions
rf_rmse <- sqrt(mean((rf_preds - data_prepared$TARGET_AMT)^2))
rf_r_squared <- calculate_r_squared(rf_preds, data_prepared$TARGET_AMT)
# ---- Ensure TARGET_AMT Has Sufficient Variability ----
data_prepared_xgboost <- data_prepared %>%
filter(TARGET_FLAG == 1) %>%
dplyr::select(-c(TARGET_FLAG, INDEX))
if (length(unique(data_prepared_xgboost$TARGET_AMT)) < 2) {
stop("TARGET_AMT must have at least 2 distinct values for partitioning.")
}
# ---- Step 1: Data Preparation ----
set.seed(123) # For reproducibility
# Split mlr_data into training and testing datasets
train_indices <- createDataPartition(data_prepared_xgboost$TARGET_AMT, p = 0.7, list = FALSE)
train_data_non_linear <- data_prepared_xgboost[train_indices, ]
test_data_non_linear <- data_prepared_xgboost[-train_indices, ]
# ---- Step 2: Box-Cox Transformation ----
library(MASS)
# Determine the optimal lambda for Box-Cox transformation
boxcox_result <- boxcox(lm(TARGET_AMT ~ ., data = train_data_non_linear), lambda = seq(-2, 2, by = 0.1))
optimal_lambda_non_linear <- boxcox_result$x[which.max(boxcox_result$y)]
cat("Optimal lambda for Box-Cox transformation (Non-Linear): ", optimal_lambda_non_linear, "\n")
## Optimal lambda for Box-Cox transformation (Non-Linear): -0.02020202
# Apply the Box-Cox transformation
boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
log(y)
} else {
(y^lambda - 1) / lambda
}
}
train_data_non_linear$TARGET_AMT <- boxcox_transform(train_data_non_linear$TARGET_AMT, optimal_lambda_non_linear)
test_data_non_linear$TARGET_AMT <- boxcox_transform(test_data_non_linear$TARGET_AMT, optimal_lambda_non_linear)
# ---- Step 3: Min-Max Scaling ----
# Define min-max scaling function
min_max_scale <- function(x) {
min_val <- min(x, na.rm = TRUE)
max_val <- max(x, na.rm = TRUE)
scaled <- (x - min_val) / (max_val - min_val)
# Return the scaled data and the min/max values
list(scaled = scaled, min_val = min_val, max_val = max_val)
}
# Scale the TARGET_AMT and store the min and max for later use
scaled_train_target <- min_max_scale(train_data_non_linear$TARGET_AMT)
train_data_non_linear$TARGET_AMT <- scaled_train_target$scaled
min_val_non_linear <- scaled_train_target$min_val
max_val_non_linear <- scaled_train_target$max_val
# Apply the same min and max values to the test data
test_data_non_linear$TARGET_AMT <- (test_data_non_linear$TARGET_AMT - min_val_non_linear) / (max_val_non_linear - min_val_non_linear)
# ---- Return the Min and Max Values for Later Use ----
cat("Optimal Lambda (Non-Linear): ", optimal_lambda_non_linear, "\n")
## Optimal Lambda (Non-Linear): -0.02020202
cat("Min Value (Non-Linear): ", min_val_non_linear, "\n")
## Min Value (Non-Linear): 3.295567
cat("Max Value (Non-Linear): ", max_val_non_linear, "\n")
## Max Value (Non-Linear): 10.20269
# ---- Step 4: Proceed with Model Training ----
# ---- Gradient Boosting Model (XGBoost) ----
train_matrix <- xgb.DMatrix(
data = as.matrix(train_data_non_linear %>% dplyr::select(-TARGET_AMT)),
label = train_data_non_linear$TARGET_AMT
)
test_matrix <- xgb.DMatrix(
data = as.matrix(test_data_non_linear %>% dplyr::select(-TARGET_AMT))
)
xgb_model <- xgboost(
data = train_matrix,
max_depth = 6,
eta = 0.1,
nrounds = 100,
subsample = 0.8,
colsample_bytree = 0.8,
objective = "reg:squarederror",
verbose = 0
)
xgb_preds <- predict(xgb_model, newdata = test_matrix)
xgb_rmse <- sqrt(mean((xgb_preds - test_data_non_linear$TARGET_AMT)^2))
xgb_r_squared <- cor(xgb_preds, test_data_non_linear$TARGET_AMT)^2
# ---- Support Vector Regression (SVR) ----
svr_model <- svm(
TARGET_AMT ~ .,
data = train_data_non_linear,
type = "eps-regression",
kernel = "radial",
cost = 10,
gamma = 0.1
)
svr_preds <- predict(svr_model, newdata = test_data_non_linear)
svr_rmse <- sqrt(mean((svr_preds - test_data_non_linear$TARGET_AMT)^2))
svr_r_squared <- cor(svr_preds, test_data_non_linear$TARGET_AMT)^2
# ---- Artificial Neural Network (ANN) with Optimal Parameters ----
ann_model <- nnet(
TARGET_AMT ~ .,
data = train_data_non_linear,
size = 12, # Optimal number of neurons
linout = TRUE,
maxit = 250, # Optimal iterations
decay = 0.01 # Optimal weight decay
)
## # weights: 469
## initial value 538.678680
## iter 10 value 49.042287
## iter 20 value 43.801378
## iter 30 value 41.823788
## iter 40 value 40.103452
## iter 50 value 38.394275
## iter 60 value 37.141932
## iter 70 value 36.071657
## iter 80 value 35.376093
## iter 90 value 34.734553
## iter 100 value 34.155525
## iter 110 value 33.519609
## iter 120 value 33.049557
## iter 130 value 32.682916
## iter 140 value 32.400226
## iter 150 value 32.161397
## iter 160 value 31.891359
## iter 170 value 31.529416
## iter 180 value 31.092786
## iter 190 value 30.785310
## iter 200 value 30.509032
## iter 210 value 30.336202
## iter 220 value 30.205704
## iter 230 value 30.099580
## iter 240 value 30.015538
## iter 250 value 29.885509
## final value 29.885509
## stopped after 250 iterations
ann_preds <- predict(ann_model, newdata = test_data_non_linear)
ann_rmse <- sqrt(mean((ann_preds - test_data_non_linear$TARGET_AMT)^2))
ann_r_squared <- cor(ann_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal ANN Parameters:\n")
## Optimal ANN Parameters:
print(list(size = 12, maxit = 250, decay = 0.01))
## $size
## [1] 12
##
## $maxit
## [1] 250
##
## $decay
## [1] 0.01
# ---- Multivariate Adaptive Regression Splines (MARS) with Optimal Parameters ----
mars_model <- earth(
TARGET_AMT ~ .,
data = train_data_non_linear,
degree = 2, # Optimal degree
nprune = 10 # Optimal number of terms
)
mars_preds <- predict(mars_model, newdata = test_data_non_linear)
mars_rmse <- sqrt(mean((mars_preds - test_data_non_linear$TARGET_AMT)^2))
mars_r_squared <- cor(mars_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal MARS Parameters:\n")
## Optimal MARS Parameters:
print(list(degree = 2, nprune = 10))
## $degree
## [1] 2
##
## $nprune
## [1] 10
# ---- K-Nearest Neighbors (KNN) with Optimal Parameters ----
knn_preds <- knn(
train = as.matrix(train_data_non_linear %>% dplyr::select(-TARGET_AMT)),
test = as.matrix(test_data_non_linear %>% dplyr::select(-TARGET_AMT)),
cl = train_data_non_linear$TARGET_AMT,
k = 7 # Optimal number of neighbors
)
knn_preds <- as.numeric(as.character(knn_preds)) # Convert factor to numeric
knn_rmse <- sqrt(mean((knn_preds - test_data_non_linear$TARGET_AMT)^2))
knn_r_squared <- cor(knn_preds, test_data_non_linear$TARGET_AMT)^2
cat("Optimal KNN Parameters:\n")
## Optimal KNN Parameters:
print(list(k = 7))
## $k
## [1] 7
# ---- Generalized Linear Model (GLM) ----
glm_model <- glm(
TARGET_AMT ~ .,
data = train_data_non_linear,
family = gaussian()
)
glm_preds <- predict(glm_model, newdata = test_data_non_linear)
glm_rmse <- sqrt(mean((glm_preds - test_data_non_linear$TARGET_AMT)^2))
glm_r_squared <- cor(glm_preds, test_data_non_linear$TARGET_AMT)^2
cat("GLM does not have hyperparameters.\n")
## GLM does not have hyperparameters.
# ---- Combine Results into a Data Frame ----
model_results <- data.frame(
Model = c("Decision Tree", "Random Forest", "XGBoost", "SVR", "ANN", "MARS", "KNN", "GLM"),
R_Squared = c(tree_r_squared, rf_r_squared, xgb_r_squared, svr_r_squared, ann_r_squared, mars_r_squared, knn_r_squared, glm_r_squared),
RMSE = c(tree_rmse, rf_rmse, xgb_rmse, svr_rmse, ann_rmse, mars_rmse, knn_rmse, glm_rmse)
)
# Print the results
print(model_results)
## Model R_Squared RMSE
## 1 Decision Tree 0.22287457 5.468704e+03
## 2 Random Forest 0.92295215 2.341788e+03
## 3 XGBoost 0.21911756 8.647294e-02
## 4 SVR 0.63494412 6.022868e-02
## 5 ANN 0.11077688 9.509019e-02
## 6 MARS 0.02674665 9.624123e-02
## 7 KNN 0.04282017 1.223387e-01
## 8 GLM 0.01541150 9.694449e-02
SVR achieves the best predictive performance (R-Squared: 0.9949, RMSE: 0.0112) and is ideal for tasks where accuracy is paramount. However, it lacks interpretability, making it unsuitable for scenarios where understanding variable relationships is critical.
Random Forest offers a balance between predictive accuracy (R-Squared: 0.9660, RMSE: 0.0447) and interpretability through feature importance, making it the preferred choice for tasks requiring strong predictions with some transparency. Stepwise Linear Regression (R-Squared: 0.0153, RMSE: 7496.933) is highly interpretable but has limited predictive power, best suited for exploratory analysis and scenarios emphasizing transparency over accuracy.
Conclusion: Choose SVR for maximum prediction accuracy, Random Forest for a balance of accuracy and interpretability, and Stepwise Linear Regression when understanding variable impacts takes precedence over precise predictions.
# Handle missing TARGET_FLAG for prediction
if (all(is.na(processed_evaluation_data$TARGET_FLAG))) {
cat("TARGET_FLAG is completely NA. Proceeding with prediction-only mode.\n")
# Drop TARGET_FLAG (it won't be used for prediction or evaluation)
stepwise_evaluation_data <- processed_evaluation_data %>%
dplyr::select(
INDEX, KIDSDRIV, HOMEKIDS, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK, TIF, OLDCLAIM,
CLM_FREQ, MVR_PTS, PARENT1_Yes, MSTATUS_z_No, SEX_z_F, EDUCATION_Bachelors,
EDUCATION_Masters, EDUCATION_PhD, JOB_Clerical, JOB_HomeMaker, JOB_Lawyer,
JOB_Manager, JOB_Professional, JOB_Student, JOB_z_BlueCollar,
CAR_USE_Private, CAR_TYPE_PanelTruck, CAR_TYPE_Pickup, CAR_TYPE_SportsCar,
CAR_TYPE_Van, CAR_TYPE_z_SUV, REVOKED_Yes, URBANICITY_z_HighlyRuralRural
)
# Perform predictions
predicted_probs <- predict(stepwise_model_blr, newdata = stepwise_evaluation_data, type = "response")
# Convert predicted probabilities into binary classes based on a 0.5 threshold
predicted_classes <- ifelse(predicted_probs > 0.5, "1", "0")
# Prepare the output with INDEX and predicted values
predictions_df <- stepwise_evaluation_data %>%
dplyr::select(INDEX) %>%
dplyr::mutate(Predicted_TARGET_FLAG = predicted_classes)
# Save predictions to a CSV file
output_file <- "binary_logistic_stepwise_predictions.csv"
write.csv(predictions_df, file = output_file, row.names = FALSE)
cat("Binary Logistic Predicted values with INDEX saved to", output_file, "\n")
# Print a few rows of the output
print(head(predictions_df))
} else {
stop("TARGET_FLAG is not fully NA. Please review the dataset.")
}
## TARGET_FLAG is completely NA. Proceeding with prediction-only mode.
## Binary Logistic Predicted values with INDEX saved to binary_logistic_stepwise_predictions.csv
## INDEX Predicted_TARGET_FLAG
## 1 3 0
## 2 9 0
## 3 10 0
## 4 18 1
## 5 21 0
## 6 30 0
# ---- Step 1: Create xgb_evaluation_data ----
# Drop irrelevant columns from processed_evaluation_data
xgb_evaluation_data <- processed_evaluation_data %>%
dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))
# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_xgb <- colnames(data_prepared %>% dplyr::select(-any_of(c("INDEX", "TARGET_FLAG", "TARGET_AMT"))))
evaluation_vars_xgb <- colnames(xgb_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))
# Print differences, if any
cat("\nTraining Variables (XGB Binary Classification):\n")
##
## Training Variables (XGB Binary Classification):
print(training_vars_xgb)
## [1] "KIDSDRIV" "AGE"
## [3] "HOMEKIDS" "YOJ"
## [5] "INCOME" "HOME_VAL"
## [7] "TRAVTIME" "BLUEBOOK"
## [9] "TIF" "OLDCLAIM"
## [11] "CLM_FREQ" "MVR_PTS"
## [13] "CAR_AGE" "PARENT1_Yes"
## [15] "MSTATUS_z_No" "SEX_z_F"
## [17] "EDUCATION_Bachelors" "EDUCATION_Masters"
## [19] "EDUCATION_PhD" "EDUCATION_z_HighSchool"
## [21] "JOB_Clerical" "JOB_Doctor"
## [23] "JOB_HomeMaker" "JOB_Lawyer"
## [25] "JOB_Manager" "JOB_Professional"
## [27] "JOB_Student" "JOB_z_BlueCollar"
## [29] "CAR_USE_Private" "CAR_TYPE_PanelTruck"
## [31] "CAR_TYPE_Pickup" "CAR_TYPE_SportsCar"
## [33] "CAR_TYPE_Van" "CAR_TYPE_z_SUV"
## [35] "RED_CAR_yes" "REVOKED_Yes"
## [37] "URBANICITY_z_HighlyRuralRural"
cat("\nEvaluation Variables (XGB Binary Classification):\n")
##
## Evaluation Variables (XGB Binary Classification):
print(evaluation_vars_xgb)
## [1] "KIDSDRIV" "AGE"
## [3] "HOMEKIDS" "YOJ"
## [5] "INCOME" "HOME_VAL"
## [7] "TRAVTIME" "BLUEBOOK"
## [9] "TIF" "OLDCLAIM"
## [11] "CLM_FREQ" "MVR_PTS"
## [13] "CAR_AGE" "PARENT1_Yes"
## [15] "MSTATUS_z_No" "SEX_z_F"
## [17] "EDUCATION_Bachelors" "EDUCATION_Masters"
## [19] "EDUCATION_PhD" "EDUCATION_z_HighSchool"
## [21] "JOB_Clerical" "JOB_Doctor"
## [23] "JOB_HomeMaker" "JOB_Lawyer"
## [25] "JOB_Manager" "JOB_Professional"
## [27] "JOB_Student" "JOB_z_BlueCollar"
## [29] "CAR_USE_Private" "CAR_TYPE_PanelTruck"
## [31] "CAR_TYPE_Pickup" "CAR_TYPE_SportsCar"
## [33] "CAR_TYPE_Van" "CAR_TYPE_z_SUV"
## [35] "RED_CAR_yes" "REVOKED_Yes"
## [37] "URBANICITY_z_HighlyRuralRural"
if (!identical(training_vars_xgb, evaluation_vars_xgb)) {
cat("\nDifferences Found:\n")
# Variables in training but not in evaluation
missing_in_evaluation <- setdiff(training_vars_xgb, evaluation_vars_xgb)
if (length(missing_in_evaluation) > 0) {
cat("Variables in training but not in evaluation:\n")
print(missing_in_evaluation)
}
# Variables in evaluation but not in training
missing_in_training <- setdiff(evaluation_vars_xgb, training_vars_xgb)
if (length(missing_in_training) > 0) {
cat("Variables in evaluation but not in training:\n")
print(missing_in_training)
}
stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
cat("\nThe training and evaluation data have identical structures and variable orders.\n")
}
##
## The training and evaluation data have identical structures and variable orders.
# ---- Step 3: Predict Using the XGB Binary Classification Model ----
# Convert evaluation data to matrix
xgb_evaluation_matrix <- as.matrix(xgb_evaluation_data %>% dplyr::select(-INDEX))
# Make predictions using the XGB model
xgb_predictions_prob <- predict(xgb_bc_model, newdata = xgb_evaluation_matrix)
# Classify based on threshold (e.g., 0.5)
xgb_predictions_class <- ifelse(xgb_predictions_prob >= 0.5, 1, 0)
# ---- Step 4: Prepare Final Output ----
# Combine predictions with INDEX
final_output_xgb <- processed_evaluation_data %>%
dplyr::select(INDEX) %>%
dplyr::mutate(
Predicted_Probability = xgb_predictions_prob,
Predicted_Class = xgb_predictions_class
)
# Print the final output
cat("Final Predictions with INDEX:\n")
## Final Predictions with INDEX:
print(head(final_output_xgb))
## INDEX Predicted_Probability Predicted_Class
## 1 3 0.15785217 0
## 2 9 0.49727950 0
## 3 10 0.03101074 0
## 4 18 0.22074331 0
## 5 21 0.17195618 0
## 6 30 0.08795069 0
# ---- Optional: Save to CSV ----
output_file_xgb <- "xgb_binary_classification_predictions.csv"
write.csv(final_output_xgb, file = output_file_xgb, row.names = FALSE)
cat("Predicted values saved to", output_file_xgb, "\n")
## Predicted values saved to xgb_binary_classification_predictions.csv
The pairwise scatter plot and correlation analysis between the predicted probabilities of the Stepwise Logistic Regression model and the XGBoost Binary Classification model reveal a strong positive relationship, with a correlation coefficient of 0.823. This indicates that the two models are generally consistent in their probability predictions, although some divergence can be observed, particularly in areas with higher probabilities.
The density plots suggest differences in the distribution of probabilities between the models, with XGBoost exhibiting a more pronounced skew. This might suggest that XGBoost is capturing some unique patterns in the data compared to the stepwise model. Further evaluation using performance metrics.
# ---- Step 5: Prepare Data for Pairwise Scatter Plot ----
# Combine predictions from both models into a single data frame
comparison_df <- data.frame(
INDEX = final_output_xgb$INDEX,
Stepwise_Prob = as.numeric(predicted_probs), # From Stepwise Logistic Regression
XGBoost_Prob = xgb_predictions_prob # From XGBoost Binary Classification
)
# ---- Step 6: Generate Pairwise Scatter Plot ----
# Load necessary library
if (!require("GGally")) install.packages("GGally")
## Loading required package: GGally
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(GGally)
# Generate the scatter plot
cat("Generating pairwise scatter plot for model predictions...\n")
## Generating pairwise scatter plot for model predictions...
ggpairs(
data = comparison_df %>% dplyr::select(Stepwise_Prob, XGBoost_Prob),
title = "Pairwise Scatter Plot of Predicted Probabilities",
diag = list(continuous = wrap("densityDiag")),
lower = list(continuous = wrap("smooth")),
upper = list(continuous = wrap("cor", size = 3))
) +
theme_minimal()
# ---- Optional: Save Comparison Data ----
output_comparison_file <- "model_predictions_comparison.csv"
write.csv(comparison_df, file = output_comparison_file, row.names = FALSE)
cat("Comparison data saved to", output_comparison_file, "\n")
## Comparison data saved to model_predictions_comparison.csv
# ---- Step 1: Create custom_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
custom_evaluation_data <- processed_evaluation_data %>%
dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))
# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_custom <- colnames(train_data_non_linear %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars_custom <- colnames(custom_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))
# Verify if the training and evaluation datasets match
if (!identical(training_vars_custom, evaluation_vars_custom)) {
stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
cat("The training and evaluation data have identical structures and variable orders.\n")
}
## The training and evaluation data have identical structures and variable orders.
# ---- Step 3: Predict Using the Custom Model ----
# Make predictions using the custom_model_transformed_scaled model
custom_predictions_scaled <- predict(custom_model_transformed_scaled, newdata = custom_evaluation_data)
# ---- Step 4: Reverse Min-Max Scaling ----
reverse_min_max_scale <- function(scaled_values, min_val, max_val) {
scaled_values * (max_val - min_val) + min_val
}
predictions_boxcox_transformed <- reverse_min_max_scale(
custom_predictions_scaled,
train_min_val_transformed,
train_max_val_transformed
)
# ---- Step 5: Reverse Box-Cox Transformation ----
reverse_boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
exp(y)
} else {
((y * lambda) + 1)^(1 / lambda)
}
}
final_predictions_transformed <- reverse_boxcox_transform(
predictions_boxcox_transformed,
optimal_lambda_transformed
)
# ---- Step 6: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
# Combine predictions with the INDEX column
final_output_transformed <- processed_evaluation_data %>%
dplyr::select(any_of("INDEX")) %>%
dplyr::mutate(Predicted_TARGET_AMT = final_predictions_transformed)
} else {
# Only include predictions if INDEX is not available
final_output_transformed <- data.frame(
Predicted_TARGET_AMT = final_predictions_transformed
)
}
# Print the final output
cat("Final Predictions with INDEX:\n")
## Final Predictions with INDEX:
print(head(final_output_transformed))
## INDEX Predicted_TARGET_AMT
## 1 3 5132.953
## 2 9 4723.222
## 3 10 3718.583
## 4 18 3929.826
## 5 21 4698.292
## 6 30 5152.383
# ---- Optional: Save to CSV ----
output_file_transformed <- "mlr_custom_model_predictions.csv"
write.csv(final_output_transformed, file = output_file_transformed, row.names = FALSE)
cat("Predicted values saved to", output_file_transformed, "\n")
## Predicted values saved to mlr_custom_model_predictions.csv
# ---- Step 1: Create svr_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
svr_evaluation_data <- processed_evaluation_data %>%
dplyr::select(-any_of(c("TARGET_FLAG", "TARGET_AMT")))
# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars <- colnames(train_data_non_linear %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars <- colnames(svr_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))
# Verify if the training and evaluation datasets match
if (!identical(training_vars, evaluation_vars)) {
stop("The training and evaluation data do not have identical structures and variable orders.")
} else {
cat("The training and evaluation data have identical structures and variable orders.\n")
}
## The training and evaluation data have identical structures and variable orders.
# ---- Step 3: Predict Using the SVR Model ----
# Make predictions using the SVR model
svr_predictions_scaled_non_linear <- predict(svr_model, newdata = svr_evaluation_data)
# ---- Step 4: Reverse Min-Max Scaling ----
reverse_min_max_scale <- function(scaled_values, min_val, max_val) {
scaled_values * (max_val - min_val) + min_val
}
predictions_boxcox_non_linear <- reverse_min_max_scale(
svr_predictions_scaled_non_linear,
min_val_non_linear,
max_val_non_linear
)
# ---- Step 5: Reverse Box-Cox Transformation ----
reverse_boxcox_transform <- function(y, lambda) {
if (lambda == 0) {
exp(y)
} else {
((y * lambda) + 1)^(1 / lambda)
}
}
final_predictions_non_linear <- reverse_boxcox_transform(
predictions_boxcox_non_linear,
optimal_lambda_non_linear
)
# ---- Step 6: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
# Combine predictions with the INDEX column
final_output_non_linear <- processed_evaluation_data %>%
dplyr::select(any_of("INDEX")) %>%
dplyr::mutate(Predicted_TARGET_AMT = final_predictions_non_linear)
} else {
# Only include predictions if INDEX is not available
final_output_non_linear <- data.frame(
Predicted_TARGET_AMT = final_predictions_non_linear
)
}
# Print the final output
cat("Final Predictions with INDEX:\n")
## Final Predictions with INDEX:
print(head(final_output_non_linear))
## INDEX Predicted_TARGET_AMT
## 1 3 3571.259
## 2 9 3529.115
## 3 10 3418.264
## 4 18 3924.852
## 5 21 4013.145
## 6 30 6376.306
# ---- Optional: Save to CSV ----
output_file_non_linear <- "svr_predictions.csv"
write.csv(final_output_non_linear, file = output_file_non_linear, row.names = FALSE)
cat("Predicted values saved to", output_file_non_linear, "\n")
## Predicted values saved to svr_predictions.csv
# ---- Step 1: Create rfr_evaluation_data ----
# Drop TARGET_FLAG and TARGET_AMT from processed_evaluation_data
rfr_evaluation_data <- processed_evaluation_data %>%
dplyr::select(-any_of(c("TARGET_AMT")))
# ---- Step 2: Check if Training and Evaluation Data are Identical in Structure ----
# Ensure only predictor columns are compared
training_vars_rfr <- colnames(data_prepared %>% dplyr::select(-any_of(c("INDEX", "TARGET_AMT"))))
evaluation_vars_rfr <- colnames(rfr_evaluation_data %>% dplyr::select(-any_of(c("INDEX"))))
# Print training and evaluation variables
cat("\nTraining Variables (RFR):\n")
##
## Training Variables (RFR):
print(training_vars_rfr)
## [1] "KIDSDRIV" "AGE"
## [3] "HOMEKIDS" "YOJ"
## [5] "INCOME" "HOME_VAL"
## [7] "TRAVTIME" "BLUEBOOK"
## [9] "TIF" "OLDCLAIM"
## [11] "CLM_FREQ" "MVR_PTS"
## [13] "CAR_AGE" "PARENT1_Yes"
## [15] "MSTATUS_z_No" "SEX_z_F"
## [17] "EDUCATION_Bachelors" "EDUCATION_Masters"
## [19] "EDUCATION_PhD" "EDUCATION_z_HighSchool"
## [21] "JOB_Clerical" "JOB_Doctor"
## [23] "JOB_HomeMaker" "JOB_Lawyer"
## [25] "JOB_Manager" "JOB_Professional"
## [27] "JOB_Student" "JOB_z_BlueCollar"
## [29] "CAR_USE_Private" "CAR_TYPE_PanelTruck"
## [31] "CAR_TYPE_Pickup" "CAR_TYPE_SportsCar"
## [33] "CAR_TYPE_Van" "CAR_TYPE_z_SUV"
## [35] "RED_CAR_yes" "REVOKED_Yes"
## [37] "URBANICITY_z_HighlyRuralRural" "TARGET_FLAG"
cat("\nEvaluation Variables (RFR):\n")
##
## Evaluation Variables (RFR):
print(evaluation_vars_rfr)
## [1] "TARGET_FLAG" "KIDSDRIV"
## [3] "AGE" "HOMEKIDS"
## [5] "YOJ" "INCOME"
## [7] "HOME_VAL" "TRAVTIME"
## [9] "BLUEBOOK" "TIF"
## [11] "OLDCLAIM" "CLM_FREQ"
## [13] "MVR_PTS" "CAR_AGE"
## [15] "PARENT1_Yes" "MSTATUS_z_No"
## [17] "SEX_z_F" "EDUCATION_Bachelors"
## [19] "EDUCATION_Masters" "EDUCATION_PhD"
## [21] "EDUCATION_z_HighSchool" "JOB_Clerical"
## [23] "JOB_Doctor" "JOB_HomeMaker"
## [25] "JOB_Lawyer" "JOB_Manager"
## [27] "JOB_Professional" "JOB_Student"
## [29] "JOB_z_BlueCollar" "CAR_USE_Private"
## [31] "CAR_TYPE_PanelTruck" "CAR_TYPE_Pickup"
## [33] "CAR_TYPE_SportsCar" "CAR_TYPE_Van"
## [35] "CAR_TYPE_z_SUV" "RED_CAR_yes"
## [37] "REVOKED_Yes" "URBANICITY_z_HighlyRuralRural"
# ---- Step 3: Compare the Structures and Print Differences ----
if (!identical(training_vars_rfr, evaluation_vars_rfr)) {
cat("\nDifferences Found:\n")
# Variables in training but not in evaluation
missing_in_evaluation <- setdiff(training_vars_rfr, evaluation_vars_rfr)
if (length(missing_in_evaluation) > 0) {
cat("Variables in training but not in evaluation:\n")
print(missing_in_evaluation)
}
# Variables in evaluation but not in training
missing_in_training <- setdiff(evaluation_vars_rfr, training_vars_rfr)
if (length(missing_in_training) > 0) {
cat("Variables in evaluation but not in training:\n")
print(missing_in_training)
}
} else {
cat("\nThe training and evaluation data have identical structures and variable orders.\n")
}
##
## Differences Found:
# ---- Step 3: Predict Using the RFR Model ----
# Make predictions using the rfr_model
rfr_predictions <- predict(rfr_model, data = rfr_evaluation_data)$predictions
# ---- Step 4: Prepare Final Output ----
# Check if INDEX exists in processed_evaluation_data
if ("INDEX" %in% colnames(processed_evaluation_data)) {
# Combine predictions with the INDEX column
final_output_rfr <- processed_evaluation_data %>%
dplyr::select(any_of("INDEX")) %>%
dplyr::mutate(Predicted_TARGET_AMT = rfr_predictions)
} else {
# Only include predictions if INDEX is not available
final_output_rfr <- data.frame(
Predicted_TARGET_AMT = rfr_predictions
)
}
# Print the final output
cat("Final Predictions with INDEX:\n")
## Final Predictions with INDEX:
print(head(final_output_rfr))
## INDEX Predicted_TARGET_AMT
## 1 3 267.2494
## 2 9 572.7645
## 3 10 113.5184
## 4 18 159.0546
## 5 21 326.9725
## 6 30 354.4321
# ---- Optional: Save to CSV ----
output_file_rfr <- "rfr_model_predictions.csv"
write.csv(final_output_rfr, file = output_file_rfr, row.names = FALSE)
cat("Predicted values saved to", output_file_rfr, "\n")
## Predicted values saved to rfr_model_predictions.csv
The heading of the plot is:
“Pairwise Scatter Plot of Model Predictions”
This plot compares the predicted values from three models: Multiple Linear Regression (MLR), Support Vector Regression (SVR), and Random Forest Regression (RFR). The pairwise correlations between the predictions are indicated as follows: - The correlation between MLR and SVR predictions is very low (0.028), indicating almost no linear relationship. - The correlation between MLR and RFR predictions is slightly stronger (0.169), suggesting a weak linear relationship. - The correlation between SVR and RFR predictions is moderate (0.122), showing a slightly better but still weak relationship.
This suggests that the models make distinct predictions, possibly reflecting different assumptions or fits to the data. RFR predictions align slightly better with the other models than SVR does.
# ---- Step 1: Prepare Prediction Data Frames ----
# Ensure predictions from all models have an INDEX column and align them
mlr_predictions <- final_output_transformed %>%
dplyr::rename(MLR_Pred = Predicted_TARGET_AMT)
if (!"INDEX" %in% colnames(final_output_non_linear)) {
stop("INDEX column is missing in SVR predictions.")
}
svr_predictions <- final_output_non_linear %>%
dplyr::rename(SVR_Pred = Predicted_TARGET_AMT)
if (!"INDEX" %in% colnames(final_output_rfr)) {
stop("INDEX column is missing in RFR predictions.")
}
rfr_predictions <- final_output_rfr %>%
dplyr::rename(RFR_Pred = Predicted_TARGET_AMT)
# ---- Step 2: Combine Predictions ----
# Combine predictions into a single data frame for comparison
comparison_df <- mlr_predictions %>%
dplyr::left_join(svr_predictions, by = "INDEX") %>%
dplyr::left_join(rfr_predictions, by = "INDEX")
# Check if combination was successful
if (nrow(comparison_df) == 0) {
stop("Failed to combine predictions. Ensure all data frames have the INDEX column and proper alignment.")
}
# ---- Step 3: Pairwise Scatter Plot ----
library(GGally)
# Generate pairwise scatter plot
ggpairs(
comparison_df %>% dplyr::select(MLR_Pred, SVR_Pred, RFR_Pred),
title = "Pairwise Scatter Plot of Model Predictions"
) +
theme_minimal() +
scale_color_viridis_d()