Data 621 Homework 4

Introduction

In this homework assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. We can only use the variables given to us (or variables that we derive from the variables provided).

Below is a short description of the variables of interest in the data set:

data_train <- read.csv("https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework4/insurance_training_data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework4/insurance-evaluation-data.csv", header = TRUE)
head(data_train)
head(data_test)

Data Exploration

Data Exploration

The dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous.

skim(data_train)
Data summary
Name data_train
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
character 14
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
INCOME 0 1 0 8 445 6613 0
PARENT1 0 1 2 3 0 2 0
HOME_VAL 0 1 0 8 464 5107 0
MSTATUS 0 1 3 4 0 2 0
SEX 0 1 1 3 0 2 0
EDUCATION 0 1 3 13 0 5 0
JOB 0 1 0 13 526 9 0
CAR_USE 0 1 7 10 0 2 0
BLUEBOOK 0 1 6 7 0 2789 0
CAR_TYPE 0 1 3 11 0 6 0
RED_CAR 0 1 2 3 0 2 0
OLDCLAIM 0 1 2 7 0 2857 0
REVOKED 0 1 2 3 0 2 0
URBANICITY 0 1 19 21 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁
data_train %>% summarize_all(funs(sum(is.na(.)) / length(.)))

Data Processing


Fix formatting

The currency symbols present in some values may disrupt our analysis, so we need to reformat these values accordingly.

strip_dollars <- function(x){
  x <- as.character(x)
  x <- gsub(",", "", x)
  x <- gsub("\\$", "", x)
  as.numeric(x)
}

fix_formatting <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(INCOME = strip_dollars(INCOME),
           HOME_VAL = strip_dollars(HOME_VAL),
           BLUEBOOK = strip_dollars(BLUEBOOK),
           OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
    ungroup()
}

Fix data types

We observed that some variables categorized as discrete actually have a high number of unique values. Upon closer inspection of the variable descriptions, we found that although these variables are encoded as factors, they are indeed continuous. Additionally, the TARGET_FLAG variable is listed as numeric in the summary but should be a binary factor. We’ll now correct these data types accordingly.

fix_data_types <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(INCOME = as.numeric(INCOME),
           HOME_VAL = as.numeric(HOME_VAL),
           BLUEBOOK = as.numeric(BLUEBOOK),
           OLDCLAIM = as.numeric(OLDCLAIM)) %>%
    ungroup()
}

data_train$TARGET_FLAG <- factor(data_train$TARGET_FLAG)

Fix bad and missing values

Additionally, some values appear to be invalid (e.g., -3 in CAR_AGE). Since missing values in both variables are under 5%, we can replace these with the median. We’ll calculate the median using the training set only and then apply it to both the training and testing sets to prevent overfitting.

na_bad_values <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(CAR_AGE = ifelse(CAR_AGE < 0, NA, CAR_AGE))%>%
    ungroup()
}

fix_missing <- function(df) {
  df %>% 
    mutate_at(vars(c("CAR_AGE", "YOJ", "AGE", "INCOME", "HOME_VAL")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
}

Process data

We apply the processing steps above to both the training and testing datasets.

data_train <- data_train %>%
  fix_formatting() %>%
  fix_data_types() %>%
  na_bad_values() %>%
  fix_missing()
data_test <- data_test %>%
  fix_formatting() %>%
  fix_data_types() %>%
  na_bad_values() %>%
  fix_missing()

Univariate charts

We proceed to examine the distribution of TARGET_FLAG across the numeric variables. Notably, BLUEBOOK, INCOME, and OLDCLAIM contain a higher number of outliers relative to other variables. We also observe that older customers, those with older vehicles, higher home values, or higher incomes are generally involved in fewer car accidents. Conversely, individuals with motor vehicle record points or a high number of prior claims tend to have more accidents.

plot_vars <- c("TARGET_FLAG", names(keep(data_train, is.numeric)))

data_train[plot_vars] %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  gather(variable, value, -TARGET_FLAG) %>%
  ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) + 
  geom_boxplot() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

data_train %>% 
  dplyr::select(-TARGET_FLAG, -TARGET_AMT, -INDEX) %>%
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

The variables dislayed below need scale transformations like OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL. AGEhas a guassian distribution. We see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.

data_train %>% 
  dplyr::select(OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

data_train %>%
  keep(is.numeric) %>%
  gather(variable, value, -TARGET_AMT, -INDEX, -CLM_FREQ, -MVR_PTS) %>%
  ggplot(., aes(value, TARGET_AMT)) + 
  geom_point() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 3) +
  labs(x = element_blank(), y = element_blank())

Correlation

We observe that CLM_FREQ, MVR_PTS, and HOME_VAL are the most positively correlated variables with our response variable, TARGET_AMT, indicating that higher claim frequency, motor vehicle record points, and home values are associated with higher target amounts. The remaining variables have relatively weak correlations with the response. Additionally, we notice a moderate negative correlation between HOMEKIDS and INCOME, suggesting that higher-income households tend to have fewer children at home.

corr_dataframe <- data_train %>%
    mutate_if(is.factor, as.numeric) %>%
    mutate_if(is.character, as.numeric) %>% 
    select_if(is.numeric)                   

q <- cor(corr_dataframe, use = "pairwise.complete.obs")
ggcorrplot(q, type = "upper", outline.color = "white",
           ggtheme = theme_classic(),
           colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)

Centrality Measures and Outliers

set.seed(42)
accidents <- data_train %>%
  filter(TARGET_FLAG == 1)

ggplot(accidents, aes(x=TARGET_AMT)) + 
  geom_density(fill='pink') +
  theme_light() +
  geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
  geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
  geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
  geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
  labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")

The distribution of TARGET_AMT shows a long right tail. The mean payout is $5,702, while the median payout is $4,104, indicating that the distribution is skewed to the right. As expected, both the mean and median are higher for observations classified as outliers. Here, we consider values above $10,594 as outliers, based on our established cutoff point.

outlier <- min(boxplot(data_train[data_train$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
data_train %>%
  mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, "Yes", "No")) %>%
  group_by(TARGET_AMT_OUTLIER) %>%
  summarise(Mean = mean(TARGET_AMT),
            Median = median(TARGET_AMT)) 

Data Preparation

Sampling

table(data_train$TARGET_FLAG)

   0    1 
6008 2153 

There is an imbalance in the TARGET_FLAG variable

Let’s check the class distribution

prop.table(table(data_train$TARGET_FLAG))

        0         1 
0.7361843 0.2638157 

The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.

To treat this unbalance, we would use the over sampling

set.seed(42)
minority <- nrow(data_train[data_train$TARGET_FLAG == 1,])
majority <- nrow(data_train[data_train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- data_train[data_train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
  merge(data_train, .) %>%
  bind_rows(data_train)

data_train_balanced <- over_sample_train

check the balance again

table(over_sample_train$TARGET_FLAG)

   0    1 
6008 6008 

Model Building - Logit Models

Our goal is to predict both TARGET_FLAG and TARGET_AMT. Given that TARGET_FLAG is a discrete response variable, it should be modeled using logistic regression to estimate the probability of an individual being involved in an accident.

# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE) 
# A function to extract the relevant metrics from the summary and confusion matrix
score_model <- function(id, model, data, output=FALSE) {
  if (output) print(summary(model))
  glm.probs <- predict(model, type="response")
  # Confirm the 0.5 threshold
  glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
  results <- tibble(target=data$TARGET_FLAG, pred=glm.pred)
  results <- results %>%
    mutate(pred.class = as.factor(pred), target.class = as.factor(target))
  
  if (output) print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
  
  acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
  sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
  spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
  #prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
  res.deviance <- model$deviance
  null.deviance <- model$null.deviance  
  aic <- model$aic
  metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec)
  metrics <- lapply(metrics, round, 3)
  
  if (output) plot(roc(results$target.class,glm.probs), print.auc = TRUE)
  model.df <- tibble(id=id, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance, 
                         aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity)
  model.list <- list(model=glm.fit, df_info=model.df)
  return(model.list)
}

Model 1 A&B: Logit Models

We create three types of models: null, full, and reduced. The null model includes only the intercept, representing the simplest form. The full model includes all predictors, serving as the most complex model. The reduced model is developed by systematically stepping through predictors between these two bounds, retaining only those that are statistically significant.

mod1data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))
#mod1data <- data_train_balanced %>% select(-c('TARGET_AMT','INDEX'))

model.null <- glm(TARGET_FLAG ~ 1,
                 data=mod1data,
                 family = binomial(link="logit")
                 )

model.full <- glm(TARGET_FLAG ~ .,
                 data=mod1data,
                 family = binomial(link="logit")
                 )
    
model.reduced <- step(model.null,
              scope = list(upper=model.full),
             direction="both",
             test="Chisq",
             trace=0,
             data=mod1data)

m1a <- score_model('model.full', model.full, mod1data, output = TRUE)

Call:
glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
    data = mod1data)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -9.286e-01  3.215e-01  -2.889 0.003869 ** 
KIDSDRIV                         3.862e-01  6.122e-02   6.308 2.82e-10 ***
AGE                             -1.015e-03  4.020e-03  -0.252 0.800672    
HOMEKIDS                         4.965e-02  3.713e-02   1.337 0.181119    
YOJ                             -1.105e-02  8.582e-03  -1.288 0.197743    
INCOME                          -3.423e-06  1.081e-06  -3.165 0.001551 ** 
PARENT1Yes                       3.820e-01  1.096e-01   3.485 0.000492 ***
HOME_VAL                        -1.306e-06  3.420e-07  -3.819 0.000134 ***
MSTATUSz_No                      4.938e-01  8.357e-02   5.909 3.45e-09 ***
SEXz_F                          -8.251e-02  1.120e-01  -0.737 0.461416    
EDUCATIONBachelors              -3.812e-01  1.157e-01  -3.296 0.000981 ***
EDUCATIONMasters                -2.903e-01  1.788e-01  -1.624 0.104397    
EDUCATIONPhD                    -1.677e-01  2.140e-01  -0.784 0.433295    
EDUCATIONz_High School           1.764e-02  9.506e-02   0.186 0.852802    
JOBClerical                      4.107e-01  1.967e-01   2.088 0.036763 *  
JOBDoctor                       -4.458e-01  2.671e-01  -1.669 0.095106 .  
JOBHome Maker                    2.323e-01  2.102e-01   1.106 0.268915    
JOBLawyer                        1.049e-01  1.695e-01   0.619 0.535958    
JOBManager                      -5.572e-01  1.716e-01  -3.248 0.001161 ** 
JOBProfessional                  1.619e-01  1.784e-01   0.907 0.364168    
JOBStudent                       2.161e-01  2.145e-01   1.007 0.313729    
JOBz_Blue Collar                 3.106e-01  1.856e-01   1.674 0.094158 .  
TRAVTIME                         1.457e-02  1.883e-03   7.736 1.03e-14 ***
CAR_USEPrivate                  -7.564e-01  9.172e-02  -8.247  < 2e-16 ***
BLUEBOOK                        -2.084e-05  5.263e-06  -3.959 7.52e-05 ***
TIF                             -5.547e-02  7.344e-03  -7.553 4.26e-14 ***
CAR_TYPEPanel Truck              5.607e-01  1.618e-01   3.466 0.000528 ***
CAR_TYPEPickup                   5.540e-01  1.007e-01   5.500 3.80e-08 ***
CAR_TYPESports Car               1.025e+00  1.299e-01   7.893 2.95e-15 ***
CAR_TYPEVan                      6.186e-01  1.265e-01   4.891 1.00e-06 ***
CAR_TYPEz_SUV                    7.682e-01  1.113e-01   6.904 5.05e-12 ***
RED_CARyes                      -9.728e-03  8.636e-02  -0.113 0.910313    
OLDCLAIM                        -1.389e-05  3.910e-06  -3.554 0.000380 ***
CLM_FREQ                         1.959e-01  2.855e-02   6.864 6.69e-12 ***
REVOKEDYes                       8.874e-01  9.133e-02   9.716  < 2e-16 ***
MVR_PTS                          1.133e-01  1.361e-02   8.324  < 2e-16 ***
CAR_AGE                         -7.196e-04  7.549e-03  -0.095 0.924053    
URBANICITYz_Highly Rural/ Rural -2.390e+00  1.128e-01 -21.181  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7297.6  on 8123  degrees of freedom
AIC: 7373.6

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5551 1235
         1  457  918
                                          
               Accuracy : 0.7927          
                 95% CI : (0.7837, 0.8014)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3963          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4264          
            Specificity : 0.9239          
         Pos Pred Value : 0.6676          
         Neg Pred Value : 0.8180          
             Prevalence : 0.2638          
         Detection Rate : 0.1125          
   Detection Prevalence : 0.1685          
      Balanced Accuracy : 0.6752          
                                          
       'Positive' Class : 1               
                                          

m1a$df_info
models.df <- rbind(models.df,m1a$df_info)
m1b <- score_model('model.reduced', model.reduced, mod1data, output = TRUE)

Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS + 
    CAR_TYPE + REVOKED + KIDSDRIV + CAR_USE + TIF + TRAVTIME + 
    INCOME + CLM_FREQ + BLUEBOOK + PARENT1 + EDUCATION + HOME_VAL + 
    OLDCLAIM, family = binomial(link = "logit"), data = mod1data)

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.053e+00  2.559e-01  -4.117 3.84e-05 ***
URBANICITYz_Highly Rural/ Rural -2.389e+00  1.128e-01 -21.181  < 2e-16 ***
JOBClerical                      4.141e-01  1.965e-01   2.107 0.035104 *  
JOBDoctor                       -4.475e-01  2.667e-01  -1.678 0.093344 .  
JOBHome Maker                    2.748e-01  2.042e-01   1.346 0.178234    
JOBLawyer                        9.715e-02  1.692e-01   0.574 0.565740    
JOBManager                      -5.649e-01  1.714e-01  -3.296 0.000980 ***
JOBProfessional                  1.548e-01  1.784e-01   0.868 0.385304    
JOBStudent                       2.751e-01  2.109e-01   1.304 0.192066    
JOBz_Blue Collar                 3.098e-01  1.855e-01   1.670 0.094879 .  
MVR_PTS                          1.143e-01  1.359e-02   8.412  < 2e-16 ***
MSTATUSz_No                      4.719e-01  7.955e-02   5.932 2.99e-09 ***
CAR_TYPEPanel Truck              6.090e-01  1.509e-01   4.035 5.46e-05 ***
CAR_TYPEPickup                   5.503e-01  1.006e-01   5.469 4.53e-08 ***
CAR_TYPESports Car               9.726e-01  1.074e-01   9.054  < 2e-16 ***
CAR_TYPEVan                      6.466e-01  1.221e-01   5.295 1.19e-07 ***
CAR_TYPEz_SUV                    7.156e-01  8.596e-02   8.324  < 2e-16 ***
REVOKEDYes                       8.927e-01  9.123e-02   9.785  < 2e-16 ***
KIDSDRIV                         4.176e-01  5.512e-02   7.576 3.57e-14 ***
CAR_USEPrivate                  -7.574e-01  9.161e-02  -8.268  < 2e-16 ***
TIF                             -5.538e-02  7.340e-03  -7.545 4.53e-14 ***
TRAVTIME                         1.448e-02  1.881e-03   7.699 1.37e-14 ***
INCOME                          -3.486e-06  1.076e-06  -3.239 0.001199 ** 
CLM_FREQ                         1.963e-01  2.852e-02   6.882 5.91e-12 ***
BLUEBOOK                        -2.308e-05  4.719e-06  -4.891 1.00e-06 ***
PARENT1Yes                       4.602e-01  9.427e-02   4.882 1.05e-06 ***
EDUCATIONBachelors              -3.868e-01  1.089e-01  -3.554 0.000380 ***
EDUCATIONMasters                -3.032e-01  1.615e-01  -1.878 0.060385 .  
EDUCATIONPhD                    -1.818e-01  2.002e-01  -0.908 0.363825    
EDUCATIONz_High School           1.487e-02  9.469e-02   0.157 0.875229    
HOME_VAL                        -1.342e-06  3.407e-07  -3.939 8.18e-05 ***
OLDCLAIM                        -1.405e-05  3.907e-06  -3.595 0.000324 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7301.8  on 8129  degrees of freedom
AIC: 7365.8

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5558 1245
         1  450  908
                                          
               Accuracy : 0.7923          
                 95% CI : (0.7833, 0.8011)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3934          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4217          
            Specificity : 0.9251          
         Pos Pred Value : 0.6686          
         Neg Pred Value : 0.8170          
             Prevalence : 0.2638          
         Detection Rate : 0.1113          
   Detection Prevalence : 0.1664          
      Balanced Accuracy : 0.6734          
                                          
       'Positive' Class : 1               
                                          

m1b$df_info
models.df <- rbind(models.df,m1b$df_info)

We calculate McFadden’s pseudo R-squared for the logistic regression models and observe that the difference between the full model and the reduced model is minimal..

full_model_r2 <- round(1 - logLik(model.full) / logLik(model.null), 4)
reduced_model_r2 <- round(1 - logLik(model.reduced) / logLik(model.null), 4)
paste0('Full model = ',round(1-logLik(model.full)/logLik(model.null),4))
[1] "Full model = 0.2251"
paste0('Reduced model = ',round(1-logLik(model.reduced)/logLik(model.null),4))
[1] "Reduced model = 0.2247"

Multiple linear regression model

mod2data <- data_train %>% dplyr::select(-c('TARGET_FLAG','INDEX'))
mod2data <- na.omit(mod2data)

Model stepwise

model.null <- lm(TARGET_AMT ~ 1, data = mod2data)
model.full <- lm(TARGET_AMT ~ ., data = mod2data)

set.seed(100)
model.stepwise <- step(model.null, 
                       scope = list(lower = model.null, upper = model.full), 
                       direction = "both", 
                       trace = FALSE)
summary(model.stepwise)

Call:
lm(formula = TARGET_AMT ~ MVR_PTS + URBANICITY + JOB + MSTATUS + 
    CAR_TYPE + KIDSDRIV + CAR_USE + TIF + TRAVTIME + PARENT1 + 
    INCOME + REVOKED + CAR_AGE + CLM_FREQ + SEX + BLUEBOOK + 
    OLDCLAIM, data = mod2data)

Residuals:
   Min     1Q Median     3Q    Max 
 -5826  -1689   -763    343 103714 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      1.245e+03  3.821e+02   3.260 0.001120 ** 
MVR_PTS                          1.759e+02  2.588e+01   6.797 1.14e-11 ***
URBANICITYz_Highly Rural/ Rural -1.660e+03  1.393e+02 -11.916  < 2e-16 ***
JOBClerical                      3.376e+02  2.950e+02   1.145 0.252389    
JOBDoctor                       -3.534e+02  3.767e+02  -0.938 0.348266    
JOBHome Maker                    1.998e+02  3.348e+02   0.597 0.550603    
JOBLawyer                        1.416e+02  2.872e+02   0.493 0.622105    
JOBManager                      -6.899e+02  2.678e+02  -2.576 0.010002 *  
JOBProfessional                  1.473e+02  2.677e+02   0.550 0.582146    
JOBStudent                       1.704e+02  3.259e+02   0.523 0.601163    
JOBz_Blue Collar                 2.898e+02  2.686e+02   1.079 0.280669    
MSTATUSz_No                      6.009e+02  1.193e+02   5.035 4.88e-07 ***
CAR_TYPEPanel Truck              3.038e+02  2.750e+02   1.105 0.269248    
CAR_TYPEPickup                   4.025e+02  1.693e+02   2.377 0.017471 *  
CAR_TYPESports Car               1.038e+03  2.162e+02   4.798 1.63e-06 ***
CAR_TYPEVan                      5.327e+02  2.120e+02   2.512 0.012014 *  
CAR_TYPEz_SUV                    7.583e+02  1.784e+02   4.250 2.16e-05 ***
KIDSDRIV                         3.759e+02  1.021e+02   3.682 0.000232 ***
CAR_USEPrivate                  -7.302e+02  1.568e+02  -4.657 3.26e-06 ***
TIF                             -4.751e+01  1.217e+01  -3.904 9.55e-05 ***
TRAVTIME                         1.181e+01  3.220e+00   3.667 0.000247 ***
PARENT1Yes                       6.438e+02  1.767e+02   3.644 0.000270 ***
INCOME                          -4.822e-03  1.562e-03  -3.086 0.002033 ** 
REVOKEDYes                       5.558e+02  1.734e+02   3.205 0.001354 ** 
CAR_AGE                         -2.733e+01  1.123e+01  -2.433 0.014978 *  
CLM_FREQ                         1.438e+02  5.497e+01   2.616 0.008908 ** 
SEXz_F                          -3.316e+02  1.608e+02  -2.062 0.039242 *  
BLUEBOOK                         1.463e-02  8.521e-03   1.717 0.086015 .  
OLDCLAIM                        -1.054e-02  7.432e-03  -1.419 0.156047    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4544 on 8132 degrees of freedom
Multiple R-squared:  0.07011,   Adjusted R-squared:  0.06691 
F-statistic:  21.9 on 28 and 8132 DF,  p-value: < 2.2e-16

#model with manually selected predictors

The variables CLM_FREQ, MVR_PTS, HOME_VAL, and INCOME were chosen for the regression model because they show moderate to strong correlations with TARGET_AMT in the correlation heatmap, suggesting they may be influential predictors of claim amount. CLM_FREQ and MVR_PTS are positively correlated with TARGET_AMT, indicating a higher claim frequency and driving points may relate to higher claims. HOME_VAL and INCOME also exhibit a positive relationship with TARGET_AMT, implying that higher home values and incomes could correspond to increased claim amounts. Additionally, the scatter plots and histograms for these variables highlight some variability and range, which may contribute to predicting TARGET_AMT effectively in a linear regression mo

# Create a model with manually selected predictors
model.manual <- lm(TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME  , data = mod2data)
summary(model.manual)

Call:
lm(formula = TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME, 
    data = mod2data)

Residuals:
   Min     1Q Median     3Q    Max 
 -4253  -1522   -980   -236 104399 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.334e+03  1.053e+02  12.670  < 2e-16 ***
CLM_FREQ     2.788e+02  4.839e+01   5.761 8.65e-09 ***
MVR_PTS      2.295e+02  2.609e+01   8.798  < 2e-16 ***
HOME_VAL    -2.295e-03  4.893e-04  -4.689 2.79e-06 ***
INCOME      -1.385e-03  1.321e-03  -1.048    0.295    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4638 on 8156 degrees of freedom
Multiple R-squared:  0.02831,   Adjusted R-squared:  0.02783 
F-statistic: 59.41 on 4 and 8156 DF,  p-value: < 2.2e-16

#model with variable transformation The variable transformations of AGE_INCOME was chosen to better capture the relationships observed in the graphs. The histogram for INCOME^2 is right-skewed, with a concentration of lower values and a long tail, suggesting that a simple linear effect may not fully capture its influence on TARGET_AMT. By adding INCOME_SQ, we allow the model to account for potential nonlinear effects, where higher income levels might impact claim amounts differently. Additionally, the interaction term AGE_INCOME was included to explore the combined effect of age and income on claim amounts, as age may modify the influence of income, which could help capture more complex patterns seen in the data.

mod2data <- mod2data %>%
  mutate(INCOME_SQ = INCOME^2, AGE_INCOME = AGE * INCOME)

model.interaction <- lm(TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME + INCOME_SQ + AGE_INCOME , data = mod2data)
summary(model.interaction)

Call:
lm(formula = TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME + 
    INCOME_SQ + AGE_INCOME, data = mod2data)

Residuals:
   Min     1Q Median     3Q    Max 
 -4312  -1524   -981   -228 104410 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.332e+03  1.267e+02  10.508  < 2e-16 ***
CLM_FREQ     2.790e+02  4.840e+01   5.765 8.47e-09 ***
MVR_PTS      2.292e+02  2.611e+01   8.780  < 2e-16 ***
HOME_VAL    -2.277e-03  4.947e-04  -4.602 4.25e-06 ***
INCOME      -1.861e-04  4.760e-03  -0.039    0.969    
INCOME_SQ    4.198e-10  1.425e-08   0.029    0.977    
AGE_INCOME  -2.710e-05  8.766e-05  -0.309    0.757    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4639 on 8154 degrees of freedom
Multiple R-squared:  0.02832,   Adjusted R-squared:  0.02761 
F-statistic: 39.61 on 6 and 8154 DF,  p-value: < 2.2e-16

The models have significant differences in performance metrics, with the stepwise model showing the lowest MSE (20,574,015) and RMSE (4,535.86). Both the manual and interaction models have similar MSE and RMSE values (around 21,498,780 and 4,636.68), indicating similar predictive power. The R-squared values are quite low across all models, ranging from 0.028 to 0.070, suggesting that the models explain very little of the variance in the target variable.

evaluate_model <- function(model, data) {
  # Predictions
  predictions <- predict(model, newdata = data)
  
  # Metrics
  actual <- data$TARGET_AMT
  mse <- mean((predictions - actual)^2)
  rmse <- sqrt(mse)
  r_squared <- summary(model)$r.squared
  adj_r_squared <- summary(model)$adj.r.squared
  
  # Return metrics
  tibble(
    Model = deparse(substitute(model)),
    MSE = round(mse, 2),
    RMSE = round(rmse, 2),
    R_Squared = round(r_squared, 3),
    Adj_R_Squared = round(adj_r_squared, 3)
  )
}

# Evaluate all models
results <- bind_rows(
  evaluate_model(model.stepwise, mod2data),
  evaluate_model(model.manual, mod2data),
  evaluate_model(model.interaction, mod2data)
)

print(results)
# A tibble: 3 × 5
  Model       MSE  RMSE R_Squared Adj_R_Squared
  <chr>     <dbl> <dbl>     <dbl>         <dbl>
1 model 20574015. 4536.     0.07          0.067
2 model 21498780. 4637.     0.028         0.028
3 model 21498527. 4637.     0.028         0.028

Diagnotics

We examine the reduced model for any irregularities and potential violations of assumptions. The logit values for the continuous predictors appear to be mostly linear.

library(broom)
# Select only numeric predictors and predict
numdata <- mod1data %>%
  dplyr::select_if(is.numeric)
predictors <- colnames(numdata)
probabilities <- predict(model.reduced, type = "response")
# Bind the logit and tidying the data for plot
numdata <- numdata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(numdata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Multiple linear regression model diagnostics All three sets of plots include the following diagnostics: residuals versus fitted values, histograms of residuals, Q-Q plots, and residuals versus leverage. For all the models, the actual versus predicted plots show a high concentration of points near zero, suggesting possible issues with model fit, particularly at larger values. The residuals versus fitted plots and histograms indicate skewness, as most residuals are concentrated near zero with some outliers. The Q-Q plots reveal heavy tails in the residual distribution, indicating deviations from normality. Additionally, in the three residuals versus leverage plots, a few influential points appear. Overall, both models seem to struggle with fitting certain aspects of the data, with potential issues in normality and residual spread.

visualize_models <- function(models, data, target_col) {
  par(mfrow = c(2, 2)) 
  
  for (i in seq_along(models)) {
    model <- models[[i]]
    model_name <- names(models)[i]
    
  
    actual <- data[[target_col]]
    predicted <- predict(model, newdata = data)
    plot(actual, predicted,
         xlab = "Actual Values", 
         ylab = "Predicted Values",
         main = paste("Actual vs Predicted:", model_name),
         col = "blue", pch = 16)
    abline(0, 1, col = "red", lwd = 2)
    
   
    residuals <- residuals(model)
    plot(predicted, residuals,
         xlab = "Fitted Values", 
         ylab = "Residuals",
         main = paste("Residuals vs Fitted:", model_name),
         col = "purple", pch = 16)
    abline(h = 0, col = "red", lwd = 2)
    
    
    hist(residuals,
         breaks = 20,
         main = paste("Histogram of Residuals:", model_name),
         xlab = "Residuals",
         col = "lightblue")
    
    
    plot(model, main = paste("Diagnostics:", model_name))
  }
  
  par(mfrow = c(1, 1)) 
}


models <- list(
  Manual = model.manual,
  Stepwise = model.stepwise,
  Interaction = model.interaction
)

visualize_models(models, data = mod2data, target_col = "TARGET_AMT")

Model selection

#binary logistic model The reduced model is the better choice as it explains more variance (R² = 0.9547 vs. 0.0037 for the full model), has a slightly lower AIC, and offers comparable accuracy. Although the full model has a marginally higher sensitivity, the reduced model performs equally well or better in terms of specificity, making it a more efficient and effective model overall.

#multiple linear regression Based on the evaluation metrics, the stepwise model appears to perform slightly better than the manual and interaction models, with the lowest MSE and RMSE values. However, all models have relatively low R-squared values, indicating that they do not explain much of the variance in the target variable, which suggests the presence of unmodeled factors or insufficient predictors. Given these results, the stepwise model may be the most reliable choice due to its better fit.

Predictions

#data_test Car crash predictions

TargetFlag_pred <- data_test %>% dplyr::select(-c('TARGET_AMT', 'INDEX'))
predictions_prob <- predict(model.reduced, newdata = TargetFlag_pred, type = "response")


predictions_class <- ifelse(predictions_prob > 0.5, 1, 0)


TargetFlag_pred$predictions <- predictions_class


predicted_counts <- table(TargetFlag_pred$predictions)

# Print the table
print(predicted_counts)

   0    1 
1781  360 

#data_test Target_AMT predictions

TargetAmt_pred <- data_test %>% dplyr::select(-c('TARGET_FLAG','INDEX'))
predictions <- predict(model.stepwise, newdata = TargetAmt_pred )


TargetAmt_pred$predictions <- predictions


head(TargetAmt_pred[, c('TARGET_AMT','predictions')])