Data 621 Homework 4

Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 8000 records, each representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is binary. A “1” indicates that the customer was in a car crash while 0 indicates that they were not. The second response variable is TARGET_AMT. This value is 0 if the customer did not crash their car. However, if they did crash their car, this number will be a value greater than 0.

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict whether a customer will crash their car and to predict the cost in the case of crash. We will only use the variables given to us (or variables that we derive from the variables provided).

Below is a short description of the variables of interest in the data set:

data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance_training_data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance-evaluation-data.csv", header = TRUE)
data_train

Data Exploration

Data Exploration

The dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous.

skim(data_train)
Data summary
Name data_train
Number of rows 8161
Number of columns 26
_______________________
Column type frequency:
factor 14
numeric 12
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
INCOME 0 1 FALSE 6613 $0: 615, emp: 445, $26: 4, $48: 4
PARENT1 0 1 FALSE 2 No: 7084, Yes: 1077
HOME_VAL 0 1 FALSE 5107 $0: 2294, emp: 464, $11: 3, $11: 3
MSTATUS 0 1 FALSE 2 Yes: 4894, z_N: 3267
SEX 0 1 FALSE 2 z_F: 4375, M: 3786
EDUCATION 0 1 FALSE 5 z_H: 2330, Bac: 2242, Mas: 1658, <Hi: 1203
JOB 0 1 FALSE 9 z_B: 1825, Cle: 1271, Pro: 1117, Man: 988
CAR_USE 0 1 FALSE 2 Pri: 5132, Com: 3029
BLUEBOOK 0 1 FALSE 2789 $1,: 157, $6,: 34, $5,: 33, $6,: 33
CAR_TYPE 0 1 FALSE 6 z_S: 2294, Min: 2145, Pic: 1389, Spo: 907
RED_CAR 0 1 FALSE 2 no: 5783, yes: 2378
OLDCLAIM 0 1 FALSE 2857 $0: 5009, $1,: 4, $1,: 4, $4,: 4
REVOKED 0 1 FALSE 2 No: 7161, Yes: 1000
URBANICITY 0 1 FALSE 2 Hig: 6492, z_H: 1669

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
INDEX 0 1.00 5151.87 2978.89 1 2559 5133 7745 10302.0 ▇▇▇▇▇
TARGET_FLAG 0 1.00 0.26 0.44 0 0 0 1 1.0 ▇▁▁▁▃
TARGET_AMT 0 1.00 1504.32 4704.03 0 0 0 1036 107586.1 ▇▁▁▁▁
KIDSDRIV 0 1.00 0.17 0.51 0 0 0 0 4.0 ▇▁▁▁▁
AGE 6 1.00 44.79 8.63 16 39 45 51 81.0 ▁▆▇▂▁
HOMEKIDS 0 1.00 0.72 1.12 0 0 0 1 5.0 ▇▂▁▁▁
YOJ 454 0.94 10.50 4.09 0 9 11 13 23.0 ▂▃▇▃▁
TRAVTIME 0 1.00 33.49 15.91 5 22 33 44 142.0 ▇▇▁▁▁
TIF 0 1.00 5.35 4.15 1 1 4 7 25.0 ▇▆▁▁▁
CLM_FREQ 0 1.00 0.80 1.16 0 0 0 2 5.0 ▇▂▁▁▁
MVR_PTS 0 1.00 1.70 2.15 0 0 1 3 13.0 ▇▂▁▁▁
CAR_AGE 510 0.94 8.33 5.70 -3 1 8 12 28.0 ▆▇▇▃▁
data_train %>% summarize_all(funs(sum(is.na(.)) / length(.)))

Data Processing


Fix formatting

The currency notation found in some values will interfere with our analysis so we need reformat those values appropriately.

strip_dollars <- function(x){
  x <- as.character(x)
  x <- gsub(",", "", x)
  x <- gsub("\\$", "", x)
  as.numeric(x)
}

fix_formatting <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(INCOME = strip_dollars(INCOME),
           HOME_VAL = strip_dollars(HOME_VAL),
           BLUEBOOK = strip_dollars(BLUEBOOK),
           OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
    ungroup()
}

Fix data types

We noticed that a few variables that are listed as discrete have large numbers of unique values. A closer inspection of the variable descriptions reveals that that while these variables are encoded as factors they are actually continuous. The TARGET_FLAG variable also appears in the summary as numeric variable, but it should be a binary factor. We proceed to fix these data types.

fix_data_types <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(INCOME = as.numeric(INCOME),
           HOME_VAL = as.numeric(HOME_VAL),
           BLUEBOOK = as.numeric(BLUEBOOK),
           OLDCLAIM = as.numeric(OLDCLAIM)) %>%
    ungroup()
}

data_train$TARGET_FLAG <- factor(data_train$TARGET_FLAG)

Fix bad and missing values

Also, there are some values that seem invalid (i.e. -3 CAR_AGE). Since both variables the missing values are less than 5% then we can replace the missing values with the median. We Will take the median on the training set only and impute in both training and testing to avoid overfitting.

na_bad_values <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(CAR_AGE = ifelse(CAR_AGE < 0, NA, CAR_AGE))%>%
    ungroup()
}

fix_missing <- function(df) {
  df %>% 
    mutate_at(vars(c("CAR_AGE", "YOJ", "AGE", "INCOME", "HOME_VAL")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
}

Process data

We apply the processing steps above to both the training and testing datasets.

data_train <- data_train %>%
  fix_formatting() %>%
  fix_data_types() %>%
  na_bad_values() %>%
  fix_missing()
data_test <- data_test %>%
  fix_formatting() %>%
  fix_data_types() %>%
  na_bad_values() %>%
  fix_missing()

Univariate charts

We now explore the distribution of TARGET_FLAG across the numeric variables. We see that BLUEBOOK, INCOME, OLDCLAIM have a high number of outliers compared to other variables. We also see that customers with who are older, or have older cars, higher home values, higher income tend to get into fewer car crashes. However, people with motor vehicle record points or high number of old claims tend to get into more accidents.

plot_vars <- c("TARGET_FLAG", names(keep(data_train, is.numeric)))

data_train[plot_vars] %>%
  dplyr::select(-INDEX, -TARGET_AMT) %>%
  gather(variable, value, -TARGET_FLAG) %>%
  ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) + 
  geom_boxplot() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

data_train %>% 
  dplyr::select(-TARGET_FLAG, -TARGET_AMT, -INDEX) %>%
  keep(is.numeric) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

The variables dislayed below need scale transformations like OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL. AGEhas a guassian distribution. We see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.

data_train %>% 
  dplyr::select(OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL) %>%
  gather() %>% 
  ggplot(aes(x= value)) + 
  geom_histogram(fill='pink') + 
  facet_wrap(~key, scales = 'free')

data_train %>%
  keep(is.numeric) %>%
  gather(variable, value, -TARGET_AMT, -INDEX, -CLM_FREQ, -MVR_PTS) %>%
  ggplot(., aes(value, TARGET_AMT)) + 
  geom_point() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 3) +
  labs(x = element_blank(), y = element_blank())

Correlation

We see MVR_PTS, CLM_FREQ, and OLDCLAIM are the most positively correlated variables with our response variables. Whereas, URBANICITY is the most negatively correlated variable. Rest of the variables are weakly correlated.

corr_dataframe = data_train %>%
    mutate_if(is.factor, as.numeric)
q <- cor(corr_dataframe)
ggcorrplot(q, type = "upper", outline.color = "white",
           ggtheme = theme_classic,
           colors = c("#6D9EC1", "white", "#E46726"),
           lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3) 

Centrality Measures and Outliers

set.seed(42)
accidents <- data_train %>%
  filter(TARGET_FLAG == 1)

ggplot(accidents, aes(x=TARGET_AMT)) + 
  geom_density(fill='pink') +
  theme_light() +
  geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
  geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
  geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
  geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
  labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")

As was previously noted this distribution has a long tail. The mean payout is $5616 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.

outlier <- min(boxplot(data_train[data_train$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
data_train %>%
  mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, "Yes", "No")) %>%
  group_by(TARGET_AMT_OUTLIER) %>%
  summarise(Mean = mean(TARGET_AMT),
            Median = median(TARGET_AMT)) 

Data Preparation

Sampling

table(data_train$TARGET_FLAG)

   0    1 
6008 2153 

There is an imbalance in the TARGET_FLAG variable

Let’s check the class distribution

prop.table(table(data_train$TARGET_FLAG))

        0         1 
0.7361843 0.2638157 

The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.

To treat this unbalance, we would use the over sampling

set.seed(42)
minority <- nrow(data_train[data_train$TARGET_FLAG == 1,])
majority <- nrow(data_train[data_train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- data_train[data_train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
  merge(data_train, .) %>%
  bind_rows(data_train)

data_train_balanced <- over_sample_train

check the balance again

table(over_sample_train$TARGET_FLAG)

   0    1 
6008 6008 

Model Building - Logit Models

Our objective is to predict both TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a discrete response variable and for that reason it should be modeled using logistic regression which determines the probability that an individual will be in an accident.

# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
                 aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
                precision.deviance=numeric(), stringsAsFactors=FALSE) 
# A function to extract the relevant metrics from the summary and confusion matrix
score_model <- function(id, model, data, output=FALSE) {
  if (output) print(summary(model))
  glm.probs <- predict(model, type="response")
  # Confirm the 0.5 threshold
  glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
  results <- tibble(target=data$TARGET_FLAG, pred=glm.pred)
  results <- results %>%
    mutate(pred.class = as.factor(pred), target.class = as.factor(target))
  
  if (output) print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
  
  acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
  sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
  spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
  #prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
  res.deviance <- model$deviance
  null.deviance <- model$null.deviance  
  aic <- model$aic
  metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec)
  metrics <- lapply(metrics, round, 3)
  
  if (output) plot(roc(results$target.class,glm.probs), print.auc = TRUE)
  model.df <- tibble(id=id, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance, 
                         aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity)
  model.list <- list(model=glm.fit, df_info=model.df)
  return(model.list)
}

Model 1 A&B: Logit Models

We construct null, full and reduced models. The null model contains only the intercept and forms the lower bound of complexity. The full model contains all predictors and is the upper bound, max complexity model. The reduced model is obtained by iteratively stepping through the predictors between the aforementioned bounds until only statistically significant predictors remain.

mod1data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))
#mod1data <- data_train_balanced %>% select(-c('TARGET_AMT','INDEX'))

model.null <- glm(TARGET_FLAG ~ 1,
                 data=mod1data,
                 family = binomial(link="logit")
                 )

model.full <- glm(TARGET_FLAG ~ .,
                 data=mod1data,
                 family = binomial(link="logit")
                 )
    
model.reduced <- step(model.null,
              scope = list(upper=model.full),
             direction="both",
             test="Chisq",
             trace=0,
             data=mod1data)

m1a <- score_model('model.full', model.full, mod1data, output = TRUE)

Call:
glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
    data = mod1data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.5846  -0.7127  -0.3983   0.6261   3.1524  

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -9.286e-01  3.215e-01  -2.889 0.003869 ** 
KIDSDRIV                         3.862e-01  6.122e-02   6.308 2.82e-10 ***
AGE                             -1.015e-03  4.020e-03  -0.252 0.800672    
HOMEKIDS                         4.965e-02  3.713e-02   1.337 0.181119    
YOJ                             -1.105e-02  8.582e-03  -1.288 0.197743    
INCOME                          -3.423e-06  1.081e-06  -3.165 0.001551 ** 
PARENT1Yes                       3.820e-01  1.096e-01   3.485 0.000492 ***
HOME_VAL                        -1.306e-06  3.420e-07  -3.819 0.000134 ***
MSTATUSz_No                      4.938e-01  8.357e-02   5.909 3.45e-09 ***
SEXz_F                          -8.251e-02  1.120e-01  -0.737 0.461416    
EDUCATIONBachelors              -3.812e-01  1.157e-01  -3.296 0.000981 ***
EDUCATIONMasters                -2.903e-01  1.788e-01  -1.624 0.104397    
EDUCATIONPhD                    -1.677e-01  2.140e-01  -0.784 0.433295    
EDUCATIONz_High School           1.764e-02  9.506e-02   0.186 0.852802    
JOBClerical                      4.107e-01  1.967e-01   2.088 0.036763 *  
JOBDoctor                       -4.458e-01  2.671e-01  -1.669 0.095106 .  
JOBHome Maker                    2.323e-01  2.102e-01   1.106 0.268915    
JOBLawyer                        1.049e-01  1.695e-01   0.619 0.535958    
JOBManager                      -5.572e-01  1.716e-01  -3.248 0.001161 ** 
JOBProfessional                  1.619e-01  1.784e-01   0.907 0.364168    
JOBStudent                       2.161e-01  2.145e-01   1.007 0.313729    
JOBz_Blue Collar                 3.106e-01  1.856e-01   1.674 0.094158 .  
TRAVTIME                         1.457e-02  1.883e-03   7.736 1.03e-14 ***
CAR_USEPrivate                  -7.564e-01  9.172e-02  -8.247  < 2e-16 ***
BLUEBOOK                        -2.084e-05  5.263e-06  -3.959 7.52e-05 ***
TIF                             -5.547e-02  7.344e-03  -7.553 4.26e-14 ***
CAR_TYPEPanel Truck              5.607e-01  1.618e-01   3.466 0.000528 ***
CAR_TYPEPickup                   5.540e-01  1.007e-01   5.500 3.80e-08 ***
CAR_TYPESports Car               1.025e+00  1.299e-01   7.893 2.95e-15 ***
CAR_TYPEVan                      6.186e-01  1.265e-01   4.891 1.00e-06 ***
CAR_TYPEz_SUV                    7.682e-01  1.113e-01   6.904 5.05e-12 ***
RED_CARyes                      -9.728e-03  8.636e-02  -0.113 0.910313    
OLDCLAIM                        -1.389e-05  3.910e-06  -3.554 0.000380 ***
CLM_FREQ                         1.959e-01  2.855e-02   6.864 6.69e-12 ***
REVOKEDYes                       8.874e-01  9.133e-02   9.716  < 2e-16 ***
MVR_PTS                          1.133e-01  1.361e-02   8.324  < 2e-16 ***
CAR_AGE                         -7.196e-04  7.549e-03  -0.095 0.924053    
URBANICITYz_Highly Rural/ Rural -2.390e+00  1.128e-01 -21.181  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7297.6  on 8123  degrees of freedom
AIC: 7373.6

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5551 1235
         1  457  918
                                          
               Accuracy : 0.7927          
                 95% CI : (0.7837, 0.8014)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3963          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4264          
            Specificity : 0.9239          
         Pos Pred Value : 0.6676          
         Neg Pred Value : 0.8180          
             Prevalence : 0.2638          
         Detection Rate : 0.1125          
   Detection Prevalence : 0.1685          
      Balanced Accuracy : 0.6752          
                                          
       'Positive' Class : 1               
                                          

m1a$df_info
models.df <- rbind(models.df,m1a$df_info)

The summary output of the reduced model retains a number of statistically significant predictors.

m1b <- score_model('model.reduced', model.reduced, mod1data, output = TRUE)

Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS + 
    CAR_TYPE + REVOKED + KIDSDRIV + CAR_USE + TIF + TRAVTIME + 
    INCOME + CLM_FREQ + BLUEBOOK + PARENT1 + EDUCATION + HOME_VAL + 
    OLDCLAIM, family = binomial(link = "logit"), data = mod1data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.6039  -0.7115  -0.3979   0.6268   3.1440  

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)                     -1.053e+00  2.559e-01  -4.117 3.84e-05 ***
URBANICITYz_Highly Rural/ Rural -2.389e+00  1.128e-01 -21.181  < 2e-16 ***
JOBClerical                      4.141e-01  1.965e-01   2.107 0.035104 *  
JOBDoctor                       -4.475e-01  2.667e-01  -1.678 0.093344 .  
JOBHome Maker                    2.748e-01  2.042e-01   1.346 0.178234    
JOBLawyer                        9.715e-02  1.692e-01   0.574 0.565740    
JOBManager                      -5.649e-01  1.714e-01  -3.296 0.000980 ***
JOBProfessional                  1.548e-01  1.784e-01   0.868 0.385304    
JOBStudent                       2.751e-01  2.109e-01   1.304 0.192066    
JOBz_Blue Collar                 3.098e-01  1.855e-01   1.670 0.094879 .  
MVR_PTS                          1.143e-01  1.359e-02   8.412  < 2e-16 ***
MSTATUSz_No                      4.719e-01  7.955e-02   5.932 2.99e-09 ***
CAR_TYPEPanel Truck              6.090e-01  1.509e-01   4.035 5.46e-05 ***
CAR_TYPEPickup                   5.503e-01  1.006e-01   5.469 4.53e-08 ***
CAR_TYPESports Car               9.726e-01  1.074e-01   9.054  < 2e-16 ***
CAR_TYPEVan                      6.466e-01  1.221e-01   5.295 1.19e-07 ***
CAR_TYPEz_SUV                    7.156e-01  8.596e-02   8.324  < 2e-16 ***
REVOKEDYes                       8.927e-01  9.123e-02   9.785  < 2e-16 ***
KIDSDRIV                         4.176e-01  5.512e-02   7.576 3.57e-14 ***
CAR_USEPrivate                  -7.574e-01  9.161e-02  -8.268  < 2e-16 ***
TIF                             -5.538e-02  7.340e-03  -7.545 4.53e-14 ***
TRAVTIME                         1.448e-02  1.881e-03   7.699 1.37e-14 ***
INCOME                          -3.486e-06  1.076e-06  -3.239 0.001199 ** 
CLM_FREQ                         1.963e-01  2.852e-02   6.882 5.91e-12 ***
BLUEBOOK                        -2.308e-05  4.719e-06  -4.891 1.00e-06 ***
PARENT1Yes                       4.602e-01  9.427e-02   4.882 1.05e-06 ***
EDUCATIONBachelors              -3.868e-01  1.089e-01  -3.554 0.000380 ***
EDUCATIONMasters                -3.032e-01  1.615e-01  -1.878 0.060385 .  
EDUCATIONPhD                    -1.818e-01  2.002e-01  -0.908 0.363825    
EDUCATIONz_High School           1.487e-02  9.469e-02   0.157 0.875229    
HOME_VAL                        -1.342e-06  3.407e-07  -3.939 8.18e-05 ***
OLDCLAIM                        -1.405e-05  3.907e-06  -3.595 0.000324 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9418.0  on 8160  degrees of freedom
Residual deviance: 7301.8  on 8129  degrees of freedom
AIC: 7365.8

Number of Fisher Scoring iterations: 5

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 5558 1245
         1  450  908
                                          
               Accuracy : 0.7923          
                 95% CI : (0.7833, 0.8011)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3934          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4217          
            Specificity : 0.9251          
         Pos Pred Value : 0.6686          
         Neg Pred Value : 0.8170          
             Prevalence : 0.2638          
         Detection Rate : 0.1113          
   Detection Prevalence : 0.1664          
      Balanced Accuracy : 0.6734          
                                          
       'Positive' Class : 1               
                                          

m1b$df_info
models.df <- rbind(models.df,m1b$df_info)

We compute McFadden’s pseudo R squared for logistic regression and we see that the difference between the full model and the reduced model is only marginal.

paste0('Full model = ',round(1-logLik(model.full)/logLik(model.null),4))
[1] "Full model = 0.2251"
paste0('Reduced model = ',round(1-logLik(model.reduced)/logLik(model.null),4))
[1] "Reduced model = 0.2247"

Diagnotics

We diagnose the reduced model for irregularities and violations of assumptions. The resulting logit values for the continuous predictors look mostly linear.

library(broom)
# Select only numeric predictors and predict
numdata <- mod1data %>%
  dplyr::select_if(is.numeric)
predictors <- colnames(numdata)
probabilities <- predict(model.reduced, type = "response")
# Bind the logit and tidying the data for plot
numdata <- numdata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>%
  gather(key = "predictors", value = "predictor.value", -logit)

ggplot(numdata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  geom_smooth(method = "loess") + 
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

Influential Values

Inspection the influential values reveals 3 observations of interest, all which have higher degrees and high bluebook values, except for sports car and a lower bluebook value.

plot(model.reduced, which = 4, id.n = 3)

model.data <- augment(model.reduced) %>% 
  mutate(index = 1:n())

model.data %>% top_n(3, .cooksd)

Outliers

Only one outlier is identified which we will consider removing in a subsequent model.

ggplot(model.data, aes(index, .std.resid)) + 
  geom_point(aes(color = TARGET_FLAG), alpha = .5) +
  theme_bw()

model.data %>% 
  filter(abs(.std.resid) > 3)

Multicollinearity

Looking into the multicollinearity, we see that the categorical variables JOB and EDUCATION are values greater than 5. These are categorical and therefore add degrees of freedom. We will remove them to compare the results but also consider penalized models to deal with multicollinearity.

car::vif(model.reduced)
                GVIF Df GVIF^(1/(2*Df))
URBANICITY  1.140628  1        1.068002
JOB        18.426984  8        1.199750
MVR_PTS     1.158224  1        1.076208
MSTATUS     1.863154  1        1.364974
CAR_TYPE    2.636934  5        1.101818
REVOKED     1.313343  1        1.146012
KIDSDRIV    1.090527  1        1.044283
CAR_USE     2.445424  1        1.563785
TIF         1.009108  1        1.004544
TRAVTIME    1.038286  1        1.018963
INCOME      2.468963  1        1.571293
CLM_FREQ    1.464061  1        1.209984
BLUEBOOK    1.767416  1        1.329442
PARENT1     1.428915  1        1.195372
EDUCATION   7.952890  4        1.295882
HOME_VAL    1.850544  1        1.360347
OLDCLAIM    1.645591  1        1.282806

Model 1 C&D: Balanced Data

As a comparison, we make use of the balance dataset assembled earlier. We compute both the full and reduced models and store the metrics in the summary dataframe.

mod1data_bal <- data_train_balanced %>% dplyr::select(-c('TARGET_AMT','INDEX'))

modelbal.null <- glm(TARGET_FLAG ~ 1,
                 data=mod1data_bal,
                 family = binomial(link="logit")
                 )

modelbal.full <- glm(TARGET_FLAG ~ .,
                 data=mod1data_bal,
                 family=binomial(link="logit")
                 )
    
modelbal.reduced <- step(modelbal.null,
              scope=list(upper=modelbal.full),
             direction="both",
             test="Chisq",
             trace=0,
             data=mod1data_bal)

m1c <- score_model('modelbal.full', modelbal.full, mod1data_bal, output = FALSE)
models.df <- rbind(models.df,m1c$df_info)
m1d <- score_model('modelbal.reduced', modelbal.reduced, mod1data_bal, output = FALSE)
models.df <- rbind(models.df,m1d$df_info)

Model 1 E&F: Influential and Outlying Observations Removed

Here we model using the previous methods but discard the influential observations as well as the outlier.

mod1data_rmv <- data_train %>% filter(!INDEX %in% c(3722, 3592, 4690, 4742)) %>% dplyr::select(-c('TARGET_AMT','INDEX')) 

modelrmv.null <- glm(TARGET_FLAG ~ 1,
                 data=mod1data_rmv,
                 family = binomial(link="logit")
                 )

modelrmv.full <- glm(TARGET_FLAG ~ .,
                 data=mod1data_rmv,
                 family=binomial(link="logit")
                 )
    
modelrmv.reduced <- step(modelrmv.null,
              scope=list(upper=modelrmv.full),
             direction="both",
             test="Chisq",
             trace=0,
             data=mod1data_rmv)

m1e <- score_model('modelrmv.full', modelrmv.full, mod1data_rmv, output = FALSE)
models.df <- rbind(models.df,m1e$df_info)
m1f <- score_model('modelrmv.reduced', modelrmv.reduced, mod1data_rmv, output = FALSE)
models.df <- rbind(models.df,m1f$df_info)

Model 1: Summary

Finally we consider the results of these models. The models are of fairly comparable performance. The models using the balanced data lose in accuracy and specificity but gain in sensitivity For the other models, the metrics are comparable. The model modelrmv.reduced has the smaller AIC overall but the full version actually has a slightly higher R squared. We proceed with the smaller model with hopes of reducing variances at the expense of introducing a bit a bias.

models.df
paste0('Full model  = ',round(1-logLik(model.full)/logLik(model.null),4))
[1] "Full model  = 0.2251"
paste0('Reduced model  = ',round(1-logLik(model.reduced)/logLik(model.null),4))
[1] "Reduced model  = 0.2247"
paste0('Full model (bal) = ',round(1-logLik(modelbal.full)/logLik(modelbal.null),4))
[1] "Full model (bal) = 0.2394"
paste0('Reduced model (bal) = ',round(1-logLik(modelrmv.reduced)/logLik(modelrmv.null),4))
[1] "Reduced model (bal) = 0.2248"
paste0('Full model (rmv) = ',round(1-logLik(modelbal.full)/logLik(modelbal.null),4))
[1] "Full model (rmv) = 0.2394"
paste0('Reduced model (rmv) = ',round(1-logLik(modelrmv.reduced)/logLik(modelrmv.null),4))
[1] "Reduced model (rmv) = 0.2248"

Model 2A: Penalized Logistic Model

Since the basic model contained many predictor variables, we take a look at a penalized logistic regression model which imposes a penalty on the model for having too many predictors. We will indentify the best shrinkage factor lambda through cross validation with an 80%/20% train/test data partitioning.

We fit a lasso regression model (alpha=1) and plot the cross-validation error over the log of lambda. The number of predictors are shown on top and the vertical lines represent the optimal (minimal) value of lambda as well as the value of lambda which minimizes the number of predictors but still remains within 1 standard error of the optimal. We will consider models with both of these lamdba values.

# Prepare the data
library(glmnet)

mod2_data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))

# Split the data into training and test set
set.seed(42)
training.samples <- mod2_data$TARGET_FLAG %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- mod2_data[training.samples, ]
test.data <- mod2_data[-training.samples, ]
# Dummy code categorical predictor variables
x <- model.matrix(TARGET_FLAG~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- train.data$TARGET_FLAG

# Find the best lambda using cross-validation
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)

paste0('lambda.min = ',cv.lasso$lambda.min)
[1] "lambda.min = 0.000696605127977698"
paste0('lambda.1se = ',cv.lasso$lambda.1se) # simplest model but also lies within one standard error of the optimal value of lambda
[1] "lambda.1se = 0.00942540015884462"

The columns below compare the variables that are dropped in the lasso regression for the optimal and smallest model lambdas.

# Display regression coefficients
c_min_lambda <- coef(cv.lasso, cv.lasso$lambda.min)
c_1se_lambda <- coef(cv.lasso, cv.lasso$lambda.1se)
cbind(c_min_lambda, c_1se_lambda)
38 x 2 sparse Matrix of class "dgCMatrix"
                                            1             1
(Intercept)                     -6.497808e-01 -3.519117e-01
KIDSDRIV                         3.878002e-01  2.840127e-01
AGE                             -2.204067e-03 -1.070443e-03
HOMEKIDS                         3.772890e-02  2.514669e-02
YOJ                             -1.322932e-02 -3.760041e-03
INCOME                          -3.288182e-06 -3.666900e-06
PARENT1Yes                       4.329532e-01  4.124385e-01
HOME_VAL                        -1.113993e-06 -1.152107e-06
MSTATUSz_No                      4.696062e-01  2.913970e-01
SEXz_F                           .             .           
EDUCATIONBachelors              -3.584300e-01 -6.312949e-02
EDUCATIONMasters                -3.054436e-01 -1.464263e-02
EDUCATIONPhD                    -2.110436e-01  .           
EDUCATIONz_High School           1.692823e-02  1.215931e-01
JOBClerical                      3.211816e-01  1.264661e-01
JOBDoctor                       -4.363530e-01 -6.694927e-02
JOBHome Maker                    1.540398e-01  .           
JOBLawyer                       -2.538496e-02  .           
JOBManager                      -6.927927e-01 -5.685217e-01
JOBProfessional                  .             .           
JOBStudent                       3.713066e-02  .           
JOBz_Blue Collar                 1.440341e-01  .           
TRAVTIME                         1.342723e-02  8.283642e-03
CAR_USEPrivate                  -7.674998e-01 -7.250352e-01
BLUEBOOK                        -2.278061e-05 -1.615039e-05
TIF                             -5.505445e-02 -3.747593e-02
CAR_TYPEPanel Truck              5.143203e-01  .           
CAR_TYPEPickup                   5.117494e-01  4.698230e-02
CAR_TYPESports Car               8.750077e-01  3.391422e-01
CAR_TYPEVan                      6.136813e-01  1.722648e-03
CAR_TYPEz_SUV                    6.625531e-01  2.148920e-01
RED_CARyes                       8.537390e-03  .           
OLDCLAIM                        -1.452753e-05  .           
CLM_FREQ                         1.987771e-01  1.324388e-01
REVOKEDYes                       8.481732e-01  5.503006e-01
MVR_PTS                          1.053631e-01  9.197091e-02
CAR_AGE                         -5.696433e-04 -7.149609e-03
URBANICITYz_Highly Rural/ Rural -2.287235e+00 -1.795670e+00

Predicting TARGET_FLAG

When comparing the accurancy of the two lasso penalized models, we see that the difference in accuracy is marginal and slightly lower that the simple logit models. For this reason, our selected best model will be the reduced model with influential observations and outliers removed.

# Final model with lambda.min
lasso.model.min <- glmnet(x, y, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.min)
# Make prediction on test data
x.test <- model.matrix(TARGET_FLAG ~., test.data)[,-1]
probabilities <- lasso.model.min %>% predict(newx = x.test)
predicted.classes <- as.factor(ifelse(probabilities > 0.5, 1, 0))
# Model accuracy rate
observed.classes <- as.factor(test.data$TARGET_FLAG)
mean(predicted.classes == observed.classes)
[1] 0.7786634
print(confusionMatrix(predicted.classes,observed.classes, positive = "1"))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1163  323
         1   38  107
                                          
               Accuracy : 0.7787          
                 95% CI : (0.7577, 0.7986)
    No Information Rate : 0.7364          
    P-Value [Acc > NIR] : 4.489e-05       
                                          
                  Kappa : 0.2759          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.2488          
            Specificity : 0.9684          
         Pos Pred Value : 0.7379          
         Neg Pred Value : 0.7826          
             Prevalence : 0.2636          
         Detection Rate : 0.0656          
   Detection Prevalence : 0.0889          
      Balanced Accuracy : 0.6086          
                                          
       'Positive' Class : 1               
                                          
# Final model with lambda.1se
lasso.model.se1 <- glmnet(x, y, alpha = 1, family = "binomial",
                      lambda = cv.lasso$lambda.1se)
# Make prediction on test data
x.test <- model.matrix(TARGET_FLAG ~., test.data)[,-1]
probabilities <- lasso.model.se1 %>% predict(newx = x.test)
predicted.classes <- as.factor(ifelse(probabilities > 0.5, 1, 0))
# Model accuracy rate
observed.classes <- as.factor(test.data$TARGET_FLAG)
#mean(predicted.classes == observed.classes)
print(confusionMatrix(predicted.classes,observed.classes, positive = "1"))
Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1184  364
         1   17   66
                                          
               Accuracy : 0.7664          
                 95% CI : (0.7451, 0.7867)
    No Information Rate : 0.7364          
    P-Value [Acc > NIR] : 0.002925        
                                          
                  Kappa : 0.188           
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.15349         
            Specificity : 0.98585         
         Pos Pred Value : 0.79518         
         Neg Pred Value : 0.76486         
             Prevalence : 0.26364         
         Detection Rate : 0.04047         
   Detection Prevalence : 0.05089         
      Balanced Accuracy : 0.56967         
                                          
       'Positive' Class : 1               
                                          

Model Building - Mutiple Regression Models

Predicting TARGET_AMT

We subset the dataset to pick out only the observations which were involved in accidents. This is also the population that has previously filed claims, so this information will be used to predict the TARGET_AMT.

Model 3 A&B: Multiple Regression

The simple multiple regression model delivers a poor performance with a very low R-squared value. For this reason, we consider the next model using weights as a comparison.

data_train2 <- data_train %>% dplyr::select(c(-'INDEX')) %>% filter(TARGET_FLAG==1) %>% dplyr::select(-c('TARGET_FLAG'))
mod3a <- lm(TARGET_AMT ~ ., data_train2)
summary(mod3a)

Call:
lm(formula = TARGET_AMT ~ ., data = data_train2)

Residuals:
   Min     1Q Median     3Q    Max 
 -8943  -3176  -1501    480  99578 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      3.396e+03  1.845e+03   1.841   0.0658 .  
KIDSDRIV                        -1.719e+02  3.166e+02  -0.543   0.5873    
AGE                              1.835e+01  2.124e+01   0.864   0.3879    
HOMEKIDS                         2.135e+02  2.071e+02   1.031   0.3028    
YOJ                              1.916e+01  4.918e+01   0.390   0.6969    
INCOME                          -9.014e-03  6.742e-03  -1.337   0.1814    
PARENT1Yes                       2.772e+02  5.873e+02   0.472   0.6369    
HOME_VAL                         2.198e-03  2.020e-03   1.088   0.2767    
MSTATUSz_No                      8.036e+02  4.935e+02   1.628   0.1036    
SEXz_F                          -1.397e+03  6.564e+02  -2.129   0.0334 *  
EDUCATIONBachelors               2.606e+02  6.419e+02   0.406   0.6848    
EDUCATIONMasters                 1.194e+03  1.084e+03   1.102   0.2707    
EDUCATIONPhD                     2.396e+03  1.312e+03   1.827   0.0679 .  
EDUCATIONz_High School          -3.954e+02  5.145e+02  -0.769   0.4423    
JOBClerical                      3.084e+02  1.203e+03   0.256   0.7977    
JOBDoctor                       -2.116e+03  1.762e+03  -1.201   0.2299    
JOBHome Maker                   -2.353e+01  1.266e+03  -0.019   0.9852    
JOBLawyer                        3.272e+02  1.029e+03   0.318   0.7506    
JOBManager                      -7.794e+02  1.066e+03  -0.731   0.4647    
JOBProfessional                  1.061e+03  1.129e+03   0.940   0.3475    
JOBStudent                       1.147e+02  1.286e+03   0.089   0.9289    
JOBz_Blue Collar                 5.206e+02  1.146e+03   0.454   0.6497    
TRAVTIME                         7.528e-01  1.108e+01   0.068   0.9458    
CAR_USEPrivate                  -4.384e+02  5.216e+02  -0.840   0.4008    
BLUEBOOK                         1.245e-01  3.053e-02   4.077 4.73e-05 ***
TIF                             -1.574e+01  4.252e+01  -0.370   0.7112    
CAR_TYPEPanel Truck             -6.403e+02  9.605e+02  -0.667   0.5051    
CAR_TYPEPickup                  -5.637e+01  5.968e+02  -0.094   0.9248    
CAR_TYPESports Car               1.061e+03  7.502e+02   1.414   0.1575    
CAR_TYPEVan                      6.335e+01  7.707e+02   0.082   0.9345    
CAR_TYPEz_SUV                    9.025e+02  6.668e+02   1.354   0.1760    
RED_CARyes                      -1.933e+02  4.965e+02  -0.389   0.6970    
OLDCLAIM                         2.502e-02  2.263e-02   1.105   0.2691    
CLM_FREQ                        -1.154e+02  1.580e+02  -0.731   0.4651    
REVOKEDYes                      -1.125e+03  5.166e+02  -2.177   0.0296 *  
MVR_PTS                          1.108e+02  6.853e+01   1.617   0.1061    
CAR_AGE                         -9.842e+01  4.406e+01  -2.233   0.0256 *  
URBANICITYz_Highly Rural/ Rural -9.778e+01  7.562e+02  -0.129   0.8971    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7690 on 2115 degrees of freedom
Multiple R-squared:  0.03061,   Adjusted R-squared:  0.01365 
F-statistic: 1.805 on 37 and 2115 DF,  p-value: 0.002183

We build the reduced model as before and output the summary. The Adjusted R-squared remains very low at 1.8%. The diagnostic QQ plot reveals a large deviation from normal in the upper quantiles that heavily affects the results.

modeltgt.null <- lm(TARGET_AMT ~ 1, data=data_train2)
modeltgt.full <- lm(TARGET_AMT ~ ., data=data_train2)
modeltgt.reduced <- step(modeltgt.null,
              scope=list(upper=modeltgt.full),
             direction="both",
             trace=0,
             data=data_train2)

summary(modeltgt.reduced)

Call:
lm(formula = TARGET_AMT ~ BLUEBOOK + MVR_PTS + SEX + REVOKED + 
    MSTATUS + CAR_AGE, data = data_train2)

Residuals:
   Min     1Q Median     3Q    Max 
 -8203  -3136  -1559    389 100457 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4374.99590  489.77634   8.933  < 2e-16 ***
BLUEBOOK       0.11270    0.02035   5.537 3.46e-08 ***
MVR_PTS      122.96517   64.23189   1.914   0.0557 .  
SEXz_F      -637.79264  334.17405  -1.909   0.0565 .  
REVOKEDYes  -694.16132  409.27610  -1.696   0.0900 .  
MSTATUSz_No  542.20258  331.58446   1.635   0.1022    
CAR_AGE      -49.32254   31.56917  -1.562   0.1183    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7672 on 2146 degrees of freedom
Multiple R-squared:  0.02104,   Adjusted R-squared:  0.0183 
F-statistic: 7.685 on 6 and 2146 DF,  p-value: 3.452e-08
plot(modeltgt.reduced)

Model 4: Weighted Least Squares

We explore a weighted least squares model hoping for a better performance and manage to get a bump up 6.8%. We see that compared to the previous model the upper range of the QQ plot rapidly increases.

lm4 <- lm(TARGET_AMT ~. -INDEX -TARGET_FLAG, data = data_train)
summary(lm4)

Call:
lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = data_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -5891  -1699   -760    344 103785 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      1.065e+03  5.572e+02   1.912 0.055974 .  
KIDSDRIV                         3.142e+02  1.132e+02   2.777 0.005507 ** 
AGE                              5.221e+00  7.066e+00   0.739 0.459973    
HOMEKIDS                         7.764e+01  6.535e+01   1.188 0.234873    
YOJ                             -3.947e+00  1.509e+01  -0.262 0.793583    
INCOME                          -4.414e-03  1.805e-03  -2.445 0.014499 *  
PARENT1Yes                       5.761e+02  2.020e+02   2.852 0.004360 ** 
HOME_VAL                        -5.574e-04  5.908e-04  -0.944 0.345402    
MSTATUSz_No                      5.703e+02  1.449e+02   3.937 8.33e-05 ***
SEXz_F                          -3.685e+02  1.838e+02  -2.005 0.044978 *  
EDUCATIONBachelors              -2.585e+02  2.048e+02  -1.262 0.207014    
EDUCATIONMasters                 2.427e+01  3.000e+02   0.081 0.935520    
EDUCATIONPhD                     2.871e+02  3.560e+02   0.807 0.419960    
EDUCATIONz_High School          -8.875e+01  1.719e+02  -0.516 0.605722    
JOBClerical                      5.292e+02  3.414e+02   1.550 0.121192    
JOBDoctor                       -4.998e+02  4.087e+02  -1.223 0.221451    
JOBHome Maker                    3.523e+02  3.646e+02   0.966 0.334044    
JOBLawyer                        2.308e+02  2.956e+02   0.781 0.435074    
JOBManager                      -4.788e+02  2.885e+02  -1.660 0.096959 .  
JOBProfessional                  4.566e+02  3.088e+02   1.479 0.139258    
JOBStudent                       2.876e+02  3.740e+02   0.769 0.441842    
JOBz_Blue Collar                 5.077e+02  3.218e+02   1.577 0.114754    
TRAVTIME                         1.195e+01  3.222e+00   3.709 0.000209 ***
CAR_USEPrivate                  -7.796e+02  1.644e+02  -4.742 2.16e-06 ***
BLUEBOOK                         1.433e-02  8.623e-03   1.662 0.096568 .  
TIF                             -4.820e+01  1.218e+01  -3.958 7.63e-05 ***
CAR_TYPEPanel Truck              2.651e+02  2.782e+02   0.953 0.340636    
CAR_TYPEPickup                   3.757e+02  1.707e+02   2.201 0.027747 *  
CAR_TYPESports Car               1.022e+03  2.178e+02   4.692 2.75e-06 ***
CAR_TYPEVan                      5.146e+02  2.132e+02   2.414 0.015815 *  
CAR_TYPEz_SUV                    7.516e+02  1.793e+02   4.192 2.80e-05 ***
RED_CARyes                      -4.822e+01  1.491e+02  -0.323 0.746335    
OLDCLAIM                        -1.056e-02  7.436e-03  -1.420 0.155686    
CLM_FREQ                         1.417e+02  5.503e+01   2.575 0.010028 *  
REVOKEDYes                       5.496e+02  1.735e+02   3.167 0.001545 ** 
MVR_PTS                          1.753e+02  2.592e+01   6.764 1.44e-11 ***
CAR_AGE                         -2.690e+01  1.280e+01  -2.102 0.035598 *  
URBANICITYz_Highly Rural/ Rural -1.665e+03  1.394e+02 -11.943  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4544 on 8123 degrees of freedom
Multiple R-squared:  0.07104,   Adjusted R-squared:  0.06681 
F-statistic: 16.79 on 37 and 8123 DF,  p-value: < 2.2e-16
sd <- 1 / lm(abs(lm4$residuals)~lm4$fitted.values)$fitted.values^2
lm4_wls <- lm(TARGET_AMT ~. -INDEX -TARGET_FLAG, data = data_train, weights = sd)

lm4_wls_step <- stepAIC(lm4_wls, method = "leapBackward", trace = FALSE)
summary(lm4_wls_step)

Call:
lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + INCOME + 
    PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME + 
    CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + CLM_FREQ + 
    REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = data_train, 
    weights = sd)

Weighted Residuals:
   Min     1Q Median     3Q    Max 
-5.784 -0.624 -0.535 -0.008 99.897 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      7.763e+02  9.580e+01   8.103 6.12e-16 ***
KIDSDRIV                         1.001e+02  5.096e+01   1.965 0.049476 *  
AGE                              4.812e+00  1.029e+00   4.677 2.96e-06 ***
HOMEKIDS                         6.926e+01  1.196e+01   5.789 7.35e-09 ***
INCOME                          -2.047e-03  2.998e-04  -6.827 9.28e-12 ***
PARENT1Yes                       4.701e+02  1.642e+02   2.863 0.004211 ** 
HOME_VAL                        -2.688e-04  1.121e-04  -2.398 0.016512 *  
MSTATUSz_No                      2.707e+02  3.861e+01   7.011 2.56e-12 ***
SEXz_F                          -1.659e+02  2.281e+01  -7.275 3.78e-13 ***
EDUCATIONBachelors              -7.303e+01  3.950e+01  -1.849 0.064526 .  
EDUCATIONMasters                 2.735e+01  4.721e+01   0.579 0.562374    
EDUCATIONPhD                     1.465e+02  5.415e+01   2.706 0.006831 ** 
EDUCATIONz_High School          -5.321e+01  3.138e+01  -1.696 0.089998 .  
JOBClerical                      1.607e+02  6.856e+01   2.345 0.019076 *  
JOBDoctor                       -2.739e+02  7.890e+01  -3.471 0.000521 ***
JOBHome Maker                    2.728e+01  6.187e+01   0.441 0.659250    
JOBLawyer                       -1.594e+01  6.305e+01  -0.253 0.800387    
JOBManager                      -2.278e+02  7.034e+01  -3.238 0.001209 ** 
JOBProfessional                  2.838e+02  6.985e+01   4.063 4.90e-05 ***
JOBStudent                       2.943e+01  7.261e+01   0.405 0.685318    
JOBz_Blue Collar                 1.655e+02  7.053e+01   2.347 0.018945 *  
TRAVTIME                         6.244e+00  4.988e-01  12.518  < 2e-16 ***
CAR_USEPrivate                  -3.486e+02  5.398e+01  -6.458 1.12e-10 ***
BLUEBOOK                         4.665e-03  1.357e-03   3.437 0.000592 ***
TIF                             -2.842e+01  2.052e+00 -13.855  < 2e-16 ***
CAR_TYPEPanel Truck             -8.041e+01  1.421e+02  -0.566 0.571436    
CAR_TYPEPickup                   1.511e+02  3.219e+01   4.693 2.74e-06 ***
CAR_TYPESports Car               4.998e+02  6.071e+01   8.232  < 2e-16 ***
CAR_TYPEVan                      5.952e+01  5.357e+01   1.111 0.266600    
CAR_TYPEz_SUV                    3.734e+02  3.196e+01  11.682  < 2e-16 ***
RED_CARyes                       3.275e+01  2.043e+01   1.603 0.108952    
CLM_FREQ                         1.554e+02  3.033e+01   5.124 3.06e-07 ***
REVOKEDYes                       2.367e+02  1.119e+02   2.116 0.034414 *  
MVR_PTS                          6.601e+01  1.075e+01   6.140 8.65e-10 ***
CAR_AGE                         -9.653e+00  1.688e+00  -5.719 1.11e-08 ***
URBANICITYz_Highly Rural/ Rural -6.903e+02  4.767e+01 -14.482  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.308 on 8125 degrees of freedom
Multiple R-squared:  0.07231,   Adjusted R-squared:  0.06831 
F-statistic: 18.09 on 35 and 8125 DF,  p-value: < 2.2e-16
plot(lm4_wls_step)

Model Selection

We use the final selected models to make the predictions and write the results to file. To predict TARGET_FLAG we use modelrmv.reduced and the WLS model for TARGET_AMT. Once the individuals who are predicted to have accidents are identified, we filter them and predict the amount of that subset.

eval_probs <- predict(modelrmv.reduced, newdata=data_test %>% dplyr::select(-INDEX, -TARGET_AMT), type='response')
data_test$TARGET_FLAG <- ifelse(eval_probs > 0.5, 1, 0)
data_test$TARGET_AMT <- 0.0
data_test[data_test$TARGET_FLAG == 1,]$TARGET_AMT <- predict(lm4_wls_step, newdata=data_test %>% filter(TARGET_FLAG==1) %>% dplyr::select(-c('TARGET_FLAG','INDEX')))

data_test
write.csv(data_test, "data_test_results.csv")