In this assignment, we will explore, analyze and model a data set containing approximately 8000 records, each representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG
, is binary. A “1” indicates that the customer was in a car crash while 0 indicates that they were not. The second response variable is TARGET_AMT
. This value is 0 if the customer did not crash their car. However, if they did crash their car, this number will be a value greater than 0.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict whether a customer will crash their car and to predict the cost in the case of crash. We will only use the variables given to us (or variables that we derive from the variables provided).
Below is a short description of the variables of interest in the data set:
data_train <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance_training_data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/salma71/Data_621/master/HW_4/insurance-evaluation-data.csv", header = TRUE)
The dataset consists of 26 variables and 8161 observations with AGE
, YOJ
, and CAR_AGE
variables containing some missing values. As stated previously, TARGET_FLAG
and TARGET_AMT
are our response variables. Also, 13
of the variables have discrete values and the rest of the variables are continuous.
Name | data_train |
Number of rows | 8161 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
factor | 14 |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
INCOME | 0 | 1 | FALSE | 6613 | $0: 615, emp: 445, $26: 4, $48: 4 |
PARENT1 | 0 | 1 | FALSE | 2 | No: 7084, Yes: 1077 |
HOME_VAL | 0 | 1 | FALSE | 5107 | $0: 2294, emp: 464, $11: 3, $11: 3 |
MSTATUS | 0 | 1 | FALSE | 2 | Yes: 4894, z_N: 3267 |
SEX | 0 | 1 | FALSE | 2 | z_F: 4375, M: 3786 |
EDUCATION | 0 | 1 | FALSE | 5 | z_H: 2330, Bac: 2242, Mas: 1658, <Hi: 1203 |
JOB | 0 | 1 | FALSE | 9 | z_B: 1825, Cle: 1271, Pro: 1117, Man: 988 |
CAR_USE | 0 | 1 | FALSE | 2 | Pri: 5132, Com: 3029 |
BLUEBOOK | 0 | 1 | FALSE | 2789 | $1,: 157, $6,: 34, $5,: 33, $6,: 33 |
CAR_TYPE | 0 | 1 | FALSE | 6 | z_S: 2294, Min: 2145, Pic: 1389, Spo: 907 |
RED_CAR | 0 | 1 | FALSE | 2 | no: 5783, yes: 2378 |
OLDCLAIM | 0 | 1 | FALSE | 2857 | $0: 5009, $1,: 4, $1,: 4, $4,: 4 |
REVOKED | 0 | 1 | FALSE | 2 | No: 7161, Yes: 1000 |
URBANICITY | 0 | 1 | FALSE | 2 | Hig: 6492, z_H: 1669 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1.00 | 5151.87 | 2978.89 | 1 | 2559 | 5133 | 7745 | 10302.0 | ▇▇▇▇▇ |
TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 | ▇▁▁▁▁ |
AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 | ▇▂▁▁▁ |
YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 | ▂▃▇▃▁ |
TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 | ▇▂▁▁▁ |
MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 | ▆▇▇▃▁ |
The currency notation found in some values will interfere with our analysis so we need reformat those values appropriately.
strip_dollars <- function(x){
x <- as.character(x)
x <- gsub(",", "", x)
x <- gsub("\\$", "", x)
as.numeric(x)
}
fix_formatting <- function(messy_df){
messy_df %>%
rowwise() %>%
mutate(INCOME = strip_dollars(INCOME),
HOME_VAL = strip_dollars(HOME_VAL),
BLUEBOOK = strip_dollars(BLUEBOOK),
OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
ungroup()
}
We noticed that a few variables that are listed as discrete have large numbers of unique values. A closer inspection of the variable descriptions reveals that that while these variables are encoded as factors they are actually continuous. The TARGET_FLAG
variable also appears in the summary as numeric variable, but it should be a binary factor. We proceed to fix these data types.
Also, there are some values that seem invalid (i.e. -3 CAR_AGE
). Since both variables the missing values are less than 5% then we can replace the missing values with the median. We Will take the median on the training set only and impute in both training and testing to avoid overfitting.
We apply the processing steps above to both the training and testing datasets.
We now explore the distribution of TARGET_FLAG
across the numeric variables. We see that BLUEBOOK
, INCOME
, OLDCLAIM
have a high number of outliers compared to other variables. We also see that customers with who are older, or have older cars, higher home values, higher income tend to get into fewer car crashes. However, people with motor vehicle record points or high number of old claims tend to get into more accidents.
plot_vars <- c("TARGET_FLAG", names(keep(data_train, is.numeric)))
data_train[plot_vars] %>%
dplyr::select(-INDEX, -TARGET_AMT) %>%
gather(variable, value, -TARGET_FLAG) %>%
ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) +
geom_boxplot() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
data_train %>%
dplyr::select(-TARGET_FLAG, -TARGET_AMT, -INDEX) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')
The variables dislayed below need scale transformations like OLDCLAIM
, INCOME
, BLUEBOOK
, HOME_VAL
. AGE
has a guassian distribution. We see several variables have high number of zeros. AGE
is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.
data_train %>%
dplyr::select(OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')
data_train %>%
keep(is.numeric) %>%
gather(variable, value, -TARGET_AMT, -INDEX, -CLM_FREQ, -MVR_PTS) %>%
ggplot(., aes(value, TARGET_AMT)) +
geom_point() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())
We see MVR_PTS
, CLM_FREQ
, and OLDCLAIM
are the most positively correlated variables with our response variables. Whereas, URBANICITY
is the most negatively correlated variable. Rest of the variables are weakly correlated.
corr_dataframe = data_train %>%
mutate_if(is.factor, as.numeric)
q <- cor(corr_dataframe)
ggcorrplot(q, type = "upper", outline.color = "white",
ggtheme = theme_classic,
colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)
set.seed(42)
accidents <- data_train %>%
filter(TARGET_FLAG == 1)
ggplot(accidents, aes(x=TARGET_AMT)) +
geom_density(fill='pink') +
theme_light() +
geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")
As was previously noted this distribution has a long tail. The mean payout is $5616 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.
outlier <- min(boxplot(data_train[data_train$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
data_train %>%
mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, "Yes", "No")) %>%
group_by(TARGET_AMT_OUTLIER) %>%
summarise(Mean = mean(TARGET_AMT),
Median = median(TARGET_AMT))
0 1
6008 2153
There is an imbalance in the TARGET_FLAG
variable
Let’s check the class distribution
0 1
0.7361843 0.2638157
The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.
To treat this unbalance, we would use the over sampling
set.seed(42)
minority <- nrow(data_train[data_train$TARGET_FLAG == 1,])
majority <- nrow(data_train[data_train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- data_train[data_train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(data_train, .) %>%
bind_rows(data_train)
data_train_balanced <- over_sample_train
check the balance again
0 1
6008 6008
Our objective is to predict both TARGET_FLAG
and TARGET_AMT
. TARGET_FLAG
is a discrete response variable and for that reason it should be modeled using logistic regression which determines the probability that an individual will be in an accident.
# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
precision.deviance=numeric(), stringsAsFactors=FALSE)
# A function to extract the relevant metrics from the summary and confusion matrix
score_model <- function(id, model, data, output=FALSE) {
if (output) print(summary(model))
glm.probs <- predict(model, type="response")
# Confirm the 0.5 threshold
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
results <- tibble(target=data$TARGET_FLAG, pred=glm.pred)
results <- results %>%
mutate(pred.class = as.factor(pred), target.class = as.factor(target))
if (output) print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
#prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
res.deviance <- model$deviance
null.deviance <- model$null.deviance
aic <- model$aic
metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec)
metrics <- lapply(metrics, round, 3)
if (output) plot(roc(results$target.class,glm.probs), print.auc = TRUE)
model.df <- tibble(id=id, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance,
aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity)
model.list <- list(model=glm.fit, df_info=model.df)
return(model.list)
}
We construct null, full and reduced models. The null model contains only the intercept and forms the lower bound of complexity. The full model contains all predictors and is the upper bound, max complexity model. The reduced model is obtained by iteratively stepping through the predictors between the aforementioned bounds until only statistically significant predictors remain.
mod1data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))
#mod1data <- data_train_balanced %>% select(-c('TARGET_AMT','INDEX'))
model.null <- glm(TARGET_FLAG ~ 1,
data=mod1data,
family = binomial(link="logit")
)
model.full <- glm(TARGET_FLAG ~ .,
data=mod1data,
family = binomial(link="logit")
)
model.reduced <- step(model.null,
scope = list(upper=model.full),
direction="both",
test="Chisq",
trace=0,
data=mod1data)
m1a <- score_model('model.full', model.full, mod1data, output = TRUE)
Call:
glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"),
data = mod1data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5846 -0.7127 -0.3983 0.6261 3.1524
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.286e-01 3.215e-01 -2.889 0.003869 **
KIDSDRIV 3.862e-01 6.122e-02 6.308 2.82e-10 ***
AGE -1.015e-03 4.020e-03 -0.252 0.800672
HOMEKIDS 4.965e-02 3.713e-02 1.337 0.181119
YOJ -1.105e-02 8.582e-03 -1.288 0.197743
INCOME -3.423e-06 1.081e-06 -3.165 0.001551 **
PARENT1Yes 3.820e-01 1.096e-01 3.485 0.000492 ***
HOME_VAL -1.306e-06 3.420e-07 -3.819 0.000134 ***
MSTATUSz_No 4.938e-01 8.357e-02 5.909 3.45e-09 ***
SEXz_F -8.251e-02 1.120e-01 -0.737 0.461416
EDUCATIONBachelors -3.812e-01 1.157e-01 -3.296 0.000981 ***
EDUCATIONMasters -2.903e-01 1.788e-01 -1.624 0.104397
EDUCATIONPhD -1.677e-01 2.140e-01 -0.784 0.433295
EDUCATIONz_High School 1.764e-02 9.506e-02 0.186 0.852802
JOBClerical 4.107e-01 1.967e-01 2.088 0.036763 *
JOBDoctor -4.458e-01 2.671e-01 -1.669 0.095106 .
JOBHome Maker 2.323e-01 2.102e-01 1.106 0.268915
JOBLawyer 1.049e-01 1.695e-01 0.619 0.535958
JOBManager -5.572e-01 1.716e-01 -3.248 0.001161 **
JOBProfessional 1.619e-01 1.784e-01 0.907 0.364168
JOBStudent 2.161e-01 2.145e-01 1.007 0.313729
JOBz_Blue Collar 3.106e-01 1.856e-01 1.674 0.094158 .
TRAVTIME 1.457e-02 1.883e-03 7.736 1.03e-14 ***
CAR_USEPrivate -7.564e-01 9.172e-02 -8.247 < 2e-16 ***
BLUEBOOK -2.084e-05 5.263e-06 -3.959 7.52e-05 ***
TIF -5.547e-02 7.344e-03 -7.553 4.26e-14 ***
CAR_TYPEPanel Truck 5.607e-01 1.618e-01 3.466 0.000528 ***
CAR_TYPEPickup 5.540e-01 1.007e-01 5.500 3.80e-08 ***
CAR_TYPESports Car 1.025e+00 1.299e-01 7.893 2.95e-15 ***
CAR_TYPEVan 6.186e-01 1.265e-01 4.891 1.00e-06 ***
CAR_TYPEz_SUV 7.682e-01 1.113e-01 6.904 5.05e-12 ***
RED_CARyes -9.728e-03 8.636e-02 -0.113 0.910313
OLDCLAIM -1.389e-05 3.910e-06 -3.554 0.000380 ***
CLM_FREQ 1.959e-01 2.855e-02 6.864 6.69e-12 ***
REVOKEDYes 8.874e-01 9.133e-02 9.716 < 2e-16 ***
MVR_PTS 1.133e-01 1.361e-02 8.324 < 2e-16 ***
CAR_AGE -7.196e-04 7.549e-03 -0.095 0.924053
URBANICITYz_Highly Rural/ Rural -2.390e+00 1.128e-01 -21.181 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7297.6 on 8123 degrees of freedom
AIC: 7373.6
Number of Fisher Scoring iterations: 5
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5551 1235
1 457 918
Accuracy : 0.7927
95% CI : (0.7837, 0.8014)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3963
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4264
Specificity : 0.9239
Pos Pred Value : 0.6676
Neg Pred Value : 0.8180
Prevalence : 0.2638
Detection Rate : 0.1125
Detection Prevalence : 0.1685
Balanced Accuracy : 0.6752
'Positive' Class : 1
The summary output of the reduced model retains a number of statistically significant predictors.
Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS +
CAR_TYPE + REVOKED + KIDSDRIV + CAR_USE + TIF + TRAVTIME +
INCOME + CLM_FREQ + BLUEBOOK + PARENT1 + EDUCATION + HOME_VAL +
OLDCLAIM, family = binomial(link = "logit"), data = mod1data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.6039 -0.7115 -0.3979 0.6268 3.1440
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+00 2.559e-01 -4.117 3.84e-05 ***
URBANICITYz_Highly Rural/ Rural -2.389e+00 1.128e-01 -21.181 < 2e-16 ***
JOBClerical 4.141e-01 1.965e-01 2.107 0.035104 *
JOBDoctor -4.475e-01 2.667e-01 -1.678 0.093344 .
JOBHome Maker 2.748e-01 2.042e-01 1.346 0.178234
JOBLawyer 9.715e-02 1.692e-01 0.574 0.565740
JOBManager -5.649e-01 1.714e-01 -3.296 0.000980 ***
JOBProfessional 1.548e-01 1.784e-01 0.868 0.385304
JOBStudent 2.751e-01 2.109e-01 1.304 0.192066
JOBz_Blue Collar 3.098e-01 1.855e-01 1.670 0.094879 .
MVR_PTS 1.143e-01 1.359e-02 8.412 < 2e-16 ***
MSTATUSz_No 4.719e-01 7.955e-02 5.932 2.99e-09 ***
CAR_TYPEPanel Truck 6.090e-01 1.509e-01 4.035 5.46e-05 ***
CAR_TYPEPickup 5.503e-01 1.006e-01 5.469 4.53e-08 ***
CAR_TYPESports Car 9.726e-01 1.074e-01 9.054 < 2e-16 ***
CAR_TYPEVan 6.466e-01 1.221e-01 5.295 1.19e-07 ***
CAR_TYPEz_SUV 7.156e-01 8.596e-02 8.324 < 2e-16 ***
REVOKEDYes 8.927e-01 9.123e-02 9.785 < 2e-16 ***
KIDSDRIV 4.176e-01 5.512e-02 7.576 3.57e-14 ***
CAR_USEPrivate -7.574e-01 9.161e-02 -8.268 < 2e-16 ***
TIF -5.538e-02 7.340e-03 -7.545 4.53e-14 ***
TRAVTIME 1.448e-02 1.881e-03 7.699 1.37e-14 ***
INCOME -3.486e-06 1.076e-06 -3.239 0.001199 **
CLM_FREQ 1.963e-01 2.852e-02 6.882 5.91e-12 ***
BLUEBOOK -2.308e-05 4.719e-06 -4.891 1.00e-06 ***
PARENT1Yes 4.602e-01 9.427e-02 4.882 1.05e-06 ***
EDUCATIONBachelors -3.868e-01 1.089e-01 -3.554 0.000380 ***
EDUCATIONMasters -3.032e-01 1.615e-01 -1.878 0.060385 .
EDUCATIONPhD -1.818e-01 2.002e-01 -0.908 0.363825
EDUCATIONz_High School 1.487e-02 9.469e-02 0.157 0.875229
HOME_VAL -1.342e-06 3.407e-07 -3.939 8.18e-05 ***
OLDCLAIM -1.405e-05 3.907e-06 -3.595 0.000324 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7301.8 on 8129 degrees of freedom
AIC: 7365.8
Number of Fisher Scoring iterations: 5
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5558 1245
1 450 908
Accuracy : 0.7923
95% CI : (0.7833, 0.8011)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3934
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4217
Specificity : 0.9251
Pos Pred Value : 0.6686
Neg Pred Value : 0.8170
Prevalence : 0.2638
Detection Rate : 0.1113
Detection Prevalence : 0.1664
Balanced Accuracy : 0.6734
'Positive' Class : 1
We compute McFadden’s pseudo R squared for logistic regression and we see that the difference between the full model and the reduced model is only marginal.
[1] "Full model = 0.2251"
[1] "Reduced model = 0.2247"
We diagnose the reduced model for irregularities and violations of assumptions. The resulting logit values for the continuous predictors look mostly linear.
library(broom)
# Select only numeric predictors and predict
numdata <- mod1data %>%
dplyr::select_if(is.numeric)
predictors <- colnames(numdata)
probabilities <- predict(model.reduced, type = "response")
# Bind the logit and tidying the data for plot
numdata <- numdata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(numdata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
Inspection the influential values reveals 3 observations of interest, all which have higher degrees and high bluebook values, except for sports car and a lower bluebook value.
Only one outlier is identified which we will consider removing in a subsequent model.
ggplot(model.data, aes(index, .std.resid)) +
geom_point(aes(color = TARGET_FLAG), alpha = .5) +
theme_bw()
Looking into the multicollinearity, we see that the categorical variables JOB
and EDUCATION
are values greater than 5. These are categorical and therefore add degrees of freedom. We will remove them to compare the results but also consider penalized models to deal with multicollinearity.
GVIF Df GVIF^(1/(2*Df))
URBANICITY 1.140628 1 1.068002
JOB 18.426984 8 1.199750
MVR_PTS 1.158224 1 1.076208
MSTATUS 1.863154 1 1.364974
CAR_TYPE 2.636934 5 1.101818
REVOKED 1.313343 1 1.146012
KIDSDRIV 1.090527 1 1.044283
CAR_USE 2.445424 1 1.563785
TIF 1.009108 1 1.004544
TRAVTIME 1.038286 1 1.018963
INCOME 2.468963 1 1.571293
CLM_FREQ 1.464061 1 1.209984
BLUEBOOK 1.767416 1 1.329442
PARENT1 1.428915 1 1.195372
EDUCATION 7.952890 4 1.295882
HOME_VAL 1.850544 1 1.360347
OLDCLAIM 1.645591 1 1.282806
As a comparison, we make use of the balance dataset assembled earlier. We compute both the full and reduced models and store the metrics in the summary dataframe.
mod1data_bal <- data_train_balanced %>% dplyr::select(-c('TARGET_AMT','INDEX'))
modelbal.null <- glm(TARGET_FLAG ~ 1,
data=mod1data_bal,
family = binomial(link="logit")
)
modelbal.full <- glm(TARGET_FLAG ~ .,
data=mod1data_bal,
family=binomial(link="logit")
)
modelbal.reduced <- step(modelbal.null,
scope=list(upper=modelbal.full),
direction="both",
test="Chisq",
trace=0,
data=mod1data_bal)
m1c <- score_model('modelbal.full', modelbal.full, mod1data_bal, output = FALSE)
models.df <- rbind(models.df,m1c$df_info)
m1d <- score_model('modelbal.reduced', modelbal.reduced, mod1data_bal, output = FALSE)
models.df <- rbind(models.df,m1d$df_info)
Here we model using the previous methods but discard the influential observations as well as the outlier.
mod1data_rmv <- data_train %>% filter(!INDEX %in% c(3722, 3592, 4690, 4742)) %>% dplyr::select(-c('TARGET_AMT','INDEX'))
modelrmv.null <- glm(TARGET_FLAG ~ 1,
data=mod1data_rmv,
family = binomial(link="logit")
)
modelrmv.full <- glm(TARGET_FLAG ~ .,
data=mod1data_rmv,
family=binomial(link="logit")
)
modelrmv.reduced <- step(modelrmv.null,
scope=list(upper=modelrmv.full),
direction="both",
test="Chisq",
trace=0,
data=mod1data_rmv)
m1e <- score_model('modelrmv.full', modelrmv.full, mod1data_rmv, output = FALSE)
models.df <- rbind(models.df,m1e$df_info)
m1f <- score_model('modelrmv.reduced', modelrmv.reduced, mod1data_rmv, output = FALSE)
models.df <- rbind(models.df,m1f$df_info)
Finally we consider the results of these models. The models are of fairly comparable performance. The models using the balanced data lose in accuracy and specificity but gain in sensitivity For the other models, the metrics are comparable. The model modelrmv.reduced
has the smaller AIC overall but the full version actually has a slightly higher R squared. We proceed with the smaller model with hopes of reducing variances at the expense of introducing a bit a bias.
[1] "Full model = 0.2251"
[1] "Reduced model = 0.2247"
[1] "Full model (bal) = 0.2394"
[1] "Reduced model (bal) = 0.2248"
[1] "Full model (rmv) = 0.2394"
[1] "Reduced model (rmv) = 0.2248"
Since the basic model contained many predictor variables, we take a look at a penalized logistic regression model which imposes a penalty on the model for having too many predictors. We will indentify the best shrinkage factor lambda
through cross validation with an 80%/20% train/test data partitioning.
We fit a lasso regression model (alpha=1) and plot the cross-validation error over the log of lambda. The number of predictors are shown on top and the vertical lines represent the optimal (minimal) value of lambda as well as the value of lambda which minimizes the number of predictors but still remains within 1 standard error of the optimal. We will consider models with both of these lamdba values.
# Prepare the data
library(glmnet)
mod2_data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))
# Split the data into training and test set
set.seed(42)
training.samples <- mod2_data$TARGET_FLAG %>% createDataPartition(p = 0.8, list = FALSE)
train.data <- mod2_data[training.samples, ]
test.data <- mod2_data[-training.samples, ]
# Dummy code categorical predictor variables
x <- model.matrix(TARGET_FLAG~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y <- train.data$TARGET_FLAG
# Find the best lambda using cross-validation
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)
[1] "lambda.min = 0.000696605127977698"
paste0('lambda.1se = ',cv.lasso$lambda.1se) # simplest model but also lies within one standard error of the optimal value of lambda
[1] "lambda.1se = 0.00942540015884462"
The columns below compare the variables that are dropped in the lasso regression for the optimal and smallest model lambdas.
# Display regression coefficients
c_min_lambda <- coef(cv.lasso, cv.lasso$lambda.min)
c_1se_lambda <- coef(cv.lasso, cv.lasso$lambda.1se)
cbind(c_min_lambda, c_1se_lambda)
38 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) -6.497808e-01 -3.519117e-01
KIDSDRIV 3.878002e-01 2.840127e-01
AGE -2.204067e-03 -1.070443e-03
HOMEKIDS 3.772890e-02 2.514669e-02
YOJ -1.322932e-02 -3.760041e-03
INCOME -3.288182e-06 -3.666900e-06
PARENT1Yes 4.329532e-01 4.124385e-01
HOME_VAL -1.113993e-06 -1.152107e-06
MSTATUSz_No 4.696062e-01 2.913970e-01
SEXz_F . .
EDUCATIONBachelors -3.584300e-01 -6.312949e-02
EDUCATIONMasters -3.054436e-01 -1.464263e-02
EDUCATIONPhD -2.110436e-01 .
EDUCATIONz_High School 1.692823e-02 1.215931e-01
JOBClerical 3.211816e-01 1.264661e-01
JOBDoctor -4.363530e-01 -6.694927e-02
JOBHome Maker 1.540398e-01 .
JOBLawyer -2.538496e-02 .
JOBManager -6.927927e-01 -5.685217e-01
JOBProfessional . .
JOBStudent 3.713066e-02 .
JOBz_Blue Collar 1.440341e-01 .
TRAVTIME 1.342723e-02 8.283642e-03
CAR_USEPrivate -7.674998e-01 -7.250352e-01
BLUEBOOK -2.278061e-05 -1.615039e-05
TIF -5.505445e-02 -3.747593e-02
CAR_TYPEPanel Truck 5.143203e-01 .
CAR_TYPEPickup 5.117494e-01 4.698230e-02
CAR_TYPESports Car 8.750077e-01 3.391422e-01
CAR_TYPEVan 6.136813e-01 1.722648e-03
CAR_TYPEz_SUV 6.625531e-01 2.148920e-01
RED_CARyes 8.537390e-03 .
OLDCLAIM -1.452753e-05 .
CLM_FREQ 1.987771e-01 1.324388e-01
REVOKEDYes 8.481732e-01 5.503006e-01
MVR_PTS 1.053631e-01 9.197091e-02
CAR_AGE -5.696433e-04 -7.149609e-03
URBANICITYz_Highly Rural/ Rural -2.287235e+00 -1.795670e+00
When comparing the accurancy of the two lasso penalized models, we see that the difference in accuracy is marginal and slightly lower that the simple logit models. For this reason, our selected best model will be the reduced model with influential observations and outliers removed.
# Final model with lambda.min
lasso.model.min <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Make prediction on test data
x.test <- model.matrix(TARGET_FLAG ~., test.data)[,-1]
probabilities <- lasso.model.min %>% predict(newx = x.test)
predicted.classes <- as.factor(ifelse(probabilities > 0.5, 1, 0))
# Model accuracy rate
observed.classes <- as.factor(test.data$TARGET_FLAG)
mean(predicted.classes == observed.classes)
[1] 0.7786634
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1163 323
1 38 107
Accuracy : 0.7787
95% CI : (0.7577, 0.7986)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 4.489e-05
Kappa : 0.2759
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.2488
Specificity : 0.9684
Pos Pred Value : 0.7379
Neg Pred Value : 0.7826
Prevalence : 0.2636
Detection Rate : 0.0656
Detection Prevalence : 0.0889
Balanced Accuracy : 0.6086
'Positive' Class : 1
# Final model with lambda.1se
lasso.model.se1 <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.1se)
# Make prediction on test data
x.test <- model.matrix(TARGET_FLAG ~., test.data)[,-1]
probabilities <- lasso.model.se1 %>% predict(newx = x.test)
predicted.classes <- as.factor(ifelse(probabilities > 0.5, 1, 0))
# Model accuracy rate
observed.classes <- as.factor(test.data$TARGET_FLAG)
#mean(predicted.classes == observed.classes)
print(confusionMatrix(predicted.classes,observed.classes, positive = "1"))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1184 364
1 17 66
Accuracy : 0.7664
95% CI : (0.7451, 0.7867)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 0.002925
Kappa : 0.188
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.15349
Specificity : 0.98585
Pos Pred Value : 0.79518
Neg Pred Value : 0.76486
Prevalence : 0.26364
Detection Rate : 0.04047
Detection Prevalence : 0.05089
Balanced Accuracy : 0.56967
'Positive' Class : 1
We subset the dataset to pick out only the observations which were involved in accidents. This is also the population that has previously filed claims, so this information will be used to predict the TARGET_AMT
.
The simple multiple regression model delivers a poor performance with a very low R-squared value. For this reason, we consider the next model using weights as a comparison.
data_train2 <- data_train %>% dplyr::select(c(-'INDEX')) %>% filter(TARGET_FLAG==1) %>% dplyr::select(-c('TARGET_FLAG'))
mod3a <- lm(TARGET_AMT ~ ., data_train2)
summary(mod3a)
Call:
lm(formula = TARGET_AMT ~ ., data = data_train2)
Residuals:
Min 1Q Median 3Q Max
-8943 -3176 -1501 480 99578
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.396e+03 1.845e+03 1.841 0.0658 .
KIDSDRIV -1.719e+02 3.166e+02 -0.543 0.5873
AGE 1.835e+01 2.124e+01 0.864 0.3879
HOMEKIDS 2.135e+02 2.071e+02 1.031 0.3028
YOJ 1.916e+01 4.918e+01 0.390 0.6969
INCOME -9.014e-03 6.742e-03 -1.337 0.1814
PARENT1Yes 2.772e+02 5.873e+02 0.472 0.6369
HOME_VAL 2.198e-03 2.020e-03 1.088 0.2767
MSTATUSz_No 8.036e+02 4.935e+02 1.628 0.1036
SEXz_F -1.397e+03 6.564e+02 -2.129 0.0334 *
EDUCATIONBachelors 2.606e+02 6.419e+02 0.406 0.6848
EDUCATIONMasters 1.194e+03 1.084e+03 1.102 0.2707
EDUCATIONPhD 2.396e+03 1.312e+03 1.827 0.0679 .
EDUCATIONz_High School -3.954e+02 5.145e+02 -0.769 0.4423
JOBClerical 3.084e+02 1.203e+03 0.256 0.7977
JOBDoctor -2.116e+03 1.762e+03 -1.201 0.2299
JOBHome Maker -2.353e+01 1.266e+03 -0.019 0.9852
JOBLawyer 3.272e+02 1.029e+03 0.318 0.7506
JOBManager -7.794e+02 1.066e+03 -0.731 0.4647
JOBProfessional 1.061e+03 1.129e+03 0.940 0.3475
JOBStudent 1.147e+02 1.286e+03 0.089 0.9289
JOBz_Blue Collar 5.206e+02 1.146e+03 0.454 0.6497
TRAVTIME 7.528e-01 1.108e+01 0.068 0.9458
CAR_USEPrivate -4.384e+02 5.216e+02 -0.840 0.4008
BLUEBOOK 1.245e-01 3.053e-02 4.077 4.73e-05 ***
TIF -1.574e+01 4.252e+01 -0.370 0.7112
CAR_TYPEPanel Truck -6.403e+02 9.605e+02 -0.667 0.5051
CAR_TYPEPickup -5.637e+01 5.968e+02 -0.094 0.9248
CAR_TYPESports Car 1.061e+03 7.502e+02 1.414 0.1575
CAR_TYPEVan 6.335e+01 7.707e+02 0.082 0.9345
CAR_TYPEz_SUV 9.025e+02 6.668e+02 1.354 0.1760
RED_CARyes -1.933e+02 4.965e+02 -0.389 0.6970
OLDCLAIM 2.502e-02 2.263e-02 1.105 0.2691
CLM_FREQ -1.154e+02 1.580e+02 -0.731 0.4651
REVOKEDYes -1.125e+03 5.166e+02 -2.177 0.0296 *
MVR_PTS 1.108e+02 6.853e+01 1.617 0.1061
CAR_AGE -9.842e+01 4.406e+01 -2.233 0.0256 *
URBANICITYz_Highly Rural/ Rural -9.778e+01 7.562e+02 -0.129 0.8971
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7690 on 2115 degrees of freedom
Multiple R-squared: 0.03061, Adjusted R-squared: 0.01365
F-statistic: 1.805 on 37 and 2115 DF, p-value: 0.002183
We build the reduced model as before and output the summary. The Adjusted R-squared remains very low at 1.8%. The diagnostic QQ plot reveals a large deviation from normal in the upper quantiles that heavily affects the results.
modeltgt.null <- lm(TARGET_AMT ~ 1, data=data_train2)
modeltgt.full <- lm(TARGET_AMT ~ ., data=data_train2)
modeltgt.reduced <- step(modeltgt.null,
scope=list(upper=modeltgt.full),
direction="both",
trace=0,
data=data_train2)
summary(modeltgt.reduced)
Call:
lm(formula = TARGET_AMT ~ BLUEBOOK + MVR_PTS + SEX + REVOKED +
MSTATUS + CAR_AGE, data = data_train2)
Residuals:
Min 1Q Median 3Q Max
-8203 -3136 -1559 389 100457
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4374.99590 489.77634 8.933 < 2e-16 ***
BLUEBOOK 0.11270 0.02035 5.537 3.46e-08 ***
MVR_PTS 122.96517 64.23189 1.914 0.0557 .
SEXz_F -637.79264 334.17405 -1.909 0.0565 .
REVOKEDYes -694.16132 409.27610 -1.696 0.0900 .
MSTATUSz_No 542.20258 331.58446 1.635 0.1022
CAR_AGE -49.32254 31.56917 -1.562 0.1183
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7672 on 2146 degrees of freedom
Multiple R-squared: 0.02104, Adjusted R-squared: 0.0183
F-statistic: 7.685 on 6 and 2146 DF, p-value: 3.452e-08
We explore a weighted least squares model hoping for a better performance and manage to get a bump up 6.8%. We see that compared to the previous model the upper range of the QQ plot rapidly increases.
Call:
lm(formula = TARGET_AMT ~ . - INDEX - TARGET_FLAG, data = data_train)
Residuals:
Min 1Q Median 3Q Max
-5891 -1699 -760 344 103785
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.065e+03 5.572e+02 1.912 0.055974 .
KIDSDRIV 3.142e+02 1.132e+02 2.777 0.005507 **
AGE 5.221e+00 7.066e+00 0.739 0.459973
HOMEKIDS 7.764e+01 6.535e+01 1.188 0.234873
YOJ -3.947e+00 1.509e+01 -0.262 0.793583
INCOME -4.414e-03 1.805e-03 -2.445 0.014499 *
PARENT1Yes 5.761e+02 2.020e+02 2.852 0.004360 **
HOME_VAL -5.574e-04 5.908e-04 -0.944 0.345402
MSTATUSz_No 5.703e+02 1.449e+02 3.937 8.33e-05 ***
SEXz_F -3.685e+02 1.838e+02 -2.005 0.044978 *
EDUCATIONBachelors -2.585e+02 2.048e+02 -1.262 0.207014
EDUCATIONMasters 2.427e+01 3.000e+02 0.081 0.935520
EDUCATIONPhD 2.871e+02 3.560e+02 0.807 0.419960
EDUCATIONz_High School -8.875e+01 1.719e+02 -0.516 0.605722
JOBClerical 5.292e+02 3.414e+02 1.550 0.121192
JOBDoctor -4.998e+02 4.087e+02 -1.223 0.221451
JOBHome Maker 3.523e+02 3.646e+02 0.966 0.334044
JOBLawyer 2.308e+02 2.956e+02 0.781 0.435074
JOBManager -4.788e+02 2.885e+02 -1.660 0.096959 .
JOBProfessional 4.566e+02 3.088e+02 1.479 0.139258
JOBStudent 2.876e+02 3.740e+02 0.769 0.441842
JOBz_Blue Collar 5.077e+02 3.218e+02 1.577 0.114754
TRAVTIME 1.195e+01 3.222e+00 3.709 0.000209 ***
CAR_USEPrivate -7.796e+02 1.644e+02 -4.742 2.16e-06 ***
BLUEBOOK 1.433e-02 8.623e-03 1.662 0.096568 .
TIF -4.820e+01 1.218e+01 -3.958 7.63e-05 ***
CAR_TYPEPanel Truck 2.651e+02 2.782e+02 0.953 0.340636
CAR_TYPEPickup 3.757e+02 1.707e+02 2.201 0.027747 *
CAR_TYPESports Car 1.022e+03 2.178e+02 4.692 2.75e-06 ***
CAR_TYPEVan 5.146e+02 2.132e+02 2.414 0.015815 *
CAR_TYPEz_SUV 7.516e+02 1.793e+02 4.192 2.80e-05 ***
RED_CARyes -4.822e+01 1.491e+02 -0.323 0.746335
OLDCLAIM -1.056e-02 7.436e-03 -1.420 0.155686
CLM_FREQ 1.417e+02 5.503e+01 2.575 0.010028 *
REVOKEDYes 5.496e+02 1.735e+02 3.167 0.001545 **
MVR_PTS 1.753e+02 2.592e+01 6.764 1.44e-11 ***
CAR_AGE -2.690e+01 1.280e+01 -2.102 0.035598 *
URBANICITYz_Highly Rural/ Rural -1.665e+03 1.394e+02 -11.943 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4544 on 8123 degrees of freedom
Multiple R-squared: 0.07104, Adjusted R-squared: 0.06681
F-statistic: 16.79 on 37 and 8123 DF, p-value: < 2.2e-16
sd <- 1 / lm(abs(lm4$residuals)~lm4$fitted.values)$fitted.values^2
lm4_wls <- lm(TARGET_AMT ~. -INDEX -TARGET_FLAG, data = data_train, weights = sd)
lm4_wls_step <- stepAIC(lm4_wls, method = "leapBackward", trace = FALSE)
summary(lm4_wls_step)
Call:
lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + INCOME +
PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME +
CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + CLM_FREQ +
REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = data_train,
weights = sd)
Weighted Residuals:
Min 1Q Median 3Q Max
-5.784 -0.624 -0.535 -0.008 99.897
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.763e+02 9.580e+01 8.103 6.12e-16 ***
KIDSDRIV 1.001e+02 5.096e+01 1.965 0.049476 *
AGE 4.812e+00 1.029e+00 4.677 2.96e-06 ***
HOMEKIDS 6.926e+01 1.196e+01 5.789 7.35e-09 ***
INCOME -2.047e-03 2.998e-04 -6.827 9.28e-12 ***
PARENT1Yes 4.701e+02 1.642e+02 2.863 0.004211 **
HOME_VAL -2.688e-04 1.121e-04 -2.398 0.016512 *
MSTATUSz_No 2.707e+02 3.861e+01 7.011 2.56e-12 ***
SEXz_F -1.659e+02 2.281e+01 -7.275 3.78e-13 ***
EDUCATIONBachelors -7.303e+01 3.950e+01 -1.849 0.064526 .
EDUCATIONMasters 2.735e+01 4.721e+01 0.579 0.562374
EDUCATIONPhD 1.465e+02 5.415e+01 2.706 0.006831 **
EDUCATIONz_High School -5.321e+01 3.138e+01 -1.696 0.089998 .
JOBClerical 1.607e+02 6.856e+01 2.345 0.019076 *
JOBDoctor -2.739e+02 7.890e+01 -3.471 0.000521 ***
JOBHome Maker 2.728e+01 6.187e+01 0.441 0.659250
JOBLawyer -1.594e+01 6.305e+01 -0.253 0.800387
JOBManager -2.278e+02 7.034e+01 -3.238 0.001209 **
JOBProfessional 2.838e+02 6.985e+01 4.063 4.90e-05 ***
JOBStudent 2.943e+01 7.261e+01 0.405 0.685318
JOBz_Blue Collar 1.655e+02 7.053e+01 2.347 0.018945 *
TRAVTIME 6.244e+00 4.988e-01 12.518 < 2e-16 ***
CAR_USEPrivate -3.486e+02 5.398e+01 -6.458 1.12e-10 ***
BLUEBOOK 4.665e-03 1.357e-03 3.437 0.000592 ***
TIF -2.842e+01 2.052e+00 -13.855 < 2e-16 ***
CAR_TYPEPanel Truck -8.041e+01 1.421e+02 -0.566 0.571436
CAR_TYPEPickup 1.511e+02 3.219e+01 4.693 2.74e-06 ***
CAR_TYPESports Car 4.998e+02 6.071e+01 8.232 < 2e-16 ***
CAR_TYPEVan 5.952e+01 5.357e+01 1.111 0.266600
CAR_TYPEz_SUV 3.734e+02 3.196e+01 11.682 < 2e-16 ***
RED_CARyes 3.275e+01 2.043e+01 1.603 0.108952
CLM_FREQ 1.554e+02 3.033e+01 5.124 3.06e-07 ***
REVOKEDYes 2.367e+02 1.119e+02 2.116 0.034414 *
MVR_PTS 6.601e+01 1.075e+01 6.140 8.65e-10 ***
CAR_AGE -9.653e+00 1.688e+00 -5.719 1.11e-08 ***
URBANICITYz_Highly Rural/ Rural -6.903e+02 4.767e+01 -14.482 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.308 on 8125 degrees of freedom
Multiple R-squared: 0.07231, Adjusted R-squared: 0.06831
F-statistic: 18.09 on 35 and 8125 DF, p-value: < 2.2e-16
We use the final selected models to make the predictions and write the results to file. To predict TARGET_FLAG
we use modelrmv.reduced
and the WLS model for TARGET_AMT
. Once the individuals who are predicted to have accidents are identified, we filter them and predict the amount of that subset.
eval_probs <- predict(modelrmv.reduced, newdata=data_test %>% dplyr::select(-INDEX, -TARGET_AMT), type='response')
data_test$TARGET_FLAG <- ifelse(eval_probs > 0.5, 1, 0)
data_test$TARGET_AMT <- 0.0
data_test[data_test$TARGET_FLAG == 1,]$TARGET_AMT <- predict(lm4_wls_step, newdata=data_test %>% filter(TARGET_FLAG==1) %>% dplyr::select(-c('TARGET_FLAG','INDEX')))
data_test