In this homework assignment, we will explore, analyze and model a
data set containing approximately 8000 records representing a customer
at an auto insurance company. Each record has two response variables.
The first response variable, TARGET_FLAG
, is a 1 or a 0. A
“1” means that the person was in a car crash. A zero means that the
person was not in a car crash. The second response variable is
TARGET_AMT
. This value is zero if the person did not crash
their car. But if they did crash their car, this number will be a value
greater than zero.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. We can only use the variables given to us (or variables that we derive from the variables provided).
Below is a short description of the variables of interest in the data set:
data_train <- read.csv("https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework4/insurance_training_data.csv", header = TRUE)
data_test <- read.csv("https://raw.githubusercontent.com/Meccamarshall/Data621/refs/heads/main/Homework4/insurance-evaluation-data.csv", header = TRUE)
The dataset consists of 26 variables and
8161 observations with AGE
,
YOJ
, and CAR_AGE
variables containing some
missing values. As stated previously, TARGET_FLAG
and
TARGET_AMT
are our response variables. Also,
13
of the variables have discrete values and the rest of
the variables are continuous.
Name | data_train |
Number of rows | 8161 |
Number of columns | 26 |
_______________________ | |
Column type frequency: | |
character | 14 |
numeric | 12 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
INCOME | 0 | 1 | 0 | 8 | 445 | 6613 | 0 |
PARENT1 | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
HOME_VAL | 0 | 1 | 0 | 8 | 464 | 5107 | 0 |
MSTATUS | 0 | 1 | 3 | 4 | 0 | 2 | 0 |
SEX | 0 | 1 | 1 | 3 | 0 | 2 | 0 |
EDUCATION | 0 | 1 | 3 | 13 | 0 | 5 | 0 |
JOB | 0 | 1 | 0 | 13 | 526 | 9 | 0 |
CAR_USE | 0 | 1 | 7 | 10 | 0 | 2 | 0 |
BLUEBOOK | 0 | 1 | 6 | 7 | 0 | 2789 | 0 |
CAR_TYPE | 0 | 1 | 3 | 11 | 0 | 6 | 0 |
RED_CAR | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
OLDCLAIM | 0 | 1 | 2 | 7 | 0 | 2857 | 0 |
REVOKED | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
URBANICITY | 0 | 1 | 19 | 21 | 0 | 2 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
INDEX | 0 | 1.00 | 5151.87 | 2978.89 | 1 | 2559 | 5133 | 7745 | 10302.0 | ▇▇▇▇▇ |
TARGET_FLAG | 0 | 1.00 | 0.26 | 0.44 | 0 | 0 | 0 | 1 | 1.0 | ▇▁▁▁▃ |
TARGET_AMT | 0 | 1.00 | 1504.32 | 4704.03 | 0 | 0 | 0 | 1036 | 107586.1 | ▇▁▁▁▁ |
KIDSDRIV | 0 | 1.00 | 0.17 | 0.51 | 0 | 0 | 0 | 0 | 4.0 | ▇▁▁▁▁ |
AGE | 6 | 1.00 | 44.79 | 8.63 | 16 | 39 | 45 | 51 | 81.0 | ▁▆▇▂▁ |
HOMEKIDS | 0 | 1.00 | 0.72 | 1.12 | 0 | 0 | 0 | 1 | 5.0 | ▇▂▁▁▁ |
YOJ | 454 | 0.94 | 10.50 | 4.09 | 0 | 9 | 11 | 13 | 23.0 | ▂▃▇▃▁ |
TRAVTIME | 0 | 1.00 | 33.49 | 15.91 | 5 | 22 | 33 | 44 | 142.0 | ▇▇▁▁▁ |
TIF | 0 | 1.00 | 5.35 | 4.15 | 1 | 1 | 4 | 7 | 25.0 | ▇▆▁▁▁ |
CLM_FREQ | 0 | 1.00 | 0.80 | 1.16 | 0 | 0 | 0 | 2 | 5.0 | ▇▂▁▁▁ |
MVR_PTS | 0 | 1.00 | 1.70 | 2.15 | 0 | 0 | 1 | 3 | 13.0 | ▇▂▁▁▁ |
CAR_AGE | 510 | 0.94 | 8.33 | 5.70 | -3 | 1 | 8 | 12 | 28.0 | ▆▇▇▃▁ |
The currency symbols present in some values may disrupt our analysis, so we need to reformat these values accordingly.
strip_dollars <- function(x){
x <- as.character(x)
x <- gsub(",", "", x)
x <- gsub("\\$", "", x)
as.numeric(x)
}
fix_formatting <- function(messy_df){
messy_df %>%
rowwise() %>%
mutate(INCOME = strip_dollars(INCOME),
HOME_VAL = strip_dollars(HOME_VAL),
BLUEBOOK = strip_dollars(BLUEBOOK),
OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
ungroup()
}
We observed that some variables categorized as discrete actually have
a high number of unique values. Upon closer inspection of the variable
descriptions, we found that although these variables are encoded as
factors, they are indeed continuous. Additionally, the
TARGET_FLAG
variable is listed as numeric in the summary
but should be a binary factor. We’ll now correct these data types
accordingly.
Additionally, some values appear to be invalid (e.g., -3 in
CAR_AGE
). Since missing values in both variables are under
5%, we can replace these with the median. We’ll calculate the median
using the training set only and then apply it to both the training and
testing sets to prevent overfitting.
We apply the processing steps above to both the training and testing datasets.
We proceed to examine the distribution of TARGET_FLAG
across the numeric variables. Notably, BLUEBOOK
,
INCOME
, and OLDCLAIM
contain a higher number
of outliers relative to other variables. We also observe that older
customers, those with older vehicles, higher home values, or higher
incomes are generally involved in fewer car accidents. Conversely,
individuals with motor vehicle record points or a high number of prior
claims tend to have more accidents.
plot_vars <- c("TARGET_FLAG", names(keep(data_train, is.numeric)))
data_train[plot_vars] %>%
dplyr::select(-INDEX, -TARGET_AMT) %>%
gather(variable, value, -TARGET_FLAG) %>%
ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) +
geom_boxplot() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
data_train %>%
dplyr::select(-TARGET_FLAG, -TARGET_AMT, -INDEX) %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')
The variables dislayed below need scale transformations like
OLDCLAIM
, INCOME
, BLUEBOOK
,
HOME_VAL
. AGE
has a guassian distribution. We
see several variables have high number of zeros. AGE
is the
only variable that is normally distributed. Rest of the variables show
some skewness. We will perform Box-Cox transformation on these
variables.
data_train %>%
dplyr::select(OLDCLAIM, INCOME, BLUEBOOK, HOME_VAL) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_histogram(fill='pink') +
facet_wrap(~key, scales = 'free')
data_train %>%
keep(is.numeric) %>%
gather(variable, value, -TARGET_AMT, -INDEX, -CLM_FREQ, -MVR_PTS) %>%
ggplot(., aes(value, TARGET_AMT)) +
geom_point() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 3) +
labs(x = element_blank(), y = element_blank())
We observe that CLM_FREQ
, MVR_PTS
, and
HOME_VAL
are the most positively correlated variables with
our response variable, TARGET_AMT
, indicating that higher
claim frequency, motor vehicle record points, and home values are
associated with higher target amounts. The remaining variables have
relatively weak correlations with the response. Additionally, we notice
a moderate negative correlation between HOMEKIDS
and
INCOME
, suggesting that higher-income households tend to
have fewer children at home.
corr_dataframe <- data_train %>%
mutate_if(is.factor, as.numeric) %>%
mutate_if(is.character, as.numeric) %>%
select_if(is.numeric)
q <- cor(corr_dataframe, use = "pairwise.complete.obs")
ggcorrplot(q, type = "upper", outline.color = "white",
ggtheme = theme_classic(),
colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)
set.seed(42)
accidents <- data_train %>%
filter(TARGET_FLAG == 1)
ggplot(accidents, aes(x=TARGET_AMT)) +
geom_density(fill='pink') +
theme_light() +
geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")
The distribution of TARGET_AMT
shows a long right tail.
The mean payout is $5,702, while the median payout is $4,104, indicating
that the distribution is skewed to the right. As expected, both the mean
and median are higher for observations classified as outliers. Here, we
consider values above $10,594 as outliers, based on our established
cutoff point.
outlier <- min(boxplot(data_train[data_train$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
data_train %>%
mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, "Yes", "No")) %>%
group_by(TARGET_AMT_OUTLIER) %>%
summarise(Mean = mean(TARGET_AMT),
Median = median(TARGET_AMT))
0 1
6008 2153
There is an imbalance in the TARGET_FLAG
variable
Let’s check the class distribution
0 1
0.7361843 0.2638157
The data contains only 26% that has already did an accident and 74% of negative flag. This is severly imbalanced data set. This would affect the accuracy score in the model building step if untreated.
To treat this unbalance, we would use the
over sampling
set.seed(42)
minority <- nrow(data_train[data_train$TARGET_FLAG == 1,])
majority <- nrow(data_train[data_train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- data_train[data_train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(data_train, .) %>%
bind_rows(data_train)
data_train_balanced <- over_sample_train
check the balance again
0 1
6008 6008
Our goal is to predict both TARGET_FLAG
and
TARGET_AMT
. Given that TARGET_FLAG
is a
discrete response variable, it should be modeled using logistic
regression to estimate the probability of an individual being involved
in an accident.
# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
precision.deviance=numeric(), stringsAsFactors=FALSE)
# A function to extract the relevant metrics from the summary and confusion matrix
score_model <- function(id, model, data, output=FALSE) {
if (output) print(summary(model))
glm.probs <- predict(model, type="response")
# Confirm the 0.5 threshold
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
results <- tibble(target=data$TARGET_FLAG, pred=glm.pred)
results <- results %>%
mutate(pred.class = as.factor(pred), target.class = as.factor(target))
if (output) print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
#prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
res.deviance <- model$deviance
null.deviance <- model$null.deviance
aic <- model$aic
metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec)
metrics <- lapply(metrics, round, 3)
if (output) plot(roc(results$target.class,glm.probs), print.auc = TRUE)
model.df <- tibble(id=id, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance,
aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity)
model.list <- list(model=glm.fit, df_info=model.df)
return(model.list)
}
We create three types of models: null, full, and reduced. The null model includes only the intercept, representing the simplest form. The full model includes all predictors, serving as the most complex model. The reduced model is developed by systematically stepping through predictors between these two bounds, retaining only those that are statistically significant.
mod1data <- data_train %>% dplyr::select(-c('TARGET_AMT','INDEX'))
#mod1data <- data_train_balanced %>% select(-c('TARGET_AMT','INDEX'))
model.null <- glm(TARGET_FLAG ~ 1,
data=mod1data,
family = binomial(link="logit")
)
model.full <- glm(TARGET_FLAG ~ .,
data=mod1data,
family = binomial(link="logit")
)
model.reduced <- step(model.null,
scope = list(upper=model.full),
direction="both",
test="Chisq",
trace=0,
data=mod1data)
m1a <- score_model('model.full', model.full, mod1data, output = TRUE)
Call:
glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"),
data = mod1data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.286e-01 3.215e-01 -2.889 0.003869 **
KIDSDRIV 3.862e-01 6.122e-02 6.308 2.82e-10 ***
AGE -1.015e-03 4.020e-03 -0.252 0.800672
HOMEKIDS 4.965e-02 3.713e-02 1.337 0.181119
YOJ -1.105e-02 8.582e-03 -1.288 0.197743
INCOME -3.423e-06 1.081e-06 -3.165 0.001551 **
PARENT1Yes 3.820e-01 1.096e-01 3.485 0.000492 ***
HOME_VAL -1.306e-06 3.420e-07 -3.819 0.000134 ***
MSTATUSz_No 4.938e-01 8.357e-02 5.909 3.45e-09 ***
SEXz_F -8.251e-02 1.120e-01 -0.737 0.461416
EDUCATIONBachelors -3.812e-01 1.157e-01 -3.296 0.000981 ***
EDUCATIONMasters -2.903e-01 1.788e-01 -1.624 0.104397
EDUCATIONPhD -1.677e-01 2.140e-01 -0.784 0.433295
EDUCATIONz_High School 1.764e-02 9.506e-02 0.186 0.852802
JOBClerical 4.107e-01 1.967e-01 2.088 0.036763 *
JOBDoctor -4.458e-01 2.671e-01 -1.669 0.095106 .
JOBHome Maker 2.323e-01 2.102e-01 1.106 0.268915
JOBLawyer 1.049e-01 1.695e-01 0.619 0.535958
JOBManager -5.572e-01 1.716e-01 -3.248 0.001161 **
JOBProfessional 1.619e-01 1.784e-01 0.907 0.364168
JOBStudent 2.161e-01 2.145e-01 1.007 0.313729
JOBz_Blue Collar 3.106e-01 1.856e-01 1.674 0.094158 .
TRAVTIME 1.457e-02 1.883e-03 7.736 1.03e-14 ***
CAR_USEPrivate -7.564e-01 9.172e-02 -8.247 < 2e-16 ***
BLUEBOOK -2.084e-05 5.263e-06 -3.959 7.52e-05 ***
TIF -5.547e-02 7.344e-03 -7.553 4.26e-14 ***
CAR_TYPEPanel Truck 5.607e-01 1.618e-01 3.466 0.000528 ***
CAR_TYPEPickup 5.540e-01 1.007e-01 5.500 3.80e-08 ***
CAR_TYPESports Car 1.025e+00 1.299e-01 7.893 2.95e-15 ***
CAR_TYPEVan 6.186e-01 1.265e-01 4.891 1.00e-06 ***
CAR_TYPEz_SUV 7.682e-01 1.113e-01 6.904 5.05e-12 ***
RED_CARyes -9.728e-03 8.636e-02 -0.113 0.910313
OLDCLAIM -1.389e-05 3.910e-06 -3.554 0.000380 ***
CLM_FREQ 1.959e-01 2.855e-02 6.864 6.69e-12 ***
REVOKEDYes 8.874e-01 9.133e-02 9.716 < 2e-16 ***
MVR_PTS 1.133e-01 1.361e-02 8.324 < 2e-16 ***
CAR_AGE -7.196e-04 7.549e-03 -0.095 0.924053
URBANICITYz_Highly Rural/ Rural -2.390e+00 1.128e-01 -21.181 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7297.6 on 8123 degrees of freedom
AIC: 7373.6
Number of Fisher Scoring iterations: 5
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5551 1235
1 457 918
Accuracy : 0.7927
95% CI : (0.7837, 0.8014)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3963
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4264
Specificity : 0.9239
Pos Pred Value : 0.6676
Neg Pred Value : 0.8180
Prevalence : 0.2638
Detection Rate : 0.1125
Detection Prevalence : 0.1685
Balanced Accuracy : 0.6752
'Positive' Class : 1
Call:
glm(formula = TARGET_FLAG ~ URBANICITY + JOB + MVR_PTS + MSTATUS +
CAR_TYPE + REVOKED + KIDSDRIV + CAR_USE + TIF + TRAVTIME +
INCOME + CLM_FREQ + BLUEBOOK + PARENT1 + EDUCATION + HOME_VAL +
OLDCLAIM, family = binomial(link = "logit"), data = mod1data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.053e+00 2.559e-01 -4.117 3.84e-05 ***
URBANICITYz_Highly Rural/ Rural -2.389e+00 1.128e-01 -21.181 < 2e-16 ***
JOBClerical 4.141e-01 1.965e-01 2.107 0.035104 *
JOBDoctor -4.475e-01 2.667e-01 -1.678 0.093344 .
JOBHome Maker 2.748e-01 2.042e-01 1.346 0.178234
JOBLawyer 9.715e-02 1.692e-01 0.574 0.565740
JOBManager -5.649e-01 1.714e-01 -3.296 0.000980 ***
JOBProfessional 1.548e-01 1.784e-01 0.868 0.385304
JOBStudent 2.751e-01 2.109e-01 1.304 0.192066
JOBz_Blue Collar 3.098e-01 1.855e-01 1.670 0.094879 .
MVR_PTS 1.143e-01 1.359e-02 8.412 < 2e-16 ***
MSTATUSz_No 4.719e-01 7.955e-02 5.932 2.99e-09 ***
CAR_TYPEPanel Truck 6.090e-01 1.509e-01 4.035 5.46e-05 ***
CAR_TYPEPickup 5.503e-01 1.006e-01 5.469 4.53e-08 ***
CAR_TYPESports Car 9.726e-01 1.074e-01 9.054 < 2e-16 ***
CAR_TYPEVan 6.466e-01 1.221e-01 5.295 1.19e-07 ***
CAR_TYPEz_SUV 7.156e-01 8.596e-02 8.324 < 2e-16 ***
REVOKEDYes 8.927e-01 9.123e-02 9.785 < 2e-16 ***
KIDSDRIV 4.176e-01 5.512e-02 7.576 3.57e-14 ***
CAR_USEPrivate -7.574e-01 9.161e-02 -8.268 < 2e-16 ***
TIF -5.538e-02 7.340e-03 -7.545 4.53e-14 ***
TRAVTIME 1.448e-02 1.881e-03 7.699 1.37e-14 ***
INCOME -3.486e-06 1.076e-06 -3.239 0.001199 **
CLM_FREQ 1.963e-01 2.852e-02 6.882 5.91e-12 ***
BLUEBOOK -2.308e-05 4.719e-06 -4.891 1.00e-06 ***
PARENT1Yes 4.602e-01 9.427e-02 4.882 1.05e-06 ***
EDUCATIONBachelors -3.868e-01 1.089e-01 -3.554 0.000380 ***
EDUCATIONMasters -3.032e-01 1.615e-01 -1.878 0.060385 .
EDUCATIONPhD -1.818e-01 2.002e-01 -0.908 0.363825
EDUCATIONz_High School 1.487e-02 9.469e-02 0.157 0.875229
HOME_VAL -1.342e-06 3.407e-07 -3.939 8.18e-05 ***
OLDCLAIM -1.405e-05 3.907e-06 -3.595 0.000324 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 9418.0 on 8160 degrees of freedom
Residual deviance: 7301.8 on 8129 degrees of freedom
AIC: 7365.8
Number of Fisher Scoring iterations: 5
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 5558 1245
1 450 908
Accuracy : 0.7923
95% CI : (0.7833, 0.8011)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3934
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4217
Specificity : 0.9251
Pos Pred Value : 0.6686
Neg Pred Value : 0.8170
Prevalence : 0.2638
Detection Rate : 0.1113
Detection Prevalence : 0.1664
Balanced Accuracy : 0.6734
'Positive' Class : 1
We calculate McFadden’s pseudo R-squared for the logistic regression models and observe that the difference between the full model and the reduced model is minimal..
full_model_r2 <- round(1 - logLik(model.full) / logLik(model.null), 4)
reduced_model_r2 <- round(1 - logLik(model.reduced) / logLik(model.null), 4)
paste0('Full model = ',round(1-logLik(model.full)/logLik(model.null),4))
[1] "Full model = 0.2251"
[1] "Reduced model = 0.2247"
Model stepwise
model.null <- lm(TARGET_AMT ~ 1, data = mod2data)
model.full <- lm(TARGET_AMT ~ ., data = mod2data)
set.seed(100)
model.stepwise <- step(model.null,
scope = list(lower = model.null, upper = model.full),
direction = "both",
trace = FALSE)
summary(model.stepwise)
Call:
lm(formula = TARGET_AMT ~ MVR_PTS + URBANICITY + JOB + MSTATUS +
CAR_TYPE + KIDSDRIV + CAR_USE + TIF + TRAVTIME + PARENT1 +
INCOME + REVOKED + CAR_AGE + CLM_FREQ + SEX + BLUEBOOK +
OLDCLAIM, data = mod2data)
Residuals:
Min 1Q Median 3Q Max
-5826 -1689 -763 343 103714
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.245e+03 3.821e+02 3.260 0.001120 **
MVR_PTS 1.759e+02 2.588e+01 6.797 1.14e-11 ***
URBANICITYz_Highly Rural/ Rural -1.660e+03 1.393e+02 -11.916 < 2e-16 ***
JOBClerical 3.376e+02 2.950e+02 1.145 0.252389
JOBDoctor -3.534e+02 3.767e+02 -0.938 0.348266
JOBHome Maker 1.998e+02 3.348e+02 0.597 0.550603
JOBLawyer 1.416e+02 2.872e+02 0.493 0.622105
JOBManager -6.899e+02 2.678e+02 -2.576 0.010002 *
JOBProfessional 1.473e+02 2.677e+02 0.550 0.582146
JOBStudent 1.704e+02 3.259e+02 0.523 0.601163
JOBz_Blue Collar 2.898e+02 2.686e+02 1.079 0.280669
MSTATUSz_No 6.009e+02 1.193e+02 5.035 4.88e-07 ***
CAR_TYPEPanel Truck 3.038e+02 2.750e+02 1.105 0.269248
CAR_TYPEPickup 4.025e+02 1.693e+02 2.377 0.017471 *
CAR_TYPESports Car 1.038e+03 2.162e+02 4.798 1.63e-06 ***
CAR_TYPEVan 5.327e+02 2.120e+02 2.512 0.012014 *
CAR_TYPEz_SUV 7.583e+02 1.784e+02 4.250 2.16e-05 ***
KIDSDRIV 3.759e+02 1.021e+02 3.682 0.000232 ***
CAR_USEPrivate -7.302e+02 1.568e+02 -4.657 3.26e-06 ***
TIF -4.751e+01 1.217e+01 -3.904 9.55e-05 ***
TRAVTIME 1.181e+01 3.220e+00 3.667 0.000247 ***
PARENT1Yes 6.438e+02 1.767e+02 3.644 0.000270 ***
INCOME -4.822e-03 1.562e-03 -3.086 0.002033 **
REVOKEDYes 5.558e+02 1.734e+02 3.205 0.001354 **
CAR_AGE -2.733e+01 1.123e+01 -2.433 0.014978 *
CLM_FREQ 1.438e+02 5.497e+01 2.616 0.008908 **
SEXz_F -3.316e+02 1.608e+02 -2.062 0.039242 *
BLUEBOOK 1.463e-02 8.521e-03 1.717 0.086015 .
OLDCLAIM -1.054e-02 7.432e-03 -1.419 0.156047
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4544 on 8132 degrees of freedom
Multiple R-squared: 0.07011, Adjusted R-squared: 0.06691
F-statistic: 21.9 on 28 and 8132 DF, p-value: < 2.2e-16
#model with manually selected predictors
The variables CLM_FREQ, MVR_PTS, HOME_VAL, and INCOME were chosen for the regression model because they show moderate to strong correlations with TARGET_AMT in the correlation heatmap, suggesting they may be influential predictors of claim amount. CLM_FREQ and MVR_PTS are positively correlated with TARGET_AMT, indicating a higher claim frequency and driving points may relate to higher claims. HOME_VAL and INCOME also exhibit a positive relationship with TARGET_AMT, implying that higher home values and incomes could correspond to increased claim amounts. Additionally, the scatter plots and histograms for these variables highlight some variability and range, which may contribute to predicting TARGET_AMT effectively in a linear regression mo
# Create a model with manually selected predictors
model.manual <- lm(TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME , data = mod2data)
summary(model.manual)
Call:
lm(formula = TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME,
data = mod2data)
Residuals:
Min 1Q Median 3Q Max
-4253 -1522 -980 -236 104399
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.334e+03 1.053e+02 12.670 < 2e-16 ***
CLM_FREQ 2.788e+02 4.839e+01 5.761 8.65e-09 ***
MVR_PTS 2.295e+02 2.609e+01 8.798 < 2e-16 ***
HOME_VAL -2.295e-03 4.893e-04 -4.689 2.79e-06 ***
INCOME -1.385e-03 1.321e-03 -1.048 0.295
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4638 on 8156 degrees of freedom
Multiple R-squared: 0.02831, Adjusted R-squared: 0.02783
F-statistic: 59.41 on 4 and 8156 DF, p-value: < 2.2e-16
#model with variable transformation The variable transformations of AGE_INCOME was chosen to better capture the relationships observed in the graphs. The histogram for INCOME^2 is right-skewed, with a concentration of lower values and a long tail, suggesting that a simple linear effect may not fully capture its influence on TARGET_AMT. By adding INCOME_SQ, we allow the model to account for potential nonlinear effects, where higher income levels might impact claim amounts differently. Additionally, the interaction term AGE_INCOME was included to explore the combined effect of age and income on claim amounts, as age may modify the influence of income, which could help capture more complex patterns seen in the data.
mod2data <- mod2data %>%
mutate(INCOME_SQ = INCOME^2, AGE_INCOME = AGE * INCOME)
model.interaction <- lm(TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME + INCOME_SQ + AGE_INCOME , data = mod2data)
summary(model.interaction)
Call:
lm(formula = TARGET_AMT ~ CLM_FREQ + MVR_PTS + HOME_VAL + INCOME +
INCOME_SQ + AGE_INCOME, data = mod2data)
Residuals:
Min 1Q Median 3Q Max
-4312 -1524 -981 -228 104410
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.332e+03 1.267e+02 10.508 < 2e-16 ***
CLM_FREQ 2.790e+02 4.840e+01 5.765 8.47e-09 ***
MVR_PTS 2.292e+02 2.611e+01 8.780 < 2e-16 ***
HOME_VAL -2.277e-03 4.947e-04 -4.602 4.25e-06 ***
INCOME -1.861e-04 4.760e-03 -0.039 0.969
INCOME_SQ 4.198e-10 1.425e-08 0.029 0.977
AGE_INCOME -2.710e-05 8.766e-05 -0.309 0.757
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4639 on 8154 degrees of freedom
Multiple R-squared: 0.02832, Adjusted R-squared: 0.02761
F-statistic: 39.61 on 6 and 8154 DF, p-value: < 2.2e-16
The models have significant differences in performance metrics, with the stepwise model showing the lowest MSE (20,574,015) and RMSE (4,535.86). Both the manual and interaction models have similar MSE and RMSE values (around 21,498,780 and 4,636.68), indicating similar predictive power. The R-squared values are quite low across all models, ranging from 0.028 to 0.070, suggesting that the models explain very little of the variance in the target variable.
evaluate_model <- function(model, data) {
# Predictions
predictions <- predict(model, newdata = data)
# Metrics
actual <- data$TARGET_AMT
mse <- mean((predictions - actual)^2)
rmse <- sqrt(mse)
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
# Return metrics
tibble(
Model = deparse(substitute(model)),
MSE = round(mse, 2),
RMSE = round(rmse, 2),
R_Squared = round(r_squared, 3),
Adj_R_Squared = round(adj_r_squared, 3)
)
}
# Evaluate all models
results <- bind_rows(
evaluate_model(model.stepwise, mod2data),
evaluate_model(model.manual, mod2data),
evaluate_model(model.interaction, mod2data)
)
print(results)
# A tibble: 3 × 5
Model MSE RMSE R_Squared Adj_R_Squared
<chr> <dbl> <dbl> <dbl> <dbl>
1 model 20574015. 4536. 0.07 0.067
2 model 21498780. 4637. 0.028 0.028
3 model 21498527. 4637. 0.028 0.028
We examine the reduced model for any irregularities and potential violations of assumptions. The logit values for the continuous predictors appear to be mostly linear.
library(broom)
# Select only numeric predictors and predict
numdata <- mod1data %>%
dplyr::select_if(is.numeric)
predictors <- colnames(numdata)
probabilities <- predict(model.reduced, type = "response")
# Bind the logit and tidying the data for plot
numdata <- numdata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(numdata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
Multiple linear regression model diagnostics All three sets of plots include the following diagnostics: residuals versus fitted values, histograms of residuals, Q-Q plots, and residuals versus leverage. For all the models, the actual versus predicted plots show a high concentration of points near zero, suggesting possible issues with model fit, particularly at larger values. The residuals versus fitted plots and histograms indicate skewness, as most residuals are concentrated near zero with some outliers. The Q-Q plots reveal heavy tails in the residual distribution, indicating deviations from normality. Additionally, in the three residuals versus leverage plots, a few influential points appear. Overall, both models seem to struggle with fitting certain aspects of the data, with potential issues in normality and residual spread.
visualize_models <- function(models, data, target_col) {
par(mfrow = c(2, 2))
for (i in seq_along(models)) {
model <- models[[i]]
model_name <- names(models)[i]
actual <- data[[target_col]]
predicted <- predict(model, newdata = data)
plot(actual, predicted,
xlab = "Actual Values",
ylab = "Predicted Values",
main = paste("Actual vs Predicted:", model_name),
col = "blue", pch = 16)
abline(0, 1, col = "red", lwd = 2)
residuals <- residuals(model)
plot(predicted, residuals,
xlab = "Fitted Values",
ylab = "Residuals",
main = paste("Residuals vs Fitted:", model_name),
col = "purple", pch = 16)
abline(h = 0, col = "red", lwd = 2)
hist(residuals,
breaks = 20,
main = paste("Histogram of Residuals:", model_name),
xlab = "Residuals",
col = "lightblue")
plot(model, main = paste("Diagnostics:", model_name))
}
par(mfrow = c(1, 1))
}
models <- list(
Manual = model.manual,
Stepwise = model.stepwise,
Interaction = model.interaction
)
visualize_models(models, data = mod2data, target_col = "TARGET_AMT")
#binary logistic model The reduced model is the better choice as it explains more variance (R² = 0.9547 vs. 0.0037 for the full model), has a slightly lower AIC, and offers comparable accuracy. Although the full model has a marginally higher sensitivity, the reduced model performs equally well or better in terms of specificity, making it a more efficient and effective model overall.
#multiple linear regression Based on the evaluation metrics, the stepwise model appears to perform slightly better than the manual and interaction models, with the lowest MSE and RMSE values. However, all models have relatively low R-squared values, indicating that they do not explain much of the variance in the target variable, which suggests the presence of unmodeled factors or insufficient predictors. Given these results, the stepwise model may be the most reliable choice due to its better fit.
#data_test Car crash predictions
predictions_prob <- predict(model.reduced, newdata = TargetFlag_pred, type = "response")
predictions_class <- ifelse(predictions_prob > 0.5, 1, 0)
TargetFlag_pred$predictions <- predictions_class
predicted_counts <- table(TargetFlag_pred$predictions)
# Print the table
print(predicted_counts)
0 1
1781 360
#data_test Target_AMT predictions
predictions <- predict(model.stepwise, newdata = TargetAmt_pred )
TargetAmt_pred$predictions <- predictions
head(TargetAmt_pred[, c('TARGET_AMT','predictions')])