used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 583502 31.2 1275565 68.2 NA 669422 35.8
Vcells 1110968 8.5 8388608 64.0 16384 1851952 14.2
ECNM HW 2
Preparation
Clear Data
Loading Packages
Bringing in Data
Introduction
In the insurance industry, identifying individuals at higher risk of car accidents is a critical component of managing both risk and financial performance. Accurate prediction of crash prone drivers enables insurers to set fair premiums, adjust coverage plans, and manage payouts more effectively.
As car crashes can range from minor incidents to major, costly accidents, estimating not only the likelihood of a crash but also the expected financial impact is essential. Insights from data driven models help insurance companies minimize losses and high quality services, benefiting both the insurer and the insured.
Variables
1. Cleaning Data
Removing “Z_” prefix from the data set
When analyzing the data, there is a prefix of “Z_” in front of some of the variables, below is a list of the variables. Based on the large quantity in which they appear, our first task will be to get rid of them in order to have a cleaner set of data.
MStatus | Sex | EDUCATION | Job | Car Type | Urban City |
---|---|---|---|---|---|
z_No | z_F | Masters | Lawyer | Pickup | z_Highly Rural/ Rural |
Yes | z_F | Masters | Home Maker | Sports Car | Highly Urban/ Urban |
z_No | z_F | Bachelors | Clerical | z_SUV | Highly Urban/ Urban |
Yes | z_F | PhD | NA | Van | Highly Urban/ Urban |
Yes | M | PhD | Manager | Panel Truck | Highly Urban/ Urban |
z_No | z_F | Bachelors | Professional | Minivan | Highly Urban/ Urban |
Yes | M | PhD | Doctor | Sports Car | Highly Urban/ Urban |
Yes | z_F | Bachelors | Clerical | Pickup | Highly Urban/ Urban |
Yes | M | z_High School | z_Blue Collar | Minivan | z_Highly Rural/ Rural |
z_No | M | Bachelors | Professional | Panel Truck | Highly Urban/ Urban |
Distribution by variable
“Z_” count after removal are now 0
MStatus | Sex | EDUCATION | Job | Car Type | Urban City |
---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 |
Renaming Variables
Renaming some variables in order to give them more intuitive names
# Rename PARENT1 to Parent_Single
names(tr_dt)[names(tr_dt) == "PARENT1"] <- "Parent_single"
names(tr_dt)[names(tr_dt) == "YOJ"] <- "Years_on_job"
names(tr_dt)[names(tr_dt) == "TIF"] <- "Time_in_force"
Understanding Missing Variables
In the Summary table above, there are variables such as Years on the Job, Home value, Car age, and others which have missing data. Before continuing it is important to have a clean set of data. The below graph and table gives us a visual and numeric understanding of key variables which will need cleaning.
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE
0 0 0 0 3
HOMEKIDS Years_on_job INCOME Parent_single HOME_VAL
0 375 354 0 368
TRAVTIME CAR_USE BLUEBOOK Time_in_force RED_CAR
0 0 0 0 0
OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
0 0 0 0 399
MSTATUS SEX EDUCATION JOB CAR_TYPE
0 0 0 419 0
URBANICITY
0
Cleaning the data
Instead of simply replacing NAs with a median or average of the respected variable, we implement machine learning to predict the missing values by considering the relationships between all variables in the dataset.
# Create a new variable 'Job_missing' that flags missing values in 'Job'
$Job_missing <- ifelse(is.na(tr_dt$JOB), 1, 0) tr_dt
Result of the data clean up
The missing data for Job was imputed well, the majority had PHD and Masters degrees, meaning a manager/lawyer jobs would be appropriate.
Job_missing | EDUCATION | JOB |
---|---|---|
1 | PhD | Manager |
1 | PhD | Manager |
1 | PhD | Manager |
1 | Masters | Manager |
1 | PhD | Manager |
1 | PhD | Manager |
1 | Masters | Lawyer |
1 | PhD | Manager |
1 | Masters | Lawyer |
1 | PhD | Manager |
1 | PhD | Manager |
1 | Masters | Lawyer |
1 | PhD | Manager |
1 | Masters | Manager |
1 | Masters | Manager |
We can also see below that we no longer have any missing data
# Summary of missing values after the adjustment
colSums(is.na(tr_dt) | tr_dt == "")
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE
0 0 0 0 0
HOMEKIDS Years_on_job INCOME Parent_single HOME_VAL
0 0 0 0 0
TRAVTIME CAR_USE BLUEBOOK Time_in_force RED_CAR
0 0 0 0 0
OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
0 0 0 0 0
MSTATUS SEX EDUCATION JOB CAR_TYPE
0 0 0 0 0
URBANICITY
0
2. Data Exploration
Summary table with changes
Below is summary of the data to understand key numerical statistics such median, mean, max, kurtosis, and standard deviation.
Mean | Median | Minimum | Maximum | Kurtosis | Skew | SD | NA Count | |
---|---|---|---|---|---|---|---|---|
INDEX | 5,157 | 5,152 | 1 | 10,302 | -1.21 | 0.00 | 2,986.61 | 0 |
TARGET_FLAG | 0 | 0 | 0 | 1 | -0.85 | 1.07 | 0.44 | 0 |
TARGET_AMT | 1,467 | 0 | 0 | 107,586 | 128.02 | 9.12 | 4,545.65 | 0 |
KIDSDRIV | 0 | 0 | 0 | 4 | 11.23 | 3.30 | 0.51 | 0 |
AGE | 45 | 45 | 16 | 81 | -0.05 | -0.04 | 8.65 | 0 |
HOMEKIDS | 1 | 0 | 0 | 5 | 0.56 | 1.32 | 1.11 | 0 |
Years_on_job | 11 | 11 | 0 | 23 | 1.28 | -1.23 | 4.05 | 0 |
INCOME | 61,320 | 53,014 | 0 | 367,030 | 2.27 | 1.21 | 47,263.66 | 0 |
HOME_VAL | 154,338 | 159,856 | 0 | 885,282 | 0.08 | 0.51 | 127,111.26 | 0 |
TRAVTIME | 33 | 33 | 5 | 142 | 0.74 | 0.46 | 15.97 | 0 |
BLUEBOOK | 15,642 | 14,370 | 1,500 | 69,740 | 0.92 | 0.82 | 8,381.49 | 0 |
Time_in_force | 5 | 4 | 1 | 25 | 0.42 | 0.88 | 4.15 | 0 |
OLDCLAIM | 4,119 | 0 | 0 | 57,037 | 9.40 | 3.07 | 8,924.67 | 0 |
CLM_FREQ | 1 | 0 | 0 | 5 | 0.31 | 1.21 | 1.16 | 0 |
MVR_PTS | 2 | 1 | 0 | 13 | 1.30 | 1.33 | 2.14 | 0 |
CAR_AGE | 8 | 8 | -3 | 27 | -0.73 | 0.28 | 5.63 | 0 |
Graphs
The following graphs help us better understand outliers and distribution patterns, enabling more effective data cleaning.
Distribution of Key Variables
Based on the distribution graphs belows, Income, BlueBook, Travel time, and Car age look to be variables which might benefit from a transformation as they currently show skewness.
Below is a distributions of the categorical variables, this is based on frequency and also highlights the amount of crashes which make up those totals.
This data can be useful for underwriting purposes to understand the exposure they might have in their portfolio. For example, this portfolio is more heavily weighted on individuals with ages of 30+, if they were to find that individuals of 20-30 run a very similar risk profile to those of 30+, then they might shift policy criteria.
Similarly, they it appears they have greater exposure to urban areas, which if they want to go for a risk aversive strategy, then targeting rural areas might be of greater benefit.
In the correlation section of the analysis we will be able to show the proportions highlighted for these variables.
Correlation Analysis
Correlations - Numerical data
Below is the correlation matrix for the numeric data set which shows the strength and the direction of a relationship between each two variables in the dataset.
Green indicates a positive correlation
Red indicates a negative correlation.
The color’s vibrancy shows the correlation’s strength, where white means no correlation, and dark green or red means a strong correlation, closer to 1 or -1.
The variables that display a higher correlation with the independent variables (“TARGET_FLAG” and “TARGET_AMT”) will be variables of interest when running our regression models.
Correlations - Categorical data
The below categorical data will help us understand correlation we couldn’t capture in Correlation plot which focused on numerical data.
Age vs Education
When observing the below heat map and you concentrate on the age group for which this insurance company mainly underwrites (30+), then one can see the impact increasing one’s education has upon the decrease of the crash rate.
Key variables
Urbanicity vs Car Type
In the Crash rate graphs above when graphing the main categorical variables, there was a strong increase in the crash rate when comparing commercial vs private use. For this reason, below there is a seperation between the two and comparing it to car type.
Commercial vehicles will historically have a higher risk:
Mileage/Usage Frequency: Commercial vehicles are on the road much more frequently than private use vehicles, increasing the likelihood of accidents.
Driver Type: Commercial vehicles might have multiple drivers with varying skill levels, whereas private vehicles typically have a limited number of drivers (immediate family).
Purpose of Use: Commercial vehicles may be used for more demanding tasks (e.g., delivery services, transportation of goods) that involve time pressures or driving in unfamiliar areas.
Car Type vs Car use
Similarily as seen below there is a clear correlation in where the vehicles are being operated with the crash rate. Urban areas because of congestion run a much higher risk of accidents.
What is the average claim amount for those who experienced an accident?
Commercial vehicles will primarily consist of higher amounts of pick up trucks and panel trucks which are of higher value and as a result drive a higher average claim. Below is a table breakout of the vehicles which have experienced an accident.
`summarise()` has grouped output by 'CAR_USE'. You can override using the
`.groups` argument.
Car Use | Car Type | Count | Avg Bluebook | Avg Claim |
---|---|---|---|---|
Commercial | Minivan | 95 | 14,078 | 6,467 |
Commercial | Panel Truck | 136 | 29,344 | 7,755 |
Commercial | Pickup | 272 | 11,753 | 5,126 |
Commercial | Sports Car | 52 | 12,014 | 4,856 |
Commercial | SUV | 156 | 11,840 | 5,287 |
Commercial | Van | 120 | 20,300 | 6,844 |
Commercial | Commercial total | 831 | 16,555 | 6,056 |
Car Use | Car Type | Count | Avg Bluebook | Avg Claim |
---|---|---|---|---|
Private | Minivan | 191 | 13,809 | 5,373 |
Private | Pickup | 85 | 12,017 | 5,364 |
Private | Sports Car | 181 | 11,719 | 5,282 |
Private | SUV | 398 | 11,162 | 5,058 |
Private | Van | 36 | 19,999 | 4,470 |
Private | Private total | 891 | 13,741 | 5,109 |
The average claim amount will be driven by a couple factors, one of those is the value of the car itself, given by the Blue Book. Generally, the higher the Blue Book value, the more expensive the car, and as a result the repair from the crash.
Other main variables which will drive a higher average claim will be the type of vehicle, the type of use (commercial vs private), and where it is used (urban vs rural setting).
3. Model Development
We are going to create dummy variables
Binary logistic regression Models
Model 1
The first logistic regression model leverages all available variables from the dataset to predict the likelihood of a car accident (TARGET_FLAG). This model includes a range of predictors such as car characteristics, driver demographics, and behavioral factors. Incorporating them will allows the model to capture as much information as possible on which variables most contribute to car accidents.
Model 1 will serve as the baseline for the model development.
Call:
glm(formula = TARGET_FLAG ~ BLUEBOOK + CAR_AGE + INCOME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` +
EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters +
EDUCATIONPhD + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + RED_CAR +
OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical +
JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force, family = binomial, data = tr_dt2)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.433e+00 3.858e-01 -6.307 2.85e-10 ***
BLUEBOOK -1.977e-05 5.916e-06 -3.341 0.000834 ***
CAR_AGE 1.313e-03 8.690e-03 0.151 0.879868
INCOME -3.089e-06 1.271e-06 -2.431 0.015073 *
Parent_single 3.871e-01 1.231e-01 3.145 0.001663 **
HOME_VAL -1.089e-06 3.988e-07 -2.731 0.006316 **
MSTATUS -4.912e-01 9.636e-02 -5.097 3.44e-07 ***
SEX_male 1.545e-01 1.252e-01 1.234 0.217230
`EDUCATION<High School` -3.561e-02 2.309e-01 -0.154 0.877405
EDUCATIONBachelors -5.027e-01 1.857e-01 -2.707 0.006787 **
EDUCATIONHighSchool -5.937e-02 2.083e-01 -0.285 0.775624
EDUCATIONMasters -2.515e-01 1.689e-01 -1.489 0.136557
EDUCATIONPhD NA NA NA NA
TRAVTIME 1.342e-02 2.102e-03 6.381 1.76e-10 ***
CAR_USEPrivate -8.020e-01 1.023e-01 -7.843 4.41e-15 ***
KIDSDRIV 4.430e-01 6.938e-02 6.385 1.72e-10 ***
AGE -1.959e-03 4.516e-03 -0.434 0.664412
HOMEKIDS 4.681e-02 4.239e-02 1.104 0.269492
Years_on_job -1.814e-02 9.598e-03 -1.890 0.058812 .
CAR_TYPEPanel_Truck 4.948e-01 1.815e-01 2.726 0.006410 **
CAR_TYPEPickup 5.639e-01 1.119e-01 5.038 4.71e-07 ***
CAR_TYPESports_Car 1.025e+00 1.458e-01 7.033 2.03e-12 ***
CAR_TYPESUV 8.339e-01 1.244e-01 6.702 2.06e-11 ***
CAR_TYPEVan 5.713e-01 1.414e-01 4.041 5.31e-05 ***
RED_CAR -1.001e-02 9.668e-02 -0.104 0.917505
OLDCLAIM -1.533e-05 4.279e-06 -3.584 0.000339 ***
CLM_FREQ 2.035e-01 3.180e-02 6.398 1.57e-10 ***
REVOKED 9.389e-01 1.001e-01 9.375 < 2e-16 ***
MVR_PTS 1.157e-01 1.525e-02 7.588 3.25e-14 ***
URBANICITY 2.357e+00 1.252e-01 18.829 < 2e-16 ***
JOBClerical 5.353e-02 1.196e-01 0.448 0.654367
JOBDoctor -1.299e+00 3.346e-01 -3.882 0.000104 ***
JOBHome_Maker -1.401e-01 1.738e-01 -0.806 0.420183
JOBLawyer -3.153e-01 2.010e-01 -1.569 0.116742
JOBManager -8.046e-01 1.497e-01 -5.374 7.68e-08 ***
JOBProfessional -1.696e-01 1.339e-01 -1.267 0.205252
JOBStudent -1.502e-01 1.465e-01 -1.026 0.305100
Time_in_force -5.643e-02 8.231e-03 -6.856 7.07e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7533.1 on 6527 degrees of freedom
Residual deviance: 5820.7 on 6491 degrees of freedom
AIC: 5894.7
Number of Fisher Scoring iterations: 5
Logistic Model 1 Results
Magnitude
- Model 1 shows cars driven in urban locations has the most substantial impact on determining car crashes. To obtain a more interpretable measure of the impact, we can exponentiate the coefficient \(e^{2.38} = ~10.80\). This result indicates that driving in a “Highly Urban” area increases the odds of having a crash by approximately 10.80 times compared to the reference category (Suburban or Rural areas).
- Similarly, professions such as doctors, sport car owners, and individuals with a history of revoked licenses have significant impacts on car crashes in the model. Interestingly, age has a surprisingly low impact, which may stem from multicollinearity among related factors. Additionally, individuals with a PhD do not have an associated coefficient, possibly also due to multicollinearity.
Significance
- The variables with high magnitudes above carry a strong significance level based on the P values, cementing their impact to the overall model. Other driving history variables, such as having a revoked license, past claims, and motor vehicle record (MVR) points, are also statistically significant, indicating they are unlikely to be due to random variation.
Direction
- Most variables align directionally with expectations. Two notable variables which jump out as exceptions would be OLDCLAIMS and RED_CAR as one would expect both of these to have a postive impact on car crashes instead of negative. It may suggest that drivers with a history of claims drive more cautiously to avoid further incidents. The red car myth, which often associates red cars with riskier drivers, also appears largely unsupported here. While RED_CAR has a significant effect, both its magnitude and significance are low, suggesting minimal relevance to our model.
Reviewing the model
\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
The main objective is to correctly identify drivers who are likely to get into a crash. For this reason, a high sensitivity will help show the model’s ability to effectively find the most crash prone drivers.
Model 1 shows it is correctly identifying crash prone drivers 42.74% of the time. While important, the model cannot solely maximizing sensitivity as it might produce many false positives (predicting crashes for those who are actually low risk), which could make insurers overly cautious and lead to higher premiums for drivers who aren’t at risk.
\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]
For this reason, precision which reflects the model’s ability to correctly label true crash cases among all predicted crashes is very important. A high precision means that most of the drivers flagged as “likely to crash” are indeed at risk.
In this case, the model is correctly labeling true crash cases 66.49% of the time, helping insurers not issuing unnecessary warnings or premium hikes to safe drivers.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4435 986
1 371 736
Accuracy : 0.7921
95% CI : (0.7821, 0.8019)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3955
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4274
Specificity : 0.9228
Pos Pred Value : 0.6649
Neg Pred Value : 0.8181
Prevalence : 0.2638
Detection Rate : 0.1127
Detection Prevalence : 0.1696
Balanced Accuracy : 0.6751
'Positive' Class : 1
Measuring multicollinearity
Multicollinearity occurs when two or more predictor variables are highly correlated, complicating the model's ability to estimate each variable's individual effect. In Model 1, Education (PhD) lacks a coefficient, likely due to its high correlation with other variables, leading the model to exclude it to avoid redundancy. This doesn't imply that a PhD has no effect but that its impact is most likely entangled with other variables.
Model 2
Model 2 will use a lasso regression to identify variables which have the potential to be excluded after applying a level 1 penalty, setting some coefficients to zero.
Below, we can see the coefficients for Car age, Red Car, and Education PHD have been removed. This will help the model reduce complexity by excluding less relevant predictors, making it more interpretable.
Lasso Coefficients:
38 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) -1.408241691
BLUEBOOK -0.164218287
CAR_AGE .
INCOME -0.131438557
Parent_single 0.129674405
HOME_VAL -0.134902343
MSTATUS -0.232351982
SEX_male 0.049020968
`EDUCATION<High School` 0.022673951
EDUCATIONBachelors -0.180219345
EDUCATIONHighSchool 0.011012316
EDUCATIONMasters -0.077224196
EDUCATIONPhD .
TRAVTIME 0.204917083
CAR_USEPrivate -0.404582004
KIDSDRIV 0.218527882
AGE -0.014804231
HOMEKIDS 0.046456211
Years_on_job -0.059453630
CAR_TYPEPanel_Truck 0.107474871
CAR_TYPEPickup 0.180201877
CAR_TYPESports_Car 0.284294270
CAR_TYPESUV 0.329100019
CAR_TYPEVan 0.139031991
RED_CAR .
OLDCLAIM -0.119138586
CLM_FREQ 0.226219431
REVOKED 0.299492206
MVR_PTS 0.244132358
URBANICITY 0.929824690
JOBClerical 0.038275981
JOBDoctor -0.183489791
JOBHome_Maker -0.001680938
JOBLawyer -0.059621365
JOBManager -0.256647335
JOBProfessional -0.029756399
JOBStudent -0.013923982
Time_in_force -0.224177803
Applying the coefficient analysis from the lasso regression into model 2
# BLR Model 2: removing "EDUCATIONPhD, "CAR_AGE", "RED_CAR"
<- glm(TARGET_FLAG ~ BLUEBOOK + INCOME + Parent_single + HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` + EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force,
logit_model2 family = binomial,
data = tr_dt2)
# Summary
summary(logit_model2)
Call:
glm(formula = TARGET_FLAG ~ BLUEBOOK + INCOME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` +
EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters +
TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + HOMEKIDS + Years_on_job +
CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car +
CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY + JOBClerical + JOBDoctor + JOBHome_Maker +
JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force,
family = binomial, data = tr_dt2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.417e+00 3.678e-01 -6.569 5.05e-11 ***
BLUEBOOK -1.977e-05 5.914e-06 -3.342 0.000831 ***
INCOME -3.079e-06 1.269e-06 -2.427 0.015242 *
Parent_single 3.874e-01 1.231e-01 3.148 0.001642 **
HOME_VAL -1.090e-06 3.982e-07 -2.738 0.006187 **
MSTATUS -4.910e-01 9.634e-02 -5.096 3.47e-07 ***
SEX_male 1.485e-01 1.117e-01 1.330 0.183477
`EDUCATION<High School` -4.884e-02 2.143e-01 -0.228 0.819733
EDUCATIONBachelors -5.091e-01 1.814e-01 -2.807 0.005006 **
EDUCATIONHighSchool -7.133e-02 1.933e-01 -0.369 0.712095
EDUCATIONMasters -2.511e-01 1.688e-01 -1.487 0.136939
TRAVTIME 1.342e-02 2.102e-03 6.381 1.75e-10 ***
CAR_USEPrivate -8.021e-01 1.023e-01 -7.844 4.36e-15 ***
KIDSDRIV 4.430e-01 6.937e-02 6.386 1.70e-10 ***
AGE -1.936e-03 4.513e-03 -0.429 0.667992
HOMEKIDS 4.665e-02 4.238e-02 1.101 0.271006
Years_on_job -1.815e-02 9.597e-03 -1.892 0.058553 .
CAR_TYPEPanel_Truck 4.944e-01 1.815e-01 2.725 0.006439 **
CAR_TYPEPickup 5.637e-01 1.119e-01 5.036 4.75e-07 ***
CAR_TYPESports_Car 1.025e+00 1.457e-01 7.033 2.02e-12 ***
CAR_TYPESUV 8.343e-01 1.244e-01 6.709 1.97e-11 ***
CAR_TYPEVan 5.710e-01 1.414e-01 4.040 5.35e-05 ***
OLDCLAIM -1.533e-05 4.279e-06 -3.583 0.000339 ***
CLM_FREQ 2.034e-01 3.179e-02 6.399 1.57e-10 ***
REVOKED 9.389e-01 1.001e-01 9.377 < 2e-16 ***
MVR_PTS 1.158e-01 1.525e-02 7.589 3.22e-14 ***
URBANICITY 2.357e+00 1.252e-01 18.829 < 2e-16 ***
JOBClerical 5.347e-02 1.196e-01 0.447 0.654680
JOBDoctor -1.299e+00 3.345e-01 -3.883 0.000103 ***
JOBHome_Maker -1.398e-01 1.738e-01 -0.804 0.421266
JOBLawyer -3.152e-01 2.010e-01 -1.568 0.116853
JOBManager -8.047e-01 1.497e-01 -5.376 7.63e-08 ***
JOBProfessional -1.701e-01 1.338e-01 -1.271 0.203587
JOBStudent -1.503e-01 1.464e-01 -1.026 0.304900
Time_in_force -5.642e-02 8.231e-03 -6.855 7.14e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7533.1 on 6527 degrees of freedom
Residual deviance: 5820.7 on 6493 degrees of freedom
AIC: 5890.7
Number of Fisher Scoring iterations: 5
Logistic Model 2 Results
Model 2 remains alligned with model 1 in coefficient magnitude, significance, and direction. The lasso regression helped eliminate PHD, which had no coefficients as well as Red Car which from a signficance and direction standpoint, did not add considerable value to the model. This helped reduce the AIC slightly down from 5895 to 5891.
Reviewing the model
\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]
Sensitivity and precision remained aligned with model 1 in which the lasso regression had no material changes on the model
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4434 986
1 372 736
Accuracy : 0.792
95% CI : (0.7819, 0.8018)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.3952
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4274
Specificity : 0.9226
Pos Pred Value : 0.6643
Neg Pred Value : 0.8181
Prevalence : 0.2638
Detection Rate : 0.1127
Detection Prevalence : 0.1697
Balanced Accuracy : 0.6750
'Positive' Class : 1
Measuring multicollinearity
Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks moderate and acceptable within 1 < VIF ≤ 5. The main outlier is education which is slightly above this preffered range.
vif_logit_model2
BLUEBOOK 2.146520
INCOME 2.854089
Parent_single 1.932535
HOME_VAL 2.069175
MSTATUS 2.173918
SEX_male 2.932498
`EDUCATION<High School` 5.996398
EDUCATIONBachelors 5.943264
EDUCATIONHighSchool 7.661304
EDUCATIONMasters 4.117099
TRAVTIME 1.037053
CAR_USEPrivate 2.419640
KIDSDRIV 1.355057
AGE 1.479564
HOMEKIDS 2.197721
Years_on_job 1.501135
CAR_TYPEPanel_Truck 2.407859
CAR_TYPEPickup 1.818406
CAR_TYPESports_Car 2.163050
CAR_TYPESUV 3.087038
CAR_TYPEVan 1.638123
OLDCLAIM 1.625621
CLM_FREQ 1.454622
REVOKED 1.303804
MVR_PTS 1.160216
URBANICITY 1.135161
JOBClerical 1.886352
JOBDoctor 1.628406
JOBHome_Maker 2.125459
JOBLawyer 3.423225
JOBManager 2.567912
JOBProfessional 2.019039
JOBStudent 1.803192
Time_in_force 1.011236
Model 3
Model 3 will take the final model produced in model 2 and try to improve on it by implementing a log transformation on variables which showed a skewed distribution in our distribution graph earlier.
The new transformed variables will then be added to replace the original non transformed variables in the model and use a step work backwards to drop predictors that do not contribute significantly to improving the fit of the model based on AIC, leaving a more concise model.
Start: AIC=5869.63
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM +
CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical +
JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force + `EDUCATION<High School` + EDUCATIONBachelors +
EDUCATIONHighSchool + EDUCATIONMasters
Df Deviance AIC
- `EDUCATION<High School` 1 5799.7 5867.7
- EDUCATIONHighSchool 1 5799.7 5867.7
- JOBClerical 1 5799.8 5867.8
- HOMEKIDS 1 5799.9 5867.9
- AGE 1 5800.3 5868.3
- EDUCATIONMasters 1 5800.6 5868.6
- Years_on_job 1 5801.0 5869.0
<none> 5799.6 5869.6
- SEX_male 1 5801.6 5869.6
- JOBProfessional 1 5801.7 5869.7
- JOBLawyer 1 5802.2 5870.2
- JOBHome_Maker 1 5803.4 5871.4
- EDUCATIONBachelors 1 5804.5 5872.5
- JOBStudent 1 5805.6 5873.6
- CAR_TYPEPanel_Truck 1 5806.4 5874.4
- Parent_single 1 5809.8 5877.8
- HOME_VAL 1 5812.2 5880.2
- OLDCLAIM 1 5813.6 5881.6
- log_INCOME 1 5815.6 5883.6
- CAR_TYPEVan 1 5817.2 5885.2
- JOBDoctor 1 5817.2 5885.2
- log_BLUEBOOK 1 5819.8 5887.8
- CAR_TYPEPickup 1 5826.7 5894.7
- MSTATUS 1 5826.8 5894.8
- JOBManager 1 5832.4 5900.4
- CLM_FREQ 1 5840.2 5908.2
- log_TRAVTIME 1 5842.0 5910.0
- KIDSDRIV 1 5843.1 5911.1
- Time_in_force 1 5846.3 5914.3
- CAR_TYPESports_Car 1 5850.0 5918.0
- CAR_TYPESUV 1 5851.5 5919.5
- MVR_PTS 1 5857.5 5925.5
- CAR_USEPrivate 1 5861.0 5929.0
- REVOKED 1 5885.4 5953.4
- URBANICITY 1 6310.4 6378.4
Step: AIC=5867.72
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM +
CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical +
JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONHighSchool +
EDUCATIONMasters
Df Deviance AIC
- EDUCATIONHighSchool 1 5799.7 5865.7
- JOBClerical 1 5799.9 5865.9
- HOMEKIDS 1 5799.9 5865.9
- AGE 1 5800.4 5866.4
- Years_on_job 1 5801.1 5867.1
<none> 5799.7 5867.7
- SEX_male 1 5801.8 5867.8
- EDUCATIONMasters 1 5801.9 5867.9
- JOBProfessional 1 5802.2 5868.2
- JOBLawyer 1 5803.5 5869.5
- JOBHome_Maker 1 5804.1 5870.1
- JOBStudent 1 5805.9 5871.9
- CAR_TYPEPanel_Truck 1 5806.6 5872.6
- Parent_single 1 5810.0 5876.0
- HOME_VAL 1 5812.8 5878.8
- OLDCLAIM 1 5813.7 5879.7
- log_INCOME 1 5815.9 5881.9
- CAR_TYPEVan 1 5817.3 5883.3
- EDUCATIONBachelors 1 5817.6 5883.6
- log_BLUEBOOK 1 5819.9 5885.9
- MSTATUS 1 5826.8 5892.8
- CAR_TYPEPickup 1 5827.1 5893.1
- JOBDoctor 1 5829.0 5895.0
- CLM_FREQ 1 5840.3 5906.3
- log_TRAVTIME 1 5842.1 5908.1
- KIDSDRIV 1 5843.2 5909.2
- Time_in_force 1 5846.5 5912.5
- JOBManager 1 5846.5 5912.5
- CAR_TYPESports_Car 1 5850.1 5916.1
- CAR_TYPESUV 1 5851.8 5917.8
- MVR_PTS 1 5857.6 5923.6
- CAR_USEPrivate 1 5863.2 5929.2
- REVOKED 1 5885.4 5951.4
- URBANICITY 1 6310.4 6376.4
Step: AIC=5865.74
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM +
CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical +
JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
- JOBClerical 1 5799.9 5863.9
- HOMEKIDS 1 5800.0 5864.0
- AGE 1 5800.4 5864.4
- Years_on_job 1 5801.1 5865.1
<none> 5799.7 5865.7
- SEX_male 1 5801.8 5865.8
- JOBProfessional 1 5802.2 5866.2
- EDUCATIONMasters 1 5802.4 5866.4
- JOBLawyer 1 5803.6 5867.6
- JOBHome_Maker 1 5804.1 5868.1
- JOBStudent 1 5805.9 5869.9
- CAR_TYPEPanel_Truck 1 5806.6 5870.6
- Parent_single 1 5810.0 5874.0
- HOME_VAL 1 5812.8 5876.8
- OLDCLAIM 1 5813.7 5877.7
- log_INCOME 1 5815.9 5879.9
- CAR_TYPEVan 1 5817.4 5881.4
- log_BLUEBOOK 1 5819.9 5883.9
- MSTATUS 1 5826.8 5890.8
- CAR_TYPEPickup 1 5827.2 5891.2
- EDUCATIONBachelors 1 5827.7 5891.7
- JOBDoctor 1 5830.6 5894.6
- CLM_FREQ 1 5840.3 5904.3
- log_TRAVTIME 1 5842.1 5906.1
- KIDSDRIV 1 5843.3 5907.3
- Time_in_force 1 5846.5 5910.5
- JOBManager 1 5847.1 5911.1
- CAR_TYPESports_Car 1 5850.2 5914.2
- CAR_TYPESUV 1 5852.0 5916.0
- MVR_PTS 1 5857.6 5921.6
- CAR_USEPrivate 1 5867.4 5931.4
- REVOKED 1 5885.5 5949.5
- URBANICITY 1 6310.4 6374.4
Step: AIC=5863.94
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM +
CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker +
JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force +
EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
- HOMEKIDS 1 5800.2 5862.2
- AGE 1 5800.6 5862.6
- Years_on_job 1 5801.4 5863.4
<none> 5799.9 5863.9
- SEX_male 1 5802.0 5864.0
- EDUCATIONMasters 1 5802.6 5864.6
- JOBProfessional 1 5803.7 5865.7
- JOBLawyer 1 5805.2 5867.2
- JOBHome_Maker 1 5806.0 5868.0
- CAR_TYPEPanel_Truck 1 5807.6 5869.6
- JOBStudent 1 5807.9 5869.9
- Parent_single 1 5810.3 5872.3
- HOME_VAL 1 5813.4 5875.4
- OLDCLAIM 1 5813.9 5875.9
- log_INCOME 1 5816.3 5878.3
- CAR_TYPEVan 1 5818.5 5880.5
- log_BLUEBOOK 1 5820.3 5882.3
- MSTATUS 1 5826.8 5888.8
- EDUCATIONBachelors 1 5828.3 5890.3
- CAR_TYPEPickup 1 5829.0 5891.0
- JOBDoctor 1 5835.0 5897.0
- CLM_FREQ 1 5840.6 5902.6
- log_TRAVTIME 1 5842.2 5904.2
- KIDSDRIV 1 5843.4 5905.4
- Time_in_force 1 5846.6 5908.6
- CAR_TYPESports_Car 1 5850.4 5912.4
- CAR_TYPESUV 1 5852.2 5914.2
- MVR_PTS 1 5857.8 5919.8
- JOBManager 1 5859.7 5921.7
- CAR_USEPrivate 1 5881.2 5943.2
- REVOKED 1 5885.8 5947.8
- URBANICITY 1 6312.1 6374.1
Step: AIC=5862.16
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV +
Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car +
CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer +
JOBManager + JOBProfessional + JOBStudent + Time_in_force +
EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
- AGE 1 5801.4 5861.4
- Years_on_job 1 5802.0 5862.0
<none> 5800.2 5862.2
- SEX_male 1 5802.2 5862.2
- EDUCATIONMasters 1 5802.8 5862.8
- JOBProfessional 1 5804.0 5864.0
- JOBLawyer 1 5805.4 5865.4
- JOBHome_Maker 1 5806.2 5866.2
- CAR_TYPEPanel_Truck 1 5807.9 5867.9
- JOBStudent 1 5808.0 5868.0
- HOME_VAL 1 5813.6 5873.6
- Parent_single 1 5814.1 5874.1
- OLDCLAIM 1 5814.1 5874.1
- log_INCOME 1 5817.4 5877.4
- CAR_TYPEVan 1 5818.7 5878.7
- log_BLUEBOOK 1 5820.5 5880.5
- MSTATUS 1 5827.3 5887.3
- EDUCATIONBachelors 1 5828.5 5888.5
- CAR_TYPEPickup 1 5829.2 5889.2
- JOBDoctor 1 5835.2 5895.2
- CLM_FREQ 1 5840.9 5900.9
- log_TRAVTIME 1 5842.4 5902.4
- Time_in_force 1 5846.7 5906.7
- CAR_TYPESports_Car 1 5850.8 5910.8
- CAR_TYPESUV 1 5852.5 5912.5
- KIDSDRIV 1 5857.0 5917.0
- MVR_PTS 1 5858.2 5918.2
- JOBManager 1 5860.1 5920.1
- CAR_USEPrivate 1 5881.5 5941.5
- REVOKED 1 5886.3 5946.3
- URBANICITY 1 6312.2 6372.2
Step: AIC=5861.39
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + KIDSDRIV +
Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car +
CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer +
JOBManager + JOBProfessional + JOBStudent + Time_in_force +
EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
- Years_on_job 1 5802.9 5860.9
- SEX_male 1 5803.1 5861.1
<none> 5801.4 5861.4
- EDUCATIONMasters 1 5804.1 5862.1
- JOBProfessional 1 5805.6 5863.6
- JOBLawyer 1 5807.2 5865.2
- JOBHome_Maker 1 5807.8 5865.8
- JOBStudent 1 5809.1 5867.1
- CAR_TYPEPanel_Truck 1 5809.5 5867.5
- OLDCLAIM 1 5815.3 5873.3
- HOME_VAL 1 5815.7 5873.7
- log_INCOME 1 5818.0 5876.0
- Parent_single 1 5819.7 5877.7
- CAR_TYPEVan 1 5820.4 5878.4
- log_BLUEBOOK 1 5823.5 5881.5
- MSTATUS 1 5827.6 5885.6
- EDUCATIONBachelors 1 5829.8 5887.8
- CAR_TYPEPickup 1 5830.5 5888.5
- JOBDoctor 1 5838.6 5896.6
- CLM_FREQ 1 5841.9 5899.9
- log_TRAVTIME 1 5843.2 5901.2
- Time_in_force 1 5847.8 5905.8
- CAR_TYPESports_Car 1 5850.9 5908.9
- CAR_TYPESUV 1 5852.8 5910.8
- KIDSDRIV 1 5857.9 5915.9
- MVR_PTS 1 5860.4 5918.4
- JOBManager 1 5864.3 5922.3
- CAR_USEPrivate 1 5882.0 5940.0
- REVOKED 1 5887.8 5945.8
- URBANICITY 1 6314.7 6372.7
Step: AIC=5860.86
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + KIDSDRIV +
CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car +
CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer +
JOBManager + JOBProfessional + JOBStudent + Time_in_force +
EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
- SEX_male 1 5804.7 5860.7
<none> 5802.9 5860.9
- EDUCATIONMasters 1 5805.7 5861.7
- JOBProfessional 1 5807.2 5863.2
- JOBLawyer 1 5808.9 5864.9
- JOBHome_Maker 1 5809.0 5865.0
- JOBStudent 1 5810.0 5866.0
- CAR_TYPEPanel_Truck 1 5810.8 5866.8
- OLDCLAIM 1 5816.6 5872.6
- HOME_VAL 1 5817.6 5873.6
- log_INCOME 1 5821.4 5877.4
- CAR_TYPEVan 1 5821.7 5877.7
- Parent_single 1 5822.1 5878.1
- log_BLUEBOOK 1 5824.7 5880.7
- MSTATUS 1 5827.6 5883.6
- CAR_TYPEPickup 1 5832.1 5888.1
- EDUCATIONBachelors 1 5832.1 5888.1
- JOBDoctor 1 5841.0 5897.0
- CLM_FREQ 1 5843.3 5899.3
- log_TRAVTIME 1 5844.8 5900.8
- Time_in_force 1 5849.2 5905.2
- CAR_TYPESports_Car 1 5852.4 5908.4
- CAR_TYPESUV 1 5854.5 5910.5
- KIDSDRIV 1 5860.7 5916.7
- MVR_PTS 1 5861.8 5917.8
- JOBManager 1 5867.0 5923.0
- CAR_USEPrivate 1 5882.5 5938.5
- REVOKED 1 5889.4 5945.4
- URBANICITY 1 6315.5 6371.5
Step: AIC=5860.68
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single +
HOME_VAL + MSTATUS + CAR_USEPrivate + KIDSDRIV + CAR_TYPEPanel_Truck +
CAR_TYPEPickup + CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan +
OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBDoctor +
JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONMasters
Df Deviance AIC
<none> 5804.7 5860.7
- EDUCATIONMasters 1 5807.4 5861.4
- JOBProfessional 1 5809.2 5863.2
- JOBLawyer 1 5810.8 5864.8
- JOBHome_Maker 1 5811.9 5865.9
- JOBStudent 1 5812.1 5866.1
- CAR_TYPEPanel_Truck 1 5815.6 5869.6
- OLDCLAIM 1 5818.5 5872.5
- HOME_VAL 1 5819.4 5873.4
- log_INCOME 1 5823.3 5877.3
- Parent_single 1 5823.4 5877.4
- CAR_TYPEVan 1 5827.1 5881.1
- MSTATUS 1 5829.5 5883.5
- EDUCATIONBachelors 1 5833.8 5887.8
- CAR_TYPEPickup 1 5833.8 5887.8
- log_BLUEBOOK 1 5834.5 5888.5
- JOBDoctor 1 5842.5 5896.5
- CLM_FREQ 1 5845.4 5899.4
- log_TRAVTIME 1 5846.8 5900.8
- Time_in_force 1 5851.1 5905.1
- CAR_TYPESports_Car 1 5859.5 5913.5
- KIDSDRIV 1 5862.1 5916.1
- MVR_PTS 1 5863.4 5917.4
- CAR_TYPESUV 1 5868.3 5922.3
- JOBManager 1 5869.2 5923.2
- CAR_USEPrivate 1 5884.3 5938.3
- REVOKED 1 5891.4 5945.4
- URBANICITY 1 6318.3 6372.3
Call:
glm(formula = TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME +
Parent_single + HOME_VAL + MSTATUS + CAR_USEPrivate + KIDSDRIV +
CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car +
CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer +
JOBManager + JOBProfessional + JOBStudent + Time_in_force +
EDUCATIONBachelors + EDUCATIONMasters, family = binomial,
data = tr_dt2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.802e-02 6.445e-01 -0.028 0.977698
log_BLUEBOOK -3.376e-01 6.153e-02 -5.486 4.10e-08 ***
log_INCOME -6.966e-02 1.619e-02 -4.302 1.69e-05 ***
log_TRAVTIME 3.853e-01 6.053e-02 6.366 1.94e-10 ***
Parent_single 4.602e-01 1.063e-01 4.329 1.50e-05 ***
HOME_VAL -1.367e-06 3.601e-07 -3.796 0.000147 ***
MSTATUS -4.453e-01 8.882e-02 -5.014 5.34e-07 ***
CAR_USEPrivate -7.612e-01 8.597e-02 -8.854 < 2e-16 ***
KIDSDRIV 4.731e-01 6.245e-02 7.576 3.56e-14 ***
CAR_TYPEPanel_Truck 5.229e-01 1.575e-01 3.320 0.000901 ***
CAR_TYPEPickup 5.905e-01 1.093e-01 5.402 6.60e-08 ***
CAR_TYPESports_Car 9.014e-01 1.211e-01 7.442 9.92e-14 ***
CAR_TYPESUV 7.499e-01 9.518e-02 7.879 3.30e-15 ***
CAR_TYPEVan 6.445e-01 1.348e-01 4.780 1.75e-06 ***
OLDCLAIM -1.575e-05 4.287e-06 -3.674 0.000239 ***
CLM_FREQ 2.043e-01 3.179e-02 6.425 1.31e-10 ***
REVOKED 9.415e-01 1.002e-01 9.394 < 2e-16 ***
MVR_PTS 1.166e-01 1.528e-02 7.629 2.36e-14 ***
URBANICITY 2.360e+00 1.250e-01 18.887 < 2e-16 ***
JOBDoctor -1.467e+00 2.721e-01 -5.393 6.95e-08 ***
JOBHome_Maker -4.407e-01 1.659e-01 -2.656 0.007917 **
JOBLawyer -4.104e-01 1.670e-01 -2.457 0.014014 *
JOBManager -9.189e-01 1.174e-01 -7.825 5.06e-15 ***
JOBProfessional -2.435e-01 1.155e-01 -2.108 0.035025 *
JOBStudent -4.074e-01 1.509e-01 -2.700 0.006930 **
Time_in_force -5.507e-02 8.227e-03 -6.694 2.18e-11 ***
EDUCATIONBachelors -4.529e-01 8.451e-02 -5.359 8.35e-08 ***
EDUCATIONMasters -2.048e-01 1.240e-01 -1.651 0.098780 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7533.1 on 6527 degrees of freedom
Residual deviance: 5804.7 on 6500 degrees of freedom
AIC: 5860.7
Number of Fisher Scoring iterations: 5
Logistic Model 3 Results
Model 3 shows improvment on model 2, AIC has dropped from ~5,891 to ~5,861.
Magnitude
- Similar to Model 1 and 2, Model 3 highlights that cars driven in urban locations have the most substantial impact on predicting car crashes, with an odds ratio of (\(e^{2.36} = ~10.59\)). This indicates that living in a highly urban area increases the odds of having a crash by approximately 10.59 times compared to the reference category (Suburban or Rural areas). Individuals with historic driving record issues and those with professions of doctor or manager continue to be impactful variables.
- In this case we also have log varibales, for example log_TRAVTIME, a one-unit increase in log_TRAVTIME (which represents a proportional increase in TRAVTIME) leads to a multiplicative increase in the odds of a crash by (\(e^{3.85} = ~47\)) or roughly a 4700% increase in odds. This suggests that substantial increases in travel time are associated with much higher odds of a crash.
Significance
- Variables such as URBANICITY, REVOKED, and log_BLUEBOOK have extremely low p-values, suggesting their effects are highly significant in predicting crashes. Similarly, other variables like JOBDoctor, HOME_VAL, and CAR_TYPESUV also show strong significance, which adds credibility to their impact on the likelihood of a crash occurring. Log_INCOME, one of the variables transformed, shows improvement in the it’s P-value as well.
- On the other hand, Variables with higher p-values, such as EDUCATIONMasters and JOBProfessional, may not significantly influence the model and could be revisited for relevance.
- Overall, the majority of variables have a strong signficance level and supports a strong model.
Direction
- The direction across the variables align with what is expected and similar to model 1 and 2.
Reviewing the model
\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]
\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]
Overall, the model improved in sensitivity compared in comparison to model 2 from 42.74% to 43.32%, indicating a better ability to identify true positives (accurately predicting instances where an individual will indeed have a car crash). This increase in sensitivity suggests that the enhanced model is capturing a slightly higher proportion of actual crash prone individuals, which is critical in risk assessment for insurance companies.
The model’s precision improved from 66.43% to 67.21%, meaning it can better identify actual crash prone individuals among those predicted to be high risk. This improvement reduces false positives, indicating the model is becoming more reliable in targeting truly crash prone individuals, which is valuable for risk assessment.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 4449 974
1 357 748
Accuracy : 0.7961
95% CI : (0.7861, 0.8058)
No Information Rate : 0.7362
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4069
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.4344
Specificity : 0.9257
Pos Pred Value : 0.6769
Neg Pred Value : 0.8204
Prevalence : 0.2638
Detection Rate : 0.1146
Detection Prevalence : 0.1693
Balanced Accuracy : 0.6800
'Positive' Class : 1
Checking MultiColinarity
Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks acceptable within 1 < VIF ≤ 5. Across the board the the variables look much better to previous models in this aspect, where the highest amount is ~2.5.
vif_logit_model3
log_BLUEBOOK 1.450673
log_INCOME 2.468105
log_TRAVTIME 1.025990
Parent_single 1.431710
HOME_VAL 1.737145
MSTATUS 1.840979
CAR_USEPrivate 1.705943
KIDSDRIV 1.090497
CAR_TYPEPanel_Truck 1.829136
CAR_TYPEPickup 1.737838
CAR_TYPESports_Car 1.476969
CAR_TYPESUV 1.794573
CAR_TYPEVan 1.496378
OLDCLAIM 1.622047
CLM_FREQ 1.449709
REVOKED 1.300646
MVR_PTS 1.156608
URBANICITY 1.125220
JOBDoctor 1.076165
JOBHome_Maker 1.885921
JOBLawyer 2.372171
JOBManager 1.581065
JOBProfessional 1.508283
JOBStudent 1.882995
Time_in_force 1.008836
EDUCATIONBachelors 1.281639
EDUCATIONMasters 2.222524
Linear regression Models
Model 1
The first logistic regression model leverages all available variables from the dataset to predict the amount of money it will cost if the person does crash their car (TARGET_AMT). This model includes a range of predictors such as car characteristics, driver demographics, and behavioral factors. Incorporating them will allows the model to capture as much information as possible on which variables most contribute to car accidents.
Model 1 will serve as the baseline for the model development and will use a filtered data set which only takes individuals who have gotten in a car accident, to more accurately predict the cost of the accident.
Call:
lm(formula = TARGET_AMT ~ BLUEBOOK + CAR_AGE + INCOME + Parent_single +
HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` +
EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters +
EDUCATIONPhD + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE +
HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup +
CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + RED_CAR +
OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical +
JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional +
JOBStudent + Time_in_force, data = tr_dt2_filtered)
Residuals:
Min 1Q Median 3Q Max
-8546 -3033 -1343 594 99723
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.396e+03 2.226e+03 1.975 0.0485 *
BLUEBOOK 1.326e-01 3.293e-02 4.026 5.92e-05 ***
CAR_AGE -8.583e+01 4.861e+01 -1.766 0.0777 .
INCOME -1.019e-02 7.833e-03 -1.301 0.1936
Parent_single 5.781e+02 6.369e+02 0.908 0.3641
HOME_VAL 2.887e-03 2.272e-03 1.271 0.2040
MSTATUS -6.438e+02 5.459e+02 -1.179 0.2384
SEX_male 1.482e+03 7.059e+02 2.099 0.0359 *
`EDUCATION<High School` -1.944e+03 1.368e+03 -1.421 0.1555
EDUCATIONBachelors -1.686e+03 1.140e+03 -1.478 0.1395
EDUCATIONHighSchool -2.401e+03 1.260e+03 -1.906 0.0569 .
EDUCATIONMasters -1.085e+03 1.004e+03 -1.080 0.2802
EDUCATIONPhD NA NA NA NA
TRAVTIME 2.076e+00 1.196e+01 0.174 0.8622
CAR_USEPrivate -6.369e+02 5.669e+02 -1.123 0.2615
KIDSDRIV -2.003e+02 3.396e+02 -0.590 0.5554
AGE 2.261e+01 2.321e+01 0.974 0.3300
HOMEKIDS 1.575e+02 2.305e+02 0.683 0.4946
Years_on_job -2.202e+01 5.463e+01 -0.403 0.6870
CAR_TYPEPanel_Truck -7.621e+02 1.041e+03 -0.732 0.4643
CAR_TYPEPickup -5.245e+02 6.366e+02 -0.824 0.4101
CAR_TYPESports_Car 7.327e+02 8.132e+02 0.901 0.3677
CAR_TYPESUV 7.689e+02 7.146e+02 1.076 0.2820
CAR_TYPEVan -8.709e+02 8.382e+02 -1.039 0.2989
RED_CAR 2.111e+01 5.371e+02 0.039 0.9687
OLDCLAIM 1.195e-02 2.374e-02 0.503 0.6148
CLM_FREQ -2.626e+01 1.690e+02 -0.155 0.8766
REVOKED -7.456e+02 5.410e+02 -1.378 0.1683
MVR_PTS 1.210e+02 7.397e+01 1.636 0.1019
URBANICITY 6.297e+02 8.141e+02 0.773 0.4393
JOBClerical -7.282e+01 6.396e+02 -0.114 0.9094
JOBDoctor -1.530e+03 2.158e+03 -0.709 0.4783
JOBHome_Maker -4.972e+02 9.795e+02 -0.508 0.6118
JOBLawyer 3.299e+02 1.187e+03 0.278 0.7811
JOBManager -5.329e+02 9.193e+02 -0.580 0.5622
JOBProfessional 7.054e+02 7.295e+02 0.967 0.3337
JOBStudent -2.713e+02 7.836e+02 -0.346 0.7292
Time_in_force -1.947e+00 4.565e+01 -0.043 0.9660
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7409 on 1685 degrees of freedom
Multiple R-squared: 0.03325, Adjusted R-squared: 0.0126
F-statistic: 1.61 on 36 and 1685 DF, p-value: 0.01273
Model 1 Results
Job | Average Target Amt | Average Income | Average Bluebook |
---|---|---|---|
Blue Collar | 5,653 | 54,055 | 13,984 |
Clerical | 5,127 | 31,934 | 11,966 |
Doctor | 5,194 | 140,324 | 18,034 |
Home Maker | 4,828 | 8,268 | 11,426 |
Lawyer | 5,865 | 80,875 | 14,941 |
Manager | 5,778 | 102,687 | 19,760 |
Professional | 6,561 | 68,766 | 17,342 |
Student | 5,058 | 5,347 | 10,441 |
The Adjusted R-squared is 0.013, indicating that the model explains only about 1% of the variance in the cost of the crash, suggesting that the predictors have limited explanatory power for this outcome. The F statistic of 1.61 with a p-value of 0.0127 indicate that the model is statistically significant overall, though the effect size is small.
Individuals with a bachelor degrees or less of education (high school and <high school) are the three variables which have the greatest magnitude on predicting the cost of the crash. For example on average, the crash cost for Individuals a high school education are $2,401 less than those with higher education levels. While this might sound counterintuitve iinitially, it does make sense as high school students for example drive entry level/cheaper to repair cars compared to those with a masters degree. While these variables have a high magnitude, they are not very statisitically significant based on their P value.
On the hand, Blue book has a higher P value meaning its more statistically significant but has a lower magnitude. It has a coefficient of approximately 0.133, indicating that for each one unit increase in BLUEBOOK (representing the value of the car), the cost of the crash increases by around 0.133 units. This being said, the units for a variable such as blue book is most likely based on dollars, and thus can accumalate significantly to have a material impact.
Overall the only statisitically significant variable is blue book which largely makes sense as the cost of repair or the total amount will be dependent on the value of the car. Variables which are unexpected in direction are ones such as income which for every dollar increase, the cost of the crash decreases $0.01. This again, is not statistically significant nor high magnitude but yet surprising as one would expect the cost of the crash to increase. Similarily, in the table above, doctors have a higher avg income and blue book value but the crash expense is higher. This is most likely driven by the crash severity of the accidents and the value of the other car in the accident which we dont have data on.
Checking Model Assumptions - Residual Analysis
The diagnostic plots indicate mild non linearity and heteroscedasticity, suggesting that the model does not fully capture the relationship between predictors and the outcome. Furthermore, the increasing spread with higher fitted values indicates heteroscedasticity. This suggests that the variance of the residuals increases with the fitted values, which could lead to inefficient estimates.
There are a few influential points that could impact the model as shown in the Residuals and Leverage graph. Lastly, the residuals show some deviation from normality, especially in the tails, which could affect statistical inferences.
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
Measuring multicollinearity
Multicollinearity occurs when two or more predictor variables are highly correlated, complicating the model's ability to estimate each variable's individual effect. In Model 1, Education (PhD) lacks a coefficient, likely due to its high correlation with other variables, leading the model to exclude it to avoid redundancy. This doesn't imply that a PhD has no effect but that its impact is most likely entangled with other variables.
Model 2
Model 2 will use a lasso regression to identify variables which have the potential to be excluded after applying a level 1 penalty, setting some coefficients to zero. In addition, the log terms previously created will be added for consideration.
Below, we can see only a few coefficients were kept based on the optimal lambda, this consists of 5 variables (log_BLUEBOOK, SEX_male, MVR_PTS, and CAR_TYPEPanel_Truck). This will help the model reduce complexity by excluding less relevant predictors, making it more interpretable.
Lasso Coefficients:
35 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 5559.85426
log_BLUEBOOK 618.10795
log_INCOME .
Parent_single .
HOME_VAL .
MSTATUS .
SEX_male 48.91519
`EDUCATION<High School` .
EDUCATIONBachelors .
EDUCATIONHighSchool .
EDUCATIONMasters .
log_TRAVTIME .
CAR_USEPrivate .
KIDSDRIV .
AGE .
HOMEKIDS .
Years_on_job .
CAR_TYPEPanel_Truck 69.43976
CAR_TYPEPickup .
CAR_TYPESports_Car .
CAR_TYPESUV .
CAR_TYPEVan .
OLDCLAIM .
CLM_FREQ .
REVOKED .
MVR_PTS 42.89832
URBANICITY .
JOBClerical .
JOBDoctor .
JOBHome_Maker .
JOBLawyer .
JOBManager .
JOBProfessional .
JOBStudent .
Time_in_force .
Applying the coefficient analysis from the lasso regression into model 2
Model 2 will also include the varuable “Revoked” which explains If the driver’s license was revoked in the past 7 years, increasing the probability of them being a riskier driver.
Outside of the value of the car, the severity of the crash becomes the next leading factor in determining the cost of the crash, which can be partially attributed by the riskiness of the driver. Other variables such as SEX being male was shown to have significance in model 1 and thus carried forward to model 2. MVR points similarly, had a P value lower in relation to the other variables in model 1 which will help us access the riskiness of the driver.
Call:
lm(formula = TARGET_AMT ~ log_BLUEBOOK + SEX_male + MVR_PTS +
CAR_TYPEPanel_Truck + REVOKED, data = tr_dt2_filtered)
Residuals:
Min 1Q Median 3Q Max
-7802 -3011 -1456 487 100414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7254.20 2752.35 -2.636 0.00847 **
log_BLUEBOOK 1306.91 292.98 4.461 8.7e-06 ***
SEX_male 610.68 374.21 1.632 0.10288
MVR_PTS 143.20 69.42 2.063 0.03928 *
CAR_TYPEPanel_Truck 786.95 748.68 1.051 0.29336
REVOKED -581.73 433.69 -1.341 0.17998
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7380 on 1716 degrees of freedom
Multiple R-squared: 0.02319, Adjusted R-squared: 0.02034
F-statistic: 8.147 on 5 and 1716 DF, p-value: 1.303e-07
Model 2 Results
The Adjusted R-squared is 0.0203 which while showed improvement, the model still only explains about 2% of the variance in the cost of the crash, suggesting that the predictors still have limited explanatory power for this outcome. The F-statistic of 8.147 with a p-value of 1.303e-07 improved further, showing the strong statistical significant in the model, though the effect size remains small.
log Blue Book has a coefficient of 1306.91 with a strong P value, indicating a strong positive association and significance. A 1 unit/dollar increase in log Blue Book is associated with an increase of approximately $1,307 in the crash cost. Similarily, individuals who drive a panel truck, on average are expected to have a $786 higher crash cost to those who do not, mainly because of the size of the vehicle.
The most significant predictors are log Blue Book and MVR points. Log Blue Book has a large positive effect on the overall crash cost, while MVR_PTS’s effect is much smaller crash cost by around $582.
What is interesting is that individuals with a historically revoked license on average have a a $582 lower crash cost to those who haven’t had their license revoked. This may suggest that these drivers drive more cautiously after experiencing the consequences of license revocation. In contrast, male drivers are associated with a $610 higher average crash cost compared to female drivers. While females have a slightly higher crash rate, the higher costs associated with male drivers may reflect different driving behaviors which leads to a higher crash cost.
Checking Model Assumptions - Residual Analysis
In Model 2, the diagnostics still indicate issues similar to Model 1, including non-linearity, heteroscedasticity, and non-normal residuals. Although the model fit has improved slightly, as seen in the increased R-squared and lower p-value, the residuals vs fitted plot suggests that the model does not fully capture the underlying relationships. The Scale Location plot shows heteroscedasticity, meaning the residual variance is not constant, which can affect the efficiency of estimates. Lastly, the Influential observations remain a concern, though they appear less extreme than in Model 1.
Checking MultiColinarity
Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks strong, all centered around 1.
vif_linear_model2
log_BLUEBOOK 1.188219
SEX_male 1.094768
MVR_PTS 1.001602
CAR_TYPEPanel_Truck 1.289131
REVOKED 1.001240
4.1 Choosing the best Logistic Model
Comparing the results
ROC Curve
The Area under the curve (AUC) for model 3 is 0.8156, meaning about 81.56% of the time, the model can correctly rank a randomly chosen positive instance higher than a randomly chosen negative instance. This indicates that the model has good predictive power. The closer to the top left the better the model is.
Below is a AUC rank reference:
0.90 - 1.00: Excellent (highly accurate model)
0.80 - 0.90: Good (strong model with good discriminative ability)
0.70 - 0.80: Fair (moderate accuracy, but still useful)
0.60 - 0.70: Poor (weak model)
0.50: No discriminative power (random guessing)
df | AIC | |
---|---|---|
logit_model1 | 37 | 5894.713 |
logit_model2 | 35 | 5890.746 |
logit_model3 | 28 | 5860.677 |
Model | Accuracy | Sensitivity | Precision | Specificity | ErrorRate |
---|---|---|---|---|---|
Model 1 | 79.21% | 42.74% | 66.49% | 92.28% | 20.79% |
Model 2 | 79.20% | 42.74% | 66.43% | 92.26% | 20.80% |
Model 3 | 79.61% | 43.44% | 67.69% | 92.57% | 20.39% |
Model 3 stands out as the most well rounded choice for predicting the likelihood of a crash due to a strong balance between performance and simplicity. It has the highest sensitivity, making it better at identifying the higher risk drivers while maintaining the highest precision, reducing the likelihood of false alarms. Additionally, its low AIC and minimal predictor set indicate that it achieves these results efficiently, supporting its use as a reliable and practical model for predicting crash risks.
Predictive Performance:
Sensitivity: Model 3 also has the highest sensitivity (43.32%) compared to Model 1 and Model 2 (both at 42.74%). Sensitivity, indicating the true positive rate, is critical here because it reflects the model’s ability to correctly identify drivers who are likely to get into a crash. A higher sensitivity means Model 3 is slightly better at flagging the higher risk drivers, which can have a material difference for an insurance company’s bottom line.
Precision: With the highest precision (67.21%), Model 3 outperforms Model 1 (66.49%) and Model 2 (66.43%) in terms of minimizing false positives. In a crash prediction context, a higher precision suggests that when Model 3 predicts a crash, it's more likely to be correct. This reliability is important, as unnecessary interventions based on false positives could be costly.
Specificity: Helps measures the model's ability to correctly identify non crashes. All three models demonstrate high specificity, indicating that they are effective at correctly identifying non-crashes. Model 3 (92.43%) has a slight edge over Model 1 (92.28%) and Model 2 (92.26%) in specificity, though the difference is marginal.
Accuracy: Measures the proportion of correct predictions (both true positives and true negatives) out of all predictions. It gives an overall sense of how well the model is classifying both crash and non-crash cases. All three models exhibit strong accuracy at around 80%, with Model 3 having a slight edge at 79.47%.
Error Rate (1 - Accuracy): Classification Error Rate is the complement of Accuracy, representing the proportion of incorrect predictions. Model 3 has the lowest error rate at 20.53%, which aligns with its slightly higher accuracy.
Model Complexity and AIC (Model Fit) Comparison:
AIC: Model 3 has the lowest AIC (5861), which indicates a better balance between model fit and complexity compared to Model 1 (5895) and Model 2 (5891). The lower AIC suggests that Model 3 provides a good fit to the data without overfitting, making it more reliable for future predictions.
Degrees of Freedom: With the fewest degrees of freedom (28), Model 3 reduces complexity and captures the essential patterns in crash prediction while avoiding unnecessary variables.
Making predictions with model 3 on the evalution data
The predictions developed by model 3 on the evaluation data seem strong and have a % of crashes of 17.3% compared to the training data’s 26.4%. It is expected that the % of crashes will differ from the training data given the data itself is different. Nevertheless the % between the two are reasonably in line, ensuring the model is performing well.
4.2 Choosing the best Linear Model
Comparing the results
R-Squared | Adjusted R-Squared | F-Statistic | Residual Standard Error | |
---|---|---|---|---|
Linear Model 1 | 0.0332 | 0.0126 | 1.6098 | 7409.073 |
Linear Model 2 | 0.0232 | 0.0203 | 8.1470 | 7379.952 |
Given these performance metrics, Linear Model 2 demonstrates a better balance of adjusted R-squared, F-statistic, and residual standard error. While both models have relatively low R-squared values suggesting that they only explain a small portion of the variance in crash costs, Model 2's higher adjusted R squared and F statistic provide evidence that it is a more reliable model.
R Squared and Adjusted R-Squared:
- Model 1 has an adjusted R-squared of 0.0126 compared to Model 2 of 0.0203. The higher adjusted R squared in Model 2 suggests that it may explain a higher percentage the variance in the crash cost by the independent variables. While neither is extremely high, it uses the driving record and the cost of the driver’s car to estimate the crash amount. What is uncertain is the severity of the crash and the value of the other vehicle involved, this creates the high unpredictability.
F-Statistic:
- The F statistic for Model 2 (8.15) is substantially higher than that of Model 1 (1.61), indicating a stronger overall significance for Model 2. This suggests that the predictors in Model 2 are more useful in explaining the variance in crash costs compared to Model 1.
Residual Standard Error (RSE):
- Model 2 has a lower residual standard error (7379.95) than Linear Model 1 (7409.07), meaning that Model 2’s predictions have a slightly smaller average error when compared to the actual values. Lower RSE is a positive indicator, as it reflects a more precise model fit.
Residual outcomes - Model 2
Linearity (Residuals vs. Fitted)
- The plot suggests some slight non linearity in the relationship between predictors and the outcome variable, as indicated by the upward trend in residuals. This hints that the model may not fully capture certain patterns in the data.
Homogeneity (Scale-Location)
- The Scale Location plot shows an increase in the spread of the residual spread as the fitted values grow, indicating heteroscedasticity. This indicates that the variance of the residuals is not constant, which could affect the efficiency of the model’s estimates.
Influential Observations (Residuals vs. Leverage)
- The Residuals vs Leverage plot points to a few potentially influential observations. These influential points could have an outsized effect on the model’s coefficients.
Normality of Residuals (Q-Q Plot)
Model 2 also shows deviations from normality in the residuals, especially in the tails, as seen in Model 1. This suggests that the normality assumption is not fully met, which may impact hypothesis testing or confidence intervals if strict normality is required.
The Q-Q plot reveals there are some departures from normality, particularly in the tails, where residuals deviate from the straight line. This indicates that residuals do not follow a perfectly normal distribution, which could impact the model’s inference capabilities.
Making predictions with the model on the evalution data
The predictions developed by model 2 on the evaluation data look well distributed and accurately captures the core accident costs, as indicated by the similar third quartiles between the predicted and actual values (shown in summary table below). However, outliers remain challenging to predict, largely due to variables that are unknown before the crash, such as the accident's severity and the value of the other vehicle involved. These unobserved factors introduce uncertainty and can lead to significant deviations in individual cost predictions.
Statistic | Predicted Value | Actual Value |
---|---|---|
Min | 1,865.84 | 30.28 |
1st Quartile | 4,927.63 | 2,639.50 |
Median | 5,586.87 | 4,093.00 |
Mean | 5,625.53 | 6,270.82 |
3rd Quartile | 6,289.53 | 5,925.50 |
Max | 9,302.43 | 73,783.47 |