ECNM HW 2

Author

Bryan Calderon

Preparation

Clear Data

          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  583502 31.2    1275565 68.2         NA   669422 35.8
Vcells 1110968  8.5    8388608 64.0      16384  1851952 14.2

Loading Packages

Bringing in Data

Introduction

In the insurance industry, identifying individuals at higher risk of car accidents is a critical component of managing both risk and financial performance. Accurate prediction of crash prone drivers enables insurers to set fair premiums, adjust coverage plans, and manage payouts more effectively.

As car crashes can range from minor incidents to major, costly accidents, estimating not only the likelihood of a crash but also the expected financial impact is essential. Insights from data driven models help insurance companies minimize losses and high quality services, benefiting both the insurer and the insured.

Variables

1. Cleaning Data

Removing “Z_” prefix from the data set

When analyzing the data, there is a prefix of “Z_” in front of some of the variables, below is a list of the variables. Based on the large quantity in which they appear, our first task will be to get rid of them in order to have a cleaner set of data.

'z_' Prefix in Each Variable
MStatus Sex EDUCATION Job Car Type Urban City
z_No z_F Masters Lawyer Pickup z_Highly Rural/ Rural
Yes z_F Masters Home Maker Sports Car Highly Urban/ Urban
z_No z_F Bachelors Clerical z_SUV Highly Urban/ Urban
Yes z_F PhD NA Van Highly Urban/ Urban
Yes M PhD Manager Panel Truck Highly Urban/ Urban
z_No z_F Bachelors Professional Minivan Highly Urban/ Urban
Yes M PhD Doctor Sports Car Highly Urban/ Urban
Yes z_F Bachelors Clerical Pickup Highly Urban/ Urban
Yes M z_High School z_Blue Collar Minivan z_Highly Rural/ Rural
z_No M Bachelors Professional Panel Truck Highly Urban/ Urban

Distribution by variable

“Z_” count after removal are now 0

'z_' Count post removal
MStatus Sex EDUCATION Job Car Type Urban City
0 0 0 0 0 0

Renaming Variables

Renaming some variables in order to give them more intuitive names

# Rename PARENT1 to Parent_Single
names(tr_dt)[names(tr_dt) == "PARENT1"] <- "Parent_single"

names(tr_dt)[names(tr_dt) == "YOJ"] <- "Years_on_job"

names(tr_dt)[names(tr_dt) == "TIF"] <- "Time_in_force"

Understanding Missing Variables

In the Summary table above, there are variables such as Years on the Job, Home value, Car age, and others which have missing data. Before continuing it is important to have a clean set of data. The below graph and table gives us a visual and numeric understanding of key variables which will need cleaning.

        INDEX   TARGET_FLAG    TARGET_AMT      KIDSDRIV           AGE 
            0             0             0             0             3 
     HOMEKIDS  Years_on_job        INCOME Parent_single      HOME_VAL 
            0           375           354             0           368 
     TRAVTIME       CAR_USE      BLUEBOOK Time_in_force       RED_CAR 
            0             0             0             0             0 
     OLDCLAIM      CLM_FREQ       REVOKED       MVR_PTS       CAR_AGE 
            0             0             0             0           399 
      MSTATUS           SEX     EDUCATION           JOB      CAR_TYPE 
            0             0             0           419             0 
   URBANICITY 
            0 

Cleaning the data

Instead of simply replacing NAs with a median or average of the respected variable, we implement machine learning to predict the missing values by considering the relationships between all variables in the dataset.

# Create a new variable 'Job_missing' that flags missing values in 'Job'
tr_dt$Job_missing <- ifelse(is.na(tr_dt$JOB), 1, 0)

Result of the data clean up

The missing data for Job was imputed well, the majority had PHD and Masters degrees, meaning a manager/lawyer jobs would be appropriate.

First 15 Rows with Missing Job Data (Job_missing = 1)
Job_missing EDUCATION JOB
1 PhD Manager
1 PhD Manager
1 PhD Manager
1 Masters Manager
1 PhD Manager
1 PhD Manager
1 Masters Lawyer
1 PhD Manager
1 Masters Lawyer
1 PhD Manager
1 PhD Manager
1 Masters Lawyer
1 PhD Manager
1 Masters Manager
1 Masters Manager

We can also see below that we no longer have any missing data

# Summary of missing values after the adjustment
colSums(is.na(tr_dt) | tr_dt == "")
        INDEX   TARGET_FLAG    TARGET_AMT      KIDSDRIV           AGE 
            0             0             0             0             0 
     HOMEKIDS  Years_on_job        INCOME Parent_single      HOME_VAL 
            0             0             0             0             0 
     TRAVTIME       CAR_USE      BLUEBOOK Time_in_force       RED_CAR 
            0             0             0             0             0 
     OLDCLAIM      CLM_FREQ       REVOKED       MVR_PTS       CAR_AGE 
            0             0             0             0             0 
      MSTATUS           SEX     EDUCATION           JOB      CAR_TYPE 
            0             0             0             0             0 
   URBANICITY 
            0 

2. Data Exploration

Summary table with changes

Below is summary of the data to understand key numerical statistics such median, mean, max, kurtosis, and standard deviation.

Summary Table of Numeric Variables
Mean Median Minimum Maximum Kurtosis Skew SD NA Count
INDEX 5,157 5,152 1 10,302 -1.21 0.00 2,986.61 0
TARGET_FLAG 0 0 0 1 -0.85 1.07 0.44 0
TARGET_AMT 1,467 0 0 107,586 128.02 9.12 4,545.65 0
KIDSDRIV 0 0 0 4 11.23 3.30 0.51 0
AGE 45 45 16 81 -0.05 -0.04 8.65 0
HOMEKIDS 1 0 0 5 0.56 1.32 1.11 0
Years_on_job 11 11 0 23 1.28 -1.23 4.05 0
INCOME 61,320 53,014 0 367,030 2.27 1.21 47,263.66 0
HOME_VAL 154,338 159,856 0 885,282 0.08 0.51 127,111.26 0
TRAVTIME 33 33 5 142 0.74 0.46 15.97 0
BLUEBOOK 15,642 14,370 1,500 69,740 0.92 0.82 8,381.49 0
Time_in_force 5 4 1 25 0.42 0.88 4.15 0
OLDCLAIM 4,119 0 0 57,037 9.40 3.07 8,924.67 0
CLM_FREQ 1 0 0 5 0.31 1.21 1.16 0
MVR_PTS 2 1 0 13 1.30 1.33 2.14 0
CAR_AGE 8 8 -3 27 -0.73 0.28 5.63 0

Graphs

The following graphs help us better understand outliers and distribution patterns, enabling more effective data cleaning.

Distribution of Key Variables

Based on the distribution graphs belows, Income, BlueBook, Travel time, and Car age look to be variables which might benefit from a transformation as they currently show skewness.

Below is a distributions of the categorical variables, this is based on frequency and also highlights the amount of crashes which make up those totals.

This data can be useful for underwriting purposes to understand the exposure they might have in their portfolio. For example, this portfolio is more heavily weighted on individuals with ages of 30+, if they were to find that individuals of 20-30 run a very similar risk profile to those of 30+, then they might shift policy criteria.

Similarly, they it appears they have greater exposure to urban areas, which if they want to go for a risk aversive strategy, then targeting rural areas might be of greater benefit.

In the correlation section of the analysis we will be able to show the proportions highlighted for these variables.

Correlation Analysis

Correlations - Numerical data

Below is the correlation matrix for the numeric data set which shows the strength and the direction of a relationship between each two variables in the dataset.

  • Green indicates a positive correlation

  • Red indicates a negative correlation.

  • The color’s vibrancy shows the correlation’s strength, where white means no correlation, and dark green or red means a strong correlation, closer to 1 or -1.

The variables that display a higher correlation with the independent variables (“TARGET_FLAG” and “TARGET_AMT”) will be variables of interest when running our regression models.

Correlations - Categorical data

The below categorical data will help us understand correlation we couldn’t capture in Correlation plot which focused on numerical data.

Age vs Education

When observing the below heat map and you concentrate on the age group for which this insurance company mainly underwrites (30+), then one can see the impact increasing one’s education has upon the decrease of the crash rate.

Key variables

Urbanicity vs Car Type

In the Crash rate graphs above when graphing the main categorical variables, there was a strong increase in the crash rate when comparing commercial vs private use. For this reason, below there is a seperation between the two and comparing it to car type.

Commercial vehicles will historically have a higher risk:

  • Mileage/Usage Frequency: Commercial vehicles are on the road much more frequently than private use vehicles, increasing the likelihood of accidents.

  • Driver Type: Commercial vehicles might have multiple drivers with varying skill levels, whereas private vehicles typically have a limited number of drivers (immediate family).

  • Purpose of Use: Commercial vehicles may be used for more demanding tasks (e.g., delivery services, transportation of goods) that involve time pressures or driving in unfamiliar areas.

Car Type vs Car use

Similarily as seen below there is a clear correlation in where the vehicles are being operated with the crash rate. Urban areas because of congestion run a much higher risk of accidents.

What is the average claim amount for those who experienced an accident?

Commercial vehicles will primarily consist of higher amounts of pick up trucks and panel trucks which are of higher value and as a result drive a higher average claim. Below is a table breakout of the vehicles which have experienced an accident.

`summarise()` has grouped output by 'CAR_USE'. You can override using the
`.groups` argument.
Car Use Car Type Count Avg Bluebook Avg Claim
Commercial Minivan 95 14,078 6,467
Commercial Panel Truck 136 29,344 7,755
Commercial Pickup 272 11,753 5,126
Commercial Sports Car 52 12,014 4,856
Commercial SUV 156 11,840 5,287
Commercial Van 120 20,300 6,844
Commercial Commercial total 831 16,555 6,056
Car Use Car Type Count Avg Bluebook Avg Claim
Private Minivan 191 13,809 5,373
Private Pickup 85 12,017 5,364
Private Sports Car 181 11,719 5,282
Private SUV 398 11,162 5,058
Private Van 36 19,999 4,470
Private Private total 891 13,741 5,109

The average claim amount will be driven by a couple factors, one of those is the value of the car itself, given by the Blue Book. Generally, the higher the Blue Book value, the more expensive the car, and as a result the repair from the crash.

Other main variables which will drive a higher average claim will be the type of vehicle, the type of use (commercial vs private), and where it is used (urban vs rural setting).

3. Model Development

We are going to create dummy variables

Binary logistic regression Models

Model 1

The first logistic regression model leverages all available variables from the dataset to predict the likelihood of a car accident (TARGET_FLAG). This model includes a range of predictors such as car characteristics, driver demographics, and behavioral factors. Incorporating them will allows the model to capture as much information as possible on which variables most contribute to car accidents.

Model 1 will serve as the baseline for the model development.


Call:
glm(formula = TARGET_FLAG ~ BLUEBOOK + CAR_AGE + INCOME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` + 
    EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters + 
    EDUCATIONPhD + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + RED_CAR + 
    OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + 
    JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force, family = binomial, data = tr_dt2)

Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.433e+00  3.858e-01  -6.307 2.85e-10 ***
BLUEBOOK                -1.977e-05  5.916e-06  -3.341 0.000834 ***
CAR_AGE                  1.313e-03  8.690e-03   0.151 0.879868    
INCOME                  -3.089e-06  1.271e-06  -2.431 0.015073 *  
Parent_single            3.871e-01  1.231e-01   3.145 0.001663 ** 
HOME_VAL                -1.089e-06  3.988e-07  -2.731 0.006316 ** 
MSTATUS                 -4.912e-01  9.636e-02  -5.097 3.44e-07 ***
SEX_male                 1.545e-01  1.252e-01   1.234 0.217230    
`EDUCATION<High School` -3.561e-02  2.309e-01  -0.154 0.877405    
EDUCATIONBachelors      -5.027e-01  1.857e-01  -2.707 0.006787 ** 
EDUCATIONHighSchool     -5.937e-02  2.083e-01  -0.285 0.775624    
EDUCATIONMasters        -2.515e-01  1.689e-01  -1.489 0.136557    
EDUCATIONPhD                    NA         NA      NA       NA    
TRAVTIME                 1.342e-02  2.102e-03   6.381 1.76e-10 ***
CAR_USEPrivate          -8.020e-01  1.023e-01  -7.843 4.41e-15 ***
KIDSDRIV                 4.430e-01  6.938e-02   6.385 1.72e-10 ***
AGE                     -1.959e-03  4.516e-03  -0.434 0.664412    
HOMEKIDS                 4.681e-02  4.239e-02   1.104 0.269492    
Years_on_job            -1.814e-02  9.598e-03  -1.890 0.058812 .  
CAR_TYPEPanel_Truck      4.948e-01  1.815e-01   2.726 0.006410 ** 
CAR_TYPEPickup           5.639e-01  1.119e-01   5.038 4.71e-07 ***
CAR_TYPESports_Car       1.025e+00  1.458e-01   7.033 2.03e-12 ***
CAR_TYPESUV              8.339e-01  1.244e-01   6.702 2.06e-11 ***
CAR_TYPEVan              5.713e-01  1.414e-01   4.041 5.31e-05 ***
RED_CAR                 -1.001e-02  9.668e-02  -0.104 0.917505    
OLDCLAIM                -1.533e-05  4.279e-06  -3.584 0.000339 ***
CLM_FREQ                 2.035e-01  3.180e-02   6.398 1.57e-10 ***
REVOKED                  9.389e-01  1.001e-01   9.375  < 2e-16 ***
MVR_PTS                  1.157e-01  1.525e-02   7.588 3.25e-14 ***
URBANICITY               2.357e+00  1.252e-01  18.829  < 2e-16 ***
JOBClerical              5.353e-02  1.196e-01   0.448 0.654367    
JOBDoctor               -1.299e+00  3.346e-01  -3.882 0.000104 ***
JOBHome_Maker           -1.401e-01  1.738e-01  -0.806 0.420183    
JOBLawyer               -3.153e-01  2.010e-01  -1.569 0.116742    
JOBManager              -8.046e-01  1.497e-01  -5.374 7.68e-08 ***
JOBProfessional         -1.696e-01  1.339e-01  -1.267 0.205252    
JOBStudent              -1.502e-01  1.465e-01  -1.026 0.305100    
Time_in_force           -5.643e-02  8.231e-03  -6.856 7.07e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7533.1  on 6527  degrees of freedom
Residual deviance: 5820.7  on 6491  degrees of freedom
AIC: 5894.7

Number of Fisher Scoring iterations: 5

Logistic Model 1 Results

Magnitude

  • Model 1 shows cars driven in urban locations has the most substantial impact on determining car crashes. To obtain a more interpretable measure of the impact, we can exponentiate the coefficient \(e^{2.38} = ~10.80\). This result indicates that driving in a “Highly Urban” area increases the odds of having a crash by approximately 10.80 times compared to the reference category (Suburban or Rural areas).
  • Similarly, professions such as doctors, sport car owners, and individuals with a history of revoked licenses have significant impacts on car crashes in the model. Interestingly, age has a surprisingly low impact, which may stem from multicollinearity among related factors. Additionally, individuals with a PhD do not have an associated coefficient, possibly also due to multicollinearity.

Significance

  • The variables with high magnitudes above carry a strong significance level based on the P values, cementing their impact to the overall model. Other driving history variables, such as having a revoked license, past claims, and motor vehicle record (MVR) points, are also statistically significant, indicating they are unlikely to be due to random variation.

Direction

  • Most variables align directionally with expectations. Two notable variables which jump out as exceptions would be OLDCLAIMS and RED_CAR as one would expect both of these to have a postive impact on car crashes instead of negative. It may suggest that drivers with a history of claims drive more cautiously to avoid further incidents. The red car myth, which often associates red cars with riskier drivers, also appears largely unsupported here. While RED_CAR has a significant effect, both its magnitude and significance are low, suggesting minimal relevance to our model.

Reviewing the model

\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

The main objective is to correctly identify drivers who are likely to get into a crash. For this reason, a high sensitivity will help show the model’s ability to effectively find the most crash prone drivers.

Model 1 shows it is correctly identifying crash prone drivers 42.74% of the time. While important, the model cannot solely maximizing sensitivity as it might produce many false positives (predicting crashes for those who are actually low risk), which could make insurers overly cautious and lead to higher premiums for drivers who aren’t at risk.

\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]

For this reason, precision which reflects the model’s ability to correctly label true crash cases among all predicted crashes is very important. A high precision means that most of the drivers flagged as “likely to crash” are indeed at risk.

In this case, the model is correctly labeling true crash cases 66.49% of the time, helping insurers not issuing unnecessary warnings or premium hikes to safe drivers.

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4435  986
         1  371  736
                                          
               Accuracy : 0.7921          
                 95% CI : (0.7821, 0.8019)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3955          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4274          
            Specificity : 0.9228          
         Pos Pred Value : 0.6649          
         Neg Pred Value : 0.8181          
             Prevalence : 0.2638          
         Detection Rate : 0.1127          
   Detection Prevalence : 0.1696          
      Balanced Accuracy : 0.6751          
                                          
       'Positive' Class : 1               
                                          

Measuring multicollinearity

Multicollinearity occurs when two or more predictor variables are highly correlated, complicating the model's ability to estimate each variable's individual effect. In Model 1, Education (PhD) lacks a coefficient, likely due to its high correlation with other variables, leading the model to exclude it to avoid redundancy. This doesn't imply that a PhD has no effect but that its impact is most likely entangled with other variables.

Model 2

Model 2 will use a lasso regression to identify variables which have the potential to be excluded after applying a level 1​ penalty, setting some coefficients to zero.

Below, we can see the coefficients for Car age, Red Car, and Education PHD have been removed. This will help the model reduce complexity by excluding less relevant predictors, making it more interpretable.


Lasso Coefficients:
38 x 1 sparse Matrix of class "dgCMatrix"
                                  s1
(Intercept)             -1.408241691
BLUEBOOK                -0.164218287
CAR_AGE                  .          
INCOME                  -0.131438557
Parent_single            0.129674405
HOME_VAL                -0.134902343
MSTATUS                 -0.232351982
SEX_male                 0.049020968
`EDUCATION<High School`  0.022673951
EDUCATIONBachelors      -0.180219345
EDUCATIONHighSchool      0.011012316
EDUCATIONMasters        -0.077224196
EDUCATIONPhD             .          
TRAVTIME                 0.204917083
CAR_USEPrivate          -0.404582004
KIDSDRIV                 0.218527882
AGE                     -0.014804231
HOMEKIDS                 0.046456211
Years_on_job            -0.059453630
CAR_TYPEPanel_Truck      0.107474871
CAR_TYPEPickup           0.180201877
CAR_TYPESports_Car       0.284294270
CAR_TYPESUV              0.329100019
CAR_TYPEVan              0.139031991
RED_CAR                  .          
OLDCLAIM                -0.119138586
CLM_FREQ                 0.226219431
REVOKED                  0.299492206
MVR_PTS                  0.244132358
URBANICITY               0.929824690
JOBClerical              0.038275981
JOBDoctor               -0.183489791
JOBHome_Maker           -0.001680938
JOBLawyer               -0.059621365
JOBManager              -0.256647335
JOBProfessional         -0.029756399
JOBStudent              -0.013923982
Time_in_force           -0.224177803

Applying the coefficient analysis from the lasso regression into model 2

# BLR Model 2: removing "EDUCATIONPhD, "CAR_AGE", "RED_CAR"
logit_model2 <- glm(TARGET_FLAG ~ BLUEBOOK + INCOME + Parent_single + HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` + EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters  + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force,
                                  family = binomial,
                                  data = tr_dt2)

# Summary
summary(logit_model2)

Call:
glm(formula = TARGET_FLAG ~ BLUEBOOK + INCOME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` + 
    EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters + 
    TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + HOMEKIDS + Years_on_job + 
    CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + 
    CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + 
    MVR_PTS + URBANICITY + JOBClerical + JOBDoctor + JOBHome_Maker + 
    JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force, 
    family = binomial, data = tr_dt2)

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -2.417e+00  3.678e-01  -6.569 5.05e-11 ***
BLUEBOOK                -1.977e-05  5.914e-06  -3.342 0.000831 ***
INCOME                  -3.079e-06  1.269e-06  -2.427 0.015242 *  
Parent_single            3.874e-01  1.231e-01   3.148 0.001642 ** 
HOME_VAL                -1.090e-06  3.982e-07  -2.738 0.006187 ** 
MSTATUS                 -4.910e-01  9.634e-02  -5.096 3.47e-07 ***
SEX_male                 1.485e-01  1.117e-01   1.330 0.183477    
`EDUCATION<High School` -4.884e-02  2.143e-01  -0.228 0.819733    
EDUCATIONBachelors      -5.091e-01  1.814e-01  -2.807 0.005006 ** 
EDUCATIONHighSchool     -7.133e-02  1.933e-01  -0.369 0.712095    
EDUCATIONMasters        -2.511e-01  1.688e-01  -1.487 0.136939    
TRAVTIME                 1.342e-02  2.102e-03   6.381 1.75e-10 ***
CAR_USEPrivate          -8.021e-01  1.023e-01  -7.844 4.36e-15 ***
KIDSDRIV                 4.430e-01  6.937e-02   6.386 1.70e-10 ***
AGE                     -1.936e-03  4.513e-03  -0.429 0.667992    
HOMEKIDS                 4.665e-02  4.238e-02   1.101 0.271006    
Years_on_job            -1.815e-02  9.597e-03  -1.892 0.058553 .  
CAR_TYPEPanel_Truck      4.944e-01  1.815e-01   2.725 0.006439 ** 
CAR_TYPEPickup           5.637e-01  1.119e-01   5.036 4.75e-07 ***
CAR_TYPESports_Car       1.025e+00  1.457e-01   7.033 2.02e-12 ***
CAR_TYPESUV              8.343e-01  1.244e-01   6.709 1.97e-11 ***
CAR_TYPEVan              5.710e-01  1.414e-01   4.040 5.35e-05 ***
OLDCLAIM                -1.533e-05  4.279e-06  -3.583 0.000339 ***
CLM_FREQ                 2.034e-01  3.179e-02   6.399 1.57e-10 ***
REVOKED                  9.389e-01  1.001e-01   9.377  < 2e-16 ***
MVR_PTS                  1.158e-01  1.525e-02   7.589 3.22e-14 ***
URBANICITY               2.357e+00  1.252e-01  18.829  < 2e-16 ***
JOBClerical              5.347e-02  1.196e-01   0.447 0.654680    
JOBDoctor               -1.299e+00  3.345e-01  -3.883 0.000103 ***
JOBHome_Maker           -1.398e-01  1.738e-01  -0.804 0.421266    
JOBLawyer               -3.152e-01  2.010e-01  -1.568 0.116853    
JOBManager              -8.047e-01  1.497e-01  -5.376 7.63e-08 ***
JOBProfessional         -1.701e-01  1.338e-01  -1.271 0.203587    
JOBStudent              -1.503e-01  1.464e-01  -1.026 0.304900    
Time_in_force           -5.642e-02  8.231e-03  -6.855 7.14e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7533.1  on 6527  degrees of freedom
Residual deviance: 5820.7  on 6493  degrees of freedom
AIC: 5890.7

Number of Fisher Scoring iterations: 5

Logistic Model 2 Results

Model 2 remains alligned with model 1 in coefficient magnitude, significance, and direction. The lasso regression helped eliminate PHD, which had no coefficients as well as Red Car which from a signficance and direction standpoint, did not add considerable value to the model. This helped reduce the AIC slightly down from 5895 to 5891.

Reviewing the model

\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]

Sensitivity and precision remained aligned with model 1 in which the lasso regression had no material changes on the model

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4434  986
         1  372  736
                                          
               Accuracy : 0.792           
                 95% CI : (0.7819, 0.8018)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.3952          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4274          
            Specificity : 0.9226          
         Pos Pred Value : 0.6643          
         Neg Pred Value : 0.8181          
             Prevalence : 0.2638          
         Detection Rate : 0.1127          
   Detection Prevalence : 0.1697          
      Balanced Accuracy : 0.6750          
                                          
       'Positive' Class : 1               
                                          

Measuring multicollinearity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks moderate and acceptable within 1 < VIF ≤ 5. The main outlier is education which is slightly above this preffered range.

                        vif_logit_model2
BLUEBOOK                        2.146520
INCOME                          2.854089
Parent_single                   1.932535
HOME_VAL                        2.069175
MSTATUS                         2.173918
SEX_male                        2.932498
`EDUCATION<High School`         5.996398
EDUCATIONBachelors              5.943264
EDUCATIONHighSchool             7.661304
EDUCATIONMasters                4.117099
TRAVTIME                        1.037053
CAR_USEPrivate                  2.419640
KIDSDRIV                        1.355057
AGE                             1.479564
HOMEKIDS                        2.197721
Years_on_job                    1.501135
CAR_TYPEPanel_Truck             2.407859
CAR_TYPEPickup                  1.818406
CAR_TYPESports_Car              2.163050
CAR_TYPESUV                     3.087038
CAR_TYPEVan                     1.638123
OLDCLAIM                        1.625621
CLM_FREQ                        1.454622
REVOKED                         1.303804
MVR_PTS                         1.160216
URBANICITY                      1.135161
JOBClerical                     1.886352
JOBDoctor                       1.628406
JOBHome_Maker                   2.125459
JOBLawyer                       3.423225
JOBManager                      2.567912
JOBProfessional                 2.019039
JOBStudent                      1.803192
Time_in_force                   1.011236

Model 3

Model 3 will take the final model produced in model 2 and try to improve on it by implementing a log transformation on variables which showed a skewed distribution in our distribution graph earlier.

The new transformed variables will then be added to replace the original non transformed variables in the model and use a step work backwards to drop predictors that do not contribute significantly to improving the fit of the model based on AIC, leaving a more concise model.

Start:  AIC=5869.63
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + 
    JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force + `EDUCATION<High School` + EDUCATIONBachelors + 
    EDUCATIONHighSchool + EDUCATIONMasters

                          Df Deviance    AIC
- `EDUCATION<High School`  1   5799.7 5867.7
- EDUCATIONHighSchool      1   5799.7 5867.7
- JOBClerical              1   5799.8 5867.8
- HOMEKIDS                 1   5799.9 5867.9
- AGE                      1   5800.3 5868.3
- EDUCATIONMasters         1   5800.6 5868.6
- Years_on_job             1   5801.0 5869.0
<none>                         5799.6 5869.6
- SEX_male                 1   5801.6 5869.6
- JOBProfessional          1   5801.7 5869.7
- JOBLawyer                1   5802.2 5870.2
- JOBHome_Maker            1   5803.4 5871.4
- EDUCATIONBachelors       1   5804.5 5872.5
- JOBStudent               1   5805.6 5873.6
- CAR_TYPEPanel_Truck      1   5806.4 5874.4
- Parent_single            1   5809.8 5877.8
- HOME_VAL                 1   5812.2 5880.2
- OLDCLAIM                 1   5813.6 5881.6
- log_INCOME               1   5815.6 5883.6
- CAR_TYPEVan              1   5817.2 5885.2
- JOBDoctor                1   5817.2 5885.2
- log_BLUEBOOK             1   5819.8 5887.8
- CAR_TYPEPickup           1   5826.7 5894.7
- MSTATUS                  1   5826.8 5894.8
- JOBManager               1   5832.4 5900.4
- CLM_FREQ                 1   5840.2 5908.2
- log_TRAVTIME             1   5842.0 5910.0
- KIDSDRIV                 1   5843.1 5911.1
- Time_in_force            1   5846.3 5914.3
- CAR_TYPESports_Car       1   5850.0 5918.0
- CAR_TYPESUV              1   5851.5 5919.5
- MVR_PTS                  1   5857.5 5925.5
- CAR_USEPrivate           1   5861.0 5929.0
- REVOKED                  1   5885.4 5953.4
- URBANICITY               1   6310.4 6378.4

Step:  AIC=5867.72
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + 
    JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONHighSchool + 
    EDUCATIONMasters

                      Df Deviance    AIC
- EDUCATIONHighSchool  1   5799.7 5865.7
- JOBClerical          1   5799.9 5865.9
- HOMEKIDS             1   5799.9 5865.9
- AGE                  1   5800.4 5866.4
- Years_on_job         1   5801.1 5867.1
<none>                     5799.7 5867.7
- SEX_male             1   5801.8 5867.8
- EDUCATIONMasters     1   5801.9 5867.9
- JOBProfessional      1   5802.2 5868.2
- JOBLawyer            1   5803.5 5869.5
- JOBHome_Maker        1   5804.1 5870.1
- JOBStudent           1   5805.9 5871.9
- CAR_TYPEPanel_Truck  1   5806.6 5872.6
- Parent_single        1   5810.0 5876.0
- HOME_VAL             1   5812.8 5878.8
- OLDCLAIM             1   5813.7 5879.7
- log_INCOME           1   5815.9 5881.9
- CAR_TYPEVan          1   5817.3 5883.3
- EDUCATIONBachelors   1   5817.6 5883.6
- log_BLUEBOOK         1   5819.9 5885.9
- MSTATUS              1   5826.8 5892.8
- CAR_TYPEPickup       1   5827.1 5893.1
- JOBDoctor            1   5829.0 5895.0
- CLM_FREQ             1   5840.3 5906.3
- log_TRAVTIME         1   5842.1 5908.1
- KIDSDRIV             1   5843.2 5909.2
- Time_in_force        1   5846.5 5912.5
- JOBManager           1   5846.5 5912.5
- CAR_TYPESports_Car   1   5850.1 5916.1
- CAR_TYPESUV          1   5851.8 5917.8
- MVR_PTS              1   5857.6 5923.6
- CAR_USEPrivate       1   5863.2 5929.2
- REVOKED              1   5885.4 5951.4
- URBANICITY           1   6310.4 6376.4

Step:  AIC=5865.74
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + 
    JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
- JOBClerical          1   5799.9 5863.9
- HOMEKIDS             1   5800.0 5864.0
- AGE                  1   5800.4 5864.4
- Years_on_job         1   5801.1 5865.1
<none>                     5799.7 5865.7
- SEX_male             1   5801.8 5865.8
- JOBProfessional      1   5802.2 5866.2
- EDUCATIONMasters     1   5802.4 5866.4
- JOBLawyer            1   5803.6 5867.6
- JOBHome_Maker        1   5804.1 5868.1
- JOBStudent           1   5805.9 5869.9
- CAR_TYPEPanel_Truck  1   5806.6 5870.6
- Parent_single        1   5810.0 5874.0
- HOME_VAL             1   5812.8 5876.8
- OLDCLAIM             1   5813.7 5877.7
- log_INCOME           1   5815.9 5879.9
- CAR_TYPEVan          1   5817.4 5881.4
- log_BLUEBOOK         1   5819.9 5883.9
- MSTATUS              1   5826.8 5890.8
- CAR_TYPEPickup       1   5827.2 5891.2
- EDUCATIONBachelors   1   5827.7 5891.7
- JOBDoctor            1   5830.6 5894.6
- CLM_FREQ             1   5840.3 5904.3
- log_TRAVTIME         1   5842.1 5906.1
- KIDSDRIV             1   5843.3 5907.3
- Time_in_force        1   5846.5 5910.5
- JOBManager           1   5847.1 5911.1
- CAR_TYPESports_Car   1   5850.2 5914.2
- CAR_TYPESUV          1   5852.0 5916.0
- MVR_PTS              1   5857.6 5921.6
- CAR_USEPrivate       1   5867.4 5931.4
- REVOKED              1   5885.5 5949.5
- URBANICITY           1   6310.4 6374.4

Step:  AIC=5863.94
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + 
    CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + 
    JOBLawyer + JOBManager + JOBProfessional + JOBStudent + Time_in_force + 
    EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
- HOMEKIDS             1   5800.2 5862.2
- AGE                  1   5800.6 5862.6
- Years_on_job         1   5801.4 5863.4
<none>                     5799.9 5863.9
- SEX_male             1   5802.0 5864.0
- EDUCATIONMasters     1   5802.6 5864.6
- JOBProfessional      1   5803.7 5865.7
- JOBLawyer            1   5805.2 5867.2
- JOBHome_Maker        1   5806.0 5868.0
- CAR_TYPEPanel_Truck  1   5807.6 5869.6
- JOBStudent           1   5807.9 5869.9
- Parent_single        1   5810.3 5872.3
- HOME_VAL             1   5813.4 5875.4
- OLDCLAIM             1   5813.9 5875.9
- log_INCOME           1   5816.3 5878.3
- CAR_TYPEVan          1   5818.5 5880.5
- log_BLUEBOOK         1   5820.3 5882.3
- MSTATUS              1   5826.8 5888.8
- EDUCATIONBachelors   1   5828.3 5890.3
- CAR_TYPEPickup       1   5829.0 5891.0
- JOBDoctor            1   5835.0 5897.0
- CLM_FREQ             1   5840.6 5902.6
- log_TRAVTIME         1   5842.2 5904.2
- KIDSDRIV             1   5843.4 5905.4
- Time_in_force        1   5846.6 5908.6
- CAR_TYPESports_Car   1   5850.4 5912.4
- CAR_TYPESUV          1   5852.2 5914.2
- MVR_PTS              1   5857.8 5919.8
- JOBManager           1   5859.7 5921.7
- CAR_USEPrivate       1   5881.2 5943.2
- REVOKED              1   5885.8 5947.8
- URBANICITY           1   6312.1 6374.1

Step:  AIC=5862.16
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + AGE + KIDSDRIV + 
    Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + 
    CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + 
    MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer + 
    JOBManager + JOBProfessional + JOBStudent + Time_in_force + 
    EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
- AGE                  1   5801.4 5861.4
- Years_on_job         1   5802.0 5862.0
<none>                     5800.2 5862.2
- SEX_male             1   5802.2 5862.2
- EDUCATIONMasters     1   5802.8 5862.8
- JOBProfessional      1   5804.0 5864.0
- JOBLawyer            1   5805.4 5865.4
- JOBHome_Maker        1   5806.2 5866.2
- CAR_TYPEPanel_Truck  1   5807.9 5867.9
- JOBStudent           1   5808.0 5868.0
- HOME_VAL             1   5813.6 5873.6
- Parent_single        1   5814.1 5874.1
- OLDCLAIM             1   5814.1 5874.1
- log_INCOME           1   5817.4 5877.4
- CAR_TYPEVan          1   5818.7 5878.7
- log_BLUEBOOK         1   5820.5 5880.5
- MSTATUS              1   5827.3 5887.3
- EDUCATIONBachelors   1   5828.5 5888.5
- CAR_TYPEPickup       1   5829.2 5889.2
- JOBDoctor            1   5835.2 5895.2
- CLM_FREQ             1   5840.9 5900.9
- log_TRAVTIME         1   5842.4 5902.4
- Time_in_force        1   5846.7 5906.7
- CAR_TYPESports_Car   1   5850.8 5910.8
- CAR_TYPESUV          1   5852.5 5912.5
- KIDSDRIV             1   5857.0 5917.0
- MVR_PTS              1   5858.2 5918.2
- JOBManager           1   5860.1 5920.1
- CAR_USEPrivate       1   5881.5 5941.5
- REVOKED              1   5886.3 5946.3
- URBANICITY           1   6312.2 6372.2

Step:  AIC=5861.39
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + KIDSDRIV + 
    Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + 
    CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + 
    MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer + 
    JOBManager + JOBProfessional + JOBStudent + Time_in_force + 
    EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
- Years_on_job         1   5802.9 5860.9
- SEX_male             1   5803.1 5861.1
<none>                     5801.4 5861.4
- EDUCATIONMasters     1   5804.1 5862.1
- JOBProfessional      1   5805.6 5863.6
- JOBLawyer            1   5807.2 5865.2
- JOBHome_Maker        1   5807.8 5865.8
- JOBStudent           1   5809.1 5867.1
- CAR_TYPEPanel_Truck  1   5809.5 5867.5
- OLDCLAIM             1   5815.3 5873.3
- HOME_VAL             1   5815.7 5873.7
- log_INCOME           1   5818.0 5876.0
- Parent_single        1   5819.7 5877.7
- CAR_TYPEVan          1   5820.4 5878.4
- log_BLUEBOOK         1   5823.5 5881.5
- MSTATUS              1   5827.6 5885.6
- EDUCATIONBachelors   1   5829.8 5887.8
- CAR_TYPEPickup       1   5830.5 5888.5
- JOBDoctor            1   5838.6 5896.6
- CLM_FREQ             1   5841.9 5899.9
- log_TRAVTIME         1   5843.2 5901.2
- Time_in_force        1   5847.8 5905.8
- CAR_TYPESports_Car   1   5850.9 5908.9
- CAR_TYPESUV          1   5852.8 5910.8
- KIDSDRIV             1   5857.9 5915.9
- MVR_PTS              1   5860.4 5918.4
- JOBManager           1   5864.3 5922.3
- CAR_USEPrivate       1   5882.0 5940.0
- REVOKED              1   5887.8 5945.8
- URBANICITY           1   6314.7 6372.7

Step:  AIC=5860.86
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + CAR_USEPrivate + KIDSDRIV + 
    CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + 
    CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + 
    MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer + 
    JOBManager + JOBProfessional + JOBStudent + Time_in_force + 
    EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
- SEX_male             1   5804.7 5860.7
<none>                     5802.9 5860.9
- EDUCATIONMasters     1   5805.7 5861.7
- JOBProfessional      1   5807.2 5863.2
- JOBLawyer            1   5808.9 5864.9
- JOBHome_Maker        1   5809.0 5865.0
- JOBStudent           1   5810.0 5866.0
- CAR_TYPEPanel_Truck  1   5810.8 5866.8
- OLDCLAIM             1   5816.6 5872.6
- HOME_VAL             1   5817.6 5873.6
- log_INCOME           1   5821.4 5877.4
- CAR_TYPEVan          1   5821.7 5877.7
- Parent_single        1   5822.1 5878.1
- log_BLUEBOOK         1   5824.7 5880.7
- MSTATUS              1   5827.6 5883.6
- CAR_TYPEPickup       1   5832.1 5888.1
- EDUCATIONBachelors   1   5832.1 5888.1
- JOBDoctor            1   5841.0 5897.0
- CLM_FREQ             1   5843.3 5899.3
- log_TRAVTIME         1   5844.8 5900.8
- Time_in_force        1   5849.2 5905.2
- CAR_TYPESports_Car   1   5852.4 5908.4
- CAR_TYPESUV          1   5854.5 5910.5
- KIDSDRIV             1   5860.7 5916.7
- MVR_PTS              1   5861.8 5917.8
- JOBManager           1   5867.0 5923.0
- CAR_USEPrivate       1   5882.5 5938.5
- REVOKED              1   5889.4 5945.4
- URBANICITY           1   6315.5 6371.5

Step:  AIC=5860.68
TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + Parent_single + 
    HOME_VAL + MSTATUS + CAR_USEPrivate + KIDSDRIV + CAR_TYPEPanel_Truck + 
    CAR_TYPEPickup + CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + 
    OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBDoctor + 
    JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force + EDUCATIONBachelors + EDUCATIONMasters

                      Df Deviance    AIC
<none>                     5804.7 5860.7
- EDUCATIONMasters     1   5807.4 5861.4
- JOBProfessional      1   5809.2 5863.2
- JOBLawyer            1   5810.8 5864.8
- JOBHome_Maker        1   5811.9 5865.9
- JOBStudent           1   5812.1 5866.1
- CAR_TYPEPanel_Truck  1   5815.6 5869.6
- OLDCLAIM             1   5818.5 5872.5
- HOME_VAL             1   5819.4 5873.4
- log_INCOME           1   5823.3 5877.3
- Parent_single        1   5823.4 5877.4
- CAR_TYPEVan          1   5827.1 5881.1
- MSTATUS              1   5829.5 5883.5
- EDUCATIONBachelors   1   5833.8 5887.8
- CAR_TYPEPickup       1   5833.8 5887.8
- log_BLUEBOOK         1   5834.5 5888.5
- JOBDoctor            1   5842.5 5896.5
- CLM_FREQ             1   5845.4 5899.4
- log_TRAVTIME         1   5846.8 5900.8
- Time_in_force        1   5851.1 5905.1
- CAR_TYPESports_Car   1   5859.5 5913.5
- KIDSDRIV             1   5862.1 5916.1
- MVR_PTS              1   5863.4 5917.4
- CAR_TYPESUV          1   5868.3 5922.3
- JOBManager           1   5869.2 5923.2
- CAR_USEPrivate       1   5884.3 5938.3
- REVOKED              1   5891.4 5945.4
- URBANICITY           1   6318.3 6372.3

Call:
glm(formula = TARGET_FLAG ~ log_BLUEBOOK + log_INCOME + log_TRAVTIME + 
    Parent_single + HOME_VAL + MSTATUS + CAR_USEPrivate + KIDSDRIV + 
    CAR_TYPEPanel_Truck + CAR_TYPEPickup + CAR_TYPESports_Car + 
    CAR_TYPESUV + CAR_TYPEVan + OLDCLAIM + CLM_FREQ + REVOKED + 
    MVR_PTS + URBANICITY + JOBDoctor + JOBHome_Maker + JOBLawyer + 
    JOBManager + JOBProfessional + JOBStudent + Time_in_force + 
    EDUCATIONBachelors + EDUCATIONMasters, family = binomial, 
    data = tr_dt2)

Coefficients:
                      Estimate Std. Error z value Pr(>|z|)    
(Intercept)         -1.802e-02  6.445e-01  -0.028 0.977698    
log_BLUEBOOK        -3.376e-01  6.153e-02  -5.486 4.10e-08 ***
log_INCOME          -6.966e-02  1.619e-02  -4.302 1.69e-05 ***
log_TRAVTIME         3.853e-01  6.053e-02   6.366 1.94e-10 ***
Parent_single        4.602e-01  1.063e-01   4.329 1.50e-05 ***
HOME_VAL            -1.367e-06  3.601e-07  -3.796 0.000147 ***
MSTATUS             -4.453e-01  8.882e-02  -5.014 5.34e-07 ***
CAR_USEPrivate      -7.612e-01  8.597e-02  -8.854  < 2e-16 ***
KIDSDRIV             4.731e-01  6.245e-02   7.576 3.56e-14 ***
CAR_TYPEPanel_Truck  5.229e-01  1.575e-01   3.320 0.000901 ***
CAR_TYPEPickup       5.905e-01  1.093e-01   5.402 6.60e-08 ***
CAR_TYPESports_Car   9.014e-01  1.211e-01   7.442 9.92e-14 ***
CAR_TYPESUV          7.499e-01  9.518e-02   7.879 3.30e-15 ***
CAR_TYPEVan          6.445e-01  1.348e-01   4.780 1.75e-06 ***
OLDCLAIM            -1.575e-05  4.287e-06  -3.674 0.000239 ***
CLM_FREQ             2.043e-01  3.179e-02   6.425 1.31e-10 ***
REVOKED              9.415e-01  1.002e-01   9.394  < 2e-16 ***
MVR_PTS              1.166e-01  1.528e-02   7.629 2.36e-14 ***
URBANICITY           2.360e+00  1.250e-01  18.887  < 2e-16 ***
JOBDoctor           -1.467e+00  2.721e-01  -5.393 6.95e-08 ***
JOBHome_Maker       -4.407e-01  1.659e-01  -2.656 0.007917 ** 
JOBLawyer           -4.104e-01  1.670e-01  -2.457 0.014014 *  
JOBManager          -9.189e-01  1.174e-01  -7.825 5.06e-15 ***
JOBProfessional     -2.435e-01  1.155e-01  -2.108 0.035025 *  
JOBStudent          -4.074e-01  1.509e-01  -2.700 0.006930 ** 
Time_in_force       -5.507e-02  8.227e-03  -6.694 2.18e-11 ***
EDUCATIONBachelors  -4.529e-01  8.451e-02  -5.359 8.35e-08 ***
EDUCATIONMasters    -2.048e-01  1.240e-01  -1.651 0.098780 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 7533.1  on 6527  degrees of freedom
Residual deviance: 5804.7  on 6500  degrees of freedom
AIC: 5860.7

Number of Fisher Scoring iterations: 5

Logistic Model 3 Results

Model 3 shows improvment on model 2, AIC has dropped from ~5,891 to ~5,861.

Magnitude

  • Similar to Model 1 and 2, Model 3 highlights that cars driven in urban locations have the most substantial impact on predicting car crashes, with an odds ratio of (\(e^{2.36} = ~10.59\)). This indicates that living in a highly urban area increases the odds of having a crash by approximately 10.59 times compared to the reference category (Suburban or Rural areas). Individuals with historic driving record issues and those with professions of doctor or manager continue to be impactful variables.
  • In this case we also have log varibales, for example log_TRAVTIME, a one-unit increase in log_TRAVTIME (which represents a proportional increase in TRAVTIME) leads to a multiplicative increase in the odds of a crash by (\(e^{3.85} = ~47\)) or roughly a 4700% increase in odds. This suggests that substantial increases in travel time are associated with much higher odds of a crash.

Significance

  • Variables such as URBANICITY, REVOKED, and log_BLUEBOOK have extremely low p-values, suggesting their effects are highly significant in predicting crashes. Similarly, other variables like JOBDoctor, HOME_VAL, and CAR_TYPESUV also show strong significance, which adds credibility to their impact on the likelihood of a crash occurring. Log_INCOME, one of the variables transformed, shows improvement in the it’s P-value as well.
  • On the other hand, Variables with higher p-values, such as EDUCATIONMasters and JOBProfessional, may not significantly influence the model and could be revisited for relevance.
  • Overall, the majority of variables have a strong signficance level and supports a strong model.

Direction

  • The direction across the variables align with what is expected and similar to model 1 and 2.

Reviewing the model

\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} \]

\[ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} \]

Overall, the model improved in sensitivity compared in comparison to model 2 from 42.74% to 43.32%, indicating a better ability to identify true positives (accurately predicting instances where an individual will indeed have a car crash). This increase in sensitivity suggests that the enhanced model is capturing a slightly higher proportion of actual crash prone individuals, which is critical in risk assessment for insurance companies.

The model’s precision improved from 66.43% to 67.21%, meaning it can better identify actual crash prone individuals among those predicted to be high risk. This improvement reduces false positives, indicating the model is becoming more reliable in targeting truly crash prone individuals, which is valuable for risk assessment.

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 4449  974
         1  357  748
                                          
               Accuracy : 0.7961          
                 95% CI : (0.7861, 0.8058)
    No Information Rate : 0.7362          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.4069          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.4344          
            Specificity : 0.9257          
         Pos Pred Value : 0.6769          
         Neg Pred Value : 0.8204          
             Prevalence : 0.2638          
         Detection Rate : 0.1146          
   Detection Prevalence : 0.1693          
      Balanced Accuracy : 0.6800          
                                          
       'Positive' Class : 1               
                                          

Checking MultiColinarity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks acceptable within 1 < VIF ≤ 5. Across the board the the variables look much better to previous models in this aspect, where the highest amount is ~2.5.

                    vif_logit_model3
log_BLUEBOOK                1.450673
log_INCOME                  2.468105
log_TRAVTIME                1.025990
Parent_single               1.431710
HOME_VAL                    1.737145
MSTATUS                     1.840979
CAR_USEPrivate              1.705943
KIDSDRIV                    1.090497
CAR_TYPEPanel_Truck         1.829136
CAR_TYPEPickup              1.737838
CAR_TYPESports_Car          1.476969
CAR_TYPESUV                 1.794573
CAR_TYPEVan                 1.496378
OLDCLAIM                    1.622047
CLM_FREQ                    1.449709
REVOKED                     1.300646
MVR_PTS                     1.156608
URBANICITY                  1.125220
JOBDoctor                   1.076165
JOBHome_Maker               1.885921
JOBLawyer                   2.372171
JOBManager                  1.581065
JOBProfessional             1.508283
JOBStudent                  1.882995
Time_in_force               1.008836
EDUCATIONBachelors          1.281639
EDUCATIONMasters            2.222524

Linear regression Models

Model 1

The first logistic regression model leverages all available variables from the dataset to predict the amount of money it will cost if the person does crash their car (TARGET_AMT). This model includes a range of predictors such as car characteristics, driver demographics, and behavioral factors. Incorporating them will allows the model to capture as much information as possible on which variables most contribute to car accidents.

Model 1 will serve as the baseline for the model development and will use a filtered data set which only takes individuals who have gotten in a car accident, to more accurately predict the cost of the accident.


Call:
lm(formula = TARGET_AMT ~ BLUEBOOK + CAR_AGE + INCOME + Parent_single + 
    HOME_VAL + MSTATUS + SEX_male + `EDUCATION<High School` + 
    EDUCATIONBachelors + EDUCATIONHighSchool + EDUCATIONMasters + 
    EDUCATIONPhD + TRAVTIME + CAR_USEPrivate + KIDSDRIV + AGE + 
    HOMEKIDS + Years_on_job + CAR_TYPEPanel_Truck + CAR_TYPEPickup + 
    CAR_TYPESports_Car + CAR_TYPESUV + CAR_TYPEVan + RED_CAR + 
    OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY + JOBClerical + 
    JOBDoctor + JOBHome_Maker + JOBLawyer + JOBManager + JOBProfessional + 
    JOBStudent + Time_in_force, data = tr_dt2_filtered)

Residuals:
   Min     1Q Median     3Q    Max 
 -8546  -3033  -1343    594  99723 

Coefficients: (1 not defined because of singularities)
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              4.396e+03  2.226e+03   1.975   0.0485 *  
BLUEBOOK                 1.326e-01  3.293e-02   4.026 5.92e-05 ***
CAR_AGE                 -8.583e+01  4.861e+01  -1.766   0.0777 .  
INCOME                  -1.019e-02  7.833e-03  -1.301   0.1936    
Parent_single            5.781e+02  6.369e+02   0.908   0.3641    
HOME_VAL                 2.887e-03  2.272e-03   1.271   0.2040    
MSTATUS                 -6.438e+02  5.459e+02  -1.179   0.2384    
SEX_male                 1.482e+03  7.059e+02   2.099   0.0359 *  
`EDUCATION<High School` -1.944e+03  1.368e+03  -1.421   0.1555    
EDUCATIONBachelors      -1.686e+03  1.140e+03  -1.478   0.1395    
EDUCATIONHighSchool     -2.401e+03  1.260e+03  -1.906   0.0569 .  
EDUCATIONMasters        -1.085e+03  1.004e+03  -1.080   0.2802    
EDUCATIONPhD                    NA         NA      NA       NA    
TRAVTIME                 2.076e+00  1.196e+01   0.174   0.8622    
CAR_USEPrivate          -6.369e+02  5.669e+02  -1.123   0.2615    
KIDSDRIV                -2.003e+02  3.396e+02  -0.590   0.5554    
AGE                      2.261e+01  2.321e+01   0.974   0.3300    
HOMEKIDS                 1.575e+02  2.305e+02   0.683   0.4946    
Years_on_job            -2.202e+01  5.463e+01  -0.403   0.6870    
CAR_TYPEPanel_Truck     -7.621e+02  1.041e+03  -0.732   0.4643    
CAR_TYPEPickup          -5.245e+02  6.366e+02  -0.824   0.4101    
CAR_TYPESports_Car       7.327e+02  8.132e+02   0.901   0.3677    
CAR_TYPESUV              7.689e+02  7.146e+02   1.076   0.2820    
CAR_TYPEVan             -8.709e+02  8.382e+02  -1.039   0.2989    
RED_CAR                  2.111e+01  5.371e+02   0.039   0.9687    
OLDCLAIM                 1.195e-02  2.374e-02   0.503   0.6148    
CLM_FREQ                -2.626e+01  1.690e+02  -0.155   0.8766    
REVOKED                 -7.456e+02  5.410e+02  -1.378   0.1683    
MVR_PTS                  1.210e+02  7.397e+01   1.636   0.1019    
URBANICITY               6.297e+02  8.141e+02   0.773   0.4393    
JOBClerical             -7.282e+01  6.396e+02  -0.114   0.9094    
JOBDoctor               -1.530e+03  2.158e+03  -0.709   0.4783    
JOBHome_Maker           -4.972e+02  9.795e+02  -0.508   0.6118    
JOBLawyer                3.299e+02  1.187e+03   0.278   0.7811    
JOBManager              -5.329e+02  9.193e+02  -0.580   0.5622    
JOBProfessional          7.054e+02  7.295e+02   0.967   0.3337    
JOBStudent              -2.713e+02  7.836e+02  -0.346   0.7292    
Time_in_force           -1.947e+00  4.565e+01  -0.043   0.9660    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7409 on 1685 degrees of freedom
Multiple R-squared:  0.03325,   Adjusted R-squared:  0.0126 
F-statistic:  1.61 on 36 and 1685 DF,  p-value: 0.01273

Model 1 Results

Job Average Target Amt Average Income Average Bluebook
Blue Collar 5,653 54,055 13,984
Clerical 5,127 31,934 11,966
Doctor 5,194 140,324 18,034
Home Maker 4,828 8,268 11,426
Lawyer 5,865 80,875 14,941
Manager 5,778 102,687 19,760
Professional 6,561 68,766 17,342
Student 5,058 5,347 10,441
  • The Adjusted R-squared is 0.013, indicating that the model explains only about 1% of the variance in the cost of the crash, suggesting that the predictors have limited explanatory power for this outcome. The F statistic of 1.61 with a p-value of 0.0127 indicate that the model is statistically significant overall, though the effect size is small.

  • Individuals with a bachelor degrees or less of education (high school and <high school) are the three variables which have the greatest magnitude on predicting the cost of the crash. For example on average, the crash cost for Individuals a high school education are $2,401 less than those with higher education levels. While this might sound counterintuitve iinitially, it does make sense as high school students for example drive entry level/cheaper to repair cars compared to those with a masters degree. While these variables have a high magnitude, they are not very statisitically significant based on their P value.

  • On the hand, Blue book has a higher P value meaning its more statistically significant but has a lower magnitude. It has a coefficient of approximately 0.133, indicating that for each one unit increase in BLUEBOOK (representing the value of the car), the cost of the crash increases by around 0.133 units. This being said, the units for a variable such as blue book is most likely based on dollars, and thus can accumalate significantly to have a material impact.

  • Overall the only statisitically significant variable is blue book which largely makes sense as the cost of repair or the total amount will be dependent on the value of the car. Variables which are unexpected in direction are ones such as income which for every dollar increase, the cost of the crash decreases $0.01. This again, is not statistically significant nor high magnitude but yet surprising as one would expect the cost of the crash to increase. Similarily, in the table above, doctors have a higher avg income and blue book value but the crash expense is higher. This is most likely driven by the crash severity of the accidents and the value of the other car in the accident which we dont have data on.

Checking Model Assumptions - Residual Analysis

The diagnostic plots indicate mild non linearity and heteroscedasticity, suggesting that the model does not fully capture the relationship between predictors and the outcome. Furthermore, the increasing spread with higher fitted values indicates heteroscedasticity. This suggests that the variance of the residuals increases with the fitted values, which could lead to inefficient estimates.

There are a few influential points that could impact the model as shown in the Residuals and Leverage graph. Lastly, the residuals show some deviation from normality, especially in the tails, which could affect statistical inferences.

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

Measuring multicollinearity

Multicollinearity occurs when two or more predictor variables are highly correlated, complicating the model's ability to estimate each variable's individual effect. In Model 1, Education (PhD) lacks a coefficient, likely due to its high correlation with other variables, leading the model to exclude it to avoid redundancy. This doesn't imply that a PhD has no effect but that its impact is most likely entangled with other variables.

Model 2

Model 2 will use a lasso regression to identify variables which have the potential to be excluded after applying a level 1​ penalty, setting some coefficients to zero. In addition, the log terms previously created will be added for consideration.

Below, we can see only a few coefficients were kept based on the optimal lambda, this consists of 5 variables (log_BLUEBOOK, SEX_male, MVR_PTS, and CAR_TYPEPanel_Truck). This will help the model reduce complexity by excluding less relevant predictors, making it more interpretable.


Lasso Coefficients:
35 x 1 sparse Matrix of class "dgCMatrix"
                                s1
(Intercept)             5559.85426
log_BLUEBOOK             618.10795
log_INCOME                 .      
Parent_single              .      
HOME_VAL                   .      
MSTATUS                    .      
SEX_male                  48.91519
`EDUCATION<High School`    .      
EDUCATIONBachelors         .      
EDUCATIONHighSchool        .      
EDUCATIONMasters           .      
log_TRAVTIME               .      
CAR_USEPrivate             .      
KIDSDRIV                   .      
AGE                        .      
HOMEKIDS                   .      
Years_on_job               .      
CAR_TYPEPanel_Truck       69.43976
CAR_TYPEPickup             .      
CAR_TYPESports_Car         .      
CAR_TYPESUV                .      
CAR_TYPEVan                .      
OLDCLAIM                   .      
CLM_FREQ                   .      
REVOKED                    .      
MVR_PTS                   42.89832
URBANICITY                 .      
JOBClerical                .      
JOBDoctor                  .      
JOBHome_Maker              .      
JOBLawyer                  .      
JOBManager                 .      
JOBProfessional            .      
JOBStudent                 .      
Time_in_force              .      

Applying the coefficient analysis from the lasso regression into model 2

Model 2 will also include the varuable “Revoked” which explains If the driver’s license was revoked in the past 7 years, increasing the probability of them being a riskier driver.
Outside of the value of the car, the severity of the crash becomes the next leading factor in determining the cost of the crash, which can be partially attributed by the riskiness of the driver. Other variables such as SEX being male was shown to have significance in model 1 and thus carried forward to model 2. MVR points similarly, had a P value lower in relation to the other variables in model 1 which will help us access the riskiness of the driver.


Call:
lm(formula = TARGET_AMT ~ log_BLUEBOOK + SEX_male + MVR_PTS + 
    CAR_TYPEPanel_Truck + REVOKED, data = tr_dt2_filtered)

Residuals:
   Min     1Q Median     3Q    Max 
 -7802  -3011  -1456    487 100414 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -7254.20    2752.35  -2.636  0.00847 ** 
log_BLUEBOOK         1306.91     292.98   4.461  8.7e-06 ***
SEX_male              610.68     374.21   1.632  0.10288    
MVR_PTS               143.20      69.42   2.063  0.03928 *  
CAR_TYPEPanel_Truck   786.95     748.68   1.051  0.29336    
REVOKED              -581.73     433.69  -1.341  0.17998    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7380 on 1716 degrees of freedom
Multiple R-squared:  0.02319,   Adjusted R-squared:  0.02034 
F-statistic: 8.147 on 5 and 1716 DF,  p-value: 1.303e-07

Model 2 Results

  • The Adjusted R-squared is 0.0203 which while showed improvement, the model still only explains about 2% of the variance in the cost of the crash, suggesting that the predictors still have limited explanatory power for this outcome. The F-statistic of 8.147 with a p-value of 1.303e-07 improved further, showing the strong statistical significant in the model, though the effect size remains small.

  • log Blue Book has a coefficient of 1306.91 with a strong P value, indicating a strong positive association and significance. A 1 unit/dollar increase in log Blue Book is associated with an increase of approximately $1,307 in the crash cost. Similarily, individuals who drive a panel truck, on average are expected to have a $786 higher crash cost to those who do not, mainly because of the size of the vehicle.

  • The most significant predictors are log Blue Book and MVR points. Log Blue Book has a large positive effect on the overall crash cost, while MVR_PTS’s effect is much smaller crash cost by around $582.

  • What is interesting is that individuals with a historically revoked license on average have a a $582 lower crash cost to those who haven’t had their license revoked. This may suggest that these drivers drive more cautiously after experiencing the consequences of license revocation. In contrast, male drivers are associated with a $610 higher average crash cost compared to female drivers. While females have a slightly higher crash rate, the higher costs associated with male drivers may reflect different driving behaviors which leads to a higher crash cost.

Checking Model Assumptions - Residual Analysis

In Model 2, the diagnostics still indicate issues similar to Model 1, including non-linearity, heteroscedasticity, and non-normal residuals. Although the model fit has improved slightly, as seen in the increased R-squared and lower p-value, the residuals vs fitted plot suggests that the model does not fully capture the underlying relationships. The Scale Location plot shows heteroscedasticity, meaning the residual variance is not constant, which can affect the efficiency of estimates. Lastly, the Influential observations remain a concern, though they appear less extreme than in Model 1.

Checking MultiColinarity

Multicollinearity which helps us understand if 2 or more predictor (independent) variables are correlated to each other, looks strong, all centered around 1.

                    vif_linear_model2
log_BLUEBOOK                 1.188219
SEX_male                     1.094768
MVR_PTS                      1.001602
CAR_TYPEPanel_Truck          1.289131
REVOKED                      1.001240

4.1 Choosing the best Logistic Model

Comparing the results

ROC Curve

The Area under the curve (AUC) for model 3 is 0.8156, meaning about 81.56% of the time, the model can correctly rank a randomly chosen positive instance higher than a randomly chosen negative instance. This indicates that the model has good predictive power. The closer to the top left the better the model is.

Below is a AUC rank reference:

  • 0.90 - 1.00: Excellent (highly accurate model)

  • 0.80 - 0.90: Good (strong model with good discriminative ability)

  • 0.70 - 0.80: Fair (moderate accuracy, but still useful)

  • 0.60 - 0.70: Poor (weak model)

  • 0.50: No discriminative power (random guessing)

Model Comparison Table for AIC
df AIC
logit_model1 37 5894.713
logit_model2 35 5890.746
logit_model3 28 5860.677


Model Comparison Table for Accuracy, Sensitivity, Precision, Specificity, and Classification Error Rate
Model Accuracy Sensitivity Precision Specificity ErrorRate
Model 1 79.21% 42.74% 66.49% 92.28% 20.79%
Model 2 79.20% 42.74% 66.43% 92.26% 20.80%
Model 3 79.61% 43.44% 67.69% 92.57% 20.39%

Model 3 stands out as the most well rounded choice for predicting the likelihood of a crash due to a strong balance between performance and simplicity. It has the highest sensitivity, making it better at identifying the higher risk drivers while maintaining the highest precision, reducing the likelihood of false alarms. Additionally, its low AIC and minimal predictor set indicate that it achieves these results efficiently, supporting its use as a reliable and practical model for predicting crash risks.

  • Predictive Performance:

    • Sensitivity: Model 3 also has the highest sensitivity (43.32%) compared to Model 1 and Model 2 (both at 42.74%). Sensitivity, indicating the true positive rate, is critical here because it reflects the model’s ability to correctly identify drivers who are likely to get into a crash. A higher sensitivity means Model 3 is slightly better at flagging the higher risk drivers, which can have a material difference for an insurance company’s bottom line.

    • Precision: With the highest precision (67.21%), Model 3 outperforms Model 1 (66.49%) and Model 2 (66.43%) in terms of minimizing false positives. In a crash prediction context, a higher precision suggests that when Model 3 predicts a crash, it's more likely to be correct. This reliability is important, as unnecessary interventions based on false positives could be costly.

    • Specificity: Helps measures the model's ability to correctly identify non crashes. All three models demonstrate high specificity, indicating that they are effective at correctly identifying non-crashes. Model 3 (92.43%) has a slight edge over Model 1 (92.28%) and Model 2 (92.26%) in specificity, though the difference is marginal. 

    • Accuracy: Measures the proportion of correct predictions (both true positives and true negatives) out of all predictions. It gives an overall sense of how well the model is classifying both crash and non-crash cases. All three models exhibit strong accuracy at around 80%, with Model 3 having a slight edge at 79.47%.

    • Error Rate (1 - Accuracy): Classification Error Rate is the complement of Accuracy, representing the proportion of incorrect predictions. Model 3 has the lowest error rate at 20.53%, which aligns with its slightly higher accuracy.

  • Model Complexity and AIC (Model Fit) Comparison:

    • AIC: Model 3 has the lowest AIC (5861), which indicates a better balance between model fit and complexity compared to Model 1 (5895) and Model 2 (5891). The lower AIC suggests that Model 3 provides a good fit to the data without overfitting, making it more reliable for future predictions.

    • Degrees of Freedom: With the fewest degrees of freedom (28), Model 3 reduces complexity and captures the essential patterns in crash prediction while avoiding unnecessary variables.

Making predictions with model 3 on the evalution data

The predictions developed by model 3 on the evaluation data seem strong and have a % of crashes of 17.3% compared to the training data’s 26.4%. It is expected that the % of crashes will differ from the training data given the data itself is different. Nevertheless the % between the two are reasonably in line, ensuring the model is performing well.

4.2 Choosing the best Linear Model

Comparing the results

Comparison of Performance Metrics for Linear Model 1 and Linear Model 2
R-Squared Adjusted R-Squared F-Statistic Residual Standard Error
Linear Model 1 0.0332 0.0126 1.6098 7409.073
Linear Model 2 0.0232 0.0203 8.1470 7379.952

Given these performance metrics, Linear Model 2 demonstrates a better balance of adjusted R-squared, F-statistic, and residual standard error. While both models have relatively low R-squared values suggesting that they only explain a small portion of the variance in crash costs, Model 2's higher adjusted R squared and F statistic provide evidence that it is a more reliable model.

  • R Squared and Adjusted R-Squared:

    • Model 1 has an adjusted R-squared of 0.0126 compared to Model 2 of 0.0203. The higher adjusted R squared in Model 2 suggests that it may explain a higher percentage the variance in the crash cost by the independent variables. While neither is extremely high, it uses the driving record and the cost of the driver’s car to estimate the crash amount. What is uncertain is the severity of the crash and the value of the other vehicle involved, this creates the high unpredictability.
  • F-Statistic:

    • The F statistic for Model 2 (8.15) is substantially higher than that of Model 1 (1.61), indicating a stronger overall significance for Model 2. This suggests that the predictors in Model 2 are more useful in explaining the variance in crash costs compared to Model 1.
  • Residual Standard Error (RSE):

    • Model 2 has a lower residual standard error (7379.95) than Linear Model 1 (7409.07), meaning that Model 2’s predictions have a slightly smaller average error when compared to the actual values. Lower RSE is a positive indicator, as it reflects a more precise model fit.

Residual outcomes - Model 2

Linearity (Residuals vs. Fitted)

  • The plot suggests some slight non linearity in the relationship between predictors and the outcome variable, as indicated by the upward trend in residuals. This hints that the model may not fully capture certain patterns in the data.

Homogeneity (Scale-Location)

  • The Scale Location plot shows an increase in the spread of the residual spread as the fitted values grow, indicating heteroscedasticity. This indicates that the variance of the residuals is not constant, which could affect the efficiency of the model’s estimates.

Influential Observations (Residuals vs. Leverage)

  • The Residuals vs Leverage plot points to a few potentially influential observations. These influential points could have an outsized effect on the model’s coefficients.

Normality of Residuals (Q-Q Plot)

  • Model 2 also shows deviations from normality in the residuals, especially in the tails, as seen in Model 1. This suggests that the normality assumption is not fully met, which may impact hypothesis testing or confidence intervals if strict normality is required.

  • The Q-Q plot reveals there are some departures from normality, particularly in the tails, where residuals deviate from the straight line. This indicates that residuals do not follow a perfectly normal distribution, which could impact the model’s inference capabilities.

Making predictions with the model on the evalution data

The predictions developed by model 2 on the evaluation data look well distributed and accurately captures the core accident costs, as indicated by the similar third quartiles between the predicted and actual values (shown in summary table below). However, outliers remain challenging to predict, largely due to variables that are unknown before the crash, such as the accident's severity and the value of the other vehicle involved. These unobserved factors introduce uncertainty and can lead to significant deviations in individual cost predictions.

Comparison of Predicted (Evaluation) and Actual (Training) Crash Amounts
Predicted
Actual
Statistic Predicted Value Actual Value
Min 1,865.84 30.28
1st Quartile 4,927.63 2,639.50
Median 5,586.87 4,093.00
Mean 5,625.53 6,270.82
3rd Quartile 6,289.53 5,925.50
Max 9,302.43 73,783.47