Predicting The Probability Of A Car Crash And It’s Cost

INTRODUCTION:

In this study, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Our objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

1. DATA EXPLORATION:

In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about the car crashes, insurance and the associated cost which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.

Loading The Training Dataset:

Below code chunk loads the required data set that we can use to train our model.

training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance_training_data.csv")

Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly:

knitr::kable(head(training))
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
1 0 0 0 60 0 11 $67,349 No $0 z_No M PhD Professional 14 Private $14,230 11 Minivan yes $4,461 2 No 3 18 Highly Urban/ Urban
2 0 0 0 43 0 11 $91,449 No $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940 1 Minivan yes $0 0 No 0 1 Highly Urban/ Urban
4 0 0 0 35 1 10 $16,039 No $124,191 Yes z_F z_High School Clerical 5 Private $4,010 4 z_SUV no $38,690 2 No 3 10 Highly Urban/ Urban
5 0 0 0 51 0 14 No $306,251 Yes M <High School z_Blue Collar 32 Private $15,440 7 Minivan yes $0 0 No 0 6 Highly Urban/ Urban
6 0 0 0 50 0 NA $114,986 No $243,925 Yes z_F PhD Doctor 36 Private $18,000 1 z_SUV no $19,217 2 Yes 3 17 Highly Urban/ Urban
7 1 2946 0 34 1 12 $125,301 Yes $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430 1 Sports Car no $0 0 No 0 7 Highly Urban/ Urban

Checking Out The Dimensions, Desciptive Summary And Distributions:

As we can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set

dim(training)
## [1] 8161   26

As we can see that we have got 26 columns in total and 8161 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.

training <- training[-1]

Let’s quickly peek into the descriptive summary of our data set

knitr::kable(describe(training))
vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_FLAG 1 8161 0.2638157 0.4407276 0 0.2047787 0.0000 0 1.0 1.0 1.0716614 -0.8516462 0.0048786
TARGET_AMT 2 8161 1504.3246481 4704.0269298 0 593.7121106 0.0000 0 107586.1 107586.1 8.7063034 112.2884386 52.0712628
KIDSDRIV 3 8161 0.1710575 0.5115341 0 0.0252719 0.0000 0 4.0 4.0 3.3518374 11.7801916 0.0056624
AGE 4 8155 44.7903127 8.6275895 45 44.8306513 8.8956 16 81.0 65.0 -0.0289889 -0.0617020 0.0955383
HOMEKIDS 5 8161 0.7212351 1.1163233 0 0.4971665 0.0000 0 5.0 5.0 1.3411271 0.6489915 0.0123571
YOJ 6 7707 10.4992864 4.0924742 11 11.0711853 2.9652 0 23.0 23.0 -1.2029676 1.1773410 0.0466169
INCOME* 7 8161 2875.5505453 2090.6786785 2817 2816.9534385 2799.1488 1 6613.0 6612.0 0.1094699 -1.2853032 23.1427840
PARENT1* 8 8161 1.1319691 0.3384779 1 1.0399755 0.0000 1 2.0 1.0 2.1743561 2.7281589 0.0037468
HOME_VAL* 9 8161 1684.8931503 1697.3791897 1245 1516.4994639 1842.8718 1 5107.0 5106.0 0.5162324 -1.1810965 18.7891522
MSTATUS* 10 8161 1.4003186 0.4899929 1 1.3754021 0.0000 1 2.0 1.0 0.4068189 -1.8347231 0.0054240
SEX* 11 8161 1.5360863 0.4987266 2 1.5451064 0.0000 1 2.0 1.0 -0.1446959 -1.9793056 0.0055207
EDUCATION* 12 8161 3.0906752 1.4448565 3 3.1133405 1.4826 1 5.0 4.0 0.1162654 -1.3799674 0.0159939
JOB* 13 8161 5.6871707 2.6818733 6 5.8145198 2.9652 1 9.0 8.0 -0.3067029 -1.2222635 0.0296870
TRAVTIME 14 8161 33.4857248 15.9083334 33 32.9954051 16.3086 5 142.0 137.0 0.4468174 0.6643331 0.1760974
CAR_USE* 15 8161 1.6288445 0.4831436 2 1.6610507 0.0000 1 2.0 1.0 -0.5332937 -1.7158080 0.0053482
BLUEBOOK* 16 8161 1283.6185516 893.5117428 1124 1259.5665492 1132.7064 1 2789.0 2788.0 0.2472837 -1.3624655 9.8907352
TIF 17 8161 5.3513050 4.1466353 4 4.8402512 4.4478 1 25.0 24.0 0.8908120 0.4224940 0.0459012
CAR_TYPE* 18 8161 3.5297145 1.9653570 3 3.5371420 2.9652 1 6.0 5.0 -0.0047181 -1.5165329 0.0217555
RED_CAR* 19 8161 1.2913859 0.4544287 1 1.2392403 0.0000 1 2.0 1.0 0.9180255 -1.1573709 0.0050303
OLDCLAIM* 20 8161 552.2714128 862.2006829 1 380.3196508 0.0000 1 2857.0 2856.0 1.3085876 0.2461666 9.5441372
CLM_FREQ 21 8161 0.7985541 1.1584527 0 0.5886047 0.0000 0 5.0 5.0 1.2087985 0.2842890 0.0128235
REVOKED* 22 8161 1.1225340 0.3279216 1 1.0281820 0.0000 1 2.0 1.0 2.3018899 3.2991013 0.0036299
MVR_PTS 23 8161 1.6955030 2.1471117 1 1.3138306 1.4826 0 13.0 13.0 1.3478403 1.3754900 0.0237675
CAR_AGE 24 7651 8.3283231 5.7007424 8 7.9632413 7.4130 -3 28.0 31.0 0.2819531 -0.7489756 0.0651737
URBANICITY* 25 8161 1.2045093 0.4033673 1 1.1306479 0.0000 1 2.0 1.0 1.4649406 0.1460688 0.0044651

Before moving on to the data preparation for our models lets check out the distribution of our continuous variable using histogram and categorical variables using bar plot

We can see that the data set is imbalance since our target variable for logistic regression TARGET_FLAG does not have equal number of positive and negative responses. Variables like PARENT1,REVOKED,RED_CAR and URBANICITY are also imbalance and might be not be a good predictor but we will further investigate. Similarly, apart from AGE other continuous variables are not normal distributed but in our first which which will be logistic regression so it is not going to be an issue since logistic regression does not assume that the continuous independent variables are normally distributed. Logistic regression is a type of regression analysis that is designed for predicting the probability of an event occurring, and it makes no assumptions about the distribution of the independent variables.

Before we go further ahead with data preparation we can quickly check out if our data set has any missing values and if Yes then which variables contains how many missing values.

knitr::kable(colSums(is.na(training)))
x
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 0
PARENT1 0
HOME_VAL 0
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

As we can see that columns or variables like AGE,YOJ and CAR_AGE contains missing values which needs to be fixed but before jumping into fixing the mixing values let’s take a look at the structure of the data and see the data type of each column.

str(training)
## 'data.frame':    8161 obs. of  25 variables:
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

As we can see that a lot of data has number and ideally should be a numeric data type but because of symbols like $ and , so R read it as character data type so that needs to be addressed.

2. DATA PREPARATION

In this section we will prepare our data for logistic regression model. First we will convert those character data type variables into numeric which has numbers in them.

training$INCOME <- parse_number(training$INCOME)
training$HOME_VAL <- parse_number(training$HOME_VAL)
training$BLUEBOOK <- parse_number(training$BLUEBOOK)
training$OLDCLAIM <- parse_number(training$OLDCLAIM)

Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.

training$PARENT1 <- as.factor(training$PARENT1)
training$MSTATUS <- as.factor(training$MSTATUS)
training$SEX <- as.factor(training$SEX)
training$EDUCATION <- as.factor(training$EDUCATION)
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.factor(training$CAR_USE)
training$CAR_TYPE <- as.factor(training$CAR_TYPE)
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)
training$TARGET_FLAG <- as.factor((training$TARGET_FLAG))

Now that we have change the data type of each variable let;s check the structure again using str() function.

str(training)
## 'data.frame':    8161 obs. of  25 variables:
##  $ TARGET_FLAG: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : num  67349 91449 16039 NA 114986 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ HOME_VAL   : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB        : Factor w/ 8 levels "Clerical","Doctor",..: 6 8 1 8 2 8 8 8 1 6 ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
##  $ BLUEBOOK   : num  14230 14940 4010 15440 18000 ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
##  $ OLDCLAIM   : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...

Now that our data type is in the right type so now we can goa head and re check the missing values and fix those

knitr::kable(colSums(is.na(training)))
x
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 445
PARENT1 0
HOME_VAL 464
MSTATUS 0
SEX 0
EDUCATION 0
JOB 526
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

Now we can see in addition to previous columns or variables we have HOME_VAL with missing values too so in order to take care of missing values we will rely on mice package from R. Which have multiple techniques to take care of missing values. Since our data set has a mixture of continuous and categorical variables so we will consider a method that can handle both types and my personal pick would be to use random forest method to look at. Random forest can handle both data type plus it is an ensemble method which is a better approach to predict something.

set.seed(2)
training <- mice(training, m=5, maxit = 3, method = 'rf')
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
training <- complete(training)

Now we can check our data set for any missing values.

sum(is.na(training))
## [1] 0

Now our data set is ready to train a model so let’s go ahead with modeling section.

3. BUILDING AND SELECTING MODELS :

LOGISTIC REGRESSION:

BUILDING MODELS

Let’s Remove the TARGET_AMT column from out data set since that column contains the response for the cost of accidents so we are going to leave it out.

training_log <- training[-2]

Now let’s split the data into training and testing. We will split the data set training_log into partial_train and validation. Partial_train contains 85% of the data from training_log and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(42)
split <- createDataPartition(training_log$TARGET_FLAG, p=.80, list=FALSE)
partial_train <- training_log[split, ]
validation <- training_log[ -split, ]

So now that our data has been split into two partitions now we can go ahead and create models. In our first model we will use all the variables in the data set with 10 fold cross validation and see how the model performs

MODEL 1:
m1 <- train(TARGET_FLAG ~., data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 5,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Here is the summary of our first model

summary(m1)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3224  -0.7115  -0.4040   0.6390   3.1684  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.418410   0.038613 -36.734  < 2e-16 ***
## KIDSDRIV                           0.201483   0.034698   5.807 6.37e-09 ***
## AGE                               -0.018978   0.038786  -0.489 0.624634    
## HOMEKIDS                           0.048120   0.046094   1.044 0.296507    
## YOJ                               -0.084769   0.038473  -2.203 0.027570 *  
## INCOME                            -0.106703   0.058747  -1.816 0.069322 .  
## PARENT1Yes                         0.145814   0.041551   3.509 0.000449 ***
## HOME_VAL                          -0.175370   0.049671  -3.531 0.000415 ***
## MSTATUSz_No                        0.217628   0.046882   4.642 3.45e-06 ***
## SEXz_F                            -0.005406   0.062485  -0.087 0.931051    
## `EDUCATIONz_High School`          -0.191901   0.057747  -3.323 0.000890 ***
## EDUCATIONBachelors                -0.125781   0.076051  -1.654 0.098148 .  
## EDUCATIONMasters                  -0.025369   0.066306  -0.383 0.702016    
## EDUCATIONPHD                      -0.008077   0.047956  -0.168 0.866250    
## JOBDoctor                         -0.182664   0.053863  -3.391 0.000696 ***
## `JOBHome Maker`                   -0.052669   0.043143  -1.221 0.222161    
## JOBLawyer                         -0.119903   0.059343  -2.020 0.043332 *  
## JOBManager                        -0.354018   0.056134  -6.307 2.85e-10 ***
## JOBProfessional                   -0.129952   0.049389  -2.631 0.008509 ** 
## JOBStudent                        -0.099372   0.041837  -2.375 0.017538 *  
## `JOBz_Blue Collar`                -0.088423   0.049560  -1.784 0.074400 .  
## TRAVTIME                           0.222401   0.033396   6.659 2.75e-11 ***
## CAR_USEPrivate                    -0.396063   0.048045  -8.244  < 2e-16 ***
## BLUEBOOK                          -0.197218   0.049663  -3.971 7.15e-05 ***
## TIF                               -0.234870   0.034013  -6.905 5.01e-12 ***
## `CAR_TYPEPanel Truck`              0.176667   0.049360   3.579 0.000345 ***
## CAR_TYPEPickup                     0.212538   0.042009   5.059 4.21e-07 ***
## `CAR_TYPESports Car`               0.293992   0.045466   6.466 1.01e-10 ***
## CAR_TYPEVan                        0.199900   0.040602   4.923 8.50e-07 ***
## CAR_TYPEz_SUV                      0.323606   0.055821   5.797 6.74e-09 ***
## RED_CARyes                         0.010064   0.044221   0.228 0.819976    
## OLDCLAIM                          -0.139755   0.038532  -3.627 0.000287 ***
## CLM_FREQ                           0.237345   0.036995   6.416 1.40e-10 ***
## REVOKEDYes                         0.288161   0.033696   8.552  < 2e-16 ***
## MVR_PTS                            0.229272   0.032802   6.990 2.76e-12 ***
## CAR_AGE                            0.002160   0.047560   0.045 0.963768    
## `URBANICITYz_Highly Rural/ Rural` -0.942472   0.049807 -18.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 5861.8  on 6493  degrees of freedom
## AIC: 5935.8
## 
## Number of Fisher Scoring iterations: 5

Let’s remove variables with higher P-values to create more models.

MODEL 2:
m2 <- train(TARGET_FLAG ~ KIDSDRIV + 
                  PARENT1 + HOME_VAL + MSTATUS + INCOME + 
                  TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE  + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, 
              data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

let’s look at the Variance inflation factor(VIF) of each predictor and remove the one with highest VIF.

knitr::kable(vif(m2$finalModel))
x
KIDSDRIV 6.262574
PARENT1Yes 8.015029
HOME_VAL 15.181384
MSTATUSz_No 12.546313
INCOME 14.450041
TRAVTIME 7.070191
CAR_USEPrivate 8.999717
BLUEBOOK 12.693090
TIF 7.382963
CAR_TYPEPanel Truck 12.010741
CAR_TYPEPickup 10.781791
CAR_TYPESports Car 8.852750
CAR_TYPEVan 9.488330
CAR_TYPEz_SUV 11.723275
OLDCLAIM 9.384444
CLM_FREQ 8.679044
REVOKEDYes 7.163466
MVR_PTS 6.822545
URBANICITYz_Highly Rural/ Rural 16.025249

Everything in the summary for model 2 above looks statistically significant but for our third model let’s remove the predictors with high values for VIF.

MODEL 3:
m3 <- train(TARGET_FLAG ~ KIDSDRIV + 
                  PARENT1  + MSTATUS + INCOME + 
                  TRAVTIME + CAR_USE  + TIF   + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,  
              data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))
summary(m3)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2890  -0.7488  -0.4450   0.7090   2.8708  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.35587    0.03712 -36.529  < 2e-16 ***
## KIDSDRIV                           0.21169    0.03053   6.935 4.06e-12 ***
## PARENT1Yes                         0.19018    0.03450   5.513 3.53e-08 ***
## MSTATUSz_No                        0.25858    0.03661   7.063 1.64e-12 ***
## INCOME                            -0.54003    0.03734 -14.462  < 2e-16 ***
## TRAVTIME                           0.21942    0.03240   6.772 1.27e-11 ***
## CAR_USEPrivate                    -0.39855    0.03140 -12.693  < 2e-16 ***
## TIF                               -0.22456    0.03318  -6.768 1.31e-11 ***
## OLDCLAIM                          -0.13003    0.03739  -3.478 0.000505 ***
## CLM_FREQ                           0.23254    0.03599   6.462 1.03e-10 ***
## REVOKEDYes                         0.29147    0.03260   8.942  < 2e-16 ***
## MVR_PTS                            0.25216    0.03194   7.894 2.93e-15 ***
## `URBANICITYz_Highly Rural/ Rural` -0.86220    0.04903 -17.585  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 6117.1  on 6517  degrees of freedom
## AIC: 6143.1
## 
## Number of Fisher Scoring iterations: 5

Again we can see that everything statistically significant in the model 3 so let;s move on to the selection models based on classification model metrics

SELECTING MODEL:

In this section we will look at different metrics like confusion matrix, specificity, sensitivity, F1 score, precision and AUC ROC and decide which model is the optimal one for the prediction. First we will look at the confusion matrix using four fold plot.

pred1 <- predict(m1, newdata = validation)
pred2 <- predict(m2, newdata = validation)
pred3 <- predict(m3, newdata = validation)
m1cM <- confusionMatrix(pred1, validation$TARGET_FLAG, 
                        mode = "everything")
m2cM <- confusionMatrix(pred2, validation$TARGET_FLAG, 
                        mode = "everything")
m3cM <- confusionMatrix(pred3, validation$TARGET_FLAG, 
                        mode = "everything")
par(mfrow=c(1,3))
fourfoldplot(m1cM$table, color = c("#B22222", "#2E8B57"), main="Mod1")
fourfoldplot(m2cM$table, color = c("#B22222", "#2E8B57"), main="Mod2")
fourfoldplot(m3cM$table, color = c("#B22222", "#2E8B57"), main="Mod3")

As we can see that Model 2 has accuracy around 81% which is not bad considering it has very fewer predictor than Model 1. It has relatively higher accuracy than Model 3 which is around 77%. We can also look at other metrics before we decide on Model 2.

eval <- data.frame(m1cM$byClass, 
                   m2cM$byClass, 
                   m3cM$byClass)
eval <- data.frame(t(eval))

eval <- dplyr::select(eval, Sensitivity, Specificity, Precision, Recall, F1)
row.names(eval) <- c("Model 1", "Model 2", "Model 3")
knitr::kable(eval)
Sensitivity Specificity Precision Recall F1
Model 1 0.9258951 0.4279070 0.8188513 0.9258951 0.8690895
Model 2 0.9308909 0.4162791 0.8166545 0.9308909 0.8700389
Model 3 0.9333888 0.3697674 0.8053161 0.9333888 0.8646356

Based on the metrics above again we will go with Model 2 since it has the highest F1 score and the reason why we are making F1 score as basis for this data set is that the output or the response variable is imbalance and whenever one has an imbalance response variable it makes sense to make decision based on F1 score. We can also check out the ROC curve. Let’s create a function for our ROC curve.

getROC <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = validation, type="prob")
    p1 <- data.frame(pred = validation$TARGET_FLAG, prob = pred.prob1[[1]])
    p1 <- p1[order(p1$prob),]
    rocobj <- roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}
par(mfrow = c(1,3))
getROC(m1)
getROC(m2)
getROC(m3)

Even though Model 2 does have a higher AUC as compared to Model 1 but we have to bear in mind that it uses fewer predictors which adds up while computing and it has higher accuracy and F1 score. So we will go with Model 2

MULTIPLE LINEAR REGRESSION:

BUILDING MODELS:

Let’s Remove the TARGET_FLAG column from out data set since that column contains the response for the events of accidents so we are going to leave it out but before that we have fix our data type to fit into linear regression model. So let’s fix our data type of column that has factors as data type and change into numeric by using label coding as shown below.

training$PARENT1 <- as.numeric(as.factor(training$PARENT1))
training$MSTATUS <- as.numeric(as.factor(training$MSTATUS))
training$SEX <- as.numeric(as.factor(training$SEX))
training$EDUCATION <- as.numeric(as.factor(training$EDUCATION))
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.numeric(as.factor(training$CAR_USE))
training$CAR_TYPE <- as.numeric(as.factor(training$CAR_TYPE))
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)

Now let’s remove the unwanted column(s)

training_lin <- training[-1]

Before moving on with our model let’s check out the distribution for response variable in multiple linear regression

hist(training_lin$TARGET_AMT)

Seems like the distribution is very favorable for linear regression and we would have to take some kind of transformation. Let’s tro log transformation.

logdata <- log1p(training_lin$TARGET_AMT)

Now checking the distribution again

hist(logdata)

It looks much better now with a huge outlier sitting at 0 since the data set is hugely imbalance and there is nothing that could be done to cure that imbalance. Now let’s split the data into training and testing. We will split the data set training_log into partial_train and validation. Partial_train contains 85% of the data from training_log and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(42)
split <- createDataPartition(training_lin$TARGET_AMT, p=.80, list=FALSE)
partial_train <- training_lin[split, ]
validation <- training_lin[ -split, ]

We can feed the data to our model. Again just like logistic regression we will use 10 fold cross validation.

MODEL 1:
lm1 <- train(log1p(TARGET_AMT) ~ ., data = partial_train, 
              method = "lm", 
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Let’s check out the summary of our model

summary(lm1)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5298 -2.3433 -0.8758  2.1333 10.3663 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        2.181752   0.040198  54.275  < 2e-16 ***
## KIDSDRIV                           0.249425   0.045865   5.438 5.58e-08 ***
## AGE                               -0.002362   0.048227  -0.049 0.960937    
## HOMEKIDS                           0.064809   0.057767   1.122 0.261943    
## YOJ                               -0.114792   0.048560  -2.364 0.018111 *  
## INCOME                            -0.149950   0.068401  -2.192 0.028398 *  
## PARENT1                            0.212613   0.054489   3.902 9.64e-05 ***
## HOME_VAL                          -0.207018   0.062443  -3.315 0.000920 ***
## MSTATUS                            0.256638   0.058121   4.416 1.02e-05 ***
## SEX                               -0.022231   0.061084  -0.364 0.715909    
## EDUCATION                          0.074819   0.041586   1.799 0.072042 .  
## JOBDoctor                         -0.218108   0.050893  -4.286 1.85e-05 ***
## `JOBHome Maker`                   -0.091788   0.052598  -1.745 0.081016 .  
## JOBLawyer                         -0.200753   0.059296  -3.386 0.000714 ***
## JOBManager                        -0.497817   0.063458  -7.845 5.04e-15 ***
## JOBProfessional                   -0.208254   0.056926  -3.658 0.000256 ***
## JOBStudent                        -0.092906   0.053908  -1.723 0.084860 .  
## `JOBz_Blue Collar`                -0.088332   0.061956  -1.426 0.153994    
## TRAVTIME                           0.251143   0.040974   6.129 9.34e-10 ***
## CAR_USE                           -0.507531   0.053316  -9.519  < 2e-16 ***
## BLUEBOOK                          -0.155611   0.048107  -3.235 0.001224 ** 
## TIF                               -0.278680   0.040373  -6.903 5.59e-12 ***
## CAR_TYPE                           0.303885   0.047908   6.343 2.40e-10 ***
## RED_CARyes                        -0.026205   0.053960  -0.486 0.627249    
## OLDCLAIM                          -0.158343   0.052185  -3.034 0.002421 ** 
## CLM_FREQ                           0.295897   0.050798   5.825 5.98e-09 ***
## REVOKEDYes                         0.413593   0.045617   9.067  < 2e-16 ***
## MVR_PTS                            0.381875   0.044387   8.603  < 2e-16 ***
## CAR_AGE                           -0.108457   0.050032  -2.168 0.030214 *  
## `URBANICITYz_Highly Rural/ Rural` -1.006789   0.044912 -22.417  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.248 on 6499 degrees of freedom
## Multiple R-squared:  0.2218, Adjusted R-squared:  0.2183 
## F-statistic: 63.87 on 29 and 6499 DF,  p-value: < 2.2e-16

As we can see that a lot predictors are very insignificant in our model so let’s get rid of those predictors

MODEL 2:
lm2 <- train(log1p(TARGET_AMT) ~ KIDSDRIV+ MSTATUS+ CAR_TYPE+ JOB+ HOME_VAL+PARENT1+ BLUEBOOK  + CLM_FREQ + REVOKED + OLDCLAIM, data = partial_train, 
              method = "lm", 
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Let’s check out the multi co-linearity among model’s predictors.

knitr::kable(vif(lm2$finalModel))
x
KIDSDRIV 1.068806
MSTATUS 1.758464
CAR_TYPE 1.072852
JOBDoctor 1.263166
JOBHome Maker 1.402136
JOBLawyer 1.591487
JOBManager 1.890681
JOBProfessional 1.770313
JOBStudent 1.514315
JOBz_Blue Collar 1.929418
HOME_VAL 1.815799
PARENT1 1.392560
BLUEBOOK 1.205615
CLM_FREQ 1.391465
REVOKEDYes 1.276010
OLDCLAIM 1.672804

And significance of the predictors:

summary(lm2)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.523 -2.287 -1.312  2.667 10.013 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.18175    0.04252  51.307  < 2e-16 ***
## KIDSDRIV            0.24495    0.04397   5.571 2.63e-08 ***
## MSTATUS             0.19925    0.05639   3.533 0.000413 ***
## CAR_TYPE            0.27216    0.04405   6.179 6.85e-10 ***
## JOBDoctor          -0.16381    0.04780  -3.427 0.000614 ***
## `JOBHome Maker`    -0.06114    0.05036  -1.214 0.224756    
## JOBLawyer          -0.14842    0.05365  -2.766 0.005684 ** 
## JOBManager         -0.25140    0.05848  -4.299 1.74e-05 ***
## JOBProfessional    -0.08194    0.05658  -1.448 0.147611    
## JOBStudent          0.08649    0.05233   1.653 0.098437 .  
## `JOBz_Blue Collar`  0.25191    0.05907   4.265 2.03e-05 ***
## HOME_VAL           -0.27466    0.05731  -4.793 1.68e-06 ***
## PARENT1             0.25765    0.05018   5.134 2.92e-07 ***
## BLUEBOOK           -0.08655    0.04669  -1.854 0.063850 .  
## CLM_FREQ            0.71057    0.05016  14.165  < 2e-16 ***
## REVOKEDYes          0.51783    0.04804  10.779  < 2e-16 ***
## OLDCLAIM           -0.13009    0.05500  -2.365 0.018054 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.436 on 6512 degrees of freedom
## Multiple R-squared:  0.1274, Adjusted R-squared:  0.1253 
## F-statistic: 59.44 on 16 and 6512 DF,  p-value: < 2.2e-16

SELECTING MODEL:

df <- data.frame()
df <- rbind(df, lm1$results)
df <- rbind(df, lm2$results)

df$intercept <- c("Mod1", "Mod2")
colnames(df)[1] <- "model"
knitr::kable(df)
model RMSE Rsquared MAE RMSESD RsquaredSD MAESD
Mod1 3.255671 0.2155983 2.640527 0.0533824 0.0273154 0.0583600
Mod2 3.440147 0.1246543 2.821055 0.0584894 0.0267984 0.0326541

Even though model 1 which is lm1 has a lot of predictors we will still choose that because we get a much higher adjusted \(R^2\). Similarly MSE and MAE are also low for our model 1

4. EVALUATION

In this section we make predictions using the evaluation data provided. Before w make prediction we have to work on changing the data type and deal with missing values in any in our evaluation data set. Before any further due let’s upload the data set in our environment.

testing <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance-evaluation-data.csv")
knitr::kable(head(testing))
INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
3 NA NA 0 48 0 11 $52,881 No $0 z_No M Bachelors Manager 26 Private $21,970 1 Van yes $0 0 No 2 10 Highly Urban/ Urban
9 NA NA 1 40 1 11 $50,815 Yes $0 z_No M z_High School Manager 21 Private $18,930 6 Minivan no $3,295 1 No 2 1 Highly Urban/ Urban
10 NA NA 0 44 2 12 $43,486 Yes $0 z_No z_F z_High School z_Blue Collar 30 Commercial $5,900 10 z_SUV no $0 0 No 0 10 z_Highly Rural/ Rural
18 NA NA 0 35 2 NA $21,204 Yes $0 z_No M z_High School Clerical 74 Private $9,230 6 Pickup no $0 0 Yes 0 4 z_Highly Rural/ Rural
21 NA NA 0 59 0 12 $87,460 No $0 z_No M z_High School Manager 45 Private $15,420 1 Minivan yes $44,857 2 No 4 1 Highly Urban/ Urban
30 NA NA 0 46 0 14 NA No $207,519 Yes M Bachelors Professional 7 Commercial $25,660 1 Panel Truck no $2,119 1 No 2 12 Highly Urban/ Urban
testing <- testing[,-(1:3)]
str(testing)
## tibble [2,141 × 23] (S3: tbl_df/tbl/data.frame)
##  $ KIDSDRIV  : num [1:2141] 0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE       : num [1:2141] 48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS  : num [1:2141] 0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ       : num [1:2141] 11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME    : chr [1:2141] "$52,881" "$50,815" "$43,486" "$21,204" ...
##  $ PARENT1   : chr [1:2141] "No" "Yes" "Yes" "Yes" ...
##  $ HOME_VAL  : chr [1:2141] "$0" "$0" "$0" "$0" ...
##  $ MSTATUS   : chr [1:2141] "z_No" "z_No" "z_No" "z_No" ...
##  $ SEX       : chr [1:2141] "M" "M" "z_F" "M" ...
##  $ EDUCATION : chr [1:2141] "Bachelors" "z_High School" "z_High School" "z_High School" ...
##  $ JOB       : chr [1:2141] "Manager" "Manager" "z_Blue Collar" "Clerical" ...
##  $ TRAVTIME  : num [1:2141] 26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE   : chr [1:2141] "Private" "Private" "Commercial" "Private" ...
##  $ BLUEBOOK  : chr [1:2141] "$21,970" "$18,930" "$5,900" "$9,230" ...
##  $ TIF       : num [1:2141] 1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE  : chr [1:2141] "Van" "Minivan" "z_SUV" "Pickup" ...
##  $ RED_CAR   : chr [1:2141] "yes" "no" "no" "no" ...
##  $ OLDCLAIM  : chr [1:2141] "$0" "$3,295" "$0" "$0" ...
##  $ CLM_FREQ  : num [1:2141] 0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED   : chr [1:2141] "No" "No" "No" "Yes" ...
##  $ MVR_PTS   : num [1:2141] 2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE   : num [1:2141] 10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY: chr [1:2141] "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...
testing$INCOME <- parse_number(testing$INCOME)
testing$HOME_VAL <- parse_number(testing$HOME_VAL)
testing$BLUEBOOK <- parse_number(testing$BLUEBOOK)
testing$OLDCLAIM <- parse_number(testing$OLDCLAIM)

Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.

testing$PARENT1 <- as.factor(testing$PARENT1)
testing$MSTATUS <- as.factor(testing$MSTATUS)
testing$SEX <- as.factor(testing$SEX)
testing$EDUCATION <- as.factor(testing$EDUCATION)
levels(testing$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
testing$CAR_USE <- as.factor(testing$CAR_USE)
testing$CAR_TYPE <- as.factor(testing$CAR_TYPE)
testing$RED_CAR <- as.factor(testing$RED_CAR)
testing$REVOKED <- as.factor(testing$REVOKED)
testing$URBANICITY <- as.factor(testing$URBANICITY)
testing$JOB[testing$JOB==""]<- NA
testing$JOB <- as.factor(testing$JOB)
colSums(is.na(testing))
##   KIDSDRIV        AGE   HOMEKIDS        YOJ     INCOME    PARENT1   HOME_VAL 
##          0          1          0         94        125          0        111 
##    MSTATUS        SEX  EDUCATION        JOB   TRAVTIME    CAR_USE   BLUEBOOK 
##          0          0          0        139          0          0          0 
##        TIF   CAR_TYPE    RED_CAR   OLDCLAIM   CLM_FREQ    REVOKED    MVR_PTS 
##          0          0          0          0          0          0          0 
##    CAR_AGE URBANICITY 
##        129          0
set.seed(2)
testing <- mice(testing, m=5, maxit = 3, method = 'rf')
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
testing <- complete(testing)

LOGISTIC REGRESSION:

logpred <- predict(m2, testing)
logpred_prob <- predict(m2, testing, type = "prob")
log_df <- cbind(logpred_prob, TARGET_FLAG = logpred)
log_df <- log_df%>%
  rename(prob=2)

MUTIPLE LINEAR REGRESSION:

#fixing data type
testing$PARENT1 <- as.numeric(as.factor(testing$PARENT1))
testing$MSTATUS <- as.numeric(as.factor(testing$MSTATUS))
testing$CAR_TYPE <- as.numeric(as.factor(testing$CAR_TYPE))
testing$SEX <- as.numeric(as.factor(testing$SEX))
testing$EDUCATION <- as.numeric(as.factor(testing$EDUCATION))
testing$CAR_USE <- as.numeric(as.factor(testing$CAR_USE))
# predicting
amountpred <- exp(predict(lm1, testing))
final <- cbind(log_df, TARGET_AMT = amountpred)

knitr::kable(head(final))
0 prob TARGET_FLAG TARGET_AMT
0.7216442 0.2783558 0 5.948811
0.6772316 0.3227684 0 13.060923
0.8646774 0.1353226 0 7.223364
0.7570752 0.2429248 0 22.487631
0.8098321 0.1901679 0 6.287991
0.7897810 0.2102190 0 7.110726

Now to confirm that our model is actually predicting higher costs if the probability of accidents increases so we can plot them against each to confirm our models

ggplot(data = final,  mapping = aes(x= prob, y = TARGET_AMT)) +
  geom_point(color = "blue", size = 3)+labs(x="Predicted Probability of Accident Happening", y="Cost")+theme_bw()

5. CONCLUSION:

In this study we were dealing with an imbalance dataset and our job was to find out what contributes towards car crash and how much would it cost for the repair. The Data set provided had 26 variables and around 8k observations. We started our analysis with exploring the data set followed by cleaning and preparing the data set for training our models. We had to come up with two models i.e. one for classification and one for regression. For classification we use a logistic regression with 10 fold cross validation and trained three models. Since it was imbalance data set so the model with highest F1 score was chosen for the final predictions. For predicting the cost of an accident we formulated two regression models and pick the best one based one Mean absolute error and Mean Squared Error. Our predictions were not that accurate when it came to the cost of the accidents but our model predicted higher TARGET_AMT every time we had higher predicted probability of accident