Predicting The Probability Of A Car Crash And It’s Cost

INTRODUCTION:

In this study, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Our objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

1. DATA EXPLORATION:

In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about the car crashes, insurance and the associated cost which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.

Loading The Training Dataset:

Below code chunk loads the required data set that we can use to train our model.

training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance_training_data.csv")

Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly:

knitr::kable(head(training))

INDEX	TARGET_FLAG	TARGET_AMT	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
1	0	0	60	0	11	$67,349	No	$0	z_No	M	PhD	Professional	14	Private	$14,230	11	Minivan	yes	$4,461	2	No	3	18	Highly Urban/ Urban
2	0	0	43	0	11	$91,449	No	$257,252	z_No	M	z_High School	z_Blue Collar	22	Commercial	$14,940	1	Minivan	yes	$0	0	No	0	1	Highly Urban/ Urban
4	0	0	35	1	10	$16,039	No	$124,191	Yes	z_F	z_High School	Clerical	5	Private	$4,010	4	z_SUV	no	$38,690	2	No	3	10	Highly Urban/ Urban
5	0	0	51	0	14		No	$306,251	Yes	M	<High School	z_Blue Collar	32	Private	$15,440	7	Minivan	yes	$0	0	No	0	6	Highly Urban/ Urban
6	0	0	50	0	NA	$114,986	No	$243,925	Yes	z_F	PhD	Doctor	36	Private	$18,000	1	z_SUV	no	$19,217	2	Yes	3	17	Highly Urban/ Urban
7	1	2946	34	1	12	$125,301	Yes	$0	z_No	z_F	Bachelors	z_Blue Collar	46	Commercial	$17,430	1	Sports Car	no	$0	0	No	0	7	Highly Urban/ Urban

Checking Out The Dimensions, Desciptive Summary And Distributions:

As we can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set

dim(training)

## [1] 8161   26

As we can see that we have got 26 columns in total and 8161 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.

training <- training[-1]

Let’s quickly peek into the descriptive summary of our data set

knitr::kable(describe(training))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_FLAG	1	8161	0.2638157	0.4407276	0	0.2047787	0.0000	0	1.0	1.0	1.0716614	-0.8516462	0.0048786
TARGET_AMT	2	8161	1504.3246481	4704.0269298	0	593.7121106	0.0000	0	107586.1	107586.1	8.7063034	112.2884386	52.0712628
KIDSDRIV	3	8161	0.1710575	0.5115341	0	0.0252719	0.0000	0	4.0	4.0	3.3518374	11.7801916	0.0056624
AGE	4	8155	44.7903127	8.6275895	45	44.8306513	8.8956	16	81.0	65.0	-0.0289889	-0.0617020	0.0955383
HOMEKIDS	5	8161	0.7212351	1.1163233	0	0.4971665	0.0000	0	5.0	5.0	1.3411271	0.6489915	0.0123571
YOJ	6	7707	10.4992864	4.0924742	11	11.0711853	2.9652	0	23.0	23.0	-1.2029676	1.1773410	0.0466169
INCOME*	7	8161	2875.5505453	2090.6786785	2817	2816.9534385	2799.1488	1	6613.0	6612.0	0.1094699	-1.2853032	23.1427840
PARENT1*	8	8161	1.1319691	0.3384779	1	1.0399755	0.0000	1	2.0	1.0	2.1743561	2.7281589	0.0037468
HOME_VAL*	9	8161	1684.8931503	1697.3791897	1245	1516.4994639	1842.8718	1	5107.0	5106.0	0.5162324	-1.1810965	18.7891522
MSTATUS*	10	8161	1.4003186	0.4899929	1	1.3754021	0.0000	1	2.0	1.0	0.4068189	-1.8347231	0.0054240
SEX*	11	8161	1.5360863	0.4987266	2	1.5451064	0.0000	1	2.0	1.0	-0.1446959	-1.9793056	0.0055207
EDUCATION*	12	8161	3.0906752	1.4448565	3	3.1133405	1.4826	1	5.0	4.0	0.1162654	-1.3799674	0.0159939
JOB*	13	8161	5.6871707	2.6818733	6	5.8145198	2.9652	1	9.0	8.0	-0.3067029	-1.2222635	0.0296870
TRAVTIME	14	8161	33.4857248	15.9083334	33	32.9954051	16.3086	5	142.0	137.0	0.4468174	0.6643331	0.1760974
CAR_USE*	15	8161	1.6288445	0.4831436	2	1.6610507	0.0000	1	2.0	1.0	-0.5332937	-1.7158080	0.0053482
BLUEBOOK*	16	8161	1283.6185516	893.5117428	1124	1259.5665492	1132.7064	1	2789.0	2788.0	0.2472837	-1.3624655	9.8907352
TIF	17	8161	5.3513050	4.1466353	4	4.8402512	4.4478	1	25.0	24.0	0.8908120	0.4224940	0.0459012
CAR_TYPE*	18	8161	3.5297145	1.9653570	3	3.5371420	2.9652	1	6.0	5.0	-0.0047181	-1.5165329	0.0217555
RED_CAR*	19	8161	1.2913859	0.4544287	1	1.2392403	0.0000	1	2.0	1.0	0.9180255	-1.1573709	0.0050303
OLDCLAIM*	20	8161	552.2714128	862.2006829	1	380.3196508	0.0000	1	2857.0	2856.0	1.3085876	0.2461666	9.5441372
CLM_FREQ	21	8161	0.7985541	1.1584527	0	0.5886047	0.0000	0	5.0	5.0	1.2087985	0.2842890	0.0128235
REVOKED*	22	8161	1.1225340	0.3279216	1	1.0281820	0.0000	1	2.0	1.0	2.3018899	3.2991013	0.0036299
MVR_PTS	23	8161	1.6955030	2.1471117	1	1.3138306	1.4826	0	13.0	13.0	1.3478403	1.3754900	0.0237675
CAR_AGE	24	7651	8.3283231	5.7007424	8	7.9632413	7.4130	-3	28.0	31.0	0.2819531	-0.7489756	0.0651737
URBANICITY*	25	8161	1.2045093	0.4033673	1	1.1306479	0.0000	1	2.0	1.0	1.4649406	0.1460688	0.0044651

Before moving on to the data preparation for our models lets check out the distribution of our continuous variable using histogram and categorical variables using bar plot

We can see that the data set is imbalance since our target variable for logistic regression TARGET_FLAG does not have equal number of positive and negative responses. Variables like PARENT1,REVOKED,RED_CAR and URBANICITY are also imbalance and might be not be a good predictor but we will further investigate. Similarly, apart from AGE other continuous variables are not normal distributed but in our first which which will be logistic regression so it is not going to be an issue since logistic regression does not assume that the continuous independent variables are normally distributed. Logistic regression is a type of regression analysis that is designed for predicting the probability of an event occurring, and it makes no assumptions about the distribution of the independent variables.

Before we go further ahead with data preparation we can quickly check out if our data set has any missing values and if Yes then which variables contains how many missing values.

knitr::kable(colSums(is.na(training)))

	x
TARGET_FLAG	0
TARGET_AMT	0
KIDSDRIV	0
AGE	6
HOMEKIDS	0
YOJ	454
INCOME	0
PARENT1	0
HOME_VAL	0
MSTATUS	0
SEX	0
EDUCATION	0
JOB	0
TRAVTIME	0
CAR_USE	0
BLUEBOOK	0
TIF	0
CAR_TYPE	0
RED_CAR	0
OLDCLAIM	0
CLM_FREQ	0
REVOKED	0
MVR_PTS	0
CAR_AGE	510
URBANICITY	0

As we can see that columns or variables like AGE,YOJ and CAR_AGE contains missing values which needs to be fixed but before jumping into fixing the mixing values let’s take a look at the structure of the data and see the data type of each column.

str(training)

## 'data.frame':    8161 obs. of  25 variables:
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

As we can see that a lot of data has number and ideally should be a numeric data type but because of symbols like $ and , so R read it as character data type so that needs to be addressed.

2. DATA PREPARATION

In this section we will prepare our data for logistic regression model. First we will convert those character data type variables into numeric which has numbers in them.

training$INCOME <- parse_number(training$INCOME)
training$HOME_VAL <- parse_number(training$HOME_VAL)
training$BLUEBOOK <- parse_number(training$BLUEBOOK)
training$OLDCLAIM <- parse_number(training$OLDCLAIM)

Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.

training$PARENT1 <- as.factor(training$PARENT1)
training$MSTATUS <- as.factor(training$MSTATUS)
training$SEX <- as.factor(training$SEX)
training$EDUCATION <- as.factor(training$EDUCATION)
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.factor(training$CAR_USE)
training$CAR_TYPE <- as.factor(training$CAR_TYPE)
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)
training$TARGET_FLAG <- as.factor((training$TARGET_FLAG))

Now that we have change the data type of each variable let;s check the structure again using str() function.

str(training)

## 'data.frame':    8161 obs. of  25 variables:
##  $ TARGET_FLAG: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : num  67349 91449 16039 NA 114986 ...
##  $ PARENT1    : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ HOME_VAL   : num  0 257252 124191 306251 243925 ...
##  $ MSTATUS    : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
##  $ SEX        : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
##  $ EDUCATION  : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
##  $ JOB        : Factor w/ 8 levels "Clerical","Doctor",..: 6 8 1 8 2 8 8 8 1 6 ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
##  $ BLUEBOOK   : num  14230 14940 4010 15440 18000 ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
##  $ RED_CAR    : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
##  $ OLDCLAIM   : num  4461 0 38690 0 19217 ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...

Now that our data type is in the right type so now we can goa head and re check the missing values and fix those

knitr::kable(colSums(is.na(training)))

	x
TARGET_FLAG	0
TARGET_AMT	0
KIDSDRIV	0
AGE	6
HOMEKIDS	0
YOJ	454
INCOME	445
PARENT1	0
HOME_VAL	464
MSTATUS	0
SEX	0
EDUCATION	0
JOB	526
TRAVTIME	0
CAR_USE	0
BLUEBOOK	0
TIF	0
CAR_TYPE	0
RED_CAR	0
OLDCLAIM	0
CLM_FREQ	0
REVOKED	0
MVR_PTS	0
CAR_AGE	510
URBANICITY	0

Now we can see in addition to previous columns or variables we have HOME_VAL with missing values too so in order to take care of missing values we will rely on mice package from R. Which have multiple techniques to take care of missing values. Since our data set has a mixture of continuous and categorical variables so we will consider a method that can handle both types and my personal pick would be to use random forest method to look at. Random forest can handle both data type plus it is an ensemble method which is a better approach to predict something.

set.seed(2)
training <- mice(training, m=5, maxit = 3, method = 'rf')

## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE

training <- complete(training)

Now we can check our data set for any missing values.

sum(is.na(training))

## [1] 0

Now our data set is ready to train a model so let’s go ahead with modeling section.

3. BUILDING AND SELECTING MODELS :

LOGISTIC REGRESSION:

BUILDING MODELS

Let’s Remove the TARGET_AMT column from out data set since that column contains the response for the cost of accidents so we are going to leave it out.

training_log <- training[-2]

Now let’s split the data into training and testing. We will split the data set training_log into partial_train and validation. Partial_train contains 85% of the data from training_log and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(42)
split <- createDataPartition(training_log$TARGET_FLAG, p=.80, list=FALSE)
partial_train <- training_log[split, ]
validation <- training_log[ -split, ]

So now that our data has been split into two partitions now we can go ahead and create models. In our first model we will use all the variables in the data set with 10 fold cross validation and see how the model performs

MODEL 1:

m1 <- train(TARGET_FLAG ~., data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 5,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Here is the summary of our first model

summary(m1)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3224  -0.7115  -0.4040   0.6390   3.1684  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.418410   0.038613 -36.734  < 2e-16 ***
## KIDSDRIV                           0.201483   0.034698   5.807 6.37e-09 ***
## AGE                               -0.018978   0.038786  -0.489 0.624634    
## HOMEKIDS                           0.048120   0.046094   1.044 0.296507    
## YOJ                               -0.084769   0.038473  -2.203 0.027570 *  
## INCOME                            -0.106703   0.058747  -1.816 0.069322 .  
## PARENT1Yes                         0.145814   0.041551   3.509 0.000449 ***
## HOME_VAL                          -0.175370   0.049671  -3.531 0.000415 ***
## MSTATUSz_No                        0.217628   0.046882   4.642 3.45e-06 ***
## SEXz_F                            -0.005406   0.062485  -0.087 0.931051    
## `EDUCATIONz_High School`          -0.191901   0.057747  -3.323 0.000890 ***
## EDUCATIONBachelors                -0.125781   0.076051  -1.654 0.098148 .  
## EDUCATIONMasters                  -0.025369   0.066306  -0.383 0.702016    
## EDUCATIONPHD                      -0.008077   0.047956  -0.168 0.866250    
## JOBDoctor                         -0.182664   0.053863  -3.391 0.000696 ***
## `JOBHome Maker`                   -0.052669   0.043143  -1.221 0.222161    
## JOBLawyer                         -0.119903   0.059343  -2.020 0.043332 *  
## JOBManager                        -0.354018   0.056134  -6.307 2.85e-10 ***
## JOBProfessional                   -0.129952   0.049389  -2.631 0.008509 ** 
## JOBStudent                        -0.099372   0.041837  -2.375 0.017538 *  
## `JOBz_Blue Collar`                -0.088423   0.049560  -1.784 0.074400 .  
## TRAVTIME                           0.222401   0.033396   6.659 2.75e-11 ***
## CAR_USEPrivate                    -0.396063   0.048045  -8.244  < 2e-16 ***
## BLUEBOOK                          -0.197218   0.049663  -3.971 7.15e-05 ***
## TIF                               -0.234870   0.034013  -6.905 5.01e-12 ***
## `CAR_TYPEPanel Truck`              0.176667   0.049360   3.579 0.000345 ***
## CAR_TYPEPickup                     0.212538   0.042009   5.059 4.21e-07 ***
## `CAR_TYPESports Car`               0.293992   0.045466   6.466 1.01e-10 ***
## CAR_TYPEVan                        0.199900   0.040602   4.923 8.50e-07 ***
## CAR_TYPEz_SUV                      0.323606   0.055821   5.797 6.74e-09 ***
## RED_CARyes                         0.010064   0.044221   0.228 0.819976    
## OLDCLAIM                          -0.139755   0.038532  -3.627 0.000287 ***
## CLM_FREQ                           0.237345   0.036995   6.416 1.40e-10 ***
## REVOKEDYes                         0.288161   0.033696   8.552  < 2e-16 ***
## MVR_PTS                            0.229272   0.032802   6.990 2.76e-12 ***
## CAR_AGE                            0.002160   0.047560   0.045 0.963768    
## `URBANICITYz_Highly Rural/ Rural` -0.942472   0.049807 -18.923  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 5861.8  on 6493  degrees of freedom
## AIC: 5935.8
## 
## Number of Fisher Scoring iterations: 5

Let’s remove variables with higher P-values to create more models.

MODEL 2:

m2 <- train(TARGET_FLAG ~ KIDSDRIV + 
                  PARENT1 + HOME_VAL + MSTATUS + INCOME + 
                  TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE  + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, 
              data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

let’s look at the Variance inflation factor(VIF) of each predictor and remove the one with highest VIF.

knitr::kable(vif(m2$finalModel))

	x
KIDSDRIV	6.262574
PARENT1Yes	8.015029
HOME_VAL	15.181384
MSTATUSz_No	12.546313
INCOME	14.450041
TRAVTIME	7.070191
CAR_USEPrivate	8.999717
BLUEBOOK	12.693090
TIF	7.382963
`CAR_TYPEPanel Truck`	12.010741
CAR_TYPEPickup	10.781791
`CAR_TYPESports Car`	8.852750
CAR_TYPEVan	9.488330
CAR_TYPEz_SUV	11.723275
OLDCLAIM	9.384444
CLM_FREQ	8.679044
REVOKEDYes	7.163466
MVR_PTS	6.822545
`URBANICITYz_Highly Rural/ Rural`	16.025249

Everything in the summary for model 2 above looks statistically significant but for our third model let’s remove the predictors with high values for VIF.

MODEL 3:

m3 <- train(TARGET_FLAG ~ KIDSDRIV + 
                  PARENT1  + MSTATUS + INCOME + 
                  TRAVTIME + CAR_USE  + TIF   + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,  
              data = partial_train, 
              method = "glm", family = "binomial",
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

summary(m3)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2890  -0.7488  -0.4450   0.7090   2.8708  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.35587    0.03712 -36.529  < 2e-16 ***
## KIDSDRIV                           0.21169    0.03053   6.935 4.06e-12 ***
## PARENT1Yes                         0.19018    0.03450   5.513 3.53e-08 ***
## MSTATUSz_No                        0.25858    0.03661   7.063 1.64e-12 ***
## INCOME                            -0.54003    0.03734 -14.462  < 2e-16 ***
## TRAVTIME                           0.21942    0.03240   6.772 1.27e-11 ***
## CAR_USEPrivate                    -0.39855    0.03140 -12.693  < 2e-16 ***
## TIF                               -0.22456    0.03318  -6.768 1.31e-11 ***
## OLDCLAIM                          -0.13003    0.03739  -3.478 0.000505 ***
## CLM_FREQ                           0.23254    0.03599   6.462 1.03e-10 ***
## REVOKEDYes                         0.29147    0.03260   8.942  < 2e-16 ***
## MVR_PTS                            0.25216    0.03194   7.894 2.93e-15 ***
## `URBANICITYz_Highly Rural/ Rural` -0.86220    0.04903 -17.585  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7536.3  on 6529  degrees of freedom
## Residual deviance: 6117.1  on 6517  degrees of freedom
## AIC: 6143.1
## 
## Number of Fisher Scoring iterations: 5

Again we can see that everything statistically significant in the model 3 so let;s move on to the selection models based on classification model metrics

SELECTING MODEL:

In this section we will look at different metrics like confusion matrix, specificity, sensitivity, F1 score, precision and AUC ROC and decide which model is the optimal one for the prediction. First we will look at the confusion matrix using four fold plot.

pred1 <- predict(m1, newdata = validation)
pred2 <- predict(m2, newdata = validation)
pred3 <- predict(m3, newdata = validation)
m1cM <- confusionMatrix(pred1, validation$TARGET_FLAG, 
                        mode = "everything")
m2cM <- confusionMatrix(pred2, validation$TARGET_FLAG, 
                        mode = "everything")
m3cM <- confusionMatrix(pred3, validation$TARGET_FLAG, 
                        mode = "everything")
par(mfrow=c(1,3))
fourfoldplot(m1cM$table, color = c("#B22222", "#2E8B57"), main="Mod1")
fourfoldplot(m2cM$table, color = c("#B22222", "#2E8B57"), main="Mod2")
fourfoldplot(m3cM$table, color = c("#B22222", "#2E8B57"), main="Mod3")

As we can see that Model 2 has accuracy around 81% which is not bad considering it has very fewer predictor than Model 1. It has relatively higher accuracy than Model 3 which is around 77%. We can also look at other metrics before we decide on Model 2.

eval <- data.frame(m1cM$byClass, 
                   m2cM$byClass, 
                   m3cM$byClass)
eval <- data.frame(t(eval))

eval <- dplyr::select(eval, Sensitivity, Specificity, Precision, Recall, F1)
row.names(eval) <- c("Model 1", "Model 2", "Model 3")
knitr::kable(eval)

	Sensitivity	Specificity	Precision	Recall	F1
Model 1	0.9258951	0.4279070	0.8188513	0.9258951	0.8690895
Model 2	0.9308909	0.4162791	0.8166545	0.9308909	0.8700389
Model 3	0.9333888	0.3697674	0.8053161	0.9333888	0.8646356

Based on the metrics above again we will go with Model 2 since it has the highest F1 score and the reason why we are making F1 score as basis for this data set is that the output or the response variable is imbalance and whenever one has an imbalance response variable it makes sense to make decision based on F1 score. We can also check out the ROC curve. Let’s create a function for our ROC curve.

getROC <- function(model) {
    name <- deparse(substitute(model))
    pred.prob1 <- predict(model, newdata = validation, type="prob")
    p1 <- data.frame(pred = validation$TARGET_FLAG, prob = pred.prob1[[1]])
    p1 <- p1[order(p1$prob),]
    rocobj <- roc(p1$pred, p1$prob)
    plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
         xlab="Specificity", main = name)
}

par(mfrow = c(1,3))
getROC(m1)
getROC(m2)
getROC(m3)

Even though Model 2 does have a higher AUC as compared to Model 1 but we have to bear in mind that it uses fewer predictors which adds up while computing and it has higher accuracy and F1 score. So we will go with Model 2

MULTIPLE LINEAR REGRESSION:

BUILDING MODELS:

Let’s Remove the TARGET_FLAG column from out data set since that column contains the response for the events of accidents so we are going to leave it out but before that we have fix our data type to fit into linear regression model. So let’s fix our data type of column that has factors as data type and change into numeric by using label coding as shown below.

training$PARENT1 <- as.numeric(as.factor(training$PARENT1))
training$MSTATUS <- as.numeric(as.factor(training$MSTATUS))
training$SEX <- as.numeric(as.factor(training$SEX))
training$EDUCATION <- as.numeric(as.factor(training$EDUCATION))
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.numeric(as.factor(training$CAR_USE))
training$CAR_TYPE <- as.numeric(as.factor(training$CAR_TYPE))
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)

Now let’s remove the unwanted column(s)

training_lin <- training[-1]

Before moving on with our model let’s check out the distribution for response variable in multiple linear regression

hist(training_lin$TARGET_AMT)

Seems like the distribution is very favorable for linear regression and we would have to take some kind of transformation. Let’s tro log transformation.

logdata <- log1p(training_lin$TARGET_AMT)

Now checking the distribution again

hist(logdata)

It looks much better now with a huge outlier sitting at 0 since the data set is hugely imbalance and there is nothing that could be done to cure that imbalance. Now let’s split the data into training and testing. We will split the data set training_log into partial_train and validation. Partial_train contains 85% of the data from training_log and the rest is in the validation that we will use for testing or evaluating the performance of our model.

set.seed(42)
split <- createDataPartition(training_lin$TARGET_AMT, p=.80, list=FALSE)
partial_train <- training_lin[split, ]
validation <- training_lin[ -split, ]

We can feed the data to our model. Again just like logistic regression we will use 10 fold cross validation.

MODEL 1:

lm1 <- train(log1p(TARGET_AMT) ~ ., data = partial_train, 
              method = "lm", 
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Let’s check out the summary of our model

summary(lm1)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5298 -2.3433 -0.8758  2.1333 10.3663 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        2.181752   0.040198  54.275  < 2e-16 ***
## KIDSDRIV                           0.249425   0.045865   5.438 5.58e-08 ***
## AGE                               -0.002362   0.048227  -0.049 0.960937    
## HOMEKIDS                           0.064809   0.057767   1.122 0.261943    
## YOJ                               -0.114792   0.048560  -2.364 0.018111 *  
## INCOME                            -0.149950   0.068401  -2.192 0.028398 *  
## PARENT1                            0.212613   0.054489   3.902 9.64e-05 ***
## HOME_VAL                          -0.207018   0.062443  -3.315 0.000920 ***
## MSTATUS                            0.256638   0.058121   4.416 1.02e-05 ***
## SEX                               -0.022231   0.061084  -0.364 0.715909    
## EDUCATION                          0.074819   0.041586   1.799 0.072042 .  
## JOBDoctor                         -0.218108   0.050893  -4.286 1.85e-05 ***
## `JOBHome Maker`                   -0.091788   0.052598  -1.745 0.081016 .  
## JOBLawyer                         -0.200753   0.059296  -3.386 0.000714 ***
## JOBManager                        -0.497817   0.063458  -7.845 5.04e-15 ***
## JOBProfessional                   -0.208254   0.056926  -3.658 0.000256 ***
## JOBStudent                        -0.092906   0.053908  -1.723 0.084860 .  
## `JOBz_Blue Collar`                -0.088332   0.061956  -1.426 0.153994    
## TRAVTIME                           0.251143   0.040974   6.129 9.34e-10 ***
## CAR_USE                           -0.507531   0.053316  -9.519  < 2e-16 ***
## BLUEBOOK                          -0.155611   0.048107  -3.235 0.001224 ** 
## TIF                               -0.278680   0.040373  -6.903 5.59e-12 ***
## CAR_TYPE                           0.303885   0.047908   6.343 2.40e-10 ***
## RED_CARyes                        -0.026205   0.053960  -0.486 0.627249    
## OLDCLAIM                          -0.158343   0.052185  -3.034 0.002421 ** 
## CLM_FREQ                           0.295897   0.050798   5.825 5.98e-09 ***
## REVOKEDYes                         0.413593   0.045617   9.067  < 2e-16 ***
## MVR_PTS                            0.381875   0.044387   8.603  < 2e-16 ***
## CAR_AGE                           -0.108457   0.050032  -2.168 0.030214 *  
## `URBANICITYz_Highly Rural/ Rural` -1.006789   0.044912 -22.417  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.248 on 6499 degrees of freedom
## Multiple R-squared:  0.2218, Adjusted R-squared:  0.2183 
## F-statistic: 63.87 on 29 and 6499 DF,  p-value: < 2.2e-16

As we can see that a lot predictors are very insignificant in our model so let’s get rid of those predictors

MODEL 2:

lm2 <- train(log1p(TARGET_AMT) ~ KIDSDRIV+ MSTATUS+ CAR_TYPE+ JOB+ HOME_VAL+PARENT1+ BLUEBOOK  + CLM_FREQ + REVOKED + OLDCLAIM, data = partial_train, 
              method = "lm", 
              trControl = trainControl(
                  method = "cv", number = 10,
                  savePredictions = TRUE),
              tuneLength = 5, 
              preProcess = c("center", "scale"))

Let’s check out the multi co-linearity among model’s predictors.

knitr::kable(vif(lm2$finalModel))

	x
KIDSDRIV	1.068806
MSTATUS	1.758464
CAR_TYPE	1.072852
JOBDoctor	1.263166
`JOBHome Maker`	1.402136
JOBLawyer	1.591487
JOBManager	1.890681
JOBProfessional	1.770313
JOBStudent	1.514315
`JOBz_Blue Collar`	1.929418
HOME_VAL	1.815799
PARENT1	1.392560
BLUEBOOK	1.205615
CLM_FREQ	1.391465
REVOKEDYes	1.276010
OLDCLAIM	1.672804

And significance of the predictors:

summary(lm2)

## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.523 -2.287 -1.312  2.667 10.013 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.18175    0.04252  51.307  < 2e-16 ***
## KIDSDRIV            0.24495    0.04397   5.571 2.63e-08 ***
## MSTATUS             0.19925    0.05639   3.533 0.000413 ***
## CAR_TYPE            0.27216    0.04405   6.179 6.85e-10 ***
## JOBDoctor          -0.16381    0.04780  -3.427 0.000614 ***
## `JOBHome Maker`    -0.06114    0.05036  -1.214 0.224756    
## JOBLawyer          -0.14842    0.05365  -2.766 0.005684 ** 
## JOBManager         -0.25140    0.05848  -4.299 1.74e-05 ***
## JOBProfessional    -0.08194    0.05658  -1.448 0.147611    
## JOBStudent          0.08649    0.05233   1.653 0.098437 .  
## `JOBz_Blue Collar`  0.25191    0.05907   4.265 2.03e-05 ***
## HOME_VAL           -0.27466    0.05731  -4.793 1.68e-06 ***
## PARENT1             0.25765    0.05018   5.134 2.92e-07 ***
## BLUEBOOK           -0.08655    0.04669  -1.854 0.063850 .  
## CLM_FREQ            0.71057    0.05016  14.165  < 2e-16 ***
## REVOKEDYes          0.51783    0.04804  10.779  < 2e-16 ***
## OLDCLAIM           -0.13009    0.05500  -2.365 0.018054 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.436 on 6512 degrees of freedom
## Multiple R-squared:  0.1274, Adjusted R-squared:  0.1253 
## F-statistic: 59.44 on 16 and 6512 DF,  p-value: < 2.2e-16

SELECTING MODEL:

df <- data.frame()
df <- rbind(df, lm1$results)
df <- rbind(df, lm2$results)

df$intercept <- c("Mod1", "Mod2")
colnames(df)[1] <- "model"
knitr::kable(df)

model	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
Mod1	3.255671	0.2155983	2.640527	0.0533824	0.0273154	0.0583600
Mod2	3.440147	0.1246543	2.821055	0.0584894	0.0267984	0.0326541

Even though model 1 which is lm1 has a lot of predictors we will still choose that because we get a much higher adjusted $R^2$. Similarly MSE and MAE are also low for our model 1

4. EVALUATION

In this section we make predictions using the evaluation data provided. Before w make prediction we have to work on changing the data type and deal with missing values in any in our evaluation data set. Before any further due let’s upload the data set in our environment.

testing <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance-evaluation-data.csv")

knitr::kable(head(testing))

INDEX	TARGET_FLAG	TARGET_AMT	KIDSDRIV	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
3	NA	NA	0	48	0	11	$52,881	No	$0	z_No	M	Bachelors	Manager	26	Private	$21,970	1	Van	yes	$0	0	No	2	10	Highly Urban/ Urban
9	NA	NA	1	40	1	11	$50,815	Yes	$0	z_No	M	z_High School	Manager	21	Private	$18,930	6	Minivan	no	$3,295	1	No	2	1	Highly Urban/ Urban
10	NA	NA	0	44	2	12	$43,486	Yes	$0	z_No	z_F	z_High School	z_Blue Collar	30	Commercial	$5,900	10	z_SUV	no	$0	0	No	0	10	z_Highly Rural/ Rural
18	NA	NA	0	35	2	NA	$21,204	Yes	$0	z_No	M	z_High School	Clerical	74	Private	$9,230	6	Pickup	no	$0	0	Yes	0	4	z_Highly Rural/ Rural
21	NA	NA	0	59	0	12	$87,460	No	$0	z_No	M	z_High School	Manager	45	Private	$15,420	1	Minivan	yes	$44,857	2	No	4	1	Highly Urban/ Urban
30	NA	NA	0	46	0	14	NA	No	$207,519	Yes	M	Bachelors	Professional	7	Commercial	$25,660	1	Panel Truck	no	$2,119	1	No	2	12	Highly Urban/ Urban

testing <- testing[,-(1:3)]

str(testing)

## tibble [2,141 × 23] (S3: tbl_df/tbl/data.frame)
##  $ KIDSDRIV  : num [1:2141] 0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE       : num [1:2141] 48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS  : num [1:2141] 0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ       : num [1:2141] 11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME    : chr [1:2141] "$52,881" "$50,815" "$43,486" "$21,204" ...
##  $ PARENT1   : chr [1:2141] "No" "Yes" "Yes" "Yes" ...
##  $ HOME_VAL  : chr [1:2141] "$0" "$0" "$0" "$0" ...
##  $ MSTATUS   : chr [1:2141] "z_No" "z_No" "z_No" "z_No" ...
##  $ SEX       : chr [1:2141] "M" "M" "z_F" "M" ...
##  $ EDUCATION : chr [1:2141] "Bachelors" "z_High School" "z_High School" "z_High School" ...
##  $ JOB       : chr [1:2141] "Manager" "Manager" "z_Blue Collar" "Clerical" ...
##  $ TRAVTIME  : num [1:2141] 26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE   : chr [1:2141] "Private" "Private" "Commercial" "Private" ...
##  $ BLUEBOOK  : chr [1:2141] "$21,970" "$18,930" "$5,900" "$9,230" ...
##  $ TIF       : num [1:2141] 1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE  : chr [1:2141] "Van" "Minivan" "z_SUV" "Pickup" ...
##  $ RED_CAR   : chr [1:2141] "yes" "no" "no" "no" ...
##  $ OLDCLAIM  : chr [1:2141] "$0" "$3,295" "$0" "$0" ...
##  $ CLM_FREQ  : num [1:2141] 0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED   : chr [1:2141] "No" "No" "No" "Yes" ...
##  $ MVR_PTS   : num [1:2141] 2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE   : num [1:2141] 10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY: chr [1:2141] "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...

testing$INCOME <- parse_number(testing$INCOME)
testing$HOME_VAL <- parse_number(testing$HOME_VAL)
testing$BLUEBOOK <- parse_number(testing$BLUEBOOK)
testing$OLDCLAIM <- parse_number(testing$OLDCLAIM)

Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.

testing$PARENT1 <- as.factor(testing$PARENT1)
testing$MSTATUS <- as.factor(testing$MSTATUS)
testing$SEX <- as.factor(testing$SEX)
testing$EDUCATION <- as.factor(testing$EDUCATION)
levels(testing$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
testing$CAR_USE <- as.factor(testing$CAR_USE)
testing$CAR_TYPE <- as.factor(testing$CAR_TYPE)
testing$RED_CAR <- as.factor(testing$RED_CAR)
testing$REVOKED <- as.factor(testing$REVOKED)
testing$URBANICITY <- as.factor(testing$URBANICITY)
testing$JOB[testing$JOB==""]<- NA
testing$JOB <- as.factor(testing$JOB)

colSums(is.na(testing))

##   KIDSDRIV        AGE   HOMEKIDS        YOJ     INCOME    PARENT1   HOME_VAL 
##          0          1          0         94        125          0        111 
##    MSTATUS        SEX  EDUCATION        JOB   TRAVTIME    CAR_USE   BLUEBOOK 
##          0          0          0        139          0          0          0 
##        TIF   CAR_TYPE    RED_CAR   OLDCLAIM   CLM_FREQ    REVOKED    MVR_PTS 
##          0          0          0          0          0          0          0 
##    CAR_AGE URBANICITY 
##        129          0

set.seed(2)
testing <- mice(testing, m=5, maxit = 3, method = 'rf')

## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  JOB  CAR_AGE

testing <- complete(testing)

LOGISTIC REGRESSION:

logpred <- predict(m2, testing)
logpred_prob <- predict(m2, testing, type = "prob")
log_df <- cbind(logpred_prob, TARGET_FLAG = logpred)

log_df <- log_df%>%
  rename(prob=2)

MUTIPLE LINEAR REGRESSION:

#fixing data type
testing$PARENT1 <- as.numeric(as.factor(testing$PARENT1))
testing$MSTATUS <- as.numeric(as.factor(testing$MSTATUS))
testing$CAR_TYPE <- as.numeric(as.factor(testing$CAR_TYPE))
testing$SEX <- as.numeric(as.factor(testing$SEX))
testing$EDUCATION <- as.numeric(as.factor(testing$EDUCATION))
testing$CAR_USE <- as.numeric(as.factor(testing$CAR_USE))

# predicting
amountpred <- exp(predict(lm1, testing))

final <- cbind(log_df, TARGET_AMT = amountpred)

knitr::kable(head(final))

0	prob	TARGET_AMT
0.7216442	0.2783558	5.948811
0.6772316	0.3227684	13.060923
0.8646774	0.1353226	7.223364
0.7570752	0.2429248	22.487631
0.8098321	0.1901679	6.287991
0.7897810	0.2102190	7.110726

Now to confirm that our model is actually predicting higher costs if the probability of accidents increases so we can plot them against each to confirm our models

ggplot(data = final,  mapping = aes(x= prob, y = TARGET_AMT)) +
  geom_point(color = "blue", size = 3)+labs(x="Predicted Probability of Accident Happening", y="Cost")+theme_bw()

5. CONCLUSION:

In this study we were dealing with an imbalance dataset and our job was to find out what contributes towards car crash and how much would it cost for the repair. The Data set provided had 26 variables and around 8k observations. We started our analysis with exploring the data set followed by cleaning and preparing the data set for training our models. We had to come up with two models i.e. one for classification and one for regression. For classification we use a logistic regression with 10 fold cross validation and trained three models. Since it was imbalance data set so the model with highest F1 score was chosen for the final predictions. For predicting the cost of an accident we formulated two regression models and pick the best one based one Mean absolute error and Mean Squared Error. Our predictions were not that accurate when it came to the cost of the accidents but our model predicted higher TARGET_AMT every time we had higher predicted probability of accident