Predicting The Probability Of A Car Crash And It’s Cost
INTRODUCTION:
In this study, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Our objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
1. DATA EXPLORATION:
In this section we load and explore the training data set. We will try get familiarize ourselves with different variables i.e. dependent and independent variables, and check out their distributions. The problem at hand is about the car crashes, insurance and the associated cost which indicates that we will be dealing with a lot of variables since these kind of problems are dependent on multiple factors. So before any further due let’s begin by loading the data set.
Loading The Training Dataset:
Below code chunk loads the required data set that we can use to train our model.
training <- read.csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance_training_data.csv")Let’s display the fist five row of the data set to check if everything has been loaded into our work environment correctly:
| INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | 60 | 0 | 11 | $67,349 | No | $0 | z_No | M | PhD | Professional | 14 | Private | $14,230 | 11 | Minivan | yes | $4,461 | 2 | No | 3 | 18 | Highly Urban/ Urban |
| 2 | 0 | 0 | 0 | 43 | 0 | 11 | $91,449 | No | $257,252 | z_No | M | z_High School | z_Blue Collar | 22 | Commercial | $14,940 | 1 | Minivan | yes | $0 | 0 | No | 0 | 1 | Highly Urban/ Urban |
| 4 | 0 | 0 | 0 | 35 | 1 | 10 | $16,039 | No | $124,191 | Yes | z_F | z_High School | Clerical | 5 | Private | $4,010 | 4 | z_SUV | no | $38,690 | 2 | No | 3 | 10 | Highly Urban/ Urban |
| 5 | 0 | 0 | 0 | 51 | 0 | 14 | No | $306,251 | Yes | M | <High School | z_Blue Collar | 32 | Private | $15,440 | 7 | Minivan | yes | $0 | 0 | No | 0 | 6 | Highly Urban/ Urban | |
| 6 | 0 | 0 | 0 | 50 | 0 | NA | $114,986 | No | $243,925 | Yes | z_F | PhD | Doctor | 36 | Private | $18,000 | 1 | z_SUV | no | $19,217 | 2 | Yes | 3 | 17 | Highly Urban/ Urban |
| 7 | 1 | 2946 | 0 | 34 | 1 | 12 | $125,301 | Yes | $0 | z_No | z_F | Bachelors | z_Blue Collar | 46 | Commercial | $17,430 | 1 | Sports Car | no | $0 | 0 | No | 0 | 7 | Highly Urban/ Urban |
Checking Out The Dimensions, Desciptive Summary And Distributions:
As we can see that we have got all the columns that are mentioned in the introduction about data set. Let’s check out the dimension of the data set
## [1] 8161 26
As we can see that we have got 26 columns in total and 8161 observations. One of those columns in an index column and we usually do not need it for the analysis so lets remove that from our data set.
Let’s quickly peek into the descriptive summary of our data set
| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG | 1 | 8161 | 0.2638157 | 0.4407276 | 0 | 0.2047787 | 0.0000 | 0 | 1.0 | 1.0 | 1.0716614 | -0.8516462 | 0.0048786 |
| TARGET_AMT | 2 | 8161 | 1504.3246481 | 4704.0269298 | 0 | 593.7121106 | 0.0000 | 0 | 107586.1 | 107586.1 | 8.7063034 | 112.2884386 | 52.0712628 |
| KIDSDRIV | 3 | 8161 | 0.1710575 | 0.5115341 | 0 | 0.0252719 | 0.0000 | 0 | 4.0 | 4.0 | 3.3518374 | 11.7801916 | 0.0056624 |
| AGE | 4 | 8155 | 44.7903127 | 8.6275895 | 45 | 44.8306513 | 8.8956 | 16 | 81.0 | 65.0 | -0.0289889 | -0.0617020 | 0.0955383 |
| HOMEKIDS | 5 | 8161 | 0.7212351 | 1.1163233 | 0 | 0.4971665 | 0.0000 | 0 | 5.0 | 5.0 | 1.3411271 | 0.6489915 | 0.0123571 |
| YOJ | 6 | 7707 | 10.4992864 | 4.0924742 | 11 | 11.0711853 | 2.9652 | 0 | 23.0 | 23.0 | -1.2029676 | 1.1773410 | 0.0466169 |
| INCOME* | 7 | 8161 | 2875.5505453 | 2090.6786785 | 2817 | 2816.9534385 | 2799.1488 | 1 | 6613.0 | 6612.0 | 0.1094699 | -1.2853032 | 23.1427840 |
| PARENT1* | 8 | 8161 | 1.1319691 | 0.3384779 | 1 | 1.0399755 | 0.0000 | 1 | 2.0 | 1.0 | 2.1743561 | 2.7281589 | 0.0037468 |
| HOME_VAL* | 9 | 8161 | 1684.8931503 | 1697.3791897 | 1245 | 1516.4994639 | 1842.8718 | 1 | 5107.0 | 5106.0 | 0.5162324 | -1.1810965 | 18.7891522 |
| MSTATUS* | 10 | 8161 | 1.4003186 | 0.4899929 | 1 | 1.3754021 | 0.0000 | 1 | 2.0 | 1.0 | 0.4068189 | -1.8347231 | 0.0054240 |
| SEX* | 11 | 8161 | 1.5360863 | 0.4987266 | 2 | 1.5451064 | 0.0000 | 1 | 2.0 | 1.0 | -0.1446959 | -1.9793056 | 0.0055207 |
| EDUCATION* | 12 | 8161 | 3.0906752 | 1.4448565 | 3 | 3.1133405 | 1.4826 | 1 | 5.0 | 4.0 | 0.1162654 | -1.3799674 | 0.0159939 |
| JOB* | 13 | 8161 | 5.6871707 | 2.6818733 | 6 | 5.8145198 | 2.9652 | 1 | 9.0 | 8.0 | -0.3067029 | -1.2222635 | 0.0296870 |
| TRAVTIME | 14 | 8161 | 33.4857248 | 15.9083334 | 33 | 32.9954051 | 16.3086 | 5 | 142.0 | 137.0 | 0.4468174 | 0.6643331 | 0.1760974 |
| CAR_USE* | 15 | 8161 | 1.6288445 | 0.4831436 | 2 | 1.6610507 | 0.0000 | 1 | 2.0 | 1.0 | -0.5332937 | -1.7158080 | 0.0053482 |
| BLUEBOOK* | 16 | 8161 | 1283.6185516 | 893.5117428 | 1124 | 1259.5665492 | 1132.7064 | 1 | 2789.0 | 2788.0 | 0.2472837 | -1.3624655 | 9.8907352 |
| TIF | 17 | 8161 | 5.3513050 | 4.1466353 | 4 | 4.8402512 | 4.4478 | 1 | 25.0 | 24.0 | 0.8908120 | 0.4224940 | 0.0459012 |
| CAR_TYPE* | 18 | 8161 | 3.5297145 | 1.9653570 | 3 | 3.5371420 | 2.9652 | 1 | 6.0 | 5.0 | -0.0047181 | -1.5165329 | 0.0217555 |
| RED_CAR* | 19 | 8161 | 1.2913859 | 0.4544287 | 1 | 1.2392403 | 0.0000 | 1 | 2.0 | 1.0 | 0.9180255 | -1.1573709 | 0.0050303 |
| OLDCLAIM* | 20 | 8161 | 552.2714128 | 862.2006829 | 1 | 380.3196508 | 0.0000 | 1 | 2857.0 | 2856.0 | 1.3085876 | 0.2461666 | 9.5441372 |
| CLM_FREQ | 21 | 8161 | 0.7985541 | 1.1584527 | 0 | 0.5886047 | 0.0000 | 0 | 5.0 | 5.0 | 1.2087985 | 0.2842890 | 0.0128235 |
| REVOKED* | 22 | 8161 | 1.1225340 | 0.3279216 | 1 | 1.0281820 | 0.0000 | 1 | 2.0 | 1.0 | 2.3018899 | 3.2991013 | 0.0036299 |
| MVR_PTS | 23 | 8161 | 1.6955030 | 2.1471117 | 1 | 1.3138306 | 1.4826 | 0 | 13.0 | 13.0 | 1.3478403 | 1.3754900 | 0.0237675 |
| CAR_AGE | 24 | 7651 | 8.3283231 | 5.7007424 | 8 | 7.9632413 | 7.4130 | -3 | 28.0 | 31.0 | 0.2819531 | -0.7489756 | 0.0651737 |
| URBANICITY* | 25 | 8161 | 1.2045093 | 0.4033673 | 1 | 1.1306479 | 0.0000 | 1 | 2.0 | 1.0 | 1.4649406 | 0.1460688 | 0.0044651 |
Before moving on to the data preparation for our models lets check out the distribution of our continuous variable using histogram and categorical variables using bar plot
We can see that the data set is imbalance since our target variable
for logistic regression TARGET_FLAG does not have equal
number of positive and negative responses. Variables like
PARENT1,REVOKED,RED_CAR and
URBANICITY are also imbalance and might be not be a good
predictor but we will further investigate. Similarly, apart from
AGE other continuous variables are not normal distributed
but in our first which which will be logistic regression so it is not
going to be an issue since logistic regression does not assume that the
continuous independent variables are normally distributed. Logistic
regression is a type of regression analysis that is designed for
predicting the probability of an event occurring, and it makes no
assumptions about the distribution of the independent variables.
Before we go further ahead with data preparation we can quickly check out if our data set has any missing values and if Yes then which variables contains how many missing values.
| x | |
|---|---|
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 0 |
| PARENT1 | 0 |
| HOME_VAL | 0 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 0 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
As we can see that columns or variables like
AGE,YOJ and CAR_AGE contains
missing values which needs to be fixed but before jumping into fixing
the mixing values let’s take a look at the structure of the data and see
the data type of each column.
## 'data.frame': 8161 obs. of 25 variables:
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
As we can see that a lot of data has number and ideally should be a
numeric data type but because of symbols like $ and
, so R read it as character data type so that needs to be
addressed.
2. DATA PREPARATION
In this section we will prepare our data for logistic regression model. First we will convert those character data type variables into numeric which has numbers in them.
training$INCOME <- parse_number(training$INCOME)
training$HOME_VAL <- parse_number(training$HOME_VAL)
training$BLUEBOOK <- parse_number(training$BLUEBOOK)
training$OLDCLAIM <- parse_number(training$OLDCLAIM)Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.
training$PARENT1 <- as.factor(training$PARENT1)
training$MSTATUS <- as.factor(training$MSTATUS)
training$SEX <- as.factor(training$SEX)
training$EDUCATION <- as.factor(training$EDUCATION)
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.factor(training$CAR_USE)
training$CAR_TYPE <- as.factor(training$CAR_TYPE)
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)
training$TARGET_FLAG <- as.factor((training$TARGET_FLAG))Now that we have change the data type of each variable let;s check
the structure again using str() function.
## 'data.frame': 8161 obs. of 25 variables:
## $ TARGET_FLAG: Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 2 1 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : num 67349 91449 16039 NA 114986 ...
## $ PARENT1 : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ HOME_VAL : num 0 257252 124191 306251 243925 ...
## $ MSTATUS : Factor w/ 2 levels "Yes","z_No": 2 2 1 1 1 2 1 1 2 2 ...
## $ SEX : Factor w/ 2 levels "M","z_F": 1 1 2 1 2 2 2 1 2 1 ...
## $ EDUCATION : Factor w/ 5 levels "<High School",..: 4 5 5 1 4 2 1 2 2 2 ...
## $ JOB : Factor w/ 8 levels "Clerical","Doctor",..: 6 8 1 8 2 8 8 8 1 6 ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : Factor w/ 2 levels "Commercial","Private": 2 1 2 2 2 1 2 1 2 1 ...
## $ BLUEBOOK : num 14230 14940 4010 15440 18000 ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : Factor w/ 6 levels "Minivan","Panel Truck",..: 1 1 6 1 6 4 6 5 6 5 ...
## $ RED_CAR : Factor w/ 2 levels "no","yes": 2 2 1 2 1 1 1 2 1 1 ...
## $ OLDCLAIM : num 4461 0 38690 0 19217 ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 1 1 ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : Factor w/ 2 levels "Highly Urban/ Urban",..: 1 1 1 1 1 1 1 1 1 2 ...
Now that our data type is in the right type so now we can goa head and re check the missing values and fix those
| x | |
|---|---|
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 445 |
| PARENT1 | 0 |
| HOME_VAL | 464 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 526 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
Now we can see in addition to previous columns or variables we have
HOME_VAL with missing values too so in order to take care
of missing values we will rely on mice package from R. Which have
multiple techniques to take care of missing values. Since our data set
has a mixture of continuous and categorical variables so we will
consider a method that can handle both types and my personal pick would
be to use random forest method to look at. Random forest can handle both
data type plus it is an ensemble method which is a better approach to
predict something.
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
Now we can check our data set for any missing values.
## [1] 0
Now our data set is ready to train a model so let’s go ahead with modeling section.
3. BUILDING AND SELECTING MODELS :
LOGISTIC REGRESSION:
BUILDING MODELS
Let’s Remove the TARGET_AMT column from out data set
since that column contains the response for the cost of accidents so we
are going to leave it out.
Now let’s split the data into training and testing. We will split the
data set training_log into partial_train and
validation. Partial_train contains 85% of the
data from training_log and the rest is in the
validation that we will use for testing or evaluating the
performance of our model.
set.seed(42)
split <- createDataPartition(training_log$TARGET_FLAG, p=.80, list=FALSE)
partial_train <- training_log[split, ]
validation <- training_log[ -split, ]So now that our data has been split into two partitions now we can go ahead and create models. In our first model we will use all the variables in the data set with 10 fold cross validation and see how the model performs
MODEL 1:
m1 <- train(TARGET_FLAG ~., data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 5,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))Here is the summary of our first model
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3224 -0.7115 -0.4040 0.6390 3.1684
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.418410 0.038613 -36.734 < 2e-16 ***
## KIDSDRIV 0.201483 0.034698 5.807 6.37e-09 ***
## AGE -0.018978 0.038786 -0.489 0.624634
## HOMEKIDS 0.048120 0.046094 1.044 0.296507
## YOJ -0.084769 0.038473 -2.203 0.027570 *
## INCOME -0.106703 0.058747 -1.816 0.069322 .
## PARENT1Yes 0.145814 0.041551 3.509 0.000449 ***
## HOME_VAL -0.175370 0.049671 -3.531 0.000415 ***
## MSTATUSz_No 0.217628 0.046882 4.642 3.45e-06 ***
## SEXz_F -0.005406 0.062485 -0.087 0.931051
## `EDUCATIONz_High School` -0.191901 0.057747 -3.323 0.000890 ***
## EDUCATIONBachelors -0.125781 0.076051 -1.654 0.098148 .
## EDUCATIONMasters -0.025369 0.066306 -0.383 0.702016
## EDUCATIONPHD -0.008077 0.047956 -0.168 0.866250
## JOBDoctor -0.182664 0.053863 -3.391 0.000696 ***
## `JOBHome Maker` -0.052669 0.043143 -1.221 0.222161
## JOBLawyer -0.119903 0.059343 -2.020 0.043332 *
## JOBManager -0.354018 0.056134 -6.307 2.85e-10 ***
## JOBProfessional -0.129952 0.049389 -2.631 0.008509 **
## JOBStudent -0.099372 0.041837 -2.375 0.017538 *
## `JOBz_Blue Collar` -0.088423 0.049560 -1.784 0.074400 .
## TRAVTIME 0.222401 0.033396 6.659 2.75e-11 ***
## CAR_USEPrivate -0.396063 0.048045 -8.244 < 2e-16 ***
## BLUEBOOK -0.197218 0.049663 -3.971 7.15e-05 ***
## TIF -0.234870 0.034013 -6.905 5.01e-12 ***
## `CAR_TYPEPanel Truck` 0.176667 0.049360 3.579 0.000345 ***
## CAR_TYPEPickup 0.212538 0.042009 5.059 4.21e-07 ***
## `CAR_TYPESports Car` 0.293992 0.045466 6.466 1.01e-10 ***
## CAR_TYPEVan 0.199900 0.040602 4.923 8.50e-07 ***
## CAR_TYPEz_SUV 0.323606 0.055821 5.797 6.74e-09 ***
## RED_CARyes 0.010064 0.044221 0.228 0.819976
## OLDCLAIM -0.139755 0.038532 -3.627 0.000287 ***
## CLM_FREQ 0.237345 0.036995 6.416 1.40e-10 ***
## REVOKEDYes 0.288161 0.033696 8.552 < 2e-16 ***
## MVR_PTS 0.229272 0.032802 6.990 2.76e-12 ***
## CAR_AGE 0.002160 0.047560 0.045 0.963768
## `URBANICITYz_Highly Rural/ Rural` -0.942472 0.049807 -18.923 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7536.3 on 6529 degrees of freedom
## Residual deviance: 5861.8 on 6493 degrees of freedom
## AIC: 5935.8
##
## Number of Fisher Scoring iterations: 5
Let’s remove variables with higher P-values to create more models.
MODEL 2:
m2 <- train(TARGET_FLAG ~ KIDSDRIV +
PARENT1 + HOME_VAL + MSTATUS + INCOME +
TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))let’s look at the Variance inflation factor(VIF) of each predictor and remove the one with highest VIF.
| x | |
|---|---|
| KIDSDRIV | 6.262574 |
| PARENT1Yes | 8.015029 |
| HOME_VAL | 15.181384 |
| MSTATUSz_No | 12.546313 |
| INCOME | 14.450041 |
| TRAVTIME | 7.070191 |
| CAR_USEPrivate | 8.999717 |
| BLUEBOOK | 12.693090 |
| TIF | 7.382963 |
CAR_TYPEPanel Truck |
12.010741 |
| CAR_TYPEPickup | 10.781791 |
CAR_TYPESports Car |
8.852750 |
| CAR_TYPEVan | 9.488330 |
| CAR_TYPEz_SUV | 11.723275 |
| OLDCLAIM | 9.384444 |
| CLM_FREQ | 8.679044 |
| REVOKEDYes | 7.163466 |
| MVR_PTS | 6.822545 |
URBANICITYz_Highly Rural/ Rural |
16.025249 |
Everything in the summary for model 2 above looks statistically significant but for our third model let’s remove the predictors with high values for VIF.
MODEL 3:
m3 <- train(TARGET_FLAG ~ KIDSDRIV +
PARENT1 + MSTATUS + INCOME +
TRAVTIME + CAR_USE + TIF + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,
data = partial_train,
method = "glm", family = "binomial",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2890 -0.7488 -0.4450 0.7090 2.8708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.35587 0.03712 -36.529 < 2e-16 ***
## KIDSDRIV 0.21169 0.03053 6.935 4.06e-12 ***
## PARENT1Yes 0.19018 0.03450 5.513 3.53e-08 ***
## MSTATUSz_No 0.25858 0.03661 7.063 1.64e-12 ***
## INCOME -0.54003 0.03734 -14.462 < 2e-16 ***
## TRAVTIME 0.21942 0.03240 6.772 1.27e-11 ***
## CAR_USEPrivate -0.39855 0.03140 -12.693 < 2e-16 ***
## TIF -0.22456 0.03318 -6.768 1.31e-11 ***
## OLDCLAIM -0.13003 0.03739 -3.478 0.000505 ***
## CLM_FREQ 0.23254 0.03599 6.462 1.03e-10 ***
## REVOKEDYes 0.29147 0.03260 8.942 < 2e-16 ***
## MVR_PTS 0.25216 0.03194 7.894 2.93e-15 ***
## `URBANICITYz_Highly Rural/ Rural` -0.86220 0.04903 -17.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7536.3 on 6529 degrees of freedom
## Residual deviance: 6117.1 on 6517 degrees of freedom
## AIC: 6143.1
##
## Number of Fisher Scoring iterations: 5
Again we can see that everything statistically significant in the model 3 so let;s move on to the selection models based on classification model metrics
SELECTING MODEL:
In this section we will look at different metrics like confusion matrix, specificity, sensitivity, F1 score, precision and AUC ROC and decide which model is the optimal one for the prediction. First we will look at the confusion matrix using four fold plot.
pred1 <- predict(m1, newdata = validation)
pred2 <- predict(m2, newdata = validation)
pred3 <- predict(m3, newdata = validation)
m1cM <- confusionMatrix(pred1, validation$TARGET_FLAG,
mode = "everything")
m2cM <- confusionMatrix(pred2, validation$TARGET_FLAG,
mode = "everything")
m3cM <- confusionMatrix(pred3, validation$TARGET_FLAG,
mode = "everything")
par(mfrow=c(1,3))
fourfoldplot(m1cM$table, color = c("#B22222", "#2E8B57"), main="Mod1")
fourfoldplot(m2cM$table, color = c("#B22222", "#2E8B57"), main="Mod2")
fourfoldplot(m3cM$table, color = c("#B22222", "#2E8B57"), main="Mod3")As we can see that Model 2 has accuracy around 81% which is not bad considering it has very fewer predictor than Model 1. It has relatively higher accuracy than Model 3 which is around 77%. We can also look at other metrics before we decide on Model 2.
eval <- data.frame(m1cM$byClass,
m2cM$byClass,
m3cM$byClass)
eval <- data.frame(t(eval))
eval <- dplyr::select(eval, Sensitivity, Specificity, Precision, Recall, F1)
row.names(eval) <- c("Model 1", "Model 2", "Model 3")
knitr::kable(eval)| Sensitivity | Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|---|
| Model 1 | 0.9258951 | 0.4279070 | 0.8188513 | 0.9258951 | 0.8690895 |
| Model 2 | 0.9308909 | 0.4162791 | 0.8166545 | 0.9308909 | 0.8700389 |
| Model 3 | 0.9333888 | 0.3697674 | 0.8053161 | 0.9333888 | 0.8646356 |
Based on the metrics above again we will go with Model 2 since it has the highest F1 score and the reason why we are making F1 score as basis for this data set is that the output or the response variable is imbalance and whenever one has an imbalance response variable it makes sense to make decision based on F1 score. We can also check out the ROC curve. Let’s create a function for our ROC curve.
getROC <- function(model) {
name <- deparse(substitute(model))
pred.prob1 <- predict(model, newdata = validation, type="prob")
p1 <- data.frame(pred = validation$TARGET_FLAG, prob = pred.prob1[[1]])
p1 <- p1[order(p1$prob),]
rocobj <- roc(p1$pred, p1$prob)
plot(rocobj, asp=NA, legacy.axes = TRUE, print.auc=TRUE,
xlab="Specificity", main = name)
}Even though Model 2 does have a higher AUC as compared to Model 1 but we have to bear in mind that it uses fewer predictors which adds up while computing and it has higher accuracy and F1 score. So we will go with Model 2
MULTIPLE LINEAR REGRESSION:
BUILDING MODELS:
Let’s Remove the TARGET_FLAG column from out data set
since that column contains the response for the events of accidents so
we are going to leave it out but before that we have fix our data type
to fit into linear regression model. So let’s fix our data type of
column that has factors as data type and change into numeric by using
label coding as shown below.
training$PARENT1 <- as.numeric(as.factor(training$PARENT1))
training$MSTATUS <- as.numeric(as.factor(training$MSTATUS))
training$SEX <- as.numeric(as.factor(training$SEX))
training$EDUCATION <- as.numeric(as.factor(training$EDUCATION))
levels(training$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
training$CAR_USE <- as.numeric(as.factor(training$CAR_USE))
training$CAR_TYPE <- as.numeric(as.factor(training$CAR_TYPE))
training$RED_CAR <- as.factor(training$RED_CAR)
training$REVOKED <- as.factor(training$REVOKED)
training$URBANICITY <- as.factor(training$URBANICITY)
training$JOB[training$JOB==""]<- NA
training$JOB <- as.factor(training$JOB)Now let’s remove the unwanted column(s)
Before moving on with our model let’s check out the distribution for response variable in multiple linear regression
Seems like the distribution is very favorable for linear regression and we would have to take some kind of transformation. Let’s tro log transformation.
Now checking the distribution again
It looks much better now with a huge outlier sitting at 0 since the
data set is hugely imbalance and there is nothing that could be done to
cure that imbalance. Now let’s split the data into training and testing.
We will split the data set training_log into
partial_train and validation.
Partial_train contains 85% of the data from
training_log and the rest is in the validation
that we will use for testing or evaluating the performance of our
model.
set.seed(42)
split <- createDataPartition(training_lin$TARGET_AMT, p=.80, list=FALSE)
partial_train <- training_lin[split, ]
validation <- training_lin[ -split, ]We can feed the data to our model. Again just like logistic regression we will use 10 fold cross validation.
MODEL 1:
lm1 <- train(log1p(TARGET_AMT) ~ ., data = partial_train,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))Let’s check out the summary of our model
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5298 -2.3433 -0.8758 2.1333 10.3663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.181752 0.040198 54.275 < 2e-16 ***
## KIDSDRIV 0.249425 0.045865 5.438 5.58e-08 ***
## AGE -0.002362 0.048227 -0.049 0.960937
## HOMEKIDS 0.064809 0.057767 1.122 0.261943
## YOJ -0.114792 0.048560 -2.364 0.018111 *
## INCOME -0.149950 0.068401 -2.192 0.028398 *
## PARENT1 0.212613 0.054489 3.902 9.64e-05 ***
## HOME_VAL -0.207018 0.062443 -3.315 0.000920 ***
## MSTATUS 0.256638 0.058121 4.416 1.02e-05 ***
## SEX -0.022231 0.061084 -0.364 0.715909
## EDUCATION 0.074819 0.041586 1.799 0.072042 .
## JOBDoctor -0.218108 0.050893 -4.286 1.85e-05 ***
## `JOBHome Maker` -0.091788 0.052598 -1.745 0.081016 .
## JOBLawyer -0.200753 0.059296 -3.386 0.000714 ***
## JOBManager -0.497817 0.063458 -7.845 5.04e-15 ***
## JOBProfessional -0.208254 0.056926 -3.658 0.000256 ***
## JOBStudent -0.092906 0.053908 -1.723 0.084860 .
## `JOBz_Blue Collar` -0.088332 0.061956 -1.426 0.153994
## TRAVTIME 0.251143 0.040974 6.129 9.34e-10 ***
## CAR_USE -0.507531 0.053316 -9.519 < 2e-16 ***
## BLUEBOOK -0.155611 0.048107 -3.235 0.001224 **
## TIF -0.278680 0.040373 -6.903 5.59e-12 ***
## CAR_TYPE 0.303885 0.047908 6.343 2.40e-10 ***
## RED_CARyes -0.026205 0.053960 -0.486 0.627249
## OLDCLAIM -0.158343 0.052185 -3.034 0.002421 **
## CLM_FREQ 0.295897 0.050798 5.825 5.98e-09 ***
## REVOKEDYes 0.413593 0.045617 9.067 < 2e-16 ***
## MVR_PTS 0.381875 0.044387 8.603 < 2e-16 ***
## CAR_AGE -0.108457 0.050032 -2.168 0.030214 *
## `URBANICITYz_Highly Rural/ Rural` -1.006789 0.044912 -22.417 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.248 on 6499 degrees of freedom
## Multiple R-squared: 0.2218, Adjusted R-squared: 0.2183
## F-statistic: 63.87 on 29 and 6499 DF, p-value: < 2.2e-16
As we can see that a lot predictors are very insignificant in our model so let’s get rid of those predictors
MODEL 2:
lm2 <- train(log1p(TARGET_AMT) ~ KIDSDRIV+ MSTATUS+ CAR_TYPE+ JOB+ HOME_VAL+PARENT1+ BLUEBOOK + CLM_FREQ + REVOKED + OLDCLAIM, data = partial_train,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5,
preProcess = c("center", "scale"))Let’s check out the multi co-linearity among model’s predictors.
| x | |
|---|---|
| KIDSDRIV | 1.068806 |
| MSTATUS | 1.758464 |
| CAR_TYPE | 1.072852 |
| JOBDoctor | 1.263166 |
JOBHome Maker |
1.402136 |
| JOBLawyer | 1.591487 |
| JOBManager | 1.890681 |
| JOBProfessional | 1.770313 |
| JOBStudent | 1.514315 |
JOBz_Blue Collar |
1.929418 |
| HOME_VAL | 1.815799 |
| PARENT1 | 1.392560 |
| BLUEBOOK | 1.205615 |
| CLM_FREQ | 1.391465 |
| REVOKEDYes | 1.276010 |
| OLDCLAIM | 1.672804 |
And significance of the predictors:
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.523 -2.287 -1.312 2.667 10.013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.18175 0.04252 51.307 < 2e-16 ***
## KIDSDRIV 0.24495 0.04397 5.571 2.63e-08 ***
## MSTATUS 0.19925 0.05639 3.533 0.000413 ***
## CAR_TYPE 0.27216 0.04405 6.179 6.85e-10 ***
## JOBDoctor -0.16381 0.04780 -3.427 0.000614 ***
## `JOBHome Maker` -0.06114 0.05036 -1.214 0.224756
## JOBLawyer -0.14842 0.05365 -2.766 0.005684 **
## JOBManager -0.25140 0.05848 -4.299 1.74e-05 ***
## JOBProfessional -0.08194 0.05658 -1.448 0.147611
## JOBStudent 0.08649 0.05233 1.653 0.098437 .
## `JOBz_Blue Collar` 0.25191 0.05907 4.265 2.03e-05 ***
## HOME_VAL -0.27466 0.05731 -4.793 1.68e-06 ***
## PARENT1 0.25765 0.05018 5.134 2.92e-07 ***
## BLUEBOOK -0.08655 0.04669 -1.854 0.063850 .
## CLM_FREQ 0.71057 0.05016 14.165 < 2e-16 ***
## REVOKEDYes 0.51783 0.04804 10.779 < 2e-16 ***
## OLDCLAIM -0.13009 0.05500 -2.365 0.018054 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.436 on 6512 degrees of freedom
## Multiple R-squared: 0.1274, Adjusted R-squared: 0.1253
## F-statistic: 59.44 on 16 and 6512 DF, p-value: < 2.2e-16
SELECTING MODEL:
df <- data.frame()
df <- rbind(df, lm1$results)
df <- rbind(df, lm2$results)
df$intercept <- c("Mod1", "Mod2")
colnames(df)[1] <- "model"
knitr::kable(df)| model | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| Mod1 | 3.255671 | 0.2155983 | 2.640527 | 0.0533824 | 0.0273154 | 0.0583600 |
| Mod2 | 3.440147 | 0.1246543 | 2.821055 | 0.0584894 | 0.0267984 | 0.0326541 |
Even though model 1 which is lm1 has a lot of predictors
we will still choose that because we get a much higher adjusted \(R^2\). Similarly MSE and MAE are also low
for our model 1
4. EVALUATION
In this section we make predictions using the evaluation data provided. Before w make prediction we have to work on changing the data type and deal with missing values in any in our evaluation data set. Before any further due let’s upload the data set in our environment.
testing <- read_csv("https://raw.githubusercontent.com/Umerfarooq122/predicting-the-probability-that-a-person-will-crash-their-car-and-also-the-amount-cost/main/insurance-evaluation-data.csv")| INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | NA | NA | 0 | 48 | 0 | 11 | $52,881 | No | $0 | z_No | M | Bachelors | Manager | 26 | Private | $21,970 | 1 | Van | yes | $0 | 0 | No | 2 | 10 | Highly Urban/ Urban |
| 9 | NA | NA | 1 | 40 | 1 | 11 | $50,815 | Yes | $0 | z_No | M | z_High School | Manager | 21 | Private | $18,930 | 6 | Minivan | no | $3,295 | 1 | No | 2 | 1 | Highly Urban/ Urban |
| 10 | NA | NA | 0 | 44 | 2 | 12 | $43,486 | Yes | $0 | z_No | z_F | z_High School | z_Blue Collar | 30 | Commercial | $5,900 | 10 | z_SUV | no | $0 | 0 | No | 0 | 10 | z_Highly Rural/ Rural |
| 18 | NA | NA | 0 | 35 | 2 | NA | $21,204 | Yes | $0 | z_No | M | z_High School | Clerical | 74 | Private | $9,230 | 6 | Pickup | no | $0 | 0 | Yes | 0 | 4 | z_Highly Rural/ Rural |
| 21 | NA | NA | 0 | 59 | 0 | 12 | $87,460 | No | $0 | z_No | M | z_High School | Manager | 45 | Private | $15,420 | 1 | Minivan | yes | $44,857 | 2 | No | 4 | 1 | Highly Urban/ Urban |
| 30 | NA | NA | 0 | 46 | 0 | 14 | NA | No | $207,519 | Yes | M | Bachelors | Professional | 7 | Commercial | $25,660 | 1 | Panel Truck | no | $2,119 | 1 | No | 2 | 12 | Highly Urban/ Urban |
## tibble [2,141 × 23] (S3: tbl_df/tbl/data.frame)
## $ KIDSDRIV : num [1:2141] 0 1 0 0 0 0 0 0 2 0 ...
## $ AGE : num [1:2141] 48 40 44 35 59 46 60 54 36 50 ...
## $ HOMEKIDS : num [1:2141] 0 1 2 2 0 0 0 0 2 0 ...
## $ YOJ : num [1:2141] 11 11 12 NA 12 14 12 12 12 8 ...
## $ INCOME : chr [1:2141] "$52,881" "$50,815" "$43,486" "$21,204" ...
## $ PARENT1 : chr [1:2141] "No" "Yes" "Yes" "Yes" ...
## $ HOME_VAL : chr [1:2141] "$0" "$0" "$0" "$0" ...
## $ MSTATUS : chr [1:2141] "z_No" "z_No" "z_No" "z_No" ...
## $ SEX : chr [1:2141] "M" "M" "z_F" "M" ...
## $ EDUCATION : chr [1:2141] "Bachelors" "z_High School" "z_High School" "z_High School" ...
## $ JOB : chr [1:2141] "Manager" "Manager" "z_Blue Collar" "Clerical" ...
## $ TRAVTIME : num [1:2141] 26 21 30 74 45 7 16 27 5 22 ...
## $ CAR_USE : chr [1:2141] "Private" "Private" "Commercial" "Private" ...
## $ BLUEBOOK : chr [1:2141] "$21,970" "$18,930" "$5,900" "$9,230" ...
## $ TIF : num [1:2141] 1 6 10 6 1 1 1 4 4 4 ...
## $ CAR_TYPE : chr [1:2141] "Van" "Minivan" "z_SUV" "Pickup" ...
## $ RED_CAR : chr [1:2141] "yes" "no" "no" "no" ...
## $ OLDCLAIM : chr [1:2141] "$0" "$3,295" "$0" "$0" ...
## $ CLM_FREQ : num [1:2141] 0 1 0 0 2 1 0 0 0 0 ...
## $ REVOKED : chr [1:2141] "No" "No" "No" "Yes" ...
## $ MVR_PTS : num [1:2141] 2 2 0 0 4 2 0 5 0 3 ...
## $ CAR_AGE : num [1:2141] 10 1 10 4 1 12 1 NA 9 1 ...
## $ URBANICITY: chr [1:2141] "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...
testing$INCOME <- parse_number(testing$INCOME)
testing$HOME_VAL <- parse_number(testing$HOME_VAL)
testing$BLUEBOOK <- parse_number(testing$BLUEBOOK)
testing$OLDCLAIM <- parse_number(testing$OLDCLAIM)Now that we have changed those columns now we can set the data type for other character and convert them into factors which a much more acceptable data type when it comes to logistic regression.
testing$PARENT1 <- as.factor(testing$PARENT1)
testing$MSTATUS <- as.factor(testing$MSTATUS)
testing$SEX <- as.factor(testing$SEX)
testing$EDUCATION <- as.factor(testing$EDUCATION)
levels(testing$EDUCATION) <- c('<High School','z_High School','Bachelors', 'Masters','PHD')
testing$CAR_USE <- as.factor(testing$CAR_USE)
testing$CAR_TYPE <- as.factor(testing$CAR_TYPE)
testing$RED_CAR <- as.factor(testing$RED_CAR)
testing$REVOKED <- as.factor(testing$REVOKED)
testing$URBANICITY <- as.factor(testing$URBANICITY)
testing$JOB[testing$JOB==""]<- NA
testing$JOB <- as.factor(testing$JOB)## KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL
## 0 1 0 94 125 0 111
## MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 0 0 0 139 0 0 0
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 0 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 129 0
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL JOB CAR_AGE
LOGISTIC REGRESSION:
MUTIPLE LINEAR REGRESSION:
#fixing data type
testing$PARENT1 <- as.numeric(as.factor(testing$PARENT1))
testing$MSTATUS <- as.numeric(as.factor(testing$MSTATUS))
testing$CAR_TYPE <- as.numeric(as.factor(testing$CAR_TYPE))
testing$SEX <- as.numeric(as.factor(testing$SEX))
testing$EDUCATION <- as.numeric(as.factor(testing$EDUCATION))
testing$CAR_USE <- as.numeric(as.factor(testing$CAR_USE))| 0 | prob | TARGET_FLAG | TARGET_AMT |
|---|---|---|---|
| 0.7216442 | 0.2783558 | 0 | 5.948811 |
| 0.6772316 | 0.3227684 | 0 | 13.060923 |
| 0.8646774 | 0.1353226 | 0 | 7.223364 |
| 0.7570752 | 0.2429248 | 0 | 22.487631 |
| 0.8098321 | 0.1901679 | 0 | 6.287991 |
| 0.7897810 | 0.2102190 | 0 | 7.110726 |
Now to confirm that our model is actually predicting higher costs if the probability of accidents increases so we can plot them against each to confirm our models
ggplot(data = final, mapping = aes(x= prob, y = TARGET_AMT)) +
geom_point(color = "blue", size = 3)+labs(x="Predicted Probability of Accident Happening", y="Cost")+theme_bw()5. CONCLUSION:
In this study we were dealing with an imbalance dataset and our job
was to find out what contributes towards car crash and how much would it
cost for the repair. The Data set provided had 26 variables and around
8k observations. We started our analysis with exploring the data set
followed by cleaning and preparing the data set for training our models.
We had to come up with two models i.e. one for classification and one
for regression. For classification we use a logistic regression with 10
fold cross validation and trained three models. Since it was imbalance
data set so the model with highest F1 score was chosen for the final
predictions. For predicting the cost of an accident we formulated two
regression models and pick the best one based one Mean absolute error
and Mean Squared Error. Our predictions were not that accurate when it
came to the cost of the accidents but our model predicted higher
TARGET_AMT every time we had higher predicted probability
of accident