Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Our objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. We will only use the variables given to us (or variables that we derive from the variables provided). Below is a short description of the variables of interest in the data set:
Variable Name Definition Theoretical Effect
INDEX Identification Variable (do not use) None
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None
TARGET_AMT If car was in a crash, what was the cost None
AGE Age of Driver Very young people tend to be risky. Maybe very old people also.
BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future
EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely
HOMEKIDS # Children at Home Unknown effect
HOME_VAL Home Value In theory, home owners tend to drive more responsibly
INCOME Income In theory, rich people tend to get into fewer crashes
JOB Job Category In theory, white collar jobs tend to be safer
KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes
MSTATUS Marital Status In theory, married people drive more safely
MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes
OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high
PARENT1 Single Parent Unknown effect
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver
SEX Gender Urban legend says that women have less crashes then men. Is that true?
TIF Time in Force People who have been customers for a long time are usually more safe.
TRAVTIME Distance to Work Long drives to work usually suggest greater risk
URBANICITY Home/Work Area Unknown
YOJ Years on Job People who stay at a job for a long time are usually more safe

Data Exploration

Our dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous. Lastly, summary function illustrates mean, quartile and other statistical analysis.

## Observations: 8,161
## Variables: 26
## $ INDEX       <int> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 1...
## $ TARGET_FLAG <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000...
## $ KIDSDRIV    <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ AGE         <int> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55...
## $ HOMEKIDS    <int> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0,...
## $ YOJ         <int> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, ...
## $ INCOME      <fct> "$67,349", "$91,449", "$16,039", "", "$114,986", "...
## $ PARENT1     <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, N...
## $ HOME_VAL    <fct> "$0", "$257,252", "$124,191", "$306,251", "$243,92...
## $ MSTATUS     <fct> z_No, z_No, Yes, Yes, Yes, z_No, Yes, Yes, z_No, z...
## $ SEX         <fct> M, M, z_F, M, z_F, z_F, z_F, M, z_F, M, z_F, z_F, ...
## $ EDUCATION   <fct> PhD, z_High School, z_High School, <High School, P...
## $ JOB         <fct> Professional, z_Blue Collar, Clerical, z_Blue Coll...
## $ TRAVTIME    <int> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25,...
## $ CAR_USE     <fct> Private, Commercial, Private, Private, Private, Co...
## $ BLUEBOOK    <fct> "$14,230", "$14,940", "$4,010", "$15,440", "$18,00...
## $ TIF         <int> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6...
## $ CAR_TYPE    <fct> Minivan, Minivan, z_SUV, Minivan, z_SUV, Sports Ca...
## $ RED_CAR     <fct> yes, yes, no, yes, no, no, no, yes, no, no, no, no...
## $ OLDCLAIM    <fct> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", ...
## $ CLM_FREQ    <int> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0,...
## $ REVOKED     <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, ...
## $ MVR_PTS     <int> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0...
## $ CAR_AGE     <int> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5,...
## $ URBANICITY  <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highly U...
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ            INCOME    
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   $0      : 615  
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0           : 445  
##  Median :45.00   Median :0.0000   Median :11.0   $26,840 :   4  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5   $48,509 :   4  
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0   $61,790 :   4  
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0   $107,375:   3  
##  NA's   :6                        NA's   :454    (Other) :7086  
##  PARENT1        HOME_VAL    MSTATUS      SEX               EDUCATION   
##  No :7084   $0      :2294   Yes :4894   M  :3786   <High School :1203  
##  Yes:1077           : 464   z_No:3267   z_F:4375   Bachelors    :2242  
##             $111,129:   3                          Masters      :1658  
##             $115,249:   3                          PhD          : 728  
##             $123,109:   3                          z_High School:2330  
##             $153,061:   3                                              
##             (Other) :5391                                              
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK   
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   $1,500 : 157  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   $6,000 :  34  
##  Professional :1117   Median : 33.00                     $5,800 :  33  
##  Manager      : 988   Mean   : 33.49                     $6,200 :  33  
##  Lawyer       : 835   3rd Qu.: 44.00                     $6,400 :  31  
##  Student      : 712   Max.   :142.00                     $5,900 :  30  
##  (Other)      :1413                                      (Other):7843  
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM   
##  Min.   : 1.000   Minivan    :2145   no :5783   $0     :5009  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   $1,310 :   4  
##  Median : 4.000   Pickup     :1389              $1,391 :   4  
##  Mean   : 5.351   Sports Car : 907              $4,263 :   4  
##  3rd Qu.: 7.000   Van        : 750              $1,105 :   3  
##  Max.   :25.000   z_SUV      :2294              $1,332 :   3  
##                                                 (Other):3134  
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.328  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 

In the bar chart below, KIDSDRIV, HOMEKIDS, PARENT1 variables tell us having no kids results in more car crash. We do not see a significant effect of individual’s sex, martial status or the type of care they use on car crash. However, high school students, blue collar employees, or SUV owners get into more car crash. Lastly, if the individual’s license is revoked or they are driving on an urban area, they have higher chance of car crash.

In the histogram plot below, we see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.

In the box plot below, we see BLUEBOOK, INCOME, OLDCLAIM have high number of outliers compared to other variables. We also see people with older car, higher home value, higher income or older customer get in to less car crash. However, people with motor vehicle record points or high number of old claims tend to get in to more car crash.

In the correlation table and plot below, we see MVR_PTS, CLM_FREQ, and OLDCLAIM are the most positively correlated variables with our response variables. Whereas, URBANICITY is the most negatively correlated variable. Rest of the variables are weakly correlated.
Correlation
TARGET_FLAG TARGET_AMT
TARGET_FLAG 1.0000000 1.0000000
TARGET_AMT 0.8334240 0.8334240
MVR_PTS 0.2191323 0.1970216
CLM_FREQ 0.2161961 0.1741927
OLDCLAIM 0.1947302 0.1611626
PARENT1 0.1576222 0.1359305
REVOKED 0.1519391 0.1263285
MSTATUS 0.1351248 0.1214701
HOMEKIDS 0.1156210 0.1008356
KIDSDRIV 0.1036683 0.0877148
CAR_TYPE 0.1023650 0.0797487
JOB 0.0612262 0.0488313
TRAVTIME 0.0492559 0.0401971
EDUCATION 0.0428730 0.0397864
SEX 0.0210786 0.0088270
RED_CAR -0.0069473 0.0005877
TIF -0.0823431 -0.0683183
BLUEBOOK -0.1092768 -0.0709830
CAR_USE -0.1426737 -0.1287263
URBANICITY -0.2242509 -0.1904945


Data Preparation

We have already performed some data transformation for our visualization. INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM variables have dollar sign and comma in them. These were removed (and converted as integer) using the stringr package. As we have seen in our statistical analysis, some of the variables have missing values. We will use the mice package and random forest method to impute the missing data. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our response variables will be removed as predictor variables but still will be imputed. Lastly, we will perform the same data preparation for our unseen dataset.

## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## [1] "Missing value after imputation: 0"
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## [1] "Missing value after imputation: 0"
In the table below we check for multicollinearity and there is no cause for concern because the VIF score is at a conservative level for all variables.
VIF Score
TARGET_AMT 1.184646
KIDSDRIV 1.322455
AGE 1.408626
HOMEKIDS 2.068329
YOJ 1.223710
INCOME 2.720449
PARENT1 1.849722
HOME_VAL 2.506717
MSTATUS 2.013524
SEX 2.265299
EDUCATION 1.044088
JOB 1.157348
TRAVTIME 1.038854
CAR_USE 1.353302
BLUEBOOK 1.375440
TIF 1.009161
CAR_TYPE 1.409798
RED_CAR 1.809696
OLDCLAIM 2.201664
CLM_FREQ 2.131016
REVOKED 1.148628
MVR_PTS 1.249189
CAR_AGE 1.311790
URBANICITY 1.243781

Build Models

We will build at least three different multiple linear regression models and three different binary logistic regression models using the original dataset, the imputed dataset, forward and backward selected variables and a boxcox transformed dataset to see which one yields the best performance.

Multiple Linear Regression: Model 1

In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. Overall, I would say this is a bad model.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_corr)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -898.90 -286.25 -134.43   62.85 1927.07 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.018e+02  7.820e+01   5.138 2.86e-07 ***
## KIDSDRIV     4.205e+01  1.318e+01   3.189  0.00143 ** 
## AGE         -2.046e-01  8.070e-01  -0.254  0.79986    
## HOMEKIDS     1.359e+01  7.532e+00   1.804  0.07121 .  
## YOJ         -3.491e-01  1.590e+00  -0.220  0.82625    
## INCOME      -2.270e-02  4.803e-03  -4.727 2.33e-06 ***
## PARENT1      7.414e+01  2.336e+01   3.173  0.00151 ** 
## HOME_VAL    -1.082e-02  5.505e-03  -1.965  0.04946 *  
## MSTATUS      7.301e+01  1.685e+01   4.332 1.50e-05 ***
## SEX         -9.673e+00  1.756e+01  -0.551  0.58178    
## EDUCATION    6.679e+00  4.132e+00   1.616  0.10606    
## JOB         -6.667e-01  2.347e+00  -0.284  0.77637    
## TRAVTIME     2.185e+00  3.772e-01   5.793 7.25e-09 ***
## CAR_USE     -1.472e+02  1.392e+01 -10.571  < 2e-16 ***
## BLUEBOOK    -2.513e-02  9.504e-03  -2.644  0.00821 ** 
## TIF         -7.286e+00  1.416e+00  -5.147 2.73e-07 ***
## CAR_TYPE     1.877e+01  3.517e+00   5.335 9.87e-08 ***
## RED_CAR     -1.768e+01  1.732e+01  -1.021  0.30740    
## OLDCLAIM    -4.355e-03  1.039e-02  -0.419  0.67518    
## CLM_FREQ     2.315e+01  7.358e+00   3.147  0.00166 ** 
## REVOKED      1.281e+02  1.915e+01   6.693 2.37e-11 ***
## MVR_PTS      2.597e+01  3.034e+00   8.560  < 2e-16 ***
## CAR_AGE     -3.795e+00  1.182e+00  -3.210  0.00133 ** 
## URBANICITY  -2.685e+02  1.590e+01 -16.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 470.1 on 6424 degrees of freedom
##   (1713 observations deleted due to missingness)
## Multiple R-squared:  0.1564, Adjusted R-squared:  0.1534 
## F-statistic: 51.79 on 23 and 6424 DF,  p-value: < 2.2e-16

Multiple Linear Regression: Model 2

In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. Overall, I would say this is a bad model.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_vif)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -910.71 -287.81 -135.00   64.62 1925.36 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.554e+02  7.012e+01   5.068 4.11e-07 ***
## KIDSDRIV     5.573e+01  1.172e+01   4.755 2.02e-06 ***
## AGE          9.965e-02  7.190e-01   0.139  0.88977    
## HOMEKIDS     1.395e+01  6.724e+00   2.075  0.03805 *  
## YOJ         -1.217e+00  1.415e+00  -0.860  0.38993    
## INCOME      -2.467e-02  4.229e-03  -5.833 5.65e-09 ***
## PARENT1      6.775e+01  2.096e+01   3.232  0.00123 ** 
## HOME_VAL    -1.074e-02  4.874e-03  -2.204  0.02755 *  
## MSTATUS      8.243e+01  1.509e+01   5.462 4.84e-08 ***
## SEX         -4.134e+00  1.576e+01  -0.262  0.79302    
## EDUCATION    7.852e+00  3.691e+00   2.127  0.03342 *  
## JOB          1.023e-01  2.094e+00   0.049  0.96105    
## TRAVTIME     2.060e+00  3.356e-01   6.140 8.62e-10 ***
## CAR_USE     -1.441e+02  1.247e+01 -11.558  < 2e-16 ***
## BLUEBOOK    -2.404e-02  8.496e-03  -2.829  0.00468 ** 
## TIF         -7.551e+00  1.263e+00  -5.980 2.32e-09 ***
## CAR_TYPE     1.679e+01  3.149e+00   5.333 9.93e-08 ***
## RED_CAR     -3.877e+00  1.545e+01  -0.251  0.80191    
## OLDCLAIM    -7.370e-03  9.185e-03  -0.802  0.42236    
## CLM_FREQ     2.144e+01  6.574e+00   3.261  0.00111 ** 
## REVOKED      1.308e+02  1.700e+01   7.692 1.62e-14 ***
## MVR_PTS      2.598e+01  2.704e+00   9.608  < 2e-16 ***
## CAR_AGE     -2.901e+00  1.044e+00  -2.779  0.00547 ** 
## URBANICITY  -2.727e+02  1.411e+01 -19.322  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.6 on 8137 degrees of freedom
## Multiple R-squared:  0.1559, Adjusted R-squared:  0.1535 
## F-statistic: 65.32 on 23 and 8137 DF,  p-value: < 2.2e-16

Multiple Linear Regression: Model 3

In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. However, we see improved p-value for several variables. Overall, I would say this is a better model.

## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, data = insurance_vif)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -917.42 -287.33 -134.63   64.92 1931.21 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.434e+02  4.990e+01   6.882 6.35e-12 ***
## KIDSDRIV     5.631e+01  1.155e+01   4.875 1.11e-06 ***
## HOMEKIDS     1.285e+01  6.168e+00   2.084 0.037210 *  
## INCOME      -2.553e-02  4.079e-03  -6.260 4.05e-10 ***
## PARENT1      6.786e+01  2.085e+01   3.255 0.001138 ** 
## HOME_VAL    -1.063e-02  4.847e-03  -2.193 0.028357 *  
## MSTATUS      8.373e+01  1.502e+01   5.575 2.55e-08 ***
## EDUCATION    7.889e+00  3.655e+00   2.158 0.030923 *  
## TRAVTIME     2.065e+00  3.353e-01   6.158 7.74e-10 ***
## CAR_USE     -1.449e+02  1.136e+01 -12.758  < 2e-16 ***
## BLUEBOOK    -2.425e-02  8.266e-03  -2.933 0.003364 ** 
## TIF         -7.566e+00  1.262e+00  -5.997 2.10e-09 ***
## CAR_TYPE     1.666e+01  2.740e+00   6.080 1.25e-09 ***
## CLM_FREQ     1.808e+01  5.056e+00   3.576 0.000351 ***
## REVOKED      1.263e+02  1.606e+01   7.862 4.26e-15 ***
## MVR_PTS      2.570e+01  2.672e+00   9.616  < 2e-16 ***
## CAR_AGE     -2.824e+00  1.020e+00  -2.770 0.005618 ** 
## URBANICITY  -2.721e+02  1.409e+01 -19.312  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.5 on 8143 degrees of freedom
## Multiple R-squared:  0.1557, Adjusted R-squared:  0.1539 
## F-statistic: 88.34 on 17 and 8143 DF,  p-value: < 2.2e-16

Multiple Linear Regression: Model 4

As seen previously, most of our dataset has many skewed variables. When an attribute has a normal distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more normal The Box Cox transform can perform this operation (assumes all values are positive). In our linear regression model below, we see the min-max and 1Q-3Q have quite similar magnitudes and the median is close to zero. This means this model is good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.22, which means this model explains 22% of the data’s variation. Overall, I would say this is the best model.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = in_bc_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7329 -0.5445 -0.2221  0.5838  2.3677 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.1215962  0.1104381  10.156  < 2e-16 ***
## KIDSDRIV     0.4317947  0.0744265   5.802 6.81e-09 ***
## AGE         -0.0006829  0.0011812  -0.578 0.563185    
## HOMEKIDS     0.1090128  0.0602544   1.809 0.070455 .  
## YOJ          0.0002756  0.0006395   0.431 0.666510    
## INCOME      -0.0019959  0.0002553  -7.819 6.01e-15 ***
## PARENT1      0.1200899  0.0354768   3.385 0.000715 ***
## HOME_VAL    -0.0046780  0.0013364  -3.500 0.000467 ***
## MSTATUS      0.1349029  0.0255845   5.273 1.38e-07 ***
## SEX          0.0030302  0.0249102   0.122 0.903182    
## EDUCATION    0.0168041  0.0089188   1.884 0.059585 .  
## JOB          0.0009349  0.0033124   0.282 0.777774    
## TRAVTIME     0.0111793  0.0013990   7.991 1.52e-15 ***
## CAR_USE     -0.2715000  0.0197274 -13.763  < 2e-16 ***
## BLUEBOOK    -0.0012943  0.0002073  -6.243 4.52e-10 ***
## TIF         -0.0528704  0.0068450  -7.724 1.26e-14 ***
## CAR_TYPE     0.0524925  0.0076967   6.820 9.75e-12 ***
## RED_CAR     -0.0102472  0.0245293  -0.418 0.676139    
## OLDCLAIM    -0.0064209  0.0104941  -0.612 0.540645    
## CLM_FREQ     0.3705831  0.1402113   2.643 0.008232 ** 
## REVOKED      0.2550767  0.0259999   9.811  < 2e-16 ***
## MVR_PTS      0.1331920  0.0184078   7.236 5.06e-13 ***
## CAR_AGE     -0.0242995  0.0048696  -4.990 6.16e-07 ***
## URBANICITY  -0.5197828  0.0225098 -23.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7484 on 8137 degrees of freedom
## Multiple R-squared:  0.216,  Adjusted R-squared:  0.2138 
## F-statistic: 97.49 on 23 and 8137 DF,  p-value: < 2.2e-16

Binary Logistic Regression: Model 1

In our logistic regression model below, we see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model shows many variables with significant p-value. We will see with following model whether AIC score improves or not.

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = logit_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5089  -0.7261  -0.4149   0.6495   3.1124  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.780e-01  3.811e-01   1.254 0.209751    
## KIDSDRIV     3.786e-01  6.044e-02   6.264 3.76e-10 ***
## AGE         -2.564e-03  3.905e-03  -0.657 0.511451    
## HOMEKIDS     5.628e-02  3.656e-02   1.539 0.123708    
## YOJ         -8.098e-03  7.623e-03  -1.062 0.288094    
## INCOME      -1.414e-04  2.256e-05  -6.269 3.64e-10 ***
## PARENT1      3.684e-01  1.085e-01   3.395 0.000687 ***
## HOME_VAL    -9.188e-05  2.636e-05  -3.486 0.000490 ***
## MSTATUS      4.950e-01  8.285e-02   5.975 2.30e-09 ***
## SEX          1.282e-02  8.807e-02   0.146 0.884232    
## EDUCATION    3.345e-02  1.985e-02   1.685 0.091971 .  
## JOB         -7.524e-03  1.133e-02  -0.664 0.506471    
## TRAVTIME     1.529e-02  1.875e-03   8.154 3.51e-16 ***
## CAR_USE     -9.337e-01  6.841e-02 -13.650  < 2e-16 ***
## BLUEBOOK    -2.774e-04  4.702e-05  -5.899 3.66e-09 ***
## TIF         -5.433e-02  7.280e-03  -7.463 8.46e-14 ***
## CAR_TYPE     1.186e-01  1.790e-02   6.626 3.44e-11 ***
## RED_CAR     -2.828e-02  8.552e-02  -0.331 0.740894    
## OLDCLAIM    -4.679e-05  4.500e-05  -1.040 0.298476    
## CLM_FREQ     1.722e-01  3.207e-02   5.371 7.84e-08 ***
## REVOKED      7.663e-01  8.455e-02   9.063  < 2e-16 ***
## MVR_PTS      1.166e-01  1.359e-02   8.578  < 2e-16 ***
## CAR_AGE     -2.245e-02  5.798e-03  -3.871 0.000108 ***
## URBANICITY  -2.316e+00  1.126e-01 -20.571  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7415.9  on 8137  degrees of freedom
## AIC: 7463.9
## 
## Number of Fisher Scoring iterations: 5

Binary Logistic Regression: Model 2

In our logistic regression model below, we use forward and backward step-wise variables selection algorithm. We see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model’s variables selection is better with better p-value. However AIC score has not improved.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, family = "binomial", data = logit_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5176  -0.7266  -0.4168   0.6501   3.0850  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.278e-01  2.677e-01   0.851 0.394917    
## KIDSDRIV     3.737e-01  5.943e-02   6.288 3.22e-10 ***
## HOMEKIDS     6.084e-02  3.356e-02   1.813 0.069875 .  
## INCOME      -1.486e-04  2.167e-05  -6.860 6.89e-12 ***
## PARENT1      3.793e-01  1.078e-01   3.519 0.000434 ***
## HOME_VAL    -9.399e-05  2.625e-05  -3.580 0.000343 ***
## MSTATUS      5.034e-01  8.250e-02   6.102 1.05e-09 ***
## EDUCATION    3.551e-02  1.968e-02   1.805 0.071097 .  
## TRAVTIME     1.527e-02  1.873e-03   8.154 3.53e-16 ***
## CAR_USE     -9.176e-01  6.211e-02 -14.774  < 2e-16 ***
## BLUEBOOK    -2.750e-04  4.592e-05  -5.987 2.13e-09 ***
## TIF         -5.426e-02  7.271e-03  -7.462 8.50e-14 ***
## CAR_TYPE     1.225e-01  1.547e-02   7.916 2.45e-15 ***
## CLM_FREQ     1.512e-01  2.519e-02   6.002 1.95e-09 ***
## REVOKED      7.352e-01  7.937e-02   9.263  < 2e-16 ***
## MVR_PTS      1.153e-01  1.342e-02   8.588  < 2e-16 ***
## CAR_AGE     -2.137e-02  5.639e-03  -3.789 0.000151 ***
## URBANICITY  -2.305e+00  1.122e-01 -20.540  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7419.5  on 8143  degrees of freedom
## AIC: 7455.5
## 
## Number of Fisher Scoring iterations: 5

Binary Logistic Regression: Model 3

In our logistic regression model below, we see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model too shows many variables with significant p-value. However, this model has the best AIC score so far.

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = in_bc_transformed1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3319  -0.7280  -0.4164   0.6695   3.1449  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.5525873  0.3806356   4.079 4.52e-05 ***
## KIDSDRIV     1.4473495  0.2443105   5.924 3.14e-09 ***
## AGE         -0.0018742  0.0040506  -0.463 0.643578    
## HOMEKIDS     0.4318856  0.2133634   2.024 0.042952 *  
## YOJ          0.0014631  0.0022099   0.662 0.507918    
## INCOME      -0.0067541  0.0008645  -7.813 5.60e-15 ***
## PARENT1      0.2618492  0.1183408   2.213 0.026920 *  
## HOME_VAL    -0.0151869  0.0044261  -3.431 0.000601 ***
## MSTATUS      0.5254929  0.0896994   5.858 4.67e-09 ***
## SEX         -0.0105097  0.0877491  -0.120 0.904665    
## EDUCATION    0.0447810  0.0303002   1.478 0.139432    
## JOB         -0.0046605  0.0112656  -0.414 0.679096    
## TRAVTIME     0.0419614  0.0049896   8.410  < 2e-16 ***
## CAR_USE     -0.9209107  0.0681881 -13.505  < 2e-16 ***
## BLUEBOOK    -0.0046782  0.0007117  -6.573 4.93e-11 ***
## TIF         -0.1811725  0.0238212  -7.606 2.84e-14 ***
## CAR_TYPE     0.1998463  0.0278986   7.163 7.88e-13 ***
## RED_CAR     -0.0318369  0.0855220  -0.372 0.709695    
## OLDCLAIM    -0.0215595  0.0317266  -0.680 0.496796    
## CLM_FREQ     1.1600990  0.4234448   2.740 0.006150 ** 
## REVOKED      0.7456174  0.0811612   9.187  < 2e-16 ***
## MVR_PTS      0.4164365  0.0622056   6.695 2.16e-11 ***
## CAR_AGE     -0.0827487  0.0167869  -4.929 8.25e-07 ***
## URBANICITY  -2.2990567  0.1131799 -20.313  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418  on 8160  degrees of freedom
## Residual deviance: 7406  on 8137  degrees of freedom
## AIC: 7454
## 
## Number of Fisher Scoring iterations: 5

Select Models

Binary Logistic Regression Metrics

To make prediction, we will compare various metrics for all three models. We calculate all three models’ accuracy, classification error rate, precision, sensitivity, specificity, F1 score, AUC, and confusion matrix. Even though all models yield similar metrics value, model 5 has the highest AUC value. We will pick model 5 with imputed values for our prediction.
Model 5 Model 6 Model 7
Accuracy 0.7861782 0.7869134 0.7859331
Class. Error Rate 0.2138218 0.2130866 0.2140669
Sensitivity 0.3989782 0.3985137 0.3934046
Specificity 0.9249334 0.9260985 0.9265979
Precision 0.6557252 0.6589862 0.6576087
F1 0.4961016 0.4966715 0.4922988
AUC 0.8056318 0.8053902 0.5770795

## [1] "1779 not in a car crash and 362 in a car crash"

Multiple Linear Regression Metrics

In the residual plot below, we see that the variance of residuals are not uniform which indicates our explanatory variable is probably does not fully explain the data. Also the quartile-quartile plot, we see that the residuals are not normally distributed. Therefore, I would say overall this is not a good model.
F-Statistic
MSE R-Squared value numdf dendf
Model 1 2.201851e+05 0.1558662 51.79085 23 6424
Model 2 2.217497e+05 0.1557105 65.32465 23 8137
Model 3 2.217906e+05 0.1564228 88.34093 17 8143
Model 4 5.584709e-01 0.2160308 97.48848 23 8137

# knitr settings
knitr::opts_chunk$set(warning = F, 
                      message = F,
                      echo = F,
                      fig.align = "center")
# load libraries
library(kableExtra)
library(knitr)
library(tidyverse)
library(psych)
library(ggthemes)
library(ggpubr)
library(stringr)
library(corrplot)
library(RColorBrewer)
library(mice)
library(car)
library(MASS)
library(caret)
library(pROC)
vn <- c("INDEX", "TARGET_FLAG", "TARGET_AMT", "AGE", "BLUEBOOK", "CAR_AGE", "CAR_TYPE", "CAR_USE", "CLM_FREQ", "EDUCATION", "HOMEKIDS", "HOME_VAL", "INCOME", "JOB", "KIDSDRIV", "MSTATUS", "MVR_PTS", "OLDCLAIM", "PARENT1", "RED_CAR", "REVOKED", "SEX", "TIF", "TRAVTIME", "URBANICITY", "YOJ")
df <- c("Identification Variable (do not use)", "Was Car in a crash? 1=YES 0=NO", "If car was in a crash, what was the cost", "Age of Driver", "Value of Vehicle", "Vehicle Age", "Type of Car", "Vehicle Use", "# Claims (Past 5 Years)", "Max Education Level", "# Children at Home", "Home Value", "Income", "Job Category", "# Driving Children", "Marital Status", "Motor Vehicle Record Points", "Total Claims (Past 5 Years)", "Single Parent", "A Red Car", "License Revoked (Past 7 Years)", "Gender", "Time in Force", "Distance to Work", "Home/Work Area", "Years on Job")
te <- c("None", "None", "None", "Very young people tend to be risky. Maybe very old people also.", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Commercial vehicles are driven more, so might increase probability of collision", "The more claims you filed in the past, the more you are likely to file in the future", "Unknown effect, but in theory more educated people tend to drive more safely", "Unknown effect", "In theory, home owners tend to drive more responsibly", "In theory, rich people tend to get into fewer crashes", "In theory, white collar jobs tend to be safer", "When teenagers drive your car, you are more likely to get into crashes", "In theory, married people drive more safely", "If you get lots of traffic tickets, you tend to get into more crashes", "If your total payout over the past five years was high, this suggests future payouts will be high", "Unknown effect", "Urban legend says that red cars (especially red sports cars) are more risky. Is that true?", "If your license was revoked in the past 7 years, you probably are a more risky driver", "Urban legend says that women have less crashes then men. Is that true?", "People who have been customers for a long time are usually more safe.", "Long drives to work usually suggest greater risk", "Unknown", "People who stay at a job for a long time are usually more safe")
kable(cbind(vn, df, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>% 
  kable_styling()
# load data
insurance_train <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/insurance_training_data.csv")
insurance_test <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/insurance-evaluation-data.csv")

# summary statistics
insurance_train %>% 
  glimpse() %>% 
  summary()
# change data type of some variables for visualization
insurance_train_dist <- insurance_train %>% 
  dplyr::select(-INDEX) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG),
         KIDSDRIV = as.factor(KIDSDRIV),
         HOMEKIDS = as.factor(HOMEKIDS),
         PARENT1 = as.factor(PARENT1),
         CLM_FREQ = as.factor(CLM_FREQ),
         INCOME = str_replace_all(INCOME, "[\\$,]", ""),
         HOME_VAL = str_replace_all(HOME_VAL, "[\\$,]", ""),
         BLUEBOOK = str_replace_all(BLUEBOOK, "[\\$,]", ""),
         OLDCLAIM = str_replace_all(OLDCLAIM, "[\\$,]", ""),
         OLDCLAIM = as.integer(OLDCLAIM),
         BLUEBOOK = as.integer(BLUEBOOK),
         HOME_VAL = as.integer(HOME_VAL),
         INCOME = as.integer(INCOME))

# distribution of discrete variables
k <- insurance_train_dist %>% 
  ggplot(aes(KIDSDRIV)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

hk <- insurance_train_dist %>% 
  ggplot(aes(HOMEKIDS)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

p <- insurance_train_dist %>% 
  ggplot(aes(PARENT1)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge()) 

m <- insurance_train_dist %>% 
  ggplot(aes(MSTATUS)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

s <- insurance_train_dist %>% 
  ggplot(aes(SEX)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

e <- insurance_train_dist %>% 
  ggplot(aes(EDUCATION)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

j <- insurance_train_dist %>% 
  ggplot(aes(JOB)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

cu <- insurance_train_dist %>% 
  ggplot(aes(CAR_USE)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

ct <- insurance_train_dist %>% 
  ggplot(aes(CAR_TYPE)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge()) 

rc <- insurance_train_dist %>% 
  ggplot(aes(RED_CAR)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

r <- insurance_train_dist %>% 
  ggplot(aes(REVOKED)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

uc <- insurance_train_dist %>% 
  ggplot(aes(URBANICITY)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge()) 

cf <- insurance_train_dist %>% 
  ggplot(aes(CLM_FREQ)) +
  geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())

ggarrange(k + rremove("legend"), hk + rremove("legend"), p + rremove("legend"), m + rremove("legend"), s + rremove("legend"), e + rremove("legend"), j + rremove("legend"), cu + rremove("legend"), ct + rremove("legend"), rc + rremove("legend"), r + rremove("legend"), uc + rremove("legend"), cf + rremove("legend"), ncol = 2, nrow = 7)
# change data type of some variables for visualization
distribution <- insurance_train_dist %>% 
  dplyr::select(c("TARGET_FLAG", "AGE", "YOJ", "INCOME", "HOME_VAL", "TRAVTIME", "BLUEBOOK", "TIF", "OLDCLAIM", "MVR_PTS", "CAR_AGE")) %>% 
  gather(key, value, -TARGET_FLAG) %>% 
  mutate(value = as.integer(value),
         key = as.factor(key),
         TARGET_FLAG = as.factor(TARGET_FLAG))

# histogram of continous variables
distribution %>% 
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free",  ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="#56B4E9") +
  theme_minimal()
# boxplot of continous variables
distribution %>% 
  ggplot(aes(x = key, y = value)) +
  geom_boxplot(aes(fill = TARGET_FLAG)) +
  facet_wrap(~ key, scales = 'free', ncol = 3) +
  scale_fill_manual(values=c("#999999", "#E69F00")) +
  theme_minimal()
# change all variable's data type for correlation
insurance_corr <- data.frame(lapply(insurance_train_dist, function(x) as.numeric(as.factor(x))))

# top correlated variables
a <- sort(cor(dplyr::select(insurance_corr, TARGET_FLAG, everything()))[,1], decreasing = T)
b <- sort(cor(dplyr::select(insurance_corr, TARGET_AMT, everything()))[,1], decreasing = T)
kable(cbind(a, b), col.names = c("TARGET_FLAG", "TARGET_AMT")) %>% 
  kable_styling(full_width = F) %>% 
  add_header_above(c(" ", "Correlation" = 2))

# correlation plot
corrplot(cor(dplyr::select(drop_na(insurance_corr), everything())), 
         method = "number", 
         type = "lower",
         col = brewer.pal(n = 26, name = "Paired"),
         number.cex = .7, tl.cex = .7,
         tl.col = "black", tl.srt = 45)
# imputating train data
init <- mice(insurance_train_dist)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET_FLAG", "TARGET_AMT")] <- 0 #this code will remove the variable as a predictor but still will be imputed
insurance_impute <- mice(insurance_train_dist, method = 'rf', predictorMatrix=predM)
insurance_imputed <- complete(insurance_impute)
print(paste0("Missing value after imputation: ", sum(is.na(insurance_imputed))))

# preparing evaluation data
insurance_test <- insurance_test %>% 
  dplyr::select(-c(TARGET_FLAG, TARGET_AMT, INDEX)) %>% 
  mutate(KIDSDRIV = as.factor(KIDSDRIV),
         HOMEKIDS = as.factor(HOMEKIDS),
         PARENT1 = as.factor(PARENT1),
         CLM_FREQ = as.factor(CLM_FREQ),
         INCOME = str_replace_all(INCOME, "[\\$,]", ""),
         HOME_VAL = str_replace_all(HOME_VAL, "[\\$,]", ""),
         BLUEBOOK = str_replace_all(BLUEBOOK, "[\\$,]", ""),
         OLDCLAIM = str_replace_all(OLDCLAIM, "[\\$,]", ""),
         OLDCLAIM = as.integer(OLDCLAIM),
         BLUEBOOK = as.integer(BLUEBOOK),
         HOME_VAL = as.integer(HOME_VAL),
         INCOME = as.integer(INCOME))

# imputating evaluation data
init <- mice(insurance_test)
meth <- init$method
predM <- init$predictorMatrix
insurance_eval_impute <- mice(insurance_test, method = 'rf', predictorMatrix=predM)
insurance_eval_imputed <- complete(insurance_eval_impute)
insurance_eval_imputed <- data.frame(lapply(insurance_eval_imputed, function(x) as.numeric(as.factor(x))))
print(paste0("Missing value after imputation: ", sum(is.na(insurance_eval_imputed))))
# check for multicollinearity
insurance_vif <- data.frame(lapply(insurance_imputed, function(x) as.numeric(as.factor(x))))
kable((car::vif(glm(TARGET_FLAG ~. , data = insurance_vif))), col.names = c("VIF Score")) %>%  #remove tax for high vif score
  kable_styling(full_width = F)
# original value model
insurance_corr <- dplyr::select(insurance_corr, -"TARGET_FLAG")
model1 <- lm(TARGET_AMT ~ ., insurance_corr)
summary(model1)
# imputed model
insurance_vif <- dplyr::select(insurance_vif, -"TARGET_FLAG")
model2 <- lm(TARGET_AMT ~ ., insurance_vif)
summary(model2)
# stepwise transformed model
model3 <- stepAIC(model2, direction = "both", trace = FALSE)
summary(model3)
# boxcox transformation model
insurance_boxcox <- preProcess(insurance_vif, c("BoxCox"))
in_bc_transformed <- predict(insurance_boxcox, insurance_vif)
model4 <- lm(TARGET_AMT ~ ., in_bc_transformed)
summary(model4)
# original value model
logit_data <- data.frame(lapply(insurance_imputed, function(x) as.numeric(as.factor(x)))) %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>% 
  dplyr::select(-"TARGET_AMT")
  
model5 <- glm(TARGET_FLAG ~ ., family = "binomial", logit_data)
summary(model5)
# stepwise transformed model
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)
# boxcox transformation model
insurance_boxcox1 <- preProcess(logit_data, c("BoxCox"))
in_bc_transformed1 <- predict(insurance_boxcox1, logit_data)
model7 <- glm(TARGET_FLAG ~ ., family = "binomial", in_bc_transformed1)
summary(model7)
# comparing all binary logistic models using various measures
c1 <- confusionMatrix(as.factor(as.integer(fitted(model5) > .5)), as.factor(model5$y), positive = "1")
c2 <- confusionMatrix(as.factor(as.integer(fitted(model6) > .5)), as.factor(model6$y), positive = "1")
c3 <- confusionMatrix(as.factor(as.integer(fitted(model7) > .5)), as.factor(model7$y), positive = "1")

roc1 <- roc(logit_data$TARGET_FLAG,  predict(model5, logit_data, interval = "prediction"))
roc2 <- roc(logit_data$TARGET_FLAG,  predict(model6, logit_data, interval = "prediction"))
roc3 <- roc(logit_data$TARGET_FLAG,  predict(model7, logit_data, interval = "prediction"))

metrics1 <- c(c1$overall[1], "Class. Error Rate" = 1 - as.numeric(c1$overall[1]), c1$byClass[c(1, 2, 5, 7)], AUC = roc1$auc)
metrics2 <- c(c2$overall[1], "Class. Error Rate" = 1 - as.numeric(c2$overall[1]), c2$byClass[c(1, 2, 5, 7)], AUC = roc2$auc)
metrics3 <- c(c3$overall[1], "Class. Error Rate" = 1 - as.numeric(c3$overall[1]), c3$byClass[c(1, 2, 5, 7)], AUC = roc3$auc)

kable(cbind(metrics1, metrics2, metrics3), col.names = c("Model 5", "Model 6", "Model 7"))  %>% 
  kable_styling(full_width = T)

# plotting roc curve of model 3
plot(roc(logit_data$TARGET_FLAG,  predict(model5, logit_data, interval = "prediction")), print.auc = TRUE, main = "Model 5" )

# predict
predict <- predict(model5, insurance_eval_imputed, interval = "prediction")
eval <- table(as.integer(predict > .5))
print(paste(eval[1], "not in a car crash", "and", eval[2], "in a car crash"))
# comparing all binary logistic models using various measures
a1 <- mean((summary(model1))$residuals^2)
a2 <- mean((summary(model2))$residuals^2)
a3 <- mean((summary(model3))$residuals^2)
a4 <- mean((summary(model4))$residuals^2)
a5 <- rbind(a1, a2, a3, a4)
 
b1 <- summary(model2)$r.squared
b2 <- summary(model3)$r.squared
b3 <- summary(model1)$r.squared
b4 <- summary(model4)$r.squared
b5 <- rbind(b1, b2, b3, b4)

c1 <- summary(model1)$fstatistic
c2 <- summary(model2)$fstatistic
c3 <- summary(model3)$fstatistic
c4 <- summary(model4)$fstatistic
c5 <- rbind(c1, c2, c3, c4)

mlr_metrics <- data.frame(cbind(a5, b5, c5), row.names = c("Model 1", "Model 2", "Model 3", "Model 4"))
colnames(mlr_metrics) <- c("MSE", "R-Squared", "value", "numdf", "dendf")
kable(mlr_metrics) %>% 
  kable_styling(full_width = T) %>% 
  add_header_above(c(" ", " " = 2, "F-Statistic" = 3))

# residual plot
plot(model4)

# prediction
prediction <- predict(model4, insurance_eval_imputed, interval = "prediction")