| Variable Name | Definition | Theoretical Effect |
|---|---|---|
| INDEX | Identification Variable (do not use) | None |
| TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | None |
| TARGET_AMT | If car was in a crash, what was the cost | None |
| AGE | Age of Driver | Very young people tend to be risky. Maybe very old people also. |
| BLUEBOOK | Value of Vehicle | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
| CAR_AGE | Vehicle Age | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
| CAR_TYPE | Type of Car | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
| CAR_USE | Vehicle Use | Commercial vehicles are driven more, so might increase probability of collision |
| CLM_FREQ | # Claims (Past 5 Years) | The more claims you filed in the past, the more you are likely to file in the future |
| EDUCATION | Max Education Level | Unknown effect, but in theory more educated people tend to drive more safely |
| HOMEKIDS | # Children at Home | Unknown effect |
| HOME_VAL | Home Value | In theory, home owners tend to drive more responsibly |
| INCOME | Income | In theory, rich people tend to get into fewer crashes |
| JOB | Job Category | In theory, white collar jobs tend to be safer |
| KIDSDRIV | # Driving Children | When teenagers drive your car, you are more likely to get into crashes |
| MSTATUS | Marital Status | In theory, married people drive more safely |
| MVR_PTS | Motor Vehicle Record Points | If you get lots of traffic tickets, you tend to get into more crashes |
| OLDCLAIM | Total Claims (Past 5 Years) | If your total payout over the past five years was high, this suggests future payouts will be high |
| PARENT1 | Single Parent | Unknown effect |
| RED_CAR | A Red Car | Urban legend says that red cars (especially red sports cars) are more risky. Is that true? |
| REVOKED | License Revoked (Past 7 Years) | If your license was revoked in the past 7 years, you probably are a more risky driver |
| SEX | Gender | Urban legend says that women have less crashes then men. Is that true? |
| TIF | Time in Force | People who have been customers for a long time are usually more safe. |
| TRAVTIME | Distance to Work | Long drives to work usually suggest greater risk |
| URBANICITY | Home/Work Area | Unknown |
| YOJ | Years on Job | People who stay at a job for a long time are usually more safe |
Our dataset consists of 26 variables and 8161 observations with AGE, YOJ, and CAR_AGE variables containing some missing values. As stated previously, TARGET_FLAG and TARGET_AMT are our response variables. Also, 13 of the variables have discrete values and the rest of the variables are continuous. Lastly, summary function illustrates mean, quartile and other statistical analysis.
## Observations: 8,161
## Variables: 26
## $ INDEX <int> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 1...
## $ TARGET_FLAG <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000...
## $ KIDSDRIV <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ AGE <int> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55...
## $ HOMEKIDS <int> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0,...
## $ YOJ <int> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, ...
## $ INCOME <fct> "$67,349", "$91,449", "$16,039", "", "$114,986", "...
## $ PARENT1 <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, N...
## $ HOME_VAL <fct> "$0", "$257,252", "$124,191", "$306,251", "$243,92...
## $ MSTATUS <fct> z_No, z_No, Yes, Yes, Yes, z_No, Yes, Yes, z_No, z...
## $ SEX <fct> M, M, z_F, M, z_F, z_F, z_F, M, z_F, M, z_F, z_F, ...
## $ EDUCATION <fct> PhD, z_High School, z_High School, <High School, P...
## $ JOB <fct> Professional, z_Blue Collar, Clerical, z_Blue Coll...
## $ TRAVTIME <int> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25,...
## $ CAR_USE <fct> Private, Commercial, Private, Private, Private, Co...
## $ BLUEBOOK <fct> "$14,230", "$14,940", "$4,010", "$15,440", "$18,00...
## $ TIF <int> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6...
## $ CAR_TYPE <fct> Minivan, Minivan, z_SUV, Minivan, z_SUV, Sports Ca...
## $ RED_CAR <fct> yes, yes, no, yes, no, no, no, yes, no, no, no, no...
## $ OLDCLAIM <fct> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", ...
## $ CLM_FREQ <int> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0,...
## $ REVOKED <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, ...
## $ MVR_PTS <int> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0...
## $ CAR_AGE <int> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5,...
## $ URBANICITY <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highly U...
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 $0 : 615
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 : 445
## Median :45.00 Median :0.0000 Median :11.0 $26,840 : 4
## Mean :44.79 Mean :0.7212 Mean :10.5 $48,509 : 4
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0 $61,790 : 4
## Max. :81.00 Max. :5.0000 Max. :23.0 $107,375: 3
## NA's :6 NA's :454 (Other) :7086
## PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## No :7084 $0 :2294 Yes :4894 M :3786 <High School :1203
## Yes:1077 : 464 z_No:3267 z_F:4375 Bachelors :2242
## $111,129: 3 Masters :1658
## $115,249: 3 PhD : 728
## $123,109: 3 z_High School:2330
## $153,061: 3
## (Other) :5391
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 $1,500 : 157
## Clerical :1271 1st Qu.: 22.00 Private :5132 $6,000 : 34
## Professional :1117 Median : 33.00 $5,800 : 33
## Manager : 988 Mean : 33.49 $6,200 : 33
## Lawyer : 835 3rd Qu.: 44.00 $6,400 : 31
## Student : 712 Max. :142.00 $5,900 : 30
## (Other) :1413 (Other):7843
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 $0 :5009
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 $1,310 : 4
## Median : 4.000 Pickup :1389 $1,391 : 4
## Mean : 5.351 Sports Car : 907 $4,263 : 4
## 3rd Qu.: 7.000 Van : 750 $1,105 : 3
## Max. :25.000 z_SUV :2294 $1,332 : 3
## (Other):3134
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.328
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
In the bar chart below, KIDSDRIV, HOMEKIDS, PARENT1 variables tell us having no kids results in more car crash. We do not see a significant effect of individual’s sex, martial status or the type of care they use on car crash. However, high school students, blue collar employees, or SUV owners get into more car crash. Lastly, if the individual’s license is revoked or they are driving on an urban area, they have higher chance of car crash.
In the histogram plot below, we see several variables have high number of zeros. AGE is the only variable that is normally distributed. Rest of the variables show some skewness. We will perform Box-Cox transformation on these variables.
In the box plot below, we see BLUEBOOK, INCOME, OLDCLAIM have high number of outliers compared to other variables. We also see people with older car, higher home value, higher income or older customer get in to less car crash. However, people with motor vehicle record points or high number of old claims tend to get in to more car crash.
| TARGET_FLAG | TARGET_AMT | |
|---|---|---|
| TARGET_FLAG | 1.0000000 | 1.0000000 |
| TARGET_AMT | 0.8334240 | 0.8334240 |
| MVR_PTS | 0.2191323 | 0.1970216 |
| CLM_FREQ | 0.2161961 | 0.1741927 |
| OLDCLAIM | 0.1947302 | 0.1611626 |
| PARENT1 | 0.1576222 | 0.1359305 |
| REVOKED | 0.1519391 | 0.1263285 |
| MSTATUS | 0.1351248 | 0.1214701 |
| HOMEKIDS | 0.1156210 | 0.1008356 |
| KIDSDRIV | 0.1036683 | 0.0877148 |
| CAR_TYPE | 0.1023650 | 0.0797487 |
| JOB | 0.0612262 | 0.0488313 |
| TRAVTIME | 0.0492559 | 0.0401971 |
| EDUCATION | 0.0428730 | 0.0397864 |
| SEX | 0.0210786 | 0.0088270 |
| RED_CAR | -0.0069473 | 0.0005877 |
| TIF | -0.0823431 | -0.0683183 |
| BLUEBOOK | -0.1092768 | -0.0709830 |
| CAR_USE | -0.1426737 | -0.1287263 |
| URBANICITY | -0.2242509 | -0.1904945 |
We have already performed some data transformation for our visualization. INCOME, HOME_VAL, BLUEBOOK, OLDCLAIM variables have dollar sign and comma in them. These were removed (and converted as integer) using the stringr package. As we have seen in our statistical analysis, some of the variables have missing values. We will use the mice package and random forest method to impute the missing data. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our response variables will be removed as predictor variables but still will be imputed. Lastly, we will perform the same data preparation for our unseen dataset.
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 5 AGE YOJ INCOME HOME_VAL CAR_AGE
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## [1] "Missing value after imputation: 0"
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 5 AGE YOJ INCOME HOME_VAL CAR_AGE
##
## iter imp variable
## 1 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 1 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 2 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 3 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 4 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 1 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 2 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 3 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 4 AGE YOJ INCOME HOME_VAL CAR_AGE
## 5 5 AGE YOJ INCOME HOME_VAL CAR_AGE
## [1] "Missing value after imputation: 0"
In the table below we check for multicollinearity and there is no cause for concern because the VIF score is at a conservative level for all variables.
| VIF Score | |
|---|---|
| TARGET_AMT | 1.184646 |
| KIDSDRIV | 1.322455 |
| AGE | 1.408626 |
| HOMEKIDS | 2.068329 |
| YOJ | 1.223710 |
| INCOME | 2.720449 |
| PARENT1 | 1.849722 |
| HOME_VAL | 2.506717 |
| MSTATUS | 2.013524 |
| SEX | 2.265299 |
| EDUCATION | 1.044088 |
| JOB | 1.157348 |
| TRAVTIME | 1.038854 |
| CAR_USE | 1.353302 |
| BLUEBOOK | 1.375440 |
| TIF | 1.009161 |
| CAR_TYPE | 1.409798 |
| RED_CAR | 1.809696 |
| OLDCLAIM | 2.201664 |
| CLM_FREQ | 2.131016 |
| REVOKED | 1.148628 |
| MVR_PTS | 1.249189 |
| CAR_AGE | 1.311790 |
| URBANICITY | 1.243781 |
We will build at least three different multiple linear regression models and three different binary logistic regression models using the original dataset, the imputed dataset, forward and backward selected variables and a boxcox transformed dataset to see which one yields the best performance.
In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. Overall, I would say this is a bad model.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_corr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -898.90 -286.25 -134.43 62.85 1927.07
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.018e+02 7.820e+01 5.138 2.86e-07 ***
## KIDSDRIV 4.205e+01 1.318e+01 3.189 0.00143 **
## AGE -2.046e-01 8.070e-01 -0.254 0.79986
## HOMEKIDS 1.359e+01 7.532e+00 1.804 0.07121 .
## YOJ -3.491e-01 1.590e+00 -0.220 0.82625
## INCOME -2.270e-02 4.803e-03 -4.727 2.33e-06 ***
## PARENT1 7.414e+01 2.336e+01 3.173 0.00151 **
## HOME_VAL -1.082e-02 5.505e-03 -1.965 0.04946 *
## MSTATUS 7.301e+01 1.685e+01 4.332 1.50e-05 ***
## SEX -9.673e+00 1.756e+01 -0.551 0.58178
## EDUCATION 6.679e+00 4.132e+00 1.616 0.10606
## JOB -6.667e-01 2.347e+00 -0.284 0.77637
## TRAVTIME 2.185e+00 3.772e-01 5.793 7.25e-09 ***
## CAR_USE -1.472e+02 1.392e+01 -10.571 < 2e-16 ***
## BLUEBOOK -2.513e-02 9.504e-03 -2.644 0.00821 **
## TIF -7.286e+00 1.416e+00 -5.147 2.73e-07 ***
## CAR_TYPE 1.877e+01 3.517e+00 5.335 9.87e-08 ***
## RED_CAR -1.768e+01 1.732e+01 -1.021 0.30740
## OLDCLAIM -4.355e-03 1.039e-02 -0.419 0.67518
## CLM_FREQ 2.315e+01 7.358e+00 3.147 0.00166 **
## REVOKED 1.281e+02 1.915e+01 6.693 2.37e-11 ***
## MVR_PTS 2.597e+01 3.034e+00 8.560 < 2e-16 ***
## CAR_AGE -3.795e+00 1.182e+00 -3.210 0.00133 **
## URBANICITY -2.685e+02 1.590e+01 -16.891 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 470.1 on 6424 degrees of freedom
## (1713 observations deleted due to missingness)
## Multiple R-squared: 0.1564, Adjusted R-squared: 0.1534
## F-statistic: 51.79 on 23 and 6424 DF, p-value: < 2.2e-16
In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. Overall, I would say this is a bad model.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_vif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -910.71 -287.81 -135.00 64.62 1925.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.554e+02 7.012e+01 5.068 4.11e-07 ***
## KIDSDRIV 5.573e+01 1.172e+01 4.755 2.02e-06 ***
## AGE 9.965e-02 7.190e-01 0.139 0.88977
## HOMEKIDS 1.395e+01 6.724e+00 2.075 0.03805 *
## YOJ -1.217e+00 1.415e+00 -0.860 0.38993
## INCOME -2.467e-02 4.229e-03 -5.833 5.65e-09 ***
## PARENT1 6.775e+01 2.096e+01 3.232 0.00123 **
## HOME_VAL -1.074e-02 4.874e-03 -2.204 0.02755 *
## MSTATUS 8.243e+01 1.509e+01 5.462 4.84e-08 ***
## SEX -4.134e+00 1.576e+01 -0.262 0.79302
## EDUCATION 7.852e+00 3.691e+00 2.127 0.03342 *
## JOB 1.023e-01 2.094e+00 0.049 0.96105
## TRAVTIME 2.060e+00 3.356e-01 6.140 8.62e-10 ***
## CAR_USE -1.441e+02 1.247e+01 -11.558 < 2e-16 ***
## BLUEBOOK -2.404e-02 8.496e-03 -2.829 0.00468 **
## TIF -7.551e+00 1.263e+00 -5.980 2.32e-09 ***
## CAR_TYPE 1.679e+01 3.149e+00 5.333 9.93e-08 ***
## RED_CAR -3.877e+00 1.545e+01 -0.251 0.80191
## OLDCLAIM -7.370e-03 9.185e-03 -0.802 0.42236
## CLM_FREQ 2.144e+01 6.574e+00 3.261 0.00111 **
## REVOKED 1.308e+02 1.700e+01 7.692 1.62e-14 ***
## MVR_PTS 2.598e+01 2.704e+00 9.608 < 2e-16 ***
## CAR_AGE -2.901e+00 1.044e+00 -2.779 0.00547 **
## URBANICITY -2.727e+02 1.411e+01 -19.322 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 471.6 on 8137 degrees of freedom
## Multiple R-squared: 0.1559, Adjusted R-squared: 0.1535
## F-statistic: 65.32 on 23 and 8137 DF, p-value: < 2.2e-16
In our linear regression model below, we see the min-max and 1Q-3Q have different magnitudes and the median is not close to zero. This means is not good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.15, which means this model explains 15% of the data’s variation. However, we see improved p-value for several variables. Overall, I would say this is a better model.
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
## URBANICITY, data = insurance_vif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -917.42 -287.33 -134.63 64.92 1931.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.434e+02 4.990e+01 6.882 6.35e-12 ***
## KIDSDRIV 5.631e+01 1.155e+01 4.875 1.11e-06 ***
## HOMEKIDS 1.285e+01 6.168e+00 2.084 0.037210 *
## INCOME -2.553e-02 4.079e-03 -6.260 4.05e-10 ***
## PARENT1 6.786e+01 2.085e+01 3.255 0.001138 **
## HOME_VAL -1.063e-02 4.847e-03 -2.193 0.028357 *
## MSTATUS 8.373e+01 1.502e+01 5.575 2.55e-08 ***
## EDUCATION 7.889e+00 3.655e+00 2.158 0.030923 *
## TRAVTIME 2.065e+00 3.353e-01 6.158 7.74e-10 ***
## CAR_USE -1.449e+02 1.136e+01 -12.758 < 2e-16 ***
## BLUEBOOK -2.425e-02 8.266e-03 -2.933 0.003364 **
## TIF -7.566e+00 1.262e+00 -5.997 2.10e-09 ***
## CAR_TYPE 1.666e+01 2.740e+00 6.080 1.25e-09 ***
## CLM_FREQ 1.808e+01 5.056e+00 3.576 0.000351 ***
## REVOKED 1.263e+02 1.606e+01 7.862 4.26e-15 ***
## MVR_PTS 2.570e+01 2.672e+00 9.616 < 2e-16 ***
## CAR_AGE -2.824e+00 1.020e+00 -2.770 0.005618 **
## URBANICITY -2.721e+02 1.409e+01 -19.312 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 471.5 on 8143 degrees of freedom
## Multiple R-squared: 0.1557, Adjusted R-squared: 0.1539
## F-statistic: 88.34 on 17 and 8143 DF, p-value: < 2.2e-16
As seen previously, most of our dataset has many skewed variables. When an attribute has a normal distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more normal The Box Cox transform can perform this operation (assumes all values are positive). In our linear regression model below, we see the min-max and 1Q-3Q have quite similar magnitudes and the median is close to zero. This means this model is good but lets do some more evaluation. The p-value below shows that the probability of this variables to be irrelevant is very low. Lastly, R-squared is 0.22, which means this model explains 22% of the data’s variation. Overall, I would say this is the best model.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = in_bc_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7329 -0.5445 -0.2221 0.5838 2.3677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1215962 0.1104381 10.156 < 2e-16 ***
## KIDSDRIV 0.4317947 0.0744265 5.802 6.81e-09 ***
## AGE -0.0006829 0.0011812 -0.578 0.563185
## HOMEKIDS 0.1090128 0.0602544 1.809 0.070455 .
## YOJ 0.0002756 0.0006395 0.431 0.666510
## INCOME -0.0019959 0.0002553 -7.819 6.01e-15 ***
## PARENT1 0.1200899 0.0354768 3.385 0.000715 ***
## HOME_VAL -0.0046780 0.0013364 -3.500 0.000467 ***
## MSTATUS 0.1349029 0.0255845 5.273 1.38e-07 ***
## SEX 0.0030302 0.0249102 0.122 0.903182
## EDUCATION 0.0168041 0.0089188 1.884 0.059585 .
## JOB 0.0009349 0.0033124 0.282 0.777774
## TRAVTIME 0.0111793 0.0013990 7.991 1.52e-15 ***
## CAR_USE -0.2715000 0.0197274 -13.763 < 2e-16 ***
## BLUEBOOK -0.0012943 0.0002073 -6.243 4.52e-10 ***
## TIF -0.0528704 0.0068450 -7.724 1.26e-14 ***
## CAR_TYPE 0.0524925 0.0076967 6.820 9.75e-12 ***
## RED_CAR -0.0102472 0.0245293 -0.418 0.676139
## OLDCLAIM -0.0064209 0.0104941 -0.612 0.540645
## CLM_FREQ 0.3705831 0.1402113 2.643 0.008232 **
## REVOKED 0.2550767 0.0259999 9.811 < 2e-16 ***
## MVR_PTS 0.1331920 0.0184078 7.236 5.06e-13 ***
## CAR_AGE -0.0242995 0.0048696 -4.990 6.16e-07 ***
## URBANICITY -0.5197828 0.0225098 -23.091 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7484 on 8137 degrees of freedom
## Multiple R-squared: 0.216, Adjusted R-squared: 0.2138
## F-statistic: 97.49 on 23 and 8137 DF, p-value: < 2.2e-16
In our logistic regression model below, we see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model shows many variables with significant p-value. We will see with following model whether AIC score improves or not.
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = logit_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5089 -0.7261 -0.4149 0.6495 3.1124
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.780e-01 3.811e-01 1.254 0.209751
## KIDSDRIV 3.786e-01 6.044e-02 6.264 3.76e-10 ***
## AGE -2.564e-03 3.905e-03 -0.657 0.511451
## HOMEKIDS 5.628e-02 3.656e-02 1.539 0.123708
## YOJ -8.098e-03 7.623e-03 -1.062 0.288094
## INCOME -1.414e-04 2.256e-05 -6.269 3.64e-10 ***
## PARENT1 3.684e-01 1.085e-01 3.395 0.000687 ***
## HOME_VAL -9.188e-05 2.636e-05 -3.486 0.000490 ***
## MSTATUS 4.950e-01 8.285e-02 5.975 2.30e-09 ***
## SEX 1.282e-02 8.807e-02 0.146 0.884232
## EDUCATION 3.345e-02 1.985e-02 1.685 0.091971 .
## JOB -7.524e-03 1.133e-02 -0.664 0.506471
## TRAVTIME 1.529e-02 1.875e-03 8.154 3.51e-16 ***
## CAR_USE -9.337e-01 6.841e-02 -13.650 < 2e-16 ***
## BLUEBOOK -2.774e-04 4.702e-05 -5.899 3.66e-09 ***
## TIF -5.433e-02 7.280e-03 -7.463 8.46e-14 ***
## CAR_TYPE 1.186e-01 1.790e-02 6.626 3.44e-11 ***
## RED_CAR -2.828e-02 8.552e-02 -0.331 0.740894
## OLDCLAIM -4.679e-05 4.500e-05 -1.040 0.298476
## CLM_FREQ 1.722e-01 3.207e-02 5.371 7.84e-08 ***
## REVOKED 7.663e-01 8.455e-02 9.063 < 2e-16 ***
## MVR_PTS 1.166e-01 1.359e-02 8.578 < 2e-16 ***
## CAR_AGE -2.245e-02 5.798e-03 -3.871 0.000108 ***
## URBANICITY -2.316e+00 1.126e-01 -20.571 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7415.9 on 8137 degrees of freedom
## AIC: 7463.9
##
## Number of Fisher Scoring iterations: 5
In our logistic regression model below, we use forward and backward step-wise variables selection algorithm. We see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model’s variables selection is better with better p-value. However AIC score has not improved.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
## URBANICITY, family = "binomial", data = logit_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5176 -0.7266 -0.4168 0.6501 3.0850
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.278e-01 2.677e-01 0.851 0.394917
## KIDSDRIV 3.737e-01 5.943e-02 6.288 3.22e-10 ***
## HOMEKIDS 6.084e-02 3.356e-02 1.813 0.069875 .
## INCOME -1.486e-04 2.167e-05 -6.860 6.89e-12 ***
## PARENT1 3.793e-01 1.078e-01 3.519 0.000434 ***
## HOME_VAL -9.399e-05 2.625e-05 -3.580 0.000343 ***
## MSTATUS 5.034e-01 8.250e-02 6.102 1.05e-09 ***
## EDUCATION 3.551e-02 1.968e-02 1.805 0.071097 .
## TRAVTIME 1.527e-02 1.873e-03 8.154 3.53e-16 ***
## CAR_USE -9.176e-01 6.211e-02 -14.774 < 2e-16 ***
## BLUEBOOK -2.750e-04 4.592e-05 -5.987 2.13e-09 ***
## TIF -5.426e-02 7.271e-03 -7.462 8.50e-14 ***
## CAR_TYPE 1.225e-01 1.547e-02 7.916 2.45e-15 ***
## CLM_FREQ 1.512e-01 2.519e-02 6.002 1.95e-09 ***
## REVOKED 7.352e-01 7.937e-02 9.263 < 2e-16 ***
## MVR_PTS 1.153e-01 1.342e-02 8.588 < 2e-16 ***
## CAR_AGE -2.137e-02 5.639e-03 -3.789 0.000151 ***
## URBANICITY -2.305e+00 1.122e-01 -20.540 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7419.5 on 8143 degrees of freedom
## AIC: 7455.5
##
## Number of Fisher Scoring iterations: 5
In our logistic regression model below, we see min-max and 1Q-3Q magnitudes is quite close and the median is close to zero. This model too shows many variables with significant p-value. However, this model has the best AIC score so far.
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = in_bc_transformed1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3319 -0.7280 -0.4164 0.6695 3.1449
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5525873 0.3806356 4.079 4.52e-05 ***
## KIDSDRIV 1.4473495 0.2443105 5.924 3.14e-09 ***
## AGE -0.0018742 0.0040506 -0.463 0.643578
## HOMEKIDS 0.4318856 0.2133634 2.024 0.042952 *
## YOJ 0.0014631 0.0022099 0.662 0.507918
## INCOME -0.0067541 0.0008645 -7.813 5.60e-15 ***
## PARENT1 0.2618492 0.1183408 2.213 0.026920 *
## HOME_VAL -0.0151869 0.0044261 -3.431 0.000601 ***
## MSTATUS 0.5254929 0.0896994 5.858 4.67e-09 ***
## SEX -0.0105097 0.0877491 -0.120 0.904665
## EDUCATION 0.0447810 0.0303002 1.478 0.139432
## JOB -0.0046605 0.0112656 -0.414 0.679096
## TRAVTIME 0.0419614 0.0049896 8.410 < 2e-16 ***
## CAR_USE -0.9209107 0.0681881 -13.505 < 2e-16 ***
## BLUEBOOK -0.0046782 0.0007117 -6.573 4.93e-11 ***
## TIF -0.1811725 0.0238212 -7.606 2.84e-14 ***
## CAR_TYPE 0.1998463 0.0278986 7.163 7.88e-13 ***
## RED_CAR -0.0318369 0.0855220 -0.372 0.709695
## OLDCLAIM -0.0215595 0.0317266 -0.680 0.496796
## CLM_FREQ 1.1600990 0.4234448 2.740 0.006150 **
## REVOKED 0.7456174 0.0811612 9.187 < 2e-16 ***
## MVR_PTS 0.4164365 0.0622056 6.695 2.16e-11 ***
## CAR_AGE -0.0827487 0.0167869 -4.929 8.25e-07 ***
## URBANICITY -2.2990567 0.1131799 -20.313 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418 on 8160 degrees of freedom
## Residual deviance: 7406 on 8137 degrees of freedom
## AIC: 7454
##
## Number of Fisher Scoring iterations: 5
| Model 5 | Model 6 | Model 7 | |
|---|---|---|---|
| Accuracy | 0.7861782 | 0.7869134 | 0.7859331 |
| Class. Error Rate | 0.2138218 | 0.2130866 | 0.2140669 |
| Sensitivity | 0.3989782 | 0.3985137 | 0.3934046 |
| Specificity | 0.9249334 | 0.9260985 | 0.9265979 |
| Precision | 0.6557252 | 0.6589862 | 0.6576087 |
| F1 | 0.4961016 | 0.4966715 | 0.4922988 |
| AUC | 0.8056318 | 0.8053902 | 0.5770795 |
## [1] "1779 not in a car crash and 362 in a car crash"
| MSE | R-Squared | value | numdf | dendf | |
|---|---|---|---|---|---|
| Model 1 | 2.201851e+05 | 0.1558662 | 51.79085 | 23 | 6424 |
| Model 2 | 2.217497e+05 | 0.1557105 | 65.32465 | 23 | 8137 |
| Model 3 | 2.217906e+05 | 0.1564228 | 88.34093 | 17 | 8143 |
| Model 4 | 5.584709e-01 | 0.2160308 | 97.48848 | 23 | 8137 |
# knitr settings
knitr::opts_chunk$set(warning = F,
message = F,
echo = F,
fig.align = "center")
# load libraries
library(kableExtra)
library(knitr)
library(tidyverse)
library(psych)
library(ggthemes)
library(ggpubr)
library(stringr)
library(corrplot)
library(RColorBrewer)
library(mice)
library(car)
library(MASS)
library(caret)
library(pROC)
vn <- c("INDEX", "TARGET_FLAG", "TARGET_AMT", "AGE", "BLUEBOOK", "CAR_AGE", "CAR_TYPE", "CAR_USE", "CLM_FREQ", "EDUCATION", "HOMEKIDS", "HOME_VAL", "INCOME", "JOB", "KIDSDRIV", "MSTATUS", "MVR_PTS", "OLDCLAIM", "PARENT1", "RED_CAR", "REVOKED", "SEX", "TIF", "TRAVTIME", "URBANICITY", "YOJ")
df <- c("Identification Variable (do not use)", "Was Car in a crash? 1=YES 0=NO", "If car was in a crash, what was the cost", "Age of Driver", "Value of Vehicle", "Vehicle Age", "Type of Car", "Vehicle Use", "# Claims (Past 5 Years)", "Max Education Level", "# Children at Home", "Home Value", "Income", "Job Category", "# Driving Children", "Marital Status", "Motor Vehicle Record Points", "Total Claims (Past 5 Years)", "Single Parent", "A Red Car", "License Revoked (Past 7 Years)", "Gender", "Time in Force", "Distance to Work", "Home/Work Area", "Years on Job")
te <- c("None", "None", "None", "Very young people tend to be risky. Maybe very old people also.", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Unknown effect on probability of collision, but probably effect the payout if there is a crash", "Commercial vehicles are driven more, so might increase probability of collision", "The more claims you filed in the past, the more you are likely to file in the future", "Unknown effect, but in theory more educated people tend to drive more safely", "Unknown effect", "In theory, home owners tend to drive more responsibly", "In theory, rich people tend to get into fewer crashes", "In theory, white collar jobs tend to be safer", "When teenagers drive your car, you are more likely to get into crashes", "In theory, married people drive more safely", "If you get lots of traffic tickets, you tend to get into more crashes", "If your total payout over the past five years was high, this suggests future payouts will be high", "Unknown effect", "Urban legend says that red cars (especially red sports cars) are more risky. Is that true?", "If your license was revoked in the past 7 years, you probably are a more risky driver", "Urban legend says that women have less crashes then men. Is that true?", "People who have been customers for a long time are usually more safe.", "Long drives to work usually suggest greater risk", "Unknown", "People who stay at a job for a long time are usually more safe")
kable(cbind(vn, df, te), col.names = c("Variable Name", "Definition", "Theoretical Effect")) %>%
kable_styling()
# load data
insurance_train <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/insurance_training_data.csv")
insurance_test <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/insurance-evaluation-data.csv")
# summary statistics
insurance_train %>%
glimpse() %>%
summary()
# change data type of some variables for visualization
insurance_train_dist <- insurance_train %>%
dplyr::select(-INDEX) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG),
KIDSDRIV = as.factor(KIDSDRIV),
HOMEKIDS = as.factor(HOMEKIDS),
PARENT1 = as.factor(PARENT1),
CLM_FREQ = as.factor(CLM_FREQ),
INCOME = str_replace_all(INCOME, "[\\$,]", ""),
HOME_VAL = str_replace_all(HOME_VAL, "[\\$,]", ""),
BLUEBOOK = str_replace_all(BLUEBOOK, "[\\$,]", ""),
OLDCLAIM = str_replace_all(OLDCLAIM, "[\\$,]", ""),
OLDCLAIM = as.integer(OLDCLAIM),
BLUEBOOK = as.integer(BLUEBOOK),
HOME_VAL = as.integer(HOME_VAL),
INCOME = as.integer(INCOME))
# distribution of discrete variables
k <- insurance_train_dist %>%
ggplot(aes(KIDSDRIV)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
hk <- insurance_train_dist %>%
ggplot(aes(HOMEKIDS)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
p <- insurance_train_dist %>%
ggplot(aes(PARENT1)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
m <- insurance_train_dist %>%
ggplot(aes(MSTATUS)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
s <- insurance_train_dist %>%
ggplot(aes(SEX)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
e <- insurance_train_dist %>%
ggplot(aes(EDUCATION)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
j <- insurance_train_dist %>%
ggplot(aes(JOB)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
cu <- insurance_train_dist %>%
ggplot(aes(CAR_USE)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
ct <- insurance_train_dist %>%
ggplot(aes(CAR_TYPE)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
rc <- insurance_train_dist %>%
ggplot(aes(RED_CAR)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
r <- insurance_train_dist %>%
ggplot(aes(REVOKED)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
uc <- insurance_train_dist %>%
ggplot(aes(URBANICITY)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
cf <- insurance_train_dist %>%
ggplot(aes(CLM_FREQ)) +
geom_bar(aes(fill = TARGET_FLAG), position = position_dodge())
ggarrange(k + rremove("legend"), hk + rremove("legend"), p + rremove("legend"), m + rremove("legend"), s + rremove("legend"), e + rremove("legend"), j + rremove("legend"), cu + rremove("legend"), ct + rremove("legend"), rc + rremove("legend"), r + rremove("legend"), uc + rremove("legend"), cf + rremove("legend"), ncol = 2, nrow = 7)
# change data type of some variables for visualization
distribution <- insurance_train_dist %>%
dplyr::select(c("TARGET_FLAG", "AGE", "YOJ", "INCOME", "HOME_VAL", "TRAVTIME", "BLUEBOOK", "TIF", "OLDCLAIM", "MVR_PTS", "CAR_AGE")) %>%
gather(key, value, -TARGET_FLAG) %>%
mutate(value = as.integer(value),
key = as.factor(key),
TARGET_FLAG = as.factor(TARGET_FLAG))
# histogram of continous variables
distribution %>%
ggplot(aes(value)) +
facet_wrap(~key, scale = "free", ncol = 3) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill="#56B4E9") +
theme_minimal()
# boxplot of continous variables
distribution %>%
ggplot(aes(x = key, y = value)) +
geom_boxplot(aes(fill = TARGET_FLAG)) +
facet_wrap(~ key, scales = 'free', ncol = 3) +
scale_fill_manual(values=c("#999999", "#E69F00")) +
theme_minimal()
# change all variable's data type for correlation
insurance_corr <- data.frame(lapply(insurance_train_dist, function(x) as.numeric(as.factor(x))))
# top correlated variables
a <- sort(cor(dplyr::select(insurance_corr, TARGET_FLAG, everything()))[,1], decreasing = T)
b <- sort(cor(dplyr::select(insurance_corr, TARGET_AMT, everything()))[,1], decreasing = T)
kable(cbind(a, b), col.names = c("TARGET_FLAG", "TARGET_AMT")) %>%
kable_styling(full_width = F) %>%
add_header_above(c(" ", "Correlation" = 2))
# correlation plot
corrplot(cor(dplyr::select(drop_na(insurance_corr), everything())),
method = "number",
type = "lower",
col = brewer.pal(n = 26, name = "Paired"),
number.cex = .7, tl.cex = .7,
tl.col = "black", tl.srt = 45)
# imputating train data
init <- mice(insurance_train_dist)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("TARGET_FLAG", "TARGET_AMT")] <- 0 #this code will remove the variable as a predictor but still will be imputed
insurance_impute <- mice(insurance_train_dist, method = 'rf', predictorMatrix=predM)
insurance_imputed <- complete(insurance_impute)
print(paste0("Missing value after imputation: ", sum(is.na(insurance_imputed))))
# preparing evaluation data
insurance_test <- insurance_test %>%
dplyr::select(-c(TARGET_FLAG, TARGET_AMT, INDEX)) %>%
mutate(KIDSDRIV = as.factor(KIDSDRIV),
HOMEKIDS = as.factor(HOMEKIDS),
PARENT1 = as.factor(PARENT1),
CLM_FREQ = as.factor(CLM_FREQ),
INCOME = str_replace_all(INCOME, "[\\$,]", ""),
HOME_VAL = str_replace_all(HOME_VAL, "[\\$,]", ""),
BLUEBOOK = str_replace_all(BLUEBOOK, "[\\$,]", ""),
OLDCLAIM = str_replace_all(OLDCLAIM, "[\\$,]", ""),
OLDCLAIM = as.integer(OLDCLAIM),
BLUEBOOK = as.integer(BLUEBOOK),
HOME_VAL = as.integer(HOME_VAL),
INCOME = as.integer(INCOME))
# imputating evaluation data
init <- mice(insurance_test)
meth <- init$method
predM <- init$predictorMatrix
insurance_eval_impute <- mice(insurance_test, method = 'rf', predictorMatrix=predM)
insurance_eval_imputed <- complete(insurance_eval_impute)
insurance_eval_imputed <- data.frame(lapply(insurance_eval_imputed, function(x) as.numeric(as.factor(x))))
print(paste0("Missing value after imputation: ", sum(is.na(insurance_eval_imputed))))
# check for multicollinearity
insurance_vif <- data.frame(lapply(insurance_imputed, function(x) as.numeric(as.factor(x))))
kable((car::vif(glm(TARGET_FLAG ~. , data = insurance_vif))), col.names = c("VIF Score")) %>% #remove tax for high vif score
kable_styling(full_width = F)
# original value model
insurance_corr <- dplyr::select(insurance_corr, -"TARGET_FLAG")
model1 <- lm(TARGET_AMT ~ ., insurance_corr)
summary(model1)
# imputed model
insurance_vif <- dplyr::select(insurance_vif, -"TARGET_FLAG")
model2 <- lm(TARGET_AMT ~ ., insurance_vif)
summary(model2)
# stepwise transformed model
model3 <- stepAIC(model2, direction = "both", trace = FALSE)
summary(model3)
# boxcox transformation model
insurance_boxcox <- preProcess(insurance_vif, c("BoxCox"))
in_bc_transformed <- predict(insurance_boxcox, insurance_vif)
model4 <- lm(TARGET_AMT ~ ., in_bc_transformed)
summary(model4)
# original value model
logit_data <- data.frame(lapply(insurance_imputed, function(x) as.numeric(as.factor(x)))) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
dplyr::select(-"TARGET_AMT")
model5 <- glm(TARGET_FLAG ~ ., family = "binomial", logit_data)
summary(model5)
# stepwise transformed model
model6 <- stepAIC(model5, direction = "both", trace = FALSE)
summary(model6)
# boxcox transformation model
insurance_boxcox1 <- preProcess(logit_data, c("BoxCox"))
in_bc_transformed1 <- predict(insurance_boxcox1, logit_data)
model7 <- glm(TARGET_FLAG ~ ., family = "binomial", in_bc_transformed1)
summary(model7)
# comparing all binary logistic models using various measures
c1 <- confusionMatrix(as.factor(as.integer(fitted(model5) > .5)), as.factor(model5$y), positive = "1")
c2 <- confusionMatrix(as.factor(as.integer(fitted(model6) > .5)), as.factor(model6$y), positive = "1")
c3 <- confusionMatrix(as.factor(as.integer(fitted(model7) > .5)), as.factor(model7$y), positive = "1")
roc1 <- roc(logit_data$TARGET_FLAG, predict(model5, logit_data, interval = "prediction"))
roc2 <- roc(logit_data$TARGET_FLAG, predict(model6, logit_data, interval = "prediction"))
roc3 <- roc(logit_data$TARGET_FLAG, predict(model7, logit_data, interval = "prediction"))
metrics1 <- c(c1$overall[1], "Class. Error Rate" = 1 - as.numeric(c1$overall[1]), c1$byClass[c(1, 2, 5, 7)], AUC = roc1$auc)
metrics2 <- c(c2$overall[1], "Class. Error Rate" = 1 - as.numeric(c2$overall[1]), c2$byClass[c(1, 2, 5, 7)], AUC = roc2$auc)
metrics3 <- c(c3$overall[1], "Class. Error Rate" = 1 - as.numeric(c3$overall[1]), c3$byClass[c(1, 2, 5, 7)], AUC = roc3$auc)
kable(cbind(metrics1, metrics2, metrics3), col.names = c("Model 5", "Model 6", "Model 7")) %>%
kable_styling(full_width = T)
# plotting roc curve of model 3
plot(roc(logit_data$TARGET_FLAG, predict(model5, logit_data, interval = "prediction")), print.auc = TRUE, main = "Model 5" )
# predict
predict <- predict(model5, insurance_eval_imputed, interval = "prediction")
eval <- table(as.integer(predict > .5))
print(paste(eval[1], "not in a car crash", "and", eval[2], "in a car crash"))
# comparing all binary logistic models using various measures
a1 <- mean((summary(model1))$residuals^2)
a2 <- mean((summary(model2))$residuals^2)
a3 <- mean((summary(model3))$residuals^2)
a4 <- mean((summary(model4))$residuals^2)
a5 <- rbind(a1, a2, a3, a4)
b1 <- summary(model2)$r.squared
b2 <- summary(model3)$r.squared
b3 <- summary(model1)$r.squared
b4 <- summary(model4)$r.squared
b5 <- rbind(b1, b2, b3, b4)
c1 <- summary(model1)$fstatistic
c2 <- summary(model2)$fstatistic
c3 <- summary(model3)$fstatistic
c4 <- summary(model4)$fstatistic
c5 <- rbind(c1, c2, c3, c4)
mlr_metrics <- data.frame(cbind(a5, b5, c5), row.names = c("Model 1", "Model 2", "Model 3", "Model 4"))
colnames(mlr_metrics) <- c("MSE", "R-Squared", "value", "numdf", "dendf")
kable(mlr_metrics) %>%
kable_styling(full_width = T) %>%
add_header_above(c(" ", " " = 2, "F-Statistic" = 3))
# residual plot
plot(model4)
# prediction
prediction <- predict(model4, insurance_eval_imputed, interval = "prediction")