Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.
## [1] "Dimension of training set: Number of rows: 8161, Number of cols: 26"
## [1] "Head of training data set:"
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M PhD Professional 14 Private $14,230
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940
## 3 $124,191 Yes z_F z_High School Clerical 5 Private $4,010
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private $15,440
## 5 $243,925 Yes z_F PhD Doctor 36 Private $18,000
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 11 Minivan yes $4,461 2 No 3 18
## 2 1 Minivan yes $0 0 No 0 1
## 3 4 z_SUV no $38,690 2 No 3 10
## 4 7 Minivan yes $0 0 No 0 6
## 5 1 z_SUV no $19,217 2 Yes 3 17
## 6 1 Sports Car no $0 0 No 0 7
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
There are few fields, which have missing values, which we’ll investigate in greater details later.
At this stage, we’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.
After that we’ll show some boxplots of the numeric fields.
Checking for NA.
## [1] TRUE
NA does exist. So, we’ll impute with mice().
Rechecking for NA after imputation.
## [1] FALSE
We observe that NA were removed. In the following, we’ll visualize with missmap().
Both is.na() and missmap() confirm that NA were eliminated.
Here, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00
## 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median :0.0000 Median : 0 Median :0.0000 Median :45.00
## Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.78
## 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 Min. : 0 Length:8161
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28172 Class :character
## Median :0.0000 Median :11.0 Median : 53895 Mode :character
## Mean :0.7212 Mean :10.5 Mean : 61787
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85734
## Max. :5.0000 Max. :23.0 Max. :367030
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 0 Length:8161 Length:8161 Length:8161
## 1st Qu.: 0 Class :character Class :character Class :character
## Median :161166 Mode :character Mode :character Mode :character
## Mean :154958
## 3rd Qu.:238741
## Max. :885282
## JOB TRAVTIME CAR_USE BLUEBOOK
## Length:8161 Min. : 5.00 Length:8161 Min. : 1500
## Class :character 1st Qu.: 22.00 Class :character 1st Qu.: 9280
## Mode :character Median : 33.00 Mode :character Median :14440
## Mean : 33.49 Mean :15710
## 3rd Qu.: 44.00 3rd Qu.:20850
## Max. :142.00 Max. :69740
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Length:8161 Length:8161 Min. : 0
## 1st Qu.: 1.000 Class :character Class :character 1st Qu.: 0
## Median : 4.000 Mode :character Mode :character Median : 0
## Mean : 5.351 Mean : 4037
## 3rd Qu.: 7.000 3rd Qu.: 4636
## Max. :25.000 Max. :57037
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 Length:8161 Min. : 0.000 Min. :-3.000
## 1st Qu.:0.0000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Mode :character Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.354
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## URBANICITY
## Length:8161
## Class :character
## Mode :character
##
##
##
For downstream analysis, we’ll reorder the columns into categorical, numeric and target.
First look at the boxplots.
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance. Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.
Histograms tell us how the data is distributed in the dataset (numeric fields).
The histograms show that AGE, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.
Now, we’ll explore the Categorical variables.
## PARENT1:
##
## No Yes
## 7084 1077
## MSTATUS:
##
## No Yes
## 3267 4894
## SEX:
##
## F M
## 4375 3786
## EDUCATION:
##
## <High School Bachelors High School Masters PhD
## 1203 2242 2330 1658 728
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 526 1825 1271 246 641 835
## Manager Professional Student
## 988 1117 712
## CAR_USE:
##
## Commercial Private
## 3029 5132
## CAR_TYPE:
##
## Minivan Panel Truck Pickup Sports Car SUV Van
## 2145 676 1389 907 2294 750
## RED_CAR:
##
## no yes
## 5783 2378
## REVOKED:
##
## No Yes
## 7161 1000
## URBANICITY:
##
## Highly Rural/ Rural Highly Urban/ Urban
## 1669 6492
Observation: In JOB column, 526 rows are empty. So, we’ll impute them with “Unknown”.
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer Manager
## 1825 1271 246 641 835 988
## Professional Student Unknown
## 1117 712 526
At this point the data is prepared. So, we’ll explore the top correlated variables.
There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find the top correlated variables, we’ll give numerical values to the correlated variables.
| TARGET_FLAG | TARGET_AMT | |
|---|---|---|
| TARGET_FLAG | 1.0000000 | 1.0000000 |
| TARGET_AMT | 0.8334240 | 0.8334240 |
| URBANICITY | 0.2242509 | 0.1904945 |
| CLM_FREQ | 0.2161961 | 0.1869848 |
| MVR_PTS | 0.2075451 | 0.1741927 |
| OLDCLAIM | 0.2004106 | 0.1652881 |
| PARENT1 | 0.1576222 | 0.1359305 |
| REVOKED | 0.1519391 | 0.1263285 |
| HOMEKIDS | 0.1175903 | 0.1014967 |
| KIDSDRIV | 0.1036217 | 0.0863553 |
| CAR_TYPE | 0.1035765 | 0.0827170 |
| TRAVTIME | 0.0550685 | 0.0440349 |
| RED_CAR | -0.0069473 | 0.0005877 |
| SEX | -0.0210786 | -0.0088270 |
| JOB | -0.0669944 | -0.0509930 |
| YOJ | -0.0704375 | -0.0512803 |
| EDUCATION | -0.0734429 | -0.0534342 |
| TIF | -0.0818308 | -0.0690666 |
| CAR_AGE | -0.1038815 | -0.0712728 |
| AGE | -0.1041892 | -0.0818184 |
| BLUEBOOK | -0.1100800 | -0.0837608 |
| MSTATUS | -0.1351248 | -0.1182110 |
| CAR_USE | -0.1426737 | -0.1214701 |
| INCOME | -0.1463515 | -0.1287263 |
| HOME_VAL | -0.1847610 | -0.1541847 |
Now, we’ll look at the correlation matrix of the variables.
At this point exploration, preparation and pair-wise correlations of insurance_training_data.csv are done. So, I’ll begin the same exericse for insurance-evaluation-data.csv.
Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.
## [1] "Dimension of training set: Number of rows: 2141, Number of cols: 26"
## [1] "Head of training data set:"
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 3 NA NA 0 48 0 11 $52,881 No
## 2 9 NA NA 1 40 1 11 $50,815 Yes
## 3 10 NA NA 0 44 2 12 $43,486 Yes
## 4 18 NA NA 0 35 2 NA $21,204 Yes
## 5 21 NA NA 0 59 0 12 $87,460 No
## 6 30 NA NA 0 46 0 14 No
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M Bachelors Manager 26 Private $21,970
## 2 $0 z_No M z_High School Manager 21 Private $18,930
## 3 $0 z_No z_F z_High School z_Blue Collar 30 Commercial $5,900
## 4 $0 z_No M z_High School Clerical 74 Private $9,230
## 5 $0 z_No M z_High School Manager 45 Private $15,420
## 6 $207,519 Yes M Bachelors Professional 7 Commercial $25,660
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 1 Van yes $0 0 No 2 10
## 2 6 Minivan no $3,295 1 No 2 1
## 3 10 z_SUV no $0 0 No 0 10
## 4 6 Pickup no $0 0 Yes 0 4
## 5 1 Minivan yes $44,857 2 No 4 1
## 6 1 Panel Truck no $2,119 1 No 2 12
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame': 2141 obs. of 26 variables:
## $ INDEX : int 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET_FLAG: logi NA NA NA NA NA NA ...
## $ TARGET_AMT : logi NA NA NA NA NA NA ...
## $ KIDSDRIV : int 0 1 0 0 0 0 0 0 2 0 ...
## $ AGE : int 48 40 44 35 59 46 60 54 36 50 ...
## $ HOMEKIDS : int 0 1 2 2 0 0 0 0 2 0 ...
## $ YOJ : int 11 11 12 NA 12 14 12 12 12 8 ...
## $ INCOME : chr "$52,881" "$50,815" "$43,486" "$21,204" ...
## $ PARENT1 : chr "No" "Yes" "Yes" "Yes" ...
## $ HOME_VAL : chr "$0" "$0" "$0" "$0" ...
## $ MSTATUS : chr "z_No" "z_No" "z_No" "z_No" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "Bachelors" "z_High School" "z_High School" "z_High School" ...
## $ JOB : chr "Manager" "Manager" "z_Blue Collar" "Clerical" ...
## $ TRAVTIME : int 26 21 30 74 45 7 16 27 5 22 ...
## $ CAR_USE : chr "Private" "Private" "Commercial" "Private" ...
## $ BLUEBOOK : chr "$21,970" "$18,930" "$5,900" "$9,230" ...
## $ TIF : int 1 6 10 6 1 1 1 4 4 4 ...
## $ CAR_TYPE : chr "Van" "Minivan" "z_SUV" "Pickup" ...
## $ RED_CAR : chr "yes" "no" "no" "no" ...
## $ OLDCLAIM : chr "$0" "$3,295" "$0" "$0" ...
## $ CLM_FREQ : int 0 1 0 0 2 1 0 0 0 0 ...
## $ REVOKED : chr "No" "No" "No" "Yes" ...
## $ MVR_PTS : int 2 2 0 0 4 2 0 5 0 3 ...
## $ CAR_AGE : int 10 1 10 4 1 12 1 NA 9 1 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...
There are few fields, which have missing values, which we’ll investigate in greater details later.
At this stage, We’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.
After that we’ll show some boxplots of the numeric fields.
Checking for NA.
## [1] TRUE
NA does exist. So, we’ll impute with mice().
Rechecking for NA after imputation.
## [1] FALSE
We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().
Both is.na() and missmap() confirm that NA were eliminated.
Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## Min. : NA Min. : NA Min. :0.0000 Min. :17.00
## 1st Qu.: NA 1st Qu.: NA 1st Qu.:0.0000 1st Qu.:39.00
## Median : NA Median : NA Median :0.0000 Median :45.00
## Mean :NaN Mean :NaN Mean :0.1625 Mean :45.02
## 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. : NA Max. : NA Max. :3.0000 Max. :73.00
## NA's :2141 NA's :2141
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.00 Min. : 0 Length:2141
## 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 25713 Class :character
## Median :0.0000 Median :11.00 Median : 51734 Mode :character
## Mean :0.7174 Mean :10.38 Mean : 60210
## 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.: 86321
## Max. :5.0000 Max. :19.00 Max. :291182
##
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 0 Length:2141 Length:2141 Length:2141
## 1st Qu.: 0 Class :character Class :character Class :character
## Median :159239 Mode :character Mode :character Mode :character
## Mean :154441
## 3rd Qu.:237217
## Max. :669271
##
## JOB TRAVTIME CAR_USE BLUEBOOK
## Length:2141 Min. : 5.00 Length:2141 Min. : 1500
## Class :character 1st Qu.: 22.00 Class :character 1st Qu.: 8870
## Mode :character Median : 33.00 Mode :character Median :14170
## Mean : 33.15 Mean :15469
## 3rd Qu.: 43.00 3rd Qu.:21050
## Max. :105.00 Max. :49940
##
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Length:2141 Length:2141 Min. : 0
## 1st Qu.: 1.000 Class :character Class :character 1st Qu.: 0
## Median : 4.000 Mode :character Mode :character Median : 0
## Mean : 5.245 Mean : 4022
## 3rd Qu.: 7.000 3rd Qu.: 4718
## Max. :25.000 Max. :54399
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.000 Length:2141 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.000 Mode :character Median : 1.000 Median : 8.000
## Mean :0.809 Mean : 1.766 Mean : 8.212
## 3rd Qu.:2.000 3rd Qu.: 3.000 3rd Qu.:13.000
## Max. :5.000 Max. :12.000 Max. :26.000
##
## URBANICITY
## Length:2141
## Class :character
## Mode :character
##
##
##
##
For downstream analysis, we’ll reorder the columns into categorical, numeric and target.
Let’s take a first look at the boxplots
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance.
Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.
We’ll do the boxplots differently, with gglplot, to check if there are any differences.
Histograms tell us how the data is distributed in the dataset (numeric fields).
The histograms show that AGE, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.
Now, we’ll explore the Categorical variables.
## PARENT1:
##
## No Yes
## 1875 266
## MSTATUS:
##
## No Yes
## 847 1294
## SEX:
##
## F M
## 1170 971
## EDUCATION:
##
## <High School Bachelors High School Masters PhD
## 312 581 622 420 206
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 139 463 319 75 202 196
## Manager Professional Student
## 269 291 187
## CAR_USE:
##
## Commercial Private
## 760 1381
## CAR_TYPE:
##
## Minivan Panel Truck Pickup Sports Car SUV Van
## 549 177 383 272 589 171
## RED_CAR:
##
## no yes
## 1543 598
## REVOKED:
##
## No Yes
## 1880 261
## URBANICITY:
##
## Highly Rural/ Rural Highly Urban/ Urban
## 403 1738
Observation: In JOB columns, 139 rows are empty. So, we’ have to’ll impute them with “Unknown”.
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer Manager
## 463 319 75 202 196 269
## Professional Student Unknown
## 291 187 139
At this point the data is prepared. So, we’ll explore the top correlated variables.
There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find pai-wise correlations, we’ll give numerical values to the correlated variables.
Now, we’ll look at the correlation matrix of the variables.
At this point exploration, preparation and pair-wise correlations of insurance_evaluation_data.csv are done. So, I’ll begin the building process.
Now, we are in a position to build the models. Initially, we’ll build models with insurance_training_data.csv and determine the model. Then we’ll use that model to predict on insurance-evaluation-data.csv.
Fact: The pre-processed dataset variables, which we’ll use in the following are Ins_train_cap_imputed and Ins_eval_cap_imputed.
We have two tasks here: One is to classify the variable TARGET_FLAG. For classification, we’ll use Logistic Regression. The other task is to predict the value of TARGET_AMT with Linear Regression.
We’ll build two Logistic Regression models and compare the accuracies and select the best one and use that for predicting on Ins_eval_cap_imputed. In order to do the Logistic Regression, we must split the data (80/20 ratio is our choice). So, let’s split the first.
The training and test datasets formed by splitting Ins_train_cap_imputed in 80/20 ratio are Ins_train_cap_imputed_trn and Ins_train_cap_imputed_tst.
We’ll build our first Logistic Regression model, called Model01_Log_Reg.
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial,
## data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2696 -0.7148 -0.3986 0.6484 3.1772
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.353e+00 3.200e-01 -7.351 1.96e-13 ***
## PARENT1Yes 4.291e-01 1.243e-01 3.454 0.000553 ***
## MSTATUSYes -4.984e-01 9.405e-02 -5.299 1.16e-07 ***
## SEXM 5.945e-02 1.250e-01 0.476 0.634204
## EDUCATIONBachelors -4.227e-01 1.292e-01 -3.272 0.001067 **
## EDUCATIONHigh School -5.255e-02 1.069e-01 -0.492 0.622940
## EDUCATIONMasters -3.416e-01 1.999e-01 -1.709 0.087432 .
## EDUCATIONPhD -2.615e-01 2.348e-01 -1.113 0.265523
## JOBClerical 7.402e-02 1.202e-01 0.616 0.538034
## JOBDoctor -5.875e-01 3.113e-01 -1.887 0.059128 .
## JOBHome Maker -8.356e-02 1.738e-01 -0.481 0.630619
## JOBLawyer -1.868e-01 2.113e-01 -0.884 0.376748
## JOBManager -8.325e-01 1.565e-01 -5.319 1.04e-07 ***
## JOBProfessional -1.020e-01 1.347e-01 -0.757 0.448777
## JOBStudent -1.226e-01 1.455e-01 -0.843 0.399393
## JOBUnknown -3.768e-01 2.080e-01 -1.811 0.070122 .
## CAR_USEPrivate -7.932e-01 1.024e-01 -7.748 9.36e-15 ***
## CAR_TYPEPanel Truck 5.483e-01 1.810e-01 3.030 0.002449 **
## CAR_TYPEPickup 5.371e-01 1.125e-01 4.776 1.79e-06 ***
## CAR_TYPESports Car 1.047e+00 1.445e-01 7.248 4.24e-13 ***
## CAR_TYPESUV 7.792e-01 1.237e-01 6.298 3.01e-10 ***
## CAR_TYPEVan 7.086e-01 1.402e-01 5.054 4.34e-07 ***
## RED_CARyes -1.660e-02 9.672e-02 -0.172 0.863717
## REVOKEDYes 7.929e-01 1.029e-01 7.706 1.29e-14 ***
## URBANICITYHighly Urban/ Urban 2.441e+00 1.256e-01 19.432 < 2e-16 ***
## KIDSDRIV 5.976e-01 1.091e-01 5.478 4.30e-08 ***
## AGE -3.216e-03 4.606e-03 -0.698 0.485129
## HOMEKIDS 4.789e-02 4.469e-02 1.072 0.283845
## YOJ -1.851e-02 9.345e-03 -1.981 0.047605 *
## INCOME -2.127e-06 1.356e-06 -1.568 0.116872
## HOME_VAL -1.470e-06 3.751e-07 -3.920 8.87e-05 ***
## TRAVTIME 1.624e-02 2.180e-03 7.451 9.26e-14 ***
## BLUEBOOK -2.351e-05 6.086e-06 -3.862 0.000112 ***
## TIF -6.075e-02 8.537e-03 -7.116 1.11e-12 ***
## OLDCLAIM -1.331e-05 5.398e-06 -2.465 0.013694 *
## CLM_FREQ 1.815e-01 3.283e-02 5.528 3.24e-08 ***
## MVR_PTS 1.089e-01 1.652e-02 6.596 4.22e-11 ***
## CAR_AGE -7.854e-03 8.065e-03 -0.974 0.330193
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5829.9 on 6491 degrees of freedom
## AIC: 5905.9
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5899.5.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Creation of Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 180 97
## 0 250 1105
##
## Accuracy : 0.7874
## 95% CI : (0.7667, 0.807)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 1.083e-06
##
## Kappa : 0.3815
##
## Mcnemar's Test P-Value : 3.356e-16
##
## Sensitivity : 0.4186
## Specificity : 0.9193
## Pos Pred Value : 0.6498
## Neg Pred Value : 0.8155
## Prevalence : 0.2635
## Detection Rate : 0.1103
## Detection Prevalence : 0.1697
## Balanced Accuracy : 0.6690
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7806.
AUC of the model is 0.801.
We’ll build our second Logistic Regression model, called Model02_Log_Reg.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
## URBANICITY, family = binomial, data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3097 -0.7236 -0.4093 0.6644 3.1651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.454e+00 2.125e-01 -11.546 < 2e-16 ***
## KIDSDRIV 5.831e-01 1.067e-01 5.463 4.67e-08 ***
## HOMEKIDS 5.626e-02 4.047e-02 1.390 0.164453
## INCOME -3.777e-06 1.192e-06 -3.170 0.001526 **
## PARENT1Yes 4.283e-01 1.227e-01 3.492 0.000479 ***
## HOME_VAL -1.377e-06 3.633e-07 -3.790 0.000151 ***
## MSTATUSYes -5.298e-01 9.287e-02 -5.705 1.17e-08 ***
## EDUCATIONBachelors -5.642e-01 1.176e-01 -4.795 1.62e-06 ***
## EDUCATIONHigh School -1.157e-01 1.035e-01 -1.118 0.263703
## EDUCATIONMasters -5.835e-01 1.480e-01 -3.943 8.06e-05 ***
## EDUCATIONPhD -6.015e-01 1.812e-01 -3.320 0.000902 ***
## TRAVTIME 1.689e-02 2.162e-03 7.811 5.65e-15 ***
## CAR_USEPrivate -8.751e-01 8.080e-02 -10.830 < 2e-16 ***
## BLUEBOOK -2.559e-05 5.447e-06 -4.698 2.62e-06 ***
## TIF -6.079e-02 8.473e-03 -7.174 7.26e-13 ***
## CAR_TYPEPanel Truck 4.853e-01 1.598e-01 3.037 0.002390 **
## CAR_TYPEPickup 4.672e-01 1.092e-01 4.279 1.88e-05 ***
## CAR_TYPESports Car 9.915e-01 1.182e-01 8.391 < 2e-16 ***
## CAR_TYPESUV 7.332e-01 9.479e-02 7.735 1.03e-14 ***
## CAR_TYPEVan 6.705e-01 1.319e-01 5.083 3.71e-07 ***
## CLM_FREQ 1.386e-01 2.845e-02 4.870 1.11e-06 ***
## REVOKEDYes 6.770e-01 8.982e-02 7.537 4.83e-14 ***
## MVR_PTS 1.096e-01 1.627e-02 6.733 1.66e-11 ***
## CAR_AGE -8.254e-03 8.012e-03 -1.030 0.302913
## URBANICITYHighly Urban/ Urban 2.387e+00 1.251e-01 19.081 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5887.2 on 6504 degrees of freedom
## AIC: 5937.2
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5930.1.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Creation of Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 172 97
## 0 258 1105
##
## Accuracy : 0.7825
## 95% CI : (0.7617, 0.8023)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 9.894e-06
##
## Kappa : 0.3629
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4000
## Specificity : 0.9193
## Pos Pred Value : 0.6394
## Neg Pred Value : 0.8107
## Prevalence : 0.2635
## Detection Rate : 0.1054
## Detection Prevalence : 0.1648
## Balanced Accuracy : 0.6597
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7855.
AUC of the model is 0.802.
In the third Logistic Regression model Model03_Log_Reg, we’ll do stepwise model selection.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL +
## MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF +
## CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, family = binomial,
## data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3376 -0.7220 -0.4088 0.6604 3.1518
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.466e+00 2.099e-01 -11.748 < 2e-16 ***
## KIDSDRIV 6.404e-01 9.858e-02 6.497 8.20e-11 ***
## INCOME -3.846e-06 1.188e-06 -3.236 0.001211 **
## PARENT1Yes 5.191e-01 1.049e-01 4.950 7.42e-07 ***
## HOME_VAL -1.395e-06 3.620e-07 -3.853 0.000117 ***
## MSTATUSYes -4.895e-01 8.819e-02 -5.550 2.85e-08 ***
## EDUCATIONBachelors -6.170e-01 1.102e-01 -5.600 2.15e-08 ***
## EDUCATIONHigh School -1.288e-01 1.031e-01 -1.249 0.211759
## EDUCATIONMasters -6.819e-01 1.250e-01 -5.456 4.86e-08 ***
## EDUCATIONPhD -6.990e-01 1.630e-01 -4.287 1.81e-05 ***
## TRAVTIME 1.681e-02 2.161e-03 7.777 7.43e-15 ***
## CAR_USEPrivate -8.754e-01 8.079e-02 -10.836 < 2e-16 ***
## BLUEBOOK -2.582e-05 5.447e-06 -4.740 2.13e-06 ***
## TIF -6.070e-02 8.468e-03 -7.169 7.58e-13 ***
## CAR_TYPEPanel Truck 4.889e-01 1.597e-01 3.061 0.002208 **
## CAR_TYPEPickup 4.649e-01 1.092e-01 4.259 2.06e-05 ***
## CAR_TYPESports Car 9.925e-01 1.181e-01 8.402 < 2e-16 ***
## CAR_TYPESUV 7.376e-01 9.471e-02 7.788 6.78e-15 ***
## CAR_TYPEVan 6.726e-01 1.318e-01 5.102 3.36e-07 ***
## CLM_FREQ 1.386e-01 2.844e-02 4.872 1.11e-06 ***
## REVOKEDYes 6.809e-01 8.979e-02 7.583 3.38e-14 ***
## MVR_PTS 1.101e-01 1.627e-02 6.765 1.33e-11 ***
## URBANICITYHighly Urban/ Urban 2.385e+00 1.251e-01 19.068 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5890.2 on 6506 degrees of freedom
## AIC: 5936.2
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5959.5.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Creation of Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 178 95
## 0 252 1107
##
## Accuracy : 0.7874
## 95% CI : (0.7667, 0.807)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 1.083e-06
##
## Kappa : 0.3794
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4140
## Specificity : 0.9210
## Pos Pred Value : 0.6520
## Neg Pred Value : 0.8146
## Prevalence : 0.2635
## Detection Rate : 0.1091
## Detection Prevalence : 0.1673
## Balanced Accuracy : 0.6675
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7874.
AUC of the model is 0.802.
At this point three Logistic Regression models were built. The accuracy was highest in the third model Model03_Log_Reg. We’ll use this model on the evaluation dataset, for classification.
Having completed three Logistic Regression models we’ll now proceed to build two Linear Regression models.
First Linear Regression model is Model01_Lin_Reg.
Since our goal is to predict the TARGET_AMT, and not classify (as we did in Logistic Regression), we’ll build lm for TARGET_AMT.
##
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = Ins_train_cap_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5146 -1708 -755 350 103696
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.636e+01 4.769e+02 0.139 0.889344
## PARENT1Yes 5.432e+02 2.043e+02 2.659 0.007843 **
## MSTATUSYes -5.611e+02 1.443e+02 -3.890 0.000101 ***
## SEXM 3.551e+02 1.834e+02 1.936 0.052933 .
## EDUCATIONBachelors -2.688e+02 2.050e+02 -1.311 0.189793
## EDUCATIONHigh School -9.559e+01 1.720e+02 -0.556 0.578453
## EDUCATIONMasters 1.538e+01 2.992e+02 0.051 0.959012
## EDUCATIONPhD 2.203e+02 3.526e+02 0.625 0.532127
## JOBClerical 1.699e+01 1.935e+02 0.088 0.930019
## JOBDoctor -9.804e+02 4.356e+02 -2.251 0.024441 *
## JOBHome Maker -1.526e+02 2.707e+02 -0.564 0.573009
## JOBLawyer -2.772e+02 3.108e+02 -0.892 0.372522
## JOBManager -9.816e+02 2.345e+02 -4.187 2.86e-05 ***
## JOBProfessional -4.084e+01 2.131e+02 -0.192 0.848010
## JOBStudent -2.216e+02 2.352e+02 -0.942 0.346054
## JOBUnknown -5.147e+02 3.216e+02 -1.600 0.109576
## CAR_USEPrivate -7.933e+02 1.644e+02 -4.825 1.43e-06 ***
## CAR_TYPEPanel Truck 2.675e+02 2.766e+02 0.967 0.333431
## CAR_TYPEPickup 3.689e+02 1.708e+02 2.160 0.030835 *
## CAR_TYPESports Car 1.014e+03 2.178e+02 4.655 3.29e-06 ***
## CAR_TYPESUV 7.443e+02 1.794e+02 4.148 3.39e-05 ***
## CAR_TYPEVan 5.156e+02 2.133e+02 2.418 0.015639 *
## RED_CARyes -3.782e+01 1.491e+02 -0.254 0.799788
## REVOKEDYes 5.930e+02 1.743e+02 3.402 0.000673 ***
## URBANICITYHighly Urban/ Urban 1.669e+03 1.392e+02 11.991 < 2e-16 ***
## KIDSDRIV 6.076e+02 1.790e+02 3.394 0.000693 ***
## AGE 6.512e+00 7.187e+00 0.906 0.364955
## HOMEKIDS 7.699e+01 6.992e+01 1.101 0.270888
## YOJ -4.280e+00 1.462e+01 -0.293 0.769680
## INCOME -3.939e-03 2.076e-03 -1.897 0.057798 .
## HOME_VAL -7.668e-04 5.857e-04 -1.309 0.190519
## TRAVTIME 1.274e+01 3.340e+00 3.816 0.000137 ***
## BLUEBOOK 1.374e-02 8.979e-03 1.530 0.125958
## TIF -5.122e+01 1.294e+01 -3.957 7.66e-05 ***
## OLDCLAIM -1.689e-02 9.122e-03 -1.851 0.064181 .
## CLM_FREQ 1.654e+02 5.688e+01 2.907 0.003659 **
## MVR_PTS 1.733e+02 2.790e+01 6.212 5.50e-10 ***
## CAR_AGE -2.692e+01 1.232e+01 -2.184 0.028988 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4545 on 8123 degrees of freedom
## Multiple R-squared: 0.07079, Adjusted R-squared: 0.06655
## F-statistic: 16.72 on 37 and 8123 DF, p-value: < 2.2e-16
Here we note the following important metrics.
R-squared: 0.07094.
Adjusted R-squared: 0.06671.
The R-squared values are far from 1. So, the model is not good. In below plot, the points are widely scattered, but not linearly.
In Second Linear Regression model Model02_Lin_Reg, we’ll do stepwise AIC.
##
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + SEX + JOB + CAR_USE +
## CAR_TYPE + REVOKED + URBANICITY + KIDSDRIV + INCOME + TRAVTIME +
## BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE,
## data = Ins_train_cap_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5198 -1700 -761 337 103636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.754e+02 3.416e+02 0.514 0.607582
## PARENT1Yes 5.929e+02 1.784e+02 3.324 0.000890 ***
## MSTATUSYes -6.195e+02 1.197e+02 -5.177 2.31e-07 ***
## SEXM 3.242e+02 1.605e+02 2.021 0.043347 *
## JOBClerical 3.153e+01 1.924e+02 0.164 0.869812
## JOBDoctor -6.537e+02 3.552e+02 -1.841 0.065705 .
## JOBHome Maker -1.088e+02 2.484e+02 -0.438 0.661362
## JOBLawyer -1.434e+02 2.416e+02 -0.594 0.552810
## JOBManager -9.721e+02 2.128e+02 -4.569 4.97e-06 ***
## JOBProfessional -1.302e+02 1.976e+02 -0.659 0.509968
## JOBStudent -1.308e+02 2.220e+02 -0.589 0.555900
## JOBUnknown -3.039e+02 2.671e+02 -1.138 0.255334
## CAR_USEPrivate -7.443e+02 1.569e+02 -4.745 2.12e-06 ***
## CAR_TYPEPanel Truck 3.077e+02 2.733e+02 1.126 0.260308
## CAR_TYPEPickup 3.933e+02 1.695e+02 2.320 0.020379 *
## CAR_TYPESports Car 1.029e+03 2.165e+02 4.756 2.01e-06 ***
## CAR_TYPESUV 7.493e+02 1.786e+02 4.196 2.75e-05 ***
## CAR_TYPEVan 5.380e+02 2.121e+02 2.537 0.011204 *
## REVOKEDYes 5.979e+02 1.743e+02 3.431 0.000604 ***
## URBANICITYHighly Urban/ Urban 1.664e+03 1.391e+02 11.962 < 2e-16 ***
## KIDSDRIV 6.978e+02 1.620e+02 4.307 1.67e-05 ***
## INCOME -5.123e-03 1.788e-03 -2.866 0.004174 **
## TRAVTIME 1.262e+01 3.337e+00 3.782 0.000157 ***
## BLUEBOOK 1.415e-02 8.880e-03 1.594 0.110992
## TIF -5.066e+01 1.294e+01 -3.916 9.09e-05 ***
## OLDCLAIM -1.687e-02 9.117e-03 -1.850 0.064308 .
## CLM_FREQ 1.684e+02 5.682e+01 2.963 0.003052 **
## MVR_PTS 1.741e+02 2.786e+01 6.251 4.30e-10 ***
## CAR_AGE -2.719e+01 1.093e+01 -2.488 0.012851 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4545 on 8132 degrees of freedom
## Multiple R-squared: 0.06984, Adjusted R-squared: 0.06663
## F-statistic: 21.81 on 28 and 8132 DF, p-value: < 2.2e-16
Here we note the following important metrics.
R-squared: 0.07012.
Adjusted R-squared: 0.06692.
The second model Model02_Lin_Reg marginally improved over the first model Model01_Lin_Reg. The plot also suggest no proper linear regression.
We ran three Logistic Regression models and two Linear Regression models. Based on the Accuracy and AUC, the third Logistic Regression model Model03_Log_Reg did best and based on R-squared value, the second Linear Regression model Model02_Lin_Reg did best. In the following we’ll name our selections.
Selected models are:
Model03_Log_Reg Model02_Lin_Reg
We’ll use these models to predict the evaluation dataset insurance-evaluation-data.csv.
The data prepared from evaluation dataset is stored in Ins_eval_cap_imputed.
Head of the predicted data.
## PARENT1 MSTATUS SEX EDUCATION JOB CAR_USE CAR_TYPE RED_CAR
## 1 No No M Bachelors Manager Private Van yes
## 2 Yes No M High School Manager Private Minivan no
## 3 Yes No F High School Blue Collar Commercial SUV no
## 4 Yes No M High School Clerical Private Pickup no
## 5 No No M High School Manager Private Minivan yes
## 6 No Yes M Bachelors Professional Commercial Panel Truck no
## REVOKED URBANICITY KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1 No Highly Urban/ Urban 0 48 0 11 52881 0
## 2 No Highly Urban/ Urban 1 40 1 11 50815 0
## 3 No Highly Rural/ Rural 0 44 2 12 43486 0
## 4 Yes Highly Rural/ Rural 0 35 2 13 21204 0
## 5 No Highly Urban/ Urban 0 59 0 12 87460 0
## 6 No Highly Urban/ Urban 0 46 0 14 87190 207519
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE TARGET_FLAG
## 1 26 21970 1 0 0 2 10 0
## 2 21 18930 6 3295 1 2 1 0
## 3 30 5900 10 0 0 0 10 0
## 4 74 9230 6 0 0 0 4 0
## 5 45 15420 1 26114 2 4 1 0
## 6 7 25660 1 2119 1 2 12 0
## TARGET_AMT
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0