Homework 04

Overview

In this homework assignment, you will explore, analyze and model a dataset containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first responsevariable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash.The second responsevariable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

We will explore both training and evaluation data as the first step in the data exploration.

## [1] "Dimension of training set:   Number of rows: 8161, Number of cols: 26"
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

Checking for NA.

## [1] TRUE

There are NA in our data, We will impute NA using mice().

Rechecking for NA after imputation.

}

We observe that NA were removed.

Here, We’ll further explore the data. Firsr, We’ll look at min, 1st quartile, median, mean, 2nd quartile, max etc.

TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. :0.0000 Min. : 0 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.00 Min. : 0 Length:8161 Min. : 0 Length:8161 Length:8161 Length:8161 Length:8161 Min. : 5.00 Length:8161 Min. : 1500 Min. : 1.000 Length:8161 Length:8161 Min. : 0 Min. :0.0000 Length:8161 Min. : 0.000 Min. :-3.000 Length:8161
1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 28088 Class :character 1st Qu.: 0 Class :character Class :character Class :character Class :character 1st Qu.: 22.00 Class :character 1st Qu.: 9280 1st Qu.: 1.000 Class :character Class :character 1st Qu.: 0 1st Qu.:0.0000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000 Class :character
Median :0.0000 Median : 0 Median :0.0000 Median :45.00 Median :0.0000 Median :11.00 Median : 53899 Mode :character Median :160699 Mode :character Mode :character Mode :character Mode :character Median : 33.00 Mode :character Median :14440 Median : 4.000 Mode :character Mode :character Median : 0 Median :0.0000 Mode :character Median : 1.000 Median : 8.000 Mode :character
Mean :0.2638 Mean : 1504 Mean :0.1711 Mean :44.78 Mean :0.7212 Mean :10.52 Mean : 61719 NA Mean :154533 NA NA NA NA Mean : 33.49 NA Mean :15710 Mean : 5.351 NA NA Mean : 4037 Mean :0.7986 NA Mean : 1.696 Mean : 8.329 NA
3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.: 85596 NA 3rd Qu.:238115 NA NA NA NA 3rd Qu.: 44.00 NA 3rd Qu.:20850 3rd Qu.: 7.000 NA NA 3rd Qu.: 4636 3rd Qu.:2.0000 NA 3rd Qu.: 3.000 3rd Qu.:12.000 NA
Max. :1.0000 Max. :107586 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.00 Max. :367030 NA Max. :885282 NA NA NA NA Max. :142.00 NA Max. :69740 Max. :25.000 NA NA Max. :57037 Max. :5.0000 NA Max. :13.000 Max. :28.000 NA

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

First look at the boxplots.

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance. We will ignore the boxplots for TARGET_FLAG and TARGET_AMT here.

Histograms

We will see how the data is distributed (numeric fields) using histrogram.

We can see that AGE, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, let’s explore the Categorical variables.

## PARENT1:
## 
##   No  Yes 
## 7084 1077
## MSTATUS:
## 
##   No  Yes 
## 3267 4894
## SEX:
## 
##    F    M 
## 4375 3786
## EDUCATION:
## 
## <High School    Bachelors  High School      Masters          PhD 
##         1203         2242         2330         1658          728
## JOB:
## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          526         1825         1271          246          641          835 
##      Manager Professional      Student 
##          988         1117          712
## CAR_USE:
## 
## Commercial    Private 
##       3029       5132
## CAR_TYPE:
## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##        2145         676        1389         907        2294         750
## RED_CAR:
## 
##   no  yes 
## 5783 2378
## REVOKED:
## 
##   No  Yes 
## 7161 1000
## URBANICITY:
## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                1669                6492

There are 526 rows empty in JOB column. So, we’ll impute them with “Unknown”.

## JOB:
## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##         1825         1271          246          641          835          988 
## Professional      Student      Unknown 
##         1117          712          526

Correlations

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find the top correlated variables, we’ll give numerical values to the correlated variables.

Top Correlated Variables
TARGET_FLAG TARGET_AMT
TARGET_FLAG 1.0000000 1.0000000
TARGET_AMT 0.8334240 0.8334240
URBANICITY 0.2242509 0.1904945
CLM_FREQ 0.2161961 0.1869848
MVR_PTS 0.2075451 0.1741927
OLDCLAIM 0.2004106 0.1652881
PARENT1 0.1576222 0.1359305
REVOKED 0.1519391 0.1263285
HOMEKIDS 0.1175903 0.1014967
KIDSDRIV 0.1036217 0.0863553
CAR_TYPE 0.1035765 0.0827170
TRAVTIME 0.0550685 0.0440349
RED_CAR -0.0069473 0.0005877
SEX -0.0210786 -0.0088270
JOB -0.0669944 -0.0509930
YOJ -0.0678665 -0.0512803
EDUCATION -0.0734429 -0.0532620
TIF -0.0818308 -0.0690666
CAR_AGE -0.1025965 -0.0712728
AGE -0.1041577 -0.0802102
BLUEBOOK -0.1100800 -0.0820739
MSTATUS -0.1351248 -0.1209872
CAR_USE -0.1426737 -0.1214701
INCOME -0.1487891 -0.1287263
HOME_VAL -0.1841877 -0.1522253

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_training_data.csv are done. So, I’ll begin the same exercise for insurance-evaluation-data.csv.

Data Exploration of insurance-evaluation-data.csv.

Initially, We’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

## [1] "Dimension of training set:   Number of rows: 2141, Number of cols: 26"
## [1] "Head of training data set:"
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ  INCOME PARENT1
## 1     3          NA         NA        0  48        0  11 $52,881      No
## 2     9          NA         NA        1  40        1  11 $50,815     Yes
## 3    10          NA         NA        0  44        2  12 $43,486     Yes
## 4    18          NA         NA        0  35        2  NA $21,204     Yes
## 5    21          NA         NA        0  59        0  12 $87,460      No
## 6    30          NA         NA        0  46        0  14              No
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M     Bachelors       Manager       26    Private  $21,970
## 2       $0    z_No   M z_High School       Manager       21    Private  $18,930
## 3       $0    z_No z_F z_High School z_Blue Collar       30 Commercial   $5,900
## 4       $0    z_No   M z_High School      Clerical       74    Private   $9,230
## 5       $0    z_No   M z_High School       Manager       45    Private  $15,420
## 6 $207,519     Yes   M     Bachelors  Professional        7 Commercial  $25,660
##   TIF    CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1   1         Van     yes       $0        0      No       2      10
## 2   6     Minivan      no   $3,295        1      No       2       1
## 3  10       z_SUV      no       $0        0      No       0      10
## 4   6      Pickup      no       $0        0     Yes       0       4
## 5   1     Minivan     yes  $44,857        2      No       4       1
## 6   1 Panel Truck      no   $2,119        1      No       2      12
##              URBANICITY
## 1   Highly Urban/ Urban
## 2   Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5   Highly Urban/ Urban
## 6   Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame':    2141 obs. of  26 variables:
##  $ INDEX      : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET_FLAG: logi  NA NA NA NA NA NA ...
##  $ TARGET_AMT : logi  NA NA NA NA NA NA ...
##  $ KIDSDRIV   : int  0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE        : int  48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS   : int  0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ        : int  11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME     : chr  "$52,881" "$50,815" "$43,486" "$21,204" ...
##  $ PARENT1    : chr  "No" "Yes" "Yes" "Yes" ...
##  $ HOME_VAL   : chr  "$0" "$0" "$0" "$0" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "z_No" "z_No" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "Bachelors" "z_High School" "z_High School" "z_High School" ...
##  $ JOB        : chr  "Manager" "Manager" "z_Blue Collar" "Clerical" ...
##  $ TRAVTIME   : int  26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE    : chr  "Private" "Private" "Commercial" "Private" ...
##  $ BLUEBOOK   : chr  "$21,970" "$18,930" "$5,900" "$9,230" ...
##  $ TIF        : int  1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE   : chr  "Van" "Minivan" "z_SUV" "Pickup" ...
##  $ RED_CAR    : chr  "yes" "no" "no" "no" ...
##  $ OLDCLAIM   : chr  "$0" "$3,295" "$0" "$0" ...
##  $ CLM_FREQ   : int  0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "Yes" ...
##  $ MVR_PTS    : int  2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE    : int  10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of insurance-evaluation-data.csv.

At this stage, We’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.

After that we’ll show some boxplots of the numeric fields.

Checking for NA.

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Rechecking for NA after imputation.

## [1] FALSE

We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().

Both is.na() and missmap() confirm that NA were eliminated.

More Data exploration of insurance-evaluation-data.csv.

Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE URBANICITY
Min. : NA Min. : NA Min. :0.0000 Min. :17.00 Min. :0.0000 Min. : 0.00 Min. : 0 Length:2141 Min. : 0 Length:2141 Length:2141 Length:2141 Length:2141 Min. : 5.00 Length:2141 Min. : 1500 Min. : 1.000 Length:2141 Length:2141 Min. : 0 Min. :0.000 Length:2141 Min. : 0.000 Min. : 0.000 Length:2141
1st Qu.: NA 1st Qu.: NA 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.00 1st Qu.: 25846 Class :character 1st Qu.: 0 Class :character Class :character Class :character Class :character 1st Qu.: 22.00 Class :character 1st Qu.: 8870 1st Qu.: 1.000 Class :character Class :character 1st Qu.: 0 1st Qu.:0.000 Class :character 1st Qu.: 0.000 1st Qu.: 1.000 Class :character
Median : NA Median : NA Median :0.0000 Median :45.00 Median :0.0000 Median :11.00 Median : 51605 Mode :character Median :158830 Mode :character Mode :character Mode :character Mode :character Median : 33.00 Mode :character Median :14170 Median : 4.000 Mode :character Mode :character Median : 0 Median :0.000 Mode :character Median : 1.000 Median : 8.000 Mode :character
Mean :NaN Mean :NaN Mean :0.1625 Mean :45.02 Mean :0.7174 Mean :10.36 Mean : 60228 NA Mean :152988 NA NA NA NA Mean : 33.15 NA Mean :15469 Mean : 5.245 NA NA Mean : 4022 Mean :0.809 NA Mean : 1.766 Mean : 8.198 NA
3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.00 3rd Qu.: 86393 NA 3rd Qu.:236449 NA NA NA NA 3rd Qu.: 43.00 NA 3rd Qu.:21050 3rd Qu.: 7.000 NA NA 3rd Qu.: 4718 3rd Qu.:2.000 NA 3rd Qu.: 3.000 3rd Qu.:13.000 NA
Max. : NA Max. : NA Max. :3.0000 Max. :73.00 Max. :5.0000 Max. :19.00 Max. :291182 NA Max. :669271 NA NA NA NA Max. :105.00 NA Max. :49940 Max. :25.000 NA NA Max. :54399 Max. :5.000 NA Max. :12.000 Max. :26.000 NA
NA’s :2141 NA’s :2141 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

Let’s take a first look at the boxplots

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance.

Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.

We’ll do the boxplots differently, with gglplot, to check if there are any differences.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

The histograms show that AGE, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, we’ll explore the Categorical variables.

## PARENT1:
## 
##   No  Yes 
## 1875  266
## MSTATUS:
## 
##   No  Yes 
##  847 1294
## SEX:
## 
##    F    M 
## 1170  971
## EDUCATION:
## 
## <High School    Bachelors  High School      Masters          PhD 
##          312          581          622          420          206
## JOB:
## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          139          463          319           75          202          196 
##      Manager Professional      Student 
##          269          291          187
## CAR_USE:
## 
## Commercial    Private 
##        760       1381
## CAR_TYPE:
## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##         549         177         383         272         589         171
## RED_CAR:
## 
##   no  yes 
## 1543  598
## REVOKED:
## 
##   No  Yes 
## 1880  261
## URBANICITY:
## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                 403                1738

Observation: In JOB columns, 139 rows are empty. So, we’ have to’ll impute them with “Unknown”.

## JOB:
## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##          463          319           75          202          196          269 
## Professional      Student      Unknown 
##          291          187          139

Correlations.

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find pai-wise correlations, we’ll give numerical values to the correlated variables.

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_evaluation_data.csv are done. So, We’ll begin the model building process.

Building Models

Now, we are ready to build models. We will build model using insurance_training_data.csv and compare the models. Then we will use the best model to predict on insurance-evaluation-data.csv

Our task is to classify the TARGET_FLAG variable using logistic regression and predict the value of TARGET_AMT using Linear Regression.

We split Ins_train_cap_imputed into training and test in 80/20 ratio and named as Ins_train_cap_imputed_trn and Ins_train_cap_imputed_tst.

Logistic Regression Model

Here we will build our first logistic regression model

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial, 
##     data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2585  -0.7148  -0.3972   0.6502   3.1743  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.314e+00  3.198e-01  -7.235 4.64e-13 ***
## PARENT1Yes                     4.214e-01  1.242e-01   3.394 0.000690 ***
## MSTATUSYes                    -5.156e-01  9.376e-02  -5.500 3.81e-08 ***
## SEXM                           5.494e-02  1.250e-01   0.440 0.660147    
## EDUCATIONBachelors            -4.152e-01  1.295e-01  -3.207 0.001341 ** 
## EDUCATIONHigh School          -5.172e-02  1.069e-01  -0.484 0.628448    
## EDUCATIONMasters              -3.404e-01  2.006e-01  -1.697 0.089711 .  
## EDUCATIONPhD                  -2.426e-01  2.352e-01  -1.031 0.302486    
## JOBClerical                    6.363e-02  1.201e-01   0.530 0.596115    
## JOBDoctor                     -5.610e-01  3.115e-01  -1.801 0.071752 .  
## JOBHome Maker                 -1.240e-01  1.737e-01  -0.714 0.475328    
## JOBLawyer                     -1.716e-01  2.114e-01  -0.812 0.416873    
## JOBManager                    -8.168e-01  1.564e-01  -5.222 1.77e-07 ***
## JOBProfessional               -9.226e-02  1.347e-01  -0.685 0.493352    
## JOBStudent                    -1.583e-01  1.460e-01  -1.084 0.278294    
## JOBUnknown                    -3.561e-01  2.081e-01  -1.711 0.087085 .  
## CAR_USEPrivate                -7.988e-01  1.024e-01  -7.801 6.14e-15 ***
## CAR_TYPEPanel Truck            5.476e-01  1.810e-01   3.025 0.002488 ** 
## CAR_TYPEPickup                 5.346e-01  1.125e-01   4.754 1.99e-06 ***
## CAR_TYPESports Car             1.043e+00  1.445e-01   7.218 5.29e-13 ***
## CAR_TYPESUV                    7.775e-01  1.237e-01   6.286 3.25e-10 ***
## CAR_TYPEVan                    7.160e-01  1.404e-01   5.101 3.37e-07 ***
## RED_CARyes                    -1.674e-02  9.675e-02  -0.173 0.862651    
## REVOKEDYes                     7.961e-01  1.029e-01   7.740 9.97e-15 ***
## URBANICITYHighly Urban/ Urban  2.445e+00  1.257e-01  19.455  < 2e-16 ***
## KIDSDRIV                       5.976e-01  1.090e-01   5.481 4.23e-08 ***
## AGE                           -3.265e-03  4.603e-03  -0.709 0.478150    
## HOMEKIDS                       5.087e-02  4.469e-02   1.138 0.254940    
## YOJ                           -1.936e-02  9.416e-03  -2.056 0.039789 *  
## INCOME                        -3.214e-06  1.359e-06  -2.364 0.018084 *  
## HOME_VAL                      -1.319e-06  3.757e-07  -3.510 0.000449 ***
## TRAVTIME                       1.619e-02  2.179e-03   7.430 1.09e-13 ***
## BLUEBOOK                      -2.301e-05  6.083e-06  -3.782 0.000155 ***
## TIF                           -6.052e-02  8.538e-03  -7.088 1.36e-12 ***
## OLDCLAIM                      -1.328e-05  5.397e-06  -2.461 0.013864 *  
## CLM_FREQ                       1.812e-01  3.285e-02   5.516 3.48e-08 ***
## MVR_PTS                        1.089e-01  1.652e-02   6.592 4.33e-11 ***
## CAR_AGE                       -6.136e-03  8.097e-03  -0.758 0.448611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5827.6  on 6491  degrees of freedom
## AIC: 5903.6
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5899.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Let’s look at the Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  181   97
##          0  249 1105
##                                           
##                Accuracy : 0.788           
##                  95% CI : (0.7673, 0.8076)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 8.082e-07       
##                                           
##                   Kappa : 0.3838          
##                                           
##  Mcnemar's Test P-Value : 4.747e-16       
##                                           
##             Sensitivity : 0.4209          
##             Specificity : 0.9193          
##          Pos Pred Value : 0.6511          
##          Neg Pred Value : 0.8161          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1109          
##    Detection Prevalence : 0.1703          
##       Balanced Accuracy : 0.6701          
##                                           
##        'Positive' Class : 1               
## 

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7806.

AUC of the model is 0.801.

Logistic Regression Model

We’ll build our second Logistic Regression model

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, family = binomial, data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2975  -0.7218  -0.4082   0.6586   3.1654  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.459e+00  2.123e-01 -11.583  < 2e-16 ***
## KIDSDRIV                       5.821e-01  1.067e-01   5.453 4.94e-08 ***
## HOMEKIDS                       5.781e-02  4.044e-02   1.429 0.152879    
## INCOME                        -4.637e-06  1.194e-06  -3.885 0.000102 ***
## PARENT1Yes                     4.235e-01  1.226e-01   3.454 0.000552 ***
## HOME_VAL                      -1.236e-06  3.642e-07  -3.393 0.000692 ***
## MSTATUSYes                    -5.449e-01  9.254e-02  -5.888 3.90e-09 ***
## EDUCATIONBachelors            -5.555e-01  1.178e-01  -4.715 2.42e-06 ***
## EDUCATIONHigh School          -1.135e-01  1.035e-01  -1.097 0.272683    
## EDUCATIONMasters              -5.769e-01  1.488e-01  -3.876 0.000106 ***
## EDUCATIONPhD                  -5.794e-01  1.815e-01  -3.192 0.001414 ** 
## TRAVTIME                       1.685e-02  2.163e-03   7.790 6.71e-15 ***
## CAR_USEPrivate                -8.800e-01  8.089e-02 -10.879  < 2e-16 ***
## BLUEBOOK                      -2.500e-05  5.444e-06  -4.592 4.39e-06 ***
## TIF                           -6.049e-02  8.476e-03  -7.136 9.59e-13 ***
## CAR_TYPEPanel Truck            4.852e-01  1.598e-01   3.036 0.002399 ** 
## CAR_TYPEPickup                 4.700e-01  1.092e-01   4.305 1.67e-05 ***
## CAR_TYPESports Car             9.889e-01  1.182e-01   8.368  < 2e-16 ***
## CAR_TYPESUV                    7.327e-01  9.481e-02   7.728 1.09e-14 ***
## CAR_TYPEVan                    6.800e-01  1.320e-01   5.150 2.61e-07 ***
## CLM_FREQ                       1.380e-01  2.847e-02   4.847 1.25e-06 ***
## REVOKEDYes                     6.812e-01  8.987e-02   7.580 3.45e-14 ***
## MVR_PTS                        1.095e-01  1.628e-02   6.728 1.72e-11 ***
## CAR_AGE                       -5.781e-03  8.058e-03  -0.717 0.473139    
## URBANICITYHighly Urban/ Urban  2.396e+00  1.252e-01  19.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5883.4  on 6504  degrees of freedom
## AIC: 5933.4
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5930.1.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Let’s look at the Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  178   96
##          0  252 1106
##                                           
##                Accuracy : 0.7868          
##                  95% CI : (0.7661, 0.8064)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 1.445e-06       
##                                           
##                   Kappa : 0.3781          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4140          
##             Specificity : 0.9201          
##          Pos Pred Value : 0.6496          
##          Neg Pred Value : 0.8144          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1091          
##    Detection Prevalence : 0.1679          
##       Balanced Accuracy : 0.6670          
##                                           
##        'Positive' Class : 1               
## 

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7855.

AUC of the model is 0.802.

Logistic Regression Model

In the third Logistic Regression model, we’ll do stepwise model selection.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, 
##     family = binomial, data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2901  -0.7231  -0.4097   0.6569   3.1630  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.480e+00  2.104e-01 -11.789  < 2e-16 ***
## KIDSDRIV                       5.824e-01  1.068e-01   5.455 4.89e-08 ***
## HOMEKIDS                       5.804e-02  4.044e-02   1.435 0.151241    
## INCOME                        -4.693e-06  1.191e-06  -3.940 8.15e-05 ***
## PARENT1Yes                     4.242e-01  1.226e-01   3.461 0.000539 ***
## HOME_VAL                      -1.222e-06  3.636e-07  -3.361 0.000777 ***
## MSTATUSYes                    -5.451e-01  9.252e-02  -5.892 3.82e-09 ***
## EDUCATIONBachelors            -5.848e-01  1.106e-01  -5.288 1.24e-07 ***
## EDUCATIONHigh School          -1.183e-01  1.032e-01  -1.146 0.251610    
## EDUCATIONMasters              -6.343e-01  1.257e-01  -5.046 4.50e-07 ***
## EDUCATIONPhD                  -6.353e-01  1.641e-01  -3.872 0.000108 ***
## TRAVTIME                       1.684e-02  2.162e-03   7.788 6.82e-15 ***
## CAR_USEPrivate                -8.797e-01  8.088e-02 -10.876  < 2e-16 ***
## BLUEBOOK                      -2.498e-05  5.445e-06  -4.588 4.47e-06 ***
## TIF                           -6.059e-02  8.473e-03  -7.151 8.62e-13 ***
## CAR_TYPEPanel Truck            4.858e-01  1.598e-01   3.039 0.002370 ** 
## CAR_TYPEPickup                 4.702e-01  1.092e-01   4.307 1.66e-05 ***
## CAR_TYPESports Car             9.896e-01  1.182e-01   8.375  < 2e-16 ***
## CAR_TYPESUV                    7.326e-01  9.482e-02   7.726 1.11e-14 ***
## CAR_TYPEVan                    6.820e-01  1.320e-01   5.166 2.39e-07 ***
## CLM_FREQ                       1.377e-01  2.847e-02   4.836 1.32e-06 ***
## REVOKEDYes                     6.817e-01  8.988e-02   7.584 3.34e-14 ***
## MVR_PTS                        1.095e-01  1.628e-02   6.725 1.76e-11 ***
## URBANICITYHighly Urban/ Urban  2.397e+00  1.252e-01  19.142  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5883.9  on 6505  degrees of freedom
## AIC: 5931.9
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5959.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Creation of Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  177  100
##          0  253 1102
##                                           
##                Accuracy : 0.7837          
##                  95% CI : (0.7629, 0.8035)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 5.811e-06       
##                                           
##                   Kappa : 0.3708          
##                                           
##  Mcnemar's Test P-Value : 5.959e-16       
##                                           
##             Sensitivity : 0.4116          
##             Specificity : 0.9168          
##          Pos Pred Value : 0.6390          
##          Neg Pred Value : 0.8133          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1085          
##    Detection Prevalence : 0.1697          
##       Balanced Accuracy : 0.6642          
##                                           
##        'Positive' Class : 1               
## 

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7874.

AUC of the model is 0.802.

At this point three Logistic Regression models were built. The accuracy was highest in the third model. We will use the third model on evaluation dataset, for classification.

Linear Regression model

First Linear Regression model

Since our goal is to predict the TARGET_AMT, and not classify (as we did in Logistic Regression), we’ll build linear regression model for TARGET_AMT.

## 
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5140  -1701   -758    354 103703 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.410e+01  4.771e+02   0.134 0.893131    
## PARENT1Yes                     5.390e+02  2.042e+02   2.639 0.008327 ** 
## MSTATUSYes                    -5.673e+02  1.442e+02  -3.934 8.42e-05 ***
## SEXM                           3.517e+02  1.835e+02   1.917 0.055261 .  
## EDUCATIONBachelors            -2.797e+02  2.055e+02  -1.361 0.173546    
## EDUCATIONHigh School          -9.839e+01  1.721e+02  -0.572 0.567524    
## EDUCATIONMasters              -8.573e+00  3.002e+02  -0.029 0.977219    
## EDUCATIONPhD                   1.966e+02  3.541e+02   0.555 0.578804    
## JOBClerical                    2.505e+01  1.934e+02   0.130 0.896929    
## JOBDoctor                     -9.795e+02  4.357e+02  -2.248 0.024611 *  
## JOBHome Maker                 -1.557e+02  2.706e+02  -0.575 0.565054    
## JOBLawyer                     -2.724e+02  3.109e+02  -0.876 0.380978    
## JOBManager                    -9.840e+02  2.345e+02  -4.195 2.75e-05 ***
## JOBProfessional               -4.248e+01  2.132e+02  -0.199 0.842035    
## JOBStudent                    -2.248e+02  2.360e+02  -0.952 0.340976    
## JOBUnknown                    -5.146e+02  3.217e+02  -1.600 0.109723    
## CAR_USEPrivate                -7.928e+02  1.644e+02  -4.822 1.45e-06 ***
## CAR_TYPEPanel Truck            2.709e+02  2.766e+02   0.979 0.327531    
## CAR_TYPEPickup                 3.745e+02  1.708e+02   2.192 0.028377 *  
## CAR_TYPESports Car             1.012e+03  2.179e+02   4.644 3.48e-06 ***
## CAR_TYPESUV                    7.439e+02  1.795e+02   4.145 3.43e-05 ***
## CAR_TYPEVan                    5.213e+02  2.133e+02   2.444 0.014556 *  
## RED_CARyes                    -3.659e+01  1.491e+02  -0.245 0.806149    
## REVOKEDYes                     5.943e+02  1.744e+02   3.409 0.000656 ***
## URBANICITYHighly Urban/ Urban  1.673e+03  1.392e+02  12.015  < 2e-16 ***
## KIDSDRIV                       6.023e+02  1.790e+02   3.365 0.000769 ***
## AGE                            6.574e+00  7.183e+00   0.915 0.360107    
## HOMEKIDS                       8.174e+01  6.995e+01   1.169 0.242637    
## YOJ                           -6.412e+00  1.474e+01  -0.435 0.663681    
## INCOME                        -3.949e-03  2.089e-03  -1.890 0.058735 .  
## HOME_VAL                      -6.719e-04  5.866e-04  -1.145 0.252042    
## TRAVTIME                       1.275e+01  3.340e+00   3.817 0.000136 ***
## BLUEBOOK                       1.343e-02  8.970e-03   1.497 0.134462    
## TIF                           -5.089e+01  1.295e+01  -3.931 8.52e-05 ***
## OLDCLAIM                      -1.690e-02  9.122e-03  -1.853 0.063925 .  
## CLM_FREQ                       1.664e+02  5.689e+01   2.924 0.003461 ** 
## MVR_PTS                        1.735e+02  2.790e+01   6.217 5.31e-10 ***
## CAR_AGE                       -2.489e+01  1.237e+01  -2.012 0.044271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8123 degrees of freedom
## Multiple R-squared:  0.07062,    Adjusted R-squared:  0.06639 
## F-statistic: 16.68 on 37 and 8123 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07094.

Adjusted R-squared: 0.06671.

The R-squared values are far from 1. So, the model is not good. In below plot, the points are widely scattered, but not linearly.

Linear Regression model

In Second Linear Regression model, we’ll do stepwise AIC.

## 
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + SEX + JOB + CAR_USE + 
##     CAR_TYPE + REVOKED + URBANICITY + KIDSDRIV + INCOME + TRAVTIME + 
##     BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE, 
##     data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5196  -1698   -766    336 103646 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.555e+02  3.413e+02   0.455 0.648781    
## PARENT1Yes                     5.962e+02  1.783e+02   3.343 0.000834 ***
## MSTATUSYes                    -6.146e+02  1.197e+02  -5.136 2.88e-07 ***
## SEXM                           3.208e+02  1.605e+02   1.999 0.045662 *  
## JOBClerical                    3.956e+01  1.923e+02   0.206 0.837011    
## JOBDoctor                     -6.685e+02  3.556e+02  -1.880 0.060152 .  
## JOBHome Maker                 -1.066e+02  2.480e+02  -0.430 0.667431    
## JOBLawyer                     -1.534e+02  2.421e+02  -0.634 0.526325    
## JOBManager                    -9.831e+02  2.127e+02  -4.623 3.84e-06 ***
## JOBProfessional               -1.372e+02  1.976e+02  -0.694 0.487429    
## JOBStudent                    -1.296e+02  2.226e+02  -0.582 0.560294    
## JOBUnknown                    -3.164e+02  2.676e+02  -1.182 0.237046    
## CAR_USEPrivate                -7.430e+02  1.569e+02  -4.736 2.21e-06 ***
## CAR_TYPEPanel Truck            3.137e+02  2.733e+02   1.148 0.251206    
## CAR_TYPEPickup                 4.009e+02  1.695e+02   2.366 0.018021 *  
## CAR_TYPESports Car             1.027e+03  2.165e+02   4.746 2.11e-06 ***
## CAR_TYPESUV                    7.486e+02  1.786e+02   4.191 2.80e-05 ***
## CAR_TYPEVan                    5.417e+02  2.121e+02   2.554 0.010654 *  
## REVOKEDYes                     5.990e+02  1.743e+02   3.437 0.000592 ***
## URBANICITYHighly Urban/ Urban  1.667e+03  1.391e+02  11.985  < 2e-16 ***
## KIDSDRIV                       6.949e+02  1.620e+02   4.289 1.81e-05 ***
## INCOME                        -5.025e-03  1.797e-03  -2.796 0.005181 ** 
## TRAVTIME                       1.261e+01  3.338e+00   3.778 0.000159 ***
## BLUEBOOK                       1.377e-02  8.872e-03   1.552 0.120690    
## TIF                           -5.033e+01  1.294e+01  -3.890 0.000101 ***
## OLDCLAIM                      -1.690e-02  9.117e-03  -1.853 0.063853 .  
## CLM_FREQ                       1.692e+02  5.683e+01   2.977 0.002921 ** 
## MVR_PTS                        1.744e+02  2.786e+01   6.258 4.08e-10 ***
## CAR_AGE                       -2.563e+01  1.092e+01  -2.346 0.018999 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8132 degrees of freedom
## Multiple R-squared:  0.06969,    Adjusted R-squared:  0.06649 
## F-statistic: 21.76 on 28 and 8132 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07012.

Adjusted R-squared: 0.06692.

The second model slightly improved over the first model. The plot also suggest no proper linear regression.

Model Selection.

We ran three Logistic Regression models and two Linear Regression models. The third Logistic Regression model did best based on the Accuracy and AUC. And the second Linear Regression model did best based on R-squared value.

We’ll use these models to predict the evaluation dataset insurance-evaluation-data.csv.

Head of the predicted data.

##   PARENT1 MSTATUS SEX   EDUCATION          JOB    CAR_USE    CAR_TYPE RED_CAR
## 1      No      No   M   Bachelors      Manager    Private         Van     yes
## 2     Yes      No   M High School      Manager    Private     Minivan      no
## 3     Yes      No   F High School  Blue Collar Commercial         SUV      no
## 4     Yes      No   M High School     Clerical    Private      Pickup      no
## 5      No      No   M High School      Manager    Private     Minivan     yes
## 6      No     Yes   M   Bachelors Professional Commercial Panel Truck      no
##   REVOKED          URBANICITY KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1      No Highly Urban/ Urban        0  48        0  11  52881        0
## 2      No Highly Urban/ Urban        1  40        1  11  50815        0
## 3      No Highly Rural/ Rural        0  44        2  12  43486        0
## 4     Yes Highly Rural/ Rural        0  35        2   9  21204        0
## 5      No Highly Urban/ Urban        0  59        0  12  87460        0
## 6      No Highly Urban/ Urban        0  46        0  14 122150   207519
##   TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE TARGET_FLAG
## 1       26    21970   1        0        0       2      10           0
## 2       21    18930   6     3295        1       2       1           0
## 3       30     5900  10        0        0       0      10           0
## 4       74     9230   6        0        0       0       4           0
## 5       45    15420   1    26114        2       4       1           0
## 6        7    25660   1     2119        1       2      12           0
##   TARGET_AMT
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Write the data to a CSV file