Homework 04

Overview

In this homework assignment, you will explore, analyze and model a dataset containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first responsevariable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash.The second responsevariable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

We will explore both training and evaluation data as the first step in the data exploration.

## [1] "Dimension of training set:   Number of rows: 8161, Number of cols: 26"

##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban

## [1] "Structure of training data set:"

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

Checking for NA.

## [1] TRUE

There are NA in our data, We will impute NA using mice().

Rechecking for NA after imputation.

}

We observe that NA were removed.

Here, We’ll further explore the data. Firsr, We’ll look at min, 1st quartile, median, mean, 2nd quartile, max etc.

TARGET_FLAG	TARGET_AMT	KIDSDRIV	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
Min. :0.0000	Min. : 0	Min. :0.0000	Min. :16.00	Min. :0.0000	Min. : 0.00	Min. : 0	Length:8161	Min. : 0	Length:8161	Length:8161	Length:8161	Length:8161	Min. : 5.00	Length:8161	Min. : 1500	Min. : 1.000	Length:8161	Length:8161	Min. : 0	Min. :0.0000	Length:8161	Min. : 0.000	Min. :-3.000	Length:8161
1st Qu.:0.0000	1st Qu.: 0	1st Qu.:0.0000	1st Qu.:39.00	1st Qu.:0.0000	1st Qu.: 9.00	1st Qu.: 28088	Class :character	1st Qu.: 0	Class :character	Class :character	Class :character	Class :character	1st Qu.: 22.00	Class :character	1st Qu.: 9280	1st Qu.: 1.000	Class :character	Class :character	1st Qu.: 0	1st Qu.:0.0000	Class :character	1st Qu.: 0.000	1st Qu.: 1.000	Class :character
Median :0.0000	Median : 0	Median :0.0000	Median :45.00	Median :0.0000	Median :11.00	Median : 53899	Mode :character	Median :160699	Mode :character	Mode :character	Mode :character	Mode :character	Median : 33.00	Mode :character	Median :14440	Median : 4.000	Mode :character	Mode :character	Median : 0	Median :0.0000	Mode :character	Median : 1.000	Median : 8.000	Mode :character
Mean :0.2638	Mean : 1504	Mean :0.1711	Mean :44.78	Mean :0.7212	Mean :10.52	Mean : 61719	NA	Mean :154533	NA	NA	NA	NA	Mean : 33.49	NA	Mean :15710	Mean : 5.351	NA	NA	Mean : 4037	Mean :0.7986	NA	Mean : 1.696	Mean : 8.329	NA
3rd Qu.:1.0000	3rd Qu.: 1036	3rd Qu.:0.0000	3rd Qu.:51.00	3rd Qu.:1.0000	3rd Qu.:13.00	3rd Qu.: 85596	NA	3rd Qu.:238115	NA	NA	NA	NA	3rd Qu.: 44.00	NA	3rd Qu.:20850	3rd Qu.: 7.000	NA	NA	3rd Qu.: 4636	3rd Qu.:2.0000	NA	3rd Qu.: 3.000	3rd Qu.:12.000	NA
Max. :1.0000	Max. :107586	Max. :4.0000	Max. :81.00	Max. :5.0000	Max. :23.00	Max. :367030	NA	Max. :885282	NA	NA	NA	NA	Max. :142.00	NA	Max. :69740	Max. :25.000	NA	NA	Max. :57037	Max. :5.0000	NA	Max. :13.000	Max. :28.000	NA

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

First look at the boxplots.

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance. We will ignore the boxplots for TARGET_FLAG and TARGET_AMT here.

Histograms

We will see how the data is distributed (numeric fields) using histrogram.

We can see that AGE, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, let’s explore the Categorical variables.

## PARENT1:

## 
##   No  Yes 
## 7084 1077

## MSTATUS:

## 
##   No  Yes 
## 3267 4894

## SEX:

## 
##    F    M 
## 4375 3786

## EDUCATION:

## 
## <High School    Bachelors  High School      Masters          PhD 
##         1203         2242         2330         1658          728

## JOB:

## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          526         1825         1271          246          641          835 
##      Manager Professional      Student 
##          988         1117          712

## CAR_USE:

## 
## Commercial    Private 
##       3029       5132

## CAR_TYPE:

## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##        2145         676        1389         907        2294         750

## RED_CAR:

## 
##   no  yes 
## 5783 2378

## REVOKED:

## 
##   No  Yes 
## 7161 1000

## URBANICITY:

## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                1669                6492

There are 526 rows empty in JOB column. So, we’ll impute them with “Unknown”.

## JOB:

## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##         1825         1271          246          641          835          988 
## Professional      Student      Unknown 
##         1117          712          526

Correlations

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find the top correlated variables, we’ll give numerical values to the correlated variables.

	Top Correlated Variables
	TARGET_FLAG	TARGET_AMT
TARGET_FLAG	1.0000000	1.0000000
TARGET_AMT	0.8334240	0.8334240
URBANICITY	0.2242509	0.1904945
CLM_FREQ	0.2161961	0.1869848
MVR_PTS	0.2075451	0.1741927
OLDCLAIM	0.2004106	0.1652881
PARENT1	0.1576222	0.1359305
REVOKED	0.1519391	0.1263285
HOMEKIDS	0.1175903	0.1014967
KIDSDRIV	0.1036217	0.0863553
CAR_TYPE	0.1035765	0.0827170
TRAVTIME	0.0550685	0.0440349
RED_CAR	-0.0069473	0.0005877
SEX	-0.0210786	-0.0088270
JOB	-0.0669944	-0.0509930
YOJ	-0.0678665	-0.0512803
EDUCATION	-0.0734429	-0.0532620
TIF	-0.0818308	-0.0690666
CAR_AGE	-0.1025965	-0.0712728
AGE	-0.1041577	-0.0802102
BLUEBOOK	-0.1100800	-0.0820739
MSTATUS	-0.1351248	-0.1209872
CAR_USE	-0.1426737	-0.1214701
INCOME	-0.1487891	-0.1287263
HOME_VAL	-0.1841877	-0.1522253

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_training_data.csv are done. So, I’ll begin the same exercise for insurance-evaluation-data.csv.

Data Exploration of insurance-evaluation-data.csv.

Initially, We’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

## [1] "Dimension of training set:   Number of rows: 2141, Number of cols: 26"

## [1] "Head of training data set:"

##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ  INCOME PARENT1
## 1     3          NA         NA        0  48        0  11 $52,881      No
## 2     9          NA         NA        1  40        1  11 $50,815     Yes
## 3    10          NA         NA        0  44        2  12 $43,486     Yes
## 4    18          NA         NA        0  35        2  NA $21,204     Yes
## 5    21          NA         NA        0  59        0  12 $87,460      No
## 6    30          NA         NA        0  46        0  14              No
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M     Bachelors       Manager       26    Private  $21,970
## 2       $0    z_No   M z_High School       Manager       21    Private  $18,930
## 3       $0    z_No z_F z_High School z_Blue Collar       30 Commercial   $5,900
## 4       $0    z_No   M z_High School      Clerical       74    Private   $9,230
## 5       $0    z_No   M z_High School       Manager       45    Private  $15,420
## 6 $207,519     Yes   M     Bachelors  Professional        7 Commercial  $25,660
##   TIF    CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1   1         Van     yes       $0        0      No       2      10
## 2   6     Minivan      no   $3,295        1      No       2       1
## 3  10       z_SUV      no       $0        0      No       0      10
## 4   6      Pickup      no       $0        0     Yes       0       4
## 5   1     Minivan     yes  $44,857        2      No       4       1
## 6   1 Panel Truck      no   $2,119        1      No       2      12
##              URBANICITY
## 1   Highly Urban/ Urban
## 2   Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5   Highly Urban/ Urban
## 6   Highly Urban/ Urban

## [1] "Structure of training data set:"

## 'data.frame':    2141 obs. of  26 variables:
##  $ INDEX      : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET_FLAG: logi  NA NA NA NA NA NA ...
##  $ TARGET_AMT : logi  NA NA NA NA NA NA ...
##  $ KIDSDRIV   : int  0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE        : int  48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS   : int  0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ        : int  11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME     : chr  "$52,881" "$50,815" "$43,486" "$21,204" ...
##  $ PARENT1    : chr  "No" "Yes" "Yes" "Yes" ...
##  $ HOME_VAL   : chr  "$0" "$0" "$0" "$0" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "z_No" "z_No" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "Bachelors" "z_High School" "z_High School" "z_High School" ...
##  $ JOB        : chr  "Manager" "Manager" "z_Blue Collar" "Clerical" ...
##  $ TRAVTIME   : int  26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE    : chr  "Private" "Private" "Commercial" "Private" ...
##  $ BLUEBOOK   : chr  "$21,970" "$18,930" "$5,900" "$9,230" ...
##  $ TIF        : int  1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE   : chr  "Van" "Minivan" "z_SUV" "Pickup" ...
##  $ RED_CAR    : chr  "yes" "no" "no" "no" ...
##  $ OLDCLAIM   : chr  "$0" "$3,295" "$0" "$0" ...
##  $ CLM_FREQ   : int  0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "Yes" ...
##  $ MVR_PTS    : int  2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE    : int  10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of insurance-evaluation-data.csv.

At this stage, We’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.

After that we’ll show some boxplots of the numeric fields.

Checking for NA.

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Rechecking for NA after imputation.

## [1] FALSE

We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().

Both is.na() and missmap() confirm that NA were eliminated.

More Data exploration of insurance-evaluation-data.csv.

Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

TARGET_FLAG	TARGET_AMT	KIDSDRIV	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
Min. : NA	Min. : NA	Min. :0.0000	Min. :17.00	Min. :0.0000	Min. : 0.00	Min. : 0	Length:2141	Min. : 0	Length:2141	Length:2141	Length:2141	Length:2141	Min. : 5.00	Length:2141	Min. : 1500	Min. : 1.000	Length:2141	Length:2141	Min. : 0	Min. :0.000	Length:2141	Min. : 0.000	Min. : 0.000	Length:2141
1st Qu.: NA	1st Qu.: NA	1st Qu.:0.0000	1st Qu.:39.00	1st Qu.:0.0000	1st Qu.: 9.00	1st Qu.: 25846	Class :character	1st Qu.: 0	Class :character	Class :character	Class :character	Class :character	1st Qu.: 22.00	Class :character	1st Qu.: 8870	1st Qu.: 1.000	Class :character	Class :character	1st Qu.: 0	1st Qu.:0.000	Class :character	1st Qu.: 0.000	1st Qu.: 1.000	Class :character
Median : NA	Median : NA	Median :0.0000	Median :45.00	Median :0.0000	Median :11.00	Median : 51605	Mode :character	Median :158830	Mode :character	Mode :character	Mode :character	Mode :character	Median : 33.00	Mode :character	Median :14170	Median : 4.000	Mode :character	Mode :character	Median : 0	Median :0.000	Mode :character	Median : 1.000	Median : 8.000	Mode :character
Mean :NaN	Mean :NaN	Mean :0.1625	Mean :45.02	Mean :0.7174	Mean :10.36	Mean : 60228	NA	Mean :152988	NA	NA	NA	NA	Mean : 33.15	NA	Mean :15469	Mean : 5.245	NA	NA	Mean : 4022	Mean :0.809	NA	Mean : 1.766	Mean : 8.198	NA
3rd Qu.: NA	3rd Qu.: NA	3rd Qu.:0.0000	3rd Qu.:51.00	3rd Qu.:1.0000	3rd Qu.:13.00	3rd Qu.: 86393	NA	3rd Qu.:236449	NA	NA	NA	NA	3rd Qu.: 43.00	NA	3rd Qu.:21050	3rd Qu.: 7.000	NA	NA	3rd Qu.: 4718	3rd Qu.:2.000	NA	3rd Qu.: 3.000	3rd Qu.:13.000	NA
Max. : NA	Max. : NA	Max. :3.0000	Max. :73.00	Max. :5.0000	Max. :19.00	Max. :291182	NA	Max. :669271	NA	NA	NA	NA	Max. :105.00	NA	Max. :49940	Max. :25.000	NA	NA	Max. :54399	Max. :5.000	NA	Max. :12.000	Max. :26.000	NA
NA’s :2141	NA’s :2141	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

Let’s take a first look at the boxplots

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance.

Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.

We’ll do the boxplots differently, with gglplot, to check if there are any differences.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

The histograms show that AGE, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, we’ll explore the Categorical variables.

## PARENT1:

## 
##   No  Yes 
## 1875  266

## MSTATUS:

## 
##   No  Yes 
##  847 1294

## SEX:

## 
##    F    M 
## 1170  971

## EDUCATION:

## 
## <High School    Bachelors  High School      Masters          PhD 
##          312          581          622          420          206

## JOB:

## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          139          463          319           75          202          196 
##      Manager Professional      Student 
##          269          291          187

## CAR_USE:

## 
## Commercial    Private 
##        760       1381

## CAR_TYPE:

## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##         549         177         383         272         589         171

## RED_CAR:

## 
##   no  yes 
## 1543  598

## REVOKED:

## 
##   No  Yes 
## 1880  261

## URBANICITY:

## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                 403                1738

Observation: In JOB columns, 139 rows are empty. So, we’ have to’ll impute them with “Unknown”.

## JOB:

## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##          463          319           75          202          196          269 
## Professional      Student      Unknown 
##          291          187          139

Correlations.

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find pai-wise correlations, we’ll give numerical values to the correlated variables.

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_evaluation_data.csv are done. So, We’ll begin the model building process.

Building Models

Now, we are ready to build models. We will build model using insurance_training_data.csv and compare the models. Then we will use the best model to predict on insurance-evaluation-data.csv

Our task is to classify the TARGET_FLAG variable using logistic regression and predict the value of TARGET_AMT using Linear Regression.

We split Ins_train_cap_imputed into training and test in 80/20 ratio and named as Ins_train_cap_imputed_trn and Ins_train_cap_imputed_tst.

Logistic Regression Model

Here we will build our first logistic regression model

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial, 
##     data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2585  -0.7148  -0.3972   0.6502   3.1743  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.314e+00  3.198e-01  -7.235 4.64e-13 ***
## PARENT1Yes                     4.214e-01  1.242e-01   3.394 0.000690 ***
## MSTATUSYes                    -5.156e-01  9.376e-02  -5.500 3.81e-08 ***
## SEXM                           5.494e-02  1.250e-01   0.440 0.660147    
## EDUCATIONBachelors            -4.152e-01  1.295e-01  -3.207 0.001341 ** 
## EDUCATIONHigh School          -5.172e-02  1.069e-01  -0.484 0.628448    
## EDUCATIONMasters              -3.404e-01  2.006e-01  -1.697 0.089711 .  
## EDUCATIONPhD                  -2.426e-01  2.352e-01  -1.031 0.302486    
## JOBClerical                    6.363e-02  1.201e-01   0.530 0.596115    
## JOBDoctor                     -5.610e-01  3.115e-01  -1.801 0.071752 .  
## JOBHome Maker                 -1.240e-01  1.737e-01  -0.714 0.475328    
## JOBLawyer                     -1.716e-01  2.114e-01  -0.812 0.416873    
## JOBManager                    -8.168e-01  1.564e-01  -5.222 1.77e-07 ***
## JOBProfessional               -9.226e-02  1.347e-01  -0.685 0.493352    
## JOBStudent                    -1.583e-01  1.460e-01  -1.084 0.278294    
## JOBUnknown                    -3.561e-01  2.081e-01  -1.711 0.087085 .  
## CAR_USEPrivate                -7.988e-01  1.024e-01  -7.801 6.14e-15 ***
## CAR_TYPEPanel Truck            5.476e-01  1.810e-01   3.025 0.002488 ** 
## CAR_TYPEPickup                 5.346e-01  1.125e-01   4.754 1.99e-06 ***
## CAR_TYPESports Car             1.043e+00  1.445e-01   7.218 5.29e-13 ***
## CAR_TYPESUV                    7.775e-01  1.237e-01   6.286 3.25e-10 ***
## CAR_TYPEVan                    7.160e-01  1.404e-01   5.101 3.37e-07 ***
## RED_CARyes                    -1.674e-02  9.675e-02  -0.173 0.862651    
## REVOKEDYes                     7.961e-01  1.029e-01   7.740 9.97e-15 ***
## URBANICITYHighly Urban/ Urban  2.445e+00  1.257e-01  19.455  < 2e-16 ***
## KIDSDRIV                       5.976e-01  1.090e-01   5.481 4.23e-08 ***
## AGE                           -3.265e-03  4.603e-03  -0.709 0.478150    
## HOMEKIDS                       5.087e-02  4.469e-02   1.138 0.254940    
## YOJ                           -1.936e-02  9.416e-03  -2.056 0.039789 *  
## INCOME                        -3.214e-06  1.359e-06  -2.364 0.018084 *  
## HOME_VAL                      -1.319e-06  3.757e-07  -3.510 0.000449 ***
## TRAVTIME                       1.619e-02  2.179e-03   7.430 1.09e-13 ***
## BLUEBOOK                      -2.301e-05  6.083e-06  -3.782 0.000155 ***
## TIF                           -6.052e-02  8.538e-03  -7.088 1.36e-12 ***
## OLDCLAIM                      -1.328e-05  5.397e-06  -2.461 0.013864 *  
## CLM_FREQ                       1.812e-01  3.285e-02   5.516 3.48e-08 ***
## MVR_PTS                        1.089e-01  1.652e-02   6.592 4.33e-11 ***
## CAR_AGE                       -6.136e-03  8.097e-03  -0.758 0.448611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5827.6  on 6491  degrees of freedom
## AIC: 5903.6
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5899.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Let’s look at the Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  181   97
##          0  249 1105
##                                           
##                Accuracy : 0.788           
##                  95% CI : (0.7673, 0.8076)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 8.082e-07       
##                                           
##                   Kappa : 0.3838          
##                                           
##  Mcnemar's Test P-Value : 4.747e-16       
##                                           
##             Sensitivity : 0.4209          
##             Specificity : 0.9193          
##          Pos Pred Value : 0.6511          
##          Neg Pred Value : 0.8161          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1109          
##    Detection Prevalence : 0.1703          
##       Balanced Accuracy : 0.6701          
##                                           
##        'Positive' Class : 1               
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7806.

AUC of the model is 0.801.

Logistic Regression Model

We’ll build our second Logistic Regression model

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, family = binomial, data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2975  -0.7218  -0.4082   0.6586   3.1654  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.459e+00  2.123e-01 -11.583  < 2e-16 ***
## KIDSDRIV                       5.821e-01  1.067e-01   5.453 4.94e-08 ***
## HOMEKIDS                       5.781e-02  4.044e-02   1.429 0.152879    
## INCOME                        -4.637e-06  1.194e-06  -3.885 0.000102 ***
## PARENT1Yes                     4.235e-01  1.226e-01   3.454 0.000552 ***
## HOME_VAL                      -1.236e-06  3.642e-07  -3.393 0.000692 ***
## MSTATUSYes                    -5.449e-01  9.254e-02  -5.888 3.90e-09 ***
## EDUCATIONBachelors            -5.555e-01  1.178e-01  -4.715 2.42e-06 ***
## EDUCATIONHigh School          -1.135e-01  1.035e-01  -1.097 0.272683    
## EDUCATIONMasters              -5.769e-01  1.488e-01  -3.876 0.000106 ***
## EDUCATIONPhD                  -5.794e-01  1.815e-01  -3.192 0.001414 ** 
## TRAVTIME                       1.685e-02  2.163e-03   7.790 6.71e-15 ***
## CAR_USEPrivate                -8.800e-01  8.089e-02 -10.879  < 2e-16 ***
## BLUEBOOK                      -2.500e-05  5.444e-06  -4.592 4.39e-06 ***
## TIF                           -6.049e-02  8.476e-03  -7.136 9.59e-13 ***
## CAR_TYPEPanel Truck            4.852e-01  1.598e-01   3.036 0.002399 ** 
## CAR_TYPEPickup                 4.700e-01  1.092e-01   4.305 1.67e-05 ***
## CAR_TYPESports Car             9.889e-01  1.182e-01   8.368  < 2e-16 ***
## CAR_TYPESUV                    7.327e-01  9.481e-02   7.728 1.09e-14 ***
## CAR_TYPEVan                    6.800e-01  1.320e-01   5.150 2.61e-07 ***
## CLM_FREQ                       1.380e-01  2.847e-02   4.847 1.25e-06 ***
## REVOKEDYes                     6.812e-01  8.987e-02   7.580 3.45e-14 ***
## MVR_PTS                        1.095e-01  1.628e-02   6.728 1.72e-11 ***
## CAR_AGE                       -5.781e-03  8.058e-03  -0.717 0.473139    
## URBANICITYHighly Urban/ Urban  2.396e+00  1.252e-01  19.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5883.4  on 6504  degrees of freedom
## AIC: 5933.4
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5930.1.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Let’s look at the Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  178   96
##          0  252 1106
##                                           
##                Accuracy : 0.7868          
##                  95% CI : (0.7661, 0.8064)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 1.445e-06       
##                                           
##                   Kappa : 0.3781          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4140          
##             Specificity : 0.9201          
##          Pos Pred Value : 0.6496          
##          Neg Pred Value : 0.8144          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1091          
##    Detection Prevalence : 0.1679          
##       Balanced Accuracy : 0.6670          
##                                           
##        'Positive' Class : 1               
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7855.

AUC of the model is 0.802.

Logistic Regression Model

In the third Logistic Regression model, we’ll do stepwise model selection.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, 
##     family = binomial, data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2901  -0.7231  -0.4097   0.6569   3.1630  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.480e+00  2.104e-01 -11.789  < 2e-16 ***
## KIDSDRIV                       5.824e-01  1.068e-01   5.455 4.89e-08 ***
## HOMEKIDS                       5.804e-02  4.044e-02   1.435 0.151241    
## INCOME                        -4.693e-06  1.191e-06  -3.940 8.15e-05 ***
## PARENT1Yes                     4.242e-01  1.226e-01   3.461 0.000539 ***
## HOME_VAL                      -1.222e-06  3.636e-07  -3.361 0.000777 ***
## MSTATUSYes                    -5.451e-01  9.252e-02  -5.892 3.82e-09 ***
## EDUCATIONBachelors            -5.848e-01  1.106e-01  -5.288 1.24e-07 ***
## EDUCATIONHigh School          -1.183e-01  1.032e-01  -1.146 0.251610    
## EDUCATIONMasters              -6.343e-01  1.257e-01  -5.046 4.50e-07 ***
## EDUCATIONPhD                  -6.353e-01  1.641e-01  -3.872 0.000108 ***
## TRAVTIME                       1.684e-02  2.162e-03   7.788 6.82e-15 ***
## CAR_USEPrivate                -8.797e-01  8.088e-02 -10.876  < 2e-16 ***
## BLUEBOOK                      -2.498e-05  5.445e-06  -4.588 4.47e-06 ***
## TIF                           -6.059e-02  8.473e-03  -7.151 8.62e-13 ***
## CAR_TYPEPanel Truck            4.858e-01  1.598e-01   3.039 0.002370 ** 
## CAR_TYPEPickup                 4.702e-01  1.092e-01   4.307 1.66e-05 ***
## CAR_TYPESports Car             9.896e-01  1.182e-01   8.375  < 2e-16 ***
## CAR_TYPESUV                    7.326e-01  9.482e-02   7.726 1.11e-14 ***
## CAR_TYPEVan                    6.820e-01  1.320e-01   5.166 2.39e-07 ***
## CLM_FREQ                       1.377e-01  2.847e-02   4.836 1.32e-06 ***
## REVOKEDYes                     6.817e-01  8.988e-02   7.584 3.34e-14 ***
## MVR_PTS                        1.095e-01  1.628e-02   6.725 1.76e-11 ***
## URBANICITYHighly Urban/ Urban  2.397e+00  1.252e-01  19.142  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5883.9  on 6505  degrees of freedom
## AIC: 5931.9
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5959.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Creation of Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  177  100
##          0  253 1102
##                                           
##                Accuracy : 0.7837          
##                  95% CI : (0.7629, 0.8035)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 5.811e-06       
##                                           
##                   Kappa : 0.3708          
##                                           
##  Mcnemar's Test P-Value : 5.959e-16       
##                                           
##             Sensitivity : 0.4116          
##             Specificity : 0.9168          
##          Pos Pred Value : 0.6390          
##          Neg Pred Value : 0.8133          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1085          
##    Detection Prevalence : 0.1697          
##       Balanced Accuracy : 0.6642          
##                                           
##        'Positive' Class : 1               
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7874.

AUC of the model is 0.802.

At this point three Logistic Regression models were built. The accuracy was highest in the third model. We will use the third model on evaluation dataset, for classification.

Linear Regression model

First Linear Regression model

Since our goal is to predict the TARGET_AMT, and not classify (as we did in Logistic Regression), we’ll build linear regression model for TARGET_AMT.

## 
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5140  -1701   -758    354 103703 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.410e+01  4.771e+02   0.134 0.893131    
## PARENT1Yes                     5.390e+02  2.042e+02   2.639 0.008327 ** 
## MSTATUSYes                    -5.673e+02  1.442e+02  -3.934 8.42e-05 ***
## SEXM                           3.517e+02  1.835e+02   1.917 0.055261 .  
## EDUCATIONBachelors            -2.797e+02  2.055e+02  -1.361 0.173546    
## EDUCATIONHigh School          -9.839e+01  1.721e+02  -0.572 0.567524    
## EDUCATIONMasters              -8.573e+00  3.002e+02  -0.029 0.977219    
## EDUCATIONPhD                   1.966e+02  3.541e+02   0.555 0.578804    
## JOBClerical                    2.505e+01  1.934e+02   0.130 0.896929    
## JOBDoctor                     -9.795e+02  4.357e+02  -2.248 0.024611 *  
## JOBHome Maker                 -1.557e+02  2.706e+02  -0.575 0.565054    
## JOBLawyer                     -2.724e+02  3.109e+02  -0.876 0.380978    
## JOBManager                    -9.840e+02  2.345e+02  -4.195 2.75e-05 ***
## JOBProfessional               -4.248e+01  2.132e+02  -0.199 0.842035    
## JOBStudent                    -2.248e+02  2.360e+02  -0.952 0.340976    
## JOBUnknown                    -5.146e+02  3.217e+02  -1.600 0.109723    
## CAR_USEPrivate                -7.928e+02  1.644e+02  -4.822 1.45e-06 ***
## CAR_TYPEPanel Truck            2.709e+02  2.766e+02   0.979 0.327531    
## CAR_TYPEPickup                 3.745e+02  1.708e+02   2.192 0.028377 *  
## CAR_TYPESports Car             1.012e+03  2.179e+02   4.644 3.48e-06 ***
## CAR_TYPESUV                    7.439e+02  1.795e+02   4.145 3.43e-05 ***
## CAR_TYPEVan                    5.213e+02  2.133e+02   2.444 0.014556 *  
## RED_CARyes                    -3.659e+01  1.491e+02  -0.245 0.806149    
## REVOKEDYes                     5.943e+02  1.744e+02   3.409 0.000656 ***
## URBANICITYHighly Urban/ Urban  1.673e+03  1.392e+02  12.015  < 2e-16 ***
## KIDSDRIV                       6.023e+02  1.790e+02   3.365 0.000769 ***
## AGE                            6.574e+00  7.183e+00   0.915 0.360107    
## HOMEKIDS                       8.174e+01  6.995e+01   1.169 0.242637    
## YOJ                           -6.412e+00  1.474e+01  -0.435 0.663681    
## INCOME                        -3.949e-03  2.089e-03  -1.890 0.058735 .  
## HOME_VAL                      -6.719e-04  5.866e-04  -1.145 0.252042    
## TRAVTIME                       1.275e+01  3.340e+00   3.817 0.000136 ***
## BLUEBOOK                       1.343e-02  8.970e-03   1.497 0.134462    
## TIF                           -5.089e+01  1.295e+01  -3.931 8.52e-05 ***
## OLDCLAIM                      -1.690e-02  9.122e-03  -1.853 0.063925 .  
## CLM_FREQ                       1.664e+02  5.689e+01   2.924 0.003461 ** 
## MVR_PTS                        1.735e+02  2.790e+01   6.217 5.31e-10 ***
## CAR_AGE                       -2.489e+01  1.237e+01  -2.012 0.044271 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8123 degrees of freedom
## Multiple R-squared:  0.07062,    Adjusted R-squared:  0.06639 
## F-statistic: 16.68 on 37 and 8123 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07094.

Adjusted R-squared: 0.06671.

The R-squared values are far from 1. So, the model is not good. In below plot, the points are widely scattered, but not linearly.

Linear Regression model

In Second Linear Regression model, we’ll do stepwise AIC.

## 
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + SEX + JOB + CAR_USE + 
##     CAR_TYPE + REVOKED + URBANICITY + KIDSDRIV + INCOME + TRAVTIME + 
##     BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE, 
##     data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5196  -1698   -766    336 103646 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.555e+02  3.413e+02   0.455 0.648781    
## PARENT1Yes                     5.962e+02  1.783e+02   3.343 0.000834 ***
## MSTATUSYes                    -6.146e+02  1.197e+02  -5.136 2.88e-07 ***
## SEXM                           3.208e+02  1.605e+02   1.999 0.045662 *  
## JOBClerical                    3.956e+01  1.923e+02   0.206 0.837011    
## JOBDoctor                     -6.685e+02  3.556e+02  -1.880 0.060152 .  
## JOBHome Maker                 -1.066e+02  2.480e+02  -0.430 0.667431    
## JOBLawyer                     -1.534e+02  2.421e+02  -0.634 0.526325    
## JOBManager                    -9.831e+02  2.127e+02  -4.623 3.84e-06 ***
## JOBProfessional               -1.372e+02  1.976e+02  -0.694 0.487429    
## JOBStudent                    -1.296e+02  2.226e+02  -0.582 0.560294    
## JOBUnknown                    -3.164e+02  2.676e+02  -1.182 0.237046    
## CAR_USEPrivate                -7.430e+02  1.569e+02  -4.736 2.21e-06 ***
## CAR_TYPEPanel Truck            3.137e+02  2.733e+02   1.148 0.251206    
## CAR_TYPEPickup                 4.009e+02  1.695e+02   2.366 0.018021 *  
## CAR_TYPESports Car             1.027e+03  2.165e+02   4.746 2.11e-06 ***
## CAR_TYPESUV                    7.486e+02  1.786e+02   4.191 2.80e-05 ***
## CAR_TYPEVan                    5.417e+02  2.121e+02   2.554 0.010654 *  
## REVOKEDYes                     5.990e+02  1.743e+02   3.437 0.000592 ***
## URBANICITYHighly Urban/ Urban  1.667e+03  1.391e+02  11.985  < 2e-16 ***
## KIDSDRIV                       6.949e+02  1.620e+02   4.289 1.81e-05 ***
## INCOME                        -5.025e-03  1.797e-03  -2.796 0.005181 ** 
## TRAVTIME                       1.261e+01  3.338e+00   3.778 0.000159 ***
## BLUEBOOK                       1.377e-02  8.872e-03   1.552 0.120690    
## TIF                           -5.033e+01  1.294e+01  -3.890 0.000101 ***
## OLDCLAIM                      -1.690e-02  9.117e-03  -1.853 0.063853 .  
## CLM_FREQ                       1.692e+02  5.683e+01   2.977 0.002921 ** 
## MVR_PTS                        1.744e+02  2.786e+01   6.258 4.08e-10 ***
## CAR_AGE                       -2.563e+01  1.092e+01  -2.346 0.018999 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8132 degrees of freedom
## Multiple R-squared:  0.06969,    Adjusted R-squared:  0.06649 
## F-statistic: 21.76 on 28 and 8132 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07012.

Adjusted R-squared: 0.06692.

The second model slightly improved over the first model. The plot also suggest no proper linear regression.

Model Selection.

We ran three Logistic Regression models and two Linear Regression models. The third Logistic Regression model did best based on the Accuracy and AUC. And the second Linear Regression model did best based on R-squared value.

We’ll use these models to predict the evaluation dataset insurance-evaluation-data.csv.

Head of the predicted data.

##   PARENT1 MSTATUS SEX   EDUCATION          JOB    CAR_USE    CAR_TYPE RED_CAR
## 1      No      No   M   Bachelors      Manager    Private         Van     yes
## 2     Yes      No   M High School      Manager    Private     Minivan      no
## 3     Yes      No   F High School  Blue Collar Commercial         SUV      no
## 4     Yes      No   M High School     Clerical    Private      Pickup      no
## 5      No      No   M High School      Manager    Private     Minivan     yes
## 6      No     Yes   M   Bachelors Professional Commercial Panel Truck      no
##   REVOKED          URBANICITY KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1      No Highly Urban/ Urban        0  48        0  11  52881        0
## 2      No Highly Urban/ Urban        1  40        1  11  50815        0
## 3      No Highly Rural/ Rural        0  44        2  12  43486        0
## 4     Yes Highly Rural/ Rural        0  35        2   9  21204        0
## 5      No Highly Urban/ Urban        0  59        0  12  87460        0
## 6      No Highly Urban/ Urban        0  46        0  14 122150   207519
##   TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE TARGET_FLAG
## 1       26    21970   1        0        0       2      10           0
## 2       21    18930   6     3295        1       2       1           0
## 3       30     5900  10        0        0       0      10           0
## 4       74     9230   6        0        0       0       4           0
## 5       45    15420   1    26114        2       4       1           0
## 6        7    25660   1     2119        1       2      12           0
##   TARGET_AMT
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Write the data to a CSV file