Homework 04
Overview
In this homework assignment, you will explore, analyze and model a dataset containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first responsevariable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash.The second responsevariable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Data Exploration
We will explore both training and evaluation data as the first step in the data exploration.
## [1] "Dimension of training set: Number of rows: 8161, Number of cols: 26"
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M PhD Professional 14 Private $14,230
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial $14,940
## 3 $124,191 Yes z_F z_High School Clerical 5 Private $4,010
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private $15,440
## 5 $243,925 Yes z_F PhD Doctor 36 Private $18,000
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial $17,430
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 11 Minivan yes $4,461 2 No 3 18
## 2 1 Minivan yes $0 0 No 0 1
## 3 4 z_SUV no $38,690 2 No 3 10
## 4 7 Minivan yes $0 0 No 0 6
## 5 1 z_SUV no $19,217 2 Yes 3 17
## 6 1 Sports Car no $0 0 No 0 7
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame': 8161 obs. of 26 variables:
## $ INDEX : int 1 2 4 5 6 7 8 11 12 13 ...
## $ TARGET_FLAG: int 0 0 0 0 0 1 0 1 1 0 ...
## $ TARGET_AMT : num 0 0 0 0 0 ...
## $ KIDSDRIV : int 0 0 0 0 0 0 0 1 0 0 ...
## $ AGE : int 60 43 35 51 50 34 54 37 34 50 ...
## $ HOMEKIDS : int 0 0 1 0 0 1 0 2 0 0 ...
## $ YOJ : int 11 11 10 14 NA 12 NA NA 10 7 ...
## $ INCOME : chr "$67,349" "$91,449" "$16,039" "" ...
## $ PARENT1 : chr "No" "No" "No" "No" ...
## $ HOME_VAL : chr "$0" "$257,252" "$124,191" "$306,251" ...
## $ MSTATUS : chr "z_No" "z_No" "Yes" "Yes" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "PhD" "z_High School" "z_High School" "<High School" ...
## $ JOB : chr "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
## $ TRAVTIME : int 14 22 5 32 36 46 33 44 34 48 ...
## $ CAR_USE : chr "Private" "Commercial" "Private" "Private" ...
## $ BLUEBOOK : chr "$14,230" "$14,940" "$4,010" "$15,440" ...
## $ TIF : int 11 1 4 7 1 1 1 1 1 7 ...
## $ CAR_TYPE : chr "Minivan" "Minivan" "z_SUV" "Minivan" ...
## $ RED_CAR : chr "yes" "yes" "no" "yes" ...
## $ OLDCLAIM : chr "$4,461" "$0" "$38,690" "$0" ...
## $ CLM_FREQ : int 2 0 2 0 2 0 0 1 0 0 ...
## $ REVOKED : chr "No" "No" "No" "No" ...
## $ MVR_PTS : int 3 0 3 0 3 0 0 10 0 1 ...
## $ CAR_AGE : int 18 1 10 6 17 7 1 7 1 17 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
Checking for NA.
## [1] TRUE
There are NA in our data, We will impute NA using mice().
Rechecking for NA after imputation.
}
We observe that NA were removed.
Here, We’ll further explore the data. Firsr, We’ll look at min, 1st quartile, median, mean, 2nd quartile, max etc.
| TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :0.0000 | Min. : 0 | Min. :0.0000 | Min. :16.00 | Min. :0.0000 | Min. : 0.00 | Min. : 0 | Length:8161 | Min. : 0 | Length:8161 | Length:8161 | Length:8161 | Length:8161 | Min. : 5.00 | Length:8161 | Min. : 1500 | Min. : 1.000 | Length:8161 | Length:8161 | Min. : 0 | Min. :0.0000 | Length:8161 | Min. : 0.000 | Min. :-3.000 | Length:8161 | |
| 1st Qu.:0.0000 | 1st Qu.: 0 | 1st Qu.:0.0000 | 1st Qu.:39.00 | 1st Qu.:0.0000 | 1st Qu.: 9.00 | 1st Qu.: 28088 | Class :character | 1st Qu.: 0 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 22.00 | Class :character | 1st Qu.: 9280 | 1st Qu.: 1.000 | Class :character | Class :character | 1st Qu.: 0 | 1st Qu.:0.0000 | Class :character | 1st Qu.: 0.000 | 1st Qu.: 1.000 | Class :character | |
| Median :0.0000 | Median : 0 | Median :0.0000 | Median :45.00 | Median :0.0000 | Median :11.00 | Median : 53899 | Mode :character | Median :160699 | Mode :character | Mode :character | Mode :character | Mode :character | Median : 33.00 | Mode :character | Median :14440 | Median : 4.000 | Mode :character | Mode :character | Median : 0 | Median :0.0000 | Mode :character | Median : 1.000 | Median : 8.000 | Mode :character | |
| Mean :0.2638 | Mean : 1504 | Mean :0.1711 | Mean :44.78 | Mean :0.7212 | Mean :10.52 | Mean : 61719 | NA | Mean :154533 | NA | NA | NA | NA | Mean : 33.49 | NA | Mean :15710 | Mean : 5.351 | NA | NA | Mean : 4037 | Mean :0.7986 | NA | Mean : 1.696 | Mean : 8.329 | NA | |
| 3rd Qu.:1.0000 | 3rd Qu.: 1036 | 3rd Qu.:0.0000 | 3rd Qu.:51.00 | 3rd Qu.:1.0000 | 3rd Qu.:13.00 | 3rd Qu.: 85596 | NA | 3rd Qu.:238115 | NA | NA | NA | NA | 3rd Qu.: 44.00 | NA | 3rd Qu.:20850 | 3rd Qu.: 7.000 | NA | NA | 3rd Qu.: 4636 | 3rd Qu.:2.0000 | NA | 3rd Qu.: 3.000 | 3rd Qu.:12.000 | NA | |
| Max. :1.0000 | Max. :107586 | Max. :4.0000 | Max. :81.00 | Max. :5.0000 | Max. :23.00 | Max. :367030 | NA | Max. :885282 | NA | NA | NA | NA | Max. :142.00 | NA | Max. :69740 | Max. :25.000 | NA | NA | Max. :57037 | Max. :5.0000 | NA | Max. :13.000 | Max. :28.000 | NA |
Data reordering
For downstream analysis, we’ll reorder the columns into categorical, numeric and target.
Boxplots
First look at the boxplots.
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance. We will ignore the boxplots for TARGET_FLAG and TARGET_AMT here.
Histograms
We will see how the data is distributed (numeric fields) using histrogram.
We can see that AGE, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.
Categorical variables
Now, let’s explore the Categorical variables.
## PARENT1:
##
## No Yes
## 7084 1077
## MSTATUS:
##
## No Yes
## 3267 4894
## SEX:
##
## F M
## 4375 3786
## EDUCATION:
##
## <High School Bachelors High School Masters PhD
## 1203 2242 2330 1658 728
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 526 1825 1271 246 641 835
## Manager Professional Student
## 988 1117 712
## CAR_USE:
##
## Commercial Private
## 3029 5132
## CAR_TYPE:
##
## Minivan Panel Truck Pickup Sports Car SUV Van
## 2145 676 1389 907 2294 750
## RED_CAR:
##
## no yes
## 5783 2378
## REVOKED:
##
## No Yes
## 7161 1000
## URBANICITY:
##
## Highly Rural/ Rural Highly Urban/ Urban
## 1669 6492
There are 526 rows empty in JOB column. So, we’ll impute them with “Unknown”.
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer Manager
## 1825 1271 246 641 835 988
## Professional Student Unknown
## 1117 712 526
Correlations
At this point the data is prepared. So, we’ll explore the top correlated variables.
There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find the top correlated variables, we’ll give numerical values to the correlated variables.
| TARGET_FLAG | TARGET_AMT | |
|---|---|---|
| TARGET_FLAG | 1.0000000 | 1.0000000 |
| TARGET_AMT | 0.8334240 | 0.8334240 |
| URBANICITY | 0.2242509 | 0.1904945 |
| CLM_FREQ | 0.2161961 | 0.1869848 |
| MVR_PTS | 0.2075451 | 0.1741927 |
| OLDCLAIM | 0.2004106 | 0.1652881 |
| PARENT1 | 0.1576222 | 0.1359305 |
| REVOKED | 0.1519391 | 0.1263285 |
| HOMEKIDS | 0.1175903 | 0.1014967 |
| KIDSDRIV | 0.1036217 | 0.0863553 |
| CAR_TYPE | 0.1035765 | 0.0827170 |
| TRAVTIME | 0.0550685 | 0.0440349 |
| RED_CAR | -0.0069473 | 0.0005877 |
| SEX | -0.0210786 | -0.0088270 |
| JOB | -0.0669944 | -0.0509930 |
| YOJ | -0.0678665 | -0.0512803 |
| EDUCATION | -0.0734429 | -0.0532620 |
| TIF | -0.0818308 | -0.0690666 |
| CAR_AGE | -0.1025965 | -0.0712728 |
| AGE | -0.1041577 | -0.0802102 |
| BLUEBOOK | -0.1100800 | -0.0820739 |
| MSTATUS | -0.1351248 | -0.1209872 |
| CAR_USE | -0.1426737 | -0.1214701 |
| INCOME | -0.1487891 | -0.1287263 |
| HOME_VAL | -0.1841877 | -0.1522253 |
Now, we’ll look at the correlation matrix of the variables.
At this point exploration, preparation and pair-wise correlations of insurance_training_data.csv are done. So, I’ll begin the same exercise for insurance-evaluation-data.csv.
Data Exploration of insurance-evaluation-data.csv.
Initially, We’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.
## [1] "Dimension of training set: Number of rows: 2141, Number of cols: 26"
## [1] "Head of training data set:"
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 3 NA NA 0 48 0 11 $52,881 No
## 2 9 NA NA 1 40 1 11 $50,815 Yes
## 3 10 NA NA 0 44 2 12 $43,486 Yes
## 4 18 NA NA 0 35 2 NA $21,204 Yes
## 5 21 NA NA 0 59 0 12 $87,460 No
## 6 30 NA NA 0 46 0 14 No
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK
## 1 $0 z_No M Bachelors Manager 26 Private $21,970
## 2 $0 z_No M z_High School Manager 21 Private $18,930
## 3 $0 z_No z_F z_High School z_Blue Collar 30 Commercial $5,900
## 4 $0 z_No M z_High School Clerical 74 Private $9,230
## 5 $0 z_No M z_High School Manager 45 Private $15,420
## 6 $207,519 Yes M Bachelors Professional 7 Commercial $25,660
## TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1 1 Van yes $0 0 No 2 10
## 2 6 Minivan no $3,295 1 No 2 1
## 3 10 z_SUV no $0 0 No 0 10
## 4 6 Pickup no $0 0 Yes 0 4
## 5 1 Minivan yes $44,857 2 No 4 1
## 6 1 Panel Truck no $2,119 1 No 2 12
## URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## [1] "Structure of training data set:"
## 'data.frame': 2141 obs. of 26 variables:
## $ INDEX : int 3 9 10 18 21 30 31 37 39 47 ...
## $ TARGET_FLAG: logi NA NA NA NA NA NA ...
## $ TARGET_AMT : logi NA NA NA NA NA NA ...
## $ KIDSDRIV : int 0 1 0 0 0 0 0 0 2 0 ...
## $ AGE : int 48 40 44 35 59 46 60 54 36 50 ...
## $ HOMEKIDS : int 0 1 2 2 0 0 0 0 2 0 ...
## $ YOJ : int 11 11 12 NA 12 14 12 12 12 8 ...
## $ INCOME : chr "$52,881" "$50,815" "$43,486" "$21,204" ...
## $ PARENT1 : chr "No" "Yes" "Yes" "Yes" ...
## $ HOME_VAL : chr "$0" "$0" "$0" "$0" ...
## $ MSTATUS : chr "z_No" "z_No" "z_No" "z_No" ...
## $ SEX : chr "M" "M" "z_F" "M" ...
## $ EDUCATION : chr "Bachelors" "z_High School" "z_High School" "z_High School" ...
## $ JOB : chr "Manager" "Manager" "z_Blue Collar" "Clerical" ...
## $ TRAVTIME : int 26 21 30 74 45 7 16 27 5 22 ...
## $ CAR_USE : chr "Private" "Private" "Commercial" "Private" ...
## $ BLUEBOOK : chr "$21,970" "$18,930" "$5,900" "$9,230" ...
## $ TIF : int 1 6 10 6 1 1 1 4 4 4 ...
## $ CAR_TYPE : chr "Van" "Minivan" "z_SUV" "Pickup" ...
## $ RED_CAR : chr "yes" "no" "no" "no" ...
## $ OLDCLAIM : chr "$0" "$3,295" "$0" "$0" ...
## $ CLM_FREQ : int 0 1 0 0 2 1 0 0 0 0 ...
## $ REVOKED : chr "No" "No" "No" "Yes" ...
## $ MVR_PTS : int 2 2 0 0 4 2 0 5 0 3 ...
## $ CAR_AGE : int 10 1 10 4 1 12 1 NA 9 1 ...
## $ URBANICITY : chr "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...
There are few fields, which have missing values, which we’ll investigate in greater details later.
Data Preparation of insurance-evaluation-data.csv.
At this stage, We’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.
After that we’ll show some boxplots of the numeric fields.
Checking for NA.
## [1] TRUE
NA does exist. So, we’ll impute with mice().
Rechecking for NA after imputation.
## [1] FALSE
We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().
Both is.na() and missmap() confirm that NA were eliminated.
More Data exploration of insurance-evaluation-data.csv.
Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.
| TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : NA | Min. : NA | Min. :0.0000 | Min. :17.00 | Min. :0.0000 | Min. : 0.00 | Min. : 0 | Length:2141 | Min. : 0 | Length:2141 | Length:2141 | Length:2141 | Length:2141 | Min. : 5.00 | Length:2141 | Min. : 1500 | Min. : 1.000 | Length:2141 | Length:2141 | Min. : 0 | Min. :0.000 | Length:2141 | Min. : 0.000 | Min. : 0.000 | Length:2141 | |
| 1st Qu.: NA | 1st Qu.: NA | 1st Qu.:0.0000 | 1st Qu.:39.00 | 1st Qu.:0.0000 | 1st Qu.: 9.00 | 1st Qu.: 25846 | Class :character | 1st Qu.: 0 | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 22.00 | Class :character | 1st Qu.: 8870 | 1st Qu.: 1.000 | Class :character | Class :character | 1st Qu.: 0 | 1st Qu.:0.000 | Class :character | 1st Qu.: 0.000 | 1st Qu.: 1.000 | Class :character | |
| Median : NA | Median : NA | Median :0.0000 | Median :45.00 | Median :0.0000 | Median :11.00 | Median : 51605 | Mode :character | Median :158830 | Mode :character | Mode :character | Mode :character | Mode :character | Median : 33.00 | Mode :character | Median :14170 | Median : 4.000 | Mode :character | Mode :character | Median : 0 | Median :0.000 | Mode :character | Median : 1.000 | Median : 8.000 | Mode :character | |
| Mean :NaN | Mean :NaN | Mean :0.1625 | Mean :45.02 | Mean :0.7174 | Mean :10.36 | Mean : 60228 | NA | Mean :152988 | NA | NA | NA | NA | Mean : 33.15 | NA | Mean :15469 | Mean : 5.245 | NA | NA | Mean : 4022 | Mean :0.809 | NA | Mean : 1.766 | Mean : 8.198 | NA | |
| 3rd Qu.: NA | 3rd Qu.: NA | 3rd Qu.:0.0000 | 3rd Qu.:51.00 | 3rd Qu.:1.0000 | 3rd Qu.:13.00 | 3rd Qu.: 86393 | NA | 3rd Qu.:236449 | NA | NA | NA | NA | 3rd Qu.: 43.00 | NA | 3rd Qu.:21050 | 3rd Qu.: 7.000 | NA | NA | 3rd Qu.: 4718 | 3rd Qu.:2.000 | NA | 3rd Qu.: 3.000 | 3rd Qu.:13.000 | NA | |
| Max. : NA | Max. : NA | Max. :3.0000 | Max. :73.00 | Max. :5.0000 | Max. :19.00 | Max. :291182 | NA | Max. :669271 | NA | NA | NA | NA | Max. :105.00 | NA | Max. :49940 | Max. :25.000 | NA | NA | Max. :54399 | Max. :5.000 | NA | Max. :12.000 | Max. :26.000 | NA | |
| NA’s :2141 | NA’s :2141 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Data reordering
For downstream analysis, we’ll reorder the columns into categorical, numeric and target.
Boxplots
Let’s take a first look at the boxplots
The boxplots show that some of the variables have outliers in them. So, we’ll cap them.
The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance.
Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.
We’ll do the boxplots differently, with gglplot, to check if there are any differences.
Histograms
Histograms tell us how the data is distributed in the dataset (numeric fields).
The histograms show that AGE, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.
Categorical variables
Now, we’ll explore the Categorical variables.
## PARENT1:
##
## No Yes
## 1875 266
## MSTATUS:
##
## No Yes
## 847 1294
## SEX:
##
## F M
## 1170 971
## EDUCATION:
##
## <High School Bachelors High School Masters PhD
## 312 581 622 420 206
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer
## 139 463 319 75 202 196
## Manager Professional Student
## 269 291 187
## CAR_USE:
##
## Commercial Private
## 760 1381
## CAR_TYPE:
##
## Minivan Panel Truck Pickup Sports Car SUV Van
## 549 177 383 272 589 171
## RED_CAR:
##
## no yes
## 1543 598
## REVOKED:
##
## No Yes
## 1880 261
## URBANICITY:
##
## Highly Rural/ Rural Highly Urban/ Urban
## 403 1738
Observation: In JOB columns, 139 rows are empty. So, we’ have to’ll impute them with “Unknown”.
## JOB:
##
## Blue Collar Clerical Doctor Home Maker Lawyer Manager
## 463 319 75 202 196 269
## Professional Student Unknown
## 291 187 139
Correlations.
At this point the data is prepared. So, we’ll explore the top correlated variables.
There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find pai-wise correlations, we’ll give numerical values to the correlated variables.
Now, we’ll look at the correlation matrix of the variables.
At this point exploration, preparation and pair-wise correlations of insurance_evaluation_data.csv are done. So, We’ll begin the model building process.
Building Models
Now, we are ready to build models. We will build model using insurance_training_data.csv and compare the models. Then we will use the best model to predict on insurance-evaluation-data.csv
Our task is to classify the TARGET_FLAG variable using logistic regression and predict the value of TARGET_AMT using Linear Regression.
We split Ins_train_cap_imputed into training and test in 80/20 ratio and named as Ins_train_cap_imputed_trn and Ins_train_cap_imputed_tst.
Logistic Regression Model
Here we will build our first logistic regression model
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial,
## data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2585 -0.7148 -0.3972 0.6502 3.1743
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.314e+00 3.198e-01 -7.235 4.64e-13 ***
## PARENT1Yes 4.214e-01 1.242e-01 3.394 0.000690 ***
## MSTATUSYes -5.156e-01 9.376e-02 -5.500 3.81e-08 ***
## SEXM 5.494e-02 1.250e-01 0.440 0.660147
## EDUCATIONBachelors -4.152e-01 1.295e-01 -3.207 0.001341 **
## EDUCATIONHigh School -5.172e-02 1.069e-01 -0.484 0.628448
## EDUCATIONMasters -3.404e-01 2.006e-01 -1.697 0.089711 .
## EDUCATIONPhD -2.426e-01 2.352e-01 -1.031 0.302486
## JOBClerical 6.363e-02 1.201e-01 0.530 0.596115
## JOBDoctor -5.610e-01 3.115e-01 -1.801 0.071752 .
## JOBHome Maker -1.240e-01 1.737e-01 -0.714 0.475328
## JOBLawyer -1.716e-01 2.114e-01 -0.812 0.416873
## JOBManager -8.168e-01 1.564e-01 -5.222 1.77e-07 ***
## JOBProfessional -9.226e-02 1.347e-01 -0.685 0.493352
## JOBStudent -1.583e-01 1.460e-01 -1.084 0.278294
## JOBUnknown -3.561e-01 2.081e-01 -1.711 0.087085 .
## CAR_USEPrivate -7.988e-01 1.024e-01 -7.801 6.14e-15 ***
## CAR_TYPEPanel Truck 5.476e-01 1.810e-01 3.025 0.002488 **
## CAR_TYPEPickup 5.346e-01 1.125e-01 4.754 1.99e-06 ***
## CAR_TYPESports Car 1.043e+00 1.445e-01 7.218 5.29e-13 ***
## CAR_TYPESUV 7.775e-01 1.237e-01 6.286 3.25e-10 ***
## CAR_TYPEVan 7.160e-01 1.404e-01 5.101 3.37e-07 ***
## RED_CARyes -1.674e-02 9.675e-02 -0.173 0.862651
## REVOKEDYes 7.961e-01 1.029e-01 7.740 9.97e-15 ***
## URBANICITYHighly Urban/ Urban 2.445e+00 1.257e-01 19.455 < 2e-16 ***
## KIDSDRIV 5.976e-01 1.090e-01 5.481 4.23e-08 ***
## AGE -3.265e-03 4.603e-03 -0.709 0.478150
## HOMEKIDS 5.087e-02 4.469e-02 1.138 0.254940
## YOJ -1.936e-02 9.416e-03 -2.056 0.039789 *
## INCOME -3.214e-06 1.359e-06 -2.364 0.018084 *
## HOME_VAL -1.319e-06 3.757e-07 -3.510 0.000449 ***
## TRAVTIME 1.619e-02 2.179e-03 7.430 1.09e-13 ***
## BLUEBOOK -2.301e-05 6.083e-06 -3.782 0.000155 ***
## TIF -6.052e-02 8.538e-03 -7.088 1.36e-12 ***
## OLDCLAIM -1.328e-05 5.397e-06 -2.461 0.013864 *
## CLM_FREQ 1.812e-01 3.285e-02 5.516 3.48e-08 ***
## MVR_PTS 1.089e-01 1.652e-02 6.592 4.33e-11 ***
## CAR_AGE -6.136e-03 8.097e-03 -0.758 0.448611
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5827.6 on 6491 degrees of freedom
## AIC: 5903.6
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5899.5.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Let’s look at the Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 181 97
## 0 249 1105
##
## Accuracy : 0.788
## 95% CI : (0.7673, 0.8076)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 8.082e-07
##
## Kappa : 0.3838
##
## Mcnemar's Test P-Value : 4.747e-16
##
## Sensitivity : 0.4209
## Specificity : 0.9193
## Pos Pred Value : 0.6511
## Neg Pred Value : 0.8161
## Prevalence : 0.2635
## Detection Rate : 0.1109
## Detection Prevalence : 0.1703
## Balanced Accuracy : 0.6701
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7806.
AUC of the model is 0.801.
Logistic Regression Model
We’ll build our second Logistic Regression model
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
## URBANICITY, family = binomial, data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2975 -0.7218 -0.4082 0.6586 3.1654
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.459e+00 2.123e-01 -11.583 < 2e-16 ***
## KIDSDRIV 5.821e-01 1.067e-01 5.453 4.94e-08 ***
## HOMEKIDS 5.781e-02 4.044e-02 1.429 0.152879
## INCOME -4.637e-06 1.194e-06 -3.885 0.000102 ***
## PARENT1Yes 4.235e-01 1.226e-01 3.454 0.000552 ***
## HOME_VAL -1.236e-06 3.642e-07 -3.393 0.000692 ***
## MSTATUSYes -5.449e-01 9.254e-02 -5.888 3.90e-09 ***
## EDUCATIONBachelors -5.555e-01 1.178e-01 -4.715 2.42e-06 ***
## EDUCATIONHigh School -1.135e-01 1.035e-01 -1.097 0.272683
## EDUCATIONMasters -5.769e-01 1.488e-01 -3.876 0.000106 ***
## EDUCATIONPhD -5.794e-01 1.815e-01 -3.192 0.001414 **
## TRAVTIME 1.685e-02 2.163e-03 7.790 6.71e-15 ***
## CAR_USEPrivate -8.800e-01 8.089e-02 -10.879 < 2e-16 ***
## BLUEBOOK -2.500e-05 5.444e-06 -4.592 4.39e-06 ***
## TIF -6.049e-02 8.476e-03 -7.136 9.59e-13 ***
## CAR_TYPEPanel Truck 4.852e-01 1.598e-01 3.036 0.002399 **
## CAR_TYPEPickup 4.700e-01 1.092e-01 4.305 1.67e-05 ***
## CAR_TYPESports Car 9.889e-01 1.182e-01 8.368 < 2e-16 ***
## CAR_TYPESUV 7.327e-01 9.481e-02 7.728 1.09e-14 ***
## CAR_TYPEVan 6.800e-01 1.320e-01 5.150 2.61e-07 ***
## CLM_FREQ 1.380e-01 2.847e-02 4.847 1.25e-06 ***
## REVOKEDYes 6.812e-01 8.987e-02 7.580 3.45e-14 ***
## MVR_PTS 1.095e-01 1.628e-02 6.728 1.72e-11 ***
## CAR_AGE -5.781e-03 8.058e-03 -0.717 0.473139
## URBANICITYHighly Urban/ Urban 2.396e+00 1.252e-01 19.138 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5883.4 on 6504 degrees of freedom
## AIC: 5933.4
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5930.1.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Let’s look at the Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 178 96
## 0 252 1106
##
## Accuracy : 0.7868
## 95% CI : (0.7661, 0.8064)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 1.445e-06
##
## Kappa : 0.3781
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4140
## Specificity : 0.9201
## Pos Pred Value : 0.6496
## Neg Pred Value : 0.8144
## Prevalence : 0.2635
## Detection Rate : 0.1091
## Detection Prevalence : 0.1679
## Balanced Accuracy : 0.6670
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7855.
AUC of the model is 0.802.
Logistic Regression Model
In the third Logistic Regression model, we’ll do stepwise model selection.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK +
## TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,
## family = binomial, data = Ins_train_cap_imputed_trn)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2901 -0.7231 -0.4097 0.6569 3.1630
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.480e+00 2.104e-01 -11.789 < 2e-16 ***
## KIDSDRIV 5.824e-01 1.068e-01 5.455 4.89e-08 ***
## HOMEKIDS 5.804e-02 4.044e-02 1.435 0.151241
## INCOME -4.693e-06 1.191e-06 -3.940 8.15e-05 ***
## PARENT1Yes 4.242e-01 1.226e-01 3.461 0.000539 ***
## HOME_VAL -1.222e-06 3.636e-07 -3.361 0.000777 ***
## MSTATUSYes -5.451e-01 9.252e-02 -5.892 3.82e-09 ***
## EDUCATIONBachelors -5.848e-01 1.106e-01 -5.288 1.24e-07 ***
## EDUCATIONHigh School -1.183e-01 1.032e-01 -1.146 0.251610
## EDUCATIONMasters -6.343e-01 1.257e-01 -5.046 4.50e-07 ***
## EDUCATIONPhD -6.353e-01 1.641e-01 -3.872 0.000108 ***
## TRAVTIME 1.684e-02 2.162e-03 7.788 6.82e-15 ***
## CAR_USEPrivate -8.797e-01 8.088e-02 -10.876 < 2e-16 ***
## BLUEBOOK -2.498e-05 5.445e-06 -4.588 4.47e-06 ***
## TIF -6.059e-02 8.473e-03 -7.151 8.62e-13 ***
## CAR_TYPEPanel Truck 4.858e-01 1.598e-01 3.039 0.002370 **
## CAR_TYPEPickup 4.702e-01 1.092e-01 4.307 1.66e-05 ***
## CAR_TYPESports Car 9.896e-01 1.182e-01 8.375 < 2e-16 ***
## CAR_TYPESUV 7.326e-01 9.482e-02 7.726 1.11e-14 ***
## CAR_TYPEVan 6.820e-01 1.320e-01 5.166 2.39e-07 ***
## CLM_FREQ 1.377e-01 2.847e-02 4.836 1.32e-06 ***
## REVOKEDYes 6.817e-01 8.988e-02 7.584 3.34e-14 ***
## MVR_PTS 1.095e-01 1.628e-02 6.725 1.76e-11 ***
## URBANICITYHighly Urban/ Urban 2.397e+00 1.252e-01 19.142 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7535.7 on 6528 degrees of freedom
## Residual deviance: 5883.9 on 6505 degrees of freedom
## AIC: 5931.9
##
## Number of Fisher Scoring iterations: 5
The important metric is AIC: 5959.5.
Now, we’ll predict on the test set Ins_train_cap_imputed_tst.
Creation of Confusion Matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 0
## 1 177 100
## 0 253 1102
##
## Accuracy : 0.7837
## 95% CI : (0.7629, 0.8035)
## No Information Rate : 0.7365
## P-Value [Acc > NIR] : 5.811e-06
##
## Kappa : 0.3708
##
## Mcnemar's Test P-Value : 5.959e-16
##
## Sensitivity : 0.4116
## Specificity : 0.9168
## Pos Pred Value : 0.6390
## Neg Pred Value : 0.8133
## Prevalence : 0.2635
## Detection Rate : 0.1085
## Detection Prevalence : 0.1697
## Balanced Accuracy : 0.6642
##
## 'Positive' Class : 1
##
Plotting the AUC under roc curve.
Here we note the following important metrics.
Accuracy of this model is 0.7874.
AUC of the model is 0.802.
At this point three Logistic Regression models were built. The accuracy was highest in the third model. We will use the third model on evaluation dataset, for classification.
Linear Regression model
First Linear Regression model
Since our goal is to predict the TARGET_AMT, and not classify (as we did in Logistic Regression), we’ll build linear regression model for TARGET_AMT.
##
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = Ins_train_cap_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5140 -1701 -758 354 103703
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.410e+01 4.771e+02 0.134 0.893131
## PARENT1Yes 5.390e+02 2.042e+02 2.639 0.008327 **
## MSTATUSYes -5.673e+02 1.442e+02 -3.934 8.42e-05 ***
## SEXM 3.517e+02 1.835e+02 1.917 0.055261 .
## EDUCATIONBachelors -2.797e+02 2.055e+02 -1.361 0.173546
## EDUCATIONHigh School -9.839e+01 1.721e+02 -0.572 0.567524
## EDUCATIONMasters -8.573e+00 3.002e+02 -0.029 0.977219
## EDUCATIONPhD 1.966e+02 3.541e+02 0.555 0.578804
## JOBClerical 2.505e+01 1.934e+02 0.130 0.896929
## JOBDoctor -9.795e+02 4.357e+02 -2.248 0.024611 *
## JOBHome Maker -1.557e+02 2.706e+02 -0.575 0.565054
## JOBLawyer -2.724e+02 3.109e+02 -0.876 0.380978
## JOBManager -9.840e+02 2.345e+02 -4.195 2.75e-05 ***
## JOBProfessional -4.248e+01 2.132e+02 -0.199 0.842035
## JOBStudent -2.248e+02 2.360e+02 -0.952 0.340976
## JOBUnknown -5.146e+02 3.217e+02 -1.600 0.109723
## CAR_USEPrivate -7.928e+02 1.644e+02 -4.822 1.45e-06 ***
## CAR_TYPEPanel Truck 2.709e+02 2.766e+02 0.979 0.327531
## CAR_TYPEPickup 3.745e+02 1.708e+02 2.192 0.028377 *
## CAR_TYPESports Car 1.012e+03 2.179e+02 4.644 3.48e-06 ***
## CAR_TYPESUV 7.439e+02 1.795e+02 4.145 3.43e-05 ***
## CAR_TYPEVan 5.213e+02 2.133e+02 2.444 0.014556 *
## RED_CARyes -3.659e+01 1.491e+02 -0.245 0.806149
## REVOKEDYes 5.943e+02 1.744e+02 3.409 0.000656 ***
## URBANICITYHighly Urban/ Urban 1.673e+03 1.392e+02 12.015 < 2e-16 ***
## KIDSDRIV 6.023e+02 1.790e+02 3.365 0.000769 ***
## AGE 6.574e+00 7.183e+00 0.915 0.360107
## HOMEKIDS 8.174e+01 6.995e+01 1.169 0.242637
## YOJ -6.412e+00 1.474e+01 -0.435 0.663681
## INCOME -3.949e-03 2.089e-03 -1.890 0.058735 .
## HOME_VAL -6.719e-04 5.866e-04 -1.145 0.252042
## TRAVTIME 1.275e+01 3.340e+00 3.817 0.000136 ***
## BLUEBOOK 1.343e-02 8.970e-03 1.497 0.134462
## TIF -5.089e+01 1.295e+01 -3.931 8.52e-05 ***
## OLDCLAIM -1.690e-02 9.122e-03 -1.853 0.063925 .
## CLM_FREQ 1.664e+02 5.689e+01 2.924 0.003461 **
## MVR_PTS 1.735e+02 2.790e+01 6.217 5.31e-10 ***
## CAR_AGE -2.489e+01 1.237e+01 -2.012 0.044271 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4545 on 8123 degrees of freedom
## Multiple R-squared: 0.07062, Adjusted R-squared: 0.06639
## F-statistic: 16.68 on 37 and 8123 DF, p-value: < 2.2e-16
Here we note the following important metrics.
R-squared: 0.07094.
Adjusted R-squared: 0.06671.
The R-squared values are far from 1. So, the model is not good. In below plot, the points are widely scattered, but not linearly.
Linear Regression model
In Second Linear Regression model, we’ll do stepwise AIC.
##
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + SEX + JOB + CAR_USE +
## CAR_TYPE + REVOKED + URBANICITY + KIDSDRIV + INCOME + TRAVTIME +
## BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE,
## data = Ins_train_cap_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5196 -1698 -766 336 103646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.555e+02 3.413e+02 0.455 0.648781
## PARENT1Yes 5.962e+02 1.783e+02 3.343 0.000834 ***
## MSTATUSYes -6.146e+02 1.197e+02 -5.136 2.88e-07 ***
## SEXM 3.208e+02 1.605e+02 1.999 0.045662 *
## JOBClerical 3.956e+01 1.923e+02 0.206 0.837011
## JOBDoctor -6.685e+02 3.556e+02 -1.880 0.060152 .
## JOBHome Maker -1.066e+02 2.480e+02 -0.430 0.667431
## JOBLawyer -1.534e+02 2.421e+02 -0.634 0.526325
## JOBManager -9.831e+02 2.127e+02 -4.623 3.84e-06 ***
## JOBProfessional -1.372e+02 1.976e+02 -0.694 0.487429
## JOBStudent -1.296e+02 2.226e+02 -0.582 0.560294
## JOBUnknown -3.164e+02 2.676e+02 -1.182 0.237046
## CAR_USEPrivate -7.430e+02 1.569e+02 -4.736 2.21e-06 ***
## CAR_TYPEPanel Truck 3.137e+02 2.733e+02 1.148 0.251206
## CAR_TYPEPickup 4.009e+02 1.695e+02 2.366 0.018021 *
## CAR_TYPESports Car 1.027e+03 2.165e+02 4.746 2.11e-06 ***
## CAR_TYPESUV 7.486e+02 1.786e+02 4.191 2.80e-05 ***
## CAR_TYPEVan 5.417e+02 2.121e+02 2.554 0.010654 *
## REVOKEDYes 5.990e+02 1.743e+02 3.437 0.000592 ***
## URBANICITYHighly Urban/ Urban 1.667e+03 1.391e+02 11.985 < 2e-16 ***
## KIDSDRIV 6.949e+02 1.620e+02 4.289 1.81e-05 ***
## INCOME -5.025e-03 1.797e-03 -2.796 0.005181 **
## TRAVTIME 1.261e+01 3.338e+00 3.778 0.000159 ***
## BLUEBOOK 1.377e-02 8.872e-03 1.552 0.120690
## TIF -5.033e+01 1.294e+01 -3.890 0.000101 ***
## OLDCLAIM -1.690e-02 9.117e-03 -1.853 0.063853 .
## CLM_FREQ 1.692e+02 5.683e+01 2.977 0.002921 **
## MVR_PTS 1.744e+02 2.786e+01 6.258 4.08e-10 ***
## CAR_AGE -2.563e+01 1.092e+01 -2.346 0.018999 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4545 on 8132 degrees of freedom
## Multiple R-squared: 0.06969, Adjusted R-squared: 0.06649
## F-statistic: 21.76 on 28 and 8132 DF, p-value: < 2.2e-16
Here we note the following important metrics.
R-squared: 0.07012.
Adjusted R-squared: 0.06692.
The second model slightly improved over the first model. The plot also suggest no proper linear regression.
Model Selection.
We ran three Logistic Regression models and two Linear Regression models. The third Logistic Regression model did best based on the Accuracy and AUC. And the second Linear Regression model did best based on R-squared value.
We’ll use these models to predict the evaluation dataset insurance-evaluation-data.csv.
Head of the predicted data.
## PARENT1 MSTATUS SEX EDUCATION JOB CAR_USE CAR_TYPE RED_CAR
## 1 No No M Bachelors Manager Private Van yes
## 2 Yes No M High School Manager Private Minivan no
## 3 Yes No F High School Blue Collar Commercial SUV no
## 4 Yes No M High School Clerical Private Pickup no
## 5 No No M High School Manager Private Minivan yes
## 6 No Yes M Bachelors Professional Commercial Panel Truck no
## REVOKED URBANICITY KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1 No Highly Urban/ Urban 0 48 0 11 52881 0
## 2 No Highly Urban/ Urban 1 40 1 11 50815 0
## 3 No Highly Rural/ Rural 0 44 2 12 43486 0
## 4 Yes Highly Rural/ Rural 0 35 2 9 21204 0
## 5 No Highly Urban/ Urban 0 59 0 12 87460 0
## 6 No Highly Urban/ Urban 0 46 0 14 122150 207519
## TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE TARGET_FLAG
## 1 26 21970 1 0 0 2 10 0
## 2 21 18930 6 3295 1 2 1 0
## 3 30 5900 10 0 0 0 10 0
## 4 74 9230 6 0 0 0 4 0
## 5 45 15420 1 26114 2 4 1 0
## 6 7 25660 1 2119 1 2 12 0
## TARGET_AMT
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0