Overview

In this homework assignment, you will explore, analyze and model a dataset containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first responsevariable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash.The second responsevariable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration of insurance_training_data.csv.

Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

## [1] "Dimension of training set:   Number of rows: 8161, Number of cols: 26"

## [1] "Head of training data set:"

##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban

## [1] "Structure of training data set:"

## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of insurance_training_data.csv.

At this stage, we’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.

After that we’ll show some boxplots of the numeric fields.

Checking for NA.

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Rechecking for NA after imputation.

## [1] FALSE

We observe that NA were removed. In the following, we’ll visualize with missmap().

Both is.na() and missmap() confirm that NA were eliminated.

More Data Exploration of insurance_training_data.csv.

Here, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

##   TARGET_FLAG       TARGET_AMT        KIDSDRIV           AGE       
##  Min.   :0.0000   Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##  Median :0.0000   Median :     0   Median :0.0000   Median :45.00  
##  Mean   :0.2638   Mean   :  1504   Mean   :0.1711   Mean   :44.78  
##  3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   :1.0000   Max.   :107586   Max.   :4.0000   Max.   :81.00  
##     HOMEKIDS           YOJ           INCOME         PARENT1         
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   Length:8161       
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28172   Class :character  
##  Median :0.0000   Median :11.0   Median : 53895   Mode  :character  
##  Mean   :0.7212   Mean   :10.5   Mean   : 61787                     
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85734                     
##  Max.   :5.0000   Max.   :23.0   Max.   :367030                     
##     HOME_VAL        MSTATUS              SEX             EDUCATION        
##  Min.   :     0   Length:8161        Length:8161        Length:8161       
##  1st Qu.:     0   Class :character   Class :character   Class :character  
##  Median :161166   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :154958                                                           
##  3rd Qu.:238741                                                           
##  Max.   :885282                                                           
##      JOB               TRAVTIME        CAR_USE             BLUEBOOK    
##  Length:8161        Min.   :  5.00   Length:8161        Min.   : 1500  
##  Class :character   1st Qu.: 22.00   Class :character   1st Qu.: 9280  
##  Mode  :character   Median : 33.00   Mode  :character   Median :14440  
##                     Mean   : 33.49                      Mean   :15710  
##                     3rd Qu.: 44.00                      3rd Qu.:20850  
##                     Max.   :142.00                      Max.   :69740  
##       TIF           CAR_TYPE           RED_CAR             OLDCLAIM    
##  Min.   : 1.000   Length:8161        Length:8161        Min.   :    0  
##  1st Qu.: 1.000   Class :character   Class :character   1st Qu.:    0  
##  Median : 4.000   Mode  :character   Mode  :character   Median :    0  
##  Mean   : 5.351                                         Mean   : 4037  
##  3rd Qu.: 7.000                                         3rd Qu.: 4636  
##  Max.   :25.000                                         Max.   :57037  
##     CLM_FREQ        REVOKED             MVR_PTS          CAR_AGE      
##  Min.   :0.0000   Length:8161        Min.   : 0.000   Min.   :-3.000  
##  1st Qu.:0.0000   Class :character   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000   Mode  :character   Median : 1.000   Median : 8.000  
##  Mean   :0.7986                      Mean   : 1.696   Mean   : 8.354  
##  3rd Qu.:2.0000                      3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000                      Max.   :13.000   Max.   :28.000  
##   URBANICITY       
##  Length:8161       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

First look at the boxplots.

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance. Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

The histograms show that AGE, YOJ, INCOME, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, we’ll explore the Categorical variables.

## PARENT1:

## 
##   No  Yes 
## 7084 1077

## MSTATUS:

## 
##   No  Yes 
## 3267 4894

## SEX:

## 
##    F    M 
## 4375 3786

## EDUCATION:

## 
## <High School    Bachelors  High School      Masters          PhD 
##         1203         2242         2330         1658          728

## JOB:

## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          526         1825         1271          246          641          835 
##      Manager Professional      Student 
##          988         1117          712

## CAR_USE:

## 
## Commercial    Private 
##       3029       5132

## CAR_TYPE:

## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##        2145         676        1389         907        2294         750

## RED_CAR:

## 
##   no  yes 
## 5783 2378

## REVOKED:

## 
##   No  Yes 
## 7161 1000

## URBANICITY:

## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                1669                6492

Observation: In JOB column, 526 rows are empty. So, we’ll impute them with “Unknown”.

## JOB:

## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##         1825         1271          246          641          835          988 
## Professional      Student      Unknown 
##         1117          712          526

Correlations

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find the top correlated variables, we’ll give numerical values to the correlated variables.

	Top Correlated Variables
	TARGET_FLAG	TARGET_AMT
TARGET_FLAG	1.0000000	1.0000000
TARGET_AMT	0.8334240	0.8334240
URBANICITY	0.2242509	0.1904945
CLM_FREQ	0.2161961	0.1869848
MVR_PTS	0.2075451	0.1741927
OLDCLAIM	0.2004106	0.1652881
PARENT1	0.1576222	0.1359305
REVOKED	0.1519391	0.1263285
HOMEKIDS	0.1175903	0.1014967
KIDSDRIV	0.1036217	0.0863553
CAR_TYPE	0.1035765	0.0827170
TRAVTIME	0.0550685	0.0440349
RED_CAR	-0.0069473	0.0005877
SEX	-0.0210786	-0.0088270
JOB	-0.0669944	-0.0509930
YOJ	-0.0704375	-0.0512803
EDUCATION	-0.0734429	-0.0534342
TIF	-0.0818308	-0.0690666
CAR_AGE	-0.1038815	-0.0712728
AGE	-0.1041892	-0.0818184
BLUEBOOK	-0.1100800	-0.0837608
MSTATUS	-0.1351248	-0.1182110
CAR_USE	-0.1426737	-0.1214701
INCOME	-0.1463515	-0.1287263
HOME_VAL	-0.1847610	-0.1541847

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_training_data.csv are done. So, I’ll begin the same exericse for insurance-evaluation-data.csv.

Data Exploration of insurance-evaluation-data.csv.

Initially, we’ll do a cursory exploration of the data. After that, we’ll iteratively prepare and explore the data, wherever required.

## [1] "Dimension of training set:   Number of rows: 2141, Number of cols: 26"

## [1] "Head of training data set:"

##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ  INCOME PARENT1
## 1     3          NA         NA        0  48        0  11 $52,881      No
## 2     9          NA         NA        1  40        1  11 $50,815     Yes
## 3    10          NA         NA        0  44        2  12 $43,486     Yes
## 4    18          NA         NA        0  35        2  NA $21,204     Yes
## 5    21          NA         NA        0  59        0  12 $87,460      No
## 6    30          NA         NA        0  46        0  14              No
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M     Bachelors       Manager       26    Private  $21,970
## 2       $0    z_No   M z_High School       Manager       21    Private  $18,930
## 3       $0    z_No z_F z_High School z_Blue Collar       30 Commercial   $5,900
## 4       $0    z_No   M z_High School      Clerical       74    Private   $9,230
## 5       $0    z_No   M z_High School       Manager       45    Private  $15,420
## 6 $207,519     Yes   M     Bachelors  Professional        7 Commercial  $25,660
##   TIF    CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1   1         Van     yes       $0        0      No       2      10
## 2   6     Minivan      no   $3,295        1      No       2       1
## 3  10       z_SUV      no       $0        0      No       0      10
## 4   6      Pickup      no       $0        0     Yes       0       4
## 5   1     Minivan     yes  $44,857        2      No       4       1
## 6   1 Panel Truck      no   $2,119        1      No       2      12
##              URBANICITY
## 1   Highly Urban/ Urban
## 2   Highly Urban/ Urban
## 3 z_Highly Rural/ Rural
## 4 z_Highly Rural/ Rural
## 5   Highly Urban/ Urban
## 6   Highly Urban/ Urban

## [1] "Structure of training data set:"

## 'data.frame':    2141 obs. of  26 variables:
##  $ INDEX      : int  3 9 10 18 21 30 31 37 39 47 ...
##  $ TARGET_FLAG: logi  NA NA NA NA NA NA ...
##  $ TARGET_AMT : logi  NA NA NA NA NA NA ...
##  $ KIDSDRIV   : int  0 1 0 0 0 0 0 0 2 0 ...
##  $ AGE        : int  48 40 44 35 59 46 60 54 36 50 ...
##  $ HOMEKIDS   : int  0 1 2 2 0 0 0 0 2 0 ...
##  $ YOJ        : int  11 11 12 NA 12 14 12 12 12 8 ...
##  $ INCOME     : chr  "$52,881" "$50,815" "$43,486" "$21,204" ...
##  $ PARENT1    : chr  "No" "Yes" "Yes" "Yes" ...
##  $ HOME_VAL   : chr  "$0" "$0" "$0" "$0" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "z_No" "z_No" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "Bachelors" "z_High School" "z_High School" "z_High School" ...
##  $ JOB        : chr  "Manager" "Manager" "z_Blue Collar" "Clerical" ...
##  $ TRAVTIME   : int  26 21 30 74 45 7 16 27 5 22 ...
##  $ CAR_USE    : chr  "Private" "Private" "Commercial" "Private" ...
##  $ BLUEBOOK   : chr  "$21,970" "$18,930" "$5,900" "$9,230" ...
##  $ TIF        : int  1 6 10 6 1 1 1 4 4 4 ...
##  $ CAR_TYPE   : chr  "Van" "Minivan" "z_SUV" "Pickup" ...
##  $ RED_CAR    : chr  "yes" "no" "no" "no" ...
##  $ OLDCLAIM   : chr  "$0" "$3,295" "$0" "$0" ...
##  $ CLM_FREQ   : int  0 1 0 0 2 1 0 0 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "Yes" ...
##  $ MVR_PTS    : int  2 2 0 0 4 2 0 5 0 3 ...
##  $ CAR_AGE    : int  10 1 10 4 1 12 1 NA 9 1 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "z_Highly Rural/ Rural" "z_Highly Rural/ Rural" ...

There are few fields, which have missing values, which we’ll investigate in greater details later.

Data Preparation of insurance-evaluation-data.csv.

At this stage, We’ll explore and prepare iteratively. First we’ll convert the fields, which are supposed to be numeric, into proper numeric format and strings into string format. After reformatting, we’ll check for NA. After that if required, we’ll impute them.

After that we’ll show some boxplots of the numeric fields.

Checking for NA.

## [1] TRUE

NA does exist. So, we’ll impute with mice().

Rechecking for NA after imputation.

## [1] FALSE

We observe that NA were removed in all columns except TARGET_FLAG and TARGET_AMT, which is what we want. In the following, we’ll visualize with missmap().

Both is.na() and missmap() confirm that NA were eliminated.

More Data exploration of insurance-evaluation-data.csv.

Now, we’ll explore the data a little further. First, we’ll take a quick look at min, 1st quartile, median, mean, 2nd quartile, max etc.

##   TARGET_FLAG     TARGET_AMT      KIDSDRIV           AGE       
##  Min.   : NA    Min.   : NA    Min.   :0.0000   Min.   :17.00  
##  1st Qu.: NA    1st Qu.: NA    1st Qu.:0.0000   1st Qu.:39.00  
##  Median : NA    Median : NA    Median :0.0000   Median :45.00  
##  Mean   :NaN    Mean   :NaN    Mean   :0.1625   Mean   :45.02  
##  3rd Qu.: NA    3rd Qu.: NA    3rd Qu.:0.0000   3rd Qu.:51.00  
##  Max.   : NA    Max.   : NA    Max.   :3.0000   Max.   :73.00  
##  NA's   :2141   NA's   :2141                                   
##     HOMEKIDS           YOJ            INCOME         PARENT1         
##  Min.   :0.0000   Min.   : 0.00   Min.   :     0   Length:2141       
##  1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.: 25713   Class :character  
##  Median :0.0000   Median :11.00   Median : 51734   Mode  :character  
##  Mean   :0.7174   Mean   :10.38   Mean   : 60210                     
##  3rd Qu.:1.0000   3rd Qu.:13.00   3rd Qu.: 86321                     
##  Max.   :5.0000   Max.   :19.00   Max.   :291182                     
##                                                                      
##     HOME_VAL        MSTATUS              SEX             EDUCATION        
##  Min.   :     0   Length:2141        Length:2141        Length:2141       
##  1st Qu.:     0   Class :character   Class :character   Class :character  
##  Median :159239   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :154441                                                           
##  3rd Qu.:237217                                                           
##  Max.   :669271                                                           
##                                                                           
##      JOB               TRAVTIME        CAR_USE             BLUEBOOK    
##  Length:2141        Min.   :  5.00   Length:2141        Min.   : 1500  
##  Class :character   1st Qu.: 22.00   Class :character   1st Qu.: 8870  
##  Mode  :character   Median : 33.00   Mode  :character   Median :14170  
##                     Mean   : 33.15                      Mean   :15469  
##                     3rd Qu.: 43.00                      3rd Qu.:21050  
##                     Max.   :105.00                      Max.   :49940  
##                                                                        
##       TIF           CAR_TYPE           RED_CAR             OLDCLAIM    
##  Min.   : 1.000   Length:2141        Length:2141        Min.   :    0  
##  1st Qu.: 1.000   Class :character   Class :character   1st Qu.:    0  
##  Median : 4.000   Mode  :character   Mode  :character   Median :    0  
##  Mean   : 5.245                                         Mean   : 4022  
##  3rd Qu.: 7.000                                         3rd Qu.: 4718  
##  Max.   :25.000                                         Max.   :54399  
##                                                                        
##     CLM_FREQ       REVOKED             MVR_PTS          CAR_AGE      
##  Min.   :0.000   Length:2141        Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.000   Class :character   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.000   Mode  :character   Median : 1.000   Median : 8.000  
##  Mean   :0.809                      Mean   : 1.766   Mean   : 8.212  
##  3rd Qu.:2.000                      3rd Qu.: 3.000   3rd Qu.:13.000  
##  Max.   :5.000                      Max.   :12.000   Max.   :26.000  
##                                                                      
##   URBANICITY       
##  Length:2141       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Data reordering

For downstream analysis, we’ll reorder the columns into categorical, numeric and target.

Boxplots

Let’s take a first look at the boxplots

The boxplots show that some of the variables have outliers in them. So, we’ll cap them.

The fields AGE, HOMEKIDS, INCOME, HOME_VAL, TRVTIME, BLUEBOOK, TIF, CLM_FREQ, MVR_PTS, CAR_AGE have higher variance.

Let’s ignore the boxplots for TARGET_FLAG and TARGET_AMT.

We’ll do the boxplots differently, with gglplot, to check if there are any differences.

Histograms

Histograms tell us how the data is distributed in the dataset (numeric fields).

The histograms show that AGE, YOJ, HOME_VAL, TRAVTIME, BLUEBOOK and CAR_AGE are approximately normally distributed. HOME_VALUE, CAR_AGE and CLM_FREQ are quite dispersed.

Categorical variables

Now, we’ll explore the Categorical variables.

## PARENT1:

## 
##   No  Yes 
## 1875  266

## MSTATUS:

## 
##   No  Yes 
##  847 1294

## SEX:

## 
##    F    M 
## 1170  971

## EDUCATION:

## 
## <High School    Bachelors  High School      Masters          PhD 
##          312          581          622          420          206

## JOB:

## 
##               Blue Collar     Clerical       Doctor   Home Maker       Lawyer 
##          139          463          319           75          202          196 
##      Manager Professional      Student 
##          269          291          187

## CAR_USE:

## 
## Commercial    Private 
##        760       1381

## CAR_TYPE:

## 
##     Minivan Panel Truck      Pickup  Sports Car         SUV         Van 
##         549         177         383         272         589         171

## RED_CAR:

## 
##   no  yes 
## 1543  598

## REVOKED:

## 
##   No  Yes 
## 1880  261

## URBANICITY:

## 
## Highly Rural/ Rural Highly Urban/ Urban 
##                 403                1738

Observation: In JOB columns, 139 rows are empty. So, we’ have to’ll impute them with “Unknown”.

## JOB:

## 
##  Blue Collar     Clerical       Doctor   Home Maker       Lawyer      Manager 
##          463          319           75          202          196          269 
## Professional      Student      Unknown 
##          291          187          139

Correlations.

At this point the data is prepared. So, we’ll explore the top correlated variables.

There are 25 variables, among which 15 are numeric and 10 are non-categorical. In order to find pai-wise correlations, we’ll give numerical values to the correlated variables.

Now, we’ll look at the correlation matrix of the variables.

At this point exploration, preparation and pair-wise correlations of insurance_evaluation_data.csv are done. So, I’ll begin the building process.

Building Models

Now, we are in a position to build the models. Initially, we’ll build models with insurance_training_data.csv and determine the model. Then we’ll use that model to predict on insurance-evaluation-data.csv.

Fact: The pre-processed dataset variables, which we’ll use in the following are Ins_train_cap_imputed and Ins_eval_cap_imputed.

We have two tasks here: One is to classify the variable TARGET_FLAG. For classification, we’ll use Logistic Regression. The other task is to predict the value of TARGET_AMT with Linear Regression.

We’ll build two Logistic Regression models and compare the accuracies and select the best one and use that for predicting on Ins_eval_cap_imputed. In order to do the Logistic Regression, we must split the data (80/20 ratio is our choice). So, let’s split the first.

The training and test datasets formed by splitting Ins_train_cap_imputed in 80/20 ratio are Ins_train_cap_imputed_trn and Ins_train_cap_imputed_tst.

Logistic Regression Model Model01_Log_Reg

We’ll build our first Logistic Regression model, called Model01_Log_Reg.

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial, 
##     data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2696  -0.7148  -0.3986   0.6484   3.1772  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.353e+00  3.200e-01  -7.351 1.96e-13 ***
## PARENT1Yes                     4.291e-01  1.243e-01   3.454 0.000553 ***
## MSTATUSYes                    -4.984e-01  9.405e-02  -5.299 1.16e-07 ***
## SEXM                           5.945e-02  1.250e-01   0.476 0.634204    
## EDUCATIONBachelors            -4.227e-01  1.292e-01  -3.272 0.001067 ** 
## EDUCATIONHigh School          -5.255e-02  1.069e-01  -0.492 0.622940    
## EDUCATIONMasters              -3.416e-01  1.999e-01  -1.709 0.087432 .  
## EDUCATIONPhD                  -2.615e-01  2.348e-01  -1.113 0.265523    
## JOBClerical                    7.402e-02  1.202e-01   0.616 0.538034    
## JOBDoctor                     -5.875e-01  3.113e-01  -1.887 0.059128 .  
## JOBHome Maker                 -8.356e-02  1.738e-01  -0.481 0.630619    
## JOBLawyer                     -1.868e-01  2.113e-01  -0.884 0.376748    
## JOBManager                    -8.325e-01  1.565e-01  -5.319 1.04e-07 ***
## JOBProfessional               -1.020e-01  1.347e-01  -0.757 0.448777    
## JOBStudent                    -1.226e-01  1.455e-01  -0.843 0.399393    
## JOBUnknown                    -3.768e-01  2.080e-01  -1.811 0.070122 .  
## CAR_USEPrivate                -7.932e-01  1.024e-01  -7.748 9.36e-15 ***
## CAR_TYPEPanel Truck            5.483e-01  1.810e-01   3.030 0.002449 ** 
## CAR_TYPEPickup                 5.371e-01  1.125e-01   4.776 1.79e-06 ***
## CAR_TYPESports Car             1.047e+00  1.445e-01   7.248 4.24e-13 ***
## CAR_TYPESUV                    7.792e-01  1.237e-01   6.298 3.01e-10 ***
## CAR_TYPEVan                    7.086e-01  1.402e-01   5.054 4.34e-07 ***
## RED_CARyes                    -1.660e-02  9.672e-02  -0.172 0.863717    
## REVOKEDYes                     7.929e-01  1.029e-01   7.706 1.29e-14 ***
## URBANICITYHighly Urban/ Urban  2.441e+00  1.256e-01  19.432  < 2e-16 ***
## KIDSDRIV                       5.976e-01  1.091e-01   5.478 4.30e-08 ***
## AGE                           -3.216e-03  4.606e-03  -0.698 0.485129    
## HOMEKIDS                       4.789e-02  4.469e-02   1.072 0.283845    
## YOJ                           -1.851e-02  9.345e-03  -1.981 0.047605 *  
## INCOME                        -2.127e-06  1.356e-06  -1.568 0.116872    
## HOME_VAL                      -1.470e-06  3.751e-07  -3.920 8.87e-05 ***
## TRAVTIME                       1.624e-02  2.180e-03   7.451 9.26e-14 ***
## BLUEBOOK                      -2.351e-05  6.086e-06  -3.862 0.000112 ***
## TIF                           -6.075e-02  8.537e-03  -7.116 1.11e-12 ***
## OLDCLAIM                      -1.331e-05  5.398e-06  -2.465 0.013694 *  
## CLM_FREQ                       1.815e-01  3.283e-02   5.528 3.24e-08 ***
## MVR_PTS                        1.089e-01  1.652e-02   6.596 4.22e-11 ***
## CAR_AGE                       -7.854e-03  8.065e-03  -0.974 0.330193    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5829.9  on 6491  degrees of freedom
## AIC: 5905.9
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5899.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Creation of Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  180   97
##          0  250 1105
##                                          
##                Accuracy : 0.7874         
##                  95% CI : (0.7667, 0.807)
##     No Information Rate : 0.7365         
##     P-Value [Acc > NIR] : 1.083e-06      
##                                          
##                   Kappa : 0.3815         
##                                          
##  Mcnemar's Test P-Value : 3.356e-16      
##                                          
##             Sensitivity : 0.4186         
##             Specificity : 0.9193         
##          Pos Pred Value : 0.6498         
##          Neg Pred Value : 0.8155         
##              Prevalence : 0.2635         
##          Detection Rate : 0.1103         
##    Detection Prevalence : 0.1697         
##       Balanced Accuracy : 0.6690         
##                                          
##        'Positive' Class : 1              
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7806.

AUC of the model is 0.801.

Logistic Regression Model Model02_Log_Reg

We’ll build our second Logistic Regression model, called Model02_Log_Reg.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, family = binomial, data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3097  -0.7236  -0.4093   0.6644   3.1651  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.454e+00  2.125e-01 -11.546  < 2e-16 ***
## KIDSDRIV                       5.831e-01  1.067e-01   5.463 4.67e-08 ***
## HOMEKIDS                       5.626e-02  4.047e-02   1.390 0.164453    
## INCOME                        -3.777e-06  1.192e-06  -3.170 0.001526 ** 
## PARENT1Yes                     4.283e-01  1.227e-01   3.492 0.000479 ***
## HOME_VAL                      -1.377e-06  3.633e-07  -3.790 0.000151 ***
## MSTATUSYes                    -5.298e-01  9.287e-02  -5.705 1.17e-08 ***
## EDUCATIONBachelors            -5.642e-01  1.176e-01  -4.795 1.62e-06 ***
## EDUCATIONHigh School          -1.157e-01  1.035e-01  -1.118 0.263703    
## EDUCATIONMasters              -5.835e-01  1.480e-01  -3.943 8.06e-05 ***
## EDUCATIONPhD                  -6.015e-01  1.812e-01  -3.320 0.000902 ***
## TRAVTIME                       1.689e-02  2.162e-03   7.811 5.65e-15 ***
## CAR_USEPrivate                -8.751e-01  8.080e-02 -10.830  < 2e-16 ***
## BLUEBOOK                      -2.559e-05  5.447e-06  -4.698 2.62e-06 ***
## TIF                           -6.079e-02  8.473e-03  -7.174 7.26e-13 ***
## CAR_TYPEPanel Truck            4.853e-01  1.598e-01   3.037 0.002390 ** 
## CAR_TYPEPickup                 4.672e-01  1.092e-01   4.279 1.88e-05 ***
## CAR_TYPESports Car             9.915e-01  1.182e-01   8.391  < 2e-16 ***
## CAR_TYPESUV                    7.332e-01  9.479e-02   7.735 1.03e-14 ***
## CAR_TYPEVan                    6.705e-01  1.319e-01   5.083 3.71e-07 ***
## CLM_FREQ                       1.386e-01  2.845e-02   4.870 1.11e-06 ***
## REVOKEDYes                     6.770e-01  8.982e-02   7.537 4.83e-14 ***
## MVR_PTS                        1.096e-01  1.627e-02   6.733 1.66e-11 ***
## CAR_AGE                       -8.254e-03  8.012e-03  -1.030 0.302913    
## URBANICITYHighly Urban/ Urban  2.387e+00  1.251e-01  19.081  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5887.2  on 6504  degrees of freedom
## AIC: 5937.2
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5930.1.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Creation of Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  172   97
##          0  258 1105
##                                           
##                Accuracy : 0.7825          
##                  95% CI : (0.7617, 0.8023)
##     No Information Rate : 0.7365          
##     P-Value [Acc > NIR] : 9.894e-06       
##                                           
##                   Kappa : 0.3629          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4000          
##             Specificity : 0.9193          
##          Pos Pred Value : 0.6394          
##          Neg Pred Value : 0.8107          
##              Prevalence : 0.2635          
##          Detection Rate : 0.1054          
##    Detection Prevalence : 0.1648          
##       Balanced Accuracy : 0.6597          
##                                           
##        'Positive' Class : 1               
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7855.

AUC of the model is 0.802.

Logistic Regression Model Model03_Log_Reg

In the third Logistic Regression model Model03_Log_Reg, we’ll do stepwise model selection.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + 
##     MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF + 
##     CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY, family = binomial, 
##     data = Ins_train_cap_imputed_trn)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3376  -0.7220  -0.4088   0.6604   3.1518  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.466e+00  2.099e-01 -11.748  < 2e-16 ***
## KIDSDRIV                       6.404e-01  9.858e-02   6.497 8.20e-11 ***
## INCOME                        -3.846e-06  1.188e-06  -3.236 0.001211 ** 
## PARENT1Yes                     5.191e-01  1.049e-01   4.950 7.42e-07 ***
## HOME_VAL                      -1.395e-06  3.620e-07  -3.853 0.000117 ***
## MSTATUSYes                    -4.895e-01  8.819e-02  -5.550 2.85e-08 ***
## EDUCATIONBachelors            -6.170e-01  1.102e-01  -5.600 2.15e-08 ***
## EDUCATIONHigh School          -1.288e-01  1.031e-01  -1.249 0.211759    
## EDUCATIONMasters              -6.819e-01  1.250e-01  -5.456 4.86e-08 ***
## EDUCATIONPhD                  -6.990e-01  1.630e-01  -4.287 1.81e-05 ***
## TRAVTIME                       1.681e-02  2.161e-03   7.777 7.43e-15 ***
## CAR_USEPrivate                -8.754e-01  8.079e-02 -10.836  < 2e-16 ***
## BLUEBOOK                      -2.582e-05  5.447e-06  -4.740 2.13e-06 ***
## TIF                           -6.070e-02  8.468e-03  -7.169 7.58e-13 ***
## CAR_TYPEPanel Truck            4.889e-01  1.597e-01   3.061 0.002208 ** 
## CAR_TYPEPickup                 4.649e-01  1.092e-01   4.259 2.06e-05 ***
## CAR_TYPESports Car             9.925e-01  1.181e-01   8.402  < 2e-16 ***
## CAR_TYPESUV                    7.376e-01  9.471e-02   7.788 6.78e-15 ***
## CAR_TYPEVan                    6.726e-01  1.318e-01   5.102 3.36e-07 ***
## CLM_FREQ                       1.386e-01  2.844e-02   4.872 1.11e-06 ***
## REVOKEDYes                     6.809e-01  8.979e-02   7.583 3.38e-14 ***
## MVR_PTS                        1.101e-01  1.627e-02   6.765 1.33e-11 ***
## URBANICITYHighly Urban/ Urban  2.385e+00  1.251e-01  19.068  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7535.7  on 6528  degrees of freedom
## Residual deviance: 5890.2  on 6506  degrees of freedom
## AIC: 5936.2
## 
## Number of Fisher Scoring iterations: 5

The important metric is AIC: 5959.5.

Now, we’ll predict on the test set Ins_train_cap_imputed_tst.

Creation of Confusion Matrix.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    0
##          1  178   95
##          0  252 1107
##                                          
##                Accuracy : 0.7874         
##                  95% CI : (0.7667, 0.807)
##     No Information Rate : 0.7365         
##     P-Value [Acc > NIR] : 1.083e-06      
##                                          
##                   Kappa : 0.3794         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.4140         
##             Specificity : 0.9210         
##          Pos Pred Value : 0.6520         
##          Neg Pred Value : 0.8146         
##              Prevalence : 0.2635         
##          Detection Rate : 0.1091         
##    Detection Prevalence : 0.1673         
##       Balanced Accuracy : 0.6675         
##                                          
##        'Positive' Class : 1              
##

Plotting the AUC under roc curve.

Here we note the following important metrics.

Accuracy of this model is 0.7874.

AUC of the model is 0.802.

At this point three Logistic Regression models were built. The accuracy was highest in the third model Model03_Log_Reg. We’ll use this model on the evaluation dataset, for classification.

Having completed three Logistic Regression models we’ll now proceed to build two Linear Regression models.

Linear Regression model Model01_Lin_Reg

First Linear Regression model is Model01_Lin_Reg.

Since our goal is to predict the TARGET_AMT, and not classify (as we did in Logistic Regression), we’ll build lm for TARGET_AMT.

## 
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5146  -1708   -755    350 103696 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    6.636e+01  4.769e+02   0.139 0.889344    
## PARENT1Yes                     5.432e+02  2.043e+02   2.659 0.007843 ** 
## MSTATUSYes                    -5.611e+02  1.443e+02  -3.890 0.000101 ***
## SEXM                           3.551e+02  1.834e+02   1.936 0.052933 .  
## EDUCATIONBachelors            -2.688e+02  2.050e+02  -1.311 0.189793    
## EDUCATIONHigh School          -9.559e+01  1.720e+02  -0.556 0.578453    
## EDUCATIONMasters               1.538e+01  2.992e+02   0.051 0.959012    
## EDUCATIONPhD                   2.203e+02  3.526e+02   0.625 0.532127    
## JOBClerical                    1.699e+01  1.935e+02   0.088 0.930019    
## JOBDoctor                     -9.804e+02  4.356e+02  -2.251 0.024441 *  
## JOBHome Maker                 -1.526e+02  2.707e+02  -0.564 0.573009    
## JOBLawyer                     -2.772e+02  3.108e+02  -0.892 0.372522    
## JOBManager                    -9.816e+02  2.345e+02  -4.187 2.86e-05 ***
## JOBProfessional               -4.084e+01  2.131e+02  -0.192 0.848010    
## JOBStudent                    -2.216e+02  2.352e+02  -0.942 0.346054    
## JOBUnknown                    -5.147e+02  3.216e+02  -1.600 0.109576    
## CAR_USEPrivate                -7.933e+02  1.644e+02  -4.825 1.43e-06 ***
## CAR_TYPEPanel Truck            2.675e+02  2.766e+02   0.967 0.333431    
## CAR_TYPEPickup                 3.689e+02  1.708e+02   2.160 0.030835 *  
## CAR_TYPESports Car             1.014e+03  2.178e+02   4.655 3.29e-06 ***
## CAR_TYPESUV                    7.443e+02  1.794e+02   4.148 3.39e-05 ***
## CAR_TYPEVan                    5.156e+02  2.133e+02   2.418 0.015639 *  
## RED_CARyes                    -3.782e+01  1.491e+02  -0.254 0.799788    
## REVOKEDYes                     5.930e+02  1.743e+02   3.402 0.000673 ***
## URBANICITYHighly Urban/ Urban  1.669e+03  1.392e+02  11.991  < 2e-16 ***
## KIDSDRIV                       6.076e+02  1.790e+02   3.394 0.000693 ***
## AGE                            6.512e+00  7.187e+00   0.906 0.364955    
## HOMEKIDS                       7.699e+01  6.992e+01   1.101 0.270888    
## YOJ                           -4.280e+00  1.462e+01  -0.293 0.769680    
## INCOME                        -3.939e-03  2.076e-03  -1.897 0.057798 .  
## HOME_VAL                      -7.668e-04  5.857e-04  -1.309 0.190519    
## TRAVTIME                       1.274e+01  3.340e+00   3.816 0.000137 ***
## BLUEBOOK                       1.374e-02  8.979e-03   1.530 0.125958    
## TIF                           -5.122e+01  1.294e+01  -3.957 7.66e-05 ***
## OLDCLAIM                      -1.689e-02  9.122e-03  -1.851 0.064181 .  
## CLM_FREQ                       1.654e+02  5.688e+01   2.907 0.003659 ** 
## MVR_PTS                        1.733e+02  2.790e+01   6.212 5.50e-10 ***
## CAR_AGE                       -2.692e+01  1.232e+01  -2.184 0.028988 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8123 degrees of freedom
## Multiple R-squared:  0.07079,    Adjusted R-squared:  0.06655 
## F-statistic: 16.72 on 37 and 8123 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07094.

Adjusted R-squared: 0.06671.

The R-squared values are far from 1. So, the model is not good. In below plot, the points are widely scattered, but not linearly.

Linear Regression model Model02_Lin_Reg

In Second Linear Regression model Model02_Lin_Reg, we’ll do stepwise AIC.

## 
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + SEX + JOB + CAR_USE + 
##     CAR_TYPE + REVOKED + URBANICITY + KIDSDRIV + INCOME + TRAVTIME + 
##     BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE, 
##     data = Ins_train_cap_imputed)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5198  -1700   -761    337 103636 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.754e+02  3.416e+02   0.514 0.607582    
## PARENT1Yes                     5.929e+02  1.784e+02   3.324 0.000890 ***
## MSTATUSYes                    -6.195e+02  1.197e+02  -5.177 2.31e-07 ***
## SEXM                           3.242e+02  1.605e+02   2.021 0.043347 *  
## JOBClerical                    3.153e+01  1.924e+02   0.164 0.869812    
## JOBDoctor                     -6.537e+02  3.552e+02  -1.841 0.065705 .  
## JOBHome Maker                 -1.088e+02  2.484e+02  -0.438 0.661362    
## JOBLawyer                     -1.434e+02  2.416e+02  -0.594 0.552810    
## JOBManager                    -9.721e+02  2.128e+02  -4.569 4.97e-06 ***
## JOBProfessional               -1.302e+02  1.976e+02  -0.659 0.509968    
## JOBStudent                    -1.308e+02  2.220e+02  -0.589 0.555900    
## JOBUnknown                    -3.039e+02  2.671e+02  -1.138 0.255334    
## CAR_USEPrivate                -7.443e+02  1.569e+02  -4.745 2.12e-06 ***
## CAR_TYPEPanel Truck            3.077e+02  2.733e+02   1.126 0.260308    
## CAR_TYPEPickup                 3.933e+02  1.695e+02   2.320 0.020379 *  
## CAR_TYPESports Car             1.029e+03  2.165e+02   4.756 2.01e-06 ***
## CAR_TYPESUV                    7.493e+02  1.786e+02   4.196 2.75e-05 ***
## CAR_TYPEVan                    5.380e+02  2.121e+02   2.537 0.011204 *  
## REVOKEDYes                     5.979e+02  1.743e+02   3.431 0.000604 ***
## URBANICITYHighly Urban/ Urban  1.664e+03  1.391e+02  11.962  < 2e-16 ***
## KIDSDRIV                       6.978e+02  1.620e+02   4.307 1.67e-05 ***
## INCOME                        -5.123e-03  1.788e-03  -2.866 0.004174 ** 
## TRAVTIME                       1.262e+01  3.337e+00   3.782 0.000157 ***
## BLUEBOOK                       1.415e-02  8.880e-03   1.594 0.110992    
## TIF                           -5.066e+01  1.294e+01  -3.916 9.09e-05 ***
## OLDCLAIM                      -1.687e-02  9.117e-03  -1.850 0.064308 .  
## CLM_FREQ                       1.684e+02  5.682e+01   2.963 0.003052 ** 
## MVR_PTS                        1.741e+02  2.786e+01   6.251 4.30e-10 ***
## CAR_AGE                       -2.719e+01  1.093e+01  -2.488 0.012851 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4545 on 8132 degrees of freedom
## Multiple R-squared:  0.06984,    Adjusted R-squared:  0.06663 
## F-statistic: 21.81 on 28 and 8132 DF,  p-value: < 2.2e-16

Here we note the following important metrics.

R-squared: 0.07012.

Adjusted R-squared: 0.06692.

The second model Model02_Lin_Reg marginally improved over the first model Model01_Lin_Reg. The plot also suggest no proper linear regression.

Model Selection.

We ran three Logistic Regression models and two Linear Regression models. Based on the Accuracy and AUC, the third Logistic Regression model Model03_Log_Reg did best and based on R-squared value, the second Linear Regression model Model02_Lin_Reg did best. In the following we’ll name our selections.

Selected models are:

Model03_Log_Reg Model02_Lin_Reg

We’ll use these models to predict the evaluation dataset insurance-evaluation-data.csv.

The data prepared from evaluation dataset is stored in Ins_eval_cap_imputed.

Head of the predicted data.

##   PARENT1 MSTATUS SEX   EDUCATION          JOB    CAR_USE    CAR_TYPE RED_CAR
## 1      No      No   M   Bachelors      Manager    Private         Van     yes
## 2     Yes      No   M High School      Manager    Private     Minivan      no
## 3     Yes      No   F High School  Blue Collar Commercial         SUV      no
## 4     Yes      No   M High School     Clerical    Private      Pickup      no
## 5      No      No   M High School      Manager    Private     Minivan     yes
## 6      No     Yes   M   Bachelors Professional Commercial Panel Truck      no
##   REVOKED          URBANICITY KIDSDRIV AGE HOMEKIDS YOJ INCOME HOME_VAL
## 1      No Highly Urban/ Urban        0  48        0  11  52881        0
## 2      No Highly Urban/ Urban        1  40        1  11  50815        0
## 3      No Highly Rural/ Rural        0  44        2  12  43486        0
## 4     Yes Highly Rural/ Rural        0  35        2  13  21204        0
## 5      No Highly Urban/ Urban        0  59        0  12  87460        0
## 6      No Highly Urban/ Urban        0  46        0  14  87190   207519
##   TRAVTIME BLUEBOOK TIF OLDCLAIM CLM_FREQ MVR_PTS CAR_AGE TARGET_FLAG
## 1       26    21970   1        0        0       2      10           0
## 2       21    18930   6     3295        1       2       1           0
## 3       30     5900  10        0        0       0      10           0
## 4       74     9230   6        0        0       0       4           0
## 5       45    15420   1    26114        2       4       1           0
## 6        7    25660   1     2119        1       2      12           0
##   TARGET_AMT
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

Output the data to a CSV file for a fuller inspection.

Data 621 Homework 04
Section 01 Group 5

Farhana Zahir, Habib Khan, Vijaya Cherukuri, Scott Reed, Shovon Biswas, Alain Kuiete Tchoupou

10/28/2020

Overview

Data Exploration of insurance_training_data.csv.

Data Preparation of insurance_training_data.csv.

More Data Exploration of insurance_training_data.csv.

Data reordering

Boxplots

Histograms

Categorical variables

Correlations

Data Exploration of insurance-evaluation-data.csv.

Data Preparation of insurance-evaluation-data.csv.

More Data exploration of insurance-evaluation-data.csv.

Data reordering

Boxplots

Histograms

Categorical variables

Correlations.

Building Models

Logistic Regression Model Model01_Log_Reg

Logistic Regression Model Model02_Log_Reg

Logistic Regression Model Model03_Log_Reg

Linear Regression model Model01_Lin_Reg

Linear Regression model Model02_Lin_Reg

Model Selection.

Output the data to a CSV file for a fuller inspection.

Data 621 Homework 04 Section 01 Group 5

Farhana Zahir, Habib Khan, Vijaya Cherukuri, Scott Reed, Shovon Biswas, Alain Kuiete Tchoupou

10/28/2020

Overview

Data Exploration of insurance_training_data.csv.

Data Preparation of insurance_training_data.csv.

More Data Exploration of insurance_training_data.csv.

Data reordering

Boxplots

Histograms

Categorical variables

Correlations

Data Exploration of insurance-evaluation-data.csv.

Data Preparation of insurance-evaluation-data.csv.

More Data exploration of insurance-evaluation-data.csv.

Data reordering

Boxplots

Histograms

Categorical variables

Correlations.

Building Models

Logistic Regression Model Model01_Log_Reg

Logistic Regression Model Model02_Log_Reg

Logistic Regression Model Model03_Log_Reg

Linear Regression model Model01_Lin_Reg

Linear Regression model Model02_Lin_Reg

Model Selection.

Output the data to a CSV file for a fuller inspection.

Data 621 Homework 04
Section 01 Group 5