DATA-621 Project 4

Cover Page
Data Exploration
DATA PREPARATION
BUILD MODELS
Select Models
Model Evaluation

Cover Page

DATA621-Assignment-4

By - Harpreet Shoker

Date - 08-Jul-2018

University - City university of New York

Professor - Marcus Ellis

Abstract

The purpose of this assignment is to build a series two different models. The first model will predict whether a person will get into a car crash and the second model will be used to predict the amount as to which the crash will cost. Both of these models will utilize the customers insurance information to predict the two variables stated above. The data set contains approximately 8161 records. Each record represents a customer profile at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This is the amount spent on repairs if there was a crash. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

We will be exploring, analyzing, and modeling the training data. Out of the many models we try to build we will go ahead and shortlist one model that works the best. We will then use these models on the test / evaluation data.

To attain our objective, we will be follow the below steps for each modeling exercise:

1 -Data Exploration 2 -Data Preparation 3 -Build Models 4 -Select Models

Data Exploration

Reading the insurance training data set from github

data <- read.csv('https://raw.githubusercontent.com/Harpreet1984/DATA621/master/HW4/insurance_training_data.csv', na.string = c("", "NA"), stringsAsFactors = FALSE)

CHecking missing vallues

Here checking ffor the missing values in the datasset provided using missmap()

We can see in the below plot there are a few fields that have some missing values. We can see that the fields JOB, CAR_AGE, HOME_VAL, YOJ, AGE and INCOME have missing values. These need to either be removed or imputed to continue on with the analysis.

 summary(data)

##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS         
##  Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      SEX             EDUCATION             JOB               TRAVTIME     
##  Length:8161        Length:8161        Length:8161        Min.   :  5.00  
##  Class :character   Class :character   Class :character   1st Qu.: 22.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 33.00  
##                                                           Mean   : 33.49  
##                                                           3rd Qu.: 44.00  
##                                                           Max.   :142.00  
##                                                                           
##    CAR_USE            BLUEBOOK              TIF           CAR_TYPE        
##  Length:8161        Length:8161        Min.   : 1.000   Length:8161       
##  Class :character   Class :character   1st Qu.: 1.000   Class :character  
##  Mode  :character   Mode  :character   Median : 4.000   Mode  :character  
##                                        Mean   : 5.351                     
##                                        3rd Qu.: 7.000                     
##                                        Max.   :25.000                     
##                                                                           
##    RED_CAR            OLDCLAIM            CLM_FREQ        REVOKED         
##  Length:8161        Length:8161        Min.   :0.0000   Length:8161       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.7986                     
##                                        3rd Qu.:2.0000                     
##                                        Max.   :5.0000                     
##                                                                           
##     MVR_PTS          CAR_AGE        URBANICITY       
##  Min.   : 0.000   Min.   :-3.000   Length:8161       
##  1st Qu.: 0.000   1st Qu.: 1.000   Class :character  
##  Median : 1.000   Median : 8.000   Mode  :character  
##  Mean   : 1.696   Mean   : 8.328                     
##  3rd Qu.: 3.000   3rd Qu.:12.000                     
##  Max.   :13.000   Max.   :28.000                     
##                   NA's   :510

we notice here that the CAR_AGE filed is showing that the minimum car age is -3 which seems to be odd.This needs to be re-imputed or removed before the final models are created and selected to get optimal results.The data also contains 10 categorical variables and 16 numeric variables. The categorical data will need to be converted into a numerical field.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.2638  1.0000  1.0000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0    1504    1036  107586

Histogram for TARGET_FLAG - first target variable (TARGET_FLAG). The summary stats are saying that the mean is less than .5 which means that the data is not evenly disbuted between 0 and 1.

Histogram for Target_AMT - we see that the amount that the customer paid is also around 0.

Checking for correlation between variables

From the above correlation plot we can see that all of the variables do not correlate with each other.The variablesthat have a low correlation signifiance may not be good choices to include in our final model for any of the target variables.variables that appear to have a relatively strong correlation between each other would be good choices for the overall model.

DATA PREPARATION

Dealing with missing null values

Removing all records that are not complete
Removing invalid values like CAR_AGE below 0
Converting accounting data to numeric field (removing $ sign)

##    INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1      1           0          0        0  60        0  11  $67,349      No
## 2      2           0          0        0  43        0  11  $91,449      No
## 3      4           0          0        0  35        1  10  $16,039      No
## 6      7           1       2946        0  34        1  12 $125,301     Yes
## 9     12           1       2501        0  34        0  10  $62,978      No
## 10    13           0          0        0  50        0   7 $106,952      No
##    HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE
## 1        $0    z_No   M           PhD  Professional       14    Private
## 2  $257,252    z_No   M z_High School z_Blue Collar       22 Commercial
## 3  $124,191     Yes z_F z_High School      Clerical        5    Private
## 6        $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial
## 9        $0    z_No z_F     Bachelors      Clerical       34    Private
## 10       $0    z_No   M     Bachelors  Professional       48 Commercial
##    BLUEBOOK TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1   $14,230  11    Minivan     yes   $4,461        2      No       3
## 2   $14,940   1    Minivan     yes       $0        0      No       0
## 3    $4,010   4      z_SUV      no  $38,690        2      No       3
## 6   $17,430   1 Sports Car      no       $0        0      No       0
## 9   $11,200   1      z_SUV      no       $0        0      No       0
## 10  $18,510   7        Van      no       $0        0      No       1
##    CAR_AGE            URBANICITY
## 1       18   Highly Urban/ Urban
## 2        1   Highly Urban/ Urban
## 3       10   Highly Urban/ Urban
## 6        7   Highly Urban/ Urban
## 9        1   Highly Urban/ Urban
## 10      17 z_Highly Rural/ Rural

After performing the above two steps the total observations for the data is reduced from 8161 to 6044 total obsverations

Dealing with Categorical data

here are 10 fields that are categorical. These variables group the data into different sections. Most of the data contains 1 of 2 posibilities (YES/NO etc). This allows me to assign a value of 0 for one possibility and a 1 for another. The education and jobs fields were a little different. They have more than 2 possibilities. The eductaion was grouped by level of academic achievement (Masters and above = 1, below is a 0). The JOB was grouped along the same way, a college education/advanced education was grouped into one (Lwyer, Professional, Manager = 1), everybody else gets a 0. The field of CAR_TYPE is dealt the same way as well. Those who drive a Panel Truck, Pickup or Sports car will be labeled with a 1 and everything else will be labeled with a 0.

The accounting data (dollar amount data) has dollar signs within the data. That means that the data will be treated as a character set. The dollar sign needs to be removed and the data needs to be changed into a number.

blue_book <- unname(sapply(data$BLUEBOOK, str_replace_all, '[,$]', ''))
blue_book <- as.numeric(blue_book)

income <- unname(sapply(data$INCOME, str_replace_all, '[,$]', ''))
income <- as.numeric(income)

home_val <- unname(sapply(data$HOME_VAL, str_replace_all, '[,$]', ''))
home_val <- as.numeric(home_val)

old_claim <- unname(sapply(data$OLDCLAIM, str_replace_all, '[,$]', ''))
old_claim <- as.numeric(old_claim)

data$BLUEBOOK <- blue_book
data$INCOME <- income
data$HOME_VAL <- home_val
data$OLDCLAIM <- old_claim

BUILD MODELS

TARGET_FLAG– This is the target variable that will tell us there was a crash or not for the given customer. If the field takes a value of 0, that means that the customer was not in an accident, or the accident was not their fault. If the field takes a value of 1, that means the customer has been in an accident, or the accident was their fault.

The first thing that we need to do is split the data up into a training set and a test set. We will be taking the data and separating it up so 70% of the data is the training set, and 30% of the data will be the testing set.

Model 1

In this model we are doing a type of stepwise function. We are taking the training set and removing some of the variables that we feel are not good predictors. We run the model against the remaining variables and look at the output. Then, we go back through and remove anymore variables that we do not feel are good predictors. That will leave us with a final function that has all the variables that we fell will make the best predicting fun ction.

## 
## Call:
## lm(formula = TARGET_FLAG ~ TARGET_AMT + PARENT1 + HOME_VAL + 
##     MSTATUS + TRAVTIME + CAR_USE + REVOKED + CAR_AGE + URBANICITY, 
##     data = training_2a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2471 -0.2202 -0.1153  0.1558  1.0162 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.520e-01  2.759e-02   9.132  < 2e-16 ***
## TARGET_AMT   4.626e-05  1.229e-06  37.659  < 2e-16 ***
## PARENT1     -1.316e-01  1.800e-02  -7.313 3.11e-13 ***
## HOME_VAL    -4.424e-07  5.151e-08  -8.589  < 2e-16 ***
## MSTATUS     -2.803e-02  1.386e-02  -2.022   0.0432 *  
## TRAVTIME     1.739e-03  3.455e-04   5.035 4.97e-07 ***
## CAR_USE     -1.011e-01  1.163e-02  -8.694  < 2e-16 ***
## REVOKED      1.094e-01  1.648e-02   6.636 3.64e-11 ***
## CAR_AGE     -4.162e-03  1.023e-03  -4.070 4.79e-05 ***
## URBANICITY   2.111e-01  1.370e-02  15.406  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3497 on 4221 degrees of freedom
## Multiple R-squared:  0.375,  Adjusted R-squared:  0.3737 
## F-statistic: 281.4 on 9 and 4221 DF,  p-value: < 2.2e-16

Model 2

In this model, we are untilizing a backwards approach into solving for the overall model. The backwards approach to variable selection starts off withh all variables in the model. I then starts to remove fields, until it gets to a point where removing anymore fields will not be beneficial to the model. That is the point when the final model is found.

To solve for the TARGET_FLAG field, I will be using a probit function. This function is very useful when there are only two possible outcomes for the field that you are trying to predict. This model utilizes backwards selection when picking the variables for the model. It starts out all of the variables that are possible. It then starts to remove variables until it reaches the optimal solution for the function.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + CAR_AGE + URBANICITY, family = binomial(link = "probit"), 
##     data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3638  -0.7404  -0.4092   0.6599   3.1440  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.934e-01  1.587e-01  -6.259 3.87e-10 ***
## KIDSDRIV     1.256e-01  4.850e-02   2.590 0.009605 ** 
## HOMEKIDS     6.009e-02  2.619e-02   2.295 0.021761 *  
## INCOME      -3.514e-06  8.070e-07  -4.354 1.33e-05 ***
## PARENT1     -2.416e-01  8.701e-02  -2.777 0.005485 ** 
## HOME_VAL    -9.239e-07  2.881e-07  -3.207 0.001340 ** 
## MSTATUS      1.921e-01  6.858e-02   2.801 0.005094 ** 
## JOB         -1.839e-01  5.078e-02  -3.621 0.000293 ***
## TRAVTIME     7.972e-03  1.496e-03   5.328 9.96e-08 ***
## CAR_USE     -4.639e-01  5.253e-02  -8.832  < 2e-16 ***
## BLUEBOOK    -1.531e-05  3.240e-06  -4.726 2.29e-06 ***
## TIF         -3.029e-02  5.795e-03  -5.226 1.73e-07 ***
## CAR_TYPE     1.620e-01  4.987e-02   3.248 0.001164 ** 
## RED_CAR     -2.005e-01  5.342e-02  -3.753 0.000175 ***
## OLDCLAIM    -1.050e-05  3.199e-06  -3.282 0.001032 ** 
## CLM_FREQ     1.430e-01  2.299e-02   6.221 4.93e-10 ***
## REVOKED      5.124e-01  7.443e-02   6.884 5.80e-12 ***
## MVR_PTS      7.396e-02  1.096e-02   6.750 1.48e-11 ***
## CAR_AGE     -8.894e-03  4.813e-03  -1.848 0.064604 .  
## URBANICITY   1.189e+00  7.566e-02  15.713  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4900.6  on 4230  degrees of freedom
## Residual deviance: 3836.3  on 4211  degrees of freedom
## AIC: 3876.3
## 
## Number of Fisher Scoring iterations: 5

Model 3

Here again using probit function, just like during the above backwards function. This function goes the opposite way as the backward function. It starts with a plain function and adds variables until it gets to the optimal solution. Once it cannot add variables to make the equation better, it stops and that is the final output

## 
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + INCOME + MVR_PTS + CAR_USE + 
##     PARENT1 + CLM_FREQ + REVOKED + TRAVTIME + TIF + BLUEBOOK + 
##     HOME_VAL + JOB + KIDSDRIV + RED_CAR + OLDCLAIM + CAR_TYPE + 
##     MSTATUS + HOMEKIDS + CAR_AGE, family = binomial(link = "probit"), 
##     data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3638  -0.7404  -0.4092   0.6599   3.1440  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.934e-01  1.587e-01  -6.259 3.87e-10 ***
## URBANICITY   1.189e+00  7.566e-02  15.713  < 2e-16 ***
## INCOME      -3.514e-06  8.070e-07  -4.354 1.33e-05 ***
## MVR_PTS      7.396e-02  1.096e-02   6.750 1.48e-11 ***
## CAR_USE     -4.639e-01  5.253e-02  -8.832  < 2e-16 ***
## PARENT1     -2.416e-01  8.701e-02  -2.777 0.005485 ** 
## CLM_FREQ     1.430e-01  2.299e-02   6.221 4.93e-10 ***
## REVOKED      5.124e-01  7.443e-02   6.884 5.80e-12 ***
## TRAVTIME     7.972e-03  1.496e-03   5.328 9.96e-08 ***
## TIF         -3.029e-02  5.795e-03  -5.226 1.73e-07 ***
## BLUEBOOK    -1.531e-05  3.240e-06  -4.726 2.29e-06 ***
## HOME_VAL    -9.239e-07  2.881e-07  -3.207 0.001340 ** 
## JOB         -1.839e-01  5.078e-02  -3.621 0.000293 ***
## KIDSDRIV     1.256e-01  4.850e-02   2.590 0.009605 ** 
## RED_CAR     -2.005e-01  5.342e-02  -3.753 0.000175 ***
## OLDCLAIM    -1.050e-05  3.199e-06  -3.282 0.001032 ** 
## CAR_TYPE     1.620e-01  4.987e-02   3.248 0.001164 ** 
## MSTATUS      1.921e-01  6.858e-02   2.801 0.005094 ** 
## HOMEKIDS     6.009e-02  2.619e-02   2.295 0.021761 *  
## CAR_AGE     -8.894e-03  4.813e-03  -1.848 0.064604 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4900.6  on 4230  degrees of freedom
## Residual deviance: 3836.3  on 4211  degrees of freedom
## AIC: 3876.3
## 
## Number of Fisher Scoring iterations: 5

TARGET_AMT

This is the target field that says wether the customer had to pay some amount after an accident. This field will only have a value if the TARGET_FLAG field has a 1. If the TARGET_FLAG field is a 0, then this field will be 0 as well. The first thing that we have to do is re-pick the training set. We do not want to use the exact same training set as before, because it is the same data and we really are not changing anything from the first models. We will be using a 70/30 split just like before.

Model 1

Using here stepwise function. In this firstwe select a set of fields from the training set and use that as a base model. We take a look at that model and see what fields should be kept and what fields whould be removed. We remove the fields that we feel are not well correlated and get the final model.

## 
## Call:
## lm(formula = TARGET_AMT ~ HOME_VAL + MSTATUS + CAR_USE + URBANICITY, 
##     data = training_2a)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6343  -3020  -1409    303  79777 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.964e+03  1.009e+03   4.918 1.01e-06 ***
## HOME_VAL     2.329e-03  2.196e-03   1.061  0.28902    
## MSTATUS      1.465e+03  5.015e+02   2.920  0.00357 ** 
## CAR_USE     -6.579e+02  4.419e+02  -1.489  0.13684    
## URBANICITY  -2.427e+01  9.044e+02  -0.027  0.97860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7349 on 1116 degrees of freedom
## Multiple R-squared:  0.009452,   Adjusted R-squared:  0.005902 
## F-statistic: 2.662 on 4 and 1116 DF,  p-value: 0.03134

Model 2

Forward selection model - This model takes a “blank” equation and starts to add variables until it finds the optimal solution for the model. It is a very similar process to the first model.

The outout for the model is as follows:

## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MSTATUS + MVR_PTS + CAR_AGE, 
##     data = train2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8581  -3060  -1384    535  78019 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3191.29573  595.00499   5.363 9.92e-08 ***
## BLUEBOOK       0.13152    0.02815   4.672 3.35e-06 ***
## MSTATUS     1231.68260  436.22104   2.824  0.00483 ** 
## MVR_PTS      137.27929   82.39298   1.666  0.09596 .  
## CAR_AGE      -58.15581   41.15696  -1.413  0.15793    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7280 on 1116 degrees of freedom
## Multiple R-squared:  0.02783,    Adjusted R-squared:  0.02434 
## F-statistic: 7.986 on 4 and 1116 DF,  p-value: 2.393e-06

comparing the two models here We can see by the summary statistics of both models, that Model 1’2 coefficients are are statistically significant (below .05 confidence), while model 1 seems to have a few variables that are not significant at all, but the R squared value for model 1 is lower than the R squared of the second model.

Select Models

Now that all of the models have been created and predicted, it is time to pick and choose which are the best. We will pick the best model for TARGET_FLAG and the best model for the TARGET_AMT field. The ROC curve comapres the sensitivity of the model with the specificity of the model. It bascially give the performance of the model. With that curve, we can cacluate the AUC (Area Under the Curve). The higher this number is, the better the performance of the model. We can see that model 1 is still the best choice from all three models.

train3 <- cbind(train1 , answer1a[1:4231], answer, answer2)

rc1 <- roc(factor(TARGET_FLAG) ~ answer1a[1:4231], data=train3)
rc2 <- roc(factor(TARGET_FLAG) ~ answer, data=train3)
rc3 <- roc(factor(TARGET_FLAG) ~ answer2, data=train3)

plot(rc1,main='Model 1 - ROC Curve')

plot(rc2,main='Model 2 - ROC Curve')

plot(rc3,main='Model 3 - ROC Curve')

model <- c('Model 1', 'Model 2', 'Model 3')
area <- c(auc(train1$TARGET_FLAG, answer1a),auc(train1$TARGET_FLAG, answer),auc(train1$TARGET_FLAG, answer2))
df <- data.frame(Model=model,AUC=area)
df

##     Model       AUC
## 1 Model 1 0.7306737
## 2 Model 2 0.6639863
## 3 Model 3 0.6639863

We first check the summary stats with the two models. The first this we check is the MSE (Mean Squared Error). This is the mean of the residuals (actual - predicted) squared. It is a good way to see how accurate your model is. A smaller MSE is always good. The next things is the R squared. This is usually called the goodness of fit. The higher the R squared value the better the model is. The last thing is the F-Stat. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. three of these parameters are showing that the best model to use would be model 2. It has highest R squared and the highest F-stat (which means a lower alpha value and most statistically relevant).

From the abov reults we can say

TARGET_FLAG - MODEL 1

TARGET_AMT - MODEL 2

Model Evaluation

The final flag values display the amounts can be seen in the insurance_result.csv

## final_answer
##    0    1 
## 1696   19