\usepackage{float} \floatplacement{figure}{H} \floatplacement{table}{H}

1 DATA EXPLORATION

In the pursuit of determining relationships between car crashes, their costs, and factors that may play a role into each, a dataset containing 8,161 observations with 25 variables was explored, analyzed, and modeled. This data came from an auto insurance company with each observation representing one of their customers. Of the 25 variables, two were target variables (car crashes and car costs), and the other 23 were predictors. TARGET_FLAG is a binary variable where a value of 1 indicates that the customer has made a claim related to a car crash and a value of 0 indicates they have not. The other target variable, TARGET_AMT, is a continuous numerical variable whose value is the payout amount of a claim, if any. The remaining variables are split in their categorization; 13 are categorical and 10 are numerical.

This data was utilized to compose and evaluate several types of models with the following features:

  • Logistic classification models that aim to predict the probability that a person crashes their car; and,
  • Multiple linear regression models that aim to predict the amount of money it will cost if the person does crash their car.

The intended use case for these models is actuarial in nature: specifically, to calculate insurance rates commensurate with policyholders’ (or policy applicants’) potential risk levels based on attributes such as income, age, distance to work, tenure as customers, so on and so forth.

Data Dictionary
VARIABLE DEFINITION TYPE
TARGET_FLAG car crash = 1, no car crash = 0 binary categorical response
TARGET_AMT car crash cost = >0, no car crash = 0 continuous numerical response
AGE driver’s age - very young/old tend to be risky continuous numerical predictor
BLUEBOOK $ value of vehicle continuous numerical predictor
CAR_AGE age of vehicle continuous numerical predictor
CAR_TYPE type of car (6types) categorical predictor
CAR_USE usage of car (commercial/private) binary categorical predictor
CLM_FREQ number of claims past 5 years discrete numerical predictor
EDUCATION max education level (5types) categorical predictor
HOMEKIDS number of children at home discrete numerical predictor
HOME_VAL $ home value - home owners tend to drive more responsibly continuous numerical predictor
INCOME $ income - rich people tend to get into fewer crashes continuous numerical predictor
JOB job category (8types, 1missing) - white collar tend to be safer categorical predictor
KIDSDRIV number of driving children - teenagers more likely to crash discrete numerical predictor
MSTATUS maritial status - married people drive more safely catogerical predictor
MVR_PTS number of traffic tickets continuous numerical predictor
OLDCLAIM $ total claims in the past 5 years continuous numerical predictor
PARENT1 single parent binary categorical predictor
RED_CAR a red car binary categorical predictor
REVOKED license revoked (past 7 years) - more risky driver binary categorical predictor
SEX gender - woman may have less crashes than man binary categorical predictor
TIF time in force - number of years being customer continuous numerical predictor
TRAVTIME distance to work continuous numerical predictor
URBANCITY urban/rural binary categorical predictor
YOJ years on job - the longer they stay more safe continuous numerical predictor

Inspection of the target variables reveals that where TARGET_FLAG has values of 0 (i.e., no claim), TARGET_AMT also has values of 0 (i.e., no payout) which is logically consistent. Additionally even when the TARGET_FLAG is coded 1 indicating that a claim has been made, TARGET_AMT sometimes still has a value of zero indicating that not all claims result in a payout.

1.1 Summary Statistics

Continuous and categorical variables were summarized separately for the sake of clarity.

Summary statistics
n min mean median max sd
TARGET_AMT 8161 0 1504.3 0 107586 4704.0
AGE 8155 16 44.8 45 81 8.6
YOJ 7707 0 10.5 11 23 4.1
INCOME 7716 0 61898.1 54028 367030 47572.7
HOME_VAL 7697 0 154867.3 161160 885282 129123.8
TRAVTIME 8161 5 33.5 33 142 15.9
BLUEBOOK 8161 1500 15709.9 14440 69740 8419.7
TIF 8161 1 5.3 4 25 4.2
OLDCLAIM 8161 0 4037.1 0 57037 8777.1
MVR_PTS 8161 0 1.7 1 13 2.1
CAR_AGE 7651 0 8.3 8 28 5.7

EDUCATION, JOB, CAR_TYPE, KIDSDRIV, HOMEKIDS, and CLM_FREQ each comprise multiple categories. On the other hand, PARENT1, SEX, MSTATUS, CAR_USE, RED_CAR, REVOKED, URBANICITY are all binaries.

Summary statistics for Categorical Variables
    EDUCATION </th>
       JOB </th>
    CAR_TYPE </th>
KIDSDRIV HOMEKIDS CLM_FREQ
<High School:1203 Blue Collar :1825 Minivan :2145 0:7180 0:5289 0:5009
Bachelors :2242 Clerical :1271 Panel Truck: 676 1: 636 1: 902 1: 997
Masters :1658 Professional:1117 Pickup :1389 2: 279 2:1118 2:1171
PhD : 728 Manager : 988 Sports Car : 907 3: 62 3: 674 3: 776
High School :2330 Lawyer : 835 Van : 750 4: 4 4: 164 4: 190
NA Student : 712 SUV :2294 NA 5: 14 5: 18
NA (Other) :1413 NA NA NA NA
Summary statistics for Binary Categorical Variables
PARENT1 SEX MSTATUS
   CAR_USE </th>
RED_CAR REVOKED URBANICITY
No :7084 M:3786 Yes:4894 Commercial:3029 no :5783 No :7161 Urban:6492
Yes:1077 F:4375 No :3267 Private :5132 yes:2378 Yes:1000 Rural:1669

1.2 Variable Descriptions

1.2.0.1 KIDSDRIV

KIDSDRIV is a categorical predictor with values ranging from 0 to 4. It shows heavy skew, with most cars having no kid drivers (value of 0). Judging from the distribution, it appears that having a kid driver results in higher probability of making a claim. Although it was imported as a numeric variable, due to the small range of discrete values it was transformed into a categorical variable with five ordered levels.

1.2.0.2 AGE

AGE presents driver’s age and shows a normal distribution centered around 45 years. Looking at the boxplot of age below, there does not appear to be a difference in the distribution between whether a claim is made or not. Accordingly, AGE may not be helpful in determining the probability of making a claim.

1.2.0.3 HOMEKIDS

HOMEKIDS is a predictor describing the number of children at home ranging from 0 to 5. Although it was imported as a numeric variable, due to the small range of discrete values it was transformed into a categorical variable with six ordered levels.

1.2.0.4 YOJ

YOJ is a predictor describing the number of years on the job. People who stay at a job for a longer time are believed to be safer drivers. Apart from those who are unemployed (values of 0), YOJ seems to show a normal distribution.

1.2.0.5 INCOME

INCOME is a heavily skewed predictor variable, suggesting that outliers should be treated for modelling.

1.2.0.6 HOME_VAL

HOME_VAL is a home value predictor variable. In theory, home owners tend to drive more responsibly. The difference between owners and renters (values of 0) is visible in the summary statistics graph.

1.2.0.7 TRAVTIME

TRAVTIME is a predictor variable describing the distance to work. Long drives to work would suggest greater risk of an accident and claim. However, its graph shows a fairly normal distribution with a slightly longer right tail and little difference between the two target values, such that this variable may not be helpful in determining the probability of making a claim.

1.2.0.8 BLUEBOOK

BLUEBOOK is a predictor variable describing the value of the car. The boxplot demonstrates that the lower the value of the car, the higher the chance of making a claim. It is conceivable that higher-priced cars are driven more carefully.

1.2.0.9 TIF

TIF describes how long the customer has been with the insurance company. Plots reveal that the longer the tenure of a policyholder, the lower the likelihood of a claim - i.e. safe drivers tend to remain so.

1.2.0.10 OLDCLAIM

OLDCLAIM is a predictor describing the value of claims made in the past 5 years. It is very heavily skewed as most policyholders do not make claims.

1.2.0.11 CLM_FREQ

CLM_FREQ is a predictor that describes the frequency of claims in the past 5 years. It suggests that those who have made a claim in the past 5 years are more likely to make another claim. Although it was imported as a numeric variable, due to the small range of discrete values it was transformed into a categorical variable with six ordered levels.

1.2.0.12 MVR_PTS

MVR_PTS is a predictor that describes motor vehicle record points. The rationale is that more traffic tickets suggests less safe driving and a higher likelihood of claims. It appears to be a highly significant variable as seen in the boxplots below.

1.2.0.13 CAR_AGE

CAR_AGE describes the age of the policyholder’s vehicle. One value is -3, which must be an error - this is corrected to 0.

1.2.0.14 PARENT1

PARENT1 indicates whether a policyholder is a single parent. This variable has been factorized.

1.2.0.15 SEX

SEX describes the gender of the driver. This variable has been factorized. It does not appear to be significant variable in the boxplots below.

1.2.0.16 MSTATUS

MSTATUS describes the marital status of the policyholder. The rationale is that married people drive more safely. This variable has been factorized.

1.2.0.17 EDUCATION

EDUCATION describes the education level of the driver. This variable is factorized. It may be correlated with INCOME.

1.2.0.18 JOB

JOB describes the type of job the driver has. This variable is factorized. It may be correlated with INCOME. In theory policyholders with white collar jobs tend to drive more safely.

1.2.0.19 CAR_TYPE

CAR_TYPE describes type of car. This variable is factorized.

1.2.0.20 CAR_USE

CAR_USE describes how the vehicle is used. Commercial vehicles are driven more and may have an elevated probability of accidents and claims. This variable is factorized.

1.2.0.21 RED_CAR

RED_CAR indicates whether the color of the vehicle is red. Red vehicles, especially sports cars, are associated with riskier driving and likelihood of claims. This variable is factorized.

1.2.0.22 REVOKED

REVOKED describes whether a policyholders license has been revoked in the past 7 years. License revocation is associated with riskier driving. This variable is factorized. The boxplot reveals that policyholders who previously lost their license are more likely to file claims.

1.2.0.23 URBANICITY

URBANICITY describes whether the driver lives in an urban area or a rural area. This variable has been factorized.

1.2.1 Summary Statistics Graphs

The shapes of the distributions of the continuous variables are very similar for both target values for all predictors. There does not seem to be a very large difference between those who file claims and those who do not for any of the predictors.

Numeric Data Distributions as a Function of TARGET_FLAG

Numeric Data Distributions as a Function of TARGET_FLAG

Examining the dispersion of claims between variables, it looks like likelihoods are higher for drivers who are male, urban, blue collar, unmarried, or parents; as well as for those with commercial vehicles or a revoked license.

Categorical Data Distributions as a Function of TARGET_FLAG

Categorical Data Distributions as a Function of TARGET_FLAG

The scale of the continuous variables’ distributions are considerably different and difficult to visualize together.

Scaling the distribution based on the standard deviation reveals that outliers are very abundant for the continuous variables OLDCLAIM, INCOME, TRAV_TIME, BLUEBOOK, and to a lesser extent HOME_VAL and TIF. The variables that appear to have the most outliers are OLDCLAIM, BLUEBOOK, TRAVTIME, and INCOME.

Scaled Boxplots

Scaled Boxplots

All of the variables show varying levels of skew save YOJ and AGE which appear the most normally distributed.

Linear relationship between each numeric predictor and the target

Linear relationship between each numeric predictor and the target

1.3 Linearity

Scatter plot between numeric predictors and the TARGET_AMT, filtered for rows where TARGET_AMT is greater than 0

Scatter plot between numeric predictors and the TARGET_AMT, filtered for rows where TARGET_AMT is greater than 0

The plotted numeric predictors with their raw values fail to show any clear linear relationships with the TARGET_AMT except for the faintest of linearity in the BLUEBOOK variable.

1.3.1 Log Transformed Data

Scatter plot between log transformed numeric predictors and the log transformed TARGET_AMT filtered for rows where TARGET_AMT is greater than 0

Scatter plot between log transformed numeric predictors and the log transformed TARGET_AMT filtered for rows where TARGET_AMT is greater than 0

In an attempt to distinguish the linearity of the variables alongside the TARGET_AMT, all numeric predictors and the TARGET_AMT underwent a log transformation. As a result, the linearity of BLUEBOOK became more apparent, but there was no obvious influence on the linearity of any of the other variables.

1.3.2 Box-Cox

Even though the linearity plots above don’t show much improvement after a log transformation, a Box-Cox plot shows that a log transformation is recommended for the TARGET_AMT.

Box-Cox Plot

Box-Cox Plot

1.3.3 Square Root Transformed Predictors and Log Transformed Target

A plot of each numerical predictor square root transformed plotted against the log transformed TARGET_AMT as recommended by the Box-Cox plot still shows little improvement.

Scatter plot between square root transformed numeric predictors and the square root transformed TARGET_AMT filtered for rows where TARGET_AMT is greater than 0

Scatter plot between square root transformed numeric predictors and the square root transformed TARGET_AMT filtered for rows where TARGET_AMT is greater than 0

1.4 Missing Data

Missing data

Missing data

A number of variables are missing observations: AGE, INCOME, YOJ, HOME_VAL, CAR_AGE. For AGE, the number is inconsequential, but the others range between 5% and 6% of total. Approximately 21% of the cases are missing one of these variables, and an additional 2% are missing more than one. For this reason, we don’t suspect latent factors can account for the absences, and assume that these values are missing at random and can be imputed when preparing the data for modeling.

2 DATA PREPARATION

2.1 Missing Values

To deal with missing data values for the variables INCOME, YOJ, HOME_VAL, and CAR_AGE - and to a lesser extent AGE - the MICE (Multivariate Imputation By Chained Equations) package was leveraged. The package assumes missing values are missing at random and creates multiple imputations (replacement values) for multivariate missing data using a a method based on Fully Conditional Specification, where each incomplete variable is imputed by a separate model. The method can impute mixes of continuous, binary, unordered categorical and ordered categorical, and continuous two-level data; and it can maintain consistency between imputations by means of passive imputation. The quality of imputed values was inspected using multiple diagnostic plots.

Difference between original and imputed data

Difference between original and imputed data

The blue lines in the density plots above represent the original distributions and the red lines include the imputed values. With the exception of AGE, the distribution of the imputed values accords with the distribution of pre-existing values. The imputed values will be used for the four variables. Since AGE is missing only 0.07% or 6 cases and displayed a strange change to the distribution using the mice package, it was imputed separately using median imputation.

Unsurprisingly, higher levels of INCOME are found with higher values of YOJ; this also means more income is disposable, which shows correlation with HOME_VAL and BLUEBOOK.

Additionally, MVR_PTS shows a positive correlation with OLDCLAIMS.

3 BUILD MODELS

3.1 Classification Models: Models 1, 2 and 3

The first three models use the predictor variables in binary logistic models as inputs and interpret their contributions to predicting the likelihood of a claim. We use drop and MASS:stepAIC functions to judge which variables to remove, evaluating AIC statistics as we go.

3.1.1 Model 1 - Base model using categorical predictors only

TARGET_FLAG ~ PARENT1 + SEX + MSTATUS + EDUCATION + JOB + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CLM_FREQ

For an easily interpretable model aimed at predicting TARGET_FLAG, inputs for Model 1 were restricted to categorical variables alone. The AIC metric as well as the p-value and significance code suggests that the RED_CAR variable could be removed, so this predictor was removed from model 1. Model 1 serves as a base model from which to compare other models.

Model 1 ROC Curve

Model 1 ROC Curve

Area Under the Curve
x
0.8

3.1.1.1 Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5471 1603
##          1  169  408
##                                               
##                Accuracy : 0.768               
##                  95% CI : (0.759, 0.778)      
##     No Information Rate : 0.737               
##     P-Value [Acc > NIR] : 0.000000000173      
##                                               
##                   Kappa : 0.224               
##  Mcnemar's Test P-Value : < 0.0000000000000002
##                                               
##             Sensitivity : 0.970               
##             Specificity : 0.203               
##          Pos Pred Value : 0.773               
##          Neg Pred Value : 0.707               
##               Precision : 0.773               
##                  Recall : 0.970               
##                      F1 : 0.861               
##              Prevalence : 0.737               
##          Detection Rate : 0.715               
##    Detection Prevalence : 0.925               
##       Balanced Accuracy : 0.586               
##                                               
##        'Positive' Class : 0                   
## 
Observations 7651
Dependent variable TARGET_FLAG
Type Generalized linear model
Family binomial
Link logit
𝛘²(37) 1764.49
Pseudo-R² (Cragg-Uhler) 0.30
Pseudo-R² (McFadden) 0.20
AIC 7125.59
BIC 7389.41
Est. S.E. z val. p VIF
(Intercept) -1.68 0.23 -7.27 0.00 NA
PARENT1Yes 0.22 0.12 1.78 0.07 2.34
SEXF -0.31 0.09 -3.38 0.00 2.36
MSTATUSNo 0.68 0.08 8.96 0.00 1.67
EDUCATIONBachelors -0.51 0.11 -4.72 0.00 7.49
EDUCATIONMasters -0.41 0.16 -2.51 0.01 7.49
EDUCATIONPhD -0.49 0.20 -2.51 0.01 7.49
EDUCATIONHigh School -0.03 0.10 -0.35 0.73 7.49
JOBClerical 0.61 0.20 3.11 0.00 14.06
JOBDoctor -0.27 0.26 -1.03 0.30 14.06
JOBHome Maker 0.72 0.20 3.63 0.00 14.06
JOBLawyer 0.16 0.17 0.91 0.36 14.06
JOBManager -0.61 0.17 -3.50 0.00 14.06
JOBProfessional 0.24 0.18 1.35 0.18 14.06
JOBStudent 0.72 0.20 3.52 0.00 14.06
JOBBlue Collar 0.43 0.19 2.28 0.02 14.06
CAR_TYPEPanel Truck 0.21 0.14 1.49 0.14 3.71
CAR_TYPEPickup 0.59 0.10 5.79 0.00 3.71
CAR_TYPESports Car 1.22 0.12 9.84 0.00 3.71
CAR_TYPEVan 0.40 0.12 3.22 0.00 3.71
CAR_TYPESUV 0.95 0.10 9.06 0.00 3.71
CAR_USEPrivate -0.74 0.09 -7.96 0.00 2.47
REVOKEDYes 0.74 0.08 9.10 0.00 1.01
URBANICITYRural -2.20 0.11 -19.34 0.00 1.11
KIDSDRIV1 0.40 0.11 3.51 0.00 1.54
KIDSDRIV2 0.69 0.16 4.20 0.00 1.54
KIDSDRIV3 0.86 0.32 2.72 0.01 1.54
KIDSDRIV4 -11.10 204.58 -0.05 0.96 1.54
HOMEKIDS1 0.34 0.11 3.01 0.00 2.65
HOMEKIDS2 0.28 0.11 2.51 0.01 2.65
HOMEKIDS3 0.27 0.13 2.09 0.04 2.65
HOMEKIDS4 0.05 0.21 0.26 0.80 2.65
HOMEKIDS5 0.03 0.69 0.05 0.96 2.65
CLM_FREQ1 0.64 0.09 7.41 0.00 1.06
CLM_FREQ2 0.68 0.08 8.51 0.00 1.06
CLM_FREQ3 0.67 0.09 7.08 0.00 1.06
CLM_FREQ4 0.97 0.17 5.61 0.00 1.06
CLM_FREQ5 1.00 0.55 1.84 0.07 1.06
Standard errors: MLE

3.1.2 Model 2 - Refined base model plus numerical predictors

TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CLM_FREQ + INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + MVR_PTS,

Building on model 1, model 2 excludes the RED_CAR variable and adds in all of the numerical variables to see if they add value to our model. After the initial model statistics were examined we then further refined the model by removing AGE, CAR_AGE, SEX, YOJ, and PARENT1 due to low p-value significance. EDUCATION, and JOB were only significant if the education was ‘Bachelors’ or if the job was ‘Manager’, so two new binary variables, EDUCATION_Bachelors and JOB_Manager were added to the dataset indicating yes or no for these specific education and job values.

Model 3 ROC Curve

Model 3 ROC Curve

Area Under the Curve
x
0.81

3.1.2.1 Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5455 1505
##          1  185  506
##                                              
##                Accuracy : 0.779              
##                  95% CI : (0.77, 0.788)      
##     No Information Rate : 0.737              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.277              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.967              
##             Specificity : 0.252              
##          Pos Pred Value : 0.784              
##          Neg Pred Value : 0.732              
##               Precision : 0.784              
##                  Recall : 0.967              
##                      F1 : 0.866              
##              Prevalence : 0.737              
##          Detection Rate : 0.713              
##    Detection Prevalence : 0.910              
##       Balanced Accuracy : 0.609              
##                                              
##        'Positive' Class : 0                  
## 
Observations 7651
Dependent variable TARGET_FLAG
Type Generalized linear model
Family binomial
Link logit
𝛘²(33) 1986.84
Pseudo-R² (Cragg-Uhler) 0.33
Pseudo-R² (McFadden) 0.23
AIC 6895.24
BIC 7131.28
Est. S.E. z val. p VIF
(Intercept) -0.96 0.16 -6.18 0.00 NA
MSTATUSNo 0.64 0.08 8.55 0.00 1.55
EDUCATION_BachelorsTRUE -0.27 0.07 -3.89 0.00 1.05
JOB_ClericalTRUE 0.26 0.09 3.05 0.00 1.16
JOB_ManagerTRUE -0.77 0.11 -6.79 0.00 1.08
CAR_TYPEPanel Truck 0.52 0.15 3.51 0.00 2.23
CAR_TYPEPickup 0.46 0.10 4.50 0.00 2.23
CAR_TYPESports Car 0.93 0.11 8.47 0.00 2.23
CAR_TYPEVan 0.54 0.12 4.37 0.00 2.23
CAR_TYPESUV 0.67 0.09 7.61 0.00 2.23
CAR_USEPrivate -0.90 0.07 -12.10 0.00 1.52
REVOKEDYes 0.95 0.10 9.96 0.00 1.36
URBANICITYRural -2.29 0.12 -19.72 0.00 1.13
KIDSDRIV1 0.42 0.12 3.64 0.00 1.53
KIDSDRIV2 0.71 0.17 4.25 0.00 1.53
KIDSDRIV3 0.74 0.32 2.29 0.02 1.53
KIDSDRIV4 -11.57 199.91 -0.06 0.95 1.53
HOMEKIDS1 0.42 0.10 4.36 0.00 1.60
HOMEKIDS2 0.36 0.10 3.77 0.00 1.60
HOMEKIDS3 0.36 0.12 3.08 0.00 1.60
HOMEKIDS4 0.21 0.21 1.03 0.30 1.60
HOMEKIDS5 0.25 0.70 0.36 0.72 1.60
CLM_FREQ1 0.58 0.10 5.68 0.00 1.87
CLM_FREQ2 0.67 0.10 6.85 0.00 1.87
CLM_FREQ3 0.62 0.11 5.67 0.00 1.87
CLM_FREQ4 0.81 0.18 4.50 0.00 1.87
CLM_FREQ5 1.14 0.56 2.04 0.04 1.87
INCOME -0.00 0.00 -5.97 0.00 1.84
HOME_VAL -0.00 0.00 -3.34 0.00 1.94
TRAVTIME 0.01 0.00 7.67 0.00 1.04
BLUEBOOK -0.00 0.00 -4.73 0.00 1.76
TIF -0.06 0.01 -7.39 0.00 1.01
OLDCLAIM -0.00 0.00 -4.62 0.00 1.89
MVR_PTS 0.10 0.01 6.83 0.00 1.24
Standard errors: MLE

3.1.3 Model 3 - Binary logistic model

Model 3 takes a similar approach to model 2 by incorporating all numeric predictor variables plus the categorical predictors that were found to be significant in the previous model. Skewed numeric predictors (BLUEBOOK, CAR_AGE, HOME_VAL, INCOME, MVR_PTS, OLDCLAIM, TIF, and TRAVTIME, ) were log transformed and added to the model as additional predictors. AGE and YOJ were not included since they were already normally distributed and determined not to be significant in the previous models. The model was then refined through backward elimination.

3.1.3.1 Original Model

TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CLM_FREQ + BLUEBOOK + CAR_AGE + HOME_VAL + INCOME + MVR_PTS + OLDCLAIM + TIF + TRAVTIME + log(BLUEBOOK) + log(CAR_AGE+1) + log(HOME_VAL+1) + log(INCOME+1) + log(MVR_PTS+1) + log(OLDCLAIM+1) + log(TIF) + log(TRAVTIME),

3.1.3.2 Model after backward elimination process

TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CAR_AGE + HOME_VAL + INCOME + MVR_PTS + OLDCLAIM + log(BLUEBOOK) + log(INCOME+1) + log(OLDCLAIM+1) + log(TIF) + log(TRAVTIME),

Model 4 ROC Curve

Model 4 ROC Curve

Area Under the Curve
x
0.82

3.1.3.3 Confusion Matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5449 1491
##          1  191  520
##                                              
##                Accuracy : 0.78               
##                  95% CI : (0.771, 0.789)     
##     No Information Rate : 0.737              
##     P-Value [Acc > NIR] : <0.0000000000000002
##                                              
##                   Kappa : 0.284              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.966              
##             Specificity : 0.259              
##          Pos Pred Value : 0.785              
##          Neg Pred Value : 0.731              
##               Precision : 0.785              
##                  Recall : 0.966              
##                      F1 : 0.866              
##              Prevalence : 0.737              
##          Detection Rate : 0.712              
##    Detection Prevalence : 0.907              
##       Balanced Accuracy : 0.612              
##                                              
##        'Positive' Class : 0                  
## 
Observations 7651
Dependent variable TARGET_FLAG
Type Generalized linear model
Family binomial
Link logit
𝛘²(31) 2017.05
Pseudo-R² (Cragg-Uhler) 0.34
Pseudo-R² (McFadden) 0.23
AIC 6861.02
BIC 7083.19
Est. S.E. z val. p VIF
(Intercept) 1.13 0.57 1.96 0.05 NA
MSTATUSNo 0.66 0.08 8.79 0.00 1.55
EDUCATION_BachelorsTRUE -0.26 0.07 -3.61 0.00 1.05
JOB_ClericalTRUE 0.28 0.09 3.01 0.00 1.30
JOB_ManagerTRUE -0.73 0.11 -6.49 0.00 1.08
CAR_TYPEPanel Truck 0.45 0.14 3.17 0.00 1.96
CAR_TYPEPickup 0.46 0.10 4.56 0.00 1.96
CAR_TYPESports Car 0.90 0.11 8.12 0.00 1.96
CAR_TYPEVan 0.55 0.12 4.46 0.00 1.96
CAR_TYPESUV 0.67 0.09 7.64 0.00 1.96
CAR_USEPrivate -0.89 0.08 -11.69 0.00 1.57
REVOKEDYes 0.97 0.10 10.10 0.00 1.37
URBANICITYRural -2.31 0.12 -19.87 0.00 1.13
KIDSDRIV1 0.44 0.12 3.73 0.00 1.53
KIDSDRIV2 0.72 0.17 4.30 0.00 1.53
KIDSDRIV3 0.76 0.33 2.32 0.02 1.53
KIDSDRIV4 -11.62 190.55 -0.06 0.95 1.53
HOMEKIDS1 0.41 0.10 4.17 0.00 1.61
HOMEKIDS2 0.33 0.10 3.44 0.00 1.61
HOMEKIDS3 0.34 0.12 2.84 0.00 1.61
HOMEKIDS4 0.15 0.21 0.72 0.47 1.61
HOMEKIDS5 0.21 0.70 0.30 0.76 1.61
CAR_AGE -0.02 0.01 -3.15 0.00 1.30
HOME_VAL -0.00 0.00 -3.05 0.00 1.97
INCOME -0.00 0.00 -3.08 0.00 2.57
MVR_PTS 0.10 0.01 6.74 0.00 1.24
OLDCLAIM -0.00 0.00 -5.59 0.00 2.33
log(BLUEBOOK) -0.32 0.06 -5.51 0.00 1.48
log(INCOME + 1) -0.03 0.01 -2.82 0.00 1.75
log(OLDCLAIM + 1) 0.08 0.01 8.20 0.00 2.22
log(TIF) -0.23 0.03 -7.26 0.00 1.01
log(TRAVTIME) 0.41 0.05 7.60 0.00 1.03
Standard errors: MLE

3.2 Regression Models: Models 4, 5, and 6

The next two models are multiple linear regression models aimed at predicting the value of claims based on different approaches, including constraining the cases based on TARGET_FLAG (i.e. based on whether or not a claim was filed) and different approaches to selecting explanatory variables.

23 lines were removed where TARGET_AMT greater than $45,000; these lines had a BLUEBOOK value far less than the car crash cost. A new variable mileage was created based on TRAVTIME and CAR_AGE.

3.2.1 Model 4 - Multiple linear regression model

Model 4 is a multiple linear regression model built only on cases with claims where TARGET_FLAG equals 1. The model is refined using stepwise elimination. From the model summary it can be observed that the Adjusted R-squared value is very low at 0.04.

TARGET_AMT ~ KIDSDRIV + log(AGE) + AGE + HOMEKIDS + YOJ + log(INCOME + 0.00000000000001) + INCOME + CAR_AGE + log(mileage) + log(BLUEBOOK) + BLUEBOOK + TIF + log(OLDCLAIM + 0.00000000000001) + OLDCLAIM + CLM_FREQ + MVR_PTS + CAR_AGE + PARENT1 + SEX + EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + CAR_TYPE + REVOKED + URBANICITY + MSTATUS + CAR_USE

Model 4 Diagnostic Plots

Model 4 Diagnostic Plots

Observations 1988
Dependent variable TARGET_AMT
Type OLS linear regression
F(40,1947) 1.44
0.03
Adj. R² 0.01
Est. S.E. t val. p VIF
(Intercept) 13448.22 14624.63 0.92 0.36 NA
KIDSDRIV1 148.33 441.87 0.34 0.74 1.93
KIDSDRIV2 468.70 582.65 0.80 0.42 1.93
KIDSDRIV3 -295.77 1051.99 -0.28 0.78 1.93
log(AGE) -4742.81 3434.47 -1.38 0.17 48.66
AGE 127.15 82.37 1.54 0.12 47.96
HOMEKIDS1 259.56 474.76 0.55 0.58 4.39
HOMEKIDS2 222.39 467.48 0.48 0.63 4.39
HOMEKIDS3 368.76 528.90 0.70 0.49 4.39
HOMEKIDS4 1194.72 821.44 1.45 0.15 4.39
HOMEKIDS5 989.90 2604.18 0.38 0.70 4.39
YOJ -47.64 46.47 -1.03 0.31 3.34
log(INCOME + 0.00000000000001) 23.39 15.68 1.49 0.14 3.72
INCOME -0.00 0.00 -0.44 0.66 2.02
CAR_AGE -22.19 28.86 -0.77 0.44 1.97
log(mileage) 96.73 78.94 1.23 0.22 1.59
log(BLUEBOOK) 995.28 468.43 2.12 0.03 7.38
BLUEBOOK -0.01 0.04 -0.17 0.86 10.05
TIF -23.70 29.33 -0.81 0.42 1.03
log(OLDCLAIM + 0.00000000000001) 190.41 331.22 0.57 0.57 3456.36
OLDCLAIM 0.01 0.03 0.33 0.74 7.19
CLM_FREQ1 -7885.99 13329.63 -0.59 0.55 3529.74
CLM_FREQ2 -7845.02 13327.05 -0.59 0.56 3529.74
CLM_FREQ3 -8190.38 13330.17 -0.61 0.54 3529.74
CLM_FREQ4 -9167.22 13325.73 -0.69 0.49 3529.74
CLM_FREQ5 -8837.38 13521.19 -0.65 0.51 3529.74
MVR_PTS 74.09 48.78 1.52 0.13 1.21
PARENT1Yes -163.71 462.82 -0.35 0.72 2.84
SEXF -446.64 413.06 -1.08 0.28 3.25
EDUCATION_BachelorsTRUE -249.72 275.20 -0.91 0.36 1.06
JOB_ClericalTRUE -360.21 345.03 -1.04 0.30 1.33
JOB_ManagerTRUE -557.34 504.60 -1.10 0.27 1.12
CAR_TYPEPanel Truck 185.48 658.16 0.28 0.78 7.50
CAR_TYPEPickup 419.08 404.82 1.04 0.30 7.50
CAR_TYPESports Car 662.50 518.57 1.28 0.20 7.50
CAR_TYPEVan -54.23 520.39 -0.10 0.92 7.50
CAR_TYPESUV 639.50 461.09 1.39 0.17 7.50
REVOKEDYes -573.09 361.22 -1.59 0.11 1.65
URBANICITYRural -488.03 519.51 -0.94 0.35 1.04
MSTATUSNo 759.76 320.47 2.37 0.02 1.98
CAR_USEPrivate -80.55 291.64 -0.28 0.78 1.64
Standard errors: OLS

3.2.2 Model 5 - Multiple linear regression model

TARGET_AMT ~ TARGET_FLAG + log(BLUEBOOK) + MVR_PTS + MSTATUS

Model 5 is a multiple linear regression model built on all cases - in other words, it relaxed the constraint that a claim was filed, and so includes TARGET_AMT values of 0. Forward elimination was used to refine variable selection. The Adjusted R-squared value significantly improved compared to the previous model.

Model 5 Diagnostic Plots

Model 5 Diagnostic Plots

Observations 7628
Dependent variable TARGET_AMT
Type OLS linear regression
F(4,7623) 1428.48
0.43
Adj. R² 0.43
Est. S.E. t val. p VIF
(Intercept) -2401.15 456.75 -5.26 0.00 NA
TARGET_FLAG 5085.74 70.42 72.22 0.00 1.08
log(BLUEBOOK) 241.38 47.66 5.06 0.00 1.01
MVR_PTS 29.51 14.23 2.07 0.04 1.05
MSTATUSNo 158.48 61.28 2.59 0.01 1.02
Standard errors: OLS

3.2.3 Model 6 - Log transformed target as well as predictors

Model 6 is almost the same as model 5 but does not include the TARGET_FLAG as a predictor. When making predictions about new customers we would not have this information. TARGET_AMT is also log transformed as was suggested by the Box-Cox plot as well as any other skewed numeric predictor.

log(TARGET_AMT + 1) ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + PARENT1 + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + CAR_AGE + INCOME + MVR_PTS + OLDCLAIM + log(BLUEBOOK) + log(HOME_VAL + 1) + log(INCOME + 1) + log(MVR_PTS + 1) + log(OLDCLAIM + 1) + log(TIF) + log(TRAVTIME)

Model 6 Diagnostic Plots

Model 6 Diagnostic Plots

Observations 7651
Dependent variable log(TARGET_AMT + 1)
Type OLS linear regression
F(28,7622) 80.99
0.23
Adj. R² 0.23
Est. S.E. t val. p VIF
(Intercept) 5.19 0.71 7.32 0.00 NA
MSTATUSNo 0.54 0.10 5.17 0.00 1.93
EDUCATION_BachelorsTRUE -0.31 0.08 -3.66 0.00 1.05
JOB_ClericalTRUE 0.40 0.12 3.44 0.00 1.29
JOB_ManagerTRUE -0.88 0.12 -7.31 0.00 1.14
PARENT1Yes 0.70 0.13 5.34 0.00 1.43
CAR_TYPEPanel Truck 0.33 0.17 1.92 0.06 1.94
CAR_TYPEPickup 0.47 0.12 3.84 0.00 1.94
CAR_TYPESports Car 1.01 0.14 7.46 0.00 1.94
CAR_TYPEVan 0.51 0.15 3.45 0.00 1.94
CAR_TYPESUV 0.74 0.10 7.26 0.00 1.94
CAR_USEPrivate -1.17 0.10 -12.17 0.00 1.58
REVOKEDYes 1.33 0.13 10.38 0.00 1.31
URBANICITYRural -2.33 0.10 -22.88 0.00 1.24
KIDSDRIV1 0.69 0.14 4.81 0.00 1.11
KIDSDRIV2 1.07 0.21 5.19 0.00 1.11
KIDSDRIV3 1.05 0.43 2.44 0.01 1.11
KIDSDRIV4 -2.48 2.29 -1.08 0.28 1.11
CAR_AGE -0.03 0.01 -3.72 0.00 1.29
INCOME -0.00 0.00 -4.68 0.00 2.10
MVR_PTS 0.26 0.06 4.60 0.00 11.18
OLDCLAIM -0.00 0.00 -5.32 0.00 2.37
log(BLUEBOOK) -0.32 0.07 -4.59 0.00 1.45
log(HOME_VAL + 1) -0.03 0.01 -3.30 0.00 1.70
log(INCOME + 1) -0.05 0.02 -2.89 0.00 1.74
log(MVR_PTS + 1) -0.30 0.17 -1.83 0.07 10.84
log(OLDCLAIM + 1) 0.10 0.01 7.72 0.00 2.49
log(TIF) -0.29 0.04 -7.41 0.00 1.01
log(TRAVTIME) 0.46 0.06 7.34 0.00 1.03
Standard errors: OLS

4 SELECT MODELS

Confusion Matrix Summary Statistics
Sensitivity Specificity Precision Recall F1
Model.1 0.97 0.20 0.77 0.97 0.86
Model.2 0.97 0.25 0.78 0.97 0.87
Model.3 0.97 0.26 0.79 0.97 0.87

4.1 Binary Logistic Regression

Pseudo R2
llh llhNull G2 McFadden r2ML r2CU
Model.1 -3525 -4407 1764 0.20 0.21 0.30
Model.2 -3414 -4407 1987 0.23 0.23 0.33
Model.3 -3399 -4407 2017 0.23 0.23 0.34

There is no \(R^2\) for logistic regression to further evaluate, however, there is an alternative called pseudo \(R^2\) terms that can be used for evaluation. In the table provided, it can be observed that there are multiple kinds of pseudo \(R^2\) values. The one that was settled on as the best predictor was McFadden’s. This is because McFadden’s pseudo \(R^2\) method excels at being used for comparison between various models when these models are used on the same data. Of the models presented, both Model 2 and Model 3 have the highest McFadden’s pseudo \(R^2\) with a value of 0.23. Secondarily examined was Cragg & Uhler’s pseudo \(R^2\) which made Model 3 the clear winner with a value of 0.34.

4.2 Multiple Linear Regression

Analysis of the Adjusted \(R^2\) led to the selection of Model 5 as the best of the multiple linear regression models. While it fails to meet the 0.5 threshold mentioned in the assignment proper, it is the best of the models composed and scrutinized. Adjusted \(R^2\) was chosen as the metric as it accounts for the number of variations that can be explained as compared to the total number of variations.

5 Predictions

TARGET AMT Predictions
n min mean median max sd
TARGET_AMT_Mod5 1798 -635.9 490.0 31.8 5525.8 1527.1
TARGET_AMT_Mod6 1798 -2.6 2.2 2.3 7.9 1.8
TARGET FLAG Predictions
x
0 1624
1 174
NA’s 343

The predictions show that there are still underlying problems with the models that need to be resolved. All 3 models produced predictions that are not on par with the distributions in the original training dataset indicating that they are not a good fit for the data. This is evidenced in the histograms below as well as in the summary statistics above.

6 Appendix

The appendix is available as script.R file in project4_insurance folder.

https://github.com/betsyrosalen/DATA_621_Business_Analyt_and_Data_Mining

# Proj 4
# DATA EXPLORATION <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

if (!require('car')) (install.packages('car'))
if (!require('caret')) (install.packages('caret'))
if (!require('corrplot')) (install.packages('corrplot'))
if (!require('data.table')) (install.packages('data.table'))
if (!require('dplyr')) (install.packages('dplyr'))
if (!require('DataExplorer')) (install.packages('DataExplorer'))
if (!require('faraway')) (install.packages('faraway'))
#if (!require('fastDummies')) (install.packages('fastDummies'))
if (!require('gridExtra')) (install.packages('gridExtra'))
if (!require('ggfortify')) (install.packages('ggfortify'))
if (!require('ggplot2')) (install.packages('ggplot2'))
if (!require('GGally')) (install.packages('GGally'))
if (!require('huxtable')) (install.packages('huxtable'))
if (!require('jtools')) (install.packages('jtools'))
if (!require('kableExtra')) (install.packages('kableExtra'))
if (!require('MASS')) (install.packages('MASS'))
if (!require('mice')) (install.packages('mice'))
if (!require('plyr')) (install.packages('plyr'))
if (!require('psych')) (install.packages('psych'))
if (!require('pROC')) (install.packages('pROC'))
if (!require('pscl')) (install.packages('pscl'))
if (!require('tidyverse')) (install.packages('tidyverse'))
if (!require('tidyr')) (install.packages('tidyr'))


# load data
train.raw <- read.csv ('https://raw.githubusercontent.com/betsyrosalen/DATA_621_Business_Analyt_and_Data_Mining/master/project4_insurance/data/insurance_training_data.csv',
                   stringsAsFactors = T, header = T)
test <- read.csv('https://raw.githubusercontent.com/betsyrosalen/DATA_621_Business_Analyt_and_Data_Mining/master/project4_insurance/data/insurance-evaluation-data.csv',
                 stringsAsFactors = T, header = T)
train.raw <- as.data.table(within(train.raw, rm('INDEX')))

vars <- rbind(c('TARGET_FLAG','car crash = 1, no car crash = 0','binary categorical response'),
               c('TARGET_AMT','car crash cost = >0, no car crash = 0','continuous numerical response'),
               c('AGE',"driver's age - very young/old tend to be risky",'continuous numerical predictor'),
               c('BLUEBOOK','$ value of vehicle','continuous numerical predictor'),
               c('CAR_AGE','age of vehicle','continuous numerical predictor'),
               c('CAR_TYPE','type of car (6types)','categorical predictor'),
               c('CAR_USE','usage of car (commercial/private)','binary categorical predictor'),
               c('CLM_FREQ','number of claims past 5 years','discrete numerical predictor'),
               c('EDUCATION','max education level (5types)','categorical predictor'),
               c('HOMEKIDS','number of children at home','discrete numerical predictor'),
               c('HOME_VAL','$ home value - home owners tend to drive more responsibly','continuous numerical predictor'),
               c('INCOME','$ income - rich people tend to get into fewer crashes','continuous numerical predictor'),
               c('JOB','job category (8types, 1missing) - white collar tend to be safer','categorical predictor'),
               c('KIDSDRIV','number of driving children - teenagers more likely to crash','discrete numerical predictor'),
               c('MSTATUS','maritial status - married people drive more safely','catogerical predictor'),
               c('MVR_PTS','number of traffic tickets','continuous numerical predictor'),
               c('OLDCLAIM','$ total claims in the past 5 years','continuous numerical predictor'),
               c('PARENT1','single parent','binary categorical predictor'),
               c('RED_CAR','a red car','binary categorical predictor'),
               c('REVOKED','license revoked (past 7 years) - more risky driver','binary categorical predictor'),
               c('SEX','gender - woman may have less crashes than man','binary categorical predictor'),
               c('TIF','time in force - number of years being customer','continuous numerical predictor'),
               c('TRAVTIME','distance to work','continuous numerical predictor'),
               c('URBANCITY','urban/rural','binary categorical predictor'),
               c('YOJ','years on job - the longer they stay more safe','continuous numerical predictor'))

colnames(vars) <- c('VARIABLE','DEFINITION','TYPE')

# ------------------------------------------------------------------------------
# Clean Data
## change BLUEBOOK, HOME_VAL, INCOME, OLDCLAIM $ to numerical value
cleanUSD <- function(num) {
  n <- gsub(",", "", num) # replace , with ""
  n <- as.numeric(gsub("[\\$,]", "", n)) # replace $ with ""
  return(n) }

train.raw$INCOME <- cleanUSD(train.raw$INCOME)
train.raw$BLUEBOOK <- cleanUSD(train.raw$BLUEBOOK)
train.raw$HOME_VAL <- cleanUSD(train.raw$HOME_VAL)
train.raw$OLDCLAIM <- cleanUSD(train.raw$OLDCLAIM)

test$INCOME <- cleanUSD(test$INCOME)
test$BLUEBOOK <- cleanUSD(test$BLUEBOOK)
test$HOME_VAL <- cleanUSD(test$HOME_VAL)
test$OLDCLAIM <- cleanUSD(test$OLDCLAIM)

# Convert 'CLM_FREQ','HOMEKIDS', and 'KIDSDRIV' to Factors
train.raw[, c('CLM_FREQ','HOMEKIDS','KIDSDRIV')] <- 
            lapply(train.raw[, c('CLM_FREQ','HOMEKIDS','KIDSDRIV')], as.factor)
test[, c('CLM_FREQ','HOMEKIDS','KIDSDRIV')] <- 
            lapply(test[, c('CLM_FREQ','HOMEKIDS','KIDSDRIV')], as.factor)

# Fix factor levels
levels(train.raw$URBANICITY) <- list(Urban="Highly Urban/ Urban", Rural="z_Highly Rural/ Rural")
levels(test$URBANICITY) <- list(Urban="Highly Urban/ Urban", Rural="z_Highly Rural/ Rural")

cleanLEVELS <- function(level) {
    l <- gsub("z_", "", levels(level)) # replace z_ with ""l
    return(l) }

levels(train.raw$EDUCATION) <- cleanLEVELS(train.raw$EDUCATION)
levels(test$EDUCATION) <- cleanLEVELS(test$EDUCATION)
levels(train.raw$JOB) <- cleanLEVELS(train.raw$JOB)
levels(test$JOB) <- cleanLEVELS(test$JOB)
levels(train.raw$CAR_TYPE) <- cleanLEVELS(train.raw$CAR_TYPE)
levels(test$CAR_TYPE) <- cleanLEVELS(test$CAR_TYPE)
levels(train.raw$SEX) <- cleanLEVELS(train.raw$SEX)
levels(test$SEX) <- cleanLEVELS(test$SEX)
levels(train.raw$MSTATUS) <- cleanLEVELS(train.raw$MSTATUS)
levels(test$MSTATUS) <- cleanLEVELS(test$MSTATUS)

## change CAR_AGE -3 to 0
train.raw[CAR_AGE == -3, CAR_AGE := 0]

# ------------------------------------------------------------------------------

# Summary Statistics

train.num <- train.raw[, c('TARGET_AMT', 'AGE', 'YOJ','INCOME','HOME_VAL',
                           'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                           'CAR_AGE')]
train.cat <- train.raw[, c('TARGET_FLAG', 'PARENT1', 'SEX', 'MSTATUS', 'EDUCATION',
                           'JOB', 'CAR_TYPE', 'CAR_USE', 'RED_CAR', 'REVOKED',
                           'URBANICITY', 'KIDSDRIV', 'HOMEKIDS', 'CLM_FREQ')]

summary.stat.num <- describe(train.num)[,c(2,8,3,5,9,4)]

summary.stat.cat <- describe(train.cat)[,c(2,8,3,5,9,4)]

summary.num <- summary(train.num)

summary.cat1 <- summary(train.cat[, c('EDUCATION', 'JOB', 'CAR_TYPE', 'KIDSDRIV', 
                                        'HOMEKIDS', 'CLM_FREQ')])
summary.cat2 <- summary(train.cat[, c('PARENT1', 'SEX', 'MSTATUS', 'CAR_USE', 
                                        'RED_CAR', 'REVOKED', 'URBANICITY')])


# ------------------------------------------------------------------------------

# Histograms

train.num.graph <- train.raw[, c('TARGET_FLAG', 'TARGET_AMT', 'AGE', 'YOJ','INCOME','HOME_VAL',
                                 'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                                 'CAR_AGE')]

hist.num <- train.num.graph %>%
    gather(-TARGET_FLAG, key = "var", value = "val") %>%
    ggplot(aes(x = val, fill=factor(TARGET_FLAG))) +
    geom_histogram(position="dodge", bins=10, alpha=0.5) +
    facet_wrap(~ var, scales = "free") +
    scale_fill_manual("TARGET_FLAG",values = c("#58BFFF", "#3300FF")) +
    xlab("") +
    ylab("") +
    theme(panel.background = element_blank(), legend.position="top")

bar.cat <- train.cat %>%
    gather(-TARGET_FLAG, key = "var", value = "val") %>%
    ggplot(aes(x = val, fill=factor(TARGET_FLAG))) +
    geom_bar(position="dodge", alpha=0.5) +
    facet_wrap(~ var, scales = "free") +
    scale_fill_manual("TARGET_FLAG",values = c("#58BFFF", "#3300FF")) +
    xlab("") +
    ylab("") +
    theme(panel.background = element_blank(), legend.position="top") +
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

# ------------------------------------------------------------------------------

# BoxPlot

melt.train <- melt(train.num)

outlier.boxplots <- ggplot(melt.train, aes(variable, value)) +
  geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
  stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
               size=2, show.legend=TRUE) +
  stat_summary(aes(colour="median"), fun.y=median, geom="point",
               size=2, show.legend=TRUE) +
  coord_flip(ylim = c(0, 110), expand = TRUE) +
  scale_y_continuous(labels = scales::comma,
                     breaks = seq(0, 110, by = 10)) +
  labs(colour="Statistics", x="", y="") +
  scale_colour_manual(values=c("#9900FF", "#3300FF")) +
  theme(panel.background=element_blank(), legend.position="top")

# Scaled BoxPlots
scaled.train.num <- as.data.table(scale(train.num[, c('AGE', 'YOJ','INCOME','HOME_VAL',
                                                      'TRAVTIME', 'BLUEBOOK', 'TIF',
                                                      'OLDCLAIM', 'MVR_PTS',
                                                      'CAR_AGE')]))
melt.train <- melt(scaled.train.num)

scaled.boxplots <- ggplot(melt.train, aes(variable, value)) +
    geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
    stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
                 size=2, show.legend=TRUE) +
    stat_summary(aes(colour="median"), fun.y=median, geom="point",
                 size=2, show.legend=TRUE) +
    coord_flip() +
    #scale_y_continuous(labels = scales::comma,
    #                   breaks = seq(0, 110, by = 10)) +
    labs(colour="Statistics", x="", y="") +
    scale_colour_manual(values=c("#9900FF", "#3300FF")) +
    theme(panel.background=element_blank(), legend.position="top")

# ------------------------------------------------------------------------------

boxplots.target <- train.num.graph %>%
  gather(-TARGET_FLAG,key = "var", value = "val") %>%
  ggplot(aes(x=factor(TARGET_FLAG), y=val)) +
  geom_boxplot(width=.5, fill="#58BFFF", outlier.colour="red", outlier.size = 1) +
  stat_summary(aes(colour="mean"), fun.y=mean, geom="point",
               size=2, show.legend=TRUE) +
  stat_summary(aes(colour="median"), fun.y=median, geom="point",
               size=2, show.legend=TRUE) +
  facet_wrap(~ var, scales = "free", ncol=4) +
  labs(colour="Statistics", x="", y="") +
  scale_colour_manual(values=c("#9900FF", "#3300FF")) +
  theme(panel.background=element_blank())

# ------------------------------------------------------------------------------


## Linearity
linearity <- train.raw[,-1] %>%
    select_if(is.numeric) %>%
    filter(TARGET_AMT>0) %>%
    gather(-TARGET_AMT, key = "var", value = "value") %>%
    ggplot(aes(x = value, y = TARGET_AMT)) +
    geom_point(alpha=0.1) +
    stat_smooth() +
    facet_wrap(~ var, scales = "free", ncol=3) +
    ylab("TARGET_AMT") +
    xlab("") +
    theme(panel.background = element_blank())

## Log Transformed Linearity
logged_vals <- train.raw[,c('TARGET_AMT', 'INCOME','HOME_VAL',
                            'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                            'CAR_AGE')] + 1
logged_vals <- logged_vals %>%
    filter(TARGET_AMT>1) %>%
    log()

linearity.log <- logged_vals %>%
    gather(-TARGET_AMT, key = "var", value = "value") %>%
    ggplot(aes(x = value, y = TARGET_AMT)) +
    geom_point(alpha=0.1) +
    stat_smooth() +
    facet_wrap(~ var, scales = "free", ncol=3) +
    ylab("TARGET_AMT") +
    xlab("") +
    theme(panel.background = element_blank())

# ------------------------------------------------------------------------------

# DATA PREPARATION <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# ------------------------------------------------------------------------------
# Missing Values

#table(is.na(train))
#sapply(train, function(x) sum(is.na(x)))

train <- data.table(train.raw)
train <- train%>%
  filter(CAR_AGE >= 0)
set.seed(123)
impute.data <- mice(train, m = 2, maxit = 2, print = FALSE)

age.med <- median(train$AGE, na.rm = T)
train$AGE[is.na(train$AGE)] <- age.med

train.mice <- mice(train, m = 1, maxit = 1, print = FALSE)
train <- mice::complete(train.mice)

density.plot <- densityplot(impute.data)

# ------------------------------------------------------------------------------
# make.dummy <- train[, c('EDUCATION', 'JOB', 'CAR_TYPE')]
# dummies <- fastDummies::dummy_cols(make.dummy)

# Divide numeric/categorical data AFTER imputing data

train.num.a <- train[, c('TARGET_FLAG', 'TARGET_AMT', 'AGE', 'YOJ','INCOME','HOME_VAL',
                           'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                           'CAR_AGE')]

train.cat.a <- train[, c('TARGET_FLAG', 'PARENT1', 'SEX', 'MSTATUS', 'EDUCATION',
                           'JOB', 'CAR_TYPE', 'CAR_USE', 'RED_CAR', 'REVOKED',
                           'URBANICITY', 'KIDSDRIV', 'HOMEKIDS', 'CLM_FREQ')]

# ------------------------------------------------------------------------------

# Does imputed data show linearity?

## Linearity Plot
linearity.new <- train.num.a[,-1] %>%
    select_if(is.numeric) %>%
    filter(TARGET_AMT>0) %>%
    gather(-TARGET_AMT, key = "var", value = "value") %>%
    ggplot(aes(x = value, y = TARGET_AMT)) +
    geom_point(alpha=0.1) +
    stat_smooth() +
    facet_wrap(~ var, scales = "free", ncol=3) +
    ylab("TARGET_AMT") +
    xlab("") +
    theme(panel.background = element_blank())

## Log Transformed Linearity Plot
logged_vals <- train.num.a[,c('TARGET_AMT', 'AGE', 'YOJ','INCOME','HOME_VAL',
                            'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                            'CAR_AGE')] + 1
logged_vals <- logged_vals %>%
    filter(TARGET_AMT>1) %>%
    log()

linearity.log.new <- logged_vals %>%
    gather(-TARGET_AMT, key = "var", value = "value") %>%
    ggplot(aes(x = value, y = TARGET_AMT)) +
    geom_point(alpha=0.1) +
    stat_smooth() +
    facet_wrap(~ var, scales = "free", ncol=3) +
    ylab("TARGET_AMT") +
    xlab("") +
    theme(panel.background = element_blank())

# Box-Cox
test <- train.num.a[train.num.a[, 'TARGET_AMT'] > 0, ]
# Code below added to .Rmd file
#bc_plot <- boxcox(TARGET_AMT~., data=test, lambda=seq(-0.2,0.2,by=0.1))

# Does square root transformation show linearity?

## Square Root Transformed Predictors and Log transformed Target Linearity Plot
X <- train.num.a[train.num.a[, 'TARGET_AMT']>0,
                 c('AGE', 'YOJ','INCOME','HOME_VAL',
                    'TRAVTIME', 'BLUEBOOK', 'TIF','OLDCLAIM', 'MVR_PTS',
                    'CAR_AGE')]
sqroot_vals <- data.table(cbind(log(train.num.a[train.num.a[, 'TARGET_AMT']>0,'TARGET_AMT']),
                     sapply(X, sqrt)))
colnames(sqroot_vals)[1] <- 'TARGET_AMT'

linearity.root <- sqroot_vals %>%
    gather(-TARGET_AMT, key = "var", value = "value") %>%
    ggplot(aes(x = value, y = TARGET_AMT)) +
    geom_point(alpha=0.1) +
    stat_smooth() +
    facet_wrap(~ var, scales = "free", ncol=3) +
    ylab("TARGET_AMT") +
    xlab("") +
    theme(panel.background = element_blank())

## Correlation

#corr.table <- ggpairs(train.num.a %>% dplyr::select(-c(TARGET_AMT, TARGET_FLAG)))

plot.data <- train.num.a
plot.data$TARGET_FLAG <- factor(plot.data$TARGET_FLAG)
corr.plot2 <- plot.data %>% # dplyr::select(-TARGET_AMT) %>%
    ggscatmat(color="TARGET_FLAG", alpha=0.1) +
    scale_color_manual(values=c("#58BFFF", "#3300FF")) +
    theme(panel.background=element_blank(), legend.position="top",
          axis.text.x = element_text(angle=-40, vjust=1, hjust=0))

# correl <- ggpairs(train)
# This plot doesn't work in the script file.  Moved code to our .Rmd file
# The code works to create  correlation table though!
corr.train <- train.num.a %>%
  dplyr::select(-TARGET_FLAG) %>%
  dplyr::select(-TARGET_AMT) %>%
  cor() %>%
  round(2) %>%
  corrplot(method = "circle")

corr.plot <- ggcorrplot::ggcorrplot(corr.train,
                                    type = 'lower',
                                    lab=T,
                                    lab_size=2)

#pairs.plot <- pairs(train.num.a, col=train.num.a$TARGET_FLAG)

# BUILD MODELS<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

## Model 1

model.1 <- train(TARGET_FLAG ~ PARENT1 + SEX + MSTATUS + EDUCATION + JOB + CAR_TYPE + 
                     CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CLM_FREQ,
                 data=train,
                 method='glm',
                 family='binomial',
                 preProcess = c("center", "scale")) 
                 # center and scale data based on the mean and sd

mod.1 <- glm(TARGET_FLAG ~ PARENT1 + SEX + MSTATUS + EDUCATION + JOB + CAR_TYPE + 
                 CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + CLM_FREQ,
             family='binomial',
             data = train.cat.a)

mod1_summary <- summ(mod.1, vifs = TRUE)

# Code below added to .Rmd file
#mod1_plot <- par(mfrow=c(2,2)); plot(mod.1)

### Model 1 Summary Statistics
pred.1.raw <- predict(mod.1, newdata = train)
pred.1 <- as.factor(ifelse(pred.1.raw < .5, 0, 1))
mod1.conf.mat <- confusionMatrix(pred.1, as.factor(train$TARGET_FLAG), mode = "everything")


#==============================================================================#

## Model 2 REMOVED from .Rmd file

#model.2 <- train(TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + CAR_TYPE + 
#                     CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + 
#                     CLM_FREQ + YOJ + INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + 
#                     TIF + OLDCLAIM + MVR_PTS,
#                 data=train,
#                 method='glm',
#                 family='binomial',
#                 preProcess = c("center", "scale")) 
                  # center and scale data based on the mean and sd
#
#mod.2 <- glm(TARGET_FLAG ~ PARENT1 + MSTATUS + EDUCATION + JOB + CAR_TYPE + 
#                     CAR_USE + REVOKED + URBANICITY + KIDSDRIV + HOMEKIDS + 
#                     CLM_FREQ + YOJ + INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + 
#                     TIF + OLDCLAIM + MVR_PTS,
#             family='binomial',
#             data = train)
#
#mod2_summary <- summ(mod.2, vifs = TRUE)

# Code below added to .Rmd file
#mod2_plot <- par(mfrow=c(2,2)); plot(mod.2)

### Model 2 Summary Statistics
#pred.2.raw <- predict(mod.2, newdata = train)
#pred.2 <- as.factor(ifelse(pred.2.raw < .5, 0, 1))
#mod2.conf.mat <- confusionMatrix(pred.2, as.factor(train$TARGET_FLAG), mode = "everything")


#==============================================================================#

## Model 2 (USED TO BE 3)

train$EDUCATION_Bachelors <- train$EDUCATION == "Bachelors"
train$JOB_Manager <- train$JOB == "Manager"
train$JOB_Clerical <- train$JOB == "Clerical"

model.3 <- train(TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + 
                    JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + 
                    KIDSDRIV + HOMEKIDS + CLM_FREQ + INCOME + HOME_VAL + 
                    TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + MVR_PTS,
                 data=train,
                 method='glm',
                 family='binomial',
                 preProcess = c("center", "scale")) 
                 # center and scale data based on the mean and sd

mod.3 <- glm(TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + 
                    JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + 
                    KIDSDRIV + HOMEKIDS + CLM_FREQ + INCOME + HOME_VAL + 
                    TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + MVR_PTS,
             family='binomial',
             data = train)

mod3_summary <- summ(mod.3, vifs = TRUE)

# Code below added to .Rmd file
#mod3_plot <- par(mfrow=c(2,2)); plot(mod.3)

### Model 3 Summary Statistics
pred.3.raw <- predict(mod.3, newdata = train)
pred.3 <- as.factor(ifelse(pred.3.raw < .5, 0, 1))
mod3.conf.mat <- confusionMatrix(pred.3, as.factor(train$TARGET_FLAG), mode = "everything")


#==============================================================================#

## Model 3 (USED TO BE 4)

mod.4.raw <- glm(TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + 
                    JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + 
                    KIDSDRIV + HOMEKIDS + CLM_FREQ + BLUEBOOK + CAR_AGE + 
                    HOME_VAL + INCOME + MVR_PTS + OLDCLAIM + TIF + TRAVTIME +
                    log(BLUEBOOK) + log(CAR_AGE+1) + log(HOME_VAL+1) + 
                    log(INCOME+1) + log(MVR_PTS+1) + log(OLDCLAIM+1) + log(TIF) + 
                    log(TRAVTIME),
             family='binomial',
             data = na.omit(train))

backward.mod.4 <- step(mod.4.raw, direction = "backward", trace=FALSE)

mod.4 <- glm(TARGET_FLAG ~ MSTATUS + EDUCATION_Bachelors + JOB_Clerical + 
                 JOB_Manager + CAR_TYPE + CAR_USE + REVOKED + URBANICITY + 
                 KIDSDRIV + HOMEKIDS + CAR_AGE + HOME_VAL + INCOME + MVR_PTS + 
                 OLDCLAIM + log(BLUEBOOK) + log(INCOME+1) + 
                 log(OLDCLAIM+1) + log(TIF) + log(TRAVTIME),
            family = "binomial", data = na.omit(train))

mod4_summary <- summ(mod.4, vifs = TRUE)

# Code below added to .Rmd file
#mod4_plot <- par(mfrow=c(2,2)); plot(mod.4)

### Model 4 Summary Statistics
pred.4.raw <- predict(mod.4, newdata = train)
pred.4 <- as.factor(ifelse(pred.4.raw < .5, 0, 1))
mod4.conf.mat <- confusionMatrix(pred.4, as.factor(train$TARGET_FLAG), mode = "everything")

#==============================================================================#

## Model 4 (USED TO BE 5)

train_5 <- train%>%
  filter(TARGET_FLAG == 1) %>%
  filter(TARGET_AMT<45000) %>%
  filter(CAR_AGE >= 0)
train_5$mileage <- train_5$TRAVTIME*(train_5$CAR_AGE+0.0000000000000000000000001)*440.0

model.5 <- lm(TARGET_AMT~ KIDSDRIV + log(AGE)+ AGE +  HOMEKIDS +
                YOJ  + log(INCOME+0.00000000000001)+INCOME + CAR_AGE +log(mileage)+  
                log(BLUEBOOK)+ BLUEBOOK +
                TIF+log(OLDCLAIM+0.00000000000001)+ OLDCLAIM + CLM_FREQ+ MVR_PTS+ CAR_AGE +
                PARENT1+ SEX+ EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + 
                CAR_TYPE+ REVOKED+ URBANICITY+ MSTATUS+ CAR_USE, data =na.omit(train_5))

mod.5 <- step(model.5, direction = "forward", trace=FALSE)

mod5_summary <- summ(mod.5, vifs = TRUE)

mod5_plot <- autoplot(mod.5, which = 1:6, colour = "#58BFFF",
                        smooth.colour = 'red', smooth.linetype = 'solid',
                        ad.colour = 'black',
                        label.size = 3, label.n = 5, label.colour = "#3300FF",
                        ncol = 2) +
                theme(panel.background=element_blank())

### Model 5 Predictions
pred.5.raw <- predict(mod.5, newdata = train_5)

#==============================================================================#

## Model 5 (USED TO BE 6)

train_6 <- train %>%
  filter(TARGET_AMT < 45000) 

train_6$mileage <- train_6$TRAVTIME*(train_6$CAR_AGE+0.00000000001)*440

model.6.raw <- lm(TARGET_AMT~ TARGET_FLAG + KIDSDRIV + log(AGE) + AGE +  HOMEKIDS +
                    YOJ + log(INCOME+1) + INCOME + CAR_AGE + log(mileage) + 
                    log(BLUEBOOK)+ BLUEBOOK + TIF + log(OLDCLAIM+1) + OLDCLAIM + 
                    CLM_FREQ + MVR_PTS + CAR_AGE + PARENT1 + SEX + 
                    EDUCATION_Bachelors + JOB_Clerical + JOB_Manager + 
                    CAR_TYPE+ REVOKED+ URBANICITY+ MSTATUS+ CAR_USE, 
                  data =na.omit(train_6))

forward.mod.6 <- step(model.6.raw, direction = "forward", trace=FALSE)
mod.6 <- step(model.6.raw, direction = "backward", trace=FALSE)

mod6_summary <- summ(mod.6, vifs = TRUE)

mod6_plot <- autoplot(mod.6, which = 1:6, colour = "#58BFFF",
                      smooth.colour = 'red', smooth.linetype = 'solid',
                      ad.colour = 'black',
                      label.size = 3, label.n = 5, label.colour = "#3300FF",
                      ncol = 2) +
                theme(panel.background=element_blank())

### Model 6 Predictions
pred.6.raw <- predict(mod.6, newdata = train_6)


#==============================================================================#

## Model 6 (USED TO BE 7)

model.7.raw <- lm(log(TARGET_AMT+1) ~ MSTATUS + EDUCATION_Bachelors + 
                      JOB_Clerical + JOB_Manager + SEX + PARENT1 +
                      CAR_TYPE + CAR_USE + REVOKED + URBANICITY + KIDSDRIV + 
                      HOMEKIDS + CLM_FREQ + BLUEBOOK + AGE + YOJ + CAR_AGE + 
                      HOME_VAL + INCOME + MVR_PTS + OLDCLAIM + TIF + TRAVTIME +
                      log(BLUEBOOK) + log(CAR_AGE+1) + log(HOME_VAL+1) + 
                      log(INCOME+1) + log(MVR_PTS+1) + log(OLDCLAIM+1) + log(TIF) + 
                      log(TRAVTIME), 
                  data=train)

forward.mod.7 <- step(model.7.raw, direction = "forward", trace=FALSE)
mod.7 <- step(model.7.raw, direction = "backward", trace=FALSE)

mod7_summary <- summ(mod.7, vifs = TRUE)

mod7_plot <- autoplot(mod.7, which = 1:6, colour = "#58BFFF",
                      smooth.colour = 'red', smooth.linetype = 'solid',
                      ad.colour = 'black',
                      label.size = 3, label.n = 5, label.colour = "#3300FF",
                      ncol = 2) +
    theme(panel.background=element_blank())

### Model 7 Predictions
pred.7.raw <- predict(mod.7, newdata = train)

#==============================================================================#

## Model Evaluations


eval_mods <- data.frame(mod1.conf.mat$byClass,
                        mod3.conf.mat$byClass,
                        mod4.conf.mat$byClass) # add additional model stats

eval_mods <- data.frame(t(eval_mods))
row.names(eval_mods) <- c("Model.1", "Model.2", "Model.3") # add additional models

eval_mods <- dplyr::select(eval_mods, Sensitivity, Specificity, Precision, Recall, F1)


# SELECT MODELS <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

#Pseudo R2

pseudo.r2 <- data.frame(pscl::pR2(mod.1),
                        pscl::pR2(mod.3),
                        pscl::pR2(mod.4))

pseudo.r2 <- data.frame(t(pseudo.r2))

row.names(pseudo.r2) <- c("Model.1", "Model.2", "Model.3")

# Prep test data for predictions
test$EDUCATION_Bachelors <- test$EDUCATION == "Bachelors"
test$JOB_Manager <- test$JOB == "Manager"
test$JOB_Clerical <- test$JOB == "Clerical"

# Predictions
test$TARGET_FLAG <- ifelse(predict(mod.4, newdata = test) < .5, 0, 1)
test$TARGET_AMT_Mod5 <- predict(mod.6, newdata = test)
test$TARGET_AMT_Mod6 <- predict(mod.7, newdata = test)

# Prediction Histograms

summary.pred.amt <- describe(test[, c('TARGET_AMT_Mod5', 'TARGET_AMT_Mod6')])[,c(2,8,3,5,9,4)]

summary.pred.flag <- summary(factor(test$TARGET_FLAG))

hist.pred1 <- test[, c('TARGET_AMT_Mod5', 'TARGET_AMT_Mod6')] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(fill = "#58BFFF") +
    xlab("") +
    ylab("") +
    theme(panel.background = element_blank())

hist.pred2 <- test[, c('TARGET_FLAG', 'TARGET_AMT_Mod5', 'TARGET_AMT_Mod6')] %>%
    gather(-TARGET_FLAG, key = "var", value = "val") %>%
    ggplot(aes(x = val, fill=factor(TARGET_FLAG))) +
    geom_histogram(position="dodge", bins=10, alpha=0.5) +
    facet_wrap(~ var, scales = "free") +
    scale_fill_manual("TARGET_FLAG_Mod3",values = c("#58BFFF", "#3300FF")) +
    xlab("") +
    ylab("") +
    theme(panel.background = element_blank(), legend.position="top")