Introduction

Group 2 of the DATA621 class was asked to analyze and model a data sent containing approximately 8,000 with 2 response and 23 predictor variables, records representing a customer at an auto insurance company.

Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.

The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.

Objective

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

Only the variables that are given or variables derived from the variables provided.

Approach

The team met to discuss this assignment and an approach for completing the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:

Data Exploration Data Preparation Build Models *Select Models

Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation.

Slack was used for daily communication during the project and for quick access to code and documentation. Meeting were organized at least twice a week and as needed using “Go to Meetings”.

Data Exploration and Data Preparation

Since the data sets were provided, it was crucial that we understand the data set and determine whether any missing values are present.

Model Building and Selection

Based on the objective of the project, several logistic, multiple and robust regression models were built.

Team Members

-Valerie Briot -Michael D’acampora -Keith Folsom -Brian Kreis -Sharon Morris

Dataset

For reproducibility of the results, the data was loaded to and accessed from a Github repository.

Data Exploration and Statistic Measures

The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.

Missing Values

The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicle Age) contains 510 missing values, YOJ(Years on Job) contain 454, income (INCOME) 445 and home value (HOME_VAL) 464 missing values.

The visualization of missing values below shows that missing values of CAR_AGE, HOME_VAL and YOJ are at 6 percent while INCOME is at 5 percent. The dataset was imputed to account for the missing values.

x
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 445
PARENT1 0
HOME_VAL 464
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

Variable to Variable Analysis

Variable Name Definition Variable Type
TARGET_FLAG Was Car in a crash? 1=YES 0=NO Response
TARGET_AMT If car was in a crash, what was the cost Response
AGE Age of Driver Predictor
BLUEBOOK Value of Vehicle Predictor
CAR_AGE Vehicle Age Predictor
CAR_TYPE Type of Car Predictor
CAR_USE Vehicle Use Predictor
CLM_FREQ # Claims (Past 5 Years) Predictor
EDUCATION Max Education Level Predictor
HOMEKIDS # Children at Home Predictor
HOME_VAL Home Value Predictor
INCOME Income Predictor
JOB Job Category Predictor
KIDSDRIV # Driving Children Predictor
MSTATUS Marital Status Predictor
MVR_PTS Motor Vehicle Record Points Predictor
OLDCLAIM Total Claims (Past 5 Years) Predictor
PARENT1 Single Parent Predictor
RED_CAR A Red Car Predictor
REVOKED License Revoked (Past 7 Years) Predictor
SEX Gender Predictor
TIF Time in Force Predictor
TRAVTIME Distance to Work Predictor
URBANICITY Home/Work Area Predictor
YOJ Years on Job Predictor

Descriptive Statistics

Descriptive statistics was performed for all predictor and response variables to explore the data.

##   TARGET_FLAG TARGET_AMT KIDSDRIV         AGE HOMEKIDS        YOJ
## 1           0          0        0 0.000735204        0 0.05563044
##       INCOME PARENT1   HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763       0 0.05685578       0   0         0   0        0       0
##   BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1        0   0        0       0        0        0       0       0
##      CAR_AGE URBANICITY
## 1 0.06249234          0
##             vars    n      mean        sd median   trimmed       mad  min
## TARGET_FLAG    1 8161      0.26      0.44      0      0.20      0.00    0
## TARGET_AMT     2 8161   1504.32   4704.03      0    593.71      0.00    0
## KIDSDRIV       3 8161      0.17      0.51      0      0.03      0.00    0
## AGE            4 8155     44.79      8.63     45     44.83      8.90   16
## HOMEKIDS       5 8161      0.72      1.12      0      0.50      0.00    0
## YOJ            6 7707     10.50      4.09     11     11.07      2.97    0
## INCOME         7 7716  61898.09  47572.68  54028  56840.98  41792.27    0
## PARENT1*       8 8161      1.13      0.34      1      1.04      0.00    1
## HOME_VAL       9 7697 154867.29 129123.77 161160 144032.07 147867.11    0
## MSTATUS*      10 8161      1.40      0.49      1      1.38      0.00    1
## SEX*          11 8161      1.54      0.50      2      1.55      0.00    1
## EDUCATION*    12 8161      3.09      1.44      3      3.11      1.48    1
## JOB*          13 8161      5.69      2.68      6      5.81      2.97    1
## TRAVTIME      14 8161     33.49     15.91     33     33.00     16.31    5
## CAR_USE*      15 8161      1.63      0.48      2      1.66      0.00    1
## BLUEBOOK      16 8161  15709.90   8419.73  14440  15036.89   8450.82 1500
## TIF           17 8161      5.35      4.15      4      4.84      4.45    1
## CAR_TYPE*     18 8161      3.53      1.97      3      3.54      2.97    1
## RED_CAR*      19 8161      1.29      0.45      1      1.24      0.00    1
## OLDCLAIM      20 8161   4037.08   8777.14      0   1719.29      0.00    0
## CLM_FREQ      21 8161      0.80      1.16      0      0.59      0.00    0
## REVOKED*      22 8161      1.12      0.33      1      1.03      0.00    1
## MVR_PTS       23 8161      1.70      2.15      1      1.31      1.48    0
## CAR_AGE       24 7651      8.33      5.70      8      7.96      7.41   -3
## URBANICITY*   25 8161      1.20      0.40      1      1.13      0.00    1
##                  max    range  skew kurtosis      se    IQR   Q0.1 Q0.25
## TARGET_FLAG      1.0      1.0  1.07    -0.85    0.00      1    0.0     0
## TARGET_AMT  107586.1 107586.1  8.71   112.29   52.07   1036    0.0     0
## KIDSDRIV         4.0      4.0  3.35    11.78    0.01      0    0.0     0
## AGE             81.0     65.0 -0.03    -0.06    0.10     12   34.0    39
## HOMEKIDS         5.0      5.0  1.34     0.65    0.01      1    0.0     0
## YOJ             23.0     23.0 -1.20     1.18    0.05      4    5.0     9
## INCOME      367030.0 367030.0  1.19     2.13  541.58  57889 4380.5 28097
## PARENT1*         2.0      1.0  2.17     2.73    0.00      0    1.0     1
## HOME_VAL    885282.0 885282.0  0.49    -0.02 1471.79 238724    0.0     0
## MSTATUS*         2.0      1.0  0.41    -1.83    0.01      1    1.0     1
## SEX*             2.0      1.0 -0.14    -1.98    0.01      1    1.0     1
## EDUCATION*       5.0      4.0  0.12    -1.38    0.02      3    1.0     2
## JOB*             9.0      8.0 -0.31    -1.22    0.03      5    2.0     3
## TRAVTIME       142.0    137.0  0.45     0.66    0.18     22   13.0    22
## CAR_USE*         2.0      1.0 -0.53    -1.72    0.01      1    1.0     1
## BLUEBOOK     69740.0  68240.0  0.79     0.79   93.20  11570 6000.0  9280
## TIF             25.0     24.0  0.89     0.42    0.05      6    1.0     1
## CAR_TYPE*        6.0      5.0  0.00    -1.52    0.02      5    1.0     1
## RED_CAR*         2.0      1.0  0.92    -1.16    0.01      1    1.0     1
## OLDCLAIM     57037.0  57037.0  3.12     9.86   97.16   4636    0.0     0
## CLM_FREQ         5.0      5.0  1.21     0.28    0.01      2    0.0     0
## REVOKED*         2.0      1.0  2.30     3.30    0.00      0    1.0     1
## MVR_PTS         13.0     13.0  1.35     1.38    0.02      3    0.0     0
## CAR_AGE         28.0     31.0  0.28    -0.75    0.07     11    1.0     1
## URBANICITY*      2.0      1.0  1.46     0.15    0.00      0    1.0     1
##              Q0.75     Q0.9
## TARGET_FLAG      1      1.0
## TARGET_AMT    1036   4904.0
## KIDSDRIV         0      1.0
## AGE             51     56.0
## HOMEKIDS         1      3.0
## YOJ             13     15.0
## INCOME       85986 123180.0
## PARENT1*         1      2.0
## HOME_VAL    238724 316542.6
## MSTATUS*         2      2.0
## SEX*             2      2.0
## EDUCATION*       5      5.0
## JOB*             8      9.0
## TRAVTIME        44     54.0
## CAR_USE*         2      2.0
## BLUEBOOK     20850  27460.0
## TIF              7     11.0
## CAR_TYPE*        6      6.0
## RED_CAR*         2      2.0
## OLDCLAIM      4636   9583.0
## CLM_FREQ         2      3.0
## REVOKED*         1      2.0
## MVR_PTS          3      5.0
## CAR_AGE         12     16.0
## URBANICITY*      1      2.0

Correlation Analysis

The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between claims in the past 5 years (CLM_FREQ) and motor vechile recorded points (MVR_PTS); driving children(KIDSDRV) and age of driver (AGE) is very high at 0.67.

The tables below represent correlation between response and predictor variables.

Correlation with Outcome Variable - TARGET_FLAG

VARIABLE CORRELATION WITH TARGET_FLAG
KIDSDRIV 0.1036683
AGE -0.1032167
HOMEKIDS 0.115621
YOJ -0.0705118
INCOME -0.1420081
HOME_VAL -0.1837371
TRAVTIME 0.0483683
BLUEBOOK -0.1033832
TIF -0.08237
OLDCLAIM 0.1380838
CLM_FREQ 0.2161961
MVR_PTS 0.2191971
CAR_AGE -0.1006506

Correlation with Outcome Variable - TARGET_AMT

VARIABLE CORRELATION WITH TARGET_AMT
KIDSDRIV 0.0553942
AGE -0.0417283
HOMEKIDS 0.061988
YOJ -0.0220852
INCOME -0.0583069
HOME_VAL -0.0856024
TRAVTIME 0.027987
BLUEBOOK -0.0046995
TIF -0.0464808
OLDCLAIM 0.0709533
CLM_FREQ 0.1164192
MVR_PTS 0.1378655
CAR_AGE -0.0588221

Analysis of predictors

Each predictor was exam ed to determine whether transformation is needed.

KIDSDRIV

The Driving Children variable is highly skewed to the right. The outliers are high.

Extreme Observations

Range Values
Lowest None
Highest 4, 3, 2, 1

AGE

The AGE predictor is close to a normal distribution with high outliers of ages 72, 73, 76, 80 & 81 and low 16, 17 and 18.

Extreme Observations

Range Values
Lowest 20, 19, 18, 17, 16
Highest 81, 80, 76, 73, 72, 70

BLUEBOOK

The predictor of car value BLUEBOOK is slightly skewed to the right. There are some outliers a the higher car value level.

Extreme Observations

Range Values
Lowest None
Highest 69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620

CAR_AGE

The distribution is normal. There are are no outliers.

Extreme Observations

Range Values
Lowest None
Highest None

CAR_TYPE

z_SUV and Minivan are majority of vehicles insured.

## 
##     Minivan Panel Truck      Pickup  Sports Car         Van       z_SUV 
##        2145         676        1389         907         750        2294

CAR_USE

The majority of cars are privately used.

## 
## Commercial    Private 
##       3029       5132

CLM_FREQ

The distribution of claims is multi modal. With the largest number of claims occurring before year 1. There are no outliers

Extreme Observations

Range Values
Lowest None
Highest None

EDUCATION

Ther majority of insurers are college or high school graduates.

## 
##  <High School     Bachelors       Masters           PhD z_High School 
##          1203          2242          1658           728          2330

HOMEKIDS

The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children. There are some outliers.

Extreme Observations

Range Values
Lowest None
Highest 5, 4, 3

HOME_VAL

The distribution of HOME_VAL is skewed to the left. There are negative values that will require further exploration. There are several outliers on the higher end.

Extreme Observations

Range Values
Lowest None
Highest 885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328

INCOME

The distribution INCOME has uni modal and skewed to the right. There are several outliers on the higher end.

Extreme Observations

Range Values
Lowest None
Highest 367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198

JOB

The majority of customers work in blue collar jobs.

ggplot(insurance_train, aes(x = JOB)) + 
  geom_bar(fill = "red", width = 0.7) + 
  xlab("Job Category") + ylab("V")

table(insurance_train$JOB)
## 
##                    Clerical        Doctor    Home Maker        Lawyer 
##           526          1271           246           641           835 
##       Manager  Professional       Student z_Blue Collar 
##           988          1117           712          1825

MSTATUS

The majority of customers are married.

## 
##  Yes z_No 
## 4894 3267

MVR_PTS

The distribution of the MVR_PTS is skewed to the right. There are outliers on the higher end.

Extreme Observations

Range Values
Lowest None
Highest 13, 11, 10, 9, 8

OLDCLAIM

The distribution OLDCLAIM is highly skewed to the left. There are several outliers on the higher end.

Extreme Observations

Range Values
Lowest None
Highest 57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593

PARENT1

The majority of customers are not single parents.

## 
##   No  Yes 
## 7084 1077

RED_CAR

The majority of the cars are not red.

## 
##   no  yes 
## 5783 2378

TIF

The distribution of TIF is skewed to the right with several outliers.

Extreme Observations

Range Values
Lowest None
Highest 25, 22, 21, 20, 19, 18, 17

TRAVTIME

The distribution of TRAVTIME is skewed to the right with several outliers.

Extreme Observations

Range Values
Lowest None
Highest 142, 134, 124, 113, 103, 101, 98, 97, 95, 93

YOJ

The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.

Extreme Observations

Range Values
Lowest 2
Highest 23

Multicollinearity

This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.

Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.

## No variable from the 13 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( TIF ~ AGE ):  0.000275457 
## max correlation ( HOME_VAL ~ INCOME ):  0.5769938 
## 
## ---------- VIFs of the remained variables -------- 
##    Variables      VIF
## 1   KIDSDRIV 1.306299
## 2        AGE 1.406202
## 3   HOMEKIDS 1.704661
## 4        YOJ 1.180470
## 5     INCOME 2.017533
## 6   HOME_VAL 1.569744
## 7   TRAVTIME 1.003802
## 8   BLUEBOOK 1.254096
## 9        TIF 1.003996
## 10  OLDCLAIM 1.327482
## 11  CLM_FREQ 1.461470
## 12   MVR_PTS 1.204448
## 13   CAR_AGE 1.241317

Data Preparation

Missing Values

The majority of cases are complete. Te concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values.

Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.

The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.

##      TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448           1          1        1        1       1       1   1
##    3           1          1        1        1       1       1   1
##  385           1          1        1        1       1       1   1
##  364           1          1        1        1       1       1   1
##  378           1          1        1        1       1       1   1
##  431           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##   22           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##   21           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   18           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   29           1          1        1        1       1       1   1
##    4           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##    5           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##                0          0        0        0       0       0   0
##      EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448         1   1        1       1        1   1        1       1        1
##    3         1   1        1       1        1   1        1       1        1
##  385         1   1        1       1        1   1        1       1        1
##  364         1   1        1       1        1   1        1       1        1
##  378         1   1        1       1        1   1        1       1        1
##  431         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##   22         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##   21         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   18         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   29         1   1        1       1        1   1        1       1        1
##    4         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##    5         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##              0   0        0       0        0   0        0       0        0
##      CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448        1       1       1          1   1      1   1        1       1
##    3        1       1       1          1   0      1   1        1       1
##  385        1       1       1          1   1      1   0        1       1
##  364        1       1       1          1   1      0   1        1       1
##  378        1       1       1          1   1      1   1        0       1
##  431        1       1       1          1   1      1   1        1       0
##    1        1       1       1          1   0      0   1        1       1
##   22        1       1       1          1   1      0   0        1       1
##    2        1       1       1          1   0      1   1        0       1
##   21        1       1       1          1   1      1   0        0       1
##   23        1       1       1          1   1      0   1        0       1
##   18        1       1       1          1   1      1   0        1       0
##   23        1       1       1          1   1      0   1        1       0
##   29        1       1       1          1   1      1   1        0       0
##    4        1       1       1          1   1      0   0        0       1
##    2        1       1       1          1   1      0   0        1       0
##    1        1       1       1          1   1      1   0        0       0
##    5        1       1       1          1   1      0   1        0       0
##    1        1       1       1          1   1      0   0        0       0
##             0       0       0          0   6    445 454      464     510
##          
## 6448    0
##    3    1
##  385    1
##  364    1
##  378    1
##  431    1
##    1    2
##   22    2
##    2    2
##   21    2
##   23    2
##   18    2
##   23    2
##   29    2
##    4    3
##    2    3
##    1    3
##    5    3
##    1    4
##      1879

Assumptions of Missing Values

The missing home value data for students and income data for home maker were replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.

Recoding of Predictor Variables

The insurance dataset was recoded to the variables listed below. Most recoding centered around converting factors to dummy variables. Additionally, age-related fields such as AGE and CAR_AGE were also created a ranges for possible benefit during the modeling phase to potentially determine significance within specific ranges.

Transform data - Logistic Regression Dataset

#insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_T.csv")

#x1 <- glm(TARGET_FLAG ~., family= binomial(), data = insurance_trainingT)
#car::mmps(x1)

Imput Recoded dataset

BUILD MODELS

Logistic Regression Models

Our first goal is to build logistic models that will be used to predict the likelihood of a crash based on the predictors. We will start with a base model which includes all varaibles.

We check for class bias to make sure that the number of positive cases (accidents) is relatively in balance with the number of negative cases (non-accidents). We seem to have a reasonable amount of both positive and negative cases, and the fact that there are less non-accidents than accidents in the dataset seems to reflect what we would expect in reality.

# Brian - move this up here.  the df Insurance_train wasn't found
insurance_train_impute <- read.csv("https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_Impute.csv")

table(insurance_train_impute$TARGET_FLAG)
## 
##    0    1 
## 6008 2153

Model 1 - All Predictors

We will first examine the results of a logistic regression model using all of the predictors. No transformation has been performed on the predictor variables.

## 
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"), 
##     data = train_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4460  -0.7069  -0.3940   0.5608   2.9503  
## 
## Coefficients: (5 not defined because of singularities)
##                                Estimate    Std. Error z value
## (Intercept)               -4.5156774307  1.3856362676  -3.259
## KIDSDRIV                   0.4436739073  0.0740574188   5.991
## MALE                      -0.3812547712  0.1103302323  -3.456
## MARRIED                   -0.4520740025  0.1037317291  -4.358
## SINGLE_PARENT              0.3812830963  0.1341100709   2.843
## LICENSE_REVOKED            0.8912160272  0.1106649915   8.053
## AGE                        0.0122755654  0.0135549946   0.906
## AGE_RANGE_16_19_YRS        1.5162020407  1.6371442238   0.926
## AGE_RANGE_20_29_YRS        1.8555513818  1.0887853725   1.704
## AGE_RANGE_30_39_YRS        1.3748950319  1.0217915333   1.346
## AGE_RANGE_40_49_YRS        0.9676266567  0.9711946882   0.996
## AGE_RANGE_50_59_YRS        1.1852294403  0.9350725097   1.268
## AGE_RANGE_60_69_YRS        1.7696770147  0.9193394808   1.925
## AGE_RANGE_70_YRS_PLUS                NA            NA      NA
## INEXP_DRIVER               0.5868826951  0.8525753654   0.688
## HOMEKIDS                   0.0388910384  0.0453501893   0.858
## YOJ                       -0.0136769013  0.0102354266  -1.336
## INCOME                    -0.0000028830  0.0000013156  -2.191
## HOME_VAL                  -0.0000013388  0.0000004218  -3.174
## TRAVTIME                   0.0132209299  0.0022940177   5.763
## BLUEBOOK                  -0.0000353644  0.0000050797  -6.962
## TIF                       -0.0480361814  0.0088859919  -5.406
## OLDCLAIM                  -0.0000165353  0.0000047287  -3.497
## CLM_FREQ                   0.2170060113  0.0344495394   6.299
## MVR_PTS                    0.1314765628  0.0165186025   7.959
## CAR_AGE                    0.0152189142  0.0182796297   0.833
## CAR_AGE_RANGE_1_YR         0.2450705383  0.2342263021   1.046
## CAR_AGE_RANGE_2_3_YRS      0.5163936164  0.7958336429   0.649
## CAR_AGE_RANGE_3_5_YRS      0.0903700329  0.2867977266   0.315
## CAR_AGE_RANGE_5_10_YRS     0.0777552247  0.1378906799   0.564
## CAR_AGE_RANGE_10_YRS_PLUS            NA            NA      NA
## MAIN_DRIVING_CITY          2.4333560852  0.1397717015  17.410
## RED_CAR                   -0.0958125497  0.1063791164  -0.901
## EDU_HIGH_SCHOOL            0.0704536541  0.1172922774   0.601
## EDU_COLLEGE               -0.3586927612  0.1097308084  -3.269
## EDU_ADV_DEGREE             0.0163360880  0.1674492596   0.098
## VEHICLE_USE_COMMERCIAL     0.4559062825  0.2639983974   1.727
## VEHICLE_CLASS_TRUCK       -0.2567708411  0.1833440128  -1.400
## VEHICLE_CLASS_SUV         -0.3746733481  0.1315306619  -2.849
## VEHICLE_CLASS_CAR                    NA            NA      NA
## SPORTS_CAR                           NA            NA      NA
## RED_SPORTS_CAR             0.0153854435  0.5978064210   0.026
## TRUCK_COMM                 0.7070014217  0.2916129221   2.424
## SUV_COMM                   0.1273187285  0.2726418584   0.467
## CAR_COMM                             NA            NA      NA
## OCCUPATION_CLERICAL        0.2540475486  0.2362478041   1.075
## OCCUPATION_MANAGER        -0.6507643393  0.2046537822  -3.180
## OCCUPATION_BLUE_COLLAR     0.2088493613  0.2235968136   0.934
## OCCUPATION_GOLD_COLLAR    -0.0368443647  0.1896582780  -0.194
## OCCUPATION_STUDENT         0.0565989712  0.2611280051   0.217
## OCCUPATION_HOME_MAKER      0.1282760297  0.2528924048   0.507
## OCCUPATION_PROFESSIONAL    0.0437333423  0.2109696174   0.207
##                                       Pr(>|z|)    
## (Intercept)                           0.001118 ** 
## KIDSDRIV                  0.000000002086254644 ***
## MALE                                  0.000549 ***
## MARRIED                   0.000013119208440227 ***
## SINGLE_PARENT                         0.004468 ** 
## LICENSE_REVOKED           0.000000000000000806 ***
## AGE                                   0.365141    
## AGE_RANGE_16_19_YRS                   0.354380    
## AGE_RANGE_20_29_YRS                   0.088336 .  
## AGE_RANGE_30_39_YRS                   0.178440    
## AGE_RANGE_40_49_YRS                   0.319092    
## AGE_RANGE_50_59_YRS                   0.204967    
## AGE_RANGE_60_69_YRS                   0.054236 .  
## AGE_RANGE_70_YRS_PLUS                       NA    
## INEXP_DRIVER                          0.491223    
## HOMEKIDS                              0.391129    
## YOJ                                   0.181474    
## INCOME                                0.028418 *  
## HOME_VAL                              0.001502 ** 
## TRAVTIME                  0.000000008252383142 ***
## BLUEBOOK                  0.000000000003356510 ***
## TIF                       0.000000064508162518 ***
## OLDCLAIM                              0.000471 ***
## CLM_FREQ                  0.000000000299103262 ***
## MVR_PTS                   0.000000000000001730 ***
## CAR_AGE                               0.405092    
## CAR_AGE_RANGE_1_YR                    0.295423    
## CAR_AGE_RANGE_2_3_YRS                 0.516422    
## CAR_AGE_RANGE_3_5_YRS                 0.752686    
## CAR_AGE_RANGE_5_10_YRS                0.572829    
## CAR_AGE_RANGE_10_YRS_PLUS                   NA    
## MAIN_DRIVING_CITY         < 0.0000000000000002 ***
## RED_CAR                               0.367763    
## EDU_HIGH_SCHOOL                       0.548061    
## EDU_COLLEGE                           0.001080 ** 
## EDU_ADV_DEGREE                        0.922283    
## VEHICLE_USE_COMMERCIAL                0.084181 .  
## VEHICLE_CLASS_TRUCK                   0.161368    
## VEHICLE_CLASS_SUV                     0.004392 ** 
## VEHICLE_CLASS_CAR                           NA    
## SPORTS_CAR                                  NA    
## RED_SPORTS_CAR                        0.979468    
## TRUCK_COMM                            0.015332 *  
## SUV_COMM                              0.640513    
## CAR_COMM                                    NA    
## OCCUPATION_CLERICAL                   0.282221    
## OCCUPATION_MANAGER                    0.001474 ** 
## OCCUPATION_BLUE_COLLAR                0.350281    
## OCCUPATION_GOLD_COLLAR                0.845967    
## OCCUPATION_STUDENT                    0.828405    
## OCCUPATION_HOME_MAKER                 0.611990    
## OCCUPATION_PROFESSIONAL               0.835778    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6546.4  on 5711  degrees of freedom
## Residual deviance: 5039.0  on 5665  degrees of freedom
## AIC: 5133
## 
## Number of Fisher Scoring iterations: 5

We will examine marginal model plots on the non binary, non factor variables to see if any appear in need of transformation.

Model 2 - All Predictors with some transformed

Based on the marginal model plots above, we will take the log of the following variables to see if it improves the model fit.

There is no statistically significant difference when using the transformed variables based on the below log likelihood test. As such, and because the AIC value is slightly smaller for the non-tranfromed data, we will proceed using the non-transformed data.

## Likelihood ratio test
## 
## Model 1: target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + 
##     AGE + AGE_RANGE_16_19_YRS + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS + 
##     AGE_RANGE_40_49_YRS + AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + 
##     AGE_RANGE_70_YRS_PLUS + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + 
##     HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + 
##     MVR_PTS + CAR_AGE + CAR_AGE_RANGE_1_YR + CAR_AGE_RANGE_2_3_YRS + 
##     CAR_AGE_RANGE_3_5_YRS + CAR_AGE_RANGE_5_10_YRS + CAR_AGE_RANGE_10_YRS_PLUS + 
##     MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + EDU_COLLEGE + 
##     EDU_ADV_DEGREE + VEHICLE_USE_COMMERCIAL + VEHICLE_CLASS_TRUCK + 
##     VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + SPORTS_CAR + RED_SPORTS_CAR + 
##     TRUCK_COMM + SUV_COMM + CAR_COMM + OCCUPATION_CLERICAL + 
##     OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + OCCUPATION_GOLD_COLLAR + 
##     OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + OCCUPATION_PROFESSIONAL
## Model 2: target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + 
##     AGE + AGE_RANGE_16_19_YRS + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS + 
##     AGE_RANGE_40_49_YRS + AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + 
##     AGE_RANGE_70_YRS_PLUS + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + 
##     HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + 
##     MVR_PTS + CAR_AGE + CAR_AGE_RANGE_1_YR + CAR_AGE_RANGE_2_3_YRS + 
##     CAR_AGE_RANGE_3_5_YRS + CAR_AGE_RANGE_5_10_YRS + CAR_AGE_RANGE_10_YRS_PLUS + 
##     MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + EDU_COLLEGE + 
##     EDU_ADV_DEGREE + VEHICLE_USE_COMMERCIAL + VEHICLE_CLASS_TRUCK + 
##     VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + SPORTS_CAR + RED_SPORTS_CAR + 
##     TRUCK_COMM + SUV_COMM + CAR_COMM + OCCUPATION_CLERICAL + 
##     OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + OCCUPATION_GOLD_COLLAR + 
##     OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + OCCUPATION_PROFESSIONAL
##   #Df  LogLik Df  Chisq            Pr(>Chisq)    
## 1  47 -2519.5                                    
## 2  47 -2519.9  0 0.7974 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stepwise AIC

We proceed in model selection using forward and backward variable selection to minimize AIC values.

## 
## Call:
## glm(formula = target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + 
##     LICENSE_REVOKED + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS + 
##     AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + INCOME + HOME_VAL + 
##     TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + 
##     MAIN_DRIVING_CITY + EDU_COLLEGE + VEHICLE_USE_COMMERCIAL + 
##     VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + TRUCK_COMM + OCCUPATION_MANAGER, 
##     family = binomial(), data = train_all)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4210  -0.7091  -0.3965   0.5650   2.9418  
## 
## Coefficients:
##                             Estimate    Std. Error z value
## (Intercept)            -2.6738521282  0.2092309298 -12.779
## KIDSDRIV                0.4668165400  0.0662150361   7.050
## MALE                   -0.4444944783  0.0881372147  -5.043
## MARRIED                -0.4627243381  0.0986670935  -4.690
## SINGLE_PARENT           0.4057647856  0.1201362905   3.378
## LICENSE_REVOKED         0.8885632276  0.1101999861   8.063
## AGE_RANGE_20_29_YRS     0.7318764674  0.1741951099   4.201
## AGE_RANGE_30_39_YRS     0.3302668286  0.0905015854   3.649
## AGE_RANGE_50_59_YRS     0.3011113991  0.0922250927   3.265
## AGE_RANGE_60_69_YRS     0.9818232160  0.1745944370   5.623
## INCOME                 -0.0000034387  0.0000011628  -2.957
## HOME_VAL               -0.0000012661  0.0000004101  -3.088
## TRAVTIME                0.0132790561  0.0022846053   5.812
## BLUEBOOK               -0.0000355043  0.0000050486  -7.033
## TIF                    -0.0478678881  0.0088461079  -5.411
## OLDCLAIM               -0.0000164749  0.0000047124  -3.496
## CLM_FREQ                0.2174044638  0.0342929153   6.340
## MVR_PTS                 0.1308243448  0.0164179143   7.968
## MAIN_DRIVING_CITY       2.4158454451  0.1381866549  17.482
## EDU_COLLEGE            -0.4274864121  0.0836261860  -5.112
## VEHICLE_USE_COMMERCIAL  0.6264599498  0.1054882858   5.939
## VEHICLE_CLASS_TRUCK    -0.2522446533  0.1750480946  -1.441
## VEHICLE_CLASS_SUV      -0.3604712367  0.1157805632  -3.113
## TRUCK_COMM              0.5677797498  0.1741909711   3.260
## OCCUPATION_MANAGER     -0.7282656807  0.1291164108  -5.640
##                                    Pr(>|z|)    
## (Intercept)            < 0.0000000000000002 ***
## KIDSDRIV               0.000000000001789074 ***
## MALE                   0.000000457786472814 ***
## MARRIED                0.000002735344874061 ***
## SINGLE_PARENT                      0.000731 ***
## LICENSE_REVOKED        0.000000000000000743 ***
## AGE_RANGE_20_29_YRS    0.000026518107696898 ***
## AGE_RANGE_30_39_YRS                0.000263 ***
## AGE_RANGE_50_59_YRS                0.001095 ** 
## AGE_RANGE_60_69_YRS    0.000000018718015554 ***
## INCOME                             0.003103 ** 
## HOME_VAL                           0.002018 ** 
## TRAVTIME               0.000000006158080609 ***
## BLUEBOOK               0.000000000002028562 ***
## TIF                    0.000000062610562919 ***
## OLDCLAIM                           0.000472 ***
## CLM_FREQ               0.000000000230314579 ***
## MVR_PTS                0.000000000000001608 ***
## MAIN_DRIVING_CITY      < 0.0000000000000002 ***
## EDU_COLLEGE            0.000000318980861962 ***
## VEHICLE_USE_COMMERCIAL 0.000000002873463934 ***
## VEHICLE_CLASS_TRUCK                0.149584    
## VEHICLE_CLASS_SUV                  0.001849 ** 
## TRUCK_COMM                         0.001116 ** 
## OCCUPATION_MANAGER     0.000000016967483071 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6546.4  on 5711  degrees of freedom
## Residual deviance: 5049.8  on 5687  degrees of freedom
## AIC: 5099.8
## 
## Number of Fisher Scoring iterations: 5

As shown below, the stepwise model performs better than both of models which include all predictor models.

kable(all_model_metrics)
Model AIC BIC McFadenR2 HL_Chi HL_p X. Kappa Youden F1Score FPPrct AUC
Model1 5133.008 5445.573 0.230 5712 0 * 0.395 0.487 0.522 15.68 0.809
Model2 - All Pred, Some Transformed 5133.805 5446.371 0.230 5712 0 * 0.394 0.483 0.523 15.60 0.810
Model3 - Step 5099.780 5266.038 0.229 5712 0 * 0.393 0.488 0.520 15.76 0.809

We will now examine the importance of the variables. Based on the below, The top 5 predictors are as follows:

However, it should be noted that there was statistical signifance for a number of variables.

x <- data.frame(varImp(model3))

x$Variable <- rownames(x)

x %>% ggplot(aes(x=reorder(Variable, Overall), y=Overall, fill=Overall)) +
            geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
            xlab("Variable") + ylab("Importance") + 
            ggtitle("Variable Importance")  

exp(model3$coefficients)
##            (Intercept)               KIDSDRIV                   MALE 
##             0.06898597             1.59490877             0.64114831 
##                MARRIED          SINGLE_PARENT        LICENSE_REVOKED 
##             0.62956616             1.50044958             2.43163344 
##    AGE_RANGE_20_29_YRS    AGE_RANGE_30_39_YRS    AGE_RANGE_50_59_YRS 
##             2.07897808             1.39133933             1.35135987 
##    AGE_RANGE_60_69_YRS                 INCOME               HOME_VAL 
##             2.66931855             0.99999656             0.99999873 
##               TRAVTIME               BLUEBOOK                    TIF 
##             1.01336761             0.99996450             0.95325972 
##               OLDCLAIM               CLM_FREQ                MVR_PTS 
##             0.99998353             1.24284669             1.13976756 
##      MAIN_DRIVING_CITY            EDU_COLLEGE VEHICLE_USE_COMMERCIAL 
##            11.19923469             0.65214626             1.87097549 
##    VEHICLE_CLASS_TRUCK      VEHICLE_CLASS_SUV             TRUCK_COMM 
##             0.77705461             0.69734763             1.76434541 
##     OCCUPATION_MANAGER 
##             0.48274550

Model 4 - Robust GLM

Using the variables selected by the stepwise method above in model 3, we now use robust regression to account for influential values.

## 
## Call:  glmrob(formula = target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT +      LICENSE_REVOKED + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS +      AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + INCOME + HOME_VAL +      TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS +      MAIN_DRIVING_CITY + EDU_COLLEGE + VEHICLE_USE_COMMERCIAL +      VEHICLE_CLASS_SUV + TRUCK_COMM + OCCUPATION_MANAGER, family = binomial,      data = train_all, method = "Mqle") 
## 
## 
## Coefficients:
##                             Estimate    Std. Error z value
## (Intercept)            -2.8381794990  0.2241644248 -12.661
## KIDSDRIV                0.4634020889  0.0685031453   6.765
## MALE                   -0.5096301080  0.0844995551  -6.031
## MARRIED                -0.4718594104  0.1028684912  -4.587
## SINGLE_PARENT           0.4296422379  0.1241407977   3.461
## LICENSE_REVOKED         0.8923057418  0.1139822473   7.828
## AGE_RANGE_20_29_YRS     0.7258048156  0.1793860946   4.046
## AGE_RANGE_30_39_YRS     0.3199796235  0.0938741271   3.409
## AGE_RANGE_50_59_YRS     0.2953592718  0.0966993895   3.054
## AGE_RANGE_60_69_YRS     0.9888725094  0.1788670744   5.529
## INCOME                 -0.0000035342  0.0000012169  -2.904
## HOME_VAL               -0.0000013319  0.0000004274  -3.116
## TRAVTIME                0.0126333011  0.0023906766   5.284
## BLUEBOOK               -0.0000374670  0.0000052367  -7.155
## TIF                    -0.0448686059  0.0092457442  -4.853
## OLDCLAIM               -0.0000172254  0.0000048830  -3.528
## CLM_FREQ                0.1970651161  0.0353187393   5.580
## MVR_PTS                 0.1345416020  0.0169008561   7.961
## MAIN_DRIVING_CITY       2.5603458735  0.1615067832  15.853
## EDU_COLLEGE            -0.4101447653  0.0868292355  -4.724
## VEHICLE_USE_COMMERCIAL  0.6981660561  0.1078595587   6.473
## VEHICLE_CLASS_SUV      -0.2772346881  0.0962225974  -2.881
## TRUCK_COMM              0.4112571014  0.1525860175   2.695
## OCCUPATION_MANAGER     -0.8056688419  0.1405016914  -5.734
##                                    Pr(>|z|)    
## (Intercept)            < 0.0000000000000002 ***
## KIDSDRIV                0.00000000001336010 ***
## MALE                    0.00000000162789641 ***
## MARRIED                 0.00000449626334734 ***
## SINGLE_PARENT                      0.000538 ***
## LICENSE_REVOKED         0.00000000000000494 ***
## AGE_RANGE_20_29_YRS     0.00005208947428332 ***
## AGE_RANGE_30_39_YRS                0.000653 ***
## AGE_RANGE_50_59_YRS                0.002255 ** 
## AGE_RANGE_60_69_YRS     0.00000003229198333 ***
## INCOME                             0.003681 ** 
## HOME_VAL                           0.001834 ** 
## TRAVTIME                0.00000012611468336 ***
## BLUEBOOK                0.00000000000083889 ***
## TIF                     0.00000121673449813 ***
## OLDCLAIM                           0.000419 ***
## CLM_FREQ                0.00000002410457060 ***
## MVR_PTS                 0.00000000000000171 ***
## MAIN_DRIVING_CITY      < 0.0000000000000002 ***
## EDU_COLLEGE             0.00000231729723425 ***
## VEHICLE_USE_COMMERCIAL  0.00000000009612826 ***
## VEHICLE_CLASS_SUV                  0.003962 ** 
## TRUCK_COMM                         0.007034 ** 
## OCCUPATION_MANAGER      0.00000000979570387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Robustness weights w.r * w.x: 
##  5051 weights are ~= 1. The remaining 661 ones are summarized as
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1432  0.5474  0.7114  0.6894  0.8641  0.9961 
## 
## Number of observations: 5712 
## Fitted by method 'Mqle'  (in 4 iterations)
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
## No deviance values available 
## Algorithmic parameters: 
##    acc    tcc 
## 0.0001 1.3450 
## maxit 
##    50 
## test.acc 
##   "coef"

Accuracy of Model 3

## [1] "Accuracy Model3: 0.787668436096366"

## [1] "Area Under the Curve - Model 3: 0.808535371704655"

Accuracy of Model 4

## [1] "Accuracy Model4: 0.787260106165782"

## [1] "Area Under the Curve - Model 4: 0.808304222548732"

We now see how accurate our predictions are of the test set. It appears although they are very similar the non robust method has a slightly higher accuracy. As such we will use model 3, the stepwise method on the non-transformed data.

table(model3fit)
## model3fit
##    0    1 
## 2033  416

Multiple Linear Regression Models

## regression model metrics

lm_model_metrics <- data.frame()

all_lm_predictions <- list()


# Valerie - I started creating an equivalent calc_metrics for linear regression models.
#           metrics are slighlty different. will need to add more like adj R-squared etc.
calc_metrics_lm <- function(model_name, model, test, train, log_trans=FALSE, robust=FALSE) {
  
  
  predicted <- predict(model, test)  # predict on test data
  
  predicted <- if (log_trans) {exp(predicted)} else { predicted }
  actual <-    if (log_trans)  {exp(test$TARGET_AMT)} else {  test$TARGET_AMT}


  compare <- cbind (actual=actual, predicted)  # combine

  model_accuracy <- mean (apply(compare, 1, min)/apply(compare, 1, max))
  
  #MAPE calculates the mean absolute percentage error:
  
  #SMAPE calculates the symmetric mean absolute percentage error:
    
  #MSE calculates mean squared error:

  # RMSE calculates the root mean squared error:

  #https://cran.r-project.org/web/packages/sjPlot/vignettes/sjtlm.html
  #https://stackoverflow.com/questions/30147756/exporting-r-regression-summary-for-publishable-paper
  
  metrics_df <- data.frame(Model=model_name, 
                           AIC=ifelse(robust, NA, round(AIC(model), 3)), 
                           BIC=ifelse(robust, NA, round(BIC(model), 3)), 
                           accuracy=round(model_accuracy, 3), #acc=Metrics::accuracy(compare[, 1], compare[, 2]),
                           MAE=Metrics::mae(compare[, 1], compare[, 2]),
                           MAPE=Metrics::mape(compare[, 1], compare[, 2]),     #MAPE calculates the mean absolute percentage error:
                           SMAPE=Metrics::smape(compare[, 1], compare[, 2]),     #SMAPE calculates the symmetric mean absolute percentage error:
                           MSE=Metrics::mse(compare[, 1], compare[, 2]),       #MSE calculates mean squared error
                           RMSE=Metrics::rmse(compare[, 1], compare[, 2]))     #RMSE calculates the root mean squared error:
                      

  # Result
  result <- list(metrics_df, predicted, compare)

  
  return (result)
  
}

# m <- calc_metrics_lm('t', model2, dev_test, dev_train)

Model 1 Multiple Regression - Baseline Model Non-transformed variables

The base multiple linear regression model of imputed data. This model predicts the cost if a car is in a crash (TARGET_AMT) using all predictor variables. This shows the estimated cost is negative the assumption the car did not crash.

The following conclusions can be made from the model: * The likelihood that a car will crash increase by 3.9% when the insurer drives with kids * The likelihood that a car will crash declined by -5.9% when the insurer is married. * The likelihood that a car will crash increased by 5.4% when the insurer’s license was revoked in the past 7 years.

The R-squared = .07 we reject the null hypothesis

Model Diagnosis

Residual vs Fitted Plot

The plot shows the residuals have a linerar pattern – the majorty are close to the line. There could be a linear relationship between predictor variabes and an outcome bariable. There appear to be some bad leverage outliers – 76910, 85,383 and 7.072.

Scale-Location Plot

This plot shows that residuals are not equally spread along the range of predictors. Thus, the variance is not equal.

Normal Q-Q Plot

This plot shows the residuals are not normally distrubuted.

Residual vs. Leverage Plot

This plot shows there are some infuential outliers. There are cases far beyond the Cook’s distance lines – the other residuals appear to be clustered on the left. The influential observations are 7691, 7270 and 29030.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = dev_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10398  -3200  -1561    588  97423 
## 
## Coefficients: (5 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -8.771e+02  8.414e+03  -0.104   0.9170    
## KIDSDRIV                  -6.242e+02  4.044e+02  -1.544   0.1229    
## MALE                       1.111e+03  6.608e+02   1.681   0.0929 .  
## MARRIED                   -9.671e+02  6.208e+02  -1.558   0.1195    
## SINGLE_PARENT              8.878e+02  7.191e+02   1.235   0.2171    
## LICENSE_REVOKED           -9.731e+02  6.331e+02  -1.537   0.1245    
## AGE                        6.286e+01  7.833e+01   0.803   0.4224    
## AGE_RANGE_16_19_YRS        1.028e+03  9.058e+03   0.113   0.9097    
## AGE_RANGE_20_29_YRS        1.911e+03  6.708e+03   0.285   0.7758    
## AGE_RANGE_30_39_YRS        2.092e+03  6.339e+03   0.330   0.7414    
## AGE_RANGE_40_49_YRS        1.555e+03  6.066e+03   0.256   0.7978    
## AGE_RANGE_50_59_YRS        1.959e+03  5.852e+03   0.335   0.7379    
## AGE_RANGE_60_69_YRS        3.137e+02  5.764e+03   0.054   0.9566    
## AGE_RANGE_70_YRS_PLUS             NA         NA      NA       NA    
## INEXP_DRIVER               3.893e+02  4.145e+03   0.094   0.9252    
## HOMEKIDS                   3.337e+02  2.550e+02   1.308   0.1909    
## YOJ                       -1.950e+01  6.018e+01  -0.324   0.7459    
## INCOME                    -1.345e-02  8.177e-03  -1.645   0.1001    
## HOME_VAL                   5.128e-03  2.527e-03   2.029   0.0426 *  
## TRAVTIME                   3.529e-01  1.360e+01   0.026   0.9793    
## BLUEBOOK                   1.294e-01  2.890e-02   4.479  8.1e-06 ***
## TIF                       -1.981e+01  5.103e+01  -0.388   0.6979    
## OLDCLAIM                   2.062e-02  2.728e-02   0.756   0.4500    
## CLM_FREQ                  -9.915e+01  1.912e+02  -0.519   0.6041    
## MVR_PTS                    7.726e+01  8.457e+01   0.914   0.3611    
## CAR_AGE                   -1.183e+02  1.163e+02  -1.017   0.3093    
## CAR_AGE_RANGE_1_YR        -9.294e+02  1.420e+03  -0.655   0.5128    
## CAR_AGE_RANGE_2_3_YRS     -2.958e+03  5.781e+03  -0.512   0.6089    
## CAR_AGE_RANGE_3_5_YRS      1.750e+03  1.617e+03   1.082   0.2793    
## CAR_AGE_RANGE_5_10_YRS    -6.555e+01  8.158e+02  -0.080   0.9360    
## CAR_AGE_RANGE_10_YRS_PLUS         NA         NA      NA       NA    
## MAIN_DRIVING_CITY          8.146e+02  1.024e+03   0.795   0.4266    
## RED_CAR                    2.161e+02  6.218e+02   0.348   0.7282    
## EDU_HIGH_SCHOOL           -5.906e+02  6.415e+02  -0.921   0.3573    
## EDU_COLLEGE                3.634e+02  6.230e+02   0.583   0.5598    
## EDU_ADV_DEGREE             1.754e+03  1.063e+03   1.650   0.0991 .  
## VEHICLE_USE_COMMERCIAL    -2.229e+01  1.472e+03  -0.015   0.9879    
## VEHICLE_CLASS_TRUCK       -1.180e+03  1.137e+03  -1.038   0.2996    
## VEHICLE_CLASS_SUV         -3.807e+02  7.765e+02  -0.490   0.6240    
## VEHICLE_CLASS_CAR                 NA         NA      NA       NA    
## SPORTS_CAR                        NA         NA      NA       NA    
## RED_SPORTS_CAR             2.334e+03  2.745e+03   0.850   0.3953    
## TRUCK_COMM                 7.195e+02  1.663e+03   0.433   0.6654    
## SUV_COMM                   3.220e+02  1.525e+03   0.211   0.8327    
## CAR_COMM                          NA         NA      NA       NA    
## OCCUPATION_CLERICAL        9.338e+02  1.457e+03   0.641   0.5218    
## OCCUPATION_MANAGER         8.627e+01  1.270e+03   0.068   0.9459    
## OCCUPATION_BLUE_COLLAR     1.360e+03  1.382e+03   0.984   0.3253    
## OCCUPATION_GOLD_COLLAR     6.541e+02  1.182e+03   0.553   0.5802    
## OCCUPATION_STUDENT         6.179e+02  1.570e+03   0.394   0.6939    
## OCCUPATION_HOME_MAKER      8.776e+02  1.549e+03   0.566   0.5712    
## OCCUPATION_PROFESSIONAL    1.367e+03  1.341e+03   1.020   0.3081    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7832 on 1461 degrees of freedom
## Multiple R-squared:  0.04256,    Adjusted R-squared:  0.01241 
## F-statistic: 1.412 on 46 and 1461 DF,  p-value: 0.03717

Model 2 - Stepwise Variable Selection Using Non-transformed Data

The second model uses a stepwise variable selection with linear regression. We chose to use both the “forward” and “backward” methods to obtain the optimal model based on the lowest AIC (Akaike Information Criterion). This modeling approach uses the non-transformed, recoded dataset.

null.model <- lm(TARGET_AMT ~ 1 , data= dev_train)  # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train)  # full model with all predictors

# perform step-wise algorithm

model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000) 

Model Summary

summary(model2)
## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + CAR_AGE_RANGE_3_5_YRS + 
##     SINGLE_PARENT + AGE_RANGE_50_59_YRS, data = dev_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9194  -3165  -1623    401  97838 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.155e+03  4.601e+02   6.857 1.03e-11 ***
## BLUEBOOK              1.160e-01  2.426e-02   4.784 1.89e-06 ***
## MALE                  9.376e+02  4.052e+02   2.314   0.0208 *  
## CAR_AGE_RANGE_3_5_YRS 2.482e+03  1.205e+03   2.060   0.0396 *  
## SINGLE_PARENT         1.104e+03  4.930e+02   2.240   0.0253 *  
## AGE_RANGE_50_59_YRS   1.072e+03  5.088e+02   2.107   0.0353 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7784 on 1502 degrees of freedom
## Multiple R-squared:  0.0275, Adjusted R-squared:  0.02426 
## F-statistic: 8.494 on 5 and 1502 DF,  p-value: 6.111e-08

AIC (Akaike Information Criterion) for Model 2 = 3.1310510^{4}
BIC (Bayesian Information Criterion) for Model 2 = 3.13477310^{4}

Most Significant Variables

Variable Coefficient
BLUEBOOK 0.1160452
MALE 937.5715330
CAR_AGE_RANGE_3_5_YRS 2482.0278004
SINGLE_PARENT 1104.1588654
AGE_RANGE_50_59_YRS 1071.8562734

Variable Importance

Interpretation. Five predictor variables are among the most significant in this model - BLUEBOOk, MARRIED, AGE_RANGE_16_19_YRS, HOME_VAL, INCOME, and MALE. Of these, BLUEBOOK, MARRIED, AGE_RANGE_16_19_YRS, and HOME_VAL are the most statistically significant with a p-value less than a signficance value of 0.05. The predictors INCOME and MALE were not statistically significant, although it’s interesting to note that there may be a potential relationship between male drivers and a higher payout amount.

BLUEBOOK value intuitively makes sense as a significant and important predictor in the payout amount; however, the corresponding coefficient of nearly 0 is not intuitive. We see that the predictor MARRIED is associated with a negative coefficient of -1423. This indicating that married drivers tend to have lower payout amounts as compared to single drivers, potentially a strong indicator of safer driving.

Conversely, the predictor AGE_RANGE_16_19_YRS is significant with a positive coefficient of 6,676 indicating that drivers between the ages of 16 to 19 are a less safe group and are associated with higher claim payout amounts.

The Adjusted R-squared value associated with this model is very low at 0.2235. There is a significant amount of variability unaccounted for by this model in the response variable. The F-statistic or F- Test for overall significance does show significance.

Diagnostic Plots

The diagnostics plots below are examined to test the assumptions of linear regression.

The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity (residuals of equal variance). Heteroscedascity can also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.

Heteroscedasticity can be confirmed statistically using the NCV test:

car::ncvTest(model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 701.1675    Df = 1     p = 1.666557e-154

The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).

The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile confirming violations of normality, which could affect the coefficients of the model. This plus the heteroscedasticity are strong indicators that the a transformation may be required.

Multicollinearity does not appear to be a problem.

car::vif(model2)
##              BLUEBOOK                  MALE CAR_AGE_RANGE_3_5_YRS 
##              1.011677              1.010687              1.001082 
##         SINGLE_PARENT   AGE_RANGE_50_59_YRS 
##              1.047432              1.049612

Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.

car::outlierTest(model2)
##       rstudent unadjusted p-value Bonferonni p
## 2038 13.485005         3.3268e-39   5.0168e-36
## 1858  9.497724         8.1307e-21   1.2261e-17
## 2063  9.062576         3.8682e-19   5.8333e-16
## 640   8.863147         2.1543e-18   3.2487e-15
## 1357  6.800948         1.4960e-11   2.2559e-08
## 1137  6.788816         1.6232e-11   2.4478e-08
## 1552  6.655228         3.9537e-11   5.9621e-08
## 251   6.158254         9.4246e-10   1.4212e-06
## 1639  6.039104         1.9502e-09   2.9410e-06
## 1200  5.905257         4.3472e-09   6.5556e-06
plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")

Removal of these two outliers does not change the model in any significant way.

## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + CAR_AGE_RANGE_3_5_YRS + 
##     SINGLE_PARENT + AGE_RANGE_50_59_YRS, data = dev_train_upd)
## 
## Coefficients:
##           (Intercept)               BLUEBOOK                   MALE  
##             3633.6680                 0.1005               666.5298  
## CAR_AGE_RANGE_3_5_YRS          SINGLE_PARENT    AGE_RANGE_50_59_YRS  
##              173.9688               820.8188               492.4396  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = mod2) 
## 
##                        Value p-value                   Decision
## Global Stat        5.781e+04 0.00000 Assumptions NOT satisfied!
## Skewness           5.637e+03 0.00000 Assumptions NOT satisfied!
## Kurtosis           5.217e+04 0.00000 Assumptions NOT satisfied!
## Link Function      3.445e+00 0.06343    Assumptions acceptable.
## Heteroscedasticity 5.444e-01 0.46063    Assumptions acceptable.

Transform Multiple Regression Model - Natural Log

Data Preparation

Log transformation was used to transform the imputed data set. Trasformation was applied to the response variable (TARGET_AMT). To adopt the dataset to the requirements of log transformation all zero value observations were removed. The decsion to remove zero values from the response variable instead of adding a value to prevent any data distortion.

Several recoded predictors that were not significant to the model were removed before transformation. The remaining dataset contained 2,152 observations.

Model 3 - Stepwise Variable Selection using Transformed Data

Model 1

Model one is a baseline model of the transformed response variale against all predictors. The summary of the regression model shows F=1.56 with a p-value=0.017, indicating that we accept the null hypothesis that the predictors collectively have a significant effect on the amount paid when there is a crash.

Model Diagnosis

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_trainT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5698 -0.4048  0.0225  0.4055  3.2292 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              8.003e+00  2.411e-01  33.199  < 2e-16 ***
## KIDSDRIV                -9.816e-02  4.069e-02  -2.413   0.0160 *  
## MALE                     2.899e-02  6.789e-02   0.427   0.6694    
## MARRIED                 -1.357e-01  6.408e-02  -2.118   0.0343 *  
## SINGLE_PARENT            3.259e-02  7.449e-02   0.438   0.6618    
## LICENSE_REVOKED         -1.023e-01  6.561e-02  -1.560   0.1191    
## AGE                      4.606e-03  2.717e-03   1.695   0.0902 .  
## INEXP_DRIVER            -2.395e-02  3.300e-01  -0.073   0.9421    
## HOMEKIDS                 4.174e-02  2.620e-02   1.593   0.1113    
## YOJ                      1.603e-03  6.227e-03   0.257   0.7969    
## INCOME                  -1.425e-06  8.343e-07  -1.707   0.0880 .  
## HOME_VAL                 3.246e-07  2.618e-07   1.240   0.2152    
## TRAVTIME                -4.059e-04  1.409e-03  -0.288   0.7733    
## BLUEBOOK                 1.180e-05  2.996e-06   3.941 8.51e-05 ***
## TIF                     -1.102e-03  5.291e-03  -0.208   0.8350    
## OLDCLAIM                 5.231e-06  2.829e-06   1.849   0.0646 .  
## CLM_FREQ                -3.556e-02  1.979e-02  -1.797   0.0726 .  
## MVR_PTS                  1.148e-02  8.690e-03   1.321   0.1868    
## CAR_AGE                  3.729e-04  4.865e-03   0.077   0.9389    
## MAIN_DRIVING_CITY        5.774e-02  1.057e-01   0.546   0.5850    
## RED_CAR                  9.248e-02  6.435e-02   1.437   0.1509    
## EDU_HIGH_SCHOOL         -3.564e-02  6.462e-02  -0.552   0.5813    
## VEHICLE_CLASS_TRUCK     -5.336e-02  1.158e-01  -0.461   0.6450    
## VEHICLE_CLASS_SUV       -1.989e-02  7.930e-02  -0.251   0.8019    
## VEHICLE_CLASS_CAR               NA         NA      NA       NA    
## SPORTS_CAR                      NA         NA      NA       NA    
## RED_SPORTS_CAR           2.885e-01  2.816e-01   1.025   0.3058    
## TRUCK_COMM               2.969e-02  9.340e-02   0.318   0.7506    
## SUV_COMM                 1.273e-02  8.531e-02   0.149   0.8814    
## CAR_COMM                -4.774e-02  1.522e-01  -0.314   0.7539    
## OCCUPATION_CLERICAL     -8.226e-02  1.230e-01  -0.669   0.5036    
## OCCUPATION_MANAGER      -1.012e-02  1.232e-01  -0.082   0.9345    
## OCCUPATION_BLUE_COLLAR  -6.049e-02  1.113e-01  -0.543   0.5870    
## OCCUPATION_GOLD_COLLAR   4.108e-02  1.224e-01   0.335   0.7373    
## OCCUPATION_STUDENT      -1.012e-01  1.386e-01  -0.730   0.4654    
## OCCUPATION_HOME_MAKER   -8.858e-02  1.464e-01  -0.605   0.5452    
## OCCUPATION_PROFESSIONAL -1.958e-02  1.128e-01  -0.174   0.8622    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.815 on 1473 degrees of freedom
## Multiple R-squared:  0.03497,    Adjusted R-squared:  0.0127 
## F-statistic:  1.57 on 34 and 1473 DF,  p-value: 0.01995

Step Multiple Regression - Backward

The output of the Backward AIC Step regression shows that most of the predictors do not have significance to the total amount paid when there is a crash.

Step Multiple Regression - Forward

##         Step Df  Deviance Resid. Df Resid. Dev       AIC
## 1            NA        NA      1507  1013.8642 -596.6990
## 2 + BLUEBOOK -1 13.532555      1506  1000.3317 -614.9626
## 3  + RED_CAR -1  3.565886      1505   996.7658 -618.3478
## 4  + MARRIED -1  3.380337      1504   993.3854 -621.4706
## 5 + KIDSDRIV -1  1.982911      1503   991.4025 -622.4837

Model 4 - Robust Regression

Model 4 applies robust linear regression to the transormed datset and model used in Model 3. Robust regression is applied using the robustbase package. Robust regression is an option for modeling data where outliers may be signicanty impacting the resulting model fit.

For the purposes of the robust model build, we will only consider the non-indicator predictor variables from the dataset.

dt <- dev_train[, c("TARGET_AMT", "KIDSDRIV",  "AGE",  "INCOME", 
                   "HOME_VAL", "TRAVTIME", "BLUEBOOK",  "OLDCLAIM", "MVR_PTS")]

model4 <- robustbase::lmrob(formula(fit5),  data = insurance_trainT, setting = "KS2014")

model4_metrics <- calc_metrics_lm("lm-Model4", model4, insurance_train_test, insurance_trainT, log_trans = T, robust = T)
lm_model_metrics <- rbind(lm_model_metrics, model4_metrics[[1]])

Model Summary

summary(model4)
## 
## Call:
## robustbase::lmrob(formula = formula(fit5), data = insurance_trainT, setting = "KS2014")
##  \--> method = "SMDM"
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4.78197 -0.39383  0.05044  0.40094  3.26439 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.209e+00  4.129e-02 198.830   <2e-16 ***
## BLUEBOOK     5.034e-06  2.159e-06   2.331   0.0199 *  
## RED_CAR      7.488e-02  3.919e-02   1.910   0.0563 .  
## MARRIED     -5.359e-02  3.544e-02  -1.512   0.1307    
## KIDSDRIV    -2.173e-02  2.915e-02  -0.745   0.4561    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 0.6947 
## Multiple R-squared:  0.008349,   Adjusted R-squared:  0.00571 
## Convergence in 8 IRWLS iterations
## 
## Robustness weights: 
##  1112 weights are ~= 1. The remaining 396 ones are summarized as
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.006382 0.599700 0.841900 0.752400 0.965700 0.999000 
## Algorithmic parameters: 
##       tuning.chi1       tuning.chi2       tuning.chi3       tuning.chi4 
##        -5.000e-01         1.500e+00                NA         5.000e-01 
##                bb       tuning.psi1       tuning.psi2       tuning.psi3 
##         5.000e-01        -5.000e-01         1.500e+00         9.500e-01 
##       tuning.psi4        refine.tol           rel.tol         solve.tol 
##                NA         1.000e-07         1.000e-07         1.000e-07 
##       eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
##         6.631e-05         1.054e-07         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##           1000            500             20              2           2000 
##    maxit.scale      trace.lev            mts     compute.rd      numpoints 
##            200              0           1000              0             10 
## fast.s.large.n 
##           2000 
##               setting                   psi           subsampling 
##              "KS2014"                 "lqq"         "nonsingular" 
##                   cov compute.outlier.stats 
##             ".vcov.w"                "SMDM" 
## seed : int(0)

The resulting summary shows that the function lmrob considers only MVR_PTS and MVR_PTS important as a predictor variables. No observations are identified as having been allocated very small weights due to their identification as outliers.

Diagnostic Plots

The diagnostics plots below are examined to test the assumptions of linear regression.

## recomputing robust Mahalanobis distances
## Warning in robMD(x = if (!is.null(x[["x"]])) x$x else if (!
## is.null(x[["model"]])) model.matrix(x, : Failed to compute robust
## Mahalanobis distances, reverting to robust leverages.
## saving the robust distances 'MD' as part of 'model4'

The resulting Residual vs. Fitted Values plot looks more evenly distributed, indicating a better model fit. However, the Normal Q-Q Plot still shows a step incline as the plot moves closer to the 2nd theoretical quantile value.

Linear Regression Model Metrics

kable(lm_model_metrics)
Model AIC BIC accuracy MAE MAPE SMAPE MSE RMSE
lm-Model1 31368.962 31624.252 0.579 3726.057 1.2715961 0.6021118 55933218 7478.851
lm-Model2 31310.500 31347.730 0.594 3581.130 1.2195981 0.5782837 55215058 7430.684
lm-Model3 3659.035 3690.946 0.629 3100.011 0.7951091 0.5167241 57420909 7577.659
lm-Model4 NA NA 0.631 3077.422 0.7836996 0.5118868 57583438 7588.375

Logistic Regression Model Metrics

kable(all_model_metrics)
Model AIC BIC McFadenR2 HL_Chi HL_p X. Kappa Youden F1Score FPPrct AUC
Model1 5133.008 5445.573 0.230 5712 0 * 0.395 0.487 0.522 15.68 0.809
Model2 - All Pred, Some Transformed 5133.805 5446.371 0.230 5712 0 * 0.394 0.483 0.523 15.60 0.810
Model3 - Step 5099.780 5266.038 0.229 5712 0 * 0.393 0.488 0.520 15.76 0.809