Introduction

The purpose of this homework assignment is to explore, analyze and model a dataset containing 8161 observations and 26 variables. The dataset are records representing a customer at an auto insurance company.

Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.

The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.

Objective and Requirements

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

Ony the variables that are given or variables derived from the variables provided.

Approach

The team met to discuss this assignment and an approach to plan to complete the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:

Data Exploration Data Preparation Build Models Select Models

Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation. Slack was used to by the team to communicate during the project and for quick access to code and documentation.

Dataset

For reproducibility of the results, the data was loaded to and accessed from a Github repository.

Data Exploration

Several of the predictor variables contain missing values and outliers. Imputation will be used for the missing values. ##Missing Values The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicl Age) contains 510 missing values and YOJ(Years on Job) contain 454 missing values.

## TARGET_FLAG  TARGET_AMT    KIDSDRIV         AGE    HOMEKIDS         YOJ 
##           0           0           0           6           0         454 
##      INCOME     PARENT1    HOME_VAL     MSTATUS         SEX   EDUCATION 
##         445           0         464           0           0           0 
##         JOB    TRAVTIME     CAR_USE    BLUEBOOK         TIF    CAR_TYPE 
##           0           0           0           0           0           0 
##     RED_CAR    OLDCLAIM    CLM_FREQ     REVOKED     MVR_PTS     CAR_AGE 
##           0           0           0           0           0         510 
##  URBANICITY 
##           0

Variables

Variable Name Definition Variable Type
TARGET_FLAG Was Car in a crash? 1=YES 0=NO Response
TARGET_AMT If car was in a crash, what was the cost Response
AGE Age of Driver Predictor
BLUEBOOK Value of Vehicle Predictor
CAR_AGE Vehicle Age Predictor
CAR_TYPE Type of Car Predictor
CAR_USE Vehicle Use Predictor
CLM_FREQ # Claims (Past 5 Years) Predictor
EDUCATION Max Education Level Predictor
HOMEKIDS # Children at Home Predictor
HOME_VAL Home Value Predictor
INCOME Income Predictor
JOB Job Category Predictor
KIDSDRIV # Driving Children Predictor
MSTATUS Marital Status Predictor
MVR_PTS Motor Vehicle Record Points Predictor
OLDCLAIM Total Claims (Past 5 Years) Predictor
PARENT1 Single Parent Predictor
RED_CAR A Red Car Predictor
REVOKED License Revoked (Past 7 Years) Predictor
SEX Gender Predictor
TIF Time in Force Predictor
TRAVTIME Distance to Work Predictor
URBANICITY Home/Work Area Predictor
YOJ Years on Job Predictor

Descriptive Statistics

Descriptive statisitics was performed for all predictor and response variables to explore the data.

##   TARGET_FLAG TARGET_AMT KIDSDRIV         AGE HOMEKIDS        YOJ
## 1           0          0        0 0.000735204        0 0.05563044
##       INCOME PARENT1   HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763       0 0.05685578       0   0         0   0        0       0
##   BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1        0   0        0       0        0        0       0       0
##      CAR_AGE URBANICITY
## 1 0.06249234          0
##             vars    n      mean        sd median   trimmed       mad  min
## TARGET_FLAG    1 8161      0.26      0.44      0      0.20      0.00    0
## TARGET_AMT     2 8161   1504.32   4704.03      0    593.71      0.00    0
## KIDSDRIV       3 8161      0.17      0.51      0      0.03      0.00    0
## AGE            4 8155     44.79      8.63     45     44.83      8.90   16
## HOMEKIDS       5 8161      0.72      1.12      0      0.50      0.00    0
## YOJ            6 7707     10.50      4.09     11     11.07      2.97    0
## INCOME         7 7716  61898.09  47572.68  54028  56840.98  41792.27    0
## PARENT1*       8 8161      1.13      0.34      1      1.04      0.00    1
## HOME_VAL       9 7697 154867.29 129123.77 161160 144032.07 147867.11    0
## MSTATUS*      10 8161      1.40      0.49      1      1.38      0.00    1
## SEX*          11 8161      1.54      0.50      2      1.55      0.00    1
## EDUCATION*    12 8161      3.09      1.44      3      3.11      1.48    1
## JOB*          13 8161      5.69      2.68      6      5.81      2.97    1
## TRAVTIME      14 8161     33.49     15.91     33     33.00     16.31    5
## CAR_USE*      15 8161      1.63      0.48      2      1.66      0.00    1
## BLUEBOOK      16 8161  15709.90   8419.73  14440  15036.89   8450.82 1500
## TIF           17 8161      5.35      4.15      4      4.84      4.45    1
## CAR_TYPE*     18 8161      3.53      1.97      3      3.54      2.97    1
## RED_CAR*      19 8161      1.29      0.45      1      1.24      0.00    1
## OLDCLAIM      20 8161   4037.08   8777.14      0   1719.29      0.00    0
## CLM_FREQ      21 8161      0.80      1.16      0      0.59      0.00    0
## REVOKED*      22 8161      1.12      0.33      1      1.03      0.00    1
## MVR_PTS       23 8161      1.70      2.15      1      1.31      1.48    0
## CAR_AGE       24 7651      8.33      5.70      8      7.96      7.41   -3
## URBANICITY*   25 8161      1.20      0.40      1      1.13      0.00    1
##                  max    range  skew kurtosis      se    IQR   Q0.1 Q0.25
## TARGET_FLAG      1.0      1.0  1.07    -0.85    0.00      1    0.0     0
## TARGET_AMT  107586.1 107586.1  8.71   112.29   52.07   1036    0.0     0
## KIDSDRIV         4.0      4.0  3.35    11.78    0.01      0    0.0     0
## AGE             81.0     65.0 -0.03    -0.06    0.10     12   34.0    39
## HOMEKIDS         5.0      5.0  1.34     0.65    0.01      1    0.0     0
## YOJ             23.0     23.0 -1.20     1.18    0.05      4    5.0     9
## INCOME      367030.0 367030.0  1.19     2.13  541.58  57889 4380.5 28097
## PARENT1*         2.0      1.0  2.17     2.73    0.00      0    1.0     1
## HOME_VAL    885282.0 885282.0  0.49    -0.02 1471.79 238724    0.0     0
## MSTATUS*         2.0      1.0  0.41    -1.83    0.01      1    1.0     1
## SEX*             2.0      1.0 -0.14    -1.98    0.01      1    1.0     1
## EDUCATION*       5.0      4.0  0.12    -1.38    0.02      3    1.0     2
## JOB*             9.0      8.0 -0.31    -1.22    0.03      5    2.0     3
## TRAVTIME       142.0    137.0  0.45     0.66    0.18     22   13.0    22
## CAR_USE*         2.0      1.0 -0.53    -1.72    0.01      1    1.0     1
## BLUEBOOK     69740.0  68240.0  0.79     0.79   93.20  11570 6000.0  9280
## TIF             25.0     24.0  0.89     0.42    0.05      6    1.0     1
## CAR_TYPE*        6.0      5.0  0.00    -1.52    0.02      5    1.0     1
## RED_CAR*         2.0      1.0  0.92    -1.16    0.01      1    1.0     1
## OLDCLAIM     57037.0  57037.0  3.12     9.86   97.16   4636    0.0     0
## CLM_FREQ         5.0      5.0  1.21     0.28    0.01      2    0.0     0
## REVOKED*         2.0      1.0  2.30     3.30    0.00      0    1.0     1
## MVR_PTS         13.0     13.0  1.35     1.38    0.02      3    0.0     0
## CAR_AGE         28.0     31.0  0.28    -0.75    0.07     11    1.0     1
## URBANICITY*      2.0      1.0  1.46     0.15    0.00      0    1.0     1
##              Q0.75     Q0.9
## TARGET_FLAG      1      1.0
## TARGET_AMT    1036   4904.0
## KIDSDRIV         0      1.0
## AGE             51     56.0
## HOMEKIDS         1      3.0
## YOJ             13     15.0
## INCOME       85986 123180.0
## PARENT1*         1      2.0
## HOME_VAL    238724 316542.6
## MSTATUS*         2      2.0
## SEX*             2      2.0
## EDUCATION*       5      5.0
## JOB*             8      9.0
## TRAVTIME        44     54.0
## CAR_USE*         2      2.0
## BLUEBOOK     20850  27460.0
## TIF              7     11.0
## CAR_TYPE*        6      6.0
## RED_CAR*         2      2.0
## OLDCLAIM      4636   9583.0
## CLM_FREQ         2      3.0
## REVOKED*         1      2.0
## MVR_PTS          3      5.0
## CAR_AGE         12     16.0
## URBANICITY*      1      2.0

Correlation Analysis

There is high correlation amoung several predictors CLM_FREQ and MVR_PTS; KIDSDRV and AGE; GET_FLAT and TARGET_AMT; AGE and HOMEKIDS.

Correlation with Outcome Variable - TARGET_FLAG

VARIABLE CORRELATION WITH TARGET_FLAG
KIDSDRIV 0.1036683
AGE -0.1032167
HOMEKIDS 0.115621
YOJ -0.0705118
INCOME -0.1420081
HOME_VAL -0.1837371
TRAVTIME 0.0483683
BLUEBOOK -0.1033832
TIF -0.08237
OLDCLAIM 0.1380838
CLM_FREQ 0.2161961
MVR_PTS 0.2191971
CAR_AGE -0.1006506

Correlation with Outcome Variable - TARGET_AMT

VARIABLE CORRELATION WITH TARGET_AMT
KIDSDRIV 0.0553942
AGE -0.0417283
HOMEKIDS 0.061988
YOJ -0.0220852
INCOME -0.0583069
HOME_VAL -0.0856024
TRAVTIME 0.027987
BLUEBOOK -0.0046995
TIF -0.0464808
OLDCLAIM 0.0709533
CLM_FREQ 0.1164192
MVR_PTS 0.1378655
CAR_AGE -0.0588221

Analysis of predictors

Each predictor was examed to determine whether transformation is needed.

KIDSDRIV

The Driving Children variable is highly skewed to the right. There appear to be outliers

Extreme Observations

Range Values
Lowest None
Highest 4, 3, 2, 1

AGE

The AGE predictor is normally distributed with high outliers of ages 72, 73, 76, 80 & 81 ane low 16, 17 and 18.

Extreme Observations

Range Values
Lowest 20, 19, 18, 17, 16
Highest 81, 80, 76, 73, 72, 70

BLUEBOOK

The predictor of car value BLUEBOOK shape is similar to bimodal. There are some outliers a the higher car value level.

Extreme Observations

Range Values
Lowest None
Highest 69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620

CAR_AGE

The distribution of the age of the vechicale is normal. There are several outliers with newer and older cars.

Extreme Observations

Range Values
Lowest None
Highest None

CAR_TYPE

z_SUV and Minivan are majority of vehicles insured.

## 
##     Minivan Panel Truck      Pickup  Sports Car         Van       z_SUV 
##        2145         676        1389         907         750        2294

CAR_USE

The majority of cars are privately used.

## 
## Commercial    Private 
##       3029       5132

CLM_FREQ

The distribution of claims is multi modal. With the largest number of claims occuring before year 1.

Extreme Observations

Range Values
Lowest None
Highest None

EDUCATION

## 
##  <High School     Bachelors       Masters           PhD z_High School 
##          1203          2242          1658           728          2330

HOMEKIDS

The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children.

Extreme Observations

Range Values
Lowest None
Highest 5, 4, 3

HOME_VAL

The distribution of HOME_VAL is skewed to the left. There are negative values that will require futher exploration.

Extreme Observations

Range Values
Lowest None
Highest 885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328

INCOME

The distribution INCOME has uni modal and skewed to the left.

Extreme Observations

Range Values
Lowest None
Highest 367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198

JOB

ggplot(insurance_train, aes(x = JOB)) + 
  geom_bar(fill = "red", width = 0.7) + 
  xlab("Job Category") + ylab("V")

table(insurance_train$JOB)
## 
##                    Clerical        Doctor    Home Maker        Lawyer 
##           526          1271           246           641           835 
##       Manager  Professional       Student z_Blue Collar 
##           988          1117           712          1825

MSTATUS

## 
##  Yes z_No 
## 4894 3267

MVR_PTS

The distribution of the MVR_PTS is skewed to the left.

Extreme Observations

Range Values
Lowest None
Highest 13, 11, 10, 9, 8

OLDCLAIM

The distribution OLDCLAIM is highly skewed to the left.

Extreme Observations

Range Values
Lowest None
Highest 57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593

PARENT1

The majority of customers are signle parents.

## 
##   No  Yes 
## 7084 1077

RED_CAR

## 
##   no  yes 
## 5783 2378

TIF

The distribution of TIF is skewed to the left with several outliers.

Extreme Observations

Range Values
Lowest None
Highest 25, 22, 21, 20, 19, 18, 17

TRAVTIME

The distribution of TRAVTIME is skewed to the left with several outliers.

Extreme Observations

Range Values
Lowest None
Highest 142, 134, 124, 113, 103, 101, 98, 97, 95, 93

YOJ

The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.

Extreme Observations

Range Values
Lowest 2
Highest 23

Multicollinearity

This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.

Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.

## No variable from the 13 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( TIF ~ HOME_VAL ):  -0.000153687 
## max correlation ( HOME_VAL ~ INCOME ):  0.5796475 
## 
## ---------- VIFs of the remained variables -------- 
##    Variables      VIF
## 1   KIDSDRIV 1.301155
## 2        AGE 1.399282
## 3   HOMEKIDS 1.692351
## 4        YOJ 1.161353
## 5     INCOME 2.008525
## 6   HOME_VAL 1.570257
## 7   TRAVTIME 1.003550
## 8   BLUEBOOK 1.248277
## 9        TIF 1.002856
## 10  OLDCLAIM 1.342261
## 11  CLM_FREQ 1.473253
## 12   MVR_PTS 1.211478
## 13   CAR_AGE 1.218389

Data Preparation

Missing Values

The majority of cases are complete. Of concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values. However, the majority of variables have less than 10 missing values.

Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.

The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.

##      TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448           1          1        1        1       1       1   1
##    3           1          1        1        1       1       1   1
##  385           1          1        1        1       1       1   1
##  364           1          1        1        1       1       1   1
##  378           1          1        1        1       1       1   1
##  431           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##   22           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##   21           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   18           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   29           1          1        1        1       1       1   1
##    4           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##    5           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##                0          0        0        0       0       0   0
##      EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448         1   1        1       1        1   1        1       1        1
##    3         1   1        1       1        1   1        1       1        1
##  385         1   1        1       1        1   1        1       1        1
##  364         1   1        1       1        1   1        1       1        1
##  378         1   1        1       1        1   1        1       1        1
##  431         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##   22         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##   21         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   18         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   29         1   1        1       1        1   1        1       1        1
##    4         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##    5         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##              0   0        0       0        0   0        0       0        0
##      CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448        1       1       1          1   1      1   1        1       1
##    3        1       1       1          1   0      1   1        1       1
##  385        1       1       1          1   1      1   0        1       1
##  364        1       1       1          1   1      0   1        1       1
##  378        1       1       1          1   1      1   1        0       1
##  431        1       1       1          1   1      1   1        1       0
##    1        1       1       1          1   0      0   1        1       1
##   22        1       1       1          1   1      0   0        1       1
##    2        1       1       1          1   0      1   1        0       1
##   21        1       1       1          1   1      1   0        0       1
##   23        1       1       1          1   1      0   1        0       1
##   18        1       1       1          1   1      1   0        1       0
##   23        1       1       1          1   1      0   1        1       0
##   29        1       1       1          1   1      1   1        0       0
##    4        1       1       1          1   1      0   0        0       1
##    2        1       1       1          1   1      0   0        1       0
##    1        1       1       1          1   1      1   0        0       0
##    5        1       1       1          1   1      0   1        0       0
##    1        1       1       1          1   1      0   0        0       0
##             0       0       0          0   6    445 454      464     510
##          
## 6448    0
##    3    1
##  385    1
##  364    1
##  378    1
##  431    1
##    1    2
##   22    2
##    2    2
##   21    2
##   23    2
##   18    2
##   23    2
##   29    2
##    4    3
##    2    3
##    1    3
##    5    3
##    1    4
##      1879

Assumptions of Missing Values

The missing home value data for students and income data for home maker was replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.

Recode Predictors

##  [1] "TARGET_FLAG"               "TARGET_AMT"               
##  [3] "KIDSDRIV"                  "MALE"                     
##  [5] "MARRIED"                   "SINGLE_PARENT"            
##  [7] "LICENSE_REVOKED"           "AGE"                      
##  [9] "AGE_RANGE_16_19_YRS"       "AGE_RANGE_20_29_YRS"      
## [11] "AGE_RANGE_30_39_YRS"       "AGE_RANGE_40_49_YRS"      
## [13] "AGE_RANGE_50_59_YRS"       "AGE_RANGE_60_69_YRS"      
## [15] "AGE_RANGE_70_YRS_PLUS"     "INEXP_DRIVER"             
## [17] "HOMEKIDS"                  "YOJ"                      
## [19] "INCOME"                    "HOME_VAL"                 
## [21] "TRAVTIME"                  "BLUEBOOK"                 
## [23] "TIF"                       "OLDCLAIM"                 
## [25] "CLM_FREQ"                  "MVR_PTS"                  
## [27] "CAR_AGE"                   "CAR_AGE_RANGE_1_YR"       
## [29] "CAR_AGE_RANGE_2_3_YRS"     "CAR_AGE_RANGE_3_5_YRS"    
## [31] "CAR_AGE_RANGE_5_10_YRS"    "CAR_AGE_RANGE_10_YRS_PLUS"
## [33] "MAIN_DRIVING_CITY"         "RED_CAR"                  
## [35] "EDU_HIGH_SCHOOL"           "EDU_COLLEGE"              
## [37] "EDU_ADV_DEGREE"            "VEHICLE_USE_COMMERCIAL"   
## [39] "VEHICLE_CLASS_TRUCK"       "VEHICLE_CLASS_SUV"        
## [41] "VEHICLE_CLASS_CAR"         "SPORTS_CAR"               
## [43] "RED_SPORTS_CAR"            "TRUCK_COMM"               
## [45] "SUV_COMM"                  "CAR_COMM"                 
## [47] "OCCUPATION_CLERICAL"       "OCCUPATION_MANAGER"       
## [49] "OCCUPATION_BLUE_COLLAR"    "OCCUPATION_GOLD_COLLAR"   
## [51] "OCCUPATION_STUDENT"        "OCCUPATION_HOME_MAKER"    
## [53] "OCCUPATION_PROFESSIONAL"

Transform data - Logistic Regression Dataset

insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_recoded.csv")

#x1 <- glm(TARGET_FLAG ~. -TARGET_AMT, family= binomial(), data = insurance_trainingT)
#car::mmps(x1)

Imput Recoded dataset

BUILD MODELS

Multilinear Regression

Model 1 - Base Model

Model 2 - Stepwise Variable Selection Using Non-transformed Data

null.model <- lm(TARGET_AMT ~ 1 , data= dev_train)  # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train)  # full model with all predictors


# perform step-wise algorithm

model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000) 

Most Significant Variables

shortlistedVars <- names(unlist(model2[[1]]))                            # get the shortlisted variable.
shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"]  # remove intercept 
print(shortlistedVars)
## [1] "BLUEBOOK"            "MALE"                "RED_CAR"            
## [4] "AGE_RANGE_50_59_YRS" "SINGLE_PARENT"

Variable Importance

x <- data.frame(varImp(model2))

x$Variable <- rownames(x)

x %>% ggplot(aes(x=reorder(Variable, Overall), y=Overall, fill=Overall)) +
            geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
            xlab("Variable") + ylab("Importance") + 
            ggtitle("Variable Importance")  

Model Summary

summary(model2)
## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + RED_CAR + AGE_RANGE_50_59_YRS + 
##     SINGLE_PARENT, data = dev_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8603  -2975  -1374    635  71334 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3266.06634  416.49604   7.842 8.36e-15 ***
## BLUEBOOK               0.10269    0.02183   4.705 2.77e-06 ***
## MALE                1489.89233  495.96503   3.004  0.00271 ** 
## RED_CAR             -939.36057  542.09071  -1.733  0.08333 .  
## AGE_RANGE_50_59_YRS  898.46784  452.22227   1.987  0.04713 *  
## SINGLE_PARENT        796.49906  461.46732   1.726  0.08455 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7088 on 1503 degrees of freedom
## Multiple R-squared:  0.02549,    Adjusted R-squared:  0.02225 
## F-statistic: 7.862 on 5 and 1503 DF,  p-value: 2.555e-07

Diagnostic Plots

autoplot(model2, which = 1:6, colour = 'dodgerblue3',
         smooth.colour = 'red', smooth.linetype = 'dashed',
         ad.colour = 'black',
         label.size = 3, label.n = 5, label.colour = 'blue',
         ncol = 3)

The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity or residuals of equal variance. Heteroscedascity is also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.

Heteroscedasticity can be confirmed statistically using the NCV test:

car::ncvTest(model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 772.6424    Df = 1     p = 4.788895e-170

The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).

The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile. This is a strong indicator that the a transformation may be required to satisfy the normality requirement.

Multicollinearity does not appear to be problem.

car::vif(model2)
##            BLUEBOOK                MALE             RED_CAR 
##            1.015265            1.832129            1.818164 
## AGE_RANGE_50_59_YRS       SINGLE_PARENT 
##            1.058556            1.061311

Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.

car::outlierTest(model2)
##       rstudent unadjusted p-value Bonferonni p
## 1858 10.440704         1.1045e-24   1.6666e-21
## 2063 10.071278         3.9376e-23   5.9419e-20
## 640   9.739370         8.8889e-22   1.3413e-18
## 1832  8.203562         4.9593e-16   7.4836e-13
## 143   7.896147         5.5094e-15   8.3138e-12
## 1137  7.635144         3.9899e-14   6.0207e-11
## 1552  7.312460         4.2485e-13   6.4109e-10
## 251   6.890893         8.1316e-12   1.2271e-08
## 1639  6.740388         2.2451e-11   3.3879e-08
## 43    6.413704         1.8980e-10   2.8641e-07
plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")

Let’s remove these two outliers to see if it improves the model

dev_train_upd <- dev_train[which(cooks.distance(model2) < 0.1),]



#dev_train[which(cooks.distance(model2)==2038)]


mod2 <- update(model2,data=dev_train_upd)

autoplot(mod2, which = 1:6, colour = 'dodgerblue3',
         smooth.colour = 'red', smooth.linetype = 'dashed',
         ad.colour = 'black',
         label.size = 3, label.n = 5, label.colour = 'blue',
         ncol = 3)

gvlma(mod2)
## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + RED_CAR + AGE_RANGE_50_59_YRS + 
##     SINGLE_PARENT, data = dev_train_upd)
## 
## Coefficients:
##         (Intercept)             BLUEBOOK                 MALE  
##          3580.68120              0.08887           1218.32680  
##             RED_CAR  AGE_RANGE_50_59_YRS        SINGLE_PARENT  
##          -667.44565            648.90297            499.93431  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = mod2) 
## 
##                        Value p-value                   Decision
## Global Stat        8.411e+04  0.0000 Assumptions NOT satisfied!
## Skewness           6.806e+03  0.0000 Assumptions NOT satisfied!
## Kurtosis           7.730e+04  0.0000 Assumptions NOT satisfied!
## Link Function      1.284e+00  0.2572    Assumptions acceptable.
## Heteroscedasticity 1.436e-03  0.9698    Assumptions acceptable.

Removal of the outliers does not improve the model 2.