The purpose of this homework assignment is to explore, analyze and model a dataset containing 8161 observations and 26 variables. The dataset are records representing a customer at an auto insurance company.
Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.
The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
Ony the variables that are given or variables derived from the variables provided.
The team met to discuss this assignment and an approach to plan to complete the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:
Data Exploration Data Preparation Build Models Select Models
Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation. Slack was used to by the team to communicate during the project and for quick access to code and documentation.
For reproducibility of the results, the data was loaded to and accessed from a Github repository.
Several of the predictor variables contain missing values and outliers. Imputation will be used for the missing values. ##Missing Values The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicl Age) contains 510 missing values and YOJ(Years on Job) contain 454 missing values.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 0 0 0 6 0 454
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## 445 0 464 0 0 0
## JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 0 0 0 0 0 0
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 0 0 0 0 0 510
## URBANICITY
## 0
Variable Name | Definition | Variable Type |
---|---|---|
TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | Response |
TARGET_AMT | If car was in a crash, what was the cost | Response |
AGE | Age of Driver | Predictor |
BLUEBOOK | Value of Vehicle | Predictor |
CAR_AGE | Vehicle Age | Predictor |
CAR_TYPE | Type of Car | Predictor |
CAR_USE | Vehicle Use | Predictor |
CLM_FREQ | # Claims (Past 5 Years) | Predictor |
EDUCATION | Max Education Level | Predictor |
HOMEKIDS | # Children at Home | Predictor |
HOME_VAL | Home Value | Predictor |
INCOME | Income | Predictor |
JOB | Job Category | Predictor |
KIDSDRIV | # Driving Children | Predictor |
MSTATUS | Marital Status | Predictor |
MVR_PTS | Motor Vehicle Record Points | Predictor |
OLDCLAIM | Total Claims (Past 5 Years) | Predictor |
PARENT1 | Single Parent | Predictor |
RED_CAR | A Red Car | Predictor |
REVOKED | License Revoked (Past 7 Years) | Predictor |
SEX | Gender | Predictor |
TIF | Time in Force | Predictor |
TRAVTIME | Distance to Work | Predictor |
URBANICITY | Home/Work Area | Predictor |
YOJ | Years on Job | Predictor |
Descriptive statisitics was performed for all predictor and response variables to explore the data.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 1 0 0 0 0.000735204 0 0.05563044
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763 0 0.05685578 0 0 0 0 0 0
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 0 0 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 1 0.06249234 0
## vars n mean sd median trimmed mad min
## TARGET_FLAG 1 8161 0.26 0.44 0 0.20 0.00 0
## TARGET_AMT 2 8161 1504.32 4704.03 0 593.71 0.00 0
## KIDSDRIV 3 8161 0.17 0.51 0 0.03 0.00 0
## AGE 4 8155 44.79 8.63 45 44.83 8.90 16
## HOMEKIDS 5 8161 0.72 1.12 0 0.50 0.00 0
## YOJ 6 7707 10.50 4.09 11 11.07 2.97 0
## INCOME 7 7716 61898.09 47572.68 54028 56840.98 41792.27 0
## PARENT1* 8 8161 1.13 0.34 1 1.04 0.00 1
## HOME_VAL 9 7697 154867.29 129123.77 161160 144032.07 147867.11 0
## MSTATUS* 10 8161 1.40 0.49 1 1.38 0.00 1
## SEX* 11 8161 1.54 0.50 2 1.55 0.00 1
## EDUCATION* 12 8161 3.09 1.44 3 3.11 1.48 1
## JOB* 13 8161 5.69 2.68 6 5.81 2.97 1
## TRAVTIME 14 8161 33.49 15.91 33 33.00 16.31 5
## CAR_USE* 15 8161 1.63 0.48 2 1.66 0.00 1
## BLUEBOOK 16 8161 15709.90 8419.73 14440 15036.89 8450.82 1500
## TIF 17 8161 5.35 4.15 4 4.84 4.45 1
## CAR_TYPE* 18 8161 3.53 1.97 3 3.54 2.97 1
## RED_CAR* 19 8161 1.29 0.45 1 1.24 0.00 1
## OLDCLAIM 20 8161 4037.08 8777.14 0 1719.29 0.00 0
## CLM_FREQ 21 8161 0.80 1.16 0 0.59 0.00 0
## REVOKED* 22 8161 1.12 0.33 1 1.03 0.00 1
## MVR_PTS 23 8161 1.70 2.15 1 1.31 1.48 0
## CAR_AGE 24 7651 8.33 5.70 8 7.96 7.41 -3
## URBANICITY* 25 8161 1.20 0.40 1 1.13 0.00 1
## max range skew kurtosis se IQR Q0.1 Q0.25
## TARGET_FLAG 1.0 1.0 1.07 -0.85 0.00 1 0.0 0
## TARGET_AMT 107586.1 107586.1 8.71 112.29 52.07 1036 0.0 0
## KIDSDRIV 4.0 4.0 3.35 11.78 0.01 0 0.0 0
## AGE 81.0 65.0 -0.03 -0.06 0.10 12 34.0 39
## HOMEKIDS 5.0 5.0 1.34 0.65 0.01 1 0.0 0
## YOJ 23.0 23.0 -1.20 1.18 0.05 4 5.0 9
## INCOME 367030.0 367030.0 1.19 2.13 541.58 57889 4380.5 28097
## PARENT1* 2.0 1.0 2.17 2.73 0.00 0 1.0 1
## HOME_VAL 885282.0 885282.0 0.49 -0.02 1471.79 238724 0.0 0
## MSTATUS* 2.0 1.0 0.41 -1.83 0.01 1 1.0 1
## SEX* 2.0 1.0 -0.14 -1.98 0.01 1 1.0 1
## EDUCATION* 5.0 4.0 0.12 -1.38 0.02 3 1.0 2
## JOB* 9.0 8.0 -0.31 -1.22 0.03 5 2.0 3
## TRAVTIME 142.0 137.0 0.45 0.66 0.18 22 13.0 22
## CAR_USE* 2.0 1.0 -0.53 -1.72 0.01 1 1.0 1
## BLUEBOOK 69740.0 68240.0 0.79 0.79 93.20 11570 6000.0 9280
## TIF 25.0 24.0 0.89 0.42 0.05 6 1.0 1
## CAR_TYPE* 6.0 5.0 0.00 -1.52 0.02 5 1.0 1
## RED_CAR* 2.0 1.0 0.92 -1.16 0.01 1 1.0 1
## OLDCLAIM 57037.0 57037.0 3.12 9.86 97.16 4636 0.0 0
## CLM_FREQ 5.0 5.0 1.21 0.28 0.01 2 0.0 0
## REVOKED* 2.0 1.0 2.30 3.30 0.00 0 1.0 1
## MVR_PTS 13.0 13.0 1.35 1.38 0.02 3 0.0 0
## CAR_AGE 28.0 31.0 0.28 -0.75 0.07 11 1.0 1
## URBANICITY* 2.0 1.0 1.46 0.15 0.00 0 1.0 1
## Q0.75 Q0.9
## TARGET_FLAG 1 1.0
## TARGET_AMT 1036 4904.0
## KIDSDRIV 0 1.0
## AGE 51 56.0
## HOMEKIDS 1 3.0
## YOJ 13 15.0
## INCOME 85986 123180.0
## PARENT1* 1 2.0
## HOME_VAL 238724 316542.6
## MSTATUS* 2 2.0
## SEX* 2 2.0
## EDUCATION* 5 5.0
## JOB* 8 9.0
## TRAVTIME 44 54.0
## CAR_USE* 2 2.0
## BLUEBOOK 20850 27460.0
## TIF 7 11.0
## CAR_TYPE* 6 6.0
## RED_CAR* 2 2.0
## OLDCLAIM 4636 9583.0
## CLM_FREQ 2 3.0
## REVOKED* 1 2.0
## MVR_PTS 3 5.0
## CAR_AGE 12 16.0
## URBANICITY* 1 2.0
There is high correlation amoung several predictors CLM_FREQ and MVR_PTS; KIDSDRV and AGE; GET_FLAT and TARGET_AMT; AGE and HOMEKIDS.
VARIABLE | CORRELATION WITH TARGET_FLAG |
---|---|
KIDSDRIV | 0.1036683 |
AGE | -0.1032167 |
HOMEKIDS | 0.115621 |
YOJ | -0.0705118 |
INCOME | -0.1420081 |
HOME_VAL | -0.1837371 |
TRAVTIME | 0.0483683 |
BLUEBOOK | -0.1033832 |
TIF | -0.08237 |
OLDCLAIM | 0.1380838 |
CLM_FREQ | 0.2161961 |
MVR_PTS | 0.2191971 |
CAR_AGE | -0.1006506 |
VARIABLE | CORRELATION WITH TARGET_AMT |
---|---|
KIDSDRIV | 0.0553942 |
AGE | -0.0417283 |
HOMEKIDS | 0.061988 |
YOJ | -0.0220852 |
INCOME | -0.0583069 |
HOME_VAL | -0.0856024 |
TRAVTIME | 0.027987 |
BLUEBOOK | -0.0046995 |
TIF | -0.0464808 |
OLDCLAIM | 0.0709533 |
CLM_FREQ | 0.1164192 |
MVR_PTS | 0.1378655 |
CAR_AGE | -0.0588221 |
Each predictor was examed to determine whether transformation is needed.
The Driving Children variable is highly skewed to the right. There appear to be outliers
Range | Values |
---|---|
Lowest | None |
Highest | 4, 3, 2, 1 |
The AGE predictor is normally distributed with high outliers of ages 72, 73, 76, 80 & 81 ane low 16, 17 and 18.
Range | Values |
---|---|
Lowest | 20, 19, 18, 17, 16 |
Highest | 81, 80, 76, 73, 72, 70 |
The predictor of car value BLUEBOOK shape is similar to bimodal. There are some outliers a the higher car value level.
Range | Values |
---|---|
Lowest | None |
Highest | 69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620 |
The distribution of the age of the vechicale is normal. There are several outliers with newer and older cars.
Range | Values |
---|---|
Lowest | None |
Highest | None |
z_SUV and Minivan are majority of vehicles insured.
##
## Minivan Panel Truck Pickup Sports Car Van z_SUV
## 2145 676 1389 907 750 2294
The majority of cars are privately used.
##
## Commercial Private
## 3029 5132
The distribution of claims is multi modal. With the largest number of claims occuring before year 1.
Range | Values |
---|---|
Lowest | None |
Highest | None |
##
## <High School Bachelors Masters PhD z_High School
## 1203 2242 1658 728 2330
The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children.
Range | Values |
---|---|
Lowest | None |
Highest | 5, 4, 3 |
The distribution of HOME_VAL is skewed to the left. There are negative values that will require futher exploration.
Range | Values |
---|---|
Lowest | None |
Highest | 885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328 |
The distribution INCOME has uni modal and skewed to the left.
Range | Values |
---|---|
Lowest | None |
Highest | 367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198 |
ggplot(insurance_train, aes(x = JOB)) +
geom_bar(fill = "red", width = 0.7) +
xlab("Job Category") + ylab("V")
table(insurance_train$JOB)
##
## Clerical Doctor Home Maker Lawyer
## 526 1271 246 641 835
## Manager Professional Student z_Blue Collar
## 988 1117 712 1825
##
## Yes z_No
## 4894 3267
The distribution of the MVR_PTS is skewed to the left.
Range | Values |
---|---|
Lowest | None |
Highest | 13, 11, 10, 9, 8 |
The distribution OLDCLAIM is highly skewed to the left.
Range | Values |
---|---|
Lowest | None |
Highest | 57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593 |
The majority of customers are signle parents.
##
## No Yes
## 7084 1077
##
## no yes
## 5783 2378
The distribution of TIF is skewed to the left with several outliers.
Range | Values |
---|---|
Lowest | None |
Highest | 25, 22, 21, 20, 19, 18, 17 |
The distribution of TRAVTIME is skewed to the left with several outliers.
Range | Values |
---|---|
Lowest | None |
Highest | 142, 134, 124, 113, 103, 101, 98, 97, 95, 93 |
The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.
Range | Values |
---|---|
Lowest | 2 |
Highest | 23 |
This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.
Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.
## No variable from the 13 input variables has collinearity problem.
##
## The linear correlation coefficients ranges between:
## min correlation ( TIF ~ HOME_VAL ): -0.000153687
## max correlation ( HOME_VAL ~ INCOME ): 0.5796475
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 KIDSDRIV 1.301155
## 2 AGE 1.399282
## 3 HOMEKIDS 1.692351
## 4 YOJ 1.161353
## 5 INCOME 2.008525
## 6 HOME_VAL 1.570257
## 7 TRAVTIME 1.003550
## 8 BLUEBOOK 1.248277
## 9 TIF 1.002856
## 10 OLDCLAIM 1.342261
## 11 CLM_FREQ 1.473253
## 12 MVR_PTS 1.211478
## 13 CAR_AGE 1.218389
The majority of cases are complete. Of concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values. However, the majority of variables have less than 10 missing values.
Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.
The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.
## TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 0 1 1 1 1
## 385 1 1 1 1 1 1 0 1 1
## 364 1 1 1 1 1 0 1 1 1
## 378 1 1 1 1 1 1 1 0 1
## 431 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 0 0 1 1 1
## 22 1 1 1 1 1 0 0 1 1
## 2 1 1 1 1 0 1 1 0 1
## 21 1 1 1 1 1 1 0 0 1
## 23 1 1 1 1 1 0 1 0 1
## 18 1 1 1 1 1 1 0 1 0
## 23 1 1 1 1 1 0 1 1 0
## 29 1 1 1 1 1 1 1 0 0
## 4 1 1 1 1 1 0 0 0 1
## 2 1 1 1 1 1 0 0 1 0
## 1 1 1 1 1 1 1 0 0 0
## 5 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 0 0 0 0
## 0 0 0 0 6 445 454 464 510
##
## 6448 0
## 3 1
## 385 1
## 364 1
## 378 1
## 431 1
## 1 2
## 22 2
## 2 2
## 21 2
## 23 2
## 18 2
## 23 2
## 29 2
## 4 3
## 2 3
## 1 3
## 5 3
## 1 4
## 1879
The missing home value data for students and income data for home maker was replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.
## [1] "TARGET_FLAG" "TARGET_AMT"
## [3] "KIDSDRIV" "MALE"
## [5] "MARRIED" "SINGLE_PARENT"
## [7] "LICENSE_REVOKED" "AGE"
## [9] "AGE_RANGE_16_19_YRS" "AGE_RANGE_20_29_YRS"
## [11] "AGE_RANGE_30_39_YRS" "AGE_RANGE_40_49_YRS"
## [13] "AGE_RANGE_50_59_YRS" "AGE_RANGE_60_69_YRS"
## [15] "AGE_RANGE_70_YRS_PLUS" "INEXP_DRIVER"
## [17] "HOMEKIDS" "YOJ"
## [19] "INCOME" "HOME_VAL"
## [21] "TRAVTIME" "BLUEBOOK"
## [23] "TIF" "OLDCLAIM"
## [25] "CLM_FREQ" "MVR_PTS"
## [27] "CAR_AGE" "CAR_AGE_RANGE_1_YR"
## [29] "CAR_AGE_RANGE_2_3_YRS" "CAR_AGE_RANGE_3_5_YRS"
## [31] "CAR_AGE_RANGE_5_10_YRS" "CAR_AGE_RANGE_10_YRS_PLUS"
## [33] "MAIN_DRIVING_CITY" "RED_CAR"
## [35] "EDU_HIGH_SCHOOL" "EDU_COLLEGE"
## [37] "EDU_ADV_DEGREE" "VEHICLE_USE_COMMERCIAL"
## [39] "VEHICLE_CLASS_TRUCK" "VEHICLE_CLASS_SUV"
## [41] "VEHICLE_CLASS_CAR" "SPORTS_CAR"
## [43] "RED_SPORTS_CAR" "TRUCK_COMM"
## [45] "SUV_COMM" "CAR_COMM"
## [47] "OCCUPATION_CLERICAL" "OCCUPATION_MANAGER"
## [49] "OCCUPATION_BLUE_COLLAR" "OCCUPATION_GOLD_COLLAR"
## [51] "OCCUPATION_STUDENT" "OCCUPATION_HOME_MAKER"
## [53] "OCCUPATION_PROFESSIONAL"
insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_recoded.csv")
#x1 <- glm(TARGET_FLAG ~. -TARGET_AMT, family= binomial(), data = insurance_trainingT)
#car::mmps(x1)
null.model <- lm(TARGET_AMT ~ 1 , data= dev_train) # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train) # full model with all predictors
# perform step-wise algorithm
model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000)
shortlistedVars <- names(unlist(model2[[1]])) # get the shortlisted variable.
shortlistedVars <- shortlistedVars[!shortlistedVars %in% "(Intercept)"] # remove intercept
print(shortlistedVars)
## [1] "BLUEBOOK" "MALE" "RED_CAR"
## [4] "AGE_RANGE_50_59_YRS" "SINGLE_PARENT"
x <- data.frame(varImp(model2))
x$Variable <- rownames(x)
x %>% ggplot(aes(x=reorder(Variable, Overall), y=Overall, fill=Overall)) +
geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
xlab("Variable") + ylab("Importance") +
ggtitle("Variable Importance")
summary(model2)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + RED_CAR + AGE_RANGE_50_59_YRS +
## SINGLE_PARENT, data = dev_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8603 -2975 -1374 635 71334
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3266.06634 416.49604 7.842 8.36e-15 ***
## BLUEBOOK 0.10269 0.02183 4.705 2.77e-06 ***
## MALE 1489.89233 495.96503 3.004 0.00271 **
## RED_CAR -939.36057 542.09071 -1.733 0.08333 .
## AGE_RANGE_50_59_YRS 898.46784 452.22227 1.987 0.04713 *
## SINGLE_PARENT 796.49906 461.46732 1.726 0.08455 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7088 on 1503 degrees of freedom
## Multiple R-squared: 0.02549, Adjusted R-squared: 0.02225
## F-statistic: 7.862 on 5 and 1503 DF, p-value: 2.555e-07
autoplot(model2, which = 1:6, colour = 'dodgerblue3',
smooth.colour = 'red', smooth.linetype = 'dashed',
ad.colour = 'black',
label.size = 3, label.n = 5, label.colour = 'blue',
ncol = 3)
The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity or residuals of equal variance. Heteroscedascity is also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.
Heteroscedasticity can be confirmed statistically using the NCV test:
car::ncvTest(model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 772.6424 Df = 1 p = 4.788895e-170
The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).
The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile. This is a strong indicator that the a transformation may be required to satisfy the normality requirement.
Multicollinearity does not appear to be problem.
car::vif(model2)
## BLUEBOOK MALE RED_CAR
## 1.015265 1.832129 1.818164
## AGE_RANGE_50_59_YRS SINGLE_PARENT
## 1.058556 1.061311
Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.
car::outlierTest(model2)
## rstudent unadjusted p-value Bonferonni p
## 1858 10.440704 1.1045e-24 1.6666e-21
## 2063 10.071278 3.9376e-23 5.9419e-20
## 640 9.739370 8.8889e-22 1.3413e-18
## 1832 8.203562 4.9593e-16 7.4836e-13
## 143 7.896147 5.5094e-15 8.3138e-12
## 1137 7.635144 3.9899e-14 6.0207e-11
## 1552 7.312460 4.2485e-13 6.4109e-10
## 251 6.890893 8.1316e-12 1.2271e-08
## 1639 6.740388 2.2451e-11 3.3879e-08
## 43 6.413704 1.8980e-10 2.8641e-07
plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")
Let’s remove these two outliers to see if it improves the model
dev_train_upd <- dev_train[which(cooks.distance(model2) < 0.1),]
#dev_train[which(cooks.distance(model2)==2038)]
mod2 <- update(model2,data=dev_train_upd)
autoplot(mod2, which = 1:6, colour = 'dodgerblue3',
smooth.colour = 'red', smooth.linetype = 'dashed',
ad.colour = 'black',
label.size = 3, label.n = 5, label.colour = 'blue',
ncol = 3)
gvlma(mod2)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + RED_CAR + AGE_RANGE_50_59_YRS +
## SINGLE_PARENT, data = dev_train_upd)
##
## Coefficients:
## (Intercept) BLUEBOOK MALE
## 3580.68120 0.08887 1218.32680
## RED_CAR AGE_RANGE_50_59_YRS SINGLE_PARENT
## -667.44565 648.90297 499.93431
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod2)
##
## Value p-value Decision
## Global Stat 8.411e+04 0.0000 Assumptions NOT satisfied!
## Skewness 6.806e+03 0.0000 Assumptions NOT satisfied!
## Kurtosis 7.730e+04 0.0000 Assumptions NOT satisfied!
## Link Function 1.284e+00 0.2572 Assumptions acceptable.
## Heteroscedasticity 1.436e-03 0.9698 Assumptions acceptable.
Removal of the outliers does not improve the model 2.