Group 2 of the DATA621 class was asked to analyze and model a data sent containing approximately 8,000 with 2 response and 23 predictor variables, records representing a customer at an auto insurance company.
Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.
The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
Only the variables that are given or variables derived from the variables provided.
The team met to discuss this assignment and an approach for completing the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:
Data Exploration Data Preparation Build Models *Select Models
Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation.
Slack was used for daily communication during the project and for quick access to code and documentation. Meeting were organized at least twice a week and as needed using “Go to Meetings”.
Since the data sets were provided, it was crucial that we understand the data set and determine whether any missing values are present.
Based on the objective of the project, several logistic, multiple and robust regression models were built.
-Valerie Briot -Michael D’acampora -Keith Folsom -Brian Kreis -Sharon Morris
For reproducibility of the results, the data was loaded to and accessed from a Github repository.
The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.
The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicle Age) contains 510 missing values, YOJ(Years on Job) contain 454, income (INCOME) 445 and home value (HOME_VAL) 464 missing values.
The visualization of missing values below shows that missing values of CAR_AGE, HOME_VAL and YOJ are at 6 percent while INCOME is at 5 percent. The dataset was imputed to account for the missing values.
| x | |
|---|---|
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 445 |
| PARENT1 | 0 |
| HOME_VAL | 464 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 0 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
| Variable Name | Definition | Variable Type |
|---|---|---|
| TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | Response |
| TARGET_AMT | If car was in a crash, what was the cost | Response |
| AGE | Age of Driver | Predictor |
| BLUEBOOK | Value of Vehicle | Predictor |
| CAR_AGE | Vehicle Age | Predictor |
| CAR_TYPE | Type of Car | Predictor |
| CAR_USE | Vehicle Use | Predictor |
| CLM_FREQ | # Claims (Past 5 Years) | Predictor |
| EDUCATION | Max Education Level | Predictor |
| HOMEKIDS | # Children at Home | Predictor |
| HOME_VAL | Home Value | Predictor |
| INCOME | Income | Predictor |
| JOB | Job Category | Predictor |
| KIDSDRIV | # Driving Children | Predictor |
| MSTATUS | Marital Status | Predictor |
| MVR_PTS | Motor Vehicle Record Points | Predictor |
| OLDCLAIM | Total Claims (Past 5 Years) | Predictor |
| PARENT1 | Single Parent | Predictor |
| RED_CAR | A Red Car | Predictor |
| REVOKED | License Revoked (Past 7 Years) | Predictor |
| SEX | Gender | Predictor |
| TIF | Time in Force | Predictor |
| TRAVTIME | Distance to Work | Predictor |
| URBANICITY | Home/Work Area | Predictor |
| YOJ | Years on Job | Predictor |
Descriptive statistics was performed for all predictor and response variables to explore the data.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 1 0 0 0 0.000735204 0 0.05563044
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763 0 0.05685578 0 0 0 0 0 0
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 0 0 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 1 0.06249234 0
## vars n mean sd median trimmed mad min
## TARGET_FLAG 1 8161 0.26 0.44 0 0.20 0.00 0
## TARGET_AMT 2 8161 1504.32 4704.03 0 593.71 0.00 0
## KIDSDRIV 3 8161 0.17 0.51 0 0.03 0.00 0
## AGE 4 8155 44.79 8.63 45 44.83 8.90 16
## HOMEKIDS 5 8161 0.72 1.12 0 0.50 0.00 0
## YOJ 6 7707 10.50 4.09 11 11.07 2.97 0
## INCOME 7 7716 61898.09 47572.68 54028 56840.98 41792.27 0
## PARENT1* 8 8161 1.13 0.34 1 1.04 0.00 1
## HOME_VAL 9 7697 154867.29 129123.77 161160 144032.07 147867.11 0
## MSTATUS* 10 8161 1.40 0.49 1 1.38 0.00 1
## SEX* 11 8161 1.54 0.50 2 1.55 0.00 1
## EDUCATION* 12 8161 3.09 1.44 3 3.11 1.48 1
## JOB* 13 8161 5.69 2.68 6 5.81 2.97 1
## TRAVTIME 14 8161 33.49 15.91 33 33.00 16.31 5
## CAR_USE* 15 8161 1.63 0.48 2 1.66 0.00 1
## BLUEBOOK 16 8161 15709.90 8419.73 14440 15036.89 8450.82 1500
## TIF 17 8161 5.35 4.15 4 4.84 4.45 1
## CAR_TYPE* 18 8161 3.53 1.97 3 3.54 2.97 1
## RED_CAR* 19 8161 1.29 0.45 1 1.24 0.00 1
## OLDCLAIM 20 8161 4037.08 8777.14 0 1719.29 0.00 0
## CLM_FREQ 21 8161 0.80 1.16 0 0.59 0.00 0
## REVOKED* 22 8161 1.12 0.33 1 1.03 0.00 1
## MVR_PTS 23 8161 1.70 2.15 1 1.31 1.48 0
## CAR_AGE 24 7651 8.33 5.70 8 7.96 7.41 -3
## URBANICITY* 25 8161 1.20 0.40 1 1.13 0.00 1
## max range skew kurtosis se IQR Q0.1 Q0.25
## TARGET_FLAG 1.0 1.0 1.07 -0.85 0.00 1 0.0 0
## TARGET_AMT 107586.1 107586.1 8.71 112.29 52.07 1036 0.0 0
## KIDSDRIV 4.0 4.0 3.35 11.78 0.01 0 0.0 0
## AGE 81.0 65.0 -0.03 -0.06 0.10 12 34.0 39
## HOMEKIDS 5.0 5.0 1.34 0.65 0.01 1 0.0 0
## YOJ 23.0 23.0 -1.20 1.18 0.05 4 5.0 9
## INCOME 367030.0 367030.0 1.19 2.13 541.58 57889 4380.5 28097
## PARENT1* 2.0 1.0 2.17 2.73 0.00 0 1.0 1
## HOME_VAL 885282.0 885282.0 0.49 -0.02 1471.79 238724 0.0 0
## MSTATUS* 2.0 1.0 0.41 -1.83 0.01 1 1.0 1
## SEX* 2.0 1.0 -0.14 -1.98 0.01 1 1.0 1
## EDUCATION* 5.0 4.0 0.12 -1.38 0.02 3 1.0 2
## JOB* 9.0 8.0 -0.31 -1.22 0.03 5 2.0 3
## TRAVTIME 142.0 137.0 0.45 0.66 0.18 22 13.0 22
## CAR_USE* 2.0 1.0 -0.53 -1.72 0.01 1 1.0 1
## BLUEBOOK 69740.0 68240.0 0.79 0.79 93.20 11570 6000.0 9280
## TIF 25.0 24.0 0.89 0.42 0.05 6 1.0 1
## CAR_TYPE* 6.0 5.0 0.00 -1.52 0.02 5 1.0 1
## RED_CAR* 2.0 1.0 0.92 -1.16 0.01 1 1.0 1
## OLDCLAIM 57037.0 57037.0 3.12 9.86 97.16 4636 0.0 0
## CLM_FREQ 5.0 5.0 1.21 0.28 0.01 2 0.0 0
## REVOKED* 2.0 1.0 2.30 3.30 0.00 0 1.0 1
## MVR_PTS 13.0 13.0 1.35 1.38 0.02 3 0.0 0
## CAR_AGE 28.0 31.0 0.28 -0.75 0.07 11 1.0 1
## URBANICITY* 2.0 1.0 1.46 0.15 0.00 0 1.0 1
## Q0.75 Q0.9
## TARGET_FLAG 1 1.0
## TARGET_AMT 1036 4904.0
## KIDSDRIV 0 1.0
## AGE 51 56.0
## HOMEKIDS 1 3.0
## YOJ 13 15.0
## INCOME 85986 123180.0
## PARENT1* 1 2.0
## HOME_VAL 238724 316542.6
## MSTATUS* 2 2.0
## SEX* 2 2.0
## EDUCATION* 5 5.0
## JOB* 8 9.0
## TRAVTIME 44 54.0
## CAR_USE* 2 2.0
## BLUEBOOK 20850 27460.0
## TIF 7 11.0
## CAR_TYPE* 6 6.0
## RED_CAR* 2 2.0
## OLDCLAIM 4636 9583.0
## CLM_FREQ 2 3.0
## REVOKED* 1 2.0
## MVR_PTS 3 5.0
## CAR_AGE 12 16.0
## URBANICITY* 1 2.0
The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between claims in the past 5 years (CLM_FREQ) and motor vechile recorded points (MVR_PTS); driving children(KIDSDRV) and age of driver (AGE) is very high at 0.67.
The tables below represent correlation between response and predictor variables.
| VARIABLE | CORRELATION WITH TARGET_FLAG |
|---|---|
| KIDSDRIV | 0.1036683 |
| AGE | -0.1032167 |
| HOMEKIDS | 0.115621 |
| YOJ | -0.0705118 |
| INCOME | -0.1420081 |
| HOME_VAL | -0.1837371 |
| TRAVTIME | 0.0483683 |
| BLUEBOOK | -0.1033832 |
| TIF | -0.08237 |
| OLDCLAIM | 0.1380838 |
| CLM_FREQ | 0.2161961 |
| MVR_PTS | 0.2191971 |
| CAR_AGE | -0.1006506 |
| VARIABLE | CORRELATION WITH TARGET_AMT |
|---|---|
| KIDSDRIV | 0.0553942 |
| AGE | -0.0417283 |
| HOMEKIDS | 0.061988 |
| YOJ | -0.0220852 |
| INCOME | -0.0583069 |
| HOME_VAL | -0.0856024 |
| TRAVTIME | 0.027987 |
| BLUEBOOK | -0.0046995 |
| TIF | -0.0464808 |
| OLDCLAIM | 0.0709533 |
| CLM_FREQ | 0.1164192 |
| MVR_PTS | 0.1378655 |
| CAR_AGE | -0.0588221 |
Each predictor was exam ed to determine whether transformation is needed.
The Driving Children variable is highly skewed to the right. The outliers are high.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 4, 3, 2, 1 |
The AGE predictor is close to a normal distribution with high outliers of ages 72, 73, 76, 80 & 81 and low 16, 17 and 18.
| Range | Values |
|---|---|
| Lowest | 20, 19, 18, 17, 16 |
| Highest | 81, 80, 76, 73, 72, 70 |
The predictor of car value BLUEBOOK is slightly skewed to the right. There are some outliers a the higher car value level.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620 |
The distribution is normal. There are are no outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | None |
z_SUV and Minivan are majority of vehicles insured.
##
## Minivan Panel Truck Pickup Sports Car Van z_SUV
## 2145 676 1389 907 750 2294
The majority of cars are privately used.
##
## Commercial Private
## 3029 5132
The distribution of claims is multi modal. With the largest number of claims occurring before year 1. There are no outliers
| Range | Values |
|---|---|
| Lowest | None |
| Highest | None |
Ther majority of insurers are college or high school graduates.
##
## <High School Bachelors Masters PhD z_High School
## 1203 2242 1658 728 2330
The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children. There are some outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 5, 4, 3 |
The distribution of HOME_VAL is skewed to the left. There are negative values that will require further exploration. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328 |
The distribution INCOME has uni modal and skewed to the right. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198 |
The majority of customers work in blue collar jobs.
ggplot(insurance_train, aes(x = JOB)) +
geom_bar(fill = "red", width = 0.7) +
xlab("Job Category") + ylab("V")
table(insurance_train$JOB)
##
## Clerical Doctor Home Maker Lawyer
## 526 1271 246 641 835
## Manager Professional Student z_Blue Collar
## 988 1117 712 1825
The majority of customers are married.
##
## Yes z_No
## 4894 3267
The distribution of the MVR_PTS is skewed to the right. There are outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 13, 11, 10, 9, 8 |
The distribution OLDCLAIM is highly skewed to the left. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593 |
The majority of customers are not single parents.
##
## No Yes
## 7084 1077
The majority of the cars are not red.
##
## no yes
## 5783 2378
The distribution of TIF is skewed to the right with several outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 25, 22, 21, 20, 19, 18, 17 |
The distribution of TRAVTIME is skewed to the right with several outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 142, 134, 124, 113, 103, 101, 98, 97, 95, 93 |
The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.
| Range | Values |
|---|---|
| Lowest | 2 |
| Highest | 23 |
This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.
Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.
## No variable from the 13 input variables has collinearity problem.
##
## The linear correlation coefficients ranges between:
## min correlation ( TIF ~ AGE ): 0.000275457
## max correlation ( HOME_VAL ~ INCOME ): 0.5769938
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 KIDSDRIV 1.306299
## 2 AGE 1.406202
## 3 HOMEKIDS 1.704661
## 4 YOJ 1.180470
## 5 INCOME 2.017533
## 6 HOME_VAL 1.569744
## 7 TRAVTIME 1.003802
## 8 BLUEBOOK 1.254096
## 9 TIF 1.003996
## 10 OLDCLAIM 1.327482
## 11 CLM_FREQ 1.461470
## 12 MVR_PTS 1.204448
## 13 CAR_AGE 1.241317
The majority of cases are complete. Te concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values.
Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.
The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.
## TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 0 1 1 1 1
## 385 1 1 1 1 1 1 0 1 1
## 364 1 1 1 1 1 0 1 1 1
## 378 1 1 1 1 1 1 1 0 1
## 431 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 0 0 1 1 1
## 22 1 1 1 1 1 0 0 1 1
## 2 1 1 1 1 0 1 1 0 1
## 21 1 1 1 1 1 1 0 0 1
## 23 1 1 1 1 1 0 1 0 1
## 18 1 1 1 1 1 1 0 1 0
## 23 1 1 1 1 1 0 1 1 0
## 29 1 1 1 1 1 1 1 0 0
## 4 1 1 1 1 1 0 0 0 1
## 2 1 1 1 1 1 0 0 1 0
## 1 1 1 1 1 1 1 0 0 0
## 5 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 0 0 0 0
## 0 0 0 0 6 445 454 464 510
##
## 6448 0
## 3 1
## 385 1
## 364 1
## 378 1
## 431 1
## 1 2
## 22 2
## 2 2
## 21 2
## 23 2
## 18 2
## 23 2
## 29 2
## 4 3
## 2 3
## 1 3
## 5 3
## 1 4
## 1879
The missing home value data for students and income data for home maker were replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.
The insurance dataset was recoded to the variables listed below. Most recoding centered around converting factors to dummy variables. Additionally, age-related fields such as AGE and CAR_AGE were also created a ranges for possible benefit during the modeling phase to potentially determine significance within specific ranges.
#insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_T.csv")
#x1 <- glm(TARGET_FLAG ~., family= binomial(), data = insurance_trainingT)
#car::mmps(x1)
Our first goal is to build logistic models that will be used to predict the likelihood of a crash based on the predictors. We will start with a base model which includes all varaibles.
We check for class bias to make sure that the number of positive cases (accidents) is relatively in balance with the number of negative cases (non-accidents). We seem to have a reasonable amount of both positive and negative cases, and the fact that there are less non-accidents than accidents in the dataset seems to reflect what we would expect in reality.
# Brian - move this up here. the df Insurance_train wasn't found
insurance_train_impute <- read.csv("https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_Impute.csv")
table(insurance_train_impute$TARGET_FLAG)
##
## 0 1
## 6008 2153
We will first examine the results of a logistic regression model using all of the predictors. No transformation has been performed on the predictor variables.
##
## Call:
## glm(formula = target ~ ., family = binomial(link = "logit"),
## data = train_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4460 -0.7069 -0.3940 0.5608 2.9503
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -4.5156774307 1.3856362676 -3.259
## KIDSDRIV 0.4436739073 0.0740574188 5.991
## MALE -0.3812547712 0.1103302323 -3.456
## MARRIED -0.4520740025 0.1037317291 -4.358
## SINGLE_PARENT 0.3812830963 0.1341100709 2.843
## LICENSE_REVOKED 0.8912160272 0.1106649915 8.053
## AGE 0.0122755654 0.0135549946 0.906
## AGE_RANGE_16_19_YRS 1.5162020407 1.6371442238 0.926
## AGE_RANGE_20_29_YRS 1.8555513818 1.0887853725 1.704
## AGE_RANGE_30_39_YRS 1.3748950319 1.0217915333 1.346
## AGE_RANGE_40_49_YRS 0.9676266567 0.9711946882 0.996
## AGE_RANGE_50_59_YRS 1.1852294403 0.9350725097 1.268
## AGE_RANGE_60_69_YRS 1.7696770147 0.9193394808 1.925
## AGE_RANGE_70_YRS_PLUS NA NA NA
## INEXP_DRIVER 0.5868826951 0.8525753654 0.688
## HOMEKIDS 0.0388910384 0.0453501893 0.858
## YOJ -0.0136769013 0.0102354266 -1.336
## INCOME -0.0000028830 0.0000013156 -2.191
## HOME_VAL -0.0000013388 0.0000004218 -3.174
## TRAVTIME 0.0132209299 0.0022940177 5.763
## BLUEBOOK -0.0000353644 0.0000050797 -6.962
## TIF -0.0480361814 0.0088859919 -5.406
## OLDCLAIM -0.0000165353 0.0000047287 -3.497
## CLM_FREQ 0.2170060113 0.0344495394 6.299
## MVR_PTS 0.1314765628 0.0165186025 7.959
## CAR_AGE 0.0152189142 0.0182796297 0.833
## CAR_AGE_RANGE_1_YR 0.2450705383 0.2342263021 1.046
## CAR_AGE_RANGE_2_3_YRS 0.5163936164 0.7958336429 0.649
## CAR_AGE_RANGE_3_5_YRS 0.0903700329 0.2867977266 0.315
## CAR_AGE_RANGE_5_10_YRS 0.0777552247 0.1378906799 0.564
## CAR_AGE_RANGE_10_YRS_PLUS NA NA NA
## MAIN_DRIVING_CITY 2.4333560852 0.1397717015 17.410
## RED_CAR -0.0958125497 0.1063791164 -0.901
## EDU_HIGH_SCHOOL 0.0704536541 0.1172922774 0.601
## EDU_COLLEGE -0.3586927612 0.1097308084 -3.269
## EDU_ADV_DEGREE 0.0163360880 0.1674492596 0.098
## VEHICLE_USE_COMMERCIAL 0.4559062825 0.2639983974 1.727
## VEHICLE_CLASS_TRUCK -0.2567708411 0.1833440128 -1.400
## VEHICLE_CLASS_SUV -0.3746733481 0.1315306619 -2.849
## VEHICLE_CLASS_CAR NA NA NA
## SPORTS_CAR NA NA NA
## RED_SPORTS_CAR 0.0153854435 0.5978064210 0.026
## TRUCK_COMM 0.7070014217 0.2916129221 2.424
## SUV_COMM 0.1273187285 0.2726418584 0.467
## CAR_COMM NA NA NA
## OCCUPATION_CLERICAL 0.2540475486 0.2362478041 1.075
## OCCUPATION_MANAGER -0.6507643393 0.2046537822 -3.180
## OCCUPATION_BLUE_COLLAR 0.2088493613 0.2235968136 0.934
## OCCUPATION_GOLD_COLLAR -0.0368443647 0.1896582780 -0.194
## OCCUPATION_STUDENT 0.0565989712 0.2611280051 0.217
## OCCUPATION_HOME_MAKER 0.1282760297 0.2528924048 0.507
## OCCUPATION_PROFESSIONAL 0.0437333423 0.2109696174 0.207
## Pr(>|z|)
## (Intercept) 0.001118 **
## KIDSDRIV 0.000000002086254644 ***
## MALE 0.000549 ***
## MARRIED 0.000013119208440227 ***
## SINGLE_PARENT 0.004468 **
## LICENSE_REVOKED 0.000000000000000806 ***
## AGE 0.365141
## AGE_RANGE_16_19_YRS 0.354380
## AGE_RANGE_20_29_YRS 0.088336 .
## AGE_RANGE_30_39_YRS 0.178440
## AGE_RANGE_40_49_YRS 0.319092
## AGE_RANGE_50_59_YRS 0.204967
## AGE_RANGE_60_69_YRS 0.054236 .
## AGE_RANGE_70_YRS_PLUS NA
## INEXP_DRIVER 0.491223
## HOMEKIDS 0.391129
## YOJ 0.181474
## INCOME 0.028418 *
## HOME_VAL 0.001502 **
## TRAVTIME 0.000000008252383142 ***
## BLUEBOOK 0.000000000003356510 ***
## TIF 0.000000064508162518 ***
## OLDCLAIM 0.000471 ***
## CLM_FREQ 0.000000000299103262 ***
## MVR_PTS 0.000000000000001730 ***
## CAR_AGE 0.405092
## CAR_AGE_RANGE_1_YR 0.295423
## CAR_AGE_RANGE_2_3_YRS 0.516422
## CAR_AGE_RANGE_3_5_YRS 0.752686
## CAR_AGE_RANGE_5_10_YRS 0.572829
## CAR_AGE_RANGE_10_YRS_PLUS NA
## MAIN_DRIVING_CITY < 0.0000000000000002 ***
## RED_CAR 0.367763
## EDU_HIGH_SCHOOL 0.548061
## EDU_COLLEGE 0.001080 **
## EDU_ADV_DEGREE 0.922283
## VEHICLE_USE_COMMERCIAL 0.084181 .
## VEHICLE_CLASS_TRUCK 0.161368
## VEHICLE_CLASS_SUV 0.004392 **
## VEHICLE_CLASS_CAR NA
## SPORTS_CAR NA
## RED_SPORTS_CAR 0.979468
## TRUCK_COMM 0.015332 *
## SUV_COMM 0.640513
## CAR_COMM NA
## OCCUPATION_CLERICAL 0.282221
## OCCUPATION_MANAGER 0.001474 **
## OCCUPATION_BLUE_COLLAR 0.350281
## OCCUPATION_GOLD_COLLAR 0.845967
## OCCUPATION_STUDENT 0.828405
## OCCUPATION_HOME_MAKER 0.611990
## OCCUPATION_PROFESSIONAL 0.835778
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6546.4 on 5711 degrees of freedom
## Residual deviance: 5039.0 on 5665 degrees of freedom
## AIC: 5133
##
## Number of Fisher Scoring iterations: 5
We will examine marginal model plots on the non binary, non factor variables to see if any appear in need of transformation.
Based on the marginal model plots above, we will take the log of the following variables to see if it improves the model fit.
There is no statistically significant difference when using the transformed variables based on the below log likelihood test. As such, and because the AIC value is slightly smaller for the non-tranfromed data, we will proceed using the non-transformed data.
## Likelihood ratio test
##
## Model 1: target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED +
## AGE + AGE_RANGE_16_19_YRS + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS +
## AGE_RANGE_40_49_YRS + AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS +
## AGE_RANGE_70_YRS_PLUS + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME +
## HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ +
## MVR_PTS + CAR_AGE + CAR_AGE_RANGE_1_YR + CAR_AGE_RANGE_2_3_YRS +
## CAR_AGE_RANGE_3_5_YRS + CAR_AGE_RANGE_5_10_YRS + CAR_AGE_RANGE_10_YRS_PLUS +
## MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + EDU_COLLEGE +
## EDU_ADV_DEGREE + VEHICLE_USE_COMMERCIAL + VEHICLE_CLASS_TRUCK +
## VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + SPORTS_CAR + RED_SPORTS_CAR +
## TRUCK_COMM + SUV_COMM + CAR_COMM + OCCUPATION_CLERICAL +
## OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + OCCUPATION_GOLD_COLLAR +
## OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + OCCUPATION_PROFESSIONAL
## Model 2: target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED +
## AGE + AGE_RANGE_16_19_YRS + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS +
## AGE_RANGE_40_49_YRS + AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS +
## AGE_RANGE_70_YRS_PLUS + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME +
## HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ +
## MVR_PTS + CAR_AGE + CAR_AGE_RANGE_1_YR + CAR_AGE_RANGE_2_3_YRS +
## CAR_AGE_RANGE_3_5_YRS + CAR_AGE_RANGE_5_10_YRS + CAR_AGE_RANGE_10_YRS_PLUS +
## MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + EDU_COLLEGE +
## EDU_ADV_DEGREE + VEHICLE_USE_COMMERCIAL + VEHICLE_CLASS_TRUCK +
## VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + SPORTS_CAR + RED_SPORTS_CAR +
## TRUCK_COMM + SUV_COMM + CAR_COMM + OCCUPATION_CLERICAL +
## OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + OCCUPATION_GOLD_COLLAR +
## OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + OCCUPATION_PROFESSIONAL
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 47 -2519.5
## 2 47 -2519.9 0 0.7974 < 0.00000000000000022 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We proceed in model selection using forward and backward variable selection to minimize AIC values.
##
## Call:
## glm(formula = target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT +
## LICENSE_REVOKED + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS +
## AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + INCOME + HOME_VAL +
## TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS +
## MAIN_DRIVING_CITY + EDU_COLLEGE + VEHICLE_USE_COMMERCIAL +
## VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + TRUCK_COMM + OCCUPATION_MANAGER,
## family = binomial(), data = train_all)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4210 -0.7091 -0.3965 0.5650 2.9418
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.6738521282 0.2092309298 -12.779
## KIDSDRIV 0.4668165400 0.0662150361 7.050
## MALE -0.4444944783 0.0881372147 -5.043
## MARRIED -0.4627243381 0.0986670935 -4.690
## SINGLE_PARENT 0.4057647856 0.1201362905 3.378
## LICENSE_REVOKED 0.8885632276 0.1101999861 8.063
## AGE_RANGE_20_29_YRS 0.7318764674 0.1741951099 4.201
## AGE_RANGE_30_39_YRS 0.3302668286 0.0905015854 3.649
## AGE_RANGE_50_59_YRS 0.3011113991 0.0922250927 3.265
## AGE_RANGE_60_69_YRS 0.9818232160 0.1745944370 5.623
## INCOME -0.0000034387 0.0000011628 -2.957
## HOME_VAL -0.0000012661 0.0000004101 -3.088
## TRAVTIME 0.0132790561 0.0022846053 5.812
## BLUEBOOK -0.0000355043 0.0000050486 -7.033
## TIF -0.0478678881 0.0088461079 -5.411
## OLDCLAIM -0.0000164749 0.0000047124 -3.496
## CLM_FREQ 0.2174044638 0.0342929153 6.340
## MVR_PTS 0.1308243448 0.0164179143 7.968
## MAIN_DRIVING_CITY 2.4158454451 0.1381866549 17.482
## EDU_COLLEGE -0.4274864121 0.0836261860 -5.112
## VEHICLE_USE_COMMERCIAL 0.6264599498 0.1054882858 5.939
## VEHICLE_CLASS_TRUCK -0.2522446533 0.1750480946 -1.441
## VEHICLE_CLASS_SUV -0.3604712367 0.1157805632 -3.113
## TRUCK_COMM 0.5677797498 0.1741909711 3.260
## OCCUPATION_MANAGER -0.7282656807 0.1291164108 -5.640
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## KIDSDRIV 0.000000000001789074 ***
## MALE 0.000000457786472814 ***
## MARRIED 0.000002735344874061 ***
## SINGLE_PARENT 0.000731 ***
## LICENSE_REVOKED 0.000000000000000743 ***
## AGE_RANGE_20_29_YRS 0.000026518107696898 ***
## AGE_RANGE_30_39_YRS 0.000263 ***
## AGE_RANGE_50_59_YRS 0.001095 **
## AGE_RANGE_60_69_YRS 0.000000018718015554 ***
## INCOME 0.003103 **
## HOME_VAL 0.002018 **
## TRAVTIME 0.000000006158080609 ***
## BLUEBOOK 0.000000000002028562 ***
## TIF 0.000000062610562919 ***
## OLDCLAIM 0.000472 ***
## CLM_FREQ 0.000000000230314579 ***
## MVR_PTS 0.000000000000001608 ***
## MAIN_DRIVING_CITY < 0.0000000000000002 ***
## EDU_COLLEGE 0.000000318980861962 ***
## VEHICLE_USE_COMMERCIAL 0.000000002873463934 ***
## VEHICLE_CLASS_TRUCK 0.149584
## VEHICLE_CLASS_SUV 0.001849 **
## TRUCK_COMM 0.001116 **
## OCCUPATION_MANAGER 0.000000016967483071 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6546.4 on 5711 degrees of freedom
## Residual deviance: 5049.8 on 5687 degrees of freedom
## AIC: 5099.8
##
## Number of Fisher Scoring iterations: 5
As shown below, the stepwise model performs better than both of models which include all predictor models.
kable(all_model_metrics)
| Model | AIC | BIC | McFadenR2 | HL_Chi | HL_p | X. | Kappa | Youden | F1Score | FPPrct | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Model1 | 5133.008 | 5445.573 | 0.230 | 5712 | 0 | * | 0.395 | 0.487 | 0.522 | 15.68 | 0.809 |
| Model2 - All Pred, Some Transformed | 5133.805 | 5446.371 | 0.230 | 5712 | 0 | * | 0.394 | 0.483 | 0.523 | 15.60 | 0.810 |
| Model3 - Step | 5099.780 | 5266.038 | 0.229 | 5712 | 0 | * | 0.393 | 0.488 | 0.520 | 15.76 | 0.809 |
We will now examine the importance of the variables. Based on the below, The top 5 predictors are as follows:
However, it should be noted that there was statistical signifance for a number of variables.
x <- data.frame(varImp(model3))
x$Variable <- rownames(x)
x %>% ggplot(aes(x=reorder(Variable, Overall), y=Overall, fill=Overall)) +
geom_bar(stat="identity") + coord_flip() + guides(fill=FALSE) +
xlab("Variable") + ylab("Importance") +
ggtitle("Variable Importance")
exp(model3$coefficients)
## (Intercept) KIDSDRIV MALE
## 0.06898597 1.59490877 0.64114831
## MARRIED SINGLE_PARENT LICENSE_REVOKED
## 0.62956616 1.50044958 2.43163344
## AGE_RANGE_20_29_YRS AGE_RANGE_30_39_YRS AGE_RANGE_50_59_YRS
## 2.07897808 1.39133933 1.35135987
## AGE_RANGE_60_69_YRS INCOME HOME_VAL
## 2.66931855 0.99999656 0.99999873
## TRAVTIME BLUEBOOK TIF
## 1.01336761 0.99996450 0.95325972
## OLDCLAIM CLM_FREQ MVR_PTS
## 0.99998353 1.24284669 1.13976756
## MAIN_DRIVING_CITY EDU_COLLEGE VEHICLE_USE_COMMERCIAL
## 11.19923469 0.65214626 1.87097549
## VEHICLE_CLASS_TRUCK VEHICLE_CLASS_SUV TRUCK_COMM
## 0.77705461 0.69734763 1.76434541
## OCCUPATION_MANAGER
## 0.48274550
Using the variables selected by the stepwise method above in model 3, we now use robust regression to account for influential values.
##
## Call: glmrob(formula = target ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + AGE_RANGE_20_29_YRS + AGE_RANGE_30_39_YRS + AGE_RANGE_50_59_YRS + AGE_RANGE_60_69_YRS + INCOME + HOME_VAL + TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + MAIN_DRIVING_CITY + EDU_COLLEGE + VEHICLE_USE_COMMERCIAL + VEHICLE_CLASS_SUV + TRUCK_COMM + OCCUPATION_MANAGER, family = binomial, data = train_all, method = "Mqle")
##
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -2.8381794990 0.2241644248 -12.661
## KIDSDRIV 0.4634020889 0.0685031453 6.765
## MALE -0.5096301080 0.0844995551 -6.031
## MARRIED -0.4718594104 0.1028684912 -4.587
## SINGLE_PARENT 0.4296422379 0.1241407977 3.461
## LICENSE_REVOKED 0.8923057418 0.1139822473 7.828
## AGE_RANGE_20_29_YRS 0.7258048156 0.1793860946 4.046
## AGE_RANGE_30_39_YRS 0.3199796235 0.0938741271 3.409
## AGE_RANGE_50_59_YRS 0.2953592718 0.0966993895 3.054
## AGE_RANGE_60_69_YRS 0.9888725094 0.1788670744 5.529
## INCOME -0.0000035342 0.0000012169 -2.904
## HOME_VAL -0.0000013319 0.0000004274 -3.116
## TRAVTIME 0.0126333011 0.0023906766 5.284
## BLUEBOOK -0.0000374670 0.0000052367 -7.155
## TIF -0.0448686059 0.0092457442 -4.853
## OLDCLAIM -0.0000172254 0.0000048830 -3.528
## CLM_FREQ 0.1970651161 0.0353187393 5.580
## MVR_PTS 0.1345416020 0.0169008561 7.961
## MAIN_DRIVING_CITY 2.5603458735 0.1615067832 15.853
## EDU_COLLEGE -0.4101447653 0.0868292355 -4.724
## VEHICLE_USE_COMMERCIAL 0.6981660561 0.1078595587 6.473
## VEHICLE_CLASS_SUV -0.2772346881 0.0962225974 -2.881
## TRUCK_COMM 0.4112571014 0.1525860175 2.695
## OCCUPATION_MANAGER -0.8056688419 0.1405016914 -5.734
## Pr(>|z|)
## (Intercept) < 0.0000000000000002 ***
## KIDSDRIV 0.00000000001336010 ***
## MALE 0.00000000162789641 ***
## MARRIED 0.00000449626334734 ***
## SINGLE_PARENT 0.000538 ***
## LICENSE_REVOKED 0.00000000000000494 ***
## AGE_RANGE_20_29_YRS 0.00005208947428332 ***
## AGE_RANGE_30_39_YRS 0.000653 ***
## AGE_RANGE_50_59_YRS 0.002255 **
## AGE_RANGE_60_69_YRS 0.00000003229198333 ***
## INCOME 0.003681 **
## HOME_VAL 0.001834 **
## TRAVTIME 0.00000012611468336 ***
## BLUEBOOK 0.00000000000083889 ***
## TIF 0.00000121673449813 ***
## OLDCLAIM 0.000419 ***
## CLM_FREQ 0.00000002410457060 ***
## MVR_PTS 0.00000000000000171 ***
## MAIN_DRIVING_CITY < 0.0000000000000002 ***
## EDU_COLLEGE 0.00000231729723425 ***
## VEHICLE_USE_COMMERCIAL 0.00000000009612826 ***
## VEHICLE_CLASS_SUV 0.003962 **
## TRUCK_COMM 0.007034 **
## OCCUPATION_MANAGER 0.00000000979570387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Robustness weights w.r * w.x:
## 5051 weights are ~= 1. The remaining 661 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1432 0.5474 0.7114 0.6894 0.8641 0.9961
##
## Number of observations: 5712
## Fitted by method 'Mqle' (in 4 iterations)
##
## (Dispersion parameter for binomial family taken to be 1)
##
## No deviance values available
## Algorithmic parameters:
## acc tcc
## 0.0001 1.3450
## maxit
## 50
## test.acc
## "coef"
## [1] "Accuracy Model3: 0.787668436096366"
## [1] "Area Under the Curve - Model 3: 0.808535371704655"
## [1] "Accuracy Model4: 0.787260106165782"
## [1] "Area Under the Curve - Model 4: 0.808304222548732"
We now see how accurate our predictions are of the test set. It appears although they are very similar the non robust method has a slightly higher accuracy. As such we will use model 3, the stepwise method on the non-transformed data.
table(model3fit)
## model3fit
## 0 1
## 2033 416
## regression model metrics
lm_model_metrics <- data.frame()
all_lm_predictions <- list()
# Valerie - I started creating an equivalent calc_metrics for linear regression models.
# metrics are slighlty different. will need to add more like adj R-squared etc.
calc_metrics_lm <- function(model_name, model, test, train, log_trans=FALSE, robust=FALSE) {
predicted <- predict(model, test) # predict on test data
predicted <- if (log_trans) {exp(predicted)} else { predicted }
actual <- if (log_trans) {exp(test$TARGET_AMT)} else { test$TARGET_AMT}
compare <- cbind (actual=actual, predicted) # combine
model_accuracy <- mean (apply(compare, 1, min)/apply(compare, 1, max))
#MAPE calculates the mean absolute percentage error:
#SMAPE calculates the symmetric mean absolute percentage error:
#MSE calculates mean squared error:
# RMSE calculates the root mean squared error:
#https://cran.r-project.org/web/packages/sjPlot/vignettes/sjtlm.html
#https://stackoverflow.com/questions/30147756/exporting-r-regression-summary-for-publishable-paper
metrics_df <- data.frame(Model=model_name,
AIC=ifelse(robust, NA, round(AIC(model), 3)),
BIC=ifelse(robust, NA, round(BIC(model), 3)),
accuracy=round(model_accuracy, 3), #acc=Metrics::accuracy(compare[, 1], compare[, 2]),
MAE=Metrics::mae(compare[, 1], compare[, 2]),
MAPE=Metrics::mape(compare[, 1], compare[, 2]), #MAPE calculates the mean absolute percentage error:
SMAPE=Metrics::smape(compare[, 1], compare[, 2]), #SMAPE calculates the symmetric mean absolute percentage error:
MSE=Metrics::mse(compare[, 1], compare[, 2]), #MSE calculates mean squared error
RMSE=Metrics::rmse(compare[, 1], compare[, 2])) #RMSE calculates the root mean squared error:
# Result
result <- list(metrics_df, predicted, compare)
return (result)
}
# m <- calc_metrics_lm('t', model2, dev_test, dev_train)
The base multiple linear regression model of imputed data. This model predicts the cost if a car is in a crash (TARGET_AMT) using all predictor variables. This shows the estimated cost is negative the assumption the car did not crash.
The following conclusions can be made from the model: * The likelihood that a car will crash increase by 3.9% when the insurer drives with kids * The likelihood that a car will crash declined by -5.9% when the insurer is married. * The likelihood that a car will crash increased by 5.4% when the insurer’s license was revoked in the past 7 years.
The R-squared = .07 we reject the null hypothesis
The plot shows the residuals have a linerar pattern – the majorty are close to the line. There could be a linear relationship between predictor variabes and an outcome bariable. There appear to be some bad leverage outliers – 76910, 85,383 and 7.072.
This plot shows that residuals are not equally spread along the range of predictors. Thus, the variance is not equal.
This plot shows the residuals are not normally distrubuted.
This plot shows there are some infuential outliers. There are cases far beyond the Cook’s distance lines – the other residuals appear to be clustered on the left. The influential observations are 7691, 7270 and 29030.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = dev_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10398 -3200 -1561 588 97423
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.771e+02 8.414e+03 -0.104 0.9170
## KIDSDRIV -6.242e+02 4.044e+02 -1.544 0.1229
## MALE 1.111e+03 6.608e+02 1.681 0.0929 .
## MARRIED -9.671e+02 6.208e+02 -1.558 0.1195
## SINGLE_PARENT 8.878e+02 7.191e+02 1.235 0.2171
## LICENSE_REVOKED -9.731e+02 6.331e+02 -1.537 0.1245
## AGE 6.286e+01 7.833e+01 0.803 0.4224
## AGE_RANGE_16_19_YRS 1.028e+03 9.058e+03 0.113 0.9097
## AGE_RANGE_20_29_YRS 1.911e+03 6.708e+03 0.285 0.7758
## AGE_RANGE_30_39_YRS 2.092e+03 6.339e+03 0.330 0.7414
## AGE_RANGE_40_49_YRS 1.555e+03 6.066e+03 0.256 0.7978
## AGE_RANGE_50_59_YRS 1.959e+03 5.852e+03 0.335 0.7379
## AGE_RANGE_60_69_YRS 3.137e+02 5.764e+03 0.054 0.9566
## AGE_RANGE_70_YRS_PLUS NA NA NA NA
## INEXP_DRIVER 3.893e+02 4.145e+03 0.094 0.9252
## HOMEKIDS 3.337e+02 2.550e+02 1.308 0.1909
## YOJ -1.950e+01 6.018e+01 -0.324 0.7459
## INCOME -1.345e-02 8.177e-03 -1.645 0.1001
## HOME_VAL 5.128e-03 2.527e-03 2.029 0.0426 *
## TRAVTIME 3.529e-01 1.360e+01 0.026 0.9793
## BLUEBOOK 1.294e-01 2.890e-02 4.479 8.1e-06 ***
## TIF -1.981e+01 5.103e+01 -0.388 0.6979
## OLDCLAIM 2.062e-02 2.728e-02 0.756 0.4500
## CLM_FREQ -9.915e+01 1.912e+02 -0.519 0.6041
## MVR_PTS 7.726e+01 8.457e+01 0.914 0.3611
## CAR_AGE -1.183e+02 1.163e+02 -1.017 0.3093
## CAR_AGE_RANGE_1_YR -9.294e+02 1.420e+03 -0.655 0.5128
## CAR_AGE_RANGE_2_3_YRS -2.958e+03 5.781e+03 -0.512 0.6089
## CAR_AGE_RANGE_3_5_YRS 1.750e+03 1.617e+03 1.082 0.2793
## CAR_AGE_RANGE_5_10_YRS -6.555e+01 8.158e+02 -0.080 0.9360
## CAR_AGE_RANGE_10_YRS_PLUS NA NA NA NA
## MAIN_DRIVING_CITY 8.146e+02 1.024e+03 0.795 0.4266
## RED_CAR 2.161e+02 6.218e+02 0.348 0.7282
## EDU_HIGH_SCHOOL -5.906e+02 6.415e+02 -0.921 0.3573
## EDU_COLLEGE 3.634e+02 6.230e+02 0.583 0.5598
## EDU_ADV_DEGREE 1.754e+03 1.063e+03 1.650 0.0991 .
## VEHICLE_USE_COMMERCIAL -2.229e+01 1.472e+03 -0.015 0.9879
## VEHICLE_CLASS_TRUCK -1.180e+03 1.137e+03 -1.038 0.2996
## VEHICLE_CLASS_SUV -3.807e+02 7.765e+02 -0.490 0.6240
## VEHICLE_CLASS_CAR NA NA NA NA
## SPORTS_CAR NA NA NA NA
## RED_SPORTS_CAR 2.334e+03 2.745e+03 0.850 0.3953
## TRUCK_COMM 7.195e+02 1.663e+03 0.433 0.6654
## SUV_COMM 3.220e+02 1.525e+03 0.211 0.8327
## CAR_COMM NA NA NA NA
## OCCUPATION_CLERICAL 9.338e+02 1.457e+03 0.641 0.5218
## OCCUPATION_MANAGER 8.627e+01 1.270e+03 0.068 0.9459
## OCCUPATION_BLUE_COLLAR 1.360e+03 1.382e+03 0.984 0.3253
## OCCUPATION_GOLD_COLLAR 6.541e+02 1.182e+03 0.553 0.5802
## OCCUPATION_STUDENT 6.179e+02 1.570e+03 0.394 0.6939
## OCCUPATION_HOME_MAKER 8.776e+02 1.549e+03 0.566 0.5712
## OCCUPATION_PROFESSIONAL 1.367e+03 1.341e+03 1.020 0.3081
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7832 on 1461 degrees of freedom
## Multiple R-squared: 0.04256, Adjusted R-squared: 0.01241
## F-statistic: 1.412 on 46 and 1461 DF, p-value: 0.03717
The second model uses a stepwise variable selection with linear regression. We chose to use both the “forward” and “backward” methods to obtain the optimal model based on the lowest AIC (Akaike Information Criterion). This modeling approach uses the non-transformed, recoded dataset.
null.model <- lm(TARGET_AMT ~ 1 , data= dev_train) # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train) # full model with all predictors
# perform step-wise algorithm
model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000)
summary(model2)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + CAR_AGE_RANGE_3_5_YRS +
## SINGLE_PARENT + AGE_RANGE_50_59_YRS, data = dev_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9194 -3165 -1623 401 97838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.155e+03 4.601e+02 6.857 1.03e-11 ***
## BLUEBOOK 1.160e-01 2.426e-02 4.784 1.89e-06 ***
## MALE 9.376e+02 4.052e+02 2.314 0.0208 *
## CAR_AGE_RANGE_3_5_YRS 2.482e+03 1.205e+03 2.060 0.0396 *
## SINGLE_PARENT 1.104e+03 4.930e+02 2.240 0.0253 *
## AGE_RANGE_50_59_YRS 1.072e+03 5.088e+02 2.107 0.0353 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7784 on 1502 degrees of freedom
## Multiple R-squared: 0.0275, Adjusted R-squared: 0.02426
## F-statistic: 8.494 on 5 and 1502 DF, p-value: 6.111e-08
AIC (Akaike Information Criterion) for Model 2 = 3.1310510^{4}
BIC (Bayesian Information Criterion) for Model 2 = 3.13477310^{4}
| Variable | Coefficient |
|---|---|
| BLUEBOOK | 0.1160452 |
| MALE | 937.5715330 |
| CAR_AGE_RANGE_3_5_YRS | 2482.0278004 |
| SINGLE_PARENT | 1104.1588654 |
| AGE_RANGE_50_59_YRS | 1071.8562734 |
Interpretation. Five predictor variables are among the most significant in this model - BLUEBOOk, MARRIED, AGE_RANGE_16_19_YRS, HOME_VAL, INCOME, and MALE. Of these, BLUEBOOK, MARRIED, AGE_RANGE_16_19_YRS, and HOME_VAL are the most statistically significant with a p-value less than a signficance value of 0.05. The predictors INCOME and MALE were not statistically significant, although it’s interesting to note that there may be a potential relationship between male drivers and a higher payout amount.
BLUEBOOK value intuitively makes sense as a significant and important predictor in the payout amount; however, the corresponding coefficient of nearly 0 is not intuitive. We see that the predictor MARRIED is associated with a negative coefficient of -1423. This indicating that married drivers tend to have lower payout amounts as compared to single drivers, potentially a strong indicator of safer driving.
Conversely, the predictor AGE_RANGE_16_19_YRS is significant with a positive coefficient of 6,676 indicating that drivers between the ages of 16 to 19 are a less safe group and are associated with higher claim payout amounts.
The Adjusted R-squared value associated with this model is very low at 0.2235. There is a significant amount of variability unaccounted for by this model in the response variable. The F-statistic or F- Test for overall significance does show significance.
The diagnostics plots below are examined to test the assumptions of linear regression.
The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity (residuals of equal variance). Heteroscedascity can also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.
Heteroscedasticity can be confirmed statistically using the NCV test:
car::ncvTest(model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 701.1675 Df = 1 p = 1.666557e-154
The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).
The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile confirming violations of normality, which could affect the coefficients of the model. This plus the heteroscedasticity are strong indicators that the a transformation may be required.
Multicollinearity does not appear to be a problem.
car::vif(model2)
## BLUEBOOK MALE CAR_AGE_RANGE_3_5_YRS
## 1.011677 1.010687 1.001082
## SINGLE_PARENT AGE_RANGE_50_59_YRS
## 1.047432 1.049612
Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.
car::outlierTest(model2)
## rstudent unadjusted p-value Bonferonni p
## 2038 13.485005 3.3268e-39 5.0168e-36
## 1858 9.497724 8.1307e-21 1.2261e-17
## 2063 9.062576 3.8682e-19 5.8333e-16
## 640 8.863147 2.1543e-18 3.2487e-15
## 1357 6.800948 1.4960e-11 2.2559e-08
## 1137 6.788816 1.6232e-11 2.4478e-08
## 1552 6.655228 3.9537e-11 5.9621e-08
## 251 6.158254 9.4246e-10 1.4212e-06
## 1639 6.039104 1.9502e-09 2.9410e-06
## 1200 5.905257 4.3472e-09 6.5556e-06
plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")
Removal of these two outliers does not change the model in any significant way.
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MALE + CAR_AGE_RANGE_3_5_YRS +
## SINGLE_PARENT + AGE_RANGE_50_59_YRS, data = dev_train_upd)
##
## Coefficients:
## (Intercept) BLUEBOOK MALE
## 3633.6680 0.1005 666.5298
## CAR_AGE_RANGE_3_5_YRS SINGLE_PARENT AGE_RANGE_50_59_YRS
## 173.9688 820.8188 492.4396
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod2)
##
## Value p-value Decision
## Global Stat 5.781e+04 0.00000 Assumptions NOT satisfied!
## Skewness 5.637e+03 0.00000 Assumptions NOT satisfied!
## Kurtosis 5.217e+04 0.00000 Assumptions NOT satisfied!
## Link Function 3.445e+00 0.06343 Assumptions acceptable.
## Heteroscedasticity 5.444e-01 0.46063 Assumptions acceptable.
Log transformation was used to transform the imputed data set. Trasformation was applied to the response variable (TARGET_AMT). To adopt the dataset to the requirements of log transformation all zero value observations were removed. The decsion to remove zero values from the response variable instead of adding a value to prevent any data distortion.
Several recoded predictors that were not significant to the model were removed before transformation. The remaining dataset contained 2,152 observations.
Model one is a baseline model of the transformed response variale against all predictors. The summary of the regression model shows F=1.56 with a p-value=0.017, indicating that we accept the null hypothesis that the predictors collectively have a significant effect on the amount paid when there is a crash.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_trainT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5698 -0.4048 0.0225 0.4055 3.2292
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.003e+00 2.411e-01 33.199 < 2e-16 ***
## KIDSDRIV -9.816e-02 4.069e-02 -2.413 0.0160 *
## MALE 2.899e-02 6.789e-02 0.427 0.6694
## MARRIED -1.357e-01 6.408e-02 -2.118 0.0343 *
## SINGLE_PARENT 3.259e-02 7.449e-02 0.438 0.6618
## LICENSE_REVOKED -1.023e-01 6.561e-02 -1.560 0.1191
## AGE 4.606e-03 2.717e-03 1.695 0.0902 .
## INEXP_DRIVER -2.395e-02 3.300e-01 -0.073 0.9421
## HOMEKIDS 4.174e-02 2.620e-02 1.593 0.1113
## YOJ 1.603e-03 6.227e-03 0.257 0.7969
## INCOME -1.425e-06 8.343e-07 -1.707 0.0880 .
## HOME_VAL 3.246e-07 2.618e-07 1.240 0.2152
## TRAVTIME -4.059e-04 1.409e-03 -0.288 0.7733
## BLUEBOOK 1.180e-05 2.996e-06 3.941 8.51e-05 ***
## TIF -1.102e-03 5.291e-03 -0.208 0.8350
## OLDCLAIM 5.231e-06 2.829e-06 1.849 0.0646 .
## CLM_FREQ -3.556e-02 1.979e-02 -1.797 0.0726 .
## MVR_PTS 1.148e-02 8.690e-03 1.321 0.1868
## CAR_AGE 3.729e-04 4.865e-03 0.077 0.9389
## MAIN_DRIVING_CITY 5.774e-02 1.057e-01 0.546 0.5850
## RED_CAR 9.248e-02 6.435e-02 1.437 0.1509
## EDU_HIGH_SCHOOL -3.564e-02 6.462e-02 -0.552 0.5813
## VEHICLE_CLASS_TRUCK -5.336e-02 1.158e-01 -0.461 0.6450
## VEHICLE_CLASS_SUV -1.989e-02 7.930e-02 -0.251 0.8019
## VEHICLE_CLASS_CAR NA NA NA NA
## SPORTS_CAR NA NA NA NA
## RED_SPORTS_CAR 2.885e-01 2.816e-01 1.025 0.3058
## TRUCK_COMM 2.969e-02 9.340e-02 0.318 0.7506
## SUV_COMM 1.273e-02 8.531e-02 0.149 0.8814
## CAR_COMM -4.774e-02 1.522e-01 -0.314 0.7539
## OCCUPATION_CLERICAL -8.226e-02 1.230e-01 -0.669 0.5036
## OCCUPATION_MANAGER -1.012e-02 1.232e-01 -0.082 0.9345
## OCCUPATION_BLUE_COLLAR -6.049e-02 1.113e-01 -0.543 0.5870
## OCCUPATION_GOLD_COLLAR 4.108e-02 1.224e-01 0.335 0.7373
## OCCUPATION_STUDENT -1.012e-01 1.386e-01 -0.730 0.4654
## OCCUPATION_HOME_MAKER -8.858e-02 1.464e-01 -0.605 0.5452
## OCCUPATION_PROFESSIONAL -1.958e-02 1.128e-01 -0.174 0.8622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.815 on 1473 degrees of freedom
## Multiple R-squared: 0.03497, Adjusted R-squared: 0.0127
## F-statistic: 1.57 on 34 and 1473 DF, p-value: 0.01995
The output of the Backward AIC Step regression shows that most of the predictors do not have significance to the total amount paid when there is a crash.
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1507 1013.8642 -596.6990
## 2 + BLUEBOOK -1 13.532555 1506 1000.3317 -614.9626
## 3 + RED_CAR -1 3.565886 1505 996.7658 -618.3478
## 4 + MARRIED -1 3.380337 1504 993.3854 -621.4706
## 5 + KIDSDRIV -1 1.982911 1503 991.4025 -622.4837
Model 4 applies robust linear regression to the transormed datset and model used in Model 3. Robust regression is applied using the robustbase package. Robust regression is an option for modeling data where outliers may be signicanty impacting the resulting model fit.
For the purposes of the robust model build, we will only consider the non-indicator predictor variables from the dataset.
dt <- dev_train[, c("TARGET_AMT", "KIDSDRIV", "AGE", "INCOME",
"HOME_VAL", "TRAVTIME", "BLUEBOOK", "OLDCLAIM", "MVR_PTS")]
model4 <- robustbase::lmrob(formula(fit5), data = insurance_trainT, setting = "KS2014")
model4_metrics <- calc_metrics_lm("lm-Model4", model4, insurance_train_test, insurance_trainT, log_trans = T, robust = T)
lm_model_metrics <- rbind(lm_model_metrics, model4_metrics[[1]])
summary(model4)
##
## Call:
## robustbase::lmrob(formula = formula(fit5), data = insurance_trainT, setting = "KS2014")
## \--> method = "SMDM"
## Residuals:
## Min 1Q Median 3Q Max
## -4.78197 -0.39383 0.05044 0.40094 3.26439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.209e+00 4.129e-02 198.830 <2e-16 ***
## BLUEBOOK 5.034e-06 2.159e-06 2.331 0.0199 *
## RED_CAR 7.488e-02 3.919e-02 1.910 0.0563 .
## MARRIED -5.359e-02 3.544e-02 -1.512 0.1307
## KIDSDRIV -2.173e-02 2.915e-02 -0.745 0.4561
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 0.6947
## Multiple R-squared: 0.008349, Adjusted R-squared: 0.00571
## Convergence in 8 IRWLS iterations
##
## Robustness weights:
## 1112 weights are ~= 1. The remaining 396 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.006382 0.599700 0.841900 0.752400 0.965700 0.999000
## Algorithmic parameters:
## tuning.chi1 tuning.chi2 tuning.chi3 tuning.chi4
## -5.000e-01 1.500e+00 NA 5.000e-01
## bb tuning.psi1 tuning.psi2 tuning.psi3
## 5.000e-01 -5.000e-01 1.500e+00 9.500e-01
## tuning.psi4 refine.tol rel.tol solve.tol
## NA 1.000e-07 1.000e-07 1.000e-07
## eps.outlier eps.x warn.limit.reject warn.limit.meanrw
## 6.631e-05 1.054e-07 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 1000 500 20 2 2000
## maxit.scale trace.lev mts compute.rd numpoints
## 200 0 1000 0 10
## fast.s.large.n
## 2000
## setting psi subsampling
## "KS2014" "lqq" "nonsingular"
## cov compute.outlier.stats
## ".vcov.w" "SMDM"
## seed : int(0)
The resulting summary shows that the function lmrob considers only MVR_PTS and MVR_PTS important as a predictor variables. No observations are identified as having been allocated very small weights due to their identification as outliers.
The diagnostics plots below are examined to test the assumptions of linear regression.
## recomputing robust Mahalanobis distances
## Warning in robMD(x = if (!is.null(x[["x"]])) x$x else if (!
## is.null(x[["model"]])) model.matrix(x, : Failed to compute robust
## Mahalanobis distances, reverting to robust leverages.
## saving the robust distances 'MD' as part of 'model4'
The resulting Residual vs. Fitted Values plot looks more evenly distributed, indicating a better model fit. However, the Normal Q-Q Plot still shows a step incline as the plot moves closer to the 2nd theoretical quantile value.
kable(lm_model_metrics)
| Model | AIC | BIC | accuracy | MAE | MAPE | SMAPE | MSE | RMSE |
|---|---|---|---|---|---|---|---|---|
| lm-Model1 | 31368.962 | 31624.252 | 0.579 | 3726.057 | 1.2715961 | 0.6021118 | 55933218 | 7478.851 |
| lm-Model2 | 31310.500 | 31347.730 | 0.594 | 3581.130 | 1.2195981 | 0.5782837 | 55215058 | 7430.684 |
| lm-Model3 | 3659.035 | 3690.946 | 0.629 | 3100.011 | 0.7951091 | 0.5167241 | 57420909 | 7577.659 |
| lm-Model4 | NA | NA | 0.631 | 3077.422 | 0.7836996 | 0.5118868 | 57583438 | 7588.375 |
kable(all_model_metrics)
| Model | AIC | BIC | McFadenR2 | HL_Chi | HL_p | X. | Kappa | Youden | F1Score | FPPrct | AUC |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Model1 | 5133.008 | 5445.573 | 0.230 | 5712 | 0 | * | 0.395 | 0.487 | 0.522 | 15.68 | 0.809 |
| Model2 - All Pred, Some Transformed | 5133.805 | 5446.371 | 0.230 | 5712 | 0 | * | 0.394 | 0.483 | 0.523 | 15.60 | 0.810 |
| Model3 - Step | 5099.780 | 5266.038 | 0.229 | 5712 | 0 | * | 0.393 | 0.488 | 0.520 | 15.76 | 0.809 |