Group 2 of the DATA621 class was asked to analyze and model a data sent containing approximately 8,000 with 2 response and 23 predictor variables, records representing a customer at an auto insurance company.
Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.
The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.
The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
Only the variables that are given or variables derived from the variables provided.
The team met to discuss this assignment and an approach for completing the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:
Data Exploration Data Preparation Build Models *Select Models
Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation.
Slack was used for daily communication during the project and for quick access to code and documentation. Meeting were organized at least twice a week and as needed using “Go to Meetings”.
Since the data sets were provided, it was crucial that we understand the data set and determine whether any missing values are present.
Based on the objective of the project, several logistic, multiple and robust regression models were built.
-Valerie Briot -Michael D’acampora -Keith Folsom -Brian Kreis -Sharon Morris
For reproducibility of the results, the data was loaded to and accessed from a Github repository.
The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.
The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicle Age) contains 510 missing values, YOJ(Years on Job) contain 454, income (INCOME) 445 and home value (HOME_VAL) 464 missing values.
The visualization of missing values below shows that missing values of CAR_AGE, HOME_VAL and YOJ are at 6 percent while INCOME is at 5 percent. The dataset was imputed to account for the missing values.
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 445 |
| PARENT1 | 0 |
| HOME_VAL | 464 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 0 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
| Variable Name | Definition | Variable Type |
|---|---|---|
| TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | Response |
| TARGET_AMT | If car was in a crash, what was the cost | Response |
| AGE | Age of Driver | Predictor |
| BLUEBOOK | Value of Vehicle | Predictor |
| CAR_AGE | Vehicle Age | Predictor |
| CAR_TYPE | Type of Car | Predictor |
| CAR_USE | Vehicle Use | Predictor |
| CLM_FREQ | # Claims (Past 5 Years) | Predictor |
| EDUCATION | Max Education Level | Predictor |
| HOMEKIDS | # Children at Home | Predictor |
| HOME_VAL | Home Value | Predictor |
| INCOME | Income | Predictor |
| JOB | Job Category | Predictor |
| KIDSDRIV | # Driving Children | Predictor |
| MSTATUS | Marital Status | Predictor |
| MVR_PTS | Motor Vehicle Record Points | Predictor |
| OLDCLAIM | Total Claims (Past 5 Years) | Predictor |
| PARENT1 | Single Parent | Predictor |
| RED_CAR | A Red Car | Predictor |
| REVOKED | License Revoked (Past 7 Years) | Predictor |
| SEX | Gender | Predictor |
| TIF | Time in Force | Predictor |
| TRAVTIME | Distance to Work | Predictor |
| URBANICITY | Home/Work Area | Predictor |
| YOJ | Years on Job | Predictor |
Descriptive statistics was performed for all predictor and response variables to explore the data.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 1 0 0 0 0.000735204 0 0.05563044
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763 0 0.05685578 0 0 0 0 0 0
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 0 0 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 1 0.06249234 0
## vars n mean sd median trimmed mad min
## TARGET_FLAG 1 8161 0.26 0.44 0 0.20 0.00 0
## TARGET_AMT 2 8161 1504.32 4704.03 0 593.71 0.00 0
## KIDSDRIV 3 8161 0.17 0.51 0 0.03 0.00 0
## AGE 4 8155 44.79 8.63 45 44.83 8.90 16
## HOMEKIDS 5 8161 0.72 1.12 0 0.50 0.00 0
## YOJ 6 7707 10.50 4.09 11 11.07 2.97 0
## INCOME 7 7716 61898.09 47572.68 54028 56840.98 41792.27 0
## PARENT1* 8 8161 1.13 0.34 1 1.04 0.00 1
## HOME_VAL 9 7697 154867.29 129123.77 161160 144032.07 147867.11 0
## MSTATUS* 10 8161 1.40 0.49 1 1.38 0.00 1
## SEX* 11 8161 1.54 0.50 2 1.55 0.00 1
## EDUCATION* 12 8161 3.09 1.44 3 3.11 1.48 1
## JOB* 13 8161 5.69 2.68 6 5.81 2.97 1
## TRAVTIME 14 8161 33.49 15.91 33 33.00 16.31 5
## CAR_USE* 15 8161 1.63 0.48 2 1.66 0.00 1
## BLUEBOOK 16 8161 15709.90 8419.73 14440 15036.89 8450.82 1500
## TIF 17 8161 5.35 4.15 4 4.84 4.45 1
## CAR_TYPE* 18 8161 3.53 1.97 3 3.54 2.97 1
## RED_CAR* 19 8161 1.29 0.45 1 1.24 0.00 1
## OLDCLAIM 20 8161 4037.08 8777.14 0 1719.29 0.00 0
## CLM_FREQ 21 8161 0.80 1.16 0 0.59 0.00 0
## REVOKED* 22 8161 1.12 0.33 1 1.03 0.00 1
## MVR_PTS 23 8161 1.70 2.15 1 1.31 1.48 0
## CAR_AGE 24 7651 8.33 5.70 8 7.96 7.41 -3
## URBANICITY* 25 8161 1.20 0.40 1 1.13 0.00 1
## max range skew kurtosis se IQR Q0.1 Q0.25
## TARGET_FLAG 1.0 1.0 1.07 -0.85 0.00 1 0.0 0
## TARGET_AMT 107586.1 107586.1 8.71 112.29 52.07 1036 0.0 0
## KIDSDRIV 4.0 4.0 3.35 11.78 0.01 0 0.0 0
## AGE 81.0 65.0 -0.03 -0.06 0.10 12 34.0 39
## HOMEKIDS 5.0 5.0 1.34 0.65 0.01 1 0.0 0
## YOJ 23.0 23.0 -1.20 1.18 0.05 4 5.0 9
## INCOME 367030.0 367030.0 1.19 2.13 541.58 57889 4380.5 28097
## PARENT1* 2.0 1.0 2.17 2.73 0.00 0 1.0 1
## HOME_VAL 885282.0 885282.0 0.49 -0.02 1471.79 238724 0.0 0
## MSTATUS* 2.0 1.0 0.41 -1.83 0.01 1 1.0 1
## SEX* 2.0 1.0 -0.14 -1.98 0.01 1 1.0 1
## EDUCATION* 5.0 4.0 0.12 -1.38 0.02 3 1.0 2
## JOB* 9.0 8.0 -0.31 -1.22 0.03 5 2.0 3
## TRAVTIME 142.0 137.0 0.45 0.66 0.18 22 13.0 22
## CAR_USE* 2.0 1.0 -0.53 -1.72 0.01 1 1.0 1
## BLUEBOOK 69740.0 68240.0 0.79 0.79 93.20 11570 6000.0 9280
## TIF 25.0 24.0 0.89 0.42 0.05 6 1.0 1
## CAR_TYPE* 6.0 5.0 0.00 -1.52 0.02 5 1.0 1
## RED_CAR* 2.0 1.0 0.92 -1.16 0.01 1 1.0 1
## OLDCLAIM 57037.0 57037.0 3.12 9.86 97.16 4636 0.0 0
## CLM_FREQ 5.0 5.0 1.21 0.28 0.01 2 0.0 0
## REVOKED* 2.0 1.0 2.30 3.30 0.00 0 1.0 1
## MVR_PTS 13.0 13.0 1.35 1.38 0.02 3 0.0 0
## CAR_AGE 28.0 31.0 0.28 -0.75 0.07 11 1.0 1
## URBANICITY* 2.0 1.0 1.46 0.15 0.00 0 1.0 1
## Q0.75 Q0.9
## TARGET_FLAG 1 1.0
## TARGET_AMT 1036 4904.0
## KIDSDRIV 0 1.0
## AGE 51 56.0
## HOMEKIDS 1 3.0
## YOJ 13 15.0
## INCOME 85986 123180.0
## PARENT1* 1 2.0
## HOME_VAL 238724 316542.6
## MSTATUS* 2 2.0
## SEX* 2 2.0
## EDUCATION* 5 5.0
## JOB* 8 9.0
## TRAVTIME 44 54.0
## CAR_USE* 2 2.0
## BLUEBOOK 20850 27460.0
## TIF 7 11.0
## CAR_TYPE* 6 6.0
## RED_CAR* 2 2.0
## OLDCLAIM 4636 9583.0
## CLM_FREQ 2 3.0
## REVOKED* 1 2.0
## MVR_PTS 3 5.0
## CAR_AGE 12 16.0
## URBANICITY* 1 2.0
The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between claims in the past 5 years (CLM_FREQ) and motor vechile recorded points (MVR_PTS); driving children(KIDSDRV) and age of driver (AGE) is very high at 0.67.
The tables below represent correlation between response and predictor variables.
| VARIABLE | CORRELATION WITH TARGET_FLAG |
|---|---|
| KIDSDRIV | 0.1036683 |
| AGE | -0.1032167 |
| HOMEKIDS | 0.115621 |
| YOJ | -0.0705118 |
| INCOME | -0.1420081 |
| HOME_VAL | -0.1837371 |
| TRAVTIME | 0.0483683 |
| BLUEBOOK | -0.1033832 |
| TIF | -0.08237 |
| OLDCLAIM | 0.1380838 |
| CLM_FREQ | 0.2161961 |
| MVR_PTS | 0.2191971 |
| CAR_AGE | -0.1006506 |
| VARIABLE | CORRELATION WITH TARGET_AMT |
|---|---|
| KIDSDRIV | 0.0553942 |
| AGE | -0.0417283 |
| HOMEKIDS | 0.061988 |
| YOJ | -0.0220852 |
| INCOME | -0.0583069 |
| HOME_VAL | -0.0856024 |
| TRAVTIME | 0.027987 |
| BLUEBOOK | -0.0046995 |
| TIF | -0.0464808 |
| OLDCLAIM | 0.0709533 |
| CLM_FREQ | 0.1164192 |
| MVR_PTS | 0.1378655 |
| CAR_AGE | -0.0588221 |
Each predictor was exam ed to determine whether transformation is needed.
The Driving Children variable is highly skewed to the right. The outliers are high.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 4, 3, 2, 1 |
The AGE predictor is close to a normal distribution with high outliers of ages 72, 73, 76, 80 & 81 and low 16, 17 and 18.
| Range | Values |
|---|---|
| Lowest | 20, 19, 18, 17, 16 |
| Highest | 81, 80, 76, 73, 72, 70 |
The predictor of car value BLUEBOOK is slightly skewed to the right. There are some outliers a the higher car value level.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620 |
The distribution is normal. There are are no outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | None |
z_SUV and Minivan are majority of vehicles insured.
##
## Minivan Panel Truck Pickup Sports Car Van z_SUV
## 2145 676 1389 907 750 2294
The majority of cars are privately used.
##
## Commercial Private
## 3029 5132
The distribution of claims is multi modal. With the largest number of claims occurring before year 1. There are no outliers
| Range | Values |
|---|---|
| Lowest | None |
| Highest | None |
Ther majority of insurers are college or high school graduates.
##
## <High School Bachelors Masters PhD z_High School
## 1203 2242 1658 728 2330
The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children. There are some outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 5, 4, 3 |
The distribution of HOME_VAL is skewed to the left. There are negative values that will require further exploration. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328 |
The distribution INCOME has uni modal and skewed to the right. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198 |
The majority of customers work in blue collar jobs.
ggplot(insurance_train, aes(x = JOB)) +
geom_bar(fill = "red", width = 0.7) +
xlab("Job Category") + ylab("V")
table(insurance_train$JOB)
##
## Clerical Doctor Home Maker Lawyer
## 526 1271 246 641 835
## Manager Professional Student z_Blue Collar
## 988 1117 712 1825
The majority of customers are married.
##
## Yes z_No
## 4894 3267
The distribution of the MVR_PTS is skewed to the right. There are outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 13, 11, 10, 9, 8 |
The distribution OLDCLAIM is highly skewed to the left. There are several outliers on the higher end.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593 |
The majority of customers are not single parents.
##
## No Yes
## 7084 1077
The majority of the cars are not red.
##
## no yes
## 5783 2378
The distribution of TIF is skewed to the right with several outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 25, 22, 21, 20, 19, 18, 17 |
The distribution of TRAVTIME is skewed to the right with several outliers.
| Range | Values |
|---|---|
| Lowest | None |
| Highest | 142, 134, 124, 113, 103, 101, 98, 97, 95, 93 |
The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.
| Range | Values |
|---|---|
| Lowest | 2 |
| Highest | 23 |
This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.
Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.
## No variable from the 13 input variables has collinearity problem.
##
## The linear correlation coefficients ranges between:
## min correlation ( TIF ~ AGE ): -0.0005407089
## max correlation ( HOME_VAL ~ INCOME ): 0.5827817
##
## ---------- VIFs of the remained variables --------
## Variables VIF
## 1 KIDSDRIV 1.295845
## 2 AGE 1.420859
## 3 HOMEKIDS 1.712523
## 4 YOJ 1.164324
## 5 INCOME 2.022984
## 6 HOME_VAL 1.584312
## 7 TRAVTIME 1.003802
## 8 BLUEBOOK 1.248098
## 9 TIF 1.003436
## 10 OLDCLAIM 1.342782
## 11 CLM_FREQ 1.455466
## 12 MVR_PTS 1.198110
## 13 CAR_AGE 1.224164
The majority of cases are complete. Te concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values.
Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.
The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.
## TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 1 1 1 1 1
## 385 1 1 1 1 1 1 1 1 1
## 364 1 1 1 1 1 1 1 1 1
## 378 1 1 1 1 1 1 1 1 1
## 431 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 22 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 21 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 18 1 1 1 1 1 1 1 1 1
## 23 1 1 1 1 1 1 1 1 1
## 29 1 1 1 1 1 1 1 1 1
## 4 1 1 1 1 1 1 1 1 1
## 2 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 5 1 1 1 1 1 1 1 1 1
## 1 1 1 1 1 1 1 1 1 1
## 0 0 0 0 0 0 0 0 0
## CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448 1 1 1 1 1 1 1 1 1
## 3 1 1 1 1 0 1 1 1 1
## 385 1 1 1 1 1 1 0 1 1
## 364 1 1 1 1 1 0 1 1 1
## 378 1 1 1 1 1 1 1 0 1
## 431 1 1 1 1 1 1 1 1 0
## 1 1 1 1 1 0 0 1 1 1
## 22 1 1 1 1 1 0 0 1 1
## 2 1 1 1 1 0 1 1 0 1
## 21 1 1 1 1 1 1 0 0 1
## 23 1 1 1 1 1 0 1 0 1
## 18 1 1 1 1 1 1 0 1 0
## 23 1 1 1 1 1 0 1 1 0
## 29 1 1 1 1 1 1 1 0 0
## 4 1 1 1 1 1 0 0 0 1
## 2 1 1 1 1 1 0 0 1 0
## 1 1 1 1 1 1 1 0 0 0
## 5 1 1 1 1 1 0 1 0 0
## 1 1 1 1 1 1 0 0 0 0
## 0 0 0 0 6 445 454 464 510
##
## 6448 0
## 3 1
## 385 1
## 364 1
## 378 1
## 431 1
## 1 2
## 22 2
## 2 2
## 21 2
## 23 2
## 18 2
## 23 2
## 29 2
## 4 3
## 2 3
## 1 3
## 5 3
## 1 4
## 1879
The missing home value data for students and income data for home maker were replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.
The insurance dataset was recoded to the variables listed below. Most recoding centered around converting factors to dummy variables. Additionally, age-related fields such as AGE and CAR_AGE were also created a ranges for possible benefit during the modeling phase to potentially determine significance within specific ranges.
#insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_T.csv")
#x1 <- glm(TARGET_FLAG ~., family= binomial(), data = insurance_trainingT)
#car::mmps(x1)
## regression model metrics
ll_model_metrics <- data.frame()
all_roc_curves <- list()
all_predictions <- list()
# Valerie - I started creating an equivalent calc_metrics for linear regression models.
# metrics are slighlty different. will need to add more like adj R-squared etc.
calc_metrics_lm <- function(model_name, model, test, train, show=FALSE) {
#y_pred_model <- predict(model, test, type = 'response')
#y_pred_model <- as.factor(ifelse(pred_model > 0.5, 1, 0))
predicted <- predict(model, test) # predict on test data
compare <- cbind (actual=test$TARGET_AMT, predicted) # combine
model_accuracy <- mean (apply(compare, 1, min)/apply(compare, 1, max))
#MAPE calculates the mean absolute percentage error:
#SMAPE calculates the symmetric mean absolute percentage error:
#MSE calculates mean squared error:
# RMSE calculates the root mean squared error:
#https://cran.r-project.org/web/packages/sjPlot/vignettes/sjtlm.html
#https://stackoverflow.com/questions/30147756/exporting-r-regression-summary-for-publishable-paper
metrics_df <- data.frame(Model=model_name,
AIC=round(AIC(model), 3),
BIC=round(BIC(model), 3),
accuracy=round(model_accuracy, 3), #acc=Metrics::accuracy(compare[, 1], compare[, 2]),
MAE=Metrics::mae(compare[, 1], compare[, 2]),
MAPE=Metrics::mape(compare[, 1], compare[, 2]), #MAPE calculates the mean absolute percentage error:
SMAPE=Metrics::smape(compare[, 1], compare[, 2]), #SMAPE calculates the symmetric mean absolute percentage error:
MSE=Metrics::mse(compare[, 1], compare[, 2]), #MSE calculates mean squared error
RMSE=Metrics::rmse(compare[, 1], compare[, 2])) #RMSE calculates the root mean squared error:
# Result
result <- list(metrics_df, predicted, compare)
return (result)
}
# m <- calc_metrics_lm('t', model2, dev_test, dev_train)
The base multiple linear regression model of imputed data. This model predicts the cost if a car is in a crash (TARGET_AMT) using all predictor variables. This shows the estimated cost is negative the assumption the car did not crash.
The following conclusions can be made from the model: * The likelihood that a car will crash increase by 3.9% when the insurer drives with kids * The likelihood that a car will crash declined by -5.9% when the insurer is married. * The likelihood that a car will crash increased by 5.4% when the insurer’s license was revoked in the past 7 years.
The R-squared = .07 we reject the null hypothesis
The plot shows the residuals have a linerar pattern – the majorty are close to the line. There could be a linear relationship between predictor variabes and an outcome bariable. There appear to be some bad leverage outliers – 76910, 85,383 and 7.072.
This plot shows that residuals are not equally spread along the range of predictors. Thus, the variance is not equal.
This plot shows the residuals are not normally distrubuted.
This plot shows there are some infuential outliers. There are cases far beyond the Cook’s distance lines – the other residuals appear to be clustered on the left. The influential observations are 7691, 7270 and 29030.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = dataBase)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6234 -1685 -755 367 103400
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.835e+03 1.924e+03 -1.473 0.140729
## KIDSDRIV 3.885e+02 1.153e+02 3.370 0.000755 ***
## MALE -3.150e+01 1.533e+02 -0.205 0.837249
## MARRIED -5.865e+02 1.486e+02 -3.947 7.97e-05 ***
## SINGLE_PARENT 5.114e+02 2.026e+02 2.524 0.011632 *
## LICENSE_REVOKED 5.446e+02 1.736e+02 3.138 0.001708 **
## AGE 3.584e+01 1.938e+01 1.850 0.064393 .
## AGE_RANGE_16_19_YRS 5.934e+03 2.306e+03 2.573 0.010108 *
## AGE_RANGE_20_29_YRS 2.528e+03 1.491e+03 1.696 0.089974 .
## AGE_RANGE_30_39_YRS 1.778e+03 1.381e+03 1.287 0.198080
## AGE_RANGE_40_49_YRS 1.166e+03 1.302e+03 0.895 0.370662
## AGE_RANGE_50_59_YRS 1.234e+03 1.245e+03 0.991 0.321485
## AGE_RANGE_60_69_YRS 1.372e+03 1.220e+03 1.125 0.260827
## AGE_RANGE_70_YRS_PLUS NA NA NA NA
## INEXP_DRIVER -1.051e+03 1.270e+03 -0.828 0.407955
## HOMEKIDS 8.205e+01 6.638e+01 1.236 0.216450
## YOJ -1.587e+01 1.481e+01 -1.072 0.283950
## INCOME -4.087e-03 1.829e-03 -2.235 0.025462 *
## HOME_VAL -5.618e-04 6.065e-04 -0.926 0.354265
## TRAVTIME 1.190e+01 3.221e+00 3.696 0.000221 ***
## BLUEBOOK 1.839e-03 7.117e-03 0.258 0.796148
## TIF -4.594e+01 1.219e+01 -3.770 0.000164 ***
## OLDCLAIM -9.225e-03 7.438e-03 -1.240 0.214960
## CLM_FREQ 1.355e+02 5.501e+01 2.463 0.013817 *
## MVR_PTS 1.709e+02 2.601e+01 6.571 5.31e-11 ***
## CAR_AGE -1.834e+01 2.537e+01 -0.723 0.469755
## CAR_AGE_RANGE_1_YR 1.012e+02 3.301e+02 0.306 0.759281
## CAR_AGE_RANGE_2_3_YRS 1.141e+02 1.345e+03 0.085 0.932414
## CAR_AGE_RANGE_3_5_YRS 5.458e+02 4.109e+02 1.328 0.184095
## CAR_AGE_RANGE_5_10_YRS 1.414e+02 1.947e+02 0.726 0.467583
## CAR_AGE_RANGE_10_YRS_PLUS NA NA NA NA
## MAIN_DRIVING_CITY 1.648e+03 1.395e+02 11.814 < 2e-16 ***
## RED_CAR -9.555e+01 1.500e+02 -0.637 0.524099
## EDU_HIGH_SCHOOL 1.876e+01 1.745e+02 0.107 0.914409
## EDU_COLLEGE -1.689e+02 1.595e+02 -1.059 0.289728
## EDU_ADV_DEGREE 3.479e+02 2.214e+02 1.571 0.116152
## VEHICLE_USE_COMMERCIAL 3.418e+01 4.212e+02 0.081 0.935311
## VEHICLE_CLASS_TRUCK -5.493e+02 2.544e+02 -2.159 0.030879 *
## VEHICLE_CLASS_SUV -4.781e+02 1.930e+02 -2.477 0.013264 *
## VEHICLE_CLASS_CAR NA NA NA NA
## SPORTS_CAR NA NA NA NA
## RED_SPORTS_CAR 5.255e+02 8.741e+02 0.601 0.547742
## TRUCK_COMM 1.154e+03 4.520e+02 2.552 0.010725 *
## SUV_COMM 4.310e+02 4.325e+02 0.996 0.319086
## CAR_COMM NA NA NA NA
## OCCUPATION_CLERICAL 5.393e+02 3.412e+02 1.580 0.114061
## OCCUPATION_MANAGER -4.740e+02 2.871e+02 -1.651 0.098811 .
## OCCUPATION_BLUE_COLLAR 6.446e+02 3.254e+02 1.981 0.047600 *
## OCCUPATION_GOLD_COLLAR 1.030e+02 2.762e+02 0.373 0.709163
## OCCUPATION_STUDENT 2.742e+02 3.779e+02 0.726 0.468103
## OCCUPATION_HOME_MAKER 2.587e+02 3.657e+02 0.707 0.479307
## OCCUPATION_PROFESSIONAL 4.456e+02 3.067e+02 1.453 0.146280
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4541 on 8113 degrees of freedom
## Multiple R-squared: 0.07334, Adjusted R-squared: 0.06809
## F-statistic: 13.96 on 46 and 8113 DF, p-value: < 2.2e-16
The second model uses a stepwise variable selection with linear regression. We chose to use both the “forward” and “backward” methods to obtain the optimal model based on the lowest AIC (Akaike Information Criterion). This modeling approach uses the non-transformed, recoded dataset.
null.model <- lm(TARGET_AMT ~ 1 , data= dev_train) # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train) # full model with all predictors
# perform step-wise algorithm
model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000)
summary(model2)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MARRIED + AGE_RANGE_16_19_YRS +
## HOME_VAL + INCOME + MALE, data = dev_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9463 -3128 -1542 373 79362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.402e+03 4.757e+02 9.252 < 2e-16 ***
## BLUEBOOK 1.173e-01 2.578e-02 4.551 5.78e-06 ***
## MARRIED -1.423e+03 4.818e+02 -2.954 0.00319 **
## AGE_RANGE_16_19_YRS 6.676e+03 3.396e+03 1.966 0.04950 *
## HOME_VAL 4.866e-03 2.306e-03 2.110 0.03502 *
## INCOME -9.953e-03 6.140e-03 -1.621 0.10521
## MALE 5.678e+02 3.945e+02 1.439 0.15030
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7569 on 1502 degrees of freedom
## Multiple R-squared: 0.02624, Adjusted R-squared: 0.02235
## F-statistic: 6.746 on 6 and 1502 DF, p-value: 4.565e-07
AIC (Akaike Information Criterion) for Model 2 = 3.124751210^{4}
BIC (Bayesian Information Criterion) for Model 2 = 3.129006610^{4}
| Variable | Coefficient |
|---|---|
| BLUEBOOK | 0.1173009 |
| MARRIED | -1423.1074373 |
| AGE_RANGE_16_19_YRS | 6676.3098861 |
| HOME_VAL | 0.0048658 |
| INCOME | -0.0099529 |
| MALE | 567.7960181 |
Interpretation. Five predictor variables are among the most significant in this model - BLUEBOOk, MARRIED, AGE_RANGE_16_19_YRS, HOME_VAL, INCOME, and MALE. Of these, BLUEBOOK, MARRIED, AGE_RANGE_16_19_YRS, and HOME_VAL are the most statistically significant with a p-value less than a signficance value of 0.05. The predictors INCOME and MALE were not statistically significant, although it’s interesting to note that there may be a potential relationship between male drivers and a higher payout amount.
BLUEBOOK value intuitively makes sense as a significant and important predictor in the payout amount; however, the corresponding coefficient of nearly 0 is not intuitive. We see that the predictor MARRIED is associated with a negative coefficient of -1423. This indicating that married drivers tend to have lower payout amounts as compared to single drivers, potentially a strong indicator of safer driving.
Conversely, the predictor AGE_RANGE_16_19_YRS is significant with a positive coefficient of 6,676 indicating that drivers between the ages of 16 to 19 are a less safe group and are associated with higher claim payout amounts.
The Adjusted R-squared value associated with this model is very low at 0.2235. There is a significant amount of variability unaccounted for by this model in the response variable. The F-statistic or F- Test for overall significance does show significance.
The diagnostics plots below are examined to test the assumptions of linear regression.
The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity (residuals of equal variance). Heteroscedascity can also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.
Heteroscedasticity can be confirmed statistically using the NCV test:
car::ncvTest(model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 465.0705 Df = 1 p = 3.7886e-103
The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).
The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile confirming violations of normality, which could affect the coefficients of the model. This plus the heteroscedasticity are strong indicators that the a transformation may be required.
Multicollinearity does not appear to be problem.
car::vif(model2)
## BLUEBOOK MARRIED AGE_RANGE_16_19_YRS
## 1.257359 1.528386 1.003361
## HOME_VAL INCOME MALE
## 2.042174 1.859690 1.013293
Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.
car::outlierTest(model2)
## rstudent unadjusted p-value Bonferonni p
## 1430 10.912140 9.8766e-27 1.4904e-23
## 2063 9.433105 1.4572e-20 2.1990e-17
## 640 9.091363 3.0107e-19 4.5431e-16
## 1552 6.993113 4.0340e-12 6.0873e-09
## 1357 6.956691 5.1846e-12 7.8235e-09
## 1137 6.955736 5.2187e-12 7.8751e-09
## 251 6.305652 3.7657e-10 5.6825e-07
## 1639 6.272875 4.6258e-10 6.9803e-07
## 1200 6.084804 1.4778e-09 2.2300e-06
## 2133 5.947021 3.3912e-09 5.1172e-06
plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")
Removal of these two outliers did not change the model based.
dev_train_upd <- dev_train[which(cooks.distance(model2) < 0.1),]
#dev_train[which(cooks.distance(model2)==2038)]
mod2 <- update(model2,data=dev_train_upd)
autoplot(mod2, which = 1:6, colour = 'dodgerblue3',
smooth.colour = 'red', smooth.linetype = 'dashed',
ad.colour = 'black',
label.size = 3, label.n = 5, label.colour = 'blue',
ncol = 3)
gvlma(mod2)
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MARRIED + AGE_RANGE_16_19_YRS +
## HOME_VAL + INCOME + MALE, data = dev_train_upd)
##
## Coefficients:
## (Intercept) BLUEBOOK MARRIED
## 4.566e+03 1.027e-01 -1.088e+03
## AGE_RANGE_16_19_YRS HOME_VAL INCOME
## -1.245e+03 3.222e-03 -7.990e-03
## MALE
## 4.131e+02
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod2)
##
## Value p-value Decision
## Global Stat 6.244e+04 0.0000 Assumptions NOT satisfied!
## Skewness 5.817e+03 0.0000 Assumptions NOT satisfied!
## Kurtosis 5.663e+04 0.0000 Assumptions NOT satisfied!
## Link Function 1.269e+00 0.2600 Assumptions acceptable.
## Heteroscedasticity 3.011e-01 0.5832 Assumptions acceptable.
Removal of the outliers does not improve the model 2.
Log transformation was used to transform the imputed data set. Trasformation was applied to the response variable (TARGET_AMT). To adopt the dataset to the requirements of log transformation all zero value observations were removed. The decsion to remove zero values from the response variable instead of adding a value to prevent any data distortion.
Several recoded predictors that were not significant to the model were removemoved before transformation. The remaining dataset contained 2,152 observations.
Model one is a baseline model of the transformed response variale against all predictors. The summary of the regression model shows F=1.56 with a p-value=0.017, indicating that we accept the null hypothesis that the predictors collectively have a significant effect on the amount paid when there is a crash.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_trainT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6571 -0.4100 0.0379 0.4113 3.2160
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.136e+00 1.935e-01 42.037 < 2e-16 ***
## KIDSDRIV -3.274e-02 3.321e-02 -0.986 0.3243
## MALE 4.165e-02 5.645e-02 0.738 0.4607
## MARRIED -9.115e-02 5.310e-02 -1.716 0.0862 .
## SINGLE_PARENT 2.219e-02 6.180e-02 0.359 0.7196
## LICENSE_REVOKED -9.607e-02 5.449e-02 -1.763 0.0781 .
## AGE 2.881e-03 2.261e-03 1.274 0.2028
## INEXP_DRIVER 2.065e-01 2.559e-01 0.807 0.4198
## HOMEKIDS 2.569e-02 2.183e-02 1.177 0.2394
## YOJ -2.758e-03 5.102e-03 -0.541 0.5888
## INCOME -9.885e-07 6.913e-07 -1.430 0.1529
## HOME_VAL 1.211e-07 2.159e-07 0.561 0.5750
## TRAVTIME -1.526e-04 1.165e-03 -0.131 0.8958
## BLUEBOOK 1.041e-05 2.513e-06 4.144 3.54e-05 ***
## TIF -2.096e-03 4.476e-03 -0.468 0.6396
## OLDCLAIM 4.476e-06 2.381e-06 1.880 0.0603 .
## CLM_FREQ -3.460e-02 1.655e-02 -2.090 0.0367 *
## MVR_PTS 1.421e-02 7.219e-03 1.969 0.0491 *
## CAR_AGE -2.109e-03 3.974e-03 -0.531 0.5956
## MAIN_DRIVING_CITY 5.948e-02 7.963e-02 0.747 0.4552
## RED_CAR 2.519e-03 5.297e-02 0.048 0.9621
## EDU_HIGH_SCHOOL -1.443e-03 5.320e-02 -0.027 0.9784
## VEHICLE_CLASS_TRUCK -4.193e-03 9.424e-02 -0.044 0.9645
## VEHICLE_CLASS_SUV -9.073e-03 6.334e-02 -0.143 0.8861
## VEHICLE_CLASS_CAR NA NA NA NA
## SPORTS_CAR NA NA NA NA
## RED_SPORTS_CAR 2.393e-01 2.385e-01 1.003 0.3159
## TRUCK_COMM 1.852e-02 7.798e-02 0.237 0.8123
## SUV_COMM 4.060e-02 7.055e-02 0.575 0.5650
## CAR_COMM -1.142e-01 1.199e-01 -0.952 0.3410
## OCCUPATION_CLERICAL -1.000e-01 1.020e-01 -0.981 0.3268
## OCCUPATION_MANAGER -7.976e-02 1.035e-01 -0.771 0.4410
## OCCUPATION_BLUE_COLLAR -1.016e-01 9.279e-02 -1.095 0.2735
## OCCUPATION_GOLD_COLLAR -2.026e-02 1.023e-01 -0.198 0.8429
## OCCUPATION_STUDENT -1.431e-01 1.150e-01 -1.244 0.2135
## OCCUPATION_HOME_MAKER -1.825e-01 1.208e-01 -1.511 0.1310
## OCCUPATION_PROFESSIONAL -5.693e-02 9.451e-02 -0.602 0.5470
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8088 on 2117 degrees of freedom
## Multiple R-squared: 0.02457, Adjusted R-squared: 0.008905
## F-statistic: 1.568 on 34 and 2117 DF, p-value: 0.0197
The output of the Backward AIC Step regression shows that most of the predictors do not have significance to the total amount paid when there is a crash.
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED +
## AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL +
## TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS +
## CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL +
## VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR +
## SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM +
## OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR +
## OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER +
## OCCUPATION_PROFESSIONAL
##
## Final Model:
## TARGET_AMT ~ MALE + MARRIED + LICENSE_REVOKED + BLUEBOOK + OLDCLAIM +
## CLM_FREQ + MVR_PTS
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 2117 1385.015 -878.3575
## 2 - SPORTS_CAR 0 0.000000000 2117 1385.015 -878.3575
## 3 - VEHICLE_CLASS_CAR 0 0.000000000 2117 1385.015 -878.3575
## 4 - EDU_HIGH_SCHOOL 1 0.000481687 2118 1385.016 -880.3568
## 5 - VEHICLE_CLASS_TRUCK 1 0.001274141 2119 1385.017 -882.3548
## 6 - RED_CAR 1 0.001444202 2120 1385.018 -884.3525
## 7 - TRAVTIME 1 0.011190859 2121 1385.030 -886.3352
## 8 - VEHICLE_CLASS_SUV 1 0.014241661 2122 1385.044 -888.3130
## 9 - OCCUPATION_GOLD_COLLAR 1 0.026545891 2123 1385.070 -890.2718
## 10 - SINGLE_PARENT 1 0.085063039 2124 1385.155 -892.1396
## 11 - TRUCK_COMM 1 0.142827618 2125 1385.298 -893.9177
## 12 - TIF 1 0.127934365 2126 1385.426 -895.7190
## 13 - SUV_COMM 1 0.144193022 2127 1385.570 -897.4950
## 14 - CAR_AGE 1 0.182760271 2128 1385.753 -899.2112
## 15 - OCCUPATION_PROFESSIONAL 1 0.138291667 2129 1385.891 -900.9965
## 16 - YOJ 1 0.208594411 2130 1386.100 -902.6726
## 17 - HOME_VAL 1 0.207081511 2131 1386.307 -904.3511
## 18 - OCCUPATION_MANAGER 1 0.245548632 2132 1386.553 -905.9700
## 19 - OCCUPATION_BLUE_COLLAR 1 0.304889058 2133 1386.857 -907.4968
## 20 - OCCUPATION_CLERICAL 1 0.153094694 2134 1387.011 -909.2593
## 21 - OCCUPATION_STUDENT 1 0.150664774 2135 1387.161 -911.0255
## 22 - MAIN_DRIVING_CITY 1 0.422870406 2136 1387.584 -912.3696
## 23 - INEXP_DRIVER 1 0.387751137 2137 1387.972 -913.7683
## 24 - INCOME 1 0.579689637 2138 1388.552 -914.8697
## 25 - OCCUPATION_HOME_MAKER 1 0.360993999 2139 1388.913 -916.3103
## 26 - KIDSDRIV 1 0.678387456 2140 1389.591 -917.2595
## 27 - AGE 1 0.428542625 2141 1390.019 -918.5959
## 28 - HOMEKIDS 1 0.245897861 2142 1390.265 -920.2152
## 29 - RED_SPORTS_CAR 1 0.981223160 2143 1391.247 -920.6969
## 30 - CAR_COMM 1 0.906329217 2144 1392.153 -921.2955
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED +
## AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL +
## TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS +
## CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL +
## VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR +
## SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM +
## OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR +
## OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER +
## OCCUPATION_PROFESSIONAL
##
## Final Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED +
## AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL +
## TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS +
## CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL +
## VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR +
## SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM +
## OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR +
## OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER +
## OCCUPATION_PROFESSIONAL
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 2117 1385.015 -878.3575
library(ridge)
## Warning: package 'ridge' was built under R version 3.4.1
## currently trying Ridge regression; need to add rlm code
linRidgeMod <- linearRidge(TARGET_AMT ~ ., data = dev_train)
predicted <- predict(linRidgeMod, dev_test) # predict on test data
compare <- cbind (actual=dev_test$TARGET_AMT, predicted) # combine
mean (apply(compare, 1, min)/apply(compare, 1, max))
## [1] 0.6023915