Introduction

Group 2 of the DATA621 class was asked to analyze and model a data sent containing approximately 8,000 with 2 response and 23 predictor variables, records representing a customer at an auto insurance company.

Each record has two response variables. The first response variable, TARGET_FLAG which is binary (0,1). If someone was in car crash the value is 1 and if the person was not in a car cash the value is 0.

The second response variable is TARGET_ATM. If someone was in a car cash the value is 1 and if they did not crash their car the value is greater than 0.

Objective

The objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

Only the variables that are given or variables derived from the variables provided.

Approach

The team met to discuss this assignment and an approach for completing the assignment. Each of the 5 team members was assigned tasks. The following tasks were assigned:

Data Exploration Data Preparation Build Models *Select Models

Github was used to manage the project. Using Github helped with version control and ensured each team member had access to the latest version of the project documentation.

Slack was used for daily communication during the project and for quick access to code and documentation. Meeting were organized at least twice a week and as needed using “Go to Meetings”.

Data Exploration and Data Preparation

Since the data sets were provided, it was crucial that we understand the data set and determine whether any missing values are present.

Model Building and Selection

Based on the objective of the project, several logistic, multiple and robust regression models were built.

Team Members

-Valerie Briot -Michael D’acampora -Keith Folsom -Brian Kreis -Sharon Morris

Dataset

For reproducibility of the results, the data was loaded to and accessed from a Github repository.

Data Exploration and Statistic Measures

The purpose of the data exploration and statistic measures phase is to understand the data to determine how to process the dataset for modelling.

Missing Values

The majority of variables do not contain missing values. The predictor CAR_AGE (Vehicle Age) contains 510 missing values, YOJ(Years on Job) contain 454, income (INCOME) 445 and home value (HOME_VAL) 464 missing values.

The visualization of missing values below shows that missing values of CAR_AGE, HOME_VAL and YOJ are at 6 percent while INCOME is at 5 percent. The dataset was imputed to account for the missing values.

TARGET_FLAG	0
TARGET_AMT	0
KIDSDRIV	0
AGE	6
HOMEKIDS	0
YOJ	454
INCOME	445
PARENT1	0
HOME_VAL	464
MSTATUS	0
SEX	0
EDUCATION	0
JOB	0
TRAVTIME	0
CAR_USE	0
BLUEBOOK	0
TIF	0
CAR_TYPE	0
RED_CAR	0
OLDCLAIM	0
CLM_FREQ	0
REVOKED	0
MVR_PTS	0
CAR_AGE	510
URBANICITY	0

Variable to Variable Analysis

Variable Name	Definition	Variable Type
TARGET_FLAG	Was Car in a crash? 1=YES 0=NO	Response
TARGET_AMT	If car was in a crash, what was the cost	Response
AGE	Age of Driver	Predictor
BLUEBOOK	Value of Vehicle	Predictor
CAR_AGE	Vehicle Age	Predictor
CAR_TYPE	Type of Car	Predictor
CAR_USE	Vehicle Use	Predictor
CLM_FREQ	# Claims (Past 5 Years)	Predictor
EDUCATION	Max Education Level	Predictor
HOMEKIDS	# Children at Home	Predictor
HOME_VAL	Home Value	Predictor
INCOME	Income	Predictor
JOB	Job Category	Predictor
KIDSDRIV	# Driving Children	Predictor
MSTATUS	Marital Status	Predictor
MVR_PTS	Motor Vehicle Record Points	Predictor
OLDCLAIM	Total Claims (Past 5 Years)	Predictor
PARENT1	Single Parent	Predictor
RED_CAR	A Red Car	Predictor
REVOKED	License Revoked (Past 7 Years)	Predictor
SEX	Gender	Predictor
TIF	Time in Force	Predictor
TRAVTIME	Distance to Work	Predictor
URBANICITY	Home/Work Area	Predictor
YOJ	Years on Job	Predictor

Descriptive Statistics

Descriptive statistics was performed for all predictor and response variables to explore the data.

##   TARGET_FLAG TARGET_AMT KIDSDRIV         AGE HOMEKIDS        YOJ
## 1           0          0        0 0.000735204        0 0.05563044
##       INCOME PARENT1   HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 0.05452763       0 0.05685578       0   0         0   0        0       0
##   BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1        0   0        0       0        0        0       0       0
##      CAR_AGE URBANICITY
## 1 0.06249234          0

##             vars    n      mean        sd median   trimmed       mad  min
## TARGET_FLAG    1 8161      0.26      0.44      0      0.20      0.00    0
## TARGET_AMT     2 8161   1504.32   4704.03      0    593.71      0.00    0
## KIDSDRIV       3 8161      0.17      0.51      0      0.03      0.00    0
## AGE            4 8155     44.79      8.63     45     44.83      8.90   16
## HOMEKIDS       5 8161      0.72      1.12      0      0.50      0.00    0
## YOJ            6 7707     10.50      4.09     11     11.07      2.97    0
## INCOME         7 7716  61898.09  47572.68  54028  56840.98  41792.27    0
## PARENT1*       8 8161      1.13      0.34      1      1.04      0.00    1
## HOME_VAL       9 7697 154867.29 129123.77 161160 144032.07 147867.11    0
## MSTATUS*      10 8161      1.40      0.49      1      1.38      0.00    1
## SEX*          11 8161      1.54      0.50      2      1.55      0.00    1
## EDUCATION*    12 8161      3.09      1.44      3      3.11      1.48    1
## JOB*          13 8161      5.69      2.68      6      5.81      2.97    1
## TRAVTIME      14 8161     33.49     15.91     33     33.00     16.31    5
## CAR_USE*      15 8161      1.63      0.48      2      1.66      0.00    1
## BLUEBOOK      16 8161  15709.90   8419.73  14440  15036.89   8450.82 1500
## TIF           17 8161      5.35      4.15      4      4.84      4.45    1
## CAR_TYPE*     18 8161      3.53      1.97      3      3.54      2.97    1
## RED_CAR*      19 8161      1.29      0.45      1      1.24      0.00    1
## OLDCLAIM      20 8161   4037.08   8777.14      0   1719.29      0.00    0
## CLM_FREQ      21 8161      0.80      1.16      0      0.59      0.00    0
## REVOKED*      22 8161      1.12      0.33      1      1.03      0.00    1
## MVR_PTS       23 8161      1.70      2.15      1      1.31      1.48    0
## CAR_AGE       24 7651      8.33      5.70      8      7.96      7.41   -3
## URBANICITY*   25 8161      1.20      0.40      1      1.13      0.00    1
##                  max    range  skew kurtosis      se    IQR   Q0.1 Q0.25
## TARGET_FLAG      1.0      1.0  1.07    -0.85    0.00      1    0.0     0
## TARGET_AMT  107586.1 107586.1  8.71   112.29   52.07   1036    0.0     0
## KIDSDRIV         4.0      4.0  3.35    11.78    0.01      0    0.0     0
## AGE             81.0     65.0 -0.03    -0.06    0.10     12   34.0    39
## HOMEKIDS         5.0      5.0  1.34     0.65    0.01      1    0.0     0
## YOJ             23.0     23.0 -1.20     1.18    0.05      4    5.0     9
## INCOME      367030.0 367030.0  1.19     2.13  541.58  57889 4380.5 28097
## PARENT1*         2.0      1.0  2.17     2.73    0.00      0    1.0     1
## HOME_VAL    885282.0 885282.0  0.49    -0.02 1471.79 238724    0.0     0
## MSTATUS*         2.0      1.0  0.41    -1.83    0.01      1    1.0     1
## SEX*             2.0      1.0 -0.14    -1.98    0.01      1    1.0     1
## EDUCATION*       5.0      4.0  0.12    -1.38    0.02      3    1.0     2
## JOB*             9.0      8.0 -0.31    -1.22    0.03      5    2.0     3
## TRAVTIME       142.0    137.0  0.45     0.66    0.18     22   13.0    22
## CAR_USE*         2.0      1.0 -0.53    -1.72    0.01      1    1.0     1
## BLUEBOOK     69740.0  68240.0  0.79     0.79   93.20  11570 6000.0  9280
## TIF             25.0     24.0  0.89     0.42    0.05      6    1.0     1
## CAR_TYPE*        6.0      5.0  0.00    -1.52    0.02      5    1.0     1
## RED_CAR*         2.0      1.0  0.92    -1.16    0.01      1    1.0     1
## OLDCLAIM     57037.0  57037.0  3.12     9.86   97.16   4636    0.0     0
## CLM_FREQ         5.0      5.0  1.21     0.28    0.01      2    0.0     0
## REVOKED*         2.0      1.0  2.30     3.30    0.00      0    1.0     1
## MVR_PTS         13.0     13.0  1.35     1.38    0.02      3    0.0     0
## CAR_AGE         28.0     31.0  0.28    -0.75    0.07     11    1.0     1
## URBANICITY*      2.0      1.0  1.46     0.15    0.00      0    1.0     1
##              Q0.75     Q0.9
## TARGET_FLAG      1      1.0
## TARGET_AMT    1036   4904.0
## KIDSDRIV         0      1.0
## AGE             51     56.0
## HOMEKIDS         1      3.0
## YOJ             13     15.0
## INCOME       85986 123180.0
## PARENT1*         1      2.0
## HOME_VAL    238724 316542.6
## MSTATUS*         2      2.0
## SEX*             2      2.0
## EDUCATION*       5      5.0
## JOB*             8      9.0
## TRAVTIME        44     54.0
## CAR_USE*         2      2.0
## BLUEBOOK     20850  27460.0
## TIF              7     11.0
## CAR_TYPE*        6      6.0
## RED_CAR*         2      2.0
## OLDCLAIM      4636   9583.0
## CLM_FREQ         2      3.0
## REVOKED*         1      2.0
## MVR_PTS          3      5.0
## CAR_AGE         12     16.0
## URBANICITY*      1      2.0

Correlation Analysis

The correlation matrix shown below highlights correlations among several predictor variables. Correlation between between claims in the past 5 years (CLM_FREQ) and motor vechile recorded points (MVR_PTS); driving children(KIDSDRV) and age of driver (AGE) is very high at 0.67.

The tables below represent correlation between response and predictor variables.

Correlation with Outcome Variable - TARGET_FLAG

VARIABLE	CORRELATION WITH TARGET_FLAG
KIDSDRIV	0.1036683
AGE	-0.1032167
HOMEKIDS	0.115621
YOJ	-0.0705118
INCOME	-0.1420081
HOME_VAL	-0.1837371
TRAVTIME	0.0483683
BLUEBOOK	-0.1033832
TIF	-0.08237
OLDCLAIM	0.1380838
CLM_FREQ	0.2161961
MVR_PTS	0.2191971
CAR_AGE	-0.1006506

Correlation with Outcome Variable - TARGET_AMT

VARIABLE	CORRELATION WITH TARGET_AMT
KIDSDRIV	0.0553942
AGE	-0.0417283
HOMEKIDS	0.061988
YOJ	-0.0220852
INCOME	-0.0583069
HOME_VAL	-0.0856024
TRAVTIME	0.027987
BLUEBOOK	-0.0046995
TIF	-0.0464808
OLDCLAIM	0.0709533
CLM_FREQ	0.1164192
MVR_PTS	0.1378655
CAR_AGE	-0.0588221

Analysis of predictors

Each predictor was exam ed to determine whether transformation is needed.

KIDSDRIV

The Driving Children variable is highly skewed to the right. The outliers are high.

Extreme Observations

Range	Values
Lowest	None
Highest	4, 3, 2, 1

AGE

The AGE predictor is close to a normal distribution with high outliers of ages 72, 73, 76, 80 & 81 and low 16, 17 and 18.

Extreme Observations

Range	Values
Lowest	20, 19, 18, 17, 16
Highest	81, 80, 76, 73, 72, 70

BLUEBOOK

The predictor of car value BLUEBOOK is slightly skewed to the right. There are some outliers a the higher car value level.

Extreme Observations

Range	Values
Lowest	None
Highest	69740, 65970, 62240, 61050, 57970, 50970, 50180, 49880, 49230, 48620

CAR_AGE

The distribution is normal. There are are no outliers.

Extreme Observations

Range	Values
Lowest	None
Highest	None

CAR_TYPE

z_SUV and Minivan are majority of vehicles insured.

## 
##     Minivan Panel Truck      Pickup  Sports Car         Van       z_SUV 
##        2145         676        1389         907         750        2294

CAR_USE

The majority of cars are privately used.

## 
## Commercial    Private 
##       3029       5132

CLM_FREQ

The distribution of claims is multi modal. With the largest number of claims occurring before year 1. There are no outliers

Extreme Observations

Range	Values
Lowest	None
Highest	None

EDUCATION

Ther majority of insurers are college or high school graduates.

## 
##  <High School     Bachelors       Masters           PhD z_High School 
##          1203          2242          1658           728          2330

HOMEKIDS

The distribution of HOMEKIDS is multimodal. The majority of customers do not have any children. There are some outliers.

Extreme Observations

Range	Values
Lowest	None
Highest	5, 4, 3

HOME_VAL

The distribution of HOME_VAL is skewed to the left. There are negative values that will require further exploration. There are several outliers on the higher end.

Extreme Observations

Range	Values
Lowest	None
Highest	885282, 750455, 738153, 682634, 657804, 653952, 649247, 631309, 630267, 611328

INCOME

The distribution INCOME has uni modal and skewed to the right. There are several outliers on the higher end.

Extreme Observations

Range	Values
Lowest	None
Highest	367030, 332339, 320127, 309628, 306277, 297435, 290846, 284071, 282292, 282198

JOB

The majority of customers work in blue collar jobs.

ggplot(insurance_train, aes(x = JOB)) + 
  geom_bar(fill = "red", width = 0.7) + 
  xlab("Job Category") + ylab("V")

table(insurance_train$JOB)

## 
##                    Clerical        Doctor    Home Maker        Lawyer 
##           526          1271           246           641           835 
##       Manager  Professional       Student z_Blue Collar 
##           988          1117           712          1825

MSTATUS

The majority of customers are married.

## 
##  Yes z_No 
## 4894 3267

MVR_PTS

The distribution of the MVR_PTS is skewed to the right. There are outliers on the higher end.

Extreme Observations

Range	Values
Lowest	None
Highest	13, 11, 10, 9, 8

OLDCLAIM

The distribution OLDCLAIM is highly skewed to the left. There are several outliers on the higher end.

Extreme Observations

Range	Values
Lowest	None
Highest	57037, 53986, 53568, 53477, 52507, 52465, 52445, 52068, 51904, 51593

PARENT1

The majority of customers are not single parents.

## 
##   No  Yes 
## 7084 1077

RED_CAR

The majority of the cars are not red.

## 
##   no  yes 
## 5783 2378

TIF

The distribution of TIF is skewed to the right with several outliers.

Extreme Observations

Range	Values
Lowest	None
Highest	25, 22, 21, 20, 19, 18, 17

TRAVTIME

The distribution of TRAVTIME is skewed to the right with several outliers.

Extreme Observations

Range	Values
Lowest	None
Highest	142, 134, 124, 113, 103, 101, 98, 97, 95, 93

YOJ

The YOJ distribution is close to normally distributed. There are outliers at both the lower and upper ends.

Extreme Observations

Range	Values
Lowest	2
Highest	23

Multicollinearity

This section will test the predictor variables to determine if there is correlation among them. Variance inflaction factor (VIF) is used to detect multicollinearity, specifically among the entire set of predictors versus within pairs of variables.

Testing for collinearity among the predictor variables, we see that none of the numeric predictor variables appear to have a problem with collinearity based on their low VIF scores.

## No variable from the 13 input variables has collinearity problem. 
## 
## The linear correlation coefficients ranges between: 
## min correlation ( TIF ~ AGE ):  -0.0005407089 
## max correlation ( HOME_VAL ~ INCOME ):  0.5827817 
## 
## ---------- VIFs of the remained variables -------- 
##    Variables      VIF
## 1   KIDSDRIV 1.295845
## 2        AGE 1.420859
## 3   HOMEKIDS 1.712523
## 4        YOJ 1.164324
## 5     INCOME 2.022984
## 6   HOME_VAL 1.584312
## 7   TRAVTIME 1.003802
## 8   BLUEBOOK 1.248098
## 9        TIF 1.003436
## 10  OLDCLAIM 1.342782
## 11  CLM_FREQ 1.455466
## 12   MVR_PTS 1.198110
## 13   CAR_AGE 1.224164

Data Preparation

Missing Values

The majority of cases are complete. Te concern are the 2 predictor variables (CAR_AGE, YOJ) that have more than 5% of missing values.

Predictors without missing values that contain zero values are possible indication zero values are actually missing values. For instance, predictors HOME_VAL and INCOME have zero values which are highly unlikely.

The missing data patterns show that 7,213 out of 8,161 are complete observations, 6 observations are missing the AGE predictor, 432 observations are missing YOJ, 488 observations are missing CAR_AGE and 22 observations are missing YOJ and CAR_AGE.

##      TARGET_FLAG TARGET_AMT KIDSDRIV HOMEKIDS PARENT1 MSTATUS SEX
## 6448           1          1        1        1       1       1   1
##    3           1          1        1        1       1       1   1
##  385           1          1        1        1       1       1   1
##  364           1          1        1        1       1       1   1
##  378           1          1        1        1       1       1   1
##  431           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##   22           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##   21           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   18           1          1        1        1       1       1   1
##   23           1          1        1        1       1       1   1
##   29           1          1        1        1       1       1   1
##    4           1          1        1        1       1       1   1
##    2           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##    5           1          1        1        1       1       1   1
##    1           1          1        1        1       1       1   1
##                0          0        0        0       0       0   0
##      EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM
## 6448         1   1        1       1        1   1        1       1        1
##    3         1   1        1       1        1   1        1       1        1
##  385         1   1        1       1        1   1        1       1        1
##  364         1   1        1       1        1   1        1       1        1
##  378         1   1        1       1        1   1        1       1        1
##  431         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##   22         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##   21         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   18         1   1        1       1        1   1        1       1        1
##   23         1   1        1       1        1   1        1       1        1
##   29         1   1        1       1        1   1        1       1        1
##    4         1   1        1       1        1   1        1       1        1
##    2         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##    5         1   1        1       1        1   1        1       1        1
##    1         1   1        1       1        1   1        1       1        1
##              0   0        0       0        0   0        0       0        0
##      CLM_FREQ REVOKED MVR_PTS URBANICITY AGE INCOME YOJ HOME_VAL CAR_AGE
## 6448        1       1       1          1   1      1   1        1       1
##    3        1       1       1          1   0      1   1        1       1
##  385        1       1       1          1   1      1   0        1       1
##  364        1       1       1          1   1      0   1        1       1
##  378        1       1       1          1   1      1   1        0       1
##  431        1       1       1          1   1      1   1        1       0
##    1        1       1       1          1   0      0   1        1       1
##   22        1       1       1          1   1      0   0        1       1
##    2        1       1       1          1   0      1   1        0       1
##   21        1       1       1          1   1      1   0        0       1
##   23        1       1       1          1   1      0   1        0       1
##   18        1       1       1          1   1      1   0        1       0
##   23        1       1       1          1   1      0   1        1       0
##   29        1       1       1          1   1      1   1        0       0
##    4        1       1       1          1   1      0   0        0       1
##    2        1       1       1          1   1      0   0        1       0
##    1        1       1       1          1   1      1   0        0       0
##    5        1       1       1          1   1      0   1        0       0
##    1        1       1       1          1   1      0   0        0       0
##             0       0       0          0   6    445 454      464     510
##          
## 6448    0
##    3    1
##  385    1
##  364    1
##  378    1
##  431    1
##    1    2
##   22    2
##    2    2
##   21    2
##   23    2
##   18    2
##   23    2
##   29    2
##    4    3
##    2    3
##    1    3
##    5    3
##    1    4
##      1879

Assumptions of Missing Values

The missing home value data for students and income data for home maker were replaced with zero. This decision was made after examination of the dataset. It is possible that students did not enter home value data because many students does not own a home. Missing income data for home makers maybe due to no information entered since home makers don’t typically earn an income.

Recoding of Predictor Variables

The insurance dataset was recoded to the variables listed below. Most recoding centered around converting factors to dummy variables. Additionally, age-related fields such as AGE and CAR_AGE were also created a ranges for possible benefit during the modeling phase to potentially determine significance within specific ranges.

Transform data - Logistic Regression Dataset

#insurance_trainingT <- read.csv( "https://raw.githubusercontent.com/621-Group2/HW4/master/insurance_training_data_T.csv")

#x1 <- glm(TARGET_FLAG ~., family= binomial(), data = insurance_trainingT)
#car::mmps(x1)

Imput Recoded dataset

BUILD MODELS

## regression model metrics

ll_model_metrics <- data.frame()
all_roc_curves <- list()
all_predictions <- list()


# Valerie - I started creating an equivalent calc_metrics for linear regression models.
#           metrics are slighlty different. will need to add more like adj R-squared etc.
calc_metrics_lm <- function(model_name, model, test, train, show=FALSE) {
  
  
  #y_pred_model <- predict(model, test, type = 'response')
  #y_pred_model <- as.factor(ifelse(pred_model > 0.5, 1, 0))
  
  predicted <- predict(model, test)  # predict on test data
  compare <- cbind (actual=test$TARGET_AMT, predicted)  # combine

  model_accuracy <- mean (apply(compare, 1, min)/apply(compare, 1, max))
  
  #MAPE calculates the mean absolute percentage error:
  
  #SMAPE calculates the symmetric mean absolute percentage error:
    
  #MSE calculates mean squared error:

  # RMSE calculates the root mean squared error:

  #https://cran.r-project.org/web/packages/sjPlot/vignettes/sjtlm.html
  #https://stackoverflow.com/questions/30147756/exporting-r-regression-summary-for-publishable-paper
  
  metrics_df <- data.frame(Model=model_name, 
                           AIC=round(AIC(model), 3), 
                           BIC=round(BIC(model), 3), 
                           accuracy=round(model_accuracy, 3), #acc=Metrics::accuracy(compare[, 1], compare[, 2]),
                           MAE=Metrics::mae(compare[, 1], compare[, 2]),
                           MAPE=Metrics::mape(compare[, 1], compare[, 2]),     #MAPE calculates the mean absolute percentage error:
                           SMAPE=Metrics::smape(compare[, 1], compare[, 2]),     #SMAPE calculates the symmetric mean absolute percentage error:
                           MSE=Metrics::mse(compare[, 1], compare[, 2]),       #MSE calculates mean squared error
                           RMSE=Metrics::rmse(compare[, 1], compare[, 2]))     #RMSE calculates the root mean squared error:
                      

  # Result
  result <- list(metrics_df, predicted, compare)

  
  return (result)
  
}

# m <- calc_metrics_lm('t', model2, dev_test, dev_train)

Model 1 Multiple Regression - Baseline Model Non-transformed variables

The base multiple linear regression model of imputed data. This model predicts the cost if a car is in a crash (TARGET_AMT) using all predictor variables. This shows the estimated cost is negative the assumption the car did not crash.

The following conclusions can be made from the model: * The likelihood that a car will crash increase by 3.9% when the insurer drives with kids * The likelihood that a car will crash declined by -5.9% when the insurer is married. * The likelihood that a car will crash increased by 5.4% when the insurer’s license was revoked in the past 7 years.

The R-squared = .07 we reject the null hypothesis

Model Diagnosis

Residual vs Fitted Plot

The plot shows the residuals have a linerar pattern – the majorty are close to the line. There could be a linear relationship between predictor variabes and an outcome bariable. There appear to be some bad leverage outliers – 76910, 85,383 and 7.072.

Scale-Location Plot

This plot shows that residuals are not equally spread along the range of predictors. Thus, the variance is not equal.

Normal Q-Q Plot

This plot shows the residuals are not normally distrubuted.

Residual vs. Leverage Plot

This plot shows there are some infuential outliers. There are cases far beyond the Cook’s distance lines – the other residuals appear to be clustered on the left. The influential observations are 7691, 7270 and 29030.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = dataBase)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6234  -1685   -755    367 103400 
## 
## Coefficients: (5 not defined because of singularities)
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.835e+03  1.924e+03  -1.473 0.140729    
## KIDSDRIV                   3.885e+02  1.153e+02   3.370 0.000755 ***
## MALE                      -3.150e+01  1.533e+02  -0.205 0.837249    
## MARRIED                   -5.865e+02  1.486e+02  -3.947 7.97e-05 ***
## SINGLE_PARENT              5.114e+02  2.026e+02   2.524 0.011632 *  
## LICENSE_REVOKED            5.446e+02  1.736e+02   3.138 0.001708 ** 
## AGE                        3.584e+01  1.938e+01   1.850 0.064393 .  
## AGE_RANGE_16_19_YRS        5.934e+03  2.306e+03   2.573 0.010108 *  
## AGE_RANGE_20_29_YRS        2.528e+03  1.491e+03   1.696 0.089974 .  
## AGE_RANGE_30_39_YRS        1.778e+03  1.381e+03   1.287 0.198080    
## AGE_RANGE_40_49_YRS        1.166e+03  1.302e+03   0.895 0.370662    
## AGE_RANGE_50_59_YRS        1.234e+03  1.245e+03   0.991 0.321485    
## AGE_RANGE_60_69_YRS        1.372e+03  1.220e+03   1.125 0.260827    
## AGE_RANGE_70_YRS_PLUS             NA         NA      NA       NA    
## INEXP_DRIVER              -1.051e+03  1.270e+03  -0.828 0.407955    
## HOMEKIDS                   8.205e+01  6.638e+01   1.236 0.216450    
## YOJ                       -1.587e+01  1.481e+01  -1.072 0.283950    
## INCOME                    -4.087e-03  1.829e-03  -2.235 0.025462 *  
## HOME_VAL                  -5.618e-04  6.065e-04  -0.926 0.354265    
## TRAVTIME                   1.190e+01  3.221e+00   3.696 0.000221 ***
## BLUEBOOK                   1.839e-03  7.117e-03   0.258 0.796148    
## TIF                       -4.594e+01  1.219e+01  -3.770 0.000164 ***
## OLDCLAIM                  -9.225e-03  7.438e-03  -1.240 0.214960    
## CLM_FREQ                   1.355e+02  5.501e+01   2.463 0.013817 *  
## MVR_PTS                    1.709e+02  2.601e+01   6.571 5.31e-11 ***
## CAR_AGE                   -1.834e+01  2.537e+01  -0.723 0.469755    
## CAR_AGE_RANGE_1_YR         1.012e+02  3.301e+02   0.306 0.759281    
## CAR_AGE_RANGE_2_3_YRS      1.141e+02  1.345e+03   0.085 0.932414    
## CAR_AGE_RANGE_3_5_YRS      5.458e+02  4.109e+02   1.328 0.184095    
## CAR_AGE_RANGE_5_10_YRS     1.414e+02  1.947e+02   0.726 0.467583    
## CAR_AGE_RANGE_10_YRS_PLUS         NA         NA      NA       NA    
## MAIN_DRIVING_CITY          1.648e+03  1.395e+02  11.814  < 2e-16 ***
## RED_CAR                   -9.555e+01  1.500e+02  -0.637 0.524099    
## EDU_HIGH_SCHOOL            1.876e+01  1.745e+02   0.107 0.914409    
## EDU_COLLEGE               -1.689e+02  1.595e+02  -1.059 0.289728    
## EDU_ADV_DEGREE             3.479e+02  2.214e+02   1.571 0.116152    
## VEHICLE_USE_COMMERCIAL     3.418e+01  4.212e+02   0.081 0.935311    
## VEHICLE_CLASS_TRUCK       -5.493e+02  2.544e+02  -2.159 0.030879 *  
## VEHICLE_CLASS_SUV         -4.781e+02  1.930e+02  -2.477 0.013264 *  
## VEHICLE_CLASS_CAR                 NA         NA      NA       NA    
## SPORTS_CAR                        NA         NA      NA       NA    
## RED_SPORTS_CAR             5.255e+02  8.741e+02   0.601 0.547742    
## TRUCK_COMM                 1.154e+03  4.520e+02   2.552 0.010725 *  
## SUV_COMM                   4.310e+02  4.325e+02   0.996 0.319086    
## CAR_COMM                          NA         NA      NA       NA    
## OCCUPATION_CLERICAL        5.393e+02  3.412e+02   1.580 0.114061    
## OCCUPATION_MANAGER        -4.740e+02  2.871e+02  -1.651 0.098811 .  
## OCCUPATION_BLUE_COLLAR     6.446e+02  3.254e+02   1.981 0.047600 *  
## OCCUPATION_GOLD_COLLAR     1.030e+02  2.762e+02   0.373 0.709163    
## OCCUPATION_STUDENT         2.742e+02  3.779e+02   0.726 0.468103    
## OCCUPATION_HOME_MAKER      2.587e+02  3.657e+02   0.707 0.479307    
## OCCUPATION_PROFESSIONAL    4.456e+02  3.067e+02   1.453 0.146280    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4541 on 8113 degrees of freedom
## Multiple R-squared:  0.07334,    Adjusted R-squared:  0.06809 
## F-statistic: 13.96 on 46 and 8113 DF,  p-value: < 2.2e-16

Model 2 - Stepwise Variable Selection Using Non-transformed Data

The second model uses a stepwise variable selection with linear regression. We chose to use both the “forward” and “backward” methods to obtain the optimal model based on the lowest AIC (Akaike Information Criterion). This modeling approach uses the non-transformed, recoded dataset.

null.model <- lm(TARGET_AMT ~ 1 , data= dev_train)  # base intercept only model
full.model <- lm(TARGET_AMT ~ . , data= dev_train)  # full model with all predictors

# perform step-wise algorithm

model2 <- step(null.model, scope = list(lower = null.model, upper = full.model), direction = "both", trace = 0, steps = 1000)

Model Summary

summary(model2)

## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MARRIED + AGE_RANGE_16_19_YRS + 
##     HOME_VAL + INCOME + MALE, data = dev_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9463  -3128  -1542    373  79362 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.402e+03  4.757e+02   9.252  < 2e-16 ***
## BLUEBOOK             1.173e-01  2.578e-02   4.551 5.78e-06 ***
## MARRIED             -1.423e+03  4.818e+02  -2.954  0.00319 ** 
## AGE_RANGE_16_19_YRS  6.676e+03  3.396e+03   1.966  0.04950 *  
## HOME_VAL             4.866e-03  2.306e-03   2.110  0.03502 *  
## INCOME              -9.953e-03  6.140e-03  -1.621  0.10521    
## MALE                 5.678e+02  3.945e+02   1.439  0.15030    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7569 on 1502 degrees of freedom
## Multiple R-squared:  0.02624,    Adjusted R-squared:  0.02235 
## F-statistic: 6.746 on 6 and 1502 DF,  p-value: 4.565e-07

AIC (Akaike Information Criterion) for Model 2 = 3.124751210^{4}
BIC (Bayesian Information Criterion) for Model 2 = 3.129006610^{4}

Most Significant Variables

Variable	Coefficient
BLUEBOOK	0.1173009
MARRIED	-1423.1074373
AGE_RANGE_16_19_YRS	6676.3098861
HOME_VAL	0.0048658
INCOME	-0.0099529
MALE	567.7960181

Variable Importance

Interpretation. Five predictor variables are among the most significant in this model - BLUEBOOk, MARRIED, AGE_RANGE_16_19_YRS, HOME_VAL, INCOME, and MALE. Of these, BLUEBOOK, MARRIED, AGE_RANGE_16_19_YRS, and HOME_VAL are the most statistically significant with a p-value less than a signficance value of 0.05. The predictors INCOME and MALE were not statistically significant, although it’s interesting to note that there may be a potential relationship between male drivers and a higher payout amount.

BLUEBOOK value intuitively makes sense as a significant and important predictor in the payout amount; however, the corresponding coefficient of nearly 0 is not intuitive. We see that the predictor MARRIED is associated with a negative coefficient of -1423. This indicating that married drivers tend to have lower payout amounts as compared to single drivers, potentially a strong indicator of safer driving.

Conversely, the predictor AGE_RANGE_16_19_YRS is significant with a positive coefficient of 6,676 indicating that drivers between the ages of 16 to 19 are a less safe group and are associated with higher claim payout amounts.

The Adjusted R-squared value associated with this model is very low at 0.2235. There is a significant amount of variability unaccounted for by this model in the response variable. The F-statistic or F- Test for overall significance does show significance.

Diagnostic Plots

The diagnostics plots below are examined to test the assumptions of linear regression.

The diagnostic plots reveal some potential issues with this model. The Residuals vs. Fitted plot shows a downward trend – as the fitted values increase on the x-axis, the residuals decrease. We would expect to see a flat line if there is homoscedasticity (residuals of equal variance). Heteroscedascity can also seen in the Scale-Location plot. Again we would expect to see a relatively flat trend compared to the updward trend of the red line.

Heteroscedasticity can be confirmed statistically using the NCV test:

car::ncvTest(model2)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 465.0705    Df = 1     p = 3.7886e-103

The p-value less than a signficance value of 0.05 confirms that there is definitely a pattern in the residuals (heteroscedasticity).

The Normal Q-Q plot shows a issue with the requirement of normal distibution of residuals. We see a step increase approaching the second quantile confirming violations of normality, which could affect the coefficients of the model. This plus the heteroscedasticity are strong indicators that the a transformation may be required.

Multicollinearity does not appear to be problem.

car::vif(model2)

##            BLUEBOOK             MARRIED AGE_RANGE_16_19_YRS 
##            1.257359            1.528386            1.003361 
##            HOME_VAL              INCOME                MALE 
##            2.042174            1.859690            1.013293

Additionally, this model is impacted by outliers as shown by Cook’s distance and the leverage plots.

car::outlierTest(model2)

##       rstudent unadjusted p-value Bonferonni p
## 1430 10.912140         9.8766e-27   1.4904e-23
## 2063  9.433105         1.4572e-20   2.1990e-17
## 640   9.091363         3.0107e-19   4.5431e-16
## 1552  6.993113         4.0340e-12   6.0873e-09
## 1357  6.956691         5.1846e-12   7.8235e-09
## 1137  6.955736         5.2187e-12   7.8751e-09
## 251   6.305652         3.7657e-10   5.6825e-07
## 1639  6.272875         4.6258e-10   6.9803e-07
## 1200  6.084804         1.4778e-09   2.2300e-06
## 2133  5.947021         3.3912e-09   5.1172e-06

plot(cooks.distance(model2), pch=23, bg='orange', cex=2, ylab="Cook's distance")

Removal of these two outliers did not change the model based.

dev_train_upd <- dev_train[which(cooks.distance(model2) < 0.1),]



#dev_train[which(cooks.distance(model2)==2038)]


mod2 <- update(model2,data=dev_train_upd)

autoplot(mod2, which = 1:6, colour = 'dodgerblue3',
         smooth.colour = 'red', smooth.linetype = 'dashed',
         ad.colour = 'black',
         label.size = 3, label.n = 5, label.colour = 'blue',
         ncol = 3)

gvlma(mod2)

## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK + MARRIED + AGE_RANGE_16_19_YRS + 
##     HOME_VAL + INCOME + MALE, data = dev_train_upd)
## 
## Coefficients:
##         (Intercept)             BLUEBOOK              MARRIED  
##           4.566e+03            1.027e-01           -1.088e+03  
## AGE_RANGE_16_19_YRS             HOME_VAL               INCOME  
##          -1.245e+03            3.222e-03           -7.990e-03  
##                MALE  
##           4.131e+02  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = mod2) 
## 
##                        Value p-value                   Decision
## Global Stat        6.244e+04  0.0000 Assumptions NOT satisfied!
## Skewness           5.817e+03  0.0000 Assumptions NOT satisfied!
## Kurtosis           5.663e+04  0.0000 Assumptions NOT satisfied!
## Link Function      1.269e+00  0.2600    Assumptions acceptable.
## Heteroscedasticity 3.011e-01  0.5832    Assumptions acceptable.

Removal of the outliers does not improve the model 2.

Transform Multiple Regression Model - Natural Log

Data Preparation

Log transformation was used to transform the imputed data set. Trasformation was applied to the response variable (TARGET_AMT). To adopt the dataset to the requirements of log transformation all zero value observations were removed. The decsion to remove zero values from the response variable instead of adding a value to prevent any data distortion.

Several recoded predictors that were not significant to the model were removemoved before transformation. The remaining dataset contained 2,152 observations.

Model 3 - Stepwise Variable Selection using Transformed Data

Model 1

Model one is a baseline model of the transformed response variale against all predictors. The summary of the regression model shows F=1.56 with a p-value=0.017, indicating that we accept the null hypothesis that the predictors collectively have a significant effect on the amount paid when there is a crash.

Model Diagnosis

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = insurance_trainT)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6571 -0.4100  0.0379  0.4113  3.2160 
## 
## Coefficients: (2 not defined because of singularities)
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              8.136e+00  1.935e-01  42.037  < 2e-16 ***
## KIDSDRIV                -3.274e-02  3.321e-02  -0.986   0.3243    
## MALE                     4.165e-02  5.645e-02   0.738   0.4607    
## MARRIED                 -9.115e-02  5.310e-02  -1.716   0.0862 .  
## SINGLE_PARENT            2.219e-02  6.180e-02   0.359   0.7196    
## LICENSE_REVOKED         -9.607e-02  5.449e-02  -1.763   0.0781 .  
## AGE                      2.881e-03  2.261e-03   1.274   0.2028    
## INEXP_DRIVER             2.065e-01  2.559e-01   0.807   0.4198    
## HOMEKIDS                 2.569e-02  2.183e-02   1.177   0.2394    
## YOJ                     -2.758e-03  5.102e-03  -0.541   0.5888    
## INCOME                  -9.885e-07  6.913e-07  -1.430   0.1529    
## HOME_VAL                 1.211e-07  2.159e-07   0.561   0.5750    
## TRAVTIME                -1.526e-04  1.165e-03  -0.131   0.8958    
## BLUEBOOK                 1.041e-05  2.513e-06   4.144 3.54e-05 ***
## TIF                     -2.096e-03  4.476e-03  -0.468   0.6396    
## OLDCLAIM                 4.476e-06  2.381e-06   1.880   0.0603 .  
## CLM_FREQ                -3.460e-02  1.655e-02  -2.090   0.0367 *  
## MVR_PTS                  1.421e-02  7.219e-03   1.969   0.0491 *  
## CAR_AGE                 -2.109e-03  3.974e-03  -0.531   0.5956    
## MAIN_DRIVING_CITY        5.948e-02  7.963e-02   0.747   0.4552    
## RED_CAR                  2.519e-03  5.297e-02   0.048   0.9621    
## EDU_HIGH_SCHOOL         -1.443e-03  5.320e-02  -0.027   0.9784    
## VEHICLE_CLASS_TRUCK     -4.193e-03  9.424e-02  -0.044   0.9645    
## VEHICLE_CLASS_SUV       -9.073e-03  6.334e-02  -0.143   0.8861    
## VEHICLE_CLASS_CAR               NA         NA      NA       NA    
## SPORTS_CAR                      NA         NA      NA       NA    
## RED_SPORTS_CAR           2.393e-01  2.385e-01   1.003   0.3159    
## TRUCK_COMM               1.852e-02  7.798e-02   0.237   0.8123    
## SUV_COMM                 4.060e-02  7.055e-02   0.575   0.5650    
## CAR_COMM                -1.142e-01  1.199e-01  -0.952   0.3410    
## OCCUPATION_CLERICAL     -1.000e-01  1.020e-01  -0.981   0.3268    
## OCCUPATION_MANAGER      -7.976e-02  1.035e-01  -0.771   0.4410    
## OCCUPATION_BLUE_COLLAR  -1.016e-01  9.279e-02  -1.095   0.2735    
## OCCUPATION_GOLD_COLLAR  -2.026e-02  1.023e-01  -0.198   0.8429    
## OCCUPATION_STUDENT      -1.431e-01  1.150e-01  -1.244   0.2135    
## OCCUPATION_HOME_MAKER   -1.825e-01  1.208e-01  -1.511   0.1310    
## OCCUPATION_PROFESSIONAL -5.693e-02  9.451e-02  -0.602   0.5470    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8088 on 2117 degrees of freedom
## Multiple R-squared:  0.02457,    Adjusted R-squared:  0.008905 
## F-statistic: 1.568 on 34 and 2117 DF,  p-value: 0.0197

Step Multiple Regression - Backward

The output of the Backward AIC Step regression shows that most of the predictors do not have significance to the total amount paid when there is a crash.

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + 
##     AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL + 
##     TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + 
##     CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + 
##     VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + 
##     SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM + 
##     OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + 
##     OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + 
##     OCCUPATION_PROFESSIONAL
## 
## Final Model:
## TARGET_AMT ~ MALE + MARRIED + LICENSE_REVOKED + BLUEBOOK + OLDCLAIM + 
##     CLM_FREQ + MVR_PTS
## 
## 
##                         Step Df    Deviance Resid. Df Resid. Dev       AIC
## 1                                                2117   1385.015 -878.3575
## 2               - SPORTS_CAR  0 0.000000000      2117   1385.015 -878.3575
## 3        - VEHICLE_CLASS_CAR  0 0.000000000      2117   1385.015 -878.3575
## 4          - EDU_HIGH_SCHOOL  1 0.000481687      2118   1385.016 -880.3568
## 5      - VEHICLE_CLASS_TRUCK  1 0.001274141      2119   1385.017 -882.3548
## 6                  - RED_CAR  1 0.001444202      2120   1385.018 -884.3525
## 7                 - TRAVTIME  1 0.011190859      2121   1385.030 -886.3352
## 8        - VEHICLE_CLASS_SUV  1 0.014241661      2122   1385.044 -888.3130
## 9   - OCCUPATION_GOLD_COLLAR  1 0.026545891      2123   1385.070 -890.2718
## 10           - SINGLE_PARENT  1 0.085063039      2124   1385.155 -892.1396
## 11              - TRUCK_COMM  1 0.142827618      2125   1385.298 -893.9177
## 12                     - TIF  1 0.127934365      2126   1385.426 -895.7190
## 13                - SUV_COMM  1 0.144193022      2127   1385.570 -897.4950
## 14                 - CAR_AGE  1 0.182760271      2128   1385.753 -899.2112
## 15 - OCCUPATION_PROFESSIONAL  1 0.138291667      2129   1385.891 -900.9965
## 16                     - YOJ  1 0.208594411      2130   1386.100 -902.6726
## 17                - HOME_VAL  1 0.207081511      2131   1386.307 -904.3511
## 18      - OCCUPATION_MANAGER  1 0.245548632      2132   1386.553 -905.9700
## 19  - OCCUPATION_BLUE_COLLAR  1 0.304889058      2133   1386.857 -907.4968
## 20     - OCCUPATION_CLERICAL  1 0.153094694      2134   1387.011 -909.2593
## 21      - OCCUPATION_STUDENT  1 0.150664774      2135   1387.161 -911.0255
## 22       - MAIN_DRIVING_CITY  1 0.422870406      2136   1387.584 -912.3696
## 23            - INEXP_DRIVER  1 0.387751137      2137   1387.972 -913.7683
## 24                  - INCOME  1 0.579689637      2138   1388.552 -914.8697
## 25   - OCCUPATION_HOME_MAKER  1 0.360993999      2139   1388.913 -916.3103
## 26                - KIDSDRIV  1 0.678387456      2140   1389.591 -917.2595
## 27                     - AGE  1 0.428542625      2141   1390.019 -918.5959
## 28                - HOMEKIDS  1 0.245897861      2142   1390.265 -920.2152
## 29          - RED_SPORTS_CAR  1 0.981223160      2143   1391.247 -920.6969
## 30                - CAR_COMM  1 0.906329217      2144   1392.153 -921.2955

Step Multiple Regression - Forward

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + 
##     AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL + 
##     TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + 
##     CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + 
##     VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + 
##     SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM + 
##     OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + 
##     OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + 
##     OCCUPATION_PROFESSIONAL
## 
## Final Model:
## TARGET_AMT ~ KIDSDRIV + MALE + MARRIED + SINGLE_PARENT + LICENSE_REVOKED + 
##     AGE + INEXP_DRIVER + HOMEKIDS + YOJ + INCOME + HOME_VAL + 
##     TRAVTIME + BLUEBOOK + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS + 
##     CAR_AGE + MAIN_DRIVING_CITY + RED_CAR + EDU_HIGH_SCHOOL + 
##     VEHICLE_CLASS_TRUCK + VEHICLE_CLASS_SUV + VEHICLE_CLASS_CAR + 
##     SPORTS_CAR + RED_SPORTS_CAR + TRUCK_COMM + SUV_COMM + CAR_COMM + 
##     OCCUPATION_CLERICAL + OCCUPATION_MANAGER + OCCUPATION_BLUE_COLLAR + 
##     OCCUPATION_GOLD_COLLAR + OCCUPATION_STUDENT + OCCUPATION_HOME_MAKER + 
##     OCCUPATION_PROFESSIONAL
## 
## 
##   Step Df Deviance Resid. Df Resid. Dev       AIC
## 1                       2117   1385.015 -878.3575

Model 4 - Robust Regression

library(ridge)

## Warning: package 'ridge' was built under R version 3.4.1

## currently trying Ridge regression; need to add rlm code

linRidgeMod <- linearRidge(TARGET_AMT ~ ., data = dev_train)

predicted <- predict(linRidgeMod, dev_test)  # predict on test data
compare <- cbind (actual=dev_test$TARGET_AMT, predicted)  # combine

mean (apply(compare, 1, min)/apply(compare, 1, max))

## [1] 0.6023915

Group#2 Homework #4

Group 2

4/8/2018

Introduction

Objective

Approach

Data Exploration and Data Preparation

Model Building and Selection

Team Members

Dataset

Data Exploration and Statistic Measures

Missing Values

Variable to Variable Analysis

Descriptive Statistics

Correlation Analysis

Correlation with Outcome Variable - TARGET_FLAG

Correlation with Outcome Variable - TARGET_AMT

Analysis of predictors

KIDSDRIV

Extreme Observations

AGE

Extreme Observations

BLUEBOOK

Extreme Observations

CAR_AGE

Extreme Observations

CAR_TYPE

CAR_USE

CLM_FREQ

Extreme Observations

EDUCATION

HOMEKIDS

Extreme Observations

HOME_VAL

Extreme Observations

INCOME

Extreme Observations

JOB

MSTATUS

MVR_PTS

Extreme Observations

OLDCLAIM

Extreme Observations

PARENT1

RED_CAR

TIF

Extreme Observations

TRAVTIME

Extreme Observations

YOJ

Extreme Observations

Multicollinearity

Data Preparation

Missing Values

Assumptions of Missing Values

Recoding of Predictor Variables

Transform data - Logistic Regression Dataset

Imput Recoded dataset

BUILD MODELS

Model 1 Multiple Regression - Baseline Model Non-transformed variables

Model Diagnosis

Residual vs Fitted Plot

Scale-Location Plot

Normal Q-Q Plot

Residual vs. Leverage Plot

Model 2 - Stepwise Variable Selection Using Non-transformed Data

Model Summary

Most Significant Variables

Variable Importance

Diagnostic Plots

Transform Multiple Regression Model - Natural Log

Data Preparation

Model 3 - Stepwise Variable Selection using Transformed Data

Model 1

Model Diagnosis

Step Multiple Regression - Backward

Step Multiple Regression - Forward

Model 4 - Robust Regression