1 Project Description

The purpose of this work is to estimate the probability that an individual (seeking auto insurance) will be in an accident and then to forecast the potential cost of an ensuing claim. A synthetic data set of ~ 8000 observations will be used to construct predictive models for this purpose. The use of synthetic data permits model development absent proprietary information, while also providing a heuristic for quantitative reasoning. Further discussion regarding the dataset can be found in Section 7.

The dataset includes two response variables. The first, ‘target_flag’, is scored as either a 1 or 0 indicating that an individual has or has not been in an accident, respectively. On this basis, a score of 1 is considered a ‘positive’ result and 0 a ‘negative’ result. The second response variable, ‘target_amt’ refers to the amount paid on a individual’s claim given an accident occurred. Target_amt is a numerical variable ranging from $0 (no accident/claims) upward.

A binary logistic classification model will be constructed to estimate the probability of class membership (1 or 0); log-normal and gamma regression models will be employed to predict payout cost in the event of an accident. This statistical approach builds on classical methods for model training/testing (which underlie more modern, machine learning, strategies for classification and prediction) and invites more explicit consideration of model assumptions and diagnostics.

The data include the following variables:

Variable	Description
Target FLAG	Was car in crash? 1=Yes, 0=No
Target_ AMT	If crash, what was the cost
AGE	Age of driver
BLUEBOOK	Value of vehicle
CAR_AGE	Age of car
CAR_TYPE	Type of car
CAR_USE	Vehicle use
CLM_FREQ	# Claims (past 5 yrs)
EDUCATION	Max educational level
HOMEKIDS	# Children at home
HOME_VAL	Home value
INCOME	Annual income
JOB	Job category
KIDSDRIVE	# Driving children
MSTATUS	Martial status
MVR_PTS	Motor vehicle record points
OLDCLAIM	Total claims (past 5 yrs)
PARENT1	Single parent
RED_CAR	A red car
REVOKED	License revoked (past 7 yrs)
SEX	Gender
TIF	Time as customer
TRAVTIME	Distance to work
URBANICITY	Home/work area
YOJ	Years on job

The project workflow will be represented and discussed following the approach of Wickham and Grolemund (2017):

Project Workflow

Import: libraries, raw data
Tidy: cleaning, wrangling, tidy format
Transform: type, relevel, summary statistics
Visualize: exploratory, diagnostic
Model: model building, diagnostics, selection, evaluation
Communicate: project summary, knitted document, portfolio

# Import libraries

library(tidyverse) # cleaning & wrangling functions
library(janitor) # data cleaning
library(magrittr) # piping
library(flextable) #table layout and formatting
library(dlookr) # exploratory data analysis functions
library(patchwork) # easy plot layout
library(ggpubr) # creates a wrapper for plotting a list
library(viridis) #coloration for box plots
library(broom) # creates a tidy data frame from statistical test results
library(InformationValue) # optimize threshold
library(caret) # modeling
library(ROCR) #roc curve
library(car) # marginal plots
library(MASS) # use for gamma shape function

# Read in data

path <- "https://raw.githubusercontent.com/sconnin/insurance_model/main/insurance_dataset.csv"

raw <- read_csv(path)

2 Data Processing and Holdout

Data processing is required to put variables in tidy form for model building as well as to support initial exploration and visualization. Additional steps are taken to facilitate interpretation of feature variables.

The following processing steps are included here:

data storage as dataframe(s)
case consistency and removal of special characters
type conversion and re-leveling of select factor variables
removal of empty/duplicate rows and/or columns
subsetting to remove unnecessary columns
renaming column values for clarity

# Copy data into working dataframe for downstream use

df<-raw%>%
    
    clean_names%>% # initial clean of col names
    
    remove_empty(c("rows", "cols"))%>%  # remove any empty rows and cols
    
    distinct()%>%     # remove duplicate rows
    
    mutate_if(is_character, str_replace_all, '\\$|,|z_|<', '')%>%  # clean any special chars in character variables

    dplyr::select(-index)%>%  # remove index

    mutate_if(is_character, str_replace_all, "No|no",'N')%>%  
    
    mutate_if(is_character, str_replace_all, "Yes|yes",'Y')%>%
    
    mutate_if(is_character, str_replace_all, "Highly Urban/ Urban",'Urban')%>%
    
    mutate_if(is_character, str_replace_all, "Highly Rural/ Rural",'Rural')%>%
    
    mutate_at(vars(income, home_val, bluebook, oldclaim), funs(as.numeric))%>%   # correct variable type: char to numeric

    mutate_if(is.numeric, round)%>%  # round out our numerics
    
    mutate_if(is_character, ~(factor(.)))%>%  # convert all character variables to factor for modeling
    
    mutate(education = fct_relevel(education, c("High School", "Bachelors", "Masters", "PhD")))%>% # relevel to show educational attainment steps
    
    mutate(car_type = fct_relevel(car_type, c("Sports Car", "SUV", "Minivan, Van", "Pickup", "Panel Truck")))

2.1 Count and Frequency of Target Variable

Recall that the variable target_flag indicates whether or not an individual was in an accident (1 = Yes, 0 = No).

target_flag (level = 0) includes 6008 observations and accounts for ~74% of the data.
target_flag (level = 1) includes 2153 observations and accounts for ~26% of the data.

# Subset levels of target_flag into new dataframes for analyses

target_0 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 0) # obs not associated with automobile accidents

target_1 <- df%>%
    dplyr::select(-target_amt)%>%
    filter(target_flag == 1)  # obs associated with automobile accidents

# Calculate the proportion of data in each subset of target_flag

df %>%
    group_by(target_flag) %>%
    summarise(n = n()) %>%
    mutate(freq = n / sum(n))%>%
    flextable()%>%
    set_caption("Count and Frequency of Target Levels")

Count and Frequency of Target Levels
target_flag	n	freq
0	6,008	0.7361843
1	2,153	0.2638157

2.2 Summary Statistics

On the basis of the following statistical summaries we find:

There is one negative value (-3) for ‘car_age’ that should be investigated. This may be a data entry error.
A number of numerical variables have highly skewed distributions (compare mean & median)
A high percentage of zeros occur in variables: ‘kidsdriv’, ‘homekids’, ‘oldclaim’, ‘clm_freq’, ‘mvr_pt’
Outliers are common across the numerical variables, particularly for observations where target_flag = 0.

# Calculate statistics summary for numeric variables at each level of target_flag    

target_0%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Summary Statistics for Target_Flag = 0")

Summary Statistics for Target_Flag = 0
variables	min	mean	median	max	zero	outlier
kidsdriv	0	0.1393142	0	4	5,407	601
age	16	45.3227901	46	81	0	47
homekids	0	0.6439747	0	5	4,116	559
yoj	0	10.6718337	12	23	379	398
income	0	65,951.9700335	58,368	367,030	365	156
home_val	0	169,075.4123566	175,052	885,282	1,448	5
travtime	5	33.0251332	32	142	0	69
bluebook	1,500	16,230.9487350	15,000	69,740	0	74
tif	1	5.5557590	6	25	0	29
oldclaim	0	3,311.5948735	0	53,986	4,111	622
clm_freq	0	0.6486352	0	5	4,111	583
mvr_pts	0	1.4137816	1	11	2,998	280
car_age	0	8.6709220	9	28	3	2

target_1%>%
    dplyr::select(-target_flag)%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Summary Statistics for Target_Flag = 1")

Summary Statistics for Target_Flag = 1
variables	min	mean	median	max	zero	minus	outlier
kidsdriv	0	0.2596377	0	4	1,773	0	380
age	16	43.3012104	43	76	0	0	7
homekids	0	0.9368323	0	5	1,173	0	0
yoj	0	10.0167488	11	19	246	0	250
income	0	50,641.2980910	43,971	320,127	250	0	77
home_val	0	115,256.5541339	114,565	750,455	846	0	11
travtime	5	34.7710172	34	97	0	0	12
bluebook	1,500	14,255.8987459	12,600	62,240	0	0	36
tif	1	4.7807710	4	21	0	0	27
oldclaim	0	6,061.5499303	2,448	57,037	898	0	233
clm_freq	0	1.2169066	1	5	898	0	0
mvr_pts	0	2.4816535	2	13	714	0	11
car_age	-3	7.3674789	7	25	0	1	0

#Calculate percentage of zero values in numerical covariates for each level of target_flag

select_col<- as_mapper(~all(is.na(.)|.< 1)) # purrr mapper for possible downstream use

df%>%
    group_by(target_flag)%>%  # note target must be numerical type for the following to run
    dplyr::select(where(is.numeric) & -target_amt)%>%
    summarise_each(~round(sum(.==0)/length(.)*100))%>%
    purrr::discard(select_col)%>%
    flextable()%>%
    set_caption("Percentage of Zeros in Numerical Covariates by Level of Target_Flag")

Percentage of Zeros in Numerical Covariates by Level of Target_Flag
target_flag	kidsdriv	homekids	oldclaim	clm_freq	mvr_pts
0	90	69	68	68	50
1	82	54	42	42	33

# Create statistical summaries for categorical variables at each level of target_flag 

target_0%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Summary Statistics for Categorical Variables: Target Level = 0")

Summary Statistics for Categorical Variables: Target Level = 0
variables	levels	N	freq	ratio	rank
parent1	N	6,008	5,407	89.996671	1
parent1	Y	6,008	601	10.003329	2
mstatus	Y	6,008	3,841	63.931425	1
mstatus	N	6,008	2,167	36.068575	2
sex	F	6,008	3,183	52.979361	1
sex	M	6,008	2,825	47.020639	2
education	High School	6,008	2,355	39.197736	1
education	Bachelors	6,008	1,719	28.611851	2
education	Masters	6,008	1,331	22.153795	3
education	PhD	6,008	603	10.036618	4
job	Blue Collar	6,008	1,191	19.823569	1
job	Clerical	6,008	900	14.980027	2
job	Professional	6,008	870	14.480692	3
job	Manager	6,008	851	14.164447	4
job	Lawyer	6,008	682	11.351531	5
job	Home Maker	6,008	461	7.673103	6
job	Student	6,008	446	7.423435	7
job		6,008	390	6.491345	8
job	Doctor	6,008	217	3.611851	9
car_use	Private	6,008	4,026	67.010652	1
car_use	Commercial	6,008	1,982	32.989348	2
car_type	Minivan	6,008	1,796	29.893475	1
car_type	SUV	6,008	1,616	26.897470	2
car_type	Pickup	6,008	946	15.745672	3
car_type	Sports Car	6,008	603	10.036618	4
car_type	Van	6,008	549	9.137816	5
car_type	Panel Truck	6,008	498	8.288948	6
red_car	N	6,008	4,246	70.672437	1
red_car	Y	6,008	1,762	29.327563	2
revoked	N	6,008	5,451	90.729028	1
revoked	Y	6,008	557	9.270972	2
urbanicity	Urban	6,008	4,454	74.134487	1
urbanicity	Rural	6,008	1,554	25.865513	2

target_1%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Summary Statistics for Categorical Variables: Target Level = 1")

Summary Statistics for Categorical Variables: Target Level = 1
variables	levels	N	freq	ratio	rank
parent1	N	2,153	1,677	77.891314	1
parent1	Y	2,153	476	22.108686	2
mstatus	N	2,153	1,100	51.091500	1
mstatus	Y	2,153	1,053	48.908500	2
sex	F	2,153	1,192	55.364608	1
sex	M	2,153	961	44.635392	2
education	High School	2,153	1,178	54.714352	1
education	Bachelors	2,153	523	24.291686	2
education	Masters	2,153	327	15.188110	3
education	PhD	2,153	125	5.805852	4
job	Blue Collar	2,153	634	29.447283	1
job	Clerical	2,153	371	17.231770	2
job	Student	2,153	266	12.354854	3
job	Professional	2,153	247	11.472364	4
job	Home Maker	2,153	180	8.360427	5
job	Lawyer	2,153	153	7.106363	6
job	Manager	2,153	137	6.363214	7
job		2,153	136	6.316767	8
job	Doctor	2,153	29	1.346958	9
car_use	Private	2,153	1,106	51.370181	1
car_use	Commercial	2,153	1,047	48.629819	2
car_type	SUV	2,153	678	31.490943	1
car_type	Pickup	2,153	443	20.575941	2
car_type	Minivan	2,153	349	16.209940	3
car_type	Sports Car	2,153	304	14.119833	4
car_type	Van	2,153	201	9.335810	5
car_type	Panel Truck	2,153	178	8.267534	6
red_car	N	2,153	1,537	71.388760	1
red_car	Y	2,153	616	28.611240	2
revoked	N	2,153	1,710	79.424059	1
revoked	Y	2,153	443	20.575941	2
urbanicity	Urban	2,153	2,038	94.658616	1
urbanicity	Rural	2,153	115	5.341384	2

We drop one observation where ‘car_age’ is < 0 given that it is impossible to have a negative age. This appears to be a data entry error given there are no other negative values in this variable category.

# Drop observation with data error for car age (obs. value = -3). 

df$car_age[df$car_age < 0] <- NA

2.3 Covariate Distributions

Histograms and QQ-Plots indicate that distributions for ‘target_flag’, ‘kidsdriv’, ‘homekids’, ‘yoj’, ‘income’, ‘oldclaim’, ‘clm_freq’, and ‘mvr_pts’ depart from a normal model. This is confirmed using a Shapiro-Wilk test of normality for numerical values. Transformation of these variables may be desirable during model building. Log, square root, and polynomial transformations may be particularly helpful in this regard.

From an interpretive stand-point, it’s worth noting the high 0 value counts for the following covariates: ‘tif’, ‘oldclaim’, ‘car_age’, ‘income’, ‘home_val’. Particularly the latter two. A 0 value for ‘income’ may indicate that an individual is unemployed and/or a student/dependent. A 0 value for ‘home_val’ may indicate that an individual is a renter, a dependent, or homeless.

These details can be important in establishing the costs:benefits of insuring a driver. And their absence (as in this dataset) can limit the discriminatory ability of a classification model.

# Evaluate and rank departures from normality for numerical vars - Wilkes Shapiro

df %>%
    as.data.frame()%>%
    dplyr::select(-target_amt, target_flag)%>%
    normality() %>%
    filter(p_value <= 0.01) %>%
    arrange(abs(p_value))

#Identify skewness in covariates across levels of target_flag

target_0%>%find_skewness(index=FALSE, thres=TRUE)

## [1] "kidsdriv" "homekids" "yoj"      "income"   "oldclaim" "clm_freq" "mvr_pts"

target_1%>%find_skewness(index=FALSE, thres=TRUE)

## [1] "kidsdriv" "yoj"      "income"   "tif"      "oldclaim"

# Identify potential transformations to address skew, this does not resolve high 0 counts

df%>%
    dplyr::select(age, yoj, income, home_val, travtime, bluebook, tif, oldclaim, car_age)%>%
    plot_normality()

## Outliers

The variables kidsrive, oldclaim, homekids, clm_freq, an yoj contain the highest percentage of outliers (~6-17%, depending on target_flag level). The percentage of outliers is generally greater for observations associated with accidents (target_flag = 1) than without (target_flag = 0).

The extent to which these/other outliers influence model parametrization is not yet know and should be evaluated as part of the model building process.

# Identify percentage of outliers for each covariate across levels of target_flag
    

diagnose_outlier(target_0) %>%
    filter(outliers_ratio > 5)%>%
    mutate(ratio_means= outliers_mean/with_mean)%>% # mean val of outliers/total mean
    arrange(desc(ratio_means))%>%
    dplyr::select(variables, outliers_ratio, ratio_means)%>%
    flextable()%>%
    set_caption("Outlier Summary: Target_Flag = 0")

Outlier Summary: Target_Flag = 0
variables	outliers_ratio	ratio_means
kidsdriv	10.003329	9.99667221
oldclaim	10.352863	6.94574452
homekids	9.304261	5.00025199
clm_freq	9.703728	4.97416009
yoj	6.624501	0.01742246

diagnose_outlier(target_1) %>%
    filter(outliers_ratio > 5)%>%
    mutate(ratio_means= outliers_mean/with_mean)%>%
    arrange(desc(ratio_means))%>%
    dplyr::select(variables, outliers_ratio, ratio_means)%>%
    flextable()%>%
    set_caption("Outlier Summary: Target_Flag = 1")

Outlier Summary: Target_Flag = 1
variables	outliers_ratio	ratio_means
kidsdriv	17.64979	5.665789474
oldclaim	10.82211	5.302250158
yoj	11.61170	0.003194649

2.4 Missing Data

The dataset includes missing observations in one categorical variable (‘job’) and six numerical variables (‘car_age’, ‘home-val’, ‘yoj’, ‘income’, ‘age’). Across covariates (i.e., predictor variables) the extent of missingness is < 6.5% of the total observations. There are no obvious patterns in the missing data that warrant concern.

#Basic missing tables by target level
    
target_0%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    filter(missing_count>0)%>%
    arrange(desc(missing_count))%>%
    flextable()%>%
    set_caption("Missing Data Summary: Target Level = 0")

Missing Data Summary: Target Level = 0
variables	types	missing_count	missing_percent
job	factor	390	6.49134487
car_age	numeric	368	6.12516644
home_val	numeric	343	5.70905459
income	numeric	335	5.57589880
yoj	numeric	331	5.50932091
age	numeric	1	0.01664447

target_1%>%
    diagnose()%>%
    dplyr::select(-unique_count, -unique_rate)%>%
    filter(missing_count>0)%>%
    arrange(desc(missing_count))%>%
    flextable()%>%
    set_caption("Missing Data Summary: Target Level = 1")

Missing Data Summary: Target Level = 1
variables	types	missing_count	missing_percent
car_age	numeric	142	6.5954482
job	factor	136	6.3167673
yoj	numeric	123	5.7129587
home_val	numeric	121	5.6200650
income	numeric	110	5.1091500
age	numeric	5	0.2322341

# Plot missing data to visualize cumulative percentages and col intersections

df %>% 
  plot_na_pareto(only_na = TRUE)

df %>%
    plot_na_intersect()

## Impute Missing Values

Given the relatively limited extent of missing data, an argument can be made for imputing these values rather than removing the observations during model building.A variety of imputation methods are availabe for this purpose. They include:

random selection of a value(s) between the minimum and maximum value of a covariate
replacement with the mean, median, or mode of the covariate’s distribution
linear regression based on other covariates
k-nearest neighbors
recursive partitioning and regression trees
multivariate imputation by chained equations (mice)

The following methods are selected for our model on the basis of initial experimentation:

’car _age’: mice
‘home_val’, ‘yoj’, ‘income’, ‘age’: recursive partitioning

Comparison of density plots for original and imputed values show a good fit for each of the numerical covariates.

#Impute numerical vars using dlookr

#car_age

car_age<-imputate_na(df, car_age, target_flag, method = "mice", seed = 999)

## 
##  iter imp variable
##   1   1  age  yoj  income  home_val  job  car_age
##   1   2  age  yoj  income  home_val  job  car_age
##   1   3  age  yoj  income  home_val  job  car_age
##   1   4  age  yoj  income  home_val  job  car_age
##   1   5  age  yoj  income  home_val  job  car_age
##   2   1  age  yoj  income  home_val  job  car_age
##   2   2  age  yoj  income  home_val  job  car_age
##   2   3  age  yoj  income  home_val  job  car_age
##   2   4  age  yoj  income  home_val  job  car_age
##   2   5  age  yoj  income  home_val  job  car_age
##   3   1  age  yoj  income  home_val  job  car_age
##   3   2  age  yoj  income  home_val  job  car_age
##   3   3  age  yoj  income  home_val  job  car_age
##   3   4  age  yoj  income  home_val  job  car_age
##   3   5  age  yoj  income  home_val  job  car_age
##   4   1  age  yoj  income  home_val  job  car_age
##   4   2  age  yoj  income  home_val  job  car_age
##   4   3  age  yoj  income  home_val  job  car_age
##   4   4  age  yoj  income  home_val  job  car_age
##   4   5  age  yoj  income  home_val  job  car_age
##   5   1  age  yoj  income  home_val  job  car_age
##   5   2  age  yoj  income  home_val  job  car_age
##   5   3  age  yoj  income  home_val  job  car_age
##   5   4  age  yoj  income  home_val  job  car_age
##   5   5  age  yoj  income  home_val  job  car_age

p1<-plot(car_age)+theme_minimal()+theme(legend.position = "top")

summary(car_age)

## * Impute missing values based on Multivariate Imputation by Chained Equations
##  - method : mice
##  - random seed : 999
## 
## * Information of Imputation (before vs after)
##               Original    Imputation
## n        7650.00000000 8161.00000000
## na        511.00000000    0.00000000
## mean        8.32980392    8.32782747
## sd          5.69964327    5.61856877
## se_mean     0.06516538    0.06219479
## IQR        11.00000000    9.00000000
## skewness    0.28251095    0.28530537
## kurtosis   -0.74839431   -0.72313838
## p00         0.00000000    0.00000000
## p01         1.00000000    1.00000000
## p05         1.00000000    1.00000000
## p10         1.00000000    1.00000000
## p20         1.00000000    1.00000000
## p25         1.00000000    3.00000000
## p30         5.00000000    5.00000000
## p40         7.00000000    7.00000000
## p50         8.00000000    8.00000000
## p60        10.00000000   10.00000000
## p70        11.00000000   11.00000000
## p75        12.00000000   12.00000000
## p80        13.00000000   13.00000000
## p90        16.00000000   16.00000000
## p95        18.00000000   18.00000000
## p99        21.00000000   21.00000000
## p100       28.00000000   28.00000000

#home_val


home_val<-imputate_na(df, home_val, target_flag, method = "rpart")

p2<-plot(home_val)+theme_minimal()+theme(legend.position = "top")

summary(home_val)

## * Impute missing values based on Recursive Partitioning and Regression Trees
##  - method : rpart
## 
## * Information of Imputation (before vs after)
##               Original   Imputation
## n         7.697000e+03 8.161000e+03
## na        4.640000e+02 0.000000e+00
## mean      1.548673e+05 1.549021e+05
## sd        1.291238e+05 1.276679e+05
## se_mean   1.471789e+03 1.413221e+03
## IQR       2.387240e+05 2.377673e+05
## skewness  4.887855e-01 5.011553e-01
## kurtosis -1.453837e-02 1.752216e-02
## p00       0.000000e+00 0.000000e+00
## p01       0.000000e+00 0.000000e+00
## p05       0.000000e+00 0.000000e+00
## p10       0.000000e+00 0.000000e+00
## p20       0.000000e+00 0.000000e+00
## p25       0.000000e+00 0.000000e+00
## p30       5.473000e+04 5.748428e+04
## p40       1.267360e+05 1.259510e+05
## p50       1.611600e+05 1.603330e+05
## p60       1.907740e+05 1.882360e+05
## p70       2.223548e+05 2.210090e+05
## p75       2.387240e+05 2.377673e+05
## p80       2.590012e+05 2.568280e+05
## p90       3.165426e+05 3.152530e+05
## p95       3.748710e+05 3.779250e+05
## p99       4.993365e+05 4.977460e+05
## p100      8.852820e+05 8.852820e+05

#yoj


yoj<- imputate_na(df, yoj, target_flag, method = "rpart")

p3<-plot(yoj)+theme_minimal()+theme(legend.position = "top")

summary(yoj)

## * Impute missing values based on Recursive Partitioning and Regression Trees
##  - method : rpart
## 
## * Information of Imputation (before vs after)
##               Original    Imputation
## n        7707.00000000 8161.00000000
## na        454.00000000    0.00000000
## mean       10.49928636   10.50694662
## sd          4.09247418    4.04481883
## se_mean     0.04661689    0.04477415
## IQR         4.00000000    4.00000000
## skewness   -1.20343604   -1.25339626
## kurtosis    1.17996900    1.33649457
## p00         0.00000000    0.00000000
## p01         0.00000000    0.00000000
## p05         0.00000000    0.00000000
## p10         5.00000000    5.00000000
## p20         8.00000000    8.00000000
## p25         9.00000000    9.00000000
## p30        10.00000000   10.00000000
## p40        11.00000000   11.00000000
## p50        11.00000000   11.00000000
## p60        12.00000000   12.00000000
## p70        13.00000000   13.00000000
## p75        13.00000000   13.00000000
## p80        14.00000000   13.00000000
## p90        15.00000000   14.00000000
## p95        15.00000000   15.00000000
## p99        17.00000000   17.00000000
## p100       23.00000000   23.00000000

# income

income<-imputate_na(df, income, target_flag, method = "rpart")

p4<-plot(income)+theme_minimal()+theme(legend.position = "top")

summary(income)

## * Impute missing values based on Recursive Partitioning and Regression Trees
##  - method : rpart
## 
## * Information of Imputation (before vs after)
##              Original   Imputation
## n        7.716000e+03 8.161000e+03
## na       4.450000e+02 0.000000e+00
## mean     6.189809e+04 6.150153e+04
## sd       4.757268e+04 4.722375e+04
## se_mean  5.415786e+02 5.227437e+02
## IQR      5.788900e+04 5.696500e+04
## skewness 1.186778e+00 1.192689e+00
## kurtosis 2.132505e+00 2.154542e+00
## p00      0.000000e+00 0.000000e+00
## p01      0.000000e+00 0.000000e+00
## p05      0.000000e+00 0.000000e+00
## p10      4.380500e+03 5.427000e+03
## p20      2.234800e+04 2.227100e+04
## p25      2.809700e+04 2.830200e+04
## p30      3.330300e+04 3.370700e+04
## p40      4.367000e+04 4.338500e+04
## p50      5.402800e+04 5.315600e+04
## p60      6.526800e+04 6.507900e+04
## p70      7.794950e+04 7.783100e+04
## p75      8.598600e+04 8.526700e+04
## p80      9.553800e+04 9.471200e+04
## p90      1.231800e+05 1.220700e+05
## p95      1.522740e+05 1.512250e+05
## p99      2.155198e+05 2.143746e+05
## p100     3.670300e+05 3.670300e+05

#age

age<-imputate_na(df, age, method = "rpart")

p5<-plot(age)+theme_minimal()+theme(legend.position = "top")

summary(age)

## * Impute missing values based on Recursive Partitioning and Regression Trees
##  - method : rpart
## 
## * Information of Imputation (before vs after)
##               Original    Imputation
## n        8155.00000000 8161.00000000
## na          6.00000000    0.00000000
## mean       44.79031269   44.78524399
## sd          8.62758946    8.62706878
## se_mean     0.09553830    0.09549741
## IQR        12.00000000   12.00000000
## skewness   -0.02899962   -0.02782888
## kurtosis   -0.06028252   -0.06119510
## p00        16.00000000   16.00000000
## p01        25.00000000   25.00000000
## p05        30.00000000   30.00000000
## p10        34.00000000   34.00000000
## p20        37.00000000   37.00000000
## p25        39.00000000   39.00000000
## p30        40.00000000   40.00000000
## p40        43.00000000   43.00000000
## p50        45.00000000   45.00000000
## p60        47.00000000   47.00000000
## p70        49.00000000   49.00000000
## p75        51.00000000   51.00000000
## p80        52.00000000   52.00000000
## p90        56.00000000   56.00000000
## p95        59.00000000   59.00000000
## p99        64.00000000   64.00000000
## p100       81.00000000   81.00000000

temp<-cbind(car_age, home_val, yoj, income, age)
temp%<>%as.data.frame(temp)

df%<>%dplyr::select(!c(car_age, home_val, yoj, income, age))%>%
    cbind(temp)

df%<>%mutate_if(is.numeric, round)

# print plots

(p1|p2)/(p3|p4)/p5

We can also impute missing values for categorical variables. In this case, the mice method is used to impute values for the variable, ‘job’.

job<-imputate_na(df, job, method = "mice", seed = 999)

## 
##  iter imp variable
##   1   1  job
##   1   2  job
##   1   3  job
##   1   4  job
##   1   5  job
##   2   1  job
##   2   2  job
##   2   3  job
##   2   4  job
##   2   5  job
##   3   1  job
##   3   2  job
##   3   3  job
##   3   4  job
##   3   5  job
##   4   1  job
##   4   2  job
##   4   3  job
##   4   4  job
##   4   5  job
##   5   1  job
##   5   2  job
##   5   3  job
##   5   4  job
##   5   5  job

plot(job)+theme_minimal()+theme(legend.position = "top")

# combine into new dfb

df<-df%>%
    dplyr::select(!job)

df<-cbind(df,job)

df$job<-factor(df$job)

2.5 Pairwise Comparisons

Pairwise comparisons provide a method to identify collinear relationships between covariates. Covariates that display high collinearity (e.g., correlation coefficients < -.50 or > .50) can reduce the precision and increase the sensitivity of model coefficients due to lack of independence.

To refine our analyses, pairwise comparisons are assessed following removal of 0 values. Our results indicate:

high positive correlation between ‘home_val’ and ‘income’, coeff ~ 0.96
high positive correlation between ‘homekids’ and ‘homedrive’, coeff ~ 0.62
moderate positive correlation between ‘home_val’ and ‘car_age’, coeff ~ 0.54
moderate positive correlation between ‘income’ and ‘car_age’, coeff ~ 0.54

As expected, there is also a strong association between ‘income’ (numerical var.) and ‘job’ type (factor var.) - based on a kruskal-Wallis rank sum test.

Both ‘home_value’ and ‘income’ have similar levels of missing data. An argument can be made for dropping one or the other from the dataset in order to simplify our feature selection process and improve model precision. This may also be true for ‘income’ and ‘job’ type.

Associations between ‘home_val’ and ‘car_age’ and ‘income’ and ‘car_age’ are less concerning, due to lower correlation coefficients and higher variability. See plots below.

Less clear is the association between ‘homekids’ and ’kidsdriv’given their narrow range of integer values.

We will revisit options for feature removal during the model building process.

#Assess collinearity in the absence of zero values in numerical cols

df%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Pairwise Correlation: Numerical Covariates")

Pairwise Correlation: Numerical Covariates
var1	var2	coef_corr
homekids	kidsdriv	0.6242945
kidsdriv	homekids	0.6242945
home_val	car_age	0.5442647
income	car_age	0.5297101
car_age	home_val	0.5442647
income	home_val	0.9618320
car_age	income	0.5297101
home_val	income	0.9618320

# Plot relationship between numerical covariates (w/o zero values for clarity) beyond the threshold coef_corr 

df %>%
    filter(income != 0 & home_val != 0)%>% 
    target_by(income) %>%
    relate(home_val) %>%
    plot()

df %>%
    filter(homekids != 0 & kidsdriv != 0)%>%
    target_by(homekids) %>%
    relate(kidsdriv) %>%
    plot()

df %>%
    filter(home_val != 0 & car_age != 0)%>%
    target_by(home_val) %>%
    relate(car_age) %>%
    plot()

df %>%
    filter(income != 0 & car_age != 0)%>%
    target_by(income) %>%
    relate(car_age) %>%
    plot()

# Plot relationship between numerical and factors for select variables

df %>% 
  target_by(income) %>%      
  relate(job) %>% 
  plot()

df%>%kruskal.test(income~job) # assess covariance with Kruskal-Wallis rank sum test where > 1 categorical level

## 
##  Kruskal-Wallis rank sum test
## 
## data:  .
## Kruskal-Wallis chi-squared = 118656, df = 24, p-value < 2.2e-16

We can visually assess relationships between categorical variables and our response variable, ‘target-flag’, using mosaic plots. In this case, a relationship is indicated by relative differences in variable levels for each level of ‘target-flag’.

Note that the width of plot columns is proportional to the count of observations for each level of the variable plotted on the horizontal axis. And the vertical length is proportional to the count in the second variable (i.e., ‘target_flag’) within each level of the first variable.

On this basis we find that:

All categorical variables except ‘sex’ and ‘red-car’ display relationships to ‘target_flag’.

An argument can be made for dropping ‘sex and ’red_car’ from our set of covariates in order to streamline the modeling process.

# Subset categorical variables into a df of factor vars

mosaic<-df%>% #subset target_flag as factor
    mutate_at(vars(target_flag), funs(as.factor))%>%
    dplyr::select(where(is.factor))

target <- all_of(names(mosaic)[1]) # set name for target_flag
predictors <- all_of(names(mosaic)[2:11])

#generate mosaic plots with purrr and dlookr

(plots<-predictors%>%map(function(predictors) mosaic%>%target_by(target)%>%relate(predictors)%>%plot()))

Boxplots make an effective visual tool to assess relationships between ‘target_flag’ and our numerical covariates.

On this basis, it appears that ‘bluebook’, ‘tif’, ‘clm_freq’, ‘mvr_pts’, ‘car_age’, ‘home_val’, and ‘income’ vary by level of target_flag. The presence of outliers, however, argues for caution in this interpretation.

# Subset numerical variables

num_box<-select_if(df, is.numeric)%>%
    mutate_at(vars(target_flag), funs(as.factor))%>%
    dplyr::select(-target_amt)

# Plot using boxplots

response = names(num_box)[1] #target_flag (needs to be fct for these plots)
response = purrr::set_names(response)

explain <- names(num_box)[2:14] #explanatory variables
explain = purrr::set_names(explain)

box_fun = function(x) {
    ggplot(num_box, aes_string(x = x, y = 'target_flag') ) +
    geom_boxplot(aes(fill = target_flag, alpha = 0.4), outlier.color =
    'red', show.legend = FALSE)+
    scale_fill_viridis(discrete = TRUE, option = "E")+
    coord_flip()+
    theme_classic()
}

box_plots<-map(explain, ~box_fun(.x)) #creates a list of plots using purrr

plot(reduce(box_plots,`+`)) # layout with patchwork

2.6 Variable Selection

Before forming training and testing datasets, we will drop ‘home_val’ and ‘target_amt’. Variability in the model explained by ‘home_val’ will be largely explained by ‘income’. And ‘target_amt’ is not relevant to our classification process.

We will also collapse the variable ‘job’ into two levels: ‘non-white-collar’ and ‘white-collar’. While this may result in a slight loss of discriminatory power in the model(s) it does lend to model simplification.

Rather than make any additional a-priori choices to drop other covariates, we will automate variable selection (via. Akaike Information Criterion) as part of the modeling process.

base_df<-df%>%
    dplyr::select(-home_val, -target_amt)  # variance in these fields explained by income

base_df%<>%
    mutate(job = (fct_collapse(job,
                 not_white_collar = c("Student", "Blue Collar", "Home Maker", "Clerical", "Manager"),
                 white_collar = c("Doctor", "Lawyer", "Professional"))))

#write.csv(base_df, 'base_df.csv', row.names=FALSE)

2.7 Training and Test Datasets

Prior to modeling, the dataset is randomly sampled to create a subset for model training (‘train_df’) and model evaluation (‘test_df’).

# Base_df<-read.csv('base_df.csv') # hold here to assist model dev 

base_df%>%count(target_flag)%>%
    mutate(frequency = n/sum(n))%>%
    flextable()%>%  
    set_caption("Count and Frequency of Target Variable by Level")

Count and Frequency of Target Variable by Level
target_flag	n	frequency
0	6,008	0.7361843
1	2,153	0.2638157

set.seed(1235) 

# Build train/test sets

trainIdx <- createDataPartition(base_df$target_flag, p = .75,
            list = FALSE, times = 1)

train_df <- base_df[trainIdx,]
test_df <- base_df[-trainIdx,]

#Check counts and frequencies of target_flag for train set and test set 

train_df%>%count(target_flag)%>%
    mutate(frequency = n/sum(n))%>%
    flextable()%>%
    set_caption("Count and Frequency of Target Variable: Training Data")

Count and Frequency of Target Variable: Training Data
target_flag	n	frequency
0	4,496	0.7345205
1	1,625	0.2654795

test_df%>%count(target_flag)%>%
    mutate(frequency = n/sum(n))%>%
    flextable()%>%
    set_caption("Count and Frequency of Target Variable: Test Data")

Count and Frequency of Target Variable: Test Data
target_flag	n	frequency
0	1,512	0.7411765
1	528	0.2588235

3 Classification Models

Selection of a logistic classifier will be based on comparison of four models:

Null model - the class probability p is independent of any terms (covariates) and is the same for all data points. Only one parameter (the intercept) is fit.
Full model - all possible terms are included in the model.
Reduced model - a nested subset of a the full model that excludes non-significant (chi-sqr.).
Transformation model - select terms used in the the reduced model are transformed to meet model requirements and improve fit.

We will assume that the null and full models are valid for the purposes of model comparison. Validity will be inferred from model diagnostics for the reduced and transformed model.

To be considered valid, a binary logistic model should meet the following conditions:

the response variable is binary.
observations are independent of each other.
there is little or no multicollinearity among covariates.
covariates and the logit (log odds) are linearly related.

3.1 Null Model: Logistic

# build and summarize null model

null<-glm(target_flag ~ 1, family = binomial, train_df)

summary(null)

## 
## Call:
## glm(formula = target_flag ~ 1, family = binomial, data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7855  -0.7855  -0.7855   1.6286   1.6286  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.01768    0.02894  -35.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7084.6  on 6120  degrees of freedom
## Residual deviance: 7084.6  on 6120  degrees of freedom
## AIC: 7086.6
## 
## Number of Fisher Scoring iterations: 4

null_stats<-glance(null) # collect stats for model comparisons

# calculate AUC for model selection process - see 3.3.1 for discussion on AUC

null_predicted<-predict(null, train_df,type='response')
null_actuals<-as.numeric(as.character(train_df$target_flag))
AUROC(null_actuals, null_predicted)

## [1] 0.5

3.2 Full Model: Logistic

Several terms included in the full model are not significant - ‘homekids’, ‘sex’, ‘red_car’, ‘car_age’, ‘age’.

full<-glm(target_flag ~ ., family = binomial, train_df) 

summary(full)

## 
## Call:
## glm(formula = target_flag ~ ., family = binomial, data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2763  -0.7277  -0.4188   0.6343   3.1369  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.181e+00  2.886e-01  -4.091 4.30e-05 ***
## kidsdriv             3.674e-01  6.909e-02   5.317 1.05e-07 ***
## homekids             4.955e-02  4.243e-02   1.168 0.242788    
## parent1Y             3.938e-01  1.244e-01   3.164 0.001556 ** 
## mstatusY            -6.472e-01  8.341e-02  -7.759 8.54e-15 ***
## sexM                 8.418e-02  1.259e-01   0.669 0.503804    
## educationBachelors  -5.294e-01  9.781e-02  -5.412 6.23e-08 ***
## educationMasters    -4.499e-01  1.432e-01  -3.142 0.001676 ** 
## educationPhD        -5.270e-01  1.808e-01  -2.915 0.003562 ** 
## travtime             1.500e-02  2.146e-03   6.990 2.76e-12 ***
## car_usePrivate      -8.287e-01  8.478e-02  -9.774  < 2e-16 ***
## bluebook            -2.413e-05  6.085e-06  -3.966 7.30e-05 ***
## tif                 -5.849e-02  8.500e-03  -6.880 5.97e-12 ***
## car_typeSUV         -3.166e-01  1.125e-01  -2.815 0.004878 ** 
## car_typePickup      -5.356e-01  1.522e-01  -3.519 0.000433 ***
## car_typePanel Truck -4.840e-01  2.361e-01  -2.050 0.040381 *  
## car_typeMinivan     -9.649e-01  1.471e-01  -6.557 5.48e-11 ***
## car_typeVan         -4.397e-01  1.955e-01  -2.250 0.024460 *  
## red_carY            -8.900e-02  9.798e-02  -0.908 0.363671    
## oldclaim            -1.397e-05  4.536e-06  -3.079 0.002076 ** 
## clm_freq             1.808e-01  3.269e-02   5.531 3.18e-08 ***
## revokedY             8.662e-01  1.055e-01   8.211  < 2e-16 ***
## mvr_pts              1.220e-01  1.554e-02   7.849 4.19e-15 ***
## urbanicityUrban      2.317e+00  1.289e-01  17.974  < 2e-16 ***
## car_age             -5.430e-03  8.827e-03  -0.615 0.538447    
## yoj                 -1.642e-02  8.861e-03  -1.854 0.063810 .  
## income              -6.124e-06  1.082e-06  -5.657 1.54e-08 ***
## age                 -3.958e-03  4.550e-03  -0.870 0.384347    
## jobwhite_collar      9.978e-02  9.292e-02   1.074 0.282927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7084.6  on 6120  degrees of freedom
## Residual deviance: 5566.7  on 6092  degrees of freedom
## AIC: 5624.7
## 
## Number of Fisher Scoring iterations: 5

full_stats<-glance(full) # collect stat metrics for model comparisons

# calculate AUC for model selection process - see 3.3.1 for discussion on AUC

full_predicted<-predict(full, train_df,type='response')
full_actuals<-as.numeric(as.character(train_df$target_flag))
AUROC(full_actuals, full_predicted)

## [1] 0.8060705

3.3 Reduced Model: Logistic

The reduced model will be constructed using the Akaike Information Criterion to remove covariates that are not statistically significant.

#Build reduced model

reduced <- glm(target_flag ~ .,family='binomial', train_df) 

reduced<-step(reduced, trace=0) # use Akiaike step for covariate selection, trace 0 prevents intermediate printing

summary(reduced)

## 
## Call:
## glm(formula = target_flag ~ kidsdriv + homekids + parent1 + mstatus + 
##     education + travtime + car_use + bluebook + tif + car_type + 
##     oldclaim + clm_freq + revoked + mvr_pts + urbanicity + yoj + 
##     income, family = "binomial", data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2627  -0.7316  -0.4190   0.6367   3.1307  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.394e+00  2.081e-01  -6.699 2.10e-11 ***
## kidsdriv             3.576e-01  6.798e-02   5.261 1.44e-07 ***
## homekids             5.948e-02  3.933e-02   1.512 0.130444    
## parent1Y             4.094e-01  1.236e-01   3.313 0.000924 ***
## mstatusY            -6.469e-01  8.329e-02  -7.767 8.04e-15 ***
## educationBachelors  -5.394e-01  8.635e-02  -6.246 4.20e-10 ***
## educationMasters    -4.703e-01  1.054e-01  -4.461 8.16e-06 ***
## educationPhD        -5.705e-01  1.573e-01  -3.626 0.000287 ***
## travtime             1.499e-02  2.144e-03   6.990 2.75e-12 ***
## car_usePrivate      -8.054e-01  8.202e-02  -9.820  < 2e-16 ***
## bluebook            -2.505e-05  5.471e-06  -4.578 4.69e-06 ***
## tif                 -5.835e-02  8.489e-03  -6.874 6.24e-12 ***
## car_typeSUV         -3.087e-01  1.122e-01  -2.751 0.005939 ** 
## car_typePickup      -4.992e-01  1.285e-01  -3.885 0.000102 ***
## car_typePanel Truck -4.291e-01  1.842e-01  -2.329 0.019836 *  
## car_typeMinivan     -9.315e-01  1.212e-01  -7.684 1.55e-14 ***
## car_typeVan         -3.858e-01  1.543e-01  -2.500 0.012416 *  
## oldclaim            -1.401e-05  4.536e-06  -3.089 0.002009 ** 
## clm_freq             1.795e-01  3.268e-02   5.494 3.94e-08 ***
## revokedY             8.697e-01  1.054e-01   8.249  < 2e-16 ***
## mvr_pts              1.229e-01  1.552e-02   7.921 2.36e-15 ***
## urbanicityUrban      2.319e+00  1.289e-01  17.989  < 2e-16 ***
## yoj                 -1.688e-02  8.692e-03  -1.942 0.052078 .  
## income              -5.926e-06  1.068e-06  -5.549 2.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7084.6  on 6120  degrees of freedom
## Residual deviance: 5569.8  on 6097  degrees of freedom
## AIC: 5617.8
## 
## Number of Fisher Scoring iterations: 5

reduced_stats<-glance(reduced) # collect stats for model comparisons

3.3.1 Diagnostics

The ability of a binary classifier to discriminate class membership can be evaluated using a confusion (error) matrix as well as a receiver operating characteristic (ROC curve).

A confusion matrix for binary classifiers consists of a 2x2 table with counts of actual and predicted classes for a response variable; providing a measure of correct (true positives, true negatives) and incorrect classifications (false positives, false negatives) at a given probability threshold.

The receiver operating characteristic plots a model’s true positive rate (i.e., sensitivity|recall|probability of detection) against the false positive rate (i.e., 1-specificity|probability of false detection) at various probability thresholds. Importantly, the area under the ROC curve (AUC) offers a measure of model accuracy and can be used as a statistic for model comparison.

The reduced model correctly classifies ~ 75% of the positive class and ~67% of the negative class for ‘target_flag’ at a probability threshold of 0.51. In this instance, the selected threshold optimizes for balance in true positive and false positive rates.

Depending on claim costs, it may be advisable to lower the threshold such that more positive observations (crashes) are correctly classified at the cost of fewer correct negative classifications. Translated, this reflects a cost:benefit between potential payouts and income that can guide a carrier’s determination of coverage for an individual driver.

The AUC and other measures of performance for the reduced model are discussed in the model selection section.

# Collect predicted percentages

predicted<-predict(reduced,train_df,type='response')

# Find optimal threshold to minimize misclassification

optimal<- optimalCutoff(train_df$target_flag, predicted, optimiseFor = "misclasserror")  # 0.51

# Calculate training classification error with .51 threshold --> 0.2595

Miss_Class_Error<-misClassError(train_df$target_flag, predicted,  threshold = optimal)

# Convert predicted to 0, 1 values based on threshold

predicts<-ifelse(predicted >= optimal, 1, 0)

# Subset actuals and predicts as numeric for confusion matrix (factors in caret)

actuals<-as.numeric(as.character(train_df$target_flag))

(InformationValue::confusionMatrix(actuals, predicts, threshold = optimal))

#Calculate related metrics

AUC<-AUROC(actuals, predicted) # -->.81

Sensitivity<-InformationValue::sensitivity(actuals, predicts, threshold = optimal) # recall: true positive rate
Specificity<-InformationValue::specificity(actuals, predicts, threshold = optimal) #false positive rate
Precision<-InformationValue::precision(actuals, predicts, threshold = optimal) # prop of predicted ones /prop total ones

YDI_Index<- youdensIndex(actuals, predicts, threshold = optimal) # 0.48 accounts for both false-positive and false-negative rates

# Assess predictive power using concordance

Predict_Power<-Concordance(actuals, predicts) # predictive power  ---> .548
Predict_Power<-Predict_Power$Concordance

ks_plot(actuals, predicts) #cum perc of ones captured by the model against that expected random

Kol_Smirnov<-ks_stat(actuals, predicts) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate. --> .69

# Create a metrics table

reduced_metrics<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)

# Construct ROC curve: note actuals, predicts, predicted as vectors)

pred <- ROCR::prediction(predicted, actuals)
perf <- ROCR::performance(pred, "tpr", "fpr")

plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")

#AUC curve

plotROC(actuals, predicted)

3.3.2 Overdispersion

As part of the diagnostic process, we check the model residuals for evidence of overdispersion. This occurs when the residuals are more variable than expected for a binomial distribution.

Overdispersion can be evaluated by dividing a model’s residual deviance by the degrees of freedom. A value close to 1 indicates that there is no overdispersion. A greater value indicates otherwise; in which case a negative binomial model may be more appropriate.

There is no evidence of overdispersion in the reduced model.

# Evaluate using deviance and quasibinomial comparison

deviance(reduced)/df.residual(reduced) # if considerably greater than one we should be concerned

## [1] 0.9135302

# Check with two model fit

quasi_model <-  glm(target_flag ~ .,family='quasibinomial', base_df) # note: using base_df
  
pchisq(summary(quasi_model)$dispersion * reduced$df.residual,
 reduced$df.residual, lower = F)

## [1] 0.1703679

3.3.3 Influential Observations

The presence of influential outliers (outliers with high leverage) can negatively affect model results.

Influence can be evaluated on the basis of a model’s:

standardized residuals and
Cook’s distance.

While the majority of covariates contain outliers, there is little evidence for influential outliers (excepting obs. 5101) in the training set (i.e., st.resid <3 stdev & cooksd < 1).

# Extract model results

reduced_df<-train_df

reduced_df$data <- augment(reduced) %>% #add model output to dataframe
  mutate(index = 1:n()) 

# Identify influential outliers

reduced_df$data %>% top_n(10, .cooksd)

plot(reduced, which = 4, id.n = 3)  # keep an eye on obs > 4/n

#Plot Standardized Residuals

ggplot(reduced_df$data, aes(index, .std.resid)) + 
  geom_point(aes(color = target_flag), alpha = .5) +
  theme_bw()

#Filter potential influential data points with abs(.std.res) > 3

reduced_df$data %>% 
  filter(abs(.std.resid) > 3)

3.3.4 Linearity

Recall that linearity between each numerical covariate and the logit is one requirement of a valid logistic model.

The bivariate plots below indicate this requirement is met for most of the covariates, with the possible exception of ‘travtime’ and ‘oldclaim’. These variables may benefit from transformation.

#create a dataframe for linearity analysis

reduced_df$predicted<-predict(reduced,train_df,type='response') # add predicted

reduced_df%<>%mutate(logit = log(predicted/(1-predicted))) # add logit


# plot - check linearity btwn numerical predictors and logit, .data used to pass strings into aes

covar_df<-reduced_df%>%
    dplyr::select(is.numeric)

xvar<-names(covar_df)[2:13] # numerical covariates
xvar<-purrr::set_names(xvar)

linear_func <- function(x){
    ggplot(covar_df, aes_string(x=x, y='logit'))+ 
    geom_point(alpha=.2)+
    geom_smooth(method = "loess", se = FALSE, color='red')+
    theme_minimal()
}

linear_plots<-map(xvar, ~linear_func(.x)) #map plotting function

plot(reduce(linear_plots,`+`)) # reduce for single print

3.3.5 Collinearity

Another requirement of a valid logistic model is the absence of multicollinearity among covariates. One measure of multicollinearity is the Variance Inflation Factor (VIF). A high VIF value (e.g. 10) for a covariate indicates that it is highly collinear with other covariates in the model.

There is no evidence of problematic multicollinearity in the reduced model.

car::vif(reduced)

##                GVIF Df GVIF^(1/(2*Df))
## kidsdriv   1.297684  1        1.139159
## homekids   1.831738  1        1.353417
## parent1    1.900427  1        1.378560
## mstatus    1.561623  1        1.249649
## education  1.688907  3        1.091275
## travtime   1.039941  1        1.019775
## car_use    1.490637  1        1.220916
## bluebook   1.760971  1        1.327016
## tif        1.009961  1        1.004968
## car_type   2.206158  5        1.082340
## oldclaim   1.680393  1        1.296300
## clm_freq   1.468197  1        1.211692
## revoked    1.331231  1        1.153790
## mvr_pts    1.159294  1        1.076705
## urbanicity 1.136284  1        1.065966
## yoj        1.198707  1        1.094855
## income     1.890915  1        1.375106

3.3.6 Independence

A final model requirement is the independence of observations. This can be assess by plotting the model residuals against the linear predictor (i.e., combination of coefficients and covariates). Independence is indicated by the absence of pattern in the resulting plot.

There does appear to be some curvature in the plot of residuals vs linear predictor for the reduced model that may point to a departure from independence. This might result, for example, from grouping effects among categorical covariates.

Given that the plotting pattern is not severe and that the model serves a predictive (vs. explanatory) purpose, there is not sufficient evidence to reject the model.

resid_df<-train_df%>%mutate('residuals' = residuals(reduced), linpred = predict(reduced))

bins<- group_by(resid_df, cut(linpred, breaks=unique(quantile(linpred, (1:100)/101))))

diag<-summarize(bins, residuals=mean(residuals), linpred = mean(linpred))

plot(residuals~linpred, diag, xlab = "linear predictor")

3.3.7 Model Fit

The reduced model is considered valid on the basis of requirements described for logistic regression.

The ‘goodness’ of model fit can be readily evaluated from marginal plots for each variable. Marginal plots graph actual and conditional probabilities (estimated by the model) for each covariate relative to the response variable. The resulting curves align closely in a model that fits the data well.

The reduced model provides a good fit to the data. Although some departure is observed for actual and conditional probabilities in the plots for ‘travtime’, ‘bluebook’, and ‘age’. These departures are not severe and may be lessened by transforming these features.

marginals_reduced<-mmps(reduced,main=NULL)

3.4 Transformation Model

The transformation model includes terms from the reduced model with transformations to reduce skew in select covariates. On this basis, ‘travtime’ and ‘bluebook’ will be log transformed while ‘tif’ and ‘income’ will be square root transformed.

This transformations follow from earlier model trials (not shown here) and experimentation.

All covariates included in the transformation model are significant at the .05 level.

The transformation model correctly classifies ~ 80% of the positive class and ~68% of the negative class for ‘target_flag’ at a probability threshold of 0.21 - optimizing for low misclassification error. The AUC is ~ .81, similar to the reduced model.

The transformation model is valid (diagnostics not included for the purpose of brevity).

#Build model with transformed vars

trans <- glm( target_flag ~ kidsdriv + parent1 + mstatus + education + 
    log(1+travtime) + car_use + log(1+bluebook) + sqrt(tif) + car_type + oldclaim + 
    clm_freq + revoked + mvr_pts + urbanicity + sqrt(income), family='binomial', train_df)

trans<-step(trans, trace=0) 

summary(trans)

## 
## Call:
## glm(formula = target_flag ~ kidsdriv + parent1 + mstatus + education + 
##     log(1 + travtime) + car_use + log(1 + bluebook) + sqrt(tif) + 
##     car_type + oldclaim + clm_freq + revoked + mvr_pts + urbanicity + 
##     sqrt(income), family = "binomial", data = train_df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3354  -0.7265  -0.4234   0.6335   3.1073  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.567e-01  6.202e-01   1.220 0.222427    
## kidsdriv             3.938e-01  6.230e-02   6.321 2.61e-10 ***
## parent1Y             4.867e-01  1.069e-01   4.552 5.31e-06 ***
## mstatusY            -6.361e-01  7.863e-02  -8.090 5.99e-16 ***
## educationBachelors  -5.365e-01  8.562e-02  -6.266 3.70e-10 ***
## educationMasters    -4.832e-01  1.027e-01  -4.707 2.51e-06 ***
## educationPhD        -6.757e-01  1.474e-01  -4.583 4.59e-06 ***
## log(1 + travtime)    4.220e-01  6.173e-02   6.837 8.09e-12 ***
## car_usePrivate      -8.148e-01  8.210e-02  -9.924  < 2e-16 ***
## log(1 + bluebook)   -3.313e-01  6.241e-02  -5.309 1.11e-07 ***
## sqrt(tif)           -2.563e-01  3.710e-02  -6.910 4.86e-12 ***
## car_typeSUV         -2.803e-01  1.127e-01  -2.488 0.012851 *  
## car_typePickup      -4.762e-01  1.288e-01  -3.698 0.000217 ***
## car_typePanel Truck -5.219e-01  1.744e-01  -2.992 0.002773 ** 
## car_typeMinivan     -9.087e-01  1.216e-01  -7.475 7.72e-14 ***
## car_typeVan         -3.606e-01  1.543e-01  -2.337 0.019427 *  
## oldclaim            -1.407e-05  4.541e-06  -3.099 0.001945 ** 
## clm_freq             1.811e-01  3.267e-02   5.544 2.96e-08 ***
## revokedY             8.799e-01  1.054e-01   8.348  < 2e-16 ***
## mvr_pts              1.239e-01  1.552e-02   7.983 1.43e-15 ***
## urbanicityUrban      2.318e+00  1.289e-01  17.981  < 2e-16 ***
## sqrt(income)        -2.765e-03  3.945e-04  -7.009 2.40e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7084.6  on 6120  degrees of freedom
## Residual deviance: 5561.0  on 6099  degrees of freedom
## AIC: 5605
## 
## Number of Fisher Scoring iterations: 5

trans_stats<-glance(trans) # save stats for model comparisons

3.4.1 Diagnostics

Tests of model validity for the transformation model are comparable to the reduced model and, to avoid repetition, are not reported here. Diagnostics for model performance are included below and in Section 4.

# collect predicted percentages

trans_predicted<-predict(trans, train_df,type='response')

# find optimal threshold to minimize misclassification

trans_optimal<- optimalCutoff(train_df$target_flag, trans_predicted, optimiseFor = "misclasserror")  # 0.51

# calculate training classification error with .51 threshold --> 0.2588

Miss_Class_Error<-misClassError(train_df$target_flag, trans_predicted,  threshold = trans_optimal)

# convert predicted to 0, 1 values based on threshold

trans_predicts<-ifelse(trans_predicted >= trans_optimal, 1, 0)

# subset actuals and predicts as numeric for confusion matrix

trans_actuals<-as.numeric(as.character(train_df$target_flag))

InformationValue::confusionMatrix(trans_actuals, trans_predicts, threshold = optimal)%>%
    flextable()%>%
    set_caption("Confusion Matrix: Reduced Model")

Confusion Matrix: Reduced Model
0	1
4,207	998
289	627

#Calculate related metrics

AUC<-AUROC(trans_actuals, trans_predicted) # -->.81
Sensitivity<-InformationValue::sensitivity(trans_actuals, trans_predicts, threshold = optimal) # recall: true positive rate
Specificity<-InformationValue::specificity(trans_actuals, trans_predicts, threshold = optimal) #false positive rate
Precision<-InformationValue::precision(trans_actuals, trans_predicts, threshold = optimal) # prop of predicted ones /prop total ones
YDI_Index<-youdensIndex(trans_actuals, trans_predicts, threshold = optimal) # 0.48 accounts for both false-positive and false-negative rates

# Calculate predictive power using concordance

Concord<-Concordance(trans_actuals, trans_predicts)
Predict_Power<-Concord$Concordance

ks_plot(trans_actuals, trans_predicts) #cum perc of ones captured by the model against that expected random

Kol_Smirnov<-ks_stat(trans_actuals, trans_predicts) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate

# Create a metrics table

trans_metrics<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)

# Construct ROC curve: note actuals, predicts, predicted as vectors)

pred_trans <- ROCR::prediction(trans_predicted, trans_actuals)
perf_trans <- ROCR::performance(pred_trans, "tpr", "fpr")

plot(perf_trans, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")

#AUC curve

plotROC(trans_actuals, trans_predicted)

### Model Fit

Marginal plots for show improvements in model fit with the transformations selected for ‘travtime’, ‘bluebook’, ‘tif’, and ‘income’.

marginals_trans<-mmps(trans,main=NULL)

3.5 Model Selection

There are a variety of methods to aid model selection, both quantitative and qualitative. Here selection will proceed using a cross-section of diagnostics that aim for:

the lowest AIC and/or BIC value
the smallest standard errors
the lowest residual deviance
the highest AUC scores

It’s worth noting that diagnostics for the transformation model produced similar results as the reduced model. Both are considered valid.

From the first table below (i.e., comparison of model results) it is clear that the transformed model is an appropriate choice. It produces the lowest AIC & BIC values and the lowest residual deviance. Additionally, the AUC for our transformed model is equivalent to the reduced and full models (.81) and greater than the null model (.50). It does, however, produce higher standard. errors for transformed covariates.

Other measures of performance are included in the classification metrics table below (reduced and transformation models). Note, however, that they are specific to the probability thresholds selected for each model. And are not considered criterion for model selection.

Additional tests can be performed to confirm our model choice (e.g., likelihood ratio tests for nested models) but are not necessary.

# Compare model output summaries

col_name<-c('Null Model', 'Full Model', 'Reduced Model', 'Transformed Model')

comp<-rbind(null_stats, full_stats, reduced_stats, trans_stats)

(compare_models<- cbind(col_name, comp)%>%
    dplyr::select(-c(null.deviance, df.null, logLik, nobs))%>%
    rename(Model = col_name)%>%
    flextable()%>%
     set_caption("Comparison of Model Results"))

Comparison of Model Results
Model	AIC	BIC	deviance	df.residual
Null Model	7,086.575	7,093.295	7,084.575	6,120
Full Model	5,624.703	5,819.568	5,566.703	6,092
Reduced Model	5,617.794	5,779.061	5,569.794	6,097
Transformed Model	5,605.049	5,752.878	5,561.049	6,099

# Compare select metrics for reduced and transformed models

col_final<-c('Reduced Model', 'Transformed Model')

final_comp<-rbind(reduced_metrics, trans_metrics)

(final_models<- cbind(col_final, final_comp)%>%
    as.data.frame()%>%
    mutate_at(vars(2:9), funs(as.numeric))%>%
    mutate_if(is.numeric, round, 3)%>%
    rename(Model = col_final)%>%
    flextable()%>%
    set_caption("Classification Metrics: Reduced & Transformed Models"))

Classification Metrics: Reduced & Transformed Models
Model	Predict_Power	Miss_Class_Error	AUC	Sensitivity	Specificity	Precision	YDI_Index	Kol_Smirnov
Reduced Model	0.397	0.212	0.806	0.433	0.917	0.652	0.350	0.338
Transformed Model	0.361	0.210	0.806	0.386	0.936	0.684	0.322	0.306

3.6 Model Evaluation:

To assess how well the transformation model generalizes, we apply it to the test dataset and summarize the results. Interestingly, the AUC produced on the test set has increased (.82) relative to the training set - indicating that the model generalizes well.

Assume for a moment that an insurer seeks to minimize their risk of claim payouts (relative to client payments). In this instance, they might elect to lower the probability threshold used to classify new clients.

Below, our results for model performance reflect a cutoff of ~0.25. The model classifies 90% of our positives (target_flag = 1, crash) correctly, as compared to 49% of our negatives (target_flag = 0, no accidents).

Other related measures of performance are included in the metrics table.

# Apply transformation model to test data

trans_test <- glm( target_flag ~ kidsdriv + parent1 + mstatus + education + 
    log(1+travtime) + car_use + log(1+bluebook) + sqrt(tif) + car_type + oldclaim + 
    clm_freq + revoked + mvr_pts + urbanicity + sqrt(income), family='binomial', test_df)


# Apply fitted model to test sample (predicted probabilities)

predicted_test <- predict(trans_test, test_df, type="response")

# Find optimal threshold to minimize misclassification

test_optimal<- optimalCutoff(test_df$target_flag, predicted_test, optimiseFor = "Both") # --> .25

# Calculate training classification error with .51 threshold

Miss_Class_Error<-misClassError(test_df$target_flag, predicted_test,  threshold = test_optimal)

# Convert predicted to 0, 1 values based on threshold

predicts_test<-ifelse(predicted_test >= test_optimal, 1, 0)

# Cubset actuals and predicts as numeric for confusion matrix

actuals_test<-as.numeric(as.character(test_df$target_flag))

InformationValue::confusionMatrix(actuals_test, predicts_test, threshold = test_optimal)%>%
    flextable()%>%
    set_caption("Confusion Matrix: Transformation Model")

Confusion Matrix: Transformation Model
0	1
1,092	117
420	411

#Calculate related metrics

AUC<-AUROC(actuals_test, predicted_test) # -->.81
Sensitivity<-InformationValue::sensitivity(actuals_test, predicts_test, threshold = .17) # recall: true positive rate
Specificity<-InformationValue::specificity(actuals_test, predicts_test, threshold = .17) #false positive rate
Precision<-InformationValue::precision(actuals_test, predicts_test, threshold = .17) # prop of predicted ones /prop total ones
YDI_Index<-youdensIndex(actuals_test, predicts_test, threshold = test_optimal) # 0.48 accounts for both false-positive and false-negative rates

# Calculate predictive power using concordance

Concord<-Concordance(actuals_test, predicts_test)
Predict_Power<-Concord$Concordance

ks_plot(actuals_test, predicts_test) #cum perc of ones captured by the model against that expected random

Kol_Smirnov<-ks_stat(actuals_test, predicts_test) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate

# Create a metrics table

(test_metrics<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)%>%
    as.data.frame()%>%
    mutate_at(vars(1:8), funs(as.numeric))%>%
    mutate_if(is.numeric, round, 3)%>%
    flextable()%>%
    set_caption("Transformation Model-Test Data: Metrics"))

Transformation Model-Test Data: Metrics
Predict_Power	Miss_Class_Error	AUC	Sensitivity	Specificity	Precision	YDI_Index	Kol_Smirnov
0.562	0.263	0.824	0.778	0.722	0.495	0.501	0.49

# Construct ROC curve: note actuals, predicts, predicted as vectors)

pred <- ROCR::prediction(predicted_test, actuals_test)
perf <- ROCR::performance(pred, "tpr", "fpr")

plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")

#Construct AUC curve

plotROC(actuals_test, predicted_test)

4 Linear Models

We can now build a linear model to predict payout cost (‘target_amt’) to a driver in the event of a crash.

It’s worth noting that there are a variety of examples in the published literature of auto insurance claim prediction based on log-normal and/or gamma regression models. This reflects the fact that the claim data are typically right skewed and positive. Both models will be constructed and compared here.

As an aside, a poisson model can also be used where the response is right skewed and positive but does limit the response to integer values.

First, we build a new training and test set for linear modeling starting with a more limited set of covariates. The latter include:

‘oldclm’
‘clm_freq’
‘bluebook’
‘car_age’
‘car_type’.

Other variables are omitted owing to the absence of any theoretical relationship to payout cost.

To further simplify our model, we construct an average past payout(‘avg-clm’) for each driver as the ratio of ‘oldclaim’:‘clm_freq’ - assigning a value of 0 to ‘avg_clm’ where no claims have been filed previously.

The extent to ‘target_amt’ includes payment for damages other than vehicular (e.g., medical) is unclear. To better focus on the vehicular component, we construct another new variable, ‘value_clm’, as the ratio of ‘target_amt’:‘bluebook’. In this context, values > 1 indicate payments in excess of vehicle market value. These observations will be dropped from the analyses. The final set of covariates used for model construction are:

‘avg_clm’
‘bluebook’
‘car_age’
‘car_type’

It’s worth noting that there are no pairwise correlations between these variables that raise concerns owing to multicolinearlity.

# subset dataframe for linear model - variance in home_val explained by income

linear_df<-df%>%
    filter(target_amt > 0)%>%
    mutate(avg_clm = round(oldclaim/clm_freq), 1)%>%  #avg claim payout in past to driver
    mutate(value_clm = round((target_amt/bluebook)*100))%>% # payout cost -to- value percentage of most recent claim
    mutate_at(vars(avg_clm), ~replace(., is.na(.), 0))%>%  # if no prior claims, set to 0
    dplyr::select(target_amt, bluebook, car_age, car_type, value_clm, avg_clm)

# use value_clm to drop target_amt with a cost/value ratio >1, then drop this variable 

linear_df%<>%
    filter(value_clm <100)%>%
    dplyr::select(-value_clm)


linear_df%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Pairwise Correlation: Numerical Covariates")

set.seed(123344) 

# Build train/test sets

trainIdx <- createDataPartition(linear_df$target_amt, p = .75,
            list = FALSE, times = 1)

linear_train <- linear_df[trainIdx,]
linear_test <- linear_df[-trainIdx,]

4.1 Data Exploration

4.1.1 Response Variable: Distribution

The response variable, ‘target_amt’, is highly right skewed and is positive across all observations. This can be problematic when constructing a multiple regression model using ordinary least squares. As mentioned previously, both log-normal and gamma regression models can be used to model this form of data (see examples below). However, the former cannot be interpreted on the original scale of the data without back-transformation and must also adhere to the requirements of ordinary least squares. In instances where heteroscedasticity is apparent a log-normal model may be modified using weighted least squares (smaller datasets) or robust techniques.

As a generalized linear model, gamma regression allows interpretation of the results on the original scale and does not require normally distributed errors and/or constant variance (variance is proportional to the square of the means). Two link functions are available for building gamma models - inverse & log. An inverse link is only applicable where the response varies between 0-1. The log link will be used for gamma-based models constructed here.

Gamma regression has been used previously to model auto insurance claims. In these instances, the response variable is often an average claim amount or some measure of claim severity. The dataset used here contains payouts for individual claims. Thus the most relevant predictors will likely be ‘bluebook’, ‘car_age’, and ‘car_type’. The term, ‘avg_clm’ will be relevant provided decisions regarding current payout are based, in part, on previous payouts to a driver.

Prior Examples:

McCullagh and Nelder, 1983 Mohammad and McGovern, 2020

# Plot density distribution

hist <- linear_train%>%
    ggplot(aes(target_amt))+
    geom_density(fill='darkblue', alpha=.5, bins=30)+
    theme_minimal()+
    labs(title='Distribution Density: target_amt')

# qqplot  

points <- linear_train%>%
    ggplot(aes(sample=target_amt))+
    stat_qq(fill='darkblue')+
    stat_qq_line()+
    theme_minimal()+
    labs(title='QQ Plot: target_amt', y='Target_amt')

#cumulative frequency plot

cum <- linear_train%>%
    ggplot(aes(target_amt)) + 
    stat_ecdf(color='blue')+
    theme_minimal()+
    labs(title = 'Cumulative Frequency Plot: target_amt', y='Frequency')
    

(hist|points)/cum # plot layout

# univariate stats for target_amt

linear_train$target_amt%>%
    summary()%>%
    tidy()%>%
    flextable()%>%
    set_caption("Univariate stats: target_amt")

Univariate stats: target_amt
minimum	q1	median	mean	q3	maximum
30	2,465.75	3,823	4,131.775	5,257.5	27,496

4.1.2 Bivariate Plots

Included below is a full set of bivariate plots for numerical and factor predictors and our response, ‘target_amt’. It is difficult to discern a linear relationship between numerical covariates and ‘target_amt’. In contrast, there does appear to be a weak linear relationship between levels of car_type and ‘target_amt’ - with the highest payouts applied to panel trucks.

It’s worth noting that ~80% of the paid claims are less than $10,000 each. And the highest payouts are associated with newer vehicles, few -to- no past claims, and mid-range bluebook values. It appears that payouts are proportionally less for high-value vehicles.

xvar_names<-linear_train%>%
    dplyr::select(is.numeric, -target_amt)%>%
    names() # numerical covariates

xvar_names<-purrr::set_names(xvar_names)

# create function for scatter plots 

scatter_func <- function(x){
    ggplot(linear_train, aes_string(x=x, y='target_amt'))+ 
    geom_point(aes_string(color='car_type'),alpha=.6)+
    geom_smooth(method = "loess", se = FALSE, color='red')+
    theme_minimal()
}

(scatter_plots<-map(xvar_names, ~scatter_func(.x))) #map plotting function

## $bluebook

## 
## $car_age

## 
## $avg_clm

# plot factor covariate car_type vs target_amt (< 10000 for readibility)


linear_train%>%
    filter(target_amt < 10000)%>%
    ggplot(aes(x = reorder(car_type, target_amt, FUN = median), y = target_amt, fill=car_type)) +
    geom_boxplot(outlier.color ='red', show.legend = FALSE)+
    scale_fill_viridis(discrete = TRUE, alpha=0.6)+
    theme_classic()+
    labs(title = 'Payment Amount vs. Car Type: Payments under $10,000')

4.1.3 Interactions

Prior to modeling, we can visually assess potential covariate interactions as predictors of ‘target_amt’. It appears that ‘car_type’ influences the relationship between each numerical covariate and ‘target_amt’. In contrast, there do not appear to be any interactions among numerical covariates in relation to ‘target_amt’

# evaluate interactions between numerical variables and car_type using a linear 
#model

numvars<-purrr::set_names(c('bluebook', 'car_age', 'avg_clm'))

filter_df<- linear_train%>%
    filter(target_amt<10000)

interact<- function (x){
    ggplot(filter_df, aes_string(x = x, y = 'target_amt', color = 'car_type')) +
    geom_smooth(method = 'lm', se=FALSE)+
    theme_classic() +
    labs(x = x, y = "target_amt")
    
}

interact_plots<-map(numvars, ~interact(.x)) #map plotting function

plot(reduce(interact_plots,`+`)) # reduce for single print

# evaluate interactions between numerical values using continuous plots


coplot(target_amt~bluebook|car_type, panel = panel.car, col='blue', rows=1, data=filter_df)

coplot(target_amt~bluebook|avg_clm, panel = panel.car, col='blue', rows=1, data=filter_df)

coplot(target_amt~bluebook|car_age, panel = panel.car, col='blue', rows=1, data=filter_df)

coplot(target_amt~car_age|avg_clm, panel = panel.car, col='blue', rows=1, data=filter_df)

## Log-Normal Model: NULL

We will start with a null model with a log transformed response (log-normal model).

# note: gaussian glm with default link (identity) is equivalent to OLS

null_lm <- lm(log(target_amt) ~ 1, data=linear_train)

summary(null_lm)

## 
## Call:
## lm(formula = log(target_amt) ~ 1, data = linear_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7209 -0.3119  0.1266  0.4453  2.0996 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.12215    0.01875   433.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7025 on 1403 degrees of freedom

4.2 Log-Normal Model: Full

Next we build a full log-normal model with main effects. Only ‘bluebook’ and ‘car_type’ are significant at an alpha = 0.5. The adjusted error is only 0.03.

full_lm <- lm(log(target_amt) ~ ., data=linear_train)

summary(full_lm)

## 
## Call:
## lm(formula = log(target_amt) ~ ., data = linear_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6121 -0.3170  0.1268  0.4572  1.8277 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.769e+00  6.839e-02 113.585  < 2e-16 ***
## bluebook             1.665e-05  2.927e-06   5.689 1.56e-08 ***
## car_age             -6.733e-04  3.443e-03  -0.196   0.8450    
## car_typeSUV          1.278e-01  6.130e-02   2.085   0.0372 *  
## car_typePickup       1.227e-01  6.622e-02   1.853   0.0640 .  
## car_typePanel Truck  5.629e-02  9.288e-02   0.606   0.5446    
## car_typeMinivan      9.422e-02  6.864e-02   1.373   0.1701    
## car_typeVan          6.547e-02  8.119e-02   0.806   0.4201    
## avg_clm              3.658e-06  2.812e-06   1.301   0.1936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6917 on 1395 degrees of freedom
## Multiple R-squared:  0.03597,    Adjusted R-squared:  0.03044 
## F-statistic: 6.507 on 8 and 1395 DF,  p-value: 2.384e-08

plot(full_lm)

## Log-Normal Model: Reduced w/ Interactions

Building further a the log-normal model further, ‘car_type’ is included as an interaction term. And variable selection is automated using AIC.

The reduced model has the following form:

log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm

Relative to the full model, the residual standard error has decreased slightly and the adjusted r-square is now 0.04. However, it appears that requirements of constant variance and normality have not been met for the model residuals. There is also a poor fit between the actual and predicted observations for ‘target_amt’

gauss_interact <- lm(log(target_amt) ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, data=linear_train)

interact_aic<-step(gauss_interact) # reduced model  -->log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm

## Start:  AIC=-1028.24
## log(target_amt) ~ bluebook * car_type + car_age * car_type + 
##     avg_clm * car_type
## 
##                     Df Sum of Sq    RSS     AIC
## - car_type:car_age   5    1.0717 653.39 -1035.9
## <none>                           652.32 -1028.2
## - car_type:avg_clm   5    5.9075 658.22 -1025.6
## - bluebook:car_type  5    8.3065 660.62 -1020.5
## 
## Step:  AIC=-1035.93
## log(target_amt) ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type + 
##     car_type:avg_clm
## 
##                     Df Sum of Sq    RSS     AIC
## - car_age            1    0.0035 653.39 -1037.9
## <none>                           653.39 -1035.9
## - car_type:avg_clm   5    5.7389 659.13 -1033.7
## - bluebook:car_type  5    8.3610 661.75 -1028.1
## 
## Step:  AIC=-1037.92
## log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type + 
##     car_type:avg_clm
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           653.39 -1037.9
## - car_type:avg_clm   5    5.7645 659.16 -1035.6
## - bluebook:car_type  5    8.3581 661.75 -1030.1

summary(interact_aic)

## 
## Call:
## lm(formula = log(target_amt) ~ bluebook + car_type + avg_clm + 
##     bluebook:car_type + car_type:avg_clm, data = linear_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6005 -0.3003  0.1130  0.4506  1.9873 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7.792e+00  1.119e-01  69.653  < 2e-16 ***
## bluebook                      1.939e-05  7.183e-06   2.700  0.00702 ** 
## car_typeSUV                   5.806e-02  1.384e-01   0.419  0.67501    
## car_typePickup                6.983e-02  1.396e-01   0.500  0.61701    
## car_typePanel Truck           7.505e-01  2.799e-01   2.681  0.00743 ** 
## car_typeMinivan              -1.746e-01  1.605e-01  -1.088  0.27687    
## car_typeVan                   5.715e-01  2.926e-01   1.953  0.05100 .  
## avg_clm                      -1.280e-05  7.093e-06  -1.805  0.07130 .  
## bluebook:car_typeSUV         -2.367e-06  9.337e-06  -0.254  0.79988    
## bluebook:car_typePickup      -1.881e-06  9.097e-06  -0.207  0.83621    
## bluebook:car_typePanel Truck -2.601e-05  1.104e-05  -2.356  0.01861 *  
## bluebook:car_typeMinivan      1.381e-05  9.921e-06   1.392  0.16405    
## bluebook:car_typeVan         -2.780e-05  1.441e-05  -1.929  0.05398 .  
## car_typeSUV:avg_clm           2.750e-05  8.746e-06   3.144  0.00170 ** 
## car_typePickup:avg_clm        2.025e-05  9.185e-06   2.204  0.02766 *  
## car_typePanel Truck:avg_clm   3.467e-06  1.256e-05   0.276  0.78259    
## car_typeMinivan:avg_clm       1.525e-05  9.796e-06   1.557  0.11978    
## car_typeVan:avg_clm           1.225e-05  1.253e-05   0.977  0.32860    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6866 on 1386 degrees of freedom
## Multiple R-squared:  0.05631,    Adjusted R-squared:  0.04474 
## F-statistic: 4.865 on 17 and 1386 DF,  p-value: 2.802e-10

plot(interact_aic)

# check actual vs predicted values for target_amt, another check on model fit

fit<-augment(interact_aic)%>%
    rename(target = 'log(target_amt)')

plot(density(fit$target), ylim=c(0, 3.0))
lines(density(fit$.fitted), col='red')

## log-Normal Model: Transformed

Transforming the model covariates can help address violations in an ordinarly least squares model. From the results below, log transforming the covariates did not lead to a valid model.

# transform response and predictors to log

train_trans<-linear_train%>%   
    mutate_if(is.numeric, ~log(1+.))

gauss_interact_trans <- lm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, data=train_trans)

trans_aic<-step(gauss_interact_trans)

## Start:  AIC=-1059.59
## target_amt ~ bluebook * car_type + car_age * car_type + avg_clm * 
##     car_type
## 
##                     Df Sum of Sq    RSS     AIC
## - car_type:car_age   5    0.6365 638.55 -1068.2
## - car_type:avg_clm   5    3.6708 641.58 -1061.5
## <none>                           637.91 -1059.6
## - bluebook:car_type  5    5.0004 642.91 -1058.6
## 
## Step:  AIC=-1068.19
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type + 
##     car_type:avg_clm
## 
##                     Df Sum of Sq    RSS     AIC
## - car_type:avg_clm   5    3.5766 642.12 -1070.3
## - car_age            1    0.1112 638.66 -1070.0
## <none>                           638.55 -1068.2
## - bluebook:car_type  5    5.3812 643.93 -1066.4
## 
## Step:  AIC=-1070.35
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type
## 
##                     Df Sum of Sq    RSS     AIC
## - avg_clm            1    0.0015 642.13 -1072.3
## - car_age            1    0.2033 642.33 -1071.9
## <none>                           642.12 -1070.3
## - bluebook:car_type  5    5.3367 647.46 -1068.7
## 
## Step:  AIC=-1072.35
## target_amt ~ bluebook + car_type + car_age + bluebook:car_type
## 
##                     Df Sum of Sq    RSS     AIC
## - car_age            1    0.2035 642.33 -1073.9
## <none>                           642.13 -1072.3
## - bluebook:car_type  5    5.3353 647.46 -1070.7
## 
## Step:  AIC=-1073.9
## target_amt ~ bluebook + car_type + bluebook:car_type
## 
##                     Df Sum of Sq    RSS     AIC
## <none>                           642.33 -1073.9
## - bluebook:car_type  5    5.3835 647.71 -1072.2

summary(trans_aic)

## 
## Call:
## lm(formula = target_amt ~ bluebook + car_type + bluebook:car_type, 
##     data = train_trans)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5664 -0.3185  0.1196  0.4496  1.7321 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   4.08773    0.80300   5.091 4.06e-07 ***
## bluebook                      0.41910    0.08590   4.879 1.19e-06 ***
## car_typeSUV                   1.36562    1.02930   1.327   0.1848    
## car_typePickup                1.41316    1.07695   1.312   0.1897    
## car_typePanel Truck           6.27198    2.96697   2.114   0.0347 *  
## car_typeMinivan              -0.42962    1.12278  -0.383   0.7020    
## car_typeVan                   6.28641    3.12119   2.014   0.0442 *  
## bluebook:car_typeSUV         -0.13404    0.11019  -1.216   0.2240    
## bluebook:car_typePickup      -0.13813    0.11554  -1.196   0.2321    
## bluebook:car_typePanel Truck -0.61793    0.29124  -2.122   0.0340 *  
## bluebook:car_typeMinivan      0.05108    0.11918   0.429   0.6682    
## bluebook:car_typeVan         -0.63946    0.31579  -2.025   0.0431 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6793 on 1392 degrees of freedom
## Multiple R-squared:  0.07067,    Adjusted R-squared:  0.06333 
## F-statistic: 9.624 on 11 and 1392 DF,  p-value: < 2.2e-16

plot(trans_aic)

#reduced model = target_amt ~ bluebook + car_type + bluebook:car_type

4.3 Gamma Model: Null

The first gamma regression model will be of a null form with a log link function.

gamma_null<- glm(target_amt ~ 1, family=Gamma(link=log), data=linear_train)

summary(gamma_null)

## 
## Call:
## glm(formula = target_amt ~ 1, family = Gamma(link = log), data = linear_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80447  -0.47537  -0.07668   0.25103   2.74206  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.32646    0.01742     478   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.4260992)
## 
##     Null deviance: 573.72  on 1403  degrees of freedom
## Residual deviance: 573.72  on 1403  degrees of freedom
## AIC: 25634
## 
## Number of Fisher Scoring iterations: 5

glm1<-glance(gamma_null)

4.4 Gamma Model: Reduced

Forgoing a full gamma model, we construct a reduced model with main effects using AIC for variable selection.

the reduced model has the following form:

glm(target_amt ~ bluebook)

gamma<- glm(target_amt ~ ., family=Gamma(link=log), data=linear_train)

gamma_reduced<-step(gamma) # reduced glm(target_amt ~ bluebook)

## Start:  AIC=25564.7
## target_amt ~ bluebook + car_age + car_type + avg_clm
## 
##            Df Deviance   AIC
## - car_type  5   543.68 25560
## - avg_clm   1   542.50 25564
## - car_age   1   542.53 25565
## <none>          541.92 25565
## - bluebook  1   566.40 25634
## 
## Step:  AIC=25559.54
## target_amt ~ bluebook + car_age + avg_clm
## 
##            Df Deviance   AIC
## - car_age   1   544.30 25559
## - avg_clm   1   544.33 25559
## <none>          543.68 25560
## - bluebook  1   573.25 25643
## 
## Step:  AIC=25559.25
## target_amt ~ bluebook + avg_clm
## 
##            Df Deviance   AIC
## - avg_clm   1   544.95 25559
## <none>          544.30 25559
## - bluebook  1   573.27 25640
## 
## Step:  AIC=25559.02
## target_amt ~ bluebook
## 
##            Df Deviance   AIC
## <none>          544.95 25559
## - bluebook  1   573.72 25639

summary(gamma_reduced)

## 
## Call:
## glm(formula = target_amt ~ bluebook, family = Gamma(link = log), 
##     data = linear_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.74975  -0.47658  -0.06705   0.27690   2.11637  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.042e+00  3.341e-02 240.694   <2e-16 ***
## bluebook    1.800e-05  1.932e-06   9.319   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.3505996)
## 
##     Null deviance: 573.72  on 1403  degrees of freedom
## Residual deviance: 544.95  on 1402  degrees of freedom
## AIC: 25559
## 
## Number of Fisher Scoring iterations: 5

glm2<-glance(gamma_reduced)

4.4.1 Diagnostics

The requirements for a valid glm model are explored below and summarized here for the reduced gamma model.

Findings:

there are no obvious patterns between the deviance residuals and linear predictor (correct link function)
there is no pattern between deviance residuals and observation (independence of errors)
there are 102 obs that have high leverage - two times the mean of Cooks-d - but none that appear to be influential
there is a decreasing pattern in the deviance residuals vs predicted values, indicating a problem with model fit
there is limited overlap between the density distributions of predicted vs. actual values.

For additional information see:

Dunn, Peter K. and Gorden K. Smyth, 2018 Balajari, 2013

#plot residuals vs linear predictor -- looking for constant variance with deviance resids 
#indicates correct(link)

plot(residuals(gamma_reduced) ~ predict(gamma_reduced, type='link'),
xlab=expression(hat(eta)), ylab='Deviance Residuals')

plot(density(rstandard(gamma_reduced, type='deviance')), col='blue')

# check for independence of observations

scatter.smooth(rstandard(gamma_reduced, type='deviance'), col='gray')

#check for high leverage

Cooks_Distance<-cooks.distance(gamma_reduced)

plot(Cooks_Distance)

lev<-(Cooks_Distance)

length(lev[lev > mean(lev) * 2])  #102 is two times the mean of Cooks-d, high leverage

## [1] 102

# check for influence -- no influential obs

 which(influence.measures(gamma_reduced)$is.inf[,'cook.d'])

## named integer(0)

 #check fitted resids vs fitted, a well fit model will have no pattern

scatter.smooth(predict(gamma_reduced, type='response'), rstandard(gamma_reduced, type='deviance', col='gray'))

# Density distributions: actual vs predicted values for target_amt, another check on model fit

shape_red <-gamma.shape(gamma_reduced) #  extract shape parameter, alpha = 2.73, SE 0.097

predict_red<- predict(gamma_reduced, type='response', se = TRUE,
        dispersion = 1/shape_red$alpha)

summary(gamma_reduced, dispersion = 1/shape_red$alpha)

## 
## Call:
## glm(formula = target_amt ~ bluebook, family = Gamma(link = log), 
##     data = linear_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.74975  -0.47658  -0.06705   0.27690   2.11637  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 8.042e+00  3.414e-02  235.55   <2e-16 ***
## bluebook    1.800e-05  1.974e-06    9.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.3660863)
## 
##     Null deviance: 573.72  on 1403  degrees of freedom
## Residual deviance: 544.95  on 1402  degrees of freedom
## AIC: 25559
## 
## Number of Fisher Scoring iterations: 5

shape1_df<-cbind(linear_train, predict_red)

shape1_df%>%
    dplyr::select(target_amt, fit)%>%
    pivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Cost')%>%
    ggplot(aes(Cost, fill=Response))+
    geom_density(alpha=.25)+
    theme_minimal()+
    labs(title = 'Density Distributions for Fitted & Predicted Values: Gamma Reduced Model')

4.5 Gamma-Model: Interactions

Building on the reduced model, we construct a gamma regression model with ‘car_type’ as an interaction term and variable selection using AIC.

The result has the following form:

target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm

The model AIC is now 25556 and the residual deviance is 532.

gamma_interact <- glm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, family=Gamma(link= 'log'), data=linear_train)

gamma_reduced_int<-step(gamma_interact) # target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm

## Start:  AIC=25564.64
## target_amt ~ bluebook * car_type + car_age * car_type + avg_clm * 
##     car_type
## 
##                     Df Deviance   AIC
## - car_type:car_age   5   531.80 25557
## <none>                   531.11 25565
## - car_type:avg_clm   5   534.71 25565
## - bluebook:car_type  5   537.72 25574
## 
## Step:  AIC=25556.56
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type + 
##     car_type:avg_clm
## 
##                     Df Deviance   AIC
## - car_age            1   532.18 25556
## <none>                   531.80 25557
## - car_type:avg_clm   5   535.35 25557
## - bluebook:car_type  5   538.77 25567
## 
## Step:  AIC=25555.63
## target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + 
##     car_type:avg_clm
## 
##                     Df Deviance   AIC
## <none>                   532.18 25556
## - car_type:avg_clm   5   535.87 25556
## - bluebook:car_type  5   539.21 25566

summary(gamma_reduced_int)

## 
## Call:
## glm(formula = target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + 
##     car_type:avg_clm, family = Gamma(link = "log"), data = linear_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.75007  -0.46452  -0.07547   0.27827   2.25739  
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   7.991e+00  9.525e-02  83.896  < 2e-16 ***
## bluebook                      2.005e-05  6.115e-06   3.279  0.00107 ** 
## car_typeSUV                  -1.643e-02  1.179e-01  -0.139  0.88917    
## car_typePickup                3.186e-02  1.189e-01   0.268  0.78868    
## car_typePanel Truck           6.674e-01  2.383e-01   2.800  0.00517 ** 
## car_typeMinivan              -2.182e-01  1.367e-01  -1.597  0.11052    
## car_typeVan                   3.996e-01  2.491e-01   1.604  0.10889    
## avg_clm                      -4.986e-06  6.039e-06  -0.826  0.40912    
## bluebook:car_typeSUV          2.541e-06  7.949e-06   0.320  0.74931    
## bluebook:car_typePickup      -2.018e-07  7.745e-06  -0.026  0.97922    
## bluebook:car_typePanel Truck -2.336e-05  9.400e-06  -2.485  0.01307 *  
## bluebook:car_typeMinivan      1.632e-05  8.447e-06   1.932  0.05360 .  
## bluebook:car_typeVan         -2.090e-05  1.227e-05  -1.703  0.08874 .  
## car_typeSUV:avg_clm           1.577e-05  7.446e-06   2.118  0.03434 *  
## car_typePickup:avg_clm        9.163e-06  7.820e-06   1.172  0.24149    
## car_typePanel Truck:avg_clm  -1.268e-05  1.069e-05  -1.186  0.23590    
## car_typeMinivan:avg_clm       4.659e-06  8.340e-06   0.559  0.57656    
## car_typeVan:avg_clm           3.363e-06  1.067e-05   0.315  0.75266    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.3417136)
## 
##     Null deviance: 573.72  on 1403  degrees of freedom
## Residual deviance: 532.18  on 1386  degrees of freedom
## AIC: 25556
## 
## Number of Fisher Scoring iterations: 6

glm3<-glance(gamma_reduced_int)

4.5.1 Diagnostics

The requirements for a valid glm are explored below and summarized here for the reduced gamma model with interactions.

Findings:

there are no obvious patterns between the deviance residuals and linear predictor (correct link function)
there is no pattern between deviance residuals and observation (independence of errors)
there are 100 obs that have high leverage - two times the mean of Cooks-d - but none that appear to be influential
there is a decreasing pattern in the deviance residuals vs predicted values, indicating a problem with model fit
there is limited overlap between the density distributions of predicted vs. actual values. And the predicted distributions has several subsidiary peaks.

Overall, the diagnostics between the reduced gamma (main effects) and reduced gamma with interactions is comparable.

#plot residuals vs linear predictor -- looking for constant variance with deviance resids 
#indicates correct(link)

plot(residuals(gamma_reduced_int) ~ predict(gamma_reduced_int, type='link'),
xlab=expression(hat(eta)), ylab='Deviance Residuals')

plot(density(rstandard(gamma_reduced_int, type='deviance')), col='blue')

# check for independence of observations

scatter.smooth(rstandard(gamma_reduced_int, type='deviance'), col='gray')

#check for high leverage

Cooks_Distance<-cooks.distance(gamma_reduced_int)

plot(Cooks_Distance)

lev<-(Cooks_Distance)

length(lev[lev > mean(lev) * 2])  #102 is two times the mean of Cooks-d, high leverage

## [1] 100

# check for influence -- no influential obs

 which(influence.measures(gamma_reduced_int)$is.inf[,'cook.d'])

## named integer(0)

 #check fitted resids vs fitted, a well fit model will have no pattern

scatter.smooth(predict(gamma_reduced_int, type='response'), rstandard(gamma_reduced_int, type='deviance', col='gray'))

# check actual vs predicted values for target_amt, another check on model fit

plot(density(linear_train$target_amt), ylim=c(0, .0008), main='Actual vs Fitted (red): Target_Amt')

lines(density(predict(gamma_reduced_int, type='response')), col='red')

# check actual vs predicted values for target_amt, another check on model fit


shape_int <-gamma.shape(gamma_reduced_int) # alpha = 2.79, SE 0.099

predict_int<- predict(gamma_reduced_int, type='response', se = TRUE,
        dispersion = 1/shape_int$alpha)

summary(gamma_reduced_int, dispersion = 1/shape_int$alpha)

## 
## Call:
## glm(formula = target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + 
##     car_type:avg_clm, family = Gamma(link = "log"), data = linear_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.75007  -0.46452  -0.07547   0.27827   2.25739  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   7.991e+00  9.748e-02  81.971  < 2e-16 ***
## bluebook                      2.005e-05  6.259e-06   3.204  0.00135 ** 
## car_typeSUV                  -1.643e-02  1.206e-01  -0.136  0.89168    
## car_typePickup                3.186e-02  1.216e-01   0.262  0.79337    
## car_typePanel Truck           6.674e-01  2.439e-01   2.736  0.00622 ** 
## car_typeMinivan              -2.182e-01  1.399e-01  -1.560  0.11870    
## car_typeVan                   3.996e-01  2.550e-01   1.567  0.11701    
## avg_clm                      -4.986e-06  6.181e-06  -0.807  0.41981    
## bluebook:car_typeSUV          2.541e-06  8.136e-06   0.312  0.75483    
## bluebook:car_typePickup      -2.018e-07  7.927e-06  -0.025  0.97969    
## bluebook:car_typePanel Truck -2.336e-05  9.621e-06  -2.428  0.01518 *  
## bluebook:car_typeMinivan      1.632e-05  8.645e-06   1.887  0.05911 .  
## bluebook:car_typeVan         -2.090e-05  1.256e-05  -1.664  0.09607 .  
## car_typeSUV:avg_clm           1.577e-05  7.621e-06   2.070  0.03849 *  
## car_typePickup:avg_clm        9.163e-06  8.003e-06   1.145  0.25226    
## car_typePanel Truck:avg_clm  -1.268e-05  1.094e-05  -1.159  0.24662    
## car_typeMinivan:avg_clm       4.659e-06  8.536e-06   0.546  0.58525    
## car_typeVan:avg_clm           3.363e-06  1.092e-05   0.308  0.75810    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.3579492)
## 
##     Null deviance: 573.72  on 1403  degrees of freedom
## Residual deviance: 532.18  on 1386  degrees of freedom
## AIC: 25556
## 
## Number of Fisher Scoring iterations: 6

shape2_df<-cbind(linear_train, predict_int)

shape2_df%>%
    dplyr::select(target_amt, fit)%>%
    pivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Amount')%>%
    ggplot(aes(Amount, fill=Response))+
    geom_density(alpha=.25)+
    theme_minimal()+
    labs(title = 'Density Distributions for Fitted & Predicted Values: Gamma Interaction Model')

4.6 Model Selection

We cannot make direct comparisons between the log-normal models and the gamma models. Given that the former did not meet several of the requirements of ordinary least squares, comparisons will be limited to the gamma model.

The gamma model with interactions had the lowest AIC, BIC, and deviance scores. On this basis, it is the preferred choice among the included models. Given the relatively poor fit to the training data, however, it is unlikely to generalize well.

# build a table to compare results of gamma models

col_names<-c('Gamma - Null', 'Gamma-No Interactions', 'Gamma-Interactions')

glm_results<-rbind(glm1, glm2, glm3)

(compare_gammas<- cbind(col_names, glm_results)%>%
    dplyr::select(c(col_names, AIC, BIC, deviance))%>%
    rename(Model = col_names)%>%
    flextable()%>%
     set_caption("Comparison of Gamma Model Results"))

Comparison of Gamma Model Results
Model	AIC	BIC	deviance
Gamma - Null	25,633.98	25,644.47	573.7193
Gamma-No Interactions	25,559.02	25,574.77	544.9491
Gamma-Interactions	25,555.63	25,655.33	532.1792

4.7 Model Evaluation

To evaluate how our selected model generalizes and performs, we apply it to our hold-out data (‘linear_test’). Comparing predicted vs. actual values (scatterplot & density distributions) it is clear that the model does not perform well using the available data. This is not surprising given that claim payments are determined by a number of factors not included here.

For example:

Accident-related medical treatment & injury
Evidence of fault
Insured driver status
Cost of lost wages

Using a reduced gamma model with interaction on the holdout data, the only significant variable was ‘bluebook’. Model metrics for AIC, BIC, and Deviance were 8498.3, 8601.958, 154.8157 respectively. Interestingly, these values are far lower than those reported for the training dataset.

# build and execute model on holdout data

gamma_test <- glm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, family=Gamma(link= 'log'), data=linear_test)

#only bluebook turned out to be significant

# plot predicted and fitted values and print summary

shape_test <-gamma.shape(gamma_test) # alpha = 3.17, SE 0.19

predict_test<- predict(gamma_test, type='response', se = TRUE, # collect predictions, acct for shape
        dispersion = 1/shape_test$alpha)

summary(gamma_test, dispersion = 1/shape_test$alpha) # print summary

## 
## Call:
## glm(formula = target_amt ~ bluebook * car_type + car_age * car_type + 
##     avg_clm * car_type, family = Gamma(link = "log"), data = linear_test)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.22987  -0.46243  -0.06145   0.26695   1.88389  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   8.154e+00  1.746e-01  46.713   <2e-16 ***
## bluebook                      7.708e-06  1.031e-05   0.748   0.4547    
## car_typeSUV                  -6.928e-03  2.209e-01  -0.031   0.9750    
## car_typePickup               -1.354e-01  2.120e-01  -0.639   0.5230    
## car_typePanel Truck          -3.353e-01  4.992e-01  -0.672   0.5018    
## car_typeMinivan              -2.490e-01  2.919e-01  -0.853   0.3936    
## car_typeVan                   2.794e-01  4.498e-01   0.621   0.5344    
## car_age                       8.369e-03  1.415e-02   0.591   0.5543    
## avg_clm                      -6.501e-06  6.705e-06  -0.970   0.3323    
## bluebook:car_typeSUV         -2.416e-06  1.440e-05  -0.168   0.8668    
## bluebook:car_typePickup       1.030e-05  1.338e-05   0.770   0.4416    
## bluebook:car_typePanel Truck  1.956e-05  1.867e-05   1.047   0.2949    
## bluebook:car_typeMinivan      7.946e-06  1.603e-05   0.496   0.6201    
## bluebook:car_typeVan         -1.650e-05  2.282e-05  -0.723   0.4696    
## car_typeSUV:car_age          -2.586e-03  1.678e-02  -0.154   0.8775    
## car_typePickup:car_age       -4.270e-03  1.796e-02  -0.238   0.8121    
## car_typePanel Truck:car_age  -2.748e-03  1.994e-02  -0.138   0.8904    
## car_typeMinivan:car_age       6.771e-03  1.861e-02   0.364   0.7161    
## car_typeVan:car_age          -3.090e-03  1.990e-02  -0.155   0.8766    
## car_typeSUV:avg_clm           7.490e-06  8.764e-06   0.855   0.3927    
## car_typePickup:avg_clm        1.130e-05  1.050e-05   1.076   0.2818    
## car_typePanel Truck:avg_clm   5.529e-06  2.854e-05   0.194   0.8464    
## car_typeMinivan:avg_clm       3.009e-05  1.725e-05   1.744   0.0812 .  
## car_typeVan:avg_clm           5.631e-06  1.131e-05   0.498   0.6186    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.3151184)
## 
##     Null deviance: 169.06  on 466  degrees of freedom
## Residual deviance: 154.82  on 443  degrees of freedom
## AIC: 8498.3
## 
## Number of Fisher Scoring iterations: 6

glm4<-glance(gamma_test)%>% # collect select metrics
    dplyr::select(c(AIC, BIC, deviance))
# check fits using density distributions of fitted and predicated values

shape3_df<-cbind(linear_test, predict_test)

p1<-shape3_df%>%
    dplyr::select(target_amt, fit)%>%
    pivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Amount')%>%
    ggplot(aes(Amount, fill=Response))+
    geom_density(alpha=.25)+
    theme_minimal()+
    labs(title = 'Density Distributions for Fitted & Predicted Values: Hold Out Data')

p2<-ggplot(shape3_df, aes(x=target_amt, y=fit))+
    geom_point()+
    geom_smooth(method=lm, sd=TRUE)+
    theme_minimal()+
    labs(title='Predicted vs. Actual Claim Payments: Hold-Out Data')


(p1/p2) # print prediction plots

5 Project Summary

This study sought to classify drivers based on their probability of an auto accident and to then predict the amount of claim payout associated with that accident. The covariates provided in the dataset were sufficient to build/select an effective logistic classifier. Predicting claim payout proved more problematic - owing to the small number of covariates suited for individual payout predictions as well as lack of information on how payment amount was derived. In a number of instances, payments exceeded the ‘bluebook’ value of a vehicle. These observations were dropped in order to better focus on payments resulting from vehicle damage alone. That said, claim determinations are typically influenced by such factors as:

Accident-related medical treatment & injury
Evidence of fault
Insured driver status
Cost of lost wages
Cost of damages to vehicle(s)

The data included here only provided indirect information tied to the last factor (cost due to damage). Nonetheless the modeling process was instructive. After comparing results for various log-normal and gamma regression models, a final selection was made on the basis of model validity as well as AIC, BIC,and Deviance scores. The selected model was a gamma regression (log link) with two numerical covariates (‘bluebook’ and ‘avg_clm’) and one interaction term (‘car_type’). Applied to a holdout dataset, the model underestimated low/high values and over-estimated mid-range values. Additionally, only one covariate, ‘bluebook’, proved significant. Given these results, a differentiated model based on binned values for ‘bluebook’ may improve the accuracy of payment predictions.

Insurance Classification and Prediction

Sean Connin

1/11/22

1 Project Description

2 Data Processing and Holdout

2.1 Count and Frequency of Target Variable

2.2 Summary Statistics

2.3 Covariate Distributions

2.4 Missing Data

2.5 Pairwise Comparisons

2.6 Variable Selection

2.7 Training and Test Datasets

3 Classification Models

3.1 Null Model: Logistic

3.2 Full Model: Logistic

3.3 Reduced Model: Logistic

3.3.1 Diagnostics

3.3.2 Overdispersion

3.3.3 Influential Observations

3.3.4 Linearity

3.3.5 Collinearity

3.3.6 Independence

3.3.7 Model Fit

3.4 Transformation Model

3.4.1 Diagnostics

3.5 Model Selection

3.6 Model Evaluation:

4 Linear Models

4.1 Data Exploration

4.1.1 Response Variable: Distribution

4.1.2 Bivariate Plots

4.1.3 Interactions

4.2 Log-Normal Model: Full

4.3 Gamma Model: Null

4.4 Gamma Model: Reduced

4.4.1 Diagnostics

4.5 Gamma-Model: Interactions

4.5.1 Diagnostics

4.6 Model Selection

4.7 Model Evaluation

5 Project Summary