The purpose of this work is to estimate the probability that an individual (seeking auto insurance) will be in an accident and then to forecast the potential cost of an ensuing claim. A synthetic data set of ~ 8000 observations will be used to construct predictive models for this purpose. The use of synthetic data permits model development absent proprietary information, while also providing a heuristic for quantitative reasoning. Further discussion regarding the dataset can be found in Section 7.
The dataset includes two response variables. The first, ‘target_flag’, is scored as either a 1 or 0 indicating that an individual has or has not been in an accident, respectively. On this basis, a score of 1 is considered a ‘positive’ result and 0 a ‘negative’ result. The second response variable, ‘target_amt’ refers to the amount paid on a individual’s claim given an accident occurred. Target_amt is a numerical variable ranging from $0 (no accident/claims) upward.
A binary logistic classification model will be constructed to estimate the probability of class membership (1 or 0); log-normal and gamma regression models will be employed to predict payout cost in the event of an accident. This statistical approach builds on classical methods for model training/testing (which underlie more modern, machine learning, strategies for classification and prediction) and invites more explicit consideration of model assumptions and diagnostics.
The data include the following variables:
Variable | Description |
---|---|
Target FLAG | Was car in crash? 1=Yes, 0=No |
Target_ AMT | If crash, what was the cost |
AGE | Age of driver |
BLUEBOOK | Value of vehicle |
CAR_AGE | Age of car |
CAR_TYPE | Type of car |
CAR_USE | Vehicle use |
CLM_FREQ | # Claims (past 5 yrs) |
EDUCATION | Max educational level |
HOMEKIDS | # Children at home |
HOME_VAL | Home value |
INCOME | Annual income |
JOB | Job category |
KIDSDRIVE | # Driving children |
MSTATUS | Martial status |
MVR_PTS | Motor vehicle record points |
OLDCLAIM | Total claims (past 5 yrs) |
PARENT1 | Single parent |
RED_CAR | A red car |
REVOKED | License revoked (past 7 yrs) |
SEX | Gender |
TIF | Time as customer |
TRAVTIME | Distance to work |
URBANICITY | Home/work area |
YOJ | Years on job |
The project workflow will be represented and discussed following the approach of Wickham and Grolemund (2017):
Project Workflow
# Import libraries
library(tidyverse) # cleaning & wrangling functions
library(janitor) # data cleaning
library(magrittr) # piping
library(flextable) #table layout and formatting
library(dlookr) # exploratory data analysis functions
library(patchwork) # easy plot layout
library(ggpubr) # creates a wrapper for plotting a list
library(viridis) #coloration for box plots
library(broom) # creates a tidy data frame from statistical test results
library(InformationValue) # optimize threshold
library(caret) # modeling
library(ROCR) #roc curve
library(car) # marginal plots
library(MASS) # use for gamma shape function
# Read in data
<- "https://raw.githubusercontent.com/sconnin/insurance_model/main/insurance_dataset.csv"
path
<- read_csv(path) raw
Data processing is required to put variables in tidy form for model building as well as to support initial exploration and visualization. Additional steps are taken to facilitate interpretation of feature variables.
The following processing steps are included here:
# Copy data into working dataframe for downstream use
<-raw%>%
df
%>% # initial clean of col names
clean_names
remove_empty(c("rows", "cols"))%>% # remove any empty rows and cols
distinct()%>% # remove duplicate rows
mutate_if(is_character, str_replace_all, '\\$|,|z_|<', '')%>% # clean any special chars in character variables
::select(-index)%>% # remove index
dplyr
mutate_if(is_character, str_replace_all, "No|no",'N')%>%
mutate_if(is_character, str_replace_all, "Yes|yes",'Y')%>%
mutate_if(is_character, str_replace_all, "Highly Urban/ Urban",'Urban')%>%
mutate_if(is_character, str_replace_all, "Highly Rural/ Rural",'Rural')%>%
mutate_at(vars(income, home_val, bluebook, oldclaim), funs(as.numeric))%>% # correct variable type: char to numeric
mutate_if(is.numeric, round)%>% # round out our numerics
mutate_if(is_character, ~(factor(.)))%>% # convert all character variables to factor for modeling
mutate(education = fct_relevel(education, c("High School", "Bachelors", "Masters", "PhD")))%>% # relevel to show educational attainment steps
mutate(car_type = fct_relevel(car_type, c("Sports Car", "SUV", "Minivan, Van", "Pickup", "Panel Truck")))
Recall that the variable target_flag indicates whether or not an individual was in an accident (1 = Yes, 0 = No).
target_flag (level = 0) includes 6008 observations and accounts for ~74% of the data.
target_flag (level = 1) includes 2153 observations and accounts for ~26% of the data.
# Subset levels of target_flag into new dataframes for analyses
<- df%>%
target_0 ::select(-target_amt)%>%
dplyrfilter(target_flag == 0) # obs not associated with automobile accidents
<- df%>%
target_1 ::select(-target_amt)%>%
dplyrfilter(target_flag == 1) # obs associated with automobile accidents
# Calculate the proportion of data in each subset of target_flag
%>%
df group_by(target_flag) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))%>%
flextable()%>%
set_caption("Count and Frequency of Target Levels")
target_flag | n | freq |
0 | 6,008 | 0.7361843 |
1 | 2,153 | 0.2638157 |
On the basis of the following statistical summaries we find:
# Calculate statistics summary for numeric variables at each level of target_flag
%>%
target_0::select(-target_flag)%>%
dplyrdiagnose_numeric()%>%
::select(variables, min, mean, median, max, zero, minus, outlier)%>%
dplyrflextable()%>%
set_caption("Summary Statistics for Target_Flag = 0")
variables | min | mean | median | max | zero | minus | outlier |
kidsdriv | 0 | 0.1393142 | 0 | 4 | 5,407 | 0 | 601 |
age | 16 | 45.3227901 | 46 | 81 | 0 | 0 | 47 |
homekids | 0 | 0.6439747 | 0 | 5 | 4,116 | 0 | 559 |
yoj | 0 | 10.6718337 | 12 | 23 | 379 | 0 | 398 |
income | 0 | 65,951.9700335 | 58,368 | 367,030 | 365 | 0 | 156 |
home_val | 0 | 169,075.4123566 | 175,052 | 885,282 | 1,448 | 0 | 5 |
travtime | 5 | 33.0251332 | 32 | 142 | 0 | 0 | 69 |
bluebook | 1,500 | 16,230.9487350 | 15,000 | 69,740 | 0 | 0 | 74 |
tif | 1 | 5.5557590 | 6 | 25 | 0 | 0 | 29 |
oldclaim | 0 | 3,311.5948735 | 0 | 53,986 | 4,111 | 0 | 622 |
clm_freq | 0 | 0.6486352 | 0 | 5 | 4,111 | 0 | 583 |
mvr_pts | 0 | 1.4137816 | 1 | 11 | 2,998 | 0 | 280 |
car_age | 0 | 8.6709220 | 9 | 28 | 3 | 0 | 2 |
%>%
target_1::select(-target_flag)%>%
dplyrdiagnose_numeric()%>%
::select(variables, min, mean, median, max, zero, minus, outlier)%>%
dplyrflextable()%>%
set_caption("Summary Statistics for Target_Flag = 1")
variables | min | mean | median | max | zero | minus | outlier |
kidsdriv | 0 | 0.2596377 | 0 | 4 | 1,773 | 0 | 380 |
age | 16 | 43.3012104 | 43 | 76 | 0 | 0 | 7 |
homekids | 0 | 0.9368323 | 0 | 5 | 1,173 | 0 | 0 |
yoj | 0 | 10.0167488 | 11 | 19 | 246 | 0 | 250 |
income | 0 | 50,641.2980910 | 43,971 | 320,127 | 250 | 0 | 77 |
home_val | 0 | 115,256.5541339 | 114,565 | 750,455 | 846 | 0 | 11 |
travtime | 5 | 34.7710172 | 34 | 97 | 0 | 0 | 12 |
bluebook | 1,500 | 14,255.8987459 | 12,600 | 62,240 | 0 | 0 | 36 |
tif | 1 | 4.7807710 | 4 | 21 | 0 | 0 | 27 |
oldclaim | 0 | 6,061.5499303 | 2,448 | 57,037 | 898 | 0 | 233 |
clm_freq | 0 | 1.2169066 | 1 | 5 | 898 | 0 | 0 |
mvr_pts | 0 | 2.4816535 | 2 | 13 | 714 | 0 | 11 |
car_age | -3 | 7.3674789 | 7 | 25 | 0 | 1 | 0 |
#Calculate percentage of zero values in numerical covariates for each level of target_flag
<- as_mapper(~all(is.na(.)|.< 1)) # purrr mapper for possible downstream use
select_col
%>%
dfgroup_by(target_flag)%>% # note target must be numerical type for the following to run
::select(where(is.numeric) & -target_amt)%>%
dplyrsummarise_each(~round(sum(.==0)/length(.)*100))%>%
::discard(select_col)%>%
purrrflextable()%>%
set_caption("Percentage of Zeros in Numerical Covariates by Level of Target_Flag")
target_flag | kidsdriv | homekids | oldclaim | clm_freq | mvr_pts |
0 | 90 | 69 | 68 | 68 | 50 |
1 | 82 | 54 | 42 | 42 | 33 |
# Create statistical summaries for categorical variables at each level of target_flag
%>%
target_0diagnose_category()%>%
flextable()%>%
set_caption("Summary Statistics for Categorical Variables: Target Level = 0")
variables | levels | N | freq | ratio | rank |
parent1 | N | 6,008 | 5,407 | 89.996671 | 1 |
parent1 | Y | 6,008 | 601 | 10.003329 | 2 |
mstatus | Y | 6,008 | 3,841 | 63.931425 | 1 |
mstatus | N | 6,008 | 2,167 | 36.068575 | 2 |
sex | F | 6,008 | 3,183 | 52.979361 | 1 |
sex | M | 6,008 | 2,825 | 47.020639 | 2 |
education | High School | 6,008 | 2,355 | 39.197736 | 1 |
education | Bachelors | 6,008 | 1,719 | 28.611851 | 2 |
education | Masters | 6,008 | 1,331 | 22.153795 | 3 |
education | PhD | 6,008 | 603 | 10.036618 | 4 |
job | Blue Collar | 6,008 | 1,191 | 19.823569 | 1 |
job | Clerical | 6,008 | 900 | 14.980027 | 2 |
job | Professional | 6,008 | 870 | 14.480692 | 3 |
job | Manager | 6,008 | 851 | 14.164447 | 4 |
job | Lawyer | 6,008 | 682 | 11.351531 | 5 |
job | Home Maker | 6,008 | 461 | 7.673103 | 6 |
job | Student | 6,008 | 446 | 7.423435 | 7 |
job | 6,008 | 390 | 6.491345 | 8 | |
job | Doctor | 6,008 | 217 | 3.611851 | 9 |
car_use | Private | 6,008 | 4,026 | 67.010652 | 1 |
car_use | Commercial | 6,008 | 1,982 | 32.989348 | 2 |
car_type | Minivan | 6,008 | 1,796 | 29.893475 | 1 |
car_type | SUV | 6,008 | 1,616 | 26.897470 | 2 |
car_type | Pickup | 6,008 | 946 | 15.745672 | 3 |
car_type | Sports Car | 6,008 | 603 | 10.036618 | 4 |
car_type | Van | 6,008 | 549 | 9.137816 | 5 |
car_type | Panel Truck | 6,008 | 498 | 8.288948 | 6 |
red_car | N | 6,008 | 4,246 | 70.672437 | 1 |
red_car | Y | 6,008 | 1,762 | 29.327563 | 2 |
revoked | N | 6,008 | 5,451 | 90.729028 | 1 |
revoked | Y | 6,008 | 557 | 9.270972 | 2 |
urbanicity | Urban | 6,008 | 4,454 | 74.134487 | 1 |
urbanicity | Rural | 6,008 | 1,554 | 25.865513 | 2 |
%>%
target_1diagnose_category()%>%
flextable()%>%
set_caption("Summary Statistics for Categorical Variables: Target Level = 1")
variables | levels | N | freq | ratio | rank |
parent1 | N | 2,153 | 1,677 | 77.891314 | 1 |
parent1 | Y | 2,153 | 476 | 22.108686 | 2 |
mstatus | N | 2,153 | 1,100 | 51.091500 | 1 |
mstatus | Y | 2,153 | 1,053 | 48.908500 | 2 |
sex | F | 2,153 | 1,192 | 55.364608 | 1 |
sex | M | 2,153 | 961 | 44.635392 | 2 |
education | High School | 2,153 | 1,178 | 54.714352 | 1 |
education | Bachelors | 2,153 | 523 | 24.291686 | 2 |
education | Masters | 2,153 | 327 | 15.188110 | 3 |
education | PhD | 2,153 | 125 | 5.805852 | 4 |
job | Blue Collar | 2,153 | 634 | 29.447283 | 1 |
job | Clerical | 2,153 | 371 | 17.231770 | 2 |
job | Student | 2,153 | 266 | 12.354854 | 3 |
job | Professional | 2,153 | 247 | 11.472364 | 4 |
job | Home Maker | 2,153 | 180 | 8.360427 | 5 |
job | Lawyer | 2,153 | 153 | 7.106363 | 6 |
job | Manager | 2,153 | 137 | 6.363214 | 7 |
job | 2,153 | 136 | 6.316767 | 8 | |
job | Doctor | 2,153 | 29 | 1.346958 | 9 |
car_use | Private | 2,153 | 1,106 | 51.370181 | 1 |
car_use | Commercial | 2,153 | 1,047 | 48.629819 | 2 |
car_type | SUV | 2,153 | 678 | 31.490943 | 1 |
car_type | Pickup | 2,153 | 443 | 20.575941 | 2 |
car_type | Minivan | 2,153 | 349 | 16.209940 | 3 |
car_type | Sports Car | 2,153 | 304 | 14.119833 | 4 |
car_type | Van | 2,153 | 201 | 9.335810 | 5 |
car_type | Panel Truck | 2,153 | 178 | 8.267534 | 6 |
red_car | N | 2,153 | 1,537 | 71.388760 | 1 |
red_car | Y | 2,153 | 616 | 28.611240 | 2 |
revoked | N | 2,153 | 1,710 | 79.424059 | 1 |
revoked | Y | 2,153 | 443 | 20.575941 | 2 |
urbanicity | Urban | 2,153 | 2,038 | 94.658616 | 1 |
urbanicity | Rural | 2,153 | 115 | 5.341384 | 2 |
We drop one observation where ‘car_age’ is < 0 given that it is impossible to have a negative age. This appears to be a data entry error given there are no other negative values in this variable category.
# Drop observation with data error for car age (obs. value = -3).
$car_age[df$car_age < 0] <- NA df
Histograms and QQ-Plots indicate that distributions for ‘target_flag’, ‘kidsdriv’, ‘homekids’, ‘yoj’, ‘income’, ‘oldclaim’, ‘clm_freq’, and ‘mvr_pts’ depart from a normal model. This is confirmed using a Shapiro-Wilk test of normality for numerical values. Transformation of these variables may be desirable during model building. Log, square root, and polynomial transformations may be particularly helpful in this regard.
From an interpretive stand-point, it’s worth noting the high 0 value counts for the following covariates: ‘tif’, ‘oldclaim’, ‘car_age’, ‘income’, ‘home_val’. Particularly the latter two. A 0 value for ‘income’ may indicate that an individual is unemployed and/or a student/dependent. A 0 value for ‘home_val’ may indicate that an individual is a renter, a dependent, or homeless.
These details can be important in establishing the costs:benefits of insuring a driver. And their absence (as in this dataset) can limit the discriminatory ability of a classification model.
# Evaluate and rank departures from normality for numerical vars - Wilkes Shapiro
%>%
df as.data.frame()%>%
::select(-target_amt, target_flag)%>%
dplyrnormality() %>%
filter(p_value <= 0.01) %>%
arrange(abs(p_value))
#Identify skewness in covariates across levels of target_flag
%>%find_skewness(index=FALSE, thres=TRUE) target_0
## [1] "kidsdriv" "homekids" "yoj" "income" "oldclaim" "clm_freq" "mvr_pts"
%>%find_skewness(index=FALSE, thres=TRUE) target_1
## [1] "kidsdriv" "yoj" "income" "tif" "oldclaim"
# Identify potential transformations to address skew, this does not resolve high 0 counts
%>%
df::select(age, yoj, income, home_val, travtime, bluebook, tif, oldclaim, car_age)%>%
dplyrplot_normality()
## Outliers
The variables kidsrive, oldclaim, homekids, clm_freq, an yoj contain the highest percentage of outliers (~6-17%, depending on target_flag level). The percentage of outliers is generally greater for observations associated with accidents (target_flag = 1) than without (target_flag = 0).
The extent to which these/other outliers influence model parametrization is not yet know and should be evaluated as part of the model building process.
# Identify percentage of outliers for each covariate across levels of target_flag
diagnose_outlier(target_0) %>%
filter(outliers_ratio > 5)%>%
mutate(ratio_means= outliers_mean/with_mean)%>% # mean val of outliers/total mean
arrange(desc(ratio_means))%>%
::select(variables, outliers_ratio, ratio_means)%>%
dplyrflextable()%>%
set_caption("Outlier Summary: Target_Flag = 0")
variables | outliers_ratio | ratio_means |
kidsdriv | 10.003329 | 9.99667221 |
oldclaim | 10.352863 | 6.94574452 |
homekids | 9.304261 | 5.00025199 |
clm_freq | 9.703728 | 4.97416009 |
yoj | 6.624501 | 0.01742246 |
diagnose_outlier(target_1) %>%
filter(outliers_ratio > 5)%>%
mutate(ratio_means= outliers_mean/with_mean)%>%
arrange(desc(ratio_means))%>%
::select(variables, outliers_ratio, ratio_means)%>%
dplyrflextable()%>%
set_caption("Outlier Summary: Target_Flag = 1")
variables | outliers_ratio | ratio_means |
kidsdriv | 17.64979 | 5.665789474 |
oldclaim | 10.82211 | 5.302250158 |
yoj | 11.61170 | 0.003194649 |
The dataset includes missing observations in one categorical variable (‘job’) and six numerical variables (‘car_age’, ‘home-val’, ‘yoj’, ‘income’, ‘age’). Across covariates (i.e., predictor variables) the extent of missingness is < 6.5% of the total observations. There are no obvious patterns in the missing data that warrant concern.
#Basic missing tables by target level
%>%
target_0diagnose()%>%
::select(-unique_count, -unique_rate)%>%
dplyrfilter(missing_count>0)%>%
arrange(desc(missing_count))%>%
flextable()%>%
set_caption("Missing Data Summary: Target Level = 0")
variables | types | missing_count | missing_percent |
job | factor | 390 | 6.49134487 |
car_age | numeric | 368 | 6.12516644 |
home_val | numeric | 343 | 5.70905459 |
income | numeric | 335 | 5.57589880 |
yoj | numeric | 331 | 5.50932091 |
age | numeric | 1 | 0.01664447 |
%>%
target_1diagnose()%>%
::select(-unique_count, -unique_rate)%>%
dplyrfilter(missing_count>0)%>%
arrange(desc(missing_count))%>%
flextable()%>%
set_caption("Missing Data Summary: Target Level = 1")
variables | types | missing_count | missing_percent |
car_age | numeric | 142 | 6.5954482 |
job | factor | 136 | 6.3167673 |
yoj | numeric | 123 | 5.7129587 |
home_val | numeric | 121 | 5.6200650 |
income | numeric | 110 | 5.1091500 |
age | numeric | 5 | 0.2322341 |
# Plot missing data to visualize cumulative percentages and col intersections
%>%
df plot_na_pareto(only_na = TRUE)
%>%
df plot_na_intersect()
## Impute Missing Values
Given the relatively limited extent of missing data, an argument can be made for imputing these values rather than removing the observations during model building.A variety of imputation methods are availabe for this purpose. They include:
The following methods are selected for our model on the basis of initial experimentation:
Comparison of density plots for original and imputed values show a good fit for each of the numerical covariates.
#Impute numerical vars using dlookr
#car_age
<-imputate_na(df, car_age, target_flag, method = "mice", seed = 999) car_age
##
## iter imp variable
## 1 1 age yoj income home_val job car_age
## 1 2 age yoj income home_val job car_age
## 1 3 age yoj income home_val job car_age
## 1 4 age yoj income home_val job car_age
## 1 5 age yoj income home_val job car_age
## 2 1 age yoj income home_val job car_age
## 2 2 age yoj income home_val job car_age
## 2 3 age yoj income home_val job car_age
## 2 4 age yoj income home_val job car_age
## 2 5 age yoj income home_val job car_age
## 3 1 age yoj income home_val job car_age
## 3 2 age yoj income home_val job car_age
## 3 3 age yoj income home_val job car_age
## 3 4 age yoj income home_val job car_age
## 3 5 age yoj income home_val job car_age
## 4 1 age yoj income home_val job car_age
## 4 2 age yoj income home_val job car_age
## 4 3 age yoj income home_val job car_age
## 4 4 age yoj income home_val job car_age
## 4 5 age yoj income home_val job car_age
## 5 1 age yoj income home_val job car_age
## 5 2 age yoj income home_val job car_age
## 5 3 age yoj income home_val job car_age
## 5 4 age yoj income home_val job car_age
## 5 5 age yoj income home_val job car_age
<-plot(car_age)+theme_minimal()+theme(legend.position = "top")
p1
summary(car_age)
## * Impute missing values based on Multivariate Imputation by Chained Equations
## - method : mice
## - random seed : 999
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 7650.00000000 8161.00000000
## na 511.00000000 0.00000000
## mean 8.32980392 8.32782747
## sd 5.69964327 5.61856877
## se_mean 0.06516538 0.06219479
## IQR 11.00000000 9.00000000
## skewness 0.28251095 0.28530537
## kurtosis -0.74839431 -0.72313838
## p00 0.00000000 0.00000000
## p01 1.00000000 1.00000000
## p05 1.00000000 1.00000000
## p10 1.00000000 1.00000000
## p20 1.00000000 1.00000000
## p25 1.00000000 3.00000000
## p30 5.00000000 5.00000000
## p40 7.00000000 7.00000000
## p50 8.00000000 8.00000000
## p60 10.00000000 10.00000000
## p70 11.00000000 11.00000000
## p75 12.00000000 12.00000000
## p80 13.00000000 13.00000000
## p90 16.00000000 16.00000000
## p95 18.00000000 18.00000000
## p99 21.00000000 21.00000000
## p100 28.00000000 28.00000000
#home_val
<-imputate_na(df, home_val, target_flag, method = "rpart")
home_val
<-plot(home_val)+theme_minimal()+theme(legend.position = "top")
p2
summary(home_val)
## * Impute missing values based on Recursive Partitioning and Regression Trees
## - method : rpart
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 7.697000e+03 8.161000e+03
## na 4.640000e+02 0.000000e+00
## mean 1.548673e+05 1.549021e+05
## sd 1.291238e+05 1.276679e+05
## se_mean 1.471789e+03 1.413221e+03
## IQR 2.387240e+05 2.377673e+05
## skewness 4.887855e-01 5.011553e-01
## kurtosis -1.453837e-02 1.752216e-02
## p00 0.000000e+00 0.000000e+00
## p01 0.000000e+00 0.000000e+00
## p05 0.000000e+00 0.000000e+00
## p10 0.000000e+00 0.000000e+00
## p20 0.000000e+00 0.000000e+00
## p25 0.000000e+00 0.000000e+00
## p30 5.473000e+04 5.748428e+04
## p40 1.267360e+05 1.259510e+05
## p50 1.611600e+05 1.603330e+05
## p60 1.907740e+05 1.882360e+05
## p70 2.223548e+05 2.210090e+05
## p75 2.387240e+05 2.377673e+05
## p80 2.590012e+05 2.568280e+05
## p90 3.165426e+05 3.152530e+05
## p95 3.748710e+05 3.779250e+05
## p99 4.993365e+05 4.977460e+05
## p100 8.852820e+05 8.852820e+05
#yoj
<- imputate_na(df, yoj, target_flag, method = "rpart")
yoj
<-plot(yoj)+theme_minimal()+theme(legend.position = "top")
p3
summary(yoj)
## * Impute missing values based on Recursive Partitioning and Regression Trees
## - method : rpart
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 7707.00000000 8161.00000000
## na 454.00000000 0.00000000
## mean 10.49928636 10.50694662
## sd 4.09247418 4.04481883
## se_mean 0.04661689 0.04477415
## IQR 4.00000000 4.00000000
## skewness -1.20343604 -1.25339626
## kurtosis 1.17996900 1.33649457
## p00 0.00000000 0.00000000
## p01 0.00000000 0.00000000
## p05 0.00000000 0.00000000
## p10 5.00000000 5.00000000
## p20 8.00000000 8.00000000
## p25 9.00000000 9.00000000
## p30 10.00000000 10.00000000
## p40 11.00000000 11.00000000
## p50 11.00000000 11.00000000
## p60 12.00000000 12.00000000
## p70 13.00000000 13.00000000
## p75 13.00000000 13.00000000
## p80 14.00000000 13.00000000
## p90 15.00000000 14.00000000
## p95 15.00000000 15.00000000
## p99 17.00000000 17.00000000
## p100 23.00000000 23.00000000
# income
<-imputate_na(df, income, target_flag, method = "rpart")
income
<-plot(income)+theme_minimal()+theme(legend.position = "top")
p4
summary(income)
## * Impute missing values based on Recursive Partitioning and Regression Trees
## - method : rpart
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 7.716000e+03 8.161000e+03
## na 4.450000e+02 0.000000e+00
## mean 6.189809e+04 6.150153e+04
## sd 4.757268e+04 4.722375e+04
## se_mean 5.415786e+02 5.227437e+02
## IQR 5.788900e+04 5.696500e+04
## skewness 1.186778e+00 1.192689e+00
## kurtosis 2.132505e+00 2.154542e+00
## p00 0.000000e+00 0.000000e+00
## p01 0.000000e+00 0.000000e+00
## p05 0.000000e+00 0.000000e+00
## p10 4.380500e+03 5.427000e+03
## p20 2.234800e+04 2.227100e+04
## p25 2.809700e+04 2.830200e+04
## p30 3.330300e+04 3.370700e+04
## p40 4.367000e+04 4.338500e+04
## p50 5.402800e+04 5.315600e+04
## p60 6.526800e+04 6.507900e+04
## p70 7.794950e+04 7.783100e+04
## p75 8.598600e+04 8.526700e+04
## p80 9.553800e+04 9.471200e+04
## p90 1.231800e+05 1.220700e+05
## p95 1.522740e+05 1.512250e+05
## p99 2.155198e+05 2.143746e+05
## p100 3.670300e+05 3.670300e+05
#age
<-imputate_na(df, age, method = "rpart")
age
<-plot(age)+theme_minimal()+theme(legend.position = "top")
p5
summary(age)
## * Impute missing values based on Recursive Partitioning and Regression Trees
## - method : rpart
##
## * Information of Imputation (before vs after)
## Original Imputation
## n 8155.00000000 8161.00000000
## na 6.00000000 0.00000000
## mean 44.79031269 44.78524399
## sd 8.62758946 8.62706878
## se_mean 0.09553830 0.09549741
## IQR 12.00000000 12.00000000
## skewness -0.02899962 -0.02782888
## kurtosis -0.06028252 -0.06119510
## p00 16.00000000 16.00000000
## p01 25.00000000 25.00000000
## p05 30.00000000 30.00000000
## p10 34.00000000 34.00000000
## p20 37.00000000 37.00000000
## p25 39.00000000 39.00000000
## p30 40.00000000 40.00000000
## p40 43.00000000 43.00000000
## p50 45.00000000 45.00000000
## p60 47.00000000 47.00000000
## p70 49.00000000 49.00000000
## p75 51.00000000 51.00000000
## p80 52.00000000 52.00000000
## p90 56.00000000 56.00000000
## p95 59.00000000 59.00000000
## p99 64.00000000 64.00000000
## p100 81.00000000 81.00000000
<-cbind(car_age, home_val, yoj, income, age)
temp%<>%as.data.frame(temp)
temp
%<>%dplyr::select(!c(car_age, home_val, yoj, income, age))%>%
dfcbind(temp)
%<>%mutate_if(is.numeric, round)
df
# print plots
|p2)/(p3|p4)/p5 (p1
We can also impute missing values for categorical variables. In this case, the mice method is used to impute values for the variable, ‘job’.
<-imputate_na(df, job, method = "mice", seed = 999) job
##
## iter imp variable
## 1 1 job
## 1 2 job
## 1 3 job
## 1 4 job
## 1 5 job
## 2 1 job
## 2 2 job
## 2 3 job
## 2 4 job
## 2 5 job
## 3 1 job
## 3 2 job
## 3 3 job
## 3 4 job
## 3 5 job
## 4 1 job
## 4 2 job
## 4 3 job
## 4 4 job
## 4 5 job
## 5 1 job
## 5 2 job
## 5 3 job
## 5 4 job
## 5 5 job
plot(job)+theme_minimal()+theme(legend.position = "top")
# combine into new dfb
<-df%>%
df::select(!job)
dplyr
<-cbind(df,job)
df
$job<-factor(df$job) df
Pairwise comparisons provide a method to identify collinear relationships between covariates. Covariates that display high collinearity (e.g., correlation coefficients < -.50 or > .50) can reduce the precision and increase the sensitivity of model coefficients due to lack of independence.
To refine our analyses, pairwise comparisons are assessed following removal of 0 values. Our results indicate:
As expected, there is also a strong association between ‘income’ (numerical var.) and ‘job’ type (factor var.) - based on a kruskal-Wallis rank sum test.
Both ‘home_value’ and ‘income’ have similar levels of missing data. An argument can be made for dropping one or the other from the dataset in order to simplify our feature selection process and improve model precision. This may also be true for ‘income’ and ‘job’ type.
Associations between ‘home_val’ and ‘car_age’ and ‘income’ and ‘car_age’ are less concerning, due to lower correlation coefficients and higher variability. See plots below.
Less clear is the association between ‘homekids’ and ’kidsdriv’given their narrow range of integer values.
We will revisit options for feature removal during the model building process.
#Assess collinearity in the absence of zero values in numerical cols
%>%
dffilter_if(is.numeric, all_vars((.) != 0))%>%
correlate()%>%
filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output
flextable()%>%
set_caption("Pairwise Correlation: Numerical Covariates")
var1 | var2 | coef_corr |
homekids | kidsdriv | 0.6242945 |
kidsdriv | homekids | 0.6242945 |
home_val | car_age | 0.5442647 |
income | car_age | 0.5297101 |
car_age | home_val | 0.5442647 |
income | home_val | 0.9618320 |
car_age | income | 0.5297101 |
home_val | income | 0.9618320 |
# Plot relationship between numerical covariates (w/o zero values for clarity) beyond the threshold coef_corr
%>%
df filter(income != 0 & home_val != 0)%>%
target_by(income) %>%
relate(home_val) %>%
plot()
%>%
df filter(homekids != 0 & kidsdriv != 0)%>%
target_by(homekids) %>%
relate(kidsdriv) %>%
plot()
%>%
df filter(home_val != 0 & car_age != 0)%>%
target_by(home_val) %>%
relate(car_age) %>%
plot()
%>%
df filter(income != 0 & car_age != 0)%>%
target_by(income) %>%
relate(car_age) %>%
plot()
# Plot relationship between numerical and factors for select variables
%>%
df target_by(income) %>%
relate(job) %>%
plot()
%>%kruskal.test(income~job) # assess covariance with Kruskal-Wallis rank sum test where > 1 categorical level df
##
## Kruskal-Wallis rank sum test
##
## data: .
## Kruskal-Wallis chi-squared = 118656, df = 24, p-value < 2.2e-16
We can visually assess relationships between categorical variables and our response variable, ‘target-flag’, using mosaic plots. In this case, a relationship is indicated by relative differences in variable levels for each level of ‘target-flag’.
Note that the width of plot columns is proportional to the count of observations for each level of the variable plotted on the horizontal axis. And the vertical length is proportional to the count in the second variable (i.e., ‘target_flag’) within each level of the first variable.
On this basis we find that:
An argument can be made for dropping ‘sex and ’red_car’ from our set of covariates in order to streamline the modeling process.
# Subset categorical variables into a df of factor vars
<-df%>% #subset target_flag as factor
mosaicmutate_at(vars(target_flag), funs(as.factor))%>%
::select(where(is.factor))
dplyr
<- all_of(names(mosaic)[1]) # set name for target_flag
target <- all_of(names(mosaic)[2:11])
predictors
#generate mosaic plots with purrr and dlookr
<-predictors%>%map(function(predictors) mosaic%>%target_by(target)%>%relate(predictors)%>%plot())) (plots
Boxplots make an effective visual tool to assess relationships between ‘target_flag’ and our numerical covariates.
On this basis, it appears that ‘bluebook’, ‘tif’, ‘clm_freq’, ‘mvr_pts’, ‘car_age’, ‘home_val’, and ‘income’ vary by level of target_flag. The presence of outliers, however, argues for caution in this interpretation.
# Subset numerical variables
<-select_if(df, is.numeric)%>%
num_boxmutate_at(vars(target_flag), funs(as.factor))%>%
::select(-target_amt)
dplyr
# Plot using boxplots
= names(num_box)[1] #target_flag (needs to be fct for these plots)
response = purrr::set_names(response)
response
<- names(num_box)[2:14] #explanatory variables
explain = purrr::set_names(explain)
explain
= function(x) {
box_fun ggplot(num_box, aes_string(x = x, y = 'target_flag') ) +
geom_boxplot(aes(fill = target_flag, alpha = 0.4), outlier.color =
'red', show.legend = FALSE)+
scale_fill_viridis(discrete = TRUE, option = "E")+
coord_flip()+
theme_classic()
}
<-map(explain, ~box_fun(.x)) #creates a list of plots using purrr
box_plots
plot(reduce(box_plots,`+`)) # layout with patchwork
Before forming training and testing datasets, we will drop ‘home_val’ and ‘target_amt’. Variability in the model explained by ‘home_val’ will be largely explained by ‘income’. And ‘target_amt’ is not relevant to our classification process.
We will also collapse the variable ‘job’ into two levels: ‘non-white-collar’ and ‘white-collar’. While this may result in a slight loss of discriminatory power in the model(s) it does lend to model simplification.
Rather than make any additional a-priori choices to drop other covariates, we will automate variable selection (via. Akaike Information Criterion) as part of the modeling process.
<-df%>%
base_df::select(-home_val, -target_amt) # variance in these fields explained by income
dplyr
%<>%
base_dfmutate(job = (fct_collapse(job,
not_white_collar = c("Student", "Blue Collar", "Home Maker", "Clerical", "Manager"),
white_collar = c("Doctor", "Lawyer", "Professional"))))
#write.csv(base_df, 'base_df.csv', row.names=FALSE)
Prior to modeling, the dataset is randomly sampled to create a subset for model training (‘train_df’) and model evaluation (‘test_df’).
# Base_df<-read.csv('base_df.csv') # hold here to assist model dev
%>%count(target_flag)%>%
base_dfmutate(frequency = n/sum(n))%>%
flextable()%>%
set_caption("Count and Frequency of Target Variable by Level")
target_flag | n | frequency |
0 | 6,008 | 0.7361843 |
1 | 2,153 | 0.2638157 |
set.seed(1235)
# Build train/test sets
<- createDataPartition(base_df$target_flag, p = .75,
trainIdx list = FALSE, times = 1)
<- base_df[trainIdx,]
train_df <- base_df[-trainIdx,]
test_df
#Check counts and frequencies of target_flag for train set and test set
%>%count(target_flag)%>%
train_dfmutate(frequency = n/sum(n))%>%
flextable()%>%
set_caption("Count and Frequency of Target Variable: Training Data")
target_flag | n | frequency |
0 | 4,496 | 0.7345205 |
1 | 1,625 | 0.2654795 |
%>%count(target_flag)%>%
test_dfmutate(frequency = n/sum(n))%>%
flextable()%>%
set_caption("Count and Frequency of Target Variable: Test Data")
target_flag | n | frequency |
0 | 1,512 | 0.7411765 |
1 | 528 | 0.2588235 |
Selection of a logistic classifier will be based on comparison of four models:
We will assume that the null and full models are valid for the purposes of model comparison. Validity will be inferred from model diagnostics for the reduced and transformed model.
To be considered valid, a binary logistic model should meet the following conditions:
# build and summarize null model
<-glm(target_flag ~ 1, family = binomial, train_df)
null
summary(null)
##
## Call:
## glm(formula = target_flag ~ 1, family = binomial, data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7855 -0.7855 -0.7855 1.6286 1.6286
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.01768 0.02894 -35.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7084.6 on 6120 degrees of freedom
## Residual deviance: 7084.6 on 6120 degrees of freedom
## AIC: 7086.6
##
## Number of Fisher Scoring iterations: 4
<-glance(null) # collect stats for model comparisons
null_stats
# calculate AUC for model selection process - see 3.3.1 for discussion on AUC
<-predict(null, train_df,type='response')
null_predicted<-as.numeric(as.character(train_df$target_flag))
null_actualsAUROC(null_actuals, null_predicted)
## [1] 0.5
Several terms included in the full model are not significant - ‘homekids’, ‘sex’, ‘red_car’, ‘car_age’, ‘age’.
<-glm(target_flag ~ ., family = binomial, train_df)
full
summary(full)
##
## Call:
## glm(formula = target_flag ~ ., family = binomial, data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2763 -0.7277 -0.4188 0.6343 3.1369
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.181e+00 2.886e-01 -4.091 4.30e-05 ***
## kidsdriv 3.674e-01 6.909e-02 5.317 1.05e-07 ***
## homekids 4.955e-02 4.243e-02 1.168 0.242788
## parent1Y 3.938e-01 1.244e-01 3.164 0.001556 **
## mstatusY -6.472e-01 8.341e-02 -7.759 8.54e-15 ***
## sexM 8.418e-02 1.259e-01 0.669 0.503804
## educationBachelors -5.294e-01 9.781e-02 -5.412 6.23e-08 ***
## educationMasters -4.499e-01 1.432e-01 -3.142 0.001676 **
## educationPhD -5.270e-01 1.808e-01 -2.915 0.003562 **
## travtime 1.500e-02 2.146e-03 6.990 2.76e-12 ***
## car_usePrivate -8.287e-01 8.478e-02 -9.774 < 2e-16 ***
## bluebook -2.413e-05 6.085e-06 -3.966 7.30e-05 ***
## tif -5.849e-02 8.500e-03 -6.880 5.97e-12 ***
## car_typeSUV -3.166e-01 1.125e-01 -2.815 0.004878 **
## car_typePickup -5.356e-01 1.522e-01 -3.519 0.000433 ***
## car_typePanel Truck -4.840e-01 2.361e-01 -2.050 0.040381 *
## car_typeMinivan -9.649e-01 1.471e-01 -6.557 5.48e-11 ***
## car_typeVan -4.397e-01 1.955e-01 -2.250 0.024460 *
## red_carY -8.900e-02 9.798e-02 -0.908 0.363671
## oldclaim -1.397e-05 4.536e-06 -3.079 0.002076 **
## clm_freq 1.808e-01 3.269e-02 5.531 3.18e-08 ***
## revokedY 8.662e-01 1.055e-01 8.211 < 2e-16 ***
## mvr_pts 1.220e-01 1.554e-02 7.849 4.19e-15 ***
## urbanicityUrban 2.317e+00 1.289e-01 17.974 < 2e-16 ***
## car_age -5.430e-03 8.827e-03 -0.615 0.538447
## yoj -1.642e-02 8.861e-03 -1.854 0.063810 .
## income -6.124e-06 1.082e-06 -5.657 1.54e-08 ***
## age -3.958e-03 4.550e-03 -0.870 0.384347
## jobwhite_collar 9.978e-02 9.292e-02 1.074 0.282927
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7084.6 on 6120 degrees of freedom
## Residual deviance: 5566.7 on 6092 degrees of freedom
## AIC: 5624.7
##
## Number of Fisher Scoring iterations: 5
<-glance(full) # collect stat metrics for model comparisons
full_stats
# calculate AUC for model selection process - see 3.3.1 for discussion on AUC
<-predict(full, train_df,type='response')
full_predicted<-as.numeric(as.character(train_df$target_flag))
full_actualsAUROC(full_actuals, full_predicted)
## [1] 0.8060705
The reduced model will be constructed using the Akaike Information Criterion to remove covariates that are not statistically significant.
#Build reduced model
<- glm(target_flag ~ .,family='binomial', train_df)
reduced
<-step(reduced, trace=0) # use Akiaike step for covariate selection, trace 0 prevents intermediate printing
reduced
summary(reduced)
##
## Call:
## glm(formula = target_flag ~ kidsdriv + homekids + parent1 + mstatus +
## education + travtime + car_use + bluebook + tif + car_type +
## oldclaim + clm_freq + revoked + mvr_pts + urbanicity + yoj +
## income, family = "binomial", data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2627 -0.7316 -0.4190 0.6367 3.1307
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.394e+00 2.081e-01 -6.699 2.10e-11 ***
## kidsdriv 3.576e-01 6.798e-02 5.261 1.44e-07 ***
## homekids 5.948e-02 3.933e-02 1.512 0.130444
## parent1Y 4.094e-01 1.236e-01 3.313 0.000924 ***
## mstatusY -6.469e-01 8.329e-02 -7.767 8.04e-15 ***
## educationBachelors -5.394e-01 8.635e-02 -6.246 4.20e-10 ***
## educationMasters -4.703e-01 1.054e-01 -4.461 8.16e-06 ***
## educationPhD -5.705e-01 1.573e-01 -3.626 0.000287 ***
## travtime 1.499e-02 2.144e-03 6.990 2.75e-12 ***
## car_usePrivate -8.054e-01 8.202e-02 -9.820 < 2e-16 ***
## bluebook -2.505e-05 5.471e-06 -4.578 4.69e-06 ***
## tif -5.835e-02 8.489e-03 -6.874 6.24e-12 ***
## car_typeSUV -3.087e-01 1.122e-01 -2.751 0.005939 **
## car_typePickup -4.992e-01 1.285e-01 -3.885 0.000102 ***
## car_typePanel Truck -4.291e-01 1.842e-01 -2.329 0.019836 *
## car_typeMinivan -9.315e-01 1.212e-01 -7.684 1.55e-14 ***
## car_typeVan -3.858e-01 1.543e-01 -2.500 0.012416 *
## oldclaim -1.401e-05 4.536e-06 -3.089 0.002009 **
## clm_freq 1.795e-01 3.268e-02 5.494 3.94e-08 ***
## revokedY 8.697e-01 1.054e-01 8.249 < 2e-16 ***
## mvr_pts 1.229e-01 1.552e-02 7.921 2.36e-15 ***
## urbanicityUrban 2.319e+00 1.289e-01 17.989 < 2e-16 ***
## yoj -1.688e-02 8.692e-03 -1.942 0.052078 .
## income -5.926e-06 1.068e-06 -5.549 2.88e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7084.6 on 6120 degrees of freedom
## Residual deviance: 5569.8 on 6097 degrees of freedom
## AIC: 5617.8
##
## Number of Fisher Scoring iterations: 5
<-glance(reduced) # collect stats for model comparisons reduced_stats
The ability of a binary classifier to discriminate class membership can be evaluated using a confusion (error) matrix as well as a receiver operating characteristic (ROC curve).
A confusion matrix for binary classifiers consists of a 2x2 table with counts of actual and predicted classes for a response variable; providing a measure of correct (true positives, true negatives) and incorrect classifications (false positives, false negatives) at a given probability threshold.
The receiver operating characteristic plots a model’s true positive rate (i.e., sensitivity|recall|probability of detection) against the false positive rate (i.e., 1-specificity|probability of false detection) at various probability thresholds. Importantly, the area under the ROC curve (AUC) offers a measure of model accuracy and can be used as a statistic for model comparison.
The reduced model correctly classifies ~ 75% of the positive class and ~67% of the negative class for ‘target_flag’ at a probability threshold of 0.51. In this instance, the selected threshold optimizes for balance in true positive and false positive rates.
Depending on claim costs, it may be advisable to lower the threshold such that more positive observations (crashes) are correctly classified at the cost of fewer correct negative classifications. Translated, this reflects a cost:benefit between potential payouts and income that can guide a carrier’s determination of coverage for an individual driver.
The AUC and other measures of performance for the reduced model are discussed in the model selection section.
# Collect predicted percentages
<-predict(reduced,train_df,type='response')
predicted
# Find optimal threshold to minimize misclassification
<- optimalCutoff(train_df$target_flag, predicted, optimiseFor = "misclasserror") # 0.51
optimal
# Calculate training classification error with .51 threshold --> 0.2595
<-misClassError(train_df$target_flag, predicted, threshold = optimal)
Miss_Class_Error
# Convert predicted to 0, 1 values based on threshold
<-ifelse(predicted >= optimal, 1, 0)
predicts
# Subset actuals and predicts as numeric for confusion matrix (factors in caret)
<-as.numeric(as.character(train_df$target_flag))
actuals
::confusionMatrix(actuals, predicts, threshold = optimal)) (InformationValue
#Calculate related metrics
<-AUROC(actuals, predicted) # -->.81
AUC
<-InformationValue::sensitivity(actuals, predicts, threshold = optimal) # recall: true positive rate
Sensitivity<-InformationValue::specificity(actuals, predicts, threshold = optimal) #false positive rate
Specificity<-InformationValue::precision(actuals, predicts, threshold = optimal) # prop of predicted ones /prop total ones
Precision
<- youdensIndex(actuals, predicts, threshold = optimal) # 0.48 accounts for both false-positive and false-negative rates
YDI_Index
# Assess predictive power using concordance
<-Concordance(actuals, predicts) # predictive power ---> .548
Predict_Power<-Predict_Power$Concordance
Predict_Power
ks_plot(actuals, predicts) #cum perc of ones captured by the model against that expected random
<-ks_stat(actuals, predicts) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate. --> .69
Kol_Smirnov
# Create a metrics table
<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)
reduced_metrics
# Construct ROC curve: note actuals, predicts, predicted as vectors)
<- ROCR::prediction(predicted, actuals)
pred <- ROCR::performance(pred, "tpr", "fpr")
perf
plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")
#AUC curve
plotROC(actuals, predicted)
As part of the diagnostic process, we check the model residuals for evidence of overdispersion. This occurs when the residuals are more variable than expected for a binomial distribution.
Overdispersion can be evaluated by dividing a model’s residual deviance by the degrees of freedom. A value close to 1 indicates that there is no overdispersion. A greater value indicates otherwise; in which case a negative binomial model may be more appropriate.
There is no evidence of overdispersion in the reduced model.
# Evaluate using deviance and quasibinomial comparison
deviance(reduced)/df.residual(reduced) # if considerably greater than one we should be concerned
## [1] 0.9135302
# Check with two model fit
<- glm(target_flag ~ .,family='quasibinomial', base_df) # note: using base_df
quasi_model
pchisq(summary(quasi_model)$dispersion * reduced$df.residual,
$df.residual, lower = F) reduced
## [1] 0.1703679
The presence of influential outliers (outliers with high leverage) can negatively affect model results.
Influence can be evaluated on the basis of a model’s:
While the majority of covariates contain outliers, there is little evidence for influential outliers (excepting obs. 5101) in the training set (i.e., st.resid <3 stdev & cooksd < 1).
# Extract model results
<-train_df
reduced_df
$data <- augment(reduced) %>% #add model output to dataframe
reduced_dfmutate(index = 1:n())
# Identify influential outliers
$data %>% top_n(10, .cooksd) reduced_df
plot(reduced, which = 4, id.n = 3) # keep an eye on obs > 4/n
#Plot Standardized Residuals
ggplot(reduced_df$data, aes(index, .std.resid)) +
geom_point(aes(color = target_flag), alpha = .5) +
theme_bw()
#Filter potential influential data points with abs(.std.res) > 3
$data %>%
reduced_dffilter(abs(.std.resid) > 3)
Recall that linearity between each numerical covariate and the logit is one requirement of a valid logistic model.
The bivariate plots below indicate this requirement is met for most of the covariates, with the possible exception of ‘travtime’ and ‘oldclaim’. These variables may benefit from transformation.
#create a dataframe for linearity analysis
$predicted<-predict(reduced,train_df,type='response') # add predicted
reduced_df
%<>%mutate(logit = log(predicted/(1-predicted))) # add logit
reduced_df
# plot - check linearity btwn numerical predictors and logit, .data used to pass strings into aes
<-reduced_df%>%
covar_df::select(is.numeric)
dplyr
<-names(covar_df)[2:13] # numerical covariates
xvar<-purrr::set_names(xvar)
xvar
<- function(x){
linear_func ggplot(covar_df, aes_string(x=x, y='logit'))+
geom_point(alpha=.2)+
geom_smooth(method = "loess", se = FALSE, color='red')+
theme_minimal()
}
<-map(xvar, ~linear_func(.x)) #map plotting function
linear_plots
plot(reduce(linear_plots,`+`)) # reduce for single print
Another requirement of a valid logistic model is the absence of multicollinearity among covariates. One measure of multicollinearity is the Variance Inflation Factor (VIF). A high VIF value (e.g. 10) for a covariate indicates that it is highly collinear with other covariates in the model.
There is no evidence of problematic multicollinearity in the reduced model.
::vif(reduced) car
## GVIF Df GVIF^(1/(2*Df))
## kidsdriv 1.297684 1 1.139159
## homekids 1.831738 1 1.353417
## parent1 1.900427 1 1.378560
## mstatus 1.561623 1 1.249649
## education 1.688907 3 1.091275
## travtime 1.039941 1 1.019775
## car_use 1.490637 1 1.220916
## bluebook 1.760971 1 1.327016
## tif 1.009961 1 1.004968
## car_type 2.206158 5 1.082340
## oldclaim 1.680393 1 1.296300
## clm_freq 1.468197 1 1.211692
## revoked 1.331231 1 1.153790
## mvr_pts 1.159294 1 1.076705
## urbanicity 1.136284 1 1.065966
## yoj 1.198707 1 1.094855
## income 1.890915 1 1.375106
A final model requirement is the independence of observations. This can be assess by plotting the model residuals against the linear predictor (i.e., combination of coefficients and covariates). Independence is indicated by the absence of pattern in the resulting plot.
There does appear to be some curvature in the plot of residuals vs linear predictor for the reduced model that may point to a departure from independence. This might result, for example, from grouping effects among categorical covariates.
Given that the plotting pattern is not severe and that the model serves a predictive (vs. explanatory) purpose, there is not sufficient evidence to reject the model.
<-train_df%>%mutate('residuals' = residuals(reduced), linpred = predict(reduced))
resid_df
<- group_by(resid_df, cut(linpred, breaks=unique(quantile(linpred, (1:100)/101))))
bins
<-summarize(bins, residuals=mean(residuals), linpred = mean(linpred))
diag
plot(residuals~linpred, diag, xlab = "linear predictor")
The reduced model is considered valid on the basis of requirements described for logistic regression.
The ‘goodness’ of model fit can be readily evaluated from marginal plots for each variable. Marginal plots graph actual and conditional probabilities (estimated by the model) for each covariate relative to the response variable. The resulting curves align closely in a model that fits the data well.
The reduced model provides a good fit to the data. Although some departure is observed for actual and conditional probabilities in the plots for ‘travtime’, ‘bluebook’, and ‘age’. These departures are not severe and may be lessened by transforming these features.
<-mmps(reduced,main=NULL) marginals_reduced
The transformation model includes terms from the reduced model with transformations to reduce skew in select covariates. On this basis, ‘travtime’ and ‘bluebook’ will be log transformed while ‘tif’ and ‘income’ will be square root transformed.
This transformations follow from earlier model trials (not shown here) and experimentation.
All covariates included in the transformation model are significant at the .05 level.
The transformation model correctly classifies ~ 80% of the positive class and ~68% of the negative class for ‘target_flag’ at a probability threshold of 0.21 - optimizing for low misclassification error. The AUC is ~ .81, similar to the reduced model.
The transformation model is valid (diagnostics not included for the purpose of brevity).
#Build model with transformed vars
<- glm( target_flag ~ kidsdriv + parent1 + mstatus + education +
trans log(1+travtime) + car_use + log(1+bluebook) + sqrt(tif) + car_type + oldclaim +
+ revoked + mvr_pts + urbanicity + sqrt(income), family='binomial', train_df)
clm_freq
<-step(trans, trace=0)
trans
summary(trans)
##
## Call:
## glm(formula = target_flag ~ kidsdriv + parent1 + mstatus + education +
## log(1 + travtime) + car_use + log(1 + bluebook) + sqrt(tif) +
## car_type + oldclaim + clm_freq + revoked + mvr_pts + urbanicity +
## sqrt(income), family = "binomial", data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3354 -0.7265 -0.4234 0.6335 3.1073
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.567e-01 6.202e-01 1.220 0.222427
## kidsdriv 3.938e-01 6.230e-02 6.321 2.61e-10 ***
## parent1Y 4.867e-01 1.069e-01 4.552 5.31e-06 ***
## mstatusY -6.361e-01 7.863e-02 -8.090 5.99e-16 ***
## educationBachelors -5.365e-01 8.562e-02 -6.266 3.70e-10 ***
## educationMasters -4.832e-01 1.027e-01 -4.707 2.51e-06 ***
## educationPhD -6.757e-01 1.474e-01 -4.583 4.59e-06 ***
## log(1 + travtime) 4.220e-01 6.173e-02 6.837 8.09e-12 ***
## car_usePrivate -8.148e-01 8.210e-02 -9.924 < 2e-16 ***
## log(1 + bluebook) -3.313e-01 6.241e-02 -5.309 1.11e-07 ***
## sqrt(tif) -2.563e-01 3.710e-02 -6.910 4.86e-12 ***
## car_typeSUV -2.803e-01 1.127e-01 -2.488 0.012851 *
## car_typePickup -4.762e-01 1.288e-01 -3.698 0.000217 ***
## car_typePanel Truck -5.219e-01 1.744e-01 -2.992 0.002773 **
## car_typeMinivan -9.087e-01 1.216e-01 -7.475 7.72e-14 ***
## car_typeVan -3.606e-01 1.543e-01 -2.337 0.019427 *
## oldclaim -1.407e-05 4.541e-06 -3.099 0.001945 **
## clm_freq 1.811e-01 3.267e-02 5.544 2.96e-08 ***
## revokedY 8.799e-01 1.054e-01 8.348 < 2e-16 ***
## mvr_pts 1.239e-01 1.552e-02 7.983 1.43e-15 ***
## urbanicityUrban 2.318e+00 1.289e-01 17.981 < 2e-16 ***
## sqrt(income) -2.765e-03 3.945e-04 -7.009 2.40e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7084.6 on 6120 degrees of freedom
## Residual deviance: 5561.0 on 6099 degrees of freedom
## AIC: 5605
##
## Number of Fisher Scoring iterations: 5
<-glance(trans) # save stats for model comparisons trans_stats
Tests of model validity for the transformation model are comparable to the reduced model and, to avoid repetition, are not reported here. Diagnostics for model performance are included below and in Section 4.
# collect predicted percentages
<-predict(trans, train_df,type='response')
trans_predicted
# find optimal threshold to minimize misclassification
<- optimalCutoff(train_df$target_flag, trans_predicted, optimiseFor = "misclasserror") # 0.51
trans_optimal
# calculate training classification error with .51 threshold --> 0.2588
<-misClassError(train_df$target_flag, trans_predicted, threshold = trans_optimal)
Miss_Class_Error
# convert predicted to 0, 1 values based on threshold
<-ifelse(trans_predicted >= trans_optimal, 1, 0)
trans_predicts
# subset actuals and predicts as numeric for confusion matrix
<-as.numeric(as.character(train_df$target_flag))
trans_actuals
::confusionMatrix(trans_actuals, trans_predicts, threshold = optimal)%>%
InformationValueflextable()%>%
set_caption("Confusion Matrix: Reduced Model")
0 | 1 |
4,207 | 998 |
289 | 627 |
#Calculate related metrics
<-AUROC(trans_actuals, trans_predicted) # -->.81
AUC<-InformationValue::sensitivity(trans_actuals, trans_predicts, threshold = optimal) # recall: true positive rate
Sensitivity<-InformationValue::specificity(trans_actuals, trans_predicts, threshold = optimal) #false positive rate
Specificity<-InformationValue::precision(trans_actuals, trans_predicts, threshold = optimal) # prop of predicted ones /prop total ones
Precision<-youdensIndex(trans_actuals, trans_predicts, threshold = optimal) # 0.48 accounts for both false-positive and false-negative rates
YDI_Index
# Calculate predictive power using concordance
<-Concordance(trans_actuals, trans_predicts)
Concord<-Concord$Concordance
Predict_Power
ks_plot(trans_actuals, trans_predicts) #cum perc of ones captured by the model against that expected random
<-ks_stat(trans_actuals, trans_predicts) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate
Kol_Smirnov
# Create a metrics table
<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)
trans_metrics
# Construct ROC curve: note actuals, predicts, predicted as vectors)
<- ROCR::prediction(trans_predicted, trans_actuals)
pred_trans <- ROCR::performance(pred_trans, "tpr", "fpr")
perf_trans
plot(perf_trans, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")
#AUC curve
plotROC(trans_actuals, trans_predicted)
### Model Fit
Marginal plots for show improvements in model fit with the transformations selected for ‘travtime’, ‘bluebook’, ‘tif’, and ‘income’.
<-mmps(trans,main=NULL) marginals_trans
There are a variety of methods to aid model selection, both quantitative and qualitative. Here selection will proceed using a cross-section of diagnostics that aim for:
It’s worth noting that diagnostics for the transformation model produced similar results as the reduced model. Both are considered valid.
From the first table below (i.e., comparison of model results) it is clear that the transformed model is an appropriate choice. It produces the lowest AIC & BIC values and the lowest residual deviance. Additionally, the AUC for our transformed model is equivalent to the reduced and full models (.81) and greater than the null model (.50). It does, however, produce higher standard. errors for transformed covariates.
Other measures of performance are included in the classification metrics table below (reduced and transformation models). Note, however, that they are specific to the probability thresholds selected for each model. And are not considered criterion for model selection.
Additional tests can be performed to confirm our model choice (e.g., likelihood ratio tests for nested models) but are not necessary.
# Compare model output summaries
<-c('Null Model', 'Full Model', 'Reduced Model', 'Transformed Model')
col_name
<-rbind(null_stats, full_stats, reduced_stats, trans_stats)
comp
<- cbind(col_name, comp)%>%
(compare_models::select(-c(null.deviance, df.null, logLik, nobs))%>%
dplyrrename(Model = col_name)%>%
flextable()%>%
set_caption("Comparison of Model Results"))
Model | AIC | BIC | deviance | df.residual |
Null Model | 7,086.575 | 7,093.295 | 7,084.575 | 6,120 |
Full Model | 5,624.703 | 5,819.568 | 5,566.703 | 6,092 |
Reduced Model | 5,617.794 | 5,779.061 | 5,569.794 | 6,097 |
Transformed Model | 5,605.049 | 5,752.878 | 5,561.049 | 6,099 |
# Compare select metrics for reduced and transformed models
<-c('Reduced Model', 'Transformed Model')
col_final
<-rbind(reduced_metrics, trans_metrics)
final_comp
<- cbind(col_final, final_comp)%>%
(final_modelsas.data.frame()%>%
mutate_at(vars(2:9), funs(as.numeric))%>%
mutate_if(is.numeric, round, 3)%>%
rename(Model = col_final)%>%
flextable()%>%
set_caption("Classification Metrics: Reduced & Transformed Models"))
Model | Predict_Power | Miss_Class_Error | AUC | Sensitivity | Specificity | Precision | YDI_Index | Kol_Smirnov |
Reduced Model | 0.397 | 0.212 | 0.806 | 0.433 | 0.917 | 0.652 | 0.350 | 0.338 |
Transformed Model | 0.361 | 0.210 | 0.806 | 0.386 | 0.936 | 0.684 | 0.322 | 0.306 |
To assess how well the transformation model generalizes, we apply it to the test dataset and summarize the results. Interestingly, the AUC produced on the test set has increased (.82) relative to the training set - indicating that the model generalizes well.
Assume for a moment that an insurer seeks to minimize their risk of claim payouts (relative to client payments). In this instance, they might elect to lower the probability threshold used to classify new clients.
Below, our results for model performance reflect a cutoff of ~0.25. The model classifies 90% of our positives (target_flag = 1, crash) correctly, as compared to 49% of our negatives (target_flag = 0, no accidents).
Other related measures of performance are included in the metrics table.
# Apply transformation model to test data
<- glm( target_flag ~ kidsdriv + parent1 + mstatus + education +
trans_test log(1+travtime) + car_use + log(1+bluebook) + sqrt(tif) + car_type + oldclaim +
+ revoked + mvr_pts + urbanicity + sqrt(income), family='binomial', test_df)
clm_freq
# Apply fitted model to test sample (predicted probabilities)
<- predict(trans_test, test_df, type="response")
predicted_test
# Find optimal threshold to minimize misclassification
<- optimalCutoff(test_df$target_flag, predicted_test, optimiseFor = "Both") # --> .25
test_optimal
# Calculate training classification error with .51 threshold
<-misClassError(test_df$target_flag, predicted_test, threshold = test_optimal)
Miss_Class_Error
# Convert predicted to 0, 1 values based on threshold
<-ifelse(predicted_test >= test_optimal, 1, 0)
predicts_test
# Cubset actuals and predicts as numeric for confusion matrix
<-as.numeric(as.character(test_df$target_flag))
actuals_test
::confusionMatrix(actuals_test, predicts_test, threshold = test_optimal)%>%
InformationValueflextable()%>%
set_caption("Confusion Matrix: Transformation Model")
0 | 1 |
1,092 | 117 |
420 | 411 |
#Calculate related metrics
<-AUROC(actuals_test, predicted_test) # -->.81
AUC<-InformationValue::sensitivity(actuals_test, predicts_test, threshold = .17) # recall: true positive rate
Sensitivity<-InformationValue::specificity(actuals_test, predicts_test, threshold = .17) #false positive rate
Specificity<-InformationValue::precision(actuals_test, predicts_test, threshold = .17) # prop of predicted ones /prop total ones
Precision<-youdensIndex(actuals_test, predicts_test, threshold = test_optimal) # 0.48 accounts for both false-positive and false-negative rates
YDI_Index
# Calculate predictive power using concordance
<-Concordance(actuals_test, predicts_test)
Concord<-Concord$Concordance
Predict_Power
ks_plot(actuals_test, predicts_test) #cum perc of ones captured by the model against that expected random
<-ks_stat(actuals_test, predicts_test) #Kolmogorov-Smirnov statistic - maximum difference between the cumulative true positive and cumulative false positive rate
Kol_Smirnov
# Create a metrics table
<-cbind(Predict_Power, Miss_Class_Error, AUC, Sensitivity, Specificity, Precision, YDI_Index, Kol_Smirnov)%>%
(test_metricsas.data.frame()%>%
mutate_at(vars(1:8), funs(as.numeric))%>%
mutate_if(is.numeric, round, 3)%>%
flextable()%>%
set_caption("Transformation Model-Test Data: Metrics"))
Predict_Power | Miss_Class_Error | AUC | Sensitivity | Specificity | Precision | YDI_Index | Kol_Smirnov |
0.562 | 0.263 | 0.824 | 0.778 | 0.722 | 0.495 | 0.501 | 0.49 |
# Construct ROC curve: note actuals, predicts, predicted as vectors)
<- ROCR::prediction(predicted_test, actuals_test)
pred <- ROCR::performance(pred, "tpr", "fpr")
perf
plot(perf, colorize = TRUE, main = "ROC curve for logistic regression on insurance data")
#Construct AUC curve
plotROC(actuals_test, predicted_test)
We can now build a linear model to predict payout cost (‘target_amt’) to a driver in the event of a crash.
It’s worth noting that there are a variety of examples in the published literature of auto insurance claim prediction based on log-normal and/or gamma regression models. This reflects the fact that the claim data are typically right skewed and positive. Both models will be constructed and compared here.
As an aside, a poisson model can also be used where the response is right skewed and positive but does limit the response to integer values.
First, we build a new training and test set for linear modeling starting with a more limited set of covariates. The latter include:
Other variables are omitted owing to the absence of any theoretical relationship to payout cost.
To further simplify our model, we construct an average past payout(‘avg-clm’) for each driver as the ratio of ‘oldclaim’:‘clm_freq’ - assigning a value of 0 to ‘avg_clm’ where no claims have been filed previously.
The extent to ‘target_amt’ includes payment for damages other than vehicular (e.g., medical) is unclear. To better focus on the vehicular component, we construct another new variable, ‘value_clm’, as the ratio of ‘target_amt’:‘bluebook’. In this context, values > 1 indicate payments in excess of vehicle market value. These observations will be dropped from the analyses. The final set of covariates used for model construction are:
It’s worth noting that there are no pairwise correlations between these variables that raise concerns owing to multicolinearlity.
# subset dataframe for linear model - variance in home_val explained by income
<-df%>%
linear_dffilter(target_amt > 0)%>%
mutate(avg_clm = round(oldclaim/clm_freq), 1)%>% #avg claim payout in past to driver
mutate(value_clm = round((target_amt/bluebook)*100))%>% # payout cost -to- value percentage of most recent claim
mutate_at(vars(avg_clm), ~replace(., is.na(.), 0))%>% # if no prior claims, set to 0
::select(target_amt, bluebook, car_age, car_type, value_clm, avg_clm)
dplyr
# use value_clm to drop target_amt with a cost/value ratio >1, then drop this variable
%<>%
linear_dffilter(value_clm <100)%>%
::select(-value_clm)
dplyr
%>%
linear_dffilter_if(is.numeric, all_vars((.) != 0))%>%
correlate()%>%
filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output
flextable()%>%
set_caption("Pairwise Correlation: Numerical Covariates")
var1 | var2 | coef_corr |
set.seed(123344)
# Build train/test sets
<- createDataPartition(linear_df$target_amt, p = .75,
trainIdx list = FALSE, times = 1)
<- linear_df[trainIdx,]
linear_train <- linear_df[-trainIdx,] linear_test
The response variable, ‘target_amt’, is highly right skewed and is positive across all observations. This can be problematic when constructing a multiple regression model using ordinary least squares. As mentioned previously, both log-normal and gamma regression models can be used to model this form of data (see examples below). However, the former cannot be interpreted on the original scale of the data without back-transformation and must also adhere to the requirements of ordinary least squares. In instances where heteroscedasticity is apparent a log-normal model may be modified using weighted least squares (smaller datasets) or robust techniques.
As a generalized linear model, gamma regression allows interpretation of the results on the original scale and does not require normally distributed errors and/or constant variance (variance is proportional to the square of the means). Two link functions are available for building gamma models - inverse & log. An inverse link is only applicable where the response varies between 0-1. The log link will be used for gamma-based models constructed here.
Gamma regression has been used previously to model auto insurance claims. In these instances, the response variable is often an average claim amount or some measure of claim severity. The dataset used here contains payouts for individual claims. Thus the most relevant predictors will likely be ‘bluebook’, ‘car_age’, and ‘car_type’. The term, ‘avg_clm’ will be relevant provided decisions regarding current payout are based, in part, on previous payouts to a driver.
Prior Examples:
McCullagh and Nelder, 1983 Mohammad and McGovern, 2020
# Plot density distribution
<- linear_train%>%
hist ggplot(aes(target_amt))+
geom_density(fill='darkblue', alpha=.5, bins=30)+
theme_minimal()+
labs(title='Distribution Density: target_amt')
# qqplot
<- linear_train%>%
points ggplot(aes(sample=target_amt))+
stat_qq(fill='darkblue')+
stat_qq_line()+
theme_minimal()+
labs(title='QQ Plot: target_amt', y='Target_amt')
#cumulative frequency plot
<- linear_train%>%
cum ggplot(aes(target_amt)) +
stat_ecdf(color='blue')+
theme_minimal()+
labs(title = 'Cumulative Frequency Plot: target_amt', y='Frequency')
|points)/cum # plot layout (hist
# univariate stats for target_amt
$target_amt%>%
linear_trainsummary()%>%
tidy()%>%
flextable()%>%
set_caption("Univariate stats: target_amt")
minimum | q1 | median | mean | q3 | maximum |
30 | 2,465.75 | 3,823 | 4,131.775 | 5,257.5 | 27,496 |
Included below is a full set of bivariate plots for numerical and factor predictors and our response, ‘target_amt’. It is difficult to discern a linear relationship between numerical covariates and ‘target_amt’. In contrast, there does appear to be a weak linear relationship between levels of car_type and ‘target_amt’ - with the highest payouts applied to panel trucks.
It’s worth noting that ~80% of the paid claims are less than $10,000 each. And the highest payouts are associated with newer vehicles, few -to- no past claims, and mid-range bluebook values. It appears that payouts are proportionally less for high-value vehicles.
<-linear_train%>%
xvar_names::select(is.numeric, -target_amt)%>%
dplyrnames() # numerical covariates
<-purrr::set_names(xvar_names)
xvar_names
# create function for scatter plots
<- function(x){
scatter_func ggplot(linear_train, aes_string(x=x, y='target_amt'))+
geom_point(aes_string(color='car_type'),alpha=.6)+
geom_smooth(method = "loess", se = FALSE, color='red')+
theme_minimal()
}
<-map(xvar_names, ~scatter_func(.x))) #map plotting function (scatter_plots
## $bluebook
##
## $car_age
##
## $avg_clm
# plot factor covariate car_type vs target_amt (< 10000 for readibility)
%>%
linear_trainfilter(target_amt < 10000)%>%
ggplot(aes(x = reorder(car_type, target_amt, FUN = median), y = target_amt, fill=car_type)) +
geom_boxplot(outlier.color ='red', show.legend = FALSE)+
scale_fill_viridis(discrete = TRUE, alpha=0.6)+
theme_classic()+
labs(title = 'Payment Amount vs. Car Type: Payments under $10,000')
Prior to modeling, we can visually assess potential covariate interactions as predictors of ‘target_amt’. It appears that ‘car_type’ influences the relationship between each numerical covariate and ‘target_amt’. In contrast, there do not appear to be any interactions among numerical covariates in relation to ‘target_amt’
# evaluate interactions between numerical variables and car_type using a linear
#model
<-purrr::set_names(c('bluebook', 'car_age', 'avg_clm'))
numvars
<- linear_train%>%
filter_dffilter(target_amt<10000)
<- function (x){
interactggplot(filter_df, aes_string(x = x, y = 'target_amt', color = 'car_type')) +
geom_smooth(method = 'lm', se=FALSE)+
theme_classic() +
labs(x = x, y = "target_amt")
}
<-map(numvars, ~interact(.x)) #map plotting function
interact_plots
plot(reduce(interact_plots,`+`)) # reduce for single print
# evaluate interactions between numerical values using continuous plots
coplot(target_amt~bluebook|car_type, panel = panel.car, col='blue', rows=1, data=filter_df)
coplot(target_amt~bluebook|avg_clm, panel = panel.car, col='blue', rows=1, data=filter_df)
coplot(target_amt~bluebook|car_age, panel = panel.car, col='blue', rows=1, data=filter_df)
coplot(target_amt~car_age|avg_clm, panel = panel.car, col='blue', rows=1, data=filter_df)
## Log-Normal Model: NULL
We will start with a null model with a log transformed response (log-normal model).
# note: gaussian glm with default link (identity) is equivalent to OLS
<- lm(log(target_amt) ~ 1, data=linear_train)
null_lm
summary(null_lm)
##
## Call:
## lm(formula = log(target_amt) ~ 1, data = linear_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7209 -0.3119 0.1266 0.4453 2.0996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.12215 0.01875 433.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7025 on 1403 degrees of freedom
Next we build a full log-normal model with main effects. Only ‘bluebook’ and ‘car_type’ are significant at an alpha = 0.5. The adjusted error is only 0.03.
<- lm(log(target_amt) ~ ., data=linear_train)
full_lm
summary(full_lm)
##
## Call:
## lm(formula = log(target_amt) ~ ., data = linear_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6121 -0.3170 0.1268 0.4572 1.8277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.769e+00 6.839e-02 113.585 < 2e-16 ***
## bluebook 1.665e-05 2.927e-06 5.689 1.56e-08 ***
## car_age -6.733e-04 3.443e-03 -0.196 0.8450
## car_typeSUV 1.278e-01 6.130e-02 2.085 0.0372 *
## car_typePickup 1.227e-01 6.622e-02 1.853 0.0640 .
## car_typePanel Truck 5.629e-02 9.288e-02 0.606 0.5446
## car_typeMinivan 9.422e-02 6.864e-02 1.373 0.1701
## car_typeVan 6.547e-02 8.119e-02 0.806 0.4201
## avg_clm 3.658e-06 2.812e-06 1.301 0.1936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6917 on 1395 degrees of freedom
## Multiple R-squared: 0.03597, Adjusted R-squared: 0.03044
## F-statistic: 6.507 on 8 and 1395 DF, p-value: 2.384e-08
plot(full_lm)
## Log-Normal Model: Reduced w/ Interactions
Building further a the log-normal model further, ‘car_type’ is included as an interaction term. And variable selection is automated using AIC.
The reduced model has the following form:
log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm
Relative to the full model, the residual standard error has decreased slightly and the adjusted r-square is now 0.04. However, it appears that requirements of constant variance and normality have not been met for the model residuals. There is also a poor fit between the actual and predicted observations for ‘target_amt’
<- lm(log(target_amt) ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, data=linear_train)
gauss_interact
<-step(gauss_interact) # reduced model -->log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm interact_aic
## Start: AIC=-1028.24
## log(target_amt) ~ bluebook * car_type + car_age * car_type +
## avg_clm * car_type
##
## Df Sum of Sq RSS AIC
## - car_type:car_age 5 1.0717 653.39 -1035.9
## <none> 652.32 -1028.2
## - car_type:avg_clm 5 5.9075 658.22 -1025.6
## - bluebook:car_type 5 8.3065 660.62 -1020.5
##
## Step: AIC=-1035.93
## log(target_amt) ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type +
## car_type:avg_clm
##
## Df Sum of Sq RSS AIC
## - car_age 1 0.0035 653.39 -1037.9
## <none> 653.39 -1035.9
## - car_type:avg_clm 5 5.7389 659.13 -1033.7
## - bluebook:car_type 5 8.3610 661.75 -1028.1
##
## Step: AIC=-1037.92
## log(target_amt) ~ bluebook + car_type + avg_clm + bluebook:car_type +
## car_type:avg_clm
##
## Df Sum of Sq RSS AIC
## <none> 653.39 -1037.9
## - car_type:avg_clm 5 5.7645 659.16 -1035.6
## - bluebook:car_type 5 8.3581 661.75 -1030.1
summary(interact_aic)
##
## Call:
## lm(formula = log(target_amt) ~ bluebook + car_type + avg_clm +
## bluebook:car_type + car_type:avg_clm, data = linear_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6005 -0.3003 0.1130 0.4506 1.9873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.792e+00 1.119e-01 69.653 < 2e-16 ***
## bluebook 1.939e-05 7.183e-06 2.700 0.00702 **
## car_typeSUV 5.806e-02 1.384e-01 0.419 0.67501
## car_typePickup 6.983e-02 1.396e-01 0.500 0.61701
## car_typePanel Truck 7.505e-01 2.799e-01 2.681 0.00743 **
## car_typeMinivan -1.746e-01 1.605e-01 -1.088 0.27687
## car_typeVan 5.715e-01 2.926e-01 1.953 0.05100 .
## avg_clm -1.280e-05 7.093e-06 -1.805 0.07130 .
## bluebook:car_typeSUV -2.367e-06 9.337e-06 -0.254 0.79988
## bluebook:car_typePickup -1.881e-06 9.097e-06 -0.207 0.83621
## bluebook:car_typePanel Truck -2.601e-05 1.104e-05 -2.356 0.01861 *
## bluebook:car_typeMinivan 1.381e-05 9.921e-06 1.392 0.16405
## bluebook:car_typeVan -2.780e-05 1.441e-05 -1.929 0.05398 .
## car_typeSUV:avg_clm 2.750e-05 8.746e-06 3.144 0.00170 **
## car_typePickup:avg_clm 2.025e-05 9.185e-06 2.204 0.02766 *
## car_typePanel Truck:avg_clm 3.467e-06 1.256e-05 0.276 0.78259
## car_typeMinivan:avg_clm 1.525e-05 9.796e-06 1.557 0.11978
## car_typeVan:avg_clm 1.225e-05 1.253e-05 0.977 0.32860
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6866 on 1386 degrees of freedom
## Multiple R-squared: 0.05631, Adjusted R-squared: 0.04474
## F-statistic: 4.865 on 17 and 1386 DF, p-value: 2.802e-10
plot(interact_aic)
# check actual vs predicted values for target_amt, another check on model fit
<-augment(interact_aic)%>%
fitrename(target = 'log(target_amt)')
plot(density(fit$target), ylim=c(0, 3.0))
lines(density(fit$.fitted), col='red')
## log-Normal Model: Transformed
Transforming the model covariates can help address violations in an ordinarly least squares model. From the results below, log transforming the covariates did not lead to a valid model.
# transform response and predictors to log
<-linear_train%>%
train_transmutate_if(is.numeric, ~log(1+.))
<- lm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, data=train_trans)
gauss_interact_trans
<-step(gauss_interact_trans) trans_aic
## Start: AIC=-1059.59
## target_amt ~ bluebook * car_type + car_age * car_type + avg_clm *
## car_type
##
## Df Sum of Sq RSS AIC
## - car_type:car_age 5 0.6365 638.55 -1068.2
## - car_type:avg_clm 5 3.6708 641.58 -1061.5
## <none> 637.91 -1059.6
## - bluebook:car_type 5 5.0004 642.91 -1058.6
##
## Step: AIC=-1068.19
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type +
## car_type:avg_clm
##
## Df Sum of Sq RSS AIC
## - car_type:avg_clm 5 3.5766 642.12 -1070.3
## - car_age 1 0.1112 638.66 -1070.0
## <none> 638.55 -1068.2
## - bluebook:car_type 5 5.3812 643.93 -1066.4
##
## Step: AIC=-1070.35
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type
##
## Df Sum of Sq RSS AIC
## - avg_clm 1 0.0015 642.13 -1072.3
## - car_age 1 0.2033 642.33 -1071.9
## <none> 642.12 -1070.3
## - bluebook:car_type 5 5.3367 647.46 -1068.7
##
## Step: AIC=-1072.35
## target_amt ~ bluebook + car_type + car_age + bluebook:car_type
##
## Df Sum of Sq RSS AIC
## - car_age 1 0.2035 642.33 -1073.9
## <none> 642.13 -1072.3
## - bluebook:car_type 5 5.3353 647.46 -1070.7
##
## Step: AIC=-1073.9
## target_amt ~ bluebook + car_type + bluebook:car_type
##
## Df Sum of Sq RSS AIC
## <none> 642.33 -1073.9
## - bluebook:car_type 5 5.3835 647.71 -1072.2
summary(trans_aic)
##
## Call:
## lm(formula = target_amt ~ bluebook + car_type + bluebook:car_type,
## data = train_trans)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5664 -0.3185 0.1196 0.4496 1.7321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.08773 0.80300 5.091 4.06e-07 ***
## bluebook 0.41910 0.08590 4.879 1.19e-06 ***
## car_typeSUV 1.36562 1.02930 1.327 0.1848
## car_typePickup 1.41316 1.07695 1.312 0.1897
## car_typePanel Truck 6.27198 2.96697 2.114 0.0347 *
## car_typeMinivan -0.42962 1.12278 -0.383 0.7020
## car_typeVan 6.28641 3.12119 2.014 0.0442 *
## bluebook:car_typeSUV -0.13404 0.11019 -1.216 0.2240
## bluebook:car_typePickup -0.13813 0.11554 -1.196 0.2321
## bluebook:car_typePanel Truck -0.61793 0.29124 -2.122 0.0340 *
## bluebook:car_typeMinivan 0.05108 0.11918 0.429 0.6682
## bluebook:car_typeVan -0.63946 0.31579 -2.025 0.0431 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6793 on 1392 degrees of freedom
## Multiple R-squared: 0.07067, Adjusted R-squared: 0.06333
## F-statistic: 9.624 on 11 and 1392 DF, p-value: < 2.2e-16
plot(trans_aic)
#reduced model = target_amt ~ bluebook + car_type + bluebook:car_type
The first gamma regression model will be of a null form with a log link function.
<- glm(target_amt ~ 1, family=Gamma(link=log), data=linear_train)
gamma_null
summary(gamma_null)
##
## Call:
## glm(formula = target_amt ~ 1, family = Gamma(link = log), data = linear_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.80447 -0.47537 -0.07668 0.25103 2.74206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.32646 0.01742 478 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.4260992)
##
## Null deviance: 573.72 on 1403 degrees of freedom
## Residual deviance: 573.72 on 1403 degrees of freedom
## AIC: 25634
##
## Number of Fisher Scoring iterations: 5
<-glance(gamma_null) glm1
Forgoing a full gamma model, we construct a reduced model with main effects using AIC for variable selection.
the reduced model has the following form:
glm(target_amt ~ bluebook)
<- glm(target_amt ~ ., family=Gamma(link=log), data=linear_train)
gamma
<-step(gamma) # reduced glm(target_amt ~ bluebook) gamma_reduced
## Start: AIC=25564.7
## target_amt ~ bluebook + car_age + car_type + avg_clm
##
## Df Deviance AIC
## - car_type 5 543.68 25560
## - avg_clm 1 542.50 25564
## - car_age 1 542.53 25565
## <none> 541.92 25565
## - bluebook 1 566.40 25634
##
## Step: AIC=25559.54
## target_amt ~ bluebook + car_age + avg_clm
##
## Df Deviance AIC
## - car_age 1 544.30 25559
## - avg_clm 1 544.33 25559
## <none> 543.68 25560
## - bluebook 1 573.25 25643
##
## Step: AIC=25559.25
## target_amt ~ bluebook + avg_clm
##
## Df Deviance AIC
## - avg_clm 1 544.95 25559
## <none> 544.30 25559
## - bluebook 1 573.27 25640
##
## Step: AIC=25559.02
## target_amt ~ bluebook
##
## Df Deviance AIC
## <none> 544.95 25559
## - bluebook 1 573.72 25639
summary(gamma_reduced)
##
## Call:
## glm(formula = target_amt ~ bluebook, family = Gamma(link = log),
## data = linear_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.74975 -0.47658 -0.06705 0.27690 2.11637
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.042e+00 3.341e-02 240.694 <2e-16 ***
## bluebook 1.800e-05 1.932e-06 9.319 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.3505996)
##
## Null deviance: 573.72 on 1403 degrees of freedom
## Residual deviance: 544.95 on 1402 degrees of freedom
## AIC: 25559
##
## Number of Fisher Scoring iterations: 5
<-glance(gamma_reduced) glm2
The requirements for a valid glm model are explored below and summarized here for the reduced gamma model.
Findings:
For additional information see:
Dunn, Peter K. and Gorden K. Smyth, 2018 Balajari, 2013
#plot residuals vs linear predictor -- looking for constant variance with deviance resids
#indicates correct(link)
plot(residuals(gamma_reduced) ~ predict(gamma_reduced, type='link'),
xlab=expression(hat(eta)), ylab='Deviance Residuals')
plot(density(rstandard(gamma_reduced, type='deviance')), col='blue')
# check for independence of observations
scatter.smooth(rstandard(gamma_reduced, type='deviance'), col='gray')
#check for high leverage
<-cooks.distance(gamma_reduced)
Cooks_Distance
plot(Cooks_Distance)
<-(Cooks_Distance)
lev
length(lev[lev > mean(lev) * 2]) #102 is two times the mean of Cooks-d, high leverage
## [1] 102
# check for influence -- no influential obs
which(influence.measures(gamma_reduced)$is.inf[,'cook.d'])
## named integer(0)
#check fitted resids vs fitted, a well fit model will have no pattern
scatter.smooth(predict(gamma_reduced, type='response'), rstandard(gamma_reduced, type='deviance', col='gray'))
# Density distributions: actual vs predicted values for target_amt, another check on model fit
<-gamma.shape(gamma_reduced) # extract shape parameter, alpha = 2.73, SE 0.097
shape_red
<- predict(gamma_reduced, type='response', se = TRUE,
predict_reddispersion = 1/shape_red$alpha)
summary(gamma_reduced, dispersion = 1/shape_red$alpha)
##
## Call:
## glm(formula = target_amt ~ bluebook, family = Gamma(link = log),
## data = linear_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.74975 -0.47658 -0.06705 0.27690 2.11637
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.042e+00 3.414e-02 235.55 <2e-16 ***
## bluebook 1.800e-05 1.974e-06 9.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.3660863)
##
## Null deviance: 573.72 on 1403 degrees of freedom
## Residual deviance: 544.95 on 1402 degrees of freedom
## AIC: 25559
##
## Number of Fisher Scoring iterations: 5
<-cbind(linear_train, predict_red)
shape1_df
%>%
shape1_df::select(target_amt, fit)%>%
dplyrpivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Cost')%>%
ggplot(aes(Cost, fill=Response))+
geom_density(alpha=.25)+
theme_minimal()+
labs(title = 'Density Distributions for Fitted & Predicted Values: Gamma Reduced Model')
Building on the reduced model, we construct a gamma regression model with ‘car_type’ as an interaction term and variable selection using AIC.
The result has the following form:
target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm
The model AIC is now 25556 and the residual deviance is 532.
<- glm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, family=Gamma(link= 'log'), data=linear_train)
gamma_interact
<-step(gamma_interact) # target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type + car_type:avg_clm gamma_reduced_int
## Start: AIC=25564.64
## target_amt ~ bluebook * car_type + car_age * car_type + avg_clm *
## car_type
##
## Df Deviance AIC
## - car_type:car_age 5 531.80 25557
## <none> 531.11 25565
## - car_type:avg_clm 5 534.71 25565
## - bluebook:car_type 5 537.72 25574
##
## Step: AIC=25556.56
## target_amt ~ bluebook + car_type + car_age + avg_clm + bluebook:car_type +
## car_type:avg_clm
##
## Df Deviance AIC
## - car_age 1 532.18 25556
## <none> 531.80 25557
## - car_type:avg_clm 5 535.35 25557
## - bluebook:car_type 5 538.77 25567
##
## Step: AIC=25555.63
## target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type +
## car_type:avg_clm
##
## Df Deviance AIC
## <none> 532.18 25556
## - car_type:avg_clm 5 535.87 25556
## - bluebook:car_type 5 539.21 25566
summary(gamma_reduced_int)
##
## Call:
## glm(formula = target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type +
## car_type:avg_clm, family = Gamma(link = "log"), data = linear_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.75007 -0.46452 -0.07547 0.27827 2.25739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.991e+00 9.525e-02 83.896 < 2e-16 ***
## bluebook 2.005e-05 6.115e-06 3.279 0.00107 **
## car_typeSUV -1.643e-02 1.179e-01 -0.139 0.88917
## car_typePickup 3.186e-02 1.189e-01 0.268 0.78868
## car_typePanel Truck 6.674e-01 2.383e-01 2.800 0.00517 **
## car_typeMinivan -2.182e-01 1.367e-01 -1.597 0.11052
## car_typeVan 3.996e-01 2.491e-01 1.604 0.10889
## avg_clm -4.986e-06 6.039e-06 -0.826 0.40912
## bluebook:car_typeSUV 2.541e-06 7.949e-06 0.320 0.74931
## bluebook:car_typePickup -2.018e-07 7.745e-06 -0.026 0.97922
## bluebook:car_typePanel Truck -2.336e-05 9.400e-06 -2.485 0.01307 *
## bluebook:car_typeMinivan 1.632e-05 8.447e-06 1.932 0.05360 .
## bluebook:car_typeVan -2.090e-05 1.227e-05 -1.703 0.08874 .
## car_typeSUV:avg_clm 1.577e-05 7.446e-06 2.118 0.03434 *
## car_typePickup:avg_clm 9.163e-06 7.820e-06 1.172 0.24149
## car_typePanel Truck:avg_clm -1.268e-05 1.069e-05 -1.186 0.23590
## car_typeMinivan:avg_clm 4.659e-06 8.340e-06 0.559 0.57656
## car_typeVan:avg_clm 3.363e-06 1.067e-05 0.315 0.75266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.3417136)
##
## Null deviance: 573.72 on 1403 degrees of freedom
## Residual deviance: 532.18 on 1386 degrees of freedom
## AIC: 25556
##
## Number of Fisher Scoring iterations: 6
<-glance(gamma_reduced_int) glm3
The requirements for a valid glm are explored below and summarized here for the reduced gamma model with interactions.
Findings:
Overall, the diagnostics between the reduced gamma (main effects) and reduced gamma with interactions is comparable.
#plot residuals vs linear predictor -- looking for constant variance with deviance resids
#indicates correct(link)
plot(residuals(gamma_reduced_int) ~ predict(gamma_reduced_int, type='link'),
xlab=expression(hat(eta)), ylab='Deviance Residuals')
plot(density(rstandard(gamma_reduced_int, type='deviance')), col='blue')
# check for independence of observations
scatter.smooth(rstandard(gamma_reduced_int, type='deviance'), col='gray')
#check for high leverage
<-cooks.distance(gamma_reduced_int)
Cooks_Distance
plot(Cooks_Distance)
<-(Cooks_Distance)
lev
length(lev[lev > mean(lev) * 2]) #102 is two times the mean of Cooks-d, high leverage
## [1] 100
# check for influence -- no influential obs
which(influence.measures(gamma_reduced_int)$is.inf[,'cook.d'])
## named integer(0)
#check fitted resids vs fitted, a well fit model will have no pattern
scatter.smooth(predict(gamma_reduced_int, type='response'), rstandard(gamma_reduced_int, type='deviance', col='gray'))
# check actual vs predicted values for target_amt, another check on model fit
plot(density(linear_train$target_amt), ylim=c(0, .0008), main='Actual vs Fitted (red): Target_Amt')
lines(density(predict(gamma_reduced_int, type='response')), col='red')
# check actual vs predicted values for target_amt, another check on model fit
<-gamma.shape(gamma_reduced_int) # alpha = 2.79, SE 0.099
shape_int
<- predict(gamma_reduced_int, type='response', se = TRUE,
predict_intdispersion = 1/shape_int$alpha)
summary(gamma_reduced_int, dispersion = 1/shape_int$alpha)
##
## Call:
## glm(formula = target_amt ~ bluebook + car_type + avg_clm + bluebook:car_type +
## car_type:avg_clm, family = Gamma(link = "log"), data = linear_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.75007 -0.46452 -0.07547 0.27827 2.25739
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.991e+00 9.748e-02 81.971 < 2e-16 ***
## bluebook 2.005e-05 6.259e-06 3.204 0.00135 **
## car_typeSUV -1.643e-02 1.206e-01 -0.136 0.89168
## car_typePickup 3.186e-02 1.216e-01 0.262 0.79337
## car_typePanel Truck 6.674e-01 2.439e-01 2.736 0.00622 **
## car_typeMinivan -2.182e-01 1.399e-01 -1.560 0.11870
## car_typeVan 3.996e-01 2.550e-01 1.567 0.11701
## avg_clm -4.986e-06 6.181e-06 -0.807 0.41981
## bluebook:car_typeSUV 2.541e-06 8.136e-06 0.312 0.75483
## bluebook:car_typePickup -2.018e-07 7.927e-06 -0.025 0.97969
## bluebook:car_typePanel Truck -2.336e-05 9.621e-06 -2.428 0.01518 *
## bluebook:car_typeMinivan 1.632e-05 8.645e-06 1.887 0.05911 .
## bluebook:car_typeVan -2.090e-05 1.256e-05 -1.664 0.09607 .
## car_typeSUV:avg_clm 1.577e-05 7.621e-06 2.070 0.03849 *
## car_typePickup:avg_clm 9.163e-06 8.003e-06 1.145 0.25226
## car_typePanel Truck:avg_clm -1.268e-05 1.094e-05 -1.159 0.24662
## car_typeMinivan:avg_clm 4.659e-06 8.536e-06 0.546 0.58525
## car_typeVan:avg_clm 3.363e-06 1.092e-05 0.308 0.75810
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.3579492)
##
## Null deviance: 573.72 on 1403 degrees of freedom
## Residual deviance: 532.18 on 1386 degrees of freedom
## AIC: 25556
##
## Number of Fisher Scoring iterations: 6
<-cbind(linear_train, predict_int)
shape2_df
%>%
shape2_df::select(target_amt, fit)%>%
dplyrpivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Amount')%>%
ggplot(aes(Amount, fill=Response))+
geom_density(alpha=.25)+
theme_minimal()+
labs(title = 'Density Distributions for Fitted & Predicted Values: Gamma Interaction Model')
We cannot make direct comparisons between the log-normal models and the gamma models. Given that the former did not meet several of the requirements of ordinary least squares, comparisons will be limited to the gamma model.
The gamma model with interactions had the lowest AIC, BIC, and deviance scores. On this basis, it is the preferred choice among the included models. Given the relatively poor fit to the training data, however, it is unlikely to generalize well.
# build a table to compare results of gamma models
<-c('Gamma - Null', 'Gamma-No Interactions', 'Gamma-Interactions')
col_names
<-rbind(glm1, glm2, glm3)
glm_results
<- cbind(col_names, glm_results)%>%
(compare_gammas::select(c(col_names, AIC, BIC, deviance))%>%
dplyrrename(Model = col_names)%>%
flextable()%>%
set_caption("Comparison of Gamma Model Results"))
Model | AIC | BIC | deviance |
Gamma - Null | 25,633.98 | 25,644.47 | 573.7193 |
Gamma-No Interactions | 25,559.02 | 25,574.77 | 544.9491 |
Gamma-Interactions | 25,555.63 | 25,655.33 | 532.1792 |
To evaluate how our selected model generalizes and performs, we apply it to our hold-out data (‘linear_test’). Comparing predicted vs. actual values (scatterplot & density distributions) it is clear that the model does not perform well using the available data. This is not surprising given that claim payments are determined by a number of factors not included here.
For example:
Using a reduced gamma model with interaction on the holdout data, the only significant variable was ‘bluebook’. Model metrics for AIC, BIC, and Deviance were 8498.3, 8601.958, 154.8157 respectively. Interestingly, these values are far lower than those reported for the training dataset.
# build and execute model on holdout data
<- glm(target_amt ~ bluebook*car_type+car_age*car_type+avg_clm*car_type, family=Gamma(link= 'log'), data=linear_test)
gamma_test
#only bluebook turned out to be significant
# plot predicted and fitted values and print summary
<-gamma.shape(gamma_test) # alpha = 3.17, SE 0.19
shape_test
<- predict(gamma_test, type='response', se = TRUE, # collect predictions, acct for shape
predict_testdispersion = 1/shape_test$alpha)
summary(gamma_test, dispersion = 1/shape_test$alpha) # print summary
##
## Call:
## glm(formula = target_amt ~ bluebook * car_type + car_age * car_type +
## avg_clm * car_type, family = Gamma(link = "log"), data = linear_test)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.22987 -0.46243 -0.06145 0.26695 1.88389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 8.154e+00 1.746e-01 46.713 <2e-16 ***
## bluebook 7.708e-06 1.031e-05 0.748 0.4547
## car_typeSUV -6.928e-03 2.209e-01 -0.031 0.9750
## car_typePickup -1.354e-01 2.120e-01 -0.639 0.5230
## car_typePanel Truck -3.353e-01 4.992e-01 -0.672 0.5018
## car_typeMinivan -2.490e-01 2.919e-01 -0.853 0.3936
## car_typeVan 2.794e-01 4.498e-01 0.621 0.5344
## car_age 8.369e-03 1.415e-02 0.591 0.5543
## avg_clm -6.501e-06 6.705e-06 -0.970 0.3323
## bluebook:car_typeSUV -2.416e-06 1.440e-05 -0.168 0.8668
## bluebook:car_typePickup 1.030e-05 1.338e-05 0.770 0.4416
## bluebook:car_typePanel Truck 1.956e-05 1.867e-05 1.047 0.2949
## bluebook:car_typeMinivan 7.946e-06 1.603e-05 0.496 0.6201
## bluebook:car_typeVan -1.650e-05 2.282e-05 -0.723 0.4696
## car_typeSUV:car_age -2.586e-03 1.678e-02 -0.154 0.8775
## car_typePickup:car_age -4.270e-03 1.796e-02 -0.238 0.8121
## car_typePanel Truck:car_age -2.748e-03 1.994e-02 -0.138 0.8904
## car_typeMinivan:car_age 6.771e-03 1.861e-02 0.364 0.7161
## car_typeVan:car_age -3.090e-03 1.990e-02 -0.155 0.8766
## car_typeSUV:avg_clm 7.490e-06 8.764e-06 0.855 0.3927
## car_typePickup:avg_clm 1.130e-05 1.050e-05 1.076 0.2818
## car_typePanel Truck:avg_clm 5.529e-06 2.854e-05 0.194 0.8464
## car_typeMinivan:avg_clm 3.009e-05 1.725e-05 1.744 0.0812 .
## car_typeVan:avg_clm 5.631e-06 1.131e-05 0.498 0.6186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.3151184)
##
## Null deviance: 169.06 on 466 degrees of freedom
## Residual deviance: 154.82 on 443 degrees of freedom
## AIC: 8498.3
##
## Number of Fisher Scoring iterations: 6
<-glance(gamma_test)%>% # collect select metrics
glm4::select(c(AIC, BIC, deviance))
dplyr# check fits using density distributions of fitted and predicated values
<-cbind(linear_test, predict_test)
shape3_df
<-shape3_df%>%
p1::select(target_amt, fit)%>%
dplyrpivot_longer(cols=c('target_amt', 'fit'), names_to = 'Response', values_to = 'Amount')%>%
ggplot(aes(Amount, fill=Response))+
geom_density(alpha=.25)+
theme_minimal()+
labs(title = 'Density Distributions for Fitted & Predicted Values: Hold Out Data')
<-ggplot(shape3_df, aes(x=target_amt, y=fit))+
p2geom_point()+
geom_smooth(method=lm, sd=TRUE)+
theme_minimal()+
labs(title='Predicted vs. Actual Claim Payments: Hold-Out Data')
/p2) # print prediction plots (p1
This study sought to classify drivers based on their probability of an auto accident and to then predict the amount of claim payout associated with that accident. The covariates provided in the dataset were sufficient to build/select an effective logistic classifier. Predicting claim payout proved more problematic - owing to the small number of covariates suited for individual payout predictions as well as lack of information on how payment amount was derived. In a number of instances, payments exceeded the ‘bluebook’ value of a vehicle. These observations were dropped in order to better focus on payments resulting from vehicle damage alone. That said, claim determinations are typically influenced by such factors as:
The data included here only provided indirect information tied to the last factor (cost due to damage). Nonetheless the modeling process was instructive. After comparing results for various log-normal and gamma regression models, a final selection was made on the basis of model validity as well as AIC, BIC,and Deviance scores. The selected model was a gamma regression (log link) with two numerical covariates (‘bluebook’ and ‘avg_clm’) and one interaction term (‘car_type’). Applied to a holdout dataset, the model underestimated low/high values and over-estimated mid-range values. Additionally, only one covariate, ‘bluebook’, proved significant. Given these results, a differentiated model based on binned values for ‘bluebook’ may improve the accuracy of payment predictions.