————————————————————————————————————————–
In this assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).
The data set we are exploring contains actuarial data related to auto insurance policyholders, including indications of whether each policyholder has been involved in a past car accident and, if so, the amounts of the insurance claims payouts.
There are 8161 rows of actuarial data, each representing behavioral and demographic information about a specific policyholder. For each policyholder we are provided with 23 attributes that could potentially be used as predictor variables, and two response variables:
TARGET_FLAG: Indicates whether a client has been involved in a past car accident
TARGET_AMOUNT: The insurance claim payout related to that past car accident
Our assignment is to try to predict whether or not an insurance policy customer is likely to be involved in a car accident on the basis of the 23 attributes provided in the data set. Then, if it is determined that the customer is likely to be involved in a car accident, we are to provide a predictive estimate of the likely cost of the subsequent auto insurance claim.
The original data set contains 14 numeric predictors and one numeric response variable. A summary of the numeric variables is shown in the following table:
| n | mean | sd | median | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| TARGET_AMT | 8161 | 1504 | 4704 | 0 | 0 | 107586 | 107586 | 9 | 112 | 52 |
| KIDSDRIV | 8161 | 0 | 1 | 0 | 0 | 4 | 4 | 3 | 12 | 0 |
| AGE | 8155 | 45 | 9 | 45 | 16 | 81 | 65 | 0 | 0 | 0 |
| HOMEKIDS | 8161 | 1 | 1 | 0 | 0 | 5 | 5 | 1 | 1 | 0 |
| YOJ | 7707 | 10 | 4 | 11 | 0 | 23 | 23 | -1 | 1 | 0 |
| INCOME | 7716 | 61898 | 47573 | 54028 | 0 | 367030 | 367030 | 1 | 2 | 542 |
| HOME_VAL | 7697 | 154867 | 129124 | 161160 | 0 | 885282 | 885282 | 0 | 0 | 1472 |
| TRAVTIME | 8161 | 33 | 16 | 33 | 5 | 142 | 137 | 0 | 1 | 0 |
| BLUEBOOK | 8161 | 15710 | 8420 | 14440 | 1500 | 69740 | 68240 | 1 | 1 | 93 |
| TIF | 8161 | 5 | 4 | 4 | 1 | 25 | 24 | 1 | 0 | 0 |
| OLDCLAIM | 8161 | 4037 | 8777 | 0 | 0 | 57037 | 57037 | 3 | 10 | 97 |
| CLM_FREQ | 8161 | 1 | 1 | 0 | 0 | 5 | 5 | 1 | 0 | 0 |
| MVR_PTS | 8161 | 2 | 2 | 1 | 0 | 13 | 13 | 1 | 1 | 0 |
| CAR_AGE | 7651 | 8 | 6 | 8 | -3 | 28 | 31 | 0 | -1 | 0 |
As we can see in the table, several numeric variables are plagued by missing data values. The missing values appear as either ‘NA’ values or blank values in the data set. The ‘JOB’ categorical variable was also discovered to have a significant number of missing values. In fact, roughly 21% of the 8161 rows contained within the data set were found to have either ‘NA’ or blank values. The missing data values are summarized below by variable name.
| Variable | Missing/NA Values |
|---|---|
| AGE | 6 |
| YOJ | 454 |
| CAR_AGE | 510 |
| HOME_VAL | 464 |
| INCOME | 445 |
| JOB | 526 |
The table also provides evidence of potential skew for some variables, including ‘INCOME’ and ‘OLDCLAIM’ as evidenced by the wide disparities between their means and medians.
Seven of the independent variables (‘PARENT1’, ‘MSTATUS’, ‘SEX’, ‘CAR_USE’, ‘RED_CAR’, ‘REVOKED’, and ‘URBANICITY’) are binary categorical variables which will need to be treated as factors. Each of these variables can only assume one of two predefined text string values. The valid values for these variables are listed in the table shown below.
| Parent1 | MSTATUS | SEX | CAR_USE | RED_CAR | REVOKED | URBANICITY |
|---|---|---|---|---|---|---|
| No | No | M | Private | no | No | Highly Rural/ Rural |
| Yes | Yes | F | Commercial | yes | Yes | Highly Urban/ Urban |
‘EDUCATION’, ‘JOB’, and ’CAR_TYPE are multi-categorical variables which will also need to be treated as factors. These variables can assume any one of more than two potential predefined text string values as indicated in the table below.
| EDUCATION | JOB | CAR TYPE |
|---|---|---|
| -HS | Blue Collar | Minivan |
| HS | Professional | SUV |
| Bachelors | Manager | Sports Car |
| Masters | Home Maker | Van |
| PhD | Clerical | Panel Truck |
| (-) | Doctor | Pickup |
| (-) | Lawyer | (-) |
| (-) | Student | (-) |
Boxplots and barplots of each independent variable relative to the binary response ‘TARGET_FLAG’ variable are a useful and direct way to analyze skew issues while also allowing us to develop some preliminary intuition regarding the predictive aspects of the data:
The boxplots shown above indicate that our skew suspicions regarding the ‘INCOME’ and ‘OLDCLAIM’ variables appear to have been justified. Furthermore, we are able to deduce some predictive aspects of the variables in relation to the ‘TARGET_FLAG’ response variable. For example, it appears as though larger values for ‘AGE’, ‘INCOME’, ‘HOME_VAL’, ‘BLUEBOOK’, ‘TIF’ and ‘CAR_AGE’ will each tend to decrease the likelihood that a policyholder will be involved in a car accident. while larger values for ‘OLD_CLAIM’, ‘CLAIM_FREQ’, ‘MVR_PTS’ and’TRAVTIME’ will each tend to increase the likelihood that a policyholder will be involved in an accident. However, for ‘YOJ’ there appears to be no significant relationship with the ‘TARGET_FLAG’ response.
The barplots allow us to infer additional relationships between different categories of drivers and the ‘TARGET_FLAG’ response variable. For example, segmenting the ‘KIDSDRIV’ and ‘HOMEKIDS’ variables into binary indicators shows us that policyholders who either have children living at home or have teenage drivers who use their vehicles are more likely to have been involved in accidents. Similarly, single parents, females, drivers of commercial vehicles, non-college graduates, students, blue collar workers, drivers of sports cars, and drivers who’ve had their license revoked in the past all appear to be more likely to be involved in car accidents than do drivers who are not members of those categories.
The following table summarizes the proportional relationships of our binary predictor variables with the TARGET_FLAG response variable. A relatively large difference between the ‘yes’ and ‘no’ proportions indicates that a policyholder belonging to one of the indicated categories is more likely to be involved in a car accident than a policyholder not belonging to the same category. Note that the ‘SEX’ and ‘RED_CAR’ variables show relatively small differences between groupings and as such may be Candidates for exclusion from our regression models.
| PROPORTION RELATIVE TO TARGET |
| Variable | no | yes | diff |
|---|---|---|---|
| H_RENTER | .338 | .662 | .324 |
| URBANICITY | .069 | .314 | .245 |
| REVOKED | .239 | .443 | .240 |
| JOB_COLOR | .389 | .611 | .222 |
| PARENT1 | .237 | .442 | .205 |
| KIDSDRIV | .247 | .387 | .140 |
| HOMEKIDS | .247 | .387 | .140 |
| CAR_USE | .216 | .346 | .130 |
| MSTATUS | .337 | .215 | .122 |
| SEX | .254 | .272 | .018 |
| RED_CAR | .266 | .259 | .007 |
Histograms allow us to more thoroughly examine whether the distribution of a variable is skewed as well as whether there may be high incidence of specific variable values throughout the data set:
The histograms again demonstrate the skew issues we’ve previously identified while also indicating that the distributions of many of the skewed variables are zero-bound. For example, ‘INCOME’, ‘HOME_VAL’, ‘OLDCLAIM’, ‘CLM_FREQ’, ‘MVR_PTS’, AND ‘TARGET_AMT’ all appear to be zero-bound, while ‘CAR_AGE’ is bounded by values of (CAR_AGE = 1). Each of these variables, along with the ‘BLUEBOOK’ variable, is right-skewed and may need to be transformed using either a log function or perhaps a Box-Cox recommended power transform if their skew negatively impacts our regression models.
A correlation matrix for the numeric variables contained within data set is provided below. As can be seen in the matrix, none of the numeric variables show evidence of a particularly strong correlation with either of the response variables, and none of the predictors appear to be more than 0.54 correlated with each other.
However, the relative magnitudes of the correlation values can still be of use to us for purposes of constructing effective regression models. For example, a ranking of the correlations of each numeric predictor relative to the ‘TARGET_FLAG’ response can be used during model building to select variables that are more likely to be predictive of the response. The table below provides such a rank ordering:
| Variable | Cor. w/ TARGET_FLAG |
|---|---|
| MVR_PTS | 0.22 |
| CLM_FREQ | 0.22 |
| OLDCLAIM | 0.14 |
| INCOME | -0.14 |
| AGE | -0.10 |
| BLUEBOOK | -0.10 |
| CAR_AGE | -0.10 |
| TIF | -0.08 |
| TRAVTIME | 0.05 |
Of the predictor variables that appear to be most correlated with each other, their relationships make sense from an intuitive perspective. For example, the correlation of ‘INCOME’ with ‘BLUEBOOK’ and ‘HOME_VAL’ unsurprisingly suggests that people with higher incomes tend to own more expensive vehicles and homes than do people with relatively lower incomes. Similarly, having kids living at home (HOMEKIDS) appears to be correlated with having teenagers that drive your car (KIDSDRIV). These correlations can be used during model building to help us avoid the potential inclusion of additional variables that may be collinear to those already included.
During our data exploration efforts we developed many useful insights into the predictive qualities of the potential predictor variables and discovered evidence of significant quantities of missing data values for several variables. The missing data values span approximately 21% of the observations we’ve been provided, which indicates a need for imputation of the missing values. We also identified several numeric variables (‘CAR_AGE’, ‘HOME_VAL’, ‘INCOME’, ‘OLDCLAIM’, ‘MVR_PTS’, CLM_FREQ’, and ‘BLUEBOOK’) that may require either a log or Box-Cox recommended power transformations during model building.
————————————————————————————————————————–
Our data preparation efforts included the imputation of missing values for six of the predictor variables, the creation of three new binary factor variables (each derived from one of three of the provided predictor variables), and the conversion of two numeric predictor variables to binary “0/1” factor variables.
While we also considered the possibility of transforming one or more of the predictor variables that have skewed distributions, we chose not to apply any such transforms prior to building binary models since normal distributions aren’t necessarily required for logistical regression modeling. Transforms can be applied if the marginal model plots for a logistic regression model show evidence of deviance between the modeled data and the actual data, but aren’t required prior to model building.
Similarly, we chose not to preemptively transform any skewed distributions for purposes of linear modeling, and this approach allowed us to make use of different transforms for each of our linear models. Explanations of any model-specific transformations are discussed within the individual model writeups provided in Part 3.
Our first data preparation step focused on the imputation of missing data values for six of the predictor variables: AGE, HOME_VAL, JOB, CAR_AGE, INCOME, and YOJ. For three of these variables (CAR_AGE, INCOME, and YOJ) we used linear regression to generate imputed values for the missing data. For the AGE, HOME_VAL, and JOB variables we relied on simply replacing the missing values with either zeroes or the median of the variable. A summary of the imputation approach for each of the six variables is provided below.
We found a total of six missing values for the AGE variable. Since the data set is comprised of auto insurance policyholder information, we concluded that the missing AGE values required imputation. However, given the relatively small number of missing values found, we concluded it was reasonable to simply set each to the median AGE value without significantly introducing bias into the data set.
The HOME_VAL variable was found to contain 464 blank entries. We concluded it was reasonable to assume that those blank entries could be indicative of the policyholder being a renter rather than a homeowner. Therefore, the blank entries were simply filled with zeroes.
The JOB variable was found to contain 526 blank entries. We concluded it was reasonable to assume that not every auto insurance policyholder would be employed, and that some of the blanks could simply have been the result of the data not having been collected at the time the policy was applied for. Therefore, the blank entries were filled with a newly created job category titled “None Specified”.
The CAR_AGE, INCOME, and YOJ variables all showed evidence of significant amounts of missing data that could not be explained away through the use of “common sense” explanations. We chose not to use the mean or median as a replacement value for these relatively large numbers of missing values since linear regression would yield imputed values that were much more consistent with the actual distribution of the data while introducing much less potential bias.
While creating our imputation regression models we made use Variance Inflation Factors (VIF) to verify that there were no collinearity issues and used backward selection to ensure that all p-values are < \(.05\). Each model produced imputed distributions for the subject variables that were consistent with those of the original NA/blank populated data. It is our belief that this consistency indicates that the resulting predicted values for the missing data are an improvement over simply filling the NA’s with a mean or median. Furthermore, the replacement of the NA’s with numerical values allow us to run our final regression models on all records, not just those without NA’s.
The imputation strategies employed for each of these variables were as follows:
510 missing values (NA’s) were identified for the CAR_AGE variable. Since the data set is comprised of auto insurance policyholder information, we concluded it was reasonable to assume that the missing CAR_AGE values required imputation. The resulting regression model was comprised of only two variables (HOME_VAL and EDUCATION) and achieved an adjusted \(R^2\) value of 0.5217.
We also discovered a single negative value (‘-3’) for CAR_AGE which seemed entirely implausible and was therefore assumed to have been the result of a typographical error. The negative value was converted to a positive number via simple use of its absolute value.
445 blank entries were found for the INCOME variable. Since it is reasonable to assume that some policyholders might not be employed, before imputing we cross-validated those blank entries with the YOJ (years on the job) entries for the same records. Of the 445 blank INCOME entries, we found only 29 instances where no YOJ was indicated. Therefore, we concluded it was reasonable to assume that the other 416 blanks should be imputed since it would be logical for a policyholder to have an income if they also had indicated some number of years on a job.
The resulting regression model for those 416 blank entries was comprised of a total of nine variables and achieved an adjusted \(R^2\) value of 0.6658. Any imputed income amounts of less than zero were assigned a value of zero. The remaining 29 blank entries that corresponded with blank entries for YOJ were simply filled with zeroes.
454 missing values (NA’s) were discovered for the YOJ variable. As explained above, only 29 of those missing values corresponded with missing values for the INCOME variable. Therefore, we concluded that the remaining 425 NA’s required imputation since a corresponding value found in the INCOME variable would most likely be indicative of the policyholder being employed and having some amount of tenure in that employment.
The resulting regression model for those 425 NA’s was comprised of a total of eight variables and achieved an adjusted \(R^2\) value of 0.3116. The remaining 29 NA’s that corresponded with blank entries for INCOME were simply filled with zeroes.
Our analysis of the CAR_AGE variable indicated that its distribution is dominated by values of (CAR_AGE = 1). Since a value of ‘1’ is indicative of the car being one year old or less, we concluded it was reasonable to assume that the CAR_AGE variable could serve as the basis for a new derived binary ‘0/1’ factor variable to be used during the binary logistic regression modeling process.
Therefore, a variable named NEW_CAR was created via simple threshholding of CAR_AGE, yielding the following interpretation:
NEW_CAR = 0: The automobile is more than 1 year old
NEW_CAR = 1: The automobile is 1 year old or less
The original CAR_AGE variable was not discarded from the data set, but instead was retained for use during the development of linear models.
Our analysis of the HOME_VAL variable indicated that the shape of its distribution is significantly impacted by values of (HOME_VAL = 0). Since a value of ‘0’ would likely be indicative of the policyholder being a renter rather than a homeowner, we concluded it was reasonable to assume that the HOME_VAL variable could serve as the basis for a new derived binary ‘0/1’ factor variable to be used during the binary logistic regression modeling process.
As such, a variable named H_RENTER was created via simple threshholding of HOME_VAL, yielding the following interpretation:
H_RENTER = 0: The policyholder is not a renter, but instead owns their home
H_RENTER = 1: The policyholder is a renter
The original HOME_VAL variable was not discarded from the data set, but instead was retained for use during the development of linear models.
Our analysis of both the HOMEKIDS and KIDSDRIV variables indicated that large percentages of the values of each were zeroes. We therefore concluded that both variables could reasonably be converted to binary ‘0/1’ variables for purposes of our analysis.
Both variables were transformed via simple threshholding where all non-zero values were converted to ‘1’, yielding the following interpretations:
HOMEKIDS = 0: The policyholder has zero children living at home
HOMEKIDS = 1: The policyholder has one or more children living at home
KIDSDRIV = 0: Zero teenage drivers make use of the policyholder’s insured vehicle
KIDSDRIV = 1: One or more teenage drivers make use of the policyholder’s insured vehicle
Our analysis of the JOB variable during data exploration indicated that certain categories of jobs were more likely to be prone to past car accidents than were others. As such, we concluded it was reasonable to assume that the JOB variable could serve as the basis for a new derived binary factor variable to be used during the binary logistic regression modeling process.
The JOB_COLOR variable was created by segregating the (post-imputation) nine separate categories of the JOB variable as follows:
JOB_COLOR = “White”: Doctor, Lawyer, Professional, Manager, Clerical, “None Specified”
JOB_COLOR = “Blue”: Student, Home Maker, z_Blue Collar
The original JOB variable was not discarded from the data set, but instead was retained for use during the development of linear models.
As should be evident from the discussion here, our data preparation efforts touched on a large percentage of the predictor variables contained within the training data set. Post-imputation summary statistics for the numeric predictor variables show that we haven’t substantively changed the skew or kurtosis of any of the variables that required imputation:
| n | mean | sd | median | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|
| AGE | 8161 | 45 | 9 | 45 | 16 | 81 | 65 | 0 | 0 | 0 |
| YOJ | 8161 | 10 | 4 | 11 | 0 | 23 | 23 | -1 | 1 | 0 |
| INCOME | 8161 | 61417 | 47251 | 54005 | 0 | 367030 | 367030 | 1 | 2 | 523 |
| HOME_VAL | 8161 | 146062 | 130427 | 151957 | 0 | 885282 | 885282 | 1 | 0 | 1444 |
| TRAVTIME | 8161 | 33 | 16 | 33 | 5 | 142 | 137 | 0 | 1 | 0 |
| BLUEBOOK | 8161 | 15710 | 8420 | 14440 | 1500 | 69740 | 68240 | 1 | 1 | 93 |
| TIF | 8161 | 5 | 4 | 4 | 1 | 25 | 24 | 1 | 0 | 0 |
| OLDCLAIM | 8161 | 4037 | 8777 | 0 | 0 | 57037 | 57037 | 3 | 10 | 97 |
| CLM_FREQ | 8161 | 1 | 1 | 0 | 0 | 5 | 5 | 1 | 0 | 0 |
| MVR_PTS | 8161 | 2 | 2 | 1 | 0 | 13 | 13 | 1 | 1 | 0 |
| CAR_AGE | 8161 | 8 | 6 | 8 | 0 | 28 | 28 | 0 | -1 | 0 |
Each of the data preparation steps discussed above was subsequently applied to the evaluation data set to ensure proper function of the regression models described below in Part 3. The transformed training and evaluation data sets can be accessed via the following web links:
Training Data Set:
Evaluation Data Set:
————————————————————————————————————————–
This assignment required us to construct both binary logistic and linear regression models based upon the contents of the provided insurance_training_data.csv file. The binary logistic models are intended to predict the likelihood that a person will crash their car, while the linear regression models attempt to predict the likely amount of money it will cost an auto insurance company if the person does, in fact, crash their car. Both metrics would be of great interest to an auto insurance company since both are critically important for calculating the risk-adjusted pricing of auto insurance policies.
Three binary logistic models and two linear regression models were constructed using various model building strategies. Many of the models make use of model-specific data transformations for purposes of improving the performance of the final versions of each. Any model-specific data transformations are described within the individual model discussions provided below.
————————————————————————————————————————–
Each of our three binary logistic regression models used the data set’s TARGET_FLAG attribute as the dependent response variable, while various subsets of the 26 (post data preparation) potential predictor variables were used as independent variables.
Our first binary model applies R’s step() function in an attempt to identify the binary logistic regression model having the lowest Akaike Information Criterion (AIC) score via backward selection based on a specific set of initial potential predictor variables. The resulting model is then further reduced via backward selection to eliminate all non-statistically significant predictors from the model.
The initial iteration of this approach excluded the ‘HOME_VAL’, ‘CAR_AGE’ and ‘JOB’ variables for purposes of avoiding collinearity issues relative to the newly created ‘H_RENTER’, ‘NEW_CAR’, and ‘JOB_COLOR’ variables. The step() function and subsequent backward selection on the basis of p-values produced a model comprised of 16 statistically significant predictor variables. However, marginal model plots showed evidence of signficant deviance between the modeled data and the actual data for the ‘INCOME’, ‘TRAVTIME’, ‘BLUEBOOK’, ‘CLM_FREQ’, and ‘MVR_PTS’ variables.
In an attempt to address that deviance the natural log of each of the five variables was then added to the 16 variable model as shown below:
glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + log(INCOME + 1) + MSTATUS +
EDUCATION + TRAVTIME + log(TRAVTIME) + CAR_USE + BLUEBOOK + log(BLUEBOOK) + TIF + CAR_TYPE +
OLDCLAIM + CLM_FREQ + log(CLM_FREQ + 1) + REVOKED + MVR_PTS + log(MVR_PTS + 1) + URBANICITY +
H_RENTER, family = binomial(link = "logit"), data = hw4)
Subsequent backward selection iterations led to the removal of the ‘BLUEBOOK’, ‘TRAVTIME’, ‘MVR_PTS’, and ‘INCOME’ variables, but indicated the need to retain both the CLM_FREQ variable and its log(CLM_FREQ + 1) transform for purposes of maintaining the integrity of the model. In fact, removing either CLM_FREQ or log(CLM_FREQ + 1) led to evidence of significant deviance in the marginal model plots for whichever of the two was excluded. As such, the end result was a model comprised of 18 predictor variables with performance metrics as indicated in the table below:
| Metric | Score |
|---|---|
| Number of Predictors | 17 |
| AIC | 7395.9 |
| Accuracy | 0.7854 |
| Classification Error Rate | 0.2145 |
| Precision | 0.6478 |
| Sensitivity | 0.4092 |
| Specificity | 0.9203 |
| F1 Score | 0.5016 |
| AUC | 0.6647 |
A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and the marginal model plots show very strong agreement between the modeled data and the actual data, with a minute amount of deviance evident only in the log(TRAVTIME) and log(MVR_PTS + 1) plot:
The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictors are included.
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| + 1.268 | Intercept | + 0.379 | CAR_TYPEPanel Truck |
| + 0.548 | KIDSDRIV1 | + 0.508 | CAR_TYPEPickup |
| + 0.359 | HOMEKIDS1 | + 0.918 | CAR_TYPESports Car |
| - 0.064 | log(INCOME + 1) | + 0.577 | CAR_TYPEVan |
| + 0.704 | MSTATUSzNo | + 0.700 | CAR_TYPEzSUV |
| - 0.679 | EDUCATIONBachelors | - 0.001 | OLDCLAIM |
| - 0.732 | EDUCATIONMasters | - 0.340 | CLM_FREQ |
| - 0.959 | EDUCATIONPhD | + 1.248 | log(CLM_FREQ + 1) |
| - 0.090 | EDUCATIONz_High School | + 0.972 | REVOKEDYes |
| + 0.415 | log(TRAVTIME) | + 0.291 | log(MVR_PTS + 1) |
| - 0.835 | CAR_USEPrivate | - 2.293 | URBANICITY-Rural |
| - 0.324 | log(BLUEBOOK) | + 0.170 | H_RENTER1 |
| - 0.052 | TIF | (-) | (-) |
We can infer the following from the coefficients listed above:
KIDSDRIV1: If the policyholder has teenagers that drive the insured vehicle, it is more likely that the policyholder’s vehicle will be involved in an accident. This is unsurprising and conforms to common belief.
HOMEKIDS1: If the policyholder has children living at home with them they are more likely to be involved in an accident. Here we have another unsurprising result that conforms to common belief.
INCOME: The higher a policyholders income is, the less likely they are to be involved in an accident. This would seem to indicate that higher income policyholders drive more responsibly than do lower income drivers.
MSTATUS: Unmarried policyholders are more likely to be involved in an accident. This result conforms to what we learned during our data exploration efforts.
EDUCATION: The higher a policyholder’s level of education is, the less likely they are to be involved in an accident. This result conforms with the results of our data exploration efforts discussed earlier.
TRAVTIME: The longer a policyholder’s work commute is, the more likely they are to be involved in an accident. This makes sense intuitively since more mile driven should increase the likelihood of a driver being involved in an accident.
CAR_USE: Drivers of private vehicles are less likely to be involved in accidents than are drivers of commercial vehicles. This seems to indicate that drivers of commercial vehicles are perhaps less responsible drivers than are drivers of private vehicles. This could be due to the fact that most commercial vehicles are owned by businesses: since the drivers of such vehicles usually don’t own them, they may be less careful than they might otherwise be if driving a vehicle they own. Furthermore, the increased insurance costs associated with a commercial vehicle accident would be passed on to the vehicle’s owner rather than the driver.
BLUEBOOK: The higher the bluebook value of the policyholder’s car, the less likely they are to be involved in an accident. This make sense intuitively since most rational people would try to protect relatively expensive assets they are responsible for.
TIF: The longer a policyholder has been a customer of the company, the less likely they are to be involved in an accident. This makes sense intuitively since insurance companies would typically either impose large rate increases (causing the driver to look elsewhere for insurance coverage) or cancel the policies of bad drivers.
CAR_TYPE: Drivers of sports cars and SUV’s are more prone to accidents than are drivers of other types of vehicles, with minivans being the least prone. This seems to indicate that drivers of sports cars and SUVs are perhaps more aggressive and less responsible than other types of drivers.
OLDCLAIM: Policyholders with relatively large cumulative past claims payouts are less likely to be involved in an accident than policyholders with relatively smaller cumulative past claims payouts. This result is somewhat counterintuitive since most people would assume the opposite. However, it may be the case that drivers who’ve had large past claims payouts were perhaps involved in a small number (perhaps only one) of unusually costly accidents rather than a larger number of relatively less costly incidents.
CLM_FREQ: We have two coefficients for this variable, one of which is relative to log(CLM_FREQ + 1) and one which is relative to the untransformed variable, and they indicate conflicting effects on the likelihood of car crashes relative to the variable. Of the two, the coefficient of log(CLM_FREQ + 1) makes the most sense intuitively: the more frequently a driver has had insurance claims in the past, the more likely they are to be involved in an accident. The coefficient for the raw variable indicates the opposite. As was mentioned above, both ‘CLM_FREQ’ and log(CLM_FREQ + 1) were retained within the model to ensure the overall integrity of the model. Reviewing the coefficient indicated for ‘CLM_FREQ’ prior to the addition of log(CLM_FREQ + 1) to the model shows that the directionality of that coefficient was positive rather than negative, and would therefore have been consistent with the accepted intuition. As such, the negative directionality indicated here is likely an artifact of (log CLM_FREQ + 1) having been included in the model and not necessarily an indication of a counterintuitive result.
REVOKED: Policyholders who’ve previously had their drivers licenses revoked are more likely to be involved in an accident. This result is unsurprising.
MVR_PTS: The more motor vehicle record points a policyholder has accummulated, the more likely they are to be involved in an accident. This conforms with the results of our data exploration efforts.
URBANICITY: if a policyholder resides in a relatively rural area they are less likely to be involved in an accident. This makes intuitive sense as there is less traffic in rural areas, thereby reducing the likelihood of an accident.
H_RENTER: If a policyholder does not own their own home they are more likely to be involved in an accident. This would seem to indicate that homeowners tend to drive more responsibly than do renters.
————————————————————————————————————————–
Our second binary model applied simple backward selection and a logit link function in an attempt to identify the regression model that included only statistically significant p-values. The initial modeling iteration excluded the ‘HOME_VAL’, ‘CAR_AGE’ and ‘JOB’ variables for purposes of avoiding collinearity issues relative to the newly created ‘H_RENTER’, ‘NEW_CAR’, and ‘JOB_COLOR’ variables described in Part 2 above. The ‘RED_CAR’, ‘SEX’, ‘TRAVTIME’ and ‘YOJ’ variables were also excluded after our data exploration efforts showed that each of them was unlikely to be useful for predicting the likelihood of a car crash. Furthermore, the ‘EDUCATION’ variable was used as the basis for a new binary field named ‘ED_LEVEL’ indicative of whether a policyholder was college educated. The ‘EDUCATION’ variable itself was excluded from the model to avoid potential collinearity issues with that new binary variable.
Successive iterations removed the least statistically significant predictors (“CAR_AGE”, “AGE”, “NEW_CAR”, “OLDCLAIM”, “PARENT1” and “JOB_COLOR”) until all p-values were below 0.05. Marginal model plots showed evidence of signficant deviance between the modeled data and the actual data for the ‘BLUEBOOK’, ‘INCOME’ and ‘MVR_PTS’ variables. Box-Cox recommended transforms were then applied to each of those variables in an attempt to address the observed deviance and the model was refitted.
The final result was a model comprised of 13 predictor variables having the performance metrics indicated in the table below:
| Metric | Value |
|---|---|
| Number of Predictors | 13 |
| AIC | 7511.1 |
| Accuracy | 0.7863 |
| Classification Error Rate | 0.2137 |
| Precision | 0.6594 |
| Sensitivity | 0.3929 |
| Specificity | 0.9273 |
| F1 Score | 0.4924 |
| AUC | 0.6601 |
A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and the marginal model plots indicated that the Box-Cox transforms appeared to have sufficiently addressed the earlier observed deviance issues, though both log(MVR_PTS + 1) and sqrt(BLUEBOOK) still exhibit some deviance.
The logit coefficients for this model are shown below. Please note that coefficients for each category of the multi-category predictor ‘CAR_TYPE’ are included.
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| - 0.338 | Intercept | + 0.919 | CAR_TYPESports Car |
| + 0.584 | KIDSDRIV | + 0.582 | CAR_TYPEVan |
| - 0.003 | sqrt(INCOME) | + 0.697 | CAR_TYPEz_SUV |
| + 0.701 | MSTATUSz_No | + 0.731 | REVOKEDYes |
| - 0.892 | CAR_USEPrivate | + 0.409 | log(MVR_PTS + 1) |
| - 0.006 | sqrt(BLUEBOOK) | - 2.330 | URBANICITY-Rural |
| - 0.054 | TIF | + 0.186 | H_RENTER |
| + 0.485 | CAR_TYPEPanel Truck | + 0.500 | ED_LEVELNot College |
| + 0.473 | CAR_TYPEPickup | + 0.326 | HOMEKIDS |
The ‘KIDSDRIV’, ‘INCOME’, ‘MSTATUS’, ‘CAR_USE’, ‘BLUEBOOK’, ‘TIF’, ‘CAR_TYPE’, ‘REVOKED’, ‘MVR_PTS’, ‘URBANICITY’, ‘H_RENTER’ and ‘HOMEKIDS’ predictors for Binary Model 2 all overlap those discussed in Model 1, and all have directionally similar coefficients to those found in Model 1. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Model 1 discussion above.
For the remaining variable we can state the following:
————————————————————————————————————————–
The third model made use of forward selection with a logit link function in an attempt to produce a binary logistic regression model yielding the lowest AIC value possible when only statistically significant predictor variables were considered.
The forward selection process made use of the knowledge we gained in Part 1 via the construction of boxplots, barplots, and the correlation matrix. As we discussed in Part 1, several of the potential numeric variables showed evidence of being at least somewhat correlated with the response variable while many specific categories of drivers (as indicated by the various categorical variables) are more likely to be involved in car accidents.
These insights allowed us to carefully select variables to be added to the model based on their propensities to be predictive of the ‘TARGET_FLAG’ response variable, with variables having higher likelihoods of being predictive of the response added first followed by others in descending order of their indicated likelihoods. If the model did not show improved metrics after a variable was added (perhaps due to collinearity), that predictor was discarded and the next was tried.
Using this iterative process the variables CLM_FREQ, log(CLM_FREQ + 1), INCOME + AGE + log(AGE + 1), log(BLUEBOOK), CAR_AGE, TIF, log(TRAVTIME), H_RENTER, URBANICITY, REVOKED, PARENT1, KIDSDRIV, CAR_USE, MSTATUS, CAR_TYPE and log(TRAVTIME) were successfully added to the model yielding the metrics shown below:
| Metric | Value |
|---|---|
| Number of Predictors | 17 |
| AIC | 7430.4 |
| Accuracy | 0.7827 |
| Classification Error Rate | 0.2173 |
| Precision | 0.6480 |
| Sensitivity | 0.3864 |
| Specificity | 0.9248 |
| F1 Score | 0.4841 |
| AUC | 0.6556 |
The marginal model plots for this model show relatively strong agreement between the modeled data and the actual data, with some minor deviance evident with the ‘INCOME’, ‘AGE’, and log(TRAVTIME) variables:
A plot of the standardized deviance residuals for this model revealed no high leverage outliers, and its logit coefficients were as follows:
| Coeff. | Variable | Coeff. | Variable |
|---|---|---|---|
| + 22.89 | Intercept | + 0.467 | CAR_TYPEPanel Truck |
| + 0.848 | KIDSDRIV1 | + 0.533 | CAR_TYPEPickup |
| - 0.055 | TIF | + 0.886 | CAR_TYPESports Car |
| - 0.001 | INCOME | + 0.619 | CAR_TYPEVan |
| + 0.170 | AGE | + 0.719 | CAR_TYPEzSUV |
| - 7.72 | log(AGE + 1) | + 0.743 | REVOKEDYes |
| + 0.560 | MSTATUSz_No | + 0.416 | log(TRAVTIME) |
| + 0.284 | Parent1Yes | - 2.284 | URBANICITY-Rural |
| - 0.026 | CAR_AGE | - 0.894 | CAR_USEPrivate |
| - 0.310 | CLM_FREQ | + 0.220 | H_RENTER1 |
| + 1.137 | log(CLM_FREQ + 1) | - 0.330 | log(BLUEBOOK) |
With the exception of ‘AGE’, log(AGE + 1), ‘Parent1’, and ‘CAR_AGE’, the predictors for Binary Model 3 overlap those discussed for Binary Model 1, and have directionally similar coefficients to those models. As such, the effects of the coefficients for those variables won’t be reiterated here and can be found in the Binary Model 1 discussion above.
Of the remaining variables we can state the following:
AGE (and log(AGE + 1)) : We have two coefficients related to the ‘AGE’ variable, one of which is relative to log(AGE + 1) and one which is relative to the untransformed variable, and they indicate conflicting effects on the likelihood of a policyholder having a car accident. Both ‘AGE’ and log(AGE + 1) were retained within the model to ensure the overall integrity of the model. Of the two, the directionality of the raw ‘AGE’ variable might be a better indicator of the variable’s effect on the likelihood of a policyholder being involved in a car accident since we can’t really be sure whether the directionality of the log(AGE+1) predictor isn’t simply an artifact of the binary logistic regression modeling process. As such, it appears that the higher a policyholder’s age is, the more likely they are to be involved in a car accident.
Parent1: Single parents appear more likely to be involved in car accidents. This conforms to the observations we made during our data exploration efforts.
CAR_AGE: The older the policyholder’s car is, the less likely it is to be involved in a car accident. This result is somewhat difficult to interpret intuitively. For example, does it indicate that drivers of older cars are somehow more cautious than drivers of relatively newer cars? Are older cars perhaps more widely driven by married college-educated parents who have no children living at home? Uncovering the exact reasons could be an interesting topic for a future study.
————————————————————————————————————————–
Both of our linear regression models used the data set’s ‘TARGET_AMT’ attribute as the dependent response variable, and the data was subsetted to include only those records that contained a value of (TARGET_FLAG = 1). Various subsets of the 26 post-“data preparation”" predictor variables were used as independent variables.
Our first linear model applies R’s step function in an attempt to identify the linear regression model having the lowest Akaike Information Criterion (AIC) score via backward selection based on a specific set of initial potential predictor variables. The resulting model is then further reduced via backward selection to eliminate all non-statistically significant predictors from the model.
Five numeric variables (INCOME, BLUEBOOK, TRAVTIME, CLM_FREQ, and MVR_PTS) were transformed using natural logs prior to model building due to their strongly right-skewed distributions. The log transformations yielded distributions that were much closer to a normal distribution, thereby greatly improving our prospects for a successful model build.
The initial iteration of this approach excluded the ‘H_RENTER’, ‘CAR_AGE’ and ‘JOB_COLOR’ variables for purposes of avoiding collinearity issues relative to the ‘HOME_VAL’, ‘NEW_CAR’, and ‘JOB’ variables. The step function and subsequent backward selection on the basis of p-values produced a model comprised of only 4 statistically significant predictor variables.
However, diagnostic plots showed unambiguous evidence of a lack of normality in the model’s residuals. An examination of the added variable plots showed no evidence of any of the predictor variables being clearly responsible for the variability of the residuals. The lack of any clear relationship between the predictors and the magnitude of the residuals seemed to preclude use potential alternative modeling approaches that we have experience with (e.g., Weighted Least Squares). As such, a natural log transform was then applied to the ‘TARGET_AMT’ response variable and the model was refitted.
Residual plots of the refitted model showed evidence of multiple high leverage outliers which were removed via additional modeling iterations, yielding an improved model. The QQ plot for the improved model showed that there was still some deviance from normality in the residuals. However, the use of the natural log transformation of the response combined with the removal of outliers had greatly improved the overall fit of the model:
A plot of the response variable vs. the fitted values generated by the linear regression model was created by back-transforming the fitted “(log(TARGET_AMT + 1)” values via an exponential function. The plot echoes the lack of normality in the residuals:
The plot clearly indicates that the model has some predictive value as evidenced by the dense collection of plot points close to the least squares line. However, the plot also shows quite clearly that instances of unusually large auto insurance payouts are not captured by the model.
The coefficients and performance metrics for the model are as follows:
| Coefficient | Variable |
|---|---|
| 7.019 | Intercept |
| + 0.128 | log(BLUEBOOK) |
| + 0.059 | log(MVR_PTS + 1) |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 0.756 | 0.0156 | 0.0147 | 16.89 | 0.5715 |
and the characteristic equation for the model is:
As the table indicates, this model has a very low \(R^2\) score even though the predictors are clearly statistically significant. A model having a low \(R^2\) and statistically significant predictors is frequently indicative of the response variable containing a significant amount of variability that simply cannot be explained through the use of the available predictor variables.
For example, in regard to the expected payout for a car accident, factors not captured by the predictive variables (e.g., specific characteristics of past accidents that contributed to payout amounts, perhaps including the types and severities of injuries; the length of any required hospital stays, property damage costs, etc.) are likely driving the variability of the residuals of this model. Therefore, the variability in the TARGET_AMT payouts contained in the training data set most likely can’t be explained by the variables we’ve been provided.
We can infer the following from the coefficients listed above:
BLUEBOOK: The higher the Bluebook value of the policyholder’s vehicle, the higher the likely insurance payout will be in the event of an accident. This is not a surprising result.
MVR_PTS: The more motor vehicle record points a policyholder has accummulated, the higher the likely insurance payout will be in the event of an accident. This may indicate that drivers with poor driving records have a higher likelihood of being involved with relatively more serious accidents than would drivers with better driving records. In other words, bad driving apparently leads to a higher likelihood of being involved in relatively bad car accidents.
This linear regression model required log transforms of five predictor variables as well as the response variable. The final model excluded all of the potential predictors except two (BLUEBOOK and MVR_PTS), both of which had been log transformed. While the \(R^2\) score for the model is very low, models with low \(R^2\) and statistically significant predictors can, in fact, still be useful for predictive purposes as long as extremely precise predictions are not required. On the other hand, if extremely precise predictions of likely auto insurance claims are required, this model should not be utilized. Instead, an auto insurer should seek to gather additional information related to claims payouts (such as those mentioned above) and include that data into the regression modeling process.
————————————————————————————————————————–
Our second linear model applied simple backward selection in an attempt to identify the linear regression model that included only statistically significant p-values.
The initial iteration of this approach excluded the ‘H_RENTER’, ‘CAR_AGE’ and ‘JOB_COLOR’ variables for purposes of avoiding collinearity issues relative to the ‘HOME_VAL’, ‘NEW_CAR’, and ‘JOB’ variables. Backward selection iterations based on p-values reduced the model to 3 statistically significant variables: ‘SEX’, ‘BLUEBOOK’ and ‘MVR_PTS’. Diagnostic plots showed a need to transform both ‘BLUEBOOK’ and ‘MVR_PTS’ due to their right-skewed distributions (matching what we saw in the binomial model building process). Applying Box-Cox recommended power transforms to both and refitting the model led to the removal of ‘SEX’ since it was no longer statistically significant. The final iteration resulted in a model comprised of 2 predictor variables: sqrt(BLUEBOOK) and log(MVR_PTS+1).
Residual plots for that refitted model showed evidence of multiple high leverage outliers which were removed via additional modeling iterations. During that process the log(MVR_PTS+1) predictor was deemed to be statistically insignificant and was removed, producing a linear model with only a single predictor: sqrt(BLUEBOOK).
As shown below, the QQ plot for that model showed significant deviance from normality in the residuals, indicating that the model violates one of the key requirements of the linear modeling process.
Given that the model contains only a single nearly normally distributed predictor, the severe lack of normality in the residuals indicates that the response variable is responsible for the variability in the residuals rather than the predictor variable. In such circumstances, a typical remedy involves attempting to transform the response variable using a log or power transform in attempt to normalize the residuals. However, this approach was already utilized for our first linear model and duplicating it for this second model would likely yield results nearly identical to the first.
The coefficients and performance metrics for this linear model are as follows:
| Coefficient | Variable |
|---|---|
| + 3067.58 | Intercept |
| + 21.52 | sqrt(BLUEBOOK) |
| RSE | R^2 | Adj. R^2 | F Stat. | MSE |
|---|---|---|---|---|
| 6739 | 0.01203 | 0.01157 | 26.12 | 45413200 |
The ‘BLUEBOOK’ predictor for Linear Model 2 has already been discussed in the writeup for Linear Model 1 above. As such, the effects of that coefficient won’t be reiterated here and can be found in the discussion of Linear Model 1.
This linear regression model required Box-Cox transforms of two of the predictor variables, with the final model excluding all of the potential predictors except one sqrt(BLUEBOOK). However, despite its simplicity the model suffers from a signficant deficiency due to the highly non-normal distribution of its residuals. As such, this model should not be used for predictive purposes.
————————————————————————————————————————–
The chart below summarizes the model statistics for our three binary logistic regression models. The models are listed from left to right in accordance with the order in which they were described in Part 3.
| Metric | Model 1 | Model 2 | Model 3 |
|---|---|---|---|
| # Predictors | 17 | 13 | 17 |
| AIC | 7395.9 | 7511.1 | 7430.4 |
| Accuracy | 0.7854 | 0.7863 | 0.7827 |
| Class.Err.R. | 0.2145 | 0.2137 | 0.2173 |
| Precision | 0.6478 | 0.6594 | 0.6480 |
| Sensitivity | 0.4092 | 0.3929 | 0.3864 |
| Specificity | 0.9203 | 0.9273 | 0.9248 |
| F1 Score | 0.5016 | 0.4924 | 0.4841 |
| AUC | 0.6647 | 0.6601 | 0.6556 |
As we can see in the table, all three of the models are fairly similar in their performance metrics, with Model 1 yielding the lowest AIC score, highest sensitivity, F1 Score and AUC while Model 2 has the smallest number of predictor variables, highest accuracy, precision, and specificity.
However, the differences in their performance metrics are marginal at best. Of the three, Model 2 is the simplest overall since it has the smallest number of predictor variables while all three models shared similarly strong marginal model plots.
Selecting our preferred model required answering the following question:
An objective answer to this question was found through the use of a Likelihood Ratio Test. In a likelihood ratio test, models having different numbers of predictors are compared to determine whether or not a smaller logistic regression model should be preferred over a larger logistic regression model assuming both are “valid” in the sense that their marginal model plots show no significant evidence of deviance and their performance metrics are somewhat similar.
In our likelihood ratio test, the smaller model is considered to be the null hypothesis H0, and a p-value for the overall model fit statistic that is less than 0.05 would indicate that we should reject the null hypothesis in favor of the alternative hypothesis HA. In other words, a small p-value would provide evidence against the smaller model and in favor of the larger model.
We conducted a likelihood ratio test for all three models simultaneously using R’s lrtest() function. The results are displayed below:
## Likelihood ratio test
##
## Model 1: TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + log(INCOME + 1) + MSTATUS +
## EDUCATION + log(TRAVTIME) + CAR_USE + log(BLUEBOOK) + TIF +
## CAR_TYPE + OLDCLAIM + CLM_FREQ + log(CLM_FREQ + 1) + REVOKED +
## log(MVR_PTS + 1) + URBANICITY + H_RENTER
## Model 2: TARGET_FLAG ~ CLM_FREQ + log(CLM_FREQ + 1) + INCOME + AGE + log(AGE +
## 1) + log(BLUEBOOK) + CAR_AGE + TIF + log(TRAVTIME) + H_RENTER +
## URBANICITY + REVOKED + PARENT1 + KIDSDRIV + CAR_USE + MSTATUS +
## CAR_TYPE
## Model 3: TARGET_FLAG ~ KIDSDRIV + sqrt(INCOME) + MSTATUS + CAR_USE + sqrt(BLUEBOOK) +
## TIF + CAR_TYPE + REVOKED + log(MVR_PTS + 1) + URBANICITY +
## H_RENTER + ED_LEVEL + HOMEKIDS
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 25 -3672.9
## 2 22 -3693.2 -3 40.577 8.042e-09 ***
## 3 18 -3737.5 -4 88.608 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results show p-values of less than 0.05, indicating that we should reject the null hypothesis H0 in favor of the alternative hypothesis HA. Specifically, there is sufficient statistical evidence for us to justifiably prefer Binary Model 1 (one of the larger models) over Binary Models 2 and 3. Therefore, we selected Binary Model 1 as the basis for our predictions of the ‘TARGET_FLAG’ binary response variable in the Evaluation data set.
————————————————————————————————————————–
The chart below summarizes the model statistics for our two linear models:
| Metric | Model 1 | Model 2 |
|---|---|---|
| RSE | 0.756 | 7639 |
| R^2 | 0.0156 | 0.0120 |
| Adj. R^2 | 0.0147 | 0.0116 |
| F Stat. | 16.89 | 26.12 |
| MSE | 0.5715 | 45413200 |
While both models suffer from low \(R^2\) scores, we’ve previously ascertained that Linear Model 2 violates the requirement for near normality of residuals. As such, Linear Model 2 was eliminated from further consideration. Therefore, we selected Linear Model 1 as the basis for our predictions of the ‘TARGET_AMT’ response variable in the Evaluation data set.
————————————————————————————————————————–
As discussed in Part 2, we applied the same set of transformations used on the Training data set to the Evaluation data set to ensure the applicability of any regression models we would build.
Binary Model 1 was applied to the transformed evaluation data to yield a set of probabilities for the ‘TARGET FLAG’ response variable. Those results were then rounded using a 0.5 threshold to populate the ‘TARGET FLAG’ variable with zeroes (indicating the predicted probability of a car accident was less than 0.5) or 1’s (indicating the predicted probability of a car accident was 0.5 or greater).
Summary statistics for the ‘TARGET_FLAG’ variable for both the training data set and updated evaluation data set are shown below (NOTE: “n” represents the number of records within a data set).
| Data Set | n | Mean | sd | Med. | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| Evaluation | 2141 | 0.17 | 0.38 | 0.00 | 0.00 | 1.00 | 1 | 1.76 | 1.1 | 0.01 |
| Training | 8161 | 0.26 | 0.44 | 0.00 | 0.00 | 1.00 | 1 | 1.07 | -0.85 | 0 |
Linear Model 1 was then applied against only those rows of the updated Evaluation data set having values of (TARGET FLAG = 1) to yield a predictive estimate of the likely cost of an auto insurance claim for that policyholder. Since our preferred linear model relies on a log-transformed response variable, the results of linear model are then back-transformed via an exponential function to calculate the final ‘TARGET_AMT’ value for each respective row. The results of the linear model are then added back into the broader Evaluation data set to produce our final results.
Summary statistics for the ‘TARGET_AMT’ variable for both the training data set and updated evaluation data set are shown below (NOTE: “n” represents the number of records within a data set).
| Data Set | n | Mean | sd | Med. | Min | Max | Range | Skew | Kurtosis | SE |
|---|---|---|---|---|---|---|---|---|---|---|
| Evaluation | 2141 | 665.79 | 1482.43 | 0.00 | 0.00 | 4276 | 4276 | 1.81 | 1.3 | 32.04 |
| Training | 8161 | 1504.32 | 4704.03 | 0.00 | 0.00 | 107586 | 107586 | 8.71 | 112.29 | 52.07 |
The table highlights the variability of the original response variable mentioned earlier. As was discussed in our writeup for Linear Model 1, the variability of the ‘TARGET_AMT’ response variable is largely not a function of any of the predictor variables we’ve been provided. Unsurprisingly, this results in our linear model being unable to conform with the large amount of variance seen in the original ‘TARGET_AMT’ variable.
A sample of the results of the prediction process is shown in the table below. Please note that the estimated probability of a policyholder being involved in a car accident is indicated by the ‘TARGET_FLAG_PROB’ variable shown in the table.
| INDEX | TARGET_FLAG_PROB | TARGET_FLAG | TARGET_AMT |
|---|---|---|---|
| 3 | 0.221 | 0 | 0 |
| 9 | 0.455 | 0 | 0 |
| 10 | 0.126 | 0 | 0 |
| 18 | 0.181 | 0 | 0 |
| 21 | 0.266 | 0 | 0 |
| 30 | 0.163 | 0 | 0 |
| 31 | 0.343 | 0 | 0 |
| 37 | 0.320 | 0 | 0 |
| 39 | 0.021 | 0 | 0 |
| 47 | 0.174 | 0 | 0 |
| 60 | 0.026 | 0 | 0 |
| 62 | 0.563 | 1 | 4088 |
| 63 | 0.830 | 1 | 3619 |
| 64 | 0.118 | 0 | 0 |
| 68 | 0.033 | 0 | 0 |
| 75 | 0.583 | 1 | 4005 |
| 76 | 0.703 | 1 | 3262 |
| 83 | 0.144 | 0 | 0 |
| 87 | 0.515 | 1 | 3855 |
| 92 | 0.376 | 0 | 0 |
| 98 | 0.180 | 0 | 0 |
| 106 | 0.453 | 0 | 0 |
| 107 | 0.104 | 0 | 0 |
| 113 | 0.323 | 0 | 0 |
| 120 | 0.364 | 0 | 0 |
| 123 | 0.413 | 0 | 0 |
| 125 | 0.426 | 0 | 0 |
| 126 | 0.456 | 0 | 0 |
| 128 | 0.135 | 0 | 0 |
| 129 | 0.177 | 0 | 0 |
| 131 | 0.194 | 0 | 0 |
| 135 | 0.450 | 0 | 0 |
| 141 | 0.060 | 0 | 0 |
| 147 | 0.198 | 0 | 0 |
| 148 | 0.108 | 0 | 0 |
| 151 | 0.039 | 0 | 0 |
| 156 | 0.177 | 0 | 0 |
| 157 | 0.091 | 0 | 0 |
| 174 | 0.041 | 0 | 0 |
| 186 | 0.601 | 1 | 4149 |
The full set of predictions listed in order of their ‘INDEX’ identifier can be found at the following web link and is also presented at the beginning of the Appendix:
https://github.com/jtopor/CUNY-MSDA-621/blob/master/HW-4/HW4-PRED-EVAL-COLS-ONLY.csv
Each of our binary logistic regression models appeared to perform reasonably well at predicting the likelihood of an auto insurance customer having a car accident based on the content of the Training data set. As such, it appears as though the predictor variables do, in fact, offer a significant amount of predictive value for purposes of predicting those type of likelihoods.
However, our linear modeling efforts revealed a significant lack of predictive value in the training data relative to the ‘TARGET AMT’ response variable. None of the potential predictor variables proved to be highly predictive of likely claims payout amounts assuming the customer had already been deemed to be relatively likely to be involved in a car accident. In other words, the large amount of variance seen in the original ‘TARGET AMT’ response variable is not closely related to any of the predictors we’ve been provided.
Despite this apparent disconnect between ‘TARGET_AMT’ and the predictors, the linear model clearly does have some degree of predictive value as indicated by the “Response vs. Fitted” plot shown in our Part 3 discussion of Linear Model 1. As we discussed therein, the linear model can, in fact, be used for predictive purposes as long as extremely precise predictive estimates of likely auto insurance claims aren’t necessary. On the other hand, if very precise predictive estimates are required, the linear model should not be used. Instead, an auto insurer would likely need to accummulate additional data that is more closely related to the magnitude of auto insurance claims payments (e.g., physical property damage costs, types and severities of injuries, length of hospital stays, etc..) and include such data as part of the linear modeling process.
————————————————————————————————————————–
The Appendix to this document containing all of the R code used for this assignment (as well as other relevant output) can be found in a separate PDF file accessible via this web link:
https://github.com/spsstudent15/2016-02-621-W1/blob/master/Notes/HW4-Appendix.pdf