Building a Regression Model for Auto Insurance Claims

This analysis will explore auto insurance claims in OH, IN, and IL. Our target variable is total claim payout. We will attempt to build a linear regression model to solve for this variable.

Below is a summary of the data points. The data was provided from data bricks and can be found here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4954928053318020/1058911316420443/167703932442645/latest.html

Data summary
Name insurance_claims
Number of rows 1000
Number of columns 31
_______________________
Column type frequency:
character 19
numeric 12
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
policy_state 0 1 2 2 0 3 0
policy_csl 0 1 7 8 0 3 0
insured_sex 0 1 4 6 0 2 0
insured_education_level 0 1 2 11 0 7 0
insured_occupation 0 1 5 17 0 14 0
insured_hobbies 0 1 4 14 0 20 0
insured_relationship 0 1 4 14 0 6 0
incident_date 0 1 10 10 0 60 0
incident_type 0 1 10 24 0 4 0
collision_type 0 1 1 15 0 4 0
incident_severity 0 1 10 14 0 4 0
authorities_contacted 0 1 4 9 0 5 0
incident_state 0 1 2 2 0 7 0
incident_city 0 1 8 11 0 7 0
incident_location 0 1 11 23 0 1000 0
police_report_available 0 1 1 3 0 3 0
auto_make 0 1 3 10 0 14 0
auto_model 0 1 2 14 0 39 0
fraud_reported 0 1 1 1 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
months_as_customer 0 1 203.95 115.11 0.00 115.75 199.5 276.25 479.00 ▅▇▇▃▂
age 0 1 38.95 9.14 19.00 32.00 38.0 44.00 64.00 ▃▇▇▃▂
policy_deductable 0 1 1136.00 611.86 500.00 500.00 1000.0 2000.00 2000.00 ▇▇▁▁▇
policy_annual_premium 0 1 1256.41 244.17 433.33 1089.61 1257.2 1415.69 2047.59 ▁▃▇▃▁
umbrella_limit 0 1 1101000.00 2297406.60 -1000000.00 0.00 0.0 0.00 10000000.00 ▇▁▁▁▁
insured_zip 0 1 501214.49 71701.61 430104.00 448404.50 466445.5 603251.00 620962.00 ▇▃▁▁▅
incident_hour_of_the_day 0 1 11.64 6.95 0.00 6.00 12.0 17.00 23.00 ▇▇▆▇▇
number_of_vehicles_involved 0 1 1.84 1.02 1.00 1.00 1.0 3.00 4.00 ▇▁▁▅▁
bodily_injuries 0 1 0.99 0.82 0.00 0.00 1.0 2.00 2.00 ▇▁▇▁▇
witnesses 0 1 1.49 1.11 0.00 1.00 1.0 2.00 3.00 ▇▇▁▇▇
total_claim_amount 0 1 52761.94 26401.53 100.00 41812.50 58055.0 70592.50 114920.00 ▃▂▇▅▁
auto_year 0 1 2005.10 6.02 1995.00 2000.00 2005.0 2010.00 2015.00 ▇▆▇▇▆

We will build a linear regression model for our target variable - total_claim_payout.

Exploring the independent variables to help build the model

Incident Hour of the Day

## 
## Call:
## lm(formula = total_claim_amount ~ incident_hour_of_the_day, data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58685 -13027   4630  17355  68359 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               43134.2     1591.1  27.110  < 2e-16 ***
## incident_hour_of_the_day    826.8      117.3   7.046 3.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25780 on 998 degrees of freedom
## Multiple R-squared:  0.04739,    Adjusted R-squared:  0.04644 
## F-statistic: 49.65 on 1 and 998 DF,  p-value: 3.418e-12

We see based off the above chart and model that the independent variable of incident_hour_of_the_day appears to have a higher average total claim payout the later it is in the day. We built a regression model with just this variable and the result is significant. However, the adjusted R2 is just 5%.

Police Report - Independent Variable

## 
## Call:
## lm(formula = total_claim_amount ~ PR_dummy, data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54204 -11219   5366  17465  60616 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    49807       1422  35.036   <2e-16 ***
## PR_dummy        4497       1754   2.564   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26330 on 998 degrees of freedom
## Multiple R-squared:  0.006545,   Adjusted R-squared:  0.00555 
## F-statistic: 6.575 on 1 and 998 DF,  p-value: 0.01049

We see here that whether there was a police report or not it was significantly different. However this accounts for very little R2.

Putting together the variables in one model

## 
## Call:
## lm(formula = total_claim_amount ~ PR_dummy + incident_hour_of_the_day + 
##     fraud_reported, data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61355 -13523   4185  17561  64759 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               47804.1     2368.5  20.184  < 2e-16 ***
## PR_dummy                   4437.6     1689.3   2.627  0.00875 ** 
## incident_hour_of_the_day    821.2      115.4   7.115 2.14e-12 ***
## fraud_reportedN           -9986.9     1859.5  -5.371 9.75e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25360 on 996 degrees of freedom
## Multiple R-squared:  0.08024,    Adjusted R-squared:  0.07747 
## F-statistic: 28.96 on 3 and 996 DF,  p-value: < 2.2e-16

Here we add fraud_reported to the model with the other two discussed independent variables. The model is starting to come together with all 3 independent variables being significant, however very little total variance is explained with just 8%.

Testing Other Independent Variables - Age

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

No real pattern with age or months as a customer with total claim payout.

Number of Vehicles Involved

## 
## Call:
## lm(formula = total_claim_amount ~ number_of_vehicles_involved, 
##     data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46699 -14228   3623  17132  65521 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  39691.9     1658.1   23.94   <2e-16 ***
## number_of_vehicles_involved   7107.2      788.8    9.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25400 on 998 degrees of freedom
## Multiple R-squared:  0.07523,    Adjusted R-squared:  0.0743 
## F-statistic: 81.19 on 1 and 998 DF,  p-value: < 2.2e-16

Here is a regression model for number of vehicles involved

## 
## Call:
## lm(formula = total_claim_amount ~ incident_type, data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -43637  -8670   -346   7513  53283 
## 
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           64445.5      722.2  89.234   <2e-16 ***
## incident_typeVehicle Theft           -58928.2     1660.7 -35.485   <2e-16 ***
## incident_typeMulti-vehicle Collision  -2808.4     1011.6  -2.776   0.0056 ** 
## incident_typeParked Car              -59137.2     1739.0 -34.007   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14500 on 996 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6984 
## F-statistic: 772.3 on 3 and 996 DF,  p-value: < 2.2e-16

Incident Type for an independent variable shows all factors as significant and accounts for nearly 70% of the variance. Let’s add this to the other independent variables to see if we can build a final strong model.

Final Model

## 
## Call:
## lm(formula = total_claim_amount ~ PR_dummy + incident_hour_of_the_day + 
##     fraud_reported + incident_type + number_of_vehicles_involved, 
##     data = insurance_claims)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42964  -8816   -305   7488  52876 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           65069.99    2440.02  26.668   <2e-16 ***
## PR_dummy                               1693.87     968.36   1.749   0.0806 .  
## incident_hour_of_the_day                 27.07      68.47   0.395   0.6927    
## fraud_reportedN                       -1395.39    1079.39  -1.293   0.1964    
## incident_typeVehicle Theft           -58448.07    1702.96 -34.321   <2e-16 ***
## incident_typeMulti-vehicle Collision   -586.61    3863.14  -0.152   0.8793    
## incident_typeParked Car              -58558.03    1780.01 -32.898   <2e-16 ***
## number_of_vehicles_involved           -1101.99    1860.48  -0.592   0.5538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14490 on 992 degrees of freedom
## Multiple R-squared:  0.7009, Adjusted R-squared:  0.6988 
## F-statistic: 332.1 on 7 and 992 DF,  p-value: < 2.2e-16

Now when we add the other independent variables they are not significant in the model. Only the incident type is. Therefore for our final model we will just keep the incident type as the only independent variable.

Predicting claim amount with our final model

Above is a data table that adds a prediction column from our regression model with just incident type as the independent variable. As we can see the “pred” column doesn’t always accurately predict the actual total claim payout value. However, in most instances the values are close.

All in all it is very difficult based off the data we are given to put together a linear model that has a high R2 value. The independent variable of incident type accounted for 70% of the variance, and was the only variable we used in our final linear regression model.