This analysis will explore auto insurance claims in OH, IN, and IL. Our target variable is total claim payout. We will attempt to build a linear regression model to solve for this variable.
Below is a summary of the data points. The data was provided from data bricks and can be found here https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4954928053318020/1058911316420443/167703932442645/latest.html
| Name | insurance_claims |
| Number of rows | 1000 |
| Number of columns | 31 |
| _______________________ | |
| Column type frequency: | |
| character | 19 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| policy_state | 0 | 1 | 2 | 2 | 0 | 3 | 0 |
| policy_csl | 0 | 1 | 7 | 8 | 0 | 3 | 0 |
| insured_sex | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| insured_education_level | 0 | 1 | 2 | 11 | 0 | 7 | 0 |
| insured_occupation | 0 | 1 | 5 | 17 | 0 | 14 | 0 |
| insured_hobbies | 0 | 1 | 4 | 14 | 0 | 20 | 0 |
| insured_relationship | 0 | 1 | 4 | 14 | 0 | 6 | 0 |
| incident_date | 0 | 1 | 10 | 10 | 0 | 60 | 0 |
| incident_type | 0 | 1 | 10 | 24 | 0 | 4 | 0 |
| collision_type | 0 | 1 | 1 | 15 | 0 | 4 | 0 |
| incident_severity | 0 | 1 | 10 | 14 | 0 | 4 | 0 |
| authorities_contacted | 0 | 1 | 4 | 9 | 0 | 5 | 0 |
| incident_state | 0 | 1 | 2 | 2 | 0 | 7 | 0 |
| incident_city | 0 | 1 | 8 | 11 | 0 | 7 | 0 |
| incident_location | 0 | 1 | 11 | 23 | 0 | 1000 | 0 |
| police_report_available | 0 | 1 | 1 | 3 | 0 | 3 | 0 |
| auto_make | 0 | 1 | 3 | 10 | 0 | 14 | 0 |
| auto_model | 0 | 1 | 2 | 14 | 0 | 39 | 0 |
| fraud_reported | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| months_as_customer | 0 | 1 | 203.95 | 115.11 | 0.00 | 115.75 | 199.5 | 276.25 | 479.00 | ▅▇▇▃▂ |
| age | 0 | 1 | 38.95 | 9.14 | 19.00 | 32.00 | 38.0 | 44.00 | 64.00 | ▃▇▇▃▂ |
| policy_deductable | 0 | 1 | 1136.00 | 611.86 | 500.00 | 500.00 | 1000.0 | 2000.00 | 2000.00 | ▇▇▁▁▇ |
| policy_annual_premium | 0 | 1 | 1256.41 | 244.17 | 433.33 | 1089.61 | 1257.2 | 1415.69 | 2047.59 | ▁▃▇▃▁ |
| umbrella_limit | 0 | 1 | 1101000.00 | 2297406.60 | -1000000.00 | 0.00 | 0.0 | 0.00 | 10000000.00 | ▇▁▁▁▁ |
| insured_zip | 0 | 1 | 501214.49 | 71701.61 | 430104.00 | 448404.50 | 466445.5 | 603251.00 | 620962.00 | ▇▃▁▁▅ |
| incident_hour_of_the_day | 0 | 1 | 11.64 | 6.95 | 0.00 | 6.00 | 12.0 | 17.00 | 23.00 | ▇▇▆▇▇ |
| number_of_vehicles_involved | 0 | 1 | 1.84 | 1.02 | 1.00 | 1.00 | 1.0 | 3.00 | 4.00 | ▇▁▁▅▁ |
| bodily_injuries | 0 | 1 | 0.99 | 0.82 | 0.00 | 0.00 | 1.0 | 2.00 | 2.00 | ▇▁▇▁▇ |
| witnesses | 0 | 1 | 1.49 | 1.11 | 0.00 | 1.00 | 1.0 | 2.00 | 3.00 | ▇▇▁▇▇ |
| total_claim_amount | 0 | 1 | 52761.94 | 26401.53 | 100.00 | 41812.50 | 58055.0 | 70592.50 | 114920.00 | ▃▂▇▅▁ |
| auto_year | 0 | 1 | 2005.10 | 6.02 | 1995.00 | 2000.00 | 2005.0 | 2010.00 | 2015.00 | ▇▆▇▇▆ |
We will build a linear regression model for our target variable - total_claim_payout.
##
## Call:
## lm(formula = total_claim_amount ~ incident_hour_of_the_day, data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58685 -13027 4630 17355 68359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43134.2 1591.1 27.110 < 2e-16 ***
## incident_hour_of_the_day 826.8 117.3 7.046 3.42e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25780 on 998 degrees of freedom
## Multiple R-squared: 0.04739, Adjusted R-squared: 0.04644
## F-statistic: 49.65 on 1 and 998 DF, p-value: 3.418e-12
We see based off the above chart and model that the independent variable of incident_hour_of_the_day appears to have a higher average total claim payout the later it is in the day. We built a regression model with just this variable and the result is significant. However, the adjusted R2 is just 5%.
##
## Call:
## lm(formula = total_claim_amount ~ PR_dummy, data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54204 -11219 5366 17465 60616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 49807 1422 35.036 <2e-16 ***
## PR_dummy 4497 1754 2.564 0.0105 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26330 on 998 degrees of freedom
## Multiple R-squared: 0.006545, Adjusted R-squared: 0.00555
## F-statistic: 6.575 on 1 and 998 DF, p-value: 0.01049
We see here that whether there was a police report or not it was significantly different. However this accounts for very little R2.
##
## Call:
## lm(formula = total_claim_amount ~ PR_dummy + incident_hour_of_the_day +
## fraud_reported, data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61355 -13523 4185 17561 64759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47804.1 2368.5 20.184 < 2e-16 ***
## PR_dummy 4437.6 1689.3 2.627 0.00875 **
## incident_hour_of_the_day 821.2 115.4 7.115 2.14e-12 ***
## fraud_reportedN -9986.9 1859.5 -5.371 9.75e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25360 on 996 degrees of freedom
## Multiple R-squared: 0.08024, Adjusted R-squared: 0.07747
## F-statistic: 28.96 on 3 and 996 DF, p-value: < 2.2e-16
Here we add fraud_reported to the model with the other two discussed independent variables. The model is starting to come together with all 3 independent variables being significant, however very little total variance is explained with just 8%.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
No real pattern with age or months as a customer with total claim payout.
##
## Call:
## lm(formula = total_claim_amount ~ number_of_vehicles_involved,
## data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46699 -14228 3623 17132 65521
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39691.9 1658.1 23.94 <2e-16 ***
## number_of_vehicles_involved 7107.2 788.8 9.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25400 on 998 degrees of freedom
## Multiple R-squared: 0.07523, Adjusted R-squared: 0.0743
## F-statistic: 81.19 on 1 and 998 DF, p-value: < 2.2e-16
Here is a regression model for number of vehicles involved
##
## Call:
## lm(formula = total_claim_amount ~ incident_type, data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43637 -8670 -346 7513 53283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64445.5 722.2 89.234 <2e-16 ***
## incident_typeVehicle Theft -58928.2 1660.7 -35.485 <2e-16 ***
## incident_typeMulti-vehicle Collision -2808.4 1011.6 -2.776 0.0056 **
## incident_typeParked Car -59137.2 1739.0 -34.007 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14500 on 996 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6984
## F-statistic: 772.3 on 3 and 996 DF, p-value: < 2.2e-16
Incident Type for an independent variable shows all factors as significant and accounts for nearly 70% of the variance. Let’s add this to the other independent variables to see if we can build a final strong model.
##
## Call:
## lm(formula = total_claim_amount ~ PR_dummy + incident_hour_of_the_day +
## fraud_reported + incident_type + number_of_vehicles_involved,
## data = insurance_claims)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42964 -8816 -305 7488 52876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65069.99 2440.02 26.668 <2e-16 ***
## PR_dummy 1693.87 968.36 1.749 0.0806 .
## incident_hour_of_the_day 27.07 68.47 0.395 0.6927
## fraud_reportedN -1395.39 1079.39 -1.293 0.1964
## incident_typeVehicle Theft -58448.07 1702.96 -34.321 <2e-16 ***
## incident_typeMulti-vehicle Collision -586.61 3863.14 -0.152 0.8793
## incident_typeParked Car -58558.03 1780.01 -32.898 <2e-16 ***
## number_of_vehicles_involved -1101.99 1860.48 -0.592 0.5538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14490 on 992 degrees of freedom
## Multiple R-squared: 0.7009, Adjusted R-squared: 0.6988
## F-statistic: 332.1 on 7 and 992 DF, p-value: < 2.2e-16
Now when we add the other independent variables they are not significant in the model. Only the incident type is. Therefore for our final model we will just keep the incident type as the only independent variable.
Above is a data table that adds a prediction column from our regression model with just incident type as the independent variable. As we can see the “pred” column doesn’t always accurately predict the actual total claim payout value. However, in most instances the values are close.
All in all it is very difficult based off the data we are given to put together a linear model that has a high R2 value. The independent variable of incident type accounted for 70% of the variance, and was the only variable we used in our final linear regression model.